New issue
Advanced search Search tips

Issue 735701 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 2
Type: Bug


Sign in to add a comment

Various TaskSchedulerImplTest sometimes crash/hang on Fuchsia

Project Member Reported by scottmg@chromium.org, Jun 21 2017

Issue description

While soaking to find a crash in another test

OneTraitsExecutionModePair/TaskSchedulerImplTest.PostTaskViaTaskRunnerBeforeStart

crashed.

Not easily reproducible, no useful output in the crash log.
 
Possibly similar to  bug 730340 .
 Bug 735740  too.
 Issue 735740  has been merged into this issue.
Components: Internals>TaskScheduler
Status: Available (was: Untriaged)
A couple recent flakes, one release run:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F5440%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

one debug run:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia__dbg_%2F1076%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

Which specific tests are inconsistent, but overall result is consistent.
Owner: scottmg@chromium.org
Status: Started (was: Available)
Summary: Various TaskSchedulerImplTest sometimes crash on Fuchsia (was: OneTraitsExecutionModePair/TaskSchedulerImplTest.PostTaskViaTaskRunnerBeforeStart sometimes crashes on Fuchsia)
Aha! I caught a crash with a stack. It makes me ... nervous.

[00013.302] 01123.01174> <== fatal exception: process [157426] thread [158207]
[00013.302] 01123.01174> <== fatal page fault, PC at 0x2776db98f7a9
[00013.302] 01123.01174>  CS:                   0 RIP:     0x2776db98f7a9 EFL:            0x10297 CR2:                  0
[00013.302] 01123.01174>  RAX:         0xfffffffe RBX:     0x74e676630e18 RCX:                  0 RDX:                0x9
[00013.302] 01123.01174>  RSI:                  0 RDI:     0x74e676630e18 RBP:     0x204b0d449320 RSP:     0x74c24b6dcb70
[00013.302] 01123.01174>   R8:               0x10  R9:         0x204b0d44 R10:     0x2c74f1304500 R11:        0x800000000
[00013.302] 01123.01174>  R12:                  0 R13:     0x74c24b6dce50 R14:     0x204b0d449140 R15:                0x1
[00013.302] 01123.01174>  errc:               0x6
[00013.302] 01123.01174> bottom of user stack:
[00013.302] 01123.01174> 0x000074c24b6dcb70: 1e2e29a0 000044ee db97b083 00000000 |.)...D..........|
[00013.302] 01123.01174> 0x000074c24b6dcb80: 00000000 00004ff6 db97aff1 00002776 |.....O......v'..|
[00013.302] 01123.01174> 0x000074c24b6dcb90: 00000000 00000000 db97aff1 00002776 |............v'..|
[00013.302] 01123.01174> 0x000074c24b6dcba0: ffffffff 7fffffff c21fdaa0 00006f46 |............Fo..|
[00013.302] 01123.01174> 0x000074c24b6dcbb0: 1c072920 000038e6 236bf1a8 000027c3 | )...8....k#.'..|
[00013.302] 01123.01174> 0x000074c24b6dcbc0: 1c06cd60 000038e6 98820a80 000063c5 |`....8.......c..|
[00013.302] 01123.01174> 0x000074c24b6dcbd0: e4874020 00007d03 98820a88 000063c5 | @...}.......c..|
[00013.302] 01123.01174> 0x000074c24b6dcbe0: 4b6dce50 000074c2 98820a80 000063c5 |P.mK.t.......c..|
[00013.303] 01123.01174> 0x000074c24b6dcbf0: 0d449320 0000204b 98820a88 000063c5 | .D.K .......c..|
[00013.303] 01123.01174> 0x000074c24b6dcc00: 4b6dce50 000074c2 0d449140 0000204b |P.mK.t..@.D.K ..|
[00013.303] 01123.01174> 0x000074c24b6dcc10: 00000001 00000000 c2198d03 00006f46 |............Fo..|
[00013.303] 01123.01174> 0x000074c24b6dcc20: a54aa020 00000afa 00000000 00000000 | .J.............|
[00013.303] 01123.01174> 0x000074c24b6dcc30: 00000000 00000000 ed5576da 00002d97 |.........vU..-..|
[00013.303] 01123.01174> 0x000074c24b6dcc40: 00000001 00000000 76630e50 000074e6 |........P.cv.t..|
[00013.303] 01123.01174> 0x000074c24b6dcc50: 00000000 00000000 db986aee 00002776 |.........j..v'..|
[00013.303] 01123.01174> 0x000074c24b6dcc60: 4b6dcca0 000074c2 e4874020 00007d03 |..mK.t.. @...}..|
[00013.303] 01123.01174> arch: x86_64
[00013.303] 01123.01174> dso: id=0ee2c1798ac9282426341de1bfb7f8d509010a73 base=0x700f9f64b000 name=libmxio.so
[00013.303] 01123.01174> dso: id=7232e5e96679f46409d5b78b2b94bccadd2de16a base=0x6f46c2122000 name=libc.so
[00013.303] 01123.01174> dso: id=2b6882f235d28d75ef453da68f39774ef99c8388 base=0x4b39eca33000 name=<vDSO>
[00013.303] 01123.01174> dso: id=6682d527b69ec892f730643f7ff711e8f78f4116 base=0x46b56707d000 name=liblaunchpad.so
[00013.303] 01123.01174> dso: id=765f3f6ec08e043f base=0x2d97ed4ab000 name=libc++.so.2
[00013.303] 01123.01174> dso: id=fd295ef1fd598df5 base=0x2776dadcf000 name=app:/system/base_unittests
[00013.303] 01123.01174> dso: id=05a640477f586007 base=0x25c63b95a000 name=libc++abi.so.1
[00013.303] 01123.01174> dso: id=aeae5a3cc3b99502 base=0x19e739f63000 name=libunwind.so.1
[00013.303] 01123.01174> bt#01: pc 0x2776db98f7a9 sp 0x74c24b6dcb70 (app:/system/base_unittests,0xbc07a9)
[00013.303] 01123.01174> bt#02: pc 0x2776db98f484 sp 0x74c24b6dccc0 (app:/system/base_unittests,0xbc0484)
[00013.304] 01123.01174> bt#03: pc 0x2776db98937a sp 0x74c24b6dce30 (app:/system/base_unittests,0xbba37a)
[00013.304] 01123.01174> bt#04: pc 0x2776db999cd2 sp 0x74c24b6dcfb0 (app:/system/base_unittests,0xbcacd2)
[00013.304] 01123.01174> bt#05: pc 0x6f46c213b63f sp 0x74c24b6dcfd0 (libc.so,0x1963f)
[00013.304] 01123.01174> bt#06: pc 0x6f46c21b30da sp 0x74c24b6dcff0 (libc.so,0x910da)
[00013.304] 01123.01174> bt#07: pc 0 sp 0x74c24b6dd000
[00013.304] 01123.01174> bt#08: end
----- start symbolized stack
#01: std::__2::__atomic_base<int, true>::fetch_add(int, std::__2::memory_order) volatile at /work/cr/src/out/fuch/../../third_party/fuchsia-sdk/sysroot/x86_64-fuchsia/include/c++/v1/atomic:1001
      (inlined by) base::subtle::NoBarrier_AtomicIncrement(int volatile*, int) at /work/cr/src/out/fuch/../../base/atomicops_internals_portable.h:86
      (inlined by) base::internal::TaskTracker::State::DecrementNumTasksBlockingShutdown() at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:157
      (inlined by) base::internal::TaskTracker::AfterRunTask(base::TaskShutdownBehavior) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:503
#02: base::time_internal::TimeBase<base::TimeTicks>::is_null() const at /work/cr/src/out/fuch/../../base/time/time.h:338
      (inlined by) base::internal::TaskTracker::RunNextTask(base::internal::Sequence*) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:248
#03: base::internal::SchedulerWorker::Thread::ThreadMain() at /work/cr/src/out/fuch/../../base/task_scheduler/scheduler_worker.cc:80
#04: base::(anonymous namespace)::ThreadFunc(void*) at /work/cr/src/out/fuch/../../base/threading/platform_thread_posix.cc:73
#05: ?? ??:0
#06: ?? ??:0
----- end symbolized stack

Got this again

----- start symbolized stack
#01: std::__2::__atomic_base<int, true>::fetch_add(int, std::__2::memory_order) volatile at /work/cr/src/out/fuch/../../third_party/fuchsia-sdk/sysroot/x86_64-fuchsia/include/c++/v1/atomic:1001
      (inlined by) base::subtle::NoBarrier_AtomicIncrement(int volatile*, int) at /work/cr/src/out/fuch/../../base/atomicops_internals_portable.h:86
      (inlined by) base::internal::TaskTracker::State::DecrementNumTasksBlockingShutdown() at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:157
      (inlined by) base::internal::TaskTracker::AfterRunTask(base::TaskShutdownBehavior) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:503
#02: base::time_internal::TimeBase<base::TimeTicks>::is_null() const at /work/cr/src/out/fuch/../../base/time/time.h:338
      (inlined by) base::internal::TaskTracker::RunNextTask(base::internal::Sequence*) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:248
#03: base::internal::SchedulerWorker::Thread::ThreadMain() at /work/cr/src/out/fuch/../../base/task_scheduler/scheduler_worker.cc:80
#04: base::(anonymous namespace)::ThreadFunc(void*) at /work/cr/src/out/fuch/../../base/threading/platform_thread_posix.cc:73
#05: ?? ??:0
#06: ?? ??:0
----- end symbolized stack

on ContinueOnShutdown/TaskSchedulerTaskTrackerTest.WillPostAndRunLongTaskBeforeShutdown/0 this time.

Very strange, but at least occasionally reproducible now.

Comment 7 Deleted

(doh, wrong bug, sorry, removing #7)
 Issue 735160  has been merged into this issue.
Similar but different in

AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostTaskAfterShutdown/0



[00014.224] 01123.01174> <== fatal exception: process [156691] thread [157726]
[00014.224] 01123.01174> <== fatal page fault, PC at 0x5abf2bfa915a
[00014.224] 01123.01174>  CS:                   0 RIP:     0x5abf2bfa915a EFL:            0x10206 CR2:               0x18
[00014.224] 01123.01174>  RAX:                  0 RBX:     0x432e2b913e18 RCX:      0x3c8a1d3b000 RDX:                0x1
[00014.224] 01123.01174>  RSI:                0x1 RDI:                  0 RBP:                0x1 RSP:      0x8c0069ff9c0
[00014.224] 01123.01174>   R8:                  0  R9:                  0 R10:     0x2c1db8b45100 R11:            0x1000f
[00014.224] 01123.01174>  R12:                  0 R13:               0x18 R14:     0x432e2b913e30 R15:                0x2
[00014.224] 01123.01174>  errc:               0x6
[00014.224] 01123.01174> bottom of user stack:
[00014.224] 01123.01174> 0x000008c0069ff9c0: 9317e210 00003fd1 92eff286 00000000 |.....?..........|
[00014.224] 01123.01174> 0x000008c0069ff9d0: 2b913e50 00000000 57fc453c ff27ac6b |P>.+....<E.Wk.'.|
[00014.224] 01123.01174> 0x000008c0069ff9e0: 2b913e30 0000432e 2b913e18 0000432e |0>.+.C...>.+.C..|
[00014.224] 01123.01174> 0x000008c0069ff9f0: a464c960 0000742e 00000000 00000000 |`.d..t..........|
[00014.224] 01123.01174> 0x000008c0069ffa00: 069ffe50 000008c0 2bfa8c9e 00005abf |P..........+.Z..|
[00014.224] 01123.01174> 0x000008c0069ffa10: 00000002 00000000 92f6a0be 00003fd1 |.............?..|
[00014.224] 01123.01174> 0x000008c0069ffa20: 069ffcc8 000008c0 2c006d03 00005abf |.........m.,.Z..|
[00014.224] 01123.01174> 0x000008c0069ffa30: 00000000 00000000 00000000 00000000 |................|
[00014.224] 01123.01174> 0x000008c0069ffa40: 069ffa58 000008c0 069ffa57 000008c0 |X.......W.......|
[00014.224] 01123.01174> 0x000008c0069ffa50: 00000000 00000000 2c06baa0 00005abf |...........,.Z..|
[00014.224] 01123.01174> 0x000008c0069ffa60: 00000000 00000000 00000000 00000000 |................|
[00014.224] 01123.01174> 0x000008c0069ffa70: 00000000 00000000 00000000 00000000 |................|
[00014.224] 01123.01174> 0x000008c0069ffa80: 57fc453c ff27ac6b 931f1001 00003fd1 |<E.Wk.'......?..|
[00014.224] 01123.01174> 0x000008c0069ffa90: 9317e220 00003fd1 931f1028 00003fd1 | ....?..(....?..|
[00014.224] 01123.01174> 0x000008c0069ffaa0: 069ffcc8 000008c0 92f9568c 00003fd1 |.........V...?..|
[00014.224] 01123.01174> 0x000008c0069ffab0: 931f10b0 00003fd1 00000000 00000000 |.....?..........|
[00014.224] 01123.01174> arch: x86_64
[00014.225] 01123.01174> dso: id=aeae5a3cc3b99502 base=0x6fb8a8425000 name=libunwind.so.1
[00014.225] 01123.01174> dso: id=2b6882f235d28d75ef453da68f39774ef99c8388 base=0x66005e9cb000 name=<vDSO>
[00014.225] 01123.01174> dso: id=7232e5e96679f46409d5b78b2b94bccadd2de16a base=0x5abf2bf90000 name=libc.so
[00014.225] 01123.01174> dso: id=05a640477f586007 base=0x44bdc7ba5000 name=libc++abi.so.1
[00014.225] 01123.01174> dso: id=49cf2f4e3d6e4c96 base=0x3fd1923b7000 name=app:/system/base_unittests
[00014.225] 01123.01174> dso: id=0ee2c1798ac9282426341de1bfb7f8d509010a73 base=0x1cd0e5f64000 name=libmxio.so
[00014.225] 01123.01174> dso: id=6682d527b69ec892f730643f7ff711e8f78f4116 base=0x1a00b90d1000 name=liblaunchpad.so
[00014.225] 01123.01174> dso: id=765f3f6ec08e043f base=0x850f9f3d000 name=libc++.so.2
[00014.225] 01123.01174> bt#01: pc 0x5abf2bfa915a sp 0x8c0069ff9c0 (libc.so,0x1915a)
[00014.225] 01123.01174> bt#02: pc 0x5abf2bfa8c9e sp 0x8c0069ffa10 (libc.so,0x18c9e)
[00014.225] 01123.01174> bt#03: pc 0x3fd192f6a0be sp 0x8c0069ffa20 (app:/system/base_unittests,0xbb30be)
[00014.226] 01123.01174> bt#04: pc 0x3fd192f7eaa4 sp 0x8c0069ffb70 (app:/system/base_unittests,0xbc7aa4)
[00014.226] 01123.01174> bt#05: pc 0x3fd192f7e6a1 sp 0x8c0069ffcc0 (app:/system/base_unittests,0xbc76a1)
[00014.226] 01123.01174> bt#06: pc 0x3fd192f7858a sp 0x8c0069ffe30 (app:/system/base_unittests,0xbc158a)
[00014.226] 01123.01174> bt#07: pc 0x3fd192f88ee2 sp 0x8c0069fffb0 (app:/system/base_unittests,0xbd1ee2)
[00014.226] 01123.01174> bt#08: pc 0x5abf2bfa963f sp 0x8c0069fffd0 (libc.so,0x1963f)
[00014.226] 01123.01174> bt#09: pc 0x5abf2c0210da sp 0x8c0069ffff0 (libc.so,0x910da)
[00014.226] 01123.01174> bt#10: pc 0 sp 0x8c006a00000
[00014.226] 01123.01174> bt#11: end
----- start symbolized stack
#01: ?? ??:0
#02: ?? ??:0
#03: base::ConditionVariable::Signal() at /work/cr/src/out/fuch/../../base/synchronization/condition_variable_posix.cc:?
#04: ~AutoSchedulerLock at /work/cr/src/out/fuch/../../base/task_scheduler/scheduler_lock.h:75
      (inlined by) base::internal::TaskTracker::DecrementNumPendingUndelayedTasks() at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:526
#05: base::internal::TaskTracker::RunNextTask(base::internal::Sequence*) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:258
#06: base::internal::SchedulerWorker::Thread::ThreadMain() at /work/cr/src/out/fuch/../../base/task_scheduler/scheduler_worker.cc:80
#07: base::(anonymous namespace)::ThreadFunc(void*) at /work/cr/src/out/fuch/../../base/threading/platform_thread_posix.cc:73
#08: ?? ??:0
#09: ?? ??:0
----- end symbolized stack

It feels like TaskTracker and therefore state_ must be getting prematurely destroyed, but I don't immediately see how that could happen.

This one seems to be the most common

[00021.731] 02404.02591>     ContinueOnShutdown/TaskSchedulerTaskTrackerTest.WillPostAndRunLongTaskBeforeShutdown/0 (../../base/task_scheduler/task_tracker_unittest.cc:253)


[00018.398] 01123.01174> <== fatal exception: process [171679] thread [174237]
[00018.398] 01123.01174> <== fatal page fault, PC at 0x3edfe7863789
[00018.398] 01123.01174>  CS:                   0 RIP:     0x3edfe7863789 EFL:            0x10297 CR2:                  0
[00018.398] 01123.01174>  RAX:         0xfffffffe RBX:     0x4dad76be6140 RCX:                  0 RDX:                0x9
[00018.398] 01123.01174>  RSI:                  0 RDI:     0x4dad76be6140 RBP:                  0 RSP:     0x4b1398028b70
[00018.398] 01123.01174>   R8:                0x8  R9:         0x4dad76be R10:     0x43d78acef500 R11:                0x2
f[00018.398] 01123.01174>  R12:                  0 R13:     0x4b1398028e50 R14:     0x4dad76be66e0 R15:                0x1
[00018.398] 01123.01174>  errc:               0x6
[00018.398] 01123.01174> bottom of user stack:
[00018.398] 01123.01174> 0x00004b1398028b70: b3916740 000074c9 e784fcac 00003edf |@g...t.......>..|
[00018.398] 01123.01174> 0x00004b1398028b80: 1295a2f8 00005f32 1295a2e0 00005f32 |....2_......2_..|
[00018.398] 01123.01174> 0x00004b1398028b90: 00000000 00000000 e785047f 00003edf |.............>..|
[00018.398] 01123.01174> 0x00004b1398028ba0: ffffffff 7fffffff faa878c8 00005941 |.........x..AY..|
[00018.398] 01123.01174> 0x00004b1398028bb0: 98028bc8 00004b13 9cf18420 000047e4 |.....K.. ....G..|
[00018.398] 01123.01174> 0x00004b1398028bc0: faa878c8 00005941 9cf12880 000047e4 |.x..AY...(...G..|
[00018.398] 01123.01174> 0x00004b1398028bd0: 9cf148a0 000047e4 00000000 00000000 |.H...G..........|
[00018.398] 01123.01174> 0x00004b1398028be0: 9cf18401 000047e4 dd300165 00007855 |.....G..e.0.Ux..|
[00018.398] 01123.01174> 0x00004b1398028bf0: 00000001 00000000 faa878c8 00000000 |.........x......|
[00018.398] 01123.01174> 0x00004b1398028c00: 98028d48 00004b13 00000008 00000000 |H....K..........|
[00018.398] 01123.01174> 0x00004b1398028c10: 98028c50 00004b13 00000001 00000000 |P....K..........|
[00018.398] 01123.01174> 0x00004b1398028c20: 3525f020 00006395 00000000 00000000 | .%5.c..........|
[00018.398] 01123.01174> 0x00004b1398028c30: 00000000 00000000 6f8006da 000041be |...........o.A..|
[00018.398] 01123.01174> 0x00004b1398028c40: 00000001 00000000 76be6178 00004dad |........xa.v.M..|
[00018.398] 01123.01174> 0x00004b1398028c50: 00000000 00000000 e785aace 00003edf |.............>..|
[00018.398] 01123.01174> 0x00004b1398028c60: 98028ca0 00004b13 76be6178 00004dad |.....K..xa.v.M..|
[00018.398] 01123.01174> arch: x86_64
[00018.399] 01123.01174> dso: id=7232e5e96679f46409d5b78b2b94bccadd2de16a base=0x7855dd290000 name=libc.so
[00018.399] 01123.01174> dso: id=05a640477f586007 base=0x6a240b40f000 name=libc++abi.so.1
[00018.399] 01123.01174> dso: id=6682d527b69ec892f730643f7ff711e8f78f4116 base=0x659a812ce000 name=liblaunchpad.so
[00018.399] 01123.01174> dso: id=2b6882f235d28d75ef453da68f39774ef99c8388 base=0x5bd2c8035000 name=<vDSO>
[00018.399] 01123.01174> dso: id=0ee2c1798ac9282426341de1bfb7f8d509010a73 base=0x5702e53c7000 name=libmxio.so
[00018.399] 01123.01174> dso: id=aeae5a3cc3b99502 base=0x5215cc633000 name=libunwind.so.1
[00018.399] 01123.01174> dso: id=765f3f6ec08e043f base=0x41be6f754000 name=libc++.so.2
[00018.399] 01123.01174> dso: id=2b214fc470c9e4e5 base=0x3edfe6ca3000 name=app:/system/base_unittests
[00018.399] 01123.01174> bt#01: pc 0x3edfe7863789 sp 0x4b1398028b70 (app:/system/base_unittests,0xbc0789)
[00018.400] 01123.01174> bt#02: pc 0x3edfe7863464 sp 0x4b1398028cc0 (app:/system/base_unittests,0xbc0464)
[00018.400] 01123.01174> bt#03: pc 0x3edfe785d35a sp 0x4b1398028e30 (app:/system/base_unittests,0xbba35a)
[00018.400] 01123.01174> bt#04: pc 0x3edfe786dcb2 sp 0x4b1398028fb0 (app:/system/base_unittests,0xbcacb2)
[00018.400] 01123.01174> bt#05: pc 0x7855dd2a963f sp 0x4b1398028fd0 (libc.so,0x1963f)
[00018.400] 01123.01174> bt#06: pc 0x7855dd3210da sp 0x4b1398028ff0 (libc.so,0x910da)
[00018.400] 01123.01174> bt#07: pc 0 sp 0x4b1398029000
[00018.401] 01123.01174> bt#08: end
----- start symbolized stack
#01: std::__2::__atomic_base<int, true>::fetch_add(int, std::__2::memory_order) volatile at /work/cr/src/out/fuch/../../third_party/fuchsia-sdk/sysroot/x86_64-fuchsia/include/c++/v1/atomic:1001
      (inlined by) base::subtle::NoBarrier_AtomicIncrement(int volatile*, int) at /work/cr/src/out/fuch/../../base/atomicops_internals_portable.h:86
      (inlined by) base::internal::TaskTracker::State::DecrementNumTasksBlockingShutdown() at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:157
      (inlined by) base::internal::TaskTracker::AfterRunTask(base::TaskShutdownBehavior) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:503
#02: base::time_internal::TimeBase<base::TimeTicks>::is_null() const at /work/cr/src/out/fuch/../../base/time/time.h:338
      (inlined by) base::internal::TaskTracker::RunNextTask(base::internal::Sequence*) at /work/cr/src/out/fuch/../../base/task_scheduler/task_tracker.cc:248
#03: base::internal::SchedulerWorker::Thread::ThreadMain() at /work/cr/src/out/fuch/../../base/task_scheduler/scheduler_worker.cc:80
#04: base::(anonymous namespace)::ThreadFunc(void*) at /work/cr/src/out/fuch/../../base/threading/platform_thread_posix.cc:73
#05: ?? ??:0
#06: ?? ??:0
----- end symbolized stack

Comment 12 by gab@chromium.org, Jun 26 2017

Hey Scott, this looks like a racy access to an atomic variable... could it be a bug in atomics on Fuschia?
That's definitely possible too. It happens "sometimes" but not often enough to really debug it yet. I'll try to work on making it more reproducible I think.
I just got a memory corruption warning running on a plain Linux build, so I think something's messed up here. Unfortunately no symbols on the build.

I'll try ASan or TSan I guess, maybe that'll clear things up.

Found a corrupted memory buffer in MallocBlock (may be offset from user ptr): buffer index: 0, buffer ptr: 0x3722fa4fd600, size of buffer: 680
Buffer byte 144 is 0x02 (should be 0xcd).
Buffer byte 145 is 0x00 (should be 0xcd).
Buffer byte 146 is 0x00 (should be 0xcd).
Buffer byte 147 is 0x00 (should be 0xcd).
Deleted by thread 0x7fbe5eb3a740
*** WARNING: Cannot convert addresses to symbols in output below.
*** Reason: Cannot find 'pprof' (is PPROF_PATH set correctly?)
*** If you cannot fix this, try running pprof directly.
    @ 0x7fbe5e65a233 
    @ 0x7fbe5e5f38ad 
    @ 0x7fbe5e5f33b3 
    @ 0x7fbe5e5f2d95 
    @ 0xc28b02 
    @ 0xf3ad4b 
    @ 0xf497de 
    @ 0xf3a9e2 
    @ 0xf2fa05 
    @ 0xf2ff5f 
    @ 0xf3530c 
    @ 0xf4d76e 
    @ 0xf3c2e2 
    @ 0xf34faf 
    @ 0xf82e51 
    @ 0xf81b92 
Memory was written to after being freed.  MallocBlock: 0x3722fa4fd600, user ptr: 0x3722fa4fd620, size: 680.  If you can't find the source of the error, try using ASan (http://code.google.com/p/address-sanitizer/), Valgrind, or Purify, or study the output of the deleter's stack printed above.
Received signal 11 SEGV_MAPERR 000000000039
#0 0x7fbe5e3263bb base::debug::StackTrace::StackTrace()
#1 0x7fbe5e3250bc base::debug::StackTrace::StackTrace()
#2 0x7fbe5e325ecf base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7fbe5e7be390 <unknown>
#4 0x7fbe5e601cb8 tcmalloc::Abort()
#5 0x7fbe5e609cda LogPrintf()
#6 0x7fbe5e609b5b RAW_VLOG()
#7 0x7fbe5e63202e MallocBlock::CheckForCorruptedBuffer()
#8 0x7fbe5e631d1b MallocBlock::CheckForDanglingWrites()
#9 0x7fbe5e62fb76 MallocBlock::ProcessFreeQueue()
#10 0x7fbe5e633704 MallocBlock::Deallocate()
#11 0x7fbe5e62c3e5 DebugDeallocate()
#12 0x7fbe5e65a233 tc_free
#13 0x7fbe5e5f38ad (anonymous namespace)::TCFree()
#14 0x7fbe5e5f33b3 ShimCppDelete
#15 0x7fbe5e5f2d95 operator delete()
#16 0x00000053564d __gnu_cxx::new_allocator<>::deallocate()
#17 0x00000053561b std::string::_Rep::_M_destroy()
#18 0x000000535595 std::string::_Rep::_M_dispose()
#19 0x000000533c57 std::string::~string()
#20 0x000000f2f513 testing::TestInfo::~TestInfo()
#21 0x000000f2fce7 testing::internal::Delete<>()
#22 0x000000f4a304 std::for_each<>()
#23 0x000000f3b01e testing::internal::ForEach<>()
#24 0x000000f2fc68 testing::TestCase::~TestCase()
#25 0x000000f2fd19 testing::TestCase::~TestCase()
#26 0x000000f35a7b testing::internal::Delete<>()
#27 0x000000f42914 std::for_each<>()
#28 0x000000f3c7ee testing::internal::ForEach<>()
#29 0x000000f3590a testing::internal::UnitTestImpl::~UnitTestImpl()
#30 0x000000f35ae9 testing::internal::UnitTestImpl::~UnitTestImpl()
#31 0x000000f35611 testing::UnitTest::~UnitTest()
#32 0x7fbe5cf6aff8 <unknown>
#33 0x7fbe5cf6b045 exit
#34 0x7fbe5cf51837 __libc_start_main
#35 0x000000532c14 <unknown>
  r8: 00007fffe38c0b40  r9: 000000000000002a r10: 000000000000c0e0 r11: 0000000000000202
 r12: 0000000000000001 r13: 00007fbe5d2f6c40 r14: 0000000000000000 r15: 0000000000000000
  di: 0000000000000002  si: 00007fffe38c0b40  bp: 00007fffe38c0ae0  bx: 0000000000000000
  dx: 0000000000000125  ax: 0000000000000000  cx: 0000000000000000  sp: 00007fffe38c0ae0
  ip: 00007fbe5e601cb8 efl: 0000000000010246 cgf: 0000000000000033 erf: 0000000000000006
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000039
[end of stack trace]
Calling _exit(1). Core file will not be generated.
[1341/2480] TaskSchedulerSingleThreadTaskRunnerManagerTest.DifferentThreadsUsed (2 ms)
[ RUN      ] TaskSchedulerSingleThreadTaskRunnerManagerTest.SameThreadUsed
[       OK ] TaskSchedulerSingleThreadTaskRunnerManagerTest.SameThreadUsed (8 ms)
[ RUN      ] TaskSchedulerSingleThreadTaskRunnerManagerTest.RunsTasksOnCurrentThread
[       OK ] TaskSchedulerSingleThreadTaskRunnerManagerTest.RunsTasksOnCurrentThread (14 ms)
[ RUN      ] TaskSchedulerSingleThreadTaskRunnerManagerTest.SharedWithBaseSyncPrimitivesDCHECKs

[WARNING] ../../third_party/googletest/src/googletest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 3 threads.
[       OK ] TaskSchedulerSingleThreadTaskRunnerManagerTest.SharedWithBaseSyncPrimitivesDCHECKs (60 ms)
[----------] 4 tests from TaskSchedulerSingleThreadTaskRunnerManagerTest (85 ms total)

[----------] 1 test from TaskSchedulerSingleThreadTaskRunnerManagerJoinTest
[ RUN      ] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoin
[       OK ] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoin (9 ms)
[----------] 1 test from TaskSchedulerSingleThreadTaskRunnerManagerJoinTest (9 ms total)

[----------] 5 tests from TaskSchedulerWorkerStackTest
[ RUN      ] TaskSchedulerWorkerStackTest.PushPop
[       OK ] TaskSchedulerWorkerStackTest.PushPop (1 ms)
[ RUN      ] TaskSchedulerWorkerStackTest.PeekPop
[       OK ] TaskSchedulerWorkerStackTest.PeekPop (0 ms)
[ RUN      ] TaskSchedulerWorkerStackTest.Contains
[       OK ] TaskSchedulerWorkerStackTest.Contains (1 ms)
[ RUN      ] TaskSchedulerWorkerStackTest.Remove
[       OK ] TaskSchedulerWorkerStackTest.Remove (0 ms)
[ RUN      ] TaskSchedulerWorkerStackTest.PushAfterRemove
[       OK ] TaskSchedulerWorkerStackTest.PushAfterRemove (1 ms)
[----------] 5 tests from TaskSchedulerWorkerStackTest (3 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 3 test cases ran. (99 ms total)
[  PASSED  ] 10 tests.
Found a corrupted memory buffer in MallocBlock (may be offset from user ptr): buffer index: 0, buffer ptr: 0x3722fa4fd600, size of buffer: 680
Buffer byte 144 is 0x02 (should be 0xcd).
Buffer byte 145 is 0x00 (should be 0xcd).
Buffer byte 146 is 0x00 (should be 0xcd).
Buffer byte 147 is 0x00 (should be 0xcd).
Deleted by thread 0x7fbe5eb3a740
*** WARNING: Cannot convert addresses to symbols in output below.
*** Reason: Cannot find 'pprof' (is PPROF_PATH set correctly?)
*** If you cannot fix this, try running pprof directly.
    @ 0x7fbe5e65a233 
    @ 0x7fbe5e5f38ad 
    @ 0x7fbe5e5f33b3 
    @ 0x7fbe5e5f2d95 
    @ 0xc28b02 
    @ 0xf3ad4b 
    @ 0xf497de 
    @ 0xf3a9e2 
    @ 0xf2fa05 
    @ 0xf2ff5f 
    @ 0xf3530c 
    @ 0xf4d76e 
    @ 0xf3c2e2 
    @ 0xf34faf 
    @ 0xf82e51 
    @ 0xf81b92 
Memory was written to after being freed.  MallocBlock: 0x3722fa4fd600, user ptr: 0x3722fa4fd620, size: 680.  If you can't find the source of the error, try using ASan (http://code.google.com/p/address-sanitizer/), Valgrind, or Purify, or study the output of the deleter's stack printed above.
Received signal 11 SEGV_MAPERR 000000000039
#0 0x7fbe5e3263bb base::debug::StackTrace::StackTrace()
#1 0x7fbe5e3250bc base::debug::StackTrace::StackTrace()
#2 0x7fbe5e325ecf base::debug::(anonymous namespace)::StackDumpSignalHandler()
#3 0x7fbe5e7be390 <unknown>
#4 0x7fbe5e601cb8 tcmalloc::Abort()
#5 0x7fbe5e609cda LogPrintf()
#6 0x7fbe5e609b5b RAW_VLOG()
#7 0x7fbe5e63202e MallocBlock::CheckForCorruptedBuffer()
#8 0x7fbe5e631d1b MallocBlock::CheckForDanglingWrites()
#9 0x7fbe5e62fb76 MallocBlock::ProcessFreeQueue()
#10 0x7fbe5e633704 MallocBlock::Deallocate()
#11 0x7fbe5e62c3e5 DebugDeallocate()
#12 0x7fbe5e65a233 tc_free
#13 0x7fbe5e5f38ad (anonymous namespace)::TCFree()
#14 0x7fbe5e5f33b3 ShimCppDelete
#15 0x7fbe5e5f2d95 operator delete()
#16 0x00000053564d __gnu_cxx::new_allocator<>::deallocate()
#17 0x00000053561b std::string::_Rep::_M_destroy()
#18 0x000000535595 std::string::_Rep::_M_dispose()
#19 0x000000533c57 std::string::~string()
#20 0x000000f2f513 testing::TestInfo::~TestInfo()
#21 0x000000f2fce7 testing::internal::Delete<>()
#22 0x000000f4a304 std::for_each<>()
#23 0x000000f3b01e testing::internal::ForEach<>()
#24 0x000000f2fc68 testing::TestCase::~TestCase()
#25 0x000000f2fd19 testing::TestCase::~TestCase()
#26 0x000000f35a7b testing::internal::Delete<>()
#27 0x000000f42914 std::for_each<>()
#28 0x000000f3c7ee testing::internal::ForEach<>()
#29 0x000000f3590a testing::internal::UnitTestImpl::~UnitTestImpl()
#30 0x000000f35ae9 testing::internal::UnitTestImpl::~UnitTestImpl()
#31 0x000000f35611 testing::UnitTest::~UnitTest()
#32 0x7fbe5cf6aff8 <unknown>
#33 0x7fbe5cf6b045 exit
#34 0x7fbe5cf51837 __libc_start_main
#35 0x000000532c14 <unknown>
  r8: 00007fffe38c0b40  r9: 000000000000002a r10: 000000000000c0e0 r11: 0000000000000202
 r12: 0000000000000001 r13: 00007fbe5d2f6c40 r14: 0000000000000000 r15: 0000000000000000
  di: 0000000000000002  si: 00007fffe38c0b40  bp: 00007fffe38c0ae0  bx: 0000000000000000
  dx: 0000000000000125  ax: 0000000000000000  cx: 0000000000000000  sp: 00007fffe38c0ae0
  ip: 00007fbe5e601cb8 efl: 0000000000010246 cgf: 0000000000000033 erf: 0000000000000006
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000039
[end of stack trace]
Calling _exit(1). Core file will not be generated.

Comment 15 by gab@chromium.org, Jun 26 2017

@fdoray/robliao: can you think of a situation that would cause a user-after-free of TaskTracker from SchedulerWorker::ThreadMain() in some scheduler unit tests? (the DecrementNumTasksBlockingShutdown() instance on the stack is likely bogus, it just happens to do the first member access on the freed TaskTracker in that stack which results in the first deref of the invalid |this|)

@scottmg: trying to repro with ASAN or TSAN sure would help
Here's a few hits under ASan Linux:

https://gist.github.com/sgraham/7792bce3a1748cbecf0f2a4bb5c05a14
https://gist.github.com/sgraham/54f7bd309bbc34e8bbc5a6fb471f9c7a
https://gist.github.com/sgraham/f5390f5b93e234860538066fb18063c1

I seem to be able to repro fairly reliably on my current config, so I'll poke at it a bit. The 54f7... one looks similar to the stack I frequently see on Fuchsia.

Looks like the test is getting torn down, but the SchedulerWorker::Thread hasn't shutdown yet. (?)

Comment 17 by gab@chromium.org, Jun 27 2017

Ok so worker threads are somehow outliving TaskTracker, this means that

  virtual void TearDownSingleThreadTaskRunnerManager() {
    single_thread_task_runner_manager_->JoinForTesting();
    single_thread_task_runner_manager_.reset();
  }

which is called from TearDown() doesn't properly join all worker threads before TaskTracker is destroyed along with the fixture..?

@robliao?

Most stacks are in TaskSchedulerSingleThreadTaskRunnerManagerTest but there are a few in other fixtures above? Might be two different bugs. At this point I'd lean to say that these are likely test-only races.
Nico pointed out https://build.chromium.org/p/chromium.memory/builders/Linux%20ASan%20LSan%20Tests%20%281%29/builds/36788 running base_unittests happily. I can repro maybe 50% of the time, but I'm on 16.04, so maybe that's a relevant variable somehow. Otherwise fairly plain I think:

sgraham@river:/work/cr/src$ cat out/Debug/args.gn
proprietary_codecs = true
ffmpeg_branding = "Chrome"
is_debug = true
symbol_level = 2
is_component_build = true
is_clang = true
enable_nacl = false
use_goma = true
is_asan = true

(Or maybe due to component?)
On Linux it does seem to be only in TaskSchedulerSingleThreadTaskRunnerManagerTest. I can't get it to repro when running only those tests though unfortunately, only when running all tests which makes it awkward to narrow down.

Another one that I hadn't seen before, but indirectly still is that test fixture: https://gist.github.com/sgraham/a240720e9daace1c0fe6b05a13d3da6a

Hmm.

My current suspicion is aimed at SchedulerSingleThreadTaskRunnerManager::JoinForTesting() doing ReleaseSharedSchedulerWorkers(). That could release a worker via Cleanup() and removing it from workers_, but then nothing would join it, so it's still running after test TearDown.

But I haven't totally convinced myself of that yet.

Hmm, but then that'd only fail in SHARED mode, but it also fails in DEDICATED mode. So maybe that's not it.
Thought I'd hit this in a new and exciting place, but no it's just some other broken thing O_o https://bugs.chromium.org/p/chromium/issues/detail?id=737311.
I didn't notice this in the spray of long class names in the output before, but in https://gist.github.com/sgraham/49e9912258fff6068b2637e7ae016912 it seems clear that I've generally copied the wrong test name in most cases. In particular, it _appears_ that

  AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostTaskAfterShutdown/0

is crashing, but the memory that's being U-A-F'd is both allocated and freed by

  base::internal::TaskSchedulerSingleThreadTaskRunnerManagerCommonTest_ThreadNamesSet_Test

that is, the test immediately before it. The crashing thread (T12) was also created by base::internal::TaskSchedulerSingleThreadTaskRunnerManagerCommonTest_ThreadNamesSet_Test::TestBody(). So that it's still running during the subsequent test seems likely-not-good.

I would have thought I could make this more reproducible by adding delays in the SchedulerWorker thread proc, but I haven't been able to, so it's more subtle than that, somehow.

I might try scripting something to soak with to scrape the asan output to see which tests are really the ones that are leaking threads that outlive the TaskTracker.
Cc: robliao@chromium.org
Ah, I think I found one.

I read this comment https://cs.chromium.org/chromium/src/base/task_scheduler/scheduler_worker.cc?l=328 and even fell for it the first time! But despite the fact that they're all one statement, there's still a race. :)

If should_exit_ and join_called_for_testing_ are checked first, then the thread is suspended, then the test's torn down, then TaskTracker is gone when the rest of the statement is evaluated.

With https://chromium-review.googlesource.com/c/551358/1/base/task_scheduler/scheduler_worker.cc applied, even without ASan, a Linux build spews lock errors, because TaskTracker and shutdown_lock_ has been destroyed by the time we get here https://cs.chromium.org/chromium/src/base/task_scheduler/task_tracker.cc?l=268 .

I think there might be more similar issues though, because not all the crash stacks are in the lock code or SchedulerWorker::ShouldExit(). 

Comment 24 by gab@chromium.org, Jun 28 2017

Cc: scottmg@chromium.org
Owner: gab@chromium.org
Aaah, nice one, I think other stacks may also be related. i.e. say ShouldExit() returns false (doesn't crash there) but TaskTracker is freed immediately after, the body executed based on !ShouldExit() uses TaskTracker (and races with it being freed any second).

We will need to be more strict about joining SingleThread workers in unit tests.

We will fix.

Comment 25 by gab@chromium.org, Jun 28 2017

@scottmg: Can you try https://chromium-review.googlesource.com/c/552379/ locally?
Still seeing hits in ShouldExit() with that patch:

[2048/2446] Allocator/StatisticsRecorderTest.SetCallbackFailsWithoutHistogramTest/0 (1 ms)
[2049/2446] Allocator/StatisticsRecorderTest.SetCallbackFailsWithoutHistogramTest/1 (0 ms)
[2050/2446] Allocator/StatisticsRecorderTest.SetCallbackFailsWithHistogramTest/0 (0 ms)
[2051/2446] AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostTaskAfterShutdown/1 (24 ms)
[ RUN      ] AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostDelayedTask/0
[24126:24208:0628/100035.513555:1209272186027:FATAL:lock_impl_posix.cc(63)] Check failed: rv == 0 (22 vs. 0). Invalid argument
    #0 0x00000066dd31 in __interceptor_backtrace ??:0:0
    #1 0x7f1f2a07422a in base::debug::StackTrace::StackTrace(unsigned long) /work/cr/src/out/Debug/../../base/debug/stack_trace_posix.cc:729:41
    #2 0x7f1f2a06e30c in base::debug::StackTrace::StackTrace() /work/cr/src/out/Debug/../../base/debug/stack_trace.cc:199:28
    #3 0x7f1f2a201a3c in logging::LogMessage::~LogMessage() /work/cr/src/out/Debug/../../base/logging.cc:553:29
    #4 0x7f1f2a5dbe6a in base::internal::LockImpl::Lock() /work/cr/src/out/Debug/../../base/synchronization/lock_impl_posix.cc:63:3
    #5 0x7f1f29f6d963 in base::Lock::Acquire() /work/cr/src/out/Debug/../../base/synchronization/lock.h:45:11
    #6 0x7f1f2a648869 in base::internal::SchedulerLockImpl::Acquire() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_lock_impl.cc:137:9
    #7 0x7f1f2a62b4b1 in base::internal::AutoSchedulerLock::AutoSchedulerLock(base::internal::SchedulerLock&) /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_lock.h:71:11
    #8 0x7f1f2a6af036 in base::internal::TaskTracker::IsShutdownComplete() const /work/cr/src/out/Debug/../../base/task_scheduler/task_tracker.cc:268:21
    #9 0x7f1f2a671aea in base::internal::SchedulerWorker::ShouldExit() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_worker.cc:328:25
    #10 0x7f1f2a6735a3 in base::internal::SchedulerWorker::Thread::ThreadMain() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_worker.cc:54:21
    #11 0x7f1f2a6e28f9 in base::(anonymous namespace)::ThreadFunc(void*) /work/cr/src/out/Debug/../../base/threading/platform_thread_posix.cc:71:13
    #12 0x7f1f2886f6ba in start_thread ??:0:0
    #13 0x7f1f27a333dd in clone /build/glibc-bfm8X4/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109:0

...

[2048/2446] MemoryPeakDetectorTest.PeakCallbackThrottling (128 ms)
[2049/2446] MemoryPeakDetectorTest.StdDev (189 ms)
[2050/2446] MemoryPeakDetectorTest.Throttle (197 ms)
[2051/2446] AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostTaskAfterShutdown/1 (4 ms)
[ RUN      ] AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostDelayedTask/0
[26976:27040:0628/100253.888508:1209410561037:FATAL:lock_impl_posix.cc(63)] Check failed: rv == 0 (22 vs. 0). Invalid argument
    #0 0x00000066dd31 in __interceptor_backtrace ??:0:0
    #1 0x7fc469c0e22a in base::debug::StackTrace::StackTrace(unsigned long) /work/cr/src/out/Debug/../../base/debug/stack_trace_posix.cc:729:41
    #2 0x7fc469c0830c in base::debug::StackTrace::StackTrace() /work/cr/src/out/Debug/../../base/debug/stack_trace.cc:199:28
    #3 0x7fc469d9ba3c in logging::LogMessage::~LogMessage() /work/cr/src/out/Debug/../../base/logging.cc:553:29
    #4 0x7fc46a175e6a in base::internal::LockImpl::Lock() /work/cr/src/out/Debug/../../base/synchronization/lock_impl_posix.cc:63:3
    #5 0x7fc469b07963 in base::Lock::Acquire() /work/cr/src/out/Debug/../../base/synchronization/lock.h:45:11
    #6 0x7fc46a1e2869 in base::internal::SchedulerLockImpl::Acquire() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_lock_impl.cc:137:9
    #7 0x7fc46a1c54b1 in base::internal::AutoSchedulerLock::AutoSchedulerLock(base::internal::SchedulerLock&) /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_lock.h:71:11
    #8 0x7fc46a249036 in base::internal::TaskTracker::IsShutdownComplete() const /work/cr/src/out/Debug/../../base/task_scheduler/task_tracker.cc:268:21
    #9 0x7fc46a20baea in base::internal::SchedulerWorker::ShouldExit() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_worker.cc:328:25
    #10 0x7fc46a20d5a3 in base::internal::SchedulerWorker::Thread::ThreadMain() /work/cr/src/out/Debug/../../base/task_scheduler/scheduler_worker.cc:54:21
    #11 0x7fc46a27c8f9 in base::(anonymous namespace)::ThreadFunc(void*) /work/cr/src/out/Debug/../../base/threading/platform_thread_posix.cc:71:13
    #12 0x7fc4684096ba in start_thread ??:0:0
    #13 0x7fc4675cd3dd in clone /build/glibc-bfm8X4/glibc-2.23/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:109:0

[2052/2446] AllModes/TaskSchedulerSingleThreadTaskRunnerManagerCommonTest.PostDelayedTask/0 (CRASHED)


I don't seem to hit the crash on Fuchsia though with that CL, so that's progress! :)
Another problem is in ~SchedulerSingleThreadTaskRunner(). In DEDICATED, that releases worker threads without joining them.
Project Member

Comment 28 by bugdroid1@chromium.org, Jun 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/02c80fa9c3cc98fe3ae5e94e881f4a05d9cd174c

commit 02c80fa9c3cc98fe3ae5e94e881f4a05d9cd174c
Author: Scott Graham <scottmg@chromium.org>
Date: Thu Jun 29 00:52:30 2017

fuchsia: Disable TaskScheduler tests

This is unfortunately ~300 tests, but I'd like to get the bot to a
reliable state (or at least identify the next racy code).

Bug:  735701 
Change-Id: I45aaac819dcb15fda1cc64a9407142a885fe193b
Reviewed-on: https://chromium-review.googlesource.com/553637
Reviewed-by: Nico Weber <thakis@chromium.org>
Commit-Queue: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#483234}
[modify] https://crrev.com/02c80fa9c3cc98fe3ae5e94e881f4a05d9cd174c/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 29 by w...@chromium.org, Jul 15 2017

Components: Internals>PlatformIntegration
Project Member

Comment 30 by bugdroid1@chromium.org, Jul 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a2136defe9b4d7c0d8ea340862e66b3ab1560e00

commit a2136defe9b4d7c0d8ea340862e66b3ab1560e00
Author: Robert Liao <robliao@chromium.org>
Date: Mon Jul 17 19:40:45 2017

Release Shared SchedulerWorkers After Joining

If shared SchedulerWorkers are released before joining, they are
effectively detached and may outlive the
SchedulerSingleThreadTaskRunnerManager.

BUG= 735701 

Change-Id: Ie83b50b99698a4059041454f579f4a2c1fbaabc0
Reviewed-on: https://chromium-review.googlesource.com/572233
Reviewed-by: Gabriel Charette <gab@chromium.org>
Commit-Queue: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#487197}
[modify] https://crrev.com/a2136defe9b4d7c0d8ea340862e66b3ab1560e00/base/task_scheduler/scheduler_single_thread_task_runner_manager.cc

Comment 31 by gab@chromium.org, Jul 17 2017

Thanks Rob, @scottmg can you let us know what is still failing (ideally
under ASAN or TSAN)?

Le lun. 17 juil. 2017 15:42, bugdroid1 via monorail <
monorail+v2.3275348242@chromium.org> a écrit :

Comment 32 by w...@chromium.org, Jul 17 2017

scottmg is OOO right now. We have all the TaskScheduler tests filtered out
right now, but I'm doing a try run now with the filter removed, to see what
fails. We don't yet have ASAN/TSan, though.

Comment 33 by gab@chromium.org, Jul 17 2017

Ok thanks, Scott was doing ASAN/TSAN locally I think to pinpoint the issues

Le lun. 17 juil. 2017 17:39, wez via monorail <
monorail+v2.1573466780@chromium.org> a écrit :
Project Member

Comment 34 by bugdroid1@chromium.org, Jul 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4a3ab2babaf8960a0c4747d3fd8e67b66facdc83

commit 4a3ab2babaf8960a0c4747d3fd8e67b66facdc83
Author: Wez <wez@chromium.org>
Date: Thu Jul 20 23:55:39 2017

Remove *TaskScheduler* filter, and more specific suppressions.

Bug:  735701 
Change-Id: I99b4e0c075c8c4787ae84eaf45351e5e69d1fa20
Reviewed-on: https://chromium-review.googlesource.com/575308
Reviewed-by: Sergey Ulanov <sergeyu@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#488496}
[modify] https://crrev.com/4a3ab2babaf8960a0c4747d3fd8e67b66facdc83/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 35 by gab@chromium.org, Jul 27 2017

Did r488496 re-enable TaskScheduler tests in Fuschia? Are we good here? If not, let's sprint through the issues as they could be product facing issues.
Status: Fixed (was: Started)
Yes, the tests are enabled now.

Comment 37 by gab@chromium.org, Jul 27 2017

Awesome so r487197 was the only fixed required?
Yes, there were other crbugs for other issues in test code. But, this one only looked like multiple tests were failing, because it was crashing in the test after the one that was actually leaking the worker.
Oh wow, r487197 fixed it? Nice!
Status: Started (was: Fixed)
Summary: Various TaskSchedulerImplTest sometimes crash/hang on Fuchsia (was: Various TaskSchedulerImplTest sometimes crash on Fuchsia)
Looks like we're not quite out of the woods yet, still seeing hangs on Fuchsia waterfall:

https://luci-milo.appspot.com/buildbot/chromium.fyi/Fuchsia/7725

[01150.774] 02484.02512> 5 tests timed out:
[01150.775] 02484.02512>     HeapAndPersistent/HistogramTest.FactoryTime/1 (../../base/metrics/histogram_unittest.cc:94)
[01150.775] 02484.02512>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostDelayedTaskWithTraitsWithDelay/12 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)
[01150.775] 02484.02512>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostDelayedTaskWithTraitsWithDelay/13 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)
[01150.775] 02484.02512>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostDelayedTaskWithTraitsWithDelay/3 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)
[01150.775] 02484.02512>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostTasksViaTaskRunner/1 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)


https://luci-milo.appspot.com/buildbot/chromium.fyi/Fuchsia/7727
[00390.568] 02602.02675> 3 tests timed out:
[00390.569] 02602.02675>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostDelayedTaskWithTraitsWithDelay/14 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)
[00390.569] 02602.02675>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostDelayedTaskWithTraitsWithDelay/7 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)
[00390.569] 02602.02675>     OneTraitsExecutionModePair/TaskSchedulerImplTest.PostTasksViaTaskRunner/11 (../../base/task_scheduler/task_scheduler_impl_unittest.cc:221)

Blocking: 738275

Comment 42 by gab@chromium.org, Jul 31 2017

Status: Assigned (was: Started)
Any way to get better logs? Try under TSAN again maybe?

Comment 43 by gab@chromium.org, Jul 31 2017

Cc: gab@chromium.org
Labels: Needs-Feedback
Owner: scottmg@chromium.org
There's no TSan on Fuchsia yet. I tried TSanon Linux for a while, but it didn't repro there.
Project Member

Comment 45 by bugdroid1@chromium.org, Aug 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9759fff7fc48f5d70e0555d51a54540536cf9cc7

commit 9759fff7fc48f5d70e0555d51a54540536cf9cc7
Author: Scott Graham <scottmg@chromium.org>
Date: Wed Aug 02 01:53:03 2017

fuchsia: Filter out some known-flake tests in base_unittests

Bug: 738275,  735701 ,  750370 ,  750756 
Change-Id: Ibaa143616ce60ae6d94854f9cadc8daf2d84a37c
Reviewed-on: https://chromium-review.googlesource.com/596914
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#491200}
[modify] https://crrev.com/9759fff7fc48f5d70e0555d51a54540536cf9cc7/testing/buildbot/filters/fuchsia.base_unittests.filter

Some different ones flaked here:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F8174%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

[01212.357] 02669.02715>     TaskSchedulerImplTest.FileDescriptorWatcherNoOpsAfterShutdown (../../base/task_scheduler/task_scheduler_impl_unittest.cc:471)
[01212.357] 02669.02715>     TaskSchedulerImplTest.MultipleTraitsExecutionModePairs (../../base/task_scheduler/task_scheduler_impl_unittest.cc:347)
[01212.357] 02669.02715>     TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaits (../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:559)

Comment 47 by gab@chromium.org, Aug 4 2017

Any way to get the hang stack?
Another perhaps more debuggable flake:

TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithCleanup

[00119.528] 02951.03022> [1468/2456] TemplateUtil.Less (0 ms)
[00119.528] 02951.03022> [1469/2456] HistogramTesterTest.Scope (4 ms)
[00119.528] 02951.03022> [1470/2456] HistogramTesterTest.GetHistogramSamplesSinceCreationNotNull (1 ms)
[00119.579] 02951.03022> [1471/2456] TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaits (44 ms)
[00119.580] 02951.03022> [1472/2456] TaskSchedulerWorkerPoolHistogramTest.NumTasksBetweenWaitsWithCleanup (1692 ms)
[00119.580] 02951.03022> [ RUN      ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup
[00119.580] 02951.03022> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:767: Failure
[00119.580] 02951.03022>       Expected: 1
[00119.580] 02951.03022> To be equal to: histogram->SnapshotSamples()->GetCount(0)
[00119.580] 02951.03022>       Which is: 0
[00119.580] 02951.03022> [  FAILED  ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1395 ms)
[00119.580] 02951.03022> [1473/2456] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1395 ms)
[00119.581] 02951.03022> [1474/2456] TaskSchedulerWorkerPoolStandbyPolicyTest.InitOne (18 ms)
[00119.581] 02951.03022> [1475/2456] TaskSchedulerWorkerPoolStandbyPolicyTest.VerifyStandbyThread (1549 ms)
[00119.581] 02951.03022> [1476/2456] TaskSchedulerWorkerStackTest.PushPop (2 ms)
[00119.581] 02951.03022> [1477/2456] TaskSchedulerWorkerStackTest.PeekPop (6 ms)
[00119.581] 02951.03022> [1478/2456] TaskSchedulerWorkerStackTest.Contains (2 ms)

Or maybe something completely different.

Project Member

Comment 49 by bugdroid1@chromium.org, Aug 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec

commit c4822834e91f44835e3fcdd49fcf9ccafe6a84ec
Author: Wez <wez@chromium.org>
Date: Sat Aug 12 01:59:30 2017

Fix PlatformThread::CurrentId() under Fuchsia.

Under Fuchsia CurrentId() was returning the pthread_t result of calling
pthread_self() as the PlatformThreadId, which under POSIX platforms is
defined to pid_t. Under Fuchsia these are an unsigned long derived from
the pthread's internal point (i.e. 64-bit), and an int (32-bit)
respectively. This is changed to use the platform thread handle, which
is a 32-bit value as well.

Bug:  751894 ,  735701 , 738275
Change-Id: Ie5ab21c3c01d609b9b2d5b6a43d04ae57990eaf3
Reviewed-on: https://chromium-review.googlesource.com/611853
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#493935}
[modify] https://crrev.com/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec/base/threading/platform_thread.h
[modify] https://crrev.com/c4822834e91f44835e3fcdd49fcf9ccafe6a84ec/base/threading/platform_thread_posix.cc

Project Member

Comment 50 by bugdroid1@chromium.org, Aug 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0

commit 0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0
Author: Wez <wez@chromium.org>
Date: Sat Aug 12 16:45:16 2017

Fix inappropriate use of pthread_self() in PlatformThread::SetName().

We updated PlatformThread::CurrentID() to work correctly for Fuchsia
in https://chromium-review.googlesource.com/c/611853, but ironically
the Fuchsia version of PlatformThread::SetName() was still mis-using
pthread_self() as the PlatformThreadId as well, causing swathes of
tests to fail due to ThreadIdManager::Set/Get being called with
different Ids (pthread_t vs mx_handle_t) and therefore mis-matching.

TBR: scottmg@chromium.org
Bug:  751894 ,  735701 , 738275
Change-Id: I520560050c18efe624f701b4b9c5925b92bbec9a
Reviewed-on: https://chromium-review.googlesource.com/612027
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#493980}
[modify] https://crrev.com/0cdc0f4c4db9d2766b29d3a93b3617c107f9f8b0/base/threading/platform_thread_fuchsia.cc

Failure in https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/8805 on TaskSchedulerWorkerPoolImplTest.PostDelayedTask.

[00215.689] 02705.02792> [2431/2472] Sequenced/TaskSchedulerWorkerPoolImplTest.PostTaskAfterShutdown/0 (219 ms)
[00215.690] 02705.02792> [ RUN      ] Sequenced/TaskSchedulerWorkerPoolImplTest.PostDelayedTask/0
[00215.690] 02705.02792> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:340: Failure
[00215.690] 02705.02792> Expected: (actual_delay) < (TimeDelta::FromMilliseconds(250) + TestTimeouts::tiny_timeout()), actual: 0.364163 s vs 0.35 s
[00215.690] 02705.02792> [  FAILED  ] Sequenced/TaskSchedulerWorkerPoolImplTest.PostDelayedTask/0, where GetParam() = 4-byte object <01-00 00-00> (658 ms)
https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/8817 hit this one again:

[00126.073] 02369.02432> [ RUN      ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup
[00126.073] 02369.02432> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:787: Failure
[00126.073] 02369.02432>       Expected: 1
[00126.073] 02369.02432> To be equal to: histogram->SnapshotSamples()->GetCount(0)
[00126.073] 02369.02432>       Which is: 0
[00126.073] 02369.02432> [  FAILED  ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (7765 ms)
https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/8927

had a timeout of

[00418.681] 02400.02428>     TaskSchedulerTaskTrackerTest.LoadWillPostAndRunBeforeShutdown (../../base/task_scheduler/task_tracker_unittest.cc:728)

Haven't investigated.
MAY_BLOCK/TaskSchedulerWorkerPoolBlockingEnterExitTest.ThreadBlockedUnblocked/0

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F9291%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout
https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F9313%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

[00057.449] 02788.02888> [ RUN      ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup
[00057.449] 02788.02888> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:620: Failure
[00057.451] 02788.02888>       Expected: *thread_ref
[00057.451] 02788.02888>       Which is: 8-byte object <30-DB F0-75 B7-0B 00-00>
[00057.451] 02788.02888> To be equal to: PlatformThread::CurrentRef()
[00057.452] 02788.02888>       Which is: 8-byte object <30-DB 9C-9C 25-1B 00-00>
[00057.453] 02788.02888> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:651: Failure
[00057.454] 02788.02888> Expected: (thread_ref) != (PlatformThread::CurrentRef()), actual: 8-byte object <30-DB F0-75 B7-0B 00-00> vs 8-byte object <30-DB F0-75 B7-0B 00-00>
[00057.455] 02788.02888> Worker reused. Worker will not cleanup and the histogram value will be wrong.
[00057.455] 02788.02888> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:678: Failure
[00057.456] 02788.02888>       Expected: 0
[00057.456] 02788.02888> To be equal to: histogram->SnapshotSamples()->GetCount(1)
[00057.456] 02788.02888>       Which is: 1
[00057.456] 02788.02888> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:680: Failure
[00057.457] 02788.02888>       Expected: 1
[00057.457] 02788.02888> To be equal to: histogram->SnapshotSamples()->GetCount(3)
[00057.457] 02788.02888>       Which is: 0
[00057.458] 02788.02888> [  FAILED  ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1340 ms)
[00057.458] 02788.02888> [1520/2530] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1340 ms)
Similarly:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F9328%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

[00054.894] 02742.02836> [ RUN      ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup
[00054.894] 02742.02836> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:620: Failure
[00054.894] 02742.02836>       Expected: *thread_ref
[00054.895] 02742.02836>       Which is: 8-byte object <30-3B C6-F6 E9-0E 00-00>
[00054.895] 02742.02836> To be equal to: PlatformThread::CurrentRef()
[00054.896] 02742.02836>       Which is: 8-byte object <30-2B 96-07 E7-72 00-00>
[00054.896] 02742.02836> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:651: Failure
[00054.896] 02742.02836> Expected: (thread_ref) != (PlatformThread::CurrentRef()), actual: 8-byte object <30-3B C6-F6 E9-0E 00-00> vs 8-byte object <30-3B C6-F6 E9-0E 00-00>
[00054.897] 02742.02836> Worker reused. Worker will not cleanup and the histogram value will be wrong.
[00054.897] 02742.02836> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:678: Failure
[00054.897] 02742.02836>       Expected: 0
[00054.897] 02742.02836> To be equal to: histogram->SnapshotSamples()->GetCount(1)
[00054.898] 02742.02836>       Which is: 1
[00054.898] 02742.02836> ../../base/task_scheduler/scheduler_worker_pool_impl_unittest.cc:680: Failure
[00054.898] 02742.02836>       Expected: 1
[00054.899] 02742.02836> To be equal to: histogram->SnapshotSamples()->GetCount(3)
[00054.899] 02742.02836>       Which is: 0
[00054.899] 02742.02836> [  FAILED  ] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1697 ms)
[00054.899] 02742.02836> [1520/2530] TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup (1697 ms)
Urgh, this one's ugly: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.fyi%2FFuchsia%2F9351%2F%2B%2Frecipes%2Fsteps%2Fbase_unittests%2F0%2Fstdout

[00123.963] 02704.02749> 10 tests failed on exit:
[00123.963] 02704.02749>     TaskSchedulerLock.AcquireLocksDifferentThreadsSafely (../../base/task_scheduler/scheduler_lock_unittest.cc:178)
[00123.964] 02704.02749>     TaskSchedulerLock.AcquireLocksWithPredecessorDifferentThreadsSafelyNoInterference (../../base/task_scheduler/scheduler_lock_unittest.cc:230)
[00123.964] 02704.02749>     TaskSchedulerLock.AcquireLocksWithPredecessorDifferentThreadsSafelyPredecessorFirst (../../base/task_scheduler/scheduler_lock_unittest.cc:192)
[00123.964] 02704.02749>     TaskSchedulerLock.AcquireLocksWithPredecessorDifferentThreadsSafelyPredecessorLast (../../base/task_scheduler/scheduler_lock_unittest.cc:211)
[00123.966] 02704.02749>     TaskSchedulerLock.PredecessorCycle (../../base/task_scheduler/scheduler_lock_unittest.cc:264)
[00123.966] 02704.02749>     TaskSchedulerLock.PredecessorLongerCycle (../../base/task_scheduler/scheduler_lock_unittest.cc:275)
[00123.966] 02704.02749>     TaskSchedulerLock.SelfReferentialLock (../../base/task_scheduler/scheduler_lock_unittest.cc:254)
[00123.966] 02704.02749>     TaskSchedulerSingleThreadTaskRunnerManagerTest.DifferentThreadsUsed (../../base/task_scheduler/scheduler_single_thread_task_runner_manager_unittest.cc:96)
[00123.967] 02704.02749>     TaskSchedulerSingleThreadTaskRunnerManagerTest.RunsTasksInCurrentSequence (../../base/task_scheduler/scheduler_single_thread_task_runner_manager_unittest.cc:149)
[00123.967] 02704.02749>     TaskSchedulerSingleThreadTaskRunnerManagerTest.SameThreadUsed (../../base/task_scheduler/scheduler_single_thread_task_runner_manager_unittest.cc:122)

Comment 58 by w...@chromium.org, Jan 19 2018

Blockedon: 803900
Filed  issue 803900  for what looks like timing-dependent flake in a TaskScheduler test.

Comment 59 by w...@chromium.org, Jan 19 2018

Labels: -Needs-Feedback
gab: We have a hook in place so that when a test job process hangs we should get a full set of thread dumps - you can see an example in  issue 803900  (though that is straightforward case of invalid expectations, I think).

We can re-enable some of the filtered-out tests on our FYI bots to gather hang backtraces.

Comment 60 by gab@chromium.org, Jan 22 2018

Labels: -Pri-3 M-66 Pri-2
@wez: sounds very exciting, let's wait until  issue 803900  is fixed and then enable it to diagnose remaining errors?

Comment 61 by w...@chromium.org, Jan 22 2018

Re #60: The thread dumps in  issue 803900 , since we see a variety of tests hang[1] on the Fuchsia bots, now and again, so we have a load of these diagostics in place to try to track down the issues.

[1] The thread dumps are triggered when the TestLauncher thinks that a test job process (one of the processes that actually runs a batch of tests) has taken too long, so technically the process may not actually be hung, just excessively slow. In practice, though, our timeouts are so huge that processes hitting them _are_ hung. :)
Owner: ----
Status: Available (was: Assigned)
Project Member

Comment 63 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2cf7a75a8ee8fe4d1a28dedd0841e3806df2896e

commit 2cf7a75a8ee8fe4d1a28dedd0841e3806df2896e
Author: Scott Graham <scottmg@chromium.org>
Date: Thu Feb 08 19:55:40 2018

fuchsia: Filter TaskSchedulerWorkerPoolHistogramTest.NumTasksBeforeCleanup

https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20%28dbg%29/builds/16001

Bug:  735701 
Change-Id: Ic21baf4b50b66ecd3b765763f10a4a82189e3171
Reviewed-on: https://chromium-review.googlesource.com/909233
Commit-Queue: Wez <wez@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535479}
[modify] https://crrev.com/2cf7a75a8ee8fe4d1a28dedd0841e3806df2896e/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 64 by gab@chromium.org, Feb 21 2018

Blockedon: 813278

Comment 65 by w...@chromium.org, Feb 21 2018

Blockedon: 810464

Comment 66 by w...@chromium.org, Feb 21 2018

Blockedon: 806003
Project Member

Comment 67 by bugdroid1@chromium.org, Feb 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/35b3464dccd14d07d154bd660adc9a608ed5bd5c

commit 35b3464dccd14d07d154bd660adc9a608ed5bd5c
Author: Wez <wez@chromium.org>
Date: Sat Feb 24 23:49:27 2018

Broaden test filter for TaskSchedulerWorkerPoolHistogramTests.

These tests rely on thread sleeps for synchronization, which makes them
racey when run on heavily-loaded systems.

TBR: scottmg
Bug:  816170 ,  735701 
Change-Id: I9cad471cf8bc88fd91dc33662489983f13f18786
Reviewed-on: https://chromium-review.googlesource.com/936446
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539025}
[modify] https://crrev.com/35b3464dccd14d07d154bd660adc9a608ed5bd5c/testing/buildbot/filters/fuchsia.base_unittests.filter

Comment 68 by gab@chromium.org, Feb 26 2018

Blockedon: 816170
Project Member

Comment 69 by bugdroid1@chromium.org, Mar 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/274f2e7f66d77271128910677f72b802a92e6c78

commit 274f2e7f66d77271128910677f72b802a92e6c78
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Mar 01 15:03:00 2018

[SchedulerWorkerPoolImpl] Make sure workers do not cleanup during WaitForWorkersIdle*().

While auditing the code I noticed that this was a problem: if a worker
cleans up while waiting for n workers to be idle, the idle stack will
shrink and this condition may never be reached.

This prevents this completely for WaitForAllWorkersIdleForTesting() and
greatly reduces the likelihood for WaitForWorkersIdleForTesting(n).

WaitForAllWorkersIdleForTesting() is safe because "n" is obtained under
the lock. Whereas an external call to WaitForWorkersIdleForTesting(n)
could wait on a worker which was already detached. But in practice
if the reclaim timeout is long enough and the main thread invokes this
immediately after unblocking a test task and waiting for it to resolve,
the race would require the main thread to be preempted for the full
reclaim timeout for no reason which seems highly unlikely.

Audited current usage and added a comment as such.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  816170 ,  735701 
Change-Id: Ie7b4fd81c04f53cc9597ebc02a25b39d4f1d658f
Reviewed-on: https://chromium-review.googlesource.com/937361
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#540154}
[modify] https://crrev.com/274f2e7f66d77271128910677f72b802a92e6c78/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/274f2e7f66d77271128910677f72b802a92e6c78/base/task_scheduler/scheduler_worker_pool_impl.h

Project Member

Comment 70 by bugdroid1@chromium.org, Mar 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0fa2041e49fc2c94f22c314bde7ef0ab5ffe105c

commit 0fa2041e49fc2c94f22c314bde7ef0ab5ffe105c
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Mar 01 15:30:20 2018

[SchedulerWorkerPoolImpl] Introduce WaitForNWorkersCleanedUpForTesting()

And use it instead of flaky sleep timings in tests that exercise the
reclaim timeout.

Bonus: this speeds up these tests from ~1500ms to ~500ms in the normal case.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  735701 ,  803900 
Change-Id: I3b271fda6c7b42033ac7b3c128e8663e80acebb3
Reviewed-on: https://chromium-review.googlesource.com/937505
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#540162}
[modify] https://crrev.com/0fa2041e49fc2c94f22c314bde7ef0ab5ffe105c/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/0fa2041e49fc2c94f22c314bde7ef0ab5ffe105c/base/task_scheduler/scheduler_worker_pool_impl.h
[modify] https://crrev.com/0fa2041e49fc2c94f22c314bde7ef0ab5ffe105c/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Comment 71 by gab@chromium.org, Apr 26 2018

Blockedon: 837341

Comment 72 by gab@chromium.org, Apr 27 2018

Owner: gab@chromium.org
Status: Fixed (was: Available)
I think we can call this fixed.  Issue 768436  is tracking the last flakes in TaskSchedulerWorkerPoolBlockingTests (fix incoming).

Woohoo :)! Thanks Fuchsia for being the best stress test of TaskScheduler tests!

Sign in to add a comment