TaskSchedulerWorkerPoolTest.RacyCleanup fails under Fuchsia |
|||||||||
Issue descriptionThis test started failing in https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia%20ARM64/6429 with a hang. The Fuchsia-specific on-hang thread dumper starts with: [00552.130] 62472.62498> Backtrace of threads of process 32748: base_unittests__exec [00552.132] 62472.62498> arch: aarch64 [00552.132] 62472.62498> 257 thread(s) ... so I'm not going to copy all the thread stacks here. :P Looking at the thread-dump output, though, the tool fails to suspend a lot of the threads, which suggests either that they were in the process of exiting when the hang timeout triggered (seems unlikely), or that we had got stuck in a thread spawn/teardown live-lock situation. The first failing builds (across Fuchsiax64 and ARM64 Release bots) were at TaskScheduler CL https://chromium-review.googlesource.com/931547, which is the one which actually introduced the test.
,
Feb 26 2018
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f8a3b7bc2e618a652d980d8cf8de66d2664f554d commit f8a3b7bc2e618a652d980d8cf8de66d2664f554d Author: Wez <wez@chromium.org> Date: Mon Feb 26 19:23:40 2018 Disable TaskSchedulerWorkerPoolTest.RacyCleanup under Fuchsia. This test flakily hangs, appearing to live-lock in a worker-thread creation/teardown cycle. TBR: gab Bug: 816575, 810464 Change-Id: I22c42740a2e9af229400bc2da077ae962a287804 Reviewed-on: https://chromium-review.googlesource.com/937984 Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#539228} [modify] https://crrev.com/f8a3b7bc2e618a652d980d8cf8de66d2664f554d/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
,
Feb 27 2018
Hmmmm... interesting. Agreed that 256 workers is kind of a lot but doesn't seem unmanageable... Where can I see the stacks?
,
Mar 1 2018
Re #4: Click through the ci.chromium.org link and find the shard (failed) link for the failed test. That will contain the complete output from the run of the test suite shard on the VM. Many of the thread stacks are missing due to timing out trying to suspend them, but a lot of stacks were still successfully captured.
,
Mar 13 2018
Ok so it looks like many workers threads are just sleeping while the main thread is trying to JoinForTesting().
Here's my theory:
SchedulerWorker::JoinForTesting() flags an atomic and then invokes SchedulerWorker::WakeUp() to ensure the worker sees it.
If the WakeUp() actually results in signaling the WaitableEvent, it generates a happens-after releationship with reading the atomic and the thread should exit.
However, this WakeUp() is racy and this test is generating exactly the timing that makes it racy.
The timing is such that if WakeUp() is called just before the thread wakes up on its own (due to the reclaim timing), the WakeUp() can be a no-op (see below).
If the WakeUp() doesn't generate a happens-after relationship between the worker invoking ShouldExit() and the main thread setting join_called_for_testing_, then there's no reason the worker is forced to see the new value.
Then at the end of ThreadMain() we do:
// Calling WakeUp() guarantees that this SchedulerWorker will run
// Tasks from Sequences returned by the GetWork() method of |delegate_|
// until it returns nullptr. Resetting |wake_up_event_| here doesn't break
// this invariant and avoids a useless loop iteration before going to
// sleep if WakeUp() is called while this SchedulerWorker is awake.
wake_up_event_.Reset();
(this is what makes the previous WakeUp() a no-op).
What I don't understand though is that this should merely cause the worker to go back to sleep to wait for another reclaim period (which is set to 10ms in this test). Nothing however in the atomics rules absolutely forces it to sync with the main thread so it may sleep in increments of 10ms forever, never seeing |join_called_for_testing_.IsSet()|?!
I think if we remove wake_up_event_.Reset(); at the end of ThreadMain we may avoid as it will force the worker to observe the WakeUp() call and hence the atomic that was set before it (so long as WaitableEvent uses locks or at least fenced atomics on Fuchsia).
@robliao : thoughts?
,
Mar 14 2018
Hmmm but getting rid of |wake_up_event_.Reset()| is tricky since GetWork() assumes that if it's called a second time while the worker is still on the idle stack it must be because it should attempt to clean up. FTR, the stacks in the timing out tests look like this: 257 threads active (main thread and 256 workers used in RacyCleanup test), many failed to be dumped but we still have most of the important info: Main thread: #06: base::PlatformThread::Join(base::PlatformThreadHandle) at ??:? #07: base::internal::SchedulerWorker::JoinForTesting() at ??:? #08: base::internal::SchedulerWorkerPoolImpl::JoinForTesting() at ??:? #09: base::internal::TaskSchedulerWorkerPoolTest_RacyCleanup_Test::TestBody() at ??:? #10: testing::Test::Run() at ??:? #11: testing::TestInfo::Run() at ??:? #12: testing::TestCase::Run() at ??:? #13: testing::internal::UnitTestImpl::RunAllTests() at ??:? #14: testing::UnitTest::Run() at ??:? #15: base::TestSuite::Run() at ??:? In 59/59 workers that could be dumped: #05: base::ConditionVariable::TimedWait(base::TimeDelta const&) at ??:? #06: base::WaitableEvent::TimedWaitUntil(base::TimeTicks const&) at ??:? #07: base::WaitableEvent::TimedWait(base::TimeDelta const&) at ??:? #08: base::internal::SchedulerWorker::Delegate::WaitForWork(base::WaitableEvent*) at ??:? #09: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:? #10: base::(anonymous namespace)::ThreadFunc(void*) at ??:? So somehow the workers are idle and not picking up the JoinForTesting() signal.
,
Mar 14 2018
@wez : fdoray and I spent a couple hours fully wrapping our mind around this. Our conclusion is that there must be a hang (or painfully slow) in thread exit on Fuchsia+QEMU when there are lots of threads, is that possible?! Why we think this: 1) There are 257 threads (256 workers + main thread), so no threads were successfully joined. 2) The main thread is in the process of joining a SchedulerWorker. 3) We know there's no work remaining (i.e. GetWork() always returns nullptr). 4) The main thread's stack indicates that: a) it's past the point where it |join_called_for_testing_.Set()|. b) It's also past |thread_->WakeUp()| which results in |waitable_event_.Signal()|. 5) The worker being signaled may either: a) Wake up because of the signal, in that case: i. The thread will wakeup with an happens-after relationship to |join_called_for_testing_.Set()| and thus ShouldExit() will return true and the SchedulerWorker will unwind. b) Wake up on its own from the TimedWait() expiring per the reclaim timeout (in this test : kReclaimTimeForRacyCleanupTest == 10ms), in that case: i. The thread isn't guaranteed to see |join_called_for_testing_.IsSet()| (assume it doesn't, otherwise it's just case (a)) ii. The thread will invoke GetWork() iii. GetWork() will attempt to Cleanup() but be denied because |SchedulerWorkerPoolImpl::worker_cleanup_disallowed_for_testing_| was set in JoinForTesting(). iv. GetWork() will return nullptr v. ThreadMain will then invoke WaitForWork() again and consume the signal from (4.b), establishing the sought happens-after relationship and forcing ShouldExit(). Hence, the timeout occurring with all 256 workers still alive indicates a likely hang in thread tear down at the OS level (if the test was just slow it would most likely hang with *some* threads alive, precisely all threads is suspicious). PS: We did check that SchedulerWorkerPoolImpl::SchedulerWorkerDelegateImpl::OnMainExit() shouldn't hang either, so we really think it's hung in actual OS thread exit. You could make the conditions simpler by removing PlatformThread::Sleep(kReclaimTimeForRacyCleanupTest) locally in that test and hence show that the problem is with the OS thread exit rather than with any of the funky timings described above. Back to you.
,
Mar 19 2018
,
Apr 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e0941c29b9ddb8b2acd32e781eb11e7e280ca5e0 commit e0941c29b9ddb8b2acd32e781eb11e7e280ca5e0 Author: Gabriel Charette <gab@chromium.org> Date: Thu Apr 26 20:47:51 2018 [Fuchsia+TaskScheduler] Re-enable a lighter version of TaskSchedulerWorkerPoolTest.RacyCleanup on Fuchsia R=wez@chromium.org Bug: 816575 Change-Id: I04e95fb1832dc35ff3b9265ab3d5f7c5f1131619 Reviewed-on: https://chromium-review.googlesource.com/1030815 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#554145} [modify] https://crrev.com/e0941c29b9ddb8b2acd32e781eb11e7e280ca5e0/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
,
Apr 27 2018
Removing from TaskScheduler queue as this is now strictly an issue with creation/teardown of many threads on Fuchsia+QEMU (test was re-enabled in a lighter mode on Fuchsia in r554145).
,
May 23 2018
,
May 23 2018
,
May 26 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by w...@chromium.org
, Feb 26 2018