Crash in base::internal::TaskTracker::IsShutdownComplete() in base_unittests |
||||||
Issue descriptionWe have a report of a raft of tests failing in base_unittests under Fuchsia(dbg) FYI, which appears to be due to a single crash of the test batch process. Output from the tests is: [00162.989] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolBlockingTest.PostBeforeBlocking/WILL_BLOCK_MAY_BLOCK [00162.989] 03907.03948> [ OK ] TaskSchedulerWorkerPoolBlockingTest.PostBeforeBlocking/WILL_BLOCK_MAY_BLOCK (146 ms) [00162.989] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK [00162.990] 03907.03948> [ OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK (333 ms) [00162.990] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK [00162.991] 03907.03948> [ OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK (181 ms) [00162.991] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK_WILL_BLOCK [00162.991] 03907.03948> [ OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK_WILL_BLOCK (202 ms) [00162.992] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK_MAY_BLOCK [00162.992] 03907.03948> [ OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK_MAY_BLOCK (168 ms) [00162.992] 03907.03948> [----------] 5 tests from TaskSchedulerWorkerPoolBlockingTest (1085 ms total) [00162.993] 03907.03948> [00162.993] 03907.03948> [----------] 1 test from TaskSchedulerWorkerPoolOverWorkerCapacityTest [00162.993] 03907.03948> [ RUN ] TaskSchedulerWorkerPoolOverWorkerCapacityTest.VerifyCleanup [00162.995] 03907.03948> [ OK ] TaskSchedulerWorkerPoolOverWorkerCapacityTest.VerifyCleanup (558 ms) [00162.995] 03907.03948> [----------] 1 test from TaskSchedulerWorkerPoolOverWorkerCapacityTest (584 ms total) [00162.996] 03907.03948> [00162.996] 03907.03948> [----------] 4 tests from TaskSchedulerWorkerStackTest [00162.997] 03907.03948> [ RUN ] TaskSchedulerWorkerStackTest.PushPop [00162.998] 03907.03948> [ OK ] TaskSchedulerWorkerStackTest.PushPop (0 ms) [00162.998] 03907.03948> [ RUN ] TaskSchedulerWorkerStackTest.PeekPop [00162.998] 03907.03948> [ OK ] TaskSchedulerWorkerStackTest.PeekPop (0 ms) [00163.004] 03907.03948> [ RUN ] TaskSchedulerWorkerStackTest.Contains [00163.004] 03907.03948> [ OK ] TaskSchedulerWorkerStackTest.Contains (0 ms) [00163.004] 03907.03948> [ RUN ] TaskSchedulerWorkerStackTest.Remove [00163.004] 03907.03948> [ OK ] TaskSchedulerWorkerStackTest.Remove (0 ms) We run ten tests per-batch, so this is likely a crash during process teardown. The crash backtrace is: [00162.845] 01196.01248> <== fatal exception: process base_unittests__exec[37366] thread TaskSchedulerAWorker[38439] [00162.846] 01196.01248> <== fatal page fault, PC at 0x4e0f7ee80b84 ... #01: pc 0x4e0f7ee80b84 sp 0x6aa4ca486690 (libc.so,0x18b84) #02: base::internal::LockImpl::Lock() at ??:? #03: base::Lock::Acquire() at ??:? #04: base::internal::SchedulerLockImpl::Acquire() at ??:? #05: base::internal::AutoSchedulerLock::AutoSchedulerLock(base::internal::SchedulerLock&) at ??:? #06: base::internal::TaskTracker::IsShutdownComplete() const at ??:? #07: base::internal::SchedulerWorkerPoolImpl::SchedulerWorkerDelegateImpl::OnMainExit(base::internal::SchedulerWorker*) at ??:? #08: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:? #09: base::(anonymous namespace)::ThreadFunc(void*) at ??:? #10: pc 0x4e0f7ee80876 sp 0x6aa4ca486fe0 (libc.so,0x18876) #11: pc 0x4e0f7eef66c9 sp 0x6aa4ca486ff0 (libc.so,0x8e6c9) #12: pc 0 sp 0x6aa4ca487000 Looking at https://cs.chromium.org/chromium/src/base/task_scheduler/scheduler_worker_pool_impl.cc?gsn=outer_&l=780 it appears that the SchedulerWorkerDelegateImpl is owned-by the SchedulerWorker, which is ref-counted, with a reference held by the SchedulerWorkerPoolImpl. The SWDI then holds a raw pointer to the SchedulerWorkerPoolImpl. I think what's happening here is a race-condition between an idle SchedulerWorker deciding to exit, and SchedulerWorkerPoolImpl teardown: - The idle SchedulerWorker enters GetWork(), takes the pool lock, confirms it can cleanup, and removes itself from |workers_|. - The idle SchedulerWorker thread is preempted. - The SchedulerWorkerPoolImpl is torn-down, because this is a test (as per comments in the dtor, this doesn't happen in production, probably?) - The idle SchedulerWorker thread is scheduled by the OS, and invokes OnMainExit() on its delegate; the delegate then use-after-frees the deleted pool impl. Marking P1 for M67, since this risks making tests randomly flake at a very low (== hard to diagnose) rate.
,
Feb 21 2018
,
Feb 21 2018
,
Feb 21 2018
Dug into this from other (dupped) bugs, copying diagnosis from there:
"""
Trying figure out why a worker could possibly outlive the pool/scheduler/TaskTracker (we always ensure that the pool/workers were joined in the destructor after all):
TaskSchedulerImpl::~TaskSchedulerImpl() {
#if DCHECK_IS_ON()
DCHECK(join_for_testing_returned_.IsSet());
#endif
}
I now have a theory: we rely on atomics in an incorrect way in one testing-only method.
That is SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting() sets an atomic bit and workers are expected to not cleanup when they see that bit. However absolutely nothing in the atomic rules forces the other threads to see that value (atomics just specify that *once* they see the value, they're also guaranteed to see what happened before it was set (per usage of release/acquire barriers in AtomicFlag), but that's irrelevant here).
Because of this, a worker can get CanCleanup() == true after the main thread invoked SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting(). In which case it will remove itself for |workers_| and will not be waited upon when JoinForTesting() is invoked on the main thread. And hence we have our worker outliving its pool..!
"""
CL @ https://chromium-review.googlesource.com/#/c/chromium/src/+/929061
Note to self: try to revert https://chromium-review.googlesource.com/925546 to re-enable some TaskScheduler tests after this is fixed.
,
Feb 21 2018
Issue 813651 has been merged into this issue.
,
Feb 21 2018
Issue 806003 has been merged into this issue.
,
Feb 22 2018
gab: Did you see robliao@'s https://bugs.chromium.org/p/chromium/issues/detail?id=813278#c12 comment? Do we have a clear idea of how this can result in us also hitting RunNextTask() after the SchedulerWorkerPool + TaskTracker are gone, sometimes?
,
Feb 22 2018
Copying response from there (let's keep discussing here): I'm not trusting this stack much. This is a Release build and symbols may be misaligned. All we know is the worker is doing a use-after-free, everything else may be garbage IMO. I'm wrapping my head around this as we speak (r538388 is a first step but not sufficient).
,
Feb 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fc40b1417b669d7df75b4d61491c521048027fda commit fc40b1417b669d7df75b4d61491c521048027fda Author: Gabriel Charette <gab@chromium.org> Date: Thu Feb 22 10:02:04 2018 Make SchedulerWorkerPoolImpl::DisallowWorkerPoolCleanupForTesting() synchronized with SchedulerWorkers' cleanup step. Atomics didn't guarantee synchronization as a worker could observe it was allowed to cleanup, then be preempted, and later actually cleanup after cleanup had been disallowed. Synchronizing on the pool's lock prevents this. I believe this is the source of most TaskScheduler flakes we've seen on Fuschia. The actual test being blamed were also random because a leaky worker would result in a user-after-free in the following test, not the test at fault. While this CL synchronizes cleanups with DisallowWorkerPoolCleanupForTesting(), a follow-up CL will be required to ensure detached workers do not touch pool state after cleanup (since they removed themselves from |workers_| they will not be joined and hence using pool state can result in a use-after-free). R=fdoray@chromium.org, robliao@chromium.org Bug: 813278 Change-Id: I455a2c8cff3aa9b1176567cc4a6409ff6ca6807d Reviewed-on: https://chromium-review.googlesource.com/929061 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Cr-Commit-Position: refs/heads/master@{#538388} [modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.cc [modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.h
,
Feb 22 2018
,
Feb 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d5b04e7e34dac974c7d6ce4b138f66419fca0ca9 commit d5b04e7e34dac974c7d6ce4b138f66419fca0ca9 Author: Gabriel Charette <gab@chromium.org> Date: Fri Feb 23 06:19:02 2018 SchedulerWorker: Do not go back to sleep after cleanup. At first sight this is merely a correctness fix to avoid sleeping twice. But it is also a prerequisite to fix crbug.com/810464 as calling WaitForWork() after cleanup (which may occur as part of GetWork()) is not safe. This may also help timing flakes on reclaim tests for which the typical resolution wait is 3X the reclaim time (but was only 1.5X in practice because of this bug). R=fdoray@chromium.org, robliao@chromium.org Bug: 810464 Change-Id: I5e00de763f95773120022d258a513bb19a30c6b1 Reviewed-on: https://chromium-review.googlesource.com/931541 Reviewed-by: François Doray <fdoray@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#538712} [modify] https://crrev.com/d5b04e7e34dac974c7d6ce4b138f66419fca0ca9/base/task_scheduler/scheduler_worker.cc
,
Feb 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c commit a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c Author: Gabriel Charette <gab@chromium.org> Date: Fri Feb 23 10:25:03 2018 [TaskScheduler] Do not log 0 to TaskScheduler.NumTasksBetweenWaits histogram on detach. Doing so accesses |outer_| which was unsafe after invoking Cleanup(). While this will be made safe with https://chromium-review.googlesource.com/c/chromium/src/+/931547 we had noticed this metric being wrong in the past but had postponed cleaning up to avoid perturbing metrics as part of a bigger change. Now is the time to drop this report. I don't think we need to rename the metric because it's an internal diagnosis metric for our team. R=fdoray@chromium.org, robliao@chromium.org Bug: 756898 , 810464 Change-Id: Ica379d6f3c21aebabeb0574847ba0bfcda346df8 Reviewed-on: https://chromium-review.googlesource.com/931524 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Cr-Commit-Position: refs/heads/master@{#538751} [modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl.cc [modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
,
Feb 24 2018
Issue 763051 has been merged into this issue.
,
Feb 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/550cb548914d5fc99258ced55c94f3348ad5a55a commit 550cb548914d5fc99258ced55c94f3348ad5a55a Author: Gabriel Charette <gab@chromium.org> Date: Sun Feb 25 21:57:00 2018 [SchedulerWorkerDelegateImpl] Add THREAD_CHECKER checks to document and enforce existing behaviour. After just doing a full pass of this code for issue 810464 , I think leaving these CHECKs behind will help a future reader as well as enforce required calling semantics. R=fdoray@chromium.org Bug: 810464 Change-Id: I48bd2c3c0308be3b856ef518ed41382497ca5cf8 Reviewed-on: https://chromium-review.googlesource.com/934081 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Cr-Commit-Position: refs/heads/master@{#539052} [modify] https://crrev.com/550cb548914d5fc99258ced55c94f3348ad5a55a/base/task_scheduler/scheduler_worker_pool_impl.cc
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1ae9a275a252299bfc2936cc899143f5bad88ab4 commit 1ae9a275a252299bfc2936cc899143f5bad88ab4 Author: Gabriel Charette <gab@chromium.org> Date: Mon Feb 26 09:56:23 2018 [SchedulerWorkerPoolImpl] Reinforce JoinForTesting to prevent use-after-free post Cleanup(). SchedulerWorker can touch SchedulerWorkerPoolImpl through SchedulerWorker::delegate_->outer_. This could cause use-after-free in tests (not a problem in prod because we never join workers nor destroy the scheduler/pools) because SchedulerWorkerPoolImpl no longer accounts for a SchedulerWorker after invoking SchedulerWorker::Cleanup and hence will not join it in JoinForTesting() before it is itself destroyed. Before this CL: a SchedulerWorker could access SchedulerWorkerPoolImpl after Cleanup() and hence potentially use after free if the pool is destroyed before the detached thread completes execution. This CL reinforces JoinForTesting() to wait for all workers that ever had a reference to |this| (i.e. its |workers_| + any recently reclaimed workers). Added TaskSchedulerWorkerPoolTest.RacyCleanup as regression test although it passes in most scenarios without the fix because this race is hard to exercise... And added a DCHECK on start that the requested pool size isn't greater than kMaxNumberOfWorkers (as I tried to add 1000 worker in my new test and had to debug why it hung until I rediscovered this intentional behavior). Bug: 810464 Change-Id: Ibaefb5446d462c325e4ddfd34a54ad3c448c1417 Reviewed-on: https://chromium-review.googlesource.com/931547 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Cr-Commit-Position: refs/heads/master@{#539095} [modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker.cc [modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker.h [modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl.cc [modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl.h [modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
,
Feb 26 2018
Issue 761400 has been merged into this issue.
,
Feb 26 2018
Issue 763058 has been merged into this issue.
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f8a3b7bc2e618a652d980d8cf8de66d2664f554d commit f8a3b7bc2e618a652d980d8cf8de66d2664f554d Author: Wez <wez@chromium.org> Date: Mon Feb 26 19:23:40 2018 Disable TaskSchedulerWorkerPoolTest.RacyCleanup under Fuchsia. This test flakily hangs, appearing to live-lock in a worker-thread creation/teardown cycle. TBR: gab Bug: 816575, 810464 Change-Id: I22c42740a2e9af229400bc2da077ae962a287804 Reviewed-on: https://chromium-review.googlesource.com/937984 Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Commit-Queue: Wez <wez@chromium.org> Cr-Commit-Position: refs/heads/master@{#539228} [modify] https://crrev.com/f8a3b7bc2e618a652d980d8cf8de66d2664f554d/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
,
Mar 1 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cb4da43870053c2857661082f2ef29c938a69d6a commit cb4da43870053c2857661082f2ef29c938a69d6a Author: Gabriel Charette <gab@chromium.org> Date: Thu Mar 01 09:58:16 2018 [SchedulerWorkerPoolImpl] Remove DisallowWorkerCleanupForTesting(). Since cleanup was made thread-safe in r539095 (and disallowing synchronized on |lock_| in r538388): This was now only needed during JoinForTesting() to ensure workers in |workers_copy| didn't decide to cleanup during join. As such, always ban cleanup during join and remove the API to control that bit. R=fdoray@chromium.org, robliao@chromium.org Bug: 810464 Change-Id: I46edfcc7e37f3a5cb54320b0c38cad5d567c8eb7 Reviewed-on: https://chromium-review.googlesource.com/937264 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Cr-Commit-Position: refs/heads/master@{#540119} [modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl.cc [modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl.h [modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc [modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/task_scheduler_impl.cc
,
Mar 21 2018
I think this one should be fixed now. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by gab@chromium.org
, Feb 13 2018Owner: robliao@chromium.org