Issue metadata
Sign in to add a comment
|
TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks failed under Fuchsia/ARM64 FYI |
||||||||||||||||||||||||
Issue descriptionTaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks failed in https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/5819 witha crash: [00104.257] 01200.01255> <== fatal exception: process base_unittests__exec[32690] thread TaskSchedulerSingleThreadForegr[33216] [00104.258] 01200.01255> <== fatal page fault, PC at 0x211d571ce0d0 ... [00104.294] 01200.01255> dso: id=97caafd6646d3ad81e2dc07bba6fb4fec7d3eb3a base=0xb3cd515b0000 name=<vDSO> [00104.294] 01200.01255> dso: id=9e3569cc22dfd432bc2ac3d3b2829196909ad4be base=0x90e47c61a000 name=libc.so [00104.294] 01200.01255> dso: id=45ed5eb239c1b28eeb9706e4130e24f4a5f378d6 base=0x68b0dca4f000 name=liblaunchpad.so [00104.294] 01200.01255> dso: id=e5a518a1be244c24d4eadc3ff79bd2478f978376 base=0x52cef2d12000 name=libfdio.so [00104.294] 01200.01255> dso: id=4542a42823657d4a base=0x211d56713000 name=app:base_unittests__exec #01: base::internal::TaskTracker::RunNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at ??:? #02: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:? #03: base::(anonymous namespace)::ThreadFunc(void*) at ??:? #04: pc 0x90e47c62ae4c sp 0x8ad8deb7ffc0 (libc.so,0x10e4c) #05: pc 0x90e47c69d820 sp 0x8ad8deb7ffe0 (libc.so,0x83820) ... [00104.431] 03757.03783> [1566/2928] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks (CRASHED)
,
Feb 20 2018
Re #1: This failed on Fuchsia/ARM64 FYI, which is a Release builder so |symbol_level| should be zero; I'm actually a little surprised that we're getting symbols at all. We could increase the |symbol_level| on our FYI bots but then we'd be running (much) bigger binaries, which will increase build & run times, etc.
,
Feb 21 2018
That's unfortunate. On Windows we have symbols even for Release builds. Can't do much to diagnose failures othewise... FWIW, this stack looks similar to issue 813651 . Maybe the worker is calling into a freed TaskTracker and the rest is garbage..? (shouldn't happen per my current mental model but eh)
,
Feb 21 2018
,
Feb 21 2018
I dug into this and other bugs to figure out why a worker could possibly outlive the pool/scheduler/TaskTracker (we always ensure that the pool/workers were joined in the destructor after all):
TaskSchedulerImpl::~TaskSchedulerImpl() {
#if DCHECK_IS_ON()
DCHECK(join_for_testing_returned_.IsSet());
#endif
}
I now have a theory: we rely on atomics in an incorrect way in one testing-only method.
That is SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting() sets an atomic bit and workers are expected to not cleanup when they see that bit. However absolutely nothing in the atomic rules forces the other threads to see that value (atomics just specify that *once* they see the value, they're also guaranteed to see what happened before it was set (per usage of release/acquire barriers in AtomicFlag), but that's irrelevant here).
Because of this, a worker can get CanCleanup() == true after the main thread invoked SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting(). In which case it will remove itself for |workers_| and will not be waited upon when JoinForTesting() is invoked on the main thread. And hence we have our worker outliving its pool..!
(working on fix now)
,
Feb 21 2018
,
Feb 21 2018
CL @ https://chromium-review.googlesource.com/#/c/chromium/src/+/929061 Note to self: try to revert https://chromium-review.googlesource.com/925546 to re-enable some TaskScheduler tests after this is fixed.
,
Feb 21 2018
Re #5: Based on your description this sounds like a duplicate of issue 810464 , right?
,
Feb 21 2018
Ah, yes, missed that one when digging for open bugs.. thanks!
,
Feb 21 2018
Could this also fit issue 806003 ?
,
Feb 21 2018
Yes, thanks! Marked as duplicate. I think this encompasses all the flakes except the timing ones based on worker capacity.
,
Feb 21 2018
Clearing up #5 after some offline discussion: There is a synchronization issue between CanWorkerCleanupForTesting() and the state sync'ed by the SchedulerWorkerPoolImpl lock_. CanWorkerCleanupForTesting() could transition from true to false before CleanupLockRequired() occurs. However, in this unsynchronized case. It's not clear how the thread would make it to RunNextTask(). Once the cleanup path is triggered, the thread will exit out of its loop. For the thread to continue to RunNextTask(), GetWork() will need to have returned a sequence, which means we haven't yet started the cleanup work.
,
Feb 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fc40b1417b669d7df75b4d61491c521048027fda commit fc40b1417b669d7df75b4d61491c521048027fda Author: Gabriel Charette <gab@chromium.org> Date: Thu Feb 22 10:02:04 2018 Make SchedulerWorkerPoolImpl::DisallowWorkerPoolCleanupForTesting() synchronized with SchedulerWorkers' cleanup step. Atomics didn't guarantee synchronization as a worker could observe it was allowed to cleanup, then be preempted, and later actually cleanup after cleanup had been disallowed. Synchronizing on the pool's lock prevents this. I believe this is the source of most TaskScheduler flakes we've seen on Fuschia. The actual test being blamed were also random because a leaky worker would result in a user-after-free in the following test, not the test at fault. While this CL synchronizes cleanups with DisallowWorkerPoolCleanupForTesting(), a follow-up CL will be required to ensure detached workers do not touch pool state after cleanup (since they removed themselves from |workers_| they will not be joined and hence using pool state can result in a use-after-free). R=fdoray@chromium.org, robliao@chromium.org Bug: 813278 Change-Id: I455a2c8cff3aa9b1176567cc4a6409ff6ca6807d Reviewed-on: https://chromium-review.googlesource.com/929061 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Robert Liao <robliao@chromium.org> Cr-Commit-Position: refs/heads/master@{#538388} [modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.cc [modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.h
,
Feb 22 2018
@#12: right, but I'm not trusting this stack much. This is a Release build and symbols may be misaligned. All we know is the worker is doing a use-after-free, everything else may be garbage IMO. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by gab@chromium.org
, Feb 19 2018