New issue
Advanced search Search tips

Issue 813278 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 810464
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 3
Type: Bug

Blocking:
issue 735701



Sign in to add a comment

TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks failed under Fuchsia/ARM64 FYI

Project Member Reported by w...@chromium.org, Feb 17 2018

Issue description

TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks failed in https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/5819 witha crash:

[00104.257] 01200.01255> <== fatal exception: process base_unittests__exec[32690] thread TaskSchedulerSingleThreadForegr[33216]
[00104.258] 01200.01255> <== fatal page fault, PC at 0x211d571ce0d0
...
[00104.294] 01200.01255> dso: id=97caafd6646d3ad81e2dc07bba6fb4fec7d3eb3a base=0xb3cd515b0000 name=<vDSO>
[00104.294] 01200.01255> dso: id=9e3569cc22dfd432bc2ac3d3b2829196909ad4be base=0x90e47c61a000 name=libc.so
[00104.294] 01200.01255> dso: id=45ed5eb239c1b28eeb9706e4130e24f4a5f378d6 base=0x68b0dca4f000 name=liblaunchpad.so
[00104.294] 01200.01255> dso: id=e5a518a1be244c24d4eadc3ff79bd2478f978376 base=0x52cef2d12000 name=libfdio.so
[00104.294] 01200.01255> dso: id=4542a42823657d4a base=0x211d56713000 name=app:base_unittests__exec
#01: base::internal::TaskTracker::RunNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at ??:?
#02: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:?
#03: base::(anonymous namespace)::ThreadFunc(void*) at ??:?
#04: pc 0x90e47c62ae4c sp 0x8ad8deb7ffc0 (libc.so,0x10e4c)
#05: pc 0x90e47c69d820 sp 0x8ad8deb7ffe0 (libc.so,0x83820)
...
[00104.431] 03757.03783> [1566/2928] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks (CRASHED)


 

Comment 1 by gab@chromium.org, Feb 19 2018

Labels: Test-Flaky
Hmmm, any way to get a better stack? Thanks!

Comment 2 by w...@chromium.org, Feb 20 2018

Cc: scottmg@chromium.org
Re #1: This failed on Fuchsia/ARM64 FYI, which is a Release builder so |symbol_level| should be zero; I'm actually a little surprised that we're getting symbols at all.

We could increase the |symbol_level| on our FYI bots but then we'd be running (much) bigger binaries, which will increase build & run times, etc.

Comment 3 by gab@chromium.org, Feb 21 2018

That's unfortunate. On Windows we have symbols even for Release builds. Can't do much to diagnose failures othewise...

FWIW, this stack looks similar to  issue 813651 .

Maybe the worker is calling into a freed TaskTracker and the rest is garbage..? (shouldn't happen per my current mental model but eh)

Comment 4 by gab@chromium.org, Feb 21 2018

Cc: akhaus...@yandex-team.ru
 Issue 813651  has been merged into this issue.

Comment 5 by gab@chromium.org, Feb 21 2018

Owner: gab@chromium.org
Status: Started (was: Untriaged)
I dug into this and other bugs to figure out why a worker could possibly outlive the pool/scheduler/TaskTracker (we always ensure that the pool/workers were joined in the destructor after all): 

TaskSchedulerImpl::~TaskSchedulerImpl() {
#if DCHECK_IS_ON()
  DCHECK(join_for_testing_returned_.IsSet());
#endif
}

I now have a theory: we rely on atomics in an incorrect way in one testing-only method.

That is SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting() sets an atomic bit and workers are expected to not cleanup when they see that bit. However absolutely nothing in the atomic rules forces the other threads to see that value (atomics just specify that *once* they see the value, they're also guaranteed to see what happened before it was set (per usage of release/acquire barriers in AtomicFlag), but that's irrelevant here).

Because of this, a worker can get CanCleanup() == true after the main thread invoked SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting(). In which case it will remove itself for |workers_| and will not be waited upon when JoinForTesting() is invoked on the main thread. And hence we have our worker outliving its pool..!

(working on fix now)

Comment 6 by gab@chromium.org, Feb 21 2018

Blocking: 735701

Comment 7 by gab@chromium.org, Feb 21 2018

CL @ https://chromium-review.googlesource.com/#/c/chromium/src/+/929061 

Note to self: try to revert https://chromium-review.googlesource.com/925546 to re-enable some TaskScheduler tests after this is fixed.

Comment 8 by w...@chromium.org, Feb 21 2018

Re #5: Based on your description this sounds like a duplicate of  issue 810464 , right?

Comment 9 by gab@chromium.org, Feb 21 2018

Mergedinto: 810464
Status: Duplicate (was: Started)
Ah, yes, missed that one when digging for open bugs.. thanks!

Comment 10 by w...@chromium.org, Feb 21 2018

Could this also fit  issue 806003 ?

Comment 11 by gab@chromium.org, Feb 21 2018

Yes, thanks! Marked as duplicate. I think this encompasses all the flakes except the timing ones based on worker capacity.
Clearing up #5 after some offline discussion:

There is a synchronization issue between CanWorkerCleanupForTesting() and the state sync'ed by the SchedulerWorkerPoolImpl lock_. CanWorkerCleanupForTesting() could transition from true to false before CleanupLockRequired() occurs. 

However, in this unsynchronized case. It's not clear how the thread would make it to RunNextTask(). Once the cleanup path is triggered, the thread will exit out of its loop.

For the thread to continue to RunNextTask(), GetWork() will need to have returned a sequence, which means we haven't yet started the cleanup work.
Project Member

Comment 13 by bugdroid1@chromium.org, Feb 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fc40b1417b669d7df75b4d61491c521048027fda

commit fc40b1417b669d7df75b4d61491c521048027fda
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Feb 22 10:02:04 2018

Make SchedulerWorkerPoolImpl::DisallowWorkerPoolCleanupForTesting() synchronized with SchedulerWorkers' cleanup step.

Atomics didn't guarantee synchronization as a worker could observe it
was allowed to cleanup, then be preempted, and later actually cleanup
after cleanup had been disallowed. Synchronizing on the pool's lock
prevents this.

I believe this is the source of most TaskScheduler flakes we've seen on
Fuschia. The actual test being blamed were also random because a leaky
worker would result in a user-after-free in the following test, not
the test at fault.

While this CL synchronizes cleanups with DisallowWorkerPoolCleanupForTesting(),
a follow-up CL will be required to ensure detached workers do not touch pool
state after cleanup (since they removed themselves from |workers_| they will
not be joined and hence using pool state can result in a use-after-free).

R=fdoray@chromium.org, robliao@chromium.org

Bug:  813278 
Change-Id: I455a2c8cff3aa9b1176567cc4a6409ff6ca6807d
Reviewed-on: https://chromium-review.googlesource.com/929061
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538388}
[modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.h

Comment 14 by gab@chromium.org, Feb 22 2018

@#12: right, but I'm not trusting this stack much. This is a Release build and symbols may be misaligned. All we know is the worker is doing a use-after-free, everything else may be garbage IMO.

Sign in to add a comment