New issue
Advanced search Search tips

Issue 810464 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 1
Type: Bug

Blocked on:
issue 756898

Blocking:
issue 735701



Sign in to add a comment

Crash in base::internal::TaskTracker::IsShutdownComplete() in base_unittests

Project Member Reported by w...@chromium.org, Feb 8 2018

Issue description

We have a report of a raft of tests failing in base_unittests under Fuchsia(dbg) FYI, which appears to be due to a single crash of the test batch process. Output from the tests is:

[00162.989] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolBlockingTest.PostBeforeBlocking/WILL_BLOCK_MAY_BLOCK
[00162.989] 03907.03948> [       OK ] TaskSchedulerWorkerPoolBlockingTest.PostBeforeBlocking/WILL_BLOCK_MAY_BLOCK (146 ms)
[00162.989] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK
[00162.990] 03907.03948> [       OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK (333 ms)
[00162.990] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK
[00162.991] 03907.03948> [       OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK (181 ms)
[00162.991] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK_WILL_BLOCK
[00162.991] 03907.03948> [       OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/MAY_BLOCK_WILL_BLOCK (202 ms)
[00162.992] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK_MAY_BLOCK
[00162.992] 03907.03948> [       OK ] TaskSchedulerWorkerPoolBlockingTest.WorkersIdleWhenOverCapacity/WILL_BLOCK_MAY_BLOCK (168 ms)
[00162.992] 03907.03948> [----------] 5 tests from TaskSchedulerWorkerPoolBlockingTest (1085 ms total)
[00162.993] 03907.03948>
[00162.993] 03907.03948> [----------] 1 test from TaskSchedulerWorkerPoolOverWorkerCapacityTest
[00162.993] 03907.03948> [ RUN      ] TaskSchedulerWorkerPoolOverWorkerCapacityTest.VerifyCleanup
[00162.995] 03907.03948> [       OK ] TaskSchedulerWorkerPoolOverWorkerCapacityTest.VerifyCleanup (558 ms)
[00162.995] 03907.03948> [----------] 1 test from TaskSchedulerWorkerPoolOverWorkerCapacityTest (584 ms total)
[00162.996] 03907.03948>
[00162.996] 03907.03948> [----------] 4 tests from TaskSchedulerWorkerStackTest
[00162.997] 03907.03948> [ RUN      ] TaskSchedulerWorkerStackTest.PushPop
[00162.998] 03907.03948> [       OK ] TaskSchedulerWorkerStackTest.PushPop (0 ms)
[00162.998] 03907.03948> [ RUN      ] TaskSchedulerWorkerStackTest.PeekPop
[00162.998] 03907.03948> [       OK ] TaskSchedulerWorkerStackTest.PeekPop (0 ms)
[00163.004] 03907.03948> [ RUN      ] TaskSchedulerWorkerStackTest.Contains
[00163.004] 03907.03948> [       OK ] TaskSchedulerWorkerStackTest.Contains (0 ms)
[00163.004] 03907.03948> [ RUN      ] TaskSchedulerWorkerStackTest.Remove
[00163.004] 03907.03948> [       OK ] TaskSchedulerWorkerStackTest.Remove (0 ms)

We run ten tests per-batch, so this is likely a crash during process teardown.

The crash backtrace is:

[00162.845] 01196.01248> <== fatal exception: process base_unittests__exec[37366] thread TaskSchedulerAWorker[38439]
[00162.846] 01196.01248> <== fatal page fault, PC at 0x4e0f7ee80b84
...
#01: pc 0x4e0f7ee80b84 sp 0x6aa4ca486690 (libc.so,0x18b84)
#02: base::internal::LockImpl::Lock() at ??:?
#03: base::Lock::Acquire() at ??:?
#04: base::internal::SchedulerLockImpl::Acquire() at ??:?
#05: base::internal::AutoSchedulerLock::AutoSchedulerLock(base::internal::SchedulerLock&) at ??:?
#06: base::internal::TaskTracker::IsShutdownComplete() const at ??:?
#07: base::internal::SchedulerWorkerPoolImpl::SchedulerWorkerDelegateImpl::OnMainExit(base::internal::SchedulerWorker*) at ??:?
#08: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:?
#09: base::(anonymous namespace)::ThreadFunc(void*) at ??:?
#10: pc 0x4e0f7ee80876 sp 0x6aa4ca486fe0 (libc.so,0x18876)
#11: pc 0x4e0f7eef66c9 sp 0x6aa4ca486ff0 (libc.so,0x8e6c9)
#12: pc 0 sp 0x6aa4ca487000

Looking at https://cs.chromium.org/chromium/src/base/task_scheduler/scheduler_worker_pool_impl.cc?gsn=outer_&l=780 it appears that the SchedulerWorkerDelegateImpl is owned-by the SchedulerWorker, which is ref-counted, with a reference held by the SchedulerWorkerPoolImpl.  The SWDI then holds a raw pointer to the SchedulerWorkerPoolImpl.

I think what's happening here is a race-condition between an idle SchedulerWorker deciding to exit, and SchedulerWorkerPoolImpl teardown:
- The idle SchedulerWorker enters GetWork(), takes the pool lock, confirms it can cleanup, and removes itself from |workers_|.
- The idle SchedulerWorker thread is preempted.
- The SchedulerWorkerPoolImpl is torn-down, because this is a test (as per comments in the dtor, this doesn't happen in production, probably?)
- The idle SchedulerWorker thread is scheduled by the OS, and invokes OnMainExit() on its delegate; the delegate then use-after-frees the deleted pool impl.

Marking P1 for M67, since this risks making tests randomly flake at a very low (== hard to diagnose) rate.
 

Comment 1 by gab@chromium.org, Feb 13 2018

Cc: robliao@chromium.org fdoray@chromium.org
Owner: robliao@chromium.org
@robliao: looks like some workers touch SchedulerWorkerPoolImpl after it was destroyed. This shouldn't be possible because ~SchedulerWorkerPoolImpl() confirms it's called after SchedulerWorkerPoolImpl::JoinForTesting(). Maybe some workers go unaccounted for when they detach? In which case they wouldn't be waited upon in JoinForTesting?

As agreed yesterday Detach+JoinForTesting is your area :), though this may overlap with Francois into worker capacity territory but he's already got more than his share.

Comment 2 by w...@chromium.org, Feb 21 2018

Blocking: 735701

Comment 3 by gab@chromium.org, Feb 21 2018

Cc: akhaus...@yandex-team.ru gab@chromium.org
 Issue 813278  has been merged into this issue.

Comment 4 by gab@chromium.org, Feb 21 2018

Owner: gab@chromium.org
Status: Started (was: Assigned)
Dug into this from other (dupped) bugs, copying diagnosis from there:

"""
Trying figure out why a worker could possibly outlive the pool/scheduler/TaskTracker (we always ensure that the pool/workers were joined in the destructor after all): 

TaskSchedulerImpl::~TaskSchedulerImpl() {
#if DCHECK_IS_ON()
  DCHECK(join_for_testing_returned_.IsSet());
#endif
}

I now have a theory: we rely on atomics in an incorrect way in one testing-only method.

That is SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting() sets an atomic bit and workers are expected to not cleanup when they see that bit. However absolutely nothing in the atomic rules forces the other threads to see that value (atomics just specify that *once* they see the value, they're also guaranteed to see what happened before it was set (per usage of release/acquire barriers in AtomicFlag), but that's irrelevant here).

Because of this, a worker can get CanCleanup() == true after the main thread invoked SchedulerWorkerPoolImpl::DisallowWorkerCleanupForTesting(). In which case it will remove itself for |workers_| and will not be waited upon when JoinForTesting() is invoked on the main thread. And hence we have our worker outliving its pool..!
"""

CL @ https://chromium-review.googlesource.com/#/c/chromium/src/+/929061 

Note to self: try to revert https://chromium-review.googlesource.com/925546 to re-enable some TaskScheduler tests after this is fixed.

Comment 5 by gab@chromium.org, Feb 21 2018

 Issue 813651  has been merged into this issue.

Comment 6 by gab@chromium.org, Feb 21 2018

 Issue 806003  has been merged into this issue.

Comment 7 by w...@chromium.org, Feb 22 2018

gab: Did you see robliao@'s https://bugs.chromium.org/p/chromium/issues/detail?id=813278#c12 comment? Do we have a clear idea of how this can result in us also hitting RunNextTask() after the SchedulerWorkerPool + TaskTracker are gone, sometimes?

Comment 8 by gab@chromium.org, Feb 22 2018

Labels: OS-Fuchsia
Copying response from there (let's keep discussing here):

I'm not trusting this stack much. This is a Release build and symbols may be misaligned. All we know is the worker is doing a use-after-free, everything else may be garbage IMO. I'm wrapping my head around this as we speak (r538388 is a first step but not sufficient).

Comment 9 by gab@chromium.org, Feb 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fc40b1417b669d7df75b4d61491c521048027fda

commit fc40b1417b669d7df75b4d61491c521048027fda
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Feb 22 10:02:04 2018

Make SchedulerWorkerPoolImpl::DisallowWorkerPoolCleanupForTesting() synchronized with SchedulerWorkers' cleanup step.

Atomics didn't guarantee synchronization as a worker could observe it
was allowed to cleanup, then be preempted, and later actually cleanup
after cleanup had been disallowed. Synchronizing on the pool's lock
prevents this.

I believe this is the source of most TaskScheduler flakes we've seen on
Fuschia. The actual test being blamed were also random because a leaky
worker would result in a user-after-free in the following test, not
the test at fault.

While this CL synchronizes cleanups with DisallowWorkerPoolCleanupForTesting(),
a follow-up CL will be required to ensure detached workers do not touch pool
state after cleanup (since they removed themselves from |workers_| they will
not be joined and hence using pool state can result in a use-after-free).

R=fdoray@chromium.org, robliao@chromium.org

Bug:   813278  
Change-Id: I455a2c8cff3aa9b1176567cc4a6409ff6ca6807d
Reviewed-on: https://chromium-review.googlesource.com/929061
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538388}
[modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/fc40b1417b669d7df75b4d61491c521048027fda/base/task_scheduler/scheduler_worker_pool_impl.h

Comment 10 by gab@chromium.org, Feb 22 2018

Blockedon: 756898
Project Member

Comment 11 by bugdroid1@chromium.org, Feb 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d5b04e7e34dac974c7d6ce4b138f66419fca0ca9

commit d5b04e7e34dac974c7d6ce4b138f66419fca0ca9
Author: Gabriel Charette <gab@chromium.org>
Date: Fri Feb 23 06:19:02 2018

SchedulerWorker: Do not go back to sleep after cleanup.

At first sight this is merely a correctness fix to avoid sleeping twice.

But it is also a prerequisite to fix  crbug.com/810464  as calling
WaitForWork() after cleanup (which may occur as part of GetWork())
is not safe.

This may also help timing flakes on reclaim tests for which the typical
resolution wait is 3X the reclaim time (but was only 1.5X in practice
because of this bug).

R=fdoray@chromium.org, robliao@chromium.org

Bug:  810464 
Change-Id: I5e00de763f95773120022d258a513bb19a30c6b1
Reviewed-on: https://chromium-review.googlesource.com/931541
Reviewed-by: François Doray <fdoray@chromium.org>
Commit-Queue: Gabriel Charette <gab@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538712}
[modify] https://crrev.com/d5b04e7e34dac974c7d6ce4b138f66419fca0ca9/base/task_scheduler/scheduler_worker.cc

Project Member

Comment 12 by bugdroid1@chromium.org, Feb 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c

commit a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c
Author: Gabriel Charette <gab@chromium.org>
Date: Fri Feb 23 10:25:03 2018

[TaskScheduler] Do not log 0 to TaskScheduler.NumTasksBetweenWaits histogram on detach.

Doing so accesses |outer_| which was unsafe after invoking Cleanup().

While this will be made safe with https://chromium-review.googlesource.com/c/chromium/src/+/931547
we had noticed this metric being wrong in the past but had postponed
cleaning up to avoid perturbing metrics as part of a bigger change.
Now is the time to drop this report. I don't think we need to rename
the metric because it's an internal diagnosis metric for our team.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  756898 ,  810464 
Change-Id: Ica379d6f3c21aebabeb0574847ba0bfcda346df8
Reviewed-on: https://chromium-review.googlesource.com/931524
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538751}
[modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/a8863d2fc9bc0ede0e9f35306ac85cc0195f3e6c/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Comment 13 by gab@chromium.org, Feb 24 2018

 Issue 763051  has been merged into this issue.
Project Member

Comment 14 by bugdroid1@chromium.org, Feb 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/550cb548914d5fc99258ced55c94f3348ad5a55a

commit 550cb548914d5fc99258ced55c94f3348ad5a55a
Author: Gabriel Charette <gab@chromium.org>
Date: Sun Feb 25 21:57:00 2018

[SchedulerWorkerDelegateImpl] Add THREAD_CHECKER checks to document and enforce existing behaviour.

After just doing a full pass of this code for  issue 810464 , I think
leaving these CHECKs behind will help a future reader as well as
enforce required calling semantics.

R=fdoray@chromium.org

Bug:  810464 
Change-Id: I48bd2c3c0308be3b856ef518ed41382497ca5cf8
Reviewed-on: https://chromium-review.googlesource.com/934081
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539052}
[modify] https://crrev.com/550cb548914d5fc99258ced55c94f3348ad5a55a/base/task_scheduler/scheduler_worker_pool_impl.cc

Project Member

Comment 15 by bugdroid1@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1ae9a275a252299bfc2936cc899143f5bad88ab4

commit 1ae9a275a252299bfc2936cc899143f5bad88ab4
Author: Gabriel Charette <gab@chromium.org>
Date: Mon Feb 26 09:56:23 2018

[SchedulerWorkerPoolImpl] Reinforce JoinForTesting to prevent use-after-free post Cleanup().

SchedulerWorker can touch SchedulerWorkerPoolImpl
through SchedulerWorker::delegate_->outer_.

This could cause use-after-free in tests (not a problem in prod because
we never join workers nor destroy the scheduler/pools) because
SchedulerWorkerPoolImpl no longer accounts for a SchedulerWorker
after invoking SchedulerWorker::Cleanup and hence will not join it
in JoinForTesting() before it is itself destroyed.

Before this CL: a SchedulerWorker could access SchedulerWorkerPoolImpl
after Cleanup() and hence potentially use after free if the pool is
destroyed before the detached thread completes execution.

This CL reinforces JoinForTesting() to wait for all workers that
ever had a reference to |this| (i.e. its |workers_| + any recently
reclaimed workers).

Added TaskSchedulerWorkerPoolTest.RacyCleanup as regression test although
it passes in most scenarios without the fix because this race is hard to
exercise...

And added a DCHECK on start that the requested pool size isn't greater than
kMaxNumberOfWorkers (as I tried to add 1000 worker in my new test and had to
debug why it hung until I rediscovered this intentional behavior).

Bug:  810464 
Change-Id: Ibaefb5446d462c325e4ddfd34a54ad3c448c1417
Reviewed-on: https://chromium-review.googlesource.com/931547
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539095}
[modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker.cc
[modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker.h
[modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl.h
[modify] https://crrev.com/1ae9a275a252299bfc2936cc899143f5bad88ab4/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Comment 16 by gab@chromium.org, Feb 26 2018

 Issue 761400  has been merged into this issue.

Comment 17 by gab@chromium.org, Feb 26 2018

 Issue 763058  has been merged into this issue.
Project Member

Comment 18 by bugdroid1@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f8a3b7bc2e618a652d980d8cf8de66d2664f554d

commit f8a3b7bc2e618a652d980d8cf8de66d2664f554d
Author: Wez <wez@chromium.org>
Date: Mon Feb 26 19:23:40 2018

Disable TaskSchedulerWorkerPoolTest.RacyCleanup under Fuchsia.

This test flakily hangs, appearing to live-lock in a worker-thread
creation/teardown cycle.

TBR: gab
Bug: 816575,  810464 
Change-Id: I22c42740a2e9af229400bc2da077ae962a287804
Reviewed-on: https://chromium-review.googlesource.com/937984
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539228}
[modify] https://crrev.com/f8a3b7bc2e618a652d980d8cf8de66d2664f554d/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc

Project Member

Comment 19 by bugdroid1@chromium.org, Mar 1 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cb4da43870053c2857661082f2ef29c938a69d6a

commit cb4da43870053c2857661082f2ef29c938a69d6a
Author: Gabriel Charette <gab@chromium.org>
Date: Thu Mar 01 09:58:16 2018

[SchedulerWorkerPoolImpl] Remove DisallowWorkerCleanupForTesting().

Since cleanup was made thread-safe in r539095 (and disallowing
synchronized on |lock_| in r538388):
This was now only needed during JoinForTesting() to ensure
workers in |workers_copy| didn't decide to cleanup during join. As such,
always ban cleanup during join and remove the API to control that bit.

R=fdoray@chromium.org, robliao@chromium.org

Bug:  810464 
Change-Id: I46edfcc7e37f3a5cb54320b0c38cad5d567c8eb7
Reviewed-on: https://chromium-review.googlesource.com/937264
Commit-Queue: Gabriel Charette <gab@chromium.org>
Reviewed-by: François Doray <fdoray@chromium.org>
Reviewed-by: Robert Liao <robliao@chromium.org>
Cr-Commit-Position: refs/heads/master@{#540119}
[modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl.cc
[modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl.h
[modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/scheduler_worker_pool_impl_unittest.cc
[modify] https://crrev.com/cb4da43870053c2857661082f2ef29c938a69d6a/base/task_scheduler/task_scheduler_impl.cc

Comment 20 by gab@chromium.org, Mar 21 2018

Status: Fixed (was: Started)
I think this one should be fixed now.

Sign in to add a comment