New issue
Advanced search Search tips

Issue 813651 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 810464
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Fuchsia
Pri: 3
Type: Bug

Blocking:
issue 738275



Sign in to add a comment

TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers crashed in base_unittests on Fuchsia

Project Member Reported by scottmg@chromium.org, Feb 19 2018

Issue description

https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/6051

https://chromium-swarm.appspot.com/task?id=3bc87f3ae4c98f10&refresh=10&show_raw=1

[00089.853] 03761.03789> [1557/2894] TaskSchedulerLock.PredecessorLongerCycle (4 ms)
[00089.855] 03761.03789> [1558/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.DifferentThreadsUsed (46 ms)
[00089.856] 03761.03789> [1559/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.SameThreadUsed (15 ms)
[00089.856] 03761.03789> [1560/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.RunsTasksInCurrentSequence (38 ms)
[00089.888] 03761.04135> [3:409166567:0219/200914.888797:89888845:ERROR:test_launcher.cc(369)] adding 18cb64bd to live process list
[00089.891] 03761.04135> [3:409166567:0219/200914.890308:89890348:ERROR:process_fuchsia.cc(168)] going to wait for process 18cb64bd (deadline=539890235000, now=89890380000)
[00090.317] 01200.01260> <== fatal exception: process base_unittests__exec[32656] thread TaskSchedulerSingleThreadForegr[33180]
[00090.320] 01200.01260> <== fatal page fault, PC at 0x13b67bf2d7f4
[00090.322] 01200.01260>  x0      0xa5a8c5d69d58 x1      0xc8e0280d6ac8 x2                 0x4 x3                 0x4
[00090.323] 01200.01260>  x4      0xc8e0280d68c0 x5      0xc8fc1712ac80 x6              0xffff x7             0x10000
[00090.323] 01200.01260>  x8                   0 x9                   0 x10     0xc8e02809bf88 x11     0xc8e02809bf88
[00090.323] 01200.01260>  x12            0x10031 x13                  0 x14                  0 x15                  0
[00090.323] 01200.01260>  x16     0xa4c229da1ef0 x17     0xa4c229cdef60 x18                  0 x19     0xa5a8c5d69df0
[00090.323] 01200.01260>  x20     0xc8e028032d60 x21     0xc8e027f58398 x22     0xa5a8c5d69f28 x23     0xa5a8c5d69f30
[00090.329] 01200.01260>  x24                0x1 x25                0x1 x26                0x1 x27                  0
[00090.329] 01200.01260>  x28     0xa5a8c5d69f48 x29     0xa5a8c5d69fc0 lr      0x13b67bf2d7e0 sp      0xa5a8c5d69d50
[00090.329] 01200.01260>  pc      0x13b67bf2d7f4 psr         0x80000000
[00090.332] 01200.01260>  far                  0 esr         0x92000004
[00090.333] 01200.01260> bottom of user stack:
[00090.335] 01200.01260> 0x0000a5a8c5d69d50: 924b24b8 000003db 00000000 00000000 |.$K.............|
[00090.335] 01200.01260> 0x0000a5a8c5d69d60: 7b4762b3 000013b6 7b49559b 000013b6 |.bG{.....UI{....|
[00090.335] 01200.01260> 0x0000a5a8c5d69d70: 00000258 00000000 7bc88270 000013b6 |X.......p..{....|
[00090.335] 01200.01260> 0x0000a5a8c5d69d80: 00000000 00000000 00000000 00000000 |................|
[00090.336] 01200.01260> 0x0000a5a8c5d69d90: 00000000 00000000 00000000 00000000 |................|
[00090.336] 01200.01260> 0x0000a5a8c5d69da0: 00000000 00000000 00000005 00000000 |................|
[00090.336] 01200.01260> 0x0000a5a8c5d69db0: 2809be00 5a8b2e00 00000001 00000000 |...(...Z........|
[00090.336] 01200.01260> 0x0000a5a8c5d69dc0: 00000001 27fa0000 00000000 00000000 |.......'........|
[00090.336] 01200.01260> 0x0000a5a8c5d69dd0: 056020c0 00000000 00000000 00000000 |. `.............|
[00090.336] 01200.01260> 0x0000a5a8c5d69de0: 280b9a80 0000c8e0 29d34601 0000a4c2 |...(.....F.)....|
[00090.336] 01200.01260> 0x0000a5a8c5d69df0: 00000000 00000000 7b4762b3 000013b6 |.........bG{....|
[00090.336] 01200.01260> 0x0000a5a8c5d69e00: 7b49559b 000013b6 00000258 00000000 |.UI{....X.......|
[00090.337] 01200.01260> 0x0000a5a8c5d69e10: 7bc88270 000013b6 00000000 00000000 |p..{............|
[00090.337] 01200.01260> 0x0000a5a8c5d69e20: 00000000 00000000 00000000 00000000 |................|
[00090.337] 01200.01260> 0x0000a5a8c5d69e30: 00000000 00000000 00000000 00000000 |................|
[00090.337] 01200.01260> 0x0000a5a8c5d69e40: 00000005 00000000 4f09c500 5a8b2e00 |...........O...Z|
[00090.337] 01200.01260> arch: aarch64
[00090.368] 01200.01260> dso: id=97caafd6646d3ad81e2dc07bba6fb4fec7d3eb3a base=0xb7a0de324000 name=<vDSO>
[00090.368] 01200.01260> dso: id=9e3569cc22dfd432bc2ac3d3b2829196909ad4be base=0xa4c229cce000 name=libc.so
[00090.368] 01200.01260> dso: id=e5a518a1be244c24d4eadc3ff79bd2478f978376 base=0xa308c9134000 name=libfdio.so
[00090.368] 01200.01260> dso: id=45ed5eb239c1b28eeb9706e4130e24f4a5f378d6 base=0x1d575a99d000 name=liblaunchpad.so
[00090.368] 01200.01260> dso: id=2228e864d342e1c0 base=0x13b67b472000 name=app:base_unittests__exec
#01: base::internal::TaskTracker::RunAndPopNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at ??:?
#02: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:?
#03: base::(anonymous namespace)::ThreadFunc(void*) at ??:?
#04: pc 0xa4c229cdee4c sp 0xa5a8c5d69fc0 (libc.so,0x10e4c)
#05: pc 0xa4c229d51820 sp 0xa5a8c5d69fe0 (libc.so,0x83820)
[00090.438] 03761.04098> [3:403923711:0219/200915.438309:90438365:ERROR:test_launcher.cc(426)] going to zx_task_kill(job) for 1cab671f
[00090.439] 03761.04098> [3:403923711:0219/200915.439462:90439520:ERROR:test_launcher.cc(443)] removing 1cab671f from live process list
[00090.461] 03761.03789> [1561/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.SharedWithBaseSyncPrimitivesDCHECKs (55 ms)
[00090.461] 03761.03789> [1562/2894] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoin (30 ms)
[00090.461] 03761.03789> [1563/2894] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoinExtraSkippedTask (10 ms)
[00090.461] 03761.03789> [1564/2894] TaskSchedulerSingleThreadTaskRunnerManagerStartTest.PostTaskBeforeStart (108 ms)
[00090.461] 03761.03789> [1565/2894] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostTasksBeforeStart (39 ms)
[00090.461] 03761.03789> [1566/2894] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks (49 ms)
[00090.462] 03761.03789> [ RUN      ] TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers
[00090.462] 03761.03789> [1567/2894] TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers (CRASHED)
[00090.462] 03761.04098> [3:403923711:0219/200915.462841:90462889:ERROR:test_launcher.cc(369)] adding 1adb66e3 to live process list
[00090.464] 03761.04098> [3:403923711:0219/200915.464355:90464395:ERROR:process_fuchsia.cc(168)] going to wait for process 1adb66e3 (deadline=540464285000, now=90464425000)
[00091.249] 03761.04135> [3:409166567:0219/200916.248855:91248910:ERROR:test_launcher.cc(426)] going to zx_task_kill(job) for 18cb64bd
[00091.250] 03761.04135> [3:409166567:0219/200916.249965:91250015:ERROR:test_launcher.cc(443)] removing 18cb64bd from live process list
 
Cc: akhaus...@yandex-team.ru
Possibly related to reland of https://chromium-review.googlesource.com/c/chromium/src/+/924001 ?
Project Member

Comment 2 by bugdroid1@chromium.org, Feb 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/36460f0ebdbd5211029488808d3cbef13b20c306

commit 36460f0ebdbd5211029488808d3cbef13b20c306
Author: Scott Graham <scottmg@chromium.org>
Date: Mon Feb 19 22:03:47 2018

fuchsia: Disable TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers

Flaked on ARM64 at
https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/6051,

Blamelist includes
https://chromium-review.googlesource.com/c/chromium/src/+/924001
which looks like it could be related, but also looks similar to general
TaskScheduler timing flake, so disabling.

TBR: wez@chromium.org
Bug:  813651 
Change-Id: Ic67fb40d5efe5cd614d6413e180311f372ae7269
Reviewed-on: https://chromium-review.googlesource.com/925546
Reviewed-by: Scott Graham <scottmg@chromium.org>
Commit-Queue: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#537686}
[modify] https://crrev.com/36460f0ebdbd5211029488808d3cbef13b20c306/testing/buildbot/filters/fuchsia.base_unittests.filter

I've had a look at TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers. The semantics of TaskSchedulerWorkerPoolCheckTlsReuse::slot_ is identical to that before the reland, so it shouldn't be the problem.

Comment 4 by w...@chromium.org, Feb 20 2018

Looking at the CL, I notice that it also removes the seemingly-redundant base::subtle::Release_Store(&initialized_, ...) calls.  Given that we saw the flake on Fuchsia/ARM64, and given that (IIRC) C++ atomics currently rely on DSB instructions, it seems that change may have altered the synchronization semantics for other fields of the Slot?

It's a shame that the crash report only lists the PC, but not the actual faulting address; my wild guess would be that this is a use-after-free from some use of Bind(Unretained) in this test suite.

Looking at the test base-classes I also notice that we flush and join the |pool_| for each test in the base-class TearDown(), which means that any task posted to the |pool_| with Bind(Unretained) to a variable on the stack of an individual test (e.g. in this particular test we Bind() to Unretained(this), which would be sufficient) then we may use-after-delete during TearDown().

Comment 5 by gab@chromium.org, Feb 21 2018

Re. removing the Release_Store(&initialized_, ...), releasing the AutoLock in which the state is being modified should also have (at least) a Release_Store-equivalent barrier on the lock.

And I agree with #3.

I doubt that CL is to blame.

Comment 6 by gab@chromium.org, Feb 21 2018

Actually, hold on. While the AutoLock will have a ReleaseStore barrier on the lock, this doesn't matter if the state is accessed without an AcquireLoad (only pairing barriers enforces anything). Since the lock is only accessed by other initializers, not by other getters, this doesn't guarantee anything.

But I'd assume that C++11 thread-safe static initialization makes it such that the constructor side-effects "happens-before" other threads see the constructed TLS::Slot value.

If that's not true (i.e. if other threads can ever see slot_ == kInvalidSlotValue, despite it being initialized before the constructor returns) then this isn't safe.

Comment 7 by gab@chromium.org, Feb 21 2018

From the spec, I expect this to be safe:

"""
such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. [...] If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.
"""

Comment 8 by gab@chromium.org, Feb 21 2018

And, hmmm, from the above stack, what makes you think it's TLS related?

TaskTracker::RunAndPopNextTask() doesn't seem to use TLS at first sight?

Comment 9 by gab@chromium.org, Feb 21 2018

Labels: Test-Flaky

Comment 10 by gab@chromium.org, Feb 21 2018

@#8, nvm I see, the whole test is about TLS. Still looking...

FWIW, while akhaustov@'s was first to cause this flake, the tests then passed 4 times before scott's disabling CL :

2018-02-19 11:30 PM (CET)	15 mins 15 secs	36460f0ebdbd5211029488808d3cbef13b20c306	Success	#6056	(scottmg@chromium.org)	build successful
2018-02-19 11:14 PM (CET)	10 mins 22 secs	33801db5ba5de9ece3d07a7df09774f895b3ec49	Success	#6055	(shend@chromium.org)	build successful
2018-02-19 10:43 PM (CET)	34 mins 49 secs	38ead92c3968463939f32114da88812b01fd3e22	Success	#6054	(v8-autoroll@chromium.org)	build successful
2018-02-19 10:08 PM (CET)	8 mins 42 secs	424f960b73554a0d2ae371a8d21ade15647afdca	Success	#6053	(chengx@chromium.org)	build successful
2018-02-19 9:58 PM (CET)	11 mins 36 secs	88fb0a0ee29a2bca380e2b6392a0a7880bb5309a	Success	#6052	(pdfium-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com)	build successful
2018-02-19 9:16 PM (CET)	45 mins 37 secs	fd6d802597d97fa206271a941575f0ebb5fd8df8	Failed	#6051	(vasilii@chromium.org) , (nparker@chromium.org) , (akhaustov@yandex-team.ru)	failed base_unittests

Comment 11 by gab@chromium.org, Feb 21 2018

Mergedinto: 813278
Owner: gab@chromium.org
Status: Duplicate (was: Unconfirmed)
Ok, I have a theory and it'll be working in  issue 813278  (it may be related to other flakes we've seen too where workers outlive the pool!)

Comment 12 by gab@chromium.org, Feb 21 2018

Mergedinto: -813278 810464

Sign in to add a comment