Issue metadata
Sign in to add a comment
|
TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers crashed in base_unittests on Fuchsia |
||||||||||||||||||||||||
Issue descriptionhttps://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/6051 https://chromium-swarm.appspot.com/task?id=3bc87f3ae4c98f10&refresh=10&show_raw=1 [00089.853] 03761.03789> [1557/2894] TaskSchedulerLock.PredecessorLongerCycle (4 ms) [00089.855] 03761.03789> [1558/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.DifferentThreadsUsed (46 ms) [00089.856] 03761.03789> [1559/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.SameThreadUsed (15 ms) [00089.856] 03761.03789> [1560/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.RunsTasksInCurrentSequence (38 ms) [00089.888] 03761.04135> [3:409166567:0219/200914.888797:89888845:ERROR:test_launcher.cc(369)] adding 18cb64bd to live process list [00089.891] 03761.04135> [3:409166567:0219/200914.890308:89890348:ERROR:process_fuchsia.cc(168)] going to wait for process 18cb64bd (deadline=539890235000, now=89890380000) [00090.317] 01200.01260> <== fatal exception: process base_unittests__exec[32656] thread TaskSchedulerSingleThreadForegr[33180] [00090.320] 01200.01260> <== fatal page fault, PC at 0x13b67bf2d7f4 [00090.322] 01200.01260> x0 0xa5a8c5d69d58 x1 0xc8e0280d6ac8 x2 0x4 x3 0x4 [00090.323] 01200.01260> x4 0xc8e0280d68c0 x5 0xc8fc1712ac80 x6 0xffff x7 0x10000 [00090.323] 01200.01260> x8 0 x9 0 x10 0xc8e02809bf88 x11 0xc8e02809bf88 [00090.323] 01200.01260> x12 0x10031 x13 0 x14 0 x15 0 [00090.323] 01200.01260> x16 0xa4c229da1ef0 x17 0xa4c229cdef60 x18 0 x19 0xa5a8c5d69df0 [00090.323] 01200.01260> x20 0xc8e028032d60 x21 0xc8e027f58398 x22 0xa5a8c5d69f28 x23 0xa5a8c5d69f30 [00090.329] 01200.01260> x24 0x1 x25 0x1 x26 0x1 x27 0 [00090.329] 01200.01260> x28 0xa5a8c5d69f48 x29 0xa5a8c5d69fc0 lr 0x13b67bf2d7e0 sp 0xa5a8c5d69d50 [00090.329] 01200.01260> pc 0x13b67bf2d7f4 psr 0x80000000 [00090.332] 01200.01260> far 0 esr 0x92000004 [00090.333] 01200.01260> bottom of user stack: [00090.335] 01200.01260> 0x0000a5a8c5d69d50: 924b24b8 000003db 00000000 00000000 |.$K.............| [00090.335] 01200.01260> 0x0000a5a8c5d69d60: 7b4762b3 000013b6 7b49559b 000013b6 |.bG{.....UI{....| [00090.335] 01200.01260> 0x0000a5a8c5d69d70: 00000258 00000000 7bc88270 000013b6 |X.......p..{....| [00090.335] 01200.01260> 0x0000a5a8c5d69d80: 00000000 00000000 00000000 00000000 |................| [00090.336] 01200.01260> 0x0000a5a8c5d69d90: 00000000 00000000 00000000 00000000 |................| [00090.336] 01200.01260> 0x0000a5a8c5d69da0: 00000000 00000000 00000005 00000000 |................| [00090.336] 01200.01260> 0x0000a5a8c5d69db0: 2809be00 5a8b2e00 00000001 00000000 |...(...Z........| [00090.336] 01200.01260> 0x0000a5a8c5d69dc0: 00000001 27fa0000 00000000 00000000 |.......'........| [00090.336] 01200.01260> 0x0000a5a8c5d69dd0: 056020c0 00000000 00000000 00000000 |. `.............| [00090.336] 01200.01260> 0x0000a5a8c5d69de0: 280b9a80 0000c8e0 29d34601 0000a4c2 |...(.....F.)....| [00090.336] 01200.01260> 0x0000a5a8c5d69df0: 00000000 00000000 7b4762b3 000013b6 |.........bG{....| [00090.336] 01200.01260> 0x0000a5a8c5d69e00: 7b49559b 000013b6 00000258 00000000 |.UI{....X.......| [00090.337] 01200.01260> 0x0000a5a8c5d69e10: 7bc88270 000013b6 00000000 00000000 |p..{............| [00090.337] 01200.01260> 0x0000a5a8c5d69e20: 00000000 00000000 00000000 00000000 |................| [00090.337] 01200.01260> 0x0000a5a8c5d69e30: 00000000 00000000 00000000 00000000 |................| [00090.337] 01200.01260> 0x0000a5a8c5d69e40: 00000005 00000000 4f09c500 5a8b2e00 |...........O...Z| [00090.337] 01200.01260> arch: aarch64 [00090.368] 01200.01260> dso: id=97caafd6646d3ad81e2dc07bba6fb4fec7d3eb3a base=0xb7a0de324000 name=<vDSO> [00090.368] 01200.01260> dso: id=9e3569cc22dfd432bc2ac3d3b2829196909ad4be base=0xa4c229cce000 name=libc.so [00090.368] 01200.01260> dso: id=e5a518a1be244c24d4eadc3ff79bd2478f978376 base=0xa308c9134000 name=libfdio.so [00090.368] 01200.01260> dso: id=45ed5eb239c1b28eeb9706e4130e24f4a5f378d6 base=0x1d575a99d000 name=liblaunchpad.so [00090.368] 01200.01260> dso: id=2228e864d342e1c0 base=0x13b67b472000 name=app:base_unittests__exec #01: base::internal::TaskTracker::RunAndPopNextTask(scoped_refptr<base::internal::Sequence>, base::internal::CanScheduleSequenceObserver*) at ??:? #02: base::internal::SchedulerWorker::Thread::ThreadMain() at ??:? #03: base::(anonymous namespace)::ThreadFunc(void*) at ??:? #04: pc 0xa4c229cdee4c sp 0xa5a8c5d69fc0 (libc.so,0x10e4c) #05: pc 0xa4c229d51820 sp 0xa5a8c5d69fe0 (libc.so,0x83820) [00090.438] 03761.04098> [3:403923711:0219/200915.438309:90438365:ERROR:test_launcher.cc(426)] going to zx_task_kill(job) for 1cab671f [00090.439] 03761.04098> [3:403923711:0219/200915.439462:90439520:ERROR:test_launcher.cc(443)] removing 1cab671f from live process list [00090.461] 03761.03789> [1561/2894] TaskSchedulerSingleThreadTaskRunnerManagerTest.SharedWithBaseSyncPrimitivesDCHECKs (55 ms) [00090.461] 03761.03789> [1562/2894] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoin (30 ms) [00090.461] 03761.03789> [1563/2894] TaskSchedulerSingleThreadTaskRunnerManagerJoinTest.ConcurrentJoinExtraSkippedTask (10 ms) [00090.461] 03761.03789> [1564/2894] TaskSchedulerSingleThreadTaskRunnerManagerStartTest.PostTaskBeforeStart (108 ms) [00090.461] 03761.03789> [1565/2894] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostTasksBeforeStart (39 ms) [00090.461] 03761.03789> [1566/2894] TaskSchedulerWorkerPoolImplPostTaskBeforeStartTest.PostManyTasks (49 ms) [00090.462] 03761.03789> [ RUN ] TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers [00090.462] 03761.03789> [1567/2894] TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers (CRASHED) [00090.462] 03761.04098> [3:403923711:0219/200915.462841:90462889:ERROR:test_launcher.cc(369)] adding 1adb66e3 to live process list [00090.464] 03761.04098> [3:403923711:0219/200915.464355:90464395:ERROR:process_fuchsia.cc(168)] going to wait for process 1adb66e3 (deadline=540464285000, now=90464425000) [00091.249] 03761.04135> [3:409166567:0219/200916.248855:91248910:ERROR:test_launcher.cc(426)] going to zx_task_kill(job) for 18cb64bd [00091.250] 03761.04135> [3:409166567:0219/200916.249965:91250015:ERROR:test_launcher.cc(443)] removing 18cb64bd from live process list
,
Feb 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/36460f0ebdbd5211029488808d3cbef13b20c306 commit 36460f0ebdbd5211029488808d3cbef13b20c306 Author: Scott Graham <scottmg@chromium.org> Date: Mon Feb 19 22:03:47 2018 fuchsia: Disable TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers Flaked on ARM64 at https://build.chromium.org/p/chromium.fyi/builders/Fuchsia%20ARM64/builds/6051, Blamelist includes https://chromium-review.googlesource.com/c/chromium/src/+/924001 which looks like it could be related, but also looks similar to general TaskScheduler timing flake, so disabling. TBR: wez@chromium.org Bug: 813651 Change-Id: Ic67fb40d5efe5cd614d6413e180311f372ae7269 Reviewed-on: https://chromium-review.googlesource.com/925546 Reviewed-by: Scott Graham <scottmg@chromium.org> Commit-Queue: Scott Graham <scottmg@chromium.org> Cr-Commit-Position: refs/heads/master@{#537686} [modify] https://crrev.com/36460f0ebdbd5211029488808d3cbef13b20c306/testing/buildbot/filters/fuchsia.base_unittests.filter
,
Feb 20 2018
I've had a look at TaskSchedulerWorkerPoolCheckTlsReuse.CheckCleanupWorkers. The semantics of TaskSchedulerWorkerPoolCheckTlsReuse::slot_ is identical to that before the reland, so it shouldn't be the problem.
,
Feb 20 2018
Looking at the CL, I notice that it also removes the seemingly-redundant base::subtle::Release_Store(&initialized_, ...) calls. Given that we saw the flake on Fuchsia/ARM64, and given that (IIRC) C++ atomics currently rely on DSB instructions, it seems that change may have altered the synchronization semantics for other fields of the Slot? It's a shame that the crash report only lists the PC, but not the actual faulting address; my wild guess would be that this is a use-after-free from some use of Bind(Unretained) in this test suite. Looking at the test base-classes I also notice that we flush and join the |pool_| for each test in the base-class TearDown(), which means that any task posted to the |pool_| with Bind(Unretained) to a variable on the stack of an individual test (e.g. in this particular test we Bind() to Unretained(this), which would be sufficient) then we may use-after-delete during TearDown().
,
Feb 21 2018
Re. removing the Release_Store(&initialized_, ...), releasing the AutoLock in which the state is being modified should also have (at least) a Release_Store-equivalent barrier on the lock. And I agree with #3. I doubt that CL is to blame.
,
Feb 21 2018
Actually, hold on. While the AutoLock will have a ReleaseStore barrier on the lock, this doesn't matter if the state is accessed without an AcquireLoad (only pairing barriers enforces anything). Since the lock is only accessed by other initializers, not by other getters, this doesn't guarantee anything. But I'd assume that C++11 thread-safe static initialization makes it such that the constructor side-effects "happens-before" other threads see the constructed TLS::Slot value. If that's not true (i.e. if other threads can ever see slot_ == kInvalidSlotValue, despite it being initialized before the constructor returns) then this isn't safe.
,
Feb 21 2018
From the spec, I expect this to be safe: """ such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. [...] If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization. """
,
Feb 21 2018
And, hmmm, from the above stack, what makes you think it's TLS related? TaskTracker::RunAndPopNextTask() doesn't seem to use TLS at first sight?
,
Feb 21 2018
,
Feb 21 2018
@#8, nvm I see, the whole test is about TLS. Still looking... FWIW, while akhaustov@'s was first to cause this flake, the tests then passed 4 times before scott's disabling CL : 2018-02-19 11:30 PM (CET) 15 mins 15 secs 36460f0ebdbd5211029488808d3cbef13b20c306 Success #6056 (scottmg@chromium.org) build successful 2018-02-19 11:14 PM (CET) 10 mins 22 secs 33801db5ba5de9ece3d07a7df09774f895b3ec49 Success #6055 (shend@chromium.org) build successful 2018-02-19 10:43 PM (CET) 34 mins 49 secs 38ead92c3968463939f32114da88812b01fd3e22 Success #6054 (v8-autoroll@chromium.org) build successful 2018-02-19 10:08 PM (CET) 8 mins 42 secs 424f960b73554a0d2ae371a8d21ade15647afdca Success #6053 (chengx@chromium.org) build successful 2018-02-19 9:58 PM (CET) 11 mins 36 secs 88fb0a0ee29a2bca380e2b6392a0a7880bb5309a Success #6052 (pdfium-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com) build successful 2018-02-19 9:16 PM (CET) 45 mins 37 secs fd6d802597d97fa206271a941575f0ebb5fd8df8 Failed #6051 (vasilii@chromium.org) , (nparker@chromium.org) , (akhaustov@yandex-team.ru) failed base_unittests
,
Feb 21 2018
Ok, I have a theory and it'll be working in issue 813278 (it may be related to other flakes we've seen too where workers outlive the pool!) |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by scottmg@chromium.org
, Feb 19 2018