Fatal error in ../../v8/src/base/platform/semaphore.cc, line 111: Semaphore signal failure: 22 |
|||||||||||||||||||
Issue descriptionIn this tryjob on linux_chromium_rel_ng: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223551 WebglConformance.conformance_ogles_GL_atan_atan_009_to_012 failed because of the following assertion failure in the renderer process: # # Fatal error in ../../v8/src/base/platform/semaphore.cc, line 111 # Semaphore signal failure: 22 # Thread 11 (crashed) 0 chrome!v8::base::OS::Abort() + 0xf 1 0x3000000020 2 chrome!WaitForTask [lock.h : 50 + 0x8] 3 chrome!ThreadMain [callback.h : 397 + 0x7] 4 chrome!ThreadFunc [platform_thread_posix.cc : 70 + 0x8] 5 libpthread-2.19.so + 0x8182 6 libc-2.19.so + 0xfa47d Similar to Issue 605349 , but apparently a different failure mode. Hannes, could you or someone else on V8's GC team please see why this may have happened?
,
May 6 2016
errno = 22 means "Invalid argument". So the native_handle of the semaphore is invalid. We know that the handle is pointer-aligned because we check for it in Semaphore constructor. I think this leaves two possibilities: 1. The semaphore is used after being destroyed. 2. The native handle of the semaphore is corrupted. Ken, any chance we can get a better stack trace? The one posted misses all the interesting bits: where the Semaphore::Signal is called from.
,
May 6 2016
I'm not sure why the stack is getting truncated. This is a 64-bit build; is stack unwind info needed that might not be present for Crashpad? CC'ing mark@ and thakis@, Crashpad and Clang owners. Can you try a local Release build with dcheck_always_on=1 and try running things in a loop? e.g.: ./content/test/gpu/run_gpu_test.py webgl_conformance --browser=release --pageset-repeat=1000 --story-filter=conformance_ogles --max-failures=1 -v --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" Then in a separate terminal: tail -f output.txt Hopefully it'll catch one of these failures. Thanks.
,
May 7 2016
Note: more failures on the CQ: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/224558 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/224541 Bumping to P1. I'd like to suggest to instrument V8's code to be able to print more detailed information when the failure happens.
,
May 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/5d9f6da6541c24077f236fe9a1837e0e8261a3ea commit 5d9f6da6541c24077f236fe9a1837e0e8261a3ea Author: ulan <ulan@chromium.org> Date: Mon May 09 11:55:14 2016 Instrument callers of Semaphore::Signal to help with investigation of flaky crashes. BUG= chromium:609249 LOG=NO Review-Url: https://codereview.chromium.org/1961893002 Cr-Commit-Position: refs/heads/master@{#36106} [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/base/platform/semaphore.cc [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/base/platform/semaphore.h [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/debug/debug.cc [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/heap/mark-compact.cc [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/heap/page-parallel-job.h [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/heap/spaces.cc [modify] https://crrev.com/5d9f6da6541c24077f236fe9a1837e0e8261a3ea/src/log.cc
,
May 10 2016
Good progress Ulan. Here's one failure with the above instrumentation in place: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/225698 [ RUN ] WebglConformance.conformance_ogles_GL_asin_asin_001_to_006 # # Fatal error in ../../v8/src/base/platform/semaphore.cc, line 110 # Semaphore signal failure: 22 called by 'PageParallelJob::Task::RunInternal' # Found Minidump: True Stack Trace: ******************************************************************************** Operating system: Linux 0.0.0 Linux 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64 CPU: amd64 family 6 model 60 stepping 3 1 CPU GPU: UNKNOWN Crash reason: SIGILL Crash address: 0x7fbabb78967f Process uptime: not available Thread 12 (crashed) 0 chrome!v8::base::OS::Abort() + 0xf rax = 0x0000000000000000 rdx = 0x0000000000000000 rcx = 0xffffffffffffffff rbx = 0x00007fbabeaaf239 rsi = 0x00007fbab23d49d0 rdi = 0x00007fbab23d31c0 rbp = 0x000000000000006e rsp = 0x00007fba99c4a9e8 r8 = 0x00007fba99c4b700 r9 = 0x00007fbab20b7927 r10 = 0x00007fbab23d0be0 r11 = 0x0000000000000000 r12 = 0x0000095a58061cc0 r13 = 0x00007fba99c4ab60 r14 = 0x00007fbabeaaf2af r15 = 0x00007fbab23d3868 rip = 0x00007fbabb78967f Found by: given as instruction pointer in context 1 0x3000000028 rbx = 0x00007fbabeaaf239 rbp = 0x000000000000006e rsp = 0x00007fba99c4a9f8 r12 = 0x0000095a58061cc0 r13 = 0x00007fba99c4ab60 r14 = 0x00007fbabeaaf2af r15 = 0x00007fbab23d3868 rip = 0x0000003000000028 Found by: call frame info 2 chrome!WaitForTask [lock.h : 50 + 0x8] rsp = 0x00007fba99c4aa10 rip = 0x00007fbabdf8cfcd Found by: stack scanning 3 chrome!ThreadMain [callback.h : 397 + 0x7] ...
,
May 11 2016
Yep, I am checking the code related to PageParallelJob. Yesterday I removed few reinterpret_casts in https://codereview.chromium.org/1963853004/ There is a slim chance that they were causing undefined behavior.
,
May 12 2016
No new crashes so far.
,
May 12 2016
That's good news. Assuming the V8 change in https://codereview.chromium.org/1963853004/ has already rolled in? Thanks for persisting with this issue.
,
May 13 2016
We have new crash today. Investigation continues. https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/228346 # # Fatal error in ../../v8/src/base/platform/semaphore.cc, line 110 # Semaphore signal failure: 22 called by 'PageParallelJob::Task::RunInternal' #
,
May 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/7bc54ba832c9b0ba750e9efb1fc6636560861aad commit 7bc54ba832c9b0ba750e9efb1fc6636560861aad Author: ulan <ulan@chromium.org> Date: Fri May 13 12:35:49 2016 Check number of finished tasks in PageParallelJob. BUG= chromium:609249 LOG=NO Review-Url: https://codereview.chromium.org/1976133002 Cr-Commit-Position: refs/heads/master@{#36238} [modify] https://crrev.com/7bc54ba832c9b0ba750e9efb1fc6636560861aad/src/heap/page-parallel-job.h
,
May 17 2016
The check in #11 ensures that the main thread waits for tasks. So the semaphore is not destroyed at the time of signalling. At this point, I can think of only memory corruption or compiler bug as an explanation for this crash. The stdio log says something about minidump. Is it possible to get the minidump? I'll try to reproduce again. Last time I didn't succeed.
,
May 17 2016
+nednguyen Ned, do you think it would be feasible to let Telemetry clients enumerate any minidumps that were produced during the run and upload them to cloud storage? I haven't looked at the code in question yet but it's buried pretty deeply in the desktop browser backend as far as I remember. Ulan, if you can try to reproduce locally that would be really helpful. Try building Release with dcheck_always_on=1 and run: src/content/test/gpu/run_gpu_test.py webgl_conformance --browser=release --pageset-repeat=100 --max-failures=1 --show-stdout --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" -v > output.txt 2>&1 I note that: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/228346 was job: https://chromium-swarm.appspot.com/user/task/2ec1f0fe62903610 which ran on build145-m4, while: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/229599 was job: https://chromium-swarm.appspot.com/user/task/2ed3b9378f9b2110 which ran on build154-m4. Since these failures aren't machine specific I doubt it's something like bad RAM.
,
May 18 2016
kbr@, I think that's ok to support. We can expose the path to minidump in Telemetry's AppCrashException (https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/core/exceptions.py#L54) Clients can catch this exception & do whatever they want with the minidump paths.
,
May 18 2016
I ran 3 parallel sessions today for 5 hours. No crash. Maybe the crash depends on the binary (because of v8 snapshot). Is there a way to download the binary of failing run?
,
May 19 2016
Yes. Go to the task page, like: https://chromium-swarm.appspot.com/user/task/2ec1f0fe62903610 (I think this requires you to be logged in with your @google.com account) Then find the hash of the "Isolated inputs", e.g.: 1ae56b273feb9741f97fd462a9a02d568a5aaf77 Then do the following (while cd'd into "src" in your Chromium checkout): ./tools/swarming_client/isolateserver.py download -t foo -s [hash] -I https://isolateserver.appspot.com This will download the isolate into the directory "foo". Or, to run the test directly from the isolate, which is closer to how the bots do it, do the following (note the command line comes from the task page above, though the escaping of quotes is bad in the user interface): ./tools/swarming_client/run_isolated.py -s [hash] -I https://isolateserver.appspot.com -- webgl_conformance --show-stdout --browser=release -v --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" --isolated-script-test-output=/tmp/foo.json
,
May 19 2016
Also: filed https://github.com/catapult-project/catapult/issues/2346 about uploading minidumps from the bots. Ulan, please tell us if this would really be useful. I don't have a good sense for how many Chromium developers would actually look at these and be able to make debugging progress with them.
,
May 19 2016
Thank you, Ken. The instructions worked. I am running the tests with the binary from the bot. We in V8 team routinely look at minidumps for chrome crashes. I think minidumps from bots would be useful if the failure is hard to reproduce, like in this case.
,
May 19 2016
OK. Thanks for the feedback Ulan. We will work on getting minidumps off the bots. See recent comment on https://github.com/catapult-project/catapult/issues/2346 . Right now the plan isn't to work on this immediately, but rather do it after the harness is simplified in Issue 352807 . Let me know if you can't make progress without the minidupms, or if you require access to one of the bots to reproduce there. The latter is simpler.
,
May 20 2016
I think we are hitting this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=12674 The bug is fixed in glibc 2.21, but my ubuntu trusty has glibc 2.19. Ken, I assume the bots also run with glibc 2.19? The race matches our use case in PageParallelJob: 1. The main thread initializes semaphore and spawns a background thread. 2. The background thread atomically increments the semaphore counter but goes to sleep before checking for waiters. 3. The main thread sees the increment of the counter, assumes that the background thread is finished, and destroys the semaphore. 4. The background thread wakes up and reads invalid semaphore. Semaphore::Signal implementation that I see on my machine: 0x7ffff493c8e0 <sem_post>: mov (%rdi),%eax 0x7ffff493c8e2 <sem_post+2>: cmp $0x7fffffff,%eax 0x7ffff493c8e7 <sem_post+7>: je 0x7ffff493c91c <sem_post+60> 0x7ffff493c8e9 <sem_post+9>: lea 0x1(%rax),%esi 0x7ffff493c8ec <sem_post+12>: lock cmpxchg %esi,(%rdi) // Race with the main thread starts here 0x7ffff493c8f0 <sem_post+16>: jne 0x7ffff493c8e2 <sem_post+2> 0x7ffff493c8f2 <sem_post+18>: cmpq $0x0,0x8(%rdi) 0x7ffff493c8f7 <sem_post+23>: je 0x7ffff493c912 <sem_post+50> 0x7ffff493c8f9 <sem_post+25>: mov $0xca,%eax 0x7ffff493c8fe <sem_post+30>: mov $0x1,%esi 0x7ffff493c903 <sem_post+35>: or 0x4(%rdi),%esi 0x7ffff493c906 <sem_post+38>: mov $0x1,%edx 0x7ffff493c90b <sem_post+43>: syscall 0x7ffff493c90d <sem_post+45>: test %rax,%rax 0x7ffff493c910 <sem_post+48>: js 0x7ffff493c915 <sem_post+53> 0x7ffff493c912 <sem_post+50>: xor %eax,%eax 0x7ffff493c914 <sem_post+52>: retq 0x7ffff493c915 <sem_post+53>: mov $0x16,%eax 0x7ffff493c91a <sem_post+58>: jmp 0x7ffff493c921 <sem_post+65> 0x7ffff493c91c <sem_post+60>: mov $0x4b,%eax 0x7ffff493c921 <sem_post+65>: mov 0x20a660(%rip),%rdx # 0x7ffff4b46f88 0x7ffff493c928 <sem_post+72>: mov %eax,%fs:(%rdx) 0x7ffff493c92b <sem_post+75>: or $0xffffffff,%eax 0x7ffff493c92e <sem_post+78>: retq
,
May 20 2016
We can work around this bug by keeping one global static semaphore instead of dynamically creating and destroying it. I will upload CL.
,
May 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/84ee947013003d481b1222d975e4456122b6efe9 commit 84ee947013003d481b1222d975e4456122b6efe9 Author: ulan <ulan@chromium.org> Date: Fri May 20 12:15:35 2016 Workaround for glibc semaphore bug. Instead of dynamically creating semaphore for each page parallel job, we create one semaphore for MarkCompact and reuse it. This patch also removes all instrumentation code that was added to help with investigation. BUG= chromium:609249 LOG=NO Review-Url: https://codereview.chromium.org/1998213002 Cr-Commit-Position: refs/heads/master@{#36407} [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/base/platform/semaphore.cc [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/base/platform/semaphore.h [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/debug/debug.cc [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/heap/mark-compact.cc [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/heap/mark-compact.h [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/heap/page-parallel-job.h [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/heap/spaces.cc [modify] https://crrev.com/84ee947013003d481b1222d975e4456122b6efe9/src/log.cc
,
May 20 2016
> Ken, I assume the bots also run with glibc 2.19? The stack trace in the original post confirms it.
,
May 20 2016
Awesome work Ulan! Thank you for persisting with this nasty bug. The fix looks good and I am sure it will stick on the bots. Sorry they are not running a more recent Ubuntu version, but I think the code change you made is worth it for stability on more Linux machines.
,
May 23 2016
Thank you, Ken. No worries about Ubuntu version on bots, even my machine has glibc 2.19. :)
,
May 23 2016
Marking Fixed. Please change if that's not appropriate. Thanks again!
,
Jun 6 2016
Issue 616789 has been merged into this issue.
,
Jun 6 2016
This crash has high impact on Chrome's stability. Channel: beta. Platform: linux. Labeling issue 609249 with Pri-0. Labeling issue 609249 with ReleaseBlock-Stable. If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Jun 7 2016
Requesting merge of cl in #23 to M51 and M52.
,
Jun 7 2016
[Automated comment] Request affecting a post-stable build (M51), manual review required.
,
Jun 7 2016
Your change meets the bar and is auto-approved for M52 (branch: 2743)
,
Jun 7 2016
[Automated comment] Request affecting a post-stable build (M51), manual review required.
,
Jun 7 2016
,
Jun 7 2016
Please merge your change ASAP so that it will be picked up for next Beta release.
,
Jun 8 2016
Please merge your change latest by 4:00 PM PST Friday (06/10) so we can pick it for next week M51 Stable refresh.
,
Jun 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/82ec83d83c4f8997175946f3d859a2ae2d6192b6 commit 82ec83d83c4f8997175946f3d859a2ae2d6192b6 Author: Ulan Degenbaev <ulan@chromium.org> Date: Thu Jun 09 08:58:21 2016 Version 5.2.361.20 (cherry-pick) Merged 84ee947013003d481b1222d975e4456122b6efe9 Workaround for glibc semaphore bug. BUG= chromium:609249 LOG=N R=mlippautz@chromium.org Review URL: https://codereview.chromium.org/2048313002 . Cr-Commit-Position: refs/branch-heads/5.2@{#26} Cr-Branched-From: 2cd36d6d0439ddfbe84cd90e112dced85084ec95-refs/heads/5.2.361@{#1} Cr-Branched-From: 3fef34e02388e07d46067c516320f1ff12304c8e-refs/heads/master@{#36332} [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/include/v8-version.h [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/base/platform/semaphore.cc [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/base/platform/semaphore.h [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/debug/debug.cc [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/mark-compact.cc [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/mark-compact.h [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/page-parallel-job.h [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/spaces.cc [modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/log.cc
,
Jun 9 2016
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/5ff2778f7b3db0b118d1488655c9d22f095b3f32 commit 5ff2778f7b3db0b118d1488655c9d22f095b3f32 Author: Ulan Degenbaev <ulan@chromium.org> Date: Thu Jun 09 09:32:51 2016 Version 5.1.281.63 (cherry-pick) Merged 84ee947013003d481b1222d975e4456122b6efe9 Workaround for glibc semaphore bug. BUG= chromium:609249 LOG=N R=mlippautz@chromium.org Review URL: https://codereview.chromium.org/2050893003 . Cr-Commit-Position: refs/branch-heads/5.1@{#74} Cr-Branched-From: 167dc63b4c9a1d0f0fe1b19af93644ac9a561e83-refs/heads/5.1.281@{#1} Cr-Branched-From: 03953f52bd4a184983a551927c406be6489ef89b-refs/heads/master@{#35282} [modify] https://crrev.com/5ff2778f7b3db0b118d1488655c9d22f095b3f32/include/v8-version.h [modify] https://crrev.com/5ff2778f7b3db0b118d1488655c9d22f095b3f32/src/heap/mark-compact.cc [modify] https://crrev.com/5ff2778f7b3db0b118d1488655c9d22f095b3f32/src/heap/mark-compact.h [modify] https://crrev.com/5ff2778f7b3db0b118d1488655c9d22f095b3f32/src/heap/page-parallel-job.h
,
Jun 9 2016
,
Aug 12 2016
ulan@: Can you comment on whether this fix would be needed on V8 5.0 and V8 4.5 (for Node.js)?
,
Aug 16 2016
Ulan is OOO till at least 8/25. Systems with glibc <2.21 are affected so if that's what you intend to support with node, then yes the fix is needed. While V8 4.5 is in theory also affected, it lacks the features that triggered the bug.
,
Sep 28 2016
,
Oct 18 2016
Node.js v6.x has upgraded to V8 5.1 which already has the fix. Node.js v4.x is not affected. |
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by kbr@chromium.org
, May 4 20164.5 MB
4.5 MB View Download