New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 609249 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 0
Type: Bug

Blocked on:
issue 605349

Blocking:
issue 352807
issue 596622



Sign in to add a comment

Fatal error in ../../v8/src/base/platform/semaphore.cc, line 111: Semaphore signal failure: 22

Project Member Reported by kbr@chromium.org, May 4 2016

Issue description

In this tryjob on linux_chromium_rel_ng:
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223551

WebglConformance.conformance_ogles_GL_atan_atan_009_to_012 failed because of the following assertion failure in the renderer process:

#
# Fatal error in ../../v8/src/base/platform/semaphore.cc, line 111
# Semaphore signal failure: 22
#

	Thread 11 (crashed)
	 0  chrome!v8::base::OS::Abort() + 0xf
	 1  0x3000000020
	 2  chrome!WaitForTask [lock.h : 50 + 0x8]
	 3  chrome!ThreadMain [callback.h : 397 + 0x7]
	 4  chrome!ThreadFunc [platform_thread_posix.cc : 70 + 0x8]
	 5  libpthread-2.19.so + 0x8182
	 6  libc-2.19.so + 0xfa47d

Similar to  Issue 605349 , but apparently a different failure mode.

Hannes, could you or someone else on V8's GC team please see why this may have happened?

 

Comment 1 by kbr@chromium.org, May 4 2016

Forgot to attach stdout.

stdout.txt
4.5 MB View Download

Comment 2 by u...@chromium.org, May 6 2016

Cc: mlippautz@chromium.org
errno = 22 means "Invalid argument". So the native_handle of the semaphore is invalid. We know that the handle is pointer-aligned because we check for it in Semaphore constructor.

I think this leaves two possibilities:
1. The semaphore is used after being destroyed.
2. The native handle of the semaphore is corrupted.

Ken, any chance we can get a better stack trace? The one posted misses all the interesting bits: where the Semaphore::Signal is called from.

Comment 3 by kbr@chromium.org, May 6 2016

Cc: mark@chromium.org hpayer@chromium.org thakis@chromium.org dyen@chromium.org
Owner: u...@chromium.org
I'm not sure why the stack is getting truncated. This is a 64-bit build; is stack unwind info needed that might not be present for Crashpad? CC'ing mark@ and thakis@, Crashpad and Clang owners.

Can you try a local Release build with dcheck_always_on=1 and try running things in a loop? e.g.:

./content/test/gpu/run_gpu_test.py webgl_conformance --browser=release --pageset-repeat=1000 --story-filter=conformance_ogles --max-failures=1 -v --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc"

Then in a separate terminal:

tail -f output.txt

Hopefully it'll catch one of these failures.

Thanks.

Comment 4 by kbr@chromium.org, May 7 2016

Labels: -Pri-2 Pri-1
Note: more failures on the CQ:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/224558
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/224541

Bumping to P1. I'd like to suggest to instrument V8's code to be able to print more detailed information when the failure happens.

Comment 6 by kbr@chromium.org, May 10 2016

Good progress Ulan. Here's one failure with the above instrumentation in place:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/225698

[ RUN      ] WebglConformance.conformance_ogles_GL_asin_asin_001_to_006


#
# Fatal error in ../../v8/src/base/platform/semaphore.cc, line 110
# Semaphore signal failure: 22 called by 'PageParallelJob::Task::RunInternal'
#

Found Minidump: True
Stack Trace:
********************************************************************************
	Operating system: Linux
	                  0.0.0 Linux 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64
	CPU: amd64
	     family 6 model 60 stepping 3
	     1 CPU
	
	GPU: UNKNOWN
	
	Crash reason:  SIGILL
	Crash address: 0x7fbabb78967f
	Process uptime: not available
	
	Thread 12 (crashed)
	 0  chrome!v8::base::OS::Abort() + 0xf
	    rax = 0x0000000000000000   rdx = 0x0000000000000000
	    rcx = 0xffffffffffffffff   rbx = 0x00007fbabeaaf239
	    rsi = 0x00007fbab23d49d0   rdi = 0x00007fbab23d31c0
	    rbp = 0x000000000000006e   rsp = 0x00007fba99c4a9e8
	     r8 = 0x00007fba99c4b700    r9 = 0x00007fbab20b7927
	    r10 = 0x00007fbab23d0be0   r11 = 0x0000000000000000
	    r12 = 0x0000095a58061cc0   r13 = 0x00007fba99c4ab60
	    r14 = 0x00007fbabeaaf2af   r15 = 0x00007fbab23d3868
	    rip = 0x00007fbabb78967f
	    Found by: given as instruction pointer in context
	 1  0x3000000028
	    rbx = 0x00007fbabeaaf239   rbp = 0x000000000000006e
	    rsp = 0x00007fba99c4a9f8   r12 = 0x0000095a58061cc0
	    r13 = 0x00007fba99c4ab60   r14 = 0x00007fbabeaaf2af
	    r15 = 0x00007fbab23d3868   rip = 0x0000003000000028
	    Found by: call frame info
	 2  chrome!WaitForTask [lock.h : 50 + 0x8]
	    rsp = 0x00007fba99c4aa10   rip = 0x00007fbabdf8cfcd
	    Found by: stack scanning
	 3  chrome!ThreadMain [callback.h : 397 + 0x7]
...

Comment 7 by u...@chromium.org, May 11 2016

Yep, I am checking the code related to PageParallelJob.

Yesterday I removed few reinterpret_casts in  https://codereview.chromium.org/1963853004/

There is a slim chance that they were causing undefined behavior.

Comment 8 by u...@chromium.org, May 12 2016

No new crashes so far.

Comment 9 by kbr@chromium.org, May 12 2016

That's good news. Assuming the V8 change in https://codereview.chromium.org/1963853004/ has already rolled in? Thanks for persisting with this issue.

Comment 10 by u...@chromium.org, May 13 2016

We have new crash today. Investigation continues.

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/228346

#
# Fatal error in ../../v8/src/base/platform/semaphore.cc, line 110
# Semaphore signal failure: 22 called by 'PageParallelJob::Task::RunInternal'
#


Project Member

Comment 11 by bugdroid1@chromium.org, May 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/7bc54ba832c9b0ba750e9efb1fc6636560861aad

commit 7bc54ba832c9b0ba750e9efb1fc6636560861aad
Author: ulan <ulan@chromium.org>
Date: Fri May 13 12:35:49 2016

Check number of finished tasks in PageParallelJob.

BUG= chromium:609249 
LOG=NO

Review-Url: https://codereview.chromium.org/1976133002
Cr-Commit-Position: refs/heads/master@{#36238}

[modify] https://crrev.com/7bc54ba832c9b0ba750e9efb1fc6636560861aad/src/heap/page-parallel-job.h

Comment 13 by u...@chromium.org, May 17 2016

The check in #11 ensures that the main thread waits for tasks. So the semaphore is not destroyed at the time of signalling.

At this point, I can think of only memory corruption or compiler bug as an explanation for this crash.

The stdio log says something about minidump. Is it possible to get the minidump?

I'll try to reproduce again. Last time I didn't succeed.

Comment 14 by kbr@chromium.org, May 17 2016

Cc: nedngu...@google.com
Components: Tests>Telemetry
+nednguyen

Ned, do you think it would be feasible to let Telemetry clients enumerate any minidumps that were produced during the run and upload them to cloud storage? I haven't looked at the code in question yet but it's buried pretty deeply in the desktop browser backend as far as I remember.

Ulan, if you can try to reproduce locally that would be really helpful. Try building Release with dcheck_always_on=1 and run:

src/content/test/gpu/run_gpu_test.py webgl_conformance --browser=release --pageset-repeat=100 --max-failures=1 --show-stdout --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" -v > output.txt 2>&1

I note that:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/228346
was job:
https://chromium-swarm.appspot.com/user/task/2ec1f0fe62903610
which ran on build145-m4, while:

https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/229599
was job:
https://chromium-swarm.appspot.com/user/task/2ed3b9378f9b2110
which ran on build154-m4. Since these failures aren't machine specific I doubt it's something like bad RAM.

kbr@, I think that's ok to support. We can expose the path to minidump in Telemetry's AppCrashException (https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/core/exceptions.py#L54)

Clients can catch this exception & do whatever they want with the minidump paths.

Comment 16 by u...@chromium.org, May 18 2016

I ran 3 parallel sessions today for 5 hours. No crash.

Maybe the crash depends on the binary (because of v8 snapshot). Is there a way to download the binary of failing run?

Comment 17 by kbr@chromium.org, May 19 2016

Yes. Go to the task page, like:
https://chromium-swarm.appspot.com/user/task/2ec1f0fe62903610

(I think this requires you to be logged in with your @google.com account)

Then find the hash of the "Isolated inputs", e.g.:
1ae56b273feb9741f97fd462a9a02d568a5aaf77

Then do the following (while cd'd into "src" in your Chromium checkout):

./tools/swarming_client/isolateserver.py download -t foo -s [hash] -I https://isolateserver.appspot.com

This will download the isolate into the directory "foo". Or, to run the test directly from the isolate, which is closer to how the bots do it, do the following (note the command line comes from the task page above, though the escaping of quotes is bad in the user interface):

./tools/swarming_client/run_isolated.py -s [hash] -I https://isolateserver.appspot.com -- webgl_conformance --show-stdout --browser=release -v --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" --isolated-script-test-output=/tmp/foo.json

Comment 18 by kbr@chromium.org, May 19 2016

Also: filed https://github.com/catapult-project/catapult/issues/2346 about uploading minidumps from the bots. Ulan, please tell us if this would really be useful. I don't have a good sense for how many Chromium developers would actually look at these and be able to make debugging progress with them.

Comment 19 by u...@chromium.org, May 19 2016

Thank you, Ken. The instructions worked. I am running the tests with the binary from the bot.

We in V8 team routinely look at minidumps for chrome crashes. I think minidumps from bots would be useful if the failure is hard to reproduce, like in this case.


Comment 20 by kbr@chromium.org, May 19 2016

Blocking: 352807
OK. Thanks for the feedback Ulan. We will work on getting minidumps off the bots. See recent comment on https://github.com/catapult-project/catapult/issues/2346 . Right now the plan isn't to work on this immediately, but rather do it after the harness is simplified in  Issue 352807 . Let me know if you can't make progress without the minidupms, or if you require access to one of the bots to reproduce there. The latter is simpler.

Comment 21 by u...@chromium.org, May 20 2016

I think we are hitting this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=12674

The bug is fixed in glibc 2.21, but my ubuntu trusty has glibc 2.19.

Ken, I assume the bots also run with glibc 2.19?

The race matches our use case in PageParallelJob:

1. The main thread initializes semaphore and spawns a background thread.
2. The background thread atomically increments the semaphore counter but goes to sleep before checking for waiters.
3. The main thread sees the increment of the counter, assumes that the background thread is finished, and destroys the semaphore.
4. The background thread wakes up and reads invalid semaphore.

Semaphore::Signal implementation that I see on my machine:

   0x7ffff493c8e0 <sem_post>:           mov    (%rdi),%eax
   0x7ffff493c8e2 <sem_post+2>:         cmp    $0x7fffffff,%eax
   0x7ffff493c8e7 <sem_post+7>:         je     0x7ffff493c91c <sem_post+60>
   0x7ffff493c8e9 <sem_post+9>:         lea    0x1(%rax),%esi
   0x7ffff493c8ec <sem_post+12>:        lock cmpxchg %esi,(%rdi)
// Race with the main thread starts here
   0x7ffff493c8f0 <sem_post+16>:        jne    0x7ffff493c8e2 <sem_post+2>
   0x7ffff493c8f2 <sem_post+18>:        cmpq   $0x0,0x8(%rdi)
   0x7ffff493c8f7 <sem_post+23>:        je     0x7ffff493c912 <sem_post+50>
   0x7ffff493c8f9 <sem_post+25>:        mov    $0xca,%eax
   0x7ffff493c8fe <sem_post+30>:        mov    $0x1,%esi
   0x7ffff493c903 <sem_post+35>:        or     0x4(%rdi),%esi
   0x7ffff493c906 <sem_post+38>:        mov    $0x1,%edx
   0x7ffff493c90b <sem_post+43>:        syscall 
   0x7ffff493c90d <sem_post+45>:        test   %rax,%rax
   0x7ffff493c910 <sem_post+48>:        js     0x7ffff493c915 <sem_post+53>
   0x7ffff493c912 <sem_post+50>:        xor    %eax,%eax
   0x7ffff493c914 <sem_post+52>:        retq   
   0x7ffff493c915 <sem_post+53>:        mov    $0x16,%eax
   0x7ffff493c91a <sem_post+58>:        jmp    0x7ffff493c921 <sem_post+65>
   0x7ffff493c91c <sem_post+60>:        mov    $0x4b,%eax
   0x7ffff493c921 <sem_post+65>:        mov    0x20a660(%rip),%rdx        # 0x7ffff4b46f88
   0x7ffff493c928 <sem_post+72>:        mov    %eax,%fs:(%rdx)
   0x7ffff493c92b <sem_post+75>:        or     $0xffffffff,%eax
   0x7ffff493c92e <sem_post+78>:        retq 


Comment 22 by u...@chromium.org, May 20 2016

We can work around this bug by keeping one global static semaphore instead of dynamically creating and destroying it.

I will upload CL.

Comment 24 by u...@chromium.org, May 20 2016

> Ken, I assume the bots also run with glibc 2.19?
The stack trace in the original post confirms it.

Comment 25 by kbr@chromium.org, May 20 2016

Awesome work Ulan! Thank you for persisting with this nasty bug. The fix looks good and I am sure it will stick on the bots. Sorry they are not running a more recent Ubuntu version, but I think the code change you made is worth it for stability on more Linux machines.

Comment 26 by u...@chromium.org, May 23 2016

Thank you, Ken. 

No worries about Ubuntu version on bots, even my machine has glibc 2.19. :)

Comment 27 by kbr@chromium.org, May 23 2016

Status: Fixed (was: Assigned)
Marking Fixed. Please change if that's not appropriate. Thanks again!

Comment 28 by u...@chromium.org, Jun 6 2016

Issue 616789 has been merged into this issue.
Project Member

Comment 29 by sheriffbot@chromium.org, Jun 6 2016

Labels: -Pri-1 ReleaseBlock-Stable Pri-0
This crash has high impact on Chrome's stability.
Channel: beta. Platform: linux.
Labeling  issue 609249  with Pri-0.
Labeling  issue 609249  with ReleaseBlock-Stable.


If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas

Comment 30 by u...@chromium.org, Jun 7 2016

Labels: Merge-Request-51 Merge-Request-52
Requesting merge of cl in #23 to M51 and M52.

Comment 31 by tin...@google.com, Jun 7 2016

Labels: -Merge-Request-51 Merge-Review-51 Hotlist-Merge-Review
[Automated comment] Request affecting a post-stable build (M51), manual review required.

Comment 32 by tin...@google.com, Jun 7 2016

Labels: -Merge-Request-52 Merge-Approved-52 Hotlist-Merge-Approved
Your change meets the bar and is auto-approved for M52 (branch: 2743)

Comment 33 by tin...@google.com, Jun 7 2016

Labels: -Merge-Request-51 Merge-Review-51 Hotlist-Merge-Review
[Automated comment] Request affecting a post-stable build (M51), manual review required.
Labels: -Merge-Review-51 Merge-Approved-51
Please merge your change ASAP so that it will be picked up for next Beta release.
Please merge your change latest by 4:00 PM PST Friday (06/10) so we can pick it for next week M51 Stable refresh.
Project Member

Comment 37 by bugdroid1@chromium.org, Jun 9 2016

Labels: merge-merged-5.2
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/82ec83d83c4f8997175946f3d859a2ae2d6192b6

commit 82ec83d83c4f8997175946f3d859a2ae2d6192b6
Author: Ulan Degenbaev <ulan@chromium.org>
Date: Thu Jun 09 08:58:21 2016

Version 5.2.361.20 (cherry-pick)

Merged 84ee947013003d481b1222d975e4456122b6efe9

Workaround for glibc semaphore bug.

BUG= chromium:609249 
LOG=N
R=mlippautz@chromium.org

Review URL: https://codereview.chromium.org/2048313002 .

Cr-Commit-Position: refs/branch-heads/5.2@{#26}
Cr-Branched-From: 2cd36d6d0439ddfbe84cd90e112dced85084ec95-refs/heads/5.2.361@{#1}
Cr-Branched-From: 3fef34e02388e07d46067c516320f1ff12304c8e-refs/heads/master@{#36332}

[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/include/v8-version.h
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/base/platform/semaphore.cc
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/base/platform/semaphore.h
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/debug/debug.cc
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/mark-compact.cc
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/mark-compact.h
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/page-parallel-job.h
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/heap/spaces.cc
[modify] https://crrev.com/82ec83d83c4f8997175946f3d859a2ae2d6192b6/src/log.cc

Labels: -Merge-Approved-51 -Merge-Approved-52
Labels: backport-review
ulan@: Can you comment on whether this fix would be needed on V8 5.0 and V8 4.5 (for Node.js)?
Ulan is OOO till at least 8/25. 

Systems with glibc <2.21 are affected so if that's what you intend to support with node, then yes the fix is needed.

While V8 4.5 is in theory also affected, it lacks the features that triggered the bug.


Labels: -Backport-review NodeJS-Backport-Review
Labels: -NodeJS-Backport-Review NodeJS-Backport-Done
Node.js v6.x has upgraded to V8 5.1 which already has the fix. Node.js v4.x is not affected.

Sign in to add a comment