New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 620259 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Crash: [Shutdown hang] gpu::GpuChannelHost::Send

Project Member Reported by sheriffbot@chromium.org, Jun 15 2016

Issue description

Crash Signature: [Shutdown hang] gpu::GpuChannelHost::Send
Process Type: Browser
Platform: Mac
Channel: Dev
Version: 53.0.2763.0
Distinct Clients: 1
CPM: 0.20
Crash Reports: 1
Median Uptime: shutdown
Infected Clients: 0.0%

Sample Reports:
https://crash.corp.google.com/browse?q=reportid=%270e205e3c00000000%27
https://crash.corp.google.com/browse?q=reportid=%273c57d3dc00000000%27
https://crash.corp.google.com/browse?q=reportid=%27b9f70ddc00000000%27
https://crash.corp.google.com/browse?q=reportid=%2735c856fa00000000%27
https://crash.corp.google.com/browse?q=reportid=%27e7ec7afa00000000%27

Crash Link:
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20product.version%3D%2753.0.2763.0%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27

Crash Link (with version impact distribution):
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27

Crash Stacktrace:
0x43507378 (0x105ad061b)
#1 0x105b3ac77 in base::WaitableEvent::WaitMany base/synchronization/waitable_event_posix.cc:283
#2 0x10671d952 in IPC::SyncChannel::WaitForReply ipc/ipc_sync_channel.cc:524
#3 0x10671d63c in IPC::SyncChannel::Send ipc/ipc_sync_channel.cc:505
#4 0x106a58c16 in gpu::GpuChannelHost::Send gpu/ipc/client/gpu_channel_host.cc:121
#5 0x106a53d9f in gpu::CommandBufferProxyImpl::DisconnectChannel gpu/ipc/client/command_buffer_proxy_impl.cc:868
#6 0x106a53b16 in gpu::CommandBufferProxyImpl::~CommandBufferProxyImpl gpu/ipc/client/command_buffer_proxy_impl.cc:109
#7 0x106a53e4d in <name omitted> gpu/ipc/client/command_buffer_proxy_impl.cc:107
#8 0x10678f056 in content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer third_party/llvm-build/Release+Asserts/include/c++/v1/memory:2540
#9 0x10678f15d in content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer content/common/gpu/client/context_provider_command_buffer.cc:88
#10 0x10934bd07 in content::GpuProcessTransportFactory::~GpuProcessTransportFactory base/memory/ref_counted.h:196
#11 0x10934be34 in non-virtual thunk to content::GpuProcessTransportFactory::~GpuProcessTransportFactory content/browser/compositor/gpu_process_transport_factory.cc:180
#12 0x10934ede7 in content::ImageTransportFactory::Terminate content/browser/compositor/image_transport_factory.cc:50
#13 0x1090156dc in content::BrowserMainLoop::ShutdownThreadsAndCleanUp content/browser/browser_main_loop.cc:1033
#14 0x10901788d in content::BrowserMainRunnerImpl::Shutdown content/browser/browser_main_runner.cc:210
#15 0x109011818 in content::BrowserMain content/browser/browser_main.cc:48
#16 0x105a9faaf in content::ContentMainRunnerImpl::Run content/app/content_main_runner.cc:787
#17 0x105a9ecf5 in content::ContentMain content/app/content_main.cc:20
#18 0x10556bb69 in ChromeMain chrome/app/chrome_main.cc:84
#19 0x1054edd41 in main chrome/app/chrome_exe_main_mac.c:87
#20 0x1054edb23 in start 


Reporter: beherad

 
Cc: -beherad@google.com
Components: Internals>GPU
Labels: -Type-Bug M-53 OS-Mac Type-Bug-Regression
Owner: jaydasika@chromium.org
Status: Assigned (was: Untriaged)
1) This is a regression issue broken recently on M51.
2) Currently its a top # 7 browser crash on Mac Dev # 53.0.2763.0 having 6 crashes from 5 different client Ids.
3) Link to list of builds where crashes are seen.
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#samplereports:5,productversion:1000
4) Omahaproxy UI CL link:
https://chromium.googlesource.com/chromium/src/+log/51.0.2696.0..51.0.2697.0?pretty=fuller&n=10000
5) POssible suspect from above CL:
Suspect : https://codereview.chromium.org/1846043002
jaydasika@ : Could you please take a look into this if its related to your change.
6) Not adding any blocker as the crash rate is less.
7) Crashes are seen only on Mac.
https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
Project Member

Comment 2 by sheriffbot@chromium.org, Jun 15 2016

Labels: FoundIn-M-53
Users experienced this crash on the following builds:

Mac Dev 53.0.2763.0 -  0.55 CPM, 5 reports, 4 clients (signature [Shutdown hang] gpu::GpuChannelHost::Send)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Cc: piman@chromium.org
Owner: ----
Status: Available (was: Assigned)
This is not related to my CL.
+piman : Can you find the right owner for this ?

Comment 4 by piman@chromium.org, Jun 15 2016

Cc: erikc...@chromium.org
Components: -Internals>GPU Internals>GPU>Internals
Labels: -Pri-1 -Stability-Crash Stability-Hang Pri-2
Owner: ericrk@chromium.org
It's very low rate on 51.
This is not a crash, but a hang on shutdown. This is waiting for the GPU process to finish, so I suspect the GPU itself is hung.

My suspicion for the uptick in 53 is Eric's https://codereview.chromium.org/2028303002 (fits in the 53.0.2757-53.0.2763.0, because it looks like there are possible cases where the fence doesn't trigger for unknown reasons). Some mitigation just landed (https://codereview.chromium.org/2064853002 ), so maybe this will resolve itself?
Another possibility would be Erik's changes around glDescheduleUntilFinishedCHROMIUM (which would also delay the GpuChannelMsg_DestroyCommandBuffer), but those were reverted and relanded a couple of times, and it doesn't seem to fit the uptick pattern.

An alternative theory is that there is a regression at the IPC level, where maybe sync messages are not properly cancelled in some cases if either the other end is terminated (GPU process terminated) - but I don't think should be the case at this stage of shutdown, unless the GPU process crashed (and I don't see evidence of that).

Comment 5 by ericrk@chromium.org, Jun 16 2016

The mitigation I landed for the other issue (https://codereview.chromium.org/2064853002) doesn't seem to have fixed this issue, so I wonder if this is a different issue? (my mitigation *should* stop us from ever waiting more than 32ms - would the browser every trigger a hang due to a 32ms delay?)
Project Member

Comment 6 by sheriffbot@chromium.org, Jun 16 2016

Labels: FoundIn-M-52
Users experienced this crash on the following builds:

Mac Beta 52.0.2743.33 -  0.11 CPM, 6 reports, 6 clients (signature [Shutdown hang] gpu::GpuChannelHost::Send)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas

Comment 7 by piman@chromium.org, Jun 16 2016

I think the browser timeout is much longer than that (seconds).

Comment 8 by ericrk@chromium.org, Jun 16 2016

FYI, a bit more data:

Looking at Canary crashes only (for more granularity), we see the following:

|    2757 - 1 crash
     2758 - 0 crashes   < My change landed
     2759 - 0 crashes
     2760 - 0 crashes
     2761 - 0 crashes
     2762 - 0 crashes
|    2763 - 1 crash
     2764 - 0 crashes
||   2765 - 2 crashes
|||| 2766 - 4 crashes
||   2767 - 2 crashes  < My mitigation landed
|||  2768 - 3 crashes

This doesn't quite line up with my change (we had 5 0 crash builds after my CL landed). Instead, I'd suspect some change between 2761/2762-2763

Also interesting is that Beta seems to spike as well (https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27%20AND%20custom_data.ChromeCrashProto.channel%3D%27beta%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D)

Beta goes from 0 reports in 2743.24 seems to 6 reports at 2743.33 - this may indicate that the regression was merged into Beta between Jun 02 (2743.24) and Jun 09 (2743.33). So my guess is we're looking at something that was checked in between 2761.00 and 2763.00, and was then merged to Beta between 2743.24 and 2743.33. I did a manual scan of these changes and nothing stood out :/ So not 100% sure.

Comment 9 by ericrk@chromium.org, Jun 16 2016

Ok, I have a new theory here - if you look at the current crash rates, you'll see that the crashes occur on 3 OSs:

10.1 (puma)  <<<<<< Misreported - this is actually 12.12 Sierra Beta
10.11 (el capitan)
10.10 (yosemite)

On 10.11 and 10.10, the crash rate is fairly steady (no large spike in counts, sub 0.75% browser crash percentages):
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Mac%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BShutdown%20hang%5D%20gpu%3A%3AGpuChannelHost%3A%3ASend%27%20AND%20custom_data.ChromeCrashProto.os_family!%3D%2710.1%20(Puma)%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D

If you look at 10.12 only, we see that the crash rate is much higher - it's consistently at 2 or 3 crashes a day, which, given the smaller 10.12 population accounts for 50-85% browser crash percentage.

Taking these two things together, we see that 10.12 only started appearing this week (when Apple released it at WWDC), and because of this, the impact of 10.12's much higher crash rate is only seen in recent chrome builds. This is why Canary doesn't seem to spike until 2765/2766 (releases from Sun/Mon, when sierra became available to devs), while Dev spikes at 2763 (the dev release available at that time).

Given this, I think the issue is 10.12 specific. Looking at 53.0.2763.0, this is the #1 browser crash on 10.12 at 87.5%. In comparison, on 10.11/10.10, this is the #64 browser crash at 0.53%.

Given the really high crash rate on 10.12, I'm hoping this is reproducible. I'll try to get a 10.12 machine to experiment with.

Comment 10 by piman@chromium.org, Jun 16 2016

Thank you, Eric, great find. I think you're spot on.
Labels: Hotlist-Sierra
Project Member

Comment 12 by sheriffbot@chromium.org, Jun 29 2016

Labels: FoundIn-M-51
Users experienced this crash on the following builds:

Mac Stable 51.0.2704.106 -  0.07 CPM, 4 reports, 4 clients (signature [Shutdown hang] gpu::GpuChannelHost::Send)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Project Member

Comment 13 by sheriffbot@chromium.org, Jul 1 2016

Labels: -Pri-2 ReleaseBlock-Dev Pri-0
This crash has high impact on Chrome's stability.
Signature: [Shutdown hang] gpu::GpuChannelHost::Send.
Channel: canary. Platform: mac.
Labeling  issue 620259  with Pri-0.
Labeling  issue 620259  with ReleaseBlock-Dev.


If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
I set up a mac Sierra machine this week, I'll start investigating the issue.

Not sure that like Pri-0 / Release block dev makes sense for a non-regression that only affects a pre-release OS?
Project Member

Comment 15 by sheriffbot@chromium.org, Jul 3 2016

Labels: FoundIn-M-54
Users experienced this crash on the following builds:

Mac Canary 54.0.2786.0 -  6.66 CPM, 5 reports, 5 clients (signature [Shutdown hang] gpu::GpuChannelHost::Send)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Labels: -Pri-0 -ReleaseBlock-Dev ReleaseBlock-Beta Pri-1
I changed this to Pri-1, RB-B. Note that we may have a significant users on M53 by the time Sierra hits public, so we do need to get it done for M53.
This is very reproducible - just launch Chrome under Sierra and ctrl+Q - may take 2 or 3 attempts, but should repro.
This looks like an IPC issue - I added tracing to my binary and am seeing the following:

1) Browser sends GpuChannelMessage_DestroyCommandBuffer
2) GPU Proc receives message and shuts down (no hang during shutdown/etc...)
3) GPU Proc believes it successfully sends the response (traced this down through the various levels to ChannelPosix::ProcessOutgoingMessages, where the fn believes it successfully sends the message without error).
4) Browser process appears to never get the response.

I'll look at the browser process next to see why the message seems to be dropped. However, it might make sense for a mac IPC expert to take a look at this.
Cc: mark@chromium.org rsesek@chromium.org
Adding mark@ and rsesek@ to help triage this Sierra issue.
quick note: in comment 17, I mean cmd+Q, not ctrl+Q (basically just anything that quits Chrome).
Cc: rnimmagadda@chromium.org
 Issue 620501  has been merged into this issue.

Comment 22 by mark@chromium.org, Jul 7 2016

Here (attached) is how I see the browser’s main thread getting stuck. As Eric found, I’m able to reproduce this in about ¼–⅓ of all quits.

All child processes except for the GPU process are gone. The GPU process has a main thread sitting idly in its MessagePumpCFRunLoop run loop, an idle work queue thread, and a Chrome_ChildIOThread sitting idly in its MessagePumpLibevent run loop.
browser_hang_bt.txt
6.7 KB View Download
Cc: -erikc...@chromium.org ericrk@chromium.org
Owner: erikc...@chromium.org
Status: Assigned (was: Available)
Cc: roc...@chromium.org
Here's what I know so far:

1. I've confirmed that the GPU process is writing the response to the underlying socket. (Similar to c#18, but I traced the message all the way to sendmsg()).

2. The browser process waits forever for the message. The message is never parsed by IPC::ChannelReader. This suggests that there may be an issue with the underlying event waiting mechanism. Note the scary warning during startup:
"""
[warn] kq_init: detected broken kqueue; not using.: Undefined error: 0
https://bugs.chromium.org/p/chromium/issues/detail?id=626534
"""

3. rockot@ has been making non-trivial changes to SyncChannel in the last two weeks, including reverts/relands. They appear to be independent of the problem, since the issue also occurs on M51. The latest reland https://codereview.chromium.org/2101163002 does not fix the problem.


Comment 25 by mark@chromium.org, Jul 8 2016

 Bug 626534  comment 2 ought to contain the fix to point 2 in this bug’s comment 24, and may also be the key to the fix for this bug.
Making mark's suggested change in  Bug 626534  removes the warning, and also fixes the hang on shutdown bug.
Note that my change to SyncChannel have no effect on the underlying interprocess I/O; it only affects the mechanism to synchronize between the I/O thread and the SyncChannel's owning thread.

Having said that, I have no interesting ideas about what could be causing this.

Comment 28 by mark@chromium.org, Jul 8 2016

Status: Started (was: Assigned)
https://codereview.chromium.org/2134603002/
Project Member

Comment 29 by bugdroid1@chromium.org, Jul 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3b07cd446f6bf33618ebae11ca68273b7a0de2f8

commit 3b07cd446f6bf33618ebae11ca68273b7a0de2f8
Author: erikchen <erikchen@chromium.org>
Date: Fri Jul 08 17:06:18 2016

Fix a logic bug in kqueue.c.

Remove an unnecessary workaround for OS X 10.4 from kqueue.c. It was causing
problems on macOS Sierra.

All credit for this CL goes to mark@chromium.org.

BUG= 626534 ,  620259 

Review-Url: https://codereview.chromium.org/2134603002
Cr-Commit-Position: refs/heads/master@{#404421}

[modify] https://crrev.com/3b07cd446f6bf33618ebae11ca68273b7a0de2f8/base/third_party/libevent/README.chromium
[modify] https://crrev.com/3b07cd446f6bf33618ebae11ca68273b7a0de2f8/base/third_party/libevent/kqueue.c

Status: Fixed (was: Started)
Project Member

Comment 31 by bugdroid1@chromium.org, Jul 11 2016

Labels: merge-merged-2785
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/45a5e0217612dfb48e8c3fd5dcd7e186d8288c52

commit 45a5e0217612dfb48e8c3fd5dcd7e186d8288c52
Author: erikchen <erikchen@chromium.org>
Date: Mon Jul 11 17:21:58 2016

Fix a logic bug in kqueue.c.

Remove an unnecessary workaround for OS X 10.4 from kqueue.c. It was causing
problems on macOS Sierra.

All credit for this CL goes to mark@chromium.org.

BUG= 626534 ,  620259 

Review-Url: https://codereview.chromium.org/2134603002
Cr-Commit-Position: refs/heads/master@{#404421}
(cherry picked from commit 3b07cd446f6bf33618ebae11ca68273b7a0de2f8)

Review URL: https://codereview.chromium.org/2140723002 .

Cr-Commit-Position: refs/branch-heads/2785@{#83}
Cr-Branched-From: 68623971be0cfc492a2cb0427d7f478e7b214c24-refs/heads/master@{#403382}

[modify] https://crrev.com/45a5e0217612dfb48e8c3fd5dcd7e186d8288c52/base/third_party/libevent/README.chromium
[modify] https://crrev.com/45a5e0217612dfb48e8c3fd5dcd7e186d8288c52/base/third_party/libevent/kqueue.c

Comment 32 by mark@chromium.org, Jul 13 2016

Labels: -Restrict-View-EditIssue

Sign in to add a comment