New issue
Advanced search Search tips

Issue 612607 link

Starred by 3 users

Issue metadata

Status: Untriaged
Owner: ----
EstimatedDays: ----
NextAction: 2019-07-09
OS: Windows
Pri: 3
Type: Bug

Blocked on:
issue 619196



Sign in to add a comment

Improve GPU watchdog to postpone crashing when I/O queue is saturated

Project Member Reported by stanisc@chromium.org, May 17 2016

Issue description

The idea is to try to do an unbuffered write in GPU watchdog just before crashing the process. If the process is slow due to heavy I/O this should give it more time to unblock.

This should theoretically help with some of GPU hangs.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jun 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4015b488f743a7399e3362fd49917f494ff7caaf

commit 4015b488f743a7399e3362fd49917f494ff7caaf
Author: stanisc <stanisc@chromium.org>
Date: Fri Jun 10 17:47:04 2016

GPU Watchdog to check I/O before terminating

The idea is to try to do an unbuffered write in GPU watchdog
just before crashing the process. If the process is slow due
to heavy I/O this should give it more time to unblock.

This should theoretically help with some of GPU hangs.

BUG=612607

Review-Url: https://codereview.chromium.org/1980263002
Cr-Commit-Position: refs/heads/master@{#399222}

[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/browser/gpu/gpu_process_host.cc
[add] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/common/gpu_watchdog_utils.cc
[add] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/common/gpu_watchdog_utils.h
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/content_common.gypi
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/gpu/gpu_watchdog_thread.cc
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/gpu/gpu_watchdog_thread.h

Project Member

Comment 2 by bugdroid1@chromium.org, Jun 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4015b488f743a7399e3362fd49917f494ff7caaf

commit 4015b488f743a7399e3362fd49917f494ff7caaf
Author: stanisc <stanisc@chromium.org>
Date: Fri Jun 10 17:47:04 2016

GPU Watchdog to check I/O before terminating

The idea is to try to do an unbuffered write in GPU watchdog
just before crashing the process. If the process is slow due
to heavy I/O this should give it more time to unblock.

This should theoretically help with some of GPU hangs.

BUG=612607

Review-Url: https://codereview.chromium.org/1980263002
Cr-Commit-Position: refs/heads/master@{#399222}

[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/browser/gpu/gpu_process_host.cc
[add] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/common/gpu_watchdog_utils.cc
[add] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/common/gpu_watchdog_utils.h
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/content_common.gypi
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/gpu/gpu_watchdog_thread.cc
[modify] https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf/content/gpu/gpu_watchdog_thread.h

Comment 3 by kbr@chromium.org, Jun 15 2016

Blockedon: 619196

Comment 4 by kbr@chromium.org, Jun 15 2016

Status: Started (was: Assigned)
Note that the CL above caused flakiness in the context_lost_tests on the GPU bots, affecting the commit queue as well as some of the waterfall bots. See  Issue 619196 . A revert is in progress in https://codereview.chromium.org/2071613002/ .

Project Member

Comment 5 by bugdroid1@chromium.org, Jun 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/40684e97fcc0cc24a2504f0dc2679a3b88557d9b

commit 40684e97fcc0cc24a2504f0dc2679a3b88557d9b
Author: kbr <kbr@chromium.org>
Date: Wed Jun 15 20:18:29 2016

Revert of GPU Watchdog to check I/O before terminating GPU process (patchset #5 id:120001 of https://codereview.chromium.org/1980263002/ )

Reason for revert:
This CL seems to have caused intermittent assertion failures in the context_lost_tests on the commit queue and reliable assertion failures on some of the GPU bots. See  http://crbug.com/619196  .

Original issue's description:
> GPU Watchdog to check I/O before terminating
>
> The idea is to try to do an unbuffered write in GPU watchdog
> just before crashing the process. If the process is slow due
> to heavy I/O this should give it more time to unblock.
>
> This should theoretically help with some of GPU hangs.
>
> BUG=612607
>
> Committed: https://crrev.com/4015b488f743a7399e3362fd49917f494ff7caaf
> Cr-Commit-Position: refs/heads/master@{#399222}

TBR=jbauman@chromium.org,wfh@chromium.org,nick@chromium.org,manzagop@chromium.org,pmonette@chromium.org,brucedawson@chromium.org,stanisc@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG=612607

Review-Url: https://codereview.chromium.org/2071613002
Cr-Commit-Position: refs/heads/master@{#399998}

[modify] https://crrev.com/40684e97fcc0cc24a2504f0dc2679a3b88557d9b/content/browser/gpu/gpu_process_host.cc
[delete] https://crrev.com/cd6c0ba34a71547db27d1abf9016d86e13e1b7ea/content/common/gpu_watchdog_utils.cc
[delete] https://crrev.com/cd6c0ba34a71547db27d1abf9016d86e13e1b7ea/content/common/gpu_watchdog_utils.h
[modify] https://crrev.com/40684e97fcc0cc24a2504f0dc2679a3b88557d9b/content/content_common.gypi
[modify] https://crrev.com/40684e97fcc0cc24a2504f0dc2679a3b88557d9b/content/gpu/gpu_watchdog_thread.cc
[modify] https://crrev.com/40684e97fcc0cc24a2504f0dc2679a3b88557d9b/content/gpu/gpu_watchdog_thread.h

This change ran on builds 53.0.2765.0 - 53.0.2768.0

It looks like it had a positive impact on the crash rate.
For MessagePumpForGpu::WaitForWork, for the 5 builds before the fix the CPM was: 7.358, 11.46, 11.282, 9.809, 10.437

In the 5 builds after the fix the CPM was: 5.911, 7.101, 7.324, 3.831, 5.327



I looked at a number of crash dumps for [GPU hang] MessagePumpForGpu::WaitForWork and other [GPU hang] crash signatures. The interesting number captured there is the duration of I/O check. That is how long did it take to write 32 bytes of data to the temp file and flush the changes to disk.

Here are some examples of I/O check duration (in seconds):

MessagePumpForGpu::WaitForWork:
2.605, 2.243, 1.023, 0.617, 0.543, 0.287, 0.231, 0.166, 0.129, 0.083, 0.072, 0.044, 0.018, 0.001.

d3dcompiler_47.dll:
1.812, 0.621, 0.107

CreateD3DDevManager:
0.487

MessagePumpForGpu::DoRunLoop (with PeekMessage at the top):
0.008, 0.005, 0.004

It is interesting the I/O check duration was consistently low for hangs with PeekMessage call at the top of the call stack. Those looks like true deadlocks to me.
Results of investigating another batch of crash dumps in one of M53 Dev builds.

For MessagePumpForGpu::WaitForWork - 52.6% of cases had I/O check duration longer than 1 second, the longest duration was 5.6 seconds, and the average - 1.4 seconds (out of 19 samples).

For [GPU hang] overall (e.g. GPU hangs with all signatures including WaitForWork) - 40% of cases had I/O check duration longer than 1 second, the longest I/O check was 8.9 sec, and the average - 1.2 sec (out of 68 samples).
If we end up re-implementing this, the code in gpu_process_host.cc would need to make sure that the temp path used with the sandbox rule isn't a reparse point.

Previously it has been failing here:

  if (!PreProcessName(&mod_name)) {
    // The path to be added might contain a reparse point.
>>>>NOTREACHED();
    return false;
  }
Owner: ----
Status: Untriaged (was: Started)
Labels: Pri-3
NextAction: 2019-07-09
Downgrading P2s that haven't been modified in more than 6 months, which have no component or owner.

Sign in to add a comment