New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 603329 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
OOO until 2019-01-24
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug

Blocked on:
issue 352807
issue 606112
issue 608923
issue 619196



Sign in to add a comment

context_lost_tests flakiness

Project Member Reported by rjkroege@chromium.org, Apr 13 2016

Issue description

Build failed (once):
context_lost_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1

Revision range:
chromium 387106 : 387112

Failing builders:
Win7 Release (NVIDIA): https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20(NVIDIA)

It's not clear to me what's going on here. kbr@ any ideas? Maybe this is a one-off?

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47813/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio


 

Comment 1 by kbr@chromium.org, Apr 13 2016

Cc: danakj@chromium.org siev...@chromium.org
Components: Blink>WebGL
Noticed that, but no concrete ideas. sievers@ did rework the logic that dispatches these events, and danakj@ has been massively cleaning up and refactoring the intermediate layers in this area. If we can reproduce it locally we should debug it urgently. Dana, Daniel, can either of you try a Release build on Windows and see whether it reproduces?

Comment 2 by kbr@chromium.org, Apr 13 2016

Here's a command line which would run the test in a loop:

python content\test\gpu\run_gpu_test.py context_lost --browser=release --show-stdout --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" --story-filter=WebGLContextLostFromGPUProcessExit --page-repeat=1000 --max-failures=1

Comment 3 by danakj@chromium.org, Apr 14 2016

It looks like there's two tests that are flaking there. If you look at the last 200 runs you see..

GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47653/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47666/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47690/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47712/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47751/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47753/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47775/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47813/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

Comment 4 by kbr@chromium.org, Apr 14 2016

Cc: dyen@chromium.org
Components: Internals>CrashReporting
Thanks for that diagnosis Dana.

I ran this test in a loop locally and wasn't able to provoke a failure. I did notice however that the test seemed to hang for a surprisingly long time and it looked to me like the new minidump symbolization ( Issue 561763 ) might be the reason why.

We should probably change these tests so they don't produce minidumps for the GPU process crashes they provoke. Not sure whether that's possible with Crashpad. Does anyone who works on Crashpad know how it's hooked up and whether it'd be easy to tell it (on demand) to ignore crashes from a specific Chrome sub-process?

Comment 5 by danakj@chromium.org, Apr 14 2016

Cc: mark@chromium.org scottmg@chromium.org
+crashpad ppls

Comment 6 by mark@chromium.org, Apr 14 2016

A process can make a Crashpad API call that says “don’t handle me.”

crashpad::CrashpadInfo::set_crashpad_handler_behavior(crashpad::TriState::kDisabled);

https://crashpad.chromium.org/doxygen/structcrashpad_1_1CrashpadInfo.html#affa1b598fdd468a56d5cd1c7241ca85d

I don’t think that we should be setting this in any production code without a very good reason. I don’t think that we should be disabling it in tests, either. Perhaps what you need is a way to disable the symbolizer, or a timeout for it. Crashpad in Chrome is first and foremost intended to get us some remote debugging capabilities of things that we’ve shipped. It’s nice that people have figured out a way to use it to get local stack traces too, but we shouldn’t impair its primary use case because the local stack traces aren’t perfect.

Comment 7 by kbr@chromium.org, Apr 14 2016

Thanks for the pointer Mark. The only situation in which I would consider doing this is for those tests that deliberately crash a sub-process (the GPU process) in order to test how the parent process reacts, and where processing of that sub-process's crash dump is undesired. The stack traces produced by Crashpad are awesome. Now that we finally have unit tests for symbolized stacks enabled on all desktop platforms ( Issue 561447 ,  Issue 561763 ,  Issue 563716 ) we are able to file reasonable bugs for flakey crashes in the product that show up on the bots ( Issue 603595 ).

In this particular case it's undesirable to disable the symbolizer. If it happens that the browser process or render process crashes as a consequence of the GPU process crashing, we want to see those stack traces in the test's logs. It has definitely happened in the past that the renderer has crashed as a consequence of the GPU process crashing, which should not happen.

Does that alleviate your concern and would you agree with our using this API in this test to filter out the GPU process's crash dumps?

Comment 8 by mark@chromium.org, Apr 14 2016

If you keep this limited to test code that only runs in a test process and you only do it for a process that’s crashed intentionally, I think I can live with it.

If you find that you “fail to crash” in the expected spot and want to turn Crashpad back on to catch an unexpected crash in an unexpected spot, call crashpad::CrashpadInfo::set_crashpad_handler_behavior(crashpad::TriState::kUnset) to restore the default behavior.

Comment 9 by kbr@chromium.org, Apr 22 2016

Since the logs above expired, attached are two from these recent runs:

https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48277
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48322

I don't seem to be able to reproduce this locally, at least not easily. The behavior of the renderer process crashing when navigating to about:gpucrash doesn't happen locally.

stdout1.txt
132 KB View Download
stdout2.txt
116 KB View Download

Comment 10 by kbr@chromium.org, Apr 23 2016

 Issue 606049  has been merged into this issue.

Comment 11 by kbr@chromium.org, Apr 23 2016

Blockedon: 606112

Comment 12 by kbr@chromium.org, Apr 23 2016

Cc: kbr@chromium.org
Owner: mark@chromium.org
Summary: context_lost_tests flakiness (was: win7 context loss test flake)
Some flakiness is seen on other platforms as well. Filed  Issue 606112  about another problem seen.

I did not have success disabling Crashpad. Please see the attached disable-crashpad.patch which was a total hack just to see if it would work. Applying the attached catapult patch to src/third_party/catapult will preserve the temporary directories for the minidumps. Examining them after running:

python content\test\gpu\run_gpu_test.py context_lost --browser=release --show-stdout -v --extra-browser-args=--enable-logging=stderr > context-lost-output.txt 2>&1

and examining context-lost-output.txt indicates that two minidumps are still generated during the test run.

Mark, could you please help me figure out how to disable the generation of those minidumps? I had a lot of difficulty figuring out from where the crash_reporter code could legally be called (e.g. content/ or chrome/) and what dependencies were needed to successfully link against code in src/components/crash/content/app .

Ultimately, gpu_benchmarking_extension.cc will need a new method which will disable Crashpad in the GPU process, and an IPC will need to be sent to the GPU process to disable Crashpad. If the crash_reporter namespace is only accessible from chrome/ rather than content/, I don't know where in the GPU process's code to put the call to crash_reporter::DisableCrashpadForTesting.

Mark, could I please assign this bug to you for your advice on how to hook this up? Please assign it back to me once we've figured out how this will be done. Thanks.

disable-crashpad.patch
1.7 KB Download
catapult-preserve-minidump-dir.patch
1.4 KB Download

Comment 13 by kbr@chromium.org, Apr 23 2016

Cc: bajones@chromium.org zmo@chromium.org
Components: Internals>GPU>Testing

Comment 14 by kbr@chromium.org, Apr 23 2016

Cc: vmi...@chromium.org
The ContextLost.WebGLContextLostFromGPUProcessExit failure seems to happen when the following is in the log:

Tab crashed while navigating to chrome://gpucrash
Waiting for page to finish.
Tab crashed while closing chrome://gpucrash

I vaguely recall vmiura@ mentioning that there were some recent changes to the handling of debug URLs. Those should be investigated.

Project Member

Comment 15 by bugdroid1@chromium.org, Apr 23 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6947a59811079d0b91ee60eee810fee5857ddc2d

commit 6947a59811079d0b91ee60eee810fee5857ddc2d
Author: kbr <kbr@chromium.org>
Date: Sat Apr 23 03:52:49 2016

Suppress ContextLost.WebGLContextLostFromGPUProcessExit flakes on Win7.

BUG= 603329 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel
NOTRY=true
TBR=zmo@chromium.org

Review URL: https://codereview.chromium.org/1916653002

Cr-Commit-Position: refs/heads/master@{#389357}

[modify] https://crrev.com/6947a59811079d0b91ee60eee810fee5857ddc2d/content/test/gpu/gpu_tests/context_lost_expectations.py

Project Member

Comment 16 by bugdroid1@chromium.org, Apr 25 2016

Labels: merge-merged-2716
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6947a59811079d0b91ee60eee810fee5857ddc2d

commit 6947a59811079d0b91ee60eee810fee5857ddc2d
Author: kbr <kbr@chromium.org>
Date: Sat Apr 23 03:52:49 2016

Suppress ContextLost.WebGLContextLostFromGPUProcessExit flakes on Win7.

BUG= 603329 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel
NOTRY=true
TBR=zmo@chromium.org

Review URL: https://codereview.chromium.org/1916653002

Cr-Commit-Position: refs/heads/master@{#389357}

[modify] https://crrev.com/6947a59811079d0b91ee60eee810fee5857ddc2d/content/test/gpu/gpu_tests/context_lost_expectations.py

Cc: jdonnelly@chromium.org
This flake is still showing up on Win7 Release (NVIDIA) despite the entry kbr added to context_lost_expections.py. Any ideas why?

See:
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48621
https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48635

Comment 18 by kbr@chromium.org, Apr 28 2016

@jdonnelly: sorry about that. Fixing the flaky retry mechanism to handle timeouts in https://codereview.chromium.org/1915033009/ .

Comment 19 by kbr@chromium.org, Apr 28 2016

The attached log from https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48621 shows that the exception that's leaking and breaking the flaky test suppression is:

Exception raised when cleaning story run: 

Traceback (most recent call last):
  _RunStoryAndProcessErrorIfNeeded at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\story_runner.py:104
    state.DidRunStory(results)
  DidRunStory at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\page\shared_page_state.py:163
    self._current_tab.Close()
  Close at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\browser\tab.py:100
    self._tab_list_backend.CloseTab(self.id)
  CloseTab at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:66
    util.WaitFor(lambda: tab_id not in self.IterContextIds(), timeout=5)
  WaitFor at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\core\util.py:94
    (timeout, GetConditionString()))
TimeoutException: Timed out while waiting 5s for util.WaitFor(lambda: tab_id not in self.IterContextIds(), timeout=5).


stdout.txt
132 KB View Download

Comment 20 by kbr@chromium.org, May 4 2016

Owner: kbr@chromium.org
mark@ provided feedback via email:

"""
Sounds like you did it correctly. Something must be broken, let’s debug!

Let’s split the difference on the crashpad_handler side and see what you’re getting in options->crashpad_handler_behavior at the return from ProcessSnapshotMac::GetCrashpadOptions():
https://chromium.googlesource.com/crashpad/crashpad/+/a02ba240062a3d9c2f5b8941ab6429367884d0ea/snapshot/mac/process_snapshot_mac.cc#90

when crashpad_handler’s looking at a process that you expect handling to be disabled for. We want to see TriState::kDisabled there. If it is, we’ll want to look on the handler side:
https://chromium.googlesource.com/crashpad/crashpad/+/6c0d42ce9dee55eaa906865191e28df35b32910d/handler/mac/crash_report_exception_handler.cc#113

and if it’s not, we’ll want to look on the module snapshot side:
https://chromium.googlesource.com/crashpad/crashpad/+/cf452d9a860885cf134051c8e5cb3a2f7f468fc2/snapshot/mac/module_snapshot_mac.cc#60

Code bisection, ’cuz we know how to save time.

Crashpad should also be somewhat wordy when unexpected things happen, but those words might not go anywhere you’re looking. If you launched Chrome from a command line, you’ll see them on stderr, so you probably wouldn’t miss anything. Otherwise, try the system log (Console.app) and look for things from crashpad_handler.
"""

Taking this bug back, but due to other recently-filed issues it may be a while before I can return to this.

Comment 21 by kbr@chromium.org, Jun 13 2016

Blockedon: 619196 608923

Comment 22 by kbr@chromium.org, Jun 13 2016

Blockedon: 352807
Project Member

Comment 23 by bugdroid1@chromium.org, Jan 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/25bb65182cdcba9fd26e1876cc7c20ed3766470c

commit 25bb65182cdcba9fd26e1876cc7c20ed3766470c
Author: kbr <kbr@chromium.org>
Date: Fri Jan 13 03:48:15 2017

Try to mark ContextLost_WebGLContextLostFromGPUProcessExit flaky again.

The new test harness has been deployed; let's see whether it's more
reliable.

BUG= 603329 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel
NOTRY=true

Review-Url: https://codereview.chromium.org/2629763002
Cr-Commit-Position: refs/heads/master@{#443479}

[modify] https://crrev.com/25bb65182cdcba9fd26e1876cc7c20ed3766470c/content/test/gpu/gpu_tests/context_lost_expectations.py

Comment 24 by kbr@chromium.org, Jan 13 2017

Status: Fixed (was: Assigned)
Calling this fixed. Let's reopen it if the flakiness resurfaces.

Sign in to add a comment