context_lost_tests flakiness |
||||||||||||
Issue descriptionBuild failed (once): context_lost_tests on NVIDIA GPU on Windows on Windows-2008ServerR2-SP1 Revision range: chromium 387106 : 387112 Failing builders: Win7 Release (NVIDIA): https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20(NVIDIA) It's not clear to me what's going on here. kbr@ any ideas? Maybe this is a one-off? https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47813/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio
,
Apr 13 2016
Here's a command line which would run the test in a loop: python content\test\gpu\run_gpu_test.py context_lost --browser=release --show-stdout --extra-browser-args="--enable-logging=stderr --js-flags=--expose-gc" --story-filter=WebGLContextLostFromGPUProcessExit --page-repeat=1000 --max-failures=1
,
Apr 14 2016
It looks like there's two tests that are flaking there. If you look at the last 200 runs you see.. GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47653/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47666/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47690/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47712/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47751/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47753/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio GpuCrash.GPUProcessCrashesExactlyOnce https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47775/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio ContextLost.WebGLContextLostFromGPUProcessExit https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/47813/steps/context_lost_tests%20on%20NVIDIA%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio
,
Apr 14 2016
Thanks for that diagnosis Dana. I ran this test in a loop locally and wasn't able to provoke a failure. I did notice however that the test seemed to hang for a surprisingly long time and it looked to me like the new minidump symbolization ( Issue 561763 ) might be the reason why. We should probably change these tests so they don't produce minidumps for the GPU process crashes they provoke. Not sure whether that's possible with Crashpad. Does anyone who works on Crashpad know how it's hooked up and whether it'd be easy to tell it (on demand) to ignore crashes from a specific Chrome sub-process?
,
Apr 14 2016
+crashpad ppls
,
Apr 14 2016
A process can make a Crashpad API call that says “don’t handle me.” crashpad::CrashpadInfo::set_crashpad_handler_behavior(crashpad::TriState::kDisabled); https://crashpad.chromium.org/doxygen/structcrashpad_1_1CrashpadInfo.html#affa1b598fdd468a56d5cd1c7241ca85d I don’t think that we should be setting this in any production code without a very good reason. I don’t think that we should be disabling it in tests, either. Perhaps what you need is a way to disable the symbolizer, or a timeout for it. Crashpad in Chrome is first and foremost intended to get us some remote debugging capabilities of things that we’ve shipped. It’s nice that people have figured out a way to use it to get local stack traces too, but we shouldn’t impair its primary use case because the local stack traces aren’t perfect.
,
Apr 14 2016
Thanks for the pointer Mark. The only situation in which I would consider doing this is for those tests that deliberately crash a sub-process (the GPU process) in order to test how the parent process reacts, and where processing of that sub-process's crash dump is undesired. The stack traces produced by Crashpad are awesome. Now that we finally have unit tests for symbolized stacks enabled on all desktop platforms ( Issue 561447 , Issue 561763 , Issue 563716 ) we are able to file reasonable bugs for flakey crashes in the product that show up on the bots ( Issue 603595 ). In this particular case it's undesirable to disable the symbolizer. If it happens that the browser process or render process crashes as a consequence of the GPU process crashing, we want to see those stack traces in the test's logs. It has definitely happened in the past that the renderer has crashed as a consequence of the GPU process crashing, which should not happen. Does that alleviate your concern and would you agree with our using this API in this test to filter out the GPU process's crash dumps?
,
Apr 14 2016
If you keep this limited to test code that only runs in a test process and you only do it for a process that’s crashed intentionally, I think I can live with it. If you find that you “fail to crash” in the expected spot and want to turn Crashpad back on to catch an unexpected crash in an unexpected spot, call crashpad::CrashpadInfo::set_crashpad_handler_behavior(crashpad::TriState::kUnset) to restore the default behavior.
,
Apr 22 2016
Since the logs above expired, attached are two from these recent runs: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48277 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48322 I don't seem to be able to reproduce this locally, at least not easily. The behavior of the renderer process crashing when navigating to about:gpucrash doesn't happen locally.
,
Apr 23 2016
Issue 606049 has been merged into this issue.
,
Apr 23 2016
,
Apr 23 2016
Some flakiness is seen on other platforms as well. Filed Issue 606112 about another problem seen. I did not have success disabling Crashpad. Please see the attached disable-crashpad.patch which was a total hack just to see if it would work. Applying the attached catapult patch to src/third_party/catapult will preserve the temporary directories for the minidumps. Examining them after running: python content\test\gpu\run_gpu_test.py context_lost --browser=release --show-stdout -v --extra-browser-args=--enable-logging=stderr > context-lost-output.txt 2>&1 and examining context-lost-output.txt indicates that two minidumps are still generated during the test run. Mark, could you please help me figure out how to disable the generation of those minidumps? I had a lot of difficulty figuring out from where the crash_reporter code could legally be called (e.g. content/ or chrome/) and what dependencies were needed to successfully link against code in src/components/crash/content/app . Ultimately, gpu_benchmarking_extension.cc will need a new method which will disable Crashpad in the GPU process, and an IPC will need to be sent to the GPU process to disable Crashpad. If the crash_reporter namespace is only accessible from chrome/ rather than content/, I don't know where in the GPU process's code to put the call to crash_reporter::DisableCrashpadForTesting. Mark, could I please assign this bug to you for your advice on how to hook this up? Please assign it back to me once we've figured out how this will be done. Thanks.
,
Apr 23 2016
,
Apr 23 2016
The ContextLost.WebGLContextLostFromGPUProcessExit failure seems to happen when the following is in the log: Tab crashed while navigating to chrome://gpucrash Waiting for page to finish. Tab crashed while closing chrome://gpucrash I vaguely recall vmiura@ mentioning that there were some recent changes to the handling of debug URLs. Those should be investigated.
,
Apr 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6947a59811079d0b91ee60eee810fee5857ddc2d commit 6947a59811079d0b91ee60eee810fee5857ddc2d Author: kbr <kbr@chromium.org> Date: Sat Apr 23 03:52:49 2016 Suppress ContextLost.WebGLContextLostFromGPUProcessExit flakes on Win7. BUG= 603329 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel NOTRY=true TBR=zmo@chromium.org Review URL: https://codereview.chromium.org/1916653002 Cr-Commit-Position: refs/heads/master@{#389357} [modify] https://crrev.com/6947a59811079d0b91ee60eee810fee5857ddc2d/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Apr 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6947a59811079d0b91ee60eee810fee5857ddc2d commit 6947a59811079d0b91ee60eee810fee5857ddc2d Author: kbr <kbr@chromium.org> Date: Sat Apr 23 03:52:49 2016 Suppress ContextLost.WebGLContextLostFromGPUProcessExit flakes on Win7. BUG= 603329 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel NOTRY=true TBR=zmo@chromium.org Review URL: https://codereview.chromium.org/1916653002 Cr-Commit-Position: refs/heads/master@{#389357} [modify] https://crrev.com/6947a59811079d0b91ee60eee810fee5857ddc2d/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Apr 28 2016
This flake is still showing up on Win7 Release (NVIDIA) despite the entry kbr added to context_lost_expections.py. Any ideas why? See: https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48621 https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48635
,
Apr 28 2016
@jdonnelly: sorry about that. Fixing the flaky retry mechanism to handle timeouts in https://codereview.chromium.org/1915033009/ .
,
Apr 28 2016
The attached log from https://build.chromium.org/p/chromium.gpu/builders/Win7%20Release%20%28NVIDIA%29/builds/48621 shows that the exception that's leaking and breaking the flaky test suppression is: Exception raised when cleaning story run: Traceback (most recent call last): _RunStoryAndProcessErrorIfNeeded at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\story_runner.py:104 state.DidRunStory(results) DidRunStory at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\page\shared_page_state.py:163 self._current_tab.Close() Close at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\browser\tab.py:100 self._tab_list_backend.CloseTab(self.id) CloseTab at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:66 util.WaitFor(lambda: tab_id not in self.IterContextIds(), timeout=5) WaitFor at c:\users\chrome~1\appdata\local\temp\runlhwond\third_party\catapult\telemetry\telemetry\core\util.py:94 (timeout, GetConditionString())) TimeoutException: Timed out while waiting 5s for util.WaitFor(lambda: tab_id not in self.IterContextIds(), timeout=5).
,
May 4 2016
mark@ provided feedback via email: """ Sounds like you did it correctly. Something must be broken, let’s debug! Let’s split the difference on the crashpad_handler side and see what you’re getting in options->crashpad_handler_behavior at the return from ProcessSnapshotMac::GetCrashpadOptions(): https://chromium.googlesource.com/crashpad/crashpad/+/a02ba240062a3d9c2f5b8941ab6429367884d0ea/snapshot/mac/process_snapshot_mac.cc#90 when crashpad_handler’s looking at a process that you expect handling to be disabled for. We want to see TriState::kDisabled there. If it is, we’ll want to look on the handler side: https://chromium.googlesource.com/crashpad/crashpad/+/6c0d42ce9dee55eaa906865191e28df35b32910d/handler/mac/crash_report_exception_handler.cc#113 and if it’s not, we’ll want to look on the module snapshot side: https://chromium.googlesource.com/crashpad/crashpad/+/cf452d9a860885cf134051c8e5cb3a2f7f468fc2/snapshot/mac/module_snapshot_mac.cc#60 Code bisection, ’cuz we know how to save time. Crashpad should also be somewhat wordy when unexpected things happen, but those words might not go anywhere you’re looking. If you launched Chrome from a command line, you’ll see them on stderr, so you probably wouldn’t miss anything. Otherwise, try the system log (Console.app) and look for things from crashpad_handler. """ Taking this bug back, but due to other recently-filed issues it may be a while before I can return to this.
,
Jun 13 2016
,
Jan 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/25bb65182cdcba9fd26e1876cc7c20ed3766470c commit 25bb65182cdcba9fd26e1876cc7c20ed3766470c Author: kbr <kbr@chromium.org> Date: Fri Jan 13 03:48:15 2017 Try to mark ContextLost_WebGLContextLostFromGPUProcessExit flaky again. The new test harness has been deployed; let's see whether it's more reliable. BUG= 603329 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel NOTRY=true Review-Url: https://codereview.chromium.org/2629763002 Cr-Commit-Position: refs/heads/master@{#443479} [modify] https://crrev.com/25bb65182cdcba9fd26e1876cc7c20ed3766470c/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Jan 13 2017
Calling this fixed. Let's reopen it if the flakiness resurfaces. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by kbr@chromium.org
, Apr 13 2016Components: Blink>WebGL