Random GPU process hangs on Mac and Windows causing random webgl_conformance_tests failures |
||||||||||||||||||||||||||||||||||||||||
Issue descriptionRandom GPU process hangs are being observed on the Mac OS trybots in the Swarming pool. They are happening both on the Mac Minis with Intel GPUs and MacBook Pro Retinas with AMD GPUs, so are unlikely to be a bug in the driver, but rather a bug in Chrome's logic. Here are a couple of tryjobs which had random tests fail because the GPU process's watchdog terminated them: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/221910 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/221792 The log excerpt is: [22046:16387:0503/184522:ERROR:gpu_watchdog_thread.cc(368)] The GPU process hung. Terminating after 10000 ms. Received signal 11 SEGV_MAPERR 000000000000 [0x00010621a866] [0x7fff89464f1a] ... It's doubtful that the stack trace of this thread would be useful, but as a first priority we should figure out how to get the GPU process to produce a minidump when this happens so that Telemetry can symbolize all the threads' stacks.
,
May 7 2016
Actually, these are happening on multiple platforms, not just Mac. Here's a failure of WebglConformance.conformance_more_conformance_quickCheckAPI_G_I caused by the watchdog firing: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/217774 Excerpt: [ RUN ] WebglConformance.conformance_more_conformance_quickCheckAPI_G_I ... [2492:2176:0506/153704:INFO:CONSOLE(0)] "WebGL: INVALID_VALUE: getAttribLocation: string not ASCII", source: http://127.0.0.1:59072/third_party/webgl/src/sdk/tests/conformance/more/conformance/quickCheckAPI-G_I.html (0) content::GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang [0x6137437E+590] base::internal::Invoker<base::IndexSequence<0>,base::internal::BindState<base::internal::RunnableAdapter<void (__thiscall content::WebMediaPlayerMS::*)(void)>,void __cdecl(content::WebMediaPlayerMS *),base::WeakPtr<content::WebMediaPlayerMS> >,base::inter [0x60D179A9+73] base::debug::TaskAnnotator::RunTask [0x5F6190A7+247] base::MessageLoop::RunTask [0x5F5E685B+1211] base::MessageLoop::DoDelayedWork [0x5F5E590A+218] [2836:2780:0506/153704:ERROR:gles2_cmd_decoder.cc(6240)] [.Offscreen-For-WebGL-051989D8]GL ERROR :GL_INVALID_VALUE : glBindAttribLocation: Invalid character base::MessagePumpDefault::Run [0x5F61B41A+138] base::MessageLoop::RunHandler [0x5F5E6397+103] base::RunLoop::Run [0x5F61B629+41] base::MessageLoop::Run [0x5F5E6322+98] [2492:2176:0506/153704:INFO:CONSOLE(0)] "WebGL: INVALID_VALUE: getAttribLocation: string not ASCII", source: http://127.0.0.1:59072/third_party/webgl/src/sdk/tests/conformance/more/conformance/quickCheckAPI-G_I.html (0) base::Thread::Run [0x5F5FBC1B+11] base::Thread::ThreadMain [0x5F5FC297+471] [2836:2780:0506/153704:ERROR:gles2_cmd_decoder.cc(6240)] [.Offscreen-For-WebGL-051989D8]GL ERROR :GL_INVALID_VALUE : glBindAttribLocation: Invalid character base::PlatformThread::Sleep [0x5F603972+290] BaseThreadInitThunk [0x767F337A+18] RtlInitializeExceptionChain [0x774392B2+99] RtlInitializeExceptionChain [0x77439285+54] [0506/153704:ERROR:process_info.cc(608)] range at 0xfffde000, size 0x230 fully unreadable ... [2492:2176:0506/153704:INFO:CONSOLE(0)] "WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost", source: http://127.0.0.1:59072/third_party/webgl/src/sdk/tests/conformance/more/conformance/quickCheckAPI-G_I.html (0) Traceback (most recent call last): File "c:\users\chrome~1\appdata\local\temp\runslirap\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 84, in _RunStoryAndProcessErrorIfNeeded state.RunStory(results) File "c:\users\chrome~1\appdata\local\temp\runslirap\content\test\gpu\gpu_tests\gpu_test_base.py", line 122, in RunStory RunStoryWithRetries(DesktopGpuSharedPageState, self, results) File "c:\users\chrome~1\appdata\local\temp\runslirap\content\test\gpu\gpu_tests\gpu_test_base.py", line 72, in RunStoryWithRetries super(cls, shared_page_state).RunStory(results) File "c:\users\chrome~1\appdata\local\temp\runslirap\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 326, in RunStory self._current_page, self._current_tab, results) File "c:\users\chrome~1\appdata\local\temp\runslirap\content\test\gpu\gpu_tests\webgl_conformance.py", line 87, in ValidateAndMeasurePage raise page_test.Failure(_WebGLTestMessages(tab)) Failure: testValidArgs [ FAILED ] WebglConformance.conformance_more_conformance_quickCheckAPI_G_I (31781 ms) Again, it's crucial for diagnosis that we (1) ensure that a minidump is created for the GPU process in this case and (2) that we get a symbolized stack for the main thread which is hung at this point. To that end, we should instrument the GPU process to introduce a random hang, run the WebGL conformance tests locally (say, on Windows), and make sure that the watchdog causes a minidump to be generated and that it's symbolized by the harness. Victor, could we please find someone to own that task in particular? Once that's done, diagnosis should be easy the next time this happens on the bots.
,
May 12 2016
,
May 24 2016
,
May 26 2016
This continues to affect commit queue jobs randomly. https://codereview.chromium.org/2009073002/ is one example where the GPU process hung at some point during test execution. It's urgent that someone step up and make the harness more robust so that a minidump is generated and symbolized when this happens, which will provide stack traces for all threads, including the hung one.
,
Jun 15 2016
These random GPU process hangs continue to be a cause of flakiness of the WebGL conformance tests on the commit queue. A few example builds: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/241986 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/242912 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/242958 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/237686 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/239161 The tests affected by the hang are random: WebglConformance.conformance_attribs_gl_vertexattribpointer WebglConformance.conformance_uniforms_uniform_samplers_test WebglConformance.conformance_more_conformance_quickCheckAPI_G_I WebglConformance.conformance_textures_video_tex_2d_rgba_rgba_unsigned_short_5_5_5_1 WebglConformance.conformance_uniforms_uniform_default_values All flakes: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR3ZWJnbF9jb25mb3JtYW5jZV90ZXN0cyAod2l0aCBwYXRjaCkM There have been recent reports of a spike in GPU process crashes and it's entirely plausible that they are in fact GPU process hangs and have the same root cause as these test failures.
,
Jun 16 2016
Working on getting the stack traces.
,
Jun 16 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0cb2844464155e533e83b10facd8523884a406be commit 0cb2844464155e533e83b10facd8523884a406be Author: vmiura <vmiura@chromium.org> Date: Thu Jun 16 22:35:44 2016 Output stacktraces on WebGL conformance failure. If a minidump stack trace is available, output it along with the page test failure. R=kbr@chromium.org BUG= 609252 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2070233002 Cr-Commit-Position: refs/heads/master@{#400282} [modify] https://crrev.com/0cb2844464155e533e83b10facd8523884a406be/content/test/gpu/gpu_tests/webgl_conformance.py
,
Jun 17 2016
,
Jun 17 2016
This job flaked on Windows but we only got the call stack of the crasher thread: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/240988/steps/webgl_conformance_tests%20on%20NVIDIA%20GPU%20on%20Windows%20%28with%20patch%29%20on%20Windows-2008ServerR2-SP1 On Mac I was able to get a call stack from all threads in the GPU process.
,
Jun 21 2016
Some callstacks from recent Mac hangs. It's unclear if the thread is hung in one of these stacks handlers, or there is a MessageLoop problem.
mac_chromium_rel_ng : build 246392 : WebglConformance.conformance_extensions_webgl_compressed_texture_s3tc
Thread 0
0 Chromium Framework!(anonymous namespace)::ZombieDealloc(objc_object*, objc_selector*) + 0x22
rax = 0x00007fff91b110b4 rdx = 0x00000000000dc870
rcx = 0x00000000000fc080 rbx = 0x00007fd2d280a400
rsi = 0x00007fff91b110b4 rdi = 0x00007fd2d280a400
rbp = 0x00007fff558d95b0 rsp = 0x00007fff558d92e0
r8 = 0x0000000000000001 r9 = 0x00007fd2d2800000
r10 = 0x00007fd2ca60d580 r11 = 0x00007fff76738050
r12 = 0x0000000000000000 r13 = 0x00007fd2d283c320
r14 = 0x0000000000000001 r15 = 0x00007fd2d280a400
rip = 0x000000010abc14b2
Found by: given as instruction pointer in context
1 CoreFoundation + 0x2d448
rbx = 0x00007fd2d280a400 rbp = 0x00007fff558d9600
rsp = 0x00007fff558d95c0 r12 = 0x0000000000000000
r13 = 0x00007fd2d283c320 r14 = 0x0000000000000001
r15 = 0x00007fff76739ed0 rip = 0x00007fff91f65448
Found by: call frame info
2 libobjc.A.dylib + 0x2189c
rbp = 0x00007fff558d9670 rsp = 0x00007fff558d9610
rip = 0x00007fff8854489c
Found by: previous frame's frame pointer
3 CoreFoundation + 0x18db0
rbp = 0x00007fff558d96c0 rsp = 0x00007fff558d9680
rip = 0x00007fff91f50db0
Found by: previous frame's frame pointer
4 CoreFoundation + 0x12a566
rbp = 0x00007fff558d9710 rsp = 0x00007fff558d96d0
rip = 0x00007fff92062566
Found by: previous frame's frame pointer
5 CoreFoundation + 0x72608
rbp = 0x00007fff558da3f0 rsp = 0x00007fff558d9720
rip = 0x00007fff91faa608
Found by: previous frame's frame pointer
6 CoreFoundation + 0x71bd8
rbp = 0x00007fff558da450 rsp = 0x00007fff558da400
rip = 0x00007fff91fa9bd8
Found by: previous frame's frame pointer
7 Chromium Framework!base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*) + 0x4f
rbp = 0x00007fff558da490 rsp = 0x00007fff558da460
rip = 0x000000010a9f7e0f
Found by: previous frame's frame pointer
8 Chromium Framework!base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) + 0x7f
rbx = 0x00007fd2ca536340 rbp = 0x00007fff558da4c0
rsp = 0x00007fff558da4a0 r12 = 0x0000000000000000
r13 = 0x000000011336be20 r14 = 0x00007fff558dac90
r15 = 0x0000000000000000 rip = 0x000000010a9f781f
Found by: call frame info
9 Chromium Framework!base::MessageLoop::RunHandler() + 0xd7
rbx = 0x00007fff558dac90 rbp = 0x00007fff558da620
rsp = 0x00007fff558da4d0 r12 = 0x00007fd2ca537bc0
r13 = 0x000000011336be20 r14 = 0x00007fff558da630
r15 = 0x000000011336d847 rip = 0x000000010aa34fd7
Found by: call frame info
10 Chromium Framework!base::RunLoop::Run() + 0x33
rbx = 0x00007fff558da7b0 rbp = 0x00007fff558da670
rsp = 0x00007fff558da630 r12 = 0x00007fd2ca537bc0
r13 = 0x000000011336be20 r14 = 0x00007fff558da630
r15 = 0x000000011336d847 rip = 0x000000010aa553d3
Found by: call frame info
11 Chromium Framework!base::MessageLoop::Run() + 0x67
rbx = 0x00007fff558da7b0 rbp = 0x00007fff558da7e0
rsp = 0x00007fff558da680 r12 = 0x00007fd2ca537bc0
r13 = 0x000000011336be20 r14 = 0x000000010a9a1c70
r15 = 0x000000011336d847 rip = 0x000000010aa34187
Found by: call frame info
12 Chromium Framework!content::GpuMain(content::MainFunctionParams const&) + 0xee7
rbx = 0x00007fd2ca813800 rbp = 0x00007fff558db510
rsp = 0x00007fff558da7f0 r12 = 0x00007fd2ca537bc0
r13 = 0x000000011336be20 r14 = 0x000000010a9a1c70
r15 = 0x000000011336d847 rip = 0x000000010a9a1817
========================================================================
mac_chromium_rel_ng : build 246795 : WebglConformance.conformance_uniforms_uniform_samplers_test
Thread 0
0 libsystem_kernel.dylib + 0x11622
rax = 0x0000000000000000 rdx = 0x0000000000000002
rcx = 0x00007fff5f382598 rbx = 0x0701ddafe9383b41
rsi = 0x0701ddafe9383b41 rdi = 0x000000000000360b
rbp = 0x00007fff5f3825e0 rsp = 0x00007fff5f382598
r8 = 0x00007fee1b61a820 r9 = 0x0000000000000000
r10 = 0x0701ddafe9383c00 r11 = 0x0000000000000202
r12 = 0x0701ddafe9383b81 r13 = 0x0701ddafe9383b41
r14 = 0x00007fee1b61aa90 r15 = 0x0000000000000000
rip = 0x00007fff88626622
Found by: given as instruction pointer in context
1 CoreFoundation + 0x79c4f
rbp = 0x00007fff5f382620 rsp = 0x00007fff5f3825f0
rip = 0x00007fff8e518c4f
Found by: previous frame's frame pointer
2 CoreFoundation + 0xb7217
rbp = 0x00007fff5f3836c0 rsp = 0x00007fff5f382630
rip = 0x00007fff8e556217
Found by: previous frame's frame pointer
3 CoreFoundation + 0x12a53d
rbp = 0x00007fff5f383710 rsp = 0x00007fff5f3836d0
rip = 0x00007fff8e5c953d
Found by: previous frame's frame pointer
4 CoreFoundation + 0x72608
rbp = 0x00007fff5f3843f0 rsp = 0x00007fff5f383720
rip = 0x00007fff8e511608
Found by: previous frame's frame pointer
5 CoreFoundation + 0x71bd8
rbp = 0x00007fff5f384450 rsp = 0x00007fff5f384400
rip = 0x00007fff8e510bd8
Found by: previous frame's frame pointer
6 Chromium Framework!base::MessagePumpCFRunLoop::DoRun(base::MessagePump::Delegate*) + 0x4f
rbp = 0x00007fff5f384490 rsp = 0x00007fff5f384460
rip = 0x0000000100f4d73f
Found by: previous frame's frame pointer
7 Chromium Framework!base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) + 0x7f
rbx = 0x00007fee1b61a080 rbp = 0x00007fff5f3844c0
rsp = 0x00007fff5f3844a0 r12 = 0x0000000000000000
r13 = 0x00000001098c1480 r14 = 0x00007fff5f384c90
r15 = 0x0000000000000000 rip = 0x0000000100f4d14f
Found by: call frame info
8 Chromium Framework!base::MessageLoop::RunHandler() + 0xd7
rbx = 0x00007fff5f384c90 rbp = 0x00007fff5f384620
rsp = 0x00007fff5f3844d0 r12 = 0x00007fee1b61ba70
r13 = 0x00000001098c1480 r14 = 0x00007fff5f384630
r15 = 0x00000001098c2ea7 rip = 0x0000000100f8a907
Found by: call frame info
9 Chromium Framework!base::RunLoop::Run() + 0x33
rbx = 0x00007fff5f3847b0 rbp = 0x00007fff5f384670
rsp = 0x00007fff5f384630 r12 = 0x00007fee1b61ba70
r13 = 0x00000001098c1480 r14 = 0x00007fff5f384630
r15 = 0x00000001098c2ea7 rip = 0x0000000100faad03
Found by: call frame info
10 Chromium Framework!base::MessageLoop::Run() + 0x67
rbx = 0x00007fff5f3847b0 rbp = 0x00007fff5f3847e0
rsp = 0x00007fff5f384680 r12 = 0x00007fee1b61ba70
r13 = 0x00000001098c1480 r14 = 0x0000000100ef75a0
r15 = 0x00000001098c2ea7 rip = 0x0000000100f89ab7
Found by: call frame info
11 Chromium Framework!content::GpuMain(content::MainFunctionParams const&) + 0xee7
rbx = 0x00007fee1c002400 rbp = 0x00007fff5f385510
rsp = 0x00007fff5f3847f0 r12 = 0x00007fee1b61ba70
r13 = 0x00000001098c1480 r14 = 0x0000000100ef75a0
r15 = 0x00000001098c2ea7 rip = 0x0000000100ef7147
,
Jun 22 2016
Thanks for gathering these stacks Victor. Let's please continue to gather more. A "git log" in src/base/message_loop/ turns up some recent changes; CC'ing a few people. Could there be a bug where the GpuWatchdogTaskObserver isn't getting called for the do-nothing task posted by GpuWatchdogThread::OnCheck? Could a recent code change have introduced such a race condition? We should be looking for changes prior to May 4, but there are quite a few changes to the core message loop code recently, like: 05e1178680e47f4263f9c7087dad1d8e099a5027 aacdfd0e17cb1597dd22745f73b172a0e2797359 7fa6701bc5183bd5a73203d4fe1309f75ccfd5b4
,
Jun 22 2016
I don't think my change https://chromium.googlesource.com/chromium/src/+/aacdfd0e17cb1597dd22745f73b172a0e2797359 affects anything other than the experimental win/chromeos "mus" build, which is tested by views_mus_unittests and similar suites. In normal builds/test suites, no MessageLoop::NestingObservers will be added, so I doubt I changed the behavior. Which doesn't really help you, I know. :-P
,
Jun 22 2016
The first change mentioned in comment #12 applied to Windows only so that wouldn't explain GPU hangs on Mac. That change was actually intended to reduce the number of GPU hang crashes. When did this issue started? There was a change to GPU watchdog in mid march that changed the hang detection logic that resulted in a surge of GPU hang crashes triggered by the watchdog.
,
Jun 22 2016
As reported at the top of this bug, it was first noticed on May 4. It's possible the flakiness was present earlier than that, for example starting in mid-March.
,
Jun 22 2016
For the record: the change in https://codereview.chromium.org/2070233002 seems to cause test failures on Android, possibly because it's attempting to fetch stack traces for expected failures and that doesn't work on Android yet. Conditionalizing this new code.
,
Jun 22 2016
Example failure log excerpt from https://build.chromium.org/p/chromium.gpu.fyi/builders/Android%20Release%20%28Nexus%205X%29/builds/332/steps/webgl_conformance_tests%20on%20Android/logs/stdio : [ RUN ] WebglConformance.conformance_textures_misc_tex_image_and_sub_image_2d_with_array_buffer_view (INFO) 2016-06-22 15:07:18,668 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( rm -f /data/local/chrome-trace-config.json );echo %$?' (INFO) 2016-06-22 15:07:18,739 cache_temperature.EnsurePageCacheTemperature:55 PageCacheTemperature: any (INFO) 2016-06-22 15:07:32,421 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( uiautomator dump /data/local/tmp/temp_file-c624139d4bc4c );echo %$?' (INFO) 2016-06-22 15:07:39,163 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 pull /data/local/tmp/temp_file-c624139d4bc4c /tmp/isolated_tmpJ_SBTo/tmpzT4_t2/tmp_ReadFileWithPull (INFO) 2016-06-22 15:07:39,258 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell 'rm -f /data/local/tmp/temp_file-c624139d4bc4c' (CRITICAL) 2016-06-22 15:07:39,271 timeout_retry.Run:172 (TimeoutThread-1-for-MainThread) Exception on _GetRootUiNode(<devil.android.app_ui.AppUi object at 0x7f897f3e7dd0>, retries=3, timeout=10), attempt 1 of 4: AdbCommandFailedError("(device: 00c74b7f7941caf7) adb pull /data/local/tmp/temp_file-c624139d4bc4c /tmp/isolated_tmpJ_SBTo/tmpzT4_t2/tmp_ReadFileWithPull: failed with exit status 1 and output:\n- remote object '/data/local/tmp/temp_file-c624139d4bc4c' does not exist\n",) (INFO) 2016-06-22 15:07:39,273 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( uiautomator dump /data/local/tmp/temp_file-117a89efd55ca );echo %$?' (INFO) 2016-06-22 15:07:46,037 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 pull /data/local/tmp/temp_file-117a89efd55ca /tmp/isolated_tmpJ_SBTo/tmp6P5uZ4/tmp_ReadFileWithPull (INFO) 2016-06-22 15:07:46,129 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell 'rm -f /data/local/tmp/temp_file-117a89efd55ca' (CRITICAL) 2016-06-22 15:07:46,156 timeout_retry.Run:172 (TimeoutThread-2-for-MainThread) Exception on _GetRootUiNode(<devil.android.app_ui.AppUi object at 0x7f897f3e7dd0>, retries=3, timeout=10), attempt 2 of 4: AdbCommandFailedError("(device: 00c74b7f7941caf7) adb pull /data/local/tmp/temp_file-117a89efd55ca /tmp/isolated_tmpJ_SBTo/tmp6P5uZ4/tmp_ReadFileWithPull: failed with exit status 1 and output:\n- remote object '/data/local/tmp/temp_file-117a89efd55ca' does not exist\n",) (INFO) 2016-06-22 15:07:46,157 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( uiautomator dump /data/local/tmp/temp_file-b233f89f865b0 );echo %$?' (INFO) 2016-06-22 15:07:52,906 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 pull /data/local/tmp/temp_file-b233f89f865b0 /tmp/isolated_tmpJ_SBTo/tmp1w7Zuy/tmp_ReadFileWithPull (INFO) 2016-06-22 15:07:52,998 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell 'rm -f /data/local/tmp/temp_file-b233f89f865b0' (CRITICAL) 2016-06-22 15:07:53,008 timeout_retry.Run:172 (TimeoutThread-3-for-MainThread) Exception on _GetRootUiNode(<devil.android.app_ui.AppUi object at 0x7f897f3e7dd0>, retries=3, timeout=10), attempt 3 of 4: AdbCommandFailedError("(device: 00c74b7f7941caf7) adb pull /data/local/tmp/temp_file-b233f89f865b0 /tmp/isolated_tmpJ_SBTo/tmp1w7Zuy/tmp_ReadFileWithPull: failed with exit status 1 and output:\n- remote object '/data/local/tmp/temp_file-b233f89f865b0' does not exist\n",) (INFO) 2016-06-22 15:07:53,010 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( uiautomator dump /data/local/tmp/temp_file-b99cd7b4dced3 );echo %$?' (INFO) 2016-06-22 15:07:59,748 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 pull /data/local/tmp/temp_file-b99cd7b4dced3 /tmp/isolated_tmpJ_SBTo/tmpdDCtbJ/tmp_ReadFileWithPull (INFO) 2016-06-22 15:07:59,841 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell 'rm -f /data/local/tmp/temp_file-b99cd7b4dced3' Expected exception while running WebglConformance.conformance_textures_misc_tex_image_and_sub_image_2d_with_array_buffer_view Traceback (most recent call last): RunStoryWithRetries at content/test/gpu/gpu_tests/gpu_test_base.py:72 super(cls, shared_page_state).RunStory(results) RunStory at third_party/catapult/telemetry/telemetry/page/shared_page_state.py:321 self._current_page, self._current_tab, results) ValidateAndMeasurePage at content/test/gpu/gpu_tests/webgl_conformance.py:87 is_valid_dump, trace_output = tab.browser.GetStackTrace() GetStackTrace at third_party/catapult/telemetry/telemetry/internal/browser/browser.py:279 return self._browser_backend.GetStackTrace() GetStackTrace at third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_backend.py:230 return self.platform_backend.GetStackTrace() GetStackTrace at third_party/catapult/telemetry/telemetry/internal/platform/android_platform_backend.py:667 ret = Decorate('UI dump', '\n'.join(self.GetSystemUi().ScreenDump())) ScreenDump at third_party/catapult/devil/devil/android/app_ui.py:212 return self._GetRootUiNode().Dump() timeout_retry_wrapper at third_party/catapult/devil/devil/android/decorators.py:57 retry_if_func=retry_if_func) Run at third_party/catapult/devil/devil/utils/timeout_retry.py:157 error_log_func=error_log_func) JoinAll at third_party/catapult/devil/devil/utils/reraiser_thread.py:186 self._JoinAll(watcher, timeout) _JoinAll at third_party/catapult/devil/devil/utils/reraiser_thread.py:158 thread.ReraiseIfException() run at third_party/catapult/devil/devil/utils/reraiser_thread.py:81 self._ret = self._func(*self._args, **self._kwargs) <lambda> at third_party/catapult/devil/devil/utils/timeout_retry.py:150 child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs), impl at third_party/catapult/devil/devil/android/decorators.py:47 return f(*args, **kwargs) _GetRootUiNode at third_party/catapult/devil/devil/android/app_ui.py:203 self._device.ReadFile(dtemp.name, force_pull=True)) timeout_retry_wrapper at third_party/catapult/devil/devil/android/decorators.py:51 return impl() impl at third_party/catapult/devil/devil/android/decorators.py:47 return f(*args, **kwargs) ReadFile at third_party/catapult/devil/devil/android/device_utils.py:1541 return self._ReadFileWithPull(device_path) _ReadFileWithPull at third_party/catapult/devil/devil/android/device_utils.py:1493 self.adb.Pull(device_path, host_temp_path) Pull at third_party/catapult/devil/devil/android/sdk/adb_wrapper.py:393 self._RunDeviceAdbCmd(cmd, timeout, retries) _RunDeviceAdbCmd at third_party/catapult/devil/devil/android/sdk/adb_wrapper.py:262 check_error=check_error) timeout_retry_wrapper at third_party/catapult/devil/devil/android/decorators.py:51 return impl() impl at third_party/catapult/devil/devil/android/decorators.py:47 return f(*args, **kwargs) _RunAdbCmd at third_party/catapult/devil/devil/android/sdk/adb_wrapper.py:238 args, output, status, device_serial) AdbCommandFailedError: (device: 00c74b7f7941caf7) adb pull /data/local/tmp/temp_file-b99cd7b4dced3 /tmp/isolated_tmpJ_SBTo/tmpdDCtbJ/tmp_ReadFileWithPull: failed with exit status 1 and output: - remote object '/data/local/tmp/temp_file-b99cd7b4dced3' does not exist Locals: args : ['pull', '/data/local/tmp/temp_file-b99cd7b4dced3', '/tmp/isolated_tmpJ_SBTo/tmpdDCtbJ/tmp_ReadFileWithPull'] check_error : True cls : <class 'devil.android.sdk.adb_wrapper.AdbWrapper'> cpu_affinity : None device_serial : '00c74b7f7941caf7' output : "remote object '/data/local/tmp/temp_file-b99cd7b4dced3' does not exist\n" retries : 2 status : 1 timeout : 300 [ OK ] WebglConformance.conformance_textures_misc_tex_image_and_sub_image_2d_with_array_buffer_view (71215 ms) [ RUN ] WebglConformance.conformance_textures_misc_tex_image_and_uniform_binding_bugs (INFO) 2016-06-22 15:08:29,884 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( rm -f /data/local/chrome-trace-config.json );echo %$?' (INFO) 2016-06-22 15:09:00,013 browser.DumpStateUponFailure:326 *************** BROWSER STANDARD OUTPUT *************** (INFO) 2016-06-22 15:09:00,014 browser.DumpStateUponFailure:328 None (INFO) 2016-06-22 15:09:00,014 browser.DumpStateUponFailure:331 *********** END OF BROWSER STANDARD OUTPUT ************ (INFO) 2016-06-22 15:09:00,014 browser.DumpStateUponFailure:333 ********************* BROWSER LOG ********************* (ERROR) 2016-06-22 15:09:00,014 browser.DumpStateUponFailure:337 Failed to get browser log: Traceback (most recent call last): File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 335, in DumpStateUponFailure logging.info(self.GetLogFileContents()) File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 276, in GetLogFileContents return self._browser_backend.GetLogFileContents() File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/browser_backend.py", line 43, in GetLogFileContents with file(self.log_file_path) as f: TypeError: coercing to Unicode: need string or buffer, NoneType found (INFO) 2016-06-22 15:09:00,015 browser.DumpStateUponFailure:338 ***************** END OF BROWSER LOG ****************** (WARNING) 2016-06-22 15:09:00,015 shared_page_state.DumpStateUponFailure:144 Taking screenshots upon failures disabled. Traceback (most recent call last): File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 78, in _RunStoryAndProcessErrorIfNeeded state.WillRunStory(story) File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 245, in WillRunStory if len(self.browser.tabs) == 0: File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py", line 15, in __len__ return self._tab_list_backend.__len__() File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 95, in __len__ self._Update() File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 99, in _Update backends_map = self._devtools_client.GetUpdatedInspectableContexts() File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_client_backend.py", line 296, in GetUpdatedInspectableContexts contexts = self._ListInspectableContexts() File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_client_backend.py", line 221, in _ListInspectableContexts return self._devtools_http.RequestJson('') File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_http.py", line 107, in RequestJson return json.loads(self.Request(path, timeout)) File "/tmp/isolated_run1bYPqQ/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/devtools_http.py", line 88, in Request response = self._conn.getresponse() File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 415, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 371, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) DevToolsClientConnectionError: timed out (INFO) 2016-06-22 15:09:00,032 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( ps | grep -F org.chromium.chrome; echo "PIPESTATUS: ${PIPESTATUS[@]}" );echo %$?' (INFO) 2016-06-22 15:09:00,267 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( ls /root );echo %$?' (INFO) 2016-06-22 15:09:00,370 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( cd /sys/devices/system/cpu && for CPU in cpu0 cpu1 cpu2 cpu3 cpu4 cpu5; do test -e "$CPU/cpufreq/scaling_governor" && echo ondemand > "$CPU/cpufreq/scaling_governor"; echo -n "%~%$?%~%"; done );echo %$?' (INFO) 2016-06-22 15:09:00,437 perf_control._WriteEachCpuFile:114 Successfully set cpufreq/scaling_governor to 'ondemand' on: cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 (INFO) 2016-06-22 15:09:00,438 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( cd /sys/devices/system/cpu && for CPU in cpu0 cpu1 cpu2 cpu3 cpu4 cpu5; do cat "$CPU/online"; echo -n "%~%$?%~%"; done );echo %$?' (INFO) 2016-06-22 15:09:00,549 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 shell '( p=org.chromium.chrome;if [[ "$(ps)" = *$p* ]]; then am force-stop $p; fi );echo %$?' (INFO) 2016-06-22 15:09:01,519 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 forward --list (INFO) 2016-06-22 15:09:01,523 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb -s 00c74b7f7941caf7 forward --remove tcp:54768 (INFO) 2016-06-22 15:09:01,529 forwarder._UnmapDevicePortLocked:232 Undo forwarding using command: [u'--adb=/tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb', '--serial-id=00c74b7f7941caf7', '--unmap', '60466'] (INFO) 2016-06-22 15:09:01,529 cmd_helper._ValidateAndLogCommand:153 [host]> /tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/forwarder_host --adb=/tmp/isolated_run1bYPqQ/third_party/catapult/devil/bin/deps/linux2/x86_64/bin/adb --serial-id=00c74b7f7941caf7 --unmap 60466 [ FAILED ] WebglConformance.conformance_textures_misc_tex_image_and_uniform_binding_bugs (31657 ms)
,
Jun 23 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6ce7da1ad477f7e562961a155c69381219ac6835 commit 6ce7da1ad477f7e562961a155c69381219ac6835 Author: kbr <kbr@chromium.org> Date: Thu Jun 23 01:43:25 2016 Conditionalize new code which fetches stack traces upon failure. BUG= 609252 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel NOTRY=true Review-Url: https://codereview.chromium.org/2085213003 Cr-Commit-Position: refs/heads/master@{#401509} [modify] https://crrev.com/6ce7da1ad477f7e562961a155c69381219ac6835/content/test/gpu/gpu_tests/webgl_conformance.py
,
Jun 23 2016
Another incidence (reported on Issue 596622 ): https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/249027 WebglConformance.conformance_attribs_gl_vertexattribpointer failed because the GPU process hung. The symbolized stack of the main thread is attached and excerpted here: Thread 0 0 libsystem_kernel.dylib + 0x11622 1 CoreFoundation + 0x79c4f 2 CoreFoundation + 0xb7217 3 CoreFoundation + 0x12a53d 4 CoreFoundation + 0x72608 5 CoreFoundation + 0x71bd8 6 Chromium Framework!__ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 0x4f 7 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 0x7f 8 Chromium Framework!__ZN4base11MessageLoop10RunHandlerEv + 0xd7 9 Chromium Framework!__ZN4base7RunLoop3RunEv + 0x33 10 Chromium Framework!__ZN4base11MessageLoop3RunEv + 0x67 11 Chromium Framework!__ZN7content7GpuMainERKNS_18MainFunctionParamsE + 0xee7 12 Chromium Framework!__ZN7content21ContentMainRunnerImpl3RunEv + 0x322 13 Chromium Framework!__ZN7content11ContentMainERKNS_17ContentMainParamsE + 0x36 14 Chromium Framework!_ChromeMain + 0x3a 15 Chromium Helper!_main + 0x212 16 Chromium Helper!start + 0x34 This doesn't look like anything suspicious is happening on the main thread; it looks like it's just waiting for the next event. There's no reason the watchdog should consider the main thread as being hung.
,
Jun 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f9319d64b21f372bddd073b46c6bd0a089dc657f commit f9319d64b21f372bddd073b46c6bd0a089dc657f Author: kbr <kbr@chromium.org> Date: Mon Jun 27 21:39:14 2016 Copy more helper DLLs for cdb.exe. These are required to get access to the !uniqstack command, to be able to dump all threads' stacks from a minidump. BUG= 609252 Review-Url: https://codereview.chromium.org/2092083004 Cr-Commit-Position: refs/heads/master@{#402295} [modify] https://crrev.com/f9319d64b21f372bddd073b46c6bd0a089dc657f/build/win/copy_cdb_to_output.py [modify] https://crrev.com/f9319d64b21f372bddd073b46c6bd0a089dc657f/chrome/chrome_tests.gypi [modify] https://crrev.com/f9319d64b21f372bddd073b46c6bd0a089dc657f/tools/perf/chrome_telemetry_build/BUILD.gn
,
Jun 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6df91ef69945c39e4a080762973503c4bc9ba5af commit 6df91ef69945c39e4a080762973503c4bc9ba5af Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Tue Jun 28 21:39:28 2016 Roll src/third_party/catapult/ d44648720..f49c20888 (4 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/d44648720513..f49c20888bb8 $ git log d44648720..f49c20888 --date=short --no-merges --format='%ad %ae %s' BUG= 609252 , 623963 , 609252 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2110613002 Cr-Commit-Position: refs/heads/master@{#402553} [modify] https://crrev.com/6df91ef69945c39e4a080762973503c4bc9ba5af/DEPS
,
Jun 29 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2f793dcc427d8345550acb237aa54c144e68ccce commit 2f793dcc427d8345550acb237aa54c144e68ccce Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Wed Jun 29 18:12:37 2016 Roll src/third_party/catapult/ f49c20888..a45087135 (9 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f49c20888bb8..a4508713566b $ git log f49c20888..a45087135 --date=short --no-merges --format='%ad %ae %s' BUG= 619045 , 609252 , 605842 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2111593002 Cr-Commit-Position: refs/heads/master@{#402870} [modify] https://crrev.com/2f793dcc427d8345550acb237aa54c144e68ccce/DEPS
,
Jun 29 2016
In Issue 596622 another flake was reported in https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/250753 . The stack trace of the main thread after the GPU watchdog fired looked like this: Thread 0 0 libobjc.A.dylib + 0x6c42 1 Foundation + 0x20832 2 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 0xa7 3 Chromium Framework!__ZN4base3mac15CallWithEHFrameEU13block_pointerFvvE + 0xa 4 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase13RunWorkSourceEPv + 0x44 5 CoreFoundation + 0x80a01 6 CoreFoundation + 0x72b8d 7 CoreFoundation + 0x721bf 8 CoreFoundation + 0x71bd8 9 Chromium Framework!__ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 0x4f 10 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 0x7f 11 Chromium Framework!__ZN4base11MessageLoop10RunHandlerEv + 0xd7 12 Chromium Framework!__ZN4base7RunLoop3RunEv + 0x33 13 Chromium Framework!__ZN4base11MessageLoop3RunEv + 0x67 14 Chromium Framework!__ZN7content7GpuMainERKNS_18MainFunctionParamsE + 0xee7 15 Chromium Framework!__ZN7content21ContentMainRunnerImpl3RunEv + 0x322 16 Chromium Framework!__ZN7content11ContentMainERKNS_17ContentMainParamsE + 0x36 17 Chromium Framework!_ChromeMain + 0x3a 18 Chromium Helper!_main + 0x212 19 Chromium Helper!start + 0x34 Blocking this on Issue 620904 . These hangs are happening on multiple platforms and it seems likely there is a cross-platform race in the MessagePump implementation.
,
Jun 29 2016
By the way, a list of all recent flakes of webgl_conformance_tests: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR3ZWJnbF9jb25mb3JtYW5jZV90ZXN0cyAod2l0aCBwYXRjaCkM&show_all=1 chromium-try-flakes is currently reporting new flakes on Issue 596622 .
,
Jun 29 2016
As of https://codereview.chromium.org/2105063003/ , if the GPU watchdog fires on Windows, Telemetry will dump the stacks for all threads, not just the watchdog thread. We should carefully watch for new flakes on Windows and see if the main thread seems to be hung in similar fashion to on the Mac.
,
Jul 13 2016
,
Jul 20 2016
Another instance: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_optional_gpu_tests_rel/builds/2079 WebglConformance_deqp_functional_gles3_negativeshaderapi failed on the AMD MacBook Pro Retinas because of a GPU process hang. For some reason, either no minidump was generated, or it didn't get symbolized. stdout excerpt attached.
,
Jul 20 2016
...actually, I don't think Victor's code which tried to symbolize minidumps upon test failure has been ported to the new WebGL test harness that's now run on the bots.
,
Jul 21 2016
Per https://bugs.chromium.org/p/chromium/issues/detail?id=596622#c113 , all of the current flakes of the WebGL 1.0 conformance suite on the CQ are these random GPU process hangs. Victor, any chance you could port your code which symbolizes minidumps upon test failure to the new test harness (webgl_conformance_integration_test.py, gpu_integration_test.py)?
,
Aug 17 2016
,
Aug 31 2016
Another instance: WebglConformance_deqp_functional_gles3_negativetextureapi failed on the Mac Minis in this try job: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_optional_gpu_tests_rel/builds/3071 of this CL: https://codereview.chromium.org/2273683003 stdout of this single test attached.
,
Aug 31 2016
As an update, I'm making some progress by provoking these GPU hangs by repeatedly running WebGL 2 conformance test shards. This is a list of shards that provoked a failure: https://chromium-swarm.appspot.com/user/task/30fb1b7db5a33910 https://chromium-swarm.appspot.com/user/task/30f9324f2b743a10 https://chromium-swarm.appspot.com/user/task/30f901ac903feb10 https://chromium-swarm.appspot.com/user/task/30f778804f711110 https://chromium-swarm.appspot.com/user/task/30f76b5757a3c910 https://chromium-swarm.appspot.com/user/task/30f733ba19ca7b10 I'm adding increasing diagnostic information to narrow things down, including stack traces and side channel information about task execution times. An interesting observation is the GPU Main thread stack trace on https://chromium-swarm.appspot.com/user/task/30f76b5757a3c910. The thread was iterating TaskObservers either before or after running a pending task. This indicates that the thread did not (permanently) hang in a task. The final PC is in a LockImpl::Unlock()... which shouldn't really hang unless there's memory corruption. Possible theories: - A) A task took a long time to run which triggered the watchdog timeout, and the main thread happened to have just finished the task. - B) The task didn't take a long time but the watchdog timed out too soon. - C) There's some other race condition / memory corruption issue around the watchdog's TaskObserver implementation. 0 libsystem_pthread.dylib + 0x1c28 1 libsystem_pthread.dylib + 0x19ea 2 Chromium Framework!__ZN4base8internal8LockImpl6UnlockEv + 0x15 3 Chromium Framework!__ZNK4base17ThreadCheckerImpl19CalledOnValidThreadEv + 0xc6 4 Chromium Framework!__ZNK4base19SequenceCheckerImpl21CalledOnValidSequenceEv + 0x76 5 Chromium Framework!__ZNK4base8internal13WeakReference4Flag7IsValidEv + 0x1a 6 Chromium Framework!__ZNK4base8internal13WeakReference8is_validEv + 0x11 7 Chromium Framework!__ZN4base16ObserverListBaseINS_11MessageLoop12TaskObserverEE8IteratorD2Ev + 0x13 8 Chromium Framework!__ZN4base11MessageLoop7RunTaskERKNS_11PendingTaskE + 0x222 9 Chromium Framework!__ZN4base11MessageLoop21DeferOrRunPendingTaskENS_11PendingTaskE + 0x2c 10 Chromium Framework!__ZN4base11MessageLoop6DoWorkEv + 0x173 ===== In another run, it looks like we did hang and time out inside a task. https://chromium-swarm.appspot.com/user/task/30fb1b7db5a33910 [33659:17923:0831/133912:ERROR:gpu_watchdog_thread.cc(444)] TASK_START time 27.175ms [33659:17923:0831/133912:ERROR:gpu_watchdog_thread.cc(444)] TASK_END time 27.195ms [33659:17923:0831/133912:ERROR:gpu_watchdog_thread.cc(444)] TASK_START time 33.736ms <---- main thread is running a task [33659:17923:0831/133912:ERROR:gpu_watchdog_thread.cc(444)] ARM time 6026.98ms <---- watchdog arms 6 seconds later [33659:17923:0831/133912:ERROR:gpu_watchdog_thread.cc(444)] TIMEOUT time 17591.6ms <---- watchdog times out 11 seconds later. However the stack trace doesn't look like we're in a task... 0 libsystem_kernel.dylib + 0x1148a 1 CoreFoundation + 0x72445 2 CoreFoundation + 0x71bd8 3 Chromium Framework!__ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 0x4f 4 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 0x7f 5 Chromium Framework!__ZN4base11MessageLoop10RunHandlerEv + 0xd7 6 Chromium Framework!__ZN4base7RunLoop3RunEv + 0x33 Finally, as I'm adding more diagnostic code it's getting harder to reproduce the hangs, which again points to raciness.
,
Aug 31 2016
For https://chromium-swarm.appspot.com/user/task/30fb1b7db5a33910 could there be a C++ exception or NSException that CFRunLoop is swallowing? Though r338332 seems to say those should all be fatal.
,
Sep 1 2016
^ Good point. Don't we compile C++ with no exceptions, which would make passing exceptions through C++ stacks 'a bad thing'?
,
Sep 1 2016
Answering my own questions yes and yes, we compile without C++ exceptions, and if ObjC swallows an exception through C++ stacks then that would be pretty bad. I'm not sure if that's happening, but I worry about the comment in r338332: "Using try/catch is now unrestricted".
,
Sep 1 2016
+rsesek for any comments about the above conversation as it relates to r338332 / https://chromium.googlesource.com/chromium/src/+/73dc3d5a6345ddd43a1a54f06dd4058be8a9842b .
,
Sep 1 2016
It looks to me like if an exception were being thrown by a task invoked by the RunLoop that base::mac::CallWithEHFrame would always see it, and crash (from MessagePumpCFRunLoopBase::RunWorkSource, src/base/message_loop/message_pump_mac.mm), so I suspect that mechanism introduced in r338332 is working as expected. The hang in base::LockImpl::Unlock is strange. It reminds me of Issue 609249 , but that was Linux-only.
,
Sep 1 2016
Yes I agree the mechanism in r338332 is working as expected. I tested it out and a termination with full stack dump from the thrower. However if any code inside a base::mac::CallWithEHFrame uses a "@try @catch" frame I think we must still take care about calling into C++ from that frame.
,
Sep 1 2016
,
Sep 2 2016
I'm having a hard time getting the GPU hangs when adding any diagnostic code to the Main thread. The most recent flaky crash happened in the Renderer process. See: https://chromium-swarm.appspot.com/user/task/3101207f37443b10 Stack Trace: ******************************************************************************** Operating system: Mac OS X 10.10.5 14F1021 CPU: amd64 family 6 model 70 stepping 1 8 CPUs GPU: UNKNOWN Crash reason: EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE Crash address: 0x7fff50fdccec Process uptime: 27 seconds Thread 0 (crashed) 0 0x7fff50fdccec rax = 0x0000000000000000 rdx = 0x00007fd45871c6e0 rcx = 0x0000000000000000 rbx = 0x00007fd45871c6c0 rsi = 0x009d3800009d3800 rdi = 0x009d3700009d3803 rbp = 0x00007fff50fdcd10 rsp = 0x00007fff50fdcce0 r8 = 0x00007fff50fdccf0 r9 = 0x00007fff790e3300 r10 = 0x00007fd45c819590 r11 = 0x00007fff79afcfb0 r12 = 0x00007fff50fdce20 r13 = 0x0000000000002503 r14 = 0x00007fd45871c6b0 r15 = 0x00007fd45871c5f0 rip = 0x00007fff50fdccec Found by: given as instruction pointer in context 1 CoreFoundation + 0x72b03 2 CoreFoundation + 0x721bf 3 CoreFoundation + 0x71bd8 4 Foundation + 0x90b29 5 Chromium Framework!__ZN4base20MessagePumpNSRunLoop5DoRunEPNS_11MessagePump8DelegateE + 0x7e 6 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 0x7f 7 Chromium Framework!__ZN4base11MessageLoop10RunHandlerEv + 0xd7 8 Chromium Framework!__ZN4base7RunLoop3RunEv + 0x33 9 Chromium Framework!__ZN7content12RendererMainERKNS_18MainFunctionParamsE + 0x2ca What could make NSRunLoop jump to a stack address? Corrupted stack frame?
,
Sep 6 2016
ObjC and C++ exceptions are the same thing under the hood, so yes, an ObjC exception handler can swallow C++ exceptions. The goal of r338332 is to: 1) Let any code (either in third_party or in a binary library) use try/catch/@try/@catch as it wants. 2) If any exception makes it to CallWithEHFrame, crash at the original point of throw. From the stack in #40 above, the three frames in CoreFoundation 1-3 are: 0x00072b03 [CoreFoundation + 0x72b03] __CFRunLoopDoSources0 0x000721bf [CoreFoundation + 0x721bf] __CFRunLoopRun 0x00071bd8 [CoreFoundation + 0x71bd8] CFRunLoopRunSpecific But frame 0 corresponds to an %rip that doesn't appear to be in any specific module. Looking at the "loaded module" section at the bottom of the crash report, no address range contains 0x00007fff50fdccec. That actually looks more like a stack pointer address.
,
Sep 6 2016
Got the following failure running with ASAN on WebGL 2.0 Conformance tests. Looks like use after free for a TaskObserverAdapter on the Renderer thread. It occurs usually occurs within a minute, on a random test. This failure reason looks very similar to what we're looking for on the GPU thread (i.e. something's up with Watchdog TaskObserver). Either way, it's a cause of flaky failures on the tests. To reproduce run with ASAN Mac build, with '--enable-unsafe-es3-apis' and run tests in: https://www.khronos.org/registry/webgl/sdk/tests/webgl-conformance-tests.html?version=2.0.0 ==30774==ERROR: AddressSanitizer: use-after-poison on address 0x7ee4598f5a00 at pc 0x000264c18abb bp 0x7fff57de6af0 sp 0x7fff57de6ae8 READ of size 8 at 0x7ee4598f5a00 thread T0 ==30774==WARNING: invalid path to external symbolizer! ==30774==WARNING: Failed to use and restart external symbolizer! #0 0x264c18aba in blink::scheduler::WebThreadBase::TaskObserverAdapter::DidProcessTask(base::PendingTask const&) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x11318aba) #1 0x264bf09ef in blink::scheduler::TaskQueueManager::ProcessTaskFromWorkQueue(blink::scheduler::internal::WorkQueue*) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x112f09ef) #2 0x264beb49e in blink::scheduler::TaskQueueManager::DoWork(base::TimeTicks, bool) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x112eb49e) #3 0x25a7972e5 in base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask const&) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6e972e5) #4 0x25a82dbb4 in base::MessageLoop::RunTask(base::PendingTask const&) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f2dbb4) #5 0x25a82e8a6 in base::MessageLoop::DeferOrRunPendingTask(base::PendingTask) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f2e8a6) #6 0x25a82fcee in base::MessageLoop::DoWork() (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f2fcee) #7 0x25a83a649 in base::MessagePumpCFRunLoopBase::RunWork() (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f3a649) #8 0x25a804d19 in base::mac::CallWithEHFrame(void () block_pointer) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f04d19) #9 0x25a83831f in base::MessagePumpCFRunLoopBase::RunWorkSource(void*) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f3831f) #10 0x7fff85873880 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ (/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation+0xaa880) #11 0x7fff85852fbb in __CFRunLoopDoSources0 (/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation+0x89fbb) #12 0x7fff858524de in __CFRunLoopRun (/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation+0x894de) #13 0x7fff85851ed7 in CFRunLoopRunSpecific (/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation+0x88ed7) #14 0x7fff819c8ed8 in -[NSRunLoop(NSRunLoop) runMode:beforeDate:] (/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation+0x24ed8) #15 0x25a83c5ae in base::MessagePumpNSRunLoop::DoRun(base::MessagePump::Delegate*) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f3c5ae) #16 0x25a839c8b in base::MessagePumpCFRunLoopBase::Run(base::MessagePump::Delegate*) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6f39c8b) #17 0x25a8b8bbb in base::RunLoop::Run() (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x6fb8bbb) #18 0x26a9e8bc1 in content::RendererMain(content::MainFunctionParams const&) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x170e8bc1) #19 0x259582d4c in content::ContentMainRunnerImpl::Run() (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x5c82d4c) #20 0x25957f333 in content::ContentMain(content::ContentMainParams const&) (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x5c7f333) #21 0x2539054af in ChromeMain (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x54af) #22 0x107e16bd7 in main (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Helper.app/Contents/MacOS/Chromium Helper+0x100001bd7) #23 0x107e167c3 in start (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Helper.app/Contents/MacOS/Chromium Helper+0x1000017c3) AddressSanitizer can not describe address in more detail (wild memory access suspected). SUMMARY: AddressSanitizer: use-after-poison (/asan-mac-release-416628/Chromium.app/Contents/Versions/55.0.2852.0/Chromium Framework.framework/Chromium Framework+0x11318aba) in blink::scheduler::WebThreadBase::TaskObserverAdapter::DidProcessTask(base::PendingTask const&) Shadow bytes around the buggy address: 0x1fdc8b31eaf0: 00 00 00 00 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 0x1fdc8b31eb00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 00 00 00 00 00 0x1fdc8b31eb10: 00 00 00 f7 00 00 00 00 00 00 00 00 00 00 00 00 0x1fdc8b31eb20: 00 00 00 f7 00 00 00 00 00 00 00 00 00 f7 00 00 0x1fdc8b31eb30: 00 00 00 00 00 00 00 00 00 00 00 f7 f7 f7 f7 f7 =>0x1fdc8b31eb40:[f7]f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 00 00 0x1fdc8b31eb50: 00 00 00 00 00 00 f7 00 00 00 00 00 00 00 00 f7 0x1fdc8b31eb60: 00 00 00 00 00 00 00 00 f7 f7 f7 f7 f7 f7 f7 f7 0x1fdc8b31eb70: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 0x1fdc8b31eb80: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 0x1fdc8b31eb90: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==30774==ABORTING Received signal 6 [0x00025a7957ea] [0x7fff87b2c52a] [0x000108b5c79b] [0x7fff949686df] [0x000107e8bac6] [0x000107e7217b] [0x000264c18abb] [0x000264bf09f0] [0x000264beb49f] [0x00025a7972e6] [0x00025a82dbb5] [0x00025a82e8a7] [0x00025a82fcef] [0x00025a83a64a] [0x00025a804d1a] [0x00025a838320] [0x7fff85873881] [0x7fff85852fbc] [0x7fff858524df] [0x7fff85851ed8] [0x7fff819c8ed9] [0x00025a83c5af] [0x00025a839c8c] [0x00025a8b8bbc] [0x00026a9e8bc2] [0x000259582d4d] [0x00025957f334] [0x0002539054b0] [0x000107e16bd8] [0x000107e167c4] [end of stack trace]
,
Sep 7 2016
,
Sep 7 2016
@rsesek: Thanks for the explanations. Do we always have accurate stack unwind data up to CallWithEHFrame for C++ code, which is built without exceptions enabled? Or is it simply not a concern, since the uncaught exception is fatal?
,
Sep 7 2016
On Mac we always emit unwind tables, even with C++ exceptions disabled, so it's not an issue.
,
Sep 7 2016
,
Sep 13 2016
,
Sep 13 2016
Attached is stdout from https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/292381 in which WebglConformance_conformance_uniforms_uniform_samplers_test failed on Windows due to a GPU process hang. Since this is on Windows it definitely isn't the exception issue described on macOS. It looks to me like Issue 644575 only affects the renderer process, correct? Is there a hypothesis that TaskObservers may be being mismanaged in the GPU process as well? This is still a major source of flakiness on the tryservers. I'm getting daily pings from chromium-try-flakes on Issue 596622 . How will we make forward progress on this?
,
Sep 14 2016
kbr@: Adding additional code for diagnosis has prevented the issue from repeating over thousands of runs. It strongly points to races & data corruption. Issue 644575 is likely unrelated, but I'd like to see what that turns out to be. Until that is fixed it's also blocking us from running extended ASAN tests to try to catch the GPU side race. I plan to return to this.
,
Sep 19 2016
,
Sep 19 2016
,
Sep 22 2016
I think so but am not 100% sure. We need to fix the new gpu_integration_test.py harness to symbolize all unsymbolized stack traces upon test failure.
,
Sep 22 2016
There are symbols later on in the log, though I'm not sure if from the same stack.
,
Sep 22 2016
#52: It may be related, but I'm not seeing a GPU watchdog timeout in this log.
,
Sep 27 2016
,
Oct 7 2016
Issue 612219 is tracking a couple of changes to the GPU Watchdog that may affect the flakiness here. Let's keep an eye on any changes. 1) Changes the Ack-ing logic from the watched thread. 2) Adds more watchdog ping signals to ContextGroup destruction code known to sometimes time out in the wild.
,
Oct 10 2016
Have these fixes landed yet? Per https://bugs.chromium.org/p/chromium/issues/detail?id=619264#c78 , all of the most recent WebGL 2.0 conformance test flakes are due to the GPU process being reported as hanging.
,
Oct 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/51e832338dfaeedf418778541a10003fd1feef8b commit 51e832338dfaeedf418778541a10003fd1feef8b Author: kbr <kbr@chromium.org> Date: Tue Oct 25 02:34:54 2016 Try disabling GPU watchdog in WebGL conformance tests. It's believed that the flaky GPU process hangs seen on the waterfall are spurious, not real. Try disabling the GPU watchdog. If the hangs are real, this will result in test timeouts rather than GPU process crashes, context loss, and test failures. BUG= 596622 , 609252 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2446033002 Cr-Commit-Position: refs/heads/master@{#427247} [modify] https://crrev.com/51e832338dfaeedf418778541a10003fd1feef8b/content/test/gpu/gpu_tests/webgl_conformance_integration_test.py
,
Nov 9 2016
vmiura: ping. Is this still an issue? It looks like conformance tests are still flaky, but hard to tell what's this and what's 596622 and if this should be closed or not.
,
Nov 9 2016
The majority of the remaining WebGL conformance flakes were worked around by disabling the GPU watchdog timer. It's still not clear to me whether there is a real bug in the watchdog that's being observed in the wild. If we're going to close this then we should somehow resolve the two outstanding issues it's blocked on (Issue 611390 and Issue 646538). Let me downgrade this to P2 on Victor's behalf since it's not affecting the WebGL conformance tests on the waterfall any more.
,
Nov 22 2016
Symbolization of unsymbolized minidumps has been added to the gpu_integration_test.py harness, and it's working (modulo a recent problem, Issue 667475 ), so removing the Catapult folks from the CC: list.
,
Nov 22 2016
,
Nov 29 2016
,
Dec 19 2016
,
Dec 19 2016
,
Dec 30 2016
,
Jan 9 2017
,
Jan 9 2017
,
Jan 12 2017
,
Jan 12 2017
,
Mar 23 2017
,
May 3 2017
,
May 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8f97866e8f0c1bae205fe0f3a4107d41ba8bae28 commit 8f97866e8f0c1bae205fe0f3a4107d41ba8bae28 Author: kbr <kbr@chromium.org> Date: Wed May 17 08:03:24 2017 Re-enable GPU watchdog for WebGL conformance tests. See whether this is working any more reliably now. BUG= 609252 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2887983002 Cr-Commit-Position: refs/heads/master@{#472378} [modify] https://crrev.com/8f97866e8f0c1bae205fe0f3a4107d41ba8bae28/content/test/gpu/gpu_tests/webgl_conformance_integration_test.py
,
May 18 2017
This issue is still occurring after re-enabling the GPU watchdog for these tests. Here are two failures: https://luci-milo.appspot.com/buildbot/chromium.gpu/Mac%20Release%20%28Intel%29/64401 WebglConformance_conformance_textures_misc_texture_upload_size failed in this run. https://luci-milo.appspot.com/buildbot/chromium.gpu/Mac%20Retina%20Release%20%28AMD%29/6737 WebglConformance_conformance_more_conformance_quickCheckAPI_B1 failed in this run. In both cases the tests failed because the GPU watchdog was triggered. Here's the symbolized minidump from one of the runs. Symbolized minidump: Operating system: Mac OS X 10.12.4 16E195 CPU: amd64 family 6 model 69 stepping 1 4 CPUs GPU: UNKNOWN Crash reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS Crash address: 0x0 Process uptime: 29 seconds Thread 3 (crashed) 0 Chromium Framework!__ZN3gpu17GpuWatchdogThread38DeliberatelyTerminateToRecoverFromHangEv + 0x11f rax = 0x200c6932b5420050 rdx = 0x00000000000a9940 rcx = 0x0000000000000000 rbx = 0x00007fe29fc17c88 rsi = 0x00007fe29ff00000 rdi = 0x0000000000000000 rbp = 0x00007000085c9250 rsp = 0x00007000085c90f0 r8 = 0x0000000000000007 r9 = 0x00007fe29ff3bd20 r10 = 0x0000000000000001 r11 = 0x0000000000000008 r12 = 0x0000000000000000 r13 = 0x000000011da3de00 r14 = 0x00007000085c90f8 r15 = 0x00007000085c9100 rip = 0x000000011788a7bf Found by: given as instruction pointer in context 1 Chromium Framework!__ZN3gpu17GpuWatchdogThread14OnCheckTimeoutEv + 0x99 rbp = 0x00007000085c9410 rsp = 0x00007000085c9260 rip = 0x000000011788a449 Found by: previous frame's frame pointer 2 Chromium Framework!__ZN4base8internal7InvokerINS0_9BindStateIMN3gpu17GpuWatchdogThreadEFvvEJNS_7WeakPtrIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE + 0xb4 rbp = 0x00007000085c9570 rsp = 0x00007000085c9420 rip = 0x000000011788ac84 Found by: previous frame's frame pointer 3 Chromium Framework!__ZN4base5debug13TaskAnnotator7RunTaskEPKcPNS_11PendingTaskE + 0x136 rbp = 0x00007000085c9710 rsp = 0x00007000085c9580 rip = 0x000000011648d4a6 Found by: previous frame's frame pointer 4 Chromium Framework!__ZN4base11MessageLoop7RunTaskEPNS_11PendingTaskE + 0x1b9 rbp = 0x00007000085c98e0 rsp = 0x00007000085c9720 rip = 0x00000001164cb8c9 Found by: previous frame's frame pointer 5 Chromium Framework!__ZN4base11MessageLoop21DeferOrRunPendingTaskENS_11PendingTaskE + 0x1f rbp = 0x00007000085c9900 rsp = 0x00007000085c98f0 rip = 0x00000001164cbc7f Found by: previous frame's frame pointer 6 Chromium Framework!__ZN4base11MessageLoop13DoDelayedWorkEPNS_9TimeTicksE + 0x137 rbp = 0x00007000085c9a70 rsp = 0x00007000085c9910 rip = 0x00000001164cc117 Found by: previous frame's frame pointer 7 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase7RunWorkEv + 0x4c rbp = 0x00007000085c9aa0 rsp = 0x00007000085c9a80 rip = 0x00000001164d07cc Found by: previous frame's frame pointer ... I'm going to turn the watchdog back off on the bots. We really should get to the bottom of this. It may be happening in the field and may be the cause of increased incidences of the "Rats, WebGL hit a snag" infobar. Any ideas about what might be going wrong with the TaskObserver list or how to continue to track this down would be appreciated.
,
May 19 2017
Note: another failure showing this is happening on Windows too: https://luci-milo.appspot.com/buildbot/tryserver.chromium.win/win_optional_gpu_tests_rel/10429 WebglConformance_deqp_functional_gles3_shaderoperator_binary_operator_06 failed because the GPU watchdog triggered. Symbolized minidump: Last event: f08.d3c: Access violation - code c0000005 (first/second chance not available) debugger time: Thu May 18 14:43:49.208 2017 (UTC - 7:00) ChildEBP RetAddr Args to Child 0429efd0 688947c8 6889461c 00000000 0429f838 chrome_child!gpu::GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang+0x27f 0429f0e4 69bf919f 0429f838 6a6f9dcc 0429f104 chrome_child!gpu::GpuWatchdogThread::OnCheckTimeout+0x1ac 0429f0f4 69bfa69a 06c1eb60 06c1eb68 0429f1ec chrome_child!??$MakeItSo@ABQ8AsyncReadPixelsCompletedQuery@gles2@gpu@@AEXXZABV?$WeakPtr@VAsyncReadPixelsCompletedQuery@gles2@gpu@@@base@@$$V@?$InvokeHelper@$00X@internal@base@@SAXABQ8AsyncReadPixelsCompletedQuery@gles2@gpu@@AEXXZABV?$WeakPtr@VAsyncReadPixelsCompletedQuery@gles2@gpu@@@2@@Z+0x26 0429f104 681815d9 06c1eb48 00000000 688945df chrome_child!base::internal::Invoker<base::internal::BindState<void (__thiscall gpu::gles2::AsyncReadPixelsCompletedQuery::*)(void),base::WeakPtr<gpu::gles2::AsyncReadPixelsCompletedQuery> >,void __cdecl(void)>::Run+0x13 0429f1ec 68137471 6a4c51dc 0429f838 00facaf8 chrome_child!base::debug::TaskAnnotator::RunTask+0x199 0429f81c 68136018 0429f838 00facaf8 a4627e48 chrome_child!base::MessageLoop::RunTask+0x4d1 0429f830 681363d0 00fa0e20 0277ad77 a4627e48 chrome_child!base::MessageLoop::DeferOrRunPendingTask+0x48 0429f8d0 681834c1 00f9feb8 00000000 00facafc chrome_child!base::MessageLoop::DoDelayedWork+0xc0 0429f99c 68136f8b 00facaf8 00000000 0429fd30 chrome_child!base::MessagePumpDefault::Run+0x81 0429fa68 68150f03 00000000 00faca1c 00000000 chrome_child!base::MessageLoop::Run+0x6b 0429fbac 68133fed 00000000 0429fbe8 681220f4 chrome_child!base::RunLoop::Run+0x93 0429fc68 681348ce 0429fd30 00f92d50 76291420 chrome_child!base::Thread::Run+0xad 0429fd74 6811acc2 00000000 00000000 00f92d50 chrome_child!base::Thread::ThreadMain+0x26e *** WARNING: Unable to verify checksum for kernel32.dll ...
,
May 19 2017
Victor pointed out something suspicious about the Mac watchdog timeouts so here's the entire symbolized minidump from the Windows failure.
Symbolized minidump:
Last event: f08.d3c: Access violation - code c0000005 (first/second chance not available)
debugger time: Thu May 18 14:43:49.208 2017 (UTC - 7:00)
ChildEBP RetAddr Args to Child
0429efd0 688947c8 6889461c 00000000 0429f838 chrome_child!gpu::GpuWatchdogThread::DeliberatelyTerminateToRecoverFromHang+0x27f
0429f0e4 69bf919f 0429f838 6a6f9dcc 0429f104 chrome_child!gpu::GpuWatchdogThread::OnCheckTimeout+0x1ac
0429f0f4 69bfa69a 06c1eb60 06c1eb68 0429f1ec chrome_child!??$MakeItSo@ABQ8AsyncReadPixelsCompletedQuery@gles2@gpu@@AEXXZABV?$WeakPtr@VAsyncReadPixelsCompletedQuery@gles2@gpu@@@base@@$$V@?$InvokeHelper@$00X@internal@base@@SAXABQ8AsyncReadPixelsCompletedQuery@gles2@gpu@@AEXXZABV?$WeakPtr@VAsyncReadPixelsCompletedQuery@gles2@gpu@@@2@@Z+0x26
0429f104 681815d9 06c1eb48 00000000 688945df chrome_child!base::internal::Invoker<base::internal::BindState<void (__thiscall gpu::gles2::AsyncReadPixelsCompletedQuery::*)(void),base::WeakPtr<gpu::gles2::AsyncReadPixelsCompletedQuery> >,void __cdecl(void)>::Run+0x13
0429f1ec 68137471 6a4c51dc 0429f838 00facaf8 chrome_child!base::debug::TaskAnnotator::RunTask+0x199
0429f81c 68136018 0429f838 00facaf8 a4627e48 chrome_child!base::MessageLoop::RunTask+0x4d1
0429f830 681363d0 00fa0e20 0277ad77 a4627e48 chrome_child!base::MessageLoop::DeferOrRunPendingTask+0x48
0429f8d0 681834c1 00f9feb8 00000000 00facafc chrome_child!base::MessageLoop::DoDelayedWork+0xc0
0429f99c 68136f8b 00facaf8 00000000 0429fd30 chrome_child!base::MessagePumpDefault::Run+0x81
0429fa68 68150f03 00000000 00faca1c 00000000 chrome_child!base::MessageLoop::Run+0x6b
0429fbac 68133fed 00000000 0429fbe8 681220f4 chrome_child!base::RunLoop::Run+0x93
0429fc68 681348ce 0429fd30 00f92d50 76291420 chrome_child!base::Thread::Run+0xad
0429fd74 6811acc2 00000000 00000000 00f92d50 chrome_child!base::Thread::ThreadMain+0x26e
*** WARNING: Unable to verify checksum for kernel32.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for kernel32.dll -
0429fd90 7629336a 0000018c 0429fddc 772c92b2 chrome_child!base::PlatformThread::Sleep+0x122
WARNING: Stack unwind information not available. Following frames may be wrong.
0429fd9c 772c92b2 00f92d50 7163c889 00000000 kernel32!BaseThreadInitThunk+0x12
0429fddc 772c9285 6811ac30 00f92d50 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0429fdf4 00000000 6811ac30 00f92d50 00000000 ntdll!RtlInitializeExceptionChain+0x36
0 Id: f08.ccc Suspend: 0 Teb: fffdd000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
001ef020 76291194 00000174 000074f8 00000000 ntdll!ZwWaitForSingleObject+0x15
001ef038 76291148 00000174 000074f8 00000000 kernel32!WaitForSingleObjectEx+0x43
001ef04c 68133aa7 00000174 000074f8 00f9fe24 kernel32!WaitForSingleObject+0x12
001ef130 68133748 00000174 001ef15c 00f9fe28 chrome_child!base::WaitableEvent::WaitMany+0x2b7
001ef164 68183507 00f9fe28 00000000 00fabf6c chrome_child!base::WaitableEvent::TimedWaitUntil+0x68
001ef230 68136f8b 00fabf68 00000000 001ef454 chrome_child!base::MessagePumpDefault::Run+0xc7
001ef2fc 68150f03 00000000 6aeef204 00000000 chrome_child!base::MessageLoop::Run+0x6b
001ef440 691207b6 00f9a4b0 00000004 00f92748 chrome_child!base::RunLoop::Run+0x93
001ef79c 680f2324 001ef934 02e88180 00000003 chrome_child!content::GpuMain+0x2ba
001ef860 680f2243 001ef97c 001ef934 001efcb0 chrome_child!content::RunNamedProcessTypeMain+0xb0
001ef998 680f59cd 00dca3e8 6e4f9576 00000000 chrome_child!content::ContentMainRunnerImpl::Run+0x19d
001efbbc 680f190c 001efbe8 6e4f9576 6a4bef40 chrome_child!service_manager::Main+0x20c
001efbec 6770bf5d 001efcd0 0000000b 00dc5e88 chrome_child!content::ContentMain+0x27
*** WARNING: Unable to verify checksum for chrome.exe
001efce4 00ff4f33 00ff0000 001efd08 9188af3b chrome_child!ChromeMain+0xdd
001efd6c 00ff2cfd 00ff0000 9188af3b 00000009 chrome!MainDllLoader::Launch+0x1bc
001efe90 01045cd8 00ff0000 00000000 01252964 chrome!wWinMain+0x17f
(Inline) -------- -------- -------- -------- chrome!invoke_main+0x1a
001efedc 7629336a fffde000 001eff28 772c92b2 chrome!__scrt_common_main_seh+0xf6
001efee8 772c92b2 fffde000 7554ca7d 00000000 kernel32!BaseThreadInitThunk+0x12
001eff28 772c9285 01045d50 fffde000 ffffffff ntdll!RtlInitializeExceptionChain+0x63
001eff40 00000000 01045d50 fffde000 00000000 ntdll!RtlInitializeExceptionChain+0x36
# 1 Id: f08.d3c Suspend: 0 Teb: fffda000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0429e5ac 768744a5 0000ea60 00000000 0429e684 ntdll!NtDelayExecution+0x15
*** WARNING: Unable to verify checksum for chrome_elf.dll
0429e5bc 6e4fb076 0000ea60 0429e848 0429e848 KERNELBASE!Sleep+0xf
0429e684 6816da57 0429e848 7636030c 688940eb chrome_elf!crashpad::CrashpadClient::StartHandler+0xc0c
0429e790 762d0313 0429e848 c490c92f 00000000 chrome_child!base::debug::StackTrace::Print+0x47
0429e818 77301ca7 0429e848 77301b84 00000000 kernel32!GetProfileStringW+0x12d77
0429fddc 772c9285 6811ac30 00f92d50 ffffffff ntdll!RtlKnownExceptionFilter+0xb7
0429fdf4 00000000 6811ac30 00f92d50 00000000 ntdll!RtlInitializeExceptionChain+0x36
2 Id: f08.bdc Suspend: 0 Teb: fffaf000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
03e1f33c 76291194 0000026c ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
03e1f354 76291148 0000026c ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
*** WARNING: Unable to verify checksum for nvwgf2um.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for nvwgf2um.dll -
03e1f368 65704c48 0000026c ffffffff 0470a32c kernel32!WaitForSingleObject+0x12
03e1f8ac 65703f03 03e1f8d4 65703ef5 03e1f8c8 nvwgf2um!OpenAdapter12+0x10b2d8
03e1f8d4 65c24e25 05d66050 03e1f918 65f45f02 nvwgf2um!OpenAdapter12+0x10a593
03e1f8e0 65f45f02 0470a414 c5d1057e 00000000 nvwgf2um!NVAPI_Thunk+0x86285
03e1f918 65f4602a 00000000 03e1f930 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
03e1f924 7629336a 05d53c98 03e1f970 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
03e1f930 772c92b2 05d53c98 76abcc25 00000000 kernel32!BaseThreadInitThunk+0x12
03e1f970 772c9285 65f45fae 05d53c98 ffffffff ntdll!RtlInitializeExceptionChain+0x63
03e1f988 00000000 65f45fae 05d53c98 00000000 ntdll!RtlInitializeExceptionChain+0x36
3 Id: f08.ed4 Suspend: 0 Teb: fffac000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0636f4bc 76291194 00000298 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
0636f4d4 76291148 00000298 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
0636f4e8 6813378a 00000298 ffffffff 00fa0198 kernel32!WaitForSingleObject+0x12
0636f5c0 681834ff 00000000 00fb5fcc 00fb5fc8 chrome_child!base::WaitableEvent::Wait+0x2a
0636f688 68136f8b 00fb5fc8 00000000 0636fa1c chrome_child!base::MessagePumpDefault::Run+0xbf
0636f754 68150f03 00000000 00fb57e0 00000000 chrome_child!base::MessageLoop::Run+0x6b
0636f898 68133fed 00000000 0636f8d4 6812215f chrome_child!base::RunLoop::Run+0x93
0636f954 681348ce 0636fa1c 00fb13b8 76291420 chrome_child!base::Thread::Run+0xad
0636fa60 6811acc2 00000000 00000000 00fb13b8 chrome_child!base::Thread::ThreadMain+0x26e
0636fa7c 7629336a 00000294 0636fac8 772c92b2 chrome_child!base::PlatformThread::Sleep+0x122
0636fa88 772c92b2 00fb13b8 737ccf9d 00000000 kernel32!BaseThreadInitThunk+0x12
0636fac8 772c9285 6811ac30 00fb13b8 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0636fae0 00000000 6811ac30 00fb13b8 00000000 ntdll!RtlInitializeExceptionChain+0x36
4 Id: f08.9f0 Suspend: 0 Teb: fffa9000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
064cf774 68182915 000002a4 064cf86c 064cf798 ntdll!NtRemoveIoCompletion+0x15
064cf79c 68183243 ffffffff 00000000 06b32f98 chrome_child!base::MessagePumpForIO::GetIOItem+0x35
064cf87c 681826b0 00000000 00fae8d8 00000000 chrome_child!base::MessagePumpForIO::WaitForWork+0xa3
064cf95c 68182f7a 00fb643c 00fb6438 00fb6438 chrome_child!base::MessagePumpForIO::DoRunLoop+0xb0
064cf984 68136f8b 00fb6438 00000000 064cfd18 chrome_child!base::MessagePumpWin::Run+0x4a
064cfa50 68150f03 00000000 001ef70c 00000000 chrome_child!base::MessageLoop::Run+0x6b
064cfb94 68133fed 00000000 064cfbd0 68122148 chrome_child!base::RunLoop::Run+0x93
064cfc50 681348ce 064cfd18 00fb5ab0 76291420 chrome_child!base::Thread::Run+0xad
064cfd5c 6811acc2 00000000 00000000 00fb5ab0 chrome_child!base::Thread::ThreadMain+0x26e
064cfd78 7629336a 0000029c 064cfdc4 772c92b2 chrome_child!base::PlatformThread::Sleep+0x122
064cfd84 772c92b2 00fb5ab0 7306c891 00000000 kernel32!BaseThreadInitThunk+0x12
064cfdc4 772c9285 6811ac30 00fb5ab0 ffffffff ntdll!RtlInitializeExceptionChain+0x63
064cfddc 00000000 6811ac30 00fb5ab0 00000000 ntdll!RtlInitializeExceptionChain+0x36
5 Id: f08.d50 Suspend: 0 Teb: fffa6000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0674fa58 7629336a 00000000 0674faa4 772c92b2 ntdll!ZwWaitForMultipleObjects+0x15
0674fa64 772c92b2 05d9e9d8 733ecff1 00000000 kernel32!BaseThreadInitThunk+0x12
0674faa4 772c9285 772d13b5 05d9e9d8 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0674fabc 00000000 772d13b5 05d9e9d8 00000000 ntdll!RtlInitializeExceptionChain+0x36
6 Id: f08.76c Suspend: 0 Teb: fffa3000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0690f8c8 7629336a 05d9f358 0690f914 772c92b2 ntdll!NtWaitForWorkViaWorkerFactory+0x12
0690f8d4 772c92b2 05d9f358 73dacc41 00000000 kernel32!BaseThreadInitThunk+0x12
0690f914 772c9285 772e27e5 05d9f358 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0690f92c 00000000 772e27e5 05d9f358 00000000 ntdll!RtlInitializeExceptionChain+0x36
7 Id: f08.97c Suspend: 0 Teb: fffd7000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0722fb30 7629336a 05eaac78 0722fb7c 772c92b2 ntdll!NtWaitForWorkViaWorkerFactory+0x12
0722fb3c 772c92b2 05eaac78 7268ce29 00000000 kernel32!BaseThreadInitThunk+0x12
0722fb7c 772c9285 772e27e5 05eaac78 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0722fb94 00000000 772e27e5 05eaac78 00000000 ntdll!RtlInitializeExceptionChain+0x36
8 Id: f08.ef8 Suspend: 0 Teb: fffa0000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0741feb0 7629336a 05eaac78 0741fefc 772c92b2 ntdll!NtWaitForWorkViaWorkerFactory+0x12
0741febc 772c92b2 05eaac78 720bcba9 00000000 kernel32!BaseThreadInitThunk+0x12
0741fefc 772c9285 772e27e5 05eaac78 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0741ff14 00000000 772e27e5 05eaac78 00000000 ntdll!RtlInitializeExceptionChain+0x36
9 Id: f08.45c Suspend: 0 Teb: fff9d000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0759fd38 7629336a 05eaac78 0759fd84 772c92b2 ntdll!NtWaitForWorkViaWorkerFactory+0x12
0759fd44 772c92b2 05eaac78 7213c8d1 00000000 kernel32!BaseThreadInitThunk+0x12
0759fd84 772c9285 772e27e5 05eaac78 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0759fd9c 00000000 772e27e5 05eaac78 00000000 ntdll!RtlInitializeExceptionChain+0x36
10 Id: f08.6a4 Suspend: 0 Teb: fff9a000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0680fe84 76291194 00000394 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
0680fe9c 76291148 00000394 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
0680feb0 65764d9a 00000394 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
0680fecc 65764cc0 00000000 0680ff10 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
0680fed8 65f45f02 04730238 c0b00376 00000000 nvwgf2um!OpenAdapter12+0x16b350
0680ff10 65f4602a 00000000 0680ff28 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
0680ff1c 7629336a 05eb7570 0680ff68 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
0680ff28 772c92b2 05eb7570 73caca3d 00000000 kernel32!BaseThreadInitThunk+0x12
0680ff68 772c9285 65f45fae 05eb7570 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0680ff80 00000000 65f45fae 05eb7570 00000000 ntdll!RtlInitializeExceptionChain+0x36
11 Id: f08.d44 Suspend: 0 Teb: fff97000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
03f1fba0 76291194 00000394 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
03f1fbb8 76291148 00000394 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
03f1fbcc 65764d9a 00000394 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
03f1fbe8 65764cc0 00000001 03f1fc2c 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
03f1fbf4 65f45f02 04730240 c5c1004a 00000000 nvwgf2um!OpenAdapter12+0x16b350
03f1fc2c 65f4602a 00000000 03f1fc44 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
03f1fc38 7629336a 05eb7950 03f1fc84 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
03f1fc44 772c92b2 05eb7950 76bbc9d1 00000000 kernel32!BaseThreadInitThunk+0x12
03f1fc84 772c9285 65f45fae 05eb7950 ffffffff ntdll!RtlInitializeExceptionChain+0x63
03f1fc9c 00000000 65f45fae 05eb7950 00000000 ntdll!RtlInitializeExceptionChain+0x36
12 Id: f08.f50 Suspend: 0 Teb: fff94000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0625fb50 76291194 00000394 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
0625fb68 76291148 00000394 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
0625fb7c 65764d9a 00000394 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
0625fb98 65764cc0 00000002 0625fbdc 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
0625fba4 65f45f02 04730248 c01507ba 00000000 nvwgf2um!OpenAdapter12+0x16b350
0625fbdc 65f4602a 00000000 0625fbf4 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
0625fbe8 7629336a 05eb7d18 0625fc34 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
0625fbf4 772c92b2 05eb7d18 736fc961 00000000 kernel32!BaseThreadInitThunk+0x12
0625fc34 772c9285 65f45fae 05eb7d18 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0625fc4c 00000000 65f45fae 05eb7d18 00000000 ntdll!RtlInitializeExceptionChain+0x36
13 Id: f08.730 Suspend: 0 Teb: fff91000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
06fdf838 76291194 00000394 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
06fdf850 76291148 00000394 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
06fdf864 65764d9a 00000394 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
06fdf880 65764cc0 00000003 06fdf8c4 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
06fdf88c 65f45f02 04730250 c0cd04a2 00000000 nvwgf2um!OpenAdapter12+0x16b350
06fdf8c4 65f4602a 00000000 06fdf8dc 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
06fdf8d0 7629336a 05eb8870 06fdf91c 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
06fdf8dc 772c92b2 05eb8870 73b7cc49 00000000 kernel32!BaseThreadInitThunk+0x12
06fdf91c 772c9285 65f45fae 05eb8870 ffffffff ntdll!RtlInitializeExceptionChain+0x63
06fdf934 00000000 65f45fae 05eb8870 00000000 ntdll!RtlInitializeExceptionChain+0x36
14 Id: f08.e78 Suspend: 0 Teb: fff8e000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
063bfde8 762919fc 063bfd9c 063bfe10 00000000 ntdll!ZwWaitForMultipleObjects+0x15
063bfe30 762941d8 00000003 fffde000 00000000 kernel32!WaitForMultipleObjectsEx+0x8e
063bfe4c 65c0a722 00000003 05eb65f8 00000000 kernel32!WaitForMultipleObjects+0x18
063bfe90 65c24e25 05eb9b58 063bfed4 65f45f02 nvwgf2um!NVAPI_Thunk+0x6bb82
063bfe9c 65f45f02 05eb65e8 c00b02b2 00000000 nvwgf2um!NVAPI_Thunk+0x86285
063bfed4 65f4602a 00000000 063bfeec 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
063bfee0 7629336a 05eb8870 063bff2c 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
063bfeec 772c92b2 05eb8870 7371ca79 00000000 kernel32!BaseThreadInitThunk+0x12
063bff2c 772c9285 65f45fae 05eb8870 ffffffff ntdll!RtlInitializeExceptionChain+0x63
063bff44 00000000 65f45fae 05eb8870 00000000 ntdll!RtlInitializeExceptionChain+0x36
15 Id: f08.d64 Suspend: 0 Teb: fff8b000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
047af840 76291194 0000036c ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
047af858 76291148 0000036c ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
047af86c 65764d9a 0000036c ffffffff 00000000 kernel32!WaitForSingleObject+0x12
047af888 65764cc0 00000000 047af8cc 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
047af894 65f45f02 047304c8 c24a04aa 00000000 nvwgf2um!OpenAdapter12+0x16b350
047af8cc 65f4602a 00000000 047af8e4 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
047af8d8 7629336a 05eb8870 047af924 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
047af8e4 772c92b2 05eb8870 7130cc71 00000000 kernel32!BaseThreadInitThunk+0x12
047af924 772c9285 65f45fae 05eb8870 ffffffff ntdll!RtlInitializeExceptionChain+0x63
047af93c 00000000 65f45fae 05eb8870 00000000 ntdll!RtlInitializeExceptionChain+0x36
16 Id: f08.29c Suspend: 0 Teb: fff88000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
03f0f8f0 76291194 0000036c ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
03f0f908 76291148 0000036c ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
03f0f91c 65764d9a 0000036c ffffffff 00000000 kernel32!WaitForSingleObject+0x12
03f0f938 65764cc0 00000001 03f0f97c 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
03f0f944 65f45f02 047304d0 c5c0051a 00000000 nvwgf2um!OpenAdapter12+0x16b350
03f0f97c 65f4602a 00000000 03f0f994 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
03f0f988 7629336a 05eba2e8 03f0f9d4 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
03f0f994 772c92b2 05eba2e8 76bacc81 00000000 kernel32!BaseThreadInitThunk+0x12
03f0f9d4 772c9285 65f45fae 05eba2e8 ffffffff ntdll!RtlInitializeExceptionChain+0x63
03f0f9ec 00000000 65f45fae 05eba2e8 00000000 ntdll!RtlInitializeExceptionChain+0x36
17 Id: f08.9c0 Suspend: 0 Teb: fff85000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
03dff8f8 65781771 00000005 00000000 0d2c0558 nvwgf2um!OpenAdapter12+0x187f08
03dff910 658a3f75 00000014 0d186be0 00000000 nvwgf2um!OpenAdapter12+0x187e01
03dff928 65a689d1 0d185828 00000027 00000001 nvwgf2um!OpenAdapter12+0x2aa605
03dff958 6590720d 0d185828 03dff9e0 659094f8 nvwgf2um!OpenAdapter12+0x46f061
03dff964 659094f8 0d185828 0d186708 0d185828 nvwgf2um!OpenAdapter12+0x30d89d
03dff9e0 657c4a7e 0d185828 03dffb64 00000003 nvwgf2um!OpenAdapter12+0x30fb88
03dff9f4 657815dc 0d185828 0d17aeb8 00000001 nvwgf2um!OpenAdapter12+0x1cb10e
03dffa6c 65781176 00000001 03dffb64 03dffb6c nvwgf2um!OpenAdapter12+0x187c6c
03dffbbc 65763e24 6578141a 079d8268 65763e14 nvwgf2um!OpenAdapter12+0x187806
03dffbdc 65763eaa 00000001 079d8258 0009001f nvwgf2um!OpenAdapter12+0x16a4b4
03dffbf4 65759177 079d8268 0009001f 030b027f nvwgf2um!OpenAdapter12+0x16a53a
03dffc20 65d0fa1d 09d4e098 01340140 09d4ddc8 nvwgf2um!OpenAdapter12+0x15f807
03dffc40 65cef1cd 03dffc70 0470ab80 09f53e58 nvwgf2um!NVAPI_Thunk+0x170e7d
03dffd10 656ed0e0 0cd2a9f8 00000000 09d4ddc8 nvwgf2um!NVAPI_Thunk+0x15062d
03dffd24 65604359 09f53e58 0cd2a9f8 6b38cffc nvwgf2um!OpenAdapter12+0xf3770
03dffd88 6575b7d3 09d28ef0 6b38cffc 09d27ca8 nvwgf2um!OpenAdapter12+0xa9e9
03dffde8 6575b70b 09d27c50 03dffe10 65764d60 nvwgf2um!OpenAdapter12+0x161e63
03dffdf4 65764d60 013416c0 09d27c50 00000000 nvwgf2um!OpenAdapter12+0x161d9b
03dffe10 65764cc0 00000002 03dffe54 65f45f02 nvwgf2um!OpenAdapter12+0x16b3f0
03dffe1c 65f45f02 047304d8 c5ef0232 00000000 nvwgf2um!OpenAdapter12+0x16b350
03dffe54 65f4602a 00000000 03dffe6c 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
03dffe60 7629336a 05eba6b0 03dffeac 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
03dffe6c 772c92b2 05eba6b0 7695cbf9 00000000 kernel32!BaseThreadInitThunk+0x12
03dffeac 772c9285 65f45fae 05eba6b0 ffffffff ntdll!RtlInitializeExceptionChain+0x63
03dffec4 00000000 65f45fae 05eba6b0 00000000 ntdll!RtlInitializeExceptionChain+0x36
18 Id: f08.a24 Suspend: 0 Teb: fff82000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
03e9f840 6590c1ec 0d175898 08c0b850 00000000 nvwgf2um!OpenAdapter12+0x311e01
03e9f85c 6590984f 00000620 0d16af20 0d175898 nvwgf2um!OpenAdapter12+0x31287c
03e9f8dc 657bec90 0d175898 03e9fa64 00000001 nvwgf2um!OpenAdapter12+0x30fedf
03e9f8f4 657815dc 0d175898 0d16af20 00000001 nvwgf2um!OpenAdapter12+0x1c5320
03e9f96c 65781176 00000001 03e9fa64 03e9fa6c nvwgf2um!OpenAdapter12+0x187c6c
03e9fabc 65763e24 6578141a 0a5e9cf0 65763e14 nvwgf2um!OpenAdapter12+0x187806
03e9fadc 65763eaa 00000001 0a5e9ce0 0009001f nvwgf2um!OpenAdapter12+0x16a4b4
03e9faf4 65759177 0a5e9cf0 0009001f 030b027f nvwgf2um!OpenAdapter12+0x16a53a
03e9fb20 65d0fa1d 00000000 0ceeb780 0ceeb780 nvwgf2um!OpenAdapter12+0x15f807
03e9fb40 65cf0acb 03e9fcb0 0470ab80 08b48918 nvwgf2um!NVAPI_Thunk+0x170e7d
03e9fd50 65cf1876 088cff90 0ceeb780 0470ab80 nvwgf2um!NVAPI_Thunk+0x151f2b
03e9fd68 656ed0e0 088cff90 00000000 0ceeb780 nvwgf2um!NVAPI_Thunk+0x152cd6
03e9fd7c 65604359 08b48918 088cff90 6b38d03c nvwgf2um!OpenAdapter12+0xf3770
03e9fde0 6575b7d3 09d2a1f8 6b38d03c 09d28fb0 nvwgf2um!OpenAdapter12+0xa9e9
03e9fe44 6575b70b 09d28f58 03e9fe6c 65764d60 nvwgf2um!OpenAdapter12+0x161e63
03e9fe50 65764d60 013416c0 09d28f58 00000000 nvwgf2um!OpenAdapter12+0x161d9b
03e9fe6c 65764cc0 00000003 03e9feb0 65f45f02 nvwgf2um!OpenAdapter12+0x16b3f0
03e9fe78 65f45f02 047304e0 c5d902d6 00000000 nvwgf2um!OpenAdapter12+0x16b350
03e9feb0 65f4602a 00000000 03e9fec8 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
03e9febc 7629336a 05eb8870 03e9ff08 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
03e9fec8 772c92b2 05eb8870 76a3ca5d 00000000 kernel32!BaseThreadInitThunk+0x12
03e9ff08 772c9285 65f45fae 05eb8870 ffffffff ntdll!RtlInitializeExceptionChain+0x63
03e9ff20 00000000 65f45fae 05eb8870 00000000 ntdll!RtlInitializeExceptionChain+0x36
19 Id: f08.74c Suspend: 0 Teb: fff7f000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
030cfc80 76291194 0000033c ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
030cfc98 76291148 0000033c ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
030cfcac 65764d9a 0000033c ffffffff 00000000 kernel32!WaitForSingleObject+0x12
030cfcc8 65764cc0 00000000 030cfd0c 65f45f02 nvwgf2um!OpenAdapter12+0x16b42a
030cfcd4 65f45f02 01347808 c53c016a 00000000 nvwgf2um!OpenAdapter12+0x16b350
030cfd0c 65f4602a 00000000 030cfd24 7629336a nvwgf2um!NVAPI_Thunk+0x3a7362
030cfd18 7629336a 05eb8870 030cfd64 772c92b2 nvwgf2um!NVAPI_Thunk+0x3a748a
030cfd24 772c92b2 05eb8870 7646c831 00000000 kernel32!BaseThreadInitThunk+0x12
030cfd64 772c9285 65f45fae 05eb8870 ffffffff ntdll!RtlInitializeExceptionChain+0x63
030cfd7c 00000000 65f45fae 05eb8870 00000000 ntdll!RtlInitializeExceptionChain+0x36
20 Id: f08.ba0 Suspend: 0 Teb: fff7c000 Unfrozen
*** WARNING: Unable to verify checksum for MFPlat.DLL
*** ERROR: Symbol file could not be found. Defaulted to export symbols for MFPlat.DLL -
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0a98fac8 675e3978 00000474 0a98fb00 0a98faf4 ntdll!NtRemoveIoCompletion+0x15
0a98faf8 675e1b11 0a98fbfc 00000000 00000000 MFPlat!MFLockPlatform+0x28d
0a98fc00 675e8ce7 0a98fc40 76a21287 08b753c8 MFPlat+0x1b11
*** WARNING: Unable to verify checksum for msvcrt.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for msvcrt.dll -
0a98fc08 76a21287 08b753c8 ca21a4ba 00000000 MFPlat!FormatTagFromWfx+0xf3
0a98fc40 76a21328 0a98fc54 7629336a 02de2ab0 msvcrt!itow_s+0x4c
0a98fc48 7629336a 02de2ab0 0a98fc94 772c92b2 msvcrt!endthreadex+0x6c
0a98fc54 772c92b2 02de2ab0 7fd2c9c1 00000000 kernel32!BaseThreadInitThunk+0x12
0a98fc94 772c9285 76a212e5 02de2ab0 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0a98fcac 00000000 76a212e5 02de2ab0 00000000 ntdll!RtlInitializeExceptionChain+0x36
21 Id: f08.5dc Suspend: 0 Teb: fff79000 Unfrozen
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
0923fd5c 76291194 00000490 ffffffff 00000000 ntdll!ZwWaitForSingleObject+0x15
0923fd74 675e2094 00000490 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0x43
0923fd98 675e1fe1 08fdd4e0 0923fdd8 00000000 MFPlat!MFHeapFree+0x1d4
0923fde0 675e1b11 0923fee4 00000000 00000000 MFPlat!MFHeapFree+0x121
0923fee8 675e8ce7 0923ff28 76a21287 08b75428 MFPlat+0x1b11
0923fef0 76a21287 08b75428 c99aa7d2 00000000 MFPlat!FormatTagFromWfx+0xf3
0923ff28 76a21328 0923ff3c 7629336a 02de2ab0 msvcrt!itow_s+0x4c
0923ff30 7629336a 02de2ab0 0923ff7c 772c92b2 msvcrt!endthreadex+0x6c
0923ff3c 772c92b2 02de2ab0 7c69ca29 00000000 kernel32!BaseThreadInitThunk+0x12
0923ff7c 772c9285 76a212e5 02de2ab0 ffffffff ntdll!RtlInitializeExceptionChain+0x63
0923ff94 00000000 76a212e5 02de2ab0 00000000 ntdll!RtlInitializeExceptionChain+0x36
,
May 19 2017
What looked suspicious was seeing the NativeStackSamplerMac was sampling at the time that the watchdog triggered. However this doesn't look consistent across the hangs. https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.gpu%2FMac_Release__Intel_%2F64401%2F%2B%2Frecipes%2Fsteps%2Fwebgl_conformance_tests_on_Intel_GPU_on_Mac_on_Mac-10.12%2F0%2Fstdout Thread 3 (crashed) 0 Chromium Framework!__ZN3gpu17GpuWatchdogThread38DeliberatelyTerminateToRecoverFromHangEv + 0x11f 1 Chromium Framework!__ZN3gpu17GpuWatchdogThread14OnCheckTimeoutEv + 0x99 2 Chromium Framework!__ZN4base8internal7InvokerINS0_9BindStateIMN3gpu17GpuWatchdogThreadEFvvEJNS_7WeakPtrIS4_EEEEEFvvEE3RunEPNS0_13BindStateBaseE + 0xb4 3 Chromium Framework!__ZN4base5debug13TaskAnnotator7RunTaskEPKcPNS_11PendingTaskE + 0x136 Thread 0 0 libsystem_kernel.dylib + 0x1234a 1 CoreFoundation + 0x884e4 2 CoreFoundation + 0x87971 3 CoreFoundation + 0x871c4 4 Chromium Framework!__ZN4base20MessagePumpCFRunLoop5DoRunEPNS_11MessagePump8DelegateE + 0x4f 5 Chromium Framework!__ZN4base24MessagePumpCFRunLoopBase3RunEPNS_11MessagePump8DelegateE + 0x77 6 Chromium Framework!__ZN4base11MessageLoop10RunHandlerEv + 0x122 7 Chromium Framework!__ZN4base7RunLoop3RunEv + 0x97 8 Chromium Framework!__ZN7content7GpuMainERKNS_18MainFunctionParamsE + 0x417 9 Chromium Framework!__ZN7content21ContentMainRunnerImpl3RunEv + 0x415 Thread 2 0 dyld + 0x168ea 1 dyld + 0xd6e0 2 libdyld.dylib + 0x2790 3 Chromium Framework!__ZN4base12_GLOBAL__N_120WalkStackFromContextIZNS0_21NativeStackSamplerMac27SuspendThreadAndRecordStackEPNS_18NativeStackSampler11StackBufferEPNS_21StackSamplingProfiler6SampleEE3$_0EEbP13unw_context_tPmRKT_ + 0xeb 4 Chromium Framework!__ZN4base12_GLOBAL__N_121NativeStackSamplerMac17RecordStackSampleEPNS_18NativeStackSampler11StackBufferEPNS_21StackSamplingProfiler6SampleE + 0x77a 5 Chromium Framework!__ZN4base21StackSamplingProfiler14SamplingThread12RecordSampleEPNS1_17CollectionContextE + 0x43f
,
May 19 2017
OK, that's very bad. Let me look into it; the thread suspension was written to be careful to not use anything that isn't already linked in but we may have failed.
,
May 19 2017
,
May 19 2017
What's super concerning is this: 1 dyld + 0xd6e0 2 libdyld.dylib + 0x2790 3 Chromium Framework!base::WalkStackFromContext() + 0xeb Which makes it look like we're making a call to a function that isn't linked in, and then deadlocking when hitting a suspended main thread. But what call? The offset is 0xeb. This is the release bot so I did a release build of base/, Hoppered it, and looked at that offset: 0000000000073979 4C8B7808 mov r15, qword [ds:rax+8] 000000000007397d 488B5810 mov rbx, qword [ds:rax+0x10] 0000000000073981 48897D80 mov qword [ss:rbp+var_80], rdi 0000000000073985 4C89E6 mov rsi, r12 ; argument "info" for method imp___stubs__dladdr 0000000000073988 E84F860800 call imp___stubs__dladdr // *** 000000000007398d 85C0 test eax, eax 000000000007398f 745F je 0x739f0 The offset isn't perfect, but it's likely dladdr. Which makes sense from the forensic point-of-view. The stack trace, again, was: 1 dyld + 0xd6e0 2 libdyld.dylib + 0x2790 3 Chromium Framework!base::WalkStackFromContext() + 0xeb So WalkStackFromContext() calls dladdr, which is libdyld.dylib, which isn't linked, so we call into dyld, the linker. What doesn't make sense here is the flow. I'm pretty sure that we aren't calling that inside of the thread suspension. I have some thoughts here but nothing solid; I'm still looking at this.
,
May 19 2017
I logged into the bot that produced the data from #78 to symbolize those dyld symbols: Thread 2: ImageLoaderMachO::findClosestSymbol(mach_header const*, void const*, void const**) (in dyld) + 386 dladdr (in dyld) + 133 dladdr (in libdyld.dylib) + 72 Thread 0: mach_msg_trap (in libsystem_kernel.dylib) + 10 __CFRunLoopServiceMachPort (in CoreFoundation) + 212 __CFRunLoopRun (in CoreFoundation) + 1361 CFRunLoopRunSpecific (in CoreFoundation) + 420 All other threads are also waiting in their proper idle state.
,
May 19 2017
The main thread doesn't look like it's suspended holding a lock. It looks like it's waiting for the kernel. What's weird here is that: 1. Thread 2, the stack sampler thread, is exploding deep in dyld, the linker. 2. Thread 0 doesn't look like it's suspended, but is still failing the watchdog thread.
,
May 19 2017
Two things before I finish for today. First, in comment 78: "What looked suspicious was seeing the NativeStackSamplerMac was sampling at the time that the watchdog triggered. However this doesn't look consistent across the hangs." Note that the sampler is running periodically all the time, so it's possible that it can end up in crash dumps simply by chance. I'm not claiming that it's not the stack sampler; I'm still going to investigate, but I would recommend that other people also keep investigating. A second note: The stack sampler is going to be turned off (https://chromium-review.googlesource.com/c/508414/) because of suspicions that it is regressing the runtime of some test suites. This is an opportunity to watch these hangs. If the hangs disappear, then that's evidence for me to dig harder to figure out what's up with the stack sampler. If the hangs continue, though, then it's strong evidence the other way.
,
May 24 2017
Note: the minidump from this failure: https://luci-milo.appspot.com/buildbot/tryserver.chromium.win/win_optional_gpu_tests_rel/10429 is available here: gs://chrome-telemetry-output/minidump-2017-05-18_14-43-45-122205.dmp
,
May 24 2017
Note that the stack sampler on the Mac is turned off. If you're still getting gpu hangs, it's not the stack sampler.
,
May 24 2017
+maruel for question on Swarming. Please see below. I've been trying to debug these Gpu watchdog timeouts by mass re-running one of the tasks that kbr@ listed in comment #75, triggering up to 100 simultaneous tasks. Here are 1396 runs of the job: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&c=completed_ts&et=1495584420000&f=reason%3Abisect&l=50&n=true&s=completed_ts%3Adesc&st=1495143540000 What's interesting is there were 1200+ successful job runs, and then 22 all failed at once within 2 minutes of each-other, on different bots. This observation would point to something like a network effect, affecting multiple bots. Talking with kbr@, logging output from the GPU process comes to mind, as the failing test spams a lot of logs. Is it possible the Swarming log service could put back-pressure on the GPU process? All failures were in test: third_party/webgl/src/sdk/tests/conformance/attribs/gl-vertexattribpointer.html Here are some of the failing runs: https://chromium-swarm.appspot.com/task?id=365065a590df4e10&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=3650662580940510&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=365065f11f709c10&refresh=10&show_raw=1 Tail of system log: May 23 16:15:32 build614-m4 diagnosticd[213]: Unexpected xpc type [<double: 0x7ffacac02970>: 9223372036855179264.000000] May 23 16:15:32 build614-m4 diagnosticd[213]: Unexpected xpc type [<double: 0x7ffacae04d70>: 9223372036855179264.000000] May 23 16:15:32 build614-m4 diagnosticd[213]: Unexpected xpc type [<double: 0x7ffacac02990>: 9223372036855179264.000000] May 23 16:15:34 build614-m4 xcodebuild[40698]: BUG in libdispatch client: kevent[EVFILT_READ] monitored resource vanished before the source cancel handler was invoked May 23 16:15:34 build614-m4 com.apple.xpc.launchd[1] (com.apple.iphonesimulator.968[40700]): Service exited due to signal: Terminated: 15 sent by killall[40955] May 23 16:16:04 build614-m4 com.apple.usbmuxd[39644]: stopping. May 23 16:16:04 build614-m4 xpcproxy[40993]: libcoreservices: _dirhelper_userdir: 523: bootstrap_look_up returned 268435459 May 23 16:16:04 build614-m4 com.apple.usbmuxd[40993]: log filter changed from 4 to 6 May 23 16:16:04 build614-m4 com.apple.usbmuxd[40993]: usbmuxd-374.70 on Aug 1 2016 at 12:50:26, running 64 bit May 23 16:16:13 build614-m4 xpcproxy[41015]: libcoreservices: _dirhelper_userdir: 523: bootstrap_look_up returned 268435459 May 23 16:16:34 build614-m4 com.apple.xpc.launchd[1] (com.apple.quicklook[41028]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.quicklook Also avi@, all of these had the StackSamplingProfiler running at time of hang. This task still had the stack sampler enabled. However due to the coincidental timing on multiple bots, this may just be a symptom rather than a cause of the timeouts.
,
May 24 2017
+maruel, please see comment #87.
,
May 24 2017
This is very interesting. Swarming doesn't do anything in the background. But puppet does. +Elliot and Vadim to know if they can correlate with their puppet schedule. Keep in mind this is likely totally red herring.
,
May 24 2017
Puppet run schedule is not correlated between bots. It is deliberately randomized. I've confirmed that a bunch of bots that fail tests in #87 ran puppet at different times: http://shortn/_9wLjTv6hvb
,
May 24 2017
,
May 24 2017
The failing test (gl-vertexattribpointer) has about 100KB of GPU ERROR output to STDOUT. My theory is that if STDOUT buffers fill up, the GPU threads can stall and cause watchdog timeouts. I wrote a quick test script that adds long delays to reading the STDOUT of testing/scripts/run_gpu_integration_test_as_googletest.py and this causes the 'gl-vertexattribpointer' test to stall mid-way. maruel@ could you give some pointers to how stdout logging is happening on the bots? Swarming task pages show live log updates, so I'm wondering where are we reading the PIPE and if that could stall due to slow network? Discussing last night with kbr@, he is thinking to change our logging so that we buffer browser log output in the test harness, and isolate browser execution from possible downstream PIPE back-pressure. Possibly more red herrings, but in past experiments we noticed that adding or removing logging would make these timeout issues go away or become harder to reproduce. Also enabling ASAN made the issue go away. When running with ASAN, STDOUT gets piped through sanitizer's symbolizer which could affect buffering.
,
May 24 2017
The Swarming bot reads the pipes as fast as it can and streams packets over HTTPS on a specific basis as decided at: https://github.com/luci/luci-py/blob/master/appengine/swarming/swarming_bot/bot_code/task_runner.py#L323 So yes there's a magic 100kb number there, which may be related with what you are seeing. https://github.com/luci/luci-py/blob/master/appengine/swarming/swarming_bot/bot_code/task_runner.py#L41 When this happens, stdout is effectively blocked for a small amount of time if it is not buffered. The kernel may buffer a bit, like 4Kb or so, I forget the details. And yes, intermediary python scripts will likely affect buffering.
,
May 24 2017
@#92: that is interesting and a bit unexpected. GPU process logging is sent via IPCs, either via GpuService::RecordLogMessage for LOG, or via GpuCommandBufferStub::SendConsoleMessage for console messages - e.g. GL errors or perf/correctness warnings - either of which would hop to the IO thread. If the reader (browser or renderer) is stalled, I could see causing IPC queuing delays on the GPU IO thread, which is certainly bad in itself, however it shouldn't block the GPU main thread and trigger the watchdog - we shouldn't be blocking on the IO thread since we don't send sync messages from the GPU process. I'm curious to see if we could get a trace of this behavior.
,
May 24 2017
#94: That's interesting. Setting up our webgl conformance tests to neglect reading the browser's pipe entirely with subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=env) and then never reading from the pipe, I get a stall exactly in test [10/859] gl-vertexattribpointer. It just happens to be the first test that fills the output buffers. The browser stalls somewhere, however I haven't seen that turn into a GPU watchdog timeout.
,
May 24 2017
So local debugging and investigation showed that #94 is incorrect, we do log from the GPU process for every message that is sent to the console (which is also logged from the browser process on the UI thread). So if the stdout/stderr pipe is full, we may block the UI thread as well as the GPU main thread (or any other thread that logs), possibly triggering the watchdog. So the solution would be to either: 1- log less (e.g. it's probably not necessary to log GL errors twice, and maybe not at all if the test passes) 2- add buffering in the harness to isolate the tests from this. or por qué no los dos.
,
May 24 2017
We also log right before the GPU watchdog crashes, so that may explain the callstacks - it only crashes after the pipe is unblocked, so at that point the main thread could finish logging and return to the message loop before the crash actually happens.
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d5e5040512747a4654b34c6ebaf0a301fb755f43 commit d5e5040512747a4654b34c6ebaf0a301fb755f43 Author: jbauman <jbauman@chromium.org> Date: Fri May 26 02:06:48 2017 Don't log to stderr during GPU process watchdog timeout It's possible stderr's buffer is full and that's causing the main thread to hang, so just attempt to send the error message to the browser process instead of possibly blocking as well. BUG= 609252 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2902383003 Cr-Commit-Position: refs/heads/master@{#474878} [modify] https://crrev.com/d5e5040512747a4654b34c6ebaf0a301fb755f43/gpu/ipc/service/gpu_watchdog_thread.cc
,
Sep 20 2017
Thanks to the excellent analysis above, this issue seems to be fixed. There are no flaky failures in the last 200 builds on these Windows and Mac machines, which are representative of the configurations where the problem used to be seen: https://luci-milo.appspot.com/buildbot/chromium.gpu.fyi/Win7%20Release%20%28NVIDIA%29/?limit=200 https://luci-milo.appspot.com/buildbot/chromium.gpu.fyi/Mac%20Release%20%28Intel%29/?limit=200 Since John landed the fix to the logging, giving this to him. Thanks John. |
||||||||||||||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||||||||||||||
Comment 1 by kbr@chromium.org
, May 4 2016