Flaky "Gathering system GPU info" timeout during GPU tests on Windows |
|||||||||
Issue descriptionFor example, in https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Exp%20Release%20%28Intel%20HD%20630%29/54 this happened during webgl2_conformance_tests, webgl_conformance_d3d9_passthrough_tests and webgl_conformance_d3d9_tests. https://chromium-swarm.appspot.com/task?id=3d77aad3bb276d10&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=3d77d70017fd2f10&refresh=10&show_raw=1 https://chromium-swarm.appspot.com/task?id=3d77e1fb60967f10&refresh=10&show_raw=1 [4320:2528:0514/072153.706:FATAL:system_info_handler.cc(191)] Check failed: false. Gathering system GPU info took more than 5 seconds.
,
May 14 2018
Sorry, just checked back from the beginning. Even the first build has this issue. I guess it's something with the new driver that caused info collection takes longer
,
May 18 2018
Just saw it on another bot. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20FYI%20Release%20%28AMD%29/1231 Can this affect Chrome start up on user machines? Maybe this is higher priority?
,
May 18 2018
Also here https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20FYI%20Release%20(NVIDIA)/997 Corentin, PTAL.
,
May 18 2018
,
May 18 2018
I tried loading one of the minidumps to know where the code is stuck in the GPU process but even with the Chromium symbol server, Visual studio is not able to show any useful stack trace. zmo@ do you know if there is a way to get that information from the minidump or the logs?
,
May 18 2018
sunnyps@ or ulan@ may be able to help you. I'm not sure whether it will help if you download the isolate which had the binaries which produced that minidump. See ./tools/swarming_client/isolateserver.py help download and use the hash for the isolated inputs. For example, for this failure: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win7%20FYI%20Release%20(NVIDIA)/997 and this Swarming task: https://chromium-swarm.appspot.com/task?id=3d89a16b58cd8710&refresh=10&show_raw=1 the isolated input is: 59618d4d8841c30ad35a47f01efaafd89ed81a56
,
May 18 2018
I suggest after the next Thursday branch cut, add a bunch of temporary logs and land them so we can see where the bots get stuck.
,
May 18 2018
Issue 842180 has been merged into this issue.
,
May 18 2018
Adding Internals>Sandbox>SiteIsolation component because r552589 was blamed as the culprit both in https://crbug.com/842180#c5 and in https://crbug.com/839470#c6
,
May 23 2018
The log in https://chromium-swarm.appspot.com/task?id=3d89a16b58cd8710&refresh=10&show_raw=1 contains a stack trace that seems to be from GPU process. Found Minidump: True Stack Trace: ******************************************************************************** Last event: 11d8.ae8: Access violation - code c0000005 (first/second chance not available) debugger time: Thu May 17 18:15:57.276 2018 (UTC - 7:00) ChildEBP RetAddr Args to Child WARNING: Stack unwind information not available. Following frames may be wrong. 0287e890 6a97fa75 000003b1 00000000 075a38a8 nvwgf2um!OpenAdapter12+0x1860da 0287e8a8 6a97fa75 00000016 0084c400 0085c88c nvwgf2um!OpenAdapter12+0x186125 0287e8c0 6a97d5ae 00000018 0287e998 0287e91c nvwgf2um!OpenAdapter12+0x186125 0287e8d8 6a97b234 0085c88c 00000001 0287e8f4 nvwgf2um!OpenAdapter12+0x183c5e 0287e8fc 6a96cb06 075a1600 05f01d98 0287e91c nvwgf2um!OpenAdapter12+0x1818e4 0287e97c 6a96c609 0287e998 0761b8d8 007cf6e0 nvwgf2um!OpenAdapter12+0x1731b6 0287e9b8 6a96c1c6 0761b8d8 05f01d98 007cf6e0 nvwgf2um!OpenAdapter12+0x172cb9 0287ea30 6a903c37 0609ea90 05f01d98 0609a730 nvwgf2um!OpenAdapter12+0x172876 0287ea64 6a8d506f 00000001 00830000 008069f0 nvwgf2um!OpenAdapter12+0x10a2e7 0287eb00 6a8d5878 40001440 00000000 00000000 nvwgf2um!OpenAdapter12+0xdb71f 0287f778 6a8d5be1 00000006 00827880 00000000 nvwgf2um!OpenAdapter12+0xdbf28 0287f7d0 6b09c321 00806080 076a1658 00806080 nvwgf2um!OpenAdapter12+0xdc291 0287f82c 6a919a99 00806080 076a1658 04d793e8 nvwgf2um!NVAPI_Thunk+0x25c4f1 0287f850 6a9199bd 0287f868 0080582c 00000000 nvwgf2um!OpenAdapter12+0x120149 0287f874 6aed6a95 04d8b710 0287f8b8 6b24385e nvwgf2um!OpenAdapter12+0x12006d 0287f880 6b24385e 00805920 54ec7e1c 00000000 nvwgf2um!NVAPI_Thunk+0x96c65 0287f8b8 6b243986 00000000 0287f8d0 7502336a nvwgf2um!NVAPI_Thunk+0x403a2e *** WARNING: Unable to verify checksum for kernel32.dll *** ERROR: Symbol file could not be found. Defaulted to export symbols for kernel32.dll - 0287f8c4 7502336a 04d793e8 0287f910 771d98f2 nvwgf2um!NVAPI_Thunk+0x403b56 0287f8d0 771d98f2 04d793e8 752b90f9 00000000 kernel32!BaseThreadInitThunk+0x12 0287f910 771d98c5 6b24390a 04d793e8 ffffffff ntdll!RtlInitializeExceptionChain+0x63 0287f928 00000000 6b24390a 04d793e8 00000000 ntdll!RtlInitializeExceptionChain+0x36 0 Id: 11d8.12cc Suspend: 0 Teb: fffdd000 Unfrozen ChildEBP RetAddr Args to Child WARNING: Stack unwind information not available. Following frames may be wrong. 002fd514 75021194 00000288 000003e8 00000000 ntdll!NtWaitForSingleObject+0x15 002fd52c 75021148 00000288 000003e8 00000000 kernel32!WaitForSingleObjectEx+0x43 002fd540 6a919d55 00000288 000003e8 0000009a kernel32!WaitForSingleObject+0x12 002fda70 6a91acca 00000007 0000009a 00000000 nvwgf2um!OpenAdapter12+0x120405 002fda8c 6a91b06b 0000009a 00000061 00000000 nvwgf2um!OpenAdapter12+0x12137a 002fdac4 6a91d1fd 00000000 00000000 002fdaf0 nvwgf2um!OpenAdapter12+0x12171b 002fdaf8 6a935c70 00000000 00000000 04dec238 nvwgf2um!OpenAdapter12+0x1238ad *** WARNING: Unable to verify checksum for d3d11.dll *** ERROR: Symbol file could not be found. Defaulted to export symbols for d3d11.dll - 002fdb28 6cc9dead 008055c4 56444a4c 04deb9c8 nvwgf2um!OpenAdapter12+0x13c320 *** WARNING: Unable to verify checksum for dxgi.dll *** ERROR: Symbol file could not be found. Defaulted to export symbols for dxgi.dll - 002fdbac 6e2d3b21 007fb420 04dec014 00000000 d3d11!D3D11CreateDeviceAndSwapChain+0x2c2c7 002fe090 6e2d1db1 00000002 00000000 00000000 dxgi!DXGID3D10RegisterLayers+0x6439 *** WARNING: Unable to verify checksum for libglesv2.dll No .natvis files found at c:\b\s\w\ir\out\Release\cdb\Visualizers. No .natvis files found at C:\Users\chrome-bot\AppData\Local\Dbg\Visualizers. 002fe0e8 6bdc76e2 04deb9c8 00000000 00000000 dxgi!DXGID3D10RegisterLayers+0x46c9 002fe1d8 6bdc6081 00000001 00000005 00000005 libglesv2!rx::SwapChain11::present+0xa8 002fe200 6bd55a78 04dee2a8 00000005 00000005 libglesv2!rx::SwapChain11::swapRect+0x63 002fe22c 6bd55d20 002fe2a0 04dee2a8 00000005 libglesv2!rx::SurfaceD3D::swapRect+0x60 002fe250 6bcafa0e 002fe2a0 04dee2a8 00000005 libglesv2!rx::SurfaceD3D::postSubBuffer+0x1c 002fe278 6bc783e8 002fe2a0 04dee2a8 00000005 libglesv2!egl::Surface::postSubBuffer+0x2a *** WARNING: Unable to verify checksum for chrome_child.dll 002fe4b0 6051e6ac 007e1ed8 04de9370 00000005 libglesv2!egl::PostSubBufferNV+0x176 002fe5a4 608422d1 00000005 00000051 00000402 chrome_child!gl::NativeViewGLSurfaceEGL::PostSubBuffer+0xba 002fe60c 6079c470 00000005 00000051 00000402 chrome_child!gpu::PassThroughImageTransportSurface::PostSubBuffer+0xa9 002fe650 607906b4 00000030 00000000 00000005 chrome_child!gpu::gles2::GLES2DecoderPassthroughImpl::DoPostSubBufferCHROMIUM+0xac 002fe68c 6075a3fc 00000000 06116258 0268e4ac chrome_child!gpu::gles2::GLES2DecoderPassthroughImpl::HandlePostSubBufferCHROMIUM+0x44 002fe6b8 6075a01b 00000014 061160a0 00000076 chrome_child!gpu::gles2::GLES2DecoderPassthroughImpl::DoCommandsImpl<0>+0xc0 002fe6d0 6181691f 00000014 061160a0 00000076 chrome_child!gpu::gles2::GLES2DecoderPassthroughImpl::DoCommands+0x21 002fe800 6182607d 0000589e 0268e4a8 002fe860 chrome_child!gpu::CommandBufferService::Flush+0x163 002fe924 61825eaf 0000589e 000000d7 00000001 chrome_child!gpu::CommandBufferStub::OnAsyncFlush+0x12f 002fe980 61824dfc 028f13b8 0267bfe8 0267bfe8 chrome_child!IPC::MessageT<GpuCommandBufferMsg_AsyncFlush_Meta,std::tuple<int,unsigned int,bool>,void>::Dispatch<gpu::CommandBufferStub,gpu::CommandBufferStub,void,void (__thiscall gpu::CommandBufferStub::*)(int,unsigned int,bool)>+0x6f 002fea50 61823713 028f13b8 002fea70 02906610 chrome_child!gpu::CommandBufferStub::OnMessageReceived+0x22c 002fea64 618222a8 028f13b8 0267bfe8 02679c70 chrome_child!gpu::GpuChannel::HandleMessageHelper+0x2d 002feb54 61823f1d 028f13b8 61822194 002feb7c chrome_child!gpu::GpuChannel::HandleMessage+0x114 002feb70 5e735d2b 028f1398 028f1398 5649e309 chrome_child!base::internal::Invoker<base::internal::BindState<void (__thiscall gpu::GpuChannel::*)(IPC::Message const &),base::WeakPtr<gpu::GpuChannel>,IPC::Message>,void __cdecl(void)>::Run+0x39 002feb88 6181b1a8 002febbc 002febbc 02669b58 chrome_child!base::OnceCallback<void __cdecl(void)>::Run+0x2b 002feccc 5faede94 026bfe38 5faa4567 000012cc chrome_child!gpu::Scheduler::RunNextTask+0x332 002fede0 5fb1b449 62862f8b 002ff0c0 002feeb8 chrome_child!base::debug::TaskAnnotator::RunTask+0x134 002feeb4 5fad1817 002ff0c0 028ef728 63025e2c chrome_child!base::internal::IncomingTaskQueue::RunTask+0x69 002fefdc 5fad1b6d 002ff0c0 5faa4157 0266aad0 chrome_child!base::MessageLoop::RunTask+0x207 002ff0b8 5fad1d9a 00000000 63026090 63025e2c chrome_child!base::MessageLoop::DeferOrRunPendingTask+0x9d 002ff218 5fb1da7c 028d29b0 028d29a8 00001201 chrome_child!base::MessageLoop::DoWork+0x1fa 002ff23c 5fad1334 028ef728 002ff3f8 5fa8c71e chrome_child!base::MessagePumpDefault::Run+0x9c 002ff30c 5fa97cdc 00000001 002ff3f8 62858c78 chrome_child!base::MessageLoop::Run+0x74 002ff3e0 612c2c0a 028d2501 002ff3e8 028ff160 chrome_child!base::RunLoop::Run+0xcc 002ff5dc 5fa60b94 002ff6cc 00000000 00000000 chrome_child!content::GpuMain+0x38e 002ff6ac 5fa61125 002ff6e0 002ff6cc 002ff92c chrome_child!content::RunNamedProcessTypeMain+0x10c 002ff7a8 5fa65ae5 028d4280 00000003 002ff8c8 chrome_child!content::ContentMainRunnerImpl::Run+0x117 002ff8b8 5fa60a5f 002ff8c4 002ff8c8 6284f750 chrome_child!service_manager::Main+0x2a5 002ff8f4 5e73111e 002ff918 002ff900 002ff8fc chrome_child!content::ContentMain+0x33 *** WARNING: Unable to verify checksum for chrome.exe 002ff968 00375384 00370000 002ff9b0 0c97b89f chrome_child!ChromeMain+0x11e 002ff9f4 00371543 00370000 0c97b89f 00000002 chrome!MainDllLoader::Launch+0x230 002ffb6c 00455df8 00370000 00000000 00763016 chrome!wWinMain+0x543
,
May 26 2018
Yikes. Thanks Ulan for tracking that down. I'm mystified because the original timeouts were reported on Win/Intel but the crash above is coming from NVIDIA's driver. This timeout of GPU info gathering on Win/Intel: https://chromium-swarm.appspot.com/task?id=3d77aad3bb276d10&refresh=10&show_raw=1 doesn't show evidence of a GPU process crash. Maybe the bots are slow and we should have a longer timeout, especially for the first cold start of the browser.
,
May 29 2018
I am also confused by this stack in #11. I don't understand how it is possible that we already have a functioning command buffer but the info collection is still pending and triggering a timeout.
,
May 29 2018
I think the stack in #11 from the swarming task in #7 is a different problem where GPU process crashes in nvidia driver. #12 has a link to another task where GPU info collection timed out, and there's no GPU process crash there.
GPU info collection times out flakily, but we retry browser process creation, and it seems to succeed eventually.
The actual test failure is:
[2/136] gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance2_glsl3_short_circuiting_in_loop_condition failed unexpectedly
...
Traceback (most recent call last):
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\serially_executed_browser_test_case.py", line 206, in <lambda>
return lambda self: based_method(self, *args)
File "c:\b\s\w\ir\content\test\gpu\gpu_tests\gpu_integration_test.py", line 179, in _RunGpuTest
self.RunActualGpuTest(url, *args)
File "c:\b\s\w\ir\content\test\gpu\gpu_tests\webgl_conformance_integration_test.py", line 188, in RunActualGpuTest
getattr(self, test_name)(test_path, *args[1:])
File "c:\b\s\w\ir\content\test\gpu\gpu_tests\webgl_conformance_integration_test.py", line 202, in _RunConformanceTest
self._CheckTestCompletion()
File "c:\b\s\w\ir\content\test\gpu\gpu_tests\webgl_conformance_integration_test.py", line 198, in _CheckTestCompletion
self.fail(self._WebGLTestMessages(self.tab))
AssertionError: should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
FAIL should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
FAIL should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
FAIL should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
FAIL should be green
at (0, 0) expected: 0,255,0,255 was 0,255,0,0
,
May 29 2018
Sorry about the unreliability of this code. I think the timeouts in system_info_handler.cc are likely caused by slow bots and slow cold startup of the browser. Per https://cs.chromium.org/chromium/src/content/browser/devtools/protocol/system_info_handler.cc?q=system_info_handler.cc&sq=package:chromium&dr&l=184 we still can't yet assert that DevTools' system info collection will succeed because Issue 796386 hasn't been fixed yet. Basically, this failure is non-fatal and is being worked around by retrying browser startup in the GPU test harness. For this reason I'm closing it as WontFix. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by zmo@chromium.org
, May 14 2018Status: Assigned (was: Untriaged)