Flaky failures on win7_chromium_rel_ng and win_optional_gpu_tests_rel - Browser start failed |
||||||||
Issue descriptionI'm seeing a lot of failed runs on win_optional_gpu_tests_rel with the same pattern: the browser fails to start three times in a row. Here are five recent examples: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel/12538 https://chromium-swarm.appspot.com/user/task/41e173531a44a810 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel/12604 https://chromium-swarm.appspot.com/user/task/41e5e3eec9e09410 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel/12629 https://chromium-swarm.appspot.com/user/task/41e6f279bb744b10 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel/12609 https://chromium-swarm.appspot.com/user/task/41e6122c6205b110 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel/12607 https://chromium-swarm.appspot.com/user/task/41e605fe1fe29e10
,
Dec 21
Does anyone have any ideas about what could be going wrong here? During the 3 attempted launches, DevTools chooses a different port every time. From https://chromium-swarm.appspot.com/task?id=41e173531a44a810&refresh=10&show_raw=1 : """ DevTools listening on ws://127.0.0.1:61539/devtools/browser/cac57439-ad9f-4fbe-ae8c-40ab9d2fb7ab I DevTools listening on ws://127.0.0.1:61571/devtools/browser/1274c48e-032a-477d-ac64-ec6a551606c9 [ DevTools listening on ws://127.0.0.1:61597/devtools/browser/b717aed0-9c1f-472c-8d54-c7e37fbc2be5 [ """ So if for some reason one of the ports had a stray process blocking it, that would have been handled during the retries. Could TsProxy be broken? I note that it's started only once during the run: """ INFO:root:Tsproxy commandline: ['c:\\b\\s\\w\\ir\\.swarming_module_cache\\vpython\\587cd8\\Scripts\\python.exe', 'c:\\b\\s\\w\\ir\\third_party\\catapult\\telemetry\\third_party\\tsproxy\\tsproxy.py', '--port=0'] INFO:root:TsProxy port: 61538 INFO:root:DoNothingForwarder started between 127.0.0.1:61538 and 61538 """ Any other ideas? It's important to get to the bottom of this. Thanks.
,
Dec 21
I'be been meaning to write better docs for this (issue 859813), but for now I have this: https://docs.google.com/document/d/1l4CIqZACAlyogsbPESz0IY3FS4bS7f-DimVQC1uM-os/edit#heading=h.u1qm2dk5idpx So, it sounds correct. TsProxy is created once (1), serving content from 61538 in your example. The forwarder (2) does nothing, so we still have TsProxy serving content on 61538. Let's skip (3). Looks like there is no WPR in your case. This info is passed in the command line (4) as --proxy-server=socks://localhost:61538 on each of the 3 attempts to start Chrome (5). Now on each of those attempts Chrome ends up picking a different port for the DevTools connection; which is fine and makes sense. This is the port that would be used to send commands from the DevTools client (7) in Telemetry back to Chrome; and is distinct from the port used to serve content from TsProxy to the browser. Also note, the fact that we see: INFO:root:Browser started (pid=4956). and all the browser info means that the DevTools connection *was* successful. Moreover, the error that eventually kills the test run says: (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:113 _HandleDevToolsConnectionError) The browser exists and can be reached. The devtools target probably crashed. <traceback object at 0x0000000005CA4488> So this is not an issue with the network controller and its connections. The timeout occurs, later when trying to reach a tab in the browser: cls.StartBrowser() StartBrowser at c:\b\s\w\ir\content\test\gpu\gpu_tests\gpu_integration_test.py:110 cls.tab = cls.browser.tabs[0] __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\tab_list.py:18 return self._tab_list_backend.__getitem__(index) __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:64 return self.GetBackendFromContextId(context_id) GetBackendFromContextId at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:75 context_id) GetInspectorBackend at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py:594 self._app_backend.app, self._devtools_client, context) traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52 return func(*args, **kwargs) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:82 self._ConvertExceptionFromInspectorWebsocket(e) traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52 return func(*args, **kwargs) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:72 self._log = inspector_log.InspectorLog(self._websocket) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:19 self._Enable() _Enable at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:29 self._inspector_websocket.SyncRequest({'method': 'Log.enable'}, timeout) maybe the tab crashed?
,
Dec 21
,
Dec 21
,
Dec 21
Same issue occurred in Issue 917388 on win7_chromium_rel_ng. CL: https://chromium-review.googlesource.com/c/chromium/src/+/1388246/2 Failing build: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/157047 Build succeeds on retry: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/157073 Swarming task: https://chromium-swarm.appspot.com/task?id=41e79f2ee4942210&refresh=10&show_raw=1 The browser failed to start 3 times in a row. No minidump was left by the renderer process. In the end, Swarming had to hard-kill several processes, and the entire shard failed as a consequence. What can we do to improve reliability in this area? It's distressing that the retry mechanism we've added to Telemetry isn't working. Do we need to tear down the whole Telemetry harness and try bringing it back up if this happens? ----- ... WARNING:root:Browser start failed (attempt 3 of 3) WARNING:root:Either tab has crashed or browser does not support taking tab screenshot. Skip taking screenshot on failure. WARNING:root:GpuIntegrationTest unable to take screenshot INFO:root:Try printing formatted exception: <class 'telemetry.core.exceptions.TimeoutException'> ******************************************************************************** (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:513 _ConvertExceptionFromInspectorWebsocket) The app is probably crashed: Found Minidump: False Stack Trace: ******************************************************************************** No crash dump found. ******************************************************************************** Standard output: ******************************************************************************** ******************************************************************************** System log: (Not implemented) ******************************************************************************** (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:518 _ConvertExceptionFromInspectorWebsocket) Original exception: WebSocketException of type <class 'websocket._exceptions.WebSocketTimeoutException'>. Error message: timed out ******************************************************************************** (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:539 _AddDebuggingInformation) Received a socket error in the browser connection and the tab still exists. The operation probably timed out. ******************************************************************************** (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:540 _AddDebuggingInformation) Debugger url: ws://127.0.0.1:60293/devtools/page/F4ADB14C8BF21FFD58694B2BB1D04006 ******************************************************************************** (c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:113 _HandleDevToolsConnectionError) The browser exists and can be reached. The devtools target probably crashed. <traceback object at 0x0000000005289348> Traceback (most recent call last): <module> at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:399 ret_code = RunTests(sys.argv[1:]) RunTests at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:341 ret, _, _ = runner.run() run at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:199 result_set, test_set.copy(), all_tests) _run_tests at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:506 self._run_one_set(self.stats, result_set, test_set) _run_one_set at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:548 test_set.isolated_tests, 1) _run_list at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:574 _setup_process, _teardown_process) make_pool at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\pool.py:28 return _AsyncPool(host, jobs, callback, context, pre_fn, post_fn) __init__ at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\pool.py:188 self.context_after_pre = pre_fn(self.host, 1, self.context) _setup_process at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:849 child.context_after_setup = child.setup_fn(child, child.context) _SetUpProcess at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:389 context.test_class.SetUpProcess() SetUpProcess at c:\b\s\w\ir\content\test\gpu\gpu_tests\info_collection_test.py:36 cls.StartBrowser() StartBrowser at c:\b\s\w\ir\content\test\gpu\gpu_tests\gpu_integration_test.py:110 cls.tab = cls.browser.tabs[0] __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\tab_list.py:18 return self._tab_list_backend.__getitem__(index) __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:64 return self.GetBackendFromContextId(context_id) GetBackendFromContextId at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:75 context_id) GetInspectorBackend at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py:594 self._app_backend.app, self._devtools_client, context) traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52 return func(*args, **kwargs) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:82 self._ConvertExceptionFromInspectorWebsocket(e) traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52 return func(*args, **kwargs) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:72 self._log = inspector_log.InspectorLog(self._websocket) __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:19 self._Enable() _Enable at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:29 self._inspector_websocket.SyncRequest({'method': 'Log.enable'}, timeout) SyncRequest at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py:132 res = self._Receive(timeout) _Receive at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py:181 raise WebSocketException(err)
,
Dec 21
Note: it seems unlikely that the renderers would be reliably crashing upon startup just on this machine. The same binary is sent to other machines, and subsequent runs of different binaries on the same machine work fine. If the renderer were actually crashing, I'd expect a minidump to be generated - that mechanism is pretty well tested on all platforms. Is there more logging we can turn on to help diagnose this?
,
Jan 2
,
Jan 2
I'm assigning this to myself to make sure Telemetry retries work correctly.
,
Jan 2
Thanks Caleb. For the record, here are the last 200 builds on these two tryservers: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel?limit=200 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng?limit=200 Need to re-triage recent failures to see if this has been happening recently. If not, we can close this as WontFix.
,
Jan 3
I'm not seeing any recent failures. I'm closing this for now. Please reopen if this happens again. I agree that it is suspicious that Telemetry suspects a crash but can't get a crash dump.
,
Today
(5 hours ago)
It happened again. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel/280 https://chromium-swarm.appspot.com/task?id=4290b5364539e810&refresh=10&show_raw=1 |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by jdarpinian@chromium.org
, Dec 21