New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 917211 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

Flaky failures on win7_chromium_rel_ng and win_optional_gpu_tests_rel - Browser start failed

Project Member Reported by jdarpinian@chromium.org, Dec 21

Issue description

Components: Platform>Apps>DevTools
I don't see any crashes in the log. It says:

WARNING:root:Browser start failed (attempt 3 of 3)

and then 

(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:113 _HandleDevToolsConnectionError) The browser exists and can be reached. The devtools target probably crashed. <traceback object at 0x0000000005BE0D08>

Cc: eyaich@chromium.org crouleau@chromium.org perezju@chromium.org
Does anyone have any ideas about what could be going wrong here?

During the 3 attempted launches, DevTools chooses a different port every time. From https://chromium-swarm.appspot.com/task?id=41e173531a44a810&refresh=10&show_raw=1 :

"""
DevTools listening on ws://127.0.0.1:61539/devtools/browser/cac57439-ad9f-4fbe-ae8c-40ab9d2fb7ab
I
DevTools listening on ws://127.0.0.1:61571/devtools/browser/1274c48e-032a-477d-ac64-ec6a551606c9
[
DevTools listening on ws://127.0.0.1:61597/devtools/browser/b717aed0-9c1f-472c-8d54-c7e37fbc2be5
[
"""

So if for some reason one of the ports had a stray process blocking it, that would have been handled during the retries.

Could TsProxy be broken? I note that it's started only once during the run:

"""
INFO:root:Tsproxy commandline: ['c:\\b\\s\\w\\ir\\.swarming_module_cache\\vpython\\587cd8\\Scripts\\python.exe', 'c:\\b\\s\\w\\ir\\third_party\\catapult\\telemetry\\third_party\\tsproxy\\tsproxy.py', '--port=0']
INFO:root:TsProxy port: 61538
INFO:root:DoNothingForwarder started between 127.0.0.1:61538 and 61538
"""

Any other ideas? It's important to get to the bottom of this. Thanks.

I'be been meaning to write better docs for this (issue 859813), but for now I have this:
https://docs.google.com/document/d/1l4CIqZACAlyogsbPESz0IY3FS4bS7f-DimVQC1uM-os/edit#heading=h.u1qm2dk5idpx

So, it sounds correct. TsProxy is created once (1), serving content from 61538 in your example. The forwarder (2) does nothing, so we still have TsProxy serving content on 61538.

Let's skip (3). Looks like there is no WPR in your case.

This info is passed in the command line (4) as --proxy-server=socks://localhost:61538 on each of the 3 attempts to start Chrome (5).

Now on each of those attempts Chrome ends up picking a different port for the DevTools connection; which is fine and makes sense. This is the port that would be used to send commands from the DevTools client (7) in Telemetry back to Chrome; and is distinct from the port used to serve content from TsProxy to the browser.

Also note, the fact that we see:

  INFO:root:Browser started (pid=4956).

and all the browser info means that the DevTools connection *was* successful.

Moreover, the error that eventually kills the test run says:

(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:113 _HandleDevToolsConnectionError) The browser exists and can be reached. The devtools target probably crashed. <traceback object at 0x0000000005CA4488>

So this is not an issue with the network controller and its connections. The timeout occurs, later when trying to reach a tab in the browser:

    cls.StartBrowser()
  StartBrowser at c:\b\s\w\ir\content\test\gpu\gpu_tests\gpu_integration_test.py:110
    cls.tab = cls.browser.tabs[0]
  __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\tab_list.py:18
    return self._tab_list_backend.__getitem__(index)
  __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:64
    return self.GetBackendFromContextId(context_id)
  GetBackendFromContextId at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:75
    context_id)
  GetInspectorBackend at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py:594
    self._app_backend.app, self._devtools_client, context)
  traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52
    return func(*args, **kwargs)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:82
    self._ConvertExceptionFromInspectorWebsocket(e)
  traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52
    return func(*args, **kwargs)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:72
    self._log = inspector_log.InspectorLog(self._websocket)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:19
    self._Enable()
  _Enable at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:29
    self._inspector_websocket.SyncRequest({'method': 'Log.enable'}, timeout)
 
maybe the tab crashed?
Labels: Infra-Platform-Test
Cc: estaab@chromium.org
 Issue 917388  has been merged into this issue.
Summary: Flaky failures on win7_chromium_rel_ng and win_optional_gpu_tests_rel - Browser start failed (was: Flaky trybot win_optional_gpu_tests_rel - Browser start failed)
Same issue occurred in  Issue 917388  on win7_chromium_rel_ng.

CL:
https://chromium-review.googlesource.com/c/chromium/src/+/1388246/2

Failing build: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/157047

Build succeeds on retry:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng/157073

Swarming task:
https://chromium-swarm.appspot.com/task?id=41e79f2ee4942210&refresh=10&show_raw=1

The browser failed to start 3 times in a row. No minidump was left by the renderer process. In the end, Swarming had to hard-kill several processes, and the entire shard failed as a consequence.

What can we do to improve reliability in this area? It's distressing that the retry mechanism we've added to Telemetry isn't working. Do we need to tear down the whole Telemetry harness and try bringing it back up if this happens?

-----
...
WARNING:root:Browser start failed (attempt 3 of 3)
WARNING:root:Either tab has crashed or browser does not support taking tab screenshot. Skip taking screenshot on failure.
WARNING:root:GpuIntegrationTest unable to take screenshot
INFO:root:Try printing formatted exception: <class 'telemetry.core.exceptions.TimeoutException'> 
********************************************************************************
(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:513 _ConvertExceptionFromInspectorWebsocket) The app is probably crashed:

Found Minidump: False
Stack Trace:
********************************************************************************
	No crash dump found.
********************************************************************************
Standard output:
********************************************************************************
********************************************************************************
System log:
(Not implemented)
********************************************************************************
(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:518 _ConvertExceptionFromInspectorWebsocket) Original exception:
WebSocketException of type <class 'websocket._exceptions.WebSocketTimeoutException'>. Error message: timed out
********************************************************************************
(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:539 _AddDebuggingInformation) Received a socket error in the browser connection and the tab still exists. The operation probably timed out.
********************************************************************************
(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:540 _AddDebuggingInformation) Debugger url: ws://127.0.0.1:60293/devtools/page/F4ADB14C8BF21FFD58694B2BB1D04006
********************************************************************************
(c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome\tab_list_backend.py:113 _HandleDevToolsConnectionError) The browser exists and can be reached. The devtools target probably crashed. <traceback object at 0x0000000005289348>

Traceback (most recent call last):
  <module> at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:399
    ret_code = RunTests(sys.argv[1:])
  RunTests at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:341
    ret, _, _ = runner.run()
  run at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:199
    result_set, test_set.copy(), all_tests)
  _run_tests at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:506
    self._run_one_set(self.stats, result_set, test_set)
  _run_one_set at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:548
    test_set.isolated_tests, 1)
  _run_list at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:574
    _setup_process, _teardown_process)
  make_pool at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\pool.py:28
    return _AsyncPool(host, jobs, callback, context, pre_fn, post_fn)
  __init__ at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\pool.py:188
    self.context_after_pre = pre_fn(self.host, 1, self.context)
  _setup_process at c:\b\s\w\ir\third_party\catapult\third_party\typ\typ\runner.py:849
    child.context_after_setup = child.setup_fn(child, child.context)
  _SetUpProcess at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\testing\run_browser_tests.py:389
    context.test_class.SetUpProcess()
  SetUpProcess at c:\b\s\w\ir\content\test\gpu\gpu_tests\info_collection_test.py:36
    cls.StartBrowser()
  StartBrowser at c:\b\s\w\ir\content\test\gpu\gpu_tests\gpu_integration_test.py:110
    cls.tab = cls.browser.tabs[0]
  __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\tab_list.py:18
    return self._tab_list_backend.__getitem__(index)
  __getitem__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:64
    return self.GetBackendFromContextId(context_id)
  GetBackendFromContextId at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend_list.py:75
    context_id)
  GetInspectorBackend at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py:594
    self._app_backend.app, self._devtools_client, context)
  traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52
    return func(*args, **kwargs)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:82
    self._ConvertExceptionFromInspectorWebsocket(e)
  traced_function at c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py:52
    return func(*args, **kwargs)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py:72
    self._log = inspector_log.InspectorLog(self._websocket)
  __init__ at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:19
    self._Enable()
  _Enable at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_log.py:29
    self._inspector_websocket.SyncRequest({'method': 'Log.enable'}, timeout)
  SyncRequest at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py:132
    res = self._Receive(timeout)
  _Receive at c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py:181
    raise WebSocketException(err)

Note: it seems unlikely that the renderers would be reliably crashing upon startup just on this machine. The same binary is sent to other machines, and subsequent runs of different binaries on the same machine work fine. If the renderer were actually crashing, I'd expect a minidump to be generated - that mechanism is pretty well tested on all platforms.

Is there more logging we can turn on to help diagnose this?

Components: -Infra>Client>Chrome
Owner: crouleau@chromium.org
Status: Assigned (was: Untriaged)
I'm assigning this to myself to make sure Telemetry retries work correctly.
Thanks Caleb. For the record, here are the last 200 builds on these two tryservers:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_optional_gpu_tests_rel?limit=200
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win7_chromium_rel_ng?limit=200

Need to re-triage recent failures to see if this has been happening recently. If not, we can close this as WontFix.

Status: WontFix (was: Assigned)
I'm not seeing any recent failures. I'm closing this for now. Please reopen if this happens again. I agree that it is suspicious that Telemetry suspects a crash but can't get a crash dump.

Sign in to add a comment