New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 688850 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Telemetry autotests are flaky with TimeoutException and DevtoolsTargetCrashException

Project Member Reported by derat@chromium.org, Feb 5 2017

Issue description

I wrote a simple UI autotest that logs in using chrome.Chrome() and then attempts to verify that a webpage can be loaded using:

  tab = cr.browser.tabs.New()
  tab.Navigate(url)
  tab.WaitForDocumentReadyStateToBeComplete()
  content = tab.EvaluateJavaScript('document.documentElement.innerText')

(You can see an updated version of it at https://chromium-review.googlesource.com/c/436766/.)

I ran the test in a loop from my workstation against a lab device using test_that. The first time through, I ran it 338 times and saw 7 failures in the page-loading part of the test due to either TimeoutException or DevtoolsTargetCrashException, similar to:

...
02/02 18:10:26.622 INFO |desktopui_ChromeSa:0054| Successfully logged in as "test@test.test"
02/02 18:10:26.962 INFO |desktopui_ChromeSa:0067| Loading http://127.0.0.1:36615/test.html...
02/02 18:12:27.060 INFO |    cros_interface:0539| (Re)starting the ui (logs the user out)
02/02 18:12:28.192 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 810, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled TimeoutException: 
********************************************************************************
(/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:483 _ConvertExceptionFromInspectorWebsocket) The app is probably crashed:
...

I couldn't find anything in Chrome or session_manager's logs suggesting that a crash occurred.

Next, I changed the test to catch both of these and try loading the page up to three times. Now the test ran 764 times with 1 failure due to a different exception, DevToolsClientUrlError. It looks like the test caught one of earlier exceptions and retried, but Telemetry restarted the UI at that point and the new exception was thrown:

...
02/05 02:52:46.200 INFO |desktopui_ChromeSa:0059| Successfully logged in as "test@test.test"
02/05 02:52:46.940 INFO |desktopui_ChromeSa:0077| Loading http://127.0.0.1:47856/test.html...
02/05 02:52:53.304 ERROR|desktopui_ChromeSa:0089| Telemetry error while asking Chrome to load page: Devtools target crashed
...
02/05 02:52:53.342 INFO |desktopui_ChromeSa:0091| Retrying page load
02/05 02:52:53.343 INFO |desktopui_ChromeSa:0077| Loading http://127.0.0.1:47856/test.html...
02/05 02:52:53.441 INFO |    cros_interface:0539| (Re)starting the ui (logs the user out)
02/05 02:53:21.497 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 810, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled DevToolsClientUrlError: [Errno 111] Connection refused
...

I suspect that it's not possible to recover from the first exception. (I didn't see any other runs where an exception was thrown at all, so I think I was just lucky and only encountered it once.)

Any ideas about the cause of these failures?

When the test passes, the output looks like:

02/05 07:59:40.512 INFO |desktopui_ChromeSa:0061| Successfully logged in as "test@test.test"
02/05 07:59:40.754 INFO |desktopui_ChromeSa:0079| Loading http://127.0.0.1:57092/test.html...
02/05 07:59:57.883 INFO |desktopui_ChromeSa:0105| Saw expected content
02/05 07:59:57.910 INFO |    cros_interface:0539| (Re)starting the ui (logs the user out)

In the earlier log snippet, it looks like the UI was restarted before the test received the page content from Chrome. Why?
 
Labels: -Pri-3 Pri-2
Dan, can you take a screenshot when the test fails? You can invoke /usr/local/autotest/bin/screenshot.py, and this might give a better idea of what's going on.

Also, do you mind trying to add a short sleep (like 2 seconds) after WaitForDocumentReadyStateToBeComplete and see if this makes a difference.

We have numerous tests that do something similar, both perf and functional, so this ought to work pretty reliably. You can do a search of WaitForDocumentReadyStateToBeComplete in the chromeos repo and you should be able to see them.

There isn't anything chromeos/autotest specific here. This looks like a devtools problem if anything.
I believe DevtoolsTargetCrashException is a renderer crash? DevToolsClientUrlError seems like chrome was restarted?
Cc: xiy...@chromium.org abodenha@chromium.org
Labels: -Pri-2 Hotlist-CrOS-Gardener Pri-1
Flaky tests are bad. Upping the priority on this.

Comment 4 by derat@chromium.org, Feb 7 2017

screenshot.py seems like it may be old and possibly broken; the only tests that use it appear to be graphics_Sanity (which stops the UI job before doing anything) and graphics_VTSwitch (which looks like it still thinks X is present).

The client.cros.graphics.graphics_utils module's take_screenshot() function appears to work, though, so I've added a call to that on failure. Thanks!


I'm pretty sure screenshot.py works. We use it in telemetry:
https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py?l=498

You need to pass in a path with a .png suffix.

If you're happy with take_screenshot(), that's fine by me.

Comment 6 by derat@chromium.org, Feb 7 2017

Oh, huh... This is what I was seeing initially:

localhost ~ # /usr/local/autotest/bin/screenshot.py /tmp/sshot.png
Traceback (most recent call last):
  File "/usr/local/autotest/bin/screenshot.py", line 21, in <module>
    image = crtcScreenshot(args.crtc)
  File "/usr/local/autotest/cros/graphics/drm.py", line 687, in crtcScreenshot
    "Unable to take screenshot. There may not be anything on the screen.")
RuntimeError: Unable to take screenshot. There may not be anything on the screen.

It looks like it's working now, though. The 'ui' job was running before, so I'm not sure what changed. Do you know if there are additional requirements for screenshot.py to run successfully?
Cc: ihf@chromium.org
Not that I know of. Ilja might know more about this script.

Comment 8 by derat@chromium.org, Feb 8 2017

After looking at some screenshots taken right after the exception is seen, it appears that a second blank tab is always open (see attached screenshot).

Here's a stack trace:

(/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend.py:111 _HandleDevToolsConnectionError) Th
e browser exists and can be reached. The devtools target probably crashed.
Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 804, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/common_lib/test.py", line 461, in execute
    dargs)
  File "/usr/local/autotest/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/common_lib/test.py", line 376, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/tests/desktopui_ChromeSanity/desktopui_ChromeSanity.py", line 79, in run_once
    tab = cr.browser.tabs.New()
  File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py", line 9, in New
    return self._tab_list_backend.New(timeout)
  File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend.py", line 43, in New
    return self.GetBackendFromContextId(context_id)
  File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 78, in GetBackendFromContextId
    raise e
TimeoutException: 

---

And the corresponding code:

 55         with chrome.Chrome() as cr:
...
 73             cr.browser.platform.SetHTTPServerDirectories(self.bindir)
 74             url = cr.browser.platform.http_server.UrlOf(
 75                     os.path.join(self.bindir, self._TEST_FILENAME))
 76             logging.info('Loading %s...', url)
 77 
 78             try:
 79                 # Figure out why this is flaky:  http://crbug.com/688850 
 80                 tab = cr.browser.tabs.New()
 81                 tab.Navigate(url)
 82                 tab.WaitForDocumentReadyStateToBeComplete()
 83                 content = tab.EvaluateJavaScript(
 84                         'document.documentElement.innerText')
 85                 if content != self._TEST_CONTENT:
 86                     raise error.TestFail(
 87                             'Expected page content "%s" but got "%s"' %
 88                             (self._TEST_CONTENT, content))
 89                 logging.info('Saw expected content')
 90             except:
 91                 prefix = 'screenshot-%s' % time.strftime('%Y%m%d-%H%M%S')
 92                 logging.info(
 93                         'Got exception; saving screenshot to %s/%s',
 94                         self._SCREENSHOT_DIR, prefix)

So sleeping after WaitForDocumentReadyStateToBeComplete won't help, since it's the earlier call to cr.browser.tabs.New that generates the exception. I'll try sleeping before that call instead, although it's still troubling if it turns out that the object returned by chrome.Chrome() isn't immediately usable in some cases.
screenshot-20170207-105105-0.png
17.6 KB View Download

Comment 9 by ihf@chromium.org, Feb 8 2017

Cc: dbehr@chromium.org
screenshot.py is still used. Dominik worked on it recently. That said it is unable to take a screenshot when nothing is displayed, which is what you may run into here.
For the record, screenshot.py calls into the same screenshot function (drm.crtcScreenshot()) as take_screenshot()

Taking a screenshot will fail with "Unable to take screenshot. There may not be anything on the screen." most likely if there is no video mode set, so there is nothing to take screenshot of.
Thanks for the info on screenshot.py. Maybe Chrome was hanging earlier and hadn't set a video mode, or something like that.

On the reliable-testing front, after sleeping for 3 seconds before interacting with the chrome.Chrome object (updated version of the test is at https://chromium-review.googlesource.com/c/436766/), I've seen 0 failures in 530 runs.

Achuith, do you have any guesses about what could be going on here? It seems like existing tests may also be affected by this, since I believe that many don't (and shouldn't) explicitly sleep after logging in.
Cc: dgozman@chromium.org
I don't think it's the chrome.Chrome object that's not available. You could probably move the sleep to right before the tabs.New() command and still see the same behavior. I believe tabs.New() is the first time we interact with DevTools, and there's probably something that failed to get initialized in time.

A LOT of our autotests have JOB_RETRIES set to paper over flakes like this:
https://cs.corp.google.com/search/?q=package:%5Echromeos_public$+file:control+JOB_RETRIES&type=cs

We also have suite-level retries, so the entire VM test suite gets retried (the suite takes an hour to run) if a test fails.

It's very worthwhile tracking down the root cause of this problem. When you run the test, you ought to be able to see the devtools traffic from the python side (you may need to increase logging verbosity).

This could likely be a ChromeOS specific bug, but possibly not - I've seen some telemetry flakiness on the main waterfall bots.

I think dgozman@ and pfeldman@ might be able to help us figure out how to debug this, so adding them to cc (dgozman appears to be OOO for another week). I'm not able to see any additional VLOGs in devtools code, so we might need to add some additional logging to figure out what's going wrong.

Another thing to consider would be for telemetry to detect the first communication with DevTools, and retry when this fails. I can add some code along these lines if we think this is the right solution. 
Cc: pfeldman@chromium.org
Owner: nednguyen@chromium.org
Status: Assigned (was: Untriaged)
Ned, could you please take a look?
Owner: achuith@chromium.org
Pinging the bug as gardener. Are there any updates?
Lloyd - this shouldn't be affecting the PFQ, unless you've found some evidence of a problem?
Issue 715808 has been merged into this issue.

Comment 19 by ihf@chromium.org, Apr 30 2017

This may as of yesterday be fixed/stable (at least logging into Chrome is). Worth checking once more.
Status: WontFix (was: Assigned)
Thanks! I ran my test, minus the timeout, in a loop against a lumpy device in the lab and saw 302 successes (before I made the mistake of syncing my checkout, which broke everything :-/). This looks like it's reliable now.

Sign in to add a comment