Telemetry autotests are flaky with TimeoutException and DevtoolsTargetCrashException |
|||||||||
Issue description
I wrote a simple UI autotest that logs in using chrome.Chrome() and then attempts to verify that a webpage can be loaded using:
tab = cr.browser.tabs.New()
tab.Navigate(url)
tab.WaitForDocumentReadyStateToBeComplete()
content = tab.EvaluateJavaScript('document.documentElement.innerText')
(You can see an updated version of it at https://chromium-review.googlesource.com/c/436766/.)
I ran the test in a loop from my workstation against a lab device using test_that. The first time through, I ran it 338 times and saw 7 failures in the page-loading part of the test due to either TimeoutException or DevtoolsTargetCrashException, similar to:
...
02/02 18:10:26.622 INFO |desktopui_ChromeSa:0054| Successfully logged in as "test@test.test"
02/02 18:10:26.962 INFO |desktopui_ChromeSa:0067| Loading http://127.0.0.1:36615/test.html...
02/02 18:12:27.060 INFO | cros_interface:0539| (Re)starting the ui (logs the user out)
02/02 18:12:28.192 WARNI| test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
_call_test_function(self.execute, *p_args, **p_dargs)
File "/usr/local/autotest/common_lib/test.py", line 810, in _call_test_function
raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled TimeoutException:
********************************************************************************
(/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend.py:483 _ConvertExceptionFromInspectorWebsocket) The app is probably crashed:
...
I couldn't find anything in Chrome or session_manager's logs suggesting that a crash occurred.
Next, I changed the test to catch both of these and try loading the page up to three times. Now the test ran 764 times with 1 failure due to a different exception, DevToolsClientUrlError. It looks like the test caught one of earlier exceptions and retried, but Telemetry restarted the UI at that point and the new exception was thrown:
...
02/05 02:52:46.200 INFO |desktopui_ChromeSa:0059| Successfully logged in as "test@test.test"
02/05 02:52:46.940 INFO |desktopui_ChromeSa:0077| Loading http://127.0.0.1:47856/test.html...
02/05 02:52:53.304 ERROR|desktopui_ChromeSa:0089| Telemetry error while asking Chrome to load page: Devtools target crashed
...
02/05 02:52:53.342 INFO |desktopui_ChromeSa:0091| Retrying page load
02/05 02:52:53.343 INFO |desktopui_ChromeSa:0077| Loading http://127.0.0.1:47856/test.html...
02/05 02:52:53.441 INFO | cros_interface:0539| (Re)starting the ui (logs the user out)
02/05 02:53:21.497 WARNI| test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
_call_test_function(self.execute, *p_args, **p_dargs)
File "/usr/local/autotest/common_lib/test.py", line 810, in _call_test_function
raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled DevToolsClientUrlError: [Errno 111] Connection refused
...
I suspect that it's not possible to recover from the first exception. (I didn't see any other runs where an exception was thrown at all, so I think I was just lucky and only encountered it once.)
Any ideas about the cause of these failures?
When the test passes, the output looks like:
02/05 07:59:40.512 INFO |desktopui_ChromeSa:0061| Successfully logged in as "test@test.test"
02/05 07:59:40.754 INFO |desktopui_ChromeSa:0079| Loading http://127.0.0.1:57092/test.html...
02/05 07:59:57.883 INFO |desktopui_ChromeSa:0105| Saw expected content
02/05 07:59:57.910 INFO | cros_interface:0539| (Re)starting the ui (logs the user out)
In the earlier log snippet, it looks like the UI was restarted before the test received the page content from Chrome. Why?
,
Feb 6 2017
I believe DevtoolsTargetCrashException is a renderer crash? DevToolsClientUrlError seems like chrome was restarted?
,
Feb 6 2017
Flaky tests are bad. Upping the priority on this.
,
Feb 7 2017
screenshot.py seems like it may be old and possibly broken; the only tests that use it appear to be graphics_Sanity (which stops the UI job before doing anything) and graphics_VTSwitch (which looks like it still thinks X is present). The client.cros.graphics.graphics_utils module's take_screenshot() function appears to work, though, so I've added a call to that on failure. Thanks!
,
Feb 7 2017
I'm pretty sure screenshot.py works. We use it in telemetry: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py?l=498 You need to pass in a path with a .png suffix. If you're happy with take_screenshot(), that's fine by me.
,
Feb 7 2017
Oh, huh... This is what I was seeing initially:
localhost ~ # /usr/local/autotest/bin/screenshot.py /tmp/sshot.png
Traceback (most recent call last):
File "/usr/local/autotest/bin/screenshot.py", line 21, in <module>
image = crtcScreenshot(args.crtc)
File "/usr/local/autotest/cros/graphics/drm.py", line 687, in crtcScreenshot
"Unable to take screenshot. There may not be anything on the screen.")
RuntimeError: Unable to take screenshot. There may not be anything on the screen.
It looks like it's working now, though. The 'ui' job was running before, so I'm not sure what changed. Do you know if there are additional requirements for screenshot.py to run successfully?
,
Feb 7 2017
Not that I know of. Ilja might know more about this script.
,
Feb 8 2017
After looking at some screenshots taken right after the exception is seen, it appears that a second blank tab is always open (see attached screenshot).
Here's a stack trace:
(/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend.py:111 _HandleDevToolsConnectionError) Th
e browser exists and can be reached. The devtools target probably crashed.
Traceback (most recent call last):
File "/usr/local/autotest/common_lib/test.py", line 804, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/common_lib/test.py", line 461, in execute
dargs)
File "/usr/local/autotest/common_lib/test.py", line 347, in _call_run_once_with_retry
postprocess_profiled_run, args, dargs)
File "/usr/local/autotest/common_lib/test.py", line 376, in _call_run_once
self.run_once(*args, **dargs)
File "/usr/local/autotest/tests/desktopui_ChromeSanity/desktopui_ChromeSanity.py", line 79, in run_once
tab = cr.browser.tabs.New()
File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/browser/tab_list.py", line 9, in New
return self._tab_list_backend.New(timeout)
File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend.py", line 43, in New
return self.GetBackendFromContextId(context_id)
File "/usr/local/telemetry/src/third_party/catapult/telemetry/telemetry/internal/backends/chrome_inspector/inspector_backend_list.py", line 78, in GetBackendFromContextId
raise e
TimeoutException:
---
And the corresponding code:
55 with chrome.Chrome() as cr:
...
73 cr.browser.platform.SetHTTPServerDirectories(self.bindir)
74 url = cr.browser.platform.http_server.UrlOf(
75 os.path.join(self.bindir, self._TEST_FILENAME))
76 logging.info('Loading %s...', url)
77
78 try:
79 # Figure out why this is flaky: http://crbug.com/688850
80 tab = cr.browser.tabs.New()
81 tab.Navigate(url)
82 tab.WaitForDocumentReadyStateToBeComplete()
83 content = tab.EvaluateJavaScript(
84 'document.documentElement.innerText')
85 if content != self._TEST_CONTENT:
86 raise error.TestFail(
87 'Expected page content "%s" but got "%s"' %
88 (self._TEST_CONTENT, content))
89 logging.info('Saw expected content')
90 except:
91 prefix = 'screenshot-%s' % time.strftime('%Y%m%d-%H%M%S')
92 logging.info(
93 'Got exception; saving screenshot to %s/%s',
94 self._SCREENSHOT_DIR, prefix)
So sleeping after WaitForDocumentReadyStateToBeComplete won't help, since it's the earlier call to cr.browser.tabs.New that generates the exception. I'll try sleeping before that call instead, although it's still troubling if it turns out that the object returned by chrome.Chrome() isn't immediately usable in some cases.
,
Feb 8 2017
screenshot.py is still used. Dominik worked on it recently. That said it is unable to take a screenshot when nothing is displayed, which is what you may run into here.
,
Feb 8 2017
For the record, screenshot.py calls into the same screenshot function (drm.crtcScreenshot()) as take_screenshot() Taking a screenshot will fail with "Unable to take screenshot. There may not be anything on the screen." most likely if there is no video mode set, so there is nothing to take screenshot of.
,
Feb 8 2017
Thanks for the info on screenshot.py. Maybe Chrome was hanging earlier and hadn't set a video mode, or something like that. On the reliable-testing front, after sleeping for 3 seconds before interacting with the chrome.Chrome object (updated version of the test is at https://chromium-review.googlesource.com/c/436766/), I've seen 0 failures in 530 runs. Achuith, do you have any guesses about what could be going on here? It seems like existing tests may also be affected by this, since I believe that many don't (and shouldn't) explicitly sleep after logging in.
,
Feb 8 2017
I don't think it's the chrome.Chrome object that's not available. You could probably move the sleep to right before the tabs.New() command and still see the same behavior. I believe tabs.New() is the first time we interact with DevTools, and there's probably something that failed to get initialized in time. A LOT of our autotests have JOB_RETRIES set to paper over flakes like this: https://cs.corp.google.com/search/?q=package:%5Echromeos_public$+file:control+JOB_RETRIES&type=cs We also have suite-level retries, so the entire VM test suite gets retried (the suite takes an hour to run) if a test fails. It's very worthwhile tracking down the root cause of this problem. When you run the test, you ought to be able to see the devtools traffic from the python side (you may need to increase logging verbosity). This could likely be a ChromeOS specific bug, but possibly not - I've seen some telemetry flakiness on the main waterfall bots. I think dgozman@ and pfeldman@ might be able to help us figure out how to debug this, so adding them to cc (dgozman appears to be OOO for another week). I'm not able to see any additional VLOGs in devtools code, so we might need to add some additional logging to figure out what's going wrong. Another thing to consider would be for telemetry to detect the first communication with DevTools, and retry when this fails. I can add some code along these lines if we think this is the right solution.
,
Feb 8 2017
,
Feb 13 2017
Ned, could you please take a look?
,
Feb 13 2017
,
Mar 27 2017
Pinging the bug as gardener. Are there any updates?
,
Mar 30 2017
Lloyd - this shouldn't be affecting the PFQ, unless you've found some evidence of a problem?
,
Apr 28 2017
Issue 715808 has been merged into this issue.
,
Apr 30 2017
This may as of yesterday be fixed/stable (at least logging into Chrome is). Worth checking once more.
,
May 1 2017
Thanks! I ran my test, minus the timeout, in a loop against a lumpy device in the lab and saw 302 successes (before I made the mistake of syncing my checkout, which broke everything :-/). This looks like it's reliable now. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by achuith@chromium.org
, Feb 6 2017