Issue metadata
Sign in to add a comment
|
telemetry_unittests frequently stuck on "Mac10.9 Tests (dbg)" bot |
||||||||||||||||||||||
Issue descriptionIn the following runs, webkit_layout_tests failed but there are no failing tests listed: https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.11%20Tests/builds/19596 https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.11%20Tests/builds/19599 Looking at the stderr, could these be having issues copying to the disk? ----------8<---------- 2017-10-25 09:31:10,352 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-platform-specific-stderr.txt from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/8/layout-test-results/webexposed/global-interface-listing-platform-specific-stderr.txt'] 2017-10-25 09:31:10,352 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-pretty-diff.html from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/layout-test-results/webexposed/global-interface-listing-pretty-diff.html'] 2017-10-25 09:31:10,353 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-stderr.txt from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/layout-test-results/webexposed/global-interface-listing-stderr.txt'] 2017-10-25 09:31:10,353 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/output.json from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/0/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/1/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/2/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/4/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/5/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/6/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/7/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/8/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/output.json'] 2017-10-25 09:31:12,827 - root: [DEBUG] Copying output.json from /b/rr/tmpIH5f6L/w/output.json to /var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpMo9IXU.json step returned non-zero exit code: 255 ----------8<----------
,
Oct 25 2017
Right, I think all of these failures are related to one shard timing out. Next question is, is something hanging? Why is one shard timing out?
,
Oct 25 2017
I'm seeing this on "Mac10.9 Tests (dbg)" for the telemetry_unittests step too: https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46497 https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46496
,
Nov 8 2017
This happened again today: https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46977
,
Nov 8 2017
Adding infra-troopers. Seems that different build steps timeout at different times. All those mentioned here are mac bots.
,
Nov 9 2017
october builds point to swarming tasks that have no output. The builds were so long ago, the task_runner logs of that time are no longer on the bot, so I don't know why a timed out task (not expired) has no output. for more recent builds have output. i see that https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46981 points to the timed out task https://chromium-swarm.appspot.com/task?id=39b44e8046b96010&refresh=10&show_raw=1 and telemetry tests there look particularly slow [72/558] telemetry.internal.actions.action_runner_unittest.ActionRunnerTest.testWaitForElementWithWrongText passed 13.0663s: [119/558] telemetry.internal.backends.chrome.tab_list_backend_unittest.TabListBackendTest.testTabIdStableAfterTabCrash passed 124.8882s: [165/558] telemetry.internal.browser.tab_unittest.TabTest.testRendererCrash passed 123.2531s: [166/558] telemetry.internal.browser.tab_unittest.TabTest.testTabIsAlive passed 118.8558s: [167/558] telemetry.internal.browser.tab_unittest.TabTest.testTimeoutExceptionIncludeConsoleMessage passed 13.2820s: [244/558] telemetry.internal.results.page_test_results_unittest.PageTestResultsTest.testNoTracesLeftAfterCleanUp passed 12.7915s [247/558] telemetry.internal.results.page_test_results_unittest.PageTestResultsTest.testTraceValue passed 12.9959s [534/558] telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest.testFirstPaintMetricSmoke passed 12.4532s: [557/558] telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest.testTBM2ForSmoke passed 11.1559s: the longest ones are defined in https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/browser/tab_unittest.py so +people who probably own this this does not look like an build infrastructure issue
,
Nov 9 2017
Seems like we need to increase the number of shards.
,
Nov 9 2017
i also filed bug 783093 to turn such steps purple
,
Nov 9 2017
,
Nov 9 2017
I look more into this, it's not a problem with telemetry_unittests taking too much time, but there is some flaky tests that occasionally get stuck :-( Example: https://chromium-swarm.appspot.com/task?id=39b94e02f2e23b10&refresh=10&show_raw=1
,
Nov 9 2017
,
Nov 9 2017
Dirk: debugging these test stuck with typ have always been a pain. Is it possible to tweak typ so that when it receives a KILL signal (which I assume swarming will do when tests reach timeout), it logs out which tests are being run? This would make debugging stuck tests a lot easier.
,
Nov 9 2017
Looking at the task in #c4, the process was killed with SIGTERM, which typ should've gotten and done something with, but I don't see any output from it: https://chromium-swarm.appspot.com/task?id=39b3cb0e2c369310&refresh=10&request_detail=true&show_raw=1 which is a bit odd. In any case, you're running with --jobs=1 --verbose, so we log the start and stop of every test and there's only one running at a time. The last message in the log is "telemetry.internal.browser.tab_unittest.TabTest.testRendererCrash queued" so there you go. I can see if I can do better about handling SIGTERM and logging what's going on.
,
Nov 10 2017
Thanks for the tip in #13 Dirk. Though telemetry_unittests no longer timed out in the last 20 builds, so I think probably someone fixed the problem.
,
Jan 16
,
Jan 16
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by pdr@chromium.org
, Oct 25 2017