New issue
Advanced search Search tips

Issue 778303 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

telemetry_unittests frequently stuck on "Mac10.9 Tests (dbg)" bot

Project Member Reported by pdr@chromium.org, Oct 25 2017

Issue description

In the following runs, webkit_layout_tests failed but there are no failing tests listed:
https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.11%20Tests/builds/19596
https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.11%20Tests/builds/19599

Looking at the stderr, could these be having issues copying to the disk?
----------8<----------
2017-10-25 09:31:10,352 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-platform-specific-stderr.txt from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/8/layout-test-results/webexposed/global-interface-listing-platform-specific-stderr.txt']
2017-10-25 09:31:10,352 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-pretty-diff.html from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/layout-test-results/webexposed/global-interface-listing-pretty-diff.html']
2017-10-25 09:31:10,353 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/layout-test-results/webexposed/global-interface-listing-stderr.txt from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/layout-test-results/webexposed/global-interface-listing-stderr.txt']
2017-10-25 09:31:10,353 - webkitpy.layout_tests.merge_results: [DEBUG] Creating merged /b/rr/tmpIH5f6L/w/output.json from ['/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/0/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/1/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/2/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/4/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/5/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/6/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/7/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/8/output.json', '/private/var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpwghwZ9/9/output.json']
2017-10-25 09:31:12,827 - root: [DEBUG] Copying output.json from /b/rr/tmpIH5f6L/w/output.json to /var/folders/9x/6c6sv3cj4j53wzpzthbp4ksm0000gm/T/tmpMo9IXU.json
step returned non-zero exit code: 255
----------8<----------

 

Comment 1 by pdr@chromium.org, Oct 25 2017

Summary: Mac webkit_layout_tests failing with no output (was: Mac10.11 Tests failing with no output)
I'm also seeing this on another bot:
https://uberchromegw.corp.google.com/i/chromium.webkit/builders/WebKit%20Mac10.11/builds/24812
https://uberchromegw.corp.google.com/i/chromium.webkit/builders/WebKit%20Mac10.11/builds/24811

Could this be related to the line "shard #0 timed out, took too much time to complete"?
Right, I think all of these failures are related to one shard timing out. Next question is, is something hanging? Why is one shard timing out?

Comment 3 by pdr@chromium.org, Oct 25 2017

Summary: Mac bots failing with "shard timed out, took too much time to complete" (was: Mac webkit_layout_tests failing with no output)
I'm seeing this on "Mac10.9 Tests (dbg)" for the telemetry_unittests step too:
https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46497
https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46496
Labels: Infra-Troopers
Status: Untriaged (was: Available)
Adding infra-troopers.  Seems that different build steps timeout at different times.  All those mentioned here are mac bots.

Comment 6 by no...@chromium.org, Nov 9 2017

Cc: no...@chromium.org charliea@chromium.org nednguyen@chromium.org
Labels: -Infra-Troopers
october builds point to swarming tasks that have no output. The builds were so long ago, the task_runner logs of that time are no longer on the bot, so I don't know why a timed out task (not expired) has no output.

for more recent builds have output.
i see that https://uberchromegw.corp.google.com/i/chromium.mac/builders/Mac10.9%20Tests%20%28dbg%29/builds/46981
points to the timed out task
https://chromium-swarm.appspot.com/task?id=39b44e8046b96010&refresh=10&show_raw=1
and telemetry tests there look particularly slow

[72/558] telemetry.internal.actions.action_runner_unittest.ActionRunnerTest.testWaitForElementWithWrongText passed 13.0663s:
[119/558] telemetry.internal.backends.chrome.tab_list_backend_unittest.TabListBackendTest.testTabIdStableAfterTabCrash passed 124.8882s:
[165/558] telemetry.internal.browser.tab_unittest.TabTest.testRendererCrash passed 123.2531s:
[166/558] telemetry.internal.browser.tab_unittest.TabTest.testTabIsAlive passed 118.8558s:
[167/558] telemetry.internal.browser.tab_unittest.TabTest.testTimeoutExceptionIncludeConsoleMessage passed 13.2820s:
[244/558] telemetry.internal.results.page_test_results_unittest.PageTestResultsTest.testNoTracesLeftAfterCleanUp passed 12.7915s
[247/558] telemetry.internal.results.page_test_results_unittest.PageTestResultsTest.testTraceValue passed 12.9959s
[534/558] telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest.testFirstPaintMetricSmoke passed 12.4532s:
[557/558] telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest.testTBM2ForSmoke passed 11.1559s:

the longest ones are defined in
https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/internal/browser/tab_unittest.py
so +people who probably own this

this does not look like an build infrastructure issue
Cc: -nednguyen@chromium.org
Owner: nedngu...@google.com
Status: Assigned (was: Untriaged)
Seems like we need to increase the number of shards.

Comment 8 by no...@chromium.org, Nov 9 2017

i also filed  bug 783093  to turn such steps purple
Components: -Blink>Infra Speed>Telemetry
I look more into this, it's not a problem with telemetry_unittests taking too much time, but there is some flaky tests that occasionally get stuck :-( 

Example: https://chromium-swarm.appspot.com/task?id=39b94e02f2e23b10&refresh=10&show_raw=1
Summary: telemetry_unittests frequently stuck on "Mac10.9 Tests (dbg)" bot (was: Mac bots failing with "shard timed out, took too much time to complete")
Cc: dpranke@chromium.org
Dirk: debugging these test stuck with typ have always been a pain. Is it possible to tweak typ so that when it receives a KILL signal (which I assume swarming will do when tests reach timeout), it logs out which tests are being run? This would make debugging stuck tests a lot easier.
Looking at the task in #c4, the process was killed with SIGTERM, which typ should've gotten and done something with, but I don't see any output from it:

https://chromium-swarm.appspot.com/task?id=39b3cb0e2c369310&refresh=10&request_detail=true&show_raw=1

which is a bit odd.

In any case, you're running with --jobs=1 --verbose, so we log the start and stop of every test and there's only one running at a time. The last message in the log is "telemetry.internal.browser.tab_unittest.TabTest.testRendererCrash queued" so there you go.

I can see if I can do better about handling SIGTERM and logging what's going on.
Status: WontFix (was: Assigned)
Thanks for the tip in #13 Dirk. Though telemetry_unittests no longer timed out in the last 20 builds, so I think probably someone fixed the problem.
Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment