New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 676742 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 697924



Sign in to add a comment

telemetry_unittests sometimes takes a long time

Project Member Reported by jam@chromium.org, Dec 23 2016

Issue description

See 
https://chromium-swarm.appspot.com/oldui/user/tasks?limit=100&sort=created_ts&state=completed_success&task_tag=name%3Atelemetry_unittests

The median is 12 minutes, but sometimes it takes 20, 25, or even 35 minutes. Every time that happens, the try run becomes slow because it's not sharded (and it shouldn't really, since it's usually fast enough that it doesn't need to be).

As one example, here's my try run which took 28 minutes
https://chromium-swarm.appspot.com/task?id=3341d3801b538b10&refresh=10&show_raw=1
I've been seeing this often though.
 

Comment 1 by jam@chromium.org, Dec 23 2016

Owner: nedngu...@google.com

Comment 2 by jam@chromium.org, Dec 27 2016

Labels: -OS-Mac
Cc: eyaich@chromium.org sullivan@chromium.org
This is strange, there are some overhead from something else than the tests. Here are the log of a long run vs normal run:

Long (37m 20s): https://chromium-swarm.appspot.com/task?id=335abdea9bd82010&refresh=10&show_raw=1

Normal (15m 43s): https://chromium-swarm.appspot.com/task?id=335aaf9d2ad4cb10&refresh=10&show_raw=1

Using the script in https://github.com/catapult-project/catapult/blob/master/telemetry/list_telemetry_unittests to parse the log output & list out all the tests' time, they are mostly equal & take around 15 minutes total (928.2037 for the first one vs 917.4851s for the seconds one).


I am suspecting the step of fetching all binaries files from cloud storage before running the telemetry tests is to blame for the time blows up here.
Cc: nednguyen@chromium.org
Owner: achuith@chromium.org
Oh nvm, my script did not take into account how long did the failed tests take. It turns out in the "37m 20s", there are 4 failed flaky telemetry test & it takes a long time to fail because of the 240s timeout in https://github.com/catapult-project/catapult/blob/3838eb5d96b721896766c557a3d7b514fa427b21/telemetry/telemetry/internal/browser/browser_options.py#L263

@Achuith: since you bumped up that time out limit for CrOS tests, can you readjust it with a saner timeout limit?

Comment 6 by jam@chromium.org, Dec 27 2016

Thanks Ned for tracking this down :)

Test timeout is something we've had to keep small to avoid flakiness causing big slowdowns. We shouldn't have any test launchers that wait up to 4 minutes. The most we wait is a minute for our other test launchers.

I see that Achuith changed it from 2 minutes to 4 minutes. Can we change this to 1 minute, and then cros bots (not on CQ or main waterfall) can override it?
Sure, I'll take a look
Status: Started (was: Assigned)
https://codereview.chromium.org/2610783002/
Project Member

Comment 9 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/64657ddad6ba9f1c432aee54c0282b1c7aca1e73

commit 64657ddad6ba9f1c432aee54c0282b1c7aca1e73
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Fri Jan 06 17:50:13 2017

Roll src/third_party/catapult/ 9e6944a8f..fa1926f93 (23 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9e6944a8f310..fa1926f937dc

$ git log 9e6944a8f..fa1926f93 --date=short --no-merges --format='%ad %ae %s'
2017-01-06 sullivan Update perf dashboard tooltips to use logdog logs and show status pages.
2017-01-06 charliea Undo unnecessary disables from http://crrev.com/2236493003/
2017-01-05 simonhatch Dashboard - Add some extra logging info for failed auto-bisect.
2017-01-05 nednguyen Revert of Asserting telemetry story display name length for filename creation. (patchset #3 id:40001 of https://codereview.chromium.org/2488743006/ )
2017-01-05 dtu [pinpoint] Fix error message for missing parameters in /isolated.
2017-01-05 benjhayden Update UserModel to an ES6 class.
2017-01-05 benjhayden Configure chai to include stack traces.
2017-01-05 achuith Revert retries for DNS failures.
2017-01-05 benjhayden Finish renaming InteractionRecords to UserExpectations.
2017-01-05 sullivan Add a cached last_row_timestamp to StoppageAlert.
2017-01-05 eakuefner [Dashboard] Factor business logic in AddPointHandler out into AddData method
2017-01-05 simonhatch Dashboard - Bisect output should include percent change if possible.
2017-01-05 jessimb Changing group reports to use an id for multiple keys.
2017-01-05 charliea Replace iterItems() with Object.entries() in a few files
2017-01-05 charliea Use individual power sample for story:power metric
2017-01-05 eakuefner [Dashboard] Eliminate redundancy in AddPointHandler.post
2017-01-05 mikecase Add logcat markers to begin and end of each story.
2017-01-05 charliea Migrate more files from iterItems() to Object methods
2017-01-05 simonhatch Dashboard - Refactor output and clarify bisect failures.
2017-01-04 charliea [catapult android trybot] Make Telemetry tests run on Android
2017-01-04 eyaich Asserting telemetry story display name length for filename creation.
2017-01-04 achuith Browser startup timeout 4 min on ChromeOS, 1 min elsewhere.
2017-01-04 aiolos Unit refactor cl 1: Rename UnitScale to UnitPrefixScale.

BUG= 662941 , 546625 , 665439 ,678282, 665439 , 662941 ,676742

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2613243002
Cr-Commit-Position: refs/heads/master@{#441970}

[modify] https://crrev.com/64657ddad6ba9f1c432aee54c0282b1c7aca1e73/DEPS

Comment 10 by jam@chromium.org, Jan 6 2017

Thank you!
Status: Fixed (was: Started)
No worries, sorry for the trouble
 Issue 679429  has been merged into this issue.

Comment 13 by jam@chromium.org, Jan 23 2017

Owner: nednguyen@chromium.org
Status: Assigned (was: Fixed)
I just had a run take 29 minutes: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/376250/steps/telemetry_unittests%20%28with%20patch%29/logs/stdio

There are 44 matches for:
BrowserConnectionGoneException: Timed out while waiting 60s for HasBrowserFinishedLaunching.

Comment 14 by jam@chromium.org, Jan 23 2017

Owner: nedngu...@google.com

Comment 15 by jam@chromium.org, Jan 23 2017

From looking at 
https://build.chromium.org/p/tryserver.chromium.linux/stats/linux_chromium_rel_ng

The median time for the telemetry_unittests target is 11.2 minutes and 90th percentile is 18 minutes.

If the expected time to complete this is longer than 5 minutes, then it should be sharded so that each shard takes 5 minutes. However I'm concerned with the timeouts we're seeing and the resulting ~half hour runs that they sometimes cause. It seems like this isn't fixed yet.
Cc: -nednguyen@chromium.org -eyaich@chromium.org nedngu...@google.com
Owner: achuith@chromium.org
I can take a look at this one too.
Cc: achuith@chromium.org
Owner: nedngu...@google.com
I thought this was a different timeout, but it's the same one I fixed earlier. 

We could reduce the timeout value further to say 30 sec, but I wonder if a better fix is to timeout the suite if there's more than X errors or something.

Comment 18 by jam@chromium.org, Jan 24 2017

Why are these timeouts occurring at all normally?
I may be a little off here.

There's >1000 telemetry unit tests. These are bundled into groups (not sure how many - 20? 40?). For each group, telemetry launches a browser instance, waits to connect to the browser, and runs the tests in that group. If there's a bug in a CL that causes the browser to not launch, crash, or not be connectable by devtools, we'll timeout while waiting for the browser instance every single time, leading to this worst case where you hit this 1 min timeout 20-40 times.

Not sure if this is clear?

Comment 20 by jam@chromium.org, Jan 24 2017

The linked run in comment 13 was for a successful run; there were no failures. It still failed to connect 44 times.

Folks: if telemetry_unittests is so unstable such that it needs so many slow retries to pass, can we take it off CQ until this is resolved?
I think this has to do with Telemetry being flaky when run in parallel on Linux (see https://github.com/catapult-project/catapult/issues/3074#issuecomment-266562500). I will disable the parallelization for telemetry_unittest for now.
Project Member

Comment 22 by bugdroid1@chromium.org, Jan 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/027b82c601d86240ce7cdfe1f08b2f73eebbe840

commit 027b82c601d86240ce7cdfe1f08b2f73eebbe840
Author: nednguyen <nednguyen@google.com>
Date: Wed Jan 25 11:58:24 2017

Disable parallelization on telemetry_unittests & set the number of shards to 2 (linux)

It's a known problem that launching browsers in parallel on Linux will cause the
test to hang due to deadlock (
https://github.com/catapult-project/catapult/issues/3074#issuecomment-266562500),
and since each hang test takes 1 min timeout, so we disable telemetry
parallelization & shards it on 2 machines to reduce the overall wait time.

BUG=676742

Review-Url: https://codereview.chromium.org/2653163002
Cr-Commit-Position: refs/heads/master@{#445998}

[modify] https://crrev.com/027b82c601d86240ce7cdfe1f08b2f73eebbe840/testing/buildbot/chromium.fyi.json
[modify] https://crrev.com/027b82c601d86240ce7cdfe1f08b2f73eebbe840/testing/buildbot/chromium.linux.json

Cc: kbr@chromium.org
I hope #22 will stabilize things a bit, but the log still shows that the act of starting the browser in Telemetry is flaky on Linux:

Log:

File "/b/s/w/irI7x6YK/third_party/catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py", line 154, in _WaitForBrowserToComeUp
raise exceptions.BrowserConnectionGoneException(self.browser, e)
BrowserConnectionGoneException: Timed out while waiting 60s for HasBrowserFinishedLaunching.
Found Minidump: False
Stack Trace:
********************************************************************************
No crash dump found.
********************************************************************************
Standard output:
********************************************************************************
[1:1:0124/173653.684891:ERROR:memory_mapped_file.cc(52)] Couldn't open /b/s/w/irI7x6YK/out/Release/chrome_200_percent.pak
[1:1:0124/173653.685013:ERROR:data_pack.cc(79)] Failed to mmap datapack
Xlib: extension "RANDR" missing on display ":99".
[3055:3055:0124/173653.718592:ERROR:memory_mapped_file.cc(52)] Couldn't open /b/s/w/irI7x6YK/out/Release/chrome_200_percent.pak
[3055:3055:0124/173653.718657:ERROR:data_pack.cc(79)] Failed to mmap datapack
Xlib: extension "RANDR" missing on display ":99".
[3100:3100:0124/173653.866100:ERROR:sandbox_linux.cc(343)] InitializeSandbox() called with multiple threads in process gpu-process. 
********************************************************************************

Not sure whether things in the browser log are red herring or we have problem with xvfb. Add Ken who is also investigating recent browser startup flakes in Telemetry.

Comment 24 by jam@chromium.org, Jan 25 2017

regarding logs: you can try running locally to see if these show up. I suspect these are harmless, since if those files were necessary but weren't on the swarming bot then no test would pass.

regarding flakiness: I'm surprised to hear that starting the browser is flaky in parallel. We do this for browser_tests and don't see any issues? These shards are using separate profile directories right?

regarding sharding: what was the default parallelization count before it's disabled? i.e. I'm wondering if a sharding level of 2 is enough.
To #24:

Log: Thanks for the comments. I will try to reproduce this locally. 

Flakiness: the shards are using separate profile directories.

Sharding: it was 1. I wasn't aware of the recommendation of each shard should takes about 5 minutes. I can update the number of shards to be 4 that for the telemetry_unittests suite on all desktop bots.

Comment 26 by jam@chromium.org, Jan 25 2017

Re sharding, mac/windows are not running with jobs==1, so do they need 4 shards?
Project Member

Comment 27 by bugdroid1@chromium.org, Jan 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e3737854d8bf3f7d81ae7e5d852b0c914541935f

commit e3737854d8bf3f7d81ae7e5d852b0c914541935f
Author: nednguyen <nednguyen@google.com>
Date: Thu Jan 26 00:52:24 2017

telemetry_unittests: set number of shards to be 4 on linux & 2 on other desktop bots

BUG=676742

Review-Url: https://codereview.chromium.org/2655063002
Cr-Commit-Position: refs/heads/master@{#446177}

[modify] https://crrev.com/e3737854d8bf3f7d81ae7e5d852b0c914541935f/testing/buildbot/chromium.fyi.json
[modify] https://crrev.com/e3737854d8bf3f7d81ae7e5d852b0c914541935f/testing/buildbot/chromium.linux.json

Components: Infra>CQ
Labels: -Build-CommitQueue
Build-CommitQueue has been long deprecated, but until now was still offered in auto-complete. Please use Infra>CQ component for CQ-related issues instead.
Status: Fixed (was: Assigned)
Seems like this fixes the problem. Jam: please reopen if you think this is still happening.

Comment 31 by jam@chromium.org, Mar 2 2017

Status: Assigned (was: Fixed)
well this seems like a band-aid to make the hanging not cause big slowness, but shouldn't we fix the hanging as well?
Blockedon: 697924
Oh right, I make a separate bug for investigating the flakiness of starting browser in parallel in Telemetry:  issue 697924 .
Components: Speed>Telemetry
Labels: -Pri-1 Pri-2
Downgrade this to P2 since the band-aid fix is making this no longer urgent.
Components: -Infra>CQ
Owner: nednguyen@chromium.org
Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment