telemetry_unittests sometimes takes a long time |
|||||||||||||||||||
Issue descriptionSee https://chromium-swarm.appspot.com/oldui/user/tasks?limit=100&sort=created_ts&state=completed_success&task_tag=name%3Atelemetry_unittests The median is 12 minutes, but sometimes it takes 20, 25, or even 35 minutes. Every time that happens, the try run becomes slow because it's not sharded (and it shouldn't really, since it's usually fast enough that it doesn't need to be). As one example, here's my try run which took 28 minutes https://chromium-swarm.appspot.com/task?id=3341d3801b538b10&refresh=10&show_raw=1 I've been seeing this often though.
,
Dec 27 2016
,
Dec 27 2016
,
Dec 27 2016
This is strange, there are some overhead from something else than the tests. Here are the log of a long run vs normal run: Long (37m 20s): https://chromium-swarm.appspot.com/task?id=335abdea9bd82010&refresh=10&show_raw=1 Normal (15m 43s): https://chromium-swarm.appspot.com/task?id=335aaf9d2ad4cb10&refresh=10&show_raw=1 Using the script in https://github.com/catapult-project/catapult/blob/master/telemetry/list_telemetry_unittests to parse the log output & list out all the tests' time, they are mostly equal & take around 15 minutes total (928.2037 for the first one vs 917.4851s for the seconds one). I am suspecting the step of fetching all binaries files from cloud storage before running the telemetry tests is to blame for the time blows up here.
,
Dec 27 2016
Oh nvm, my script did not take into account how long did the failed tests take. It turns out in the "37m 20s", there are 4 failed flaky telemetry test & it takes a long time to fail because of the 240s timeout in https://github.com/catapult-project/catapult/blob/3838eb5d96b721896766c557a3d7b514fa427b21/telemetry/telemetry/internal/browser/browser_options.py#L263 @Achuith: since you bumped up that time out limit for CrOS tests, can you readjust it with a saner timeout limit?
,
Dec 27 2016
Thanks Ned for tracking this down :) Test timeout is something we've had to keep small to avoid flakiness causing big slowdowns. We shouldn't have any test launchers that wait up to 4 minutes. The most we wait is a minute for our other test launchers. I see that Achuith changed it from 2 minutes to 4 minutes. Can we change this to 1 minute, and then cros bots (not on CQ or main waterfall) can override it?
,
Dec 28 2016
Sure, I'll take a look
,
Jan 4 2017
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/64657ddad6ba9f1c432aee54c0282b1c7aca1e73 commit 64657ddad6ba9f1c432aee54c0282b1c7aca1e73 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Fri Jan 06 17:50:13 2017 Roll src/third_party/catapult/ 9e6944a8f..fa1926f93 (23 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9e6944a8f310..fa1926f937dc $ git log 9e6944a8f..fa1926f93 --date=short --no-merges --format='%ad %ae %s' 2017-01-06 sullivan Update perf dashboard tooltips to use logdog logs and show status pages. 2017-01-06 charliea Undo unnecessary disables from http://crrev.com/2236493003/ 2017-01-05 simonhatch Dashboard - Add some extra logging info for failed auto-bisect. 2017-01-05 nednguyen Revert of Asserting telemetry story display name length for filename creation. (patchset #3 id:40001 of https://codereview.chromium.org/2488743006/ ) 2017-01-05 dtu [pinpoint] Fix error message for missing parameters in /isolated. 2017-01-05 benjhayden Update UserModel to an ES6 class. 2017-01-05 benjhayden Configure chai to include stack traces. 2017-01-05 achuith Revert retries for DNS failures. 2017-01-05 benjhayden Finish renaming InteractionRecords to UserExpectations. 2017-01-05 sullivan Add a cached last_row_timestamp to StoppageAlert. 2017-01-05 eakuefner [Dashboard] Factor business logic in AddPointHandler out into AddData method 2017-01-05 simonhatch Dashboard - Bisect output should include percent change if possible. 2017-01-05 jessimb Changing group reports to use an id for multiple keys. 2017-01-05 charliea Replace iterItems() with Object.entries() in a few files 2017-01-05 charliea Use individual power sample for story:power metric 2017-01-05 eakuefner [Dashboard] Eliminate redundancy in AddPointHandler.post 2017-01-05 mikecase Add logcat markers to begin and end of each story. 2017-01-05 charliea Migrate more files from iterItems() to Object methods 2017-01-05 simonhatch Dashboard - Refactor output and clarify bisect failures. 2017-01-04 charliea [catapult android trybot] Make Telemetry tests run on Android 2017-01-04 eyaich Asserting telemetry story display name length for filename creation. 2017-01-04 achuith Browser startup timeout 4 min on ChromeOS, 1 min elsewhere. 2017-01-04 aiolos Unit refactor cl 1: Rename UnitScale to UnitPrefixScale. BUG= 662941 , 546625 , 665439 ,678282, 665439 , 662941 ,676742 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2613243002 Cr-Commit-Position: refs/heads/master@{#441970} [modify] https://crrev.com/64657ddad6ba9f1c432aee54c0282b1c7aca1e73/DEPS
,
Jan 6 2017
Thank you!
,
Jan 6 2017
No worries, sorry for the trouble
,
Jan 12 2017
Issue 679429 has been merged into this issue.
,
Jan 23 2017
I just had a run take 29 minutes: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/376250/steps/telemetry_unittests%20%28with%20patch%29/logs/stdio There are 44 matches for: BrowserConnectionGoneException: Timed out while waiting 60s for HasBrowserFinishedLaunching.
,
Jan 23 2017
,
Jan 23 2017
From looking at https://build.chromium.org/p/tryserver.chromium.linux/stats/linux_chromium_rel_ng The median time for the telemetry_unittests target is 11.2 minutes and 90th percentile is 18 minutes. If the expected time to complete this is longer than 5 minutes, then it should be sharded so that each shard takes 5 minutes. However I'm concerned with the timeouts we're seeing and the resulting ~half hour runs that they sometimes cause. It seems like this isn't fixed yet.
,
Jan 23 2017
I can take a look at this one too.
,
Jan 24 2017
I thought this was a different timeout, but it's the same one I fixed earlier. We could reduce the timeout value further to say 30 sec, but I wonder if a better fix is to timeout the suite if there's more than X errors or something.
,
Jan 24 2017
Why are these timeouts occurring at all normally?
,
Jan 24 2017
I may be a little off here. There's >1000 telemetry unit tests. These are bundled into groups (not sure how many - 20? 40?). For each group, telemetry launches a browser instance, waits to connect to the browser, and runs the tests in that group. If there's a bug in a CL that causes the browser to not launch, crash, or not be connectable by devtools, we'll timeout while waiting for the browser instance every single time, leading to this worst case where you hit this 1 min timeout 20-40 times. Not sure if this is clear?
,
Jan 24 2017
The linked run in comment 13 was for a successful run; there were no failures. It still failed to connect 44 times. Folks: if telemetry_unittests is so unstable such that it needs so many slow retries to pass, can we take it off CQ until this is resolved?
,
Jan 25 2017
I think this has to do with Telemetry being flaky when run in parallel on Linux (see https://github.com/catapult-project/catapult/issues/3074#issuecomment-266562500). I will disable the parallelization for telemetry_unittest for now.
,
Jan 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/027b82c601d86240ce7cdfe1f08b2f73eebbe840 commit 027b82c601d86240ce7cdfe1f08b2f73eebbe840 Author: nednguyen <nednguyen@google.com> Date: Wed Jan 25 11:58:24 2017 Disable parallelization on telemetry_unittests & set the number of shards to 2 (linux) It's a known problem that launching browsers in parallel on Linux will cause the test to hang due to deadlock ( https://github.com/catapult-project/catapult/issues/3074#issuecomment-266562500), and since each hang test takes 1 min timeout, so we disable telemetry parallelization & shards it on 2 machines to reduce the overall wait time. BUG=676742 Review-Url: https://codereview.chromium.org/2653163002 Cr-Commit-Position: refs/heads/master@{#445998} [modify] https://crrev.com/027b82c601d86240ce7cdfe1f08b2f73eebbe840/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/027b82c601d86240ce7cdfe1f08b2f73eebbe840/testing/buildbot/chromium.linux.json
,
Jan 25 2017
I hope #22 will stabilize things a bit, but the log still shows that the act of starting the browser in Telemetry is flaky on Linux: Log: File "/b/s/w/irI7x6YK/third_party/catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py", line 154, in _WaitForBrowserToComeUp raise exceptions.BrowserConnectionGoneException(self.browser, e) BrowserConnectionGoneException: Timed out while waiting 60s for HasBrowserFinishedLaunching. Found Minidump: False Stack Trace: ******************************************************************************** No crash dump found. ******************************************************************************** Standard output: ******************************************************************************** [1:1:0124/173653.684891:ERROR:memory_mapped_file.cc(52)] Couldn't open /b/s/w/irI7x6YK/out/Release/chrome_200_percent.pak [1:1:0124/173653.685013:ERROR:data_pack.cc(79)] Failed to mmap datapack Xlib: extension "RANDR" missing on display ":99". [3055:3055:0124/173653.718592:ERROR:memory_mapped_file.cc(52)] Couldn't open /b/s/w/irI7x6YK/out/Release/chrome_200_percent.pak [3055:3055:0124/173653.718657:ERROR:data_pack.cc(79)] Failed to mmap datapack Xlib: extension "RANDR" missing on display ":99". [3100:3100:0124/173653.866100:ERROR:sandbox_linux.cc(343)] InitializeSandbox() called with multiple threads in process gpu-process. ******************************************************************************** Not sure whether things in the browser log are red herring or we have problem with xvfb. Add Ken who is also investigating recent browser startup flakes in Telemetry.
,
Jan 25 2017
regarding logs: you can try running locally to see if these show up. I suspect these are harmless, since if those files were necessary but weren't on the swarming bot then no test would pass. regarding flakiness: I'm surprised to hear that starting the browser is flaky in parallel. We do this for browser_tests and don't see any issues? These shards are using separate profile directories right? regarding sharding: what was the default parallelization count before it's disabled? i.e. I'm wondering if a sharding level of 2 is enough.
,
Jan 25 2017
To #24: Log: Thanks for the comments. I will try to reproduce this locally. Flakiness: the shards are using separate profile directories. Sharding: it was 1. I wasn't aware of the recommendation of each shard should takes about 5 minutes. I can update the number of shards to be 4 that for the telemetry_unittests suite on all desktop bots.
,
Jan 25 2017
Re sharding, mac/windows are not running with jobs==1, so do they need 4 shards?
,
Jan 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e3737854d8bf3f7d81ae7e5d852b0c914541935f commit e3737854d8bf3f7d81ae7e5d852b0c914541935f Author: nednguyen <nednguyen@google.com> Date: Thu Jan 26 00:52:24 2017 telemetry_unittests: set number of shards to be 4 on linux & 2 on other desktop bots BUG=676742 Review-Url: https://codereview.chromium.org/2655063002 Cr-Commit-Position: refs/heads/master@{#446177} [modify] https://crrev.com/e3737854d8bf3f7d81ae7e5d852b0c914541935f/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/e3737854d8bf3f7d81ae7e5d852b0c914541935f/testing/buildbot/chromium.linux.json
,
Jan 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2288cb3fae6baa710b0228f292e3f368c32edca1 commit 2288cb3fae6baa710b0228f292e3f368c32edca1 Author: nednguyen <nednguyen@google.com> Date: Thu Jan 26 19:36:45 2017 Set the number of shards for telemetry_unittest on Windows & Mac to be 2 BUG=676742 Review-Url: https://codereview.chromium.org/2650943012 Cr-Commit-Position: refs/heads/master@{#446395} [modify] https://crrev.com/2288cb3fae6baa710b0228f292e3f368c32edca1/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/2288cb3fae6baa710b0228f292e3f368c32edca1/testing/buildbot/chromium.mac.json [modify] https://crrev.com/2288cb3fae6baa710b0228f292e3f368c32edca1/testing/buildbot/chromium.win.json
,
Feb 28 2017
Build-CommitQueue has been long deprecated, but until now was still offered in auto-complete. Please use Infra>CQ component for CQ-related issues instead.
,
Mar 2 2017
Seems like this fixes the problem. Jam: please reopen if you think this is still happening.
,
Mar 2 2017
well this seems like a band-aid to make the hanging not cause big slowness, but shouldn't we fix the hanging as well?
,
Mar 2 2017
Oh right, I make a separate bug for investigating the flakiness of starting browser in parallel in Telemetry: issue 697924 .
,
Mar 8 2017
Downgrade this to P2 since the band-aid fix is making this no longer urgent.
,
Apr 13 2017
,
Aug 2
,
Jan 16
,
Jan 16
|
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by jam@chromium.org
, Dec 23 2016