All chromium.* test tasks should have configured hard timeouts |
|
Issue descriptionAny chromium builder that schedules test tasks into the pools used by the main precommit and postcommit Chromium builders should have (reasonable) hard timeouts configured, so that if a test gets stuck or slows down significantly, we can cap the amount of resources it uses in the CQ, rather than just running out of capacity mysteriously. Tests (and builders) that we do not wish to have hard timeouts on should be provisioned into separate pools so that we can then ensure they don't consume high-priority resources. It might arguably be okay to have them configured with lower-priority into the main pools as long as the tests still have hard timeouts that are short enough (< 5 min?) to not materially be able to starve the higher-priority tasks.
,
May 22 2018
By default, it's 1h: https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/api.py?l=193&rcl=b68f787b12aa0d8c7e13e33869fe525e4ccbfaf9 which is, of course, too long. OTOH, making it very short will quickly teach people to shard tests more, creating more shorter tasks and hitting capacity even harder... How about counting the total swarming time for each build, and failing it if it reaches over a certain threshold?
,
May 22 2018
Example: ./swarming.py query -S chromium-swarm.appspot.com 'tasks/list?limit=500&tags=buildername%3Aios-simulator&tags=buildnumber%3A11744' | python -c 'import sys; import json; d = json.load(sys.stdin); print(sum(i["duration"] for i in d["items"]))' 24569.258626 That's 6.8 hours (!!) of swarming time in 252 separate tasks, most of which are 2-5 min (average ~100 sec per task). Add 1 min of overhead for task scheduling (even if we fix issue 844151 ), and we're wasting ~30% of capacity.
,
May 22 2018
The above can be an additional recipe step at the end of all swarming tasks, and we can set a swarming budget, and also report a few slowest tasks, and possibly some stats on runtimes (e.g. too many too short tasks -> consolidate; too long running tasks -> shard; too much time overall -> speed up or remove tests). We can start by just creating such a step for reporting only.
,
May 22 2018
Creating more shorter tasks should not be a big hit on capacity as long as O(overhead) << O(task). I think we expect per-task overhead to be O(10s) on a successful task, longer on a failed task (since we'll reboot afterwards). O(1 min) would be way too high. I also believe that most of the tasks are well under 1 min, but most of the time is spent in the big tasks that aren't, but these numbers might vary by platform. Counting swarming time by build doesn't really solve a different problem that I'd also like to solve: catching significant increases in step times. If a step unexpectedly goes from 1s to 10s, that's probably a bug worth alerting on even if the total increase in the build time is immaterial. [ That use case wasn't mentioned in the description, which is my fault. Possibly that should be a separate bug ]. We want tasks (in the CQ and main builders, at least) to be much shorter by default, generally < 5 min (with more shards as necessary to meet that), in order to minimize the time a build spends waiting for tests to complete. Which suggests that we should have a hard timeout of ~10 min or something by default. From my previous investigation into step times, what we really want is a concept of something like what they have in Blaze/Bazel, of "small", "medium", "large", and "huge" tests: Bazel uses 1/5/15/60 min, but I don't think those are the right values for us, and it'd be better to be something like 10s, 60s, 5 min (1 shard), 5 min * explicit # of shards.
,
May 23 2018
See https://crbug.com/845637#c7 for the actual data - maybe it's due to issue 844151 , but at this point the overhead is indeed close to O(1min).
,
May 23 2018
For step time monitoring, do you mean actual steps in the build (like compile), or the individual swarming task durations? I'm guessing, both are important, but also both have a pretty high variance in general. So we'd need some aggregation across builds before we can meaningfully detect a regression. It's probably worth a separate bug. Perhaps, watching the total swarming time is also a separate issue, though IMHO it's likely to catch capacity problems more reliably. For instance, a single task time limit won't catch adding 100 more swarmed tests. And this can easily happen e.g. for ios as they test on a cartesian product of <os version> x <architecture> x <tests>.
,
May 23 2018
For this particular bug, I mean monitoring swarming task durations. Good point about not catching newly added steps. However, those at least require code review by ops people or other OWNERS that should know what they're doing.
,
May 23 2018
Overhead is not counted as part of the task duration. I wrote a small tool to get a rough range via recent data as part of issue 402454 but I didn't roll the new values. :/ https://cs.chromium.org/chromium/src/testing/buildbot/timeouts.py
,
May 23 2018
maruel: Thanks for the pointer! This may indeed come handy. dpranke: re #c8: at least in case of ios, I know for the fact they often add tests and configurations without realizing the capacity impact. And since configs are in src, Ops don't always review it. I suspect the same is true for the rest of chromium. And even Ops people often have a vague idea about capacity impact for any particular change. Anyways, I guess I'm trying to say that just limiting the task timeout will constrain it from above, but we also need to constrain it from below (due to overhead) and the overall capacity taken by each build. Gotta plug all the holes, otherwise engineers will find a way past the constraints :)
,
May 23 2018
> dpranke: re #c8: at least in case of ios, I know for the fact they often > add tests and configurations without realizing the capacity impact. Yeah, unfortunately we all do that too often, and it's not realistic to ask devs to even try, given the current tooling we could point them at. > And since configs are in src, Ops don't always review it. > I suspect the same is true for the rest of chromium. Ops people (specifically CCI people) should be reviewing the changes to //ios/build/bots. We do so for //testing/buildbot/* already. > we also need to constrain it from below (due to overhead) I agree. I also filed bug 845646 for per-build timeouts, but that won't catch stuff being run in parallel, so you're likely right that we need monitoring on aggregate totals as well.
,
May 23 2018
I have a proposal then: how about I'll add a step to the chromium recipe (and possibly ios) that for now just prints some swarming statistics. If it proves useful and sufficiently stable over time, we can make it fail the build if some stats are out of whack. WDYT?
,
May 23 2018
I'm not sure quite what you have in mind, but I'm happy to look at a CL if you want to work on this approach. Or write up what you have in mind better here, or in a doc, or whatever.
,
May 23 2018
Filed issue 846024 for my proposal.
,
Aug 29
Is this still relevant? Sergey/Dirk? Note that this is a blocking bug for cit-pm-84. Thanks!
,
Aug 30
Yes, this is still relevant. |
|
►
Sign in to add a comment |
|
Comment 1 by dpranke@chromium.org
, May 22 2018