Issue metadata
Sign in to add a comment
|
ASyncHWTest stages not running after failed HWTest stages |
||||||||||||||||||||||||
Issue descriptionHere is R71-11111.0.0 on grunt: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2989878 Not it has some ASyncHWTest kicked off ("[bvt-cq] [grunt]", "[bvt-perbuild] [aleena]" etc.). Here is R71-11112.0.0 on grunt: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2990551 No ASyncHWTest [ are getting kicked off.
,
Oct 2
This is across builds. It looks like ASyncHWTest stages just stopped getting kicked off. Assuming there was some regression introduced in CBuildBot.
,
Oct 2
,
Oct 2
,
Oct 2
Using grunt-release as an investigative source is problematic because grunt-release hasn't had a successful build since 11111.0 so all of the missing ASyncHWTest are expected on failed builds. Investigating the across other builds.
,
Oct 2
Here's the last successful fizz-release R71-11117.0.0 (also unibuild) and it has ASync results: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933876149885847408
,
Oct 2
And scarlet-release just now @ R71-11120.0.0 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933784982830710416 I think everything is working as intended and to the extent that things aren't running it's because a build is failing. AFAICT, this is WAI. Does anyone disagree?
,
Oct 2
Jason, that sounds reasonable, but there was some doubt as to whether that was the logic or not. Can you link to where that logic is implemented for future reference?
,
Oct 2
We might want bvt-cq to get kicked off even if the build fails, particularly for unified build boards where the probability of any one model failing is higher. I guess the logic as it is was written with the assumption that if we fail any of the inline tests, that the build is bad, I am not sure that holds true in our current situation, most of the time we can get tests to pass on rerun (flakes/lab timeouts/etc). If we keep it this way, we might start regressing in bvt-cq but we might not get to notice for several days, and we will have dug ourselves into a deeper hole. If we do want this sort of logic (inline test failures mean no asynchronous tests), it may be best implemented per model, such that even if the build fails because model X failed bvt-inline, model Y that passed the inline tests gets bvt-cq kicked off.
,
Oct 3
Sorry for the delay. Here's the logic: http://cs/chromeos_public/chromite/cbuildbot/builders/simple_builders.py?l=123&rcl=8775060263d67635cb7c004f5cceeec68043fcf6 On the second loop through that block, the suites are non-blocking and so subsequently ASyncHWTest stages are scheduled all at once. That's after the blocking HWTests have already finished. I think that this was, in part, to prevent over-subscription of the hardware pools but I'm not sure. Anyway, yes, it dates to roughly mid-2017. So what do we want to do with this? Is there a feature request here?
,
Oct 3
The hardware pool needs to be sized for all suites to run, as release builders are clocked on a fixed schedule. So the protection here is not for over-subscription, but (on paper) to not install too many bad images. Considering that bvt-arc and bvt-inline already can "burn" the whole pool:cq/bvt there isn't much protection one way or another. I think the simplest feature request here is for unibuild to still kick off all async suites as long as HWTEST_INSTALLER_SUITE passed (and ignore all the other blocking suites). Or at least if HWTEST_INSTALLER_SUITE + one other suite passed kick off all async suites (if extra sanity is wanted). Right now insisting that all blocking suites pass before kicking off the async suite is asking for too much (as the number of models in a board grows, the probability of async kickoff goes down). -- 1) I did a bit of archaeology on bvt-cq. It is both blocking and async. This combination is contradictory/undefined. (At least right now async overrides blocking.) I think I keep forgetting/discovering this "peculiarity" from time to time. A good reason not to add tests to bvt-cq but only to bvt-inline. (And another reason why nobody noticed why bvt-cq was not run for a few weeks.) 2) "Blocking" seems to be used in too many ways: a) blocking on the cq/pfq b) blocking async suites on the release builder c) blocking suite_scheduler suites based on release build status My mental model: a) cq/pfq would only progress if all of bvt-inline/bvt-arc etc. pass b) release builders always kick off async suites c) suite_scheduler only uses green release builds (passing bvt-*, possibly extra constraints)
,
Oct 3
I think I understand the grunt case now
---
blocking: Setting this to true requires that this suite must PASS for suites
scheduled after it to run. This also means any suites that are
scheduled before a blocking one are also blocking ones scheduled
after. This should be used when you want some suites to block
whether or not others should run e.g. only run longer-running
suites if some core ones pass first.
Note, if you want multiple suites to block other suites but run
in parallel, you should only mark the last one scheduled as
blocking (it effectively serves as a thread/process join).
async: Fire-and-forget suite.
---
The unexpected thing is the serialization of the "blocking". In other words order matters a lot.
,
Oct 3
Simplified builder config:
"grunt-release": {
"boards": [
"grunt"
],
"hw_tests": [
"{"async": false, "blocking": true, "suite": "sanity",
"{"async": false, "blocking": false, "suite": "bvt-inline", "priority": "Build"
"{"async": false, "blocking": false, "suite": "bvt-arc",
"{"async": false, "blocking": true, "suite": "bvt-installer",
"{"async": true, "blocking": false, "suite": "bvt-cq",
"{"async": true, "blocking": false, "suite": "bvt-perbuild", "priority": "PostBuild"
[...]
Translated:
1) Run and wait for sanity to finish.
2) If passes run and wait for bvt-inline, bvt-arc, bvt-installer to finish.
3) If passes start kicking off async.
I bet what the intention for the release builder was:
1) Run and wait for sanity + bvt-installer to run in parallel and finish to test if build is DOA/kills DUTs.
2) If passes run all of bvt-inline, bvt-arc, bvt-cq, bvt-tast sync. (Use the status for this on dashboards and suite_scheduler.)
3) Kick off all async suites without waiting, but ideally in a way such that they don't compete with 2). This is achieved by using a lower priority/different pool.
So what we need to do here is to reorder the json (and I think we should fix bvt-cq as sync and make sanity have priority CQ):
"grunt-release": {
"boards": [
"grunt"
],
"hw_tests": [
1) "{"async": false, "blocking": false, "suite": "sanity", "priority": "CQ" (fix)
"{"async": false, "blocking": true, "suite": "bvt-installer", "priority": "CQ"
2) "{"async": false, "blocking": false, "suite": "bvt-inline", "priority": "Build"
"{"async": false, "blocking": false, "suite": "bvt-arc",
"{"async": false, "blocking": false, "suite": "bvt-cq",
"{"async": false, "blocking": false, "suite": "bvt-tast",
3) "{"async": true, "blocking": false, "suite": "bvt-perbuild", "priority": "PostBuild"
[more asyn...]
,
Oct 3
(Notice we could first run sanity and then block on the result, then run bvt-installer and block on that result. But I think we should only block once, either after sanity or after sanity+installer.)
,
Oct 3
+Don for comment on 14/15.
,
Oct 3
,
Oct 3
,
Oct 10
,
Oct 11
@Don to reply to #14/15.
,
Oct 11
,
Oct 11
ihf@ seems right to me. With two notes: A) I'm not at all sure what semantics should be if both "blocking" and "async" are set. It would make sense to me to add something to chromeos_config_unittest to make sure we don't do that. B) If sanity suite passes, why block the async tests on bvt-installer? If lab team agrees, lets remove that gate.
,
Oct 12
I guess the lab team is not CCed? Doing that. Even with the removed gate I feel unibuild with several models would be very conservative. See issue 894746 where grunt is stable but careena is not. Now as then important careena provision failures would still prevent grunt from running.
,
Oct 13
,
Oct 13
,
Nov 8
Looks like the infra team today came to the same conclusion in issue 901871 . |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by kbleicher@google.com
, Oct 2Labels: M-71