New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 891467 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 901871
Owner:
Closed: Nov 8
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

ASyncHWTest stages not running after failed HWTest stages

Project Member Reported by djkurtz@google.com, Oct 2

Issue description

Here is R71-11111.0.0 on grunt:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2989878

Not it has some ASyncHWTest kicked off ("[bvt-cq] [grunt]", "[bvt-perbuild] [aleena]" etc.).

Here is R71-11112.0.0 on grunt:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=2990551

No ASyncHWTest [ are getting kicked off.
 
Cc: dgagnon@chromium.org djmm@chromium.org kbleicher@chromium.org
Labels: M-71
Owner: jclinton@chromium.org
This is across builds.  It looks like ASyncHWTest stages just stopped getting kicked off.  Assuming there was some regression introduced in CBuildBot.
Cc: ihf@chromium.org
Status: Started (was: Assigned)
Using grunt-release as an investigative source is problematic because grunt-release hasn't had a successful build since 11111.0 so all of the missing ASyncHWTest are expected on failed builds. Investigating the across other builds.
Here's the last successful fizz-release R71-11117.0.0 (also unibuild) and it has ASync results: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933876149885847408
Summary: ASyncHWTest stages not running on failed builds (was: ASyncHWTest stages not running since R71-11112.0.0)
And scarlet-release just now @ R71-11120.0.0 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8933784982830710416

I think everything is working as intended and to the extent that things aren't running it's because a build is failing. AFAICT, this is WAI.

Does anyone disagree?
Cc: bhthompson@chromium.org
Jason,
that sounds reasonable, but there was some doubt as to whether that was the logic or not.  Can you link to where that logic is implemented for future reference?
We might want bvt-cq to get kicked off even if the build fails, particularly for unified build boards where the probability of any one model failing is higher. 

I guess the logic as it is was written with the assumption that if we fail any of the inline tests, that the build is bad, I am not sure that holds true in our current situation, most of the time we can get tests to pass on rerun (flakes/lab timeouts/etc). If we keep it this way, we might start regressing in bvt-cq but we might not get to notice for several days, and we will have dug ourselves into a deeper hole. 

If we do want this sort of logic (inline test failures mean no asynchronous tests), it may be best implemented per model, such that even if the build fails because model X failed bvt-inline, model Y that passed the inline tests gets bvt-cq kicked off.
Labels: -Type-Bug Type-Feature
Summary: ASyncHWTest stages not running after failed HWTest stages (was: ASyncHWTest stages not running on failed builds)
Sorry for the delay. Here's the logic:
http://cs/chromeos_public/chromite/cbuildbot/builders/simple_builders.py?l=123&rcl=8775060263d67635cb7c004f5cceeec68043fcf6

On the second loop through that block, the suites are non-blocking and so subsequently ASyncHWTest stages are scheduled all at once. That's after the blocking HWTests have already finished. I think that this was, in part, to prevent over-subscription of the hardware pools but I'm not sure. Anyway, yes, it dates to roughly mid-2017.

So what do we want to do with this? Is there a feature request here?

Cc: derat@chromium.org
The hardware pool needs to be sized for all suites to run, as release builders are clocked on a fixed schedule. So the protection here is not for over-subscription, but (on paper) to not install too many bad images. Considering that bvt-arc and bvt-inline already can "burn" the whole pool:cq/bvt there isn't much protection one way or another.

I think the simplest feature request here is for unibuild to still kick off all async suites as long as HWTEST_INSTALLER_SUITE passed (and ignore all the other blocking suites). Or at least if HWTEST_INSTALLER_SUITE + one other suite passed kick off all async suites (if extra sanity is wanted). Right now insisting that all blocking suites pass before kicking off the async suite is asking for too much (as the number of models in a board grows, the probability of async kickoff goes down).

--

1) I did a bit of archaeology on bvt-cq. It is both blocking and async. This combination is contradictory/undefined. (At least right now async overrides blocking.) I think I keep forgetting/discovering this "peculiarity" from time to time. A good reason not to add tests to bvt-cq but only to bvt-inline. (And another reason why nobody noticed why bvt-cq was not run for a few weeks.)

2) "Blocking" seems to be used in too many ways:
a) blocking on the cq/pfq
b) blocking async suites on the release builder
c) blocking suite_scheduler suites based on release build status

My mental model:
a) cq/pfq would only progress if all of bvt-inline/bvt-arc etc. pass
b) release builders always kick off async suites
c) suite_scheduler only uses green release builds (passing bvt-*, possibly extra constraints)
I think I understand the grunt case now

---
    blocking: Setting this to true requires that this suite must PASS for suites
              scheduled after it to run. This also means any suites that are
              scheduled before a blocking one are also blocking ones scheduled
              after. This should be used when you want some suites to block
              whether or not others should run e.g. only run longer-running
              suites if some core ones pass first.

              Note, if you want multiple suites to block other suites but run
              in parallel, you should only mark the last one scheduled as
              blocking (it effectively serves as a thread/process join).
    async: Fire-and-forget suite.
---

The unexpected thing is the serialization of the "blocking". In other words order matters a lot.


Simplified builder config:

    "grunt-release": {
        "boards": [
            "grunt"
        ],
        "hw_tests": [
            "{"async": false, "blocking": true,  "suite": "sanity", 
            "{"async": false, "blocking": false, "suite": "bvt-inline",   "priority": "Build"
            "{"async": false, "blocking": false, "suite": "bvt-arc",
            "{"async": false, "blocking": true,  "suite": "bvt-installer",
            "{"async": true,  "blocking": false, "suite": "bvt-cq",
            "{"async": true,  "blocking": false, "suite": "bvt-perbuild", "priority": "PostBuild"
         [...]

Translated:
1) Run and wait for sanity to finish.
2) If passes run and wait for bvt-inline, bvt-arc, bvt-installer to finish.
3) If passes start kicking off async.

I bet what the intention for the release builder was:
1) Run and wait for sanity + bvt-installer to run in parallel and finish to test if build is DOA/kills DUTs.
2) If passes run all of bvt-inline, bvt-arc, bvt-cq, bvt-tast sync. (Use the status for this on dashboards and suite_scheduler.)
3) Kick off all async suites without waiting, but ideally in a way such that they don't compete with 2). This is achieved by using a lower priority/different pool.

So what we need to do here is to reorder the json (and I think we should fix bvt-cq as sync and make sanity have priority CQ):

    "grunt-release": {
        "boards": [
            "grunt"
        ],
        "hw_tests": [
1)          "{"async": false, "blocking": false, "suite": "sanity",        "priority": "CQ" (fix)
            "{"async": false, "blocking": true,  "suite": "bvt-installer", "priority": "CQ"

2)          "{"async": false, "blocking": false, "suite": "bvt-inline",    "priority": "Build"
            "{"async": false, "blocking": false, "suite": "bvt-arc",
            "{"async": false, "blocking": false, "suite": "bvt-cq",
            "{"async": false, "blocking": false, "suite": "bvt-tast",

3)          "{"async": true,  "blocking": false, "suite": "bvt-perbuild",  "priority": "PostBuild"
             [more asyn...]

(Notice we could first run sanity and then block on the result, then run bvt-installer and block on that result. But I think we should only block once, either after sanity or after sanity+installer.)
Cc: dgarr...@chromium.org
+Don for comment on 14/15.
Labels: -Pri-1 -M-71 Pri-2
Owner: ----
Status: Available (was: Started)
Cc: hidehiko@chromium.org
Owner: dgarr...@chromium.org
@Don to reply to #14/15.
Cc: yshaul@chromium.org
ihf@ seems right to me. With two notes:

A) I'm not at all sure what semantics should be if both "blocking" and "async" are set. It would make sense to me to add something to chromeos_config_unittest to make sure we don't do that.

B) If sanity suite passes, why block the async tests on bvt-installer? If lab team agrees, lets remove that gate.

Cc: haoweiw@chromium.org johndhong@chromium.org
I guess the lab team is not CCed? Doing that.

Even with the removed gate I feel unibuild with several models would be very conservative. See issue 894746 where grunt is stable but careena is not. Now as then important careena provision failures would still prevent grunt from running. 
Cc: stagenut@chromium.org
Components: Infra>Client>ChromeOS>Test
Mergedinto: 901871
Status: Duplicate (was: Available)
Looks like the infra team today came to the same conclusion in  issue 901871 .

Sign in to add a comment