CI considers lack of test results a passing BVT |
|||||||||||||||
Issue descriptionIf a bad CL affects the test reporting mechanism, CI won't fail the run. Given a lab provisioned system, CI should report BVT failure if test results aren't received. See: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941334989769717664 Related: https://crbug.com/867624 This caused a major test escape in Dru, and has in the past caused a number of issues where low lab provisions allowed devices to silently stop running BVT while still reporting pass.
,
Jul 25
Agree, this is WAI.
,
Jul 25
This bug is specifically that bvt-cq should not pass. Moving forward with an actual release isn't part of CI. This Dru brakage should have been caught by CI.
,
Jul 25
But bvt-cq is asynchronous, the build does not wait for it in any builder, by design. The builder only waits for (and can thus fail due to) bvt-inline. It does appear something is funky here though, looking at https://luci-logdog.appspot.com/v/?s=chromeos/buildbucket/cr-buildbucket.appspot.com/8941334989769717664/+/steps/HWTest__bvt-inline___dru_/0/stdout we see a bunch of messages like: ... cheets_StartAndroid.stress.0 [ INFO ] cheets_StartAndroid.stress.0 TEST_NA: Skipping: test not supported on this board/pool. platform_DMVerityCorruption [ INFO ] platform_DMVerityCorruption TEST_NA: Skipping: test not supported on this board/pool. platform_DMVerityBitCorruption [ INFO ] platform_DMVerityBitCorruption TEST_NA: Skipping: test not supported on this board/pool. provision_AutoUpdate.double [ INFO ] provision_AutoUpdate.double TEST_NA: Skipping: test not supported on this board/pool. ... It appears tests are somehow disabled here? If we intentionally turned off a test for a board, the build should not fail because of it IMO, but if this was configured incorrectly that is another matter.
,
Jul 25
Yes, sorry I meant bvt-inline
,
Jul 25
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?rcl=e31aa16360ae28bc9398c38e2cb1607db0e83b03&l=1022 suggests that this error means that we don't think the board exists...
,
Jul 25
Re: #4, I'm guessing that it's trying to look up "dru" configuration, when the config is stored under "scarlet". I believe there was a similar issue when reef was renamed into electro/basking, how did that work?
,
Jul 25
Yea it is borked, the builder tried to run: ... 21:49:22: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpji1MI3/tmpgo2Fml/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name scarlet-release/R69-10865.0.0-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:bvt-inline' '--tags=build:scarlet-release/R69-10865.0.0' '--tags=task_name:scarlet-release/R69-10865.0.0-bvt-inline' '--tags=board:scarlet' -- /usr/local/autotest/site_utils/run_suite.py --build scarlet-release/R69-10865.0.0 --board scarlet --model dru --suite_name bvt-inline --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 4 --suite_min_duts 6 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 85339587L, 'cidb_build_id': 2738118, 'datastore_parent_key': ('Build', 2738118, 'BuildStage', 85339587L)}" -c ... Emphasis on "--model dru". So lab board name being broken not only does not report results into Goldeneye, but the builder side does not try to test the right thing either :-/. This is probably an independent bug of the GE side bug for 'lab board name', and may be an feature request to have 'lab model name'...
,
Jul 25
Okay, that's a different request than the one you filed; updated description. This is self-service, though: https://cros-goldeneye.corp.google.com/chromeos/console/listModel If that's what you want, just follow the example set by other projects and set 'dru' 'CQ Tests' to 'yes' just like Astronaut was done for Coral. This will be replicated via Config Updater in a few hours and you can check it tomorrow in CQ runs. If you don't have permission, assign back to me and I'll do it.
,
Jul 25
I am wondering if we should deprecate 'lab board name' in Goldeneye, and just change the model labels on these systems (e.g. all scarlets become dru, and all reefs become electro, regardless of whatever they may physically be). Then from here on out, we only allow the real model names to be deployed in the lab, and all is well.
,
Jul 25
+jrbarnette, akeshet Scarlet IS configured to run HWTests, it just fails to run them because it tries to run them on model:dru, which does not exist in the lab. This is an identical state to reef. Following up on comment 10, what happens if we just go in and change the labels, will anything else explode in the lab?
,
Jul 25
> Scarlet IS configured to run HWTests, it just fails to run them because it tries to run them on model:dru, which does not exist in the lab. This is an identical state to reef. Is this in reference to the CQ (= paladin builders), or in general? In the CQ, my reading is that scarlet IS NOT configured to run hwtest. See http://cs/chromeos_public/chromite/config/chromeos_config.py?l=2647&rcl=d81364a25e1e01848eddbd45a2e5fb2b9598bbb8
,
Jul 25
There are two separate issues; Aviv is working on the model label problem. This is how we indicate to test on Astronaut for Coral: http://cs/chromeos_public/chromite/config/config_dump.json?l=16905&rcl=d81364a25e1e01848eddbd45a2e5fb2b9598bbb8 This is, conversely, what we are incorrectly passing for Scarlet: http://cs/chromeos_public/chromite/config/config_dump.json?l=61984&rcl=d81364a25e1e01848eddbd45a2e5fb2b9598bbb8 . Note that all of the models information is missing. We must configure things in GoldeneEye correctly for there to be any chance of the unibuild logic to work correctly. Updating the configuration as I indicated in #9 will fix this and it will impact the way that runs are scheduled.
,
Jul 25
Re comment 12, this is just running tests in general from the release builders, the CQ configuration is not defined in Goldeneye AFAIK. Re comment 13, it does appear the paladin config is missing the model info, but the release builder does have it, maybe this is fixed by enabling CQ tests in GE for the model in question, but that is something we can do after we get basic release tests going IMO.
,
Jul 25
No, the issue is that this failure escaped in the first place, and we should close the test escape that allowed it. If a system that had bvt-inline configured can't run to completion, CI should mark it as a fail generally. This bug has nothing to do with scarlet, other than that's where it was noticed. If there's a lab-board name issue please file a separate bug for that.
,
Jul 25
We have a lot of boards that don't immediately have lab deployments though, I am afraid that if we did as requested, anything new would fail for the first few months until a lab deployment was done. I think what we want is a latched mechanism, if there were tests coming out of the lab, we expect them indefinitely but I am not sure how feasible that is to implement.
,
Jul 25
I believe it's possible to enable BVT per model, correct? We should just not enable BVT until devices are present.
,
Jul 25
As Bernie said, we're not going to implement what you're requesting because that's not compatible with board bring-up; this is the way that things have worked for years and there's no pressing need to implement a new feature here particularly given all of the other things that we have to do. Closing this WontFix. Feel free to reopen and assign to yourself if you want to use this bug to track configuring Scarlet to run HW test stages in CQ. Board labeling is already tracked in issue 867624.
,
Jul 26
After some offline discussion, I think it could be acceptable to have the builder fail if we expected to run tests (per the GE check-boxes for the model). If we do so, this implies we must have discipline about turning on the tests, Nick had suggested this could be part of the deployment process or another bug project tracker puts up, which I think could work. This does mean there is more room for human error, but it would cover cases where DUT pools disappeared for some reason. I think a potentially compelling case today would be that if the lab was still there such that we can talk to it at all, but all the DUTs somehow disappeared, the CQ is likely to still pass (IIUC). As such I think this can still be a valid request, though it seems more like a feature and is not a P1, so adjusting to match. I presume this is not directly Jason's work, but this can probably be triaged out like any other CI feature. This should be blocked on implementation of proper process to ensure the tests get turned on when DUTs are deployed.
,
Jul 26
FWIS. for new devices, this is somewhat related to the discussion we had last week (aka lab entry criteria). The overall plan is to link device phases (*VT) with test required to pass (e.g. new board will need basic lab provisioning passing while other test will be off then additional test must pass/be enabled before xVT phases. We are planning to leverage Project Tracker and setup "burning test pool" for this.
,
Jul 26
We aren't going to do this until at least 2020 so I'm closing this WontFix since the world will likely be completely different then.
,
Jul 26
Josafat and I talked briefly about this issue, seems like its closely related to provisioning (see his comment #20), so we are planning to discuss it at the follow-up meeting (next week).
,
Jul 26
,
Sep 17
No action for a long time and I don't think this is related to provisioning so I'm closing this per the reason in #21. Please reopen if there's some clarification as to how this is related to provisioning.
,
Sep 25
,
Sep 25
Please do not reopen bugs in others' components. You might escalate to Bhaskar if you disagree with the conclusion in #21. Or, reopen if you have some clarification as to how this is related to provisioning. If it is provisioning related, this needs to move to a component related to provisioning such as Infra > Client > ChromeOS > Test.
,
Sep 28
This is needed to prevent test escapes like we encountered on scarlet, and blocks any unibuild migration, including V1->v2.
,
Sep 28
This is part of our long term test planning and monitoring plans. Ultimately, we're going to have clear test plans that must be fulfilled to progress in releases. That's where the monitoring will occur. We're not going to try and retrofit this into the current system. This has nothing to do with v1/v2 migration since the equivalence testing already exists.
,
Sep 28
,
Oct 3
The symptom of this bug is that tests aren't run at all on certain config changes. So it's not immediately clear how adding more tests would address the issue. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by bhthompson@google.com
, Jul 25