Add VM test coverage for more kernels
Reported by
jrbarnette@chromium.org,
Aug 9 2017
|
||||||
Issue descriptionCurrently, we only perform VM testing for betty, which tests only a single (recent) kernel version. Recently, a CL broke the CQ for an extended time because it used a kernel feature present in v4.6, but not in various earlier versions such as 3.8 or 3.10 (see bug 753838 ). We should create configs that will allow VM testing against older kernel versions, and then get those configs into the pre-CQ, so that problems like this will be caught earlier, and more reliably.
,
Aug 9 2017
what would be the candidate build configs for the vmtests?
,
Aug 9 2017
> what would be the candidate build configs for the vmtests? Candidate configs don't exist. This bug is a request to create them.
,
Aug 9 2017
FWIW there was a conscious decision a few months ago to only have 1 target for VMTests (see discussion in Issue 710629) because the tradeoff between the cost of supporting more VM-specific codepaths for various kernels and the benefit of the extra test coverage didn't look good.
,
Aug 9 2017
> [ ... ] and the benefit of the extra test coverage didn't look good. To be clear, the lack of the extra test coverage contributed directly to an 18 hour CQ outage. Reading comments on the problem CL, it sounds like this isn't even the first time that this particular kernel difference has caused trouble. That feels like enough additional evidence to justify revisiting the cost/benefit analysis.
,
Aug 14 2017
,
Aug 15 2017
Adding folks from issue 710629 here so we have common audience. Aviv, as discussed can we dig up some data that helps justify adding VM test coverage for additional kernels? The implementation would likely entail adding per-kernel betty-like board overlays instead of going back to per hardware board VM tests as was the case in the past.
,
Aug 15 2017
We have occasional outages that better PreCQ kernel coverage would catch. My gut feeling is that they are rare, but expensive to recover from (2-3 days of CQ outage), where rare means every 2-3 months. There have also been some disruptions where kernel bugs were flaky, but might have passed PreCQ anyway. What frequency of outage would be enough to justify the work? If the answer is "one a year" don't bother investigating, just do the work. If the answer is "one a week" just skip the work.
,
Aug 15 2017
I estimate about 1 major outage per month that would be caught by broader vmtest. Looking at recent summary emails for bad CLs that I believe are vmtest-catchable... august: https://chromium-review.googlesource.com/c/602882 cryptohome change that breaks on some kernels. 8+ cq failures. See Issue 753838 . july: https://chromium-review.googlesource.com/c/565632 bad minijail uprev, broke hwtest. seems vm-catchable. 4 cq failures, 80 false rejections. june: https://chromium-review.googlesource.com/c/421984 bad kernel 3.18 CL breaks boot on several boards . see Issue 731253 . (not certain if vm-catchable though) june: https://chromium-review.googlesource.com/c/514822 openssh uprev breaks sshd on DUTs, causing hwtest to break.
,
Aug 15 2017
For all that is worth the last/june oppenssh issue was an intermittent failure and caught by betty once.
,
Aug 15 2017
and https://chromium-review.googlesource.com/c/421984 is not catchable by VMs
,
May 10 2018
,
May 11 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by snanda@chromium.org
, Aug 9 2017