cave-paladin / caroline-paladin died. All DUTs failed provision |
|||||
Issue descriptionProbably just a bad CL. Filing this bug for tracking purposes. Will close once I have an owner. https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15142 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615179 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615158
,
Jun 23 2017
The only, even remotely blame-able CL that the CQ didn't submit in that run was: https://chromium-review.googlesource.com/c/535837/ So, if this was a bad CL, it's now in the tree. OTOH, I'm not yet convinced this isn't an infra issue.
,
Jun 23 2017
cyan did the same: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615141 wolf-tot failed, but that's a random flake, not principled failure like this. Note that a lot of these DUTs are now out of commission, risking the next CQ run as well. + few arc++ knowledgeable people
,
Jun 23 2017
Ignoring all CQ smarts. The only CLs that could be relevant: https://chromium-review.googlesource.com/c/544801/ https://chromium-review.googlesource.com/c/535837/ coreboot doesn't actually get flashed in the build, iirc: https://chromium-review.googlesource.com/c/538892/
,
Jun 23 2017
I went through some of the successful repairs. They were all USB installs. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack6-host1/775492-repair/20172306120434/ That means no logs from the failed provision...
,
Jun 23 2017
+ sheriffs. My current reading is: - There was a bad CL, making DUTs fail to boot / network interfaces didn't come up. - many of the cave, cyan, caroline DUTs are out of service - The bug has made it into ToT.
,
Jun 23 2017
This looks more like some sort of kernel failure, not ARC++ related.
My leading candidate would be this CL:
https://chromium-review.googlesource.com/#/c/544801/
I see that the CL has been committed, but I wonder whether maybe our
new history-sensitive commit logic has done the wrong thing? If it has,
we're in for a very rough ride.
,
Jun 23 2017
Regarding the failure symptom, here's history on a caroline DUT:
chromeos2-row4-rack11-host19
2017-06-23 12:56:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814436-repair/
2017-06-23 12:50:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814417-provision/
2017-06-23 12:04:23 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814252-repair/
2017-06-23 11:24:06 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814082-provision/
Looking in the status.log for one of the repair jobs, you see that
the DUT was offline, and stayed offline until repaired when servo
re-imaged the DUT from USB. In other words, on installing the new
build, the DUT crashed and burned, and couldn't even roll back.
,
Jun 23 2017
Regarding the suspect CL:
https://chromium-review.googlesource.com/#/c/544801/
The CL bumps the kernel eclass, but doesn't bump kernel ebuild
rev numbers. The last time we had that kind of bug, we got exactly
this kind of failure.
,
Jun 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/884a1db67342b6f3e2242851af712bf98d85eadc commit 884a1db67342b6f3e2242851af712bf98d85eadc Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Jun 23 20:54:24 2017 Revert "cros-kernel2.eclass: remove loader_kernel_ramfs" This reverts commit 0d7dfa0e45e283dc57cce238037926ab0b82e81f. Reason for revert: A completely speculative revert for crbug.com/736473 Really, I'm out of ideas. Accept my apologies. Original change's description: > cros-kernel2.eclass: remove loader_kernel_ramfs > > loader_kernel is no longer used, so remove references to > USE=loader_kernel_ramfs > > BUG= chromium:736054 > TEST=emerge-x86-generic chromeos-kernel-4_4 > TEST=trybots > > Change-Id: Ida540fe85705b2867abf88d8f50e66cc8f2271fc > Reviewed-on: https://chromium-review.googlesource.com/544801 > Commit-Ready: Drew Davenport <ddavenport@chromium.org> > Tested-by: Drew Davenport <ddavenport@chromium.org> > Reviewed-by: Mike Frysinger <vapier@chromium.org> BUG= chromium:736473 Change-Id: I599365bab97aca6511ee799d360ec418a45a3061 Reviewed-on: https://chromium-review.googlesource.com/546875 Reviewed-by: Drew Davenport <ddavenport@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/884a1db67342b6f3e2242851af712bf98d85eadc/eclass/cros-kernel2.eclass
,
Jun 23 2017
Re #7, yeah that's my conclusion as well. I've reverted the only candidate we have. https://chromium-review.googlesource.com/c/546875 Too late for the current CQ run though.
,
Jun 23 2017
> I've reverted the only candidate we have. > https://chromium-review.googlesource.com/c/546875 That CL has to be it: * The failure symptom indicates a kernel crash. * VMTest failed on cyan, adding more evidence of a kernel bug. * The three failing boards are all sufficiently similar that we should expect a kernel problem could affect them equally. * The blamelist for the failed boards included only that one change. EC changes don't affect CQ runs, and the coreboot change wasn't even relevant to the boards. * The CL has at least one bug that could cause the symptom.
,
Jun 23 2017
Killed the affected cq builds for the current run because I expect them to fail and break all the DUTs again :/ Also killed the CQ master because the slaves hadn't gotten far enough for the master to submit anything at all. Tree throttled for the next run to gently check if my revert was enough.
,
Jun 23 2017
Some evidence that this is fixed: A recent pre-cq run on cyan passed VMTest (while earlier runs weren't passing): https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/pre_cq/builds/40742
,
Jun 23 2017
History-aware submit is not really at fault here. The real problem is the usual issue with .eclass changes not triggering uprevs of affected ebuilds. Issue 220902 The kernel class CL in question participated would have been submitted on its first try if not for unrelated flake. https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15136
,
Jun 23 2017
,
Jun 23 2017
Re #c15, I think I saw some repos complaining during the commit/upload hook that the ebuild had to be manually uprev-ed. Can the same be done?
,
Jun 24 2017
i have a hard time believing that CL:544801 is a problem. nowhere do we set USE=loader_kernel ... it was added from some brillo work, but never deployed for any boards. wrt upload hooks, it is possible to add something for specific eclasses. or we could look at extending the existing subdir-to-uprev logic to include the eclasses as we've had a long standing bug/request for that.
,
Jun 24 2017
Reverting that CL was the only action I took between fail: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/675 I interrupted the next one. pass: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/677
,
Jun 24 2017
i still think heisen bug ;)
,
Jun 24 2017
Hard to believe this was a flake. The build immediately killed 24 different DUTs from 3 boards. So, when re-landing that CL, be extra careful :)
,
Jun 24 2017
Can someone (ie sheriffs) manually test the change (or bad build) on the afflicted devices to try and figure out what's going on?
,
Jun 27 2017
What can be done to verify whether this CL was actually the problem or not before I try to reland it? As was mentioned in #c18, this change removed an unused flag so should not have affected anything.
,
Jun 27 2017
jrbarnette pointed out that this CL went to the CQ 3 times. Focussing on caroline, the first time we didn't get far enough for HWTest: https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/335 The second time, we actually _passed_ HWTest. But note that this CQ run had a bunch of other kernel changes which would have forced a kernel ebuild uprev: https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/336 The third time, there were no other kernel changes and HWTest ended up killing DUTs. So the theory that this was caused by the eclass being updated without a kernel uprev stands (although I can't explain why that'd be the case)
,
Jun 29 2017
I wanted to verify this change before attempting to send it to the CQ again. Sending it to a trybot unchanged resulted in failure, which is expected under the assumption this change was problematic: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3288 I made some trivial changes to the kernel .ebuild files to ensure they get uprevved, and tried again. However this resulted in failure as well: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3307 I also did a remote try against ToT, but this also failed: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3308 Am I just doing this wrong? Here's how I kicked off the trybots: cbuildbot --remote --hwtest --novmtest {cave,caroline}-paladin
,
Jun 29 2017
By and large, the results show that all the tests passed, but the suite failed. I haven't dug into the root causes of the suite failures, but they look infrastructure related. For this CL, the only test that matters is provision_AutoUpdate.double, in the 'bvt-inline' suite. By the numbers: build - https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3288 suite - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125380256 test - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125380263 build - https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3307 suite - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125765991 test - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125766074 Against ToT: build - https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3308 suite - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125766197 test - http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125766210
,
Jun 29 2017
(Thanks for showing me how to dig into those results further.) Ignoring the infra issues, given that the tests passed, does that mean that: 1) The original CL wasn't the root cause of the breakage or 2) Running the hwtests on a trybot isn't sufficient to demonstrate the follow up CL is safe. or something else? Does that make sense that (2) would be the case? Is there something else I could try?
,
Jun 29 2017
> Ignoring the infra issues, given that the tests passed, does that mean that: > 1) The original CL wasn't the root cause of the breakage > or > 2) Running the hwtests on a trybot isn't sufficient to demonstrate the follow up CL is safe. > > or something else? The CLs that were tested included code changes to kernel ebuilds that would force ebuild uprevs. The theory is that the original CL caused failures when applied without a proper uprev. The passing result with those changes is consistent with the theory. It also says that (regardless of cause) the modified CL (with the kernel ebuild changes) should be safe to submit. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Jun 23 2017