New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 736473 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

cave-paladin / caroline-paladin died. All DUTs failed provision

Project Member Reported by pprabhu@chromium.org, Jun 23 2017

Issue description

Labels: -Pri-3 Pri-0
Actually, this is a real problem because the bad CL will be retried on it own since provision failures are blamed on infra.
We need to find the bad CL and kick it out.
The only, even remotely blame-able CL that the CQ didn't submit in that run was: https://chromium-review.googlesource.com/c/535837/

So, if this was a bad CL, it's now in the tree.
OTOH, I'm not yet convinced this isn't an infra issue.
Cc: davidri...@chromium.org jrbarnette@chromium.org ihf@chromium.org akes...@chromium.org
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
cyan did the same:
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615141

wolf-tot failed, but that's a random flake, not principled failure like this.

Note that a lot of these DUTs are now out of commission, risking the next CQ run as well.

+ few arc++ knowledgeable people
Ignoring all CQ smarts.
The only CLs that could be relevant:
https://chromium-review.googlesource.com/c/544801/
https://chromium-review.googlesource.com/c/535837/

coreboot doesn't actually get flashed in the build, iirc:
https://chromium-review.googlesource.com/c/538892/

I went through some of the successful repairs.
They were all USB installs.
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack6-host1/775492-repair/20172306120434/

That means no logs from the failed provision...
Cc: kirtika@google.com pmalani@chromium.org
+ sheriffs.

My current reading is:
- There was a bad CL, making DUTs fail to boot / network interfaces didn't come up.
- many of the cave, cyan, caroline DUTs are out of service
- The bug has made it into ToT.
This looks more like some sort of kernel failure, not ARC++ related.

My leading candidate would be this CL:
    https://chromium-review.googlesource.com/#/c/544801/

I see that the CL has been committed, but I wonder whether maybe our
new history-sensitive commit logic has done the wrong thing?  If it has,
we're in for a very rough ride.

Regarding the failure symptom, here's history on a caroline DUT:
chromeos2-row4-rack11-host19
    2017-06-23 12:56:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814436-repair/
    2017-06-23 12:50:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814417-provision/
    2017-06-23 12:04:23  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814252-repair/
    2017-06-23 11:24:06  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row4-rack11-host19/814082-provision/

Looking in the status.log for one of the repair jobs, you see that
the DUT was offline, and stayed offline until repaired when servo
re-imaged the DUT from USB.  In other words, on installing the new
build, the DUT crashed and burned, and couldn't even roll back.

Regarding the suspect CL:
    https://chromium-review.googlesource.com/#/c/544801/

The CL bumps the kernel eclass, but doesn't bump kernel ebuild
rev numbers.  The last time we had that kind of bug, we got exactly
this kind of failure.

Project Member

Comment 10 by bugdroid1@chromium.org, Jun 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/884a1db67342b6f3e2242851af712bf98d85eadc

commit 884a1db67342b6f3e2242851af712bf98d85eadc
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Jun 23 20:54:24 2017

Revert "cros-kernel2.eclass: remove loader_kernel_ramfs"

This reverts commit 0d7dfa0e45e283dc57cce238037926ab0b82e81f.

Reason for revert: A completely speculative revert for  crbug.com/736473 
Really, I'm out of ideas. Accept my apologies.

Original change's description:
> cros-kernel2.eclass: remove loader_kernel_ramfs
> 
> loader_kernel is no longer used, so remove references to
> USE=loader_kernel_ramfs
> 
> BUG= chromium:736054 
> TEST=emerge-x86-generic chromeos-kernel-4_4
> TEST=trybots
> 
> Change-Id: Ida540fe85705b2867abf88d8f50e66cc8f2271fc
> Reviewed-on: https://chromium-review.googlesource.com/544801
> Commit-Ready: Drew Davenport <ddavenport@chromium.org>
> Tested-by: Drew Davenport <ddavenport@chromium.org>
> Reviewed-by: Mike Frysinger <vapier@chromium.org>

BUG= chromium:736473 

Change-Id: I599365bab97aca6511ee799d360ec418a45a3061
Reviewed-on: https://chromium-review.googlesource.com/546875
Reviewed-by: Drew Davenport <ddavenport@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/884a1db67342b6f3e2242851af712bf98d85eadc/eclass/cros-kernel2.eclass

Re #7, yeah that's my conclusion as well.

I've reverted the only candidate we have.
https://chromium-review.googlesource.com/c/546875

Too late for the current CQ run though.
> I've reverted the only candidate we have.
> https://chromium-review.googlesource.com/c/546875

That CL has to be it:
  * The failure symptom indicates a kernel crash.
  * VMTest failed on cyan, adding more evidence of a kernel bug.
  * The three failing boards are all sufficiently similar
    that we should expect a kernel problem could affect them
    equally.
  * The blamelist for the failed boards included only that one
    change.  EC changes don't affect CQ runs, and the coreboot
    change wasn't even relevant to the boards.
  * The CL has at least one bug that could cause the symptom.

Killed the affected cq builds for the current run because I expect them to fail and break all the DUTs again :/

Also killed the CQ master because the slaves hadn't gotten far enough for the master to submit anything at all.

Tree throttled for the next run to gently check if my revert was enough.
Some evidence that this is fixed:

A recent pre-cq run on cyan passed VMTest (while earlier runs weren't passing):
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/pre_cq/builds/40742
History-aware submit is not really at fault here. The real problem is the usual issue with .eclass changes not triggering uprevs of affected ebuilds. Issue 220902

The kernel class CL in question participated would have been submitted on its first try if not for unrelated flake.

https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15136

Cc: vapier@chromium.org

Comment 17 by kirtika@google.com, Jun 23 2017

Re #c15, I think I saw some repos complaining during the commit/upload hook that the ebuild had to be manually uprev-ed. Can the same be done?

i have a hard time believing that CL:544801 is a problem.  nowhere do we set USE=loader_kernel ... it was added from some brillo work, but never deployed for any boards.

wrt upload hooks, it is possible to add something for specific eclasses.  or we could look at extending the existing subdir-to-uprev logic to include the eclasses as we've had a long standing bug/request for that.
Status: Fixed (was: Started)
Reverting that CL was the only action I took between fail: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/675
I interrupted the next one.
pass: https://uberchromegw.corp.google.com/i/chromeos/builders/cave-paladin/builds/677
i still think heisen bug ;)
Hard to believe this was a flake. The build immediately killed 24 different DUTs from 3 boards.

So, when re-landing that CL, be extra careful :)
Can someone (ie sheriffs) manually test the change (or bad build) on the afflicted devices to try and figure out what's going on?
What can be done to verify whether this CL was actually the problem or not before I try to reland it? As was mentioned in #c18, this change removed an unused flag so should not have affected anything.
jrbarnette pointed out that this CL went to the CQ 3 times.
Focussing on caroline, the first time we didn't get far enough for HWTest:
https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/335

The second time, we actually _passed_ HWTest. But note that this CQ run had a bunch of other kernel changes which would have forced a kernel ebuild uprev:
https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-paladin/builds/336

The third time, there were no other kernel changes and HWTest ended up killing DUTs.

So the theory that this was caused by the eclass being updated without a kernel uprev stands (although I can't explain why that'd be the case)
I wanted to verify this change before attempting to send it to the CQ again. Sending it to a trybot unchanged resulted in failure, which is expected under the assumption this change was problematic:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3288

I made some trivial changes to the kernel .ebuild files to ensure they get uprevved, and tried again. However this resulted in failure as well:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3307

I also did a remote try against ToT, but this also failed:
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3308

Am I just doing this wrong?

Here's how I kicked off the trybots:
cbuildbot --remote --hwtest --novmtest {cave,caroline}-paladin

(Thanks for showing me how to dig into those results further.)

Ignoring the infra issues, given that the tests passed, does that mean that:
1) The original CL wasn't the root cause of the breakage
or
2) Running the hwtests on a trybot isn't sufficient to demonstrate the follow up CL is safe.

or something else?

Does that make sense that (2) would be the case? Is there something else I could try?
> Ignoring the infra issues, given that the tests passed, does that mean that:
> 1) The original CL wasn't the root cause of the breakage
> or
> 2) Running the hwtests on a trybot isn't sufficient to demonstrate the follow up CL is safe.
>
> or something else?

The CLs that were tested included code changes to kernel ebuilds
that would force ebuild uprevs.  The theory is that the original
CL caused failures when applied without a proper uprev.  The passing
result with those changes is consistent with the theory.  It also
says that (regardless of cause) the modified CL (with the kernel
ebuild changes) should be safe to submit.

Sign in to add a comment