New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 819292 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 829665



Sign in to add a comment

CQ slaves use incorrect stale chroot when a CQ run fails partially. Should all clobber chroot instead.

Project Member Reported by pprabhu@chromium.org, Mar 6 2018

Issue description

These two CLs together move a file between packages: https://chrome-internal-review.googlesource.com/c/chromeos/cheets-scripts/+/572280
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/923617

They were tried by the CQ run: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17926

The CQ did not pass (due to another failure).
The *next* CQ run did not pick up these CLs, but failed because of file collisions that are due to these CLs:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17927

The run after that recovered on its own (almost surely because the failure in BuildPackagesStage caused the chroots on the builders to be recreated)

Those CLs should not be retried in the CQ until we understand why this happened to avoid hitting more CQ runs.
 

Comment 1 by nya@chromium.org, Mar 7 2018

Cc: vapier@chromium.org pprabhu@chromium.org
If I understand correctly, this was caused in following way:

1. CLs are picked up in master-paladin/17926.
2. CLs are tested in slave builder, e.g. caroline-paladin/2813, eve-arcnext-paladin/156, etc.
3. Some slave builders succeed (e.g. caroline-paladin/2813), others fail (e.g. eve-arcnext-paladin/156), so CLs are not submitted.
4. In the next run, succeeded slave builders do incremental build. So the chroot has packages of not-yet-submitted version.
5. Since new ebuilds are not written to handle such situation, build fails.

I believe the fundamental flaw here is slave builders doing incremental builds with not-yet-committed version packages. IIUC, slave builders do incremental builds when its previous run was green; it does not care the status of master builder. As long as we keep this behavior, same failures will continue to happen.

Cc: hidehiko@chromium.org akes...@chromium.org
Owner: pprabhu@chromium.org
Thank you nya-san.

So sounds like this is generic build infra related issue, rather than caused by my CL itself.
pprabhu@, Could you unblock my CLs by removing -1 verified?
ping?
OK, I buy #1. Thanks for that analysis. Makes sense to me.

dgarrett@ tells me there is an old bug about this, I'm trying to find it to attach here.

Sending these CLs back at the CQ is not an option, since it can fail in the same way again. imho, chumping after enough sanity testing might be the lesser of the two evils here.
Cc: -pprabhu@chromium.org nxia@chromium.org
+nxia, can you find the bug about inconsistent clobber on slaves?
I can't find it right now, and Don tells me you had it for a while.
Also vapier@: Can you confirm if our reading of this situation is correct. This isn't about just a missing dependency blocker? (These are both cros-workon packages, so I'm guessing the answer is no)

Comment 7 by nxia@chromium.org, Mar 8 2018

Cc: dgarr...@chromium.org bmgordon@chromium.org
what's the inconsistent clobber ?

I thought the CQ only used the incremental chroot when the last CQ-master was green, was that changed? 
I think the analysis in #1 is correct.  When I turn on snapshots, I'm planning to tie them to the master-paladin status instead of the local status, but I think all the builders currently look at their local status.
I believe they do look at local status, and that this is a long standing bug.
Cc: jclinton@chromium.org
Owner: ----
Status: Untriaged (was: Assigned)
Summary: CQ slaves use incorrect stale chroot when a CQ run fails partially. Should all clobber chroot instead. (was: Moving a file from cheets-scripts to platform2 cause the *next* CQ run to fail:)
I've removed my verified bit from those CLs, and recommendation is to chump :'(

Making this bug the more general ask, because I can't find the old bug.

+jclinton, FYI.
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
dgarrett to own or delegate.  It's a bug, it is significant, but probably not urgent from the sound of things.  I have no idea how much a fix would cost, but dgarrett might.

Comment 12 by nya@chromium.org, Mar 12 2018

Do you know which code clobbers chroot when the last build was red? Can we make it consider the master status by querying CIDB for example?

Comment 13 by nya@chromium.org, Mar 12 2018

Cc: nya@chromium.org
FYI: Re #4, now CLs are chump'ed.
CL:923617, CL:923926, CL:*572280, CL:*572281.
Owner: jclinton@chromium.org
I will try to take a look at this next week.
I recommend having the master pass "--clobber" on the command line to the slaves when scheduling them. That puts ownership of the decision in the right place.
Owner: bmgordon@chromium.org
Status: Started (was: Assigned)
I've been discussing offline with people.  We're going to add a --revert-chroot option that is less expensive than a full clobber.  I'll take this bug since I've already been looking at it.
Blockedon: 829665
Status: Fixed (was: Started)
Now that crrev.com/c/1037193 is in, slaves will check the previous master's status before reusing a chroot.  The scenario in this bug should no longer be possible.
Cc: jrbarnette@chromium.org
 Issue 796653  has been merged into this issue.

Sign in to add a comment