CQ slaves use incorrect stale chroot when a CQ run fails partially. Should all clobber chroot instead. |
|||||||||||
Issue descriptionThese two CLs together move a file between packages: https://chrome-internal-review.googlesource.com/c/chromeos/cheets-scripts/+/572280 https://chromium-review.googlesource.com/c/chromiumos/platform2/+/923617 They were tried by the CQ run: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17926 The CQ did not pass (due to another failure). The *next* CQ run did not pick up these CLs, but failed because of file collisions that are due to these CLs: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17927 The run after that recovered on its own (almost surely because the failure in BuildPackagesStage caused the chroots on the builders to be recreated) Those CLs should not be retried in the CQ until we understand why this happened to avoid hitting more CQ runs.
,
Mar 7 2018
Thank you nya-san. So sounds like this is generic build infra related issue, rather than caused by my CL itself. pprabhu@, Could you unblock my CLs by removing -1 verified?
,
Mar 8 2018
ping?
,
Mar 8 2018
OK, I buy #1. Thanks for that analysis. Makes sense to me. dgarrett@ tells me there is an old bug about this, I'm trying to find it to attach here. Sending these CLs back at the CQ is not an option, since it can fail in the same way again. imho, chumping after enough sanity testing might be the lesser of the two evils here.
,
Mar 8 2018
+nxia, can you find the bug about inconsistent clobber on slaves? I can't find it right now, and Don tells me you had it for a while.
,
Mar 8 2018
Also vapier@: Can you confirm if our reading of this situation is correct. This isn't about just a missing dependency blocker? (These are both cros-workon packages, so I'm guessing the answer is no)
,
Mar 8 2018
what's the inconsistent clobber ? I thought the CQ only used the incremental chroot when the last CQ-master was green, was that changed?
,
Mar 8 2018
I think the analysis in #1 is correct. When I turn on snapshots, I'm planning to tie them to the master-paladin status instead of the local status, but I think all the builders currently look at their local status.
,
Mar 8 2018
I believe they do look at local status, and that this is a long standing bug.
,
Mar 8 2018
I've removed my verified bit from those CLs, and recommendation is to chump :'( Making this bug the more general ask, because I can't find the old bug. +jclinton, FYI.
,
Mar 9 2018
dgarrett to own or delegate. It's a bug, it is significant, but probably not urgent from the sound of things. I have no idea how much a fix would cost, but dgarrett might.
,
Mar 12 2018
Do you know which code clobbers chroot when the last build was red? Can we make it consider the master status by querying CIDB for example?
,
Mar 12 2018
,
Mar 13 2018
FYI: Re #4, now CLs are chump'ed. CL:923617, CL:923926, CL:*572280, CL:*572281.
,
Mar 13 2018
I will try to take a look at this next week.
,
Mar 14 2018
I recommend having the master pass "--clobber" on the command line to the slaves when scheduling them. That puts ownership of the decision in the right place.
,
Mar 15 2018
I've been discussing offline with people. We're going to add a --revert-chroot option that is less expensive than a full clobber. I'll take this bug since I've already been looking at it.
,
Apr 6 2018
,
May 4 2018
Now that crrev.com/c/1037193 is in, slaves will check the previous master's status before reusing a chroot. The scenario in this bug should no longer be possible.
,
May 16 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by nya@chromium.org
, Mar 7 2018