New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703819 link

Starred by 5 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Successful CQ run self destructed, and reported failure.

Project Member Reported by dgarr...@chromium.org, Mar 21 2017

Issue description

In this CQ run:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13993


The master decided to abort after it decided it had enough information to submit all pending CLs without waiting for remaining slaves.

09:39:34: INFO: will_submit set contains 1 changes: [CL:456636]
might_submit set contains 0 changes: []
will_not_submit set contains 0 changes: []

This is good and correct, EXCEPT, the master also reported that it failed. I think the master should have reported success in this case.
 

Comment 1 by nxia@chromium.org, Mar 21 2017

Cc: akes...@chromium.org
This also happened a couple times during the weekend. The CQ-master submitted all CLs and marked the build as 'Failure'. Although marking the build as "fail" doesn't hurt, I have no problem with marking the the CQ as "Success". If no objections, I'll make the change. + akeshet@ ?
This presents an interesting corner case.

We want to publish prebuilts, if we can. That is done by CQ master, and it's done only if CQCompletion passed (I believe).

The old rationale for this was: prebuilts will be accurate if and only if all of the CLs that were part of the run were actually submitted.

I'm tempted to say that in the case of will_submit={everything} we should just terminate CQCompletion early with Success, and allow the prebuilts to be published.

I think we need to consider the case where will_submit={everything} even though an important slave failed. In that case, do we want to publish prebuilts? I think the answer is still yes (whether or not we do doesn't seem like it should affect whether that builder is broken).

If the answer to above question is No, then we may have to actually add some strange logic such that if will_submit={everything} we actually force ourselves to wait for all slaves, so we can know whether to publish prebuilts.


But anyway, I think the simpler solution of will_submit={everything} -> terminate with success -> publish prebuilts is probably right.
Oh, I didn't think about prebuilts.

Regarding "submit={everything} even though an important slave failed"....

If our logic to detect which builders matter is perfect, then it's safe to publish even if that builder failed.

If our logic is imperfect and we publish an uprev without a prebuilt, things aren't horrible.

Anyone building that ebuild will compile the package locally, and the difference is invisible other than the time taken. Also, the missing prebuilt will be uploaded automatically by future CQ runs.

The pathological case is that we continuously "succeed early" killing off the same slave repeatedly before it gets to publish. I'm not currently worried about that case since most builds contain CLs a that require all slaves to pass.
Also, yes, I strongly agree we should publish uprevs/prebuilts on early success. I'm more worried about the uprevs than the prebuilts.
Also, correct my if I'm wrong, but I think if the following sequence of events happens, then prebuilts are still ok.

1) master-paladin reached will_submit={everything} even though foo-paladin is still running.
2) master-paladin publishes prebuilts
3) foo-paladin finishes successfully and uploads prebuilts.


I claim that we have still reached a state where foo-paladin's prebuilts are published. Because I think master-paladin is only updating some global manifest-version pointers to the latest manifest-version to try to pull prebuilts from, and doesn't have board level granularity. At least on the CQ. Is that right?
Ah, but what if foo-paladin is aborted before it can upload; then indeed foo is without prebuilts.

And in theory this is actually a downward spiral, because that means future foo-paladin runs are slower, so foo is more likely to be the long pole.

Still, that downward spiral only lasts as long as the CQ is testing changes that are irrelevant to foo. Once a foo-related change comes by, we get back to prebuilts for everything. So I'm still not concerned.

Comment 7 by autumn@chromium.org, Mar 21 2017

Labels: -current-issue

Comment 8 by nxia@chromium.org, Mar 22 2017

Just to make sure the even though foo-paladin didn't upload any prebuilts because it's irrelevant to any CLs and got canceled, the prebuilts uploaded by the master are still valid?

If there's no problem, I'll start the change.

Comment 9 by nxia@chromium.org, Mar 22 2017

Status: Started (was: Untriaged)
I don't think the master uploads any prebuilts, but it does upload the ebuild uprevs (not quite the same thing).

I think it's correct (or at least correct enough) to ignore the prebuilts that ignored slaves would have uploaded.

Comment 11 by nxia@chromium.org, Mar 22 2017

Status: Unconfirmed (was: Started)
the master-CQ uprevs and markes the prebuilts as stable? does it matter if some slaves didn't upload their prebuilts? 

Comment 12 by nxia@chromium.org, Mar 22 2017

Cc: dgarr...@chromium.org
If someone didn't upload them, they won't exist, but later builds will fill in the missing prebuilts, and anyone trying to use them will just compile the ebuilds from source. Slower, but it yields the same result.

Comment 14 by nxia@chromium.org, Mar 22 2017

I think my question would be if some slaves didn't upload prebuilts, should the master-paladin run publish uprevs?
I believe that it will. The ebuild uprevs are recalculated by the master after it decides which CLs to submit, and which to not submit.

Though, I'm not sure if the ebuild uprevs are submitted in the same changes that mark the prebuilts as stable. I assume so.

Comment 16 by nxia@chromium.org, Mar 22 2017

A success master-paladin runs PublishUprevChanges, like 

https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F14041%2F%2B%2Frecipes%2Fsteps%2FPublishUprevChanges%2F0%2Fstdout

it runs cros_mark_as_stable for all boards? do we know what happens if some board-paladins didn't finish?
No, I'm not certain that it runs, but cros_mark_as_stable isn't board specific, so if it runs, all boards are covered.

Comment 18 by nxia@chromium.org, Mar 22 2017

Per the discussion with dgarrett@ offline, the problem remained is when a slave-CQ (say foo-paladin) keeps being the slowest slave and getting cancelled by its master because the master keeps successfully submitting CLs without foo-paladin, we may end up with a big delay in prebuilts. In the origin system, the big delay also happens when the CQ keeps submitting partial CLs without publishing uprevs, but self-destructed CQ may increase the frequency of this kind of problem.

we decide to keep eyes on the tree and see whether this can become a real issue for us. Will leave the bug open for now. 
But #18 will only keep happening until a CL comes along that cares about foo-paladin, at which point we will wait for that. The builder only becomes slow when we don't care about it (because no CLs do for a long time)
Cc: philipchen@google.com pprabhu@chromium.org shchen@google.com itspeter@google.com
 Issue 707696  has been merged into this issue.

Comment 22 by nxia@chromium.org, Apr 10 2017

If the irrelevant slaves were aborted before they uploaded prebuilts, the master can't safely publish uprevs for all slaves. There might be missing binhost package errors  crbug.com/710036 . we can:

1) leave the CQ-master as 'fail' when self-destruction happens; or 
2) mark the the CQ-master as 'pass' but don't run PublishUprev; or
3) do not perform self-destruction in this case, wait for all slaves to complete; or
4) is there a way to only publish uprevs for relevant slave paladins? 



2) Should be easy and safe, unless we abort early *too* regularly.
3) Easy and safest, but slows down the CQ for this case.
4) If we have a robust mapping between slaves and packages files, this is
   possible. But it's the most complicated to implement.
Cc: davidjames@chromium.org
1) might be ok with this though it makes our numbers look worse than they are
2) also ok with this, though it runs the risk that we may run for a long time without ever publishing (and without ever noticing
3) probably my preferred solution. Fits the "CQ tries as hard as it can to do as much as it can, but gives up quickly on impossible things." philosophy.
4) davidjames@ thoughts on if this is possible? My gut instinct is no.
In order to publish uprevs, we need prebuilts from slaves to exist, and I would argue that every passing run should include prebuilts. It is also desirable to have empty runs that include prebuilts (so that you can update prebuilts w.r.t. chumped changes and Chrome PFQ changes).

I suggest we wait for prebuilts to be uploaded on all builders before self-destructing -- this is required any time you are planning on passing a run. If you're failing, you don't actually need to wait for prebuilts, but regardless prebuilt upload is pretty fast so I don't see it as a huge difference if we just move the destruction to happen a bit later.
What David is suggesting requires that the master watches which stages have completed. We should be able to do this via CIDB.
Yes, the stage information is already in cidb.

So, if the master has determined that it is going to self-destruct but wants to pass, we should add a wait step where we ensure that for all important slaves, "UploadPrebuilts" stage has completed with status PASS.

Comment 28 by nxia@chromium.org, Apr 11 2017

Checking the "UploadPrebuilts" stage should be feasible. "UploadPrebuilts" happens after buildPackages & buildImage stages, the master can wait for that. 

Comment 29 by nxia@chromium.org, Jun 2 2017

Cc: nxia@chromium.org
 Issue 728320  has been merged into this issue.

Comment 30 by nxia@chromium.org, Jun 2 2017

 Issue 727369  has been merged into this issue.

Comment 31 by nxia@chromium.org, Jun 5 2017

Labels: Chase-Pending

Comment 32 by aut...@google.com, Jun 5 2017

Labels: -Chase-Pending Hotlist-Fixit
Project Member

Comment 33 by bugdroid1@chromium.org, Jun 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/d4e8a4c3ae1ff79808b13655171984569fa71460

commit d4e8a4c3ae1ff79808b13655171984569fa71460
Author: Ningning Xia <nxia@chromium.org>
Date: Fri Jun 30 19:51:25 2017

Wait slaves to pass UploadPrebuilts when deciding self-destruction.

Before self-destructing on a success build (which will submit all picked
up changes), the master should wait for all important slaves to pass
UploadPrebuiltsStage in order to publish prebuilts for all slaves.

BUG= chromium:703819 
TEST=unit_tests

Change-Id: Iad9fe2fb0b3c6475ba1a7013b87736b471715d77
Reviewed-on: https://chromium-review.googlesource.com/544788
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/relevant_changes.py
[modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/lib/constants.py
[modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/build_status.py
[modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/relevant_changes_unittest.py

Project Member

Comment 34 by bugdroid1@chromium.org, Jun 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3

commit 09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3
Author: Ningning Xia <nxia@chromium.org>
Date: Fri Jun 30 19:51:25 2017

Consider a successful and self-destructed CQ as success.

If a CQ destructed itself earily without failures and all important
slaves passed the desired stages, consider this CQ as a success and
publish uprevs for this build.

BUG= chromium:703819 
TEST=unit_tests

Change-Id: I6af7a88fc8308bde2d374f0f470b52f2d9831b1b
Reviewed-on: https://chromium-review.googlesource.com/547046
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3/cbuildbot/stages/completion_stages_unittest.py
[modify] https://crrev.com/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3/cbuildbot/stages/completion_stages.py

Fixed?

Comment 36 by nxia@chromium.org, Jul 5 2017

not actually. looking into two issues:

1. some builds don't even run UploadPrebuilts stage (e.g. falco-full-compile-paladin, nyan-full-compile-paladin, etc.)

2. some builds already passed the stage but not noticed by the master, should be a bug and I'm trying to enable some logs. 

Comment 37 by nxia@chromium.org, Jul 5 2017

No.2 in #36 is false alarm, misled by the wrong log, sent out CL:559955.

Will fix No.1 in #36.
Project Member

Comment 38 by bugdroid1@chromium.org, Jul 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/60cf737c197796291f598f8698a98f6b16ee502f

commit 60cf737c197796291f598f8698a98f6b16ee502f
Author: Ningning Xia <nxia@chromium.org>
Date: Fri Jul 07 20:15:46 2017

Fix misleading logs in relevant_changes.

BUG= chromium:703819 
TEST=unit_tests

Change-Id: Idb4e11cabbddd4d78cd49d7b7b50c2f5ab2df179
Reviewed-on: https://chromium-review.googlesource.com/559955
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/60cf737c197796291f598f8698a98f6b16ee502f/cbuildbot/relevant_changes.py

Project Member

Comment 39 by bugdroid1@chromium.org, Jul 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/115aaa4094766b5d4378b153502e8eca13824b2d

commit 115aaa4094766b5d4378b153502e8eca13824b2d
Author: Ningning Xia <nxia@chromium.org>
Date: Fri Jul 07 20:15:46 2017

Do not check UploadPrebuiltsStage status of compilecheck builds.

Compilecheck builds don't run UploadPrebuiltsStage. Master builds don't
need to check UploadPrebuiltsStage status of Compilecheck builds to
decide whether or not to publish prebuilts.

BUG= chromium:703819 
TEST=unit_tests

Change-Id: I7f95c48d9f0eb9e85e5cecf1c12259e2001f537d
Reviewed-on: https://chromium-review.googlesource.com/559998
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/115aaa4094766b5d4378b153502e8eca13824b2d/cbuildbot/relevant_changes_unittest.py
[modify] https://crrev.com/115aaa4094766b5d4378b153502e8eca13824b2d/cbuildbot/relevant_changes.py

Comment 40 by nxia@chromium.org, Jul 10 2017

Status: Fixed (was: Unconfirmed)
master/15329 self-destructed and successfully published uprevs:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15329


Comment 41 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment