Successful CQ run self destructed, and reported failure. |
|||||||||||
Issue descriptionIn this CQ run: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13993 The master decided to abort after it decided it had enough information to submit all pending CLs without waiting for remaining slaves. 09:39:34: INFO: will_submit set contains 1 changes: [CL:456636] might_submit set contains 0 changes: [] will_not_submit set contains 0 changes: [] This is good and correct, EXCEPT, the master also reported that it failed. I think the master should have reported success in this case.
,
Mar 21 2017
This presents an interesting corner case.
We want to publish prebuilts, if we can. That is done by CQ master, and it's done only if CQCompletion passed (I believe).
The old rationale for this was: prebuilts will be accurate if and only if all of the CLs that were part of the run were actually submitted.
I'm tempted to say that in the case of will_submit={everything} we should just terminate CQCompletion early with Success, and allow the prebuilts to be published.
I think we need to consider the case where will_submit={everything} even though an important slave failed. In that case, do we want to publish prebuilts? I think the answer is still yes (whether or not we do doesn't seem like it should affect whether that builder is broken).
If the answer to above question is No, then we may have to actually add some strange logic such that if will_submit={everything} we actually force ourselves to wait for all slaves, so we can know whether to publish prebuilts.
But anyway, I think the simpler solution of will_submit={everything} -> terminate with success -> publish prebuilts is probably right.
,
Mar 21 2017
Oh, I didn't think about prebuilts.
Regarding "submit={everything} even though an important slave failed"....
If our logic to detect which builders matter is perfect, then it's safe to publish even if that builder failed.
If our logic is imperfect and we publish an uprev without a prebuilt, things aren't horrible.
Anyone building that ebuild will compile the package locally, and the difference is invisible other than the time taken. Also, the missing prebuilt will be uploaded automatically by future CQ runs.
The pathological case is that we continuously "succeed early" killing off the same slave repeatedly before it gets to publish. I'm not currently worried about that case since most builds contain CLs a that require all slaves to pass.
,
Mar 21 2017
Also, yes, I strongly agree we should publish uprevs/prebuilts on early success. I'm more worried about the uprevs than the prebuilts.
,
Mar 21 2017
Also, correct my if I'm wrong, but I think if the following sequence of events happens, then prebuilts are still ok.
1) master-paladin reached will_submit={everything} even though foo-paladin is still running.
2) master-paladin publishes prebuilts
3) foo-paladin finishes successfully and uploads prebuilts.
I claim that we have still reached a state where foo-paladin's prebuilts are published. Because I think master-paladin is only updating some global manifest-version pointers to the latest manifest-version to try to pull prebuilts from, and doesn't have board level granularity. At least on the CQ. Is that right?
,
Mar 21 2017
Ah, but what if foo-paladin is aborted before it can upload; then indeed foo is without prebuilts. And in theory this is actually a downward spiral, because that means future foo-paladin runs are slower, so foo is more likely to be the long pole. Still, that downward spiral only lasts as long as the CQ is testing changes that are irrelevant to foo. Once a foo-related change comes by, we get back to prebuilts for everything. So I'm still not concerned.
,
Mar 21 2017
,
Mar 22 2017
Just to make sure the even though foo-paladin didn't upload any prebuilts because it's irrelevant to any CLs and got canceled, the prebuilts uploaded by the master are still valid? If there's no problem, I'll start the change.
,
Mar 22 2017
,
Mar 22 2017
I don't think the master uploads any prebuilts, but it does upload the ebuild uprevs (not quite the same thing). I think it's correct (or at least correct enough) to ignore the prebuilts that ignored slaves would have uploaded.
,
Mar 22 2017
the master-CQ uprevs and markes the prebuilts as stable? does it matter if some slaves didn't upload their prebuilts?
,
Mar 22 2017
,
Mar 22 2017
If someone didn't upload them, they won't exist, but later builds will fill in the missing prebuilts, and anyone trying to use them will just compile the ebuilds from source. Slower, but it yields the same result.
,
Mar 22 2017
I think my question would be if some slaves didn't upload prebuilts, should the master-paladin run publish uprevs?
,
Mar 22 2017
I believe that it will. The ebuild uprevs are recalculated by the master after it decides which CLs to submit, and which to not submit. Though, I'm not sure if the ebuild uprevs are submitted in the same changes that mark the prebuilts as stable. I assume so.
,
Mar 22 2017
A success master-paladin runs PublishUprevChanges, like https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F14041%2F%2B%2Frecipes%2Fsteps%2FPublishUprevChanges%2F0%2Fstdout it runs cros_mark_as_stable for all boards? do we know what happens if some board-paladins didn't finish?
,
Mar 22 2017
No, I'm not certain that it runs, but cros_mark_as_stable isn't board specific, so if it runs, all boards are covered.
,
Mar 22 2017
Per the discussion with dgarrett@ offline, the problem remained is when a slave-CQ (say foo-paladin) keeps being the slowest slave and getting cancelled by its master because the master keeps successfully submitting CLs without foo-paladin, we may end up with a big delay in prebuilts. In the origin system, the big delay also happens when the CQ keeps submitting partial CLs without publishing uprevs, but self-destructed CQ may increase the frequency of this kind of problem. we decide to keep eyes on the tree and see whether this can become a real issue for us. Will leave the bug open for now.
,
Mar 23 2017
But #18 will only keep happening until a CL comes along that cares about foo-paladin, at which point we will wait for that. The builder only becomes slow when we don't care about it (because no CLs do for a long time)
,
Apr 3 2017
,
Apr 4 2017
Issue 707696 has been merged into this issue.
,
Apr 10 2017
If the irrelevant slaves were aborted before they uploaded prebuilts, the master can't safely publish uprevs for all slaves. There might be missing binhost package errors crbug.com/710036 . we can: 1) leave the CQ-master as 'fail' when self-destruction happens; or 2) mark the the CQ-master as 'pass' but don't run PublishUprev; or 3) do not perform self-destruction in this case, wait for all slaves to complete; or 4) is there a way to only publish uprevs for relevant slave paladins?
,
Apr 10 2017
2) Should be easy and safe, unless we abort early *too* regularly. 3) Easy and safest, but slows down the CQ for this case. 4) If we have a robust mapping between slaves and packages files, this is possible. But it's the most complicated to implement.
,
Apr 10 2017
1) might be ok with this though it makes our numbers look worse than they are 2) also ok with this, though it runs the risk that we may run for a long time without ever publishing (and without ever noticing 3) probably my preferred solution. Fits the "CQ tries as hard as it can to do as much as it can, but gives up quickly on impossible things." philosophy. 4) davidjames@ thoughts on if this is possible? My gut instinct is no.
,
Apr 10 2017
In order to publish uprevs, we need prebuilts from slaves to exist, and I would argue that every passing run should include prebuilts. It is also desirable to have empty runs that include prebuilts (so that you can update prebuilts w.r.t. chumped changes and Chrome PFQ changes). I suggest we wait for prebuilts to be uploaded on all builders before self-destructing -- this is required any time you are planning on passing a run. If you're failing, you don't actually need to wait for prebuilts, but regardless prebuilt upload is pretty fast so I don't see it as a huge difference if we just move the destruction to happen a bit later.
,
Apr 10 2017
What David is suggesting requires that the master watches which stages have completed. We should be able to do this via CIDB.
,
Apr 10 2017
Yes, the stage information is already in cidb. So, if the master has determined that it is going to self-destruct but wants to pass, we should add a wait step where we ensure that for all important slaves, "UploadPrebuilts" stage has completed with status PASS.
,
Apr 11 2017
Checking the "UploadPrebuilts" stage should be feasible. "UploadPrebuilts" happens after buildPackages & buildImage stages, the master can wait for that.
,
Jun 2 2017
,
Jun 2 2017
Issue 727369 has been merged into this issue.
,
Jun 5 2017
,
Jun 5 2017
,
Jun 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d4e8a4c3ae1ff79808b13655171984569fa71460 commit d4e8a4c3ae1ff79808b13655171984569fa71460 Author: Ningning Xia <nxia@chromium.org> Date: Fri Jun 30 19:51:25 2017 Wait slaves to pass UploadPrebuilts when deciding self-destruction. Before self-destructing on a success build (which will submit all picked up changes), the master should wait for all important slaves to pass UploadPrebuiltsStage in order to publish prebuilts for all slaves. BUG= chromium:703819 TEST=unit_tests Change-Id: Iad9fe2fb0b3c6475ba1a7013b87736b471715d77 Reviewed-on: https://chromium-review.googlesource.com/544788 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/build_status_unittest.py [modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/relevant_changes.py [modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/lib/constants.py [modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/build_status.py [modify] https://crrev.com/d4e8a4c3ae1ff79808b13655171984569fa71460/cbuildbot/relevant_changes_unittest.py
,
Jun 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3 commit 09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3 Author: Ningning Xia <nxia@chromium.org> Date: Fri Jun 30 19:51:25 2017 Consider a successful and self-destructed CQ as success. If a CQ destructed itself earily without failures and all important slaves passed the desired stages, consider this CQ as a success and publish uprevs for this build. BUG= chromium:703819 TEST=unit_tests Change-Id: I6af7a88fc8308bde2d374f0f470b52f2d9831b1b Reviewed-on: https://chromium-review.googlesource.com/547046 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3/cbuildbot/stages/completion_stages_unittest.py [modify] https://crrev.com/09fa8173a5b6cd51df0cfa3fc93a27c14d68f1c3/cbuildbot/stages/completion_stages.py
,
Jul 5 2017
Fixed?
,
Jul 5 2017
not actually. looking into two issues: 1. some builds don't even run UploadPrebuilts stage (e.g. falco-full-compile-paladin, nyan-full-compile-paladin, etc.) 2. some builds already passed the stage but not noticed by the master, should be a bug and I'm trying to enable some logs.
,
Jul 5 2017
No.2 in #36 is false alarm, misled by the wrong log, sent out CL:559955. Will fix No.1 in #36.
,
Jul 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/60cf737c197796291f598f8698a98f6b16ee502f commit 60cf737c197796291f598f8698a98f6b16ee502f Author: Ningning Xia <nxia@chromium.org> Date: Fri Jul 07 20:15:46 2017 Fix misleading logs in relevant_changes. BUG= chromium:703819 TEST=unit_tests Change-Id: Idb4e11cabbddd4d78cd49d7b7b50c2f5ab2df179 Reviewed-on: https://chromium-review.googlesource.com/559955 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/60cf737c197796291f598f8698a98f6b16ee502f/cbuildbot/relevant_changes.py
,
Jul 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/115aaa4094766b5d4378b153502e8eca13824b2d commit 115aaa4094766b5d4378b153502e8eca13824b2d Author: Ningning Xia <nxia@chromium.org> Date: Fri Jul 07 20:15:46 2017 Do not check UploadPrebuiltsStage status of compilecheck builds. Compilecheck builds don't run UploadPrebuiltsStage. Master builds don't need to check UploadPrebuiltsStage status of Compilecheck builds to decide whether or not to publish prebuilts. BUG= chromium:703819 TEST=unit_tests Change-Id: I7f95c48d9f0eb9e85e5cecf1c12259e2001f537d Reviewed-on: https://chromium-review.googlesource.com/559998 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/115aaa4094766b5d4378b153502e8eca13824b2d/cbuildbot/relevant_changes_unittest.py [modify] https://crrev.com/115aaa4094766b5d4378b153502e8eca13824b2d/cbuildbot/relevant_changes.py
,
Jul 10 2017
master/15329 self-destructed and successfully published uprevs: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15329
,
Jan 22 2018
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by nxia@chromium.org
, Mar 21 2017