CQ self-destruct cancels irrelevant slaves, but then reports these slaves as failed. |
||||||||||
Issue descriptionWhen CQ destructs with success and all slaves have passed the UploadPrebuilts stage, the CQ should report success and publish uprevis. The fix for crbug.com/753189 helps to abort all running slaves when self-destruction is triggered, but this introduces another issue: when the running slaves are aborted by self-destruction, they report failures to the CIDB. when the CQ master triages slave failures, it thinks the slaves have failures and so report the CQ as a failed CQ. Examples: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16523 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16524 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16526
,
Oct 9 2017
1) yes. we insert messages into the buildMessageTable. 2) that's the current design, after the wait-for-all-slaves loop finishes, the master fetches the statuses of slaves, triages failures and prints failure summary/links (some slaves failed but didn't affect the CLs, the CQ should still show the failures)
,
Oct 9 2017
,
Oct 12 2017
I can take this bug as jkop@ is fixing the logging error.
,
Oct 13 2017
,
Oct 19 2017
,
Oct 23 2017
,
Oct 23 2017
,
Oct 23 2017
,
Oct 24 2017
What is the ETA on this? And is there an easier workaround we can land first. This is causing havoc on the CLs' messaging: e.g.: This build failed only in falco-full-compile: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16696 But that spammed the CLs with some 57 failed slaves: https://chromium-review.googlesource.com/c/chromiumos/third_party/portage_tool/+/559225 I think that would push the developers to simply CQ:+1 their CL, thinking this widespread failure can't be their fault (that CL was retried and killed 2 CQ runs overnight). Also, this is polluting the goldeneye and other dashboards: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildHealthDashboard
,
Oct 24 2017
,
Oct 24 2017
Issue 777829 has been merged into this issue.
,
Oct 24 2017
Aha. I spoke with nxia@ the other day and tried to steer away from the fix she had in flight, toward one based on splitting up CQCompletion stage. My thinking in part was that this didn't seem like an urgent problem. I didn't understand the degree to which it is spamming developer CLs and likely making them CQ+1 in error. Can we temporarily fix it, slightly hackily, but just silencing all "cbuildbot failed in" messages?
,
Oct 24 2017
jkop@ is fixing the logging at https://chromium-review.googlesource.com/c/chromiumos/chromite/+/692980
,
Oct 24 2017
My fix won't fix the issue mentioned in #10. My fix is to make a CQ green if it destructed with success, not destructed with slave failures.
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a0a90f97a4bd9988fc8d518f98b55040a0556610 commit a0a90f97a4bd9988fc8d518f98b55040a0556610 Author: Ningning Xia <nxia@chromium.org> Date: Wed Nov 01 23:47:15 2017 completion_stages: handle IsFailureFatal for self-destructed master. If a CQ master build self-destructs, it aborts the running slaves and inserts the slaves aborted by self-destruction into buildMessageTable. The aborted slaves may insert exceptions (caused by the abort) into the failureTable. When the CQ master collects the failures of the builds (master + slaves) and analyzes the failures are fatal, it should ignore the failures from the slaves which were aborted by the master. 1) If a master build self-destructs with success, only reports fatal if the master build itself reports failures. 2) If a master build self-destructs without success, only reports fatal if the failed and not-aborted-by-self-destruction builds are not sanity checker builds. BUG= chromium:772943 TEST=unit_tests Change-Id: I5a64cc68efc16e7ffeabdd35e1addf5bca30330d Reviewed-on: https://chromium-review.googlesource.com/726540 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/cidb.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/cbuildbot/stages/completion_stages.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/cbuildbot/stages/completion_stages_unittest.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/fake_cidb.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/builder_status_lib_unittest.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/builder_status_lib.py [modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/cidb_integration_test.py
,
Nov 3 2017
fixed. see https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16809
,
Jan 22 2018
,
Jan 23 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by dgarr...@chromium.org
, Oct 9 2017