New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 772943 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

CQ self-destruct cancels irrelevant slaves, but then reports these slaves as failed.

Project Member Reported by nxia@chromium.org, Oct 9 2017

Issue description

When CQ destructs with success and all slaves have passed the UploadPrebuilts stage, the CQ should report success and publish uprevis. The fix for  crbug.com/753189  helps to abort all running slaves when self-destruction is triggered, but this introduces another issue: when the running slaves are aborted by self-destruction, they report failures to the CIDB. when the CQ master triages slave failures, it thinks the slaves have failures and so report the CQ as a failed CQ. 


Examples:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16523

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16524

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16526
 
Interesting.

Two questions:

1) When we abort a slave builder, can we pass it an abort reason?
2) Why does the master check CIDB status of the slaves after CQ completion
   stage has declared success?

Comment 2 by nxia@chromium.org, Oct 9 2017

1) yes. we insert messages into the buildMessageTable.
2) that's the current design, after the wait-for-all-slaves loop finishes, the master fetches the statuses of slaves, triages failures and prints failure summary/links (some slaves failed but didn't affect the CLs, the CQ should still show the failures) 
Cc: jclinton@chromium.org

Comment 4 by nxia@chromium.org, Oct 12 2017

Owner: nxia@chromium.org
I can take this bug as jkop@ is fixing the logging error.

Comment 5 by nxia@chromium.org, Oct 13 2017

Cc: furquan@chromium.org
 Issue 774671  has been merged into this issue.

Comment 6 by nxia@chromium.org, Oct 19 2017

Status: Started (was: Untriaged)
Labels: Chase-Pending
Labels: -Chase-Pending
What is the ETA on this? And is there an easier workaround we can land first.

This is causing havoc on the CLs' messaging:
e.g.: This build failed only in falco-full-compile:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16696

But that spammed the CLs with some 57 failed slaves: 
https://chromium-review.googlesource.com/c/chromiumos/third_party/portage_tool/+/559225

I think that would push the developers to simply CQ:+1 their CL, thinking this widespread failure can't be their fault (that CL was retried and killed 2 CQ runs overnight).

Also, this is polluting the goldeneye and other dashboards: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildHealthDashboard
Summary: CQ self-destruct cancels irrelevant slaves, but then reports these slaves as failed. (was: CQ should self-destruct with success but reported as a failed run.)
Issue 777829 has been merged into this issue.
Aha. I spoke with nxia@ the other day and tried to steer away from the fix she had in flight, toward one based on splitting up CQCompletion stage.

My thinking in part was that this didn't seem like an urgent problem. I didn't understand the degree to which it is spamming developer CLs and likely making them CQ+1 in error.

Can we temporarily fix it, slightly hackily, but just silencing all "cbuildbot failed in" messages?

Comment 15 by nxia@chromium.org, Oct 24 2017

My fix won't fix the issue mentioned in #10. My fix is to make a CQ green if it destructed with success, not destructed with slave failures.
Project Member

Comment 16 by bugdroid1@chromium.org, Nov 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/a0a90f97a4bd9988fc8d518f98b55040a0556610

commit a0a90f97a4bd9988fc8d518f98b55040a0556610
Author: Ningning Xia <nxia@chromium.org>
Date: Wed Nov 01 23:47:15 2017

completion_stages: handle IsFailureFatal for self-destructed master.

If a CQ master build self-destructs, it aborts the running slaves and
inserts the slaves aborted by self-destruction into buildMessageTable.
The aborted slaves may insert exceptions (caused by the abort) into the
failureTable. When the CQ master collects the failures of the builds
(master + slaves) and analyzes the failures are fatal, it should ignore
the failures from the slaves which were aborted by the master.
1) If a master build self-destructs with success, only reports fatal if
the master build itself reports failures.
2) If a master build self-destructs without success, only reports fatal
if the failed and not-aborted-by-self-destruction builds are not sanity
checker builds.

BUG= chromium:772943 
TEST=unit_tests

Change-Id: I5a64cc68efc16e7ffeabdd35e1addf5bca30330d
Reviewed-on: https://chromium-review.googlesource.com/726540
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/cidb.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/cbuildbot/stages/completion_stages.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/cbuildbot/stages/completion_stages_unittest.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/fake_cidb.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/builder_status_lib_unittest.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/builder_status_lib.py
[modify] https://crrev.com/a0a90f97a4bd9988fc8d518f98b55040a0556610/lib/cidb_integration_test.py

Comment 18 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 19 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment