New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703874 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

CQ self-destructs but notifies sheriffs of "infra failures"

Project Member Reported by norvez@chromium.org, Mar 21 2017

Issue description


This run https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14031 correctly self-destructed because of a bad CL https://chromium-review.googlesource.com/c/455296/

But the email that went out to the sheriffs was titled "master-paladin infra failures" and the content:
"
master-paladin has encountered infra failures:

x86-mario-paladin timed out

beaglebone-paladin timed out

wolf-paladin timed out

veyron_rialto-paladin timed out


<snip lots of "timed out"> failures
"

Is it possible to change the email subject since it wasn't actually an infra failure. "CQ self destructed"?

The notification email should also not list the various paladins as "timed out" since they did not actually time out. Could it instead reproduce the output of the CommitQueueCompletion stage?


This would help differentiate "usual" failures with actual massive infra failures such as https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016


 

Comment 1 by nxia@chromium.org, Mar 22 2017

Status: Started (was: Unconfirmed)
It's probably valid to dup this together with the bug I filed yesterday.

Comment 3 by nxia@chromium.org, Mar 22 2017

this is a different bug. This master build 14016 did have a failed slave build and should be marked as 'failed'. norvez@ was asking to change the content in the email alerts to reflect more accurate failure reasons. 

Comment 4 by norvez@chromium.org, Mar 22 2017

See attached, the email that was sent out.

It didn't mention the failed slave "x86-generic-paladin: The BuildPackages stage failed: Packages failed in ./build_packages: chromeos-base/container_utils" that is present in the  CommitQueueCompletion output https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14031

It would be good if that was mentioned in the email, since this is the most useful info to diagnose the problem.
original_msg (1).txt
4.6 KB View Download

Comment 5 by nxia@chromium.org, Mar 22 2017

Re #4:

PackageBuildFailure is not a infra_type failure, that's why it's not printed. I think this is by design and not a change introduced by the self-destructed CQ. 

You can look into the master build url and find out what slaves failed.  

Comment 6 by nxia@chromium.org, Mar 22 2017

The infra alerts will only be sent to deputies/sheriffs if a CQ is failed and has infra failures. deputies/sheriffs should still look into the waterfall page for other failures (bad cls, build failures, etc..)

The change I will make is to explicitly mention 'self-destruction' in the alert message when the CQ has infra failures and there're still running slaves when the self-destruction happens. 

Comment 7 by norvez@chromium.org, Mar 22 2017

Makes sense, but then I don't understand why the alert was sent in that particular case.
Afaict there was no infra failure, simply a bad CL that made BuildPackages fail, so the CQ correctly decided to self-destruct and stop the slaves that were still running.

Comment 8 by nxia@chromium.org, Mar 22 2017

Re #7, yes, that's an issue I need to fix. I'm uploading the change to only send out alerts when there're infra failures.
Project Member

Comment 9 by bugdroid1@chromium.org, Mar 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d

commit 0f38dc31c589a01f4af9ff3aa9ae64c4c819236d
Author: Ningning Xia <nxia@chromium.org>
Date: Thu Mar 23 03:05:56 2017

Include self-destruction information in the infra alerts.

If self-destructed is True, the alert sent by the CQ-master should
explicitly state the CQ was destructed and list the builds which were
still running/waiting-to-start when the master was destructed.

BUG= chromium:703874 
TEST=unit_tests

Change-Id: I448bc35811b6540425fff0d0f095ace55ca8d98f
Reviewed-on: https://chromium-review.googlesource.com/458519
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Nicolas Norvez <norvez@chromium.org>

[modify] https://crrev.com/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d/cbuildbot/stages/completion_stages_unittest.py
[modify] https://crrev.com/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d/cbuildbot/stages/completion_stages.py

Comment 10 by nxia@chromium.org, Mar 23 2017

Status: Fixed (was: Started)

Comment 11 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 13 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment