CQ self-destructs but notifies sheriffs of "infra failures" |
|||||
Issue descriptionThis run https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14031 correctly self-destructed because of a bad CL https://chromium-review.googlesource.com/c/455296/ But the email that went out to the sheriffs was titled "master-paladin infra failures" and the content: " master-paladin has encountered infra failures: x86-mario-paladin timed out beaglebone-paladin timed out wolf-paladin timed out veyron_rialto-paladin timed out <snip lots of "timed out"> failures " Is it possible to change the email subject since it wasn't actually an infra failure. "CQ self destructed"? The notification email should also not list the various paladins as "timed out" since they did not actually time out. Could it instead reproduce the output of the CommitQueueCompletion stage? This would help differentiate "usual" failures with actual massive infra failures such as https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016
,
Mar 22 2017
It's probably valid to dup this together with the bug I filed yesterday.
,
Mar 22 2017
this is a different bug. This master build 14016 did have a failed slave build and should be marked as 'failed'. norvez@ was asking to change the content in the email alerts to reflect more accurate failure reasons.
,
Mar 22 2017
See attached, the email that was sent out. It didn't mention the failed slave "x86-generic-paladin: The BuildPackages stage failed: Packages failed in ./build_packages: chromeos-base/container_utils" that is present in the CommitQueueCompletion output https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14031 It would be good if that was mentioned in the email, since this is the most useful info to diagnose the problem.
,
Mar 22 2017
Re #4: PackageBuildFailure is not a infra_type failure, that's why it's not printed. I think this is by design and not a change introduced by the self-destructed CQ. You can look into the master build url and find out what slaves failed.
,
Mar 22 2017
The infra alerts will only be sent to deputies/sheriffs if a CQ is failed and has infra failures. deputies/sheriffs should still look into the waterfall page for other failures (bad cls, build failures, etc..) The change I will make is to explicitly mention 'self-destruction' in the alert message when the CQ has infra failures and there're still running slaves when the self-destruction happens.
,
Mar 22 2017
Makes sense, but then I don't understand why the alert was sent in that particular case. Afaict there was no infra failure, simply a bad CL that made BuildPackages fail, so the CQ correctly decided to self-destruct and stop the slaves that were still running.
,
Mar 22 2017
Re #7, yes, that's an issue I need to fix. I'm uploading the change to only send out alerts when there're infra failures.
,
Mar 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d commit 0f38dc31c589a01f4af9ff3aa9ae64c4c819236d Author: Ningning Xia <nxia@chromium.org> Date: Thu Mar 23 03:05:56 2017 Include self-destruction information in the infra alerts. If self-destructed is True, the alert sent by the CQ-master should explicitly state the CQ was destructed and list the builds which were still running/waiting-to-start when the master was destructed. BUG= chromium:703874 TEST=unit_tests Change-Id: I448bc35811b6540425fff0d0f095ace55ca8d98f Reviewed-on: https://chromium-review.googlesource.com/458519 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Nicolas Norvez <norvez@chromium.org> [modify] https://crrev.com/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d/cbuildbot/stages/completion_stages_unittest.py [modify] https://crrev.com/0f38dc31c589a01f4af9ff3aa9ae64c4c819236d/cbuildbot/stages/completion_stages.py
,
Mar 23 2017
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by nxia@chromium.org
, Mar 22 2017