investigate our monitoring/alerting for increased rates of build failures |
||||
Issue descriptionA bad patch got through the CQ (https://chromium-review.googlesource.com/556394 , #484302) and caused all of the tryjobs to fail because of bug 73641 . While we have alerting on the total number of failed builds, it clearly isn't tuned well enough because we didn't get any alerts for this in a timely manner, leading to heavily backlogged bots in the CQ and the outage discussed in bug 739556. We should tune these alerts and make sure we catch issues way earlier.
,
Jul 8 2017
,
Jul 12 2017
,
Jul 14 2017
When you say we have alerting on the total number of failed builds, what are you referring to? I don't see anything quite like that in buildbot_alerts or the playbook, though we do alert on infra failures.
,
Jul 14 2017
I was thinking of the OverallBuildInfraFailuresAlert, as you guessed. Though, we should probably have alerts for when there's too many failed builds, period, as well.
,
Jul 14 2017
Agreed, just wanted to make sure I understand the description.
,
Jun 2 2018
Friendly ping. This is a blocking bug on cit-pm-55. Please update need and priority accordingly.
,
Aug 29
The groundwork is being laid out in issue 873754, we can add this alert in addition to the infra failure alerts. The graphs already exist in http://vi/auto/prod:chrome-ops-client-infra/chrome_client/builds , though in absolute counts. I imagine it'd be more useful to have a graph of a percentage of failed builds, for easier diagnosis.
,
Oct 22
Thought of the day: the situation described in #0 sounds more like a sheriffing issue, since it's a src-side CL that broke builds. I'm not sure if it also broke CI builds for sheriffs to notice though. These days we also have pending builds alerts for Chromium CQ, and we'd received an alert if CQ got overloaded. Aside from that, I wonder, in general, if there is a meaningful action a trooper can take when lots of builds start failing, other than pinging a sheriff. In which case, maybe we need to alert sheriffs directly?
,
Nov 6
,
Nov 7
Per #9 - should we consider this bug closed? |
||||
►
Sign in to add a comment |
||||
Comment 1 by dpranke@chromium.org
, Jul 8 2017