New issue
Advanced search Search tips

Issue 740303 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug

Blocked on:
issue 873754


Participants' hotlists:
chrome-client-infra-monitoring


Sign in to add a comment

investigate our monitoring/alerting for increased rates of build failures

Project Member Reported by dpranke@chromium.org, Jul 8 2017

Issue description

A bad patch got through the CQ (https://chromium-review.googlesource.com/556394 , #484302) and caused all of the tryjobs to fail because of    bug 73641  . While we have alerting on the total number of failed builds, it clearly isn't tuned well enough because we didn't get any alerts for this in a timely manner, leading to heavily backlogged bots in the CQ and the outage discussed in bug 739556.

We should tune these alerts and make sure we catch issues way earlier.


 
Description: Show this description
Description: Show this description
Labels: cit-pm-57
When you say we have alerting on the total number of failed builds, what are you referring to? I don't see anything quite like that in buildbot_alerts or the playbook, though we do alert on infra failures.
I was thinking of the OverallBuildInfraFailuresAlert, as you guessed.

Though, we should probably have alerts for when there's too many failed builds, period, as well.
Agreed, just wanted to make sure I understand the description.

Comment 7 by efoo@chromium.org, Jun 2 2018

Friendly ping. This is a blocking bug on cit-pm-55. Please update need and priority accordingly. 
Blockedon: 873754
Components: -Infra>Client -Infra>Monitoring Infra>Client>Chrome
The groundwork is being laid out in issue 873754, we can add this alert in addition to the infra failure alerts. The graphs already exist in http://vi/auto/prod:chrome-ops-client-infra/chrome_client/builds , though in absolute counts. I imagine it'd be more useful to have a graph of a percentage of failed builds, for easier diagnosis.
Thought of the day: the situation described in #0 sounds more like a sheriffing issue, since it's a src-side CL that broke builds. I'm not sure if it also broke CI builds for sheriffs to notice though.

These days we also have pending builds alerts for Chromium CQ, and we'd received an alert if CQ got overloaded.

Aside from that, I wonder, in general, if there is a meaningful action a trooper can take when lots of builds start failing, other than pinging a sheriff. In which case, maybe we need to alert sheriffs directly?
Cc: bpastene@chromium.org
 Issue 674683  has been merged into this issue.
Per #9 - should we consider this bug closed?

Sign in to add a comment