New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 658344 link

Starred by 0 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jan 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Feature

Blocked on:
issue 649391
issue 666960

Blocking:
issue 649391



Sign in to add a comment

Set up more aggressive alerting for builder infra failures

Project Member Reported by katthomas@chromium.org, Oct 21 2016

Issue description

There are two types of infra failures we are concerned about here: flakey failures that happen occasionally and outage-magnitude failures. In my experience thus far the latter happens when one of our dependencies is failing in some way. There is related work to ensure that we are failing gracefully in those cases. 

In the case of flakey failures, we want to eventually want to automatically create bugs (with multiple failures of the same type on the same bug) with the Infra>Trooper label. 

This will be quite a large number of alerts at the moment, so we could start with a simple email alert sent to chrome-troopers-alerts+staging@ for now while we’re working on burning down the rate.

For outage-magnitude failures, we care about the impact on CQ cycle time, so this directly relates to the work that the crossover team will be doing this quarter on the speed and reliability of the CQ. We can link the bug for the CQ monitoring/alerting once we have that.


 
I am not sure Infra>Trooper label is going to help us. Troopers will likely ignore it because it's not an outage and long-term fix can easily take longer than the typical 2-day trooper shift. IMHO, we need to contact service owners and ask them to fix their code and if there is no owner, we need to find one (ask benhenry@ about that).
Cc: -katthomas@chromium.org
Labels: -Pri-1 Pri-2
Owner: katthomas@chromium.org
Status: Assigned (was: Untriaged)
Blockedon: 666960
Status: Archived (was: Assigned)
I'm closing this bug. crbug.com/666960 took care of alerting on "outage-magnitude" failures. For flakey failures, we'll be exploring different options for surfacing those which will be tracked elsewhere. 
Labels: cit-pm-5

Sign in to add a comment