New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 682250 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Aug 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 682248
issue 682251
issue 682253



Sign in to add a comment

Surface CQ infra failures in SOM

Project Member Reported by katthomas@chromium.org, Jan 18 2017

Issue description

This is an umbrella bug. We want to be able to use SOM for triaging infra failures. See go/burn-cit for details.
 
Blockedon: 682251
Labels: Infra-Failures
Blockedon: 682253
Labels: Milestone-Workflow
Adding a milestone for SoM triaging. :)
Cc: katthomas@chromium.org
Owner: jparent@chromium.org
Sean - Katie has a pretty full plate with other CQ reliability work, is this something you think you could take on?
I'll take another look at the proposal doc, but IIRC it involves scanning builds on trybots which SoM doesn't currently do. A lot of the failure grouping logic deals with revision ranges, which also might need to be modified. It could be a big chunk of work. 
Labels: Hotlist-Infra-Failures
Cc: jparent@chromium.org
Owner: ----
Status: Available (was: Assigned)
Components: -Infra>Monitoring
Has any further thought gone into this, or has anything changed since the Jan update?
Haven't revisited this since Jan. 

Is SoM really the best place to show this information though? Sheriffs aren't responsible for CQ infra failures, and Troopers don't typically spend much if any time on SoM during their shifts.
That is kinda my point :)

We either need to be moving more towards Troopers using SoM (which, won't moving the alert bugs into monorail and into a queue exposed there help with?) or this really should be closed.  At a minimum, its a P1 with no action in months, hence, my pinging.
Labels: -Pri-1 Pri-2
lowering pri since this isn't super pressing and needs some clarification before we act on it anyways.
Components: -Infra>CQ
Is this still something we're looking to do? If so, I think it might make more sense to show these failures in go/trooper-queue 

(If this is no longer relevant, we should archive this issue and related issues)
I think it still makes sense to do something along these lines. I'm not sure where the logic belongs though. If the trooper queue is in Monorail, perhaps there? 

Would it be possible to write a cron job in monorail that does the following:
* Queries BQ to get a list of infra failures
* Retrieve logs for steps with infra failures from logdog
* Calculate how similar the log is to logs for open infra failure bugs. If that failure exceeds a threshold, add it to the bug as an additional example. If that failure does not exceed the threshold, create a new bug. 
* Automatically adjust priority based on how many examples of the failure have been seen. 
Thanks for the feedback! Agree that Monorail would probbaly be the place to surface these though I think Monorail might not be the best place for such a cron job to live. 

I'd want to look into whether we could use Alert Manager or maybe other monitoring tools to create these failure alerts. 
Status: Archived (was: Available)

Sign in to add a comment