Surface CQ infra failures in SOM |
|||||||||||
Issue descriptionThis is an umbrella bug. We want to be able to use SOM for triaging infra failures. See go/burn-cit for details.
,
Jan 18 2017
,
Jan 18 2017
,
Jan 18 2017
Adding a milestone for SoM triaging. :)
,
Jan 25 2017
Sean - Katie has a pretty full plate with other CQ reliability work, is this something you think you could take on?
,
Jan 25 2017
I'll take another look at the proposal doc, but IIRC it involves scanning builds on trybots which SoM doesn't currently do. A lot of the failure grouping logic deals with revision ranges, which also might need to be modified. It could be a big chunk of work.
,
Jan 25 2017
,
Feb 6 2017
,
Mar 15 2017
,
Jun 9 2017
Has any further thought gone into this, or has anything changed since the Jan update?
,
Jun 9 2017
Haven't revisited this since Jan. Is SoM really the best place to show this information though? Sheriffs aren't responsible for CQ infra failures, and Troopers don't typically spend much if any time on SoM during their shifts.
,
Jun 9 2017
That is kinda my point :) We either need to be moving more towards Troopers using SoM (which, won't moving the alert bugs into monorail and into a queue exposed there help with?) or this really should be closed. At a minimum, its a P1 with no action in months, hence, my pinging.
,
Jun 10 2017
lowering pri since this isn't super pressing and needs some clarification before we act on it anyways.
,
Aug 18 2017
,
Dec 14 2017
Is this still something we're looking to do? If so, I think it might make more sense to show these failures in go/trooper-queue (If this is no longer relevant, we should archive this issue and related issues)
,
Dec 15 2017
I think it still makes sense to do something along these lines. I'm not sure where the logic belongs though. If the trooper queue is in Monorail, perhaps there? Would it be possible to write a cron job in monorail that does the following: * Queries BQ to get a list of infra failures * Retrieve logs for steps with infra failures from logdog * Calculate how similar the log is to logs for open infra failure bugs. If that failure exceeds a threshold, add it to the bug as an additional example. If that failure does not exceed the threshold, create a new bug. * Automatically adjust priority based on how many examples of the failure have been seen.
,
Dec 15 2017
Thanks for the feedback! Agree that Monorail would probbaly be the place to surface these though I think Monorail might not be the best place for such a cron job to live. I'd want to look into whether we could use Alert Manager or maybe other monitoring tools to create these failure alerts.
,
Aug 2
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by katthomas@chromium.org
, Jan 18 2017