New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 835335 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

automatic triage of infra issues on chromium.perf bots

Project Member Reported by eyaich@chromium.org, Apr 20 2018

Issue description

Infra failures on the perf waterfall are not currently getting triaged automatically.  

We were under the impression that when a bot goes down it is automatically handled by infra troopers, but maybe this is not the case for perf?

I am attaching a view from SOM.  The win 10 High-DPI Perf bots were down for almost a week before we filed a bug for it, but it was ignored given the view that it was already being addressed.  

Is this something that we can investigate?  What is the process for getting these automatically triaged?
 
Screenshot from 2018-04-19 12-07-49.png
309 KB View Download
Owner: benhenry@chromium.org
Status: Assigned (was: Untriaged)
I'll take a look.
Infra monitoring systems (rather than SoM) should probably catch and alert these to troopers on call.

Comment 3 by efoo@chromium.org, Apr 20 2018

Can someone share an example of the bug filed and automatic triage mentioned? I may be mistaken, but have not personally seen any of these triage automatically to Troopers before. AFAIK, this would be something new. 
Sheriff-o-Matic never auto-filed bugs based on these alerts. The intention back in the day was for sheriffs to look at the trooper page in Sheriff-o-Matic and see these alerts, but the trooper page never gained significant adoption and was ultimately yet another page for troopers to look at. 

I agree with Sean that, long-term, we should replace these infra failures in Sheriff-o-Matic with monitoring that pages or files tickets to troopers based on these issues. The one use case that might get lost with that approach is that sheriffs wouldn't get notified about infra failures that might affect them in the same way, but we could probably expose that information to sheriffs in a better way.

Comment 5 by eyaich@chromium.org, Apr 23 2018

Thank you for the clarification. 

In the short term can we change the wording from "Troopers should be working on them.  Click here to hide the alerts" to something like "Infra failures are handled by troopers.  Please file a bug at go/bug-a-trooper". 

Also, it would be helpful to keep them open by default as they are currently grouped and closed.  

We are hoping to ramp up new sheriffs this quarter so the less confusion the better.
Cc: pschmidt@chromium.org
My understanding was that Infra and labs had some sort of pipeline set up so that offline devices would automatically be filed as a ticket for the Labs team. Maybe this is only some subset of Infra problems, though.

Adding pschmidt@, because I know that I've heard him talk about this ticketing system before. Does anyone know for what subset of problems we should expect these tickets to automatically be filed?
Cc: -pschmidt@chromium.org bpastene@chromium.org
pschmidt is no longer actively on the project.

#5: looks like the wording got updated to "There are 3 infra failures currently affecting your tree. Contact a trooper for help with these." w/ a link to go/bug-a-trooper.

#6: that's the device ticket filer, and yes, it's limited to devices.
Ah, sorry, I didn't realize. I guess my question then is: what exactly is a "device failure" under the current definition as perceived by the device ticket filer? It'd be helpful to know the criteria in order to know whether something is being addressed automatically.

Comment 9 by eyaich@chromium.org, May 31 2018

I wanted to clarify how infra issues get surfaced for desktop bots now that soft device affinity is launched.  We are no longer surfacing alerts in SOM for these bots since we silently find alive ones.

If you look at this page we have had linux bots down for weeks: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=os&c=task&c=status&c=pool&f=status%3Adead&f=pool%3Achrome.tests.perf&l=100&q=pool%3Achrome.tests.perf&s=id%3Aasc

John I think when you say its limited to devices you mean android (which is how perf android devices are handled as well).  How do troopers discover dead desktop bots right now?  Is there any kind of monitoring/alerting set up on those swarming pools to bring them back up?


Owner: seanmccullough@chromium.org
Handing this off to Sean who has more context.
Project Member

Comment 11 by bugdroid1@chromium.org, Jul 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e87ba2f2c31d81aedd027b2f68a7f2014410904

commit 2e87ba2f2c31d81aedd027b2f68a7f2014410904
Author: Emily Hanley <eyaich@google.com>
Date: Sat Jul 07 12:06:06 2018

Adding documentation with links for infra failures.

Bug: 835335
Change-Id: Ibecab130001d9603189a0dfeb0ad5f4a2b68c683
Reviewed-on: https://chromium-review.googlesource.com/1082602
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#573162}
[modify] https://crrev.com/2e87ba2f2c31d81aedd027b2f68a7f2014410904/docs/speed/bot_health_sheriffing/main.md

Sign in to add a comment