automatic triage of infra issues on chromium.perf bots |
||||
Issue descriptionInfra failures on the perf waterfall are not currently getting triaged automatically. We were under the impression that when a bot goes down it is automatically handled by infra troopers, but maybe this is not the case for perf? I am attaching a view from SOM. The win 10 High-DPI Perf bots were down for almost a week before we filed a bug for it, but it was ignored given the view that it was already being addressed. Is this something that we can investigate? What is the process for getting these automatically triaged?
,
Apr 20 2018
Infra monitoring systems (rather than SoM) should probably catch and alert these to troopers on call.
,
Apr 20 2018
Can someone share an example of the bug filed and automatic triage mentioned? I may be mistaken, but have not personally seen any of these triage automatically to Troopers before. AFAIK, this would be something new.
,
Apr 20 2018
Sheriff-o-Matic never auto-filed bugs based on these alerts. The intention back in the day was for sheriffs to look at the trooper page in Sheriff-o-Matic and see these alerts, but the trooper page never gained significant adoption and was ultimately yet another page for troopers to look at. I agree with Sean that, long-term, we should replace these infra failures in Sheriff-o-Matic with monitoring that pages or files tickets to troopers based on these issues. The one use case that might get lost with that approach is that sheriffs wouldn't get notified about infra failures that might affect them in the same way, but we could probably expose that information to sheriffs in a better way.
,
Apr 23 2018
Thank you for the clarification. In the short term can we change the wording from "Troopers should be working on them. Click here to hide the alerts" to something like "Infra failures are handled by troopers. Please file a bug at go/bug-a-trooper". Also, it would be helpful to keep them open by default as they are currently grouped and closed. We are hoping to ramp up new sheriffs this quarter so the less confusion the better.
,
May 4 2018
My understanding was that Infra and labs had some sort of pipeline set up so that offline devices would automatically be filed as a ticket for the Labs team. Maybe this is only some subset of Infra problems, though. Adding pschmidt@, because I know that I've heard him talk about this ticketing system before. Does anyone know for what subset of problems we should expect these tickets to automatically be filed?
,
May 4 2018
pschmidt is no longer actively on the project. #5: looks like the wording got updated to "There are 3 infra failures currently affecting your tree. Contact a trooper for help with these." w/ a link to go/bug-a-trooper. #6: that's the device ticket filer, and yes, it's limited to devices.
,
May 4 2018
Ah, sorry, I didn't realize. I guess my question then is: what exactly is a "device failure" under the current definition as perceived by the device ticket filer? It'd be helpful to know the criteria in order to know whether something is being addressed automatically.
,
May 31 2018
I wanted to clarify how infra issues get surfaced for desktop bots now that soft device affinity is launched. We are no longer surfacing alerts in SOM for these bots since we silently find alive ones. If you look at this page we have had linux bots down for weeks: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=os&c=task&c=status&c=pool&f=status%3Adead&f=pool%3Achrome.tests.perf&l=100&q=pool%3Achrome.tests.perf&s=id%3Aasc John I think when you say its limited to devices you mean android (which is how perf android devices are handled as well). How do troopers discover dead desktop bots right now? Is there any kind of monitoring/alerting set up on those swarming pools to bring them back up?
,
Jun 15 2018
Handing this off to Sean who has more context.
,
Jul 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2e87ba2f2c31d81aedd027b2f68a7f2014410904 commit 2e87ba2f2c31d81aedd027b2f68a7f2014410904 Author: Emily Hanley <eyaich@google.com> Date: Sat Jul 07 12:06:06 2018 Adding documentation with links for infra failures. Bug: 835335 Change-Id: Ibecab130001d9603189a0dfeb0ad5f4a2b68c683 Reviewed-on: https://chromium-review.googlesource.com/1082602 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#573162} [modify] https://crrev.com/2e87ba2f2c31d81aedd027b2f68a7f2014410904/docs/speed/bot_health_sheriffing/main.md |
||||
►
Sign in to add a comment |
||||
Comment 1 by benhenry@chromium.org
, Apr 20 2018Status: Assigned (was: Untriaged)