Add more metrics to Sheriff-o-Matic |
|||||||
Issue descriptionFollowup to some comments from the FindIt survey: https://docs.google.com/document/d/1NiIiGgLMvVdCDz4SA3ljOkIwjR9R07EXaNMAiDiUkYA/edit We should record metrics on things like how average time required to fix alerts. I don't think we currently have much in place for this sort of thing, but it would probably be useful to help improve sheriffing.
,
Nov 12 2016
I have a plan for Findit to track this on compile failure. 1. Add two metadata fields to the commit message of the reverting CL, and ask sheriffs to fill the build url when reverting a CL. - REVERTED-COMMIT=git-hash-of-reverted-commit - BUILD-URLs=comma-separated-urls-to-the-builds-with-failure-as-reason-to-revert IMO, it is fair to ask sheriffs to fill in the build urls, because it is helpful for the CL owner to investigate the failure. I have seen quite a few cases that the CL owner is asking for the link to the failures. This could be done with changes to the one-click revert button on Rietveld, and adding a git_revert to depot_tools similar to "git cl ...". 2. Have a cron job to track compile tree closures in http://chromium-status.appspot.com/, and get the commit time of the reverting CL in the first-green build. We could track compile failures from data in Findit or SoM too. But tree closure is usually the first occurrence of the same failure. 3. Calculate the diff of commit-time and failure-occurred-time as the time to fix failures. (We could add another field REVERT-TIME if a reverting CL is going through CQ) This general idea might work for test failures too. We don't track test failures on chromium-status any more; but Findit and SoM have that data. Any pitfall here? (A long while ago, I tried to compute a signature for a commit based on the diff of that commit; and match the signature between commits in the first-red build and first-following-green build to identify the reverting/reverted commits. But I don't think it would work for all cases. Another approach I tried is to guess the revert based on the "> Cr-Commit-Position: refs/heads/master@{#419454}" in the reverting CL, but this could be fragile and doesn't work for those reverting CLs by "git revert".)
,
Nov 14 2016
I think it would be a good idea to try and add more linkages between reverts and the failures that led to them. However, I think if we expect sheriffs (or other people) to do this manually we'll get inconsistent results at best. I think if we can do something like add a "revert" button to sheriff-o-matic that can fill this stuff in automatically that would improve things quite a bit, though.
,
Nov 14 2016
We do have plans to add a revert button to SoM. https://bugs.chromium.org/p/chromium/issues/detail?id=401879
,
Nov 22 2016
,
Feb 1 2017
,
Feb 2 2017
Trying to provide a more detail summary of which metrics we should implement. Maybe we can split these up into separate bugs if needed? Metrics based on the original FindIt survey info: * Average time before alert is triaged * Average time before a sheriff starts dealing with an alert (Maybe measure by something like when they click on links or something?) * Average time until a culprit is found (this would need some sort of reverting integration, I think) These were metrics requested by a Perf sheriff: * How many alerts I triaged during my shift * How many alerts per bug * How many new bugs seen
,
Apr 25 2017
On Findit side, I've added some metrics regarding # of tree closures/reverts/flakes/etc based on tree status in chromium-status.appspot.com. https://findit-for-me.appspot.com/waterfall/auto-revert-metrics But I didn't implement what I mentioned in comment #2 above yet.
,
Dec 14 2017
We have implemented bigquery event tables for sheriff-o-matic that can probably answer these questions now. ex: https://bigquery.cloud.google.com/table/sheriff-o-matic:events.alerts
,
Dec 17
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 17
Bump up priority as it is good timing for OKR planning now. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by seanmccullough@google.com
, Nov 12 2016