New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 664645 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: ----



Sign in to add a comment

Add more metrics to Sheriff-o-Matic

Project Member Reported by zhangtiff@chromium.org, Nov 11 2016

Issue description

Followup to some comments from the FindIt survey: https://docs.google.com/document/d/1NiIiGgLMvVdCDz4SA3ljOkIwjR9R07EXaNMAiDiUkYA/edit  

We should record metrics on things like how average time required to fix alerts. I don't think we currently have much in place for this sort of thing, but it would probably be useful to help improve sheriffing. 
 
Cc: dpranke@chromium.org
+dpranke who asked about some metrics like this yesterday

Comment 2 by st...@chromium.org, Nov 12 2016

Cc: chanli@chromium.org lijeffrey@chromium.org
I have a plan for Findit to track this on compile failure.

1. Add two metadata fields to the commit message of the reverting CL, and ask sheriffs to fill the build url when reverting a CL.
   - REVERTED-COMMIT=git-hash-of-reverted-commit
   - BUILD-URLs=comma-separated-urls-to-the-builds-with-failure-as-reason-to-revert

   IMO, it is fair to ask sheriffs to fill in the build urls, because it is helpful for the CL owner to investigate the failure. I have seen quite a few cases that the CL owner is asking for the link to the failures.

   This could be done with changes to the one-click revert button on Rietveld, and adding a git_revert to depot_tools similar to "git cl ...".

2. Have a cron job to track compile tree closures in http://chromium-status.appspot.com/, and get the commit time of the reverting CL in the first-green build.
   We could track compile failures from data in Findit or SoM too. But tree closure is usually the first occurrence of the same failure.

3. Calculate the diff of commit-time and failure-occurred-time as the time to fix failures. (We could add another field REVERT-TIME if a reverting CL is going through CQ)


This general idea might work for test failures too.
We don't track test failures on chromium-status any more; but Findit and SoM have that data.


Any pitfall here?

(A long while ago, I tried to compute a signature for a commit based on the diff of that commit; and match the signature between commits in the first-red build and first-following-green build to identify the reverting/reverted commits. But I don't think it would work for all cases. Another approach I tried is to guess the revert based on the "> Cr-Commit-Position: refs/heads/master@{#419454}" in the reverting CL, but this could be fragile and doesn't work for those reverting CLs by "git revert".)
I think it would be a good idea to try and add more linkages between reverts and the failures that led to them. However, I think if we expect sheriffs (or other people) to do this manually we'll get inconsistent results at best.

I think if we can do something like add a "revert" button to sheriff-o-matic that can fill this stuff in automatically that would improve things quite a bit, though.
We do have plans to add a revert button to SoM. https://bugs.chromium.org/p/chromium/issues/detail?id=401879

Comment 5 by aga...@chromium.org, Nov 22 2016

Labels: -Infra-DX
Status: Available (was: Untriaged)
Trying to provide a more detail summary of which metrics we should implement. Maybe we can split these up into separate bugs if needed? 

Metrics based on the original FindIt survey info: 
* Average time before alert is triaged
* Average time before a sheriff starts dealing with an alert (Maybe measure by something like when they click on links or something?) 
* Average time until a culprit is found (this would need some sort of reverting integration, I think) 

These were metrics requested by a Perf sheriff: 
* How many alerts I triaged during my shift
* How many alerts per bug
* How many new bugs seen 

Comment 8 by st...@chromium.org, Apr 25 2017

On Findit side, I've added some metrics regarding # of tree closures/reverts/flakes/etc based on tree status in chromium-status.appspot.com.
https://findit-for-me.appspot.com/waterfall/auto-revert-metrics


But I didn't implement what I mentioned in comment #2 above yet.
Components: -Infra>Sheriffing>SheriffOMatic
We have implemented bigquery event tables for sheriff-o-matic that can probably answer these questions now. 

ex: https://bigquery.cloud.google.com/table/sheriff-o-matic:events.alerts


Project Member

Comment 10 by sheriffbot@chromium.org, Dec 17

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Pri-2 Pri-1
Status: Available (was: Untriaged)
Bump up priority as it is good timing for OKR planning now.

Sign in to add a comment