New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 784529 link

Starred by 7 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-02-19
OS: ----
Pri: 1
Type: Bug


Participants' hotlists:
Tiff-List


Sign in to add a comment

Sheriff-o-matic periodically fails to fetch all annotations (groupings, bugs, etc)

Project Member Reported by charliea@chromium.org, Nov 13 2017

Issue description

I've had a few strange grouping experiences with Sheriff-o-matic on the perf waterfall. SoM will periodically lose all groupings, rapidly shooting up the number of consistent failures (in my case, from 25 failures to 84 failures). Each failure that would normally be grouped appears in its own row. Then, a few minutes later, order is restored, and all of the groupings reappear. 

I've confirmed that reloading the page during this period of ungrouping doesn't help. Any idea what's up? It's extremely hard to triage during these grouping outages.
 
Groupings are currently stored as annotaitons which are fetched separately, so my first thought here is that the annotations request is failing somehow.
 Issue 787592  has been merged into this issue.
 Issue 791646  has been merged into this issue.
 Issue 791818  has been merged into this issue.
Labels: Milestone-UX
Summary: Sheriff-o-matic fails to fetch annotations sometimes (was: Sheriff-o-matic loses groupings for a brief period)
 Issue 777815  has been merged into this issue.
Labels: -Pri-1 Pri-2
Status: Available (was: Untriaged)
We should add some logging to detect annotation fetch failures on the client. 
Labels: Milestone-Polish
 Issue 813092  has been merged into this issue.
Labels: -Pri-2 -Milestone-UX -Milestone-Polish Milestone-Reliability Pri-1
Owner: zhangtiff@chromium.org
Status: Assigned (was: Available)
Summary: Sheriff-o-matic periodically fails to fetch all annotations (groupings, bugs, etc) (was: Sheriff-o-matic fails to fetch annotations sometimes)
Raising the priority on this since I think this happens pretty frequently and is painful for users. 

I believe this mostly happens on the Perf tree and is largely because of the high volume of alerts. 

Comment 12 by carl...@google.com, Feb 16 2018

Re #11: 
> I believe this mostly happens on the Perf tree and is largely because of the high volume of alerts. 

Just to bring attention to that not being exclusively the case. The duplicate  issue 787592  that I reported was not related to perf bots.
Thanks for bringing that up, carlosk. https://sheriff-o-matic.appspot.com/api/v1/annotations/ does seem to currently return the annotations for every tree, so you're right, and I think this would probably affect all trees at the same time. 

So, I looked through the SoM logs and I am seeing several entries for requests to api/v1/annotations that error with "Exceeded soft private memory limit of 512 MB with 518 MB after servicing 747 requests total" 

So I think this is a result of the fact that annotations (not alerts) are a really large payload that is constantly being reloaded. 

I believe splitting annotations by tree should help with this (and is probably something we should have done a while ago). I will try to have a CL in by early next week. 
NextAction: 2018-02-19
The NextAction date has arrived: 2018-02-19
Project Member

Comment 16 by bugdroid1@chromium.org, Feb 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/a6a8d9426e919da295459b82b206fe7481fbfbf1

commit a6a8d9426e919da295459b82b206fe7481fbfbf1
Author: Tiff Zhang <zhangtiff@google.com>
Date: Thu Feb 22 22:58:01 2018

SoM: Split annotations by tree.


Bug:784529
Bug:809803
Bug:809805
Change-Id: Iec72def3121e3c308dc297cd675a891710d44813
Reviewed-on: https://chromium-review.googlesource.com/930329
Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/analyze_test.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/main.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-annotations/som-annotations.js
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/annotations.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/model/model.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-rev-range/som-rev-range.js
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-drawer/som-drawer.js
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/analyze.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/main_test.go
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-drawer-test.html
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-drawer/som-drawer.html
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-annotations-test.html
[modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-rev-range-test.html

Just a reminder that this is still happening, and is still pretty time consuming when sheriffing.

It's going on right now, and I'm just kind of sitting around for a few minutes until it resolves itself.

See attached waiting.png.
waiting.png
97.6 KB View Download
I believe the fix for this hasn't been deployed yet. 

@Sean: Should we do a deployment today? This CL will end up "clearing" the old annotations unless we decide to make a migration or something of the like. 
Yep we're due for a push today anyways. Will that CL clear out *all* annotations?
Yup. The annotations will still be around, but they won't have data on the tree they came from attached to them, so the frontends for the trees won't  be able to find the annotations anymore. Automated groups would be regenerated after a few minutes by the cron but user annotations would be "lost". 

I could add a temporary change that makes the frontend query for both old annotations and new annotations, but that would temporarily make the problem in this bug worse rather than better. 

Or maybe we could somehow attach tree names to existing annotations based on alert data? 
Could we look at what masters/builders are identified in alert objects, and determine the appropriate tree using a backwards lookup from gatekeeper config?

Not sure how to deploy that (one time request handler? task queue worker?)
The alert model contains the tree information, so we could look up the alert keys attached to the annotations and then attach the tree name from the alerts through that. 

To deploy, it looks like a one time Task Queue is a thing people do for DataStore migrations: https://cloud.google.com/appengine/articles/update_schema#updating-existing-entities 
Okay that sounds like a plan that wouldn't be too disruptive for our users. 

Having to re-link bugs, losing comments and snoozing etc would create a lot of extra work for them.
Holding off on today's push since that would effectively make all of the user annotations disappear.
Project Member

Comment 25 by bugdroid1@chromium.org, Mar 15 2018

Note that this seems to be happening again

The first screenshot is from about two minutes ago. (~200 failures)

The second is from now (~20 failures)

 
Screen Shot 2018-04-13 at 2.46.34 PM.png
267 KB View Download
Cc: nednguyen@chromium.org
This is happening again and is incredibly disruptive to perf sheriffing. Every time that it happens, it literally makes me sit and wait several minutes until I can resume sheriffing again.
QiiWcBxub9S.png
200 KB View Download
s/perf sheriffing/bot health sheriffing in that last comment
 Issue 866343  has been merged into this issue.
 Issue 907836  has been merged into this issue.

Sign in to add a comment