Issue metadata
Sign in to add a comment
|
Sheriff-o-matic periodically fails to fetch all annotations (groupings, bugs, etc) |
||||||||||||||||||||
Issue descriptionI've had a few strange grouping experiences with Sheriff-o-matic on the perf waterfall. SoM will periodically lose all groupings, rapidly shooting up the number of consistent failures (in my case, from 25 failures to 84 failures). Each failure that would normally be grouped appears in its own row. Then, a few minutes later, order is restored, and all of the groupings reappear. I've confirmed that reloading the page during this period of ungrouping doesn't help. Any idea what's up? It's extremely hard to triage during these grouping outages.
,
Dec 5 2017
Issue 787592 has been merged into this issue.
,
Dec 5 2017
Issue 791646 has been merged into this issue.
,
Dec 5 2017
Issue 791818 has been merged into this issue.
,
Dec 6 2017
,
Dec 14 2017
,
Dec 14 2017
Issue 777815 has been merged into this issue.
,
Dec 14 2017
We should add some logging to detect annotation fetch failures on the client.
,
Feb 15 2018
,
Feb 16 2018
Issue 813092 has been merged into this issue.
,
Feb 16 2018
Raising the priority on this since I think this happens pretty frequently and is painful for users. I believe this mostly happens on the Perf tree and is largely because of the high volume of alerts.
,
Feb 16 2018
Re #11: > I believe this mostly happens on the Perf tree and is largely because of the high volume of alerts. Just to bring attention to that not being exclusively the case. The duplicate issue 787592 that I reported was not related to perf bots.
,
Feb 16 2018
Thanks for bringing that up, carlosk. https://sheriff-o-matic.appspot.com/api/v1/annotations/ does seem to currently return the annotations for every tree, so you're right, and I think this would probably affect all trees at the same time. So, I looked through the SoM logs and I am seeing several entries for requests to api/v1/annotations that error with "Exceeded soft private memory limit of 512 MB with 518 MB after servicing 747 requests total" So I think this is a result of the fact that annotations (not alerts) are a really large payload that is constantly being reloaded. I believe splitting annotations by tree should help with this (and is probably something we should have done a while ago). I will try to have a CL in by early next week.
,
Feb 16 2018
,
Feb 19 2018
The NextAction date has arrived: 2018-02-19
,
Feb 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/a6a8d9426e919da295459b82b206fe7481fbfbf1 commit a6a8d9426e919da295459b82b206fe7481fbfbf1 Author: Tiff Zhang <zhangtiff@google.com> Date: Thu Feb 22 22:58:01 2018 SoM: Split annotations by tree. Bug:784529 Bug:809803 Bug:809805 Change-Id: Iec72def3121e3c308dc297cd675a891710d44813 Reviewed-on: https://chromium-review.googlesource.com/930329 Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org> Reviewed-by: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/analyze_test.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/main.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-annotations/som-annotations.js [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/annotations.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/model/model.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-rev-range/som-rev-range.js [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-drawer/som-drawer.js [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/analyze.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/som/handler/main_test.go [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-drawer-test.html [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/elements/som-drawer/som-drawer.html [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-annotations-test.html [modify] https://crrev.com/a6a8d9426e919da295459b82b206fe7481fbfbf1/go/src/infra/appengine/sheriff-o-matic/frontend/test/som-rev-range-test.html
,
Mar 13 2018
Just a reminder that this is still happening, and is still pretty time consuming when sheriffing. It's going on right now, and I'm just kind of sitting around for a few minutes until it resolves itself. See attached waiting.png.
,
Mar 13 2018
I believe the fix for this hasn't been deployed yet. @Sean: Should we do a deployment today? This CL will end up "clearing" the old annotations unless we decide to make a migration or something of the like.
,
Mar 13 2018
Yep we're due for a push today anyways. Will that CL clear out *all* annotations?
,
Mar 13 2018
Yup. The annotations will still be around, but they won't have data on the tree they came from attached to them, so the frontends for the trees won't be able to find the annotations anymore. Automated groups would be regenerated after a few minutes by the cron but user annotations would be "lost". I could add a temporary change that makes the frontend query for both old annotations and new annotations, but that would temporarily make the problem in this bug worse rather than better. Or maybe we could somehow attach tree names to existing annotations based on alert data?
,
Mar 13 2018
Could we look at what masters/builders are identified in alert objects, and determine the appropriate tree using a backwards lookup from gatekeeper config? Not sure how to deploy that (one time request handler? task queue worker?)
,
Mar 13 2018
The alert model contains the tree information, so we could look up the alert keys attached to the annotations and then attach the tree name from the alerts through that. To deploy, it looks like a one time Task Queue is a thing people do for DataStore migrations: https://cloud.google.com/appengine/articles/update_schema#updating-existing-entities
,
Mar 13 2018
Okay that sounds like a plan that wouldn't be too disruptive for our users. Having to re-link bugs, losing comments and snoozing etc would create a lot of extra work for them.
,
Mar 14 2018
Holding off on today's push since that would effectively make all of the user annotations disappear.
,
Mar 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/52fb6b3c3877ff3bf194b0e99e0c485037860ba2 commit 52fb6b3c3877ff3bf194b0e99e0c485037860ba2 Author: Tiff Zhang <zhangtiff@google.com> Date: Thu Mar 15 02:30:29 2018 SoM: create annotation migration. Bug:784529 Change-Id: Id1b812287641f986f9f08b5fc1ce4febd6099f1b Reviewed-on: https://chromium-review.googlesource.com/961801 Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org> Reviewed-by: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/backend/app.yaml [modify] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/som/handler/analyze_test.go [add] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/som/handler/migrations_test.go [modify] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/backend/main.go [modify] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/som/client/testresults_test.go [add] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/som/handler/migrations.go [modify] https://crrev.com/52fb6b3c3877ff3bf194b0e99e0c485037860ba2/go/src/infra/appengine/sheriff-o-matic/frontend/queue.yaml
,
Apr 13 2018
Note that this seems to be happening again The first screenshot is from about two minutes ago. (~200 failures) The second is from now (~20 failures)
,
Jun 11 2018
This is happening again and is incredibly disruptive to perf sheriffing. Every time that it happens, it literally makes me sit and wait several minutes until I can resume sheriffing again.
,
Jun 11 2018
s/perf sheriffing/bot health sheriffing in that last comment
,
Aug 2
Issue 866343 has been merged into this issue.
,
Dec 6
Issue 907836 has been merged into this issue. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by zhangtiff@chromium.org
, Nov 13 2017