New issue
Advanced search Search tips

Issue 637006 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

handle infra failure alerts better

Project Member Reported by seanmccullough@google.com, Aug 11 2016

Issue description

[from cit-sheriffing thread]

The right way to address infra failures in SoM might be:

- (re)Add a trooper-oriented tab that shows infra failures (perhaps grouped by trees, but in a single list). 

- Then for sheriff tabs, add some indication of the *existence* of infra failures. It shouldn't crowd out the list of alerts that they can actually act on though. Perhaps a single "There are X infra failures now affecting your tree. Troopers should be working on them. Click here for more details" expando somewhere conspicuous but not obstructive. 

- And no "Show infra failures" checkbox.

Infra failures affect two different groups of oncallers who have somewhat overlapping responsibilities and needs for awareness, but different methods and abilities to fix things. A single checkbox is probably just too binary for this problem.

 
Project Member

Comment 1 by sheriffbot@chromium.org, Aug 12 2016

Labels: Hotlist-Google
Labels: Type-Bug
Labels: -Milestone-SoMNGFollowUp Milestone-Workflow
 Issue 653239  has been merged into this issue.
I am interested in working on this. Though I have been wondering what the best way to handle pulling all infra failures from the different trees is? Should we just pull the alerts JSON for every tree and then parse through it on the client side? Or should be potentially change the way things are stored in the backend so this can be filtered from that end? 
That's a good question. We don't currently parse the alerts json on the SoM server side. It just stores the alerts as raw json data, so we don't have an easy way to construct an infra-only alerts feed on the server today.

Short term, it might make more sense for the client to grab all of the trees' alerts when rendering /trooper, and filter to just show the infra failures by tree. 

Longer term, it would be better to issue one single request from the client for all infra failures across trees, but that'll require a lot more work on the backend.

Owner: zhangtiff@chromium.org
I am interested in working on this. 
Awesome! How about a short design doc and some mockups of the UI?

Use those to hash out answers to some open questions:
- whether to do the infra alerts filtering on the server or on the client
- how the UI will change for sheriffs and troopers (For sheriffs, just link to infra alerts on the trooper page, or do progressive disclosure in-page? For troopers: visually group infra alerts by tree on on the trooper page, or just one unified list somehow ordered by severity, number of builders affected etc)


Sure, writing a mini design doc sounds good to me. I have a few rough initial ideas that I'll post here if that's alright: 

- Implement the Trooper tab on the frontend the same way as a new tree named "Trooper" and take advantage of things like playbook linking and a Chrome Infra logo for the tree (and maybe infra-status.appspot.com added to the list of status apps?). 
- The bug queue view is reused as the trooper queue for the trooper tab. It is adjusted to sort bugs by priority and maybe a bit of extra info display, but otherwise few frontend changes. The main change will be the backend which, when a request is made for the "trooper" bug queue, will look up the trooper-queue query on the Monorail API to get a list of bugs
- Infra failures are shown in the Trooper tab and sorted by the tree they came from. 
- Other trees will still show infra failures.  

It would not be difficult to have the frontend just pull all alerts from every tree and filter for only infra failures. However, I think think this is a feature we are not in a hurry to get out, so it would probably be better to try to do it the "good" way. 

How feasible/desirable would it be to restructure the Datastore storage for alerts to actually format the data based on the JSON format? If we did that, we would be able to look up alerts by whether they are infra failures. 
Regarding the server-side construction of the feed:

Restructuring the JSON format to fit into the datastore is going to be a lot of work. It's necessary and desirable, but we shouldn't block this work on that.

A third option would be to have the server pull all of the current trees' alerts from the datastore, parse the json blobs on the server and then filter for infra failures before returning the infra failure feed.

You should be able to query datastore for all treess' alerts and for each, parse AlertsJSON.Content into a infra/monitoring/messages.AlertsSummary struct.
Then filter out the infra failures from there.


Project Member

Comment 12 by bugdroid1@chromium.org, Nov 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/3e2fb96235a9498bcaef9e287c8d2751c418422b

commit 3e2fb96235a9498bcaef9e287c8d2751c418422b
Author: Stephen Martinis <martiniss@chromium.org>
Date: Mon Oct 31 23:25:00 2016

Update relnotes for SOM

BUG= 655234 , 655286 , 637006 

Change-Id: I914008d6cb9e0c684ceaa53552b57e4700149f38
Reviewed-on: https://chromium-review.googlesource.com/405827
Reviewed-by: Tiffany Zhang <zhangtiff@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/3e2fb96235a9498bcaef9e287c8d2751c418422b/go/src/infra/appengine/sheriff-o-matic/RELNOTES.md

Status: Started (was: Available)
Project Member

Comment 16 by bugdroid1@chromium.org, Nov 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/1da8369a532291f8bcb2fbec66c0f314643d3ad8

commit 1da8369a532291f8bcb2fbec66c0f314643d3ad8
Author: Tiffany Zhang <zhangtiff@google.com>
Date: Wed Nov 23 22:47:47 2016

SoM: Fix trooper bug query.


BUG= 637006 

Change-Id: Ife105d84669efd92d69ded3e30ff78470e26e6bd
Reviewed-on: https://chromium-review.googlesource.com/414291
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org>

[modify] https://crrev.com/1da8369a532291f8bcb2fbec66c0f314643d3ad8/go/src/infra/appengine/sheriff-o-matic/som/bugqueue.go

Project Member

Comment 19 by bugdroid1@chromium.org, Jan 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/8f84797d9922a3d6f2253f4c1bc36993847d39f1

commit 8f84797d9922a3d6f2253f4c1bc36993847d39f1
Author: Tiff Zhang <zhangtiff@google.com>
Date: Fri Jan 27 22:17:41 2017

SoM: Add bug priority category headers + more whitespace adjustments.

BUG= 637006 
BUG= 669254 

Change-Id: Ie9ec0a48fc4a3c973c908dc0f36e18dd7fcd36f5
Reviewed-on: https://chromium-review.googlesource.com/433819
Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-alert-item/som-alert-item.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-app/som-app.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-bug-queue/som-bug-queue.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-bug-queue/som-bug-queue.js
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-extension-build-failure/som-extension-build-failure.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-swarming-bots/som-swarming-bots.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/test/som-bug-queue-test.html

Status: Fixed (was: Started)
I think I should go ahead and close this since we have a trooper page and working on this bug is mostly just adding polish. Some other related requests are better covered in more specific bugs. 
Project Member

Comment 21 by bugdroid1@chromium.org, Mar 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/8f84797d9922a3d6f2253f4c1bc36993847d39f1

commit 8f84797d9922a3d6f2253f4c1bc36993847d39f1
Author: Tiff Zhang <zhangtiff@google.com>
Date: Fri Jan 27 22:29:45 2017

SoM: Add bug priority category headers + more whitespace adjustments.

BUG= 637006 
BUG= 669254 

Change-Id: Ie9ec0a48fc4a3c973c908dc0f36e18dd7fcd36f5
Reviewed-on: https://chromium-review.googlesource.com/433819
Commit-Queue: Tiffany Zhang <zhangtiff@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-alert-item/som-alert-item.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-bug-queue/som-bug-queue.js
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-app/som-app.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-extension-build-failure/som-extension-build-failure.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-swarming-bots/som-swarming-bots.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/test/som-bug-queue-test.html
[modify] https://crrev.com/8f84797d9922a3d6f2253f4c1bc36993847d39f1/go/src/infra/appengine/sheriff-o-matic/elements/som-bug-queue/som-bug-queue.html

Sign in to add a comment