New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734618 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Add viceroy alert for when a production shard is out.

Project Member Reported by pprabhu@chromium.org, Jun 19 2017

Issue description

Believe it or not, Ganeti servers fail surprisingly often.

Look at the number of "oops we rebooted your server" announcements here: https://groups.google.com/a/google.com/forum/#!searchin/mdb.chromeos-build-deputy/has$20been$20restarted%7Csort:date

Also, these two partial outages of class I service were detected accidentally. Both were caused due to some Ganeti servers we own being down for extended periods of time:

https://bugs.chromium.org/p/chromium/issues/detail?id=734307
https://bugs.chromium.org/p/chromium/issues/detail?id=734438
 
This is vague. Lots of things are "important" lab servers.

I think  Issue 734307  is better scoped and more actionable, at least for a particular variety of failure here.
Re #1:  Issue 734307  is about monitoring / alerting the swraming server used for chromeos-proxy.

This is about alerting on (we already have sysmon generated metrics for) ganeti server outages directly. We are clearly not paying attention to the emails that we get from Ganeti because they tend to be noisy (they tell us that a server restarted, but we only care about a server staying down for extended periods of time).

By important servers, I mean the result of:

$ atest server list -s primary
$ atest shard list
We may want to exclude devserver and crashserver role from 'atest server list' to start with, because one of them being down is not worth an alert. All other roles do not have have redundancy in the lab.
Owner: pho...@chromium.org
Status: Assigned (was: Untriaged)
Summary: Add viceroy alert for when a production shard is out. (was: Add viceroy alert for any important lab servers being down)
I had a chat IRL with akeshet@

We don't want to alert based on machines.
What we care about is services, so re-scoping this to be:

- add an alert when a production shard is down

phobbs@ claims he has a bug assigned to him which basically wants this.
- dedup this bug to that
- add "Chase-Pending" to it.
Status: WontFix (was: Assigned)

Comment 6 by pho...@chromium.org, Jun 26 2017

Mergedinto: 713318
Status: Duplicate (was: WontFix)

Sign in to add a comment