Add viceroy alert for when a production shard is out. |
||||
Issue descriptionBelieve it or not, Ganeti servers fail surprisingly often. Look at the number of "oops we rebooted your server" announcements here: https://groups.google.com/a/google.com/forum/#!searchin/mdb.chromeos-build-deputy/has$20been$20restarted%7Csort:date Also, these two partial outages of class I service were detected accidentally. Both were caused due to some Ganeti servers we own being down for extended periods of time: https://bugs.chromium.org/p/chromium/issues/detail?id=734307 https://bugs.chromium.org/p/chromium/issues/detail?id=734438
,
Jun 19 2017
Re #1: Issue 734307 is about monitoring / alerting the swraming server used for chromeos-proxy. This is about alerting on (we already have sysmon generated metrics for) ganeti server outages directly. We are clearly not paying attention to the emails that we get from Ganeti because they tend to be noisy (they tell us that a server restarted, but we only care about a server staying down for extended periods of time). By important servers, I mean the result of: $ atest server list -s primary $ atest shard list
,
Jun 19 2017
We may want to exclude devserver and crashserver role from 'atest server list' to start with, because one of them being down is not worth an alert. All other roles do not have have redundancy in the lab.
,
Jun 19 2017
I had a chat IRL with akeshet@ We don't want to alert based on machines. What we care about is services, so re-scoping this to be: - add an alert when a production shard is down phobbs@ claims he has a bug assigned to him which basically wants this. - dedup this bug to that - add "Chase-Pending" to it.
,
Jun 22 2017
,
Jun 26 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by akes...@chromium.org
, Jun 19 2017