add swarming_proxy handling time alert; add swarming server health to swarming dashboard |
|||||||||
Issue descriptionhttp://shortn/_J0i41gIK9W The fact that the onset was precisely 6pm makes me suspect it was a planned outage, but I see no outage annoucement in my email. maruel@ was there a planned downtime? (P1 justification: preventative against future P0 outages)
,
Jun 19 2017
Thanks Marc!-Antoine! (also, you're making me use my high school french education in reading that :) ) pprabhu does this correspond to known ganeti outage?
,
Jun 19 2017
My inbox has no fewer than 4 announcements of server failures
during the time frame in question, all attributed to various
unexpected failures. These are the servers I saw mentioned:
chromeos-server51.cbf.corp.google.com (twice)
chromeos-server85.cbf.corp.google.com
chromeos-server22.cbf.corp.google.com
The problems were ultimately attributed to "an apparent power
issue in CBF."
,
Jun 19 2017
Looks like this incident is past? Filed issue 734618 for mitigation action for future incidents.
,
Jun 19 2017
Reopening, I don't think that bug is as well scoped as this one. This is specifically about swarming outage, which is probably telling us we want a) to add swarming proxy server health info to the swarming dashboard on viceroy b) an alert about swarming proxy performance
,
Jun 19 2017
justification: swarming_proxy outages cause class 1 service (CQ) outages.
,
Jun 19 2017
,
Jun 19 2017
I don't usually like co-opting an incident bug for the follow-up work. But in this case, the follow up is clear enough, and the incident is clearly over. I basically don't want the two mixing.
,
Jun 19 2017
,
Jun 20 2017
Issue 720005 has been merged into this issue.
,
Jun 20 2017
Issue 720026 has been merged into this issue.
,
Jun 20 2017
,
Jun 26 2017
Dashboard CL ready, alert CL in progress. Will be fixed by EoW - P
,
Jun 29 2017
,
Jun 30 2017
Both CLs landed. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by mar...@chromium.org
, Jun 17 2017