New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734307 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

add swarming_proxy handling time alert; add swarming server health to swarming dashboard

Project Member Reported by akes...@chromium.org, Jun 17 2017

Issue description

http://shortn/_J0i41gIK9W

The fact that the onset was precisely 6pm makes me suspect it was a planned outage, but I see no outage annoucement in my email. maruel@ was there a planned downtime?


(P1 justification: preventative against future P0 outages)
 

Comment 1 by mar...@chromium.org, Jun 17 2017

Looks like it's your VMs that went offline, i.e. ganeti downtime.

Server saw increased request load, likely because a large number of clients waiting for tasks: 
https://screenshot.googleplex.com/ZScHT8Yo6r3

Here's how it look on an individual bot:
https://screenshot.googleplex.com/OLZ2JtFff83
Owner: pprabhu@chromium.org
Thanks Marc!-Antoine! (also, you're making me use my high school french education in reading that :) )

pprabhu does this correspond to known ganeti outage?
My inbox has no fewer than 4 announcements of server failures
during the time frame in question, all attributed to various
unexpected failures.  These are the servers I saw mentioned:
    chromeos-server51.cbf.corp.google.com (twice)
    chromeos-server85.cbf.corp.google.com
    chromeos-server22.cbf.corp.google.com

The problems were ultimately attributed to "an apparent power
issue in CBF."

Status: Fixed (was: Untriaged)
Looks like this incident is past?

Filed  issue 734618  for mitigation action for future incidents.
Status: Available (was: Fixed)
Reopening, I don't think that bug is as well scoped as this one. This is specifically about swarming outage, which is probably telling us we want

a) to add swarming proxy server health info to the swarming dashboard on viceroy
b) an alert about swarming proxy performance
Labels: Chase-Pending
Summary: add swarming_proxy handling time alert; add swarming server health to swarming dashboard (was: chromeos-proxy was down for ~5 hours)
justification: swarming_proxy outages cause class 1 service (CQ) outages.
Owner: ----
I don't usually like co-opting an incident bug for the follow-up work.
But in this case, the follow up is clear enough, and the incident is clearly over. I basically don't want the two mixing.
Labels: -Chase-Pending Chase
Owner: pprabhu@chromium.org
Issue 720005 has been merged into this issue.
Issue 720026 has been merged into this issue.
Status: Assigned (was: Available)
Dashboard CL ready, alert CL in progress. Will be fixed by EoW - P
Status: Started (was: Assigned)
Dashboard CL landed: cr/160295750

Alert CL in-flight: cr/160469985
Status: Fixed (was: Started)
Both CLs landed.

Sign in to add a comment