New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 666894 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 673205
Owner: ----
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug



Sign in to add a comment

CQ monitoring failed to detect a badly overloaded builder

Project Member Reported by dpranke@chromium.org, Nov 18 2016

Issue description

This week we had a patch land that caused bug 665246. 

The root cause was that we shifted a bunch of load to a builder that only had one machine connected to it, and obviously why that happened and how to fix it were separate issues, but we also noticed this first after a developer filed a bug about it. I believe the problem had been happening for several days at this point.

We should've caught this with monitoring.

Specifically you could probably detect this one with cycle times for the CQ growing, but it probably would've been a lot easier to detect if we were monitoring pending build queues on builders. Do we have such monitoring, or can we get it in place?
 
We have the monitoring, but not alerts: 

https://viceroy.corp.google.com/chrome_infra/Buildbot/per_builder?builder=linux_trusty_blink_rel&duration=30d&master=master.tryserver.blink&refresh=-1

I'm not sure what the right path forward is. It seems like CQ time is too noisy to alert on, especially now that this builder is getting more traffic. Looking at the past month, I would have suggested alerting on 2hr average for 90th percentile cycle time > 75 minutes. But, that alert would be firing right now. And maybe it should be? To be honest, I'm not sure.

I hesitate to alert on pending builds, because what if that ceases to be a good proxy for developer pain? That said, we could do it as a bandaid, if that seems like the best path forward at the moment.
(Is this a load issue?  That is, do we need to increase capacity for this pool?)
Maybe a good way to monitor to monitor the time from "job requested" to "job completed", rather than "job started" to "job completed", since the former would capture the delay due to the overloaded builder plus normal build time issues?

There's currently six machines in the linux_trusty_blink_rel pool, which is probably enough but doesn't give us a lot of breathing room. The outage happened when we had one slave, which was very much not enough :).
Mergedinto: 673205
Status: Duplicate (was: Available)
I'm going to dup this into bug 673205.

Sign in to add a comment