Issue metadata
Sign in to add a comment
|
CQ monitoring failed to detect a badly overloaded builder |
||||||||||||||||||||||||
Issue descriptionThis week we had a patch land that caused bug 665246. The root cause was that we shifted a bunch of load to a builder that only had one machine connected to it, and obviously why that happened and how to fix it were separate issues, but we also noticed this first after a developer filed a bug about it. I believe the problem had been happening for several days at this point. We should've caught this with monitoring. Specifically you could probably detect this one with cycle times for the CQ growing, but it probably would've been a lot easier to detect if we were monitoring pending build queues on builders. Do we have such monitoring, or can we get it in place?
,
Nov 21 2016
(Is this a load issue? That is, do we need to increase capacity for this pool?)
,
Nov 22 2016
Maybe a good way to monitor to monitor the time from "job requested" to "job completed", rather than "job started" to "job completed", since the former would capture the delay due to the overloaded builder plus normal build time issues? There's currently six machines in the linux_trusty_blink_rel pool, which is probably enough but doesn't give us a lot of breathing room. The outage happened when we had one slave, which was very much not enough :).
,
Jan 16 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by katthomas@chromium.org
, Nov 21 2016