New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 662468 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 547690
Owner: ----
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 2
Type: Bug



Sign in to add a comment

Add monitoring around the number of queued changelists to build

Project Member Reported by charliea@chromium.org, Nov 4 2016

Issue description

Yesterday I tried to assess whether a specific CL reduced YouTube power consumption but discovered that the CL (submitted a day prior) still hadn't been tested on the perf bots. A little more digging showed that the root cause was that the Mac builders were something like 50-75 CLs behind, whereas a normal builder might be 3-5 CLs behind. The even rootier cause was that there were only 7 Mac buildslaves, whereas there were something like 20 Linux and Windows buildslaves.

After asking about this on the speed infra chat, martiniss@ was very helpful in getting more builders online by the end of the day. The builder is now catching up.

I'm thankful that the problem was resolved so quickly, but am wondering what next steps we can take to ensure that this doesn't happen again, as it's definitely disruptive and makes perfbot health sheriffing more difficult. Would it be possible to add monitoring on the builder, along the lines of:

if (time_to_build_each_cl * cls_to_build / number_of_buildslaves) > 2 hours:
  alert('Builder %s is behind' % platform)

It seems like, without this type of monitoring in place, it's a near inevitability that it'll happen again.


 
Cc: sergeybe...@chromium.org stip@chromium.org
Pending build time sounds like a good alert. 

stip@, have we tried this before? I feel like we did, and they weren't useful? Or something?
/bump
Mergedinto: 547690
Status: Duplicate (was: Untriaged)
The most rootiest cause here is capacity monitoring. We do have a staging alert for high pending times (I can't find a bug for it), but it's not ready for prime time yet.

There is another bug for high pending queues -  issue 547690 . And then there is issue 645323 for monitoring the actual capacity. Neither is being actively worked on, as far as I can tell, but we are approaching a point where this is becoming fairly critical.

Sign in to add a comment