New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703478 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 710627



Sign in to add a comment

linux_layout_tests_slimming_paint_v2 is under bot capacity

Project Member Reported by pdr@chromium.org, Mar 21 2017

Issue description

I'm seeing this error somewhat frequently (more than once a day):
linux_layout_tests_slimming_paint_v2 on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL)

It looks like these failures are causing patches to be rejected from the CQ flakily.

Recent examples:
https://codereview.chromium.org/2758683002/
https://codereview.chromium.org/2758203003/
https://codereview.chromium.org/2760063003/

Builder:
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_layout_tests_slimming_paint_v2

Background:
linux_layout_tests_slimming_paint_v2 is a special CQ bot that runs webkit layout tests with --enable-slimming-paint-v2 and is triggered for paint-related changes.

(I removed RVG because this can all be public)
 
Cc: sergeybe...@chromium.org
Components: -Infra Infra>CQ Infra>Platform>Buildbucket
CQ logs indicate that CQ tried to schedule linux_layout_tests_slimming_paint_v2 build at 17:25:34.124 PDT (Mar 20)

However, buildbucket only shows a single build for that CL that happened later (and succeeded): https://cros-build-status.appspot.com/builds/query?bucket=master.tryserver.chromium.linux&tag=builder:linux_layout_tests_slimming_paint_v2&tag=buildset:patch/rietveld/codereview.chromium.org/2758683002/20001&max=50

I suspect something broke in CQ-to-Buildbucket connection, or the build must have gotten lost somehow.

Adding both CQ and buildbucket teams to investigate.

Comment 2 by pdr@chromium.org, Mar 21 2017

Cc: wangxianzhu@chromium.org
Removing from the trooper queue.
Cc: no...@chromium.org tandrii@chromium.org
Labels: -Infra-Troopers

Comment 6 by no...@chromium.org, Mar 21 2017

The green bubble in the CL is
https://cr-buildbucket.appspot.com/_ah/api/buildbucket/v1/builds/8984556824149165552
It was created at 17:25 PDT, so it corresponds to the gray bubble in
https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2758683002/20001 at 17:25 pdt.
It is one build.

From the buildbucket log:
1	2017-03-21 00:25:34 UTC	Build 8984556824149165552 was created by user:5071639625-1lppvbtck1morgivc6sq4dul7klu27sd@developer.gserviceaccount.com	 
2	2017-03-21 00:27:20 UTC	Build 8984556824149165552 was leased by user:446450136466-u4o8pcvmvt31t2os2d8of23egso76scf@developer.gserviceaccount.com	 
3	2017-03-21 02:27:41 UTC	Build 8984556824149165552 was started. URL: http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_layout_tests_slimming_paint_v2/builds/3361	 
4	2017-03-21 02:48:42 UTC	Build 8984556824149165552 was completed. Status: COMPLETED. Result: SUCCESS

I see nothing wrong on the buildbucket side.

CQ team, what does "JOB_TIMED_OUT, no build URL" means? 

> However, buildbucket only shows a single build for that CL that happened later
I don't think it happened later.

The answer is the nodir's quoted log right above: 

Build was scheduled by CQ at 00:25 UTC, but it actually started (and got URL, i think that's where is from!) by 02:27 UTC. **That's 2 hours later**. CQ doesn't wait that long, and that's why the error.

So, capacity is clearly lacking here. 

Comment 8 by no...@chromium.org, Mar 21 2017

Cc: qyears...@chromium.org
Components: -Infra>CQ -Infra>Platform>Buildbucket Infra
Status: Available (was: Untriaged)
Summary: linux_layout_tests_slimming_paint_v2 is under bot capacity (was: linux_layout_tests_slimming_paint_v2 failures: JOB_TIMED_OUT, no build URL)
no wonder, there is only one bot

I am removing CQ and Buildbucket components because it is not a bug in either of them. I don't know who owns this builder, qyearsley set it up though. Whoever owns/cares about the builder, should consider adding more bots. Please consider cc'ing people who are familiar with the builder.

Comment 9 by pdr@chromium.org, Mar 21 2017

Thanks for looking into it.

@qyearsley, we originally used one bot and manually added the spv2 trybot. This was working so well that we started auto-triggering the trybot on changes in certain directories. May we have one or two additional builders?
Please estimate the capacity properly and request machines from Infra>Labs.
To estimate the capacity, estimate the peak number of CLs per hour (N) that would trigger this builder (e.g. for the entire Chromium it's ~60) and check the average runtime (T) in hours. The required capacity then is at least N*T. We strive to have 75% peak load on every pool, so make it 1.3*N*T.

Here's the estimation if this were a full Chromium builder (internal only - sorry):

http://vi/chrome_infra/Buildbot/per_pool?running_builders=&estimated_builders=linux_layout_tests_slimming_paint_v2&duration=1d&refresh=-1&pool=master.tryserver.chromium.linux%3Alinux_layout_tests_slimming_paint_v2&cq_project=chromium&est_master=master.tryserver.chromium.linux

The above console shows ~20min cycle time (0.33h), and http://shortn/_oRY2SoiHqv suggests that you trigger at most 4-5 builds per hour. At this rate, it seems 3 bots should handle the load: ceil(1.3*0.33*5) = 3.

This estimate is based purely on historical usage, so if you know your usage will be different please factor it in. Thanks!

Comment 11 by pdr@chromium.org, Mar 21 2017

That viceroy link is pretty fancy. I don't expect our usage to go any higher in the near-term and agree with your analysis; 3 bots sounds great.
Components: -Infra Infra>Platform

Comment 13 by pdr@chromium.org, Mar 23 2017

Labels: -Pri-3 Pri-2
Now that we understand the bug, do you mind if I raise the priority? I have a stable blocker patch that will likely miss the branch today because of under capacity on the spv2 bot :( (https://codereview.chromium.org/2767343003)
Pinging this -- what do we need to do next to "request machines from Infra>Labs"?

I would like to add changes under cc/ to run this bot as well, which will require this capacity increase.

Background is that I just spent several days finding a regression introduced  http://crbug.com/707281  which would have been prevented if cc/... were running this bot on CQ.

Thanks for insight on next steps.

Comment 15 by no...@chromium.org, Mar 31 2017

Components: -Infra>Platform Infra
-Infra>Platform because it is not a platform issue
+Infra to triage
This was in Infra back in c#8 already. Should we put it in Infra > Labs?
Cc: friedman@chromium.org estaab@chromium.org
Ping -- how do we request machines from Infra > Labs?
Components: -Infra Infra>Labs
Yes, adding Infra>Labs adds the ticket to the Labs rotation. They can provide you with more machines.
Owner: vhang@chromium.org
Assigning to person I am talking with on the work now
Owner: friedman@chromium.org
Status: Assigned (was: Available)
Chatted with wkorman@ offline.  He'll need 2 more GCE instances that looks like slave775-c4.
Status: Fixed (was: Assigned)
slave1404-c4 slave1405-c4

Comment 22 by pdr@chromium.org, Apr 11 2017

Status: Assigned (was: Fixed)
Did the new slaves go away or fail to get added? I still see a single slave:
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_layout_tests_slimming_paint_v2
(slave775-c4)

Reopening because we just had Nico complain about their patch getting held up by the spv2 trybot.
Owner: dpranke@chromium.org
It looks like labs gave us two slaves, but I see nothing to indicate that the slaves were actually added to the builder. Though I suspect that 3 builders is still a bit low.

Comment 24 by pdr@chromium.org, Apr 11 2017

In comment 10 we back-of-the-enveloped the capacity and 3 seemed reasonable. This bot just runs for a subset of blink changes.
We are in discussion to add it to run on cc/... changes as well if that affects things:

https://groups.google.com/a/chromium.org/d/msg/graphics-dev/wuY9tp3CkLI/ujEL24BcBQAJ
Blocking: 710627
Project Member

Comment 27 by bugdroid1@chromium.org, Apr 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/931ed421b527d703565f3e09c50b6a013efe36d0

commit 931ed421b527d703565f3e09c50b6a013efe36d0
Author: Dirk Pranke <dpranke@chromium.org>
Date: Fri Apr 14 00:03:54 2017

Add two more bots to the slimming paint builder on tryserver.chromium.linux.

R=qyearsley@chromium.org
BUG= 703478 

Change-Id: If57756b3adc0f5444f41fa70f2281a577cc10ee8
Reviewed-on: https://chromium-review.googlesource.com/474968
Commit-Queue: Dirk Pranke <dpranke@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/931ed421b527d703565f3e09c50b6a013efe36d0/masters/master.tryserver.chromium.linux/slaves.cfg

Status: Fixed (was: Assigned)

Comment 29 by pdr@chromium.org, Apr 19 2017

Status: Verified (was: Fixed)
Hey Dirk, I haven't seen any grey spv2 trybots recently, and the bot seems a lot faster overall. Thank you for your help! It makes our team's day-to-day development faster.

Comment 30 by no...@chromium.org, Apr 19 2017

Cc: -no...@chromium.org
Amazing how having more than one of something might make a difference :).

Sign in to add a comment