WPT Import: mac*_blink_rel trybots need more shards |
|||||
Issue description(Part of the KR to reduce WPT import latency in Q4) Right now, when we have new WPT changes to import, we spend a large amount of time on the first stage where we run the changes through the *_blink_rel trybots. Surprisingly, the Mac bots tend to take longer than the Windows. For example, mac10.12_blink_rel takes around 40min to run the layout tests. If one or more tests fail, it will run everything again without the imported changes, so that's another 40min running tests before we even get to the point where the import process rebaselines the tests and runs the CQ. The non-Mac trybots aren't the fastest either, but they tend to spend more time on other steps (bot_update and archive_webkit_tests_results are very slow on Windows, for example).
,
Oct 12 2017
Ah, sorry for the delay! Yep, the procedure is: 1. File an issue with component Infra>Labs to request new slaves. Filed bug 774161 for this. 2. Commit a change to master.tryserver.blink/slaves.cfg in the build repo (https://cs.chromium.org/chromium/build/masters/master.tryserver.blink/slaves.cfg) to add those slaves to the relevant pools. 3. File an issue for a master restart, following https://g.co/bugatrooper.
,
Oct 16 2017
Got it, thanks!
,
Oct 16 2017
,
Oct 16 2017
CL for all except 10.11 retina: https://crrev.com/c/721626
,
Oct 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/5415e5e60aa335ed9433c8e3de9e1c1ea90373da commit 5415e5e60aa335ed9433c8e3de9e1c1ea90373da Author: Quinten Yearsley <qyearsley@chromium.org> Date: Mon Oct 16 23:17:13 2017 Add slaves to mac10.{10,11,12}_blink_rel Bug: 772335 Change-Id: I69c9fa15063dff2eb6bc7c187ed80e1130a27863 Reviewed-on: https://chromium-review.googlesource.com/721626 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/5415e5e60aa335ed9433c8e3de9e1c1ea90373da/masters/master.tryserver.blink/slaves.cfg
,
Oct 24 2017
Looking at issue 774161, it looks like we're still waiting for the Retina machine(s). Should we wait for all slaves before asking for the master to be restarted or can we do that now for the existing ones?
,
Oct 24 2017
We can do that now with existing ones; filed bug 777887 for restart today.
,
Oct 25 2017
Update: now the non-retina mac bots have 3 slaves each.
,
Oct 25 2017
Thank you. Does that also mean the layout tests will run with more shards or does it only mean we have more slaves available to pick up new builds?
,
Oct 25 2017
It just means more slaves available to pick up new builds (reducing scheduled/pending time, but not build run time) -- Number of shards for layout tests is separate.
,
Oct 25 2017
Ah, maybe I misinterpreted this bug from the start...! is time waiting to start a build an issue, or just run time? Increasing the number of shards for layout tests should be a change to //testing/buildbot/chromium.webkit.json.
,
Oct 25 2017
I see. What's the process for increasing the number of shards? In addition to having few slaves, the time it takes to run the layout tests is quite big as well (as I said in the bug description, it can take 40*2 + epsilon minutes for mac10.12_blink_rel to run before we even get to the stage of rebaselining and triggering the CQ bots).
,
Oct 25 2017
How does one pick the right value for the builders in chromium.webkit.json? The slowest Mac bots all have |shards| set to 2. Should I raise it to 5? 10?
,
Oct 25 2017
I believe that 2 was just an initial value to see whether it works OK (relevant CL: https://chromium-review.googlesource.com/c/chromium/src/+/616483). I think that the setting to some extend is flexible, but the best number probably depends on the number of swarming bots available per platform, as well as the total cumulative test time and desired clock time. For linux, it's set to 6, but there are a lot more linux swarming bots available to run tasks (https://chromium-swarm.appspot.com/botlist?c=id&c=task&c=status&f=os%3AUbuntu-14.04&l=100&s=os%3Aasc). 6 would probably be acceptable, but then you'd more often run out of swarming bots and the jobs would be waiting on that anyway, perhaps. Maybe 4 would be OK?
,
Oct 25 2017
My access to the swarming pages is rather limited, so I'll trust you and send a CL bumping the number to 4 :)
,
Oct 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/23c1566fcff5ea60c0d3fc2de833e4af09ebaaa0 commit 23c1566fcff5ea60c0d3fc2de833e4af09ebaaa0 Author: Quinten Yearsley <qyearsley@chromium.org> Date: Sat Oct 28 10:25:08 2017 Increase shard count for Mac Blink builders Reason: Currently of the blink_rel try bots, the mac bots are the slowest. Bug: 772335 Change-Id: I087fbd652c6e11d2d884905859f13e1601d0dbe5 Reviewed-on: https://chromium-review.googlesource.com/737415 Reviewed-by: Raphael Kubo da Costa (rakuco) <raphael.kubo.da.costa@intel.com> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Raphael Kubo da Costa (rakuco) <raphael.kubo.da.costa@intel.com> Cr-Commit-Position: refs/heads/master@{#512397} [modify] https://crrev.com/23c1566fcff5ea60c0d3fc2de833e4af09ebaaa0/testing/buildbot/chromium.webkit.json
,
Oct 30 2017
Update: Now mac10.{10,11,12}_blink_rel are using 4 shards for layout test runs, and the step duration seems to be closer to 25 minutes, e.g.:
https://build.chromium.org/p/tryserver.blink/builders/mac10.12_blink_rel/builds/2371
mac10.11_retina_blink_rel is still not using swarming, I think, and is still slower, e.g.:
https://build.chromium.org/p/tryserver.blink/builders/mac10.11_retina_blink_rel/builds/4579
Still left to do:
If mac10.11_retina_blink_rel is now the bottleneck, then that platform should also use swarming and should have enough shards.
,
Nov 4 2017
Right now, a job needs to wait 3~4hrs to get a chance to run on mac10.{10,11,12}_blink_rel. Perhaps we still need more shards...
,
Nov 4 2017
I am planning to add more bots, which will help. We can probably add more shards for the tests as well.
,
Nov 7 2017
BTW, I've noticed that each of the mac10.{10,11_retina,12}_blink_rel builders has one bot that's actually disconnected for at least a few days, which brings the number of available slaves back to 2 again.
,
Nov 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/82538a9a747bb968b4b0b5443e894c46fb196a64 commit 82538a9a747bb968b4b0b5443e894c46fb196a64 Author: Dirk Pranke <dpranke@chromium.org> Date: Wed Nov 08 02:59:22 2017 Re-add mac bots to tryserver.blink. Now that the six bots have been upgraded from 10.10 and 10.11 to 10.12, we can re-add them to the pool giving us hopefully plenty of capacity. TBR=qyearsley@chromium.org BUG= 772335 , 780950 Change-Id: Iaec59adc18c850179b58973a57445a7ffb80ac95 Reviewed-on: https://chromium-review.googlesource.com/757981 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Dirk Pranke <dpranke@chromium.org> [modify] https://crrev.com/82538a9a747bb968b4b0b5443e894c46fb196a64/masters/master.tryserver.blink/slaves.cfg
,
Nov 8 2017
We should have enough capacity now (buildbot-bot-wise). Let me know if you see pending builds with any frequency going forward. We can also add more shards to the tests to decrease cycle time if need be.
,
Nov 8 2017
Actually, they might not be fully provisioned yet, but we're working on it. See crbug.com/780950 for additional details. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by raphael....@intel.com
, Oct 12 2017