Huge pending queue spikes on ios-simulator try bot |
||||||||||||
Issue descriptionFor this patch https://codereview.chromium.org/2787633002/, in https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/182738, for all tests with the patch, only "ios_net_unittests (ipad air 2 ios 10.0)" failed due to Swarming task timeout; but during retry, all the tests are run without the patch, including "components_unittests (iphone 5 ios 9.0)" which was successful with patch but failed without patch. In my understanding, only those failed tests should be retried without patch even they are quick because of cached result on Swarming. I might missed some context to rerun all tests again without patch. But at least in this case, the rerun of "components_unittests (iphone 5 ios 9.0)" gave a false signal.
,
Mar 30 2017
+ Sana, since she reverted a CL to help the net unittest flake.
,
Mar 30 2017
try-jobs on this trybot are queuing up https://screenshot.googleplex.com/08wK2PC2cA3.png 10 out of 28 slaves are retrying all test steps instead of just the failed one https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator
,
Mar 30 2017
I believe this change to the ios/try recipe introduce the retry of tests https://chromium-review.googlesource.com/c/452808/ Assign to smut@. Is there a quick fix, or should we revert first?
,
Mar 30 2017
I think we should revert the change now and fix later.
,
Mar 30 2017
,
Mar 30 2017
revert is in CQ https://chromium-review.googlesource.com/c/463847
,
Mar 30 2017
revert landed, lower the priority to P1
,
Mar 31 2017
The bot still has many pending builds: https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator Maybe the revert didn't help?
,
Mar 31 2017
Though a few builds are still taking 45+ minutes (when they have to recompile everything), many are taking 5-8, which isn't too bad. Maybe we just need more capacity here.
,
Mar 31 2017
Orange is pending builds: https://screenshot.googleplex.com/pHiREc5n2cs.png First major spike was the 22nd, around 5pm pacific.
,
Mar 31 2017
Er, 21st not 22nd.
,
Mar 31 2017
Issue 706673 has been merged into this issue.
,
Mar 31 2017
I was going to suggest temporarily swiping capacity from the other iOS try bots that don't run tests or anything and have much shorter cycle times, but for some reason ios-device (compile-only) spiked to 26 pending on 3/28 at 3pm (it has 16 configured slaves). Still, in the last 30 days the more typical peak for ios-device is 8 concurrent builds. 26 was likely an anomaly. I think there was an issue with buildbucket scheduling builds at that time? So I think we can discount that spike, which means we could take 6 VMs from ios-device and still have 10, which should be enough. ios-device-xcode-clang also usually peaks at 8-9 judging by the last 30 days, and ios-simulator-xcode-clang generally peaks at 8-9 based on the last 30 days, however in the last 3 days it jumped to 12 concurrent builds twice. Both are compile-only. I suspect we could steal 6 VMs each from ios-device and ios-device-xcode-clang, and maybe 4 from ios-simulator-xcode-clang, which would increase ios-simulator's capacity by about half. In the sgirt term this would hopefully help the throughput on ios-simulator while we figure out what the problem is.
,
Mar 31 2017
s/sgirt/short/ I forgot about ios-simulator-eg, which seems to peak 6 typically, but has spiked to 10 pending. It's not even part of the CQ so we can definitely steal some VMs from it.
,
Mar 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/cdc4eaec35e92e088e9ff010c3c8a966809e8eeb commit cdc4eaec35e92e088e9ff010c3c8a966809e8eeb Author: smut <smut@google.com> Date: Fri Mar 31 03:25:52 2017 Shuffle iOS try slaves to give more capacity to ios-simulator ios-device: 16 -> 10 ios-device-xcode-clang: 16 -> 10 ios-simulator: 30 -> 52 ios-simulator-cronet: 2 -> 2 ios-simulator-eg: 16 -> 10 ios-simulator-xcode-clang: 16 -> 12 BUG= 706653 Change-Id: I0446d3542c2b6abf8e8aeb52d658a76bf9e6200f Reviewed-on: https://chromium-review.googlesource.com/464466 Reviewed-by: smut <smut@chromium.org> [modify] https://crrev.com/cdc4eaec35e92e088e9ff010c3c8a966809e8eeb/masters/master.tryserver.chromium.mac/slaves.cfg
,
Mar 31 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/master-manager/+/0c8bb20d15b9f4754807177482ef97c13e72500f commit 0c8bb20d15b9f4754807177482ef97c13e72500f Author: smut <smut@google.com> Date: Fri Mar 31 03:30:52 2017
,
Mar 31 2017
Added 22 slaves, increasing capacity by 73%. It's possible that capacity may be needed on Swarming to handle the task load of 52 concurrent ios-simulator builds.
,
Mar 31 2017
,
Mar 31 2017
Now more than 50% of trybots on ios-simulator fail base_unittests: https://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator?numbuilds=200 (looks like a recent development; might be unrelated to the changes in this bug, but it's related to the broader "ios-simulator has been very unreliable the last few days" theme)
,
Mar 31 2017
Those base_unittests failures were due to a recipe change which has been reverted https://chromium.googlesource.com/chromium/tools/build/+/328305ad806f3bc88fe190f754463ee007d0c040
,
Mar 31 2017
I'm not seeing any pending queues on the graphs today.
,
Apr 4 2017
No pending queues seen since the capacity increase.
,
Apr 4 2017
,
Jun 23 2017
|
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by st...@chromium.org
, Mar 30 2017