Swarming Ubuntu-12.04 on linux_chromium_rel_ng is expiring |
|||||||||
Issue descriptionhttps://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/354831 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/354808 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/354792 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/354809 The failed swarming task just show state "EXPIRED"
,
Dec 13 2016
Example: https://chromium-swarm.appspot.com/task?id=330e6b1650139c10&refresh=10&show_raw=1 Requests are coming as 12.04 but most of the fleet was changed to 14.04 These need to be updated to Ubuntu-14.04? This was tracked as issue 664294 . There's a large number of bots on both fleet so I'm quite surprised that this is a problem. Ubuntu-14.04: 825 bots, 116 busy. https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=os%3AUbuntu-14.04&l=100&s=id%3Aasc Ubuntu-12.04: 809 bots, 672 busy. https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=os%3AUbuntu-12.04&l=100&s=id%3Aasc (bots may look instantaneously idle while fetching another task, which happens more often when running quick tasks)
,
Dec 13 2016
The task seem to take a long time, a casual look I'm seeing many tasks with >10m runtime, which is significantly too high.
,
Dec 13 2016
+Andrii FYI
,
Dec 13 2016
It's unclear to me what we need to do to fix: does Ned need to update his recipe to request a less specific ubuntu version, or to update to 14.04? Do we need to expire tasks more quickly? Ned - are you completely blocked and should this be a p0?
,
Dec 13 2016
To #5: I am filing this bug as a Chromium-sheriff for 12/13 & 12/14 shift. This does not block my work.
,
Dec 13 2016
Whatever you decide, let me know when to update MP config. It's still 80% Precise, 20% Trusty.
,
Dec 16 2016
I'm completely at a loss for what to do here... is there a relevant entry in the playbook that I'm missing?
,
Dec 16 2016
It sounds like smut maybe is the one who knows what to do? Tentatively assigning, if that's actually the case.
,
Dec 16 2016
I'm not sure what I should do. All I see from the Machine Provider graphs is that the 12.04 VMs are available. Right now 737 of them are busy so it seems like they're running tasks. Are we suggesting that 800 12.04 VMs is not enough? This is actually up from ~600 that we had before Machine Provider was supplying them.
,
Dec 16 2016
Here's a recent expiry: https://chromium-swarm.appspot.com/task?id=33226c4962815110&refresh=10&show_raw=1 Created 12/16/2016, 1:47:34 PM (PST) Abandoned 12/16/2016, 2:48:14 PM (PST) At the time, some of the MP VMs were being refreshed (refreshing VMs is staggered across several hours), but there wasn't a significant drop in the number of available VMs, and most of the refreshed VMs connected and started running tasks right away. e.g.: https://chromium-swarm.appspot.com/bot?id=gce-precise-c4413e93-0b6h&selected=1&show_all_events=true&sort_stats=total%3Adesc Leased at 1:56:01 PM (PST), connected at 1:56:44 PM (PST), requested task at 1:57:48 PM (PST).
,
Dec 17 2016
Ubuntu-12.04 tasks are expiring at this very moment even though there are currently 797 gce-precise VMs being supplied by Machine Provider, 720 of which are running tasks and the rest of which seem connected so they are probably instantaneously idle and about to reap a task as maruel said in #2. Since all MP VMs are there and tasks are still expiring I can only conclude we need to provide even more than 800 Precise VMs. Is it possible there was a massive surge of Ubuntu-12.04 tasks caused by the long pending queues on tryserver.chromium.linux? Maybe that's why we suddenly don't have enough?
,
Dec 17 2016
I think there was some kind of unusual spike in number of tasks. CCing Mike, I hope he knows more. If so, I don't think adding more VMs is a good move.
,
Dec 20 2016
We need to figure out what increased the load, since I suspect the load increased. We'd save a lot of money if we did the migration to 14.04 ASAP so we wouldn't have two parallel fleet.
,
Dec 20 2016
,
Dec 27 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 27 2016
Seems like this no longer happens.
,
Jun 23 2017
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by nedngu...@google.com
, Dec 13 2016