Some builder in goma.latest is pending for several hours |
||||||||
Issue description
,
Oct 17
Not sure yet, but http://shortn/_Qmx57CewP9 indeed shows a long pending time for this builder, starting from Oct 16 17:33 PDT and till 23:30 (almost 6 hours). And then again 1.2h of pending time, presumably for the next build. It's all cleared now.
,
Oct 17
During the 6 hour waiting time, this build was pending and then expired (presumably, after the timeout of 6h): https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Builder%20Goma%20Latest%20Client/9007 The next build https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Builder%20Goma%20Latest%20Client/9008 was again pending for 80+ min, and got picked up and finished normally. This explains the pending graph. So, it seems there were no swarming bots available for ~7.5h during that time.
,
Oct 17
The graph for the pool http://shortn/_GTc4kWCt0z suggests that the pool itself was fine (4 bots). Note, that the "builder" dimension is set to "WinMSVC64 Goma Latest Client" in the graph, because the pool is shared by a whole bunch of builders (I didn't even know we support this type of sharing! Or maybe we don't yet?), and the metric cannot handle a list of dimensions well, so it picks a random value from the list. In any case, it's the correct pool. Another curiousity: an example bot https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-cb3p&sort_stats=total%3Adesc starts its lifetime with a task at 1:18am. This corresponds to the time the pool recovered. Maybe Machine Provider killed the bots and spawned them back? It's odd though that the capacity graph still showed available bots while the builder couldn't find them.
,
Oct 17
> It's odd though that the capacity graph still showed available bots while the builder couldn't find them. Yeah, I looked briefly last night but couldn't figure out why it wouldn't run them. It said "1 pending task, 4 bots could run this task".
,
Oct 17
#5 - wow, so swarming did see the 4 bots, but didn't schedule? That's odd. +maruel@ - any ideas? And +smut@ FYI, because these are MP bots - maybe there is something special about them? Here's a more direct graph: http://shortn/_9HbNcUmoS9 It shows that the bots were sourced from 3 GCE zones: us-west1-b was available throughout, while us-east1-b and us-west1-c were removed around 1:09am and replaced by more bots in us-west1-b. This approximately correlates with the recovery time, but I'm still not sure what may have caused the outage. If we don't figure anything out soon, I'm tempted to mark this bug as WontFix, since everything returned back to normal now.
,
Oct 17
That's still the early_release_secs issue we discussed two weeks ago in Foundation. Problem: Instead of extending the lease (currently 24) of a bot as long as it's still running a task, Swarming will instead refuse to hand out the task if hard_timeout (10h here) is higher than the amount of time remaining in the lease. Ref: https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/bots.cfg#6850
,
Oct 17
Thanks, maruel@! In this case, I'm releasing the bug into the Foundation trooper queue, as I'm not sure much can be done on CCI side. Thanks!
,
Oct 18
let me help goma folks: https://chrome-internal-review.googlesource.com/c/infradata/config/+/700008
,
Oct 18
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/fc2af377d6b8917e3b113332f5ff632aeb93e07a commit fc2af377d6b8917e3b113332f5ff632aeb93e07a Author: Andrii Shyshkalov <tandrii@chromium.org> Date: Thu Oct 18 01:12:07 2018
,
Oct 18
,
Oct 18
Thank you! |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by tikuta@chromium.org
, Oct 17