New issue
Advanced search Search tips

Issue 896121 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Oct 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Some builder in goma.latest is pending for several hours

Project Member Reported by tikuta@chromium.org, Oct 17

Issue description

Labels: -Pri-3 Pri-2
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
Not sure yet, but http://shortn/_Qmx57CewP9 indeed shows a long pending time for this builder, starting from Oct 16 17:33 PDT and till 23:30 (almost 6 hours). And then again  1.2h of pending time, presumably for the next build. It's all cleared now.
During the 6 hour waiting time, this build was pending and then expired (presumably, after the timeout of 6h): https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Builder%20Goma%20Latest%20Client/9007

The next build https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20Builder%20Goma%20Latest%20Client/9008 was again pending for 80+ min, and got picked up and finished normally. This explains the pending graph.

So, it seems there were no swarming bots available for ~7.5h during that time.
The graph for the pool http://shortn/_GTc4kWCt0z suggests that the pool itself was fine (4 bots). Note, that the "builder" dimension is set to "WinMSVC64 Goma Latest Client" in the graph, because the pool is shared by a whole bunch of builders (I didn't even know we support this type of sharing! Or maybe we don't yet?), and the metric cannot handle a list of dimensions well, so it picks a random value from the list. In any case, it's the correct pool.

Another curiousity: an example bot
https://chromium-swarm.appspot.com/bot?id=gce-trusty-e833d7b0-us-west1-b-cb3p&sort_stats=total%3Adesc starts its lifetime with a task at 1:18am. This corresponds to the time the pool recovered. Maybe Machine Provider killed the bots and spawned them back? 

It's odd though that the capacity graph still showed available bots while the builder couldn't find them. 
> It's odd though that the capacity graph still showed available bots while the builder couldn't find them. 

Yeah, I looked briefly last night but couldn't figure out why it wouldn't run them.  It said "1 pending task, 4 bots could run this task".
Cc: mar...@chromium.org s...@google.com
#5 - wow, so swarming did see the 4 bots, but didn't schedule? That's odd.
+maruel@ - any ideas?
And +smut@ FYI, because these are MP bots - maybe there is something special about them?

Here's a more direct graph: http://shortn/_9HbNcUmoS9
It shows that the bots were sourced from 3 GCE zones: us-west1-b was available throughout, while us-east1-b and us-west1-c were removed around 1:09am and replaced by more bots in us-west1-b. This approximately correlates with the recovery time, but I'm still not sure what may have caused the outage.

If we don't figure anything out soon, I'm tempted to mark this bug as WontFix, since everything returned back to normal now.
Cc: vadimsh@chromium.org estaab@chromium.org tandrii@chromium.org
That's still the early_release_secs issue we discussed two weeks ago in Foundation.

Problem:
Instead of extending the lease (currently 24) of a bot as long as it's still running a task, Swarming will instead refuse to hand out the task if hard_timeout (10h here) is higher than the amount of time remaining in the lease.

Ref:
https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/bots.cfg#6850
Cc: sergeybe...@chromium.org
Components: -Infra>Client>Chrome Infra>Platform>Swarming
Labels: -Infra-Troopers Foundation-Troopers
Owner: ----
Status: Untriaged (was: Assigned)
Thanks, maruel@!

In this case, I'm releasing the bug into the Foundation trooper queue, as I'm not sure much can be done on CCI side. Thanks!
Owner: tandrii@chromium.org
Status: Assigned (was: Untriaged)
let me help goma folks: https://chrome-internal-review.googlesource.com/c/infradata/config/+/700008
Project Member

Comment 10 by bugdroid1@chromium.org, Oct 18

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/fc2af377d6b8917e3b113332f5ff632aeb93e07a

commit fc2af377d6b8917e3b113332f5ff632aeb93e07a
Author: Andrii Shyshkalov <tandrii@chromium.org>
Date: Thu Oct 18 01:12:07 2018

Status: Fixed (was: Assigned)
Status: Verified (was: Fixed)
Thank you!

Sign in to add a comment