New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 870369 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

ANGLE standalone windows slaves all offline

Project Member Reported by jmad...@chromium.org, Aug 2

Issue description

unsure when/why this happened, you can see all try and ci jobs are failing:

https://ci.chromium.org/p/angle/g/ci/console

https://ci.chromium.org/p/angle/g/try/builders

All the bots seem to be offline. Example task:

https://chromium-swarm.appspot.com/task?id=3f111f9dc1fe1410&refresh=10&show_raw=1&wide_logs=true

Labs team / troopers, can you see if these bots can be restarted? Did they somehow get reassigned to other builders?

Marking Pri-0 as it is blocking ANGLE CQ. It is only ANGLE, not Chromium, but seems pretty serious.
 
The Swarming slaves that are offline seem to be:

swarm80-c4	Offline	2018-07-29 1:26 PM (PDT)
swarm81-c4	Offline	2018-07-29 1:17 PM (PDT)
swarm82-c4	Offline	2018-07-29 1:39 PM (PDT)
swarm83-c4	Offline	2018-07-29 1:12 PM (PDT)
swarm84-c4	Offline	2018-07-29 1:31 PM (PDT)
swarm85-c4	Offline	2018-07-29 1:42 PM (PDT)

Owner: d...@chromium.org
Status: Assigned (was: Untriaged)
Ken, where did you pull that list from? I am not sure but I think it might be more than that. We have six per config, plus CI testers, and all four configs seem broken in CI and Try.
I clicked the drop-down arrow next to "6 bots" near the top of one of the bots' pages:
https://ci.chromium.org/p/angle/builders/luci.angle.ci/win-clang-x86-dbg

all of those offline bots are sharing the same 6 slaves.

To see the association between slaves and bots look here:
https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg

There is more configuration information here:
https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/

in cr-buildbucket.cfg and luci-scheduler.cfg.

Looks like the puppet certs expired on the bots in #1:

root@puppetm:/var/log/puppet# grep expire puppetmaster* | egrep swarm8[0-5]-c4 | cut -d : -f 2- | cut -d ' ' -f 8- | sort -u
(warning): Certificate 'swarm80-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:40:44GMT
(warning): Certificate 'swarm81-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:41:08GMT
(warning): Certificate 'swarm82-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:41:13GMT
(warning): Certificate 'swarm83-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:42:28GMT
(warning): Certificate 'swarm84-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:42:45GMT

I've respawned them and they're reconnect.

I also see the following swarming bots offline in luci.angle.try:

$ swarming bots -S chromium-swarm.appspot.com -d pool luci.angle.try --dead-only -b
swarm52-c4
swarm53-c4
swarm54-c4
swarm55-c4
swarm56-c4
swarm57-c4
swarm58-c4
swarm59-c4
swarm60-c4
swarm61-c4
swarm86-c4
swarm87-c4
swarm88-c4
swarm89-c4
swarm90-c4
swarm91-c4
swarm92-c4

Some of which, are suffering from a similar issue:

root@puppetm:/var/log/puppet# grep expire puppetmaster* | egrep swarm'(5[2-9]|6[0-1]|8[6-9]|9[0-2])'-c4 | cut -d : -f 2- | cut -d ' ' -f 8- | sort -u
(warning): Certificate 'swarm52-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:08:03GMT
(warning): Certificate 'swarm54-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:08:21GMT
(warning): Certificate 'swarm57-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:09:43GMT
(warning): Certificate 'swarm58-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:09:59GMT
(warning): Certificate 'swarm86-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:44:09GMT
(warning): Certificate 'swarm87-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:44:44GMT
(warning): Certificate 'swarm89-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:45:46GMT
(warning): Certificate 'swarm90-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:46:36GMT
(warning): Certificate 'swarm92-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:48:20GMT

Thanks Bryce for finding those – could you fix those too under this same report?

Labels: -Pri-0 Pri-1
Ok, after troubleshooting some puppet issues, bots in #5 are now finally back.

Downgrading to P1 for now.

kbr/jmadill - are there other resources I need to check out for this bug?
Is swarm62-c4 another one of these, or shall I file a separate bug for it?
The affected bot is
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Exp%20Release%20%28Intel%20HD%20630%29
swarm62-c4 is now back
Status: Fixed (was: Assigned)
Looks like they're all back now. Thanks.

Sign in to add a comment