ANGLE standalone windows slaves all offline |
||||
Issue descriptionunsure when/why this happened, you can see all try and ci jobs are failing: https://ci.chromium.org/p/angle/g/ci/console https://ci.chromium.org/p/angle/g/try/builders All the bots seem to be offline. Example task: https://chromium-swarm.appspot.com/task?id=3f111f9dc1fe1410&refresh=10&show_raw=1&wide_logs=true Labs team / troopers, can you see if these bots can be restarted? Did they somehow get reassigned to other builders? Marking Pri-0 as it is blocking ANGLE CQ. It is only ANGLE, not Chromium, but seems pretty serious.
,
Aug 2
,
Aug 2
Ken, where did you pull that list from? I am not sure but I think it might be more than that. We have six per config, plus CI testers, and all four configs seem broken in CI and Try.
,
Aug 2
I clicked the drop-down arrow next to "6 bots" near the top of one of the bots' pages: https://ci.chromium.org/p/angle/builders/luci.angle.ci/win-clang-x86-dbg all of those offline bots are sharing the same 6 slaves. To see the association between slaves and bots look here: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/bots.cfg There is more configuration information here: https://chromium.googlesource.com/chromium/src/+/master/infra/config/global/ in cr-buildbucket.cfg and luci-scheduler.cfg.
,
Aug 2
Looks like the puppet certs expired on the bots in #1: root@puppetm:/var/log/puppet# grep expire puppetmaster* | egrep swarm8[0-5]-c4 | cut -d : -f 2- | cut -d ' ' -f 8- | sort -u (warning): Certificate 'swarm80-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:40:44GMT (warning): Certificate 'swarm81-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:41:08GMT (warning): Certificate 'swarm82-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:41:13GMT (warning): Certificate 'swarm83-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:42:28GMT (warning): Certificate 'swarm84-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:42:45GMT I've respawned them and they're reconnect. I also see the following swarming bots offline in luci.angle.try: $ swarming bots -S chromium-swarm.appspot.com -d pool luci.angle.try --dead-only -b swarm52-c4 swarm53-c4 swarm54-c4 swarm55-c4 swarm56-c4 swarm57-c4 swarm58-c4 swarm59-c4 swarm60-c4 swarm61-c4 swarm86-c4 swarm87-c4 swarm88-c4 swarm89-c4 swarm90-c4 swarm91-c4 swarm92-c4 Some of which, are suffering from a similar issue: root@puppetm:/var/log/puppet# grep expire puppetmaster* | egrep swarm'(5[2-9]|6[0-1]|8[6-9]|9[0-2])'-c4 | cut -d : -f 2- | cut -d ' ' -f 8- | sort -u (warning): Certificate 'swarm52-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:08:03GMT (warning): Certificate 'swarm54-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:08:21GMT (warning): Certificate 'swarm57-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:09:43GMT (warning): Certificate 'swarm58-c4.c.chromecompute.google.com.internal' will expire on 2018-08-01T21:09:59GMT (warning): Certificate 'swarm86-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:44:09GMT (warning): Certificate 'swarm87-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:44:44GMT (warning): Certificate 'swarm89-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:45:46GMT (warning): Certificate 'swarm90-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:46:36GMT (warning): Certificate 'swarm92-c4.c.chromecompute.google.com.internal' will expire on 2018-07-29T19:48:20GMT
,
Aug 2
Thanks Bryce for finding those – could you fix those too under this same report?
,
Aug 2
Ok, after troubleshooting some puppet issues, bots in #5 are now finally back. Downgrading to P1 for now. kbr/jmadill - are there other resources I need to check out for this bug?
,
Aug 2
Is swarm62-c4 another one of these, or shall I file a separate bug for it? The affected bot is https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Exp%20Release%20%28Intel%20HD%20630%29
,
Aug 2
swarm62-c4 is now back
,
Aug 2
Looks like they're all back now. Thanks. |
||||
►
Sign in to add a comment |
||||
Comment 1 by kbr@chromium.org
, Aug 2