New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 821975 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug-Regression

Blocking:
issue 815092



Sign in to add a comment

Windows GPU FYI builder slaves missing or offline

Project Member Reported by geoffl...@chromium.org, Mar 14 2018

Issue description

The Builders for the LUCI Windows GPU FYI waterfall appear to have gone offline or are missing in swarming.

The bots are:
swarm749-c4
swarm750-c4
swarm751-c4
swarm752-c4
swarm753-c4
swarm754-c4
swarm755-c4
swarm756-c4
swarm757-c4
swarm758-c4
swarm759-c4
swarm760-c4
swarm761-c4
swarm762-c4
swarm763-c4
swarm764-c4
swarm765-c4
swarm766-c4
swarm767-c4
swarm768-c4

Example builder page:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/GPU%20FYI%20Win%20Builder

Example offline bot:
https://chromium-swarm.appspot.com/bot?id=swarm749-c4
 

Comment 1 by kbr@chromium.org, Mar 14 2018

Blocking: 815092
Cc: estaab@chromium.org hinoka@chromium.org
Labels: -Type-Bug Infra-Troopers Type-Bug-Regression
Infra folks / troopers: this is blocking our LUCI migration; can someone please take a look quickly? Thanks.

Comment 2 by hinoka@chromium.org, Mar 14 2018

Logged onto one:

[I2018-03-14T20:05:40.454000 2800 2820 chromebuild-startup:78] Calling: 'C:\\Program Files\\OpenSSH-Win64\\ssh-keygen.exe -f C:\\Program Files\\OpenSSH-Win64\\ssh_host_rsa_key -P  -t rsa'
[I2018-03-14T20:05:40.454000 2800 2820 chromebuild-startup:79]    in C:\Windows\system32
[D2018-03-14T20:05:41.110000 2800 2820 chromebuild-startup:100] Generating public/private rsa key pair.
[D2018-03-14T20:05:41.110000 2800 2820 chromebuild-startup:100] C:\Program Files\OpenSSH-Win64\ssh_host_rsa_key already exists.

I can respawn the bots for now, but I feel like OpenSSH is causing more harm than it is helping us, and we should disable it for now and fix it later (Since we can still RDP into the bot).

Comment 3 by kbr@chromium.org, Mar 14 2018

Any recommendations you have hinoka@ for getting these bots back online sooner rather than later are welcome. Thanks.

Comment 4 by hinoka@chromium.org, Mar 14 2018

I respawned them all but that didn't seem to help.  It's still getting stuck on SSH key regeneration.  I'm wondering how these worked previously and if we need to push a new image.

Comment 5 by hinoka@chromium.org, Mar 14 2018

That might've been a red herring.  I logged onto a different bot and it's stuck at starting up due to infra-python not being around.  infra-python is installed by puppet.

Comment 6 by hinoka@chromium.org, Mar 14 2018

I've been informed that there was a CIPD issue that was causing 403s to be served.  That might've been related.

Comment 7 by hinoka@chromium.org, Mar 14 2018

The CIPD issue should be fixed.  I'll try respawning again.

Comment 8 by hinoka@chromium.org, Mar 14 2018

Owner: hinoka@chromium.org
Status: Fixed (was: Available)
I spot checked 5 of them and they all look online.  I'll mark this as fixed.  Root cause was CIPD being broken due to Google Storage giving us bad signed URLs, and therefore the instances could not finish bootstrap.
Hm... I'm not sure if that's the /root/ cause. That error only cropped up in the last hour or so.

Comment 10 by kbr@chromium.org, Mar 14 2018

Status: Started (was: Fixed)
hinoka@: thanks for your quick action. At least one of these slaves, swarm768-c4, still seems missing:
https://chromium-swarm.appspot.com/bot?id=swarm768-c4&sort_stats=total%3Adesc

Can you please take another look?

Respawning... in the meantime, filed crbug.com/822032
Status: Fixed (was: Started)
It took 4 respawns... but it's back online now.  We really need openssh to be less broken.

Comment 13 by kbr@chromium.org, Mar 14 2018

Jeez. Thanks Ryan.

Status: Available (was: Fixed)
These bots all appear to be offline again.  Can someone take a look?
Owner: ----
Owner: tandrii@chromium.org
Status: Assigned (was: Available)
wtf. Let's see...

Comment 17 by efoo@chromium.org, Mar 19 2018

I would suggest a postmortem on this when this is resolved. Andrii can you take the lead on drafting one? 

Comment 18 by efoo@chromium.org, Mar 19 2018

Labels: cit-pm-75

Comment 19 by efoo@chromium.org, Mar 19 2018

Labels: cit-pm
Status: Started (was: Assigned)
efoo@ there are two distinct causes here: both of which are handled in diff bugs, see Ryan's reply above:
 issue 822032
 issue 819355 which results in Win bots getting auto-updated and getting stuck in privacy options screen



Yep, this is issue 819355 again. Unfortunately the only thing I can do it respawn them.

efoo@ cit-pm-75 should be written for issue 819355, but first it has to be resolved.
$ ./ccompute ri swarm7{49..68}-c4
4 bots came back...
Status: Fixed (was: Started)
All back. However, not for long. If they go offline again (and I bet they will), please poke issue 819355
Thanks for following up, will keep future requests in that bug.

Comment 26 by efoo@chromium.org, Mar 20 2018

To clarify, my request is to create a postmortem on the flaky OpenSSH issues detailed here. This is separate from the auto-update security updates.

Pinging smut@ and vadmsh@ directly. 

Sign in to add a comment