Issue metadata
Sign in to add a comment
|
Windows GPU FYI builder slaves missing or offline |
||||||||||||||||||||||
Issue descriptionThe Builders for the LUCI Windows GPU FYI waterfall appear to have gone offline or are missing in swarming. The bots are: swarm749-c4 swarm750-c4 swarm751-c4 swarm752-c4 swarm753-c4 swarm754-c4 swarm755-c4 swarm756-c4 swarm757-c4 swarm758-c4 swarm759-c4 swarm760-c4 swarm761-c4 swarm762-c4 swarm763-c4 swarm764-c4 swarm765-c4 swarm766-c4 swarm767-c4 swarm768-c4 Example builder page: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/GPU%20FYI%20Win%20Builder Example offline bot: https://chromium-swarm.appspot.com/bot?id=swarm749-c4
,
Mar 14 2018
Logged onto one: [I2018-03-14T20:05:40.454000 2800 2820 chromebuild-startup:78] Calling: 'C:\\Program Files\\OpenSSH-Win64\\ssh-keygen.exe -f C:\\Program Files\\OpenSSH-Win64\\ssh_host_rsa_key -P -t rsa' [I2018-03-14T20:05:40.454000 2800 2820 chromebuild-startup:79] in C:\Windows\system32 [D2018-03-14T20:05:41.110000 2800 2820 chromebuild-startup:100] Generating public/private rsa key pair. [D2018-03-14T20:05:41.110000 2800 2820 chromebuild-startup:100] C:\Program Files\OpenSSH-Win64\ssh_host_rsa_key already exists. I can respawn the bots for now, but I feel like OpenSSH is causing more harm than it is helping us, and we should disable it for now and fix it later (Since we can still RDP into the bot).
,
Mar 14 2018
Any recommendations you have hinoka@ for getting these bots back online sooner rather than later are welcome. Thanks.
,
Mar 14 2018
I respawned them all but that didn't seem to help. It's still getting stuck on SSH key regeneration. I'm wondering how these worked previously and if we need to push a new image.
,
Mar 14 2018
That might've been a red herring. I logged onto a different bot and it's stuck at starting up due to infra-python not being around. infra-python is installed by puppet.
,
Mar 14 2018
I've been informed that there was a CIPD issue that was causing 403s to be served. That might've been related.
,
Mar 14 2018
The CIPD issue should be fixed. I'll try respawning again.
,
Mar 14 2018
I spot checked 5 of them and they all look online. I'll mark this as fixed. Root cause was CIPD being broken due to Google Storage giving us bad signed URLs, and therefore the instances could not finish bootstrap.
,
Mar 14 2018
Hm... I'm not sure if that's the /root/ cause. That error only cropped up in the last hour or so.
,
Mar 14 2018
hinoka@: thanks for your quick action. At least one of these slaves, swarm768-c4, still seems missing: https://chromium-swarm.appspot.com/bot?id=swarm768-c4&sort_stats=total%3Adesc Can you please take another look?
,
Mar 14 2018
Respawning... in the meantime, filed crbug.com/822032
,
Mar 14 2018
It took 4 respawns... but it's back online now. We really need openssh to be less broken.
,
Mar 14 2018
Jeez. Thanks Ryan.
,
Mar 19 2018
These bots all appear to be offline again. Can someone take a look?
,
Mar 19 2018
,
Mar 19 2018
wtf. Let's see...
,
Mar 19 2018
I would suggest a postmortem on this when this is resolved. Andrii can you take the lead on drafting one?
,
Mar 19 2018
,
Mar 19 2018
,
Mar 20 2018
efoo@ there are two distinct causes here: both of which are handled in diff bugs, see Ryan's reply above: issue 822032 issue 819355 which results in Win bots getting auto-updated and getting stuck in privacy options screen
,
Mar 20 2018
Yep, this is issue 819355 again. Unfortunately the only thing I can do it respawn them. efoo@ cit-pm-75 should be written for issue 819355, but first it has to be resolved.
,
Mar 20 2018
$ ./ccompute ri swarm7{49..68}-c4
,
Mar 20 2018
4 bots came back...
,
Mar 20 2018
All back. However, not for long. If they go offline again (and I bet they will), please poke issue 819355
,
Mar 20 2018
Thanks for following up, will keep future requests in that bug.
,
Mar 20 2018
To clarify, my request is to create a postmortem on the flaky OpenSSH issues detailed here. This is separate from the auto-update security updates. Pinging smut@ and vadmsh@ directly. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by kbr@chromium.org
, Mar 14 2018Cc: estaab@chromium.org hinoka@chromium.org
Labels: -Type-Bug Infra-Troopers Type-Bug-Regression