New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 899908 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 30
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

Windows bots falling offline

Project Member Reported by martiniss@chromium.org, Oct 29

Issue description

https://viceroy.corp.google.com/auto/prod:chrome-ops-client-infra/chrome_client/bots?borg_user=&project=chromium&refresh=-1&duration=1d&bucket=&os=Windows.*&pool=Chrome&groupby=builder&utc_end=1540845047 shows this. 

win10_chromium_x64_rel_ng has long pending times, I think as a result of this.

I took a glance at https://chrome-internal.googlesource.com/infradata/config.git and didn't see anything obvious.

This will break the CQ bot very soon.
 
Yep that looks very very wrong to me.
http://shortn/_fHxctUMX7E is graph which shows this better. 
I don't see many dead bots. I would guess that these bots are just being deleted and bots aren't being brought up to replace them?
Cc: s...@google.com
I don't see anything errorlike in the mp logs, but this might be a logic bug where it thinks it needs to decommission everything, somehow...
Project Member

Comment 6 by bugdroid1@chromium.org, Oct 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/60b53b5d929f5c3d5059aee2d800b7a7177596aa

commit 60b53b5d929f5c3d5059aee2d800b7a7177596aa
Author: Stephen Martinis <martiniss@chromium.org>
Date: Mon Oct 29 20:52:53 2018

Remove win10_chromium_x64_rel_ng from the CQ

This bot is totally broken at the moment. Remove it so it doesn't block
CQ.

TBR=jbudorick
NOTRY=true

Bug:  899908 
Change-Id: Ie6eca4c1bf8a67821e5a6570a59255fc6a87f54d
Reviewed-on: https://chromium-review.googlesource.com/c/1306396
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Cr-Commit-Position: refs/heads/master@{#603611}
[modify] https://crrev.com/60b53b5d929f5c3d5059aee2d800b7a7177596aa/infra/config/branch/cq.cfg

Looks like the bots can't talk to swarming.

I ssh-ed onto a bot and saw this in the logs:

7152 2018-10-29 21:00:43.213 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake
7152 2018-10-29 21:00:43.213 E: Failed to contact for handshake, retrying in 300 sec...
vadimsh found the culprit (https://chrome-internal.googlesource.com/infra/puppet/+/41a3a0a6bfad8658f477fd89af343d2c45c35d63), which results in:


Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class ::chrome_infra::setup::windows::wsus for win10-727f49a0-us-west1-b-fp6j.c.chromecompute.google.com.internal on node win10-727f49a0-us-west1-b-fp6j.c.chromecompute.google.com.internal
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2)
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2)
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Which results in the bot not having credentials.
He's reverting che change.
Project Member

Comment 10 by bugdroid1@chromium.org, Oct 29

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/88c74012c667beb956197de129eaa8c2af5ae3b2

commit 88c74012c667beb956197de129eaa8c2af5ae3b2
Author: Vadim Shtayura <vadimsh@google.com>
Date: Mon Oct 29 21:08:55 2018

Sorry guys. I did not know your puppet configs would call our WSUS module so i wrote the CL assuming it was only being called by our systems.
A few machines have come online; I'll continue monitoring.
Didn't we used to have a 403 alert for swarming? And we removed it because we got a lot of false positives on it? Or am I misremembering.
Labels: -Pri-0 Pri-1
The bots seem to be recovering. I'm going to keep the bot off of the CQ until it's cleared out the pending task list.
Issue 899992 has been merged into this issue.
Issue 899917 has been merged into this issue.
Project Member

Comment 19 by bugdroid1@chromium.org, Oct 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6fff7d80b11eebe78ec0b2f74c699278d07efe64

commit 6fff7d80b11eebe78ec0b2f74c699278d07efe64
Author: Stephen Martinis <martiniss@chromium.org>
Date: Mon Oct 29 23:36:58 2018

Revert "Remove win10_chromium_x64_rel_ng from the CQ"

This reverts commit 60b53b5d929f5c3d5059aee2d800b7a7177596aa.

Reason for revert: Bot is drained, can re-add to CQ. 

Original change's description:
> Remove win10_chromium_x64_rel_ng from the CQ
> 
> This bot is totally broken at the moment. Remove it so it doesn't block
> CQ.
> 
> TBR=jbudorick
> NOTRY=true
> 
> Bug:  899908 
> Change-Id: Ie6eca4c1bf8a67821e5a6570a59255fc6a87f54d
> Reviewed-on: https://chromium-review.googlesource.com/c/1306396
> Commit-Queue: Stephen Martinis <martiniss@chromium.org>
> Reviewed-by: John Budorick <jbudorick@chromium.org>
> Reviewed-by: Stephen Martinis <martiniss@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#603611}

TBR=martiniss@chromium.org,jbudorick@chromium.org

Change-Id: I55af5ee2c7d1255f10de817a4f9a548793989a21
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  899908 
Reviewed-on: https://chromium-review.googlesource.com/c/1306659
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Cr-Commit-Position: refs/heads/master@{#603689}
[modify] https://crrev.com/6fff7d80b11eebe78ec0b2f74c699278d07efe64/infra/config/branch/cq.cfg

Issue 900019 has been merged into this issue.
Status: Fixed (was: Started)
This outage is over. See PM in #16 for more info.

Sign in to add a comment