Windows bots falling offline |
||||
Issue descriptionhttps://viceroy.corp.google.com/auto/prod:chrome-ops-client-infra/chrome_client/bots?borg_user=&project=chromium&refresh=-1&duration=1d&bucket=&os=Windows.*&pool=Chrome&groupby=builder&utc_end=1540845047 shows this. win10_chromium_x64_rel_ng has long pending times, I think as a result of this. I took a glance at https://chrome-internal.googlesource.com/infradata/config.git and didn't see anything obvious. This will break the CQ bot very soon.
,
Oct 29
http://shortn/_fHxctUMX7E is graph which shows this better.
,
Oct 29
Not seeing many dead bots on swarming though? https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=status%3Adead&f=pool%3AChrome&f=os%3AWindows&l=100&q=os%3Awind&s=id%3Aasc
,
Oct 29
I don't see many dead bots. I would guess that these bots are just being deleted and bots aren't being brought up to replace them?
,
Oct 29
I don't see anything errorlike in the mp logs, but this might be a logic bug where it thinks it needs to decommission everything, somehow...
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/60b53b5d929f5c3d5059aee2d800b7a7177596aa commit 60b53b5d929f5c3d5059aee2d800b7a7177596aa Author: Stephen Martinis <martiniss@chromium.org> Date: Mon Oct 29 20:52:53 2018 Remove win10_chromium_x64_rel_ng from the CQ This bot is totally broken at the moment. Remove it so it doesn't block CQ. TBR=jbudorick NOTRY=true Bug: 899908 Change-Id: Ie6eca4c1bf8a67821e5a6570a59255fc6a87f54d Reviewed-on: https://chromium-review.googlesource.com/c/1306396 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#603611} [modify] https://crrev.com/60b53b5d929f5c3d5059aee2d800b7a7177596aa/infra/config/branch/cq.cfg
,
Oct 29
Looks like the bots can't talk to swarming. I ssh-ed onto a bot and saw this in the logs: 7152 2018-10-29 21:00:43.213 E: Request to https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake failed with HTTP status code 403: 403 Client Error: Forbidden for url: https://chromium-swarm.appspot.com/swarming/api/v1/bot/handshake 7152 2018-10-29 21:00:43.213 E: Failed to contact for handshake, retrying in 300 sec...
,
Oct 29
vadimsh found the culprit (https://chrome-internal.googlesource.com/infra/puppet/+/41a3a0a6bfad8658f477fd89af343d2c45c35d63), which results in: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class ::chrome_infra::setup::windows::wsus for win10-727f49a0-us-west1-b-fp6j.c.chromecompute.google.com.internal on node win10-727f49a0-us-west1-b-fp6j.c.chromecompute.google.com.internal Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Warning: Unable to fetch my node definition, but the agent run will continue: Warning: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Which results in the bot not having credentials.
,
Oct 29
He's reverting che change.
,
Oct 29
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/puppet/+/88c74012c667beb956197de129eaa8c2af5ae3b2 commit 88c74012c667beb956197de129eaa8c2af5ae3b2 Author: Vadim Shtayura <vadimsh@google.com> Date: Mon Oct 29 21:08:55 2018
,
Oct 29
Sorry guys. I did not know your puppet configs would call our WSUS module so i wrote the CL assuming it was only being called by our systems.
,
Oct 29
A few machines have come online; I'll continue monitoring.
,
Oct 29
Didn't we used to have a 403 alert for swarming? And we removed it because we got a lot of false positives on it? Or am I misremembering.
,
Oct 29
The bots seem to be recovering. I'm going to keep the bot off of the CQ until it's cleared out the pending task list.
,
Oct 29
Started a postmortem: https://docs.google.com/document/d/1d2uLciu7JISrfCHf5KoTYYIROP4RfEbJDDgtXGQWAjc/edit
,
Oct 29
Issue 899992 has been merged into this issue.
,
Oct 29
Issue 899917 has been merged into this issue.
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6fff7d80b11eebe78ec0b2f74c699278d07efe64 commit 6fff7d80b11eebe78ec0b2f74c699278d07efe64 Author: Stephen Martinis <martiniss@chromium.org> Date: Mon Oct 29 23:36:58 2018 Revert "Remove win10_chromium_x64_rel_ng from the CQ" This reverts commit 60b53b5d929f5c3d5059aee2d800b7a7177596aa. Reason for revert: Bot is drained, can re-add to CQ. Original change's description: > Remove win10_chromium_x64_rel_ng from the CQ > > This bot is totally broken at the moment. Remove it so it doesn't block > CQ. > > TBR=jbudorick > NOTRY=true > > Bug: 899908 > Change-Id: Ie6eca4c1bf8a67821e5a6570a59255fc6a87f54d > Reviewed-on: https://chromium-review.googlesource.com/c/1306396 > Commit-Queue: Stephen Martinis <martiniss@chromium.org> > Reviewed-by: John Budorick <jbudorick@chromium.org> > Reviewed-by: Stephen Martinis <martiniss@chromium.org> > Cr-Commit-Position: refs/heads/master@{#603611} TBR=martiniss@chromium.org,jbudorick@chromium.org Change-Id: I55af5ee2c7d1255f10de817a4f9a548793989a21 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 899908 Reviewed-on: https://chromium-review.googlesource.com/c/1306659 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#603689} [modify] https://crrev.com/6fff7d80b11eebe78ec0b2f74c699278d07efe64/infra/config/branch/cq.cfg
,
Oct 29
Issue 900019 has been merged into this issue.
,
Oct 30
This outage is over. See PM in #16 for more info. |
||||
►
Sign in to add a comment |
||||
Comment 1 by iannu...@google.com
, Oct 29