New issue
Advanced search Search tips

Issue 864638 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 864555



Sign in to add a comment

Swarming bot: be more resilient against metadata.google.internal flakiness

Project Member Reported by mar...@chromium.org, Jul 17

Issue description

Sometimes (not often but it happens), the metadata.google.internal server is down for a brief moment. When that happens and a bot is starting, it messes up the bot in a way that makes it hard fail like issue 864555. Basically, it fails to get the GCE service account, so it fails to handshake. The handshake code path doesn't refresh its authentication credentials when retrying.

AI:
- is_gce() and get_metadata_uncached() should be tweaked to try to retry at least for a brief moment, as the DNS itself may be down.
- During handshake, it should refresh the credentials via get_authentication_headers() when handshake fails.

Ref:
https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/api/platforms/gce.py
https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/bot_code/bot_main.py
 
Blocking: 864555

Sign in to add a comment