Swarming bot: be more resilient against metadata.google.internal flakiness |
|
Issue descriptionSometimes (not often but it happens), the metadata.google.internal server is down for a brief moment. When that happens and a bot is starting, it messes up the bot in a way that makes it hard fail like issue 864555. Basically, it fails to get the GCE service account, so it fails to handshake. The handshake code path doesn't refresh its authentication credentials when retrying. AI: - is_gce() and get_metadata_uncached() should be tweaked to try to retry at least for a brief moment, as the DNS itself may be down. - During handshake, it should refresh the credentials via get_authentication_headers() when handshake fails. Ref: https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/api/platforms/gce.py https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/bot_code/bot_main.py |
|
►
Sign in to add a comment |
|
Comment 1 by jchin...@chromium.org
, Jul 19