New issue
Advanced search Search tips

Issue 851212 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Quarantine swarming bot on too many repeated bot deaths

Project Member Reported by jbudorick@chromium.org, Jun 9 2018

Issue description

build16-m9 spent most of this afternoon repeatedly dying because it couldn't run cipd. (See e.g. https://isolateserver.appspot.com/restricted/ereporter2/errors/5137238572662784 or https://chromium-swarm.appspot.com/task?id=3dfa09b537dca710)

These fast failures caused the bot to single-handedly kill ios-simulator on both the CQ and CI for most of the afternoon. Swarming *should* be quarantining bots in cases like this (see COUNT_BOT_DIED in bot_config.py), but that doesn't appear to have happened here.
 
This check in bot_config is ineffective as the bot reboots after failures. This would have to be stored to disk.

Comment 2 by no...@chromium.org, Jun 16 2018

Labels: -Pri-2 Pri-1
Status: Available (was: Untriaged)
this seems important given the cascading affect?

Comment 3 by mar...@chromium.org, Jun 20 2018

The real fix is to serialize this information to disk instead of using a global variable.
Issue 901216 has been merged into this issue.
 Issue 757932  has been merged into this issue.
One thing you may want is issue 757931.
Huge +1 to the server side quarantining of a dying bot. Relying on bot's self-diagnosis and self-quarantine is inherently unreliable when the bot is already broken.

Sign in to add a comment