Quarantine swarming bot on too many repeated bot deaths |
||
Issue descriptionbuild16-m9 spent most of this afternoon repeatedly dying because it couldn't run cipd. (See e.g. https://isolateserver.appspot.com/restricted/ereporter2/errors/5137238572662784 or https://chromium-swarm.appspot.com/task?id=3dfa09b537dca710) These fast failures caused the bot to single-handedly kill ios-simulator on both the CQ and CI for most of the afternoon. Swarming *should* be quarantining bots in cases like this (see COUNT_BOT_DIED in bot_config.py), but that doesn't appear to have happened here.
,
Jun 16 2018
this seems important given the cascading affect?
,
Jun 20 2018
The real fix is to serialize this information to disk instead of using a global variable.
,
Nov 6
Issue 901216 has been merged into this issue.
,
Nov 6
Issue 757932 has been merged into this issue.
,
Nov 6
One thing you may want is issue 757931.
,
Nov 6
Huge +1 to the server side quarantining of a dying bot. Relying on bot's self-diagnosis and self-quarantine is inherently unreliable when the bot is already broken. |
||
►
Sign in to add a comment |
||
Comment 1 by mar...@chromium.org
, Jun 9 2018