New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 719312 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature

Blocked on: View detail
issue 757931
issue 839415



Sign in to add a comment

Add way to quarantine a bot from "Swarming Bot Page"

Project Member Reported by tansell@chromium.org, May 8 2017

Issue description

On the page at https://chromium-swarm.appspot.com/bot?id=vm91-m4&sort_stats=total%3Adesc I can "restart a bot". 

It would be also good if I could quarantine a bot using the UI.

Then when I find bots which are causing failures like in https://bugs.chromium.org/p/chromium/issues/detail?id=718707 - I can just quarantine them myself.
 
I'm slightly confused.  When you say "restart a bot", what do you mean? I only see a magic button for "gracefully shutdown a bot"

Regarding quarantining a bot, is the goal to shut it down such that it doesn't run any more tasks?  If so, is simply shutting down the bot insufficient?

Also, if you quarantine a bot, you'll probably want a message to explain to others why you have quarantined it.

Finally, when the bot has been repaired, what behavior do you want to do to un-quarantine it?
Goal is to prevent the bot from accepting any more tasks until it has been manually fixed.

Shutting down a bot doesn't prevent it just starting back up again with the same misconfiguration. I believe machine provider shuts down and starts up bots regularly?

A message for the quarantine would be good.
Cc: kjlubick@chromium.org
Owner: ----
Status: Available (was: Assigned)
> Shutting down a bot doesn't prevent it just starting back up again with the same misconfiguration. I believe machine provider shuts down and starts up bots regularly? 

I was unaware of that.  I now see how that would not be what you want.

Our Skia bot_config has a bit of logic that tells the bot to shut down if it gets two BOT_DIED in a row.  Since we don't have MP, a human is required to fix and restart them. It does this by writing a few files to local bot disk to "remember" what happened, since Swarming tries to be stateless.

I can envision at least part of a system that writes the message to ~/manual_quarantined or something.  The bot can see this and know to be quarantined.

However, I'm a bit sketchy on the details of being able to remove this state from the API/UI.  M-A would have better ideas on that part.
I care about the putting things into quarantine more than getting them back out at the moment. I would be happy with a CLI method for putting them back into the pool. 

Getting them out of the pool quickly is something anyone should be able to do, while putting them back in is something only troopers should really be doing.

maruel@ - Thoughts?
Labels: -Type-Bug Type-Feature
I'm fine with the idea, had filed https://github.com/luci/luci-py/issues/123 a long time ago. I had even started a branch locally. Just never made this a priority.
Summary: Add way to quarantine a bot from "Swarming Bot Page" (was: Add way to qurantine a bot from "Swarming Bot Page")
Components: -Infra>Platform>Swarming Infra>Platform>Swarming>WebUI

Comment 8 by mar...@chromium.org, Aug 22 2017

Blockedon: 757931
Blockedon: 839415

Sign in to add a comment