New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 741240 link

Starred by 2 users

Issue metadata

Status: Duplicate
Owner:
User never visited
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug



Sign in to add a comment

Look into being able to reimage / repair machines faster through machine provider

Project Member Reported by dpranke@chromium.org, Jul 12 2017

Issue description

In the outage today (http://o/e/m254945d72000001c) we pushed a bad GCE image to machine provider. We were able to revert the change, but without taking other action it would've taken up to 24 hours to ensure that any bad bots had been recycled.

We should figure out how to recycle machines faster. If this is a configurable thing, we should document how to change things. We should also figure out what sort of throughput we can reimage at using GCE.

We may also need to figure out the best way to cancel existing leases or delete damaged bots as part of this (in addition to quarantining them).
 
Labels: cit-pm-57

Comment 2 by estaab@chromium.org, Jul 31 2017

Owner: s...@google.com
Passing this to smut@ for thoughts.

It could be nice to have a tool to quarantine bots with bad images at the swarming level and then revoke leases at the MP level after.

Comment 3 by s...@google.com, Mar 15 2018

Mergedinto: 820646
Status: Duplicate (was: Assigned)
Owner: smut@chromium.org
Cc: -smut@chromium.org

Sign in to add a comment