In the outage today (http://o/e/m254945d72000001c) we pushed a bad GCE image to machine provider. We were able to revert the change, but without taking other action it would've taken up to 24 hours to ensure that any bad bots had been recycled.
We should figure out how to recycle machines faster. If this is a configurable thing, we should document how to change things. We should also figure out what sort of throughput we can reimage at using GCE.
We may also need to figure out the best way to cancel existing leases or delete damaged bots as part of this (in addition to quarantining them).
Comment 1 by dpranke@chromium.org
, Jul 12 2017