New issue
Advanced search Search tips

Issue 905323 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

LUCI Scheduler Abort took 25 minutes to take effect.

Project Member Reported by bhthompson@google.com, Nov 14

Issue description

With the previous waterfall the team had the ability to actively halt any builder in progress, and the halt was immediate.

Today one can halt the builder but one must first navigate through several layers of UI, starting in legoland, then milo, then the actual task, which seems not intuitive for new team members (this should have been surfaced in legoland as part of the swarming migration IMO), and then when one does try to halt the builder it takes over 20 minutes for it to actually happen. 

For example when cancelling https://chrome-swarming.appspot.com/task?id=41249bc12b882110&refresh=10&show_raw=1&wide_logs=true the cancel button was pressed around 9:15 am Pacific, and the builder did not go into a killed state enough for the master 71 builder to restart until 9:40. 

What can we do to get back the ability to halt builders in seconds?
 
I was expecting the Abort button to take effect within a few seconds. It's worth investigation that specific case.

However, the abort button aborts the master, not the child builds. The children will be aborted when a new master build starts, but will keep running (and consuming resources) until then.
Summary: LUCI Scheduler Abort took 25 minutes to take effect. (was: Provide a manual method to immediately halt swarming builders)
I was thinking that we had to stop these like we would the waterfall, e.g. we stop all the child builds, then the master will stop itself and start the next build.

If the proper procedure is to abort the master, and let the new master come through and restart any child builds, and that is reasonably fast, then maybe this bug is a non issue?
Yep, we should generally only start or abort master builds now. Starting children directly isn't supported, and will probably become more and more problematic over time.

Re-running a release build for a specific board is possible, but will overwrite any existing build artifacts from previous runs, and still not really supported.

What happens if we run a production trybot against a release branch for a specific board?

Will it still generate a new version like it used to, or will this break things?

Sign in to add a comment