Masters not killing in-progress slaves on branch builds. |
|
Issue descriptionThese two builds are for master-release on the release-R69-10895.B branch. I would have expected the second build to kill the slaves running from the first build to avoid corruption, or excessive use of build resources (both builders, and lab DUTs), but that mechanism doesn't seem to have worked correctly. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940063994328518608 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940058619896203856
,
Jul 25
It looks like the query for previous builds did not correctly locate the in-progress build: 21:33:57: INFO: Cancelling obsolete slave builds. 21:33:57: INFO: Get service account /creds/service_accounts/service-account-chromeos.json 21:33:57: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/.cache/cipd/40506ccc2cd82978530da01fbf9a64c1e7d5d463 ensure -root /b/swarming/w/ir/cache/cbuild/repository/.cache/cipd/packages/infra/tools/luci-auth/linux-amd64 -list /b/swarming/w/ir/tmp/t/cbuildbot-tmpyCZuDY/tmpMZo6e9 21:33:59: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/.cache/cipd/packages/infra/tools/luci-auth/linux-amd64/luci-auth token '-service-account-json=/creds/service_accounts/service-account-chromeos.json' 21:33:59: INFO: Found Previous Master builds: 8940127873549370544, 8940173528155015088 The previous build had an id of "8940058619896203856".
,
Jul 25
FWIW, I think allowing both to build in parallel would be a good thing, the lab DUT loading can be an issue, but I thought one of the benefits of swarming was we would be able to build more in parallel due to resources not being as directly tied up. What sort of corruption would we expect? I believe we have a race condition if they start too close to the same time when the builders make git commits to update versions and such, but if spaced sufficiently apart (say an hour) they should be ok I would think.
,
Jul 25
For some master slave groups, this would be really bad (CQ / PFQs), since it can result in uprevs being corrupted. For the release builds, the version / manifest updates might / might not be unsafe, but spaced apart, they should be fine (as you say). The problem with release builds is running out of builders / DUTs. They are pooled with all other prod builds, and so can impact the TOT PFQs, informationals, canaries, etc. We don't yet have very good management, so sudden spikes are problematic. I expect to have better management for that over time, but today, we have trouble even knowing if we ran out of builders. PS: The abort button in the scheduler UI that's next to the Run button should generally be safe to use.
,
Jul 25
Ok, next time we can try to abort the build in progress. I think some of this will be clearer once we have these represented in Legoland and can see the state of all the individual builds and not just the master easily. I was hesitant to just abort the master, as historically stopping the master has no impact on the slaves (maybe that is different now?), and would not have helped, and without a good way of seeing all the slaves (maybe looking through master logs to find buildbucket ids?) it was not clear we could have effectively stopped them.
,
Jul 25
Well, true. Aborting the master is supposed to be sufficient, because the start of the next master should kill all of the slaves for you. But this bug gets in the way. |
|
►
Sign in to add a comment |
|
Comment 1 by dgarr...@chromium.org
, Jul 25