New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 920453 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

Provide tooling to rebuild a board (or boards) that failed non-deterministically during a release build

Project Member Reported by djmm@google.com, Jan 10

Issue description

Often builds will fail on https://cros-goldeneye.corp.google.com/chromeos/console/listBuild#%2F due to infra or other causes that are not true build failures.  

It would be good to have some way to fill in those gaps.

First choice would be for the builders to understand heuristically or otherwise when non-deterministic failures occur and retry the failed board(s).

Second choice would be providing the ability to rebuild from that tag and fill in the failed builds on existing build records.  If it could keep the original CrOS version that failed (just retry), that would be excellent or emulate (or auto create) a 'branch' by incrementing the .0 if necessary.
 
Labels: -Restrict-View-Google -Type-Bug Type-Feature
are you interested in release builds ?  CQ ?  something else ?  all of them ?
I think regular means release in this case, the goal would be to find a way to regenerate artifacts for a particular version number if it failed to build for some reason outside of its own contents (e.g. some git flake or whatnot). 
Summary: Provide tooling to rebuild a board (or boards) that failed non-deterministically during a release build (was: Provide tooling to rebuild a board (or boards) that failed non-deterministically during a regular build)

Comment 4 by djmm@google.com, Jan 16 (6 days ago)

Cc: dgarr...@chromium.org
Example of this:
* crbug/917099 causing failures that killed release build images for a week+
* The only choice is a heavyweight branch operation followed by a .1 build
* all of that to effectively rebuild what was a simple transient build failure

Comment 5 by dgarr...@chromium.org, Jan 16 (6 days ago)

The primary question is what to do with release artifacts from the initial build that is being re-run.

Various side effects that become relevant:
  GS build artifact files (especially metadata.json that was imported by GE)
  CIDB entries
  TKO test entries (including downstream stainless and other values).

The GS build artifacts are a particular problem (in both chromeos-image-archive, and chromeos-releases), because for release artifacts they are named based on the version being built. If you rebuild the same version number, you will overwrite existing artifacts.

A) Should we remove existing files first, so that you don't end up with artifacts from two different builds?

B) What about downstream data consumers (like GE) which assume that some files (like metadata.json) are write once. Or don't understand that a single board can be re-run and thus might display data from the wrong run, or confuse data from the two runs?

If we change the version number, most of those problems go away.

What I suggest is a way to invoke "master-release" (or it's future replacement) with a whitelist of children to include. That way you'll do an all-new build (with a new version number), but only for the needed boards instead of everything.

When rebuilding that way, you WILL pick up new changes merged to the relevant branch.

Comment 6 by bhthompson@google.com, Jan 16 (6 days ago)

The problem with building a new set is that they will have new changes in them, which may require retesting.

I think nuking out existing (e.g. partially created) artifacts for a case like this is probably the more correct thing to do, but that might have ramifications I cannot think of off hand.

Today we do this by making a new branch, and building that branch, which gets us a new version number and binary with no changes, but this is a bit heavy.

Comment 7 by dgarr...@chromium.org, Jan 16 (6 days ago)

Cc: evanhernandez@chromium.org
The problem with nuking is that you have to nuke the downstream consumers, not just the files themselves.

And maybe some tools make (reasonable) assumptions about the child build being triggered by the same master as the other child builds.

Maybe the stabilize branch won't be as big a deal with the new "cros branch" tool is ready?

Comment 8 by dgarr...@chromium.org, Jan 16 (6 days ago)

Also... if you are on a release branch, can't you know in a robust way if there are any new changes in the branch? In theory, don't TPMs have to approve all merges to the branch?

Comment 9 by bhthompson@google.com, Jan 16 (6 days ago)

We can know if there are changes, but removing them is also awkward.

While we review all the changes, we are not the ones actually performing the merges, so the merges may happen at varying times. Also we bring in Chrome on a daily basis, which has their own merge process that might bring in other stuff.

Maybe a rebuild tool could incorporate the new supported branch utility to automatically determine if it is safe and possible to branch (no conflicting branch exists) then cut the branch and trigger a build, though it still feels awkward that we cannot retry in place...

Comment 10 by dgarr...@chromium.org, Jan 16 (6 days ago)

Cc: athilenius@chromium.org la...@chromium.org
I understand.

When release builds are migrated to recipes, we could redo how artifacts are published so that the release number isn't a key part of the path. Doing that (and keeping this use case in mind when updating downstream tooling), could make a simple rebuild much more straight forward and robust.

That's longer term, and it DOES make manual exploration of build artifacts much harder.

Comment 11 by djmm@google.com, Jan 16 (6 days ago)

I do no profess to know the inner workings of the current build system, but are the artifacts not atomically published only when everything passes?  

Do we publish as we go and if there's a failure along the way, that gets us into a committed state in terms of that particular build?

Comment 12 by dgarr...@chromium.org, Jan 16 (6 days ago)

We publish as we go, and on failure, you are left in an undefined state that may have interacted with more systems.

Comment 13 by djmm@google.com, Jan 16 (6 days ago)

Can we make building and artifact publishing an atomic operation?  That's probably a bigger scoped problem.  We could file another issue and block this on it.   Certainly having a truly atomic result would give all sorts of wins across infra.

Comment 14 by dgarr...@chromium.org, Jan 16 (6 days ago)

That's a much, much larger change than it sounds like.

Artifacts are published to chromeos-image-archive to help with GE communication mid-build, and to allow lab testing (the lab has to download the images to test against from somewhere).

Artifacts are uploaded to chromeos-releases to allow image signing, and payload generation (which happens in-place).

The stages doing those uploads are running in parallel with other stages for performance reasons, and those other stages are often still generating things.

Comment 15 by dgarr...@chromium.org, Jan 16 (6 days ago)

I believe the eventual plan is to break up our builds into more distinct units (build / test / release generation) which would make distinct uploads for each piece much more straight forward, but that's pretty far down the road.

Comment 16 by djmm@google.com, Jan 16 (6 days ago)

I kinda figured that was the case, but I have to ask!  Thank you for detailing how these builds-in-progress are being used.  If atomic publishing of results is not a realistic goal in the short term, burning a build number isn't the end of the world, though it does make things a bit more complex.  

If there was a way for an automatic fail/retry method to use the new 'cros branch' tooling to Just Make An Image, that would be so much awesome.
At the end of the day we just want a complete set of release builds that avoid the seemingly endless problems around transient failures and the tracking down of them, missing schedules and having a human triggered rebuild of the same.

Comment 17 by djmm@google.com, Jan 17 (5 days ago)

Another data point is that release announcements will mention two or more versions of ChromeOS on public blog posts and the obvious question might be, why are there multiple versions released?  What's different between .0 and .1?  

The answer is of course, nothing.   We just need to rebuild due to transient failures and we have no way of retrying/rebuilding the original version. 

Sign in to add a comment