New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 657224 link

Starred by 4 users

Issue metadata

Status: Duplicate
Merged: issue 829522
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Blocked on:
issue 660059
issue 671379
issue 684887
issue 709287



Sign in to add a comment

Need a robust way to invoke pooled builders for real production builds.

Project Member Reported by hungte@chromium.org, Oct 19 2016

Issue description

The original story is "Factory branches have long latency before a new build can be started."

What steps will reproduce the problem?
(1) Commit a change to factory branches, or manually kickoff a build

What is the expected output?
The build process should start soon, and after a normal build time we should be able to see the output.

What do you see instead?
Some factory branches (for instance, strago, rambi, ...) are shared by multiple boards (15~20), and whenever one of the board has submitted a change in its own board overlay, the builder will try to build new images for ALL boards - and that exhausted all buildbot resources allocated for factory branches.

In the end, we usually need to wait for hours, even several days, before the expected build can be processed -- and probably 90% of buildbot are wasted creating images that most boards don't care about (since the build was kicked off by some private changes that only affects one board).

This is a problem especially when a project build is ongoing that they do need some hotfix and expect the build to come out in few hours.

Related discussion started when we were doing Ryu builds that we need new official kernel builds very often.

dnj@ and bhthompson@ at that time recommended using trybots:
 ./cbuildbot/cbuildbot --remote --buildbot --branch=${BRANCH_TO_BUILD} ${BUILD_CONFIGS}

However, issues were found that time:
 - version number of new kernel didn't increase
 - no new entries on Golden Eye.

Attaching discussion at that time:

---

Bernie>

it seems to have built a 7265.57 which is one newer than what is in Goldeneye, so it is uprevving the build revision. The uprev step shows it marked a new chromeos-kernel-3_18 as stable, so it is trying to uprev it at least. It is possible this is a byproduct of some other bug like https://code.google.com/p/chromium/issues/detail?id=536857 ?

The results exist in the right place, it uploaded 'signed' artifacts to gs://chromeos-releases/canary-channel/smaug/7265.57.0/ but I am not sure why Goldeneye did not pick them up.

I have an example of this working off of a release branch at chromiumos.tryserver/builders/release/builds/863 for the Cyan FSI for comparison if that helps, this one was detected by Goldeneye.

hungte>

No it doesn't.

In the end I still have to wait for the factory branch builder to finish the build , which generates real 7265.57
and updated the Goldeneye.

Seems like I can build the smaug-release multiple times but it just won't increase version. :(

Bernie>

Weird, some of the uprev dependencies on the branch builders are a mystery to me, I think there may be some races/bugs.

In any case for Smaug this should not be an issue for much longer I hope. 

---

The story of Ryu ends. However, we're still seeing similar problems for other projects. What we need for Chromebooks include:

 - We need factory_install image to be signed.
 - We need a way so partners can download the files. Currently partner-eng team has setup a service on CPFE that it'll do a refcopy from gs://chromeos-release to CPFE's own bucket and allow partners to download, but it does not support trybot.
 - We need a way to have a "formal release". Trybots do not preserve manifest info so it's hard to track what has been changed and what changes were merged in the build.

According to Bernie and dnj, --cbuildbot should do the right thing. Quoting Bernie's words:
Bernie> "When you run a trybot with --buildbot it should act just like the official builder, generating a new Chrome OS version, signing, CPFE publication, etc. We use occasionally this to fill in holes in releases as described in [wiki]/chronos-download/pmo/cros-releasetasks/stabilize

dnj> For all practical purposes, a trybot run *can* be the same as any other waterfall run. It's possible that tryjob default flags are disabling uprevving; we might have to find a way to not send those flags.

What I don't know is why the case linked earlier did not appear to properly uprev the kernel, and my assumption is that something about how the factory builders expect things to happen with the pre-flight builder was not being met by the trybot. It is quite possible that the trybot with --buildbot works more correctly on some branches than others."

Alternative solutions I've proposed include
 - Boards sharing same branch should be allocated to only same build machine, so changes in other branches will have better chance to get build started
 - Or should we reserve one machine for "builds that manually kicked-off"?
 - (Bernie) Leveraging GCE based builders, and long term we might even have them allocated dynamically
 - (Bernie) Having a manual only builder waiting is reasonable, but the thinking was that the trybot would be able to serve such a purpose.

dgarrett>
Currently, we modify builds on the trybot waterfall to be a bit different, on the assumption that they are used to try things out, not for production work. For example, they always get invoked with --debug.
It's starting to sound like we need a robust way to invoke pooled builders for real production builds.
If someone can please file a bug to me, I'll try to come up with a clean suggestion for how to do it. (Josafat also asked for this yesterday).
 

Comment 1 Deleted

Cc: akes...@chromium.org aaboagye@chromium.org nxia@chromium.org chingcodes@chromium.org
Yeah, we seem to have a general need here.

Floating one idea for how to solve this:

1) Use a different commandline to invoke "production" build tryjobs, so we can avoid the tryjob munge.
2) Stock up on tryjob builders, with emphasis on using GCE bots.
3) Make sure production tryjob masters correctly spawn all necessary slaves on the tryjob waterfall (via buildbucket).

Notes:

A) This is looking a lot like a swarming system with a general pool of builders.
B) It would be a lot easier if non of these jobs ran VMTests, or if GCE supported VMTests.
C) It would be a lot cheaper if we could spin the slaves up and down based on need.

Comment 3 by d...@chromium.org, Oct 19 2016

Do we need master/slave builders for this particular task? It looks like the factory builders in question here are finite and known in advance. In that case, dispatching a handful of targeted tryjobs should be easy.

One problem with the system that you're proposing is that it would require changes to cbuildbot, which would then have to be ported to all branches requiring this support. That seems very non-trivial at first glance.
Notes:

D) Except for UI issues, we could run all builds here, other than ChromeOS and ChromiumOS waterfalls.
E) With UI solved, and strong build slave affinity (for performance) we could run all of our builds this way.

Sorry, I've got a similar request from Josafat for special branch builds, and master/slave support would make that easier. So far, using the release waterfall hasn't been working well for that.

You make a good point about branches through.

Comment 6 by hungte@chromium.org, Oct 19 2016

> It would be a lot easier if non of these jobs ran VMTests, or if GCE supported VMTests.

 At least factory branch builds don't need VMTests :)

Comment 7 by hungte@chromium.org, Oct 27 2016

Do need a factory build today so I tried again:

 ./bin/cbuildbot -b factory-gru-8652.B --buildbot kevin-factory --remote
 
Let's see how it works this time.

Comment 8 by hungte@chromium.org, Oct 27 2016

... ok it failed.

 /b/build/slave/etc/build/chromite/third_party/google/protobuf/__init__.py:37: UserWarning: Module simplejson was already imported from /b/build/third_party/simplejson/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
  __import__('pkg_resources').declare_namespace(__name__)
00:43:33: ERROR: No such configuraton target: "kevin-factory".

The 'kevin-factory' is a config that only lives in factory branch (factory-gr-8652.B). How do I tell trybot to use the chromite there?
I am afraid for this to work ToT chromite has to have some version of the config :(. 

So we could land such configs in ToT I guess?

Comment 10 by d...@chromium.org, Oct 27 2016

Nope, you can definitely build a branch; you just have to supply cbuildbot with the "-b" option:

-b BRANCH, --branch=BRANCH
                        The manifest branch to test.  The branch to check the
                        buildroot out to.
IIRC, it will check out the branch version of chromite for actual execution, but if a version of the config does not exist on ToT, it will fail to get to the point it checks out the branched version of chromite.

Comment 12 by d...@chromium.org, Oct 27 2016

Hmm, I think this very issue came up recently in the last month or two and was fixed on ToT.

Comment 13 by d...@chromium.org, Oct 27 2016

Nevermind, I was thinking of flags that were not known to ToT but are known to previous versions.

Comment 14 by d...@chromium.org, Oct 27 2016

Blockedon: 660059

Comment 15 by d...@chromium.org, Oct 27 2016

nxia@, another option would be to have cbuildbot, when executed with "--remote", specify the "cbb_branch" property to BuildBucket, causing the initial Chromite checkout to happen at that branch instead of ToT. WDYT?
No, it's an ongoing issue that I'd love to fix. But it's hard, since we use the build config to decide how to sync. Since you have to sync before you get the branched code.... yeah.
> So we could land such configs in ToT I guess?

 That means we'll have to land a bunch of factory configs, from every factory branches; or we'll need to add each factory that "we may want" into ToT...

Comment 18 by d...@chromium.org, Oct 28 2016

I think the blocking bug or #15 are clearly the way to go. Either check out the branch you want, same as the branch builder, or enable ToT's cbuildbot to bootstrap a config it doesn't immediately own. From comments in the blocking bug, dgarrett@ doesn't think the latter is trivial, so I recommend #15.
I think my pre-bootstrap suggestion in crbug.com/660059#c4 is the best approach for solving the branch/config problem.

Comment 20 by d...@chromium.org, Oct 28 2016

The pre-bootstrap suggestion is (I think) what I am advocating for in #15, basically. I think the simplest way to do this would be to have "remote_try.py" forward the "cbb_branch" parameter when scheduling through BuildBucket. Something like: https://chromium-review.googlesource.com/405013
I was thinking put a script in chromite is invoked instead of cbuildbot with the same arguments that we currently pass to cbuildbot.

It could then discover the branch (if specified), and any CLs to patch in (if specified), or even add the patching feature over time without affecting the API.

My secret agenda is to split our bootstrap out into a different script to make it easier to follow, but that's unlikely to every really happen.

Comment 22 by nxia@chromium.org, Nov 1 2016

dnj's proposal will work on the new branches. we would need to resolve crbug.com/660059 first to use dnj's proposal on the old branches (which don't have buildbucket supported). 

If the pre-bootstrap script is the solution we want for crbug.com/660059, we can consolidate the logics and let pre-bootstrap handle the branch for us. 
So.... which way do you want to move forward with this?

Comment 24 by nxia@chromium.org, Nov 1 2016

the pre-bootstrap script adds complexity but lgtm. it involves changes in chromite recipe: 1) always checks out the ToT chromite branch 2) invokes pre-bootstrap instead of chromite/bin/cbuildbot script. dnj@ if there're no other corner cases we missed, shall we move on with the pre-bootstrap script dgarrett@ proposed. 
Strawman implementation: https://chromium-review.googlesource.com/#/c/406568/

Whoever we want can take it over an improve as needed.
Blockedon: 671379
Project Member

Comment 27 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f

commit c4114cc39bea9b9a6cf04bea91f24851ee20414f
Author: Don Garrett <dgarrett@google.com>
Date: Wed Nov 02 03:04:06 2016

bootstrap: Create script to launch cbuildbot on branch.

This script is invoked with the normal cbuildbot command line options,
but if cbuildbot is passed a branch to use, it will checkout chromite
on the branch before launching cbuildbot, otherwise, it runs cbuildbot
on a clean master branch.

This allows us to move from TOT to any branch invisibly.

One caveat, when building with a pinned manifest, it's still important
to specify the chromite branch on the cbuildbot command line.

BUG= chromium:657224 
TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin
     bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin
     bootstrap foo; echo $? (to verify non-zero exit code returned)

Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758
Reviewed-on: https://chromium-review.googlesource.com/406568
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest
[modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py

Project Member

Comment 28 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f

commit c4114cc39bea9b9a6cf04bea91f24851ee20414f
Author: Don Garrett <dgarrett@google.com>
Date: Wed Nov 02 03:04:06 2016

bootstrap: Create script to launch cbuildbot on branch.

This script is invoked with the normal cbuildbot command line options,
but if cbuildbot is passed a branch to use, it will checkout chromite
on the branch before launching cbuildbot, otherwise, it runs cbuildbot
on a clean master branch.

This allows us to move from TOT to any branch invisibly.

One caveat, when building with a pinned manifest, it's still important
to specify the chromite branch on the cbuildbot command line.

BUG= chromium:657224 
TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin
     bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin
     bootstrap foo; echo $? (to verify non-zero exit code returned)

Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758
Reviewed-on: https://chromium-review.googlesource.com/406568
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest
[modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py

Project Member

Comment 29 by bugdroid1@chromium.org, Jan 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f

commit c4114cc39bea9b9a6cf04bea91f24851ee20414f
Author: Don Garrett <dgarrett@google.com>
Date: Wed Nov 02 03:04:06 2016

bootstrap: Create script to launch cbuildbot on branch.

This script is invoked with the normal cbuildbot command line options,
but if cbuildbot is passed a branch to use, it will checkout chromite
on the branch before launching cbuildbot, otherwise, it runs cbuildbot
on a clean master branch.

This allows us to move from TOT to any branch invisibly.

One caveat, when building with a pinned manifest, it's still important
to specify the chromite branch on the cbuildbot command line.

BUG= chromium:657224 
TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin
     bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin
     bootstrap foo; echo $? (to verify non-zero exit code returned)

Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758
Reviewed-on: https://chromium-review.googlesource.com/406568
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py
[add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest
[modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py

Tried today.

 $ cd ~/chromiumos/chromite/bin
 $ git checkout -b cros/factory-reef-8811.B # so reef-factory is available

 $ ./cbuildbot --remote -b factory-reef-8811.B reef-factory
 https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/etc/builds/1029/steps/cbuildbot%20%5Breef-factory%5D/logs/stdio
  ERROR: No such configuraton target: "reef-factory".
 
 $ ./cbuildbot --remote --buildbot -b factory-reef-8811.B reef-factory
 https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/etc/builds/1030/steps/cbuildbot%20%5Breef-factory%5D/logs/stdio
  ERROR: No such configuraton target: "reef-factory".

Do I need to add any special params so buildbot would know that it needs to launch on branches? Or do I need to cherry-pick 406568 to branches that I want to build?
The bootstrap script has landed, but the builders aren't using it yet. This is a buildbot recipe change. After that's in place and working, this bug will be solved.
Also, this is a significant change, so there is every chance of issues when it goes into production.
Status: Started (was: Assigned)
Blockedon: 684887
Blockedon: 709287
We need reef factory build today and found that the factory builders are again occupied by rambi and strago family projects because of one change (hwid?) in rambi / strago branch.

If the work of enabling trybot is still a long way to go, can we get some quick hacks, for example "builds using same branch should be grouped to same builder"?
Meanwhile, is there a way to prevent builders start when they only see changes in platform/chromeos-hwid?

I think most old builders started only because of changes in HWID repo, which is really not necessary to kickoff a new full build...
We're working towards "swarming" builds which will basically turn all builds into tryjobs that run against a single, very large pool of builders. The idea is that the huge unused capacity of our idle builders will allow us to absorb spikes of load much more easily.

However, that won't be ready until the end of Q2 at the earliest.


In the mean time, it looks like factory builds tend to take 1-2 hours. If you schedule a new build, and since most of our families aren't very big I wouldn't expect that to cause more than an hour or two of delay.

Is that enough to be a real problem?
Re#39

  Reef factory preflight: 44min
  Reef factory: 3 hrs, 48 mins

  Strago factory preflight: 1hr 26 mins
  Strago factory celes: 2 hrs, 48 mins, 47 secs

So factory full build takes 3~4 hours.

 Strago family: 14 builders
 Rambi family: 9+13 builders (two branches)
 Veyron: 10 builders

There are 6 builders, so if there are two CLs merged and triggered one of rambi + strago to start,
that will take (13+14)*3.5 / 6 = minimal 15 hours to finish one round.

Maybe we should consider changing old huge family to rebuild only on explicitly demand (strago, rambi, veyron).

That seems reasonable to me, and easily done.
Status: Available (was: Started)
Mergedinto: 829522
Status: Duplicate (was: Available)
This will be handled by the transition to swarming for this waterfall.

Sign in to add a comment