Issue metadata
Sign in to add a comment
|
Need a robust way to invoke pooled builders for real production builds. |
||||||||||||||||||||||||
Issue description
The original story is "Factory branches have long latency before a new build can be started."
What steps will reproduce the problem?
(1) Commit a change to factory branches, or manually kickoff a build
What is the expected output?
The build process should start soon, and after a normal build time we should be able to see the output.
What do you see instead?
Some factory branches (for instance, strago, rambi, ...) are shared by multiple boards (15~20), and whenever one of the board has submitted a change in its own board overlay, the builder will try to build new images for ALL boards - and that exhausted all buildbot resources allocated for factory branches.
In the end, we usually need to wait for hours, even several days, before the expected build can be processed -- and probably 90% of buildbot are wasted creating images that most boards don't care about (since the build was kicked off by some private changes that only affects one board).
This is a problem especially when a project build is ongoing that they do need some hotfix and expect the build to come out in few hours.
Related discussion started when we were doing Ryu builds that we need new official kernel builds very often.
dnj@ and bhthompson@ at that time recommended using trybots:
./cbuildbot/cbuildbot --remote --buildbot --branch=${BRANCH_TO_BUILD} ${BUILD_CONFIGS}
However, issues were found that time:
- version number of new kernel didn't increase
- no new entries on Golden Eye.
Attaching discussion at that time:
---
Bernie>
it seems to have built a 7265.57 which is one newer than what is in Goldeneye, so it is uprevving the build revision. The uprev step shows it marked a new chromeos-kernel-3_18 as stable, so it is trying to uprev it at least. It is possible this is a byproduct of some other bug like https://code.google.com/p/chromium/issues/detail?id=536857 ?
The results exist in the right place, it uploaded 'signed' artifacts to gs://chromeos-releases/canary-channel/smaug/7265.57.0/ but I am not sure why Goldeneye did not pick them up.
I have an example of this working off of a release branch at chromiumos.tryserver/builders/release/builds/863 for the Cyan FSI for comparison if that helps, this one was detected by Goldeneye.
hungte>
No it doesn't.
In the end I still have to wait for the factory branch builder to finish the build , which generates real 7265.57
and updated the Goldeneye.
Seems like I can build the smaug-release multiple times but it just won't increase version. :(
Bernie>
Weird, some of the uprev dependencies on the branch builders are a mystery to me, I think there may be some races/bugs.
In any case for Smaug this should not be an issue for much longer I hope.
---
The story of Ryu ends. However, we're still seeing similar problems for other projects. What we need for Chromebooks include:
- We need factory_install image to be signed.
- We need a way so partners can download the files. Currently partner-eng team has setup a service on CPFE that it'll do a refcopy from gs://chromeos-release to CPFE's own bucket and allow partners to download, but it does not support trybot.
- We need a way to have a "formal release". Trybots do not preserve manifest info so it's hard to track what has been changed and what changes were merged in the build.
According to Bernie and dnj, --cbuildbot should do the right thing. Quoting Bernie's words:
Bernie> "When you run a trybot with --buildbot it should act just like the official builder, generating a new Chrome OS version, signing, CPFE publication, etc. We use occasionally this to fill in holes in releases as described in [wiki]/chronos-download/pmo/cros-releasetasks/stabilize
dnj> For all practical purposes, a trybot run *can* be the same as any other waterfall run. It's possible that tryjob default flags are disabling uprevving; we might have to find a way to not send those flags.
What I don't know is why the case linked earlier did not appear to properly uprev the kernel, and my assumption is that something about how the factory builders expect things to happen with the pre-flight builder was not being met by the trybot. It is quite possible that the trybot with --buildbot works more correctly on some branches than others."
Alternative solutions I've proposed include
- Boards sharing same branch should be allocated to only same build machine, so changes in other branches will have better chance to get build started
- Or should we reserve one machine for "builds that manually kicked-off"?
- (Bernie) Leveraging GCE based builders, and long term we might even have them allocated dynamically
- (Bernie) Having a manual only builder waiting is reasonable, but the thinking was that the trybot would be able to serve such a purpose.
dgarrett>
Currently, we modify builds on the trybot waterfall to be a bit different, on the assumption that they are used to try things out, not for production work. For example, they always get invoked with --debug.
It's starting to sound like we need a robust way to invoke pooled builders for real production builds.
If someone can please file a bug to me, I'll try to come up with a clean suggestion for how to do it. (Josafat also asked for this yesterday).
,
Oct 19 2016
Yeah, we seem to have a general need here. Floating one idea for how to solve this: 1) Use a different commandline to invoke "production" build tryjobs, so we can avoid the tryjob munge. 2) Stock up on tryjob builders, with emphasis on using GCE bots. 3) Make sure production tryjob masters correctly spawn all necessary slaves on the tryjob waterfall (via buildbucket). Notes: A) This is looking a lot like a swarming system with a general pool of builders. B) It would be a lot easier if non of these jobs ran VMTests, or if GCE supported VMTests. C) It would be a lot cheaper if we could spin the slaves up and down based on need.
,
Oct 19 2016
Do we need master/slave builders for this particular task? It looks like the factory builders in question here are finite and known in advance. In that case, dispatching a handful of targeted tryjobs should be easy. One problem with the system that you're proposing is that it would require changes to cbuildbot, which would then have to be ported to all branches requiring this support. That seems very non-trivial at first glance.
,
Oct 19 2016
Notes: D) Except for UI issues, we could run all builds here, other than ChromeOS and ChromiumOS waterfalls. E) With UI solved, and strong build slave affinity (for performance) we could run all of our builds this way.
,
Oct 19 2016
Sorry, I've got a similar request from Josafat for special branch builds, and master/slave support would make that easier. So far, using the release waterfall hasn't been working well for that. You make a good point about branches through.
,
Oct 19 2016
> It would be a lot easier if non of these jobs ran VMTests, or if GCE supported VMTests. At least factory branch builds don't need VMTests :)
,
Oct 27 2016
Do need a factory build today so I tried again: ./bin/cbuildbot -b factory-gru-8652.B --buildbot kevin-factory --remote Let's see how it works this time.
,
Oct 27 2016
... ok it failed.
/b/build/slave/etc/build/chromite/third_party/google/protobuf/__init__.py:37: UserWarning: Module simplejson was already imported from /b/build/third_party/simplejson/__init__.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path
__import__('pkg_resources').declare_namespace(__name__)
00:43:33: ERROR: No such configuraton target: "kevin-factory".
The 'kevin-factory' is a config that only lives in factory branch (factory-gr-8652.B). How do I tell trybot to use the chromite there?
,
Oct 27 2016
I am afraid for this to work ToT chromite has to have some version of the config :(. So we could land such configs in ToT I guess?
,
Oct 27 2016
Nope, you can definitely build a branch; you just have to supply cbuildbot with the "-b" option:
-b BRANCH, --branch=BRANCH
The manifest branch to test. The branch to check the
buildroot out to.
,
Oct 27 2016
IIRC, it will check out the branch version of chromite for actual execution, but if a version of the config does not exist on ToT, it will fail to get to the point it checks out the branched version of chromite.
,
Oct 27 2016
Hmm, I think this very issue came up recently in the last month or two and was fixed on ToT.
,
Oct 27 2016
Nevermind, I was thinking of flags that were not known to ToT but are known to previous versions.
,
Oct 27 2016
,
Oct 27 2016
nxia@, another option would be to have cbuildbot, when executed with "--remote", specify the "cbb_branch" property to BuildBucket, causing the initial Chromite checkout to happen at that branch instead of ToT. WDYT?
,
Oct 27 2016
No, it's an ongoing issue that I'd love to fix. But it's hard, since we use the build config to decide how to sync. Since you have to sync before you get the branched code.... yeah.
,
Oct 27 2016
> So we could land such configs in ToT I guess? That means we'll have to land a bunch of factory configs, from every factory branches; or we'll need to add each factory that "we may want" into ToT...
,
Oct 28 2016
I think the blocking bug or #15 are clearly the way to go. Either check out the branch you want, same as the branch builder, or enable ToT's cbuildbot to bootstrap a config it doesn't immediately own. From comments in the blocking bug, dgarrett@ doesn't think the latter is trivial, so I recommend #15.
,
Oct 28 2016
I think my pre-bootstrap suggestion in crbug.com/660059#c4 is the best approach for solving the branch/config problem.
,
Oct 28 2016
The pre-bootstrap suggestion is (I think) what I am advocating for in #15, basically. I think the simplest way to do this would be to have "remote_try.py" forward the "cbb_branch" parameter when scheduling through BuildBucket. Something like: https://chromium-review.googlesource.com/405013
,
Oct 28 2016
I was thinking put a script in chromite is invoked instead of cbuildbot with the same arguments that we currently pass to cbuildbot. It could then discover the branch (if specified), and any CLs to patch in (if specified), or even add the patching feature over time without affecting the API. My secret agenda is to split our bootstrap out into a different script to make it easier to follow, but that's unlikely to every really happen.
,
Nov 1 2016
dnj's proposal will work on the new branches. we would need to resolve crbug.com/660059 first to use dnj's proposal on the old branches (which don't have buildbucket supported). If the pre-bootstrap script is the solution we want for crbug.com/660059, we can consolidate the logics and let pre-bootstrap handle the branch for us.
,
Nov 1 2016
So.... which way do you want to move forward with this?
,
Nov 1 2016
the pre-bootstrap script adds complexity but lgtm. it involves changes in chromite recipe: 1) always checks out the ToT chromite branch 2) invokes pre-bootstrap instead of chromite/bin/cbuildbot script. dnj@ if there're no other corner cases we missed, shall we move on with the pre-bootstrap script dgarrett@ proposed.
,
Nov 2 2016
Strawman implementation: https://chromium-review.googlesource.com/#/c/406568/ Whoever we want can take it over an improve as needed.
,
Dec 6 2016
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f commit c4114cc39bea9b9a6cf04bea91f24851ee20414f Author: Don Garrett <dgarrett@google.com> Date: Wed Nov 02 03:04:06 2016 bootstrap: Create script to launch cbuildbot on branch. This script is invoked with the normal cbuildbot command line options, but if cbuildbot is passed a branch to use, it will checkout chromite on the branch before launching cbuildbot, otherwise, it runs cbuildbot on a clean master branch. This allows us to move from TOT to any branch invisibly. One caveat, when building with a pinned manifest, it's still important to specify the chromite branch on the cbuildbot command line. BUG= chromium:657224 TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin bootstrap foo; echo $? (to verify non-zero exit code returned) Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758 Reviewed-on: https://chromium-review.googlesource.com/406568 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest [modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f commit c4114cc39bea9b9a6cf04bea91f24851ee20414f Author: Don Garrett <dgarrett@google.com> Date: Wed Nov 02 03:04:06 2016 bootstrap: Create script to launch cbuildbot on branch. This script is invoked with the normal cbuildbot command line options, but if cbuildbot is passed a branch to use, it will checkout chromite on the branch before launching cbuildbot, otherwise, it runs cbuildbot on a clean master branch. This allows us to move from TOT to any branch invisibly. One caveat, when building with a pinned manifest, it's still important to specify the chromite branch on the cbuildbot command line. BUG= chromium:657224 TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin bootstrap foo; echo $? (to verify non-zero exit code returned) Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758 Reviewed-on: https://chromium-review.googlesource.com/406568 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest [modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py
,
Jan 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c4114cc39bea9b9a6cf04bea91f24851ee20414f commit c4114cc39bea9b9a6cf04bea91f24851ee20414f Author: Don Garrett <dgarrett@google.com> Date: Wed Nov 02 03:04:06 2016 bootstrap: Create script to launch cbuildbot on branch. This script is invoked with the normal cbuildbot command line options, but if cbuildbot is passed a branch to use, it will checkout chromite on the branch before launching cbuildbot, otherwise, it runs cbuildbot on a clean master branch. This allows us to move from TOT to any branch invisibly. One caveat, when building with a pinned manifest, it's still important to specify the chromite branch on the cbuildbot command line. BUG= chromium:657224 TEST=bootstrap --buildbot --debug --buildroot <root> master-paladin bootstrap --buildbot --debug --buildroot <root> --branch release-R56-9000.B master-paladin bootstrap foo; echo $? (to verify non-zero exit code returned) Change-Id: Idf8c6e4267c3ff9f4db425705a9c98676bf1d758 Reviewed-on: https://chromium-review.googlesource.com/406568 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap.py [add] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/scripts/bootstrap_unittest [modify] https://crrev.com/c4114cc39bea9b9a6cf04bea91f24851ee20414f/lib/git.py
,
Jan 16 2017
Tried today. $ cd ~/chromiumos/chromite/bin $ git checkout -b cros/factory-reef-8811.B # so reef-factory is available $ ./cbuildbot --remote -b factory-reef-8811.B reef-factory https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/etc/builds/1029/steps/cbuildbot%20%5Breef-factory%5D/logs/stdio ERROR: No such configuraton target: "reef-factory". $ ./cbuildbot --remote --buildbot -b factory-reef-8811.B reef-factory https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/etc/builds/1030/steps/cbuildbot%20%5Breef-factory%5D/logs/stdio ERROR: No such configuraton target: "reef-factory". Do I need to add any special params so buildbot would know that it needs to launch on branches? Or do I need to cherry-pick 406568 to branches that I want to build?
,
Jan 17 2017
The bootstrap script has landed, but the builders aren't using it yet. This is a buildbot recipe change. After that's in place and working, this bug will be solved.
,
Jan 17 2017
Also, this is a significant change, so there is every chance of issues when it goes into production.
,
Jan 24 2017
,
Jan 25 2017
,
Apr 6 2017
,
Apr 26 2017
We need reef factory build today and found that the factory builders are again occupied by rambi and strago family projects because of one change (hwid?) in rambi / strago branch. If the work of enabling trybot is still a long way to go, can we get some quick hacks, for example "builds using same branch should be grouped to same builder"?
,
Apr 26 2017
Meanwhile, is there a way to prevent builders start when they only see changes in platform/chromeos-hwid? I think most old builders started only because of changes in HWID repo, which is really not necessary to kickoff a new full build...
,
Apr 26 2017
We're working towards "swarming" builds which will basically turn all builds into tryjobs that run against a single, very large pool of builders. The idea is that the huge unused capacity of our idle builders will allow us to absorb spikes of load much more easily. However, that won't be ready until the end of Q2 at the earliest.
,
Apr 26 2017
In the mean time, it looks like factory builds tend to take 1-2 hours. If you schedule a new build, and since most of our families aren't very big I wouldn't expect that to cause more than an hour or two of delay. Is that enough to be a real problem?
,
Apr 27 2017
Re#39 Reef factory preflight: 44min Reef factory: 3 hrs, 48 mins Strago factory preflight: 1hr 26 mins Strago factory celes: 2 hrs, 48 mins, 47 secs So factory full build takes 3~4 hours. Strago family: 14 builders Rambi family: 9+13 builders (two branches) Veyron: 10 builders There are 6 builders, so if there are two CLs merged and triggered one of rambi + strago to start, that will take (13+14)*3.5 / 6 = minimal 15 hours to finish one round. Maybe we should consider changing old huge family to rebuild only on explicitly demand (strago, rambi, veyron).
,
Apr 27 2017
That seems reasonable to me, and easily done.
,
Aug 3 2017
,
Apr 5 2018
This will be handled by the transition to swarming for this waterfall. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 Deleted