New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 659864 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 752196



Sign in to add a comment

revive cache generation in PublishUprevChanges stage

Project Member Reported by vapier@chromium.org, Oct 27 2016

Issue description

when the metadata cache files get out of date, emerge can get very slow processing updates because it runs `ebuild ... depend` on every ebuild, and currently does so in serial.

looks like the last time metadata/ updates were posted:

portage-stable/:
commit f21c9ee135e8b46016083f644c75ff115db6446b
Author:     chrome-bot <chrome-bot@chromium.org>
AuthorDate: Tue Oct 27 22:47:52 2015 -0700
Commit:     chrome-bot <chrome-bot@chromium.org>
CommitDate: Tue Oct 27 22:47:52 2015 -0700

chromeos-overlay/:
commit e9328b9242d3aff4102ccd7c3ff5431a6970e80c
Author:     chrome-bot <chrome-bot@chromium.org>
AuthorDate: Fri Oct 30 16:20:26 2015 -0700
Commit:     chrome-bot <chrome-bot@chromium.org>
CommitDate: Fri Oct 30 16:20:26 2015 -0700

chromiumos-overlay/:
commit ab478da4b8c943be5d1491918c74395b0798158d
Author:     chrome-bot <chrome-bot@chromium.org>
AuthorDate: Fri Oct 30 16:20:35 2015 -0700
Commit:     chrome-bot <chrome-bot@chromium.org>
CommitDate: Fri Oct 30 16:20:35 2015 -0700

the commits around that time look highly suspicious:
https://chromium.googlesource.com/chromiumos/chromite/+log/master/?s=09edad6e69fb57097299a2d346f916983a8c7ab6

09edad6e69fb 2015-10-30 19:13:49 -0700 Cleanup and reapply uprevs in PublishUprev in CQ.
1e46be8f4940 2015-10-30 13:46:02 -0700 cros_mark_as_stable: Checkout before syncing
faef721b371d 2015-10-25 08:59:58 -0700 Surface bugs or CLs responsible for build failures.

 
Cc: aaboagye@chromium.org
Cc: chingcodes@chromium.org
Owner: akes...@chromium.org
@ Aviv - I think we need an owner for this area. 

@ Chris - you've been asking about package management, wanna take a look? 
I wouldn't mind, have a few commitments that I need to meet first.
Labels: -Pri-1 Pri-2
Owner: chingcodes@chromium.org
Agreed, seems like a good area for Chris. This is important but not super urgent, considering it's been going on for a year.
Cc: -davidjames@chromium.org
Cc: snanda@chromium.org
Owner: ----
+snanda

I think this got dropped, however it would make a good candidate for the latency task force.
Owner: snanda@chromium.org
snanda to find owner

Comment 8 by ihf@chromium.org, May 31 2017

Cc: ihf@chromium.org
Mike, does this slow down the official builders - and any idea by how much?

Comment 9 by vapier@chromium.org, May 31 2017

it slows down all emerge commands on bots and dev systems

let's look at the following command from a cold disk cache.  although there are multiple emerge commands used on a system from a clean install, so the times here will be worse, but not linear.
$ emerge-$BOARD -epv virtual/target-os virtual/target-os-dev virtual/target-os-test

with an up-to-date cache, this takes ~16sec.  with the cache as it exists in the tree now, this takes almost 5min.

Comment 10 by ihf@chromium.org, May 31 2017

5 minutes is very large (might get multiplied with up to 12 CQ runs if it happens always). Now I went to a paladin build and the build I looked at was incremental, hence "just" 37 seconds?
https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fveyron_mighty-paladin%2F5175%2F%2B%2Frecipes%2Fsteps%2FBuildPackages%2F0%2Fstdout
"Deps calculated in 0m37.4s"

But the log shows other startup overhead in the minutes. And of course the real expense is when there are failures and we need to do clean builds, which makes everything extra slow. So an argument could be made to attack the long pole even if the average is not so bad?
for incremental builds, it won't be as bad every time because they would have done the cache generation work themselves locally and portage saves that work automatically.  so each successive emerge command will get better as it incrementally fills out its local cache.  at that point, local cache invalidation would happen when an ebuild updates, or when an eclass changes (which would cascade to all ebuilds using it).

release bots start with a new sdk every time and thus an empty cache.

the other reason it's so painful is that portage will generate caches in serial.  egencache can do it in parallel.  there is probably opportunity here to improve portage too ...

it is hard to quantify overall perf hit here, but i think it does add up.
Our paladin builders wipe the chroot after every failed run, which clears this cache (or is it in the .cache directory?). Which means improvements will have the most impact during repeated failures/outages.
the cache lives inside the chroot under /var/cache/edb/dep/, so every time the sdk gets recreated, the cache gets nuked.  i don't think we want to try to move the location of that.

poking around cbuildbot a bit, it looks like we could mitigate things on the bot side via the RegenPortageCacheStage ... but it's skipped in all the configs i checked presumably because it generates git commits.
Would it be clean/correct to regen and commit every time we commit ebuild uprevs?
we used to do exactly that in the cros_mark_as_stable script, but it was refactored significantly here:
  https://chromium-review.googlesource.com/275403
Cc: hidehiko@chromium.org
Re #13 could you comment on what RegenPortageCache stage does, and how much time it would save us?
Blocking: 752196
i covered perf issues in #11

RegenPortageCache takes care of actually running the egencache tools.  i don't recall how the actual git commits were bundled up and pushed out.  prob should be left to the CQ master as part of its final uprev phase.
RegenPortageCache is already running on master-paladin. Is that not sufficient, or is it shortcutting whatever is the needed behavior? https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15741

If it needs to run on the slaves and generate commits that get submitted there, then that would be problematic.
we want it running on all of them, but only the master should be pushing its commits.  it looks like the "push" part has been lost.

my guess is that the PublishUprevChanges stage is throwing away previous work, and we aren't re-running the generation stage there.  so maybe it's as simple as updating that stage to also re-run cache generation.
Labels: -Pri-2 Pri-1
Summary: revive cache generation in PublishUprevChanges stage (was: buildbots have stopped updating ebuild caches which slows all emerge commands down)
makes sense, mostly (but why "we want it running on all of them, but only the master should be pushing its commits."? what benefits do the slaves get from running it if only the master is pushing the results?)

Upgrading to P1 because I think this could have a large impact on CQ time (~several minutes)
> (but why "we want it running on all of them, but only the master should be pushing its commits."? what benefits do the slaves get from running it if only the master is pushing the results?)

the slaves are most likely going to use the caches when they build packages.  the dedicated cache generation step in chromite is much much faster than the automatic portage one -- we explicitly run it with multiple processors whereas portage will only utilize a single cpu.  arguably we should improve portage, but we might also be able to run the regen stage in parallel with other stages if we thought hard about it.

in the case where there's nothing to be done, the regen stage should be pretty quick since it's just comparing md5 hashes of ebuilds/eclasses.
Owner: ddavenp...@chromium.org
Drew expressed some interest in this.
Status: Started (was: Untriaged)
Looks like #21 is what's happening.


Old behaviour:
- cros_mark_as_stable regenerated the portage cache
New behaviour:
- cros_mark_as_stable just does uprev, and RegenPortageCache just regenerates portage cache

The PublishUprevChanges step reverts all local changes and re-syncs before re-applying uprevs and then committing them. To re-apply uprevs, cros_mark_as_stable gets called, which no longer is regenerating the portage cache.

The most straightforward thing to do here is to ensure PublishUprevChanges also does the same work that RegenPortageCache does. Perhaps PublishUprevChanges should also be renamed? It also updates binhost conf and will be updating the cache (once again) as well.
"ProcessUprevChanges" ? :)  feel free to offer any other bikeshed names.

we should also make sure the stage's docstring enumerates everything it does as that docstring shows up in cbuildbot logs at the very top, so it helps people to see that overview up front.
Well, it may be that PublishUprevChanges is already descriptive enough for who've spent longer on the project than I have :) PublishPackageChanges? PublishBuildArtifacts? None of these feel too much better, though, and renaming stuff in python can be a bit of a pain. After further consideration I'd lean towards leaving it as-is unless there's something that's clearly better. 

I'll be sure to update the docstring accordingly, either way.
Project Member

Comment 29 by bugdroid1@chromium.org, Sep 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/5f3bfa02d3cd56136a8fde7a595194c8ccecb865

commit 5f3bfa02d3cd56136a8fde7a595194c8ccecb865
Author: Drew Davenport <ddavenport@chromium.org>
Date: Fri Sep 01 23:41:57 2017

cbuildbot: Restore portage cache regeneration

Also regenerate portage cache when reapplying the uprev in
PublishUprevChanges step after reverting all local changes.

BUG= chromium:659864 
TEST=FEATURES=test emerge chromite
TEST=cbuildbot --remote --buildbot --debug master-paladin

Change-Id: Ie40a20ca4ad1753c9515b2000e4703f9d22e96fa
Reviewed-on: https://chromium-review.googlesource.com/644307
Commit-Ready: Drew Davenport <ddavenport@chromium.org>
Tested-by: Drew Davenport <ddavenport@chromium.org>
Reviewed-by: Drew Davenport <ddavenport@chromium.org>

[modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/stages/build_stages.py
[modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/stages/completion_stages.py
[modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/commands.py

md5 caches seem to be getting updated as expected.

After removing /var/cache/edb/dep:
> time emerge-$BOARD -epv virtual/target-os virtual/target-os-dev virtual/target-os-test

> real	0m35.169s
> user	0m19.192s
> sys	0m4.076s


Status: Fixed (was: Started)

Comment 32 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 33 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment