revive cache generation in PublishUprevChanges stage |
|||||||||||||||
Issue descriptionwhen the metadata cache files get out of date, emerge can get very slow processing updates because it runs `ebuild ... depend` on every ebuild, and currently does so in serial. looks like the last time metadata/ updates were posted: portage-stable/: commit f21c9ee135e8b46016083f644c75ff115db6446b Author: chrome-bot <chrome-bot@chromium.org> AuthorDate: Tue Oct 27 22:47:52 2015 -0700 Commit: chrome-bot <chrome-bot@chromium.org> CommitDate: Tue Oct 27 22:47:52 2015 -0700 chromeos-overlay/: commit e9328b9242d3aff4102ccd7c3ff5431a6970e80c Author: chrome-bot <chrome-bot@chromium.org> AuthorDate: Fri Oct 30 16:20:26 2015 -0700 Commit: chrome-bot <chrome-bot@chromium.org> CommitDate: Fri Oct 30 16:20:26 2015 -0700 chromiumos-overlay/: commit ab478da4b8c943be5d1491918c74395b0798158d Author: chrome-bot <chrome-bot@chromium.org> AuthorDate: Fri Oct 30 16:20:35 2015 -0700 Commit: chrome-bot <chrome-bot@chromium.org> CommitDate: Fri Oct 30 16:20:35 2015 -0700 the commits around that time look highly suspicious: https://chromium.googlesource.com/chromiumos/chromite/+log/master/?s=09edad6e69fb57097299a2d346f916983a8c7ab6 09edad6e69fb 2015-10-30 19:13:49 -0700 Cleanup and reapply uprevs in PublishUprev in CQ. 1e46be8f4940 2015-10-30 13:46:02 -0700 cros_mark_as_stable: Checkout before syncing faef721b371d 2015-10-25 08:59:58 -0700 Surface bugs or CLs responsible for build failures.
,
Nov 1 2016
@ Aviv - I think we need an owner for this area. @ Chris - you've been asking about package management, wanna take a look?
,
Nov 2 2016
I wouldn't mind, have a few commitments that I need to meet first.
,
Nov 2 2016
Agreed, seems like a good area for Chris. This is important but not super urgent, considering it's been going on for a year.
,
Dec 22 2016
,
Apr 13 2017
+snanda I think this got dropped, however it would make a good candidate for the latency task force.
,
Apr 19 2017
snanda to find owner
,
May 31 2017
Mike, does this slow down the official builders - and any idea by how much?
,
May 31 2017
it slows down all emerge commands on bots and dev systems let's look at the following command from a cold disk cache. although there are multiple emerge commands used on a system from a clean install, so the times here will be worse, but not linear. $ emerge-$BOARD -epv virtual/target-os virtual/target-os-dev virtual/target-os-test with an up-to-date cache, this takes ~16sec. with the cache as it exists in the tree now, this takes almost 5min.
,
May 31 2017
5 minutes is very large (might get multiplied with up to 12 CQ runs if it happens always). Now I went to a paladin build and the build I looked at was incremental, hence "just" 37 seconds? https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fveyron_mighty-paladin%2F5175%2F%2B%2Frecipes%2Fsteps%2FBuildPackages%2F0%2Fstdout "Deps calculated in 0m37.4s" But the log shows other startup overhead in the minutes. And of course the real expense is when there are failures and we need to do clean builds, which makes everything extra slow. So an argument could be made to attack the long pole even if the average is not so bad?
,
May 31 2017
for incremental builds, it won't be as bad every time because they would have done the cache generation work themselves locally and portage saves that work automatically. so each successive emerge command will get better as it incrementally fills out its local cache. at that point, local cache invalidation would happen when an ebuild updates, or when an eclass changes (which would cascade to all ebuilds using it). release bots start with a new sdk every time and thus an empty cache. the other reason it's so painful is that portage will generate caches in serial. egencache can do it in parallel. there is probably opportunity here to improve portage too ... it is hard to quantify overall perf hit here, but i think it does add up.
,
May 31 2017
Our paladin builders wipe the chroot after every failed run, which clears this cache (or is it in the .cache directory?). Which means improvements will have the most impact during repeated failures/outages.
,
May 31 2017
the cache lives inside the chroot under /var/cache/edb/dep/, so every time the sdk gets recreated, the cache gets nuked. i don't think we want to try to move the location of that. poking around cbuildbot a bit, it looks like we could mitigate things on the bot side via the RegenPortageCacheStage ... but it's skipped in all the configs i checked presumably because it generates git commits.
,
May 31 2017
Would it be clean/correct to regen and commit every time we commit ebuild uprevs?
,
May 31 2017
we used to do exactly that in the cros_mark_as_stable script, but it was refactored significantly here: https://chromium-review.googlesource.com/275403
,
Jun 20 2017
,
Aug 9 2017
Re #13 could you comment on what RegenPortageCache stage does, and how much time it would save us?
,
Aug 9 2017
,
Aug 9 2017
i covered perf issues in #11 RegenPortageCache takes care of actually running the egencache tools. i don't recall how the actual git commits were bundled up and pushed out. prob should be left to the CQ master as part of its final uprev phase.
,
Aug 15 2017
RegenPortageCache is already running on master-paladin. Is that not sufficient, or is it shortcutting whatever is the needed behavior? https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/15741 If it needs to run on the slaves and generate commits that get submitted there, then that would be problematic.
,
Aug 15 2017
we want it running on all of them, but only the master should be pushing its commits. it looks like the "push" part has been lost. my guess is that the PublishUprevChanges stage is throwing away previous work, and we aren't re-running the generation stage there. so maybe it's as simple as updating that stage to also re-run cache generation.
,
Aug 15 2017
makes sense, mostly (but why "we want it running on all of them, but only the master should be pushing its commits."? what benefits do the slaves get from running it if only the master is pushing the results?) Upgrading to P1 because I think this could have a large impact on CQ time (~several minutes)
,
Aug 15 2017
> (but why "we want it running on all of them, but only the master should be pushing its commits."? what benefits do the slaves get from running it if only the master is pushing the results?) the slaves are most likely going to use the caches when they build packages. the dedicated cache generation step in chromite is much much faster than the automatic portage one -- we explicitly run it with multiple processors whereas portage will only utilize a single cpu. arguably we should improve portage, but we might also be able to run the regen stage in parallel with other stages if we thought hard about it. in the case where there's nothing to be done, the regen stage should be pretty quick since it's just comparing md5 hashes of ebuilds/eclasses.
,
Aug 17 2017
Drew expressed some interest in this.
,
Aug 29 2017
,
Aug 30 2017
Looks like #21 is what's happening. Old behaviour: - cros_mark_as_stable regenerated the portage cache New behaviour: - cros_mark_as_stable just does uprev, and RegenPortageCache just regenerates portage cache The PublishUprevChanges step reverts all local changes and re-syncs before re-applying uprevs and then committing them. To re-apply uprevs, cros_mark_as_stable gets called, which no longer is regenerating the portage cache. The most straightforward thing to do here is to ensure PublishUprevChanges also does the same work that RegenPortageCache does. Perhaps PublishUprevChanges should also be renamed? It also updates binhost conf and will be updating the cache (once again) as well.
,
Aug 30 2017
"ProcessUprevChanges" ? :) feel free to offer any other bikeshed names. we should also make sure the stage's docstring enumerates everything it does as that docstring shows up in cbuildbot logs at the very top, so it helps people to see that overview up front.
,
Aug 30 2017
Well, it may be that PublishUprevChanges is already descriptive enough for who've spent longer on the project than I have :) PublishPackageChanges? PublishBuildArtifacts? None of these feel too much better, though, and renaming stuff in python can be a bit of a pain. After further consideration I'd lean towards leaving it as-is unless there's something that's clearly better. I'll be sure to update the docstring accordingly, either way.
,
Sep 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/5f3bfa02d3cd56136a8fde7a595194c8ccecb865 commit 5f3bfa02d3cd56136a8fde7a595194c8ccecb865 Author: Drew Davenport <ddavenport@chromium.org> Date: Fri Sep 01 23:41:57 2017 cbuildbot: Restore portage cache regeneration Also regenerate portage cache when reapplying the uprev in PublishUprevChanges step after reverting all local changes. BUG= chromium:659864 TEST=FEATURES=test emerge chromite TEST=cbuildbot --remote --buildbot --debug master-paladin Change-Id: Ie40a20ca4ad1753c9515b2000e4703f9d22e96fa Reviewed-on: https://chromium-review.googlesource.com/644307 Commit-Ready: Drew Davenport <ddavenport@chromium.org> Tested-by: Drew Davenport <ddavenport@chromium.org> Reviewed-by: Drew Davenport <ddavenport@chromium.org> [modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/stages/build_stages.py [modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/stages/completion_stages.py [modify] https://crrev.com/5f3bfa02d3cd56136a8fde7a595194c8ccecb865/cbuildbot/commands.py
,
Sep 5 2017
md5 caches seem to be getting updated as expected. After removing /var/cache/edb/dep: > time emerge-$BOARD -epv virtual/target-os virtual/target-os-dev virtual/target-os-test > real 0m35.169s > user 0m19.192s > sys 0m4.076s
,
Sep 5 2017
,
Jan 22 2018
,
Jan 23 2018
|
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dgarr...@chromium.org
, Oct 27 2016