Grunt paladin build times getting worse and worse |
||||||||||||||||
Issue descriptionhttp://shortn/_NRFrvl9QrI Something about the Grunt packages and build configuration is negatively affecting everyone else: we can't complete a CQ run until it's complete and it's getting worse and worse while other builds are held constant or are otherwise decreasing. If this gets much worse, we'll have to remove it from the important list until it's brought back down. Assigning to Grunt eng lead, sjg.
,
Aug 17
https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/ Shows build duration going from 1h39 4 days ago to 3h12 this morning.
,
Aug 17
I do see this but not why it might cause such a large delay: WARNING: The following packages failed once or more, but succeeded upon retry. This might indicate incorrect dependencies. sys-boot/amd-firmware-0.0.1-r66 Merge complete
,
Aug 17
There's much longer historical data in the graph in the report: http://shortn/_NRFrvl9QrI There's a per-package build times report on the individual builds: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/emergePackageDetails?cidbBuildId=2854221
,
Aug 17
Completed chromeos-base/chromeos-ec-0.0.1-r5033 (in 40m6.7s) That seems like a very long time. Assigning to Ed in case something has changed with the EC.
,
Aug 17
Also, from Jason's second link, llvm is being built. Is that expected? Adding Ben.
,
Aug 17
Perhaps a silly question: what unit is the y axis in http://shortn/_NRFrvl9QrI ? How does that graph relate to "run duration" here: https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/#
,
Aug 17
It looks to me like it's rebuilding too many packages. For example, this build: https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/2594 only used 1 binpkg and rebuilt krb5 (changed in July), ghostscript-gpl (changed in April), llvm (changed in July), arc-llvm (changed 9 days ago), etc. I suspect the "rebuild too many packages" part is widespread, but we're seeing it worst on grunt because we have several large packages that take a long time to build.
,
Aug 17
Looking at the graphs though, I don't see any regression in other paladins.
,
Aug 17
Top 6 slowest packages: (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/emergePackageDetails?cidbBuildId=2854221) llvm chromeos-kernel-4_14 mit-krb5 chromeos-ec arc-llvm autotest-deps-ltp chromeos-ec does seem odd to me: I don't know why it is taking that long.
,
Aug 17
I don't understand why the toolchain (llvm) and other toolchain related pieces get built (gcc-libs), perhaps toolchain people need to look at it? +cc llozano
,
Aug 17
llvm isn't being built for the toolchain, it's being built for the runtime libs (the JIT and such w/graphics). that's WAI.
,
Aug 17
,
Aug 17
The llvm here is the one used by mesa, not the host toolchain's llvm. Regardless of that, why is that being rebuilt. Shouldn't prebuilts be getting used?
,
Aug 20
Answering my own question from #7: The Panopticon chart shows duration of BuildPackages in seconds. grunt-paladin is currently at 6815 seconds, which matches the 1 hrs 54 mins for BuildPackages here: https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/2612 emerge-grunt chromeos-ec builds 8 EC images (RO/RW, grunt/careena/aleena/liara). This takes 2 minutes on my workstation, using 56 cores. Could lack of available cores on grunt-paladin explain taking 20 times longer?
,
Aug 21
It's building on a machine with 32 cores and 208G of RAM.
,
Aug 21
,
Aug 22
I compared package build times between octopus-paladin and grunt-paladin for these builds: https://luci-milo.appspot.com/buildbot/chromeos/octopus-paladin/1456 BuildPackages 1 hrs 16 mins https://luci-milo.appspot.com/buildbot/chromeos/grunt-paladin/2617 BuildPackages 1 hrs 55 mins These packages built for both, but take about twice as long for grunt: 10 26 +15 sys-block/parted-3.1-r1 13 29 +15 app-text/ghostscript-gpl-9.19-r12 2 20 +17 sys-libs/libcxxabi-7.0.0-r4 22 40 +18 chromeos-base/chromeos-ec-0.0.1-r5045 16 36 +19 chromeos-base/autotest-deps-ltp-0.20150119-r17 24 45 +20 app-crypt/mit-krb5-1.15.1 21 42 +21 sys-boot/coreboot-0.0.1-r2695 20 46 +25 sys-kernel/chromeos-kernel-4_14-4.14.64-r453 These packages are not built for octopus: 29 +29 sys-devel/arc-llvm-6.0.0-r3 49 +49 sys-devel/llvm-6.0.0-r2 (columns are times in minutes, for octopus / grunt / difference) Maybe building llvm and arc-llvm is what slows down the other packages?
,
Aug 22
That sounds believable to me. When building on my workstation, arc-llvm slow everything down noticeably and eventually hangs the whole thing for a couple of minutes of heavy I/O. It still seems like the solution to the general problem is that we shouldn't be rebuilding all these packages. arc-llvm, for example, only depends on three ebuilds and none of them have changed since at least July. Why aren't we finding a binpkg for it?
,
Aug 23
My belief is that some of the recent changes to build_packages (or parallel_emerge) when computing reverse dependencies is causing the build issue. Prebuilts are there but because of the aggressive reverse deps calculations, portage ends rebuilding practically every package.
,
Aug 23
The current status is all packages are being rebuilt except Chrome. For the record, I am pretty sure that reverting https://chromium-review.googlesource.com/c/chromiumos/platform/crosutils/+/1015529 will fix the problem with prebuilts and will reduce the time spent in build_packages. The reverse deps computation introduced by the above CL are too aggressive making portage reject all prebuilts and decides to rebuild the packages.
,
Aug 23
,
Aug 23
https://chromium-review.googlesource.com/c/chromiumos/platform/crosutils/+/1015529 is issue 864309 and is unrelated to this bug: let's not conflate them. Grunt saw massive regressions while all other paladins held constant or got better. See the graphs above.
,
Aug 25
Current theory is that grunt-paladin is slower than other paladins because it has to build llvm and arc-llvm packages.
,
Aug 26
That's been true for a while, though. What changed to make Grunt get worse around Aug 3? http://shortn/_YaldHjNqap
,
Aug 27
I compared these 2 grunt-paladin builds:
#2481 2018-07-31 7:30 AM (MDT) BuildPackages = 1 hrs 29 mins = 5340 secs
#2491 2018-08-02 12:24 AM (MDT) BuildPackages = 1 hrs 52 mins = 6720 secs
I don't see any obvious problem / explanation for the 23 min increase.
The packages with the largest increase in time are:
1 9 + 8 chromeos-base/tpm_manager-0.0.1
2 9 + 8 chromeos-base/chaps-0.0.1
9 17 + 8 media-libs/arc-mesa-18.2.0_pre
1 9 + 8 chromeos-base/autotest-tests-p2p-0.0.1
1 10 + 9 chromeos-base/autotest-server-tests-tast-0.0.1
9 + 9 chromeos-base/drivefs-0.0.1
3 12 + 9 net-wireless/bluez-5.44
3 12 + 9 sys-libs/libcxxabi-4.0.0
4 14 +10 sys-apps/usbguard-20180726
4 13 +10 media-sound/adhd-0.0.1
14 +14 chromeos-base/drivefs-google3-0.0.1
9 23 +14 sys-boot/coreboot-0.0.1
2 24 +22 app-crypt/nss-3.30.2
(columns are times in minutes, for #2481 / #2491 / difference)
,
Aug 27
Also, why is is rebuilding llvm and arc-llvm? Shouldn't they always use prebuilts? It seems wasteful.
,
Aug 27
Rebuilding llvm and arc-llvm is issue 864309
,
Aug 28
we've deployed our revdep rebuild changes yesterday for issue 864309 . so we should gather new data on the grunt paladin to see if its times come back.
,
Aug 29
Grunt's gotten worse again: http://shortn/_FIpaKTLSKF At this point, we don't really have any other choice but to remove Grunt from the important status in the CQ.
,
Aug 29
That seems bad to me. IMO build_packages builds FAR too much stuff now. I worry that the cure is worse than the disease.
,
Aug 29
,
Aug 30
CL removing Grunt was rejected because Ben is investigating this now. We need a solution this week or I will have to remove it on Monday.
,
Aug 30
Monday is a holiday. Let's let the investigation run its course.
,
Aug 30
It's been two weeks since we asked for some investigation. This is hurting everyone else. We need an answer soon or to take action. We can't wait much longer.
,
Aug 30
The current theory is that arc-llvm is probably what's making grunt worse than the other boards. It's a large package that only grunt uses and it also slows down everything else while it's building. A couple of things we're investigating now: 1. arc-llvm is huge. We're going to look for unused compiler features that can be turned off to shrink the package. 2. arc-llvm is getting rebuilt on nearly every run even though binpkgs are available. The revdeps logic doesn't look like the problem, since it's showing up in the initial to-build list. I'm looking into why emerge isn't picking up the binpkg. The broader problem is that binpkgs are only updated on a green CQ run. That means that currently they're a week old, so everything touched in the last week and everything that transitively depends on any of those is being rebuilt on every CQ run. This is affecting every board as well as developer chroots. We might want to talk about doing an "empty" CQ run from time to time so we don't build up these massive piles of packages that have to be rebuilt every time.
,
Aug 31
Ben thank you for the helpful info. I'm pleased to hear about #2 as that is my experience too. Fixing that would surely help.
,
Aug 31
tl;dr: 1. grunt includes two large ebuilds that are needed for its hardware (llvm and arc-llvm). 2. CQ runs for all boards are slower than usual right now because of outdated binpkgs. 3. grunt is being hit harder than the rest because llvm and arc-llvm are triggered by #2. I expect grunt to always be on the slower end because of #1, but it should be relatively closer to other boards when binpkgs are up to date. crrev.com/c/1198723 is in the CQ now to try to make arc-llvm faster, and djkurtz@ is looking at potential speedups for llvm. Full version: I did some comparisons of the grunt package lists to make sure we aren't pulling in anything unwanted. For grunt compared to other arcnext boards: * grunt pulls in cros-camera with about 15 dependencies. These take ~20 minutes to build for the whole set. These are required for HALv3 support in ARC++ P and are going to show up on all new boards soon. * grunt pulls in arc-llvm, llvm, arc-libelf, and libelf. These are required for amdgpu support. The llvm packages take 30-45 minutes each because they're big. To make matters worse, they cause everything else to build more slowly. * grunt pulls in unibuild config support (cros-config, etc). * grunt pulls in various AMD firmware support packages. These are comparable to the firmware for other boards. I don't see anything in the list that isn't a reasonable requirement for grunt. None of those should take any particularly long time to install from binpkgs, so that leaves us with the question of why are we building all these packages every time. I initially thought that maybe binpkgs weren't working. After looking into it, that doesn't seem to be the case. Binpkgs are used when available, but we haven't had a green CQ run since 8/23, so there aren't any new binpkgs in the last 8 days. Since then, we've upreved several low-level packages: bzip2, openssl, portage, python, chromeos-kernel-4_14, chromeos-ec, and samba (among others, of course). Because build_packages passes --rebuild-if-unbuilt to emerge and does the rev deps calculation, everything in the transitive dependencies of these has a forced rebuild. Between the packages I listed, they explain why all of at least the top 15 longest-building ebuilds are being rebuilt. The other part of this is that arc-llvm and llvm are slow to compile. If we can speed them up, the impact of the rest of this becomes relatively less bad. crrev.com/c/1198723 shrinks the arc-llvm package; tryjobs suggest it might shave off 10-20 minutes from grunt's BuildPackages stage. djkurtz@ and I have been trying out some possible changes to llvm to get a similar savings there. That will still leave grunt on the slower end, but it might be a few minutes worse than other boards instead of 20-30 minutes worse.
,
Aug 31
Thanks for the update and investigation. Jason, what are your thoughts based on this?
,
Aug 31
Regarding llvm package, just wondering what all tools are being built in /build/<board>/usr/lib/llvm/bin/ And does mesa need all of those? Anything that is not needed by mesa can dropped by tuning many of the llvm's CMake options (https://llvm.org/docs/CMake.html) or calling ninja just for specific tools e.g. ninja llc opt <more tools>.
,
Aug 31
arc-llvm doesn't build any of the tools. We're hoping to turn them off for llvm as well.
,
Aug 31
This is a great outcome. Thanks for looking at the LLVM build times and the build targets; those improvements will be applicable to other Mesa drivers in the future. The bit about cros-camera and that it's about to be enabled for all boards is also an extremely useful insight since it means that those additional build deps are about to impact everyone and we can build that understanding in to how we look at long-tail builders over the next few months. For now, let's consider this mitigated as soon as the LLVM optimizations land.
,
Sep 1
@#40 - yeah, good idea about disabling building unneeded LLVM tools. As Ben mentioned, our plan is to only enable building llvm_config for the host, to match arc-llvm. Here is an attempt to do so: https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1200450 Before we can commit this CL, we'd like to confirm the theories: (1) for grunt we don't need any other llvm tools (2) betty is the only other device that builds llvm-6.0 (3) betty doesn't need tools built, either Here are two tryjobs to see if the grunt & betty paladins builds will work with this patch: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936626475134735680 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8936626472952746912
,
Sep 5
Bugdroid is busted. The CL from #43 [0] landed in 11037.0.0. [0] http://crrev.com/c/1200450 The grunt-paladin CQ run that landed the CL [1] showed: Completed sys-devel/llvm-6.0.0-r3 (in 33m49.1s) INFO : Elapsed time (build_packages): 83m40s Compared to the last run without this change [2]. Completed sys-devel/llvm-6.0.0-r2 (in 48m19.4s) INFO : Elapsed time (build_packages): 88m2s [1] https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?builderName=grunt-paladin&buildNumber=2710 [2] https://luci-logdog.appspot.com/logs/chromeos/bb/chromeos/grunt-paladin/2708/+/recipes/steps/BuildPackages/0/stdout
,
Sep 5
,
Sep 5
,
Sep 5
Approving merge for M70 Chrome OS
,
Sep 5
FYI: Merge-Request is for these two CLs: arc-llvm: https://chromium-review.googlesource.com/1198723 (with original BUG=b:112313068) llvm: https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1200450
,
Sep 5
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by jclinton@chromium.org
, Aug 17