BuildPackages [afdo_use] takes > 6 hours to compelete (sometimes) |
|||||||||||||||||||||
Issue descriptionBuild packages is completing, but consistently taking over 6 hours to do so. https://uberchromegw.corp.google.com/i/chromeos/builders/terra-release/builds/10 This is enough to cause the entire build to timeout, as well as being a general problem. There may be a few other boards with the same issue... looking.
,
May 10 2016
,
May 10 2016
I pulled the logs from the cited terra build, and did a quick bit of processing to find the longest-running ebuilds. Here's what I got: 61 22.5 sys-kernel/chromeos-kernel-3_18-3.18-r1247 62 16.6 x11-libs/libX11-1.6.0-r1 66 58.9 media-libs/libmtp-0.0.1-r21 70 14.1 chromeos-base/autotest-0.0.2-r7243 76 55.0 sys-fs/e2fsprogs-1.42.12 87 47.6 dev-libs/glib-2.34.3-r4 89 11.0 net-dns/c-ares-1.7.5-r1 90 28.1 chromeos-base/chromite-0.0.2-r2372 93 40.6 sys-devel/binutils-2.25.51-r57 98 29.7 net-misc/openssh-6.6.1_p1-r2 109 38.5 sys-libs/gcc-libs-4.9.2-r32 119 43.6 sci-libs/fftw-3.2.2 122 1.6 chromeos-base/autotest-deps-ltp-0.20150119-r12 143 47.3 chromeos-base/chromeos-chrome-52.0.2727.0_rc-r1 175 39.2 sys-devel/gdb-7.10.20160211 The first number is minutes, the second is seconds. It's a bit odd that it took more than an hour to build the kernel; that feels like a system that's way overloaded. More disturbing: Why are we building gdb? And at 2:55:39, that's an awful lot of potentially wasted processing...
,
May 10 2016
,
May 10 2016
We are seeing the same problem on heli-release. Randall, can you find an owner for this? Maybe someone who knows something about the boards in question?
,
May 10 2016
stevenh, agnescheng, vineeths: Y'all are listed as contacts for these boards. Can you point us to the platform engineers for those boards?
,
May 10 2016
Kefka is standard Braswell "Strago" platform. ttsang@ is the SIE for kefka.
,
May 10 2016
Heli is based on the Rambi platform. bhthompson@ may be able to help.
,
May 10 2016
Kefka is a minimal board, it has lees than strago in the overlay Looks like something happened around may7 to may9, a few other builders are failing (taking longer than 7hrs). lulu/cave/banon/celes/cyan/squawks/ultima/wizpig in addition to the ones above. They are diffrerent boards (bdw/bsw/bt).
,
May 10 2016
The gdb package is brought in when one builds a test image, so the package is expected to be there, but it is not expected to take 2 hours to build. I just tried a local build with it and it built in way less than that, I don't think this is specific to these boards, but rather some situation on the build hardware?
,
May 10 2016
If you build locally, don't forget to use "--nousepkg". The release builders don't use binary prebuilts, but instead build everything from scratch. I started a heli build locally before, but it's still running.
,
May 10 2016
This builders were reconfigured (ungrouped) on the 6th and the 7th in a way that would make these failures much more apparent. They were also moved to new build machines, most of which are GCE based. I'm looking to see if all of the problem boards are GCE based.
,
May 10 2016
And looking.... lulu/cave don't have release builders at all. banon duration: 4:08:33 median 4:56:24 mean 4:00:41 min 1:51:49 max 6:11:23 celes duration: 2:05:10 median 2:31:17 mean 3:19:25 min 1:55:43 max 5:31:15 (also has a build with 7+ hours that was interrupted) cyan duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02 heli duration: 4:05:55 median 5:54:20 mean 4:47:12 min 1:54:20 max 6:20:13 kefka duration: 3:03:22 median 2:22:45 mean 2:48:23 min 1:56:22 max 5:22:01 squawks duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02 terra duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02 ultima duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02 wizpig duration: 4:24:19 median 5:53:56 mean 4:47:07 min 1:59:36 max 6:43:45 The stats across all release builders since the reorg.... BuildPackages: duration: median 0:29:50 mean 0:47:00 min 0:22:37 max 2:24:51 BuildPackages [afdo_use]: duration: median 2:24:42 mean 3:13:25 min 1:33:07 max 7:16:16
,
May 10 2016
All of those builders are GCE builders, but we added over 80 of them, and not all have this problem. What I am seeing is that this is tied to AFDO use, but I don't see why it would be so variable. What does AFDO do differently?
,
May 10 2016
I used `emerge-heli --getbinpkg=n gdb` and it still built quickly (and appears to be performing compilation).
,
May 10 2016
AFDO looks like some kind of build optimization, I am not very familiar with it.
,
May 10 2016
I don't either, but I do know that the output is a lot larger. Could this be a disk IO performance issue on the new builders?
,
May 11 2016
,
May 11 2016
,
May 11 2016
Increasing priority, this seems to be the root cause of multiple time outs affecting multiple builds e.g. https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-release/builds/1345/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio https://uberchromegw.corp.google.com/i/chromeos/builders/banjo-release/builds/10/steps/Signing/logs/stdio https://uberchromegw.corp.google.com/i/chromeos/builders/glimmer-release/builds/291/steps/Paygen/logs/stdio
,
May 11 2016
Adding someone from the toolchain team to see if they have input.
,
May 11 2016
PS: It's possible that the GCE builders have decreased (or inconsistent) drive IO performance...... I'm wondering if that could explain the variability?
,
May 11 2016
From jrbarnette: The failure in the -294 build on quawks looks a lot like bug 610536 . I discovered that the logs for build_package now include the output from build processes every 60 minutes or so. That means it's possible to see output snapshots from the longest running builds. Most interesting (to me) was the output from the gdb build. That showed the system spent more than an hour in "configure"(!). The configure script shouldn't really take that long. At all. If I had to guess at this point, I'd say the builder system is overloaded, possibly from memory pressure from high levels of parallelism.
,
May 11 2016
The GCE builders may have reduced memory compared to a physical builder. If so, memory pressure could explain this. It's also a problem that can be solved relatively easily. I'll compare available ram in the two classes of builders.
,
May 11 2016
Either of those could be the case, significant changes in memory bandwidth or storage bandwidth would impact build times, is it safe to assume we always get the same number of available cores/threads in these instances? How consistent do we expect the GCE instances to be?
,
May 11 2016
,
May 11 2016
We have dedicated CPUs and ram, but I don't know how the virtual disks work. Even though the memory is consistent, thrashing could still explain the variability since it's behavior can be somewhat non-deterministic. I'll increase the memory on one of the problematic builders, and check disk usage graphs for anything educational.
,
May 11 2016
,
May 12 2016
I'm confident that memory isn't the problem. The GCE instances have 230G of ram, but our physical builders only have 128G. On the other hand, looking at the history for cros-beefy37, I don't see exactly what I'd expect to see if disk IO is the issue. Sample build: https://uberchromegw.corp.google.com/i/chromeos/builders/heli-release/builds/10 Builder stats can be seen here: https://pantheon.corp.google.com/compute/instancesDetail/zones/us-east1-b/instances/cros-beefy37-c2?project=chromeos-bot&graph=GCE_CPU&duration=P4D Picking a build that spent 6 hours 20 minutes in build packages (more than 3x a build that used 1 hour 55 minutes). It started May 9th at 7:40 PM, and finished May 10th at 3:24 AM. The disk IO in bytes per second looks very similar to most other builds (including passing builds). Max IO rate is about 74Mb/second for only a brief time, some passing builds reaching about 84 Mb/second for similar brief spikes. However, the number of disk operations apears to be lower than in later builds that passed. CPU load on the machine seems near max for the entire time build packages is running. Very similar to other builds, except that the CPU stays pegged for much longer. If IO was the problem, I'd expect CPU load to go down while blocked on IO.
,
May 12 2016
To me, that looks like more work is being done for some reason during the slow build_packages, and disk operations are reduced because it's taking longer to finish each part of the compile.
,
May 12 2016
That makes it sound like a pure CPU problem, is it possible we get fewer virtual CPUs somehow in these cases or that they slow down? In such a VM setup it is possible it exposes 64 threads but it is not really giving us full time on them (or we have fewer physical cores allocated)?
,
May 12 2016
Each CPU is supposed to be a dedicated core (we pay more for that). Passing to someone with AFDO experience for suggestions.
,
May 12 2016
AFDO only applies to the Chrome build. No other packages are built with AFDO. The chrome build is taking 2 hours.. not sure if this is very high but I believe it is not very high. What is very strange is this: Completed sys-devel/gdb-7.10.20160211 (in 175m39.2s) I dont understand why building GDB would take 3 hours (more than Chrome) and, furthermore, why is GDB being built. GDB should be installed as part of setup_board... is there some package that has a bad dependency on GDB? Asking Yunlian to take a look.
,
May 12 2016
Also, I dont understand the numbers from #13. how can it be that a build_packages takes less than 30 mins. The whole build of Chrome takes more than that. release builders rebuilt everything. There is no way it will take less than 30 mins.
,
May 12 2016
,
May 12 2016
GDB gets built because we built a test image in the artifacts, and test images have GDB as a dependency. Why it takes 3 hours to build GDB is the million dollar question.
,
May 12 2016
I started a local cbuildbot for terra-release on my machine (Z620 32 cores) last night. The time for chrome and gdb are: Completed chromeos-base/chromeos-chrome-52.0.2733.0_rc-r1 (in 132m12.8s) which is NOT very different from what we see in the GCE instance and for GDB: Pending 175/601, Building 21/21, [Time 04:11:36 | Elapsed 65m55.2s | Load 74.79 82.04 62.62] Completed chromeos-base/autotest-tests-p2p-0.0.1-r2037 (in 1m19.9s) Still building openssh-6.6.1_p1-r2 (16m45.5s). Logs in /tmp/openssh-6.6.1_p1-r2-6_UjOw Completed chromeos-base/autotest-server-tests-shill-0.0.1-r2031 (in 1m14.4s) Pending 175/601, Building 19/19, [Time 04:11:50 | Elapsed 66m9.6s | Load 69.22 80.48 62.43] Completed sys-devel/gdb-7.10.20160211 (in 49m38.7s) Which is pretty high (49 mins) but it is way lower than the GCE instance (3 hours). You can see that the machine is quite busy when GDB finishes building. So, maybe we have too much parallelism in the build? and the GCE instance does not handle that amount of parallelism well? on the GCE instance: Pending 62/601, Building 9/9, [Time 06:09:12 | Elapsed 219m2.0s | Load 80.91 98.7 103.48] Pending 62/601, Building 9/9, [Time 06:10:12 | Elapsed 220m2.9s | Load 86.68 97.82 102.92] Completed chromeos-base/cryptohome-0.0.1-r1197 (in 1m6.1s) Still building common-assets-0.0.1-r37 (2m2.4s). Logs in /tmp/common-assets-0.0.1-r37-x3vWUS Started chromeos-base/chromeos-login-0.0.2-r2306 (logged in /tmp/chromeos-login-0.0.2-r2306-8iMdRj) Completed sys-devel/gdb-7.10.20160211 (in 175m39.2s) you can see the load is pretty high too. A few minutes before the load reaches 128: Pending 72/601, Building 11/11, [Time 06:05:52 | Elapsed 215m42.0s | Load 128.99 109.76 106.99] When GDB completes, chromeos-chrome has not even started. So, AFDO has nothing to do with this: Pending 24/601, Building 6/6, [Time 06:44:02 | Elapsed 253m52.2s | Load 31.53 49.31 63.56] Started chromeos-base/chromeos-chrome-52.0.2727.0_rc-r1 (logged in /tmp/chromeos-chrome-52.0.2727.0_rc-r1-4GuTJZ) can we limit the amount of maximum parallelism in the build for GCE instances?
,
May 12 2016
I'll note that the long gdb build time seems to be a hallmark of this problem. I downloaded build logs from a successful build (with a faster run time), and in that case the gdb build time was much shorter. My guess would be that gdb is a victim rather than a cause of the problem: Something that commonly runs in parallel with the gdb build sucks up all the time, and causes trouble.
,
May 12 2016
The GCE instances used are 64 cores, so 128 doesn't seem that high. Before the builders were ungrouped, loads between 200-300 were normal. The 30 minute build.... that value is the shortest seen on any release builder, including some that don't build Chrome at all.
,
May 12 2016
Hum... are we using AFDO for EVERY package? Could we disable it for test image only packages like GDB? Or perhaps only enable it for Chrome and a few select others? I don't know enough about AFDO to know how much sense this makes. That might also help with some space issues that have been seen.
,
May 12 2016
As I said in #33, AFDO only applies to the build of Chrome (chromeos-base/chromeos-chrome) and, as noted in #40, chromeos-chrome has not even started building when the build of GDB is going on. I will agree with jrbarnette in #38. It seems to me the gdb build is the victim and not the cause. I think neither Yunlian or I are good owners for this problem. Someone with more knowledge about GCE instances should own this.
,
May 12 2016
Sorry, I must have missed #33. Thanks for the input.
,
May 12 2016
Handling of parallelism in the machine depends on other factors besides the number of cores.. so, it maybe worth doing experiments where we limit the amount of parallel builds and see if it behaves better. I have done this in the past by setting the following variables EMERGE_DEFAULT_OPTS="--jobs=12 --load-average=10" MAKEOPTS="-j12 --load-average=10" use some values that make sense for that machine. For example, use load average that does not exceed the number of cores.
,
May 12 2016
Just making notes for myself: cros-beefy2-c2 has performed well consistently: https://uberchromegw.corp.google.com/i/chromeos/builders/rambi-release cros-beefy61-c2 has varied over time: https://uberchromegw.corp.google.com/i/chromeos/builders/squawks-release
,
May 13 2016
I sent email to bigcluster-users@ and am starting to get some answers. Looks like the servers are spending a lot of time inside the kernel during bad performance windows.... more digging/answers to come tomorrow.
,
May 13 2016
The GCE people ARE finding unusual CPU scheduling on these builders and are looking into it further. Interesting note.... they think we might be tickling a bad behavior because we use so LITTLE memory during the build.
,
May 13 2016
As a work around, the GCE people are suggesting that we try using 32 core builders. I'll switch a few builders (it's fairly painless to do for a small number of GCE instances) to 32 core, and we can see what happens to performance. If it works well enough, we can switch over all the release builders.
,
May 16 2016
The GCE team has advised that 32 core builders *probably* won't be as prone to this issue. I switched nyan-release (cros-beefy3-c2) to 32 core builders to see what performance is like. It's still finishing in time, but is about an hour slower over all.
,
May 16 2016
My current proposal is that we switch all release builders to 32 core. This will involve the loss of at least 1 release build (if things go well), and up to a full days worth, if things go badly. The hack that makes this feasible before crbug.com/609886 is fixed is taking an existing builder, converting it into an image, then using that image.
,
May 16 2016
Looking for TPM feedback before killing builds like that.
,
May 16 2016
We can probably do this tomorrow, assuming you mean ToT release builders right (and not release branch builders which seem to be bare metal)? The release schedule is at https://sites.google.com/a/google.com/chromeos/resources/engineering/weekly-release-plan for reference, since we are planning to try to do a R52 dev release tomorrow which will come off of ToT we probably want these builders living overnight, assuming we get a good R52 in the morning we should be ok to skip a build after that.
,
May 16 2016
Sounds like a plan.
,
May 17 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/c608f42c551f16cf5221777caab8cdc89ffe46e3 commit c608f42c551f16cf5221777caab8cdc89ffe46e3 Author: dgarrett <dgarrett@google.com> Date: Tue May 17 00:23:36 2016
,
May 19 2016
Issue 613052 has been merged into this issue.
,
May 19 2016
PS: I was asked on IRC to wait until Monday for the change. I'm going to interpret that as "after Friday", and might do it over the weekend.
,
May 20 2016
Issue 613615 has been merged into this issue.
,
May 20 2016
,
May 22 2016
I have switched all GCE release builders to 32 core instances, which MAY be a workaround for this problem. I'm expecting it to slow normal compiles by about an hour (which should be acceptable), but will hopefully avoid the abnormal performance problems. The GCE team also has some ideas to try, and I am working with them to see if they help.
,
May 25 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 25 2016
I'm calling this fixed, for now.
,
Jun 27 2016
Closing... please feel free to reopen if its not fixed.
,
Jun 27 2016
,
Oct 14 2016
This issue seems to be returning. See parrot-release build 3127. Lowering to 1 because it's only affecting canaries, not the CQ. It may have been affecting canaries for a while, I will try to dig out exact numbers.
,
Oct 14 2016
I'd probably file new bugs here, but reference this, and b/28764849. It might not be the same issue.
,
Oct 14 2016
This is 100% not the same bug. The problematic builders that I see are physical. https://uberchromegw.corp.google.com/i/chromeos/builders/parrot-release https://uberchromegw.corp.google.com/i/chromeos/builders/samus-release
,
Oct 14 2016
,
Oct 14 2016
Thank you! |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by dgarr...@chromium.org
, May 10 2016