New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 610536 link

Starred by 5 users

Issue metadata

Status: Verified
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

BuildPackages [afdo_use] takes > 6 hours to compelete (sometimes)

Project Member Reported by dgarr...@chromium.org, May 10 2016

Issue description

Build packages is completing, but consistently taking over 6 hours to do so.

https://uberchromegw.corp.google.com/i/chromeos/builders/terra-release/builds/10

This is enough to cause the entire build to timeout, as well as being a general problem. There may be a few other boards with the same issue... looking.
 
kefka-release is having similar problems.

https://uberchromegw.corp.google.com/i/chromeos/builders/kefka-release
Summary: terra/kefka BuildPackages takes > 6 hours to compelete. (was: terra BuildPackages takes > 6 hours to compelete.)
I pulled the logs from the cited terra build, and did a quick bit of
processing to find the longest-running ebuilds.  Here's what
I got:
61 22.5 sys-kernel/chromeos-kernel-3_18-3.18-r1247
62 16.6 x11-libs/libX11-1.6.0-r1
66 58.9 media-libs/libmtp-0.0.1-r21
70 14.1 chromeos-base/autotest-0.0.2-r7243
76 55.0 sys-fs/e2fsprogs-1.42.12
87 47.6 dev-libs/glib-2.34.3-r4
89 11.0 net-dns/c-ares-1.7.5-r1
90 28.1 chromeos-base/chromite-0.0.2-r2372
93 40.6 sys-devel/binutils-2.25.51-r57
98 29.7 net-misc/openssh-6.6.1_p1-r2
109 38.5 sys-libs/gcc-libs-4.9.2-r32
119 43.6 sci-libs/fftw-3.2.2
122 1.6 chromeos-base/autotest-deps-ltp-0.20150119-r12
143 47.3 chromeos-base/chromeos-chrome-52.0.2727.0_rc-r1
175 39.2 sys-devel/gdb-7.10.20160211

The first number is minutes, the second is seconds.

It's a bit odd that it took more than an hour to build the kernel;
that feels like a system that's way overloaded.

More disturbing:  Why are we building gdb?  And at 2:55:39, that's
an awful lot of potentially wasted processing...

Labels: -Pri-2 Pri-1
Summary: terra/kefka/heli BuildPackages takes > 6 hours to compelete. (was: terra/kefka BuildPackages takes > 6 hours to compelete.)
We are seeing the same problem on heli-release.

Randall, can you find an owner for this? Maybe someone who knows something about the boards in question?
Cc: stevenh@chromium.org vineeths@chromium.org agnescheng@chromium.org
stevenh, agnescheng, vineeths:

Y'all are listed as contacts for these boards.  Can you point us to the platform engineers for those boards?

Comment 7 by stevenh@google.com, May 10 2016

Kefka is standard Braswell "Strago" platform.  ttsang@ is the SIE for kefka.
Cc: bhthompson@chromium.org
Heli is based on the Rambi platform. bhthompson@ may be able to help. 

Comment 9 by tts...@google.com, May 10 2016

Kefka is a minimal board, it has lees than strago in the overlay

Looks like something happened around may7 to may9, a few other builders are failing (taking longer than 7hrs). lulu/cave/banon/celes/cyan/squawks/ultima/wizpig in addition to the ones above. They are diffrerent boards (bdw/bsw/bt).
The gdb package is brought in when one builds a test image, so the package is expected to be there, but it is not expected to take 2 hours to build. I just tried a local build with it and it built in way less than that, I don't think this is specific to these boards, but rather some situation on the build hardware?
If you build locally, don't forget to use "--nousepkg". The release builders don't use binary prebuilts, but instead build everything from scratch.

I started a heli build locally before, but it's still running.
This builders were reconfigured (ungrouped) on the 6th and the 7th in a way that would make these failures much more apparent. They were also moved to new build machines, most of which are GCE based. I'm looking to see if all of the problem boards are GCE based.
And looking....  lulu/cave don't have release builders at all.

banon    duration: 4:08:33 median 4:56:24 mean 4:00:41 min 1:51:49 max 6:11:23
celes    duration: 2:05:10 median 2:31:17 mean 3:19:25 min 1:55:43 max 5:31:15
  (also has a build with 7+ hours that was interrupted)
cyan     duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02
heli     duration: 4:05:55 median 5:54:20 mean 4:47:12 min 1:54:20 max 6:20:13
kefka    duration: 3:03:22 median 2:22:45 mean 2:48:23 min 1:56:22 max 5:22:01
squawks  duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02
terra    duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02
ultima   duration: 4:07:09 median 4:58:13 mean 4:14:15 min 1:54:21 max 5:26:02
wizpig   duration: 4:24:19 median 5:53:56 mean 4:47:07 min 1:59:36 max 6:43:45


The stats across all release builders since the reorg....

BuildPackages:
  duration:  median 0:29:50 mean 0:47:00 min 0:22:37 max 2:24:51

BuildPackages [afdo_use]:
  duration:  median 2:24:42 mean 3:13:25 min 1:33:07 max 7:16:16


All of those builders are GCE builders, but we added over 80 of them, and not all have this problem.

What I am seeing is that this is tied to AFDO use, but I don't see why it would be so variable.

What does AFDO do differently?
I used `emerge-heli --getbinpkg=n gdb` and it still built quickly (and appears to be performing compilation). 
AFDO looks like some kind of build optimization, I am not very familiar with it.
I don't either, but I do know that the output is a lot larger. Could this be a disk IO performance issue on the new builders?
Summary: BuildPackages [afdo_use] takes > 6 hours to compelete (sometimes) (was: terra/kefka/heli BuildPackages takes > 6 hours to compelete.)
Cc: akes...@chromium.org dgarr...@chromium.org
 Issue 610878  has been merged into this issue.
Cc: cmt...@chromium.org
Adding someone from the toolchain team to see if they have input.
PS: It's possible that the GCE builders have decreased (or inconsistent) drive IO performance...... I'm wondering if that could explain the variability?
Cc: jrbarnette@chromium.org
From jrbarnette:

The failure in the -294 build on quawks looks a lot like  bug 610536 .

I discovered that the logs for build_package now include the output
from build processes every 60 minutes or so.  That means it's possible
to see output snapshots from the longest running builds.  Most interesting
(to me) was the output from the gdb build.  That showed the system spent
more than an hour in "configure"(!).  The configure script shouldn't
really take that long.  At all.

If I had to guess at this point, I'd say the builder system is overloaded,
possibly from memory pressure from high levels of parallelism.
Owner: dgarr...@chromium.org
The GCE builders may have reduced memory compared to a physical builder. If so, memory pressure could explain this.

It's also a problem that can be solved relatively easily. I'll compare available ram in the two classes of builders.
Either of those could be the case, significant changes in memory bandwidth or storage bandwidth would impact build times, is it safe to assume we always get the same number of available cores/threads in these instances?

How consistent do we expect the GCE instances to be?
Labels: -current-issue
We have dedicated CPUs and ram, but I don't know how the virtual disks work.

Even though the memory is consistent, thrashing could still explain the variability since it's behavior can be somewhat non-deterministic.

I'll increase the memory on one of the problematic builders, and check disk usage graphs for anything educational.
Cc: dhadd...@chromium.org kathrelk...@chromium.org
 Issue 610442  has been merged into this issue.
I'm confident that memory isn't the problem. The GCE instances have 230G of ram, but our physical builders only have 128G.

On the other hand, looking at the history for cros-beefy37, I don't see exactly what I'd expect to see if disk IO is the issue.

Sample build:
  https://uberchromegw.corp.google.com/i/chromeos/builders/heli-release/builds/10

Builder stats can be seen here:
  https://pantheon.corp.google.com/compute/instancesDetail/zones/us-east1-b/instances/cros-beefy37-c2?project=chromeos-bot&graph=GCE_CPU&duration=P4D

Picking a build that spent 6 hours 20 minutes in build packages (more than 3x a build that used 1 hour 55 minutes).

It started May 9th at 7:40 PM, and finished May 10th at 3:24 AM.

The disk IO in bytes per second looks very similar to most other builds (including passing builds). Max IO rate is about 74Mb/second for only a brief time, some passing builds reaching about 84 Mb/second for similar brief spikes. However, the number of disk operations apears to be lower than in later builds that passed.

CPU load on the machine seems near max for the entire time build packages is running. Very similar to other builds, except that the CPU stays pegged for much longer. If IO was the problem, I'd expect CPU load to go down while blocked on IO.
To me, that looks like more work is being done for some reason during the slow build_packages, and disk operations are reduced because it's taking longer to finish each part of the compile.
That makes it sound like a pure CPU problem, is it possible we get fewer virtual CPUs somehow in these cases or that they slow down? 

In such a VM setup it is possible it exposes 64 threads but it is not really giving us full time on them (or we have fewer physical cores allocated)?
Owner: llozano@chromium.org
Each CPU is supposed to be a dedicated core (we pay more for that).

Passing to someone with AFDO experience for suggestions.
Owner: yunlian@chromium.org
AFDO only applies to the Chrome build. No other packages are built with AFDO. The chrome build is taking 2 hours.. not sure if this is very high but I believe it is not very high.
What is very strange is this:
Completed sys-devel/gdb-7.10.20160211 (in 175m39.2s)

I dont understand why building GDB would take 3 hours (more than Chrome) and, furthermore, why is GDB being built. GDB should be installed as part of setup_board... 
is there some package that has a bad dependency on GDB? 
Asking Yunlian to take a look. 



Also, I dont understand the numbers from #13. how can it be that a build_packages takes less than 30 mins. The whole build of Chrome takes more than that. release builders rebuilt everything. There is no way it will take less than 30 mins.
Labels: Build-Toolchain
GDB gets built because we built a test image in the artifacts, and test images have GDB as a dependency.

Why it takes 3 hours to build GDB is the million dollar question.
I started a local cbuildbot for terra-release on my machine (Z620 32 cores) last night. 
The time for chrome and gdb are:

Completed chromeos-base/chromeos-chrome-52.0.2733.0_rc-r1 (in 132m12.8s)
which is NOT very different from what we see in the GCE instance

and for GDB:

Pending 175/601, Building 21/21, [Time 04:11:36 | Elapsed 65m55.2s | Load 74.79 82.04 62.62]
Completed chromeos-base/autotest-tests-p2p-0.0.1-r2037 (in 1m19.9s)
Still building openssh-6.6.1_p1-r2 (16m45.5s). Logs in /tmp/openssh-6.6.1_p1-r2-6_UjOw
Completed chromeos-base/autotest-server-tests-shill-0.0.1-r2031 (in 1m14.4s)
Pending 175/601, Building 19/19, [Time 04:11:50 | Elapsed 66m9.6s | Load 69.22 80.48 62.43]
Completed sys-devel/gdb-7.10.20160211 (in 49m38.7s)

Which is pretty high (49 mins) but it is way lower than the GCE instance (3 hours). 
You can see that the machine is quite busy when GDB finishes building.
So, maybe we have too much parallelism in the build? and the GCE instance does not handle that amount of parallelism well?

on the GCE instance:

Pending 62/601, Building 9/9, [Time 06:09:12 | Elapsed 219m2.0s | Load 80.91 98.7 103.48]
Pending 62/601, Building 9/9, [Time 06:10:12 | Elapsed 220m2.9s | Load 86.68 97.82 102.92]
Completed chromeos-base/cryptohome-0.0.1-r1197 (in 1m6.1s)
Still building common-assets-0.0.1-r37 (2m2.4s). Logs in /tmp/common-assets-0.0.1-r37-x3vWUS
Started chromeos-base/chromeos-login-0.0.2-r2306 (logged in /tmp/chromeos-login-0.0.2-r2306-8iMdRj)
Completed sys-devel/gdb-7.10.20160211 (in 175m39.2s)

you can see the load is pretty high too. A few minutes before the load reaches 128:
Pending 72/601, Building 11/11, [Time 06:05:52 | Elapsed 215m42.0s | Load 128.99 109.76 106.99]

When GDB completes, chromeos-chrome has not even started. So, AFDO has nothing to do with this:

Pending 24/601, Building 6/6, [Time 06:44:02 | Elapsed 253m52.2s | Load 31.53 49.31 63.56]
Started chromeos-base/chromeos-chrome-52.0.2727.0_rc-r1 (logged in /tmp/chromeos-chrome-52.0.2727.0_rc-r1-4GuTJZ)

can we limit the amount of maximum parallelism in the build for GCE instances?




I'll note that the long gdb build time seems to be a
hallmark of this problem.  I downloaded build logs from a
successful build (with a faster run time), and in that case
the gdb build time was much shorter.

My guess would be that gdb is a victim rather than a cause of
the problem:  Something that commonly runs in parallel with
the gdb build sucks up all the time, and causes trouble.

The GCE instances used are 64 cores, so 128 doesn't seem that high. Before the builders were ungrouped, loads between 200-300 were normal.

The 30 minute build.... that value is the shortest seen on any release builder, including some that don't build Chrome at all.

Hum... are we using AFDO for EVERY package? Could we disable it for test image only packages like GDB? Or perhaps only enable it for Chrome and a few select others? I don't know enough about AFDO to know how much sense this makes.

That might also help with some space issues that have been seen.
As I said in #33, AFDO only applies to the build of Chrome (chromeos-base/chromeos-chrome) and, as noted in #40, chromeos-chrome has not even started building when the build of GDB is going on. 

I will agree with jrbarnette in #38. It seems to me the gdb build is the victim and not the cause.

I think neither Yunlian or I are good owners for this problem. Someone with more knowledge about GCE instances should own this.

Owner: dgarr...@chromium.org
Sorry, I must have missed #33. Thanks for the input.
Handling of parallelism in the machine depends on other factors besides the number of cores.. so, it maybe worth doing experiments where we limit the amount of parallel builds and see if it behaves better. 

I have done this in the past by setting the following variables

EMERGE_DEFAULT_OPTS="--jobs=12 --load-average=10" MAKEOPTS="-j12 --load-average=10"

use some values that make sense for that machine. For example, use load average that does not exceed the number of cores. 
Just making notes for myself:

cros-beefy2-c2 has performed well consistently:
  https://uberchromegw.corp.google.com/i/chromeos/builders/rambi-release

cros-beefy61-c2 has varied over time:
  https://uberchromegw.corp.google.com/i/chromeos/builders/squawks-release
I sent email to bigcluster-users@ and am starting to get some answers. Looks like the servers are spending a lot of time inside the kernel during bad performance windows.... more digging/answers to come tomorrow.
The GCE people ARE finding unusual CPU scheduling on these builders and are looking into it further. Interesting note.... they think we might be tickling a bad behavior because we use so LITTLE memory during the build.
As a work around, the GCE people are suggesting that we try using 32 core builders. I'll switch a few builders (it's fairly painless to do for a small number of GCE instances) to 32 core, and we can see what happens to performance. If it works well enough, we can switch over all the release builders.

The GCE team has advised that 32 core builders *probably* won't be as prone to this issue.

I switched nyan-release (cros-beefy3-c2) to 32 core builders to see what performance is like. It's still finishing in time, but is about an hour slower over all.

My current proposal is that we switch all release builders to 32 core.

This will involve the loss of at least 1 release build (if things go well), and up to a full days worth, if things go badly.

The hack that makes this feasible before  crbug.com/609886  is fixed is taking an existing builder, converting it into an image, then using that image.
Looking for TPM feedback before killing builds like that.
Cc: gkihumba@chromium.org
We can probably do this tomorrow, assuming you mean ToT release builders right (and not release branch builders which seem to be bare metal)?

The release schedule is at https://sites.google.com/a/google.com/chromeos/resources/engineering/weekly-release-plan for reference, since we are planning to try to do a R52 dev release tomorrow which will come off of ToT we probably want these builders living overnight, assuming we get a good R52 in the morning we should be ok to skip a build after that.
Sounds like a plan.
Project Member

Comment 53 by bugdroid1@chromium.org, May 17 2016

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal.git/+/c608f42c551f16cf5221777caab8cdc89ffe46e3

commit c608f42c551f16cf5221777caab8cdc89ffe46e3
Author: dgarrett <dgarrett@google.com>
Date: Tue May 17 00:23:36 2016

Issue 613052 has been merged into this issue.
PS: I was asked on IRC to wait until Monday for the change. I'm going to interpret that as "after Friday", and might do it over the weekend.
 Issue 613615  has been merged into this issue.
Cc: -vineeths@chromium.org
I have switched all GCE release builders to 32 core instances, which MAY be a workaround for this problem. I'm expecting it to slow normal compiles by about an hour (which should be acceptable), but will hopefully avoid the abnormal performance problems.

The GCE team also has some ideas to try, and I am working with them to see if they help.
Project Member

Comment 59 by sheriffbot@chromium.org, May 25 2016

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Fixed (was: Assigned)
I'm calling this fixed, for now.
Closing... please feel free to reopen if its not fixed.
Status: Verified (was: Fixed)
Labels: -Pri-0 iptaskforce Pri-1
Status: Untriaged (was: Verified)
This issue seems to be returning.  See parrot-release build 3127.  Lowering to 1 because it's only affecting canaries, not the CQ.  It may have been affecting canaries for a while, I will try to dig out exact numbers.
I'd probably file new bugs here, but reference this, and b/28764849. It might not be the same issue.
Status: Verified (was: Untriaged)
Filed crbug.com/656205 to cover the new problem.
Thank you!

Sign in to add a comment