New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 896037 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Task

Blocked on: View detail
issue 618475
issue 898996
issue 901020
issue 279618
issue 728971
issue 846666
issue 896377



Sign in to add a comment

PFQ run time is slow / long, ~5 hours

Project Member Reported by steve...@chromium.org, Oct 16

Issue description

New tracking issue for slow PFQ builds.

PFQ runs are taking 200-350 minutes, or 3.5-6 hours. This has been a problem in the past, been improved (to ~2.5 hours at one point IIRC), and now seems to be bad again.

Looking at a recent PFQ run will show a graph of times, e.g.
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932812303067472048

One of the slower builders is cyan, e.g.
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932633415418580080

The bulk of the time is in BuildPackages (~2 hours) and HWTest (~2 hours combined) with SetupBoard and SyncChrome the other high nails.

BuildPackages is mostly chromeos-chrome (~1 hour) followed by arc-mesa (30 mins):
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/emergePackageDetails?cidbBuildId=3033109

goma appears to be enabled for building chromeos-chrome.

HWTest has a large gap (unclear why) and lots of time spent provisioning. Possibly not enough DUTs?
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=248089505
 


 
Cc: -dpranke@google.com dpranke@chromium.org bpastene@chromium.org
Cc: akes...@chromium.org
Cc: bjanakiraman@chromium.org
Blockedon: 728971 279618 618475 846666
Linking issues that are related, net necessarily strictly blocking this.

Components: Infra>Client>ChromeOS>Test
Labels: -Pri-1 Pri-2
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
Based on meeting notes, going to look at hardware resources to see if we can add more devices?
Some reasons why long PFQ build times cause problems:
1. Fixes merged to the Chrome branch or CrOS ToT take longer to get picked up.
2. Since the PFQ runs less frequently, flaky test failures (which are often difficult to identify) have a greater impact.
3. Longer waits for critical bug fixes makes release management more challenging.

It would be great if we could audit the HWTest performance of all of the PFQ builders for cases where increasing the number of DUTs would measurably improve build times. The information is available through Legoland, but it is rather tedious to dig through.

I'm curious why the chromeos-chrome build takes so long. There's nothing indicative in the build logs from the build linked in #0 except for continuous "Still building chromeos-chrome" statements:
https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8932811342918765840/+/steps/BuildPackages/0/stdout

The logs point me to "Logs in /tmp/chromeos-chrome-72.0.3579.0_rc-r1-4f1tTv". Are those accessible at all? Any idea where I could look at them? (Somewhere in GS maybe?)
Cc: tikuta@chromium.org
#8,
building chrome itself is not so slow.
It looks to finish less than 10 minutes from ninja's build trace.
https://chromium-build-stats.appspot.com/ninja_log/2018/10/13/swarm-cros-377/ninja_log.chrome-bot.swarm-cros-377.20181013-001712.26591.gz/trace.html
Got from Report -> TestSimpleChromeWorkflow Goma ninja_log -> trace viewer 

It is very strange, but there is something time consuming in chromeos-chrome package build.


stevenjb@, if building arc-mesa, gdb, mit-krb5 and etc are always slow, we can investigate whether those can use goma or not to build.

Cc: hidehiko@chromium.org
+hidehiko@ who enabled goma on PFQ in issue 279618.
Re #9: Ahh, I see the links now, thanks. Though the one you're looking at appears to be from the "TestSimpleChromeWorkflow" step, which isn't the actual chromeos-chrome build. It's likely a subsequent rebuild using the chrome-cros-sdk to verify it works.

If you look at the "BuildPackages Goma ninja_log", that'll show you the ninja trace for the 1hr long chrome build:
https://chromium-build-stats.appspot.com/ninja_log/2018/10/13/swarm-cros-377/ninja_log.chrome-bot.swarm-cros-377.20181012-233637.26591.gz/trace.html

It appears to have spent 30min linking chrome? Yikes... not sure how we can optimize that.
Cc: llozano@chromium.org
30 minute link times suggests to me that a lot of swapping might be going on? How much memory do the builders have? As I recall, a laptop with 16 GB is barely adequate to link chrome in anything resembling a reasonable amount of time.

I also believe that we are doing some fairly intensive optimization when building in the chroot, which might show up during the link stage.

In response to #12, the standard builder config is n1-highmem-32 (32 vCPUs, 208 GB memory).

-- Mike
Owner: ----
Is this bug about long hwtest, or other long stages? It seems like the discussion is mostly about non-hwtest issues. I can investigate HWTest times separately.
Blockedon: 896377
Re #13: That's indeed the specs of the machine that ran the build in question:
https://chrome-swarming.appspot.com/bot?id=swarm-cros-377

Do cros bots have machine metrics in monarch that track cpu/mem/disk usage over time? Something like sysmon? It would be interesting to see what spikes during the chromeos-chrome build.
Cc: yunlian@chromium.org manojgupta@chromium.org
#9, Ah thanks.
If link time is too long, I guess it does heavy link time optimization.
https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/master/chromeos-base/chromeos-chrome/chromeos-chrome-9999.ebuild#67

yunlian, any thought?

Status: Untriaged (was: Assigned)
Moving back to Untriaged since this has no owner (akeshet@ is assigned to blocked-on issue 896377). I would really appreciate if someone could take ownership of the overall health of the PFQ, but for now at least specifically helping ensure the 'Blocked on' issues get some love and that the end result is a more reasonable 2-3 hour cycle time for the PFQ.

https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8932811342918765840/+/steps/BuildPackages/0/stdout

The build packages stage takes 2 hours, but Chrome build takes only 1 hr and rest of the time (1 hr) is being taken by other packages getting rebuilt for various reasons.
IIRC, thinlto adds around 20 mins to Chrome build so even if we remove it, it would be hard to save more than 20 mins unless other package rebuilds could be reduced.
I thought we used goma, so chromeos-chrome builds should be faster (even if you take 20 mins for thin-LTO). Is this expected?
Owner: malaykeshav@chromium.org
Status: Assigned (was: Untriaged)
Gardeners own the PFQ and should feel encouraged to ask the owners of the blocking bugs on a regular basis for status update. Assigning to current Gardener. Feel free to hot-potato this meta bug between Gardener shifts.
Cc: amoylan@chromium.org
Labels: -Type-Bug Type-Task
Blockedon: 898996
thinlto is used by devs/cq/etc... as well, so turning it off just for the Chrome PFQ won't help.  we'd have to turn it off everywhere and then only leave it on for release builds.

i've split build_packages slowness into issue 898996.
From f2f discussion with the goma team, given the way thinLTO is architected, goma's caching is not going to give any benefits, so we will be stuck with the 1 hr+ build for chromeos-chrome as long as we use thinLTO (perhaps a little better if we use a different parallelization factor).

Why are we doing thinLTO builds in the PFQ? If this is to catch compiler issues before the release, can we not do this in a different builder that can be off the critical path of the PFQ?
We are working on the following which will improve the ThinLTO time:
- Moving to LLD, which will allow us to enable a mechanism for ThinLTO caching (not the same cache as Goma and will only work for incremental builds). We are working on a bug that shows up with LLD and it has proven to be very difficult to triage. After we move to LLD we need to test the caching mechanism. So, maybe we will have this ready in 3 or 4 weeks.
- Building the compiler with PGO + ThinLTO which has been show to improve build times by around ~15%. This is planned for this quarter. 
- Enabling Goma for ThinLTO backend executions. We have been in contact with Goma team about this. They added some functionality to Goma so that we can do this but the changes in the compiler tooling need to be worked on. It is not clear how much will this help with the build time. This will probably be done next quarter.

IMO, if ThinLTO build times are a problem, we could enable ThinLTO only on the release builders. I suggested this before but some of us want to make sure that the PFQ builders are as close as possible to the release builders (it is an understandable point of view). As a reference point, for the Chrome validation for linux, ThinLTO is only enabled on the release builders (AFAIK). 

 
I don't think pushing complexity onto the release builders is a good idea, as it will require gardening a second time: first on the PFQ and then again making sure that the release builders are not failing. I think in the ideal case the release builders would not build Chrome at all (but use the binaries from the PFQ/waterfall).
I'm not convinced we should be doing thinLTO builds on PFQ, but release builders shouldn't be the first time we do that either. 

Is it not possible to do thinLTO build validation on a separate track that doesn't hold up the PFQ? 

Cc: martis@chromium.org
Owner: jdufault@chromium.org
re: #29 what is the difference between doing it in separate builder compared to just do it in the release builders?
release builders are sheriffed too..
Re: #31, so that release builders are not the first time we see thinLTO problems, and the toolchain team has more time to fix issues with thinLTO.
I am just trying to run with the idea.

The time between PFQ and release build is too short to take corrective action.

The only way to give the toolchain team advanced notice is to add thinLTO (or full toolchain build) to the informational Chromium waterfall (like what the Chrome gardeners are watching in addition to the PFQ). But because Chrome is supposed to rev once a day the toolchain team would get << 24h advanced notice over the gardener/release builder problems. If a possible few hours early notice will make a difference to the toolchain team then we should certainly consider adding such builders.

Thinking about it, I doubt there still will be true early notice of unexpected failures, as the thinLTO (or full toolchain build) takes a long time. But maybe one could *interleave* hourly informational builders on the Chromium waterfall, especially to verify/test fixes?

I think best would be to have only one, fully featured build straight in the Chromium waterfall or PFQ creating all chromeos-chrome binaries. But maybe the fully featured informational builder could morph over time into this? What we really should not do is to introduce the N+1st divergent way of building Chrome.

I guess if we really want to give my team time to fix any issues with ThinLTO, then the builder testing it should be a "blocking" builder. Like the PFQ builders. Otherwise, we would be in a very urgent rush to fix the issue with ThinLTO because Chrome has already been delivered into Chrome OS.
I think the build time issue will improve with the solutions mentioned in #27. In particular the first one, since the PFQ is an incremental build.Can we wait for that and see how it goes?


Blockedon: 901020
I have created a separate issue 901020 for tracking chromeos-chrome build slowness so that we can move discussion of that part of the problem there and keep this as an overall tracking issue.

A recent cyan build took ~4 hours:
40 minutes of initialization, checkout and sync
70 minutes in BuildPackages (primarily chromeos-chrome)
35 minutes in HWTest (sanity) which appears to block other HWTest runs
100 minutes in HWTest (bvt-arc) which appears to include 40 minutes with no results, but the DUTs appear busy the whole time so that is unclear. This is tracked in issue 896377.

Cc: sa...@chromium.org
Owner: wutao@chromium.org
Owner: cindyb@chromium.org
Status: Closed (was: Assigned)
Assign to cindyb as TPM to track PFQ progress. 
Status: Assigned (was: Closed)
Labels: PFQ-performance-tracking
By looking recent master pfq, it takes long time (up to 380 mins) when there is failure. However if it is green, it only takes up to 96 mins.

https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=master-chromium-pfq&buildBranch=master
That's not quite correct:
1. Whenever a chrome branch tag is created, the PFQ is run.
2a. If there are new chrome changes, *or chromeos changes that may impact the chrome build*, chrome is rebuilt and all tests are run. This takes ~5 hours.
2b. If chrome does not need to be rebuilt *on any board*, the PFQ exits early. It takes ~90 minutes to determine this.

The PFQ appears to have been failing lately in tests which is why it seems to take longer when it fails, but there was a successful run Tuesday taking 289 minutes:
2018-11-06 13:25	2018-11-06 17:50	289.0

I've always felt that we are overly conservative in determining when to run the PFQ and that 2b could be sped up considerably, but it's complicated and relatively low on the priority list.

Thanks stevenjb@.

Another thing I noticed is that sometime, Auth token TTL expired. This also took one hour to retry.

========
Will return from run_suite with status: OK
03:35:23: INFO: Re-run swarming_cmd to avoid buildbot salency check.
03:35:23: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/t/cbuildbot-tmpGfTtti/tmpX5r97V/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name peach_pit-chrome-pfq/R72-11241.0.0-rc2-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:bvt-inline' '--tags=build:peach_pit-chrome-pfq/R72-11241.0.0-rc2' '--tags=task_name:peach_pit-chrome-pfq/R72-11241.0.0-rc2-bvt-inline' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-chrome-pfq/R72-11241.0.0-rc2 --board peach_pit --suite_name bvt-inline --pool bvt --file_bugs True --priority PFQ --timeout_mins 180 --retry True --max_retries 5 --minimum_duts 4 --suite_min_duts 3 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 98522302L, 'cidb_build_id': 3114487, 'datastore_parent_key': ('Build', 3114487, 'BuildStage', 98522302L)}" -m 256052517
03:35:49: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
03:35:49: INFO: Refreshing access_token
04:20:30: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
04:20:30: INFO: Refreshing access_token
04:31:12: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
04:31:12: INFO: Refreshing access_token
04:35:54: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
04:35:54: INFO: Refreshing access_token


https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8930436898582927280/+/steps/HWTest__bvt-inline_/0/stdout
OAuth token refreshing is unrelated; it's purely a diagnostic message for overhead going on in the background while waiting for something else. The thing that took an hour in that log sample was the Swarming job that was scheduled above it.
Cc: updowndota@chromium.org weidongg@chromium.org
Cc: -martis@chromium.org
Cc: agawronska@chromium.org
Cc: -weidongg@chromium.org
Cc: -agawronska@chromium.org

Sign in to add a comment