New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 593172 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocking:
issue 591630
issue 611139



Sign in to add a comment

Consider moving the TestSimpleChromeWorkflow stage from PFQ builders

Project Member Reported by steve...@chromium.org, Mar 8 2016

Issue description

Currently we run cros chrome-sdk on every PFQ builder.

The run time for the step averages ~1 hour. Net time might be less, but it still almost certainly adds a significant additional run time to the PFQ builders.

We also have chrome-sdk coverage on the following builders (all of which perform incremental chrome builds in the cros chrome-sdk environment):

* chromium tree:
** ChromiumOS x86-generic Compile
** ChromiumOS amd64-generic Compile
** ChromiumOS daisy Compile

chromeos.chrome tree:
* amd64-generic Trusty (Informational)

Advantages:
* Reduce build time of PFQ builders.
** This would help identify failures sooner, and potentially reduce the impact of flakiness.
* Potentially reduce load based flakiness due to slow builders without increasing the maximum allowed build time (see  issue 593089 )

Disadvantages:
* Would loose coverage of non-incremental SimpleChrome builds
* Would loose coverage for builds based on prebuilts for a specific board type

Options:
* Do nothing - rely on other changes to make the PFQ more robust
* Add additional informational chromeos.chrome builders to cover each board type
* Add additional separate PFQ builders to build SimpleChrome in parallel.

 
Labels: -Pri-2 -gardner gardener Pri-1
I separated out comment #9 from issue 591630 so that we can discuss this proposal separately here.

I was motivated to write this up as I wait impatiently for the latest PFQ run to complete.

Question: Does the cros chrome-sdk build use anything from the PFQ builder, i.e. the just built image? If so, adding separate PFQ builders wouldn't achieve the current results. (Which may be fine - my personal leaning is that these could be informational - breaking SimpleChrome blocks developers, which is bad, but not I think as bad as blocking Chrome upreves for Chrome OS, and we really really need to pay more attention to the chroemos.chrome informational builders anyway).


Also: there is the potential for a two-for-one win here if we can develop a cros chrome-sdk recipe that can also run a basic VMTest or two (issue 591623). With a handful of "simple-chrome-vmtest" builders we can add simple chrome and VMTest coverage with a fast turnaround before the PFQ.


Also note: the cros chrome-sdk builds in the PFQ do not appear to take advantage of goma.

That all makes sense and sounds good to me. The only thing I don't know is whether that stage, on the pfq, publishes an artifact that developers rely on.

davidjames@ care to comment?
Do we know if the chromium tree builders (ChromiumOS x86-generic Compile, et al) are informational or tree closers?

Let's also note that in practice we haven't seen any advantage from this step - it seems rare that the simple chrome flow breaks but BuildPackages succeeds on the PFQ. At least I've never seen this happen. I believe previous simple chrome breakages were due to cros changes, not chromium changes. Someone correct me if I'm wrong.
Cc: dgarr...@chromium.org
ChromeSDK doesn't produce artifacts in the PFQ but it does produce artifacts in release builders. We have definitely seen cases before where new versions of Chrome broke the SimpleChrome stage (e.g. Chrome started using new packages but those packages weren't installed on ChromeOS builders). I agree it's relatively rare so far though.

In the future, if you add more complexity to SimpleChrome (e.g. running VMTest from SimpleChrome), you'll want to test that too somehow.

On the few ChromeSDK runs I looked at, ChromeSDK wasn't the bottleneck, so removing it will likely have little impact (<10m) on the build speed. Take an example:

https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-chrome-pfq/builds/8372 -- bottleneck is that it timed out in AFDODataGenerate.

https://uberchromegw.corp.google.com/i/chromeos/builders/peppy-chrome-pfq/builds/1781 -- bottleneck was flake in VMTest (resolved by retry). This might actually be helped by removing ChromeSDK stage. Lumpy was only ~2 minutes behind it though so net speed up is ~2 minutes.

https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-chrome-pfq/builds/2136 -- bottleneck was HWTest

HWTest seems to be the thing to speed up here. HWTest runs on a different machine so getting rid of ChromeSDK has little or no impact performance-wise on PFQ builders.

If you wanted to get rid of ChromeSDK to improve performance, the place where it would have the most benefit to get rid of it would be on the release-group builders. We could consider getting rid of it there and restructuring how ChromeSDK works (e.g. just build ChromeSDK images in the Chrome PFQ and forget about building it in release builders.) That might be a more productive direction for speeding things up. Because ChromeSDK is a bottleneck on those builders, you could see some large speedups in the release group builders from doing this (~1hr).
(Keep in mind though, my suggestion isn't as easy as setting chrome_sdk=False on release builders -- you'd also need to change how the artifact generation works. But might be worth it since release build times are a focus right now.)
Ah, I was assuming that the delta time would be greater. I suppose it might
be on builders that don't run HWTests? (I tried to figure out what the
actual time impact would be and failed - I think my brain is to crowded
right now).

Getting some working SimpleChrome + VMTest subset seems like a higher
priority right now all things considered, but we'll keep this one on the
back burner.

Thanks for the detailed info!
We run many stages in parallel, so figuring out the impact on total build time is rarely easy.
Labels: -gardener Hotlist-CrOSGardener
Labels: Hotlist-CrOS-Gardener
Labels: -Pri-1 -Hotlist-CrOSGardener Pri-2
Summary: Consider moving the SimpleChromeWokflow stage from PFQ builders (was: Consider eliminating ChromeSDK (SimpleChrome) step from PFQ builders)
After having become all to familiar with simple chrome recently, I have learned the following:

The PFQ runs SumpleChromeWorkflow using the --chroot flag, pointing to a chroot environment based on ToT. (The details are complicated, but the code for this stage is here: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/stages/chrome_stages.py?q=SimpleChromeWorkflowStage&dr=CSs&l=151)

As far as I am aware, this is the only builder that builds using SimpleChrome with the latest chromeos.

At some point I would like to move this to a separate builder (e.g. alongside the pfq-informational builders), because having so many different stages in the PFQ builders leads to a lot of confusion. For now however I think we should leave this as-is.

Components: Build
Cc: -davidjames@chromium.org
Cc: derat@chromium.org
Blocking: 611139
Summary: Consider moving the SimpleChromeWorkflow stage from PFQ builders (was: Consider moving the SimpleChromeWokflow stage from PFQ builders)
For searching purposes, the step that fails is called TestSimpleChromeWorkflow on the builders.
Picking a random recent build, it's adding about 40 minutes to the build. You can see the timelines linked under the Report stage of the waterfall.

https://storage.cloud.google.com/chromeos-image-archive/nyan-chrome-pfq/R60-9503.0.0-rc2/timeline-stages.html
Cc: steve...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Summary: Consider moving the TestSimpleChromeWorkflow stage from PFQ builders (was: Consider moving the SimpleChromeWorkflow stage from PFQ builders)
That's a builder that does not run HW Tests. The idea is that TestSimpleChromeWorkflow runs in parallel to HWTests and does not actually increase the overall time for a PFQ run to complete, e.g.

https://storage.cloud.google.com/chromeos-image-archive/veyron_minnie-chrome-pfq/R60-9503.0.0-rc2/timeline-stages.html

That said, I would be an advocate of moving this to a separate dedicated informational builder:
1. Identifying failures before they get tho the PFQ is always good.
2. The odds of this failing in a board specific way are pretty low. We should probably test amd64 and arm, but that should suffice.
3. The actions when this breaks are pretty much independent of any other PFQ breakage.
4. Reducing the complexity of the master PFQ is generally a good idea.

I am marking this as Available and removing myself as owner, but I would be more than happy to help someone work on this.

Doing this would also make it easier to diagnose when an issue really is with TestSimpleChromeWorkflow vs. something else, like  issue 720075 

Does it need to be tested in a board specific way? Or would it be good enough to have a single SimpleChromeWorkflow builder?

For example, would it make sense to have an additional PFQ builder that ONLY runs the SimpleChromeWorkflow test? Or just fully remove that test from the PFQ and depend on informational builders only?

I'm fine either way, just trying to present options.
Breaking SimpleChrome is bad, but does not impact the CQ or Canary builders so I don't think this should be part of the PFQ.

We already have chrome-tot-chromeos-amd64-generic on the chromeos.chrome waterfall which builds an internal simple chrome with ToT chrome.

I would propose adding chrome-tot-chromeos-arm-generic so that we at least also have ARM coverage, and removing the stage (or at least disabling it) from the PFQ builders.


That doesn't sound too hard.
Don - do you know if Google cloud has a plan to support kvm for ARM? We could consider running one test in a non-accelerated VM perhaps 
KVM for ARM should work today, since it's pure simulation and doesn't need kernel support.

I'm not sure of the status of ChromeOS inside an ARM VM. In theory it should work, but didn't. There was an effort to fix it and I don't know what came out of that.
It's very slow as expected. 
Project Member

Comment 29 by sheriffbot@chromium.org, May 10 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 30 by derat@chromium.org, May 10 2018

Is this still worth doing? I just glanced at two PFQ builds, and in both cases TestSimpleChromeWorkflow took 11-12 minutes and ran in parallel with other stages that took longer:

https://luci-milo.appspot.com/buildbot/chromeos/caroline-chrome-pfq/1973
https://luci-milo.appspot.com/buildbot/chromeos/peppy-chrome-pfq/5506
I think it should be ok to mark this WontFix.
Status: WontFix (was: Untriaged)

Sign in to add a comment