Running perf benchmarks on Chromium CQ |
|||||||||||||||
Issue descriptionThis is meta bug for running perf benchmarks on Chromium waterfall for smoke testing. My rough idea of how this should work: 1) We only run benchmarks that area already merged together (see issue 575762 & bit.ly/why-merge-benchmarks) on Chromium waterfall. 2) Benchmarks are run with "--storyset-repeat=1" since this is just smoke test. 3) Each benchmark is sharded on multiple devices to ensure its runtime is roughly 5 minutes. No need for device affinity sharding. *Android benchmarks will be blocked on single Android swarming being deployed everywhere. Dirk: are you ok with this overall plan? ⛆ |
|
|
,
Jul 6 2017
@martiniss can help with answering the number of machine time. For the number of swarmed tests, I expect about 20 at most (once we merged the benchmarks together).
,
Jul 6 2017
I have a table of our total test time per bot. It's an internal link. Currently, we aren't running our linux bot, because of issues with the bots. But, our windows bots run an average of ~21 hours of testing, across 5 shards.
,
Jul 6 2017
is that 21 machine-hours * 8 cpus / machine == 168 cpu-hours, spread across 40 cpus? Or 21*5*8 = 840 cpu-hours? And do we have any idea how much faster things will be with the reduced # of benchmarks and --storyset-repeat=1?
,
Jul 7 2017
Dirk: we don't running perf test in parallel, so that is more like: 21 machine hours * 1 cpus per machine = 21 cpu hours. Maybe multi-cores will make each test run a bit faster, but I expect only marginally. With "--story-set-repeat", I expect it would be 21/3 = 7 cpu hours.
,
Jul 7 2017
+smut, maruel (fyi). Good point, running the perf tests as perf tests in parallel would just be asking for pain :). Do you know how hard it would be to run them in parallel for functional test purposes? It'd be interesting to see if that exposed more flakiness or not. If we don't run them in parallel, we'd probably need to make some other adjustments; the swarming bots in the CQ pools are normally 8-core VMs, so only using one core out of 8 would be not a great use of resources. Swarming fully supports smaller VMs, but we don't provision a lot of them. We actually have this problem with other test steps as well (lots of them don't actually need 8 cores) so if it's hard to parallelize things, maybe this is the excuse we need to handle provisioning smarter.
,
Jul 7 2017
I really worry about running these integration perf tests in parallel. Some tests involve GPU which could behave weirdly in parallel, I think. +Ken for opinions
,
Jul 7 2017
The bots on the CQ are VMs, so they won't have real GPUs. We should be sure that that's not going to be a problem one way or another.
,
Jul 7 2017
Oh.. For now, I only expect running these benchmarks on the Chromium waterfall and not CQ. We probably would run only system health benchmark on CQ. It would be great if infra abstract the "multi core" away from test through something like VM or docker instance. Implement parallel tests for Telemetry benchmark will be a bit challenging for us :P
,
Jul 7 2017
We try to avoid only running tests on the waterfall and not on the CQ; that causes a lot of confusion for people, and so we only do that when the tests are too expensive to run, and I don't think this fits that category. Even on the waterfall, I'd rather run these under swarming on VMs if we can, I would like to avoid having to run tests on bare metal wherever possible. Running on single-core VMs should work just fine in swarming, it's likely just a matter of configuring machine provider to spin up some and then setting the right swarming dimensions in the //testing/buildbot configs, so if it's hard to do this in telemetry (and adds no real value to do so), we can tackle it that way instead.
,
Jul 7 2017
Thanks Dirk. So seems like the direction would be: 1) We run benchmark smoke tests on CQ with "--pageset-repeat=1" 2) We pick swarming dimension with single-core VMs to utilize hardware resources. I would want to check with Ken that gpu desktop tests are also currently configured this way.
,
Jul 7 2017
The GPU bots are probably conceptually inefficient, but the screenshot mechanisms we use aren't guaranteed to work if there are overlapping windows. So we don't run the Telemetry-based tests in parallel on the GPU bots, though we do shard them.
,
Jul 7 2017
#12: Ken, what about gpu tests running on CQ? Do you use gpu bots there, or just the general Chromium bots with low cores?
,
Jul 7 2017
GPU bots, same as on the waterfall.
,
Jul 11 2017
,
Jul 12 2017
,
Jul 12 2017
,
Jul 17 2017
Update the title according to Dirk's assertment in #10
,
Jul 18 2017
,
Oct 10
,
Nov 20
Issue 840427 has been merged into this issue.
,
Nov 21
,
Jan 16
|
||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dpranke@chromium.org
, Jul 6 2017