New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753495 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 756989



Sign in to add a comment

Disable parallel execution of Telemetry tests (telemetry_unittests & telemetry_perf_unittests) to reduce flakiness & simplify its core infra

Project Member Reported by nedngu...@google.com, Aug 8 2017

Issue description

A while ago, we have a code yellow in CQ cycle time (there is no swarming when this happened) that leads to us integrating Telemetry with typ for parallel test executing.

However, as many Telemetry tests are integration tests that interact with the OS, the browser..., parallelizing the test executions create tons of problem. E.g:
1) Fetching files in parallel sometimes cause deadlock (issue 643320).
2) Startup tracing tests was failing because they was run in parallel.
3) Some pytrace event tests was failing in Telemetry test because pytrace is supposed to covered across processes

The current issue we just have recently is switching wprgo integration to always install root certificate (for HTTPS connection) doesn't work well in Telemetry tests. This is because the OS can either have root certificate installed or not have root certificate installed, hence parallel execution create race condition: one process trying to uninstall root cert whereas other tries to remove root cert.

Now we have swarming unfra that enables parallelizing test execution at machine level, I propose that we use more machines (VMs) to run sharded Telemetry test in serial.

To utilize the test machines' CPU cores, we can pick machine with low cores for running Telemetry tests.

Note that this doesn't mean we stop integrating Telemetry with typ. It would just mean we set the number of parallel jobs to always be 1 by default (https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/testing/run_tests.py#L165)
 
Blocking: 730036
Noting a couple of things I mentioned to nednguyen@ off-bug ...

Currently most of our VMs are 8-core machines, and running tests in parallel ensures that all 8 cores are being used. If we don't run things in parallel, then the total utilization of the machine is probably significantly lower (50%? 25%)?

As a result, we probably don't want to use 8-core machines for this, and would be better off replacing 1 8-core VM with 2 4-core VMs.

In addition, if we increase the number of VMs we use to run tests, that means we increase the overall load on swarming, the total aggregate disk space we need (e.g., doubling the number of copies of the isolates), and the total network bandwidth we use to move those isolates around.
On Linux, with everything running in GCE, this is probably not an issue, but it's probably something more to consider on Windows and particularly Mac as we can't as easily add disk space or memory to the latter.

All that said, I'm not opposed to the idea of sharding via VM rather than via worker -- there are the obvious advantages mentioned -- but we'd have to do tests to evaluate the tradeoffs.
Blocking: 756989
Blocking: -730036
Cc: rnep...@chromium.org jam@chromium.org perezju@chromium.org
 Issue 697924  has been merged into this issue.
Owner: nedngu...@google.com
Status: Started (was: Untriaged)
Dirk, I look at all the builder currently run telemetry_perf_unittests & telemetry_unittests, they are:

'x86 Cloud Tester'
'Android N5X Swarm Builder'
'Linux ChromiumOS Tests (1)'
'Chromium Mac 10.11'
'Chromium Mac 10.13'
'ClangToTMac tester'
'Mojo ChromiumOS'
'Out of Process Profiling Mac'
'Linux Tests'
'Mac10.10 Tests'
'Mac10.11 Tests'
'Mac10.12 Tests'
'Linux Tests SANDBOX'
'Win 7 Tests x64 (1)'
'Win7 Tests (1)'
'Linux - Future'

So what should be the next step here? Do I need to pick new builder, or do I just need to tweak swarming dimension of telemetry_perf_unittest to pick the VMs?
Also this is the list of builder running 'telemetry_unittest', I would expect it to be the same as telemetry_perf_unittests (posted in #6), but apparently it's not.

'Linux ChromiumOS Tests (1)'
'Chromium Mac 10.11'
'Chromium Mac 10.13'
'ClangToTMac tester'
'Mojo ChromiumOS'
'Out of Process Profiling Mac'
'Linux Tests'
'Linux Tests (dbg)(1)'
'Linux Tests (dbg)(1)(32)'
'Mac10.10 Tests'
'Mac10.11 Tests'
'Mac10.12 Tests'
'Mac10.9 Tests'
'Mac10.9 Tests (dbg)'
'Linux Tests SANDBOX'
'Win 7 Tests x64 (1)'
'Win7 Tests (1)'
'Win7 Tests (dbg)(1)'
'Linux - Future'
'Linux - Future (dbg)'
The next step is probably to pick a linux configuration and tweak the swarming args for it to run more shards and pass a flag in to not run tests in parallel.
Thanks. I will pick "'Linux Tests" builder, does that sound good to you?
"Linux Tests" is the wrong one to pick, since that's mirrored by linux_chromium_rel_ng in the CQ :).

Maybe try "Linux Tests (dbg)(1)" instead?
https://chromium-review.googlesource.com/c/chromium/src/+/703490 to add "Linux Tests (dbg)(1)" is being blocked because:
1) The speedometer2 smoke test is failing on linux_chromium_dbg_ng (which is "Linux Tests (dbg)(1)")
2) A v8 runtime stats smoke test is failing
3) The shards are very unbalanced.

+Dirk: do you think (3) would be a blocker since this doesn't affect machines utilization & we don't enable the test on CQ?
As I just commented on the bug, it looks like the tests are running 10x slower in debug, and that concerns me. I don't think we should turn them on if that's the case, and either we need to figure out what's going on there, or try a non-debug bot, e.g., "Linux Tests SANDBOX" or one of the Mac10.10 or 10.11 bots.
I will try Mac10.11 bots next
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/41539671102ef65148f07d72ddb89b3be24ec787

commit 41539671102ef65148f07d72ddb89b3be24ec787
Author: Ned Nguyen <nednguyen@google.com>
Date: Tue Nov 07 00:38:25 2017

Disable parallel processing for telemetry_perf_unittests & telemetry_unittest on "Mac 10.12 Test" 

Bug: 753495
Change-Id: Id04b687da4d22d8bbde18bb4a733b8af4968de04
Reviewed-on: https://chromium-review.googlesource.com/703490
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#514322}
[modify] https://crrev.com/41539671102ef65148f07d72ddb89b3be24ec787/testing/buildbot/chromium.mac.json

Mac 10.12 has been running quite stable, so I will disable parallelization on all Mac bots as the next step
Project Member

Comment 16 by bugdroid1@chromium.org, Nov 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9761a8796edbff9bb8917d7af9ce617c533ebcbd

commit 9761a8796edbff9bb8917d7af9ce617c533ebcbd
Author: Ned Nguyen <nednguyen@google.com>
Date: Tue Nov 14 00:23:18 2017

Disable parallelization of telemetry_perf_unittest on all Mac bots

Bug: 753495
Change-Id: I0121d82ce25bc037c20914b74714aec076aa8fe3
Reviewed-on: https://chromium-review.googlesource.com/763873
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#516112}
[modify] https://crrev.com/9761a8796edbff9bb8917d7af9ce617c533ebcbd/testing/buildbot/chromium.mac.json

Project Member

Comment 18 by bugdroid1@chromium.org, Jan 4 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7ce3ea4c5672848136feabdc47913667dc5a95d7

commit 7ce3ea4c5672848136feabdc47913667dc5a95d7
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Jan 04 17:43:52 2018

Remove unnecessary modifications to telemetry_unittest suite

Bug: 753495
Change-Id: I6e23e7b536b52c05070661869fb67ea8d04bc9d0
Reviewed-on: https://chromium-review.googlesource.com/850713
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#527026}
[modify] https://crrev.com/7ce3ea4c5672848136feabdc47913667dc5a95d7/testing/buildbot/test_suite_exceptions.pyl

Project Member

Comment 19 by bugdroid1@chromium.org, Jan 11 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/35d625f2f714b857cf08f256de23f366fa46948d

commit 35d625f2f714b857cf08f256de23f366fa46948d
Author: nednguyen <nednguyen@google.com>
Date: Thu Jan 11 19:45:30 2018

Disable telemetry_perf_unittest' parallelization on Linux Tests dbg & Linux tests Sandbox

Bug: 753495
Change-Id: Iff43849de0708d62d51960627346439d525f05a2
Reviewed-on: https://chromium-review.googlesource.com/861430
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#528712}
[modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/chromium.linux.json
[modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/chromium.sandbox.json
[modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/test_suite_exceptions.pyl

Project Member

Comment 20 by bugdroid1@chromium.org, Jan 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e6e2604e09e660b44ef4e2e396c78730b7da777c

commit e6e2604e09e660b44ef4e2e396c78730b7da777c
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Jan 17 20:42:48 2018

Disable Telemetry test parallelization on Linux Tests

Bug: 753495
Change-Id: I1780ff5394793a5e7a9e08ff873c1e67a70a809e
Reviewed-on: https://chromium-review.googlesource.com/868643
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#529876}
[modify] https://crrev.com/e6e2604e09e660b44ef4e2e396c78730b7da777c/testing/buildbot/chromium.linux.json
[modify] https://crrev.com/e6e2604e09e660b44ef4e2e396c78730b7da777c/testing/buildbot/test_suite_exceptions.pyl

Owner: nednguyen@chromium.org
Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment