Disable parallel execution of Telemetry tests (telemetry_unittests & telemetry_perf_unittests) to reduce flakiness & simplify its core infra |
||||||||
Issue descriptionA while ago, we have a code yellow in CQ cycle time (there is no swarming when this happened) that leads to us integrating Telemetry with typ for parallel test executing. However, as many Telemetry tests are integration tests that interact with the OS, the browser..., parallelizing the test executions create tons of problem. E.g: 1) Fetching files in parallel sometimes cause deadlock (issue 643320). 2) Startup tracing tests was failing because they was run in parallel. 3) Some pytrace event tests was failing in Telemetry test because pytrace is supposed to covered across processes The current issue we just have recently is switching wprgo integration to always install root certificate (for HTTPS connection) doesn't work well in Telemetry tests. This is because the OS can either have root certificate installed or not have root certificate installed, hence parallel execution create race condition: one process trying to uninstall root cert whereas other tries to remove root cert. Now we have swarming unfra that enables parallelizing test execution at machine level, I propose that we use more machines (VMs) to run sharded Telemetry test in serial. To utilize the test machines' CPU cores, we can pick machine with low cores for running Telemetry tests. Note that this doesn't mean we stop integrating Telemetry with typ. It would just mean we set the number of parallel jobs to always be 1 by default (https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/testing/run_tests.py#L165)
,
Aug 9 2017
Noting a couple of things I mentioned to nednguyen@ off-bug ... Currently most of our VMs are 8-core machines, and running tests in parallel ensures that all 8 cores are being used. If we don't run things in parallel, then the total utilization of the machine is probably significantly lower (50%? 25%)? As a result, we probably don't want to use 8-core machines for this, and would be better off replacing 1 8-core VM with 2 4-core VMs. In addition, if we increase the number of VMs we use to run tests, that means we increase the overall load on swarming, the total aggregate disk space we need (e.g., doubling the number of copies of the isolates), and the total network bandwidth we use to move those isolates around. On Linux, with everything running in GCE, this is probably not an issue, but it's probably something more to consider on Windows and particularly Mac as we can't as easily add disk space or memory to the latter. All that said, I'm not opposed to the idea of sharding via VM rather than via worker -- there are the obvious advantages mentioned -- but we'd have to do tests to evaluate the tradeoffs.
,
Aug 18 2017
,
Sep 5 2017
,
Oct 5 2017
Issue 697924 has been merged into this issue.
,
Oct 5 2017
Dirk, I look at all the builder currently run telemetry_perf_unittests & telemetry_unittests, they are: 'x86 Cloud Tester' 'Android N5X Swarm Builder' 'Linux ChromiumOS Tests (1)' 'Chromium Mac 10.11' 'Chromium Mac 10.13' 'ClangToTMac tester' 'Mojo ChromiumOS' 'Out of Process Profiling Mac' 'Linux Tests' 'Mac10.10 Tests' 'Mac10.11 Tests' 'Mac10.12 Tests' 'Linux Tests SANDBOX' 'Win 7 Tests x64 (1)' 'Win7 Tests (1)' 'Linux - Future' So what should be the next step here? Do I need to pick new builder, or do I just need to tweak swarming dimension of telemetry_perf_unittest to pick the VMs?
,
Oct 5 2017
Also this is the list of builder running 'telemetry_unittest', I would expect it to be the same as telemetry_perf_unittests (posted in #6), but apparently it's not. 'Linux ChromiumOS Tests (1)' 'Chromium Mac 10.11' 'Chromium Mac 10.13' 'ClangToTMac tester' 'Mojo ChromiumOS' 'Out of Process Profiling Mac' 'Linux Tests' 'Linux Tests (dbg)(1)' 'Linux Tests (dbg)(1)(32)' 'Mac10.10 Tests' 'Mac10.11 Tests' 'Mac10.12 Tests' 'Mac10.9 Tests' 'Mac10.9 Tests (dbg)' 'Linux Tests SANDBOX' 'Win 7 Tests x64 (1)' 'Win7 Tests (1)' 'Win7 Tests (dbg)(1)' 'Linux - Future' 'Linux - Future (dbg)'
,
Oct 5 2017
The next step is probably to pick a linux configuration and tweak the swarming args for it to run more shards and pass a flag in to not run tests in parallel.
,
Oct 5 2017
Thanks. I will pick "'Linux Tests" builder, does that sound good to you?
,
Oct 5 2017
"Linux Tests" is the wrong one to pick, since that's mirrored by linux_chromium_rel_ng in the CQ :). Maybe try "Linux Tests (dbg)(1)" instead?
,
Oct 9 2017
https://chromium-review.googlesource.com/c/chromium/src/+/703490 to add "Linux Tests (dbg)(1)" is being blocked because: 1) The speedometer2 smoke test is failing on linux_chromium_dbg_ng (which is "Linux Tests (dbg)(1)") 2) A v8 runtime stats smoke test is failing 3) The shards are very unbalanced. +Dirk: do you think (3) would be a blocker since this doesn't affect machines utilization & we don't enable the test on CQ?
,
Oct 9 2017
As I just commented on the bug, it looks like the tests are running 10x slower in debug, and that concerns me. I don't think we should turn them on if that's the case, and either we need to figure out what's going on there, or try a non-debug bot, e.g., "Linux Tests SANDBOX" or one of the Mac10.10 or 10.11 bots.
,
Oct 9 2017
I will try Mac10.11 bots next
,
Nov 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/41539671102ef65148f07d72ddb89b3be24ec787 commit 41539671102ef65148f07d72ddb89b3be24ec787 Author: Ned Nguyen <nednguyen@google.com> Date: Tue Nov 07 00:38:25 2017 Disable parallel processing for telemetry_perf_unittests & telemetry_unittest on "Mac 10.12 Test" Bug: 753495 Change-Id: Id04b687da4d22d8bbde18bb4a733b8af4968de04 Reviewed-on: https://chromium-review.googlesource.com/703490 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#514322} [modify] https://crrev.com/41539671102ef65148f07d72ddb89b3be24ec787/testing/buildbot/chromium.mac.json
,
Nov 10 2017
Mac 10.12 has been running quite stable, so I will disable parallelization on all Mac bots as the next step
,
Nov 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9761a8796edbff9bb8917d7af9ce617c533ebcbd commit 9761a8796edbff9bb8917d7af9ce617c533ebcbd Author: Ned Nguyen <nednguyen@google.com> Date: Tue Nov 14 00:23:18 2017 Disable parallelization of telemetry_perf_unittest on all Mac bots Bug: 753495 Change-Id: I0121d82ce25bc037c20914b74714aec076aa8fe3 Reviewed-on: https://chromium-review.googlesource.com/763873 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#516112} [modify] https://crrev.com/9761a8796edbff9bb8917d7af9ce617c533ebcbd/testing/buildbot/chromium.mac.json
,
Dec 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ad33ed3adb4ffc05903c1d4095ff264579583437 commit ad33ed3adb4ffc05903c1d4095ff264579583437 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Dec 07 19:55:21 2017 Disable parallelism on telemetry_unittests everywhere Bug: 753495, 662541 Change-Id: I5c819540d9a34a0b9e1e679a16cd4059f097ceee Reviewed-on: https://chromium-review.googlesource.com/814935 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#522516} [modify] https://crrev.com/ad33ed3adb4ffc05903c1d4095ff264579583437/testing/buildbot/chromium.mac.json [modify] https://crrev.com/ad33ed3adb4ffc05903c1d4095ff264579583437/testing/buildbot/chromium.win.json [modify] https://crrev.com/ad33ed3adb4ffc05903c1d4095ff264579583437/testing/buildbot/test_suite_exceptions.pyl [modify] https://crrev.com/ad33ed3adb4ffc05903c1d4095ff264579583437/testing/buildbot/test_suites.pyl
,
Jan 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7ce3ea4c5672848136feabdc47913667dc5a95d7 commit 7ce3ea4c5672848136feabdc47913667dc5a95d7 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Jan 04 17:43:52 2018 Remove unnecessary modifications to telemetry_unittest suite Bug: 753495 Change-Id: I6e23e7b536b52c05070661869fb67ea8d04bc9d0 Reviewed-on: https://chromium-review.googlesource.com/850713 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#527026} [modify] https://crrev.com/7ce3ea4c5672848136feabdc47913667dc5a95d7/testing/buildbot/test_suite_exceptions.pyl
,
Jan 11 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/35d625f2f714b857cf08f256de23f366fa46948d commit 35d625f2f714b857cf08f256de23f366fa46948d Author: nednguyen <nednguyen@google.com> Date: Thu Jan 11 19:45:30 2018 Disable telemetry_perf_unittest' parallelization on Linux Tests dbg & Linux tests Sandbox Bug: 753495 Change-Id: Iff43849de0708d62d51960627346439d525f05a2 Reviewed-on: https://chromium-review.googlesource.com/861430 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#528712} [modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/chromium.linux.json [modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/chromium.sandbox.json [modify] https://crrev.com/35d625f2f714b857cf08f256de23f366fa46948d/testing/buildbot/test_suite_exceptions.pyl
,
Jan 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e6e2604e09e660b44ef4e2e396c78730b7da777c commit e6e2604e09e660b44ef4e2e396c78730b7da777c Author: Ned Nguyen <nednguyen@google.com> Date: Wed Jan 17 20:42:48 2018 Disable Telemetry test parallelization on Linux Tests Bug: 753495 Change-Id: I1780ff5394793a5e7a9e08ff873c1e67a70a809e Reviewed-on: https://chromium-review.googlesource.com/868643 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#529876} [modify] https://crrev.com/e6e2604e09e660b44ef4e2e396c78730b7da777c/testing/buildbot/chromium.linux.json [modify] https://crrev.com/e6e2604e09e660b44ef4e2e396c78730b7da777c/testing/buildbot/test_suite_exceptions.pyl
,
Aug 2
,
Jan 16
,
Jan 16
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by xunji...@chromium.org
, Aug 8 2017