race condition when running telemetry after CrOS VM starts up |
|||||||
Issue descriptionThere's a small race condition when running telemetry in a cros VM. If you run tools/perf/run_tests right after spawning a vm, it'll error out with "Connection timed out". For example running this command fails: `./third_party/chromite/bin/cros_vm --start && ./tools/perf/run_tests --browser=cros-chrome --remote=127.0.0.1 --remote-ssh-port=9222` But if you wait a bit after launching the VM, it'll work fine: `./third_party/chromite/bin/cros_vm --start && sleep 60 && ./tools/perf/run_tests --browser=cros-chrome --remote=127.0.0.1 --remote-ssh-port=9222` I think the ssh server in the VM hasn't fully started up before cros_vm exits. A possible solution is to bump the "ConnectTimeout" value here: https://codesearch.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/core/cros_interface.py?rcl=02bb4b7da4dbea61d40d2289bfb9d294616f76f3&l=107
,
May 3 2018
,
May 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/53d9b3ced38904718a0b3eb9471a47215e1e168c commit 53d9b3ced38904718a0b3eb9471a47215e1e168c Author: Ben Pastene <bpastene@chromium.org> Date: Fri May 04 17:12:59 2018 telemetry: Increase default ssh connection timeout on CrOS. When targeting a VM, it'll timeout if the VM just spawned. Bug: chromium:839187 Test: ./tools/perf/run_telemetry_tests --browser=cros-chrome --remote=127.0.0.1 --remote-ssh-port=9222 telemetry.core.cros_interface_unittest Change-Id: I5f8316c43ef6906c0facc8783bf3e1e84e9167aa Reviewed-on: https://chromium-review.googlesource.com/1043452 Commit-Queue: Ben Pastene <bpastene@chromium.org> Reviewed-by: Achuith Bhandarkar <achuith@chromium.org> [modify] https://crrev.com/53d9b3ced38904718a0b3eb9471a47215e1e168c/telemetry/telemetry/core/cros_interface.py [modify] https://crrev.com/53d9b3ced38904718a0b3eb9471a47215e1e168c/telemetry/telemetry/core/cros_interface_unittest.py
,
May 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/69239a6864233f9cff1c7b7fafffed089e7d4ab0 commit 69239a6864233f9cff1c7b7fafffed089e7d4ab0 Author: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Fri May 04 18:34:49 2018 Roll src/third_party/catapult/ f0ae6bb58..53d9b3ced (1 commit) https://chromium.googlesource.com/catapult.git/+log/f0ae6bb58a45..53d9b3ced389 $ git log f0ae6bb58..53d9b3ced --date=short --no-merges --format='%ad %ae %s' 2018-05-04 bpastene telemetry: Increase default ssh connection timeout on CrOS. Created with: roll-dep src/third_party/catapult BUG= chromium:839187 The AutoRoll server is located here: https://catapult-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=sullivan@chromium.org Change-Id: If84e0cb1aab60edcb0122f2ab550741a3e01e7ea Reviewed-on: https://chromium-review.googlesource.com/1044645 Commit-Queue: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Reviewed-by: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#556128} [modify] https://crrev.com/69239a6864233f9cff1c7b7fafffed089e7d4ab0/DEPS
,
May 7 2018
Bumping the initial connection timeout to 60s certainly helped, but it seems the race condition still exists: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel-vm-tests Looks like about ~5% of tasks still hit "got Connection timed out during banner exchange" https://chromium-swarm.appspot.com/task?id=3d5439cb7b99a410 https://chromium-swarm.appspot.com/task?id=3d5380838fbeb610 https://chromium-swarm.appspot.com/task?id=3d530d4ceefa0710 I wonder what else we can do to smooth that over... Maybe adding a retry via ConnectionAttempts? I'll look into that.
,
May 8 2018
Sorry for not chiming in earlier; I was on vacation last week. cros_run_vm_test has to deal with the same issue, and it uses WaitForBoot: https://cs.chromium.org/chromium/src/third_party/chromite/scripts/cros_vm.py?l=387 Are you literally just running: `./third_party/chromite/bin/cros_vm --start && ./tools/perf/run_tests --browser=cros-chrome --remote=127.0.0.1 --remote-ssh-port=9222` I can add an option to cros_run_vm_test to run perf tests, so you can just do: ./third_party/chromite/bin/cros_run_vm_tests --perf-tests It would be pretty easy to add this.
,
May 8 2018
,
May 8 2018
Ahh, that makes sense. Reusing that bit of logic sgtm. What would "--perf-tests" control? A hard-coded invocation of tools/perf/run_tests? I worry that we might have similar use-cases for the VMs down the road (ie: a command/test that starts off on the host rather than inside the VM) What if we instead add a "--host-cmd" option cros_run_vm_test? It could work the same as "--cmd" but instead of running the cmd in the vm, it runs it on the host. That would cover telemetry here and any other similar use in the future. WDYT?
,
May 8 2018
Sure, that makes sense. I can look into adding --host-cmd
,
May 8 2018
FYI, you can also run perf tests from within the VM: https://chromium.googlesource.com/chromiumos/docs/+/master/cros_vm.md#run-telemetry-unit-tests localhost ~ # python /usr/local/telemetry/src/tools/perf/run_tests cros_run_vm_test has an option --catapult-tests that runs telemetry unittests in a similar fashion. This is the 'local' mode on chromeos, and you're pursuing the 'server' mode. I think the server mode is probably better; we have to fetch archives from google storage for some perf tests, and need proper .boto permissions which is probably easier on the dev server rather than the VM.
,
May 8 2018
I don't know if you're aware of this, but the chromeos team maintains an informational bot that runs both telemetry unit tests and perf tests: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=amd64-generic-telemetry&buildBranch=master It does by wrapping the telemetry test runner with an autotest: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/telemetry_UnitTests/telemetry_UnitTests.py which calls into a script called RunChromeOSTests: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/testing/run_chromeos_tests.py This is generally maintained by the ChromeOS gardening effort, so I wouldn't be too surprised if most/all tests pass. One more thing I'd like to point out is that in addition to all unit tests and perf tests, we also run a small subset of unit tests (telemetry's so-called BrowserTests) in the incognito browser (--browser=cros-chrome-guest). We can probably drop this since I don't think anyone is tracking incognito browser performance (though we probably should be - it can regress). This does open another gap - which is ensuring that incognito mode continues working functionally. I can probably modify vm_sanity to cover this.
,
May 8 2018
Oh, interesting. I wasn't aware of the run_chromeos_tests.py script. You'll have to excuse my ignorance in these parts regarding telemetry, I'm not too familiar with it. I just know that we run telemetry tests on the CQ against all of our platforms (windows, android, mac, etc.) and it's all done via the run_tests script in chromium src: https://codesearch.chromium.org/chromium/src/tools/perf/run_tests (Which looks like to calls down to this other run_tests script: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/testing/run_tests.py) I was assuming that that would be true for chromeos as well (+ a couple options like --remote and --remote-ssh-port). But maybe it's not? I'll need some guidance on exactly what tests to run/how to run them. Are there any specific benchmarks we should be testing? (+Dirk who was asking similar questions lately.) The current set seems pretty limited: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/chromeos-amd64-generic-rel-vm-tests/2039 (ie, skipping 169 and only running 88) And that's using --browser=cros-chrome, not cros-chrome-guest like you mentioned. As for local vs server mode: correct me if I'm wrong, but the local mode seems to use a catapult/telemetry checkout in the VM itself, which is baked into the vm image. If that's the case, I don't think we'd have much value in that. On chromium's CQ, we'd be a lot closer to upstream changes of telemetry and are more interested in testing it at head than a blessed version that came packaged with the VM.
,
May 8 2018
,
May 8 2018
I assume this effort is mostly to run unit tests and perf unit tests in the VMs? I don't think we would want to run benchmarks on the VMs because the results probably wouldn't be very meaningful. I guess they could help catch some regressions. Looking at a recent run, I see 89 passed and 168 skipped, so this seems correct for perf tests. Google storage bucket with test results: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/amd64-generic-telemetry/R68-10658.0.0-b2551730/vm_test_results_1/telemetry_unit_server/test_harness/all/SimpleTestVerify/1_autotest_tests/ The log file for the perf test run: https://storage.cloud.google.com/chromeos-image-archive/amd64-generic-telemetry/R68-10658.0.0-b2551730/vm_test_results_1/telemetry_unit_server/test_harness/all/SimpleTestVerify/1_autotest_tests/results-3-telemetry_UnitTestsServer_perf/telemetry_UnitTestsServer.perf/telemetry_UnitTests/debug/telemetry_UnitTests.DEBUG?_ga=2.180384615.-1779913377.1517273826 Are you also planning to run telemetry unittests? There are 1038 that run, with 126 skips. Here's the log file (there are 6 failing tests which have been recently disabled): https://storage.cloud.google.com/chromeos-image-archive/amd64-generic-telemetry/R68-10658.0.0-b2551730/vm_test_results_1/telemetry_unit_server/test_harness/all/SimpleTestVerify/1_autotest_tests/results-1-telemetry_UnitTestsServer/telemetry_UnitTestsServer.user/telemetry_UnitTests/debug/telemetry_UnitTests.DEBUG?_ga=2.251344710.-1779913377.1517273826
,
May 8 2018
IIUC telemetry_perf_unittests runs the benchmarks but ignores the results. I think it's supposed to be a smoke test to ensure that the benchmarks won't crash when you want to run them for real, which is why it's suitable for running in VMs. But I'll need someone familiar with things to confirm that... And I'm planning on adding telemetry_unittests as well; I just started with the telemetry_perf_unittests suite since I assumed that one would have more technical blockers.
,
May 8 2018
To answer your previous question, local mode runs with the telemetry checkout in the VM. It works on our FYI bot since we build the image from TOT ChromeOS and TOT Chrome, but you don't want to do this. The other option would be to update the checkout located at /usr/local/telemetry on the VM, but you don't want to do this either. So server mode should be the way forward. Yup, the actual benchmarks would use the run_benchmark script, and you would want to run those on the actual kevins or whatever device(s) you settle on. From the perf linux bot, the invocation looks something like this: Command: /b/s/w/ir/.swarming_module_cache/vpython/73deba/bin/python ../../testing/scripts/run_performance_tests.py ../../tools/perf/run_benchmark -v --browser=release --upload-results --isolated-script-test-output=/b/s/w/io36nKZO/output.json --isolated-script-test-perf-output=/b/s/w/io36nKZO/perftest-output.json I expect that running benchmarks and collecting results on cros devices will be a pretty significant undertaking. The VMs are perfect for the perf and telemetry unit tests. I think the number of perf tests is actually small? I tried looking for where we run them on linux, but couldn't find it.
,
May 8 2018
Gotcha, thanks for the pointers Achuith. Sounds like server mode is what we want. I'll add the rest of telemetry's unittests on the bot while we work on adding that --host-cmd option (or something like it)
,
May 9 2018
Sounds like there is already agreement here. But just to confirm: > IIUC telemetry_perf_unittests runs the benchmarks but ignores the results. I think it's supposed to be a smoke test to ensure that the benchmarks won't crash when you want to run them for real Yes, that part is correct.
,
May 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/016a7aee48cd40630a6b06177592388550fb38ec commit 016a7aee48cd40630a6b06177592388550fb38ec Author: Ben Pastene <bpastene@chromium.org> Date: Thu May 17 01:22:46 2018 Use "cros_run_vm_test --host-cmd" for host-side cros VM tests. Replaces our use of cros_vm directly. Needed because going through cros_run_vm_test.py gives us smarter VM-bootup logic. cros_vm.py lacks this is and is rather dumb. Also bump timeout of cros-vm bot's timeout on telemetry unittests by a few min. (Added in this CL 'cause all of this is aimed at getting telemetry_perf_unittests stable.) Bug: 839187 Change-Id: I3fe6a8ec52d4a6f47b8b1ef475390346aa920ec3 Reviewed-on: https://chromium-review.googlesource.com/1062590 Reviewed-by: Dirk Pranke <dpranke@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Ben Pastene <bpastene@chromium.org> Cr-Commit-Position: refs/heads/master@{#559382} [modify] https://crrev.com/016a7aee48cd40630a6b06177592388550fb38ec/build/chromeos/run_vm_test.py [modify] https://crrev.com/016a7aee48cd40630a6b06177592388550fb38ec/testing/buildbot/chromium.fyi.json [modify] https://crrev.com/016a7aee48cd40630a6b06177592388550fb38ec/testing/buildbot/test_suites.pyl
,
May 17 2018
This is pretty much fixed. After switching to use the "--host-cmd" option (thnx achuith), t_p_u hasn't run into any connection timeout problems on the fyi bot. Though there does seem to be a problem where telemetry just hangs forever and eventually times out: https://chromium-swarm.appspot.com/task?id=3d88390620b29810 https://chromium-swarm.appspot.com/task?id=3d87572b417efb10 Probably another issue. I'll look into it when I get a chance. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by bpastene@chromium.org
, May 3 2018Status: Assigned (was: Untriaged)