Process-checking Tast tests flakily fail on VM builders when run just after booting |
||
Issue descriptionI'm trying to run Tast tests on VM paladin builders for issue 885016 . In tryjobs that make use of cros_vm.py ( issue 891928 ), I've noticed failures in the platform.CheckProcesses and security.SELinuxProcessContext tests due to missing processes. See http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932102744351452208, for instance: 2018/10/20 19:12:55 platform.CheckProcesses [ FAIL ] metrics_daemon not running ... 2018/10/20 19:12:55 security.SELinuxProcessContext [ FAIL ] Found 0 process(es) for test case {0 /usr/bin/metrics_daemon u:r:cros_metrics_daemon:s0 1}; require at least 1 metrics_daemon.conf contains the following: start on stopped crash-boot-collect stop on stopping system-services respawn crash-boot-collect.conf contains this: start on stopped crash-reporter and started system-services ... exec /sbin/crash_reporter --boot_collect I think this makes it basically arbitrary when metrics_daemon gets started, so this probably isn't something that cros_vm.py could handle automatically. In http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932170984258645072, security.SELinuxProcessContext also failed due to cryptohomed not running: 2018/10/20 01:03:07 [01:03:06.511] Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}: [] 2018/10/20 01:03:07 [01:03:06.511] Error at selinux_process_context.go:91: Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}; require at least 1 2018/10/20 01:03:07 [01:03:06.511] Stack trace: Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}; require at least 1 at chromiumos/tast/local/bundles/cros/security.SELinuxProcessContext (selinux_process_context.go:91) at chromiumos/tast/testing.(*Test).Run.func4 (test.go:189) at chromiumos/tast/testing.runAndRecover.func1 (stage.go:68) at runtime.goexit (asm_amd64.s:2361) I likely need to update the tests that depend on these processes to wait for them.
,
Oct 22
Is platform_CheckProcesses run in a VM with arbitrarily heavy load on the host? Do you know what the load is for this test? Maybe the timeout should be multiplied by the load average. Other possible fixes: 1. change the metrics_daemon to start earlier, then wait for the boot-collect event before sending the boot-collect stats (assuming that's why it waits); 2. making the boot-collect process send its stats independently; 3. making the test wait for the upstart event for all daemons of interest.
,
Oct 22
Per nya@'s suggestion on https://crrev.com/c/1293072, I'll use this to track adding something wider-reaching to try to make us wait until the system is "stable" (however we want to define that) before running tests.
,
Oct 22
#2: No idea how I'd even check the host system's load, but I fear that trying to accomodate high loads by increasing timeouts would just result in more flaky failures.
,
Oct 23
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4 commit a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4 Author: Daniel Erat <derat@chromium.org> Date: Tue Oct 23 06:48:38 2018 tast-tests: Make process tests wait for system-services. Update the platform.CheckProcesses and security.SELinuxProcessContext tests to wait for the system-services Upstart job to be running. This hopefully prevents failures when tests run immediately after the system boots, which can happen on VM builders. Also make platform.CheckProcesses wait 10 seconds rather than 5 for Upstart jobs, and stop checking metrics_daemon since it's apparently sometimes only started about a minute after the system boots. Add a few more services to the wait list as well. Finally, update security.SELinuxProcessContext to permit non-critical processes which may get restarted by other tests to not be running and to not log multiline messages (which make test logs difficult to read). BUG= chromium:897521 TEST=manual: ran tests Change-Id: I6d4478571d7d1b714f0d31134675c8be7972b3ed Reviewed-on: https://chromium-review.googlesource.com/1293072 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/security/selinux_process_context.go [modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go [modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/security/selinux/processes_helper.go [modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/upstart/upstart.go
,
Oct 23
One way to deal with timeouts under high load would be to timeout on CPU time (i.e. user+system) instead of wall time. You'd still probably want a (much higher) wall time timeout in case you end up hung on IO.
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/6b4fe52543e99f44f774b8df5770e7603f9a7b45 commit 6b4fe52543e99f44f774b8df5770e7603f9a7b45 Author: Daniel Erat <derat@chromium.org> Date: Thu Oct 25 18:35:48 2018 tast: Make bundle.Local support waiting for DUT readiness. Make the bundle.Local function take an optional func that is executed to wait for the DUT to become ready for testing. This provides a mechanism that the local "cros" bundle can use to defer testing until critical Chrome OS system daemons are running. BUG= chromium:897521 TEST=added unit tests; also manually verified (in conjunction with tast-tests change) that ready function runs and that testing doesn't start if debugd is stopped CQ-DEEND=Ieb8536ad3adbe389a80614014772cf1e8d34db69 Change-Id: Ice0c81fb887bd73af5c5541158865c2871cea09b Reviewed-on: https://chromium-review.googlesource.com/1298595 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/local_test.go [modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/local.go [modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/bundle.go [modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/runner/runner_test.go
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/ac0af4d310c708b1fdf5e7e59701db1a4999859f commit ac0af4d310c708b1fdf5e7e59701db1a4999859f Author: Daniel Erat <derat@chromium.org> Date: Thu Oct 25 18:35:48 2018 tast-tests: Make local cros bundle wait for DUT to be ready. Make the local "cros" test bundle wait for the system-services, debugd, and shill Upstart jobs to be running before tests are run. BUG= chromium:897521 TEST=manual: verified that "tast run" waits for a stopped debugd job before starting tests CQ-DEPEND=Ice0c81fb887bd73af5c5541158865c2871cea09b Change-Id: Ieb8536ad3adbe389a80614014772cf1e8d34db69 Reviewed-on: https://chromium-review.googlesource.com/1298263 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> [modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/security/selinux_process_context.go [modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go [modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/main.go [modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/upstart/upstart.go
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/3bde681d74e8dea8e5644db4222259f740bc5bd4 commit 3bde681d74e8dea8e5644db4222259f740bc5bd4 Author: Daniel Erat <derat@chromium.org> Date: Mon Oct 29 18:21:36 2018 tast: Protect concurrent logging calls in bundle funcs. Make the logging function passed to runConfig.preRunFunc and runConfig.postRunFunc safe to call from multiple goroutines. BUG= chromium:897521 TEST=logging still works Change-Id: I19a0a225a59c464666b921c331e83ea168f1122d Reviewed-on: https://chromium-review.googlesource.com/1303283 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/3bde681d74e8dea8e5644db4222259f740bc5bd4/src/chromiumos/tast/bundle/bundle.go
,
Oct 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/2461a42835d9804dd4b101e9084a59eb8f8da045 commit 2461a42835d9804dd4b101e9084a59eb8f8da045 Author: Daniel Erat <derat@chromium.org> Date: Mon Oct 29 18:21:36 2018 tast-tests: Update platform.CheckProcesses to wait for PIDs. Update the platform.CheckProcesses test to wait up to 30 seconds for the expected processes to show up instead of waiting for a few Upstart jobs. Also make it check for metrics_daemon. Also move upstart.WaitForJobsRunning into the local "cros" bundle's ready function since that's the only caller, and make the bundle wait for metrics_daemon. BUG= chromium:897521 TEST=ran it with and without required processes running CQ-DEPEND=I19a0a225a59c464666b921c331e83ea168f1122d Change-Id: Ief0f9f6c0e2bf7082c2a208fe1c4203bee454e46 Reviewed-on: https://chromium-review.googlesource.com/1303284 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go [modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/bundles/cros/main.go [modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/upstart/upstart.go
,
Oct 29
|
||
►
Sign in to add a comment |
||
Comment 1 by derat@chromium.org
, Oct 22