New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 897521 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 29
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 885016



Sign in to add a comment

Process-checking Tast tests flakily fail on VM builders when run just after booting

Project Member Reported by derat@chromium.org, Oct 21

Issue description

I'm trying to run Tast tests on VM paladin builders for  issue 885016 . In tryjobs that make use of cros_vm.py ( issue 891928 ), I've noticed failures in the platform.CheckProcesses and security.SELinuxProcessContext tests due to missing processes.

See http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932102744351452208, for instance:

2018/10/20 19:12:55 platform.CheckProcesses         [ FAIL ] metrics_daemon not running
...
2018/10/20 19:12:55 security.SELinuxProcessContext  [ FAIL ] Found 0 process(es) for test case {0 /usr/bin/metrics_daemon u:r:cros_metrics_daemon:s0 1}; require at least 1

metrics_daemon.conf contains the following:

start on stopped crash-boot-collect
stop on stopping system-services
respawn

crash-boot-collect.conf contains this:

start on stopped crash-reporter and started system-services
...
exec /sbin/crash_reporter --boot_collect

I think this makes it basically arbitrary when metrics_daemon gets started, so this probably isn't something that cros_vm.py could handle automatically.

In http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932170984258645072, security.SELinuxProcessContext also failed due to cryptohomed not running:

2018/10/20 01:03:07 [01:03:06.511] Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}: []
2018/10/20 01:03:07 [01:03:06.511] Error at selinux_process_context.go:91: Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}; require at least 1
2018/10/20 01:03:07 [01:03:06.511] Stack trace:
Found 0 process(es) for test case {0 /usr/sbin/cryptohomed u:r:cros_cryptohomed:s0 1}; require at least 1
	at chromiumos/tast/local/bundles/cros/security.SELinuxProcessContext (selinux_process_context.go:91)
	at chromiumos/tast/testing.(*Test).Run.func4 (test.go:189)
	at chromiumos/tast/testing.runAndRecover.func1 (stage.go:68)
	at runtime.goexit (asm_amd64.s:2361)

I likely need to update the tests that depend on these processes to wait for them.
 
Cc: semenzato@chromium.org
I tried making platform.CheckProcesses wait up to 10 seconds for the metrics_daemon Upstart job to be running, but that looks like it's still not enough. In my tryjob at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8932021952879138320:

16:18:47  suite runs cros_run_vm_test
16:19:33  tast process is started
16:19:37  platform.CheckProcesses starts
16:19:46  test fails because metrics_daemon isn't running
16:20:02  metrics_daemon finally logs some stuff in /var/log/messages

That feels like it's way too slow. I'm going to update platform.CheckProcesses stop verifying that metrics_daemon is running. I'm happy to re-add it if metrics_daemon is updated to start synchronously with system-services in the future.
Is platform_CheckProcesses run in a VM with arbitrarily heavy load on the host?  Do you know what the load is for this test?  Maybe the timeout should be multiplied by the load average.

Other possible fixes: 

1. change the metrics_daemon to start earlier, then wait for the boot-collect event before sending the boot-collect stats (assuming that's why it waits);

2. making the boot-collect process send its stats independently;

3. making the test wait for the upstart event for all daemons of interest.

Per nya@'s suggestion on https://crrev.com/c/1293072, I'll use this to track adding something wider-reaching to try to make us wait until the system is "stable" (however we want to define that) before running tests.
#2: No idea how I'd even check the host system's load, but I fear that trying to accomodate high loads by increasing timeouts would just result in more flaky failures.
Project Member

Comment 5 by bugdroid1@chromium.org, Oct 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4

commit a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4
Author: Daniel Erat <derat@chromium.org>
Date: Tue Oct 23 06:48:38 2018

tast-tests: Make process tests wait for system-services.

Update the platform.CheckProcesses and
security.SELinuxProcessContext tests to wait for
the system-services Upstart job to be running. This
hopefully prevents failures when tests run immediately after
the system boots, which can happen on VM builders.

Also make platform.CheckProcesses wait 10 seconds rather
than 5 for Upstart jobs, and stop checking metrics_daemon
since it's apparently sometimes only started about a minute
after the system boots. Add a few more services to the wait
list as well.

Finally, update security.SELinuxProcessContext to permit
non-critical processes which may get restarted by other
tests to not be running and to not log multiline messages
(which make test logs difficult to read).

BUG= chromium:897521 
TEST=manual: ran tests

Change-Id: I6d4478571d7d1b714f0d31134675c8be7972b3ed
Reviewed-on: https://chromium-review.googlesource.com/1293072
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/security/selinux_process_context.go
[modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go
[modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/bundles/cros/security/selinux/processes_helper.go
[modify] https://crrev.com/a490e3d92053cc938b6dbb2de0cbd2dd6dbbb1f4/src/chromiumos/tast/local/upstart/upstart.go

One way to deal with timeouts under high load would be to timeout on CPU time (i.e. user+system) instead of wall time. You'd still probably want a (much higher) wall time timeout in case you end up hung on IO.
Project Member

Comment 7 by bugdroid1@chromium.org, Oct 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/6b4fe52543e99f44f774b8df5770e7603f9a7b45

commit 6b4fe52543e99f44f774b8df5770e7603f9a7b45
Author: Daniel Erat <derat@chromium.org>
Date: Thu Oct 25 18:35:48 2018

tast: Make bundle.Local support waiting for DUT readiness.

Make the bundle.Local function take an optional func that is
executed to wait for the DUT to become ready for testing.
This provides a mechanism that the local "cros" bundle can
use to defer testing until critical Chrome OS system daemons
are running.

BUG= chromium:897521 
TEST=added unit tests; also manually verified (in
     conjunction with tast-tests change) that ready function
     runs and that testing doesn't start if debugd is
     stopped
CQ-DEEND=Ieb8536ad3adbe389a80614014772cf1e8d34db69

Change-Id: Ice0c81fb887bd73af5c5541158865c2871cea09b
Reviewed-on: https://chromium-review.googlesource.com/1298595
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/local_test.go
[modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/local.go
[modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/bundle/bundle.go
[modify] https://crrev.com/6b4fe52543e99f44f774b8df5770e7603f9a7b45/src/chromiumos/tast/runner/runner_test.go

Project Member

Comment 8 by bugdroid1@chromium.org, Oct 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/ac0af4d310c708b1fdf5e7e59701db1a4999859f

commit ac0af4d310c708b1fdf5e7e59701db1a4999859f
Author: Daniel Erat <derat@chromium.org>
Date: Thu Oct 25 18:35:48 2018

tast-tests: Make local cros bundle wait for DUT to be ready.

Make the local "cros" test bundle wait for the
system-services, debugd, and shill Upstart jobs to be
running before tests are run.

BUG= chromium:897521 
TEST=manual: verified that "tast run" waits for a stopped
     debugd job before starting tests
CQ-DEPEND=Ice0c81fb887bd73af5c5541158865c2871cea09b

Change-Id: Ieb8536ad3adbe389a80614014772cf1e8d34db69
Reviewed-on: https://chromium-review.googlesource.com/1298263
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/security/selinux_process_context.go
[modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go
[modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/bundles/cros/main.go
[modify] https://crrev.com/ac0af4d310c708b1fdf5e7e59701db1a4999859f/src/chromiumos/tast/local/upstart/upstart.go

Project Member

Comment 9 by bugdroid1@chromium.org, Oct 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/3bde681d74e8dea8e5644db4222259f740bc5bd4

commit 3bde681d74e8dea8e5644db4222259f740bc5bd4
Author: Daniel Erat <derat@chromium.org>
Date: Mon Oct 29 18:21:36 2018

tast: Protect concurrent logging calls in bundle funcs.

Make the logging function passed to runConfig.preRunFunc and
runConfig.postRunFunc safe to call from multiple goroutines.

BUG= chromium:897521 
TEST=logging still works

Change-Id: I19a0a225a59c464666b921c331e83ea168f1122d
Reviewed-on: https://chromium-review.googlesource.com/1303283
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/3bde681d74e8dea8e5644db4222259f740bc5bd4/src/chromiumos/tast/bundle/bundle.go

Project Member

Comment 10 by bugdroid1@chromium.org, Oct 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/2461a42835d9804dd4b101e9084a59eb8f8da045

commit 2461a42835d9804dd4b101e9084a59eb8f8da045
Author: Daniel Erat <derat@chromium.org>
Date: Mon Oct 29 18:21:36 2018

tast-tests: Update platform.CheckProcesses to wait for PIDs.

Update the platform.CheckProcesses test to wait up to 30
seconds for the expected processes to show up instead of
waiting for a few Upstart jobs. Also make it check for
metrics_daemon.

Also move upstart.WaitForJobsRunning into the local "cros"
bundle's ready function since that's the only caller, and
make the bundle wait for metrics_daemon.

BUG= chromium:897521 
TEST=ran it with and without required processes running
CQ-DEPEND=I19a0a225a59c464666b921c331e83ea168f1122d

Change-Id: Ief0f9f6c0e2bf7082c2a208fe1c4203bee454e46
Reviewed-on: https://chromium-review.googlesource.com/1303284
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Hidehiko Abe <hidehiko@chromium.org>

[modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go
[modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/bundles/cros/main.go
[modify] https://crrev.com/2461a42835d9804dd4b101e9084a59eb8f8da045/src/chromiumos/tast/local/upstart/upstart.go

Status: Fixed (was: Started)

Sign in to add a comment