New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 855200 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

platform.CheckProcesses test sometimes fails due to missing powerd process

Project Member Reported by derat@chromium.org, Jun 21 2018

Issue description

I've seen occasional failures in the platform.CheckProcesses Tast test due to a powerd process not being found. I suspect that some earlier job may be leaving the DUT in a bad state (I believe that various power_* Autotest tests stop powerd to test various things).

One failure was on chromeos6-row3-rack14-host15; see https://stainless.corp.google.com/browse/chromeos-autotest-results/210242963-chromeos-test/ :

2018/06/20 12:01:50 Started test platform.CheckProcesses
2018/06/20 12:01:50 [12:01:50.112] conntrackd is running
2018/06/20 12:01:50 [12:01:50.112] dbus-daemon is running
2018/06/20 12:01:50 [12:01:50.112] debugd is running
2018/06/20 12:01:50 [12:01:50.112] metrics_daemon is running
2018/06/20 12:01:50 [12:01:50.112] Error at check_processes.go:59: powerd not running
2018/06/20 12:01:50 [12:01:50.112] Stack trace:
chromiumos/tast/local/bundles/cros/platform.CheckProcesses(0xc4203924e0)
	/build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go:59 +0x40e
chromiumos/tast/testing.(*Test).Run.func2(0xc4203924e0, 0xc42006c0e0, 0xc420128b40)
	/build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:99 +0x66
created by chromiumos/tast/testing.(*Test).Run
	/build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:87 +0xb8
2018/06/20 12:01:50 [12:01:50.112] shill is running
2018/06/20 12:01:50 [12:01:50.113] udevd is running
2018/06/20 12:01:50 [12:01:50.113] tlsdated is running
2018/06/20 12:01:50 [12:01:50.114] update_engine is running
2018/06/20 12:01:50 [12:01:50.114] wpa_supplicant is running
2018/06/20 12:01:50 Completed test platform.CheckProcesses in 52ms with 1 error(s)

The test run collected a powerd log file that was started after the power.Reboot test rebooted the system later on in the run:

2018/06/20 12:04:32 Started test power.Reboot
2018/06/20 12:04:32 [12:04:32.243] Rebooting DUT
2018/06/20 12:04:33 [12:04:33.307] Waiting for DUT to become unreachable
2018/06/20 12:04:35 [12:04:35.348] DUT became unreachable (as expected)
2018/06/20 12:04:35 [12:04:35.348] Reconnecting to DUT
2018/06/20 12:04:46 [12:04:46.574] Reconnected to DUT
2018/06/20 12:04:46 Completed test power.Reboot in 14.45s with 0 error(s)

[0620/120439:INFO:main.cc(308)] vcsid 0.0.2-r2763-8ae6dcca2b7464a828d7601b5e844eb953b1d52e
[0620/120439:INFO:main.cc(314)] System uptime: 5s
[0620/120439:INFO:udev.cc(106)] Watching FD 15 for udev events
...

No earlier powerd log file got updated, suggesting that the process truly wasn't running.

The other failure was (coincidentally?) also on a different sand DUT, chromeos6-row3-rack17-host9; see https://stainless.corp.google.com/browse/chromeos-autotest-results/210309297-chromeos-test/ .

The cautotest UI is slow and horrible, and I'm having a hard time finding likely culprits. I'm worried that we'll always have some amount of flakiness in tests like this one due to the DUT starting out in a bad state. In this particular case, making the test into a remote test so it can reboot the DUT before checking processes would probably help, but that would make the test slower.

I can make the test collect more info (e.g. powerd.PREVIOUS and /var/log/messages) to make it easier to see when powerd was killed to help track down the bad test (assuming that that's the problem), but this problem will probably recur in the future.

An easy workaround for this particular issue would probably to make the test start the powerd job if it isn't already running. That may mask powerd crashes, though.
 

Comment 1 by ihf@chromium.org, Jun 21 2018

Suggestions to isolate tast from autotest:
1) Make the first test in tast a test that reboots the test, if you have one.
2) Reboot the DUT at start of tast (not as a test).
3) Put "TIME = 'LONG'" in the tast control files. That will currently schedule tast to run before shorter tests. Usually it should be the first test after provisioning. But as provisions can get recycled across different suites, no hard guarantees here about getting a clean machine. Just a reduction in noise.

One could also argue that platform.CheckProcesses is a bad test, because it assumes state that it did not setup itself, and hence is not robust.

If you want you could have tests declare the likelihood that they change system state/rely on system state and run them in a particular order. Of course Chromium testing just randomizes test order and tries to find test pairs that impact each other. That might be a long term goal but may be too much at this stage.

Comment 2 by nya@chromium.org, Jun 22 2018

In 210242963-chromeos-test logs, I saw these lines before platform.CheckProcesses() was run:

2018-06-20T17:54:13.358101+00:00 INFO btdispatch[2204]: Power manager becomes not available
2018-06-20T17:54:13.359262+00:00 WARNING trunksd[1863]: PowerManager service lost.
2018-06-20T17:54:13.360275+00:00 WARNING kernel: [  844.575957] init: powerd main process (1097) killed by TERM signal
https://storage.cloud.google.com/chromeos-autotest-results/210242963-chromeos-test/chromeos6-row3-rack14-host15/tast/sysinfo/reboot_current/messages

So I think Dan is right, powerd was not really running.


This is actually catching a real issue: powerd should be running always.


But I'll also note that, in the above case, the test ran just after arc.IntentForward, meaning that it ran during user session. If the test order changes, it can run on login screen. I think that's not great.


As a test writer, I want the DUT to be consistently "initial" state (= on the login screen) when a tast test case starts. Something we might be able to do for example:
- Require all test cases to restore the initial state when it finishes (pass or fail). For login tests, they should logout at the end.
- When a test case timeouts, Tast would forcibly restore the initial state by some means, e.g. rebooting DUT or forcing logout, before running the next test.

Comment 3 by ihf@chromium.org, Jun 22 2018

I disagree that the DUT should be on the login screen at the begging of a test. This is very inefficient. Already logged in is much faster for Chrome based tests. And for ARC++ tests one would want to have a running Android, if possible. Indeed CTS tests try to leave the DUT in the best state for the next CTS test (e.g. logged in + Android running).

A test should not be expected to be cleaning up. Cleaning up after oneself is a social construct that never works as it encourages freeloaders which don't do this and punishes tests that do. A test should be required to declare/request/set the initial state it requires.

Comment 4 by derat@chromium.org, Jun 22 2018

I don't think that we should expect to be at the login screen when tests start either, and Tast's code for interacting with Chrome already handles getting the UI into the appropriate state. This is pretty much mandatory, since different tests may want to start Chrome with different flags, and Tast also can't expect whatever Autotest test ran earlier (if any even did) to e.g. install the extension that it uses.

If there's a desire for platform.CheckProcesses to run consistently with regard to whether a user is logged in or not, then it should just restart the ui job or log in or whatever at the beginning.

There's clearly a line to be drawn, e.g. if a test hoses the system by deleting random files from the system or writing garbage to /home/chronos/Local State or something similar, then there's not going to be much other tests can do to recover.

Here, it may just be the case that platform.CheckProcesses doesn't really make sense as a local test, and that it should reboot the DUT first to make it more likely that the system is in a reasonable state.

Comment 5 by nya@chromium.org, Jun 22 2018

That's a fair argument that we want to reuse the state as far as possible to speed up.

However I'm not sure if we can support both "reuse state as far as possible" and "do not expect tests to clean up".

Comment 6 by derat@chromium.org, Jun 22 2018

In general, I don't think we should reuse state. I think that tests shouldn't mess things up (which includes not leaving usually-running daemons stopped), but I don't think we should leave the UI in a particular state so other tests don't need to do setup -- that's likely to be fragile and to cause hard-to-debug failures.

If we have tests that do similar things and need similar setup, I'd rather we just combined them into a single Tast test. Remember that it's possible to report multiple failures. Then you end up with hopefully-repeatable behavior.

Comment 7 by ihf@chromium.org, Jun 22 2018

One can only reuse the state of the DUT if not cleaning up. Call it the lazy cleanup.

Now I am not expecting tast to become a smart scheduler and figure out how to best chain cleanup dependencies. But maybe one can ask users hooking up tast tests to order them in a way such that state gets reused? And maybe have ready defined barrier commands inserted between groups of tests, which can reboot a DUT, stop Chrome or services, start Chrome or services (with special options), login Chrome, start Android, start adb/cleanup adb etc?

All this may look ugly to a purist (and we need ideas how to distribute over multiple DUTs - maybe at special barriers?), but practically speaking this can reduce at the same time runtime and noise levels.
Cc: jrbarnette@chromium.org
https://crrev.com/c/1121681 for  issue 859212  should help in cases where the powerd Upstart job has just been restarted by another test but isn't up yet.

I've still seen at least one case where the powerd Upstart job wasn't running at all, though: http://stainless/browse/chromeos-autotest-results/213682240-chromeos-test/

2018/07/02 17:46:14 Started test platform.CheckProcesses
2018/07/02 17:46:19 [17:46:19.569] Error at check_processes.go:49: powerd job not running
...
2018/07/02 17:46:19 [17:46:19.613] conntrackd is running
2018/07/02 17:46:19 [17:46:19.613] dbus-daemon is running
2018/07/02 17:46:19 [17:46:19.613] debugd is running
2018/07/02 17:46:19 [17:46:19.613] metrics_daemon is running
2018/07/02 17:46:19 [17:46:19.613] Error at check_processes.go:88: powerd not running

When I look at the full messages file collected by Autotest, I think I can see the point where powerd was stopped:

2018-07-02T23:10:53.616593+00:00 NOTICE root[32649]: autotest runtest graphics_dEQP
2018-07-02T23:10:54.811542+00:00 INFO kernel: [ 5537.671619] input: Emulated Keyboard as /devices/virtual/input/input11
2018-07-02T23:10:55.568532+00:00 WARNING kernel: [ 5538.428406] init: arc-kmsg-logger main process (32272) killed by TERM signal
2018-07-02T23:10:55.597895+00:00 INFO session_manager[31993]: [INFO:browser_job.cc(165)] Terminating process: exiting cleanly
2018-07-02T23:10:55.597922+00:00 INFO session_manager[31993]: [INFO:system_utils_impl.cc(94)] Sending 15 to 32012 as 1000
2018-07-02T23:10:56.012184+00:00 INFO session_manager[31993]: [INFO:session_manager_service.cc(480)] SessionManagerService quitting run loop
2018-07-02T23:10:56.012345+00:00 INFO session_manager[31993]: [INFO:session_manager_service.cc(201)] SessionManagerService exiting
2018-07-02T23:10:56.027577+00:00 INFO session_manager[31993]: [INFO:android_oci_wrapper.cc(253)] Cleaning up container 32307
2018-07-02T23:10:56.042404+00:00 INFO session_manager[31993]: [INFO:session_manager_impl.cc(1694)] Android Container with pid 32307 stopped
2018-07-02T23:10:56.430380+00:00 INFO btdispatch[2173]: Power manager becomes not available
2018-07-02T23:10:56.430609+00:00 WARNING trunksd[1891]: PowerManager service lost.
2018-07-02T23:10:56.430900+00:00 WARNING kernel: [ 5539.290959] init: powerd main process (1140) killed by TERM signal

These entries are in GMT (the first Tast mention in messages is at 2018-07-03T00:43:19), which would place them at 16:10 in relation to the Tast logs.

graphics_dEQP looks like it's trying to do the right thing:

    def initialize(self):
        ...
        self._services = service_stopper.ServiceStopper(['ui', 'powerd'])
        ...
        self._services.stop_services()

    def cleanup(self):
        if self._services:
            self._services.restore_services()

But I think that http://cautotest/afe/#tab_id=view_job&object_id=213680843 is the job corresponding to this, and it was aborted.

cleanup_services() in server/hosts/cros_host.py restarts the ui job. Should I make it also restart the powerd job?
> cleanup_services() in server/hosts/cros_host.py restarts the ui job.
> Should I make it also restart the powerd job?

That's not completely crazy.  The purpose of cleanup() is to allow
papering over simple problems like what's described here.

In an ideal world, we'd have slightly more sophisticated handling.
In some cases, if 'powerd' isn't running, it could represent a
condition that shouldn't be ignored.  In other cases, papering over
a dead 'powerd' is exactly the right thing to do.  Distinguishing
the two isn't impossible, but it requires more work to be done
successfully.

Also in an ideal world, in addition to 'powerd', we'd have consistent,
well-thought-through checking and recovery for multiple important
jobs, and not just 'ui' and 'powerd'.

Project Member

Comment 10 by bugdroid1@chromium.org, Jul 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4b5f7e0335090c89be9223e410de50c2f3f04315

commit 4b5f7e0335090c89be9223e410de50c2f3f04315
Author: Daniel Erat <derat@chromium.org>
Date: Sat Jul 07 16:29:57 2018

autotest: Update CrosHost.cleanup_services to start powerd.

Make CrosHost.cleanup_services() start the powerd Upstart
job if it's installed but not running. This is intended to
get the DUT into a stable state when a test aborts after
attempting to temporarily stop powerd.

BUG= chromium:855200 
TEST=none

Change-Id: Ieb4a5c19eb5a7e78959f84dd2abe6b850721dc15
Reviewed-on: https://chromium-review.googlesource.com/1126573
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/4b5f7e0335090c89be9223e410de50c2f3f04315/server/hosts/cros_host.py

Status: Fixed (was: Assigned)
I don't see any failures in this test yesterday.

Sign in to add a comment