platform.CheckProcesses test sometimes fails due to missing powerd process |
|||
Issue descriptionI've seen occasional failures in the platform.CheckProcesses Tast test due to a powerd process not being found. I suspect that some earlier job may be leaving the DUT in a bad state (I believe that various power_* Autotest tests stop powerd to test various things). One failure was on chromeos6-row3-rack14-host15; see https://stainless.corp.google.com/browse/chromeos-autotest-results/210242963-chromeos-test/ : 2018/06/20 12:01:50 Started test platform.CheckProcesses 2018/06/20 12:01:50 [12:01:50.112] conntrackd is running 2018/06/20 12:01:50 [12:01:50.112] dbus-daemon is running 2018/06/20 12:01:50 [12:01:50.112] debugd is running 2018/06/20 12:01:50 [12:01:50.112] metrics_daemon is running 2018/06/20 12:01:50 [12:01:50.112] Error at check_processes.go:59: powerd not running 2018/06/20 12:01:50 [12:01:50.112] Stack trace: chromiumos/tast/local/bundles/cros/platform.CheckProcesses(0xc4203924e0) /build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go:59 +0x40e chromiumos/tast/testing.(*Test).Run.func2(0xc4203924e0, 0xc42006c0e0, 0xc420128b40) /build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:99 +0x66 created by chromiumos/tast/testing.(*Test).Run /build/sand/tmp/portage/chromeos-base/tast-local-tests-cros-0.0.1-r83/work/tast-local-tests-cros-0.0.1/tast-base/src/chromiumos/tast/testing/test.go:87 +0xb8 2018/06/20 12:01:50 [12:01:50.112] shill is running 2018/06/20 12:01:50 [12:01:50.113] udevd is running 2018/06/20 12:01:50 [12:01:50.113] tlsdated is running 2018/06/20 12:01:50 [12:01:50.114] update_engine is running 2018/06/20 12:01:50 [12:01:50.114] wpa_supplicant is running 2018/06/20 12:01:50 Completed test platform.CheckProcesses in 52ms with 1 error(s) The test run collected a powerd log file that was started after the power.Reboot test rebooted the system later on in the run: 2018/06/20 12:04:32 Started test power.Reboot 2018/06/20 12:04:32 [12:04:32.243] Rebooting DUT 2018/06/20 12:04:33 [12:04:33.307] Waiting for DUT to become unreachable 2018/06/20 12:04:35 [12:04:35.348] DUT became unreachable (as expected) 2018/06/20 12:04:35 [12:04:35.348] Reconnecting to DUT 2018/06/20 12:04:46 [12:04:46.574] Reconnected to DUT 2018/06/20 12:04:46 Completed test power.Reboot in 14.45s with 0 error(s) [0620/120439:INFO:main.cc(308)] vcsid 0.0.2-r2763-8ae6dcca2b7464a828d7601b5e844eb953b1d52e [0620/120439:INFO:main.cc(314)] System uptime: 5s [0620/120439:INFO:udev.cc(106)] Watching FD 15 for udev events ... No earlier powerd log file got updated, suggesting that the process truly wasn't running. The other failure was (coincidentally?) also on a different sand DUT, chromeos6-row3-rack17-host9; see https://stainless.corp.google.com/browse/chromeos-autotest-results/210309297-chromeos-test/ . The cautotest UI is slow and horrible, and I'm having a hard time finding likely culprits. I'm worried that we'll always have some amount of flakiness in tests like this one due to the DUT starting out in a bad state. In this particular case, making the test into a remote test so it can reboot the DUT before checking processes would probably help, but that would make the test slower. I can make the test collect more info (e.g. powerd.PREVIOUS and /var/log/messages) to make it easier to see when powerd was killed to help track down the bad test (assuming that that's the problem), but this problem will probably recur in the future. An easy workaround for this particular issue would probably to make the test start the powerd job if it isn't already running. That may mask powerd crashes, though.
,
Jun 22 2018
In 210242963-chromeos-test logs, I saw these lines before platform.CheckProcesses() was run: 2018-06-20T17:54:13.358101+00:00 INFO btdispatch[2204]: Power manager becomes not available 2018-06-20T17:54:13.359262+00:00 WARNING trunksd[1863]: PowerManager service lost. 2018-06-20T17:54:13.360275+00:00 WARNING kernel: [ 844.575957] init: powerd main process (1097) killed by TERM signal https://storage.cloud.google.com/chromeos-autotest-results/210242963-chromeos-test/chromeos6-row3-rack14-host15/tast/sysinfo/reboot_current/messages So I think Dan is right, powerd was not really running. This is actually catching a real issue: powerd should be running always. But I'll also note that, in the above case, the test ran just after arc.IntentForward, meaning that it ran during user session. If the test order changes, it can run on login screen. I think that's not great. As a test writer, I want the DUT to be consistently "initial" state (= on the login screen) when a tast test case starts. Something we might be able to do for example: - Require all test cases to restore the initial state when it finishes (pass or fail). For login tests, they should logout at the end. - When a test case timeouts, Tast would forcibly restore the initial state by some means, e.g. rebooting DUT or forcing logout, before running the next test.
,
Jun 22 2018
I disagree that the DUT should be on the login screen at the begging of a test. This is very inefficient. Already logged in is much faster for Chrome based tests. And for ARC++ tests one would want to have a running Android, if possible. Indeed CTS tests try to leave the DUT in the best state for the next CTS test (e.g. logged in + Android running). A test should not be expected to be cleaning up. Cleaning up after oneself is a social construct that never works as it encourages freeloaders which don't do this and punishes tests that do. A test should be required to declare/request/set the initial state it requires.
,
Jun 22 2018
I don't think that we should expect to be at the login screen when tests start either, and Tast's code for interacting with Chrome already handles getting the UI into the appropriate state. This is pretty much mandatory, since different tests may want to start Chrome with different flags, and Tast also can't expect whatever Autotest test ran earlier (if any even did) to e.g. install the extension that it uses. If there's a desire for platform.CheckProcesses to run consistently with regard to whether a user is logged in or not, then it should just restart the ui job or log in or whatever at the beginning. There's clearly a line to be drawn, e.g. if a test hoses the system by deleting random files from the system or writing garbage to /home/chronos/Local State or something similar, then there's not going to be much other tests can do to recover. Here, it may just be the case that platform.CheckProcesses doesn't really make sense as a local test, and that it should reboot the DUT first to make it more likely that the system is in a reasonable state.
,
Jun 22 2018
That's a fair argument that we want to reuse the state as far as possible to speed up. However I'm not sure if we can support both "reuse state as far as possible" and "do not expect tests to clean up".
,
Jun 22 2018
In general, I don't think we should reuse state. I think that tests shouldn't mess things up (which includes not leaving usually-running daemons stopped), but I don't think we should leave the UI in a particular state so other tests don't need to do setup -- that's likely to be fragile and to cause hard-to-debug failures. If we have tests that do similar things and need similar setup, I'd rather we just combined them into a single Tast test. Remember that it's possible to report multiple failures. Then you end up with hopefully-repeatable behavior.
,
Jun 22 2018
One can only reuse the state of the DUT if not cleaning up. Call it the lazy cleanup. Now I am not expecting tast to become a smart scheduler and figure out how to best chain cleanup dependencies. But maybe one can ask users hooking up tast tests to order them in a way such that state gets reused? And maybe have ready defined barrier commands inserted between groups of tests, which can reboot a DUT, stop Chrome or services, start Chrome or services (with special options), login Chrome, start Android, start adb/cleanup adb etc? All this may look ugly to a purist (and we need ideas how to distribute over multiple DUTs - maybe at special barriers?), but practically speaking this can reduce at the same time runtime and noise levels.
,
Jul 3
https://crrev.com/c/1121681 for issue 859212 should help in cases where the powerd Upstart job has just been restarted by another test but isn't up yet. I've still seen at least one case where the powerd Upstart job wasn't running at all, though: http://stainless/browse/chromeos-autotest-results/213682240-chromeos-test/ 2018/07/02 17:46:14 Started test platform.CheckProcesses 2018/07/02 17:46:19 [17:46:19.569] Error at check_processes.go:49: powerd job not running ... 2018/07/02 17:46:19 [17:46:19.613] conntrackd is running 2018/07/02 17:46:19 [17:46:19.613] dbus-daemon is running 2018/07/02 17:46:19 [17:46:19.613] debugd is running 2018/07/02 17:46:19 [17:46:19.613] metrics_daemon is running 2018/07/02 17:46:19 [17:46:19.613] Error at check_processes.go:88: powerd not running When I look at the full messages file collected by Autotest, I think I can see the point where powerd was stopped: 2018-07-02T23:10:53.616593+00:00 NOTICE root[32649]: autotest runtest graphics_dEQP 2018-07-02T23:10:54.811542+00:00 INFO kernel: [ 5537.671619] input: Emulated Keyboard as /devices/virtual/input/input11 2018-07-02T23:10:55.568532+00:00 WARNING kernel: [ 5538.428406] init: arc-kmsg-logger main process (32272) killed by TERM signal 2018-07-02T23:10:55.597895+00:00 INFO session_manager[31993]: [INFO:browser_job.cc(165)] Terminating process: exiting cleanly 2018-07-02T23:10:55.597922+00:00 INFO session_manager[31993]: [INFO:system_utils_impl.cc(94)] Sending 15 to 32012 as 1000 2018-07-02T23:10:56.012184+00:00 INFO session_manager[31993]: [INFO:session_manager_service.cc(480)] SessionManagerService quitting run loop 2018-07-02T23:10:56.012345+00:00 INFO session_manager[31993]: [INFO:session_manager_service.cc(201)] SessionManagerService exiting 2018-07-02T23:10:56.027577+00:00 INFO session_manager[31993]: [INFO:android_oci_wrapper.cc(253)] Cleaning up container 32307 2018-07-02T23:10:56.042404+00:00 INFO session_manager[31993]: [INFO:session_manager_impl.cc(1694)] Android Container with pid 32307 stopped 2018-07-02T23:10:56.430380+00:00 INFO btdispatch[2173]: Power manager becomes not available 2018-07-02T23:10:56.430609+00:00 WARNING trunksd[1891]: PowerManager service lost. 2018-07-02T23:10:56.430900+00:00 WARNING kernel: [ 5539.290959] init: powerd main process (1140) killed by TERM signal These entries are in GMT (the first Tast mention in messages is at 2018-07-03T00:43:19), which would place them at 16:10 in relation to the Tast logs. graphics_dEQP looks like it's trying to do the right thing: def initialize(self): ... self._services = service_stopper.ServiceStopper(['ui', 'powerd']) ... self._services.stop_services() def cleanup(self): if self._services: self._services.restore_services() But I think that http://cautotest/afe/#tab_id=view_job&object_id=213680843 is the job corresponding to this, and it was aborted. cleanup_services() in server/hosts/cros_host.py restarts the ui job. Should I make it also restart the powerd job?
,
Jul 3
> cleanup_services() in server/hosts/cros_host.py restarts the ui job. > Should I make it also restart the powerd job? That's not completely crazy. The purpose of cleanup() is to allow papering over simple problems like what's described here. In an ideal world, we'd have slightly more sophisticated handling. In some cases, if 'powerd' isn't running, it could represent a condition that shouldn't be ignored. In other cases, papering over a dead 'powerd' is exactly the right thing to do. Distinguishing the two isn't impossible, but it requires more work to be done successfully. Also in an ideal world, in addition to 'powerd', we'd have consistent, well-thought-through checking and recovery for multiple important jobs, and not just 'ui' and 'powerd'.
,
Jul 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4b5f7e0335090c89be9223e410de50c2f3f04315 commit 4b5f7e0335090c89be9223e410de50c2f3f04315 Author: Daniel Erat <derat@chromium.org> Date: Sat Jul 07 16:29:57 2018 autotest: Update CrosHost.cleanup_services to start powerd. Make CrosHost.cleanup_services() start the powerd Upstart job if it's installed but not running. This is intended to get the DUT into a stable state when a test aborts after attempting to temporarily stop powerd. BUG= chromium:855200 TEST=none Change-Id: Ieb4a5c19eb5a7e78959f84dd2abe6b850721dc15 Reviewed-on: https://chromium-review.googlesource.com/1126573 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/4b5f7e0335090c89be9223e410de50c2f3f04315/server/hosts/cros_host.py
,
Jul 10
I don't see any failures in this test yesterday. |
|||
►
Sign in to add a comment |
|||
Comment 1 by ihf@chromium.org
, Jun 21 2018