Skylab: dut_state is ready after a test failure? |
|||
Issue descriptionDuring investigation of reef-paladin failure, I found that we usually get 2 failures in a line, e.g. for bot: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-cafaa769-50cb-4a7f-99c3-12dd99fd7f9a&sort_stats=total%3Adesc https://screenshot.googleplex.com/vcCqQNVskLd https://screenshot.googleplex.com/iAGEoQERr3v The first failure is always a real test failure. The second failure fails due to the DUT is not in good condition. For every task, we require dut_state='ready'. If a DUT fails one test, how can it accept the next one immediately without first taking a FleetAdminTask?
,
Aug 22
In the most recent example on that bot, the first test failed, the second job ran reset and reset failed because the DUT had bad state. That's the expected sequence of events, but I don't know why the DUT was bad.
08/21 22:31:36.178 DEBUG| utils:0286| [stderr] which: no python in (/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin)
08/21 22:31:36.277 ERROR| repair:0354| Failed: Python on the host is installed and working
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
self.verify(host)
File "/usr/local/autotest/server/hosts/cros_repair.py", line 262, in verify
raise hosts.AutoservVerifyError(message)
AutoservVerifyError: Python is missing; may be caused by powerwash
08/21 22:31:40.665 DEBUG| repair:0111| The following dependencies failed:
08/21 22:31:40.665 DEBUG| repair:0113| The host's TPM is available and working
08/21 22:31:40.665 DEBUG| repair:0113| Python on the host is installed and working
08/21 22:31:40.666 ERROR| reset:0037| Reset failed due to Exception.
Traceback (most recent call last):
File "/usr/local/autotest/server/control_segments/reset", line 33, in reset
target.verify()
File "/usr/local/autotest/server/hosts/cros_host.py", line 1164, in verify
self._repair_strategy.verify(self)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 756, in verify
self._verify_root._verify_host(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 348, in _verify_host
self._verify_dependencies(host, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 226, in _verify_dependencies
self._verify_list(host, self._dependency_list, silent)
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 216, in _verify_list
raise AutoservVerifyDependencyError(self, failures)
,
Aug 22
It sounds not right that a failed test cause the following test to fail. Looks like the test and reset is combined together and share the failure, and report it as test failure.
,
Aug 22
The failed test itself is not causing the following test to fail. Either the DUT is bad or the test not only fails but breaks the DUT. >Looks like the test and reset is combined together and share the failure, and report it as test failure. That's the same behavior as Autotest (prejob task failure -> test job failure). We can change it after Skylab rolls out completely, but I don't think changing that is in scope for Skylab.
,
Aug 23
,
Sep 11
|
|||
►
Sign in to add a comment |
|||
Comment 1 by ayatane@chromium.org
, Aug 22