New issue
Advanced search Search tips

Issue 876762 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Sep 11
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Skylab: dut_state is ready after a test failure?

Project Member Reported by xixuan@chromium.org, Aug 22

Issue description

During investigation of reef-paladin failure, I found that we usually get 2 failures in a line, e.g. for bot:

https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-cafaa769-50cb-4a7f-99c3-12dd99fd7f9a&sort_stats=total%3Adesc

https://screenshot.googleplex.com/vcCqQNVskLd
https://screenshot.googleplex.com/iAGEoQERr3v

The first failure is always a real test failure.
The second failure fails due to the DUT is not in good condition.

For every task, we require dut_state='ready'. If a DUT fails one test, how can it accept the next one immediately without first taking a FleetAdminTask?

 
I'll double check, but I don't think failing a test causes a DUT to require repair
In the most recent example on that bot, the first test failed, the second job ran reset and reset failed because the DUT had bad state.  That's the expected sequence of events, but I don't know why the DUT was bad.  

08/21 22:31:36.178 DEBUG|             utils:0286| [stderr] which: no python in (/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin)
08/21 22:31:36.277 ERROR|            repair:0354| Failed: Python on the host is installed and working
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 351, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 262, in verify
    raise hosts.AutoservVerifyError(message)
AutoservVerifyError: Python is missing; may be caused by powerwash

08/21 22:31:40.665 DEBUG|            repair:0111| The following dependencies failed:
08/21 22:31:40.665 DEBUG|            repair:0113|     The host's TPM is available and working
08/21 22:31:40.665 DEBUG|            repair:0113|     Python on the host is installed and working
08/21 22:31:40.666 ERROR|             reset:0037| Reset failed due to Exception.
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/reset", line 33, in reset
    target.verify()
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1164, in verify
    self._repair_strategy.verify(self)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 756, in verify
    self._verify_root._verify_host(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 348, in _verify_host
    self._verify_dependencies(host, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 226, in _verify_dependencies
    self._verify_list(host, self._dependency_list, silent)
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 216, in _verify_list
    raise AutoservVerifyDependencyError(self, failures)

It sounds not right that a failed test cause the following test to fail. Looks like the test and reset is combined together and share the failure, and report it as test failure.
The failed test itself is not causing the following test to fail.  Either the DUT is bad or the test not only fails but breaks the DUT.

>Looks like the test and reset is combined together and share the failure, and report it as test failure.

That's the same behavior as Autotest (prejob task failure -> test job failure).  We can change it after Skylab rolls out completely, but I don't think changing that is in scope for Skylab.
Owner: ayatane@chromium.org
Status: WontFix (was: Untriaged)

Sign in to add a comment