New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 753407 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

DUT repair should reboot more aggressively.

Project Member Reported by dgarr...@chromium.org, Aug 8 2017

Issue description

We've had several DUTs get into a bad state this week in which they fail every test but pass repair.

The belief is that rebooting them will fix the issue.

This has killed multiple CQ runs for both sentry and wolf.

   https://crbug.com/753120 
   https://crbug.com/753221 
 
I'm proposing that repair start rebooting DUTs. Either every time, or after a fail/repair streak of some length.
 
Note, the DUTs in question fail provision then pass repair every time, not tests.

If we understand why they are failing provision, maybe we can test for that during repair.

Alternatively, we can start forcing provision during repair for DUTs that have failed more than X provisions in a row.
> If we understand why they are failing provision, maybe we can test for that during repair.

In the case of  bug 753120 , the DUT passes repair because it's accessible
via ssh from servers outside of the lab.  It fails provision because it's
inaccessible from the devservers.  I'm leery of making a test that complicated
every time we verify (meaning every reset task), and that's the only
available option for such a test without redesigning repair.

The following patch will apply reboot on every repair:

diff --git a/server/hosts/cros_host.py b/server/hosts/cros_host.py
index 2d9ccd6b8..8a60aea2f 100644
--- a/server/hosts/cros_host.py
+++ b/server/hosts/cros_host.py
@@ -1218,6 +1218,7 @@ class CrosHost(abstract_ssh.AbstractSSHHost):
         repair steps needed to get the DUT working.
         """
         self._repair_strategy.repair(self)
+        self.reboot()
 
 
     def close(self):

Project Member

Comment 3 by bugdroid1@chromium.org, Aug 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bfab867c63817406cc6b44529d531d9c83345548

commit bfab867c63817406cc6b44529d531d9c83345548
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Aug 10 23:55:52 2017

[autotest] Force reboot with every repair.

Sometimes, DUTs get stuck in a provision/repair cycle caused by a
mysterious problem with certain USB ethernet dongles.  Rebootin the
DUT is sufficient to clear the problem.

This adds a reboot call to the repair code, so as to break the cycle.

BUG= chromium:753407 
TEST=None

Change-Id: I0b5c82a185719bc8b8ddf9e6b7ed7cb314617cb9
Reviewed-on: https://chromium-review.googlesource.com/611240
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Commit-Queue: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>

[modify] https://crrev.com/bfab867c63817406cc6b44529d531d9c83345548/server/hosts/cros_host.py

Let's hold this open for a bit, since the strategy of "reboot every
time" could be overaggressive.

An alternative to consider would be something like this:
  * Before repair starts, remember the local time, then run repair.
  * After repair completes successfully, check elapsed time, and check
    uptime on the DUT.
  * Only reboot if the DUT's uptime is greater than the elapsed time.

That would avoid rebooting DUTs that went through any repair step that
already forced a reboot.

Status: Fixed (was: Untriaged)
Labels: -Pri-3 Pri-1
this was retroactively a p1
Labels: -Chase-Pending Chase

Sign in to add a comment