DUT repair should reboot more aggressively. |
||||
Issue descriptionWe've had several DUTs get into a bad state this week in which they fail every test but pass repair. The belief is that rebooting them will fix the issue. This has killed multiple CQ runs for both sentry and wolf. https://crbug.com/753120 https://crbug.com/753221 I'm proposing that repair start rebooting DUTs. Either every time, or after a fail/repair streak of some length.
,
Aug 8 2017
> If we understand why they are failing provision, maybe we can test for that during repair. In the case of bug 753120 , the DUT passes repair because it's accessible via ssh from servers outside of the lab. It fails provision because it's inaccessible from the devservers. I'm leery of making a test that complicated every time we verify (meaning every reset task), and that's the only available option for such a test without redesigning repair. The following patch will apply reboot on every repair: diff --git a/server/hosts/cros_host.py b/server/hosts/cros_host.py index 2d9ccd6b8..8a60aea2f 100644 --- a/server/hosts/cros_host.py +++ b/server/hosts/cros_host.py @@ -1218,6 +1218,7 @@ class CrosHost(abstract_ssh.AbstractSSHHost): repair steps needed to get the DUT working. """ self._repair_strategy.repair(self) + self.reboot() def close(self):
,
Aug 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bfab867c63817406cc6b44529d531d9c83345548 commit bfab867c63817406cc6b44529d531d9c83345548 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Aug 10 23:55:52 2017 [autotest] Force reboot with every repair. Sometimes, DUTs get stuck in a provision/repair cycle caused by a mysterious problem with certain USB ethernet dongles. Rebootin the DUT is sufficient to clear the problem. This adds a reboot call to the repair code, so as to break the cycle. BUG= chromium:753407 TEST=None Change-Id: I0b5c82a185719bc8b8ddf9e6b7ed7cb314617cb9 Reviewed-on: https://chromium-review.googlesource.com/611240 Reviewed-by: Don Garrett <dgarrett@chromium.org> Commit-Queue: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/bfab867c63817406cc6b44529d531d9c83345548/server/hosts/cros_host.py
,
Aug 11 2017
Let's hold this open for a bit, since the strategy of "reboot every
time" could be overaggressive.
An alternative to consider would be something like this:
* Before repair starts, remember the local time, then run repair.
* After repair completes successfully, check elapsed time, and check
uptime on the DUT.
* Only reboot if the DUT's uptime is greater than the elapsed time.
That would avoid rebooting DUTs that went through any repair step that
already forced a reboot.
,
Aug 14 2017
,
Aug 15 2017
this was retroactively a p1
,
Aug 15 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by dgarr...@chromium.org
, Aug 8 2017