New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 663963 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature

Blocked on:
issue 653402



Sign in to add a comment

DUT verify/repair should mark DUT as bad based on repeated failed provisions. (was: DUT verify/repair should reimage, based on repeated failed provisions.)

Project Member Reported by dgarr...@chromium.org, Nov 10 2016

Issue description

We have examples of DUTs that fail to provision, get verified (pass), fail to provision, get verified (pass), and so on.

The most recent cause is understood and being worked.

However, we should have a more general solution.

I suggest that the verify/repair code looks at history, and forces a reimage as part of the repair when this pattern is seen. This could be by an update to the 'stable' version, or via servo install. If we are unable to perform that update, mark the DUT as bad.
 
Cc: aaboagye@chromium.org
A repair job includes re-imaging the DUT after exhausting other options. But, I don't think that actually helps. Because in order to run the test, you need that DUT provisioned with the build under test.

I see three general causes for a provision failure:

a) The build is bad.
b) Infrastructure issues.
c) There's something physically wrong with the DUT.

For a), you just have to check that it succeeded on another DUT. For c), I would expect the verify job to catch this case. b) is a little hard because it could be due to flake for a number of reasons, but perhaps it may succeed in a retry (or not).

What about this:

- attempt to provision DUT A
- if it fails, attempt repair
- if repair fails, mark the DUT as "Repair Failed" (already happens today) and fail the suite. Don't provision any other DUTs with this build. (the build may have killed the DUT)
- if repair succeeds, reattempt provision on the same DUT (maybe it was flake)
- if provision fails again, attempt repair, but regardless of the repair status, don't try the build on this DUT anymore.
- check to see if any other DUTs succeeded in provisioning this build.
- if no other DUTs attempted, select a DUT B to attempt to provision.
- if provision succeeded for DUT B, maybe alert someone to take a look why DUT A has issues.
- if DUT B fails to provision as well, attempt repair and just fail the test suite.

I think this gives the build a sufficient "chance" to provision a DUT without possibly wiping out many DUTs. Failing the suite also has the side effect that it might be faster instead of hoping after each test that the next provision might work. But there are probably certain aspects that I failed to consider. Like, what happens if one of the tests exercises the DUT in a way that brings it down in the middle of the suite? 

Comment 2 by ihf@chromium.org, Nov 10 2016

Some slightly different take:
a) One should check that it succeeded on *most* other DUTs (not just one).
c) Verify doesn't really catch a subtly bad DUT. One needs statistics for that across multiple images and DUTs.

Let's back up a little to get the big picture:
1) If an image has passed BVT master it is by definition good.
2) If an image has passed BVT on stable it is by definition double plus good and ready to ship to users.
3) If an image is ready to ship to users it can be used to check if a DUT is good or bad.
4) If an image is ready to ship to users (or even better, has been shipped to users), we can use it hard without fear of bricking the lab to check if DUTs are bad.
5) If we were to brick the lab with such an image, then that would be preferable to bricking a million users.

We are not operating in a vacuum here. There is plenty of state we can use to gain confidence.

In particular I object to failing a suite that is not BVT too easily due to provision issues.

Comment 3 by ihf@chromium.org, Nov 10 2016

Summary: DUT verify/repair should mark DUT as bad based on repeated failed provisions. (was: DUT verify/repair should reimage, based on repeated failed provisions.)
That said, this issue really is about finding DUTs that are misbehaving over long periods of time and preventing them from using too many resources and killing too many jobs. This doesn't have to be done continuously. A cron job could check the DB twice a day and fish for DUTs that failed everything in the last 24 hours for instance. This cron job could send an alert if it saw excessive badness in the lab, mark the DUTs and automatically schedule a thorough hardware check (say running BVT a few times) using a known good stable build.
I'll throw out a another case to add to Allen's list....

D) Verify is incorrectly deciding that a DUT is good.

In the case that inspired this, we had an infrastructure failure (stateful.tgz mismatch), that would block nearly all updates on that DUT from ever succeeding, but verify was incorrectly marking the DUT as good. An install from USB via servo was the only way to really "fix" the DUT automatically.

So... I want to make sure that the repair does include an attempt to install the stable version (via USB, if possible) even if the verify steps are reporting the dut as good.

That shouldn't be needed if other steps are working perfectly, but makes us more robust if they are imperfect.
Current verify/repair code already checks for provision failures,
and will force re-install if provision fails.

Please post history of a failed DUT:  I'd like to see more
evidence that the proposed solution matches the actual problem.
Status: Available (was: Untriaged)
Summary: Detect DUTs trapped in a repair loop (was: DUT verify/repair should mark DUT as bad based on repeated failed provisions.)
OK.  After some discussion, we're a) clarifying the symptom we
want to correct, and b) proposing a strawman for a fix.

Regarding clarifying the symptom:  From time to time, DUTs do
get stuck in a loop where some condition makes the DUT unable
to pass a special task (principally, but not always, a Provision
task), yet the subsequent Repair task finds nothing wrong.  This
leads to a DUT that can't run tests, but isn't flagged for action.
We want to detect and surface such conditions.

The proposed fix is to add a verifier that tries to detect this.
The verifier would look for a sequence consisting entirely of
failed special tasks followed by successful repair, without running
any intervening tests.  The pattern would have to be seen to repeat
some number of times (call it N).  The verifier would be a trigger
all three re-installing repair actions.

Regarding the value of N, we need N > 2; I'd recommend N > 6:
  * N > 1 is mandatory.  Otherwise, every successful repair would
    trigger a failure in the verifier.
  * N > 2 is better; I believe there are enough cases where a bad
    build (or a devserver problem) could cause two successive
    provision jobs to fail that the verifier would fail too often
    to be useful.
  * N == 6 is probably enough.  6 Provision failures represents
    more than an hour's worth of failures.

Excellent, just remember that the bigger N is the more tests fail we fail reasons unrelated to the tests.

Comment 8 by ihf@chromium.org, Nov 15 2016

Richard, if we only look at one DUT at the time I agree that N=4..6. But you are not describing how bad DUTs get terminated. In your proposal bad DUTs can still cycle and destroy jobs forever, so there needs to be further escalation to resolve this:

a) a DUT which can be fixed using the available repair actions (the case which Richard describes)
b) a DUT which has not been fixed by previous full repair actions
   - we should remove the DUT from the pool (termination!)
   - notify the deputy
   - run a self check (say run the old/stable bvt suite corresponding to the servo image to see if the DUT still passes that suite)
c) if we see several DUTs in the lab removed as in state b) with the same image, then that image they ran last must be bad
   - notify deputy
   - jobs with that image can probably be aborted across the lab (might be hard)


> [ ... ] But you are not describing how bad DUTs get terminated.

This bug isn't about how bad DUTs get terminated.  It's about how we
detect a specific failure mode.  A DUT stuck in a repair loop does not
necessarily (or even probably) have bad hardware.  These problems are
routinely caused by software.

What we do to detect and surface DUTs with bad hardware needs to be
the subject of a separate bug.


> [ ... ] In your proposal bad DUTs can still cycle and destroy
> jobs forever, so there needs to be further escalation to resolve this:

That's not how the proposal works.  If a DUT is determined to be in a
repair cycle, the verifier will force re-install.  If the DUT can't
install the repair image it will fail repair, which will force manual
intervention.  If the problem is that a DUT can install the repair
image but can't install some recent test image, the problem isn't
likely to be with the DUT.

Comment 10 by ihf@chromium.org, Nov 15 2016

Summary: DUT verify/repair should mark DUT as bad based on repeated failed provisions. (was: DUT verify/repair should reimage, based on repeated failed provisions.) (was: Detect DUTs trapped in a repair loop)
The let's update the summary again.

To be plain clear, my request as a user is from DUTs killing jobs forever. How you get there I don't care, but let's stick to my request.

Comment 11 by ihf@chromium.org, Nov 15 2016

Blockedon: 653402
DUTs that are in a bad state and don't do work concern me little as such, as long as they would not impact production jobs (we have enough DUTs in the lab).

The problem is, that DUTs in a bad state are destroying production jobs (issue 653402). If you don't want to terminate DUTs in a bad state which keep destroying jobs on every failure, then please first fix issue 653402 first.
Labels: -current-issue
Labels: Type-Feature
Owner: ----
Status: WontFix (was: Available)
This bug seems hopelessly muddled.  Here's the original ask:

====
We have examples of DUTs that fail to provision, get verified (pass), fail to provision, get verified (pass), and so on.

The most recent cause is understood and being worked.

However, we should have a more general solution.

I suggest that the verify/repair code looks at history, and forces a reimage as part of the repair when this pattern is seen. This could be by an update to the 'stable' version, or via servo install. If we are unable to perform that update, mark the DUT as bad.
====

That request, as phrased, describes current actual behavior of
the system, and no one has demonstrated an example of that
intended behavior not working.

So, I'm closing this as WontFix.

If somebody's got a current example of the originally described
problem occurring, it's time to file a new bug, and show the goods.

If somebody's got a different ask, that should have been a new
bug ab initio.  It's not too late to file that bug now.

Sign in to add a comment