DUT verify/repair should mark DUT as bad based on repeated failed provisions. (was: DUT verify/repair should reimage, based on repeated failed provisions.) |
|||||||||
Issue descriptionWe have examples of DUTs that fail to provision, get verified (pass), fail to provision, get verified (pass), and so on. The most recent cause is understood and being worked. However, we should have a more general solution. I suggest that the verify/repair code looks at history, and forces a reimage as part of the repair when this pattern is seen. This could be by an update to the 'stable' version, or via servo install. If we are unable to perform that update, mark the DUT as bad.
,
Nov 10 2016
Some slightly different take: a) One should check that it succeeded on *most* other DUTs (not just one). c) Verify doesn't really catch a subtly bad DUT. One needs statistics for that across multiple images and DUTs. Let's back up a little to get the big picture: 1) If an image has passed BVT master it is by definition good. 2) If an image has passed BVT on stable it is by definition double plus good and ready to ship to users. 3) If an image is ready to ship to users it can be used to check if a DUT is good or bad. 4) If an image is ready to ship to users (or even better, has been shipped to users), we can use it hard without fear of bricking the lab to check if DUTs are bad. 5) If we were to brick the lab with such an image, then that would be preferable to bricking a million users. We are not operating in a vacuum here. There is plenty of state we can use to gain confidence. In particular I object to failing a suite that is not BVT too easily due to provision issues.
,
Nov 10 2016
That said, this issue really is about finding DUTs that are misbehaving over long periods of time and preventing them from using too many resources and killing too many jobs. This doesn't have to be done continuously. A cron job could check the DB twice a day and fish for DUTs that failed everything in the last 24 hours for instance. This cron job could send an alert if it saw excessive badness in the lab, mark the DUTs and automatically schedule a thorough hardware check (say running BVT a few times) using a known good stable build.
,
Nov 10 2016
I'll throw out a another case to add to Allen's list.... D) Verify is incorrectly deciding that a DUT is good. In the case that inspired this, we had an infrastructure failure (stateful.tgz mismatch), that would block nearly all updates on that DUT from ever succeeding, but verify was incorrectly marking the DUT as good. An install from USB via servo was the only way to really "fix" the DUT automatically. So... I want to make sure that the repair does include an attempt to install the stable version (via USB, if possible) even if the verify steps are reporting the dut as good. That shouldn't be needed if other steps are working perfectly, but makes us more robust if they are imperfect.
,
Nov 10 2016
Current verify/repair code already checks for provision failures, and will force re-install if provision fails. Please post history of a failed DUT: I'd like to see more evidence that the proposed solution matches the actual problem.
,
Nov 15 2016
OK. After some discussion, we're a) clarifying the symptom we
want to correct, and b) proposing a strawman for a fix.
Regarding clarifying the symptom: From time to time, DUTs do
get stuck in a loop where some condition makes the DUT unable
to pass a special task (principally, but not always, a Provision
task), yet the subsequent Repair task finds nothing wrong. This
leads to a DUT that can't run tests, but isn't flagged for action.
We want to detect and surface such conditions.
The proposed fix is to add a verifier that tries to detect this.
The verifier would look for a sequence consisting entirely of
failed special tasks followed by successful repair, without running
any intervening tests. The pattern would have to be seen to repeat
some number of times (call it N). The verifier would be a trigger
all three re-installing repair actions.
Regarding the value of N, we need N > 2; I'd recommend N > 6:
* N > 1 is mandatory. Otherwise, every successful repair would
trigger a failure in the verifier.
* N > 2 is better; I believe there are enough cases where a bad
build (or a devserver problem) could cause two successive
provision jobs to fail that the verifier would fail too often
to be useful.
* N == 6 is probably enough. 6 Provision failures represents
more than an hour's worth of failures.
,
Nov 15 2016
Excellent, just remember that the bigger N is the more tests fail we fail reasons unrelated to the tests.
,
Nov 15 2016
Richard, if we only look at one DUT at the time I agree that N=4..6. But you are not describing how bad DUTs get terminated. In your proposal bad DUTs can still cycle and destroy jobs forever, so there needs to be further escalation to resolve this: a) a DUT which can be fixed using the available repair actions (the case which Richard describes) b) a DUT which has not been fixed by previous full repair actions - we should remove the DUT from the pool (termination!) - notify the deputy - run a self check (say run the old/stable bvt suite corresponding to the servo image to see if the DUT still passes that suite) c) if we see several DUTs in the lab removed as in state b) with the same image, then that image they ran last must be bad - notify deputy - jobs with that image can probably be aborted across the lab (might be hard)
,
Nov 15 2016
> [ ... ] But you are not describing how bad DUTs get terminated. This bug isn't about how bad DUTs get terminated. It's about how we detect a specific failure mode. A DUT stuck in a repair loop does not necessarily (or even probably) have bad hardware. These problems are routinely caused by software. What we do to detect and surface DUTs with bad hardware needs to be the subject of a separate bug. > [ ... ] In your proposal bad DUTs can still cycle and destroy > jobs forever, so there needs to be further escalation to resolve this: That's not how the proposal works. If a DUT is determined to be in a repair cycle, the verifier will force re-install. If the DUT can't install the repair image it will fail repair, which will force manual intervention. If the problem is that a DUT can install the repair image but can't install some recent test image, the problem isn't likely to be with the DUT.
,
Nov 15 2016
The let's update the summary again. To be plain clear, my request as a user is from DUTs killing jobs forever. How you get there I don't care, but let's stick to my request.
,
Nov 15 2016
DUTs that are in a bad state and don't do work concern me little as such, as long as they would not impact production jobs (we have enough DUTs in the lab). The problem is, that DUTs in a bad state are destroying production jobs (issue 653402). If you don't want to terminate DUTs in a bad state which keep destroying jobs on every failure, then please first fix issue 653402 first.
,
Nov 15 2016
,
Jun 20 2017
,
Jun 30 2017
,
Jun 30 2017
This bug seems hopelessly muddled. Here's the original ask: ==== We have examples of DUTs that fail to provision, get verified (pass), fail to provision, get verified (pass), and so on. The most recent cause is understood and being worked. However, we should have a more general solution. I suggest that the verify/repair code looks at history, and forces a reimage as part of the repair when this pattern is seen. This could be by an update to the 'stable' version, or via servo install. If we are unable to perform that update, mark the DUT as bad. ==== That request, as phrased, describes current actual behavior of the system, and no one has demonstrated an example of that intended behavior not working. So, I'm closing this as WontFix. If somebody's got a current example of the originally described problem occurring, it's time to file a new bug, and show the goods. If somebody's got a different ask, that should have been a new bug ab initio. It's not too late to file that bug now. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by aaboagye@chromium.org
, Nov 10 2016