wizpigs in the lab are on repair/verify failure loop |
||||||||
Issue descriptionThere are a number of wizpigs in the lab which appear to be repeatedly repair/verify failing. The first three I checked from the suites pool are exhibiting this issue. chromeos2-row8-rack8-host10 chromeos2-row8-rack8-host13 chromeos2-row8-rack8-host14
,
Oct 2 2017
Won't they potentially pick out of band test suites? Also, if there is some systemic issue with software, product team should be involved to identify/fix these issues.
,
Oct 2 2017
Still working on the push, but the one I looked at is failing USB repair, I didn't understand why at first glance. And no, since they are in repair failed state, they won't be used for any tests.
,
Oct 3 2017
This likely should be WontFix:
* When devices fail repair, the system periodically
verifies the devices automatically. So the "loop"
behavior described is WAI.
* As dgarrett@ noted, devices that fail repair aren't
used for testing, and so don't cause test failures.
* There are automated processes for identifying failed
devices and requesting manual fixes. The CrOS Infra
team doesn't get involved unless the volume of failures
reduces supply so much that tests can't run.
,
Oct 3 2017
We've got 3/3 out of devices (that I checked) in this state on a device that is experiencing 3-4% provision failures (last time I checked) which is much higher than other devices, and also mirrors devices from the same board family that is causing issues (cyan). I'm pretty sure that from looking at the host list that the actual count is probably at least 6-8 wizpigs. We want to get to the bottom of these provisioning failures which are causing CQ failures, and understanding these failures would be helpful to that end. Alternatively, since we have a lot of wizpigs, if they're up and going and I could run tests on 10 devices at once instead of 2, it would potentially make reproduction of issue 639301 much easier.
,
Oct 3 2017
PS: The shard for wizpig was having DB corruption issues yesterday that are believed fixed. I haven't yet rechecked the devices to see what state they are in, since I'm still trying to get software push to work.
,
Oct 3 2017
dgarrett: what was the bug for the shard corruption issue? Wizpig is on chromeos-server104, but I don't see a bug mentioning that server recently.
,
Oct 3 2017
It was the skunk1 issue that Aviv worked on yesterday, I don't know if there was a bug, only that he declared victory (I was working on skunk-1 at the same time, very confusing).
,
Oct 3 2017
Or do I have it confused? There were many shard issues yesterday, all fixed to the best of my knowledge.
,
Oct 3 2017
Perhaps #7/#9 refer to issue 770865 "shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage"
,
Oct 3 2017
issue 771257 may be related since it shows two wizpig continuously failing in pool:cq, this does block the master-paladin. Host chromeos6-row2-rack20-host20 continuously fails in Verify/Repair Host chromeos6-row2-rack20-host4 fails in Provision
,
Oct 14 2017
Transferring outstanding deputy bugs.
,
Nov 13 2017
,
Nov 13 2017
,
Jan 22 2018
,
Jan 23 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by dgarr...@chromium.org
, Oct 2 2017