veyron_minnie-tot-chrome-pfq-informational failing due to DUT failures |
|||||
Issue descriptionThe following hosts are failing repeatedly on veyron_minnie-tot-chrome-pfq-informational: host: chromeos4-row9-rack10-host6, status: Repair Failed, locked: False diagnosis: Failed repair host: chromeos4-row9-rack9-host17, status: Repairing, locked: False diagnosis: Failed repair host: chromeos2-row6-rack2-host3, status: Repair Failed, locked: True diagnosis: Failed repair
,
Sep 21
These two haven't run anything, even repair, since late August. Not sure why. They are both locked (but I don't believe they need to be). chromeos4-row9-rack9-host17 chromeos2-row6-rack2-host3 I'm going to unlock them both and manually request a repair. chromeos4-row9-rack9-host17 repair is failing and likely needs investigation form lab techs, will file https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack9-host17/2226275-repair/
,
Sep 21
https://b.corp.google.com/issues/116343878 filed for chromeos4-row9-rack9-host17
,
Sep 21
I locked those two because they were repeatedly failing and that is what I was told to do.
,
Sep 21
I may have given the wrong advice. Looking at their history, the repeated failures are from late august. I assumed you were referring to recent failures. I don't seen any repair attempts since those late-august failures. The devices have been in "Repair Failed" state since then, which means they should not get used by tests anyway, and shouldn't have been causing any build failures. Do you have a counterexample -- a build more recent than Sept 1 that failed due to one of those devices?
,
Sep 21
I think the issue is that the pool:continuous which is what tot-chrome-pfq-informational uses has insufficient DUTs, and the suites pool has no spares. akeshet@akeshet:~$ balance_pool continuous veyron_minnie veyron_minnie continuous pool: Target of 5 is above minimum. Balancing ['model:veyron_minnie'] continuous pool: Total 5 DUTs, 2 working, 3 broken, 0 reserved. Target is 5 working DUTs; grow pool by 3 DUTs. ['model:veyron_minnie'] suites pool has 0 spares available for balancing pool continuous ERROR: Not enough spares: need 3, only have 0. ERROR: ['model:veyron_minnie'] continuous pool: Refusing to act on pool with 3 broken DUTs. ERROR: Please investigate this model to for a bug ERROR: that is bricking devices. Once you have finished your ERROR: investigation, you can force a rebalance with ERROR: --force-rebalance Transferring 0 DUTs from continuous to suites. Transferring 0 DUTs from suites to continuous. Will add those other 2 DUTs to the repair ticket if automated repair fails for them. I don't know why they haven't been running repair jobs on their own.
,
Sep 21
So, it certainly sounds like we need to add some more DUTs. It also looks like I was confused and that the DUTs that failed to repair are not the problem DUTs. In the most recent veyron_minnie-tot-chrome-pfq-informational build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934774175930066496 The HWTest failure reason was: desktopui_KillRestart.session: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. Following the link to cautotest for that test: http://cautotest-prod/afe/#tab_id=view_job&object_id=240394605 Shows that the DUT was: chromeos4-row9-rack10-host7 And the message for *that* DUT in 'stdout' for the HWTest stage was: host: chromeos4-row9-rack10-host7, status: Running, locked: False diagnosis: Working Which is apparently incorrect. That host failed at least once before recently, and so has chromeos4-row9-rack10-host9 FWIW
,
Sep 21
,
Sep 21
,
Sep 24
,
Sep 24
,
Nov 2
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by steve...@chromium.org
, Sep 21