New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 888108 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Nov 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 884901



Sign in to add a comment

veyron_minnie-tot-chrome-pfq-informational failing due to DUT failures

Project Member Reported by steve...@chromium.org, Sep 21

Issue description

The following hosts are failing repeatedly on veyron_minnie-tot-chrome-pfq-informational:

  host: chromeos4-row9-rack10-host6, status: Repair Failed, locked: False diagnosis: Failed repair
  
  host: chromeos4-row9-rack9-host17, status: Repairing, locked: False diagnosis: Failed repair

  host: chromeos2-row6-rack2-host3, status: Repair Failed, locked: True diagnosis: Failed repair

 
Blockedon: 884901
See also  issue 884901 

These two haven't run anything, even repair, since late August. Not sure why. They are both locked (but I don't believe they need to be).

chromeos4-row9-rack9-host17
chromeos2-row6-rack2-host3

I'm going to unlock them both and manually request a repair.

chromeos4-row9-rack9-host17 repair is failing and likely needs investigation form lab techs, will file https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack9-host17/2226275-repair/


https://b.corp.google.com/issues/116343878 filed for chromeos4-row9-rack9-host17
I locked those two because they were repeatedly failing and that is what I was told to do.

I may have given the wrong advice. Looking at their history, the repeated failures are from late august. I assumed you were referring to recent failures.

I don't seen any repair attempts since those late-august failures. The devices have been in "Repair Failed" state since then, which means they should not get used by tests anyway, and shouldn't have been causing any build failures. Do you have a counterexample -- a build more recent than Sept 1 that failed due to one of those devices?
I think the issue is that the pool:continuous which is what tot-chrome-pfq-informational uses has insufficient DUTs, and the suites pool has no spares.

akeshet@akeshet:~$ balance_pool continuous veyron_minnie
veyron_minnie continuous pool: Target of 5 is above minimum.

Balancing ['model:veyron_minnie'] continuous pool:
Total 5 DUTs, 2 working, 3 broken, 0 reserved.
Target is 5 working DUTs; grow pool by 3 DUTs.
['model:veyron_minnie'] suites pool has 0 spares available for balancing pool continuous
ERROR: Not enough spares: need 3, only have 0.
ERROR: ['model:veyron_minnie'] continuous pool: Refusing to act on pool with 3 broken DUTs.
ERROR: Please investigate this model to for a bug 
ERROR: that is bricking devices. Once you have finished your 
ERROR: investigation, you can force a rebalance with 
ERROR: --force-rebalance
Transferring 0 DUTs from continuous to suites.
Transferring 0 DUTs from suites to continuous.


Will add those other 2 DUTs to the repair ticket if automated repair fails for them. I don't know why they haven't been running repair jobs on their own.
So, it certainly sounds like we need to add some more DUTs.

It also looks like I was confused and that the DUTs that failed to repair are not the problem DUTs.

In the most recent veyron_minnie-tot-chrome-pfq-informational build:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934774175930066496

The HWTest failure reason was:
desktopui_KillRestart.session: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.

Following the link to cautotest for that test:

http://cautotest-prod/afe/#tab_id=view_job&object_id=240394605

Shows that the DUT was: chromeos4-row9-rack10-host7

And the message for *that* DUT in 'stdout' for the HWTest stage was:

host: chromeos4-row9-rack10-host7, status: Running, locked: False diagnosis: Working

Which is apparently incorrect.

That host failed at least once before recently, and so has chromeos4-row9-rack10-host9 FWIW





Cc: bhthompson@google.com
 Issue 884901  has been merged into this issue.
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
Cc: -bhthompson@google.com bhthomp...@google.com.minch
Cc: -bhthomp...@google.com.minch bhthompson@google.com minch@chromium.org
Status: WontFix (was: Assigned)

Sign in to add a comment