New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 913747 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Not enough Nocturne DUTs in CQ

Project Member Reported by matth...@chromium.org, Dec 11

Issue description

CrOS CQ failure for Nocturne due to not enough DUTs.  AIUI, it would be unacceptable to mark Nocturne as "experimental" at this point, and I don't want to do that, so I'm marking this as P1.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927494632336267008

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927493315372556784

https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927493315372556784/+/steps/HWTest__bvt-inline_/0/stdout

TestLabException: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/run_suite.py", line 2000, in _run_task
    return _run_suite(options)
  File "/usr/local/autotest/site_utils/run_suite.py", line 1738, in _run_suite
    options.skip_duts_check)
  File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 320, in check_dut_availability
    hosts=hosts)
NotEnoughDutsError: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3
Will return from run_suite with status: INFRA_FAILUR
 
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
At one point we increased the number of devices in the lab.
https://buganizer.corp.google.com/issues/116396627


I currently see no issue. Could this have been either a temporary issue for nocturne, or a bad build?

$ dut-status -b nocturne -p cq
hostname                       S   last checked         URL
chromeos6-row5-rack16-host1    OK  2018-12-10 15:24:42  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1435368-provision/
chromeos6-row5-rack16-host9    OK  2018-12-10 15:27:59  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host9/1435375-repair/
chromeos6-row5-rack16-host15   OK  2018-12-09 17:18:47  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1432355-repair/
chromeos6-row5-rack19-host3    OK  2018-12-10 15:33:51  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack19-host3/1435388-reset/
chromeos6-row5-rack16-host21   OK  2018-12-10 16:06:59  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1435441-repair/

$ balance_pool cq nocturne
nocturne cq pool: Target of 5 is above minimum.
Transferring 0 DUTs from cq to suites.
Transferring 0 DUTs from suites to cq.

akeshet, can you look up the past 1d history of number of Nocturne DUTs in CQ?

If I am understanding https://issuetracker.google.com/116396627#comment18 correctly there should be 8 DUTs in the Nocturne CQ pool.  Do you have visibility into why we have less right now, i.e. what happened to the other 3?
Cc: xixuan@chromium.org
 https://crbug.com/917053  may be an underlying cause.
Cc: akes...@chromium.org pprabhu@chromium.org
Labels: OS-Chrome
Owner: ayatane@chromium.org
 https://crbug.com/917053  should be fixed with the revert.

Can we please rebalance the pools? Else we'll need to keep nocturne-paladin as experimental...

https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924763399688775616/+/steps/HWTest__provision_/0/stdout
    NotEnoughDutsError: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3

$ ./dut_status.py -b nocturne -p cq
hostname                       S   last checked         URL
chromeos6-row5-rack16-host1    NO  2019-01-09 18:40:20  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525203-repair/
chromeos6-row5-rack16-host15   OK  2019-01-08 21:31:13  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1522159-repair/
chromeos6-row5-rack18-host11   OK  2019-01-09 14:54:57  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack18-host11/1524622-verify/
chromeos6-row5-rack16-host21   OK  2019-01-08 21:30:30  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1522157-reset/

$ ./balance_pools.py --dry-run cq nocturne
nocturne cq pool: Target of 4 is above minimum.

# Balancing ['model:nocturne'] cq pool:
# Total 4 DUTs, 3 working, 1 broken, 0 reserved.
# Target is 4 working DUTs; grow pool by 1 DUTs.
# ['model:nocturne'] suites pool has 0 spares available for balancing pool cq
ERROR: Not enough spares: need 1, only have 0.
# Transferring 0 DUTs from cq to suites.
# Transferring 0 DUTs from suites to cq.

Sounds like we don't have enough good devices in the lab ,-(

Comment 9 Deleted

Oh, the repair job on chromeos6-row5-rack16-host1  is failing precisely because of that kvm issue...

https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525203-repair/

Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 356, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 416, in verify
    raise hosts.AutoservVerifyError('/dev/kvm is missing')
AutoservVerifyError: /dev/kvm is missing
I'm going to try to recover that host manually, since I can ssh into it.

1. Lock host
2. cros flash chromeos6-row5-rack16-host1.cros xbuddy://remote/nocturne-release/R72-11316.69.0
3. Unlock host
4. Press reverify

Seems to have worked:

./dut_status.py -b nocturne -p cq
hostname                       S   last checked         URL
chromeos6-row5-rack16-host1    OK  2019-01-09 20:47:03  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525464-repair/
chromeos6-row5-rack16-host15   OK  2019-01-08 21:31:13  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1522159-repair/
chromeos6-row5-rack18-host11   OK  2019-01-09 14:54:57  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack18-host11/1524622-verify/
chromeos6-row5-rack16-host21   OK  2019-01-08 21:30:30  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1522157-reset/

We can probably recover the broken ones in pool:suites in a similar way.

Comment 13 by dchan@chromium.org, Jan 18 (4 days ago)

Will the fix stick?  That is, is this related to KB fw update ?

Sign in to add a comment