Issue metadata
Sign in to add a comment
|
Not enough Nocturne DUTs in CQ |
||||||||||||||||||||
Issue descriptionCrOS CQ failure for Nocturne due to not enough DUTs. AIUI, it would be unacceptable to mark Nocturne as "experimental" at this point, and I don't want to do that, so I'm marking this as P1. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927494632336267008 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927493315372556784 https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927493315372556784/+/steps/HWTest__bvt-inline_/0/stdout TestLabException: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3 Traceback (most recent call last): File "/usr/local/autotest/site_utils/run_suite.py", line 2000, in _run_task return _run_suite(options) File "/usr/local/autotest/site_utils/run_suite.py", line 1738, in _run_suite options.skip_duts_check) File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 320, in check_dut_availability hosts=hosts) NotEnoughDutsError: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3 Will return from run_suite with status: INFRA_FAILUR
,
Dec 11
At one point we increased the number of devices in the lab. https://buganizer.corp.google.com/issues/116396627
,
Dec 11
I currently see no issue. Could this have been either a temporary issue for nocturne, or a bad build? $ dut-status -b nocturne -p cq hostname S last checked URL chromeos6-row5-rack16-host1 OK 2018-12-10 15:24:42 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1435368-provision/ chromeos6-row5-rack16-host9 OK 2018-12-10 15:27:59 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host9/1435375-repair/ chromeos6-row5-rack16-host15 OK 2018-12-09 17:18:47 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1432355-repair/ chromeos6-row5-rack19-host3 OK 2018-12-10 15:33:51 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack19-host3/1435388-reset/ chromeos6-row5-rack16-host21 OK 2018-12-10 16:06:59 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1435441-repair/ $ balance_pool cq nocturne nocturne cq pool: Target of 5 is above minimum. Transferring 0 DUTs from cq to suites. Transferring 0 DUTs from suites to cq.
,
Dec 11
akeshet, can you look up the past 1d history of number of Nocturne DUTs in CQ? If I am understanding https://issuetracker.google.com/116396627#comment18 correctly there should be 8 DUTs in the Nocturne CQ pool. Do you have visibility into why we have less right now, i.e. what happened to the other 3?
,
Dec 13
We've had 5 devices in the pool since ~Nov 29. Not sure what caused the shuffle at that time. https://viceroy.corp.google.com/chromeos/dut_utilization?board=nocturne&duration=30d&is_locked=False&mdb_role=chrome-infra&model=&pool=managed%3Acq&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&status=Running&topstreams=5
,
Dec 20
,
Jan 10
https://crbug.com/917053 should be fixed with the revert. Can we please rebalance the pools? Else we'll need to keep nocturne-paladin as experimental... https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924763399688775616/+/steps/HWTest__provision_/0/stdout NotEnoughDutsError: Not enough DUTs for requirements: ('board:nocturne', 'pool:cq'); required: 4, found: 3
,
Jan 10
$ ./dut_status.py -b nocturne -p cq hostname S last checked URL chromeos6-row5-rack16-host1 NO 2019-01-09 18:40:20 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525203-repair/ chromeos6-row5-rack16-host15 OK 2019-01-08 21:31:13 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1522159-repair/ chromeos6-row5-rack18-host11 OK 2019-01-09 14:54:57 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack18-host11/1524622-verify/ chromeos6-row5-rack16-host21 OK 2019-01-08 21:30:30 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1522157-reset/ $ ./balance_pools.py --dry-run cq nocturne nocturne cq pool: Target of 4 is above minimum. # Balancing ['model:nocturne'] cq pool: # Total 4 DUTs, 3 working, 1 broken, 0 reserved. # Target is 4 working DUTs; grow pool by 1 DUTs. # ['model:nocturne'] suites pool has 0 spares available for balancing pool cq ERROR: Not enough spares: need 1, only have 0. # Transferring 0 DUTs from cq to suites. # Transferring 0 DUTs from suites to cq. Sounds like we don't have enough good devices in the lab ,-(
,
Jan 10
Oh, the repair job on chromeos6-row5-rack16-host1 is failing precisely because of that kvm issue... https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525203-repair/ Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 356, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/cros_repair.py", line 416, in verify raise hosts.AutoservVerifyError('/dev/kvm is missing') AutoservVerifyError: /dev/kvm is missing
,
Jan 10
I'm going to try to recover that host manually, since I can ssh into it. 1. Lock host 2. cros flash chromeos6-row5-rack16-host1.cros xbuddy://remote/nocturne-release/R72-11316.69.0 3. Unlock host 4. Press reverify
,
Jan 10
Seems to have worked: ./dut_status.py -b nocturne -p cq hostname S last checked URL chromeos6-row5-rack16-host1 OK 2019-01-09 20:47:03 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host1/1525464-repair/ chromeos6-row5-rack16-host15 OK 2019-01-08 21:31:13 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host15/1522159-repair/ chromeos6-row5-rack18-host11 OK 2019-01-09 14:54:57 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack18-host11/1524622-verify/ chromeos6-row5-rack16-host21 OK 2019-01-08 21:30:30 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row5-rack16-host21/1522157-reset/ We can probably recover the broken ones in pool:suites in a similar way.
,
Jan 18
(4 days ago)
Will the fix stick? That is, is this related to KB fw update ? |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by jclinton@chromium.org
, Dec 11Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)