veyron_rialto: Rialto misconfigured to run two ARC suites |
|||||||
Issue descriptionveyron_rialto started to fail since 06/30 (https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_rialto-release/builds/2332) due to the following failures At ASyncHWTest phase, ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual) At HWTest phase, NotEnoughDutsError: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1
,
Jul 3
$ balance-pool bvt veyron_rialto veyron_rialto bvt pool: Target of 6 is above minimum. Balancing ['model:veyron_rialto'] bvt pool: Total 6 DUTs, 1 working, 5 broken, 0 reserved. Target is 6 working DUTs; grow pool by 5 DUTs. ['model:veyron_rialto'] suites pool has 2 spares available for balancing pool bvt ERROR: Not enough spares: need 5, only have 2. ERROR: ['model:veyron_rialto'] bvt pool: Refusing to act on pool with 5 broken DUTs. ERROR: Please investigate this model to for a bug ERROR: that is bricking devices. Once you have finished your ERROR: investigation, you can force a rebalance with ERROR: --force-rebalance Transferring 0 DUTs from bvt to suites. Transferring 0 DUTs from suites to bvt.
,
Jul 3
Filed http://b/111123164 for not enough duts error.
,
Jul 3
For the Control file not found error:
Triggered task: veyron_rialto-release/R69-10832.0.0-arc-cts-qual
chromeos-golo-server1-121: 3e6d010132b8d810 3
Autotest instance created: cautotest-prod
06-30-2018 [21:52:38] Submitted create_suite_job rpc
Error Message: ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual)
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1172, in replacement
return func(**kwargs)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1970, in create_suite_job
test_source_build, ds, suite_name)
File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 165, in get_control_file_by_build
(build, devserver_name, e))
ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual)
It's weird that the hostname of devserver, 100.108.133.193, is chromeos-gt-devserver12. I don't understand why we choose this one.
,
Jul 3
The same error is observed on stout-release builder.
Error Message: ControlFileNotFound: Failed to get control file for stout-release/R69-10837.0.0 (devserver: 100.108.133.192) (error: No control file for test_suites/control.arc-cts-qual)
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1172, in replacement
return func(**kwargs)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1970, in create_suite_job
test_source_build, ds, suite_name)
File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 165, in get_control_file_by_build
(build, devserver_name, e))
ControlFileNotFound: Failed to get control file for stout-release/R69-10837.0.0 (devserver: 100.108.133.192) (error: No control file for test_suites/control.arc-cts-qual)
,
Jul 4
Re comment #5, please ignore that comment as stout has been EOL'ed since 6/30/2018.
,
Jul 4
CC ARC constable for the AsyncHWTest issue. Can you please ensure if veyron_rialto support ARC++ suites? We have a guess that this error is expected.
,
Jul 4
veyron_rialto is very special board and never supports ARC. $ portageq-veyron_rialto envvar USE | grep arc; echo $? 1
,
Jul 4
And arc-[c|g]ts-qual is expected to fail here. But I don't see the problem as it seems not blocking the release.
,
Jul 4
,
Jul 6
My guess is this should be a build or test configuration issue that we shouldn't run the aync hw test for this board.
,
Jul 6
Niranjan, can you please get a couple of more devices over to the test lab so they have some more spares in the pool? Will help improve stability a bit. Can also take one V2 so we can ensure that's also working.
,
Jul 10
Chatted with Richard (cc'ed). The root cause appears to be https://bugs.chromium.org/p/chromium/issues/detail?id=854404&desc=2 After tackling that one, we will see if this problem still persists
,
Jul 10
There are two bugs:
1) The rialto release builder is configured to run two ARC suites.
Rialto doesn't support ARC, and shouldn't run the suites.
2) Rialto DUTs go offline, and servo fails to repair them. That's left
the test pool with no working DUTs.
Problem 2) is covered by bug 854404. So, we shouldn't talk about it here
any further.
This bug should be about problem 1). For that, I note two things:
* The fix must be made to Chromite, so it's a CI (not Test) problem.
I expect the Rialto team should make the change in consultation
with a CI expert.
* It's not clear that the ARC suite failures are actually harming
anything; it's already been noted that this bug isn't blocking
releases. So, although we should fix this, it may be that we
should downgrade to P3.
,
Jul 12
Updating the summary to reflect the split of the issue into multiple bugs and the direct requirements for this request. -- Mike
,
Jul 12
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by gu...@chromium.org
, Jul 3For the NotEnoughDutsError, I tried to balance the pool, but failed because there's no enough spares. Triggered task: veyron_rialto-release/R69-10841.0.0-bvt-inline chromeos-golo-server5-201: 3e7937b0d7e7e310 3 Autotest instance created: cautotest-prod TestLabException: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1 Traceback (most recent call last): File "/usr/local/autotest/site_utils/run_suite.py", line 1990, in _run_task return _run_suite(options) File "/usr/local/autotest/site_utils/run_suite.py", line 1726, in _run_suite options.skip_duts_check) File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 330, in check_dut_availability hosts=hosts) NotEnoughDutsError: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1 Will return from run_suite with status: INFRA_FAILURE Will file another ticket to ask lab team to fix.