New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 859966 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

veyron_rialto: Rialto misconfigured to run two ARC suites

Project Member Reported by mcchou@chromium.org, Jul 3

Issue description

veyron_rialto started to fail since 06/30 (https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_rialto-release/builds/2332) due to the following failures

At ASyncHWTest phase,
ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual)

At HWTest phase,
NotEnoughDutsError: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1
 
For the NotEnoughDutsError, I tried to balance the pool, but failed because there's no enough spares.
Triggered task: veyron_rialto-release/R69-10841.0.0-bvt-inline
chromeos-golo-server5-201: 3e7937b0d7e7e310 3
  Autotest instance created: cautotest-prod
  TestLabException: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1990, in _run_task
      return _run_suite(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1726, in _run_suite
      options.skip_duts_check)
    File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 330, in check_dut_availability
      hosts=hosts)
  NotEnoughDutsError: Not enough DUTs for board: veyron_rialto, pool: bvt; required: 4, found: 1
  Will return from run_suite with status: INFRA_FAILURE

Will file another ticket to ask lab team to fix.
$ balance-pool bvt veyron_rialto
veyron_rialto bvt pool: Target of 6 is above minimum.

Balancing ['model:veyron_rialto'] bvt pool:
Total 6 DUTs, 1 working, 5 broken, 0 reserved.
Target is 6 working DUTs; grow pool by 5 DUTs.
['model:veyron_rialto'] suites pool has 2 spares available for balancing pool bvt
ERROR: Not enough spares: need 5, only have 2.
ERROR: ['model:veyron_rialto'] bvt pool: Refusing to act on pool with 5 broken DUTs.
ERROR: Please investigate this model to for a bug 
ERROR: that is bricking devices. Once you have finished your 
ERROR: investigation, you can force a rebalance with 
ERROR: --force-rebalance
Transferring 0 DUTs from bvt to suites.
Transferring 0 DUTs from suites to bvt.

Filed http://b/111123164 for not enough duts error.
For the Control file not found error:
Triggered task: veyron_rialto-release/R69-10832.0.0-arc-cts-qual
chromeos-golo-server1-121: 3e6d010132b8d810 3
  Autotest instance created: cautotest-prod
  06-30-2018 [21:52:38] Submitted create_suite_job rpc
  Error Message: ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual)
  Traceback (most recent call last):
    File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
      results['result'] = self.invokeServiceEndpoint(meth, args)
    File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
      return meth(*args)
    File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
      return f(*args, **keyword_args)
    File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1172, in replacement
      return func(**kwargs)
    File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1970, in create_suite_job
      test_source_build, ds, suite_name)
    File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 165, in get_control_file_by_build
      (build, devserver_name, e))
  ControlFileNotFound: Failed to get control file for veyron_rialto-release/R69-10832.0.0 (devserver: 100.108.133.193) (error: No control file for test_suites/control.arc-cts-qual)

It's weird that the hostname of devserver, 100.108.133.193, is chromeos-gt-devserver12. I don't understand why we choose this one.
The same error is observed on stout-release builder.

Error Message: ControlFileNotFound: Failed to get control file for stout-release/R69-10837.0.0 (devserver: 100.108.133.192) (error: No control file for test_suites/control.arc-cts-qual)
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1172, in replacement
    return func(**kwargs)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1970, in create_suite_job
    test_source_build, ds, suite_name)
  File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 165, in get_control_file_by_build
    (build, devserver_name, e))
ControlFileNotFound: Failed to get control file for stout-release/R69-10837.0.0 (devserver: 100.108.133.192) (error: No control file for test_suites/control.arc-cts-qual)
Re comment #5, please ignore that comment as stout has been EOL'ed since 6/30/2018.

Comment 7 Deleted

CC ARC constable for the AsyncHWTest issue. Can you please ensure if veyron_rialto support ARC++ suites? We have a guess that this error is expected.
veyron_rialto is very special board and never supports ARC.

$ portageq-veyron_rialto envvar USE | grep arc; echo $?
1

And arc-[c|g]ts-qual is expected to fail here.
But I don't see the problem as it seems not blocking the release.
Cc: skylarc@chromium.org drustsm...@google.com rialto-eng@google.com englab-sys-cros@google.com
Owner: ----
Status: Untriaged (was: Assigned)
My guess is this should be a build or test configuration issue that we shouldn't run the aync hw test for this board.
Cc: kumarniranjan@chromium.org
Niranjan, can you please get a couple of more devices over to the test lab so they have some more spares in the pool? Will help improve stability a bit.

Can also take one V2 so we can ensure that's also working.
Cc: jrbarnette@chromium.org
Chatted with Richard (cc'ed). The root cause appears to be https://bugs.chromium.org/p/chromium/issues/detail?id=854404&desc=2

After tackling that one, we will see if this problem still persists
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>CI
There are two bugs:
 1) The rialto release builder is configured to run two ARC suites.
    Rialto doesn't support ARC, and shouldn't run the suites.
 2) Rialto DUTs go offline, and servo fails to repair them.  That's left
    the test pool with no working DUTs.

Problem 2) is covered by bug 854404.  So, we shouldn't talk about it here
any further.

This bug should be about problem 1).  For that, I note two things:
  * The fix must be made to Chromite, so it's a CI (not Test) problem.
    I expect the Rialto team should make the change in consultation
    with a CI expert.
  * It's not clear that the ARC suite failures are actually harming
    anything; it's already been noted that this bug isn't blocking
    releases.  So, although we should fix this, it may be that we
    should downgrade to P3.

Updating the summary to reflect the split of the issue into multiple bugs and the direct requirements for this request.  

-- Mike
Summary: veyron_rialto: Rialto misconfigured to run two ARC suites (was: veyron_rialto: Failed at ASyncHWTest phase (ControlFileNotFound) and HWTest phase (NotEnoughDutsError))

Sign in to add a comment