New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 851758 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

eve-tot-chrome-pfq-informational failing in HWTest[provision] (not enough DUTs)

Project Member Reported by steve...@chromium.org, Jun 12 2018

Issue description

Example failure:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943984642511046416

Snippet:
15:28:12: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/cbuildbot-tmps16zTy/tmpdauCov/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:provision' '--tags=build:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388' '--tags=task_name:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision' '--tags=board:eve' -- /usr/local/autotest/site_utils/run_suite.py --build eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388 --board eve --suite_name provision --pool continuous --file_bugs True --priority PFQ --timeout_mins 180 --retry False --minimum_duts 1 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 82523468L, 'cidb_build_id': 2655388, 'datastore_parent_key': ('Build', 2655388, 'BuildStage', 82523468L)}" -c
15:28:17: WARNING: Exception is not retriable return code: 3; command: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/cbuildbot-tmps16zTy/tmpdauCov/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:provision' '--tags=build:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388' '--tags=task_name:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision' '--tags=board:eve' -- /usr/local/autotest/site_utils/run_suite.py --build eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388 --board eve --suite_name provision --pool continuous --file_bugs True --priority PFQ --timeout_mins 180 --retry False --minimum_duts 1 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 82523468L, 'cidb_build_id': 2655388, 'datastore_parent_key': ('Build', 2655388, 'BuildStage', 82523468L)}" -c
Triggered task: eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision
chromeos-golo-server5-260: 3e09c84b5a2a4a10 3
  Autotest instance created: cautotest-prod
  TestLabException: Not enough DUTs for board: eve, pool: continuous; required: 1, found: 0
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1990, in _run_task
      return _run_suite(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1726, in _run_suite
      options.skip_duts_check)
    File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 330, in check_dut_availability
      hosts=hosts)
  NotEnoughDutsError: Not enough DUTs for board: eve, pool: continuous; required: 1, found: 0
  Will return from run_suite with status: INFRA_FAILURE

 
Components: -Infra>Labs
Components: Infra>Client>ChromeOS>Test
eve-tot-chrome-pfq-informational is running tests now, but they do not seem to be passing.

Failures appear to be in the pattern of either:

"Suite timed out" (even though all tests appear to have passed)

Or multiple tests failing with:

 "ABORT: Timed out, did not run."

Looking at "GE Suite Details" from recent builds:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2661200
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2660653
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2659667

There only appear to be two builders (the names, annoyingly, are not copyable): 
chromeos6-row4-rack11-host17
chromeos6-row4-rack11-host19


> There only appear to be two builders (the names, annoyingly, are not copyable): 
> chromeos6-row4-rack11-host17
> chromeos6-row4-rack11-host19

Those aren't builders; those are the DUTs.  And the fact that there's
only two of them is no doubt contributing to the timeout symptom.
It would be easy to add more DUTs (I'd recommend a total of 4), there
are seven working spares to be tapped:
    $ dut-status -m eve -p suites -w | wc -l
    7

Yes, sorry, I mean't DUT. SGTM.

I did some more digging and it looks like all of the recent failures were indeed because the HWTest stage(s) are exceeding the 4 hour (!) timeout, presumably because of a lack of DUTs.

It's not especially clear that this is the case from the summary message 'Suit Job: ABORT'. I filed  issue 852821  to address this.

I noticed that we are passing '--minimum_duts', '1' to the suite, which seems incorrect since clearly 2 are insufficient to complete the tests in the specified timeout.


It seems like we should probably do an audit of the pfq-intormational builders (and maybe others?) and ensure that --minimum_duts is set correctly for the various HWTest suites (we apparently need 3, maybe 4?).

Owner: jkop@chromium.org
> It seems like we should probably do an audit of the pfq-intormational
> builders (and maybe others?) and ensure that --minimum_duts is set
> correctly for the various HWTest suites (we apparently need 3, maybe 4?).

For this case, I think "minimum_duts" was set correctly.  What went wrong
was that we allowed the pool to become too small.

That said, we have no mechanism to cross check the load generated by
builders against the supply in the lab.  The closest we've come is a
feature request that would allow a human to examine the two:  That's
bug 717574.

I think that for this bug, the thing to do now is to grow the eve
'continuous' pool to at least 4 DUTs.  The systemic problems that
allowed this to happen should be different bugs.

Passing this on to this week's deputy.

Comment 9 by jkop@chromium.org, Jun 21 2018

Status: Fixed (was: Assigned)

Sign in to add a comment