eve-tot-chrome-pfq-informational failing in HWTest[provision] (not enough DUTs) |
||||
Issue descriptionExample failure: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943984642511046416 Snippet: 15:28:12: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/cbuildbot-tmps16zTy/tmpdauCov/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:provision' '--tags=build:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388' '--tags=task_name:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision' '--tags=board:eve' -- /usr/local/autotest/site_utils/run_suite.py --build eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388 --board eve --suite_name provision --pool continuous --file_bugs True --priority PFQ --timeout_mins 180 --retry False --minimum_duts 1 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 82523468L, 'cidb_build_id': 2655388, 'datastore_parent_key': ('Build', 2655388, 'BuildStage', 82523468L)}" -c [1;33m15:28:17: WARNING: Exception is not retriable return code: 3; command: /b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /b/swarming/w/ir/tmp/cbuildbot-tmps16zTy/tmpdauCov/temp_summary.json --print-status-updates --timeout 14400 --raw-cmd --task-name eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision --dimension os Ubuntu-14.04 --dimension pool default --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:PFQ' '--tags=suite:provision' '--tags=build:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388' '--tags=task_name:eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision' '--tags=board:eve' -- /usr/local/autotest/site_utils/run_suite.py --build eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388 --board eve --suite_name provision --pool continuous --file_bugs True --priority PFQ --timeout_mins 180 --retry False --minimum_duts 1 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 82523468L, 'cidb_build_id': 2655388, 'datastore_parent_key': ('Build', 2655388, 'BuildStage', 82523468L)}" -c Triggered task: eve-tot-chrome-pfq-informational/R69-10773.0.0-b2655388-provision chromeos-golo-server5-260: 3e09c84b5a2a4a10 3 Autotest instance created: cautotest-prod TestLabException: Not enough DUTs for board: eve, pool: continuous; required: 1, found: 0 Traceback (most recent call last): File "/usr/local/autotest/site_utils/run_suite.py", line 1990, in _run_task return _run_suite(options) File "/usr/local/autotest/site_utils/run_suite.py", line 1726, in _run_suite options.skip_duts_check) File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 330, in check_dut_availability hosts=hosts) NotEnoughDutsError: Not enough DUTs for board: eve, pool: continuous; required: 1, found: 0 Will return from run_suite with status: INFRA_FAILURE
,
Jun 12 2018
,
Jun 13 2018
eve-tot-chrome-pfq-informational is running tests now, but they do not seem to be passing. Failures appear to be in the pattern of either: "Suite timed out" (even though all tests appear to have passed) Or multiple tests failing with: "ABORT: Timed out, did not run." Looking at "GE Suite Details" from recent builds: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2661200 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2660653 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?cidbBuildId=2659667 There only appear to be two builders (the names, annoyingly, are not copyable): chromeos6-row4-rack11-host17 chromeos6-row4-rack11-host19
,
Jun 13 2018
> There only appear to be two builders (the names, annoyingly, are not copyable):
> chromeos6-row4-rack11-host17
> chromeos6-row4-rack11-host19
Those aren't builders; those are the DUTs. And the fact that there's
only two of them is no doubt contributing to the timeout symptom.
It would be easy to add more DUTs (I'd recommend a total of 4), there
are seven working spares to be tapped:
$ dut-status -m eve -p suites -w | wc -l
7
,
Jun 13 2018
Yes, sorry, I mean't DUT. SGTM.
,
Jun 14 2018
I did some more digging and it looks like all of the recent failures were indeed because the HWTest stage(s) are exceeding the 4 hour (!) timeout, presumably because of a lack of DUTs. It's not especially clear that this is the case from the summary message 'Suit Job: ABORT'. I filed issue 852821 to address this. I noticed that we are passing '--minimum_duts', '1' to the suite, which seems incorrect since clearly 2 are insufficient to complete the tests in the specified timeout.
,
Jun 18 2018
It seems like we should probably do an audit of the pfq-intormational builders (and maybe others?) and ensure that --minimum_duts is set correctly for the various HWTest suites (we apparently need 3, maybe 4?).
,
Jun 18 2018
> It seems like we should probably do an audit of the pfq-intormational > builders (and maybe others?) and ensure that --minimum_duts is set > correctly for the various HWTest suites (we apparently need 3, maybe 4?). For this case, I think "minimum_duts" was set correctly. What went wrong was that we allowed the pool to become too small. That said, we have no mechanism to cross check the load generated by builders against the supply in the lab. The closest we've come is a feature request that would allow a human to examine the two: That's bug 717574. I think that for this bug, the thing to do now is to grow the eve 'continuous' pool to at least 4 DUTs. The systemic problems that allowed this to happen should be different bugs. Passing this on to this week's deputy.
,
Jun 21 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by friedman@chromium.org
, Jun 12 2018