New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 848312 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

peach_pit-chrome-pfq fail due to "Not enough DUTs for board"

Project Member Reported by xiy...@chromium.org, May 31 2018

Issue description

peach_pit-chrome-pfq fails due to "Not enough DUTs for board"
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-chrome-pfq/5655

Triggered task: peach_pit-chrome-pfq/R69-10738.0.0-rc3-provision
chromeos-golo-server2-251: 3dce9a2ede514110 3
  Autotest instance created: cautotest-prod
  TestLabException: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1990, in _run_task
      return _run_suite(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1726, in _run_suite
      options.skip_duts_check)
    File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 330, in check_dut_availability
      hosts=hosts)
  NotEnoughDutsError: Not enough DUTs for board: peach_pit, pool: bvt; required: 4, found: 3
  Will return from run_suite with status: INFRA_FAILURE

 
There's no shortage right now:  DUTs are actively testing.

$ dut-status -m peach_pit -p bvt
hostname                       S   last checked         URL
chromeos6-row2-rack10-host12   OK  2018-05-31 09:03:09  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host12/1188846-reset/
chromeos6-row2-rack10-host6    OK  2018-05-31 09:02:40  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host6/1188844-reset/
chromeos6-row2-rack10-host20   OK  2018-05-31 09:02:54  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host20/1188845-reset/
chromeos6-row2-rack11-host6    OK  2018-05-31 09:03:51  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host6/1188849-reset/
chromeos6-row2-rack11-host13   OK  2018-05-31 09:03:36  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host13/1188847-reset/
chromeos6-row2-rack11-host20   OK  2018-05-31 09:03:39  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host20/1188848-reset/

I'm looking into what happened.

There was a temporary testing glitch caused by a bug in Chrome OS:

$ dut-status -m peach_pit -p bvt -u '2018-05-31 04:40:00' -f -d 2 | grep repair
    2018-05-31 03:33:42  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host6/1187498-repair/
    2018-05-31 03:35:30  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack10-host20/1187505-repair/
    2018-05-31 03:43:19  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host6/1187535-repair/
    2018-05-31 03:45:23  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host13/1187541-repair/
    2018-05-31 03:37:43  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack11-host20/1187518-repair/

At least three DUTs were actively repairing after a failure when the
PFQ requested its test suite.  DUTs that are repairing are considered
unavailable, which is what caused the error.

These are the test jobs that failed:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=204247244
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=204260738
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=204260741
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=204260735
    https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=204247254

In all cases, there was a large number of Chrome crashes.  The crashes
then forced repair, for one of two reasons:
  * For most DUTs, they ran out of disk space to hold any more crashes.
  * Presumably for all of the DUTs, Chrome didn't say up.

The crashes that let to the repairs were caused by a known bug in
the R68 beta branch for peach_pit; a fix needs to be cherry-picked.

The bug that caused the crashes that caused this problem is  bug 845429 .

Status: WontFix (was: Assigned)
I'm not prepared to suggest that we change the check for "are there
enough DUTs".  And, absent making that check smarter, the only fix I
see is to cherry-pick the fix to whatever is making peach_pit fail in
Beta.  This isn't the bug for cherry-picking the fix, so we're done here.

Sign in to add a comment