New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 666351 link

Starred by 3 users

Issue metadata

Status: Duplicate
Merged: issue 666414
Owner: ----
Closed: Nov 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

devserver load following lab downtime causes test failures.

Project Member Reported by skau@chromium.org, Nov 17 2016

Issue description

Comment 1 by skau@chromium.org, Nov 17 2016

chromeos-server31-14: 328b564278d78610 3
  Autotest instance: cautotest
  11-17-2016 [05:40:48] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=86130546
  Unhandled run_suite exception: Timeout occurred- waited 1800 seconds.
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1787, in main
      code, output_dict = main_without_exception_handling(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1641, in main_without_exception_handling
      return _handle_job_wait(afe, job_id, options, job_timer, is_real_time)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1659, in _handle_job_wait
      while not afe.get_jobs(id=job_id, finished=True):
    File "/usr/local/autotest/server/frontend.py", line 579, in get_jobs
      jobs_data = self.run('get_jobs', **dargs)
    File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 111, in run
      self, call, **dargs)
    File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 114, in GenericRetry
      time.sleep(sleep_time)
    File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 62, in kill_us
      raise TimeoutError(error_message % {'time': max_run_time})
  TimeoutError: Timeout occurred- waited 1800 seconds.
  Will return from run_suite with status: INFRA_FAILURE
cwd=None
06:13:14: ERROR: wait_cmd has lab failures: cwd=None.
Exception will be raised in the next json_dump run.

Comment 2 by skau@chromium.org, Nov 17 2016

daisy_skate

chromeos-server22-180: 328b5a3695c9b810 3
  Autotest instance: cautotest
  Unhandled run_suite exception: Timeout occurred- waited 1800 seconds.
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1787, in main
      code, output_dict = main_without_exception_handling(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1601, in main_without_exception_handling
      options.skip_duts_check)
    File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 306, in check_dut_availability
      multiple_labels=('pool:%s' % pool, 'board:%s' % board))
    File "/usr/local/autotest/server/frontend.py", line 510, in get_hosts
      hosts = self.run('get_hosts', **query_args)
    File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 111, in run
      self, call, **dargs)
    File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 114, in GenericRetry
      time.sleep(sleep_time)
    File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 62, in kill_us
      raise TimeoutError(error_message % {'time': max_run_time})
  TimeoutError: Timeout occurred- waited 1800 seconds.
  Will return from run_suite with status: INFRA_FAILURE

Comment 3 by skau@chromium.org, Nov 17 2016

Triggered task: elm-paladin/R56-8998.0.0-rc1-bvt-inline
Waiting for results from the following shards: 0
Waiting for results from the following shards: 0
chromeos-server31-93: 328b873c311b1510 3
  Autotest instance: cautotest
  Unhandled run_suite exception: Timeout occurred- waited 1800 seconds.
  Traceback (most recent call last):
    File "/usr/local/autotest/site_utils/run_suite.py", line 1787, in main
      code, output_dict = main_without_exception_handling(options)
    File "/usr/local/autotest/site_utils/run_suite.py", line 1601, in main_without_exception_handling
      options.skip_duts_check)
    File "/usr/local/autotest/site_utils/diagnosis_utils.py", line 306, in check_dut_availability
      multiple_labels=('pool:%s' % pool, 'board:%s' % board))
    File "/usr/local/autotest/server/frontend.py", line 510, in get_hosts
      hosts = self.run('get_hosts', **query_args)
    File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 111, in run
      self, call, **dargs)
    File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 114, in GenericRetry
      time.sleep(sleep_time)
    File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 62, in kill_us
      raise TimeoutError(error_message % {'time': max_run_time})
  TimeoutError: Timeout occurred- waited 1800 seconds.
  Will return from run_suite with status: INFRA_FAILURE

Comment 4 by xixuan@chromium.org, Nov 17 2016

Command /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com ...
returns an error code 3, which is identified as an INFRA_FAILURE. (cbuildbot/swarming_lib.py)

Since there's no logging of detailed error, I don't know what 'returncode=3' means. proxy server is down or abnormal or too busy?

Comment 5 by ntang@google.com, Nov 17 2016

I think the original round of failure (at about 5:02am) caused by this run_suite timeout has passed. We see alll builders at least are able to execute the tests now.

A new round of failure happens (at about 7:33am) is related to ssp in autotest (could not download tar) is logged in crbug/666372. There is speculation on dev_server overloading. Need to see if it recovers in next round.

Comment 6 by xixuan@chromium.org, Nov 17 2016

What's run_suite timeout? I thought it's 90 minutes. (--timeout_mins 90), but seems after about 30 minutes, the command returns returncode=3.
Labels: -Pri-0 Pri-2
Summary: devserver load following lab downtime causes test failures. (was: paladins failing on HWTest infrastructure issues)
afaict, the devserver load is just a fallout of the lab downtime.
We don't handle it graciously, but nothing I can do right now about it. Things seem to have recovered on their own for now.
Mergedinto: 666414
Status: Duplicate (was: Untriaged)

Sign in to add a comment