New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 835944 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Make lab inventory runs more robust to transient shard failures

Reported by jrbarnette@chromium.org, Apr 23 2018

Issue description

This morning's lab inventory run failed.  The relevant error messages
are further down.  The key feature of the failure is that while
gathering status for one particular model (reks), the RPC to the
shard timed out.  That timeout from a single shard invalidated the
entire run.  We've become more reliant on the inventory runs producing
data (see bugs  835941  and 804625), so we need for inventory runs to
forgive individual failures of this sort.

The inventory code should probably do something along these lines to
be more robust to failure:
  * If an RPC fails, mark the model of the failed DUT as having no data.
  * Continue processing for other models.
  * Report models with no data in the inventory e-mail.

Here's the relevant errors from this morning's failure:
====
2018-04-23 06:38:08 | DEBUG      | Listing failed DUTs for reks
2018-04-23 07:08:08 | DEBUG      | ending retries with error: <class 'chromite.lib.timeout_util.TimeoutError'>(Timeout occurred- waited 1800.0 seconds.)
2018-04-23 07:08:08 | ERROR      | Unexpected exception: Timeout occurred- waited 1800.0 seconds.
Traceback (most recent call last):
  File "site_utils/lab_inventory.py", line 1383, in main
    _perform_inventory_reports(arguments)
  File "site_utils/lab_inventory.py", line 1197, in _perform_inventory_reports
    _perform_model_inventory(arguments, inventory, timestamp)
  File "site_utils/lab_inventory.py", line 969, in _perform_model_inventory
    inventory, arguments.recommend) + '\n\n\n'
  File "site_utils/lab_inventory.py", line 669, in _generate_repair_recommendation
    if counts.get_broken() != 0:
  File "site_utils/lab_inventory.py", line 359, in get_broken
    return self._count_pool(_HostSetInventory.get_broken, pool)
  File "site_utils/lab_inventory.py", line 304, in _count_pool
    self._histories_by_pool.values()])
  File "site_utils/lab_inventory.py", line 228, in get_broken
    return len(self.get_broken_list())
  File "site_utils/lab_inventory.py", line 222, in get_broken_list
    if h.last_diagnosis()[0] == status_history.BROKEN]
  File "/usr/local/autotest/server/lib/status_history.py", line 658, in last_diagnosis
    self._init_status_task()
  File "/usr/local/autotest/server/lib/status_history.py", line 587, in _init_status_task
    self._afe, self._host.id, self.end_time)
  File "/usr/local/autotest/server/lib/status_history.py", line 285, in get_status_task
    task = afe.get_host_status_task(host_id, query_end)
  File "/usr/local/autotest/server/frontend.py", line 660, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 131, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 244, in GenericRetry
    return _run()
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 177, in _Wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in _run
    return functor(*args, **kwargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 94, in _run
    return super(RetryingAFE, self).run(call, **dargs)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
    postdata, min_rpc_timeout)
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
    return urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 88, in kill_us
    raise TimeoutError(error_message % {'time': max_run_time})
TimeoutError: Timeout occurred- waited 1800.0 seconds.


 
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Owner: jrbarnette@chromium.org
Owner: ----
Possibly mine, but the assignment needs ratification.

Owner: jrbarnette@chromium.org
Status: Assigned (was: Untriaged)
jrbarnette to decide on next action, e.g. to either clarify or pass on parts of the work to others.  The bug triager (me) can't really decide on an action for this.

Sign in to add a comment