Make lab inventory runs more robust to transient shard failures
Reported by
jrbarnette@chromium.org,
Apr 23 2018
|
|||
Issue descriptionThis morning's lab inventory run failed. The relevant error messages are further down. The key feature of the failure is that while gathering status for one particular model (reks), the RPC to the shard timed out. That timeout from a single shard invalidated the entire run. We've become more reliant on the inventory runs producing data (see bugs 835941 and 804625), so we need for inventory runs to forgive individual failures of this sort. The inventory code should probably do something along these lines to be more robust to failure: * If an RPC fails, mark the model of the failed DUT as having no data. * Continue processing for other models. * Report models with no data in the inventory e-mail. Here's the relevant errors from this morning's failure: ==== 2018-04-23 06:38:08 | DEBUG | Listing failed DUTs for reks 2018-04-23 07:08:08 | DEBUG | ending retries with error: <class 'chromite.lib.timeout_util.TimeoutError'>(Timeout occurred- waited 1800.0 seconds.) 2018-04-23 07:08:08 | ERROR | Unexpected exception: Timeout occurred- waited 1800.0 seconds. Traceback (most recent call last): File "site_utils/lab_inventory.py", line 1383, in main _perform_inventory_reports(arguments) File "site_utils/lab_inventory.py", line 1197, in _perform_inventory_reports _perform_model_inventory(arguments, inventory, timestamp) File "site_utils/lab_inventory.py", line 969, in _perform_model_inventory inventory, arguments.recommend) + '\n\n\n' File "site_utils/lab_inventory.py", line 669, in _generate_repair_recommendation if counts.get_broken() != 0: File "site_utils/lab_inventory.py", line 359, in get_broken return self._count_pool(_HostSetInventory.get_broken, pool) File "site_utils/lab_inventory.py", line 304, in _count_pool self._histories_by_pool.values()]) File "site_utils/lab_inventory.py", line 228, in get_broken return len(self.get_broken_list()) File "site_utils/lab_inventory.py", line 222, in get_broken_list if h.last_diagnosis()[0] == status_history.BROKEN] File "/usr/local/autotest/server/lib/status_history.py", line 658, in last_diagnosis self._init_status_task() File "/usr/local/autotest/server/lib/status_history.py", line 587, in _init_status_task self._afe, self._host.id, self.end_time) File "/usr/local/autotest/server/lib/status_history.py", line 285, in get_status_task task = afe.get_host_status_task(host_id, query_end) File "/usr/local/autotest/server/frontend.py", line 660, in get_host_status_task host_id=host_id, end_time=end_time) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 131, in run self, call, **dargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 244, in GenericRetry return _run() File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 177, in _Wrapper ret = func(*args, **kwargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in _run return functor(*args, **kwargs) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 94, in _run return super(RetryingAFE, self).run(call, **dargs) File "/usr/local/autotest/server/frontend.py", line 108, in run result = utils.strip_unicode(rpc_call(**dargs)) File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__ postdata, min_rpc_timeout) File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request return urllib2.urlopen(request).read() File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 88, in kill_us raise TimeoutError(error_message % {'time': max_run_time}) TimeoutError: Timeout occurred- waited 1800.0 seconds.
,
Apr 30 2018
Possibly mine, but the assignment needs ratification.
,
May 8 2018
jrbarnette to decide on next action, e.g. to either clarify or pass on parts of the work to others. The bug triager (me) can't really decide on an action for this. |
|||
►
Sign in to add a comment |
|||
Comment 1 by dgarr...@chromium.org
, Apr 26 2018Owner: jrbarnette@chromium.org