CQ failed due to RPC layer timeouts |
||
Issue descriptionThis happened on a few builds at around the same time https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/2475 https://viceroy.corp.google.com/chromeos/suite_details?job_id=143718941 https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6668 https://viceroy.corp.google.com/chromeos/suite_details?job_id=143717777 Digging in to one of the failure logs (a repair failure at http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row6-rack3-host8/1793509-repair/ ) I see rpc timeouts at the start of the job: 09/20 23:18:17.689 DEBUG| retry_util:0201| ending retries with error: <class 'chromite.lib.timeout_util.TimeoutError'>(Timeout occurred- waited 300.0 seconds.) 09/20 23:18:17.692 ERROR| autoserv:0762| Uncaught Exception, exit_code = 1. Traceback (most recent call last): File "/usr/local/autotest/server/autoserv", line 754, in main use_ssp) File "/usr/local/autotest/server/autoserv", line 471, in run_autoserv test_retry, **kwargs) File "/usr/local/autotest/server/server_job.py", line 363, in __init__ self._connection_pool) File "/usr/local/autotest/server/server_job.py", line 129, in get_machine_dicts afe_host = _create_afe_host(machine) File "/usr/local/autotest/server/server_job.py", line 1529, in _create_afe_host hosts = afe.get_hosts(hostname=hostname) File "/usr/local/autotest/server/frontend.py", line 527, in get_hosts hosts = self.run('get_hosts', **query_args) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 126, in run self, call, **dargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in GenericRetry return _run() File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 176, in _Wrapper ret = func(*args, **kwargs) File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 242, in _run return functor(*args, **kwargs) File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 89, in _run return super(RetryingAFE, self).run(call, **dargs) File "/usr/local/autotest/server/frontend.py", line 107, in run result = utils.strip_unicode(rpc_call(**dargs)) File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 117, in __call__ respdata = urllib2.urlopen(request).read() File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open r = h.getresponse(buffering=True) File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse response.begin() File "/usr/lib/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File "/usr/lib/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 87, in kill_us raise TimeoutError(error_message % {'time': max_run_time}) Trying to see if this correlates with other outages or metrics-visible issues at that time.
,
Sep 22 2017
This correlates well to a bunch of shard client tick rates being low: https://viceroy.corp.google.com/chromeos/deputy-view?duration=6h&utc_end=1505987276#_VG_lnuPnWCa
,
Sep 22 2017
A spike of 5XXs is also visible. https://viceroy.corp.google.com/chromeos/afe_rpc?duration=1d&utc_end=1506040326.29
,
May 17 2018
|
||
►
Sign in to add a comment |
||
Comment 1 by akes...@chromium.org
, Sep 22 2017