New issue
Advanced search Search tips

Issue 767681 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: May 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 723645



Sign in to add a comment

CQ failed due to RPC layer timeouts

Project Member Reported by akes...@chromium.org, Sep 22 2017

Issue description

This happened on a few builds at around the same time

https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/2475
https://viceroy.corp.google.com/chromeos/suite_details?job_id=143718941

https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6668
https://viceroy.corp.google.com/chromeos/suite_details?job_id=143717777

Digging in to one of the failure logs (a repair failure at http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row6-rack3-host8/1793509-repair/ ) I see rpc timeouts at the start of the job:

09/20 23:18:17.689 DEBUG|        retry_util:0201| ending retries with error: <class 'chromite.lib.timeout_util.TimeoutError'>(Timeout occurred- waited 300.0 seconds.)
09/20 23:18:17.692 ERROR|          autoserv:0762| Uncaught Exception, exit_code = 1.
Traceback (most recent call last):
  File "/usr/local/autotest/server/autoserv", line 754, in main
    use_ssp)
  File "/usr/local/autotest/server/autoserv", line 471, in run_autoserv
    test_retry, **kwargs)
  File "/usr/local/autotest/server/server_job.py", line 363, in __init__
    self._connection_pool)
  File "/usr/local/autotest/server/server_job.py", line 129, in get_machine_dicts
    afe_host = _create_afe_host(machine)
  File "/usr/local/autotest/server/server_job.py", line 1529, in _create_afe_host
    hosts = afe.get_hosts(hostname=hostname)
  File "/usr/local/autotest/server/frontend.py", line 527, in get_hosts
    hosts = self.run('get_hosts', **query_args)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 126, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in GenericRetry
    return _run()
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 176, in _Wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 242, in _run
    return functor(*args, **kwargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 89, in _run
    return super(RetryingAFE, self).run(call, **dargs)
  File "/usr/local/autotest/server/frontend.py", line 107, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 117, in __call__
    respdata = urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1089, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/local/autotest/site-packages/chromite/lib/timeout_util.py", line 87, in kill_us
    raise TimeoutError(error_message % {'time': max_run_time})


Trying to see if this correlates with other outages or metrics-visible issues at that time.
 
Blocking: 723645
This looks like a specific instance of Issue 723645
This correlates well to a bunch of shard client tick rates being low: https://viceroy.corp.google.com/chromeos/deputy-view?duration=6h&utc_end=1505987276#_VG_lnuPnWCa


Status: Archived (was: Untriaged)

Sign in to add a comment