New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 684020 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Jul 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

balance_pool: susceptible to flaky URLError

Project Member Reported by pprabhu@chromium.org, Jan 23 2017

Issue description

balance_pool log for 2017-01-23 has the following exception:

Traceback (most recent call last):
  File "site_utils/balance_pools.py", line 599, in <module>
    main(sys.argv)
  File "site_utils/balance_pools.py", line 593, in main
    parallel.RunTasksInProcessPool(balancer, board_info, processes=8)
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 809, in RunTasksInProcessPool
    queue.put((idx, input_args))
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 750, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 750, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 561, in ParallelTasks
    raise BackgroundFailure(exc_infos=errors)
chromite.lib.parallel.BackgroundFailure: <class 'urllib2.URLError'>: <urlopen error [Errno 110] Connection timed out>
Traceback (most recent call last):
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 602, in TaskRunner
    task(*x, **task_kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 800, in <lambda>
    fn = lambda idx, task_args: out_queue.put((idx, task(*task_args)))
  File "site_utils/balance_pools.py", line 562, in balancer
    _balance_board(arguments, afe, board, pool, start_time, end_time)
  File "site_utils/balance_pools.py", line 329, in _balance_board
    start_time, end_time)
  File "site_utils/balance_pools.py", line 179, in __init__
    self.total_hosts = self._get_hosts(afe, start_time, end_time)
  File "site_utils/balance_pools.py", line 195, in _get_hosts
    diag = h.last_diagnosis()[0]
  File "/usr/local/autotest/server/lib/status_history.py", line 573, in last_diagnosis
    self._init_status_task()
  File "/usr/local/autotest/server/lib/status_history.py", line 502, in _init_status_task
    self._afe, self._host.id, self.end_time)
  File "/usr/local/autotest/server/lib/status_history.py", line 235, in get_status_task
    task = afe.get_host_status_task(host_id, query_end)
  File "/usr/local/autotest/server/frontend.py", line 648, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/frontend.py", line 104, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 114, in __call__
    respdata = urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 110] Connection timed out>


This caused the script to exit early and not balance the easy pools for me, so I complain. ;)
 
The key line in the traceback is this one:
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 114, in __call__
    respdata = urllib2.urlopen(request).read()

That is, the call to urlopen was on behalf of an RPC call to
cautotest.  So, the timeout was because cautotest was slow to
respond to that a particular RPC call.

Looking at the traceback, it seems we're using server.frontend.AFE,
rather than the RetryingAFE class.  Retrying in this case _probably_
would make things better.

Comment 2 by autumn@chromium.org, Jan 24 2017

Labels: -current-issue
Project Member

Comment 3 by sheriffbot@chromium.org, Feb 12 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: jrbarnette@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: -Pri-2 -Hotlist-Recharge-Cold Pri-3
Owner: ----
Status: Available (was: Untriaged)
Status: Archived (was: Available)

Sign in to add a comment