New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 805707 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Lab inventory runs are lousy with 500 errors

Reported by jrbarnette@chromium.org, Jan 24 2018

Issue description

Checking the logs for the lab-inventory runs, many runs
have been failing with 500 errors.  Here's a typical sample:

2018-01-24 14:00:07 | DEBUG      | Starting lab inventory for 2018-01-24.14
[ ... ]
2018-01-24 14:06:14 | DEBUG      | Counting 45 DUTS for board veyron_minnie
2018-01-24 14:06:21 | DEBUG      | ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCException'>(HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
    postdata, min_rpc_timeout)
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
    return urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
)
2018-01-24 14:06:21 | ERROR      | Unexpected exception: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
    postdata, min_rpc_timeout)
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
    return urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
  File "site_utils/lab_inventory.py", line 1366, in main
    _perform_inventory_reports(arguments)
  File "site_utils/lab_inventory.py", line 1187, in _perform_inventory_reports
    _perform_board_inventory(arguments, inventory, timestamp)
  File "site_utils/lab_inventory.py", line 1018, in _perform_board_inventory
    board_message = _generate_board_inventory_message(inventory)
  File "site_utils/lab_inventory.py", line 786, in _generate_board_inventory_message
    counts.get_spares_buffer(),
  File "site_utils/lab_inventory.py", line 404, in get_spares_buffer
    return self.get_total(SPARE_POOL) - self.get_broken()
  File "site_utils/lab_inventory.py", line 357, in get_broken
    return self._count_pool(_CachedHostJobHistories.get_broken, pool)
  File "site_utils/lab_inventory.py", line 302, in _count_pool
    self._histories_by_pool.values()])
  File "site_utils/lab_inventory.py", line 225, in get_broken
    return len(self.get_broken_list())
  File "site_utils/lab_inventory.py", line 219, in get_broken_list
    if h.last_diagnosis()[0] == status_history.BROKEN]
  File "/usr/local/autotest/server/lib/status_history.py", line 651, in last_diagnosis
    self._init_status_task()
  File "/usr/local/autotest/server/lib/status_history.py", line 580, in _init_status_task
    self._afe, self._host.id, self.end_time)
  File "/usr/local/autotest/server/lib/status_history.py", line 285, in get_status_task
    task = afe.get_host_status_task(host_id, query_end)
  File "/usr/local/autotest/server/frontend.py", line 672, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 131, in run
    self, call, **dargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 244, in GenericRetry
    return _run()
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 177, in _Wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in _run
    return functor(*args, **kwargs)
  File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 94, in _run
    return super(RetryingAFE, self).run(call, **dargs)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 143, in __call__
    raise BuildException(resp['error'])
JSONRPCException: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
    postdata, min_rpc_timeout)
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
    return urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

I'm reasonably certain is that what this says is that the shard
(in this case, for veyron_minnie) is where the 500 originated.
As for _why_ the shard complained, I don't know.

Logs show failures at the following times:
    2018-01-12 14:29:27
    2018-01-16 14:07:20
    2018-01-17 14:27:55
    2018-01-18 14:22:29
    2018-01-19 14:03:03
    2018-01-22 06:03:48
    2018-01-23 06:11:33
    2018-01-23 14:50:03
    2018-01-24 14:06:21

Additionally, the balance_pool command that runs prior to
the morning pool inventory run is sometimes struck by the
failure as well.  The logs call out these dates:
    2018-01-09
    2018-01-12
    2018-01-18
    2018-01-19
    2018-01-23
    2018-01-24

 
5XX rates have been spiking on various servers

https://viceroy.corp.google.com/chromeos/afe_rpc?duration=8d

Will hold open for this investigation.
Status: Fixed (was: Assigned)
presumed due to db issues on shards. Worst offending shards are turned down. Presumed fixed.

Comment 3 by jkop@chromium.org, Jan 29 2018

Cc: -jkop@chromium.org akes...@chromium.org
Owner: jkop@chromium.org
Status: Assigned (was: Fixed)
I'm getting the same errors on balance_pool runs today.

Traceback (most recent call last):
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/site-packages/chromite/lib/parallel.py", line 603, in TaskRunner
    task(*x, **task_kwargs)
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/site-packages/chromite/lib/parallel.py", line 801, in <lambda>
    fn = lambda idx, task_args: out_queue.put((idx, task(*task_args)))
  File "/usr/local/bin/balance_pool", line 705, in balancer
    start_time, end_time)
  File "/usr/local/bin/balance_pool", line 369, in _balance_board_or_model
    spare_pool = _DUTPool(afe, arguments.spare, labels, start_time, end_time)
  File "/usr/local/bin/balance_pool", line 186, in __init__
    self.total_hosts = self._get_hosts(afe, start_time, end_time)
  File "/usr/local/bin/balance_pool", line 199, in _get_hosts
    diag = h.last_diagnosis()[0]
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 651, in last_diagnosis
    self._init_status_task()
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 580, in _init_status_task
    self._afe, self._host.id, self.end_time)
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 285, in get_status_task
    task = afe.get_host_status_task(host_id, query_end)
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/frontend.py", line 672, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/frontend/afe/json_rpc/proxy.py", line 143, in __call__
    raise BuildException(resp['error'])
JSONRPCException: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
    host_id=host_id, end_time=end_time)
  File "/usr/local/autotest/server/frontend.py", line 108, in run
    result = utils.strip_unicode(rpc_call(**dargs))
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
    postdata, min_rpc_timeout)
  File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
    return urllib2.urlopen(request).read()
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

Comment 4 by jkop@chromium.org, Jan 29 2018

Cc: shuqianz@chromium.org

Comment 5 by jkop@chromium.org, Jan 29 2018

Labels: -Pri-1 Pri-2
balance_pool succeeded when I went back to investigate further. Until anything else goes wrong, demoting.
Labels: -Pri-2 Pri-1
This bug wasn't fixed on Friday when it was closed.  The
morning inventory run failed both during pool balancing, and
again gathering inventory status.  That's the problem we need
to fix, and it's still just as serious as when this bug was
filed.  I think it also still qualifies as P1; this problem
probably isn't affecting only lab inventory.

Digging into today's failure.

First, 'lab_inventory.log' around the failure:
2018-01-29 06:04:02 | DEBUG      | Listing failed DUTs for buddy
2018-01-29 06:04:02 | DEBUG      | ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCException'>(HTTPError: HTTP Error 500: INTERNAL SERVER ERROR

So, it was the 'buddy' shard that failed:
    $ atest shard list | grep buddy
    193  chromeos-server153.cbf.corp.google.com       board:buddy

Looking at the dashboard for that shard, there was a 2-hour CPU
utilization excursion from about 4:45AM to about 6:45AM, which
includes the time of the failure:
    http://shortn/_LgSYlfD3CN

So, if we explain that excursion, and I think we can explain the
failure.

Comment 7 by jkop@chromium.org, Jan 31 2018

The server that acted up yesterday has been retired, and it hasn't recurred today. Lab inventory also seemed fine today.

Comment 8 by jkop@chromium.org, Jan 31 2018

Status: Fixed (was: Assigned)

Sign in to add a comment