Lab inventory runs are lousy with 500 errors
Reported by
jrbarnette@chromium.org,
Jan 24 2018
|
|||||||
Issue description
Checking the logs for the lab-inventory runs, many runs
have been failing with 500 errors. Here's a typical sample:
2018-01-24 14:00:07 | DEBUG | Starting lab inventory for 2018-01-24.14
[ ... ]
2018-01-24 14:06:14 | DEBUG | Counting 45 DUTS for board veyron_minnie
2018-01-24 14:06:21 | DEBUG | ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCException'>(HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/autotest/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
postdata, min_rpc_timeout)
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
return urllib2.urlopen(request).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
)
2018-01-24 14:06:21 | ERROR | Unexpected exception: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/autotest/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
postdata, min_rpc_timeout)
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
return urllib2.urlopen(request).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
File "site_utils/lab_inventory.py", line 1366, in main
_perform_inventory_reports(arguments)
File "site_utils/lab_inventory.py", line 1187, in _perform_inventory_reports
_perform_board_inventory(arguments, inventory, timestamp)
File "site_utils/lab_inventory.py", line 1018, in _perform_board_inventory
board_message = _generate_board_inventory_message(inventory)
File "site_utils/lab_inventory.py", line 786, in _generate_board_inventory_message
counts.get_spares_buffer(),
File "site_utils/lab_inventory.py", line 404, in get_spares_buffer
return self.get_total(SPARE_POOL) - self.get_broken()
File "site_utils/lab_inventory.py", line 357, in get_broken
return self._count_pool(_CachedHostJobHistories.get_broken, pool)
File "site_utils/lab_inventory.py", line 302, in _count_pool
self._histories_by_pool.values()])
File "site_utils/lab_inventory.py", line 225, in get_broken
return len(self.get_broken_list())
File "site_utils/lab_inventory.py", line 219, in get_broken_list
if h.last_diagnosis()[0] == status_history.BROKEN]
File "/usr/local/autotest/server/lib/status_history.py", line 651, in last_diagnosis
self._init_status_task()
File "/usr/local/autotest/server/lib/status_history.py", line 580, in _init_status_task
self._afe, self._host.id, self.end_time)
File "/usr/local/autotest/server/lib/status_history.py", line 285, in get_status_task
task = afe.get_host_status_task(host_id, query_end)
File "/usr/local/autotest/server/frontend.py", line 672, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 131, in run
self, call, **dargs)
File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 244, in GenericRetry
return _run()
File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 177, in _Wrapper
ret = func(*args, **kwargs)
File "/usr/local/autotest/site-packages/chromite/lib/retry_util.py", line 243, in _run
return functor(*args, **kwargs)
File "/usr/local/autotest/server/cros/dynamic_suite/frontend_wrappers.py", line 94, in _run
return super(RetryingAFE, self).run(call, **dargs)
File "/usr/local/autotest/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 143, in __call__
raise BuildException(resp['error'])
JSONRPCException: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/autotest/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
postdata, min_rpc_timeout)
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
return urllib2.urlopen(request).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
I'm reasonably certain is that what this says is that the shard
(in this case, for veyron_minnie) is where the 500 originated.
As for _why_ the shard complained, I don't know.
Logs show failures at the following times:
2018-01-12 14:29:27
2018-01-16 14:07:20
2018-01-17 14:27:55
2018-01-18 14:22:29
2018-01-19 14:03:03
2018-01-22 06:03:48
2018-01-23 06:11:33
2018-01-23 14:50:03
2018-01-24 14:06:21
Additionally, the balance_pool command that runs prior to
the morning pool inventory run is sometimes struck by the
failure as well. The logs call out these dates:
2018-01-09
2018-01-12
2018-01-18
2018-01-19
2018-01-23
2018-01-24
,
Jan 27 2018
presumed due to db issues on shards. Worst offending shards are turned down. Presumed fixed.
,
Jan 29 2018
I'm getting the same errors on balance_pool runs today.
Traceback (most recent call last):
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/site-packages/chromite/lib/parallel.py", line 603, in TaskRunner
task(*x, **task_kwargs)
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/site-packages/chromite/lib/parallel.py", line 801, in <lambda>
fn = lambda idx, task_args: out_queue.put((idx, task(*task_args)))
File "/usr/local/bin/balance_pool", line 705, in balancer
start_time, end_time)
File "/usr/local/bin/balance_pool", line 369, in _balance_board_or_model
spare_pool = _DUTPool(afe, arguments.spare, labels, start_time, end_time)
File "/usr/local/bin/balance_pool", line 186, in __init__
self.total_hosts = self._get_hosts(afe, start_time, end_time)
File "/usr/local/bin/balance_pool", line 199, in _get_hosts
diag = h.last_diagnosis()[0]
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 651, in last_diagnosis
self._init_status_task()
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 580, in _init_status_task
self._afe, self._host.id, self.end_time)
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/lib/status_history.py", line 285, in get_status_task
task = afe.get_host_status_task(host_id, query_end)
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/frontend.py", line 672, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/google/home/jkop/shadow_cros/src/third_party/autotest/files/frontend/afe/json_rpc/proxy.py", line 143, in __call__
raise BuildException(resp['error'])
JSONRPCException: HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1498, in get_host_status_task
host_id=host_id, end_time=end_time)
File "/usr/local/autotest/server/frontend.py", line 108, in run
result = utils.strip_unicode(rpc_call(**dargs))
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 136, in __call__
postdata, min_rpc_timeout)
File "/usr/local/autotest/frontend/afe/json_rpc/proxy.py", line 162, in _raw_http_request
return urllib2.urlopen(request).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
,
Jan 29 2018
,
Jan 29 2018
balance_pool succeeded when I went back to investigate further. Until anything else goes wrong, demoting.
,
Jan 30 2018
This bug wasn't fixed on Friday when it was closed. The
morning inventory run failed both during pool balancing, and
again gathering inventory status. That's the problem we need
to fix, and it's still just as serious as when this bug was
filed. I think it also still qualifies as P1; this problem
probably isn't affecting only lab inventory.
Digging into today's failure.
First, 'lab_inventory.log' around the failure:
2018-01-29 06:04:02 | DEBUG | Listing failed DUTs for buddy
2018-01-29 06:04:02 | DEBUG | ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCException'>(HTTPError: HTTP Error 500: INTERNAL SERVER ERROR
So, it was the 'buddy' shard that failed:
$ atest shard list | grep buddy
193 chromeos-server153.cbf.corp.google.com board:buddy
Looking at the dashboard for that shard, there was a 2-hour CPU
utilization excursion from about 4:45AM to about 6:45AM, which
includes the time of the failure:
http://shortn/_LgSYlfD3CN
So, if we explain that excursion, and I think we can explain the
failure.
,
Jan 31 2018
The server that acted up yesterday has been retired, and it hasn't recurred today. Lab inventory also seemed fine today.
,
Jan 31 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, Jan 26 2018