Crash collection can cause repair to fail
Reported by
jrbarnette@chromium.org,
Feb 10 2017
|
||||
Issue description
This repair job:
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host13/59883785-repair/
The job failed because of an exception raised from
get_crashinfo() in server/crashcollect.py. The code
path indicates that the repair itself was actually
successful, but a bug in the collection code caused
a false failure. See bug 691119 .
Probably, exceptions of this sort shouldn't be allowed
to cause repair failures.
Here's the relevant source:
class ServoResetRepair(hosts.RepairAction):
# ...
def repair(self, host):
# ...
if host.wait_up(host.BOOT_TIMEOUT):
# Collect logs once we regain ssh access before clobbering them.
local_log_dir = crashcollect.get_crashinfo_dir(host, 'after_reset')
host.collect_logs('/var/log', local_log_dir, ignore_errors=True)
# Collect crash info.
crashcollect.get_crashinfo(host, None)
return
# ...
The call to get_crashinfo() indicates that host.wait_up()
returned true, meaning the DUT was working.
Here's the traceback:
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
self.repair(host)
File "/usr/local/autotest/server/hosts/cros_repair.py", line 282, in repair
crashcollect.get_crashinfo(host, None)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/server/crashcollect.py", line 161, in get_crashinfo
get_crashdumps(host, test_start_time)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/server/crashcollect.py", line 148, in get_crashdumps
get_site_crashdumps(host, test_start_time)
File "/usr/local/autotest/server/site_crashcollect.py", line 264, in get_site_crashdumps
minidumps = find_and_generate_minidump_stacktraces(host_resultdir)
File "/usr/local/autotest/server/site_crashcollect.py", line 147, in find_and_generate_minidump_stacktraces
generate_stacktrace_for_file(file, host_resultdir)
File "/usr/local/autotest/server/site_crashcollect.py", line 109, in generate_stacktrace_for_file
crashserver_name = _resolve_crashserver()
File "/usr/local/autotest/server/site_crashcollect.py", line 53, in _resolve_crashserver
fields={'crash_server': crashserver_name})
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 105, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 86, in AddToQueueIfPresent
return fn(*args, **kwargs)
TypeError: Counter() got an unexpected keyword argument 'fields'
,
Feb 15 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 16 2018
,
May 16 2018
This was fixed with a separate CL for a possibly related (duplicate?)
bug. Here's the relevant code now:
def _check_reset_success(self, host):
"""Check whether reset succeeded, and gather logs if possible."""
if host.wait_up(host.BOOT_TIMEOUT):
try:
# Collect logs once we regain ssh access before
# clobbering them.
self._collect_logs(host)
except Exception:
# If the DUT is up, we want to declare success, even if
# log gathering fails for some reason. So, if there's
# a failure, just log it and move on.
logging.exception('Unexpected failure in log '
'collection during %s.',
self.tag)
return
raise hosts.AutoservRepairError(
'Host %s is still offline after %s.' %
(host.hostname, self.tag))
|
||||
►
Sign in to add a comment |
||||
Comment 1 by aut...@google.com
, Feb 15 2017