New issue
Advanced search Search tips

Issue 691123 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: May 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Crash collection can cause repair to fail

Reported by jrbarnette@chromium.org, Feb 10 2017

Issue description

This repair job:
    http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row7-rack7-host13/59883785-repair/

The job failed because of an exception raised from
get_crashinfo() in server/crashcollect.py.  The code
path indicates that the repair itself was actually
successful, but a bug in the collection code caused
a false failure.  See  bug 691119 .

Probably, exceptions of this sort shouldn't be allowed
to cause repair failures.

Here's the relevant source:
class ServoResetRepair(hosts.RepairAction):
    # ...
    def repair(self, host):
        # ...
        if host.wait_up(host.BOOT_TIMEOUT):
            # Collect logs once we regain ssh access before clobbering them.
            local_log_dir = crashcollect.get_crashinfo_dir(host, 'after_reset')
            host.collect_logs('/var/log', local_log_dir, ignore_errors=True)
            # Collect crash info.
            crashcollect.get_crashinfo(host, None)
            return
        # ...

The call to get_crashinfo() indicates that host.wait_up()
returned true, meaning the DUT was working.

Here's the traceback:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 447, in _repair_host
    self.repair(host)
  File "/usr/local/autotest/server/hosts/cros_repair.py", line 282, in repair
    crashcollect.get_crashinfo(host, None)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 161, in get_crashinfo
    get_crashdumps(host, test_start_time)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 148, in get_crashdumps
    get_site_crashdumps(host, test_start_time)
  File "/usr/local/autotest/server/site_crashcollect.py", line 264, in get_site_crashdumps
    minidumps = find_and_generate_minidump_stacktraces(host_resultdir)
  File "/usr/local/autotest/server/site_crashcollect.py", line 147, in find_and_generate_minidump_stacktraces
    generate_stacktrace_for_file(file, host_resultdir)
  File "/usr/local/autotest/server/site_crashcollect.py", line 109, in generate_stacktrace_for_file
    crashserver_name = _resolve_crashserver()
  File "/usr/local/autotest/server/site_crashcollect.py", line 53, in _resolve_crashserver
    fields={'crash_server': crashserver_name})
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 105, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 86, in AddToQueueIfPresent
    return fn(*args, **kwargs)
TypeError: Counter() got an unexpected keyword argument 'fields'
 

Comment 1 by aut...@google.com, Feb 15 2017

Labels: Hotlist-Fixit
Marking as FixIt
Project Member

Comment 2 by sheriffbot@chromium.org, Feb 15 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Hotlist-Recharge-Cold
Status: Available (was: Untriaged)
Status: Fixed (was: Available)
This was fixed with a separate CL for a possibly related (duplicate?)
bug.  Here's the relevant code now:

    def _check_reset_success(self, host):
        """Check whether reset succeeded, and gather logs if possible."""
        if host.wait_up(host.BOOT_TIMEOUT):
            try:
                # Collect logs once we regain ssh access before
                # clobbering them.
                self._collect_logs(host)
            except Exception:
                # If the DUT is up, we want to declare success, even if
                # log gathering fails for some reason.  So, if there's
                # a failure, just log it and move on.
                logging.exception('Unexpected failure in log '
                                  'collection during %s.',
                                  self.tag)
            return
        raise hosts.AutoservRepairError(
                'Host %s is still offline after %s.' %
                (host.hostname, self.tag))

Sign in to add a comment