New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 734764 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Repair failure on nyan_kitty

Reported by jrbarnette@chromium.org, Jun 19 2017

Issue description

As a consequence  bug 734731 , three boards worth of CQ DUTs
went into repair.  Of the three boards, one of them, nyan_kitty,
failed repair on all DUTs.

Logs for an instance of a failed repair is here:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host4/3600789-repair/

The status.log file is attached; it shows problems with the
DUT's servo.  However, _the servo failures didn't cause the
repair failures_.  The log shows that DUT repair never ran.

The attached autoserv.DEBUG file has the log of why repair failed
to run on the DUT.  Essentially, the DUT was complaining about
"Read-only file system".  As best I can tell, that error was causing
host construction to fail.  That mustn't be allowed.  Host object
construction isn't allowed to fail, ever.

 
status.log
2.3 KB View Download
autoserv.DEBUG
5.5 KB Download
It should be noted that the failure to repair led immediately
to this CQ run failure:

master-paladin:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15077

nyan_kitty-paladin:
    https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1996

Cc: pprabhu@chromium.org
Owner: jrbarnette@chromium.org
Status: Assigned (was: Available)
Our current guess is:
- repair tries to create a host object
- somehow this requires the DUT to have a writable filesystem which fails
- so host creation fails
- repair fails

AI: jrbarnette@ to sorta-verify the theory above and re-summarize the bug with the actual root cause.
Digging into the logs, this is the trace of the specific failure:

06/16 14:07:10.686 INFO | site_crashcollect:0215| There are no orphaned crashes; deleting /usr/local/autotest/results/hosts/chromeos4-row13-rack8-host4/3600789-repair/20171606140359/crashinfo.chromeos4-row13-rack8-host4
06/16 14:07:10.688 ERROR|            repair:0037| Repair failed due to Exception.
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/repair", line 30, in repair
    crashcollect.get_crashinfo(target, None)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 306, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 164, in get_crashinfo
    get_crashdumps(host, test_start_time)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 306, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 151, in get_crashdumps
    get_site_crashdumps(host, test_start_time)
  File "/usr/local/autotest/server/site_crashcollect.py", line 266, in get_site_crashdumps
    orphans = fetch_orphaned_crashdumps(host, infodir)
  File "/usr/local/autotest/server/site_crashcollect.py", line 216, in fetch_orphaned_crashdumps
    os.rmdir(infodir)

This is the relevant source code:
def fetch_orphaned_crashdumps(host, infodir):
    # ...
    try:
        # ...
    finally:
        # Delete infodir if we have no orphans
        if not orphans:
            logging.info('There are no orphaned crashes; deleting %s', infodir)
            os.rmdir(infodir)
    return orphans

The call to `os.rmdir()` failed, and there was nothing to catch
the exception.


The fix should be to catch, log, and then ignore exceptions coming
from the code block in server/control_segments/repair:
            if isinstance(target, hosts.CrosHost):
            # Collect logs before the repair, as it might destroy all
            # useful logs.
            local_log_dir = os.path.join(job.resultdir, hostname,
                                         'before_repair')
            target.collect_logs('/var/log', local_log_dir, ignore_errors=True)
            # Collect crash info.
            crashcollect.get_crashinfo(target, None)

Status: Started (was: Assigned)
Can you upload that CL and send it to me for review then?
> Can you upload that CL and send it to me for review then?

First write, then test, then upload for review.  All sometime
today, Lord willin', and the crick don't rise.

Project Member

Comment 8 by bugdroid1@chromium.org, Jun 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3c6d0c84c4fcd2df23c3f356c815c6cb7517d022

commit 3c6d0c84c4fcd2df23c3f356c815c6cb7517d022
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Jun 21 00:24:17 2017

[autotest] Don't allow crash collection to fail repair.

The standard repair task control file attempts to collect crash log
data.  If the collection attempt failed with an exception, the
entire repair task would fail.  That's bad, since repair is more
important than the logs.

This fixes the code to log and discard all exceptions comming from
crash log collection during repair.

BUG= chromium:734764 
TEST=run repair in a local instance

Change-Id: Ibe6eb2b22bede9b16a715359b1805b37e6e3214a
Reviewed-on: https://chromium-review.googlesource.com/540749
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/3c6d0c84c4fcd2df23c3f356c815c6cb7517d022/server/control_segments/repair

Status: Fixed (was: Started)
Fixed, but we need a push to prod.

Cc: jrbarnette@chromium.org
 Issue 686791  has been merged into this issue.

Sign in to add a comment