Repair failure on nyan_kitty
Reported by
jrbarnette@chromium.org,
Jun 19 2017
|
|||||
Issue descriptionAs a consequence bug 734731 , three boards worth of CQ DUTs went into repair. Of the three boards, one of them, nyan_kitty, failed repair on all DUTs. Logs for an instance of a failed repair is here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host4/3600789-repair/ The status.log file is attached; it shows problems with the DUT's servo. However, _the servo failures didn't cause the repair failures_. The log shows that DUT repair never ran. The attached autoserv.DEBUG file has the log of why repair failed to run on the DUT. Essentially, the DUT was complaining about "Read-only file system". As best I can tell, that error was causing host construction to fail. That mustn't be allowed. Host object construction isn't allowed to fail, ever.
,
Jun 19 2017
,
Jun 19 2017
Our current guess is: - repair tries to create a host object - somehow this requires the DUT to have a writable filesystem which fails - so host creation fails - repair fails AI: jrbarnette@ to sorta-verify the theory above and re-summarize the bug with the actual root cause.
,
Jun 20 2017
Digging into the logs, this is the trace of the specific failure:
06/16 14:07:10.686 INFO | site_crashcollect:0215| There are no orphaned crashes; deleting /usr/local/autotest/results/hosts/chromeos4-row13-rack8-host4/3600789-repair/20171606140359/crashinfo.chromeos4-row13-rack8-host4
06/16 14:07:10.688 ERROR| repair:0037| Repair failed due to Exception.
Traceback (most recent call last):
File "/usr/local/autotest/server/control_segments/repair", line 30, in repair
crashcollect.get_crashinfo(target, None)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 306, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/server/crashcollect.py", line 164, in get_crashinfo
get_crashdumps(host, test_start_time)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 306, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/server/crashcollect.py", line 151, in get_crashdumps
get_site_crashdumps(host, test_start_time)
File "/usr/local/autotest/server/site_crashcollect.py", line 266, in get_site_crashdumps
orphans = fetch_orphaned_crashdumps(host, infodir)
File "/usr/local/autotest/server/site_crashcollect.py", line 216, in fetch_orphaned_crashdumps
os.rmdir(infodir)
This is the relevant source code:
def fetch_orphaned_crashdumps(host, infodir):
# ...
try:
# ...
finally:
# Delete infodir if we have no orphans
if not orphans:
logging.info('There are no orphaned crashes; deleting %s', infodir)
os.rmdir(infodir)
return orphans
The call to `os.rmdir()` failed, and there was nothing to catch
the exception.
The fix should be to catch, log, and then ignore exceptions coming
from the code block in server/control_segments/repair:
if isinstance(target, hosts.CrosHost):
# Collect logs before the repair, as it might destroy all
# useful logs.
local_log_dir = os.path.join(job.resultdir, hostname,
'before_repair')
target.collect_logs('/var/log', local_log_dir, ignore_errors=True)
# Collect crash info.
crashcollect.get_crashinfo(target, None)
,
Jun 20 2017
Can you upload that CL and send it to me for review then?
,
Jun 20 2017
> Can you upload that CL and send it to me for review then? First write, then test, then upload for review. All sometime today, Lord willin', and the crick don't rise.
,
Jun 20 2017
,
Jun 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3c6d0c84c4fcd2df23c3f356c815c6cb7517d022 commit 3c6d0c84c4fcd2df23c3f356c815c6cb7517d022 Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Jun 21 00:24:17 2017 [autotest] Don't allow crash collection to fail repair. The standard repair task control file attempts to collect crash log data. If the collection attempt failed with an exception, the entire repair task would fail. That's bad, since repair is more important than the logs. This fixes the code to log and discard all exceptions comming from crash log collection during repair. BUG= chromium:734764 TEST=run repair in a local instance Change-Id: Ibe6eb2b22bede9b16a715359b1805b37e6e3214a Reviewed-on: https://chromium-review.googlesource.com/540749 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/3c6d0c84c4fcd2df23c3f356c815c6cb7517d022/server/control_segments/repair
,
Jun 21 2017
Fixed, but we need a push to prod.
,
Jul 21 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Jun 19 2017It should be noted that the failure to repair led immediately to this CQ run failure: master-paladin: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15077 nyan_kitty-paladin: https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1996