New issue
Advanced search Search tips

Issue 684017 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

crash_collection exception: fails to remove directory on drone

Project Member Reported by pprabhu@chromium.org, Jan 23 2017

Issue description

Example: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack10-host15/2388271-repair/20172001164321/debug/

Exception:
Traceback (most recent call last):
  File "/usr/local/autotest/server/control_segments/repair", line 30, in repair
    crashcollect.get_crashinfo(target, None)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 161, in get_crashinfo
    get_crashdumps(host, test_start_time)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/crashcollect.py", line 148, in get_crashdumps
    get_site_crashdumps(host, test_start_time)
  File "/usr/local/autotest/server/site_crashcollect.py", line 263, in get_site_crashdumps
    orphans = fetch_orphaned_crashdumps(host, infodir)
  File "/usr/local/autotest/server/site_crashcollect.py", line 213, in fetch_orphaned_crashdumps
    os.rmdir(infodir)
OSError: [Errno 39] Directory not empty: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host15/2388271-repair/20172001164321/crashinfo.chromeos4-row4-rack10-host15'


I have no idea why the drone's directory was not empty, but this shouldn't fail repair, causing the DUT to go out of commission.
 
Owner: pprabhu@chromium.org
Status: Started (was: Available)
So the problem is twofold:

(1) For some reason, the crash directory on the DUT is readonly. This means that we copy the crash out, but then fail to rm the crash from the DUT:

01/20 16:43:46.975 ERROR|        base_utils:0280| [stderr] rm: cannot remove '/var/spool/crash/keygen.20170120.154032.11449.core': Read-only file system

(2) The collection code then decides that we failed to fetch and crashes, and tries to delete the local target directory. But, we do have a crashdump in there, so the local rm fails.

This results in us failing the repair for the DUT.
A third problem:

(3) Failing to collect logs (for whatever reason), should not fail repair before we can even get to repairing the DUT.

The status.log shows that we never even ran repair on the DUT (Which would have fixed this problem by rebooting the DUT):

START	----	repair	timestamp=1484959411	localtime=Jan 20 16:43:31	
	GOOD	----	verify.ssh	timestamp=1484959413	localtime=Jan 20 16:43:33	
	GOOD	----	verify.brd_config	timestamp=1484959414	localtime=Jan 20 16:43:34	
	GOOD	----	verify.ser_config	timestamp=1484959414	localtime=Jan 20 16:43:34	
	GOOD	----	verify.job	timestamp=1484959415	localtime=Jan 20 16:43:35	
	GOOD	----	verify.servod	timestamp=1484959418	localtime=Jan 20 16:43:38	
	GOOD	----	verify.pwr_button	timestamp=1484959418	localtime=Jan 20 16:43:38	
	GOOD	----	verify.lid_open	timestamp=1484959418	localtime=Jan 20 16:43:38	
	GOOD	----	verify.update	timestamp=1484959422	localtime=Jan 20 16:43:42	
	GOOD	----	verify.PASS	timestamp=1484959422	localtime=Jan 20 16:43:42	
END FAIL	----	repair	timestamp=1484959426	localtime=Jan 20 16:43:46	

Comment 4 by autumn@chromium.org, Jan 24 2017

Labels: -current-issue
Status: Archived (was: Started)
Bulk closing Infra>Client>ChromeOS issues untouched in over a year.

Sign in to add a comment