crash_collection exception: fails to remove directory on drone |
||||
Issue descriptionExample: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack10-host15/2388271-repair/20172001164321/debug/ Exception: Traceback (most recent call last): File "/usr/local/autotest/server/control_segments/repair", line 30, in repair crashcollect.get_crashinfo(target, None) File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/local/autotest/server/crashcollect.py", line 161, in get_crashinfo get_crashdumps(host, test_start_time) File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 274, in wrapper return fn(*args, **kwargs) File "/usr/local/autotest/server/crashcollect.py", line 148, in get_crashdumps get_site_crashdumps(host, test_start_time) File "/usr/local/autotest/server/site_crashcollect.py", line 263, in get_site_crashdumps orphans = fetch_orphaned_crashdumps(host, infodir) File "/usr/local/autotest/server/site_crashcollect.py", line 213, in fetch_orphaned_crashdumps os.rmdir(infodir) OSError: [Errno 39] Directory not empty: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host15/2388271-repair/20172001164321/crashinfo.chromeos4-row4-rack10-host15' I have no idea why the drone's directory was not empty, but this shouldn't fail repair, causing the DUT to go out of commission.
,
Jan 23 2017
So the problem is twofold: (1) For some reason, the crash directory on the DUT is readonly. This means that we copy the crash out, but then fail to rm the crash from the DUT: 01/20 16:43:46.975 ERROR| base_utils:0280| [stderr] rm: cannot remove '/var/spool/crash/keygen.20170120.154032.11449.core': Read-only file system (2) The collection code then decides that we failed to fetch and crashes, and tries to delete the local target directory. But, we do have a crashdump in there, so the local rm fails. This results in us failing the repair for the DUT.
,
Jan 23 2017
A third problem: (3) Failing to collect logs (for whatever reason), should not fail repair before we can even get to repairing the DUT. The status.log shows that we never even ran repair on the DUT (Which would have fixed this problem by rebooting the DUT): START ---- repair timestamp=1484959411 localtime=Jan 20 16:43:31 GOOD ---- verify.ssh timestamp=1484959413 localtime=Jan 20 16:43:33 GOOD ---- verify.brd_config timestamp=1484959414 localtime=Jan 20 16:43:34 GOOD ---- verify.ser_config timestamp=1484959414 localtime=Jan 20 16:43:34 GOOD ---- verify.job timestamp=1484959415 localtime=Jan 20 16:43:35 GOOD ---- verify.servod timestamp=1484959418 localtime=Jan 20 16:43:38 GOOD ---- verify.pwr_button timestamp=1484959418 localtime=Jan 20 16:43:38 GOOD ---- verify.lid_open timestamp=1484959418 localtime=Jan 20 16:43:38 GOOD ---- verify.update timestamp=1484959422 localtime=Jan 20 16:43:42 GOOD ---- verify.PASS timestamp=1484959422 localtime=Jan 20 16:43:42 END FAIL ---- repair timestamp=1484959426 localtime=Jan 20 16:43:46
,
Jan 24 2017
,
Mar 16 2018
Bulk closing Infra>Client>ChromeOS issues untouched in over a year. |
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jan 23 2017