gs_offloader getting root owned test results. |
|||||||||||||||
Issue description
On: chromeos-skunk1.mtv
This is happening repeatedly.
DEBUG:root:Running 'gsutil cp -eR 138505728-chromeos-test/chromeos4-row8-rack10-host18/cheets_CTS_N.CtsJankDeviceTestCases/results/android-cts/2017.08.30_15.29.05.zip gs://chromeos-cts-apfe/sentry-release/R61-9765.48.0/138249548/cheets_CTS_N.CtsJankDeviceTestCases/138505728-chromeos-test_2017.08.30_15.29.05/'
DEBUG:root:Upload 138505728-chromeos-test/chromeos4-row8-rack10-host18/cheets_CTS_N.CtsJankDeviceTestCases/results/android-cts/2017.08.30_15.29.05.zip to gs://chromeos-cts-apfe/sentry-release/R61-9765.48.0/138249548/cheets_CTS_N.CtsJankDeviceTestCases/138505728-chromeos-test_2017.08.30_15.29.05/
ERROR:root:ERROR uploading test results 138505728-chromeos-test/chromeos4-row8-rack10-host18/cheets_CTS_N.CtsJankDeviceTestCases/results/android-cts/2017.08.30_15.29.05 to GS: [Errno 13] Permission denied: '138505728-chromeos-test/chromeos4-row8-rack10-host18/cheets_CTS_N.CtsJankDeviceTestCases/results/android-cts/2017.08.30_15.29.05/test_result.xml.gz'
INFO:root:Waiting for ts_mon flushing process to finish...
INFO:root:Finished waiting for ts_mon process.
Traceback (most recent call last):
File "/usr/local/autotest/site_utils/gs_offloader.py", line 1146, in <module>
main()
File "/usr/local/autotest/site_utils/gs_offloader.py", line 1100, in main
offloader.offload_once()
File "/usr/local/autotest/site_utils/gs_offloader.py", line 914, in offload_once
_enqueue_offload(job, queue, self._upload_age_limit)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 750, in BackgroundTaskRunner
queue.put(_AllTasksComplete())
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 750, in BackgroundTaskRunner
queue.put(_AllTasksComplete())
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 561, in ParallelTasks
raise BackgroundFailure(exc_infos=errors)
chromite.lib.parallel.BackgroundFailure: <type 'exceptions.OSError'>: [Errno 13] Permission denied: '138505726-chromeos-test/chromeos4-row8-rack9-host15/cheets_CTS_N.CtsNdefTestCases/cheets_StartAndroid/keyval'
Traceback (most recent call last):
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 602, in TaskRunner
task(*x, **task_kwargs)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 561, in offload
self._offload(dir_entry, dest_path, stdout_file, stderr_file)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 612, in _offload
limit_file_count(dir_entry)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 339, in limit_file_count
_make_into_tarball(folder)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 358, in _make_into_tarball
shutil.rmtree(dirpath)
File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
rmtree(fullname, ignore_errors, onerror)
File "/usr/lib/python2.7/shutil.py", line 252, in rmtree
onerror(os.remove, fullname, sys.exc_info())
File "/usr/lib/python2.7/shutil.py", line 250, in rmtree
os.remove(fullname)
OSError: [Errno 13] Permission denied: '138505726-chromeos-test/chromeos4-row8-rack9-host15/cheets_CTS_N.CtsNdefTestCases/cheets_StartAndroid/keyval'
<type 'exceptions.OSError'>: [Errno 13] Permission denied: '138505551-chromeos-test/chromeos4-row8-rack9-host18/cheets_CTS_N.CtsMediaTestCases/dir_summary_1504124345.json'
Traceback (most recent call last):
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 602, in TaskRunner
task(*x, **task_kwargs)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 561, in offload
,
Oct 2 2017
The call to remove the files does not have root permissions, and fails. https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/site_utils/gs_offloader.py?rcl=894f66426521ad2e3d10c51f299e847c452bd000&l=358 This causes gs_offloader to crash in a tight loop.
,
Oct 2 2017
As a short term workaround, I ran: sudo chown -R chromeos-test.eng /usr/local/autotest/results/
,
Oct 2 2017
,
Oct 2 2017
We really need an owner for this, and I suspect it's related to how CTS test results make it onto the shard.
,
Oct 2 2017
Passing to ihf@ based on the assumption that he can find the right owner. I'm bouncing between Pri-0 and Pri-1. This will break the lab badly if not addressed really soon.
,
Oct 2 2017
"Aug 30" That timestamp seems to be odd. Is the test run by SSP? Could be an SSP bug.
,
Oct 2 2017
Oh... I missed that about the timestamp. Maybe we've had this problem for a long time. Have we been running for a month with gs_offloader crash looping? And the push to prod broken?
,
Oct 2 2017
There are some old result files on server100. Presumably it tarballs the results freshly each time. The question, is this still a problem, or was it an intermittent old problem that is fixed now but not yet cleaned up? ihf@chromeos-server100:/usr/local/autotest/results/130930876-chromeos-test/chromeos6-row1-rack22-host9$ ls -alh total 1.8M drwxr-xr-x 8 chromeos-test eng 4.0K Jul 28 02:52 . drwxr-xr-x 3 chromeos-test eng 4.0K Jul 27 17:32 .. -rw-r--r-- 1 chromeos-test eng 6 Jul 27 17:32 .autoserv_execute drwxr-xr-x 7 root root 4.0K Jul 27 17:45 cheets_CTS_N.CtsHardwareTestCases -rw-r--r-- 1 chromeos-test eng 1.6M Oct 2 15:00 cheets_CTS_N.CtsHardwareTestCases.tgz -rw-r--r-- 1 chromeos-test eng 1.9K Jul 27 21:51 .collect_crashinfo.log -rw-r--r-- 1 root root 5 Jul 27 17:33 .container_autoserv_execute -rw-r--r-- 1 root root 610 Jul 27 17:52 control -rw-r--r-- 1 root root 1.1K Jul 27 17:33 control.srv drwxr-xr-x 2 chromeos-test eng 4.0K Jul 27 17:52 debug -rw-r--r-- 1 root root 21K Jul 27 17:34 dir_summary_1501202045.json -rw-r--r-- 1 root root 21K Jul 27 17:44 dir_summary_1501202661.json drwxr-xr-x 2 chromeos-test eng 4.0K Jul 27 17:33 host_info_store drwxr-xr-x 2 chromeos-test eng 4.0K Jul 27 17:32 host_keyvals -rw-r--r-- 1 chromeos-test eng 158 Jul 27 21:50 job_failure -rw-r--r-- 1 root root 71K Jul 27 17:52 job_report.html -rw-r--r-- 1 chromeos-test eng 435 Jul 27 21:50 keyval drwxr-xr-x 3 chromeos-test eng 4.0K Jul 27 17:32 ssp_logs -rw-r--r-- 1 root root 204 Jul 27 17:52 status -rw-r--r-- 1 root root 1.8K Jul 27 17:54 status.log drwxr-xr-x 3 chromeos-test eng 4.0K Jul 27 17:52 sysinfo -rw-r--r-- 1 root root 6 Jul 27 17:52 uncollected_logs
,
Oct 2 2017
If this is only about cleaning up old errors, it's not so critical. Also, ihf demonstrated to me that it's not CTS specific.
,
Oct 2 2017
If this is only about cleaning up old errors, it's not so critical. Also, ihf demonstrated to me that it's not CTS specific.
,
Oct 2 2017
Unfortunately I found some from today on a different server ihf@chromeos-server97:/usr/local/autotest/results/146485812-chromeos-test/chromeos2-row3-rack2-host5$ ls -alh total 144K drwxr-xr-x 8 chromeos-test eng 4.0K Oct 2 15:12 . drwxr-xr-x 3 chromeos-test eng 4.0K Oct 2 15:09 .. -rw-r--r-- 1 chromeos-test eng 1.1K Oct 2 15:09 attach.497 -rw-r--r-- 1 chromeos-test eng 6 Oct 2 15:09 .autoserv_execute drwxr-xr-x 7 root root 4.0K Oct 2 15:13 cheets_CTS_N.CtsAccelerationTestCases -rw-r--r-- 1 root root 5 Oct 2 15:10 .container_autoserv_execute -rw-r--r-- 1 root root 574 Oct 2 15:12 control -rw-r--r-- 1 root root 1.1K Oct 2 15:11 control.srv drwxr-xr-x 2 chromeos-test eng 4.0K Oct 2 15:12 debug -rw-r--r-- 1 root root 3.3K Oct 2 15:11 dir_summary_1506982307.json -rw-r--r-- 1 root root 3.2K Oct 2 15:12 dir_summary_1506982340.json drwxr-xr-x 2 chromeos-test eng 4.0K Oct 2 15:10 host_info_store drwxr-xr-x 2 chromeos-test eng 4.0K Oct 2 15:09 host_keyvals -rw-r--r-- 1 root root 67K Oct 2 15:12 job_report.html -rw-r--r-- 1 chromeos-test eng 440 Oct 2 15:09 keyval drwxr-xr-x 3 chromeos-test eng 4.0K Oct 2 15:09 ssp_logs -rw-r--r-- 1 root root 215 Oct 2 15:12 status -rw-r--r-- 1 root root 696 Oct 2 15:14 status.log drwxr-xr-x 5 chromeos-test eng 4.0K Oct 2 15:12 sysinfo -rw-r--r-- 1 root root 6 Oct 2 15:13 uncollected_logs
,
Oct 2 2017
I think it affects any server test, below power_RPMTest.RPM_ON1 and network_WiFi_SuspendStress.stress_WPA2 ./146489604-chromeos-test/chromeos6-row4-rack3-host8/: total 144 -rw-r--r-- 1 chromeos-test eng 3116 Oct 2 15:15 attach.557 -rw-r--r-- 1 root root 573 Oct 2 15:18 control -rw-r--r-- 1 root root 3116 Oct 2 15:16 control.srv drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 15:18 debug -rw-r--r-- 1 root root 4227 Oct 2 15:17 dir_summary_1506982636.json -rw-r--r-- 1 root root 4227 Oct 2 15:17 dir_summary_1506982650.json -rw-r--r-- 1 root root 4228 Oct 2 15:18 dir_summary_1506982704.json drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 15:16 host_info_store drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 15:15 host_keyvals -rw-r--r-- 1 root root 67527 Oct 2 15:18 job_report.html -rw-r--r-- 1 chromeos-test eng 369 Oct 2 15:15 keyval drwxr-xr-x 7 root root 4096 Oct 2 15:18 power_RPMTest.RPM_ON1 drwxr-xr-x 3 chromeos-test eng 4096 Oct 2 15:15 ssp_logs -rw-r--r-- 1 root root 214 Oct 2 15:18 status -rw-r--r-- 1 root root 741 Oct 2 15:18 status.log drwxr-xr-x 5 chromeos-test eng 4096 Oct 2 15:18 sysinfo -rw-r--r-- 1 root root 6 Oct 2 15:18 uncollected_logs ./146419961-chromeos-test/chromeos9-row4-rack2-host3/: total 128 -rw-r--r-- 1 chromeos-test eng 1650 Oct 2 10:23 attach.149427 -rw-r--r-- 1 root root 574 Oct 2 10:25 control -rw-r--r-- 1 root root 1650 Oct 2 10:24 control.srv drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 10:25 debug -rw-r--r-- 1 root root 3542 Oct 2 10:24 dir_summary_1506965096.json -rw-r--r-- 1 root root 3422 Oct 2 10:25 dir_summary_1506965150.json drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 10:24 host_info_store drwxr-xr-x 2 chromeos-test eng 4096 Oct 2 10:23 host_keyvals -rw-r--r-- 1 root root 69425 Oct 2 10:25 job_report.html -rw-r--r-- 1 chromeos-test eng 390 Oct 2 10:23 keyval drwxr-xr-x 5 root root 4096 Oct 2 10:24 network_WiFi_SuspendStress.stress_WPA2 drwxr-xr-x 3 chromeos-test eng 4096 Oct 2 10:23 ssp_logs -rw-r--r-- 1 root root 215 Oct 2 10:25 status -rw-r--r-- 1 root root 558 Oct 2 10:25 status.log drwxr-xr-x 5 chromeos-test eng 4096 Oct 2 10:25 sysinfo -rw-r--r-- 1 root root 6 Oct 2 10:25 uncollected_logs
,
Oct 2 2017
How is /usr/local/autotest/results populated? Could this be somethink like a new/wrong set of options passed to rsync?
,
Oct 2 2017
It almost seems that autoserv failed to change the result file's ownership back to chromeos-test after ssp test finishes. Can we run some test to check the permission issue?
,
Oct 2 2017
Please.
,
Oct 2 2017
"pending jobs" metric seems to have noticed this problem https://viceroy.corp.google.com/chromeos/gs_offloader?duration=8d&utc_end=1506983741.7
,
Oct 2 2017
(actually, the timeline doesn't make sense for #17, I think that is noticing different problems)
,
Oct 2 2017
Yes, the whole destruction sequence is missing (including the chown/chgrp) tail ssp_logs/debug/autoserv.DEBUG 10/02 15:16:24.288 DEBUG| utils:0212| Running 'sudo lxc-attach -P /usr/local/autotest/containers -n test_146489604_1506982506_24048 -- bash -c "/usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/146489604-chromeos-test -m chromeos6-row4-rack3-host8 -u chromeos-test -l 'hana-release/R63-9994.0.0/power_daily/Power daily tests' -s --lab True -P /usr/local/autotest/results/146489604-chromeos-test -n /usr/local/autotest/drone_tmp/attach.557 --verify_job_repo_url --use-existing-results --pidfile-label container_autoserv"'
,
Oct 2 2017
Seeing this on chromeos-server102 as well.
<type 'exceptions.OSError'>: [Errno 13] Permission denied: '142156391-chromeos-test/chromeos4-row6-rack8-host5/cheets_CTS_N.CtsAppSecurityHostTestCases/profiling'
Traceback (most recent call last):
File "/usr/local/autotest/site-packages/chromite/lib/parallel.py", line 602, in TaskRunner
task(*x, **task_kwargs)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 561, in offload
self._offload(dir_entry, dest_path, stdout_file, stderr_file)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 612, in _offload
limit_file_count(dir_entry)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 339, in limit_file_count
_make_into_tarball(folder)
File "/usr/local/autotest/site_utils/gs_offloader.py", line 358, in _make_into_tarball
shutil.rmtree(dirpath)
File "/usr/lib/python2.7/shutil.py", line 247, in rmtree
rmtree(fullname, ignore_errors, onerror)
File "/usr/lib/python2.7/shutil.py", line 256, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
os.rmdir(path)
OSError: [Errno 13] Permission denied: '142156391-chromeos-test/chromeos4-row6-rack8-host5/cheets_CTS_N.CtsAppSecurityHostTestCases/profiling'
,
Oct 2 2017
Oh, wait. The two jobs above were still running. Of course the files are still root while running and likely destruction sequence works fine too. So, maybe all that needs to be done here is to run another chown/chgrp before the gs_offload to be more robust to jobs that get killed in the middle (server could go down or many other things happen)? https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/autoserv?q=chown+autotest+package:%5Echromeos_public$&l=236
,
Oct 2 2017
Okay, this makes more sense. Would updating permissions for all of /usr/local/autotest/results break in-progress tests?
,
Oct 2 2017
Probably not (famous last words), but you could try with one of the tests that are still in progress and see if it ends up passing normally.
,
Oct 3 2017
By manually fixing permissions on all problematic servers, I was able to get the push to work, but I think we need the recommended fixes above.
,
Oct 3 2017
I was thinking about this some more. One idea was using root for gs_offloader.py, but this is a bad idea. But we need a permanent way to clean things up on disk. So if after running gs_offloader immediately after chown/chgrp fails with OSError we probably should move the directory out of the way for later examination (e.g. "sudo mv result_dir /usr/local/autotest/results_offload_error/")? We should see if disk usage grows over time and be able to examine it.
,
Oct 3 2017
One thing we can do is to add some metrics for such os_error, and stop gs_offloader from retrying it (moving it out of results_dir is one way to do it). The metrics data we need is: hostname (included in metrics by default) job id is_ssp (by checking if there exists ssp_logs directory)
,
Oct 9 2017
dshi suggests making gs_offloader run `sudo chown` on files
,
Oct 9 2017
Issue 773093 has been merged into this issue.
,
Oct 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ebcd873b64591ddb6f624f34a93d2a15c497cf2c commit ebcd873b64591ddb6f624f34a93d2a15c497cf2c Author: Dan Shi <dshi@google.com> Date: Sat Oct 14 02:43:09 2017 [autotest] Try to fix the file permission error in gs_offloader BUG= chromium:770868 TEST=unittest Change-Id: I7a7230868553db08bc209aa22cfaa7bf424ce621 Reviewed-on: https://chromium-review.googlesource.com/707816 Commit-Ready: Dan Shi <dshi@google.com> Tested-by: Dan Shi <dshi@google.com> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/ebcd873b64591ddb6f624f34a93d2a15c497cf2c/site_utils/gs_offloader.py
,
Oct 16 2017
Should be fixed by the next lab push. If the metrics are in place now, we probably has some more gs_offloaders that were crash looping that should stop.
,
Oct 18 2017
,
Oct 31 2017
Crashlooping observed on chromeos-server104. I'm going to blow away these results shortly. 2017-10-31 07:57:14,232 - DEBUG - Start of offload cycle - found 24 new jobs 2017-10-31 07:57:14,237 - INFO - Attempting refresh to obtain initial access_token 2017-10-31 07:57:14,278 - INFO - Refreshing access_token 2017-10-31 07:57:15,630 - DEBUG - Removing symlink hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/vmlog/vmlog.LATEST 2017-10-31 07:57:15,631 - DEBUG - Creating marker hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/vmlog/vmlog.LATEST 2017-10-31 07:57:15,632 - DEBUG - Removing symlink hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/vmlog/vmlog.PREVIOUS 2017-10-31 07:57:15,632 - DEBUG - Creating marker hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/vmlog/vmlog.PREVIOUS 2017-10-31 07:57:15,650 - DEBUG - Removing symlink hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/power_manager/powerd.PREVIOUS 2017-10-31 07:57:15,651 - DEBUG - Creating marker hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/power_manager/powerd.PREVIOUS 2017-10-31 07:57:15,651 - DEBUG - Removing symlink hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/power_manager/powerd.LATEST 2017-10-31 07:57:15,651 - DEBUG - Creating marker hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/power_manager/powerd.LATEST 2017-10-31 07:57:15,652 - DEBUG - Removing symlink hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/ui/ui.LATEST 2017-10-31 07:57:15,652 - DEBUG - Creating marker hosts/chromeos1-row1-rack1-host3/983481-repair/20173110075411/chromeos1-row1-rack1-host3/before_repair/log/ui/ui.LATEST 2017-10-31 07:57:22,169 - INFO - Trying to correct file permission of hosts/chromeos2-row8-rack1-host9/492124-repair. 2017-10-31 07:57:33,115 - INFO - Waiting for ts_mon flushing process to finish... 2017-10-31 07:58:14,484 - INFO - Finished waiting for ts_mon process.
,
Oct 31 2017
chromeos-test@chromeos-server104:/usr/local/autotest/results/152949833-chromeos-test/chromeos1-grover-host9$ ls -lh total 8.0K drwxr-xr-x 5 root root 4.0K Oct 31 10:15 network_WiFi_AttenuatedPerf.ht40_ch001 drwxr-xr-x 4 chromeos-test eng 4.0K Oct 31 10:46 sysinfo
,
Oct 31 2017
Silly idea, make /usr/local/autotest/results setgid and make chromeos-test umask 002. All new files and directories will be owned by eng group and chromeos-test can delete them.
,
Oct 31 2017
dshi landed code in gs_offloader to fix this. I'd like to know why it didn't work. That log even shows the "Trying to correct" log message, but not the "Failed to modify permission" message. Search for correct_results_folder_permission
,
Oct 31 2017
I can't find any logs indicating that gs_offloader is crashing due to file permission issue. The job id in #33 is for a job still running: http://chromeos-server104.mtv.corp.google.com/afe/#tab_id=view_job&object_id=152949833
,
Nov 6 2017
No files left over to debug. Trail is dead currently.
,
Mar 31 2018
Bulk closing old unconfirmed issues. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dgarr...@chromium.org
, Oct 2 2017