job_aborter on cros-full-0033 is crash looping
Reported by
jrbarnette@chromium.org,
Aug 31
|
||
Issue description
job_aborter on cros-full-0033.mtv is stuck in a crash loop:
chromeos-test@cros-full-0033:/usr/local/autotest$ status job_aborter ; sleep 5 ; status job_aborter
job_aborter start/running, process 235485
job_aborter start/post-stop, process 236357
Initially, the error prompting the crash looked like this:
job_aborter: 2018-08-31 07:41:24,908:DEBUG:job_aborter:_main_loop:66:Tick
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 203, in <module>
main(sys.argv[1:])
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 51, in main
_main_loop(jobdir=args.jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 68, in _main_loop
_main_loop_body(metrics, jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 75, in _main_loop_body
lease.id: lease for lease in leasing.leases_iter(jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 76, in <dictcomp>
if not lease.expired()
File "/usr/local/autotest/venv/lucifer/leasing.py", line 96, in expired
return not _fcntl_locked(self._entry.path)
File "/usr/local/autotest/venv/lucifer/leasing.py", line 157, in _fcntl_locked
fd = os.open(path, os.O_WRONLY)
OSError: [Errno 13] Permission denied: '/usr/local/autotest/leases/232343460'
That file was a bit odd:
chromeos-test@cros-full-0033:~$ ls -l /usr/local/autotest/leases/232343460
-rw-r--r-- 1 chromeos-hwtest-corp-role primarygroup 98561 Aug 30 10:05 /usr/local/autotest/leases/232343460
chromeos-test@cros-full-0033:~$ file /usr/local/autotest/leases/232343460
/usr/local/autotest/leases/232343460: Git index, version 2, 934 entries
I moved the file to /usr/local/autotest/tmp/232343460. Afterward,
the failures started looking like this:
job_aborter: 2018-08-31 09:38:27,870:DEBUG:job_aborter:_main_loop:66:Tick
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 203, in <module>
main(sys.argv[1:])
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 51, in main
_main_loop(jobdir=args.jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 68, in _main_loop
_main_loop_body(metrics, jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 75, in _main_loop_body
lease.id: lease for lease in leasing.leases_iter(jobdir)
File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 76, in <dictcomp>
if not lease.expired()
File "/usr/local/autotest/venv/lucifer/leasing.py", line 96, in expired
return not _fcntl_locked(self._entry.path)
File "/usr/local/autotest/venv/lucifer/leasing.py", line 157, in _fcntl_locked
fd = os.open(path, os.O_WRONLY)
OSError: [Errno 6] No such device or address: '/usr/local/autotest/leases/232335143'
So...
chromeos-test@cros-full-0033:/usr/local/autotest$ ls -l /usr/local/autotest/leases/232335143
srwxr-xr-x 2 chromeos-test eng 0 Aug 31 09:39 /usr/local/autotest/leases/232335143
Which looks more normal for the leases content. And therefore:
chromeos-test@cros-full-0033:/usr/local/autotest$ rm /usr/local/autotest/leases/232335143
Now, maybe, it's getting better. Certainly, it hasn't failed for
the past minute or two.
We need to do two things:
* Explain what happened here.
* Change job_aborter to ignore or purge problem content of this sort.
,
Sep 1
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979 commit 02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979 Author: Allen Li <ayatane@chromium.org> Date: Sat Sep 01 00:00:06 2018 autotest: Catch exceptions in job_aborter lock check BUG= chromium:879623 TEST=None Change-Id: I18652ab394dc2e40a60f5964abb0c66d58c43d0b Reviewed-on: https://chromium-review.googlesource.com/1200246 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Congbin Guo <guocb@chromium.org> [modify] https://crrev.com/02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979/venv/lucifer/leasing.py
,
Oct 2
|
||
►
Sign in to add a comment |
||
Comment 1 by ayatane@chromium.org
, Aug 31