New issue
Advanced search Search tips

Issue 879623 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Oct 2
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

job_aborter on cros-full-0033 is crash looping

Reported by jrbarnette@chromium.org, Aug 31

Issue description

job_aborter on cros-full-0033.mtv is stuck in a crash loop:
    chromeos-test@cros-full-0033:/usr/local/autotest$ status job_aborter ; sleep 5 ; status job_aborter
    job_aborter start/running, process 235485
    job_aborter start/post-stop, process 236357

Initially, the error prompting the crash looked like this:
job_aborter: 2018-08-31 07:41:24,908:DEBUG:job_aborter:_main_loop:66:Tick
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 203, in <module>
    main(sys.argv[1:])
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 51, in main
    _main_loop(jobdir=args.jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 68, in _main_loop
    _main_loop_body(metrics, jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 75, in _main_loop_body
    lease.id: lease for lease in leasing.leases_iter(jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 76, in <dictcomp>
    if not lease.expired()
  File "/usr/local/autotest/venv/lucifer/leasing.py", line 96, in expired
    return not _fcntl_locked(self._entry.path)
  File "/usr/local/autotest/venv/lucifer/leasing.py", line 157, in _fcntl_locked
    fd = os.open(path, os.O_WRONLY)
OSError: [Errno 13] Permission denied: '/usr/local/autotest/leases/232343460'

That file was a bit odd:
    chromeos-test@cros-full-0033:~$ ls -l /usr/local/autotest/leases/232343460
    -rw-r--r-- 1 chromeos-hwtest-corp-role primarygroup 98561 Aug 30 10:05 /usr/local/autotest/leases/232343460
    chromeos-test@cros-full-0033:~$ file /usr/local/autotest/leases/232343460
    /usr/local/autotest/leases/232343460: Git index, version 2, 934 entries

I moved the file to /usr/local/autotest/tmp/232343460.  Afterward,
the failures started looking like this:
job_aborter: 2018-08-31 09:38:27,870:DEBUG:job_aborter:_main_loop:66:Tick
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 203, in <module>
    main(sys.argv[1:])
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 51, in main
    _main_loop(jobdir=args.jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 68, in _main_loop
    _main_loop_body(metrics, jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 75, in _main_loop_body
    lease.id: lease for lease in leasing.leases_iter(jobdir)
  File "/usr/local/autotest/venv/lucifer/cmd/job_aborter.py", line 76, in <dictcomp>
    if not lease.expired()
  File "/usr/local/autotest/venv/lucifer/leasing.py", line 96, in expired
    return not _fcntl_locked(self._entry.path)
  File "/usr/local/autotest/venv/lucifer/leasing.py", line 157, in _fcntl_locked
    fd = os.open(path, os.O_WRONLY)
OSError: [Errno 6] No such device or address: '/usr/local/autotest/leases/232335143'

So...
    chromeos-test@cros-full-0033:/usr/local/autotest$ ls -l /usr/local/autotest/leases/232335143
    srwxr-xr-x 2 chromeos-test eng 0 Aug 31 09:39 /usr/local/autotest/leases/232335143

Which looks more normal for the leases content.  And therefore:
    chromeos-test@cros-full-0033:/usr/local/autotest$ rm /usr/local/autotest/leases/232335143

Now, maybe, it's getting better.  Certainly, it hasn't failed for
the past minute or two.

We need to do two things:
  * Explain what happened here.
  * Change job_aborter to ignore or purge problem content of this sort.

 
Project Member

Comment 2 by bugdroid1@chromium.org, Sep 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979

commit 02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979
Author: Allen Li <ayatane@chromium.org>
Date: Sat Sep 01 00:00:06 2018

autotest: Catch exceptions in job_aborter lock check

BUG= chromium:879623 
TEST=None

Change-Id: I18652ab394dc2e40a60f5964abb0c66d58c43d0b
Reviewed-on: https://chromium-review.googlesource.com/1200246
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Congbin Guo <guocb@chromium.org>

[modify] https://crrev.com/02a72448cbfc21b6b3ef3a798abc7ec2c6cdd979/venv/lucifer/leasing.py

Status: Verified (was: Started)

Sign in to add a comment