New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 800657 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Restricted
  • Only users with EditIssue permission may comment.



Sign in to add a comment

CTS tests failing for Lulu (making lulu-release red) with Unhandled IOError: [Errno 2] No such file or directory: '.../adb'

Project Member Reported by kinaba@chromium.org, Jan 10 2018

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/lulu-release

https://stainless.corp.google.com/search?view=list&first_date=20180108&last_date=20180110&test=cheets_%5BCG%5DTS&board=%5Elulu%24&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true

All recent jobs are failing with
> Unhandled IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'


+Ilja, something might be going wrong on the zip caching logic.
CC: ARC constables (linben@, hidehiko@)
 

Comment 1 by ihf@chromium.org, Jan 10 2018

I manually cleared out the cache directory. I think it should self-correct now on the next run.

While it is possible that this was the zip logic, the problem here is better error detection/recovery. I'll look at it tomorrow.

Comment 3 by kinaba@chromium.org, Jan 12 2018

To kill my time reading some logs around the time of the incident.

It looks that multiple jobs are running the same critical section.
While some job is downloading a file, other job thinks the cache is dirty and deleting the directory that resulting in an unpredictable state (directory exists but a file missing.)

I guess it is either due to the force break of the lock... or worse, can the lockfile module work correctly over containers?
Lockfile looks using PID as the uniquifier key, but if PIDs are namespaced they may conflict.




http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168741996
01/09 06:41:08.026 ERROR|     tradefed_test:0284| Permanent lock failure. Trying to break lock.



http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168824004
01/09 06:43:09.690 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:43:28.843 INFO |     tradefed_test:0293| Acquired cache lock after 5 attempts.
01/09 06:43:31.959 INFO |     tradefed_test:0779| Current cache size=5851404871 of /usr/local/autotest/results/shared/cache/cache.
01/09 06:43:31.959 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:43:31.960 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:43:31.960 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:43:31.961 INFO |     tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b.
01/09 06:43:33.700 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:43:33.700 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:43:33.718 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:43:33.719 INFO |     tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt, reusing content of /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5.
01/09 06:43:35.823 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:43:35.823 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:43:35.823 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:43:35.836 INFO |     tradefed_test:0834| Host chromeos4-row6-rack3-host3 not in lab. Downloading gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip directly to /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip.
01/09 06:43:35.837 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip''
01/09 06:44:34.856 INFO |     tradefed_test:0717| Begin unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip
01/09 06:44:34.857 DEBUG|             utils:0212| Running 'unzip -d '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmpnkKHhA' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip''
01/09 06:46:29.591 ERROR|     tradefed_test:0721| Failed unzip, cleaning up.
01/09 06:46:29.592 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:29.592 WARNI|              test:0637| The test failed with the following exception
checkdir error:  cannot create /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmpnkKHhA/android-cts
                 No such file or directory
                 unable to process android-cts/testcases/CtsMediaStressTestCases.apk.



http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168824029
01/09 06:43:52.856 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:45:35.086 INFO |     tradefed_test:0293| Acquired cache lock after 8 attempts.
01/09 06:45:52.161 INFO |     tradefed_test:0779| Current cache size=6927811322 of /usr/local/autotest/results/shared/cache/cache.
01/09 06:45:52.162 INFO |     tradefed_test:0781| Found dirty cache.
01/09 06:45:52.162 WARNI|     tradefed_test:0784| Cleaning download cache.
01/09 06:45:59.535 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:45:59.535 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:45:59.536 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:45:59.542 INFO |     tradefed_test:0834| Host chromeos4-row6-rack2-host5 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb.
01/09 06:45:59.542 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb''
01/09 06:46:17.273 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:17.298 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:17.299 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:46:17.300 INFO |     tradefed_test:0834| Host chromeos4-row6-rack2-host5 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt directly to /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt.
01/09 06:46:17.300 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt' '/usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt''
01/09 06:46:25.031 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:25.032 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:25.033 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:46:25.034 INFO |     tradefed_test:0814| Skipping download of gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf.
01/09 06:46:25.035 INFO |     tradefed_test:0717| Begin unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip
01/09 06:46:25.035 DEBUG|             utils:0212| Running 'unzip -d '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmp3upOGt' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip''
01/09 06:46:25.543 ERROR|     tradefed_test:0721| Failed unzip, cleaning up.
01/09 06:46:25.564 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:25.565 WARNI|              test:0637| The test failed with the following exception
unzip:  cannot find or open /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip



http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742666
01/09 06:46:25.040 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:26.682 INFO |     tradefed_test:0293| Acquired cache lock after 6 attempts.
01/09 06:46:26.694 INFO |     tradefed_test:0779| Current cache size=292756120 of /usr/local/autotest/results/shared/cache/cache.
01/09 06:46:26.694 INFO |     tradefed_test:0781| Found dirty cache.
01/09 06:46:26.695 WARNI|     tradefed_test:0784| Cleaning download cache.
01/09 06:46:26.724 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:26.725 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:26.725 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:46:26.726 INFO |     tradefed_test:0834| Host chromeos2-row5-rack10-host7 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb.
01/09 06:46:26.726 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb''
01/09 06:46:38.095 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:38.095 WARNI|              test:0637| The test failed with the following exception
IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'



http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742630
01/09 06:46:27.643 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:31.662 INFO |     tradefed_test:0293| Acquired cache lock after 6 attempts.
01/09 06:46:31.663 INFO |     tradefed_test:0779| Current cache size=0 of /usr/local/autotest/results/shared/cache/cache.
01/09 06:46:31.663 INFO |     tradefed_test:0781| Found dirty cache.
01/09 06:46:31.664 WARNI|     tradefed_test:0784| Cleaning download cache.
01/09 06:46:31.678 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:31.679 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:31.679 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:46:31.681 INFO |     tradefed_test:0834| Host chromeos2-row5-rack10-host11 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb.
01/09 06:46:31.681 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb''
01/09 06:46:38.194 WARNI|              test:0637| The test failed with the following exception
NotLocked: /usr/local/autotest/results/shared/cache/lock is not locked



http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742601
01/09 06:46:20.218 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:42.070 INFO |     tradefed_test:0293| Acquired cache lock after 6 attempts.
01/09 06:46:42.070 INFO |     tradefed_test:0779| Current cache size=0 of /usr/local/autotest/results/shared/cache/cache.
01/09 06:46:42.071 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:42.071 INFO |     tradefed_test:0269| Waiting for cache lock...
01/09 06:46:42.072 INFO |     tradefed_test:0293| Acquired cache lock after 1 attempts.
01/09 06:46:42.072 INFO |     tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b.
01/09 06:46:42.073 INFO |     tradefed_test:0298| Released cache lock.
01/09 06:46:42.073 WARNI|              test:0637| The test failed with the following exception
IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'

Comment 4 by ihf@chromium.org, Jan 12 2018

While it uses PID on my dev machine
-rw-r--r--  2 ihf eng    0 Jan 11 23:42 ql.mtv.corp.google.com.MainThread-216514

It doesn't seem to from the lxc instance on chromeos-server27
-rw-r--r-- 1 root          root    0 Jan 10 07:04 test-chromeos2-row5-rack10-host3-d3095740.18542093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 23:10 test-chromeos2-row5-rack10-host9-ad634740.18612093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 08:53 test-chromeos2-row5-rack9-host11-3041b740.18572093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 23:08 test-chromeos4-row6-rack2-host19-4c9f740.18602093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 10:03 test-chromeos4-row6-rack2-host19-c4ca6740.18812093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 10:04 test-chromeos4-row6-rack3-host1-46848740.18792093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 08:07 test-chromeos4-row6-rack3-host3-e7c3740.19132093970780390141302
-rw-r--r-- 1 root          root    0 Jan 10 09:24 test-chromeos4-row6-rack3-host5-d3f4d740.18742093970780390141302

But there is something very fishy, with so many old lock attempts accumulating. 

Comment 5 Deleted

Comment 6 by ihf@chromium.org, Jan 12 2018

Labels: Restrict-AddIssueComment-EditIssue

Comment 7 by kinaba@chromium.org, Jan 12 2018

"%s%s.%s%s" % (self.hostname, self.tname, self.pid, hash(self.path)))
was the format for the recent lockfile module,
and since hash('/usr/local/autotest/results/shared/cache/lock')==2093970780390141302,
the 4 digits around 1800~1900 after the dot looks to be the PID. But anyway, it looks the host name part was enough for our case?


The lock file looks deleted only when it releases, so probably all the lock-related aborts will accumulate.
https://stainless.corp.google.com/search?view=list&first_date=20180101&last_date=20180131&test=cheets_%5BCG%5DTS&board=lulu&status=FAIL&status=ERROR&reason=lock&exclude_cts=false&exclude_not_run=true&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
Cc: phshah@chromium.org uekawa@chromium.org
CC: constables in this week.
Some logs from the runs near the failure point (I don't have a nice way to query so this is a result of manual pickup).
Again, while one job is running, another job is entering the locked section and removing the directory. (but it is outside the scope of forced lock break, which is more strange.)

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169690088-chromeos-test/chromeos4-row6-rack3-host9/debug/
01/12 14:13:16.138 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:16.487 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:17.807 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:19.324 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:22.659 INFO |     tradefed_test:0273| Acquired cache lock after 4 attempts.
01/12 14:13:28.467 INFO |     tradefed_test:0759| Current cache size=6659480335 of /usr/local/autotest/results/shared/cache/cache.
01/12 14:13:28.467 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:13:28.468 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:28.468 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:13:28.468 INFO |     tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b.
01/12 14:13:31.369 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:13:31.370 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:31.370 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:13:31.370 INFO |     tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt, reusing content of /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5.
01/12 14:13:32.815 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:13:32.815 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:32.816 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:13:32.816 INFO |     tradefed_test:0794| Skipping download of gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf.
01/12 14:13:32.816 INFO |     tradefed_test:0694| Skipping unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86
01/12 14:13:32.817 INFO |     tradefed_test:0872| Copying /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86 to instance /tmp/autotest-tradefed-install_Z2Ukbx/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86
01/12 14:15:10.864 INFO |     tradefed_test:0278| Released cache lock.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169400378-chromeos-test/chromeos2-row5-rack9-host7/debug/
01/12 14:12:30.302 ERROR|     tradefed_test:1219| Failed to run tradefed! Cleaning up now.
...
01/12 14:12:36.786 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:12:38.757 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:12:41.509 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:12:48.346 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:13:00.563 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:13:31.022 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:14:11.014 INFO |     tradefed_test:0269| Waiting for cache lock...
01/12 14:15:11.277 INFO |     tradefed_test:0293| Acquired cache lock after 7 attempts.
01/12 14:15:12.750 INFO |     tradefed_test:0779| Current cache size=6659480335 of /usr/local/autotest/results/shared/cache/cache.
01/12 14:15:12.751 WARNI|     tradefed_test:0784| Cleaning download cache.
01/12 14:15:24.044 INFO |     tradefed_test:0298| Released cache lock.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169688900-chromeos-test/chromeos2-row5-rack9-host11/debug/
01/12 12:51:19.454 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:51:20.860 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:51:24.482 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:51:31.119 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:51:44.318 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:52:02.065 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:52:22.176 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:54:28.982 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 12:54:59.348 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 13:01:00.168 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 13:06:24.496 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 13:30:02.870 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:15:52.891 ERROR|     tradefed_test:0264| Permanent lock failure. Trying to break lock.

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169690143-chromeos-test/chromeos4-row6-rack3-host7/debug/
01/12 14:13:49.394 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:49.650 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:52.716 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:53.630 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:13:54.916 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:14:17.443 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:14:29.354 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:15:14.054 INFO |     tradefed_test:0273| Acquired cache lock after 7 attempts.
01/12 14:15:14.936 INFO |     tradefed_test:0759| Current cache size=6405676608 of /usr/local/autotest/results/shared/cache/cache.
01/12 14:15:15.063 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:15:15.064 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:15:15.064 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:15:15.065 INFO |     tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b.
01/12 14:15:17.553 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:15:17.553 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:15:17.553 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:15:17.554 INFO |     tradefed_test:0814| Host chromeos4-row6-rack3-host7 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt directly to /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt.
01/12 14:15:17.554 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt' '/usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt''
01/12 14:15:36.618 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:15:36.619 INFO |     tradefed_test:0249| Waiting for cache lock...
01/12 14:15:36.619 INFO |     tradefed_test:0273| Acquired cache lock after 1 attempts.
01/12 14:15:36.620 INFO |     tradefed_test:0814| Host chromeos4-row6-rack3-host7 not in lab. Downloading gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip directly to /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip.
01/12 14:15:36.620 DEBUG|             utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip''
01/12 14:15:59.683 INFO |     tradefed_test:0278| Released cache lock.
01/12 14:15:59.684 WARNI|              test:0637| The test failed with the following exception
gsutil cp ... failed
OSError: No such file or directory.

Comment 11 by ihf@chromium.org, Jan 17 2018

Labels: M-65
Status: Started (was: Assigned)
The Lulu problem reappeared and I cleaned it up once more. I also rebooted the server27 to clean up old processes. (gs_offloader and ssh) Generally when operating the server is a little overloaded, so I guess more likely for the unexpected to happen.
https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server27&duration=8d&mdb_role=chrome-infra

The underlying problem is that
1) processes can (and are at too high rates) getting killed while obtaining a file lock
2) recovery from this bad state in a future job is fragile

But there must be more here, as the cache ended up in a semi-broken state again.


Comment 12 by ihf@chromium.org, Jan 17 2018

To answer comment 3, I think the locks in principle work across containers (based on my observation), yet your logs indicate to some degree the opposite. I suspect cleanup. I mean, when everything goes well we write 2 directories once a month, and maybe delete everything before that. I think I got too aggressive with the cleanup (bad state) and now the cache thrashes too often. The last corruption was probably cause by the 7.1_R13 rev yesterday, but lock cleanups could do the same. Still learning.
Cc: dhadd...@chromium.org sdantul...@chromium.org mkarkada@chromium.org
Labels: M-64
M64 GTS testing is blocked with this issue.

https://sponge.corp.google.com/target?id=11307b4a-6a93-4c35-80d7-b13d9a8ee308&target=lulu-release%2FR64-10176.51.0%2Farc-gts%2Fcheets_GTS.5.1_r2.GtsTvBugReportTestCases&searchFor=GTS+R64-10176.51.0+lulu
Labels: ReleaseBlock-Stable

Comment 16 by ihf@chromium.org, Jan 19 2018

Both cs26 and cs27 are currently in a very sad state with super high load and leaked process counts, see  issue 803566 .

Comment 17 by ihf@chromium.org, Jan 19 2018

Ok, now I understand. We have an unrelated process leak. At some point these little quarter servers run out of memory (or get so slow that nothing progresses, same idea). Kernel selects a victim to free up memory. Kills java which was waiting for lock or worse doing file operations.

That is the world we live in. We absolutely cannot rely (much) on the locks.
Project Member

Comment 18 by bugdroid1@chromium.org, Jan 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a62cc8f967c51de3c373601e0819eb871db152c0

commit a62cc8f967c51de3c373601e0819eb871db152c0
Author: Kazuhiro Inaba <kinaba@chromium.org>
Date: Fri Jan 19 07:00:06 2018

cheets_CTS/GTS: Mitigation for a broken download cache.

This is a tentative workaround until we find a fundamental way to avoid
breaking the cache. When a cache entry with empty content (which should
happen normally) is found, the file is not considered as cached and
re-downloaded.

BUG= chromium:800657 
TEST=run cheets_CTS_N after manually deleting the cached adb binary.

Change-Id: I6a53c3f5f875bc5ebf5f68d195c68ff5509c71a9
Reviewed-on: https://chromium-review.googlesource.com/875525
Tested-by: Kazuhiro Inaba <kinaba@chromium.org>
Trybot-Ready: Kazuhiro Inaba <kinaba@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/a62cc8f967c51de3c373601e0819eb871db152c0/server/cros/tradefed_test.py

To tree/branch watchers:
#18 is expected to recover the broken shard to a working state, but it does not fix the root cause. (= some runs will still fail on heavy-loaded machines, but if we schedule a manual retry, it has a chance to pass.)



Re #15 (the case of Setzer): as far as I examined, the failure started from two `gsutil cp remote/adb local/adb`
command run in parallel (for some reason, possibly due to the forced break_lock(), but I'm not sure because I could not find the corresponding job that broke the lock.)

A job that finished downloading first will succeed, but the second one fails, and clears out the downloaded adb file.
Hence the empty cache directory remains. What's bad is that the first successful runs commits validate_download_cache().
As a results, even the second job aborted in the middle for invalidate- and validate- cache, the final state is marked as valid.
What we can learn from this is, it's not only the clean-cache task that broke the cache. Other parts need some care as well.
Random thoughts dump (before going into the weekend):

- If possible I think it will be preferable to switch to a better locking mechanism that won't leaved locked even when the process aborted (like using flock()? not sure.)

- If it is difficult, we have to live in the world that aborted job may leave a lock forever, and hence we have to force break the lock sometimes, and hence we have to live with multiple jobs entering the critical section at the same time.

One way to work around the situation is to make the invalidate- and validate-download-cache pair to be the inner lock. That is, instead of the current invalidate-cache = (if not exist: mkdir)
https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/cros/tradefed_test.py#747,
let it invalidate-cache = (mkdir, if it fails delete all the cache somehow quasi-atomically, by renaming and then rmtree) for the initialization of the test and invalidate-cache = (mkdir, if it fails then the test immediately fails) for the later steps.

Then each single unfortunate job may fail but the errors won't persist longer.
> I think I got too aggressive with the cleanup (bad state) and now the cache thrashes too often.

I also agree on this. ./cts_tradefed command failure always cleans up the cache now, but according to the recent test failure logs, most of the case this code path is triggered by test timeout. (For instance, currently Elm/Hana has a failing test that takes very long to fail and thus triggers timeout. Hence I guess the cache is wiped out everyday.
Project Member

Comment 22 by bugdroid1@chromium.org, Jan 20 2018

Labels: merge-merged-release-R64-10176.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ecbcafe4c5e36aa4d01b483919b3f3a750b8766b

commit ecbcafe4c5e36aa4d01b483919b3f3a750b8766b
Author: Kazuhiro Inaba <kinaba@chromium.org>
Date: Sat Jan 20 00:16:37 2018

cheets_CTS/GTS: Mitigation for a broken download cache.

This is a tentative workaround until we find a fundamental way to avoid
breaking the cache. When a cache entry with empty content (which should
happen normally) is found, the file is not considered as cached and
re-downloaded.

BUG= chromium:800657 
TEST=run cheets_CTS_N after manually deleting the cached adb binary.

Change-Id: I6a53c3f5f875bc5ebf5f68d195c68ff5509c71a9
Reviewed-on: https://chromium-review.googlesource.com/875525
Tested-by: Kazuhiro Inaba <kinaba@chromium.org>
Trybot-Ready: Kazuhiro Inaba <kinaba@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
(cherry picked from commit a62cc8f967c51de3c373601e0819eb871db152c0)
Reviewed-on: https://chromium-review.googlesource.com/877359
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>

[modify] https://crrev.com/ecbcafe4c5e36aa4d01b483919b3f3a750b8766b/server/cros/tradefed_test.py

Labels: -ReleaseBlock-Stable -M-64 -merge-merged-release-R64-10176.B
removing RBS. at least the bots will self-recover from missing adb now.
Labels: -M-65 M-64 ReleaseBlock-Stable
ummmmmm, "Error: Could not find any tests in module." error, which shares the same cause as this bug but #18 cannot recover is coming up:

https://stainless.corp.google.com/search?view=list&first_date=20180108&last_date=20180122&test=cheets_%5BCG%5DTS&board=setzer&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
(unzipped cts-tradefed is partially deleted and hence no apks found.)

Comment 25 by ihf@chromium.org, Jan 22 2018

Cc: akes...@chromium.org
Aviv and I will move lulu/setzer to new servers this week (unrelated reasons). It may be the best shot at resolving this issue short term.

Comment 26 by ihf@chromium.org, Jan 23 2018

I forked the FileLock sources and I am adjusting them for our needs. Don't see another way.

Comment 27 by ihf@chromium.org, Jan 23 2018

I think I found a bug in the FileLock implementation. It would explain the extra cascading failures. Basically

    def release(self):
        if not self.is_locked():

should be 
    def release(self):
        if not self.i_am_locking():

SQLiteFileLock gets this right (but I don't want to use it, hence another argument to fork).
Project Member

Comment 28 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/711fcdbf9ff510671ecad159bfa1ce2dcdad1d27

commit 711fcdbf9ff510671ecad159bfa1ce2dcdad1d27
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Thu Jan 25 00:54:26 2018

Unmodified import of lockfile.py version 0.8.

pip show lockfile
---
Name: lockfile
Version: 0.8
Location: /usr/lib/python2.7/dist-packages

1) I checked upstream are newer versions, but the bug I am after is not fixed.
2) Notice the chroot uses
portage-stable/dev-python/lockfile/lockfile-0.9.1.ebuild
3) I am not sure what the base image uses for SSP.

With this we get all places to the same implementation.

BUG= chromium:800657 
TEST=None.

Change-Id: I7f7f107b1377cee0c838710cfbc614218e269f5b
Reviewed-on: https://chromium-review.googlesource.com/880484
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>

[add] https://crrev.com/711fcdbf9ff510671ecad159bfa1ce2dcdad1d27/server/cros/lockfile.py

Project Member

Comment 29 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c

commit 397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Thu Jan 25 00:55:19 2018

lockfile.py: clean up for tradefed_test needs.

1) Only keep link based lockfile code.
2) Fix release(). Using is_locked() is wrong when locks are
   broken. Instead use i_am_locking().
3) Modify code to add a way to query age of existing lock.
4) Minor cleanup to functionality.
   - new style class
   - less unlock error cases

BUG= chromium:800657 
TEST=pylint

Change-Id: I883cdf276e165fdd9ffb3d127d434ac35e5f503f
Reviewed-on: https://chromium-review.googlesource.com/880485
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>

[modify] https://crrev.com/397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c/server/cros/lockfile.py

Project Member

Comment 30 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3dfa037a616567643b4dfe49bf9bd530e8d62c24

commit 3dfa037a616567643b4dfe49bf9bd530e8d62c24
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Thu Jan 25 08:15:55 2018

tradefed_test: switch to use forked lockfile.

The forked lockfile.py fixes a bug. It also adds functionality which is
not used yet.

BUG= chromium:800657 
TEST=test_that <ip> cheets_CTS_N.7.1_r13.arm.CtsDramTestCases

Change-Id: I02665bce52bc06f44b931ae30e74e0c8743f9344
Reviewed-on: https://chromium-review.googlesource.com/882134
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>

[modify] https://crrev.com/3dfa037a616567643b4dfe49bf9bd530e8d62c24/server/cros/tradefed_test.py

Changes get merged?   Need to verify and/or remove RBS for M64

Comment 32 by ihf@chromium.org, Jan 30 2018

Labels: -M-64 M-66
Project Member

Comment 33 by bugdroid1@chromium.org, Feb 4 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a4327fc988683ef89a9987febed7784e3da020aa

commit a4327fc988683ef89a9987febed7784e3da020aa
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sun Feb 04 22:26:45 2018

tradefed_test: split code into multiple files.

No functional changes. Just moving code around to get shorter files.

BUG= chromium:800657 
TEST=Ran a CTS/GTS test locally.

Change-Id: Ib9a9e9a770cc7a5cb5a3fefe0c6a5dbb061020c4
Reviewed-on: https://chromium-review.googlesource.com/894994
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_chromelogin.py
[add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_utils.py
[add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_constants.py
[modify] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_test.py

Comment 34 by ihf@chromium.org, Apr 5 2018

Status: Fixed (was: Started)
I think the change in #30 fixed most problems. I have still a half refactored change to improve things, but it has been rotting and needs serious rebase. The urgent part is fixed.
Labels: Merge-TBD
[Auto-generated comment by a script] We noticed that this issue is targeted for M-66; it appears the fix may have landed after branch point, meaning a merge might be required. Please confirm if a merge is required here - if so add Merge-Request-66 label, otherwise remove Merge-TBD label. Thanks.
Project Member

Comment 36 by bugdroid1@chromium.org, Apr 25 2018

Labels: merge-merged-release-R65-10323.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/864fe282023073a7ce8a2d170d15fbbbee00b685

commit 864fe282023073a7ce8a2d170d15fbbbee00b685
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Wed Apr 25 02:51:31 2018

Unmodified import of lockfile.py version 0.8.

pip show lockfile
---
Name: lockfile
Version: 0.8
Location: /usr/lib/python2.7/dist-packages

1) I checked upstream are newer versions, but the bug I am after is not fixed.
2) Notice the chroot uses
portage-stable/dev-python/lockfile/lockfile-0.9.1.ebuild
3) I am not sure what the base image uses for SSP.

With this we get all places to the same implementation.

BUG= chromium:800657 
TEST=None.

Change-Id: I7f7f107b1377cee0c838710cfbc614218e269f5b
Reviewed-on: https://chromium-review.googlesource.com/880484
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>
(cherry picked from commit 711fcdbf9ff510671ecad159bfa1ce2dcdad1d27)
Reviewed-on: https://chromium-review.googlesource.com/1027193
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[add] https://crrev.com/864fe282023073a7ce8a2d170d15fbbbee00b685/server/cros/lockfile.py

Project Member

Comment 37 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ee0df91ccf5989a837d68d40b1edca3ad4d17ee2

commit ee0df91ccf5989a837d68d40b1edca3ad4d17ee2
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Wed Apr 25 02:52:11 2018

tradefed_test: switch to use forked lockfile.

The forked lockfile.py fixes a bug. It also adds functionality which is
not used yet.

BUG= chromium:800657 
TEST=test_that <ip> cheets_CTS_N.7.1_r13.arm.CtsDramTestCases

Change-Id: I02665bce52bc06f44b931ae30e74e0c8743f9344
Reviewed-on: https://chromium-review.googlesource.com/882134
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org>
(cherry picked from commit 3dfa037a616567643b4dfe49bf9bd530e8d62c24)
Reviewed-on: https://chromium-review.googlesource.com/1027194
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/ee0df91ccf5989a837d68d40b1edca3ad4d17ee2/server/cros/tradefed_test.py

Project Member

Comment 38 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/329b643cc236e68e1a59d8a46a852ee66235ee27

commit 329b643cc236e68e1a59d8a46a852ee66235ee27
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Wed Apr 25 02:52:46 2018

tradefed_test: split code into multiple files.

No functional changes. Just moving code around to get shorter files.

BUG= chromium:800657 
TEST=Ran a CTS/GTS test locally.

Change-Id: Ib9a9e9a770cc7a5cb5a3fefe0c6a5dbb061020c4
Reviewed-on: https://chromium-review.googlesource.com/894994
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
(cherry picked from commit a4327fc988683ef89a9987febed7784e3da020aa)
Reviewed-on: https://chromium-review.googlesource.com/1027195

[add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_chromelogin.py
[add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_utils.py
[add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_constants.py
[modify] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_test.py

Project Member

Comment 39 by sheriffbot@chromium.org, Jun 6 2018

Labels: -Merge-TBD

Sign in to add a comment