CTS tests failing for Lulu (making lulu-release red) with Unhandled IOError: [Errno 2] No such file or directory: '.../adb' |
||||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/lulu-release https://stainless.corp.google.com/search?view=list&first_date=20180108&last_date=20180110&test=cheets_%5BCG%5DTS&board=%5Elulu%24&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true All recent jobs are failing with > Unhandled IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb' +Ilja, something might be going wrong on the zip caching logic. CC: ARC constables (linben@, hidehiko@)
,
Jan 11 2018
,
Jan 12 2018
To kill my time reading some logs around the time of the incident. It looks that multiple jobs are running the same critical section. While some job is downloading a file, other job thinks the cache is dirty and deleting the directory that resulting in an unpredictable state (directory exists but a file missing.) I guess it is either due to the force break of the lock... or worse, can the lockfile module work correctly over containers? Lockfile looks using PID as the uniquifier key, but if PIDs are namespaced they may conflict. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168741996 01/09 06:41:08.026 ERROR| tradefed_test:0284| Permanent lock failure. Trying to break lock. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168824004 01/09 06:43:09.690 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:43:28.843 INFO | tradefed_test:0293| Acquired cache lock after 5 attempts. 01/09 06:43:31.959 INFO | tradefed_test:0779| Current cache size=5851404871 of /usr/local/autotest/results/shared/cache/cache. 01/09 06:43:31.959 INFO | tradefed_test:0298| Released cache lock. 01/09 06:43:31.960 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:43:31.960 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:43:31.961 INFO | tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b. 01/09 06:43:33.700 INFO | tradefed_test:0298| Released cache lock. 01/09 06:43:33.700 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:43:33.718 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:43:33.719 INFO | tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt, reusing content of /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5. 01/09 06:43:35.823 INFO | tradefed_test:0298| Released cache lock. 01/09 06:43:35.823 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:43:35.823 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:43:35.836 INFO | tradefed_test:0834| Host chromeos4-row6-rack3-host3 not in lab. Downloading gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip directly to /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip. 01/09 06:43:35.837 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip'' 01/09 06:44:34.856 INFO | tradefed_test:0717| Begin unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip 01/09 06:44:34.857 DEBUG| utils:0212| Running 'unzip -d '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmpnkKHhA' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip'' 01/09 06:46:29.591 ERROR| tradefed_test:0721| Failed unzip, cleaning up. 01/09 06:46:29.592 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:29.592 WARNI| test:0637| The test failed with the following exception checkdir error: cannot create /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmpnkKHhA/android-cts No such file or directory unable to process android-cts/testcases/CtsMediaStressTestCases.apk. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168824029 01/09 06:43:52.856 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:45:35.086 INFO | tradefed_test:0293| Acquired cache lock after 8 attempts. 01/09 06:45:52.161 INFO | tradefed_test:0779| Current cache size=6927811322 of /usr/local/autotest/results/shared/cache/cache. 01/09 06:45:52.162 INFO | tradefed_test:0781| Found dirty cache. 01/09 06:45:52.162 WARNI| tradefed_test:0784| Cleaning download cache. 01/09 06:45:59.535 INFO | tradefed_test:0298| Released cache lock. 01/09 06:45:59.535 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:45:59.536 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:45:59.542 INFO | tradefed_test:0834| Host chromeos4-row6-rack2-host5 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb. 01/09 06:45:59.542 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'' 01/09 06:46:17.273 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:17.298 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:17.299 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:46:17.300 INFO | tradefed_test:0834| Host chromeos4-row6-rack2-host5 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt directly to /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt. 01/09 06:46:17.300 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt' '/usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt'' 01/09 06:46:25.031 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:25.032 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:25.033 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:46:25.034 INFO | tradefed_test:0814| Skipping download of gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf. 01/09 06:46:25.035 INFO | tradefed_test:0717| Begin unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip 01/09 06:46:25.035 DEBUG| utils:0212| Running 'unzip -d '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/tmp3upOGt' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip'' 01/09 06:46:25.543 ERROR| tradefed_test:0721| Failed unzip, cleaning up. 01/09 06:46:25.564 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:25.565 WARNI| test:0637| The test failed with the following exception unzip: cannot find or open /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742666 01/09 06:46:25.040 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:26.682 INFO | tradefed_test:0293| Acquired cache lock after 6 attempts. 01/09 06:46:26.694 INFO | tradefed_test:0779| Current cache size=292756120 of /usr/local/autotest/results/shared/cache/cache. 01/09 06:46:26.694 INFO | tradefed_test:0781| Found dirty cache. 01/09 06:46:26.695 WARNI| tradefed_test:0784| Cleaning download cache. 01/09 06:46:26.724 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:26.725 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:26.725 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:46:26.726 INFO | tradefed_test:0834| Host chromeos2-row5-rack10-host7 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb. 01/09 06:46:26.726 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'' 01/09 06:46:38.095 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:38.095 WARNI| test:0637| The test failed with the following exception IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb' http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742630 01/09 06:46:27.643 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:31.662 INFO | tradefed_test:0293| Acquired cache lock after 6 attempts. 01/09 06:46:31.663 INFO | tradefed_test:0779| Current cache size=0 of /usr/local/autotest/results/shared/cache/cache. 01/09 06:46:31.663 INFO | tradefed_test:0781| Found dirty cache. 01/09 06:46:31.664 WARNI| tradefed_test:0784| Cleaning download cache. 01/09 06:46:31.678 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:31.679 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:31.679 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:46:31.681 INFO | tradefed_test:0834| Host chromeos2-row5-rack10-host11 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb directly to /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb. 01/09 06:46:31.681 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb' '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'' 01/09 06:46:38.194 WARNI| test:0637| The test failed with the following exception NotLocked: /usr/local/autotest/results/shared/cache/lock is not locked http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=168742601 01/09 06:46:20.218 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:42.070 INFO | tradefed_test:0293| Acquired cache lock after 6 attempts. 01/09 06:46:42.070 INFO | tradefed_test:0779| Current cache size=0 of /usr/local/autotest/results/shared/cache/cache. 01/09 06:46:42.071 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:42.071 INFO | tradefed_test:0269| Waiting for cache lock... 01/09 06:46:42.072 INFO | tradefed_test:0293| Acquired cache lock after 1 attempts. 01/09 06:46:42.072 INFO | tradefed_test:0814| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b. 01/09 06:46:42.073 INFO | tradefed_test:0298| Released cache lock. 01/09 06:46:42.073 WARNI| test:0637| The test failed with the following exception IOError: [Errno 2] No such file or directory: '/usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b/adb'
,
Jan 12 2018
While it uses PID on my dev machine -rw-r--r-- 2 ihf eng 0 Jan 11 23:42 ql.mtv.corp.google.com.MainThread-216514 It doesn't seem to from the lxc instance on chromeos-server27 -rw-r--r-- 1 root root 0 Jan 10 07:04 test-chromeos2-row5-rack10-host3-d3095740.18542093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 23:10 test-chromeos2-row5-rack10-host9-ad634740.18612093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 08:53 test-chromeos2-row5-rack9-host11-3041b740.18572093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 23:08 test-chromeos4-row6-rack2-host19-4c9f740.18602093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 10:03 test-chromeos4-row6-rack2-host19-c4ca6740.18812093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 10:04 test-chromeos4-row6-rack3-host1-46848740.18792093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 08:07 test-chromeos4-row6-rack3-host3-e7c3740.19132093970780390141302 -rw-r--r-- 1 root root 0 Jan 10 09:24 test-chromeos4-row6-rack3-host5-d3f4d740.18742093970780390141302 But there is something very fishy, with so many old lock attempts accumulating.
,
Jan 12 2018
,
Jan 12 2018
"%s%s.%s%s" % (self.hostname, self.tname, self.pid, hash(self.path)))
was the format for the recent lockfile module,
and since hash('/usr/local/autotest/results/shared/cache/lock')==2093970780390141302,
the 4 digits around 1800~1900 after the dot looks to be the PID. But anyway, it looks the host name part was enough for our case?
The lock file looks deleted only when it releases, so probably all the lock-related aborts will accumulate.
https://stainless.corp.google.com/search?view=list&first_date=20180101&last_date=20180131&test=cheets_%5BCG%5DTS&board=lulu&status=FAIL&status=ERROR&reason=lock&exclude_cts=false&exclude_not_run=true&exclude_non_release=false&exclude_au=false&exclude_acts=false&exclude_retried=false&exclude_non_production=false
,
Jan 15 2018
Happening again for Lulu bots. https://stainless.corp.google.com/search?view=list&first_date=20180110&last_date=20180131&test=cheets_%5BCG%5DTS&board=lulu&status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Jan 15 2018
CC: constables in this week.
,
Jan 15 2018
Some logs from the runs near the failure point (I don't have a nice way to query so this is a result of manual pickup). Again, while one job is running, another job is entering the locked section and removing the directory. (but it is outside the scope of forced lock break, which is more strange.) https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169690088-chromeos-test/chromeos4-row6-rack3-host9/debug/ 01/12 14:13:16.138 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:16.487 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:17.807 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:19.324 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:22.659 INFO | tradefed_test:0273| Acquired cache lock after 4 attempts. 01/12 14:13:28.467 INFO | tradefed_test:0759| Current cache size=6659480335 of /usr/local/autotest/results/shared/cache/cache. 01/12 14:13:28.467 INFO | tradefed_test:0278| Released cache lock. 01/12 14:13:28.468 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:28.468 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:13:28.468 INFO | tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b. 01/12 14:13:31.369 INFO | tradefed_test:0278| Released cache lock. 01/12 14:13:31.370 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:31.370 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:13:31.370 INFO | tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt, reusing content of /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5. 01/12 14:13:32.815 INFO | tradefed_test:0278| Released cache lock. 01/12 14:13:32.815 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:32.816 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:13:32.816 INFO | tradefed_test:0794| Skipping download of gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf. 01/12 14:13:32.816 INFO | tradefed_test:0694| Skipping unzip /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip, reusing content of /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86 01/12 14:13:32.817 INFO | tradefed_test:0872| Copying /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86 to instance /tmp/autotest-tradefed-install_Z2Ukbx/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86 01/12 14:15:10.864 INFO | tradefed_test:0278| Released cache lock. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169400378-chromeos-test/chromeos2-row5-rack9-host7/debug/ 01/12 14:12:30.302 ERROR| tradefed_test:1219| Failed to run tradefed! Cleaning up now. ... 01/12 14:12:36.786 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:12:38.757 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:12:41.509 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:12:48.346 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:13:00.563 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:13:31.022 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:14:11.014 INFO | tradefed_test:0269| Waiting for cache lock... 01/12 14:15:11.277 INFO | tradefed_test:0293| Acquired cache lock after 7 attempts. 01/12 14:15:12.750 INFO | tradefed_test:0779| Current cache size=6659480335 of /usr/local/autotest/results/shared/cache/cache. 01/12 14:15:12.751 WARNI| tradefed_test:0784| Cleaning download cache. 01/12 14:15:24.044 INFO | tradefed_test:0298| Released cache lock. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169688900-chromeos-test/chromeos2-row5-rack9-host11/debug/ 01/12 12:51:19.454 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:51:20.860 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:51:24.482 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:51:31.119 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:51:44.318 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:52:02.065 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:52:22.176 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:54:28.982 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 12:54:59.348 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 13:01:00.168 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 13:06:24.496 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 13:30:02.870 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:15:52.891 ERROR| tradefed_test:0264| Permanent lock failure. Trying to break lock. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/169690143-chromeos-test/chromeos4-row6-rack3-host7/debug/ 01/12 14:13:49.394 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:49.650 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:52.716 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:53.630 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:13:54.916 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:14:17.443 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:14:29.354 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:15:14.054 INFO | tradefed_test:0273| Acquired cache lock after 7 attempts. 01/12 14:15:14.936 INFO | tradefed_test:0759| Current cache size=6405676608 of /usr/local/autotest/results/shared/cache/cache. 01/12 14:15:15.063 INFO | tradefed_test:0278| Released cache lock. 01/12 14:15:15.064 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:15:15.064 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:15:15.065 INFO | tradefed_test:0794| Skipping download of gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-cheets_arm-user/3544738/adb, reusing content of /usr/local/autotest/results/shared/cache/cache/e7435caad94ac50c52577017e421de5b. 01/12 14:15:17.553 INFO | tradefed_test:0278| Released cache lock. 01/12 14:15:17.553 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:15:17.553 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:15:17.554 INFO | tradefed_test:0814| Host chromeos4-row6-rack3-host7 not in lab. Downloading gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt directly to /usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt. 01/12 14:15:17.554 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/builds/git_nyc-mr1-arc-linux-static_sdk_tools/3544738/aapt' '/usr/local/autotest/results/shared/cache/cache/4bd092f015ba402ee0579894be1cb6d5/aapt'' 01/12 14:15:36.618 INFO | tradefed_test:0278| Released cache lock. 01/12 14:15:36.619 INFO | tradefed_test:0249| Waiting for cache lock... 01/12 14:15:36.619 INFO | tradefed_test:0273| Acquired cache lock after 1 attempts. 01/12 14:15:36.620 INFO | tradefed_test:0814| Host chromeos4-row6-rack3-host7 not in lab. Downloading gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip directly to /usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip. 01/12 14:15:36.620 DEBUG| utils:0212| Running 'gsutil cp 'gs://chromeos-arc-images/cts/bundle/N/android-cts-7.1_r12-linux_x86-x86.zip' '/usr/local/autotest/results/shared/cache/cache/a56265ecc844a842418ade3f859e36cf/android-cts-7.1_r12-linux_x86-x86.zip'' 01/12 14:15:59.683 INFO | tradefed_test:0278| Released cache lock. 01/12 14:15:59.684 WARNI| test:0637| The test failed with the following exception gsutil cp ... failed OSError: No such file or directory.
,
Jan 17 2018
The Lulu problem reappeared and I cleaned it up once more. I also rebooted the server27 to clean up old processes. (gs_offloader and ssh) Generally when operating the server is a little overloaded, so I guess more likely for the unexpected to happen. https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server27&duration=8d&mdb_role=chrome-infra The underlying problem is that 1) processes can (and are at too high rates) getting killed while obtaining a file lock 2) recovery from this bad state in a future job is fragile But there must be more here, as the cache ended up in a semi-broken state again.
,
Jan 17 2018
To answer comment 3, I think the locks in principle work across containers (based on my observation), yet your logs indicate to some degree the opposite. I suspect cleanup. I mean, when everything goes well we write 2 directories once a month, and maybe delete everything before that. I think I got too aggressive with the cleanup (bad state) and now the cache thrashes too often. The last corruption was probably cause by the 7.1_R13 rev yesterday, but lock cleanups could do the same. Still learning.
,
Jan 18 2018
M64 GTS testing is blocked with this issue. https://sponge.corp.google.com/target?id=11307b4a-6a93-4c35-80d7-b13d9a8ee308&target=lulu-release%2FR64-10176.51.0%2Farc-gts%2Fcheets_GTS.5.1_r2.GtsTvBugReportTestCases&searchFor=GTS+R64-10176.51.0+lulu
,
Jan 18 2018
,
Jan 19 2018
Setzer / chromeos-server26 is now hitting this failure... https://stainless.corp.google.com/search?view=list&first_date=20180115&last_date=20180131&test=cheets_%5BCG%5DTS&status=FAIL&status=ERROR&status=ABORT&reason=Unhandled+IOError%3A&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Jan 19 2018
Both cs26 and cs27 are currently in a very sad state with super high load and leaked process counts, see issue 803566 .
,
Jan 19 2018
Ok, now I understand. We have an unrelated process leak. At some point these little quarter servers run out of memory (or get so slow that nothing progresses, same idea). Kernel selects a victim to free up memory. Kills java which was waiting for lock or worse doing file operations. That is the world we live in. We absolutely cannot rely (much) on the locks.
,
Jan 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a62cc8f967c51de3c373601e0819eb871db152c0 commit a62cc8f967c51de3c373601e0819eb871db152c0 Author: Kazuhiro Inaba <kinaba@chromium.org> Date: Fri Jan 19 07:00:06 2018 cheets_CTS/GTS: Mitigation for a broken download cache. This is a tentative workaround until we find a fundamental way to avoid breaking the cache. When a cache entry with empty content (which should happen normally) is found, the file is not considered as cached and re-downloaded. BUG= chromium:800657 TEST=run cheets_CTS_N after manually deleting the cached adb binary. Change-Id: I6a53c3f5f875bc5ebf5f68d195c68ff5509c71a9 Reviewed-on: https://chromium-review.googlesource.com/875525 Tested-by: Kazuhiro Inaba <kinaba@chromium.org> Trybot-Ready: Kazuhiro Inaba <kinaba@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/a62cc8f967c51de3c373601e0819eb871db152c0/server/cros/tradefed_test.py
,
Jan 19 2018
To tree/branch watchers: #18 is expected to recover the broken shard to a working state, but it does not fix the root cause. (= some runs will still fail on heavy-loaded machines, but if we schedule a manual retry, it has a chance to pass.) Re #15 (the case of Setzer): as far as I examined, the failure started from two `gsutil cp remote/adb local/adb` command run in parallel (for some reason, possibly due to the forced break_lock(), but I'm not sure because I could not find the corresponding job that broke the lock.) A job that finished downloading first will succeed, but the second one fails, and clears out the downloaded adb file. Hence the empty cache directory remains. What's bad is that the first successful runs commits validate_download_cache(). As a results, even the second job aborted in the middle for invalidate- and validate- cache, the final state is marked as valid. What we can learn from this is, it's not only the clean-cache task that broke the cache. Other parts need some care as well.
,
Jan 19 2018
Random thoughts dump (before going into the weekend): - If possible I think it will be preferable to switch to a better locking mechanism that won't leaved locked even when the process aborted (like using flock()? not sure.) - If it is difficult, we have to live in the world that aborted job may leave a lock forever, and hence we have to force break the lock sometimes, and hence we have to live with multiple jobs entering the critical section at the same time. One way to work around the situation is to make the invalidate- and validate-download-cache pair to be the inner lock. That is, instead of the current invalidate-cache = (if not exist: mkdir) https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/cros/tradefed_test.py#747, let it invalidate-cache = (mkdir, if it fails delete all the cache somehow quasi-atomically, by renaming and then rmtree) for the initialization of the test and invalidate-cache = (mkdir, if it fails then the test immediately fails) for the later steps. Then each single unfortunate job may fail but the errors won't persist longer.
,
Jan 19 2018
> I think I got too aggressive with the cleanup (bad state) and now the cache thrashes too often. I also agree on this. ./cts_tradefed command failure always cleans up the cache now, but according to the recent test failure logs, most of the case this code path is triggered by test timeout. (For instance, currently Elm/Hana has a failing test that takes very long to fail and thus triggers timeout. Hence I guess the cache is wiped out everyday.
,
Jan 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ecbcafe4c5e36aa4d01b483919b3f3a750b8766b commit ecbcafe4c5e36aa4d01b483919b3f3a750b8766b Author: Kazuhiro Inaba <kinaba@chromium.org> Date: Sat Jan 20 00:16:37 2018 cheets_CTS/GTS: Mitigation for a broken download cache. This is a tentative workaround until we find a fundamental way to avoid breaking the cache. When a cache entry with empty content (which should happen normally) is found, the file is not considered as cached and re-downloaded. BUG= chromium:800657 TEST=run cheets_CTS_N after manually deleting the cached adb binary. Change-Id: I6a53c3f5f875bc5ebf5f68d195c68ff5509c71a9 Reviewed-on: https://chromium-review.googlesource.com/875525 Tested-by: Kazuhiro Inaba <kinaba@chromium.org> Trybot-Ready: Kazuhiro Inaba <kinaba@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> (cherry picked from commit a62cc8f967c51de3c373601e0819eb871db152c0) Reviewed-on: https://chromium-review.googlesource.com/877359 Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> [modify] https://crrev.com/ecbcafe4c5e36aa4d01b483919b3f3a750b8766b/server/cros/tradefed_test.py
,
Jan 20 2018
removing RBS. at least the bots will self-recover from missing adb now.
,
Jan 22 2018
ummmmmm, "Error: Could not find any tests in module." error, which shares the same cause as this bug but #18 cannot recover is coming up: https://stainless.corp.google.com/search?view=list&first_date=20180108&last_date=20180122&test=cheets_%5BCG%5DTS&board=setzer&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true (unzipped cts-tradefed is partially deleted and hence no apks found.)
,
Jan 22 2018
Aviv and I will move lulu/setzer to new servers this week (unrelated reasons). It may be the best shot at resolving this issue short term.
,
Jan 23 2018
I forked the FileLock sources and I am adjusting them for our needs. Don't see another way.
,
Jan 23 2018
I think I found a bug in the FileLock implementation. It would explain the extra cascading failures. Basically
def release(self):
if not self.is_locked():
should be
def release(self):
if not self.i_am_locking():
SQLiteFileLock gets this right (but I don't want to use it, hence another argument to fork).
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/711fcdbf9ff510671ecad159bfa1ce2dcdad1d27 commit 711fcdbf9ff510671ecad159bfa1ce2dcdad1d27 Author: Ilja H. Friedel <ihf@chromium.org> Date: Thu Jan 25 00:54:26 2018 Unmodified import of lockfile.py version 0.8. pip show lockfile --- Name: lockfile Version: 0.8 Location: /usr/lib/python2.7/dist-packages 1) I checked upstream are newer versions, but the bug I am after is not fixed. 2) Notice the chroot uses portage-stable/dev-python/lockfile/lockfile-0.9.1.ebuild 3) I am not sure what the base image uses for SSP. With this we get all places to the same implementation. BUG= chromium:800657 TEST=None. Change-Id: I7f7f107b1377cee0c838710cfbc614218e269f5b Reviewed-on: https://chromium-review.googlesource.com/880484 Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> [add] https://crrev.com/711fcdbf9ff510671ecad159bfa1ce2dcdad1d27/server/cros/lockfile.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c commit 397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c Author: Ilja H. Friedel <ihf@chromium.org> Date: Thu Jan 25 00:55:19 2018 lockfile.py: clean up for tradefed_test needs. 1) Only keep link based lockfile code. 2) Fix release(). Using is_locked() is wrong when locks are broken. Instead use i_am_locking(). 3) Modify code to add a way to query age of existing lock. 4) Minor cleanup to functionality. - new style class - less unlock error cases BUG= chromium:800657 TEST=pylint Change-Id: I883cdf276e165fdd9ffb3d127d434ac35e5f503f Reviewed-on: https://chromium-review.googlesource.com/880485 Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> [modify] https://crrev.com/397eea2ea75a3dbf5486e5e72bc8f0c339d6ec8c/server/cros/lockfile.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3dfa037a616567643b4dfe49bf9bd530e8d62c24 commit 3dfa037a616567643b4dfe49bf9bd530e8d62c24 Author: Ilja H. Friedel <ihf@chromium.org> Date: Thu Jan 25 08:15:55 2018 tradefed_test: switch to use forked lockfile. The forked lockfile.py fixes a bug. It also adds functionality which is not used yet. BUG= chromium:800657 TEST=test_that <ip> cheets_CTS_N.7.1_r13.arm.CtsDramTestCases Change-Id: I02665bce52bc06f44b931ae30e74e0c8743f9344 Reviewed-on: https://chromium-review.googlesource.com/882134 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> [modify] https://crrev.com/3dfa037a616567643b4dfe49bf9bd530e8d62c24/server/cros/tradefed_test.py
,
Jan 29 2018
Changes get merged? Need to verify and/or remove RBS for M64
,
Jan 30 2018
,
Feb 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a4327fc988683ef89a9987febed7784e3da020aa commit a4327fc988683ef89a9987febed7784e3da020aa Author: Ilja H. Friedel <ihf@chromium.org> Date: Sun Feb 04 22:26:45 2018 tradefed_test: split code into multiple files. No functional changes. Just moving code around to get shorter files. BUG= chromium:800657 TEST=Ran a CTS/GTS test locally. Change-Id: Ib9a9e9a770cc7a5cb5a3fefe0c6a5dbb061020c4 Reviewed-on: https://chromium-review.googlesource.com/894994 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_chromelogin.py [add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_utils.py [add] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_constants.py [modify] https://crrev.com/a4327fc988683ef89a9987febed7784e3da020aa/server/cros/tradefed_test.py
,
Apr 5 2018
I think the change in #30 fixed most problems. I have still a half refactored change to improve things, but it has been rotting and needs serious rebase. The urgent part is fixed.
,
Apr 5 2018
[Auto-generated comment by a script] We noticed that this issue is targeted for M-66; it appears the fix may have landed after branch point, meaning a merge might be required. Please confirm if a merge is required here - if so add Merge-Request-66 label, otherwise remove Merge-TBD label. Thanks.
,
Apr 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/864fe282023073a7ce8a2d170d15fbbbee00b685 commit 864fe282023073a7ce8a2d170d15fbbbee00b685 Author: Ilja H. Friedel <ihf@chromium.org> Date: Wed Apr 25 02:51:31 2018 Unmodified import of lockfile.py version 0.8. pip show lockfile --- Name: lockfile Version: 0.8 Location: /usr/lib/python2.7/dist-packages 1) I checked upstream are newer versions, but the bug I am after is not fixed. 2) Notice the chroot uses portage-stable/dev-python/lockfile/lockfile-0.9.1.ebuild 3) I am not sure what the base image uses for SSP. With this we get all places to the same implementation. BUG= chromium:800657 TEST=None. Change-Id: I7f7f107b1377cee0c838710cfbc614218e269f5b Reviewed-on: https://chromium-review.googlesource.com/880484 Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> (cherry picked from commit 711fcdbf9ff510671ecad159bfa1ce2dcdad1d27) Reviewed-on: https://chromium-review.googlesource.com/1027193 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [add] https://crrev.com/864fe282023073a7ce8a2d170d15fbbbee00b685/server/cros/lockfile.py
,
Apr 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ee0df91ccf5989a837d68d40b1edca3ad4d17ee2 commit ee0df91ccf5989a837d68d40b1edca3ad4d17ee2 Author: Ilja H. Friedel <ihf@chromium.org> Date: Wed Apr 25 02:52:11 2018 tradefed_test: switch to use forked lockfile. The forked lockfile.py fixes a bug. It also adds functionality which is not used yet. BUG= chromium:800657 TEST=test_that <ip> cheets_CTS_N.7.1_r13.arm.CtsDramTestCases Change-Id: I02665bce52bc06f44b931ae30e74e0c8743f9344 Reviewed-on: https://chromium-review.googlesource.com/882134 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Kazuhiro Inaba <kinaba@chromium.org> (cherry picked from commit 3dfa037a616567643b4dfe49bf9bd530e8d62c24) Reviewed-on: https://chromium-review.googlesource.com/1027194 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/ee0df91ccf5989a837d68d40b1edca3ad4d17ee2/server/cros/tradefed_test.py
,
Apr 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/329b643cc236e68e1a59d8a46a852ee66235ee27 commit 329b643cc236e68e1a59d8a46a852ee66235ee27 Author: Ilja H. Friedel <ihf@chromium.org> Date: Wed Apr 25 02:52:46 2018 tradefed_test: split code into multiple files. No functional changes. Just moving code around to get shorter files. BUG= chromium:800657 TEST=Ran a CTS/GTS test locally. Change-Id: Ib9a9e9a770cc7a5cb5a3fefe0c6a5dbb061020c4 Reviewed-on: https://chromium-review.googlesource.com/894994 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> (cherry picked from commit a4327fc988683ef89a9987febed7784e3da020aa) Reviewed-on: https://chromium-review.googlesource.com/1027195 [add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_chromelogin.py [add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_utils.py [add] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_constants.py [modify] https://crrev.com/329b643cc236e68e1a59d8a46a852ee66235ee27/server/cros/tradefed_test.py
,
Jun 6 2018
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by ihf@chromium.org
, Jan 10 2018