chromite: workqueue/tasks_unittest are flaky |
||||||
Issue descriptionWe had two builds in a row when we observed failures in chromite UnitTests: Build: https://luci-milo.appspot.com/buildbot/chromeos/poppy-release/656 (passed after a retry) Log: https://uberchromegw.corp.google.com/i/chromeos/builders/poppy-release/builds/656/steps/UnitTest/logs/stdio Failure from the first try: chromite-0.0.2-r3272: ====================================================================== chromite-0.0.2-r3272: FAIL: testReapAllInOrder (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests) chromite-0.0.2-r3272: ---------------------------------------------------------------------- chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 158, in testReapAllInOrder chromite-0.0.2-r3272: self._ReapAll(True) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 118, in _ReapAll chromite-0.0.2-r3272: self._StartAll() chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll chromite-0.0.2-r3272: self._StartTask(rqid) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask chromite-0.0.2-r3272: self.assertTrue(os.path.exists(rqfile)) chromite-0.0.2-r3272: AssertionError: False is not true chromite-0.0.2-r3272: chromite-0.0.2-r3272: ====================================================================== chromite-0.0.2-r3272: FAIL: testRunThenTerminate (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests) chromite-0.0.2-r3272: ---------------------------------------------------------------------- chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 207, in testRunThenTerminate chromite-0.0.2-r3272: self._StartTask(self._REQUEST_IDS[0]) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask chromite-0.0.2-r3272: self.assertTrue(os.path.exists(rqfile)) chromite-0.0.2-r3272: AssertionError: False is not true Build: https://luci-milo.appspot.com/buildbot/chromeos/poppy-release/657 (failed) Log: https://uberchromegw.corp.google.com/i/chromeos/builders/poppy-release/builds/657/steps/UnitTest/logs/stdio Failure from the first try: chromite-0.0.2-r3272: ====================================================================== chromite-0.0.2-r3272: FAIL: testReapAllInOrder (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests) chromite-0.0.2-r3272: ---------------------------------------------------------------------- chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 158, in testReapAllInOrder chromite-0.0.2-r3272: self._ReapAll(True) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 118, in _ReapAll chromite-0.0.2-r3272: self._StartAll() chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll chromite-0.0.2-r3272: self._StartTask(rqid) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask chromite-0.0.2-r3272: self.assertTrue(os.path.exists(rqfile)) chromite-0.0.2-r3272: AssertionError: False is not true chromite-0.0.2-r3272: chromite-0.0.2-r3272: ====================================================================== chromite-0.0.2-r3272: FAIL: testTerminateAllReversed (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests) chromite-0.0.2-r3272: ---------------------------------------------------------------------- chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 178, in testTerminateAllReversed chromite-0.0.2-r3272: self._TerminateAll(False) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 132, in _TerminateAll chromite-0.0.2-r3272: self._StartAll() chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll chromite-0.0.2-r3272: self._StartTask(rqid) chromite-0.0.2-r3272: File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask chromite-0.0.2-r3272: self.assertTrue(os.path.exists(rqfile)) chromite-0.0.2-r3272: AssertionError: False is not true Failure from the second try: chromite-0.0.2-r3272: 20:23:57: INFO: Finished [198/198] lib/workqueue/tasks_unittest (0:01:19.262856) chromite-0.0.2-r3272: chromite-0.0.2-r3272: 20:23:57: ERROR: ### LOG: scripts/extract_gerrit_history_unittest chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main chromite-0.0.2-r3272: "__main__", fname, loader, pkg_name) chromite-0.0.2-r3272: File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code chromite-0.0.2-r3272: exec code in run_globals chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/scripts/create_venv.py", line 46, in <module> chromite-0.0.2-r3272: main() chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/scripts/create_venv.py", line 35, in main chromite-0.0.2-r3272: print(venv.ensure()) chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/venvlib.py", line 173, in ensure chromite-0.0.2-r3272: with flock.FileLock(self._paths.lockfile): chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 69, in __enter__ chromite-0.0.2-r3272: self._acquire_lock() chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 78, in _acquire_lock chromite-0.0.2-r3272: self._clean_stale_lock() chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 107, in _clean_stale_lock chromite-0.0.2-r3272: self._clean_orphaned_lock() chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 117, in _clean_orphaned_lock chromite-0.0.2-r3272: if self._is_lock_orphaned(): chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 129, in _is_lock_orphaned chromite-0.0.2-r3272: and not _is_pid_running(self._get_lock_pid())) chromite-0.0.2-r3272: File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 133, in _get_lock_pid chromite-0.0.2-r3272: return _procdir.getpid(os.readlink(self._lockfile)) chromite-0.0.2-r3272: OSError: [Errno 2] No such file or directory: '/home/chrome-bot/.cache/cros_venv/venv-2.7.10-e5e7c29bac8f6c49d86dffd36c05d6fa/change.lock' chromite-0.0.2-r3272: Traceback (most recent call last): chromite-0.0.2-r3272: File "scripts/extract_gerrit_history_unittest", line 93, in <module> chromite-0.0.2-r3272: main() chromite-0.0.2-r3272: File "scripts/extract_gerrit_history_unittest", line 31, in main chromite-0.0.2-r3272: venvdir = _CreateVenv() chromite-0.0.2-r3272: File "scripts/extract_gerrit_history_unittest", line 39, in _CreateVenv chromite-0.0.2-r3272: _REQUIREMENTS, chromite-0.0.2-r3272: File "/usr/lib64/python2.7/subprocess.py", line 573, in check_output chromite-0.0.2-r3272: raise CalledProcessError(retcode, cmd, output=output) chromite-0.0.2-r3272: subprocess.CalledProcessError: Command '['/mnt/host/source/chromite/../infra_virtualenv/bin/create_venv', '/mnt/host/source/chromite/venv/requirements.txt']' returned non-zero exit status 1 chromite-0.0.2-r3272: Please take a look.
,
Aug 16 2017
<sigh> The tests have some necessary timing dependencies. System load can cause some operations to run slow, which will cause test failures. Very likely, that's what's going on, in which case the fix is to increase the timeout. Or maybe, just disable the tests until we sort out something that's less load-sensitive.
,
Aug 16 2017
imo, generally speaking, timeouts should be high/extreme by default. after all, correctly working code isn't going to hit the timeout (they're going to finish+exit quickly), which means the timeout is only going to hit when there's a bug, or when the load is high. we shouldn't try to use low timeouts just to cater to the "bad" codepaths. if it isn't too complicated, maybe a middle ground would be to update cros_test_lib with a constant/func that scales the timeouts used by tests. if the command is run interactively (e.g. by a dev running the test), then we don't scale it very high. but otherwise (e.g. ebuild or a bot), then we scale it pretty high. can use os.isatty(sys.stdin.fileno()) to autodetect.
,
Aug 16 2017
,
Aug 22 2017
Issue 757866 has been merged into this issue.
,
Aug 24 2017
Another flake in a run I happened to be looking at today: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fcoral-release%2F271%2F%2B%2Frecipes%2Fsteps%2FUnitTest%2F0%2Fstdout Fortunately, it passed on the second build/test.
,
Aug 24 2017
FWIW, the fix is approved, and waiting for the CQ:
https://chromium-review.googlesource.com/#/c/chromiumos/chromite/+/617375/
,
Aug 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/42baf760c17c7ac329787f9155bc26562b439705 commit 42baf760c17c7ac329787f9155bc26562b439705 Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Aug 25 03:41:13 2017 Make the workqueue task unit tests less flaky. The workqueue test unit tests have to test an assertion that an asynchronous call (via StartTask) starts _eventually_. The underlying Python+Linux combination provides no hard real time guarantee on how long "eventually" will be, so we have to pick a generous timeout, and cross our fingers. During release builds, the builder system can be heavily loaded, increasing task start time enough that that the originally selected "generous" timeout wasn't always long enough, causing the test to be flaky. This increases the timeout dramatically. It further changes the wait strategy to prevent the test time for success from blowing up on an unloaded system. BUG= chromium:755914 TEST=run the unit tests; also test with _START_LATENCY temporarily 0 Change-Id: Id25f1214f2cae502adec672ba50f88515dac25c9 Reviewed-on: https://chromium-review.googlesource.com/617375 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/42baf760c17c7ac329787f9155bc26562b439705/lib/workqueue/tasks_unittest.py
,
Aug 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/974d461851daf1bff41b57a52f03f13b7479887b commit 974d461851daf1bff41b57a52f03f13b7479887b Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Aug 25 03:41:13 2017 workqueue: in ProcessPoolTaskManager use terminate(), not close(). In lib.workqueue.task.ProcessPoolTaskManager, calling the Close() method can hang forever if a child process is hung. This method is only called from an environment where we intend to terminate the current process, so waiting indefinitely (or even timing out) isn't a good choice. This changes the Close() method to use Pool.terminate() instead of Pool.close(), to force immediate termination. BUG= chromium:755914 TEST=Use altered unit tests that fail and leave hung children behind Change-Id: I3edb184230b670ec67efbea5ac042bf11ed30c92 Reviewed-on: https://chromium-review.googlesource.com/630577 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/974d461851daf1bff41b57a52f03f13b7479887b/lib/workqueue/tasks.py
,
Aug 25 2017
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by vapier@chromium.org
, Aug 16 2017Owner: jrbarnette@chromium.org
Summary: chromite: workqueue/tasks_unittest are flaky (was: poppy-release: chromite UnitTests flaky?)