Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 2 users
Status: Fixed
Owner:
Closed: Aug 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
chromite: workqueue/tasks_unittest are flaky
Project Member Reported by tfiga@chromium.org, Aug 16 Back to list
We had two builds in a row when we observed failures in chromite UnitTests:

Build: https://luci-milo.appspot.com/buildbot/chromeos/poppy-release/656
(passed after a retry)
Log: https://uberchromegw.corp.google.com/i/chromeos/builders/poppy-release/builds/656/steps/UnitTest/logs/stdio

Failure from the first try:

chromite-0.0.2-r3272: ======================================================================
chromite-0.0.2-r3272: FAIL: testReapAllInOrder (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests)
chromite-0.0.2-r3272: ----------------------------------------------------------------------
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 158, in testReapAllInOrder
chromite-0.0.2-r3272:     self._ReapAll(True)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 118, in _ReapAll
chromite-0.0.2-r3272:     self._StartAll()
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll
chromite-0.0.2-r3272:     self._StartTask(rqid)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask
chromite-0.0.2-r3272:     self.assertTrue(os.path.exists(rqfile))
chromite-0.0.2-r3272: AssertionError: False is not true
chromite-0.0.2-r3272: 
chromite-0.0.2-r3272: ======================================================================
chromite-0.0.2-r3272: FAIL: testRunThenTerminate (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests)
chromite-0.0.2-r3272: ----------------------------------------------------------------------
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 207, in testRunThenTerminate
chromite-0.0.2-r3272:     self._StartTask(self._REQUEST_IDS[0])
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask
chromite-0.0.2-r3272:     self.assertTrue(os.path.exists(rqfile))
chromite-0.0.2-r3272: AssertionError: False is not true

Build: https://luci-milo.appspot.com/buildbot/chromeos/poppy-release/657
(failed)
Log: https://uberchromegw.corp.google.com/i/chromeos/builders/poppy-release/builds/657/steps/UnitTest/logs/stdio

Failure from the first try:

chromite-0.0.2-r3272: ======================================================================
chromite-0.0.2-r3272: FAIL: testReapAllInOrder (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests)
chromite-0.0.2-r3272: ----------------------------------------------------------------------
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 158, in testReapAllInOrder
chromite-0.0.2-r3272:     self._ReapAll(True)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 118, in _ReapAll
chromite-0.0.2-r3272:     self._StartAll()
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll
chromite-0.0.2-r3272:     self._StartTask(rqid)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask
chromite-0.0.2-r3272:     self.assertTrue(os.path.exists(rqfile))
chromite-0.0.2-r3272: AssertionError: False is not true
chromite-0.0.2-r3272: 
chromite-0.0.2-r3272: ======================================================================
chromite-0.0.2-r3272: FAIL: testTerminateAllReversed (chromite.lib.workqueue.tasks_unittest.ProcessPoolTaskManagerTests)
chromite-0.0.2-r3272: ----------------------------------------------------------------------
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 178, in testTerminateAllReversed
chromite-0.0.2-r3272:     self._TerminateAll(False)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 132, in _TerminateAll
chromite-0.0.2-r3272:     self._StartAll()
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 105, in _StartAll
chromite-0.0.2-r3272:     self._StartTask(rqid)
chromite-0.0.2-r3272:   File "/mnt/host/source/chromite/lib/workqueue/tasks_unittest.py", line 76, in _StartTask
chromite-0.0.2-r3272:     self.assertTrue(os.path.exists(rqfile))
chromite-0.0.2-r3272: AssertionError: False is not true

Failure from the second try:

chromite-0.0.2-r3272: 20:23:57: INFO: Finished [198/198] lib/workqueue/tasks_unittest (0:01:19.262856)
chromite-0.0.2-r3272: 
chromite-0.0.2-r3272: 20:23:57: ERROR: ### LOG: scripts/extract_gerrit_history_unittest
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
chromite-0.0.2-r3272:     "__main__", fname, loader, pkg_name)
chromite-0.0.2-r3272:   File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
chromite-0.0.2-r3272:     exec code in run_globals
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/scripts/create_venv.py", line 46, in <module>
chromite-0.0.2-r3272:     main()
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/scripts/create_venv.py", line 35, in main
chromite-0.0.2-r3272:     print(venv.ensure())
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/venvlib.py", line 173, in ensure
chromite-0.0.2-r3272:     with flock.FileLock(self._paths.lockfile):
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 69, in __enter__
chromite-0.0.2-r3272:     self._acquire_lock()
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 78, in _acquire_lock
chromite-0.0.2-r3272:     self._clean_stale_lock()
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 107, in _clean_stale_lock
chromite-0.0.2-r3272:     self._clean_orphaned_lock()
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 117, in _clean_orphaned_lock
chromite-0.0.2-r3272:     if self._is_lock_orphaned():
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 129, in _is_lock_orphaned
chromite-0.0.2-r3272:     and not _is_pid_running(self._get_lock_pid()))
chromite-0.0.2-r3272:   File "/mnt/host/source/infra_virtualenv/venv/cros_venv/flock.py", line 133, in _get_lock_pid
chromite-0.0.2-r3272:     return _procdir.getpid(os.readlink(self._lockfile))
chromite-0.0.2-r3272: OSError: [Errno 2] No such file or directory: '/home/chrome-bot/.cache/cros_venv/venv-2.7.10-e5e7c29bac8f6c49d86dffd36c05d6fa/change.lock'
chromite-0.0.2-r3272: Traceback (most recent call last):
chromite-0.0.2-r3272:   File "scripts/extract_gerrit_history_unittest", line 93, in <module>
chromite-0.0.2-r3272:     main()
chromite-0.0.2-r3272:   File "scripts/extract_gerrit_history_unittest", line 31, in main
chromite-0.0.2-r3272:     venvdir = _CreateVenv()
chromite-0.0.2-r3272:   File "scripts/extract_gerrit_history_unittest", line 39, in _CreateVenv
chromite-0.0.2-r3272:     _REQUIREMENTS,
chromite-0.0.2-r3272:   File "/usr/lib64/python2.7/subprocess.py", line 573, in check_output
chromite-0.0.2-r3272:     raise CalledProcessError(retcode, cmd, output=output)
chromite-0.0.2-r3272: subprocess.CalledProcessError: Command '['/mnt/host/source/chromite/../infra_virtualenv/bin/create_venv', '/mnt/host/source/chromite/venv/requirements.txt']' returned non-zero exit status 1
chromite-0.0.2-r3272: 

Please take a look.

 
Labels: OS-Chrome
Owner: jrbarnette@chromium.org
Summary: chromite: workqueue/tasks_unittest are flaky (was: poppy-release: chromite UnitTests flaky?)
lets keep this one for workqueue/tasks_unittest.  i've forked the venv failure into  issue 756001  as i've seen that one myself on my local system.
Status: Assigned
<sigh> The tests have some necessary timing dependencies.  System load
can cause some operations to run slow, which will cause test failures.

Very likely, that's what's going on, in which case the fix is to increase
the timeout.  Or maybe, just disable the tests until we sort out something
that's less load-sensitive.

imo, generally speaking, timeouts should be high/extreme by default.  after all, correctly working code isn't going to hit the timeout (they're going to finish+exit quickly), which means the timeout is only going to hit when there's a bug, or when the load is high.  we shouldn't try to use low timeouts just to cater to the "bad" codepaths.

if it isn't too complicated, maybe a middle ground would be to update cros_test_lib with a constant/func that scales the timeouts used by tests.  if the command is run interactively (e.g. by a dev running the test), then we don't scale it very high.  but otherwise (e.g. ebuild or a bot), then we scale it pretty high.

can use os.isatty(sys.stdin.fileno()) to autodetect.
Status: Started
Cc: pprabhu@chromium.org apronin@chromium.org jrbarnette@chromium.org
 Issue 757866  has been merged into this issue.
Another flake in a run I happened to be looking at today:

https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fcoral-release%2F271%2F%2B%2Frecipes%2Fsteps%2FUnitTest%2F0%2Fstdout

Fortunately, it passed on the second build/test.
FWIW, the fix is approved, and waiting for the CQ:
    https://chromium-review.googlesource.com/#/c/chromiumos/chromite/+/617375/

Project Member Comment 8 by bugdroid1@chromium.org, Aug 25
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/42baf760c17c7ac329787f9155bc26562b439705

commit 42baf760c17c7ac329787f9155bc26562b439705
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Aug 25 03:41:13 2017

Make the workqueue task unit tests less flaky.

The workqueue test unit tests have to test an assertion that an
asynchronous call (via StartTask) starts _eventually_.  The
underlying Python+Linux combination provides no hard real time
guarantee on how long "eventually" will be, so we have to pick a
generous timeout, and cross our fingers.

During release builds, the builder system can be heavily loaded,
increasing task start time enough that that the originally selected
"generous" timeout wasn't always long enough, causing the test to be
flaky.

This increases the timeout dramatically.  It further changes the
wait strategy to prevent the test time for success from blowing up
on an unloaded system.

BUG= chromium:755914 
TEST=run the unit tests; also test with _START_LATENCY temporarily 0

Change-Id: Id25f1214f2cae502adec672ba50f88515dac25c9
Reviewed-on: https://chromium-review.googlesource.com/617375
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/42baf760c17c7ac329787f9155bc26562b439705/lib/workqueue/tasks_unittest.py

Project Member Comment 9 by bugdroid1@chromium.org, Aug 25
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/974d461851daf1bff41b57a52f03f13b7479887b

commit 974d461851daf1bff41b57a52f03f13b7479887b
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Aug 25 03:41:13 2017

workqueue: in ProcessPoolTaskManager use terminate(), not close().

In lib.workqueue.task.ProcessPoolTaskManager, calling the Close()
method can hang forever if a child process is hung.  This method is
only called from an environment where we intend to terminate the
current process, so waiting indefinitely (or even timing out) isn't
a good choice.

This changes the Close() method to use Pool.terminate() instead of
Pool.close(), to force immediate termination.

BUG= chromium:755914 
TEST=Use altered unit tests that fail and leave hung children behind

Change-Id: I3edb184230b670ec67efbea5ac042bf11ed30c92
Reviewed-on: https://chromium-review.googlesource.com/630577
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/974d461851daf1bff41b57a52f03f13b7479887b/lib/workqueue/tasks.py

Status: Fixed
Sign in to add a comment