Aborting a payload copy may aggravate client timeouts
Reported by
jrbarnette@chromium.org,
Sep 20 2017
|
|||
Issue description
When provisioning calls the payload copying workqueue
for a throttled copy, it can time out, and then abort
its request. If the copy is actually running (as opposed
to merely queued and waiting to run), the abort includes
calling this code in tasks.ProcessPoolTaskManager:
def TerminateTask(self, request_id):
del self._pending_results[request_id]
... that is, the method merely removes the record of the
running task without actually stopping it. This change
means that subsequent queries to `HasCapacity()` will
return true, allowing `ProcessRequests()` to add additional
tasks.
We're using `multiprocessing.Pool` to keep track of the work,
so the extra tasks would queue up inside the Pool object,
taking up time, but not bandwidth. The net effect is that some
copy tasks would show excessively long run times, rather than
long wait times. This could also aggravate wait times, and
cause more client timeouts through this sequence:
* Request A times out while running, and the client requests
abort.
* This bug allows a new request B to be started before A
terminates. The request will be accounted for as running,
even though it gets no processing time.
* If request B times out, the task for B _will_still_run_.
It may not even start processing until after being aborted.
One effect of clients aborting work should be that when wait
times are long, the system self-regulates by eliminating work.
But, in the scenario above, work that should have been stopped
instead plows forward in stealth mode without reducing system
load.
,
Sep 20 2017
,
Oct 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/fc019898a274c3086109ca7fa67e82b1da87616a commit fc019898a274c3086109ca7fa67e82b1da87616a Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Oct 13 21:44:54 2017 [workqueue] Handle aborted tasks properly When a client aborts a running task, we can't account for it as increased capacity until the running task actually terminates. multiprocessing.Pool doesn't allow premature termination, so we have to wait for the task to complete normally before recording an increase in capacity. BUG= chromium:767211 TEST=unit tests Change-Id: Ie332770e117cd1c690c7c2838d2d75fc65c17098 Reviewed-on: https://chromium-review.googlesource.com/693379 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/fc019898a274c3086109ca7fa67e82b1da87616a/lib/workqueue/tasks_unittest.py [modify] https://crrev.com/fc019898a274c3086109ca7fa67e82b1da87616a/lib/workqueue/tasks.py
,
Oct 16 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by jrbarnette@chromium.org
, Sep 20 2017Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)