New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 767211 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 760652



Sign in to add a comment

Aborting a payload copy may aggravate client timeouts

Reported by jrbarnette@chromium.org, Sep 20 2017

Issue description

When provisioning calls the payload copying workqueue
for a throttled copy, it can time out, and then abort
its request.  If the copy is actually running (as opposed
to merely queued and waiting to run), the abort includes
calling this code in tasks.ProcessPoolTaskManager:
  def TerminateTask(self, request_id):
    del self._pending_results[request_id]


... that is, the method merely removes the record of the
running task without actually stopping it.  This change
means that subsequent queries to `HasCapacity()` will
return true, allowing `ProcessRequests()` to add additional
tasks.

We're using `multiprocessing.Pool` to keep track of the work,
so the extra tasks would queue up inside the Pool object,
taking up time, but not bandwidth.  The net effect is that some
copy tasks would show excessively long run times, rather than
long wait times.  This could also aggravate wait times, and
cause more client timeouts through this sequence:
  * Request A times out while running, and the client requests
    abort.
  * This bug allows a new request B to be started before A
    terminates.  The request will be accounted for as running,
    even though it gets no processing time.
  * If request B times out, the task for B _will_still_run_.
    It may not even start processing until after being aborted.
One effect of clients aborting work should be that when wait
times are long, the system self-regulates by eliminating work.
But, in the scenario above, work that should have been stopped
instead plows forward in stealth mode without reducing system
load.

 
Components: Infra>Client>ChromeOS
Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
Fixing this is a bit of a nuisance:  multiprocessing.Pool has
no provision for terminating a running `async_call()` invocation,
so unless we use some different facility for managing the work,
we have to let copies finish on their own.

Even if there were some alternative facility, the work we really
want to terminate is in a subprocess running 'scp'.  So, we'd still
have to do work to make it possible to cleanly terminate invocations
of `RemoteDevice.CopyToDevice()`.  Mostly, that seems not worth it.

So, the obvious fix is, after a task is aborted, have `ProcessPoolTaskManager`
carry the running task until it sees a completion.  That's a bit awkward,
but it would at least allow for reliably terminating any work that doesn't
actually start processing.

Blocking: 760652
Project Member

Comment 3 by bugdroid1@chromium.org, Oct 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/fc019898a274c3086109ca7fa67e82b1da87616a

commit fc019898a274c3086109ca7fa67e82b1da87616a
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Oct 13 21:44:54 2017

[workqueue] Handle aborted tasks properly

When a client aborts a running task, we can't account for it as
increased capacity until the running task actually terminates.
multiprocessing.Pool doesn't allow premature termination, so we
have to wait for the task to complete normally before recording
an increase in capacity.

BUG= chromium:767211 
TEST=unit tests

Change-Id: Ie332770e117cd1c690c7c2838d2d75fc65c17098
Reviewed-on: https://chromium-review.googlesource.com/693379
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/fc019898a274c3086109ca7fa67e82b1da87616a/lib/workqueue/tasks_unittest.py
[modify] https://crrev.com/fc019898a274c3086109ca7fa67e82b1da87616a/lib/workqueue/tasks.py

Status: Fixed (was: Started)

Sign in to add a comment