New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 751788 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature

Blocked on:
issue 757001
issue 759154
issue 760652



Sign in to add a comment

Provide a way to rate-limit provisioning bandwidth on devservers

Reported by jrbarnette@chromium.org, Aug 2 2017

Issue description

We need a way to control/limit the system bandwidth
consumption from provisioning jobs on devservers.

The design proposal is here:
    https://docs.google.com/document/d/1jxZOZsu3h7aMIjgQa-WFaexlVSEvCz5KMs6LOnzLP4c/edit

The essence of the proposal is that when we copy payload
files from the devserver to the DUT, the copy operations
should be throttled so that the number of simultaneously
running operations can't exceed system bandwidth.  The
proposed implementation is to forward all payload copy
operations to a centralized service that will perform the
copies in FIFO order, subject to the throttling constraint.

 
Project Member

Comment 1 by bugdroid1@chromium.org, Aug 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/8dd08b9364e237d0b320a9aeea83522e4e7ba7ed

commit 8dd08b9364e237d0b320a9aeea83522e4e7ba7ed
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Aug 09 18:47:56 2017

Add the basic service API for provisioning throttling.

This adds the core API for making payload copy requests to
a centralized server able to schedule those requests according
to load.  The API consists of a single class that provides
both the client and server side support:
  * Three client-side methods provide for making requests,
    waiting for request completion, and aborting requests.
  * A server-side method provides a processing loop that
    receives requests, and schedules them with the task
    manager.

BUG= chromium:751788 
TEST=Unit tests

Change-Id: I2910224ad15dc04f03394117c10d219eb6287c7e
Reviewed-on: https://chromium-review.googlesource.com/598632
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[add] https://crrev.com/8dd08b9364e237d0b320a9aeea83522e4e7ba7ed/lib/workqueue/service_unittest.py
[add] https://crrev.com/8dd08b9364e237d0b320a9aeea83522e4e7ba7ed/lib/workqueue/service.py
[add] https://crrev.com/8dd08b9364e237d0b320a9aeea83522e4e7ba7ed/lib/workqueue/service_unittest

Project Member

Comment 2 by bugdroid1@chromium.org, Aug 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/3ad08e286f75508ae53f3d4ad41937e902b3a3cc

commit 3ad08e286f75508ae53f3d4ad41937e902b3a3cc
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Aug 09 18:47:56 2017

Add a workqueue task manager for provisioning throttling.

This adds early code to support provisioning throttling.  The
code provides a task manager that is responsible for managing
a set of independently running processes that in turn are
responsible for actually performing copies of payload files
during provisioning.

BUG= chromium:751788 
TEST=Unit tests

Change-Id: I600490978491f42546d9d73c1731c5507e673a35
Reviewed-on: https://chromium-review.googlesource.com/598730
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks_unittest
[add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks_unittest.py
[add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks.py
[add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/__init__.py

Project Member

Comment 3 by bugdroid1@chromium.org, Aug 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c83431272ef200b70c2aeaeaa02a35e4151ef9f0

commit c83431272ef200b70c2aeaeaa02a35e4151ef9f0
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Aug 09 18:47:56 2017

Add the `workqueue_server` command.

This adds a new daemon to implement the server side of provisioning
throttling.  The daemon takes payload copy requests, and executes them
in order.  Copy requests are rate-limited according to a command line
option.

BUG= chromium:751788 
TEST=run the daemon; test calls via python CLI.

Change-Id: I80600c9d59a898fe176845c99b8df00bdbacb624
Reviewed-on: https://chromium-review.googlesource.com/604941
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/workqueue_server.py
[add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/throttle.py
[add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/copy_handler.py
[add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/workqueue_server

Project Member

Comment 4 by bugdroid1@chromium.org, Aug 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4

commit 42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Aug 11 17:35:53 2017

Add support for making throttled payload copy calls.

This adds support in the provisioning code flow for requesting
payload file copies via the throttling work queue.  The new code
is disabled, and will be enabled for production with a separate
change.

BUG= chromium:751788 
TEST=run provisioning with a local autotest instance

Change-Id: I89907d2f6817aca461298f88353a6810ec99460a
Reviewed-on: https://chromium-review.googlesource.com/606693
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4/lib/remote_access.py
[modify] https://crrev.com/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4/lib/auto_updater.py

Plan following the commits above:

 1. Commit two CLs, one for logging, one for metrics:
  * Logging changes: crosreview.com/611159
  * Metrics changes in progress; to be uploaded.
 2. Get reviewed and committed an upstart job for the service:
  * crosreview.com/i/427388
 3. Deploy all the changes, and confirm that the service is able
    to perform in prod the way it did in testing.
 4. Commit the CL to enable using the service:
  * crosreview.com/606694
 5. Roll out the final change, and monitor the outcome.  Adjust
   and/or revert as necessary.

Project Member

Comment 6 by bugdroid1@chromium.org, Aug 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9b938ecc325996375e7429dd4121eb550b2be9a3

commit 9b938ecc325996375e7429dd4121eb550b2be9a3
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Mon Aug 14 21:57:27 2017

Fix the "Throttled copy" log message.

The messsage text was associating the hostname with the source
path, when the host is really part of the destination.

BUG= chromium:751788 
TEST=run provisioning with a local autotest instance.  _Look at the logs_.

Change-Id: Ie1f472f9a0bb3018484d275549070113ca8b5f69
Reviewed-on: https://chromium-review.googlesource.com/612446
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/9b938ecc325996375e7429dd4121eb550b2be9a3/lib/remote_access.py

Project Member

Comment 7 by bugdroid1@chromium.org, Aug 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/3db8994dfb61498bf6baf89477df36b975c4bb47

commit 3db8994dfb61498bf6baf89477df36b975c4bb47
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Aug 15 03:37:55 2017

Add debug logging to the workqueue server code.

This adds some basic logging to the WorkQueueServer class so that
request traffic and progress can be extracted after the fact.

BUG= chromium:751788 
TEST=Unit tests; manually test the daemon with --debug

Change-Id: Iee2b905ddebd0b52eb021446fe57a76404d307fd
Reviewed-on: https://chromium-review.googlesource.com/611159
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/3db8994dfb61498bf6baf89477df36b975c4bb47/lib/workqueue/service.py

Update on the plan/status.

1. These CLs need review/approval:
    crosreview.com/614305 (Metrics for the service)
    crosreview.com/i/427388 (The upstart job)

2. Commit the metrics CL, and push all existing chromite code
   to the devservers.

3. Commit the upstart job, and see that puppet rolls it out.

4. Run sanity checks on the new service, and the upstart job.
   In particular, see that metrics are coming out.

5. Commit this CL:
    crosreview.com/606694 (Enable chromite to use the service)

6. Push the changes to the devservers.

7. Monitor.


Project Member

Comment 9 by bugdroid1@chromium.org, Aug 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/a69561f1b823af32a41f6a106a5d999b031a1bfb

commit a69561f1b823af32a41f6a106a5d999b031a1bfb
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Aug 17 18:31:50 2017

Add metrics reporting to the workqueue server code.

This adds Monarch metrics to the provisioning workqueue:
  * A counter to track polling ticks.
  * A gauge to track uncompleted requests.
  * Time distributions for time spent in various states.
  * Counters to track request receipt and final disposition.

BUG= chromium:751788 
TEST=unit tests

Change-Id: I4a9cd88d3c9200b0ddcd47bb2bc366b04bdaa120
Reviewed-on: https://chromium-review.googlesource.com/614305
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/a69561f1b823af32a41f6a106a5d999b031a1bfb/lib/workqueue/service.py
[modify] https://crrev.com/a69561f1b823af32a41f6a106a5d999b031a1bfb/lib/workqueue/workqueue_server.py

Blockedon: 756581
Blockedon: -756581 757001
Waiting for another push, to see if this time it all works.

Project Member

Comment 12 by bugdroid1@chromium.org, Aug 25 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/c84b909754fe9a91722f5620365a96d4545c0743

commit c84b909754fe9a91722f5620365a96d4545c0743
Author: Richard Barnette <jrbarnette@google.com>
Date: Fri Aug 25 18:03:38 2017

Blockedon: 759154
We're nearly done with this step:
> 4. Run sanity checks on the new service, and the upstart job.
>    In particular, see that metrics are coming out.

The service is up and running on every devserver save one.
Tomorrow, we'll have sufficient data to declare that metrics
are there (or not).

Assuming all goes well, we go live shortly after that with
these two steps:

> 5. Commit this CL:
>     crosreview.com/606694 (Enable chromite to use the service)
>
> 6. Push the changes to the devservers.

Project Member

Comment 15 by bugdroid1@chromium.org, Aug 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c3bf4aafb197e65f00f193628c5c0f8d20ce3059

commit c3bf4aafb197e65f00f193628c5c0f8d20ce3059
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Aug 29 18:24:36 2017

Enable throttled payload copies during provisioning.

This change enables payload copy throttling in the provisioning code
flow.

BUG= chromium:751788 
TEST=run provisioning in a local autotest instance using the workqueue server

Change-Id: I4baecb7ec20c284db0bfc507b7318a6197bd9d88
Reviewed-on: https://chromium-review.googlesource.com/606694
Trybot-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/c3bf4aafb197e65f00f193628c5c0f8d20ce3059/lib/auto_updater.py

Yesterday's change went live just before 14:00 yesterday
(Tuesday), and was promptly reverted at 16:00 because of
a dramatic spike in provision failures.  The cause of the
spike is under investigation.

Blockedon: 760652
Cc: johndhong@chromium.org
Labels: -Pri-1 Pri-2
Back burner, for now at least.

Arguably once I finish adding servers with 2x10Gb in 2018 a moot point???
Status: WontFix (was: Started)
This is no longer worth the trouble required.

Currently, provision failure rates aren't serious enough to be
driving overal failures, and we've got much more powerful devservers
to protect ourselves from unruly load.

Sign in to add a comment