Provide a way to rate-limit provisioning bandwidth on devservers
Reported by
jrbarnette@chromium.org,
Aug 2 2017
|
|||||||
Issue description
We need a way to control/limit the system bandwidth
consumption from provisioning jobs on devservers.
The design proposal is here:
https://docs.google.com/document/d/1jxZOZsu3h7aMIjgQa-WFaexlVSEvCz5KMs6LOnzLP4c/edit
The essence of the proposal is that when we copy payload
files from the devserver to the DUT, the copy operations
should be throttled so that the number of simultaneously
running operations can't exceed system bandwidth. The
proposed implementation is to forward all payload copy
operations to a centralized service that will perform the
copies in FIFO order, subject to the throttling constraint.
,
Aug 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/3ad08e286f75508ae53f3d4ad41937e902b3a3cc commit 3ad08e286f75508ae53f3d4ad41937e902b3a3cc Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Aug 09 18:47:56 2017 Add a workqueue task manager for provisioning throttling. This adds early code to support provisioning throttling. The code provides a task manager that is responsible for managing a set of independently running processes that in turn are responsible for actually performing copies of payload files during provisioning. BUG= chromium:751788 TEST=Unit tests Change-Id: I600490978491f42546d9d73c1731c5507e673a35 Reviewed-on: https://chromium-review.googlesource.com/598730 Commit-Ready: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks_unittest [add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks_unittest.py [add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/tasks.py [add] https://crrev.com/3ad08e286f75508ae53f3d4ad41937e902b3a3cc/lib/workqueue/__init__.py
,
Aug 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c83431272ef200b70c2aeaeaa02a35e4151ef9f0 commit c83431272ef200b70c2aeaeaa02a35e4151ef9f0 Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Aug 09 18:47:56 2017 Add the `workqueue_server` command. This adds a new daemon to implement the server side of provisioning throttling. The daemon takes payload copy requests, and executes them in order. Copy requests are rate-limited according to a command line option. BUG= chromium:751788 TEST=run the daemon; test calls via python CLI. Change-Id: I80600c9d59a898fe176845c99b8df00bdbacb624 Reviewed-on: https://chromium-review.googlesource.com/604941 Commit-Ready: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/workqueue_server.py [add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/throttle.py [add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/copy_handler.py [add] https://crrev.com/c83431272ef200b70c2aeaeaa02a35e4151ef9f0/lib/workqueue/workqueue_server
,
Aug 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4 commit 42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4 Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Aug 11 17:35:53 2017 Add support for making throttled payload copy calls. This adds support in the provisioning code flow for requesting payload file copies via the throttling work queue. The new code is disabled, and will be enabled for production with a separate change. BUG= chromium:751788 TEST=run provisioning with a local autotest instance Change-Id: I89907d2f6817aca461298f88353a6810ec99460a Reviewed-on: https://chromium-review.googlesource.com/606693 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4/lib/remote_access.py [modify] https://crrev.com/42cb6c90e4d5a7a958b6ee536d82f8b8adff4de4/lib/auto_updater.py
,
Aug 14 2017
Plan following the commits above: 1. Commit two CLs, one for logging, one for metrics: * Logging changes: crosreview.com/611159 * Metrics changes in progress; to be uploaded. 2. Get reviewed and committed an upstart job for the service: * crosreview.com/i/427388 3. Deploy all the changes, and confirm that the service is able to perform in prod the way it did in testing. 4. Commit the CL to enable using the service: * crosreview.com/606694 5. Roll out the final change, and monitor the outcome. Adjust and/or revert as necessary.
,
Aug 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9b938ecc325996375e7429dd4121eb550b2be9a3 commit 9b938ecc325996375e7429dd4121eb550b2be9a3 Author: Richard Barnette <jrbarnette@chromium.org> Date: Mon Aug 14 21:57:27 2017 Fix the "Throttled copy" log message. The messsage text was associating the hostname with the source path, when the host is really part of the destination. BUG= chromium:751788 TEST=run provisioning with a local autotest instance. _Look at the logs_. Change-Id: Ie1f472f9a0bb3018484d275549070113ca8b5f69 Reviewed-on: https://chromium-review.googlesource.com/612446 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/9b938ecc325996375e7429dd4121eb550b2be9a3/lib/remote_access.py
,
Aug 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/3db8994dfb61498bf6baf89477df36b975c4bb47 commit 3db8994dfb61498bf6baf89477df36b975c4bb47 Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Aug 15 03:37:55 2017 Add debug logging to the workqueue server code. This adds some basic logging to the WorkQueueServer class so that request traffic and progress can be extracted after the fact. BUG= chromium:751788 TEST=Unit tests; manually test the daemon with --debug Change-Id: Iee2b905ddebd0b52eb021446fe57a76404d307fd Reviewed-on: https://chromium-review.googlesource.com/611159 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/3db8994dfb61498bf6baf89477df36b975c4bb47/lib/workqueue/service.py
,
Aug 16 2017
Update on the plan/status.
1. These CLs need review/approval:
crosreview.com/614305 (Metrics for the service)
crosreview.com/i/427388 (The upstart job)
2. Commit the metrics CL, and push all existing chromite code
to the devservers.
3. Commit the upstart job, and see that puppet rolls it out.
4. Run sanity checks on the new service, and the upstart job.
In particular, see that metrics are coming out.
5. Commit this CL:
crosreview.com/606694 (Enable chromite to use the service)
6. Push the changes to the devservers.
7. Monitor.
,
Aug 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a69561f1b823af32a41f6a106a5d999b031a1bfb commit a69561f1b823af32a41f6a106a5d999b031a1bfb Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Aug 17 18:31:50 2017 Add metrics reporting to the workqueue server code. This adds Monarch metrics to the provisioning workqueue: * A counter to track polling ticks. * A gauge to track uncompleted requests. * Time distributions for time spent in various states. * Counters to track request receipt and final disposition. BUG= chromium:751788 TEST=unit tests Change-Id: I4a9cd88d3c9200b0ddcd47bb2bc366b04bdaa120 Reviewed-on: https://chromium-review.googlesource.com/614305 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/a69561f1b823af32a41f6a106a5d999b031a1bfb/lib/workqueue/service.py [modify] https://crrev.com/a69561f1b823af32a41f6a106a5d999b031a1bfb/lib/workqueue/workqueue_server.py
,
Aug 17 2017
,
Aug 25 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/c84b909754fe9a91722f5620365a96d4545c0743 commit c84b909754fe9a91722f5620365a96d4545c0743 Author: Richard Barnette <jrbarnette@google.com> Date: Fri Aug 25 18:03:38 2017
,
Aug 25 2017
,
Aug 29 2017
We're nearly done with this step: > 4. Run sanity checks on the new service, and the upstart job. > In particular, see that metrics are coming out. The service is up and running on every devserver save one. Tomorrow, we'll have sufficient data to declare that metrics are there (or not). Assuming all goes well, we go live shortly after that with these two steps: > 5. Commit this CL: > crosreview.com/606694 (Enable chromite to use the service) > > 6. Push the changes to the devservers.
,
Aug 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c3bf4aafb197e65f00f193628c5c0f8d20ce3059 commit c3bf4aafb197e65f00f193628c5c0f8d20ce3059 Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Aug 29 18:24:36 2017 Enable throttled payload copies during provisioning. This change enables payload copy throttling in the provisioning code flow. BUG= chromium:751788 TEST=run provisioning in a local autotest instance using the workqueue server Change-Id: I4baecb7ec20c284db0bfc507b7318a6197bd9d88 Reviewed-on: https://chromium-review.googlesource.com/606694 Trybot-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/c3bf4aafb197e65f00f193628c5c0f8d20ce3059/lib/auto_updater.py
,
Aug 30 2017
Yesterday's change went live just before 14:00 yesterday (Tuesday), and was promptly reverted at 16:00 because of a dramatic spike in provision failures. The cause of the spike is under investigation.
,
Aug 30 2017
,
Dec 12 2017
,
Jan 30 2018
Back burner, for now at least.
,
Jan 30 2018
Arguably once I finish adding servers with 2x10Gb in 2018 a moot point???
,
Jun 26 2018
This is no longer worth the trouble required. Currently, provision failure rates aren't serious enough to be driving overal failures, and we've got much more powerful devservers to protect ourselves from unruly load. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by bugdroid1@chromium.org
, Aug 9 2017