pool auto-balancer should be metrics-instrumented |
||||||||
Issue descriptionauto-balancer should send metrics that allow us to determine: - # of duts shuffled around (per board) - # of boards/pools with insufficient spare capacity - # of boards/pools which were quarantined and on which shuffling was skipped
,
Aug 21 2017
Provisionally assigned. Remove owner, set status -> Untriaged to put back in triage queue.
,
Aug 23 2017
,
Aug 29 2017
I can't proceed safely until issue 759862 is fixed; repo hooks fail.
,
Sep 5 2017
,
Oct 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c6e483e1d707f473d54b85e6f4531b300b3f6b29 commit c6e483e1d707f473d54b85e6f4531b300b3f6b29 Author: Jacob Kopczynski <jkop@chromium.org> Date: Fri Oct 06 04:24:42 2017 Add metrics to balancer, remove board indices. Boolean metrics for boards and pools depleted of capacity and for pools which are quarantined. Counter metric for moved DUTs parameterized by board, source pool, and target pool. No longer assign indices to boards while balancing them. BUG= chromium:746067 TEST=ran 'balance_pool --all-boards suites' Change-Id: Ic42ddd99b45b7ee5d6cbf40dde5b566dcce9f828 Reviewed-on: https://chromium-review.googlesource.com/639692 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/c6e483e1d707f473d54b85e6f4531b300b3f6b29/site_utils/balance_pools.py
,
Oct 6 2017
I just ran balance pool to test out the behavior. I got this:
$ balance_pool bvt stumpy
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal
INFO:root:Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
ERROR:root:ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
NOTICE:root:ts_mon was set up.
Balancing stumpy bvt pool:
Total 7 DUTs, 6 working, 1 broken, 0 reserved.
Target is 7 working DUTs; grow pool by 1 DUTs.
stumpy bvt pool has 0 spares available.
ERROR: Not enough spares: need 1, only have 0.
'balance_pool' is primarily a command line tool for use
by humans; its use in a cron job is secondary. The extra
noise from ts_mon mustn't be there. Moreover, we don't want
metrics uploaded from any code flow/system other than the
specific cron job context.
I mentioned it in code review comments; I'll mention it here:
The metrics uploading behavior needs to be conditioned on the
presence of a specific command-line option. Then, that option
needs to be added to the cron job invocation for automatic balancing.
When the option isn't present on the command line, there should be
no attempt to upload metrics.
,
Oct 6 2017
I was on the fence on this question, but I'm beginning to agree with Richard. Let's add an optional --production flag to balance_pool, and have it only set up metrics if that flag is provided. (Should be a simple matter of only creating the SetupTsMonGlobalState context in --production; none of the metric sending lines need conditionals, the wrapper lib will silently ignore those if ts_mon isn't set up).
,
Oct 9 2017
I'm not clear on why human-initiated balancing triggering metrics is a bad thing. Please explain.
,
Oct 9 2017
1. We can't reliably know that people running this script locally even have the right metrics credentials on their machine. So we'll have a mix of some users that send metrics and some that don't. 2. It wasn't clear at the project outset, but I argue that the point of this metric is to monitor the health of our autobalancing service, so we know how much work it is doing and whether it is going down. 3. If we want a metric that tell us globally how often boards are getting shuffled around, then due to #1 the metric you are adding cannot reliably be that, and we'll need to figure out some other way to measure that or solve #1.
,
Oct 11 2017
> 3. If we want a metric that tell us globally how often boards are getting shuffled around, Also, such a metric is suspect anyway, since we can run 'balance_pools' for reasons that have nothing to do with dealing with broken devices. Most importantly, we use it to adjust pool sizes, and create new pool allocations.
,
Oct 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f8d90a8e09373c0d5950faae157540d71606f524 commit f8d90a8e09373c0d5950faae157540d71606f524 Author: Jacob Kopczynski <jkop@google.com> Date: Fri Oct 27 01:09:24 2017 Make pool metrics disabled except as flagged Add --production flag, intended for use only by auto-balancer, which enables metrics. Without the flag, hide the metrics. BUG= chromium:746067 TEST=ran 'balance_pool --all-boards suites' Change-Id: I6bdc1efbde1c0ab3e04e3bcafb0f575cffff270b Reviewed-on: https://chromium-review.googlesource.com/710518 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/f8d90a8e09373c0d5950faae157540d71606f524/site_utils/balance_pools.py [modify] https://crrev.com/f8d90a8e09373c0d5950faae157540d71606f524/contrib/run-pool-inventory
,
Oct 30 2017
,
Nov 8 2017
,
Nov 10 2017
While investigating crbug.com/782479 , discovered that two of the three metrics created have submitted no data so far; chromeos/autotest/balance_pools/exhausted_pools and chromeos/autotest/balance_pools/duts_moved. Currently investigating why.
,
Nov 29 2017
They were not submitting data because they missed the push. I probably should have verified that sooner. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, Aug 15 2017