New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 746067 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 782479



Sign in to add a comment

pool auto-balancer should be metrics-instrumented

Project Member Reported by akes...@chromium.org, Jul 18 2017

Issue description

auto-balancer should send metrics that allow us to determine:
 - # of duts shuffled around (per board)
 - # of boards/pools with insufficient spare capacity
 - # of boards/pools which were quarantined and on which shuffling was skipped
 
Cc: jkop@chromium.org
+jkop candidate starter bug, to get familiar with our metrics/monitoring system. This code lives in the autotest code base, the tool is called site_utils/balance_pools.py
Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)
Provisionally assigned. Remove owner, set status -> Untriaged to put back in triage queue.

Comment 3 by jkop@chromium.org, Aug 23 2017

Status: Started (was: Assigned)

Comment 4 Deleted

Comment 5 by jkop@chromium.org, Aug 29 2017

I can't proceed safely until issue 759862 is fixed; repo hooks fail.

Comment 6 by jkop@chromium.org, Sep 5 2017

Blockedon: 762149

Comment 7 by jkop@chromium.org, Oct 2 2017

Blockedon: -762149 -759862
Project Member

Comment 8 by bugdroid1@chromium.org, Oct 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c6e483e1d707f473d54b85e6f4531b300b3f6b29

commit c6e483e1d707f473d54b85e6f4531b300b3f6b29
Author: Jacob Kopczynski <jkop@chromium.org>
Date: Fri Oct 06 04:24:42 2017

Add metrics to balancer, remove board indices.

Boolean metrics for boards and pools depleted of capacity and for pools which
 are quarantined.
Counter metric for moved DUTs parameterized by board, source pool, and target
 pool.
No longer assign indices to boards while balancing them.

BUG= chromium:746067 
TEST=ran 'balance_pool --all-boards suites'

Change-Id: Ic42ddd99b45b7ee5d6cbf40dde5b566dcce9f828
Reviewed-on: https://chromium-review.googlesource.com/639692
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/c6e483e1d707f473d54b85e6f4531b300b3f6b29/site_utils/balance_pools.py

I just ran balance pool to test out the behavior.  I got this:
    $ balance_pool bvt stumpy
    INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal
    INFO:root:Configuration file does not exist, ignoring: /etc/chrome-infra/ts-mon.json
    ERROR:root:ts_mon monitoring is disabled because the endpoint provided is invalid or not supported:
    NOTICE:root:ts_mon was set up.

    Balancing stumpy bvt pool:
    Total 7 DUTs, 6 working, 1 broken, 0 reserved.
    Target is 7 working DUTs; grow pool by 1 DUTs.
    stumpy bvt pool has 0 spares available.
    ERROR: Not enough spares: need 1, only have 0.

'balance_pool' is primarily a command line tool for use
by humans; its use in a cron job is secondary.  The extra
noise from ts_mon mustn't be there.  Moreover, we don't want
metrics uploaded from any code flow/system other than the
specific cron job context.

I mentioned it in code review comments; I'll mention it here:
The metrics uploading behavior needs to be conditioned on the
presence of a specific command-line option.  Then, that option
needs to be added to the cron job invocation for automatic balancing.
When the option isn't present on the command line, there should be
no attempt to upload metrics.

I was on the fence on this question, but I'm beginning to agree with Richard.

Let's add an optional --production flag to balance_pool, and have it only set up metrics if that flag is provided.

(Should be a simple matter of only creating the SetupTsMonGlobalState context in --production; none of the metric sending lines need conditionals, the wrapper lib will silently ignore those if ts_mon isn't set up).

Comment 11 by jkop@chromium.org, Oct 9 2017

I'm not clear on why human-initiated balancing triggering metrics is a bad thing. Please explain.

1. We can't reliably know that people running this script locally even have the right metrics credentials on their machine. So we'll have a mix of some users that send metrics and some that don't.

2. It wasn't clear at the project outset, but I argue that the point of this metric is to monitor the health of our autobalancing service, so we know how much work it is doing and whether it is going down.

3. If we want a metric that tell us globally how often boards are getting shuffled around, then due to #1 the metric you are adding cannot reliably be that, and we'll need to figure out some other way to measure that or solve #1.
> 3. If we want a metric that tell us globally how often boards are getting shuffled around,

Also, such a metric is suspect anyway, since we can run 'balance_pools'
for reasons that have nothing to do with dealing with broken devices.
Most importantly, we use it to adjust pool sizes, and create new pool
allocations.

Project Member

Comment 14 by bugdroid1@chromium.org, Oct 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f8d90a8e09373c0d5950faae157540d71606f524

commit f8d90a8e09373c0d5950faae157540d71606f524
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Oct 27 01:09:24 2017

Make pool metrics disabled except as flagged

Add --production flag, intended for use only by auto-balancer, which
 enables metrics. Without the flag, hide the metrics.

BUG= chromium:746067 
TEST=ran 'balance_pool --all-boards suites'

Change-Id: I6bdc1efbde1c0ab3e04e3bcafb0f575cffff270b
Reviewed-on: https://chromium-review.googlesource.com/710518
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/f8d90a8e09373c0d5950faae157540d71606f524/site_utils/balance_pools.py
[modify] https://crrev.com/f8d90a8e09373c0d5950faae157540d71606f524/contrib/run-pool-inventory

Comment 15 by jkop@chromium.org, Oct 30 2017

Status: Fixed (was: Started)

Comment 16 by jkop@chromium.org, Nov 8 2017

Blocking: 782479

Comment 17 by jkop@chromium.org, Nov 10 2017

Status: Started (was: Fixed)
While investigating  crbug.com/782479 , discovered that two of the three metrics created have submitted no data so far; chromeos/autotest/balance_pools/exhausted_pools and chromeos/autotest/balance_pools/duts_moved.

Currently investigating why.

Comment 18 by jkop@chromium.org, Nov 29 2017

Status: Fixed (was: Started)
They were not submitting data because they missed the push. I probably should have verified that sooner.

Sign in to add a comment