New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 829124 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

alert in case of failing or missing balance_pool runs

Reported by jrbarnette@chromium.org, Apr 4 2018

Issue description

The automated balance_pool run seems to fail mostly or always
with the following error:
Traceback (most recent call last):
  File "site_utils/balance_pools.py", line 677, in <module>
    main(sys.argv)
  File "site_utils/balance_pools.py", line 663, in main
    balancer_targets = infer_balancer_targets(afe, arguments, pools)
  File "site_utils/balance_pools.py", line 610, in infer_balancer_targets
    for model in inventory.get_models(pool):
AttributeError: '_LabInventory' object has no attribute 'get_models'

The problem is caused by this code in balance_pool:
    def infer_balancer_targets(afe, arguments, pools):
        # ...
                    for model in inventory.get_models(pool):
                        labels = labellib.LabelsMapping()
                        labels['model'] = model
                        balancer_targets.append((pool, labels.getlabels()))
        # ...

The for-loop should be like this:
    for model in inventory.get_pool_models(pool):
        # ...

 
Project Member

Comment 2 by bugdroid1@chromium.org, Apr 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/685ac85e94f1b5c0efa0ada667fbe41891284773

commit 685ac85e94f1b5c0efa0ada667fbe41891284773
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Apr 05 00:32:56 2018

[autotest] Fix balance_pools as used by inventory.

The particular combination of balance_pools command line options
used by the daily inventory balancing was consitently failing
because of a call to a non-existent method.  This fixes the
offending call to use the correct method name.

BUG= chromium:829124 
TEST=run the command as for the full inventory; see it pass

Change-Id: I177fb8cb8432f333e44e314a48bfa3c5136bdedd
Reviewed-on: https://chromium-review.googlesource.com/996652
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Trybot-Ready: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/685ac85e94f1b5c0efa0ada667fbe41891284773/site_utils/balance_pools.py

Labels: Chase-Pending
Owner: ----
Status: Available (was: Started)
+Chase-Pending, for post-mortem on why we didn't have an alert
for the failures.

Cc: jrbarnette@chromium.org
 Issue 824895  has been merged into this issue.
Labels: -Chase-Pending Chase
Owner: akes...@chromium.org
Status: Assigned (was: Available)
alert on metrics missing , then alert on metrics indicating bad behavior
Did nothing, will look this week.
Issue 829011 has been merged into this issue.
Cc: pprabhu@chromium.org
None of the balance pool metrics appear to be working. Perhaps because they had field specs changed by https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/762597/5/site_utils/balance_pools.py#255

Plan of attack

1) delete in monarch the existing balance_pool metrics, so that monarch can re-learn the correct new field schema.

2) add a success_counter metric for this script, to prove that it is even able to emit metrics at all
Summary: alert in case of failing or missing balance_pool runs (was: Nightly automated balance_pool run fails)
Cc: jkop@chromium.org
Status: Started (was: Assigned)
Issue 807326 has been merged into this issue.
Re 8 I no longer suspect the field spec change, and I see via monarch_tool that metrics seem like they have the correct spec defitinitions.

I think the problem is that balance_pool is multiprocess but wasn't using indirect=True metrics.
Project Member

Comment 14 by bugdroid1@chromium.org, Apr 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2cc427df297dd31d2e8562180d625400e62038f0

commit 2cc427df297dd31d2e8562180d625400e62038f0
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 19 19:46:07 2018

autotest: use indirect metrics in balance_pool

balance_pool uses multiple processes, so it needs to use indirect=True.

Also, get rid a metrics.Flush call that I suspect is unneeded, or will
be with follow-up work.

BUG= chromium:829124 
TEST=None

Change-Id: Icc16a7ba06c492e757630a5a0d88deb40827f70f
Reviewed-on: https://chromium-review.googlesource.com/1017814
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/2cc427df297dd31d2e8562180d625400e62038f0/site_utils/balance_pools.py

Project Member

Comment 15 by bugdroid1@chromium.org, Apr 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/259a65078706e67e6f96420ea1551b92aa8fb600

commit 259a65078706e67e6f96420ea1551b92aa8fb600
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 19 19:46:09 2018

autotest: balance_pools: add a SuccessCounter

BUG= chromium:829124 
TEST=None

Change-Id: I36cff63f4f16b98b08d328e9e2a07b9e959b366e
Reviewed-on: https://chromium-review.googlesource.com/1017815
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/259a65078706e67e6f96420ea1551b92aa8fb600/site_utils/balance_pools.py

Confirmed via the newly added runs metric that metrics are now working from balance_pools http://shortn/_0RM12sVqly

(I ran it a few times manually against irrelvant pools just to force that counter up. In reality I expect it to count about 1x every 24h).
Project Member

Comment 18 by bugdroid1@chromium.org, Apr 20 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d1304821c104bb3c825a8a6d19a19602b7aa760f

commit d1304821c104bb3c825a8a6d19a19602b7aa760f
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri Apr 20 00:08:42 2018

Status: Fixed (was: Started)

Comment 20 by jkop@chromium.org, Jun 2 2018

Cc: -jkop@chromium.org akes...@chromium.org
Labels: -Chase Chase-Pending
Owner: jkop@chromium.org
Status: Started (was: Fixed)
This alert was demoted as it is flaky, regularly firing despite viceroy and pcon showing nothing wrong. The source is a difference in data source access; monarch alerts never look at bigtable, and the metric is only retained in memory for 12h. So despite the 26h time-window it was scoped to, it effectively was only checking whether balance_pools was run in the last 12 hours, which is only true about half the time.

Proposed fix: Get this metric added to the chrome-infra-monitoring-alerts policy: pcon/chrome-infra/settings/collection?module=chrome-infra-monitoring-alerts&policy=2d@1m
Hmmm...  Comment #20 and  bug 847911  are the same.  So...  Either
we send this back to fixed, and run with the other bug, or we
close the other as a duplicate of this.
Cc: jkop@chromium.org cra...@chromium.org
 Issue 847911  has been merged into this issue.
I don't have much knowledge on the proposed solution in #20. Does doing that get us the longer retention we need?

Comment 24 by jkop@chromium.org, Jun 4 2018

I am not _certain_, but I am reasonably confident that it does. I can reach out to someone in ChromeOps who manages retention policies to confirm.

My understanding is that it could be added to that policy, which stores 1m-interval data for the last two days in memory.

Caveats:

I am not sure it is possible/permissible to add our metrics to this policy, which is otherwise owned by CrOps.
I am not sure whether a metric can be part of two policies; if not, then we will need a new policy, because the one I linked does not maintain any data at all past that two-day window.
I could be misunderstanding something else and not have noticed; I have not worked with this before.

Labels: -Chase-Pending Chase
Status: Assigned (was: Started)
^ under investigation

also: consider running b_p 4hourly
Talking with Richard, it seems like there's some convoluted schedule that all these inventory admin scripts run on. Without going into all that here, I think we should reduce the alerting window to 12 hours, and we should run the script we're alerting on at least 3 times in that window. This allows for some robustness to one off failures. We should not alert on a script not running if it's expected to run exactly once in the alert window (too brittle).

Comment 27 by jkop@chromium.org, Jun 5 2018

Re #26: I agree. I'm going to make the several scripts all run balance_pool rather than just the daily one which sends the pool status email, and then unify/dedupe the scripts as a non-Chase issue.
Project Member

Comment 28 by bugdroid1@chromium.org, Jun 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a085a19309e991e325b54689a8c6fb0d3b2200b7

commit a085a19309e991e325b54689a8c6fb0d3b2200b7
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Jun 08 06:33:49 2018

Make inventory management scripts balance pools

Currently only run-pool-inventory balances pools, once per 24h
The underlying calls for the other two varieties are otherwise the same
except for what email reporting they do.

Both model inventory and loop detection scripts will now balance pools
before gathering data.

BUG= chromium:829124 
TEST=None

Change-Id: I28824caace6738822a4e0be22f0375f4ad4e5ac1
Reviewed-on: https://chromium-review.googlesource.com/1087365
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-model-inventory
[modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-loop-detection
[modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-pool-inventory

Project Member

Comment 29 by bugdroid1@chromium.org, Jun 11 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/fd95a1f823a78fd2f1c82bdc397a62d1263d3dc4

commit fd95a1f823a78fd2f1c82bdc397a62d1263d3dc4
Author: Jacob Kopczynski <jkop@google.com>
Date: Mon Jun 11 17:49:10 2018

Status: Fixed (was: Assigned)
Cron jobs are running more often now, alerting thresholds may be tweaked now that we have more runs.

Marking this bug fixed, please Verify no false positives over this week.

Comment 31 by jkop@chromium.org, Jun 11 2018

 Issue 851620  has been merged into this issue.
Status: Assigned (was: Fixed)
This alert fired once again at 2:34am today.  Re-opening bug.
 
We might be at the point where it would be better to close this
permanently, and open a new bug over how to reduce the spamminess
of the alert...

I just went to check the balance_pool logs on cautotest-prod.
ITOT the logs get written into files with names like
"balance_pool.log.2018-06-12", and overwritten at every run.
So, only the last run of the day is recorded, and if an earlier
run fails (or just doesn't run), we have no easy way to find
out what happened.

Comment 35 by jkop@chromium.org, Jun 12 2018

Status: Started (was: Assigned)
That might be fair. The fix has passed the CQ, but isn't in prod yet because test push has been failing.

The log name can be fixed easily.

Comment 36 by jkop@chromium.org, Jun 14 2018

Status: Fixed (was: Started)
Fix pushed, it's working.
Project Member

Comment 37 by bugdroid1@chromium.org, Jul 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3

commit 5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Jul 12 01:48:51 2018

Refactor run-*-inventory to streamline them

Nearly all of run-loop-detection and run-(model/pool)-inventory were
identical, differing only in which args were passed. Factor them out
into the parameterized call, and thin scripts to specify the args.

BUG= chromium:829124 
TEST=None

Change-Id: I930bc78cb92d44be2d66be41887e689c420a9beb
Reviewed-on: https://chromium-review.googlesource.com/1089724
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-model-inventory
[add] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-inventory
[modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-loop-detection
[modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-pool-inventory

Sign in to add a comment