alert in case of failing or missing balance_pool runs
Reported by
jrbarnette@chromium.org,
Apr 4 2018
|
||||||||||||||
Issue description
The automated balance_pool run seems to fail mostly or always
with the following error:
Traceback (most recent call last):
File "site_utils/balance_pools.py", line 677, in <module>
main(sys.argv)
File "site_utils/balance_pools.py", line 663, in main
balancer_targets = infer_balancer_targets(afe, arguments, pools)
File "site_utils/balance_pools.py", line 610, in infer_balancer_targets
for model in inventory.get_models(pool):
AttributeError: '_LabInventory' object has no attribute 'get_models'
The problem is caused by this code in balance_pool:
def infer_balancer_targets(afe, arguments, pools):
# ...
for model in inventory.get_models(pool):
labels = labellib.LabelsMapping()
labels['model'] = model
balancer_targets.append((pool, labels.getlabels()))
# ...
The for-loop should be like this:
for model in inventory.get_pool_models(pool):
# ...
,
Apr 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/685ac85e94f1b5c0efa0ada667fbe41891284773 commit 685ac85e94f1b5c0efa0ada667fbe41891284773 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Apr 05 00:32:56 2018 [autotest] Fix balance_pools as used by inventory. The particular combination of balance_pools command line options used by the daily inventory balancing was consitently failing because of a call to a non-existent method. This fixes the offending call to use the correct method name. BUG= chromium:829124 TEST=run the command as for the full inventory; see it pass Change-Id: I177fb8cb8432f333e44e314a48bfa3c5136bdedd Reviewed-on: https://chromium-review.googlesource.com/996652 Tested-by: Richard Barnette <jrbarnette@chromium.org> Trybot-Ready: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/685ac85e94f1b5c0efa0ada667fbe41891284773/site_utils/balance_pools.py
,
Apr 5 2018
+Chase-Pending, for post-mortem on why we didn't have an alert for the failures.
,
Apr 9 2018
,
Apr 9 2018
alert on metrics missing , then alert on metrics indicating bad behavior
,
Apr 16 2018
Did nothing, will look this week.
,
Apr 16 2018
Issue 829011 has been merged into this issue.
,
Apr 18 2018
None of the balance pool metrics appear to be working. Perhaps because they had field specs changed by https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/762597/5/site_utils/balance_pools.py#255 Plan of attack 1) delete in monarch the existing balance_pool metrics, so that monarch can re-learn the correct new field schema. 2) add a success_counter metric for this script, to prove that it is even able to emit metrics at all
,
Apr 18 2018
,
Apr 18 2018
,
Apr 18 2018
,
Apr 18 2018
Issue 807326 has been merged into this issue.
,
Apr 18 2018
Re 8 I no longer suspect the field spec change, and I see via monarch_tool that metrics seem like they have the correct spec defitinitions. I think the problem is that balance_pool is multiprocess but wasn't using indirect=True metrics.
,
Apr 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2cc427df297dd31d2e8562180d625400e62038f0 commit 2cc427df297dd31d2e8562180d625400e62038f0 Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 19 19:46:07 2018 autotest: use indirect metrics in balance_pool balance_pool uses multiple processes, so it needs to use indirect=True. Also, get rid a metrics.Flush call that I suspect is unneeded, or will be with follow-up work. BUG= chromium:829124 TEST=None Change-Id: Icc16a7ba06c492e757630a5a0d88deb40827f70f Reviewed-on: https://chromium-review.googlesource.com/1017814 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/2cc427df297dd31d2e8562180d625400e62038f0/site_utils/balance_pools.py
,
Apr 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/259a65078706e67e6f96420ea1551b92aa8fb600 commit 259a65078706e67e6f96420ea1551b92aa8fb600 Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 19 19:46:09 2018 autotest: balance_pools: add a SuccessCounter BUG= chromium:829124 TEST=None Change-Id: I36cff63f4f16b98b08d328e9e2a07b9e959b366e Reviewed-on: https://chromium-review.googlesource.com/1017815 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/259a65078706e67e6f96420ea1551b92aa8fb600/site_utils/balance_pools.py
,
Apr 19 2018
Confirmed via the newly added runs metric that metrics are now working from balance_pools http://shortn/_0RM12sVqly (I ran it a few times manually against irrelvant pools just to force that counter up. In reality I expect it to count about 1x every 24h).
,
Apr 19 2018
,
Apr 20 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d1304821c104bb3c825a8a6d19a19602b7aa760f commit d1304821c104bb3c825a8a6d19a19602b7aa760f Author: Aviv Keshet <akeshet@chromium.org> Date: Fri Apr 20 00:08:42 2018
,
Apr 23 2018
,
Jun 2 2018
This alert was demoted as it is flaky, regularly firing despite viceroy and pcon showing nothing wrong. The source is a difference in data source access; monarch alerts never look at bigtable, and the metric is only retained in memory for 12h. So despite the 26h time-window it was scoped to, it effectively was only checking whether balance_pools was run in the last 12 hours, which is only true about half the time. Proposed fix: Get this metric added to the chrome-infra-monitoring-alerts policy: pcon/chrome-infra/settings/collection?module=chrome-infra-monitoring-alerts&policy=2d@1m
,
Jun 2 2018
Hmmm... Comment #20 and bug 847911 are the same. So... Either we send this back to fixed, and run with the other bug, or we close the other as a duplicate of this.
,
Jun 4 2018
,
Jun 4 2018
I don't have much knowledge on the proposed solution in #20. Does doing that get us the longer retention we need?
,
Jun 4 2018
I am not _certain_, but I am reasonably confident that it does. I can reach out to someone in ChromeOps who manages retention policies to confirm. My understanding is that it could be added to that policy, which stores 1m-interval data for the last two days in memory. Caveats: I am not sure it is possible/permissible to add our metrics to this policy, which is otherwise owned by CrOps. I am not sure whether a metric can be part of two policies; if not, then we will need a new policy, because the one I linked does not maintain any data at all past that two-day window. I could be misunderstanding something else and not have noticed; I have not worked with this before.
,
Jun 4 2018
^ under investigation also: consider running b_p 4hourly
,
Jun 4 2018
Talking with Richard, it seems like there's some convoluted schedule that all these inventory admin scripts run on. Without going into all that here, I think we should reduce the alerting window to 12 hours, and we should run the script we're alerting on at least 3 times in that window. This allows for some robustness to one off failures. We should not alert on a script not running if it's expected to run exactly once in the alert window (too brittle).
,
Jun 5 2018
Re #26: I agree. I'm going to make the several scripts all run balance_pool rather than just the daily one which sends the pool status email, and then unify/dedupe the scripts as a non-Chase issue.
,
Jun 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a085a19309e991e325b54689a8c6fb0d3b2200b7 commit a085a19309e991e325b54689a8c6fb0d3b2200b7 Author: Jacob Kopczynski <jkop@google.com> Date: Fri Jun 08 06:33:49 2018 Make inventory management scripts balance pools Currently only run-pool-inventory balances pools, once per 24h The underlying calls for the other two varieties are otherwise the same except for what email reporting they do. Both model inventory and loop detection scripts will now balance pools before gathering data. BUG= chromium:829124 TEST=None Change-Id: I28824caace6738822a4e0be22f0375f4ad4e5ac1 Reviewed-on: https://chromium-review.googlesource.com/1087365 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-model-inventory [modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-loop-detection [modify] https://crrev.com/a085a19309e991e325b54689a8c6fb0d3b2200b7/contrib/run-pool-inventory
,
Jun 11 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/fd95a1f823a78fd2f1c82bdc397a62d1263d3dc4 commit fd95a1f823a78fd2f1c82bdc397a62d1263d3dc4 Author: Jacob Kopczynski <jkop@google.com> Date: Mon Jun 11 17:49:10 2018
,
Jun 11 2018
Cron jobs are running more often now, alerting thresholds may be tweaked now that we have more runs. Marking this bug fixed, please Verify no false positives over this week.
,
Jun 11 2018
Issue 851620 has been merged into this issue.
,
Jun 12 2018
,
Jun 12 2018
This alert fired once again at 2:34am today. Re-opening bug.
,
Jun 12 2018
We might be at the point where it would be better to close this permanently, and open a new bug over how to reduce the spamminess of the alert... I just went to check the balance_pool logs on cautotest-prod. ITOT the logs get written into files with names like "balance_pool.log.2018-06-12", and overwritten at every run. So, only the last run of the day is recorded, and if an earlier run fails (or just doesn't run), we have no easy way to find out what happened.
,
Jun 12 2018
That might be fair. The fix has passed the CQ, but isn't in prod yet because test push has been failing. The log name can be fixed easily.
,
Jun 14 2018
Fix pushed, it's working.
,
Jul 12
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3 commit 5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3 Author: Jacob Kopczynski <jkop@google.com> Date: Thu Jul 12 01:48:51 2018 Refactor run-*-inventory to streamline them Nearly all of run-loop-detection and run-(model/pool)-inventory were identical, differing only in which args were passed. Factor them out into the parameterized call, and thin scripts to specify the args. BUG= chromium:829124 TEST=None Change-Id: I930bc78cb92d44be2d66be41887e689c420a9beb Reviewed-on: https://chromium-review.googlesource.com/1089724 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-model-inventory [add] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-inventory [modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-loop-detection [modify] https://crrev.com/5bfd2fa6531e8f6132eafac862d6ef11eaa3baf3/contrib/run-pool-inventory |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Apr 4 2018