add monarch metrics / instrumentation about repair actions |
||
Issue descriptionPer discussion in meeting, we should count how often various repair strategies are running and with what success, over what boards and pools.
,
Oct 19 2017
An alternative to adding code to RepairAction._repair_host() is
to put metrics generation for repair actions in RepairStrategy.repair()
around this point:
try:
ra._repair_host(host, silent)
except Exception as e:
# all logging and exception handling was done at
# lower levels
pass
,
Oct 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1cc22ea2a84653ddd9e4b8923480d8bd169e835c commit 1cc22ea2a84653ddd9e4b8923480d8bd169e835c Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Oct 25 00:36:34 2017 [autotest] Add metrics for repair. This adds metrics to track the outcomes of repair actions. BUG= chromium:776505 TEST=unit tests Change-Id: If5a92a67671d4d36a6ae0fa4224930a3df3eaf39 Reviewed-on: https://chromium-review.googlesource.com/729211 Commit-Ready: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/1cc22ea2a84653ddd9e4b8923480d8bd169e835c/client/common_lib/hosts/repair.py [modify] https://crrev.com/1cc22ea2a84653ddd9e4b8923480d8bd169e835c/client/common_lib/hosts/repair_unittest.py
,
Dec 13 2017
What's the status of this?
,
Dec 13 2017
> What's the status of this? The code is in, and we have data. I don't know how to construct useful monarch queries that will answer questions.
,
Jan 25 2018
I'm inclined to declare victory here. The metrics are in; we'll figure out what to do with them later.
,
Jan 25 2018
|
||
►
Sign in to add a comment |
||
Comment 1 by jrbarnette@chromium.org
, Oct 19 2017It should be a straightforward change. The relevant code is the RepairAction._repair_host() method in client/common_lib/hosts/repair.py: # ... try: self.repair(host) except Exception as e: # ... There's a handful of paths through the method after calling `self.repair()` handling three different forms of failure, and one form of success. It seems it would be sufficient to publish a boolean metric at each of the exit points. Assuming we publish just the single boolean metric (or a counter with a boolean field), that would allow answering questions like "what's the success rate of repair action A?" and "how many times did we invoke repair action A?" If we want to ask questions like "what percentage of repair tasks invoke repair action A?", we'd need to add a counter for overall repair calls, in RepairStrategy.repair(), too. If we want to ask questions like "why do we invoke repair action A?" or "why does repair action A fail?", we might also need to instrument verifiers. That could be trickier, so I'd say let's not do that for a first cut.