New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 776505 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jan 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

add monarch metrics / instrumentation about repair actions

Project Member Reported by akes...@chromium.org, Oct 19 2017

Issue description

Per discussion in meeting, we should count how often various repair strategies are running and with what success, over what boards and pools.
 
Summary: add monarch metrics / instrumentation about repair actions (was: add monarch metrics / instrumentation about repair strategies)
It should be a straightforward change.

The relevant code is the RepairAction._repair_host() method in
client/common_lib/hosts/repair.py:

            # ...
            try:
                self.repair(host)
            except Exception as e:
                # ...

There's a handful of paths through the method after calling
`self.repair()` handling three different forms of failure,
and one form of success.  It seems it would be sufficient to
publish a boolean metric at each of the exit points.

Assuming we publish just the single boolean metric (or a counter
with a boolean field), that would allow answering questions like
"what's the success rate of repair action A?" and "how many times
did we invoke repair action A?"

If we want to ask questions like "what percentage of repair tasks
invoke repair action A?", we'd need to add a counter for overall
repair calls, in RepairStrategy.repair(), too.

If we want to ask questions like "why do we invoke repair action A?"
or "why does repair action A fail?", we might also need to instrument
verifiers.  That could be trickier, so I'd say let's not do that for
a first cut.

An alternative to adding code to RepairAction._repair_host() is
to put metrics generation for repair actions in RepairStrategy.repair()
around this point:
            try:
                ra._repair_host(host, silent)
            except Exception as e:
                # all logging and exception handling was done at
                # lower levels
                pass

Project Member

Comment 3 by bugdroid1@chromium.org, Oct 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1cc22ea2a84653ddd9e4b8923480d8bd169e835c

commit 1cc22ea2a84653ddd9e4b8923480d8bd169e835c
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Oct 25 00:36:34 2017

[autotest] Add metrics for repair.

This adds metrics to track the outcomes of repair actions.

BUG= chromium:776505 
TEST=unit tests

Change-Id: If5a92a67671d4d36a6ae0fa4224930a3df3eaf39
Reviewed-on: https://chromium-review.googlesource.com/729211
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/1cc22ea2a84653ddd9e4b8923480d8bd169e835c/client/common_lib/hosts/repair.py
[modify] https://crrev.com/1cc22ea2a84653ddd9e4b8923480d8bd169e835c/client/common_lib/hosts/repair_unittest.py

What's the status of this?
> What's the status of this?

The code is in, and we have data.  I don't know how to construct
useful monarch queries that will answer questions.
I'm inclined to declare victory here.  The metrics are in;
we'll figure out what to do with them later.

Status: Fixed (was: Assigned)

Sign in to add a comment