New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 775199 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Add administrative (c.f. sentinel) service to track short term DUT-history triggered conditions

Project Member Reported by pprabhu@chromium.org, Oct 16 2017

Issue description

We do not currently correlate dut-history across (more than 1) jobs at all. This is very important to detect conditions like verify-repair loops.

The specific ask here is: Detect verify-repair loops (3 instances) per-dut and create a monarch metric for it. We can then dashboard/alert from monarch.

-----
Design direction:

This should be an upstart controlled process running on sentinel server role.
- Once this goes live, we'd like to add more metrics based on short dut-history, so keep it generic.
- This process can run slowly, over the course of the day, so as to not overload anything.
- This process should hit readonly replicas of databases
 
Labels: -Pri-3 Pri-1
Labels: -Chase-Pending Chase
Owner: jrbarnette@chromium.org
Status: Assigned (was: Untriaged)
Richard prefers this to be in the inventory script.
> Richard prefers this to be in the inventory script.

More to the point:  Looking over the technical details, the feature
requested must share code with the inventory script (most especially,
the logic for finding/selecting/excluding the DUTs that matter to the
inventory).

I've done some examination of the performance implications of this
feature as well.

Gathering status on the current managed inventory requires 20 minutes
when run on cautotest-prod.  The feature requested here also requires
a full status check of the managed inventory.  If the actual checking
and reporting requested here is part of the existing lab inventory run
jobs, the extra queries in theory should be free, since the lab inventory
scripts cache the status query results.  Unfortunately, looking at the
requirements and implementation details, I think the extra queries are
_not_ free, and could be significantly expensive.

The objective is to detect repair loops, where a DUT successfully
repairs, only to immediately fail its next special task (typically,
provisioning).  The result is a DUT that looks to be working, but in
reality does no work.  Detecting that condition requires searching
a DUT's recent history looking for non-repair special tasks.  A DUT
is looping if none of the most recent tasks have succeeded, but
every repair is passing.

The performance issue is that gathering recent history for a DUT is
more expensive than gathering just current status.  In a small sample
test, I found gathering 24 hours of history is ~10 times more
expensive.  Even gathering only 8 hours of history is ~5 times more
expensive.  That means a naive addition to the lab inventory to look
at 8 hours of history would add 90 minutes to the inventory run.

Options:
  * Live with inventory runs that take over 2 hours to complete.
  * Run the check for repair loops in a separate job; my expectation
    is that the status caching in the standard run doesn't change
    the cost of gathering full history, so there's no performance
    advantage to sharing the run.
  * Search less than the full inventory, e.g. focus on only the CQ
    pool.
  * The current history query API only supports a time range.  But,
    a time range isn't the most convenient parameter in this case:
    we really want the last N special tasks.  So, we could add an
    RPC call that only queries for those N special tasks.  The cost
    of the query is roughly proportional to the number of tasks
    returned, so if N tasks is smaller than the typical number of
    tasks in (say) 8 hours, then the query will be faster.
  * I think the reason that the cost is proportional to the size of
    the result is that job results include the test control file,
    which can be quite large.  If there were an API option to not
    return the control file, the API call would likely go faster.
  * I also think that the result size issue with control files is
    specific to test jobs, and not special tasks, but for diagnosis,
    only special tasks matter (mostly).  So, an API option not to
    include test jobs could make things faster.

The RPC interface changes above would also make the dut-status
command faster, so although they're more expensive, they pay other
dividends.  Also, the changes above aren't the last word in how to
make dut-status faster, and anything that makes dut-status faster
will help here.

I've written a prototype, and tested it out.  The performance
issue cited above is real:  My first run took nearly 4 hours;
that includes ~20 minutes for status gathering, so the net
additional cost of detecting loops with a naive history search
is over 3.5 hours.

However, the naive search can be skipped for most DUTs, since
for a DUT stuck in a repair loop, the final status will be a
successful repair task.  Checking final status is free (it's
what we pay for with the 20 minutes status gathering), and a
test of the optimization did repair loop detection in ~30 minutes,
meaning ~10 minutes net additional cost.  That's tolerable.

There's likely some tuning needed to prevent false positives:
A simple "3 repairs in a row" check turned up DUTs with a history
of failures followed by repairs, but the DUTs typically went back
to successfully running tests afterwards.  Incidents like that
probably shouldn't trigger alerts.
Richard has something, and will investigate them this week.
Regarding "something", I have crosreview.com/737070/

There's review to get through, and probably some tweaking, but the
code works well enough to find DUTs stuck in various kinds of loops:
test runs have been discovering ongoing outages with alarming
reliability.

Project Member

Comment 7 by bugdroid1@chromium.org, Dec 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cf5d8346c98566f33923ac946845f6e1c7a4eb06

commit cf5d8346c98566f33923ac946845f6e1c7a4eb06
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Sat Dec 02 04:32:41 2017

[autotest] Report repair loops in inventory runs.

This add an option to the lab inventory script to detect and
report DUTs stuck in repair loops.

BUG= chromium:775199 
TEST=run the script locally with --debug

Change-Id: I2f7972054e632906207279c08a32bd691864eb56
Reviewed-on: https://chromium-review.googlesource.com/737070
Commit-Ready: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/cf5d8346c98566f33923ac946845f6e1c7a4eb06/site_utils/lab_inventory.py
[modify] https://crrev.com/cf5d8346c98566f33923ac946845f6e1c7a4eb06/site_utils/lab_inventory_unittest.py

Richard to update the cronjob to invoke this script to generate the metrics.
I think maybe this is the right starting point:
    https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/833288/

Richard may have lost comments CL, but 1 CL forthcoming. +cronjob; deputy this week, expected next week.
on the list for this week
Project Member

Comment 12 by bugdroid1@chromium.org, Jan 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/45ac033514addbe5759ecc8d92e0fa4e67f68ed1

commit 45ac033514addbe5759ecc8d92e0fa4e67f68ed1
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Sat Jan 20 01:18:10 2018

[autotest] Generate repair loop metrics.

This updates the "board" and "pool" inventory runs to include
generating repair loop detection.  Additionally, it adds a
"run-loop-detection" script to run inventory to detect repair
loops without doing any other inventory checks.

BUG= chromium:775199 
TEST=Run the script using the debug options.

Change-Id: I971d94ce7f119f4f658d3f189721c661ffd36293
Reviewed-on: https://chromium-review.googlesource.com/833199
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/inventory_options
[add] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/run-loop-detection
[modify] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/run-pool-inventory

Push to prod then a cronjob change to start running this.
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/4573e1229dd2b26b1227d4f0ab1c5bd2f9150a13

commit 4573e1229dd2b26b1227d4f0ab1c5bd2f9150a13
Author: Richard Barnette <jrbarnette@google.com>
Date: Thu Jan 25 02:17:24 2018

Richard to create dashboard.
ping
I just did a spot-check in Panopticon.  The new metric is supposed
to be "chromeos/autotest/inventory/repair_loops", but apparently,
that metric has never been reported.

The inventory script has definitely been finding looping DUTs, so
that means the metrics reporting in the inventory script isn't
working.  This is the first time we've tried to actually add
reporting to that script, so that's not really surprising.

I expect I can create a dashboard for the metric, but without data,
I can't create a dashboard that _works_...  So, we have to figure
out why the metrics aren't there.

Per discussion offline, this is broken because it doesn't do one of two things:

- Either set indirect=True
- Or call chromite.lib.metrics.Flush() manually.

Arguably we should change the context manager that SetupTsMonGlobalState() returns to call .Flush() on __exit__ so that we can avoid this little pitfall by default.

Project Member

Comment 19 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8840588deca7ef5b8bc10099527d42374fc85719

commit 8840588deca7ef5b8bc10099527d42374fc85719
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Feb 08 00:31:05 2018

[autotest] Fix lab_inventory to properly report its metrics.

The lab_inventory script has been failing to report the metric for
DUTs in repair loops.  This adds a call to `metrics.Flush()` to make
sure that the metrics get reported.

BUG= chromium:775199 
TEST=None

Change-Id: Ia09c0e06c5c1b013af470cdf4ce01418e50ee617
Reviewed-on: https://chromium-review.googlesource.com/907213
Reviewed-by: Paul Hobbs <phobbs@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>

[modify] https://crrev.com/8840588deca7ef5b8bc10099527d42374fc85719/site_utils/lab_inventory.py

Project Member

Comment 20 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/88b948904cb027d738e5265cd483fb71ead2a0bb

commit 88b948904cb027d738e5265cd483fb71ead2a0bb
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Thu Feb 08 00:31:18 2018

[autotest] Add an option for debugging metrics in lab_inventory.

It's possible to do better for debugging the behavior of the
metrics library if we throw the right options at the setup
function.  So, add an option to make it possible.

BUG= chromium:775199 
TEST=Run the command with the option

Change-Id: I16d94161d79fb8ca22ea6aa157341ea536994c4e
Reviewed-on: https://chromium-review.googlesource.com/907182
Reviewed-by: Paul Hobbs <phobbs@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>

[modify] https://crrev.com/88b948904cb027d738e5265cd483fb71ead2a0bb/site_utils/lab_inventory.py

Metric fixed should be pushed to prod as of this morning.
Is tracking of DUTs that have remained locked for a long time within scope for this bug?
Issue 807125 has been merged into this issue.
> Is tracking of DUTs that have remained locked for a long time within scope for this bug?

Arguably, no.

In detail:
  * We currently do detect DUTs that are working but don't run
    tests.  That includes DUTs that are locked for a long time,
    but also includes other kinds of problems.  DUTs in this
    state are reported as idle in the inventory e-mail.
  * Although there's a case that idle DUTs could/should be reported
    via Monarch, similar to how we report looping DUTs, idle DUTs
    are qualitatively different from looping DUTs.  They need to be
    reported separately because the threshold for "there's a problem"
    needs to be different and the procedure for dealing with the
    problems needs to be different.

Re comment #24, idle DUTs and looping DUTs are also qualitatively
different because they're defined differently.

A looping DUT is defined as a DUT that runs no tests,
but passes repair special tasks, and fails all other tasks.

In contrast, an idle DUT runs nothing at all, including
special tasks, even after work is scheduled for the DUT.

Project Member

Comment 26 by bugdroid1@chromium.org, Feb 24 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a71602d4571f30d43f502963b6901625e90a2eb

commit 0a71602d4571f30d43f502963b6901625e90a2eb
Author: Richard Barnette <jrbarnette@google.com>
Date: Sat Feb 24 00:44:47 2018

Project Member

Comment 27 by bugdroid1@chromium.org, Feb 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/08aae248df06367ebc6356f7fb779141b2c5fa18

commit 08aae248df06367ebc6356f7fb779141b2c5fa18
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Sat Feb 24 07:13:05 2018

[autotest] Fix metrics initialization for lab inventory.

The setup for metrics generation in lab inventory run was using
`short_lived=True,`.  ITOT that we prefer for that parameter to be
false.

BUG= chromium:775199 
TEST=None

Change-Id: Ie4c0ffde6a506ffd183991f009e99108894d4b7c
Reviewed-on: https://chromium-review.googlesource.com/935834
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/08aae248df06367ebc6356f7fb779141b2c5fa18/site_utils/lab_inventory.py

Code changes in progress in cider; should be able to confirm via test env that data is reflected there.
Project Member

Comment 29 by bugdroid1@chromium.org, Feb 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/54150304b9637704c8a65d4a3e2c12e99c46f41b

commit 54150304b9637704c8a65d4a3e2c12e99c46f41b
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Feb 27 03:28:49 2018

[autotest] Fix lab inventory for "detect repair loops".

When not invoked for debug, the lab inventory script requires at
least one option that specifies one of the available inventory
reports.  The new repair loop detection is an inventory report, but
the check for "at least one report" didn't recognize it as such.
The result was that inventory runs requesting "detect repair loops
only" would fail.

This updates the command line parsing to do the right thing, and
updates the unit tests to test against that case.

BUG= chromium:775199 
TEST=run unit tests, run the cron job script manually

Change-Id: I5eb664a2901e3c351fd2efce0d5dd1dfe48175c2
Reviewed-on: https://chromium-review.googlesource.com/937782
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/54150304b9637704c8a65d4a3e2c12e99c46f41b/site_utils/lab_inventory.py
[modify] https://crrev.com/54150304b9637704c8a65d4a3e2c12e99c46f41b/site_utils/lab_inventory_unittest.py

We have a dashboard:
    https://viceroy.corp.google.com/chromeos/repair_loops?duration=8d

Primitive, but real data on real failures.

Status: Fixed (was: Assigned)

Sign in to add a comment