Add administrative (c.f. sentinel) service to track short term DUT-history triggered conditions |
|||
Issue descriptionWe do not currently correlate dut-history across (more than 1) jobs at all. This is very important to detect conditions like verify-repair loops. The specific ask here is: Detect verify-repair loops (3 instances) per-dut and create a monarch metric for it. We can then dashboard/alert from monarch. ----- Design direction: This should be an upstart controlled process running on sentinel server role. - Once this goes live, we'd like to add more metrics based on short dut-history, so keep it generic. - This process can run slowly, over the course of the day, so as to not overload anything. - This process should hit readonly replicas of databases
,
Oct 16 2017
Richard prefers this to be in the inventory script.
,
Oct 19 2017
> Richard prefers this to be in the inventory script.
More to the point: Looking over the technical details, the feature
requested must share code with the inventory script (most especially,
the logic for finding/selecting/excluding the DUTs that matter to the
inventory).
I've done some examination of the performance implications of this
feature as well.
Gathering status on the current managed inventory requires 20 minutes
when run on cautotest-prod. The feature requested here also requires
a full status check of the managed inventory. If the actual checking
and reporting requested here is part of the existing lab inventory run
jobs, the extra queries in theory should be free, since the lab inventory
scripts cache the status query results. Unfortunately, looking at the
requirements and implementation details, I think the extra queries are
_not_ free, and could be significantly expensive.
The objective is to detect repair loops, where a DUT successfully
repairs, only to immediately fail its next special task (typically,
provisioning). The result is a DUT that looks to be working, but in
reality does no work. Detecting that condition requires searching
a DUT's recent history looking for non-repair special tasks. A DUT
is looping if none of the most recent tasks have succeeded, but
every repair is passing.
The performance issue is that gathering recent history for a DUT is
more expensive than gathering just current status. In a small sample
test, I found gathering 24 hours of history is ~10 times more
expensive. Even gathering only 8 hours of history is ~5 times more
expensive. That means a naive addition to the lab inventory to look
at 8 hours of history would add 90 minutes to the inventory run.
Options:
* Live with inventory runs that take over 2 hours to complete.
* Run the check for repair loops in a separate job; my expectation
is that the status caching in the standard run doesn't change
the cost of gathering full history, so there's no performance
advantage to sharing the run.
* Search less than the full inventory, e.g. focus on only the CQ
pool.
* The current history query API only supports a time range. But,
a time range isn't the most convenient parameter in this case:
we really want the last N special tasks. So, we could add an
RPC call that only queries for those N special tasks. The cost
of the query is roughly proportional to the number of tasks
returned, so if N tasks is smaller than the typical number of
tasks in (say) 8 hours, then the query will be faster.
* I think the reason that the cost is proportional to the size of
the result is that job results include the test control file,
which can be quite large. If there were an API option to not
return the control file, the API call would likely go faster.
* I also think that the result size issue with control files is
specific to test jobs, and not special tasks, but for diagnosis,
only special tasks matter (mostly). So, an API option not to
include test jobs could make things faster.
The RPC interface changes above would also make the dut-status
command faster, so although they're more expensive, they pay other
dividends. Also, the changes above aren't the last word in how to
make dut-status faster, and anything that makes dut-status faster
will help here.
,
Oct 21 2017
I've written a prototype, and tested it out. The performance issue cited above is real: My first run took nearly 4 hours; that includes ~20 minutes for status gathering, so the net additional cost of detecting loops with a naive history search is over 3.5 hours. However, the naive search can be skipped for most DUTs, since for a DUT stuck in a repair loop, the final status will be a successful repair task. Checking final status is free (it's what we pay for with the 20 minutes status gathering), and a test of the optimization did repair loop detection in ~30 minutes, meaning ~10 minutes net additional cost. That's tolerable. There's likely some tuning needed to prevent false positives: A simple "3 repairs in a row" check turned up DUTs with a history of failures followed by repairs, but the DUTs typically went back to successfully running tests afterwards. Incidents like that probably shouldn't trigger alerts.
,
Nov 27 2017
Richard has something, and will investigate them this week.
,
Nov 28 2017
Regarding "something", I have crosreview.com/737070/ There's review to get through, and probably some tweaking, but the code works well enough to find DUTs stuck in various kinds of loops: test runs have been discovering ongoing outages with alarming reliability.
,
Dec 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cf5d8346c98566f33923ac946845f6e1c7a4eb06 commit cf5d8346c98566f33923ac946845f6e1c7a4eb06 Author: Richard Barnette <jrbarnette@chromium.org> Date: Sat Dec 02 04:32:41 2017 [autotest] Report repair loops in inventory runs. This add an option to the lab inventory script to detect and report DUTs stuck in repair loops. BUG= chromium:775199 TEST=run the script locally with --debug Change-Id: I2f7972054e632906207279c08a32bd691864eb56 Reviewed-on: https://chromium-review.googlesource.com/737070 Commit-Ready: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/cf5d8346c98566f33923ac946845f6e1c7a4eb06/site_utils/lab_inventory.py [modify] https://crrev.com/cf5d8346c98566f33923ac946845f6e1c7a4eb06/site_utils/lab_inventory_unittest.py
,
Dec 18 2017
Richard to update the cronjob to invoke this script to generate the metrics.
,
Dec 19 2017
I think maybe this is the right starting point:
https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/833288/
,
Jan 8 2018
Richard may have lost comments CL, but 1 CL forthcoming. +cronjob; deputy this week, expected next week.
,
Jan 16 2018
on the list for this week
,
Jan 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/45ac033514addbe5759ecc8d92e0fa4e67f68ed1 commit 45ac033514addbe5759ecc8d92e0fa4e67f68ed1 Author: Richard Barnette <jrbarnette@chromium.org> Date: Sat Jan 20 01:18:10 2018 [autotest] Generate repair loop metrics. This updates the "board" and "pool" inventory runs to include generating repair loop detection. Additionally, it adds a "run-loop-detection" script to run inventory to detect repair loops without doing any other inventory checks. BUG= chromium:775199 TEST=Run the script using the debug options. Change-Id: I971d94ce7f119f4f658d3f189721c661ffd36293 Reviewed-on: https://chromium-review.googlesource.com/833199 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/inventory_options [add] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/run-loop-detection [modify] https://crrev.com/45ac033514addbe5759ecc8d92e0fa4e67f68ed1/contrib/run-pool-inventory
,
Jan 22 2018
Push to prod then a cronjob change to start running this.
,
Jan 25 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/4573e1229dd2b26b1227d4f0ab1c5bd2f9150a13 commit 4573e1229dd2b26b1227d4f0ab1c5bd2f9150a13 Author: Richard Barnette <jrbarnette@google.com> Date: Thu Jan 25 02:17:24 2018
,
Jan 29 2018
Richard to create dashboard.
,
Feb 5 2018
ping
,
Feb 7 2018
I just did a spot-check in Panopticon. The new metric is supposed to be "chromeos/autotest/inventory/repair_loops", but apparently, that metric has never been reported. The inventory script has definitely been finding looping DUTs, so that means the metrics reporting in the inventory script isn't working. This is the first time we've tried to actually add reporting to that script, so that's not really surprising. I expect I can create a dashboard for the metric, but without data, I can't create a dashboard that _works_... So, we have to figure out why the metrics aren't there.
,
Feb 7 2018
Per discussion offline, this is broken because it doesn't do one of two things: - Either set indirect=True - Or call chromite.lib.metrics.Flush() manually. Arguably we should change the context manager that SetupTsMonGlobalState() returns to call .Flush() on __exit__ so that we can avoid this little pitfall by default.
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8840588deca7ef5b8bc10099527d42374fc85719 commit 8840588deca7ef5b8bc10099527d42374fc85719 Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Feb 08 00:31:05 2018 [autotest] Fix lab_inventory to properly report its metrics. The lab_inventory script has been failing to report the metric for DUTs in repair loops. This adds a call to `metrics.Flush()` to make sure that the metrics get reported. BUG= chromium:775199 TEST=None Change-Id: Ia09c0e06c5c1b013af470cdf4ce01418e50ee617 Reviewed-on: https://chromium-review.googlesource.com/907213 Reviewed-by: Paul Hobbs <phobbs@google.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/8840588deca7ef5b8bc10099527d42374fc85719/site_utils/lab_inventory.py
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/88b948904cb027d738e5265cd483fb71ead2a0bb commit 88b948904cb027d738e5265cd483fb71ead2a0bb Author: Richard Barnette <jrbarnette@chromium.org> Date: Thu Feb 08 00:31:18 2018 [autotest] Add an option for debugging metrics in lab_inventory. It's possible to do better for debugging the behavior of the metrics library if we throw the right options at the setup function. So, add an option to make it possible. BUG= chromium:775199 TEST=Run the command with the option Change-Id: I16d94161d79fb8ca22ea6aa157341ea536994c4e Reviewed-on: https://chromium-review.googlesource.com/907182 Reviewed-by: Paul Hobbs <phobbs@google.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/88b948904cb027d738e5265cd483fb71ead2a0bb/site_utils/lab_inventory.py
,
Feb 12 2018
Metric fixed should be pushed to prod as of this morning.
,
Feb 23 2018
Is tracking of DUTs that have remained locked for a long time within scope for this bug?
,
Feb 23 2018
Issue 807125 has been merged into this issue.
,
Feb 23 2018
> Is tracking of DUTs that have remained locked for a long time within scope for this bug?
Arguably, no.
In detail:
* We currently do detect DUTs that are working but don't run
tests. That includes DUTs that are locked for a long time,
but also includes other kinds of problems. DUTs in this
state are reported as idle in the inventory e-mail.
* Although there's a case that idle DUTs could/should be reported
via Monarch, similar to how we report looping DUTs, idle DUTs
are qualitatively different from looping DUTs. They need to be
reported separately because the threshold for "there's a problem"
needs to be different and the procedure for dealing with the
problems needs to be different.
,
Feb 23 2018
Re comment #24, idle DUTs and looping DUTs are also qualitatively different because they're defined differently. A looping DUT is defined as a DUT that runs no tests, but passes repair special tasks, and fails all other tasks. In contrast, an idle DUT runs nothing at all, including special tasks, even after work is scheduled for the DUT.
,
Feb 24 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0a71602d4571f30d43f502963b6901625e90a2eb commit 0a71602d4571f30d43f502963b6901625e90a2eb Author: Richard Barnette <jrbarnette@google.com> Date: Sat Feb 24 00:44:47 2018
,
Feb 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/08aae248df06367ebc6356f7fb779141b2c5fa18 commit 08aae248df06367ebc6356f7fb779141b2c5fa18 Author: Richard Barnette <jrbarnette@chromium.org> Date: Sat Feb 24 07:13:05 2018 [autotest] Fix metrics initialization for lab inventory. The setup for metrics generation in lab inventory run was using `short_lived=True,`. ITOT that we prefer for that parameter to be false. BUG= chromium:775199 TEST=None Change-Id: Ie4c0ffde6a506ffd183991f009e99108894d4b7c Reviewed-on: https://chromium-review.googlesource.com/935834 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/08aae248df06367ebc6356f7fb779141b2c5fa18/site_utils/lab_inventory.py
,
Feb 26 2018
Code changes in progress in cider; should be able to confirm via test env that data is reflected there.
,
Feb 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/54150304b9637704c8a65d4a3e2c12e99c46f41b commit 54150304b9637704c8a65d4a3e2c12e99c46f41b Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Feb 27 03:28:49 2018 [autotest] Fix lab inventory for "detect repair loops". When not invoked for debug, the lab inventory script requires at least one option that specifies one of the available inventory reports. The new repair loop detection is an inventory report, but the check for "at least one report" didn't recognize it as such. The result was that inventory runs requesting "detect repair loops only" would fail. This updates the command line parsing to do the right thing, and updates the unit tests to test against that case. BUG= chromium:775199 TEST=run unit tests, run the cron job script manually Change-Id: I5eb664a2901e3c351fd2efce0d5dd1dfe48175c2 Reviewed-on: https://chromium-review.googlesource.com/937782 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/54150304b9637704c8a65d4a3e2c12e99c46f41b/site_utils/lab_inventory.py [modify] https://crrev.com/54150304b9637704c8a65d4a3e2c12e99c46f41b/site_utils/lab_inventory_unittest.py
,
Feb 28 2018
We have a dashboard:
https://viceroy.corp.google.com/chromeos/repair_loops?duration=8d
Primitive, but real data on real failures.
,
Mar 5 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by pprabhu@chromium.org
, Oct 16 2017