Add metrics for servo_repair failure modes |
|||||||||||||||
Issue descriptionWe're currently completely out of daisy_skate spares in the lab. I went through most of the repair_failed DUTs and found the following reasons for servo repair failing: (In order of incidence) https://bugs.chromium.org/p/chromium/issues/detail?id=735086 https://bugs.chromium.org/p/chromium/issues/detail?id=665232 https://bugs.chromium.org/p/chromium/issues/detail?id=735115 https://bugs.chromium.org/p/chromium/issues/detail?id=735119 ⛆ |
|
|
,
Jun 20 2017
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735132
,
Jun 20 2017
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735144
,
Jun 20 2017
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735156
,
Jun 20 2017
The ask here is for an easy detection mechanism for all these failure modes. - Add metrics to servo_host usage so that each of these failure modes spews a metric. - Add a vicerofy graph for servo_repair stats. We _never_ want servo repair failing. If servo repair fails, it's a problem with the infra, not the product. So we need a way to clearly notice when this fails (maybe even an alert eventually)
,
Jun 20 2017
,
Jun 22 2017
I'm tired of making new bugs for each servo failure. Here's another failure mode (from autoserv repair logs): https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos6-row4-rack2-host2/533156-repair/20172206112219/ 06/22 11:25:21.087 DEBUG| servo:0225| Servo initialized, version is servo_v4 06/22 11:25:21.087 INFO | server_job:0199| GOOD ---- verify.servod timestamp=1498155921 localtime=Jun 22 11:25:21 06/22 11:25:21.088 INFO | repair:0327| Verifying this condition: pwr_button control is normal 06/22 11:25:21.096 INFO | server_job:0199| GOOD ---- verify.pwr_button timestamp=1498155921 localtime=Jun 22 11:25:21 06/22 11:25:21.096 INFO | repair:0327| Verifying this condition: lid_open control is normal 06/22 11:25:21.104 ERROR| repair:0332| Failed: lid_open control is normal Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/servo_repair.py", line 252, in verify 'Check lid switch: lid_open is %s' % lid_open) AutoservVerifyError: Check lid switch: lid_open is no 06/22 11:25:21.105 INFO | server_job:0199| FAIL ---- verify.lid_open timestamp=1498155921 localtime=Jun 22 11:25:21 Check lid switch: lid_open is no Goes downhill from there.
,
Jun 23 2017
Anecdotal evidence that servo_repair is doing pretty badly in the lab: One CTS run took out all DUTs for cyan, caronline, cave. Only half of the DUTs recovered. Spot checking some of the repair failures shows some servo failures: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615141 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615179 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615158 This means that next CQ run is slated to fail even without the bad CL unless deputy intervenes.
,
Jun 26 2017
Justify Chase-Pending.
,
Sep 15 2017
,
Jun 8 2018
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be closed. Thank you.
,
Sep 7
,
Sep 7
Why Chase-Pending: Today, I found all but one skylab staging DUTs were in repair_failed, blocking staging because none of their servo was working. servo is a critical piece of infrastructure that makes it possible to keep DUTs healthy at the scale we do. With servo down, various DUT failures are not noticed / unrecoverable until they cause an outage. Since servo is necessary for a (all?) P0 service we are responsible for, we need dashboards and alerts around servo health in the lab. I've heard the argument in the past along these lines: "The way to repair a DUT is to repair the servo then have it repair the DUT. So if a DUT is working, so is the servo." That is well and good in theory, but I do not have confidence it is true in practice. There are a large number of servo failure modes (some collected on this bug), and we have no visibility into the frequency of occurrence of any of them, nor any ability for the deputy to be notified if servo failure rates jump up. This has made outage recovery difficult for several outages in the past.
,
Oct 29
,
Nov 5
digging into the verifier code to understand how to implement this.
,
Nov 8
As an example: several of our clapper DUTs in pool CQ are currently dead, but there is no way to know what the different failure modes are. I'm annotating below why servo could not repair the DUT. pprabhu@pprabhu:files$ dut-status -m clapper -p bvt hostname S last checked URL chromeos4-row4-rack3-host4 NO 2018-11-08 07:37:13 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host4/1569054-repair/ --> servo works; servo usb works; chromeos-install via usb failed during image install chromeos4-row4-rack4-host1 NO 2018-11-08 07:42:01 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host1/1569076-repair/ --> servo broken (detail: 'pwr_button' is stuck) chromeos4-row4-rack4-host3 NO 2018-11-08 07:42:02 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host3/1569077-repair/ --> servo works; servo usb works; chromeos-install flashes DUT, but DUT fails to reboot chromeos4-row4-rack3-host19 NO 2018-11-08 07:41:02 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host19/1569068-repair/ --> servo broken; (detail: Check lid switch: lid_open is no) chromeos4-row4-rack4-host11 NO 2018-10-24 10:39:24 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host11/1495748-repair/ --> servo broken; (detail: No answer to ping from chromeos4-row4-rack4-host11-servo) chromeos4-row4-rack3-host10 NO 2018-11-08 07:38:11 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host10/1569058-repair/ --> servo works; servo usb works; chromeos-install via usb failed during image install [same as [1]] As this shows, there are 5 different failure modes among the 6 DUTs here, and I had to spend 20 minutes to dig through logs to get this information. It is simply not enough to say DUT is in REPAIR_FAILED -> servo is not working... there are many ways servo fails, and it is important to know fleet-wide what failure modes are common. Hopefully, this helps give a concrete example of what kind of data / dashboard I'm requesting via this bug. I'd like to have various servo repair failure modes with the data sliced along the DUT-pool, DUT-model dimensions.
,
Nov 26
discussing how to handle multiple-failure modes within a single repair job
,
Dec 4
After discussed with Prathmesh, we want a more comprehensive metrics, which require more change on lower level in general repair code. Once this done, we'll also have cros repair metrics since the change will be done in base level.
,
Dec 10
under review, aim to land tomorrow
,
Dec 12
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/67538914aba9e3f09bcf60334f4a4e78f999f6c9 commit 67538914aba9e3f09bcf60334f4a4e78f999f6c9 Author: Garry Wang <xianuowang@chromium.org> Date: Wed Dec 12 03:33:55 2018 autotest: add metrics for servo_repair failure modes BUG= chromium:735121 TEST=None Change-Id: I2f20a8df68c25cf1d00605ad147f100dcdb23798 Reviewed-on: https://chromium-review.googlesource.com/1366915 Commit-Ready: Garry Wang <xianuowang@chromium.org> Tested-by: Garry Wang <xianuowang@chromium.org> Reviewed-by: Garry Wang <xianuowang@chromium.org> [modify] https://crrev.com/67538914aba9e3f09bcf60334f4a4e78f999f6c9/server/hosts/cros_repair.py [modify] https://crrev.com/67538914aba9e3f09bcf60334f4a4e78f999f6c9/server/hosts/servo_repair.py [modify] https://crrev.com/67538914aba9e3f09bcf60334f4a4e78f999f6c9/client/common_lib/hosts/repair.py [modify] https://crrev.com/67538914aba9e3f09bcf60334f4a4e78f999f6c9/client/common_lib/hosts/repair_unittest.py
,
Dec 14
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c10552c0d96e5092ac61506f42841b36750174d1 commit c10552c0d96e5092ac61506f42841b36750174d1 Author: Garry Wang <xianuowang@chromium.org> Date: Fri Dec 14 03:28:04 2018 autotest: filter out unwanted hostname for repair metrics BUG= chromium:735121 TEST=None Change-Id: Ic3dfa1961bf12a71fde051d1658cfc0c1607927c Reviewed-on: https://chromium-review.googlesource.com/1374438 Commit-Ready: Garry Wang <xianuowang@chromium.org> Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Garry Wang <xianuowang@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/c10552c0d96e5092ac61506f42841b36750174d1/client/common_lib/hosts/repair.py
,
Dec 17
awaiting push
,
Jan 7
|
||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by pprabhu@chromium.org
, Jun 20 2017