New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735121 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit 18 days ago
Closed: Jan 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on: View detail
issue 735132
issue 765686
issue 665232
issue 735086
issue 735115
issue 735119
issue 735144
issue 735156



Sign in to add a comment

Add metrics for servo_repair failure modes

Project Member Reported by pprabhu@chromium.org, Jun 20 2017

Issue description

We're currently completely out of daisy_skate spares in the lab.
I went through most of the repair_failed DUTs and found the following reasons for servo repair failing:

(In order of incidence)

https://bugs.chromium.org/p/chromium/issues/detail?id=735086
https://bugs.chromium.org/p/chromium/issues/detail?id=665232
https://bugs.chromium.org/p/chromium/issues/detail?id=735115
https://bugs.chromium.org/p/chromium/issues/detail?id=735119
 
Blockedon: 735115 665232 735119 735086
Blockedon: 735132
Summary: servo repair is unreliable (hitting daisy_skate particularly hard?) (was: daisy_skate servo repair is unreliable)
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735132
Blockedon: 735144
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735144
Blockedon: 735156
Another failure mode: https://bugs.chromium.org/p/chromium/issues/detail?id=735156
Status: Available (was: Untriaged)
Summary: Add metrics for servo_repair failure modes (was: servo repair is unreliable (hitting daisy_skate particularly hard?))
The ask here is for an easy detection mechanism for all these failure modes.

- Add metrics to servo_host usage so that each of these failure modes spews a metric.
- Add a vicerofy graph for servo_repair stats.

We _never_ want servo repair failing. If servo repair fails, it's a problem with the infra, not the product. So we need a way to clearly notice when this fails (maybe even an alert eventually)
Cc: akes...@chromium.org
I'm tired of making new bugs for each servo failure.
Here's another failure mode (from autoserv repair logs):

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos6-row4-rack2-host2/533156-repair/20172206112219/

06/22 11:25:21.087 DEBUG|             servo:0225| Servo initialized, version is servo_v4
06/22 11:25:21.087 INFO |        server_job:0199| 	GOOD	----	verify.servod	timestamp=1498155921	localtime=Jun 22 11:25:21	
06/22 11:25:21.088 INFO |            repair:0327| Verifying this condition: pwr_button control is normal
06/22 11:25:21.096 INFO |        server_job:0199| 	GOOD	----	verify.pwr_button	timestamp=1498155921	localtime=Jun 22 11:25:21	
06/22 11:25:21.096 INFO |            repair:0327| Verifying this condition: lid_open control is normal
06/22 11:25:21.104 ERROR|            repair:0332| Failed: lid_open control is normal
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/servo_repair.py", line 252, in verify
    'Check lid switch: lid_open is %s' % lid_open)
AutoservVerifyError: Check lid switch: lid_open is no
06/22 11:25:21.105 INFO |        server_job:0199| 	FAIL	----	verify.lid_open	timestamp=1498155921	localtime=Jun 22 11:25:21	Check lid switch: lid_open is no


Goes downhill from there.
Anecdotal evidence that servo_repair is doing pretty badly in the lab:

One CTS run took out all DUTs for cyan, caronline, cave. Only half of the DUTs recovered. Spot checking some of the repair failures shows some servo failures:

https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615141
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615179
https://viceroy.corp.google.com/chromeos/suite_details?build_id=1615158

This means that next CQ run is slated to fail even without the bad CL unless deputy intervenes.

Owner: pprabhu@chromium.org
Status: Assigned (was: Available)
Justify Chase-Pending.
Blockedon: 765686
Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be closed. Thank you.
Owner: ----
Status: Available (was: Assigned)
Labels: Chase-Pending
Why Chase-Pending: Today, I found all but one skylab staging DUTs were in repair_failed, blocking staging because none of their servo was working.

servo is a critical piece of infrastructure that makes it possible to keep DUTs healthy at the scale we do. With servo down, various DUT failures are not noticed / unrecoverable until they cause an outage.

Since servo is necessary for a (all?) P0 service we are responsible for, we need dashboards and alerts around servo health in the lab. 

I've heard the argument in the past along these lines: "The way to repair a DUT is to repair the servo then have it repair the DUT. So if a DUT is working, so is the servo."

That is well and good in theory, but I do not have confidence it is true in practice. There are a large number of servo failure modes (some collected on this bug), and we have no visibility into the frequency of occurrence of any of them, nor any ability for the deputy to be notified if servo failure rates jump up. This has made outage recovery difficult for several outages in the past.
Labels: -Chase-Pending Chase
Owner: xianuowang@chromium.org
Status: Assigned (was: Available)
digging into the verifier code to understand how to implement this.
As an example: several of our clapper DUTs in pool CQ are currently dead, but there is no way to know what the different failure modes are. I'm annotating below why servo could not repair the DUT.



pprabhu@pprabhu:files$ dut-status -m clapper -p bvt
hostname                       S   last checked         URL
chromeos4-row4-rack3-host4     NO  2018-11-08 07:37:13  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host4/1569054-repair/
--> servo works; servo usb works; chromeos-install via usb failed during image install

chromeos4-row4-rack4-host1     NO  2018-11-08 07:42:01  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host1/1569076-repair/
--> servo broken (detail: 'pwr_button' is stuck)

chromeos4-row4-rack4-host3     NO  2018-11-08 07:42:02  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host3/1569077-repair/
--> servo works; servo usb works; chromeos-install flashes DUT, but DUT fails to reboot

chromeos4-row4-rack3-host19    NO  2018-11-08 07:41:02  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host19/1569068-repair/
--> servo broken; (detail: Check lid switch: lid_open is no)

chromeos4-row4-rack4-host11    NO  2018-10-24 10:39:24  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack4-host11/1495748-repair/
--> servo broken; (detail: No answer to ping from chromeos4-row4-rack4-host11-servo)

chromeos4-row4-rack3-host10    NO  2018-11-08 07:38:11  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row4-rack3-host10/1569058-repair/
--> servo works; servo usb works; chromeos-install via usb failed during image install [same as [1]]

As this shows, there are 5 different failure modes among the 6 DUTs here, and I had to spend 20 minutes to dig through logs to get this information. It is simply not enough to say DUT is in REPAIR_FAILED -> servo is not working... there are many ways servo fails, and it is important to know fleet-wide what failure modes are common.

Hopefully, this helps give a concrete example of what kind of data / dashboard I'm requesting via this bug.
I'd like to have various servo repair failure modes with the data sliced along the DUT-pool, DUT-model dimensions.
discussing how to handle multiple-failure modes within a single repair job
After discussed with Prathmesh, we want a more comprehensive metrics, which require more change on lower level in general repair code. Once this done, we'll also have cros repair metrics since the change will be done in base level.
under review, aim to land tomorrow
Project Member

Comment 21 by bugdroid1@chromium.org, Dec 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c10552c0d96e5092ac61506f42841b36750174d1

commit c10552c0d96e5092ac61506f42841b36750174d1
Author: Garry Wang <xianuowang@chromium.org>
Date: Fri Dec 14 03:28:04 2018

autotest: filter out unwanted hostname for repair metrics

BUG= chromium:735121 
TEST=None

Change-Id: Ic3dfa1961bf12a71fde051d1658cfc0c1607927c
Reviewed-on: https://chromium-review.googlesource.com/1374438
Commit-Ready: Garry Wang <xianuowang@chromium.org>
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Garry Wang <xianuowang@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/c10552c0d96e5092ac61506f42841b36750174d1/client/common_lib/hosts/repair.py

awaiting push
Status: Fixed (was: Assigned)

Sign in to add a comment