Need a better process for tracking/reporting servo failures
Reported by
jrbarnette@chromium.org,
Jun 2 2017
|
||||
Issue description
This is follow up to bug 722961; details of the original failure
are described there.
Recently, 4 cave DUTs went offline, and subsequently failed repair.
Repair failed because of two apparently distinct servo failures.
Nominally, our process for this failure works this way:
* The deputy, after confirming that it's safe, balances pools to
get the failed DUTs out of the critical pool.
* At some point, the failed DUTs come up in the regular repair
list for englab-sys-cros@
* The responsible tech sees that servo failed, troubleshoots and
fixes the servo, then uses the servo to repair the DUT.
However, we have no feedback in the process to know if it's working
as intended, or tools to help improve the process:
* We don't have easily accessible history of the servo to determine
when it went bad.
* The tech's repair actions don't produce an easily accessible
history, so we don't know what sorts of manual actions are common
and therefore how we might improve diagnosis or check for errors
more proactively.
* We don't know how hard it is for techs to troubleshoot these
problems, or what tool improvements would help them.
We gather much of the data required already. What's needed are steps
like the following:
* Index special task results stored in gs://chromeos-autotest-results.
* Include machine-readable special task summaries (cf bug 708312 ).
* Make key work in deployment_test and repair_test go through a
special task.
* Create tools to use the data implied by the above changes.
,
Jun 5 2017
,
Jun 5 2017
Next steps: investigate to scope to work that's small enough to use for chase-pending
,
Jun 5 2017
What I actually want is a report of servos that aren't working, which is populated BEFORE they fail during a repair. Reasonable signals that a servo is broken: 1) It can't be pinged. 2) It doesn't respond to SSH. 3) It self-reports failure. Servod not running? 4) Occasional USB stick testing when idle.
,
Jun 5 2017
> What I actually want is a report of servos that aren't working, > which is populated BEFORE they fail during a repair. By itself, that data isn't enough. ATM, there are ~250 broken DUTs. No broken servo with a working DUT will get (or deserve) attention until that backlog is controlled. More broadly, any improvement meant to make it easier to find and deal with servo problems must also make it at least as easy to find and deal with broken DUTs. Moreover, the steps suggested in the bug description are pre-requisite to producing the suggested report. Most especially: Three of the four items in the suggested report are already gathered in provision jobs. (The fourth could be added, although we'd want to change the repair framework to be able to distinguish different levels of "broken".) The problem is finding relevant provision tasks; hence the recommendation for indexing special task logs, and providing machine-readable results.
,
Mar 31 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by dgarr...@chromium.org
, Jun 5 2017