New issue
Advanced search Search tips

Issue 729177 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

Need a better process for tracking/reporting servo failures

Reported by jrbarnette@chromium.org, Jun 2 2017

Issue description

This is follow up to bug 722961; details of the original failure
are described there.

Recently, 4 cave DUTs went offline, and subsequently failed repair.
Repair failed because of two apparently distinct servo failures.

Nominally, our process for this failure works this way:
  * The deputy, after confirming that it's safe, balances pools to
    get the failed DUTs out of the critical pool.
  * At some point, the failed DUTs come up in the regular repair
    list for englab-sys-cros@
  * The responsible tech sees that servo failed, troubleshoots and
    fixes the servo, then uses the servo to repair the DUT.

However, we have no feedback in the process to know if it's working
as intended, or tools to help improve the process:
  * We don't have easily accessible history of the servo to determine
    when it went bad.
  * The tech's repair actions don't produce an easily accessible
    history, so we don't know what sorts of manual actions are common
    and therefore how we might improve diagnosis or check for errors
    more proactively.
  * We don't know how hard it is for techs to troubleshoot these
    problems, or what tool improvements would help them.

We gather much of the data required already.  What's needed are steps
like the following:
  * Index special task results stored in gs://chromeos-autotest-results.
  * Include machine-readable special task summaries (cf  bug 708312 ).
  * Make key work in deployment_test and repair_test go through a
    special task.
  * Create tools to use the data implied by the above changes.

 
Labels: Chase-Pending
Labels: -ImpactsCQ

Comment 3 by aut...@google.com, Jun 5 2017

Labels: -Chase-Pending OKR
Next steps: investigate to scope to work that's small enough to use for chase-pending 
What I actually want is a report of servos that aren't working, which is populated BEFORE they fail during a repair.

Reasonable signals that a servo is broken:
 1) It can't be pinged.
 2) It doesn't respond to SSH.
 3) It self-reports failure. Servod not running?
 4) Occasional USB stick testing when idle.

> What I actually want is a report of servos that aren't working,
> which is populated BEFORE they fail during a repair.

By itself, that data isn't enough.  ATM, there are ~250 broken
DUTs.  No broken servo with a working DUT will get (or deserve)
attention until that backlog is controlled.  More broadly, any
improvement meant to make it easier to find and deal with servo
problems must also make it at least as easy to find and deal with
broken DUTs.

Moreover, the steps suggested in the bug description are pre-requisite
to producing the suggested report.  Most especially:  Three of the four
items in the suggested report are already gathered in provision jobs.
(The fourth could be added, although we'd want to change the repair
framework to be able to distinguish different levels of "broken".)
The problem is finding relevant provision tasks; hence the recommendation
for indexing special task logs, and providing machine-readable results.

Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: -OKR

Sign in to add a comment