New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 842467 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
OOO
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 842468


Participants' hotlists:
SIE-infra-request


Sign in to add a comment

labstation servod should restart reliably

Project Member Reported by nsanders@chromium.org, May 12 2018

Issue description

servod on labstations don't restart reliably on failure, and can remain wedged across multiple runs.

It should:
* restart if USB connection is lost
* restart if dut-controls fail
* reboot servo v4 and servo micro if servod fails

* throttle and surface actionable infra error if fail repeatedly.
 
Labels: Infra-ChromeOS
Is this a proposal to change Autotest code for talking to
servo, or to change servod code in hdctools?

Restarting servod on various failures is already part of the ServoHost
repair code in Autotest.

> * reboot servo v4 and servo micro if servod fails

Because a servo V4 labstation serves more than one DUT, this can't
be done unconditionally, but should only be done once all of the
attached DUTs can be made idle.

Rebooting servos doesn't require rebooting the labstation (actually there's a separate bug that rebooting the labstation doesn't reboot the servos) 

The proposal here is to add some init and monitoring code to the upstart script that runs and restarts servod, if I understand how servod gets run correctly.

From looking at the Dru rack state, ServoHost may be allowing dead servods to persist, it's worth reviewing that code. It may be if we go over it together we can figure it out? 
> Rebooting servos doesn't require rebooting the labstation [ .. ]

Ah.  Adding that to the servod upstart job could make sense.  Is it something
we can do while servod is running?  If so, this might be something that should
be part of the 'hwinit()' call.  If we make it part of 'hwinit()', that would
obviate upstart changes.


> From looking at the Dru rack state, ServoHost may be allowing dead servods to persist  [ ... ]

Fixing problems by rebooting servo V4 isn't something that the current
system knows how to do.  The code for detecting problems will restart
servod if the servo fails various verification checks.  Without more
data, I can't say whether to deal with existing dead servos we need to
add verification checks, or add servo V4 reboot, or both.

If we can reboot servo V4 as part of 'hwinit()', I think it would obviate
any work here, too.

Unfortunately it can't be part of hwinit since servod only understands persistent USB connections, so this would need to be in the upstart script or autotest.

I think the various verification checks may need to be updated, is there a link to what these checks are? 
> [ ... ] it can't be part of hwinit since servod only understands persistent USB connections [ ... ]

For my own benefit, I'll want to understand better what this means.

> I think the various verification checks may need to be updated, is there a link to what these checks are? 

https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/servo_repair.py?l=352

Executive summary:
'servo_ssh' - Test that the servo host answers to 'ssh'.
'update' - Test that software on the host is up-to-date.  If out-of-date,
    normally the verifier triggers update rather than failing.
'brd_config' - Test that the BOARD setting for servod matches the DUT's
    database entry.
'ser_config' - Test that the SERIAL setting for the servod job matches the
    database.
'job' - Test that the servod upstart job is running.
'servod' - Establish an ssh tunnel connection to servod (if needed, which
    it always is in the test lab), and confirm that hwinit() returns
    success.
'pwr_button' - Test that the 'pwr_button' control is readable.
'lid_open' - Test that the 'lid_open' control is readable, if the DUT
    has a lid.

Components: Infra>Client>ChromeOS>Test
Labels: OS-Chrome
Issue 842821 has been merged into this issue.
Owner: jrbarnette@chromium.org
Status: Assigned (was: Untriaged)
jrbarnette to figure out what needs to be done
Blockedon: 842468
Owner: nsanders@chromium.org
nsanders@ knows better than I what changes might be necessary.

One question to help motivate:
    Is there any condition that might require reboot of servo V4
    (as opposed to labstation reboot) that can't be detected by
    trying to read 'pwr_button'?

If the answer is "no", then I believe this bug can be closed in favor
of  bug 842468 .  If the answer is "yes" then we need to address  bug 842468 
first.  After that, it should be enough to add a Servo verifier that can
detect when servo V4 reboot is needed, and add the new verifier to the
triggers for the _RestartServod repair action.

Cc: englab-sys-cros@google.com
Yes, I think reading 'pwr_button" isn't sufficient to check servo functionality. Lets meet up and discuss.


Cc: gu...@chromium.org
Owner: gu...@chromium.org
https://chromium-review.googlesource.com/c/chromiumos/third_party/hdctools/+/1068154 has some attempt at this. I don't have any way to test it though, so reassign to guocb@.
Labels: labstation
Labels: -Pri-1 Pri-2
Part of 2019 planning. Updating priority.

Sign in to add a comment