labstation servod should restart reliably |
||||||||
Issue descriptionservod on labstations don't restart reliably on failure, and can remain wedged across multiple runs. It should: * restart if USB connection is lost * restart if dut-controls fail * reboot servo v4 and servo micro if servod fails * throttle and surface actionable infra error if fail repeatedly.
,
May 14 2018
Is this a proposal to change Autotest code for talking to servo, or to change servod code in hdctools? Restarting servod on various failures is already part of the ServoHost repair code in Autotest. > * reboot servo v4 and servo micro if servod fails Because a servo V4 labstation serves more than one DUT, this can't be done unconditionally, but should only be done once all of the attached DUTs can be made idle.
,
May 14 2018
Rebooting servos doesn't require rebooting the labstation (actually there's a separate bug that rebooting the labstation doesn't reboot the servos) The proposal here is to add some init and monitoring code to the upstart script that runs and restarts servod, if I understand how servod gets run correctly. From looking at the Dru rack state, ServoHost may be allowing dead servods to persist, it's worth reviewing that code. It may be if we go over it together we can figure it out?
,
May 14 2018
> Rebooting servos doesn't require rebooting the labstation [ .. ] Ah. Adding that to the servod upstart job could make sense. Is it something we can do while servod is running? If so, this might be something that should be part of the 'hwinit()' call. If we make it part of 'hwinit()', that would obviate upstart changes. > From looking at the Dru rack state, ServoHost may be allowing dead servods to persist [ ... ] Fixing problems by rebooting servo V4 isn't something that the current system knows how to do. The code for detecting problems will restart servod if the servo fails various verification checks. Without more data, I can't say whether to deal with existing dead servos we need to add verification checks, or add servo V4 reboot, or both. If we can reboot servo V4 as part of 'hwinit()', I think it would obviate any work here, too.
,
May 14 2018
Unfortunately it can't be part of hwinit since servod only understands persistent USB connections, so this would need to be in the upstart script or autotest. I think the various verification checks may need to be updated, is there a link to what these checks are?
,
May 14 2018
> [ ... ] it can't be part of hwinit since servod only understands persistent USB connections [ ... ] For my own benefit, I'll want to understand better what this means. > I think the various verification checks may need to be updated, is there a link to what these checks are? https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/servo_repair.py?l=352 Executive summary: 'servo_ssh' - Test that the servo host answers to 'ssh'. 'update' - Test that software on the host is up-to-date. If out-of-date, normally the verifier triggers update rather than failing. 'brd_config' - Test that the BOARD setting for servod matches the DUT's database entry. 'ser_config' - Test that the SERIAL setting for the servod job matches the database. 'job' - Test that the servod upstart job is running. 'servod' - Establish an ssh tunnel connection to servod (if needed, which it always is in the test lab), and confirm that hwinit() returns success. 'pwr_button' - Test that the 'pwr_button' control is readable. 'lid_open' - Test that the 'lid_open' control is readable, if the DUT has a lid.
,
May 14 2018
,
May 14 2018
Issue 842821 has been merged into this issue.
,
May 14 2018
jrbarnette to figure out what needs to be done
,
May 14 2018
nsanders@ knows better than I what changes might be necessary.
One question to help motivate:
Is there any condition that might require reboot of servo V4
(as opposed to labstation reboot) that can't be detected by
trying to read 'pwr_button'?
If the answer is "no", then I believe this bug can be closed in favor
of bug 842468 . If the answer is "yes" then we need to address bug 842468
first. After that, it should be enough to add a Servo verifier that can
detect when servo V4 reboot is needed, and add the new verifier to the
triggers for the _RestartServod repair action.
,
May 15 2018
,
May 16 2018
Yes, I think reading 'pwr_button" isn't sufficient to check servo functionality. Lets meet up and discuss.
,
May 29 2018
https://chromium-review.googlesource.com/c/chromiumos/third_party/hdctools/+/1068154 has some attempt at this. I don't have any way to test it though, so reassign to guocb@.
,
Jun 26 2018
,
Oct 12
Part of 2019 planning. Updating priority. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by nsanders@chromium.org
, May 12 2018