New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 832957 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug


Participants' hotlists:
SIE-infra-request


Sign in to add a comment

CCD: servod disconnect and reboot on cr50 update

Project Member Reported by nsanders@chromium.org, Apr 13 2018

Issue description

If cr50 is updated by an AU or recovery, servod will disconnect and the DUT will reboot. 

Lab infra should support this usecase. It's likely this will require changes to both CCD and servo_micro/v3 flows. 

cr50 can be updated in recovery, chromeos_install, AU.
 
Components: Infra>Client>ChromeOS
> If cr50 is updated [ ... ], servod will disconnect and the DUT will reboot.

Ugh.  Is it really necessary for servod to disconnect?

In any event, I think that the servo upstart job uses "respawn", so
of servod dies, the system will automatically restart it.  So, the
servo host will maintain service.  I think the remote client's
connection to servod should be safe, too.  The RPC calls to servod
are stateless, so that part is safe.  The connection is handled over
an ssh tunnel, but I _think_ the tunnel won't be affected by a
servod restart.

As for the DUT reboot:  Rebooting the DUT is a natural part of
the update process, but it has to be driven by Autotest.  If the
DUT reboots spontaneously during AU, it could be interpreted as a
crash.  I'm reluctant to fix that, because distinguishing "exepected
reboot because of cr50 update" from "kernel crash due to a bug in the
new build" is likely to prove to be fragile in practice.

However, I have some sense that the reboot triggered by cr50 update
is actually something that happens during early boot the first time
after installing the new build, before we start services like ssh.
If that's the case, we're OK.  That reboot will be perceived simply as
"DUT was slow to reboot after installation."  As long as a long boot
doesn't trigger a timeout, it won't be considered an error.

cr50 is acting as the servo in this case, so when it's updated, it's the equivalent of unplugging and repluggung servo's USB.

cr50's security model involves cold reset of ec and ap on reboot/au. This is usually deferred until a reboot happens naturally, after au is applied.

A likely effect of this would be that powe_state: off/on will report fail since servod will disconnect on the indicated reboot. 

I think you are correct in that ssh won't be affected.
You can use power_state:ccd_reset after the update to reinitialize all cr50 the usb endpoints in servod.

SSH will be fine.
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Status: Assigned (was: Untriaged)
Owner: mruthven@chromium.org
OK.  It's not clear to me what the visible impact of a cr50 update
is, or whether that impact would belong under the rubric "bug".

IIUC, the cr50 update is part of the AU process, so when we run AU
(which is periodically), the cr50 will update during the flow.  Only
the DUT knows when this happens, and the only visible impacts are
 1) The reboot after the update will take longer than normal.
 2) Some servod impact that I haven't yet grasped.

Can someone characterize what servod will do after the cr50 update?
What I want to hear is that servod will fail, and shut down.  If that's
so, than that's good enough for now.  Eventually, we may want to be
smarter, but I think we'll need some practical experience to discover
what "smarter" really means.  If the response from servod is somewhat
less graceful that "shut down", then we'll probably need changes...

... passing to mruthven@ to answer the question posed on comment #5.

The servo impact is that for ccd, servo uses the cr50 usb endpoints. When cr50 reboots the usb resets and servod can't communicate with the consoles anymore. 

Servod doesn't automatically recover from cr50 resets at all. All of the controls/consoles just stop responding. It doesn't crash. 

if you run dut-control power_state:ccd_reset, usb will be reinitialized.
> Servod doesn't automatically recover from cr50 resets at all.
> All of the controls/consoles just stop responding. It doesn't crash. 

If the controls stop responding, does reading them at least return
failure?  In particular, what if you try to read the 'power_button'
control after a cr50 reset?

Also, in the 'Servod' class, there's a method named 'hwinit()'.  The
source is in servo/servo_server.py.  That method is designed to reinitialize
servo to a known, sane state.  Is there any reason that the 'hwinit()'
function couldn't include a call to 'self.reinitialize()'?  If it did, then
the Autotest lab would be guaranteed to recover from a cr50 reset eventually,
since we call 'hwinit()' quite regularly.

Finally, if you stop and restart servod, will that also fix anything hanging
over from a cr50 reset?

I think servod fails with 'Timeout waiting for response'


I think hwinit could be modified to run reinitialize and it should work.


Yeah restarting servod will reinitialize usb, so it fixes it.
> I think hwinit could be modified to run reinitialize and it should work.

Let's start with that:  No matter what else we need to do to
deal with this problem, making hwinit() fix the problem is
going to be a required part of the solution.

Once we've made that change, I'd recommend we close this bug, then
wait and see.  The change would mean that the lab will revert to
working order in ordinary operation, regardless of the particulars
of how cr50 updates manifest.  It's possible (even likely) that cr50
updates will be rare enough that we won't need to do more than that.

Labels: labstation
Owner: jrbarnette@chromium.org
+jrbarnette to update status
Cc: englab-sys-cros@google.com
Cc: stagenut@chromium.org
Labels: -Pri-1 Pri-2
Part of 2019 planning. Updating priority.
Cc: matth...@chromium.org
Owner: ----
Status: Available (was: Assigned)

Sign in to add a comment