New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 832955 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
OOO
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug


Participants' hotlists:
SIE-infra-request


Sign in to add a comment

CCD: Ethernet disconnect on EC update

Project Member Reported by nsanders@chromium.org, Apr 13 2018

Issue description

On EC fw update, the PD state is reset. 

This must cause a USB disconnect in the lab, and therefore will interrupt the ssh session. 

We'll need autotest to support this usecase.
 
Components: Infra>Client>ChromeOS>Test
> On EC fw update, the PD state is reset. 
>
> This must cause a USB disconnect in the lab, and therefore will interrupt the ssh session. 

Oy.

OK.  I believe that this will only happen when we run
"chromeos-firmwareupdate", right?.  The places where that happens
in the source can be counted on one hand (even my grandfather's
hand, and he was missing two fingers...)  So, if we have to adjust
how we update firmware in the lab, it ain't a lot of work.  That
doesn't mean it ain't a nuisance, it just means the nuisance can
be contained.

Status: Assigned (was: Untriaged)
Labels: -Pri-1 Pri-3
Owner: ----
Status: Available (was: Assigned)
I'm not clear what the fix is, and I suspect (but I don't know) that
the problem is low-impact.  Most likely, what we'll see is
  * During the firmware update verifier, the "chromeos-firmwareupdate"
    command will fail.
  * The verifier will fail because of the command failure.
  * We'll kick the device to repair, which will discover a) that the
    firmware is now up-to-date, and b) nothing else is wrong either.
  * The DUT will go back into service, after a maybe 15 minute hiatus.

I think these disruptions will be rare enough to justify calling this "P3"
until proven otherwise.

Labels: labstation
Cc: gu...@chromium.org
Owner: gu...@chromium.org
> I'm not clear what the fix is, and I suspect (but I don't know) that
> the problem is low-impact.  Most likely, what we'll see is

I spoke at length with nsanders@ on this topic.  From the discussion,
it appears that after this failure, the DUT's USB adapter will be unavailable
until it's reset.  That will happen during repair; the DUT itself may also do
it from check_ethernet.hook.  Also, IIUC, the network failure will happen
during firmware update, before we reboot the DUT.

So, revising, I think this is the most likely scenario:
  * During the firmware update verifier, the "chromeos-firmwareupdate"
    command will fail, and the device will go offline.
  * The verifier will fail because of the command failure, and because
    the DUT is now offline.
  * We'll kick the device to repair, which will discover that the DUT is
    offline, and hit it with several different forms of reset.  If any
    of these work, the DUT will have rebooted, and will then pass the
    firmware verifier.

Total outage from the event should still be on the order of 15 minutes.

An alternative scenario would be this:
  * During the firmware update verifier, the "chromeos-firmwareupdate"
    command will fail, and the device will go offline.
  * The verifier will fail because of the command failure, and because
    the DUT is now offline.
  * Repair fails because of a servo problem.
  * check_ethernet.hook on the DUT discovers that the network is down,
    and resets the USB.  If that fails, it eventually reboots.  All of
    this should happen within the space of about 15-30 minutes.
  * Eventually, we come back to the DUT to retry the repair, and find that
    the DUT is working.

In this alternative case, the time of the outage is likely the time between
repair attempts.

This problem is likely to be periodically disruptive, especially for EVT and
DVT units, when firmware updates are most common.  However, the problem isn't
fundamentally fatal.  The worst case frequency should be once a week, shortly
after the automated firmware update.  In practice, I'm assuming most models
won't update more frequently than once a month at their peak.
This problem will also affect the automated deployment script.  When new
DUTs arrive from the factory, there's a special flow used.  Roughly, it
goes like this:
  * With the DUT in dev mode, use servo to boot from USB via ctrl+U on
    the keyboard.
  * Run these commands:
        flashrom -p host --wp-disable
        flashrom -p ec --wp-disable
        chromeos-firmwareupdate --mode=factory
        /usr/share/vboot/bin/set_gbb_flags.sh 0
        crossystem disable_dev_request=1
        halt

The point of the sequence is to get the DUTs back into verified mode,
but with dev-signed firmware.  If the DUT were to go offline at the
firmware update, the sequence would break.

The sequence above is _already_ known to be fragile.  It's common that
we have problems with the "boot in dev mode using ctrl+U".  So, the
standard deployment procedure allows for manually installing the firmware,
and skipping the automated procedure above.  But, in the presence of CCD,
the procedure above could turn out to be irretrievably broken.

Status: Assigned (was: Available)
Part of 2019 planning.

Sign in to add a comment