CCD: Ethernet disconnect on EC update |
||||||
Issue descriptionOn EC fw update, the PD state is reset. This must cause a USB disconnect in the lab, and therefore will interrupt the ssh session. We'll need autotest to support this usecase.
,
May 9 2018
,
May 10 2018
I'm not clear what the fix is, and I suspect (but I don't know) that
the problem is low-impact. Most likely, what we'll see is
* During the firmware update verifier, the "chromeos-firmwareupdate"
command will fail.
* The verifier will fail because of the command failure.
* We'll kick the device to repair, which will discover a) that the
firmware is now up-to-date, and b) nothing else is wrong either.
* The DUT will go back into service, after a maybe 15 minute hiatus.
I think these disruptions will be rare enough to justify calling this "P3"
until proven otherwise.
,
Jun 26 2018
,
Jun 26 2018
,
Jul 26
> I'm not clear what the fix is, and I suspect (but I don't know) that
> the problem is low-impact. Most likely, what we'll see is
I spoke at length with nsanders@ on this topic. From the discussion,
it appears that after this failure, the DUT's USB adapter will be unavailable
until it's reset. That will happen during repair; the DUT itself may also do
it from check_ethernet.hook. Also, IIUC, the network failure will happen
during firmware update, before we reboot the DUT.
So, revising, I think this is the most likely scenario:
* During the firmware update verifier, the "chromeos-firmwareupdate"
command will fail, and the device will go offline.
* The verifier will fail because of the command failure, and because
the DUT is now offline.
* We'll kick the device to repair, which will discover that the DUT is
offline, and hit it with several different forms of reset. If any
of these work, the DUT will have rebooted, and will then pass the
firmware verifier.
Total outage from the event should still be on the order of 15 minutes.
An alternative scenario would be this:
* During the firmware update verifier, the "chromeos-firmwareupdate"
command will fail, and the device will go offline.
* The verifier will fail because of the command failure, and because
the DUT is now offline.
* Repair fails because of a servo problem.
* check_ethernet.hook on the DUT discovers that the network is down,
and resets the USB. If that fails, it eventually reboots. All of
this should happen within the space of about 15-30 minutes.
* Eventually, we come back to the DUT to retry the repair, and find that
the DUT is working.
In this alternative case, the time of the outage is likely the time between
repair attempts.
This problem is likely to be periodically disruptive, especially for EVT and
DVT units, when firmware updates are most common. However, the problem isn't
fundamentally fatal. The worst case frequency should be once a week, shortly
after the automated firmware update. In practice, I'm assuming most models
won't update more frequently than once a month at their peak.
,
Jul 26
This problem will also affect the automated deployment script. When new
DUTs arrive from the factory, there's a special flow used. Roughly, it
goes like this:
* With the DUT in dev mode, use servo to boot from USB via ctrl+U on
the keyboard.
* Run these commands:
flashrom -p host --wp-disable
flashrom -p ec --wp-disable
chromeos-firmwareupdate --mode=factory
/usr/share/vboot/bin/set_gbb_flags.sh 0
crossystem disable_dev_request=1
halt
The point of the sequence is to get the DUTs back into verified mode,
but with dev-signed firmware. If the DUT were to go offline at the
firmware update, the sequence would break.
The sequence above is _already_ known to be fragile. It's common that
we have problems with the "boot in dev mode using ctrl+U". So, the
standard deployment procedure allows for manually installing the firmware,
and skipping the automated procedure above. But, in the presence of CCD,
the procedure above could turn out to be irretrievably broken.
,
Aug 2
,
Oct 12
Part of 2019 planning. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, Apr 17 2018