New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 678675 link

Starred by 3 users

Issue metadata

Status: Duplicate
Merged: issue 787159
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Add EC feature for generic retry mechanism

Project Member Reported by diand...@chromium.org, Jan 5 2017

Issue description

We always end up with problems where EC comms are only 99.99% reliable and then we end up spending tons of time trying to get that last .01%.

I wonder if we should add add some stuff to the EC protocol that would allow reliable retries in the case of failure.  Errors that affect reliability suddenly become much less of an issue.

To implement this, you need:
* A sequence number that increases with each command (doesn't need to be very big, could even be 1 bit big).
* Enough RAM on the EC to store the biggest possible response.


The idea is that if the EC gets a command and sees that the sequence number matches the last sequence number that it got, it will immediately return the last response rather than doing any additional work.


Doing this will allow even commands with a "side effect" to be retried on communication failure.

---

As an example using a "1 bit" sequence number:

AP: get_next_event (seq #0)
EC: gets the event, caches the result, sends result back to AP
AP: get_next_event (seq #1)
EC: gets the event, caches the result, sends result back to AP
AP: get_next_event (seq #1)
EC: same seq number as last time ==> retry; send back cache
AP: get_next_event (seq #0)
EC: gets the event, caches the result, sends result back to AP


 
There are a few unused bits in the request/response headers which could be used for this.

This would also be a great time to make a few other changes to the packet headers:

1) Separate header CRC vs. body CRC.  Right now, you can't know if you have the packet length correct until you've read the entire packet with the unverified length.

2) Actually use a CRC instead of a checksum.  The design doc for host command v3 said CRC, but somehow we (I) implemented a checksum.  Which has caused problems in at least one case where 0x00 bytes were getting inserted into a packet.
Owner: rspangler@chromium.org
Status: Started (was: Untriaged)
I'll propose a design for this.
Cc: amstan@chromium.org
Working on a design doc.

We'll need to add host command protocol V4 to do this, because existing protocol request/response handlers are paranoid about verifying the reserved bytes == 0.

To keep the init sequence simple, we'll keep supporting command protocol V3, at least for EC_CMD_GET_PROTOCOL_INFO.  Once an AP driver determines via that command that V4 is supported, it can decide to start sending V4 packets.

The EC changes are fairly straightforward and well-contained.  

The AP changes can be phased in over time, since it'll be a while before we can drop V3 support in the EC (just as it was a while before we could drop V2 support).

Design doc in progress here: https://docs.google.com/a/google.com/document/d/1AcH2DH9lzceRncpNKj32l0W1ueaN2EXfQZWfSRRbyfY/edit?usp=sharing

Not ready for review yet, but want to make sure it's linked in the bug so we don't lose it.
I'm curious if this is still planned?
I'd still like to do it, if we can find resources to do so.
This came up recently where retries would have solved the problem.  <https://patchwork.kernel.org/patch/9996149/>.  I'm sure it will come up again...
Mergedinto: 787159
Status: Duplicate (was: Started)

Sign in to add a comment