New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 775520 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug



Sign in to add a comment

BattOr agent failing due to "too many start bytes" in data frame, but doesn't retry

Project Member Reported by charliea@chromium.org, Oct 17 2017

Issue description

I stumbled across this doing local testing on my Macbook Pro, and I'll upload the serial log in a comment following this one.

The really strange thing to me is that the agent doesn't seem to be retrying the download, which I'd expect it to. I know that the flush/retry logic is in this version of the binary because if I search for "(flush)" in the serial log, something comes up.

This appears to be strictly a client-side bug, and I wonder if it's some sort of bug with the state machine that initiates the retry.
 

Comment 1 by charliea@google.com, Oct 17 2017

Attached is the serial log for the failure.
tmpDuHnDg
24.2 MB Download
Cc: charliea@chromium.org ashleymarie@chromium.org
 Issue 773374  has been merged into this issue.
After adding a lot more logging and reproducing this locally, I believe that the issue is that there are instances where the timeout isn't canceled after a read fails and we retry the command. The rough order of events is:

1) We receive a bad data frame (or some bad message) and we call RetryCommand() (http://bit.ly/2xNjxxA). RetryCommand in turn posts a delayed task to restart the command in two seconds (http://bit.ly/2xO8hkc).

2) We immediately return from OnMessageRead() (http://bit.ly/2xO8hkc), which causes us to not hit the line of code after the switch that cancels the timeout function (http://bit.ly/2kWNgOk) that was posted just after the last read was initiated.

3) If the timeout function is set to run before the retrying command is set to run, OnActionTimeout() runs (http://bit.ly/2xOxOKo), resulting in the overall command failing with BATTOR_ERROR_TIMEOUT.

The solution here should just be to cancel the timeout_callback_ as soon as any message is read. After all, the purpose of the timout callback is to prevent the situation where no response ever comes, and in this case a response has come, it's just not a good one.

I have this fixed locally and am running a stress test to verify that it looks alright. Assuming that it does, I'll send out the CL to fix it.
Project Member

Comment 5 by bugdroid1@chromium.org, Oct 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/67b7c1a18515f034ba788e055f8218c3676a8cf2

commit 67b7c1a18515f034ba788e055f8218c3676a8cf2
Author: Charlie Andrews <charliea@google.com>
Date: Wed Oct 18 20:47:26 2017

Cancel the BattOr agent action timeout as soon as a message is read

Before this change, the timeout callback was only cancelled if the
message was successfuly read. This caused problems when the message
*wasn't* successfully read: the action would be scheduled to be retried
in two seconds, but before that two seconds elapsed, the timeout would
trigger and we'd say that the action failed due to a timeout, without
ever having retried it.

Cancelling the timeout unconditionally should be the right thing to do
here: after all, the timeout is intended to prevent nothing from
happening for an extended period of time and, in the case that a
bad message was read, something *did* happen, it just wasn't good.

Bug:  775520 
Change-Id: I9ac7c4cfb9f12e50af07a1f8011a2495447a2f40
Reviewed-on: https://chromium-review.googlesource.com/726304
Reviewed-by: rnephew <rnephew@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#509863}
[modify] https://crrev.com/67b7c1a18515f034ba788e055f8218c3676a8cf2/tools/battor_agent/battor_agent.cc

Status: Fixed (was: Assigned)

Sign in to add a comment