New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 726087 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

BattOr agent fails on Windows with "read failed due to the message containing an irrecoverable error: 2" or "3"

Project Member Reported by charliea@chromium.org, May 24 2017

Issue description

This started happening after https://codereview.chromium.org/2826913002/. It doesn't seem to happen on other platforms.

Irrecoverable error 2 denotes that the message is missing a start byte. I expect that this has something to do with flushing not working as expected.

Putting this as a P1 because it's preventing us from rereleasing the BattOr agent.
 
Upon investigating further, it looks like there are also some "irrecoverable error: 3"s in there, which denotes INVALID_MESSAGE_TYPE. My theory is still the same for this being related to bad flushing though.
Summary: BattOr agent fails on Windows with "read failed due to the message containing an irrecoverable error: 2" or "3" (was: BattOr agent fails on Windows with "read failed due to the message containing an irrecoverable error: 2")
Charlie are you taking a look at this? I can get a windows build env up and running if not.
I'm guessing he is out for the day since its 7 there, but from the sounds of it he plans to pick it up again tomorrow?
Yea, I'm planning to dig into this by EOD today.
For some reason, I'm seeing a whole bunch of reinitializing of the serial connection, even though I'm seeing valid communication between the host and the BattOr. Still investigating.
Hmmm, yea, this is weird. I added some logging in the destructor, and it doesn't seem like it's being called. However, I'm seeing lots of "Serial connection open finished with success: 1", which is in the OnOpened() callback. Why would Open() be called again and again, without ever destroying the BattOrConnectionImpl?

Adding logging in BattOrConnectionImpl::Open(), it looks like that's being called over and over, even after successful reads. There's only one call to that function, and it's on BattOrAgent::BeginConnect(). This means that the connection is reinitializing, even when the last read was successful. Given that the CL aschulman@ submitted was about restarting the command when something went wrong, my best guess at this point is that something is restarting the command even when things *don't* go wrong. Maybe we're forgetting to cancel a callback somewhere?
Nope, my logic is wrong. It looks like in my log where I just do "StartTracing, StopTracing", there's only one instance where the serial connection reopens after a successful read, and it looks like that's because it automatically reopens when you start a new command. All of the other times, it reopens after a failed read, which is where we need to keep looking.
So my latest hypothesis is that we aren't waiting long enough after requesting the BattOr to send samples back to actually see the samples and, as a result, are restarting. 

Looking at the logs, it seems to corroborate this fact: after reopening the serial connection, the first thing we see is a whole bunch of bytes come back on the wire that aren't an ack to the stream samples request. This also confirms something that we've suspected before: just calling Flush() on the IO handler isn't sufficient to actually flush the serial buffer, and you still need to pull all possible bytes off of the wire.

All of this together explains why not waiting enough for the samples to be sent back might result in later read errors.

One thing that I'm confused about is why the bytes that we're seeing start with:

  0x02 0x01 0x30 0x02 0x00 0xee 0x02 0x01 0x30...

I definitely don't think that sample streams start with 0x02. I think that they should start with 0x00 (BATTOR_CONTROL_BYTE_START) 0x05 (BATTOR_MESSAGE_TYPE_SAMPLES). Hmmm...
Okay, I think I have a general idea of what's happening. When a read failed, we used to retry that read up to 20 times at 1 millisecond delays. This acted as a sort of built-in accommodation for the fact that the BattOr isn't always able to send data as fast as we expect it, and we'd basically allow 20 * (1ms + amount of time required for a read to time out) for the BattOr to send us the data.

Aaron's patch changed it so that we're a lot more lenient about retrying the command as a whole, but made it stricter about not allowing failures at any point during the command. However, because the BattOr isn't able to respond fast enough when we request data, we're basically guaranteed to need to be too eager to ask the BattOr for data at some point, leading the command to fail. Combine this with the fact that we're not doing a sufficient flush of the serial buffer, leaving old sample data on the wire, and we get the situation we have now. I think that the serial connection might be a little slower on Windows, which is why we're seeing the problem here and not elsewhere.

The fix to this should just be to add back in the retry for individual reads in addition to the overall command retry. With both of these, I think that there should be sufficient cushion to get these commands working the vast majority of the time.
Thanks for investigating this Charlie. I agree that your solution may work, but I think we should take a quick look at the windows serial code to see why the reads appear to be timing out (and thus we don't get call back for them). I know that Mac/linix serial in chrome is implemented with libevent which doesn't have a timeout for it's reads (you just register a callback for the serial file descriptor to be called whenever there is data ready to read. I suspect on Windows the serial code is not using libevent, and therefore there may be a timeout for each read. 

In summary the fix may just be changing a timeout in the serial option structure that gives the reads more time (maybe even infinite like on *nix).
charliea@. I think I found the reason why the reads are failing. I have confirmed that the Windows code is not using libevent, so that's likely why we are seeing different behavior on Windows vs. *nix. Can you post a serial log from a windows client so I can take a look and see that it is the problem that I suspect?

My hunch is that it has something to do with this: https://cs.chromium.org/chromium/src/device/serial/serial_io_handler_win.cc?l=254
The windows serial code sets a timeout of 0 for all reads. That means ReadFile will always return immediately. I'm not sure yet how this impacts the WaitCommEvent() call that happens in ReadImpl(), but the serial log should tell us.
Here's the serial log from my most recent run. Note that I added a little extra logging in here versus what's checked in by always logging every byte that comes in.
log.txt
55.9 KB View Download
Project Member

Comment 14 by bugdroid1@chromium.org, May 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e08d9d7cc11ca18f127f900c0e76499fba723c99

commit e08d9d7cc11ca18f127f900c0e76499fba723c99
Author: charliea <charliea@chromium.org>
Date: Fri May 26 22:48:02 2017

Retry BattOr connection serial reads when zero bytes are read

The BattOr connection will now continue to make read attempts until the
timeout is imposed by the BattOrAgent by canceling the read.

This fixes a Windows bug in which the BattOr agent would request the
samples from the BattOr, immediately request a read, receive zero bytes
back, fail, and restart StopTracing.

TBR=aschulman@chromium.org
BUG= 726087 

Review-Url: https://codereview.chromium.org/2907793003
Cr-Commit-Position: refs/heads/master@{#475169}

[modify] https://crrev.com/e08d9d7cc11ca18f127f900c0e76499fba723c99/tools/battor_agent/battor_connection_impl.cc

Randy you should be able to commit your CL now updating the binaries after you push the latest tree to the battor_agent builders. This should resolve the windows smoke test failure.
Status: Fixed (was: Assigned)
Going to close this as fixed. Hopefully it stays that way.

Sign in to add a comment