New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 757508 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug



Sign in to add a comment

BattOr agent isn't fully flushing the wire after receiving a corrupted byte during download

Project Member Reported by charliea@chromium.org, Aug 21 2017

Issue description

It's inevitable that, when streaming back the lots and lots of bytes in a BattOr trace, a byte will get corrupted once in a while, resulting in a failed download. This doesn't happen very often, but it's necessary to gracefully handle the scenario when it does happen.

Attached is a serial log from exactly such a case (corrupted_download_log.txt): the BattOr tracing proceeds as expected until we receive a corrupted byte.

However, despite the retry logic that we have, we fail to handle this case appropriately.

Because we receive the byte in the middle of the download, we should continue pulling bytes off of the wire until the wire goes quiet, then retry the entire download command.

However, what we actually do is immediately retry the command. We start out by flushing the wire, but because the flush of the wire is sometimes inadequate (and because it's possible the BattOr is still sending more bytes on the wire), this is insufficient to guarantee that our message requesting another download of the trace will be responded to with an ack. Each time that the download command fails, restarts, sends a download start request, and reads in hope of finding an ack, it instead pulls off more bytes from that original trace with the corrupted message.

This happened during battor.steady_state of this Mac 10.11 run (https://uberchromegw.corp.google.com/i/chromium.perf/builders/Mac%20Air%2010.11%20Perf/builds/1169).

Starting with the very next benchmark (battor.steady_state.reference), the BattOr was unresponsive. I think that this unresponsiveness is a bug in itself: something about repeatedly sending the BattOr requests to download the last trace makes the BattOr permanently go into "download trace mode", which makes sense logically given the fact that the light stays on as red when the BattOr is unresponsive, indicating that the BattOr is permanently stuck in "download trace" mode.

I want to do two things with this:

1) Locally modify the BattOr agent to be able to replicate the hang that we're seeing by going through the normal init, start tracing, and record clock sync marker routines before starting to spam "request download" commands at 2s intervals - exactly what the BattOr agent is currently doing.

2) Fix the BattOr agent so that it does a "slow flush" of the wire before running any command. This will involve repeatedly pulling bytes from the wire until we hit some ~10ms interval during which we receive no additional bytes on the wire, at which point we can safely assume that any sending has stopped.
 
subsequent_run_log.txt
7.2 KB View Download
corrupted_download_log.txt
114 KB View Download
I'm actually not 100% sure that this explanation is correct now, but it appears that there is likely something going on with incorrect flushing.

The order of things happening in the corrupted download log is:

====== StartTracing
Open connection, connection successfully opened
Send init, init ack received
Set gain, set gain ack received
Start sampling, start sampling ack received
(33s pause while the benchmark runs)

====== RecordClockSyncMarker
Request sample count, sample count received

====== StopTracing
Request EEPROM, EEPROM received
Read SD UART
Received ~20 valid data frames from the BattOr
????????????
(wait 2s)
====== StopTracing retry
Request EEPROM
85 more bytes read
Read failed due to missing start byte
(wait 2s)
====== StopTracing retry
Request EEPROM
191 more bytes read
Read failed due to missing start byte
(wait 2s)
====== StopTracing retry
Request EEPROM
TIMEOUT
(wait 2s)
====== StopTracing retry
Request EEPROM
TIMEOUT
...
(eventually ends after 10 total StopTracing retries)

Because we have the printed contents of the EEPROM from the first, successful time that we requested it at the beginning of StopTracing, we can compare the contents there to the bytes that we read later on when we attempted to retry StartTracing. This makes it clear: the bytes that we're receiving when we read 85, then 191 more bytes when requesting the EEPROM in retry attempts are not actually valid EEPROM but rather more data frames as the BattOr continues to stream back samples, even after our previous StopTracing failed. This gives further credence to the idea that we should be waiting for all data frame bytes to arrive to the host computer before initiating a retry attempt of StopTracing.

There are two other interesting questions about this log, though. First, if the last thing that the BattOr agent saw was a valid data frame, then why did it retry StopTracing? Looking at the code, there seems to only be one reason this could happen: the message was indeed valid (w/ a correct start and end byte), but that message wasn't a valid data frame (i.e. ParseSampleFrame here http://bit.ly/2x7LZpO returned false). This could happen for a number of reasons - an out of order sequence number or the frame size in the header not matching the actual number of bytes we see, to name a few. Whatever it was, it seems to happen rarely, and may just be from a corrupted byte on the wire.

The second interesting question is why the BattOr went unresponsive. I think to both us and Mellow, this is the more interesting question. However, it seems like this log gives us a valuable clue: an EEPROM request was sent while the BattOr samples were still streaming back, and this caused the BattOr to hang indefinitely.

I'm going to try and rig a BattOr agent that does just this and see if I can reproduce the problem.
SUCCESS! The problem seems to lie with requesting the EEPROM while samples are still streaming back. By modifying the BattOr agent locally to start streaming back and then spam the BattOr with 1000 EEPROM requests, I was able to get this to reproduce 100% of the time.

I got in touch with Aaron and he's going to look into what might be causing the BattOr firmware to enter an infinite loop when an EEPROM request is received while streaming back samples. I'm going to work on modifying the BattOr agent so that it waits for quiescence on the wire before retrying the StopTracing command.
(I sent Aaron a hacked Mac version of the BattOr agent binary that reproduces this every time.)
Great work, Charlie! Is it possible to write integration test for this? S.t like a fake battor service that communicates with the battor agent?
I'm not sure, but I suspect that an integration test would be really really difficult. In order to do one, we'd really have to stub out the entire backend in order to simulate a BattOr, which would be significant work. I do plan on adding unit tests to make sure that the agent-side fix for this works, though, and that we don't spam EEPROM requests while samples are still streaming back. A true integration test for the firmware side, though, would involve communicating with a real BattOr, which would be problematic because it would actually cause the BattOr to hang indefinitely.
When I say about fake battor service, I would expect something that allows emulating a virtual battor at OS level, and you shouldn't need to do much work in C++ battor code.

A search shows https://stackoverflow.com/questions/52187/virtual-serial-port-for-linux which may be relevant. Anyway, just my 2 cents as I am not an expert in this area :-)
Project Member

Comment 7 by bugdroid1@chromium.org, Aug 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/62da4c88e85fa365dbc8bc100dafb36c4088e9db

commit 62da4c88e85fa365dbc8bc100dafb36c4088e9db
Author: Charlie Andrews <charliea@google.com>
Date: Tue Aug 22 21:35:55 2017

Rename "ReadByte" functions to "ReadBytesForMessage" in BattOrConnection

This is in preparation to add a second type of byte reading function,
where we continuously read and throw away bytes until the serial
connection reaches quiescence.

Bug:  757508 
Change-Id: I2a7ec8194186527f3ee2beaf25a749017ba1bf36
Reviewed-on: https://chromium-review.googlesource.com/627185
Reviewed-by: rnephew <rnephew@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496451}
[modify] https://crrev.com/62da4c88e85fa365dbc8bc100dafb36c4088e9db/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/62da4c88e85fa365dbc8bc100dafb36c4088e9db/tools/battor_agent/battor_connection_impl.h

ok i have identified the bug in the battor firmware that is making it unresponsive in this error (blinking yellow LED, doesn't reply to commands). essentially the code that parses incoming serial commands on the battor was not properly handling parsing errors. when a corrupted frame exists in the battor's buffer, it was just leaving that frame in the battor's internal buffer and continually failed to parse it no matter what new data would arrive as the frame was retransmitted.

this issue came up because we did not have a test case for receiving broken frames in the battor firmware. I will be pushing the fix as well the testing code to the firmware later today.
Project Member

Comment 9 by bugdroid1@chromium.org, Aug 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/02fdd57963abe238fac1160fa3be85e0dcdba7e9

commit 02fdd57963abe238fac1160fa3be85e0dcdba7e9
Author: Charlie Andrews <charliea@chromium.org>
Date: Thu Aug 24 02:25:09 2017

Add "force error" methods to the TestSerialIoHandler

Without these methods, it's impossible to test error handling for things like timeouts.

Bug:  757508 
Change-Id: I236a5b4d71acb1e8d53e5b986562de088cbf0aba
Reviewed-on: https://chromium-review.googlesource.com/629336
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Reilly Grant <reillyg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496923}
[modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/device/serial/test_serial_io_handler.cc
[modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/device/serial/test_serial_io_handler.h
[modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/tools/battor_agent/battor_connection_impl_unittest.cc

Project Member

Comment 10 by bugdroid1@chromium.org, Aug 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2ee4901c879a19d872751a84bb7567030e996782

commit 2ee4901c879a19d872751a84bb7567030e996782
Author: Charlie Andrews <charliea@chromium.org>
Date: Fri Aug 25 16:12:25 2017

Make the Flush() command for BattOrConnection wait for quiescence

This command is currently unused, but will soon be used after failed reads in
order to ensure that the BattOr's response to a previous command doesn't
get interpreted as a response to a retried command. This was happening
frequently when a data frame got corrupted, the BattOr agent retried the
StopTracing command, and the data frames that were still on the wire
got interpreted as a corrupted EEPROM from the StopTracing retry.

Bug:  757508 
Change-Id: I3714cce8b6b2e897ab8d9e0c0d558e002fe17c72
Reviewed-on: https://chromium-review.googlesource.com/629436
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Reviewed-by: rnephew <rnephew@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#497428}
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_agent.h
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection.h
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl.h
[modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl_unittest.cc

Project Member

Comment 11 by bugdroid1@chromium.org, Aug 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4e648692b5a9cdf6853ad942b417775cbab0feb9

commit 4e648692b5a9cdf6853ad942b417775cbab0feb9
Author: Charlie Andrews <charliea@chromium.org>
Date: Tue Aug 29 18:06:54 2017

Makes the BattOr connection's Flush() a part of the Open() call

There aren't (and likely won't be in the future) times when we want to
open the connection without flushing the existing data on it. My
original rationale for making Flush() a separate call was that there may
be times when we want to Flush(), even when we don't want to reopen the
connection. This may still be true, but exposing a separate Flush()
command in the future would be trivial. Integrating Flush() into Open()
reduces the likelihood that some code path mistakenly opens the
connection but forgets to Flush() the connection afterwards.

TBR=rnephew@chromium.org

Bug:  757508 
Change-Id: Iab6a11a9ee79b0efacf811b29615017f231b0b00
Reviewed-on: https://chromium-review.googlesource.com/638414
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#498166}
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_agent.h
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection.h
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl.h
[modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl_unittest.cc

I have attached some logs where the latest commit appears to fail with a TIMEOUT during the Flush().
battor_output-trace1.txt
1.8 KB View Download
battor_serial-trace1.txt
5.0 MB View Download
Here is another set of logs where the latest commit appears to lead to a CONNECTION FAILED on the first retry of the INIT message.


battor_output-trace14.txt
1.7 KB View Download
battor_serial-trace14.txt
1.6 KB View Download
For comment 12, the problem is that, from the BattOr agent's perspective, the whole Flush() happens within the ten second timeout allocated for opening the connection. I think we need to move the 10s connection timeout into the battor_connection_impl.cc and remove any connection timeout from the BattOr agent level: the problem is that the BattOr agent just can't know how long it will take to flush whatever's left on the wire. 

When I recorded a 180s trace, the download took 55s. This means that, even for just a 60s trace (which would take about 18s to download), if there's a corrupted byte at the beginning of that download that triggers a flush when the connection is reopened, there's a good likelihood that opening the connection is going to take longer than 10s.

I'll send out a CL to fix this now.
Cc: charliea@chromium.org
 Issue 754746  has been merged into this issue.
Cc: jo...@chromium.org pschmidt@chromium.org
 Issue 737176  has been merged into this issue.
Project Member

Comment 17 by bugdroid1@chromium.org, Sep 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/73a3e594ada44c03b4bd0e011c142f9d0b284dbb

commit 73a3e594ada44c03b4bd0e011c142f9d0b284dbb
Author: Charlie Andrews <charliea@chromium.org>
Date: Fri Sep 01 20:42:34 2017

Move the connection timeout from the BattOrAgent to the BattOrConnection

With the introduction of the slow flush into opening the connection,
it's possible for opening the connection to take a very long time from
the BattOrAgent's point of view. For example, if the user records a
180s trace, that trace may take a minute to stream back. If a corrupted
data frame occurs at the beginning of that stream, the connection will
be reopoened after a 2s delay, resulting in 53s of useful flushing
before the connection will successfully open. As long as this flushing
finds bytes, Open() shouldn't time out.

We avoid this by moving the timeout into BattOrConnection, where we
can verify that flushing is continuing to find bytes.

TBR=rnephew@chromium.org

Bug:  757508 
Change-Id: I33bc2e2c145ec1f8803eda4b4c14b35d699206a9
Reviewed-on: https://chromium-review.googlesource.com/644136
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#499284}
[modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_connection_impl.h

Project Member

Comment 18 by bugdroid1@chromium.org, Sep 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d33a9557ccd125e36a6aa3b269a6f93424256eb0

commit d33a9557ccd125e36a6aa3b269a6f93424256eb0
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Sat Sep 02 02:20:04 2017

Roll src/third_party/catapult/ 017fd5cf4..e29053eb7 (11 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/017fd5cf4ccb..e29053eb7dfc

$ git log 017fd5cf4..e29053eb7 --date=short --no-merges --format='%ad %ae %s'
2017-09-01 dtu [pinpoint] Handle missing chartjson files.
2017-09-01 simonhatch Pinpoint - Update job dialog with latest names + tir_label.
2017-09-01 tiborg [vr] Add metric to measure frame cycle times
2017-09-01 charliea Move the benchmark health report's deps into a requirements.txt file
2017-09-01 kristipark [Telemetry] Removed legacy blob storage metrics
2017-09-01 kraynov Android systrace: Plot used memory from /proc/meminfo.
2017-09-01 benjhayden Clean up histogram-set-json-format.md
2017-09-01 benjhayden Serialize traces when --output-format=html.
2017-09-01 charliea Upload a new version of the BattOr firmware
2017-09-01 charliea Make BattOrWrapper StopTracing check for 'Done.'
2017-09-01 djordje.golubovic.imgtec Update the update_v8 script to support MIPS

Created with:
  roll-dep src/third_party/catapult
BUG= 726906 , 717515 , 757508 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: I22caf15b3a034b5b0779e40dd967f468e401c892
Reviewed-on: https://chromium-review.googlesource.com/647655
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#499376}
[modify] https://crrev.com/d33a9557ccd125e36a6aa3b269a6f93424256eb0/DEPS

Project Member

Comment 19 by bugdroid1@chromium.org, Sep 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/54ca27592159e6b7d65f5d0f03126c2a6f36185c

commit 54ca27592159e6b7d65f5d0f03126c2a6f36185c
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Tue Sep 05 23:07:32 2017

Roll src/third_party/catapult/ 4ab7a181d..3a5d44da0 (5 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/4ab7a181db1c..3a5d44da0799

$ git log 4ab7a181d..3a5d44da0 --date=short --no-merges --format='%ad %ae %s'
2017-09-05 dtu [pinpoint] Adjust chart bounds and pulse size.
2017-09-05 dtu [pinpoint] Show values as a histogram.
2017-09-05 charliea Roll the BattOr agent binary to include new flushing logic
2017-09-05 benjhayden Make 'html' telemetry's default output format.
2017-09-05 dtu [pinpoint] Fix Jobs table initial sorting.

Created with:
  roll-dep src/third_party/catapult
BUG= 757508 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: I53a97ab2c4d7ac7c2f5f27a4b60817be82dddd02
Reviewed-on: https://chromium-review.googlesource.com/651273
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#499784}
[modify] https://crrev.com/54ca27592159e6b7d65f5d0f03126c2a6f36185c/DEPS

Status: Fixed (was: Assigned)

Sign in to add a comment