BattOr agent isn't fully flushing the wire after receiving a corrupted byte during download |
||
Issue descriptionIt's inevitable that, when streaming back the lots and lots of bytes in a BattOr trace, a byte will get corrupted once in a while, resulting in a failed download. This doesn't happen very often, but it's necessary to gracefully handle the scenario when it does happen. Attached is a serial log from exactly such a case (corrupted_download_log.txt): the BattOr tracing proceeds as expected until we receive a corrupted byte. However, despite the retry logic that we have, we fail to handle this case appropriately. Because we receive the byte in the middle of the download, we should continue pulling bytes off of the wire until the wire goes quiet, then retry the entire download command. However, what we actually do is immediately retry the command. We start out by flushing the wire, but because the flush of the wire is sometimes inadequate (and because it's possible the BattOr is still sending more bytes on the wire), this is insufficient to guarantee that our message requesting another download of the trace will be responded to with an ack. Each time that the download command fails, restarts, sends a download start request, and reads in hope of finding an ack, it instead pulls off more bytes from that original trace with the corrupted message. This happened during battor.steady_state of this Mac 10.11 run (https://uberchromegw.corp.google.com/i/chromium.perf/builders/Mac%20Air%2010.11%20Perf/builds/1169). Starting with the very next benchmark (battor.steady_state.reference), the BattOr was unresponsive. I think that this unresponsiveness is a bug in itself: something about repeatedly sending the BattOr requests to download the last trace makes the BattOr permanently go into "download trace mode", which makes sense logically given the fact that the light stays on as red when the BattOr is unresponsive, indicating that the BattOr is permanently stuck in "download trace" mode. I want to do two things with this: 1) Locally modify the BattOr agent to be able to replicate the hang that we're seeing by going through the normal init, start tracing, and record clock sync marker routines before starting to spam "request download" commands at 2s intervals - exactly what the BattOr agent is currently doing. 2) Fix the BattOr agent so that it does a "slow flush" of the wire before running any command. This will involve repeatedly pulling bytes from the wire until we hit some ~10ms interval during which we receive no additional bytes on the wire, at which point we can safely assume that any sending has stopped.
,
Aug 22 2017
SUCCESS! The problem seems to lie with requesting the EEPROM while samples are still streaming back. By modifying the BattOr agent locally to start streaming back and then spam the BattOr with 1000 EEPROM requests, I was able to get this to reproduce 100% of the time. I got in touch with Aaron and he's going to look into what might be causing the BattOr firmware to enter an infinite loop when an EEPROM request is received while streaming back samples. I'm going to work on modifying the BattOr agent so that it waits for quiescence on the wire before retrying the StopTracing command.
,
Aug 22 2017
(I sent Aaron a hacked Mac version of the BattOr agent binary that reproduces this every time.)
,
Aug 22 2017
Great work, Charlie! Is it possible to write integration test for this? S.t like a fake battor service that communicates with the battor agent?
,
Aug 22 2017
I'm not sure, but I suspect that an integration test would be really really difficult. In order to do one, we'd really have to stub out the entire backend in order to simulate a BattOr, which would be significant work. I do plan on adding unit tests to make sure that the agent-side fix for this works, though, and that we don't spam EEPROM requests while samples are still streaming back. A true integration test for the firmware side, though, would involve communicating with a real BattOr, which would be problematic because it would actually cause the BattOr to hang indefinitely.
,
Aug 22 2017
When I say about fake battor service, I would expect something that allows emulating a virtual battor at OS level, and you shouldn't need to do much work in C++ battor code. A search shows https://stackoverflow.com/questions/52187/virtual-serial-port-for-linux which may be relevant. Anyway, just my 2 cents as I am not an expert in this area :-)
,
Aug 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/62da4c88e85fa365dbc8bc100dafb36c4088e9db commit 62da4c88e85fa365dbc8bc100dafb36c4088e9db Author: Charlie Andrews <charliea@google.com> Date: Tue Aug 22 21:35:55 2017 Rename "ReadByte" functions to "ReadBytesForMessage" in BattOrConnection This is in preparation to add a second type of byte reading function, where we continuously read and throw away bytes until the serial connection reaches quiescence. Bug: 757508 Change-Id: I2a7ec8194186527f3ee2beaf25a749017ba1bf36 Reviewed-on: https://chromium-review.googlesource.com/627185 Reviewed-by: rnephew <rnephew@chromium.org> Commit-Queue: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#496451} [modify] https://crrev.com/62da4c88e85fa365dbc8bc100dafb36c4088e9db/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/62da4c88e85fa365dbc8bc100dafb36c4088e9db/tools/battor_agent/battor_connection_impl.h
,
Aug 23 2017
ok i have identified the bug in the battor firmware that is making it unresponsive in this error (blinking yellow LED, doesn't reply to commands). essentially the code that parses incoming serial commands on the battor was not properly handling parsing errors. when a corrupted frame exists in the battor's buffer, it was just leaving that frame in the battor's internal buffer and continually failed to parse it no matter what new data would arrive as the frame was retransmitted. this issue came up because we did not have a test case for receiving broken frames in the battor firmware. I will be pushing the fix as well the testing code to the firmware later today.
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/02fdd57963abe238fac1160fa3be85e0dcdba7e9 commit 02fdd57963abe238fac1160fa3be85e0dcdba7e9 Author: Charlie Andrews <charliea@chromium.org> Date: Thu Aug 24 02:25:09 2017 Add "force error" methods to the TestSerialIoHandler Without these methods, it's impossible to test error handling for things like timeouts. Bug: 757508 Change-Id: I236a5b4d71acb1e8d53e5b986562de088cbf0aba Reviewed-on: https://chromium-review.googlesource.com/629336 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Reilly Grant <reillyg@chromium.org> Cr-Commit-Position: refs/heads/master@{#496923} [modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/device/serial/test_serial_io_handler.cc [modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/device/serial/test_serial_io_handler.h [modify] https://crrev.com/02fdd57963abe238fac1160fa3be85e0dcdba7e9/tools/battor_agent/battor_connection_impl_unittest.cc
,
Aug 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2ee4901c879a19d872751a84bb7567030e996782 commit 2ee4901c879a19d872751a84bb7567030e996782 Author: Charlie Andrews <charliea@chromium.org> Date: Fri Aug 25 16:12:25 2017 Make the Flush() command for BattOrConnection wait for quiescence This command is currently unused, but will soon be used after failed reads in order to ensure that the BattOr's response to a previous command doesn't get interpreted as a response to a retried command. This was happening frequently when a data frame got corrupted, the BattOr agent retried the StopTracing command, and the data frames that were still on the wire got interpreted as a corrupted EEPROM from the StopTracing retry. Bug: 757508 Change-Id: I3714cce8b6b2e897ab8d9e0c0d558e002fe17c72 Reviewed-on: https://chromium-review.googlesource.com/629436 Reviewed-by: Charlie Andrews <charliea@chromium.org> Reviewed-by: rnephew <rnephew@chromium.org> Commit-Queue: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#497428} [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_agent.h [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection.h [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl.h [modify] https://crrev.com/2ee4901c879a19d872751a84bb7567030e996782/tools/battor_agent/battor_connection_impl_unittest.cc
,
Aug 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4e648692b5a9cdf6853ad942b417775cbab0feb9 commit 4e648692b5a9cdf6853ad942b417775cbab0feb9 Author: Charlie Andrews <charliea@chromium.org> Date: Tue Aug 29 18:06:54 2017 Makes the BattOr connection's Flush() a part of the Open() call There aren't (and likely won't be in the future) times when we want to open the connection without flushing the existing data on it. My original rationale for making Flush() a separate call was that there may be times when we want to Flush(), even when we don't want to reopen the connection. This may still be true, but exposing a separate Flush() command in the future would be trivial. Integrating Flush() into Open() reduces the likelihood that some code path mistakenly opens the connection but forgets to Flush() the connection afterwards. TBR=rnephew@chromium.org Bug: 757508 Change-Id: Iab6a11a9ee79b0efacf811b29615017f231b0b00 Reviewed-on: https://chromium-review.googlesource.com/638414 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#498166} [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_agent.h [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection.h [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl.h [modify] https://crrev.com/4e648692b5a9cdf6853ad942b417775cbab0feb9/tools/battor_agent/battor_connection_impl_unittest.cc
,
Aug 29 2017
I have attached some logs where the latest commit appears to fail with a TIMEOUT during the Flush().
,
Aug 29 2017
Here is another set of logs where the latest commit appears to lead to a CONNECTION FAILED on the first retry of the INIT message.
,
Aug 30 2017
For comment 12, the problem is that, from the BattOr agent's perspective, the whole Flush() happens within the ten second timeout allocated for opening the connection. I think we need to move the 10s connection timeout into the battor_connection_impl.cc and remove any connection timeout from the BattOr agent level: the problem is that the BattOr agent just can't know how long it will take to flush whatever's left on the wire. When I recorded a 180s trace, the download took 55s. This means that, even for just a 60s trace (which would take about 18s to download), if there's a corrupted byte at the beginning of that download that triggers a flush when the connection is reopened, there's a good likelihood that opening the connection is going to take longer than 10s. I'll send out a CL to fix this now.
,
Aug 30 2017
,
Aug 30 2017
,
Sep 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/73a3e594ada44c03b4bd0e011c142f9d0b284dbb commit 73a3e594ada44c03b4bd0e011c142f9d0b284dbb Author: Charlie Andrews <charliea@chromium.org> Date: Fri Sep 01 20:42:34 2017 Move the connection timeout from the BattOrAgent to the BattOrConnection With the introduction of the slow flush into opening the connection, it's possible for opening the connection to take a very long time from the BattOrAgent's point of view. For example, if the user records a 180s trace, that trace may take a minute to stream back. If a corrupted data frame occurs at the beginning of that stream, the connection will be reopoened after a 2s delay, resulting in 53s of useful flushing before the connection will successfully open. As long as this flushing finds bytes, Open() shouldn't time out. We avoid this by moving the timeout into BattOrConnection, where we can verify that flushing is continuing to find bytes. TBR=rnephew@chromium.org Bug: 757508 Change-Id: I33bc2e2c145ec1f8803eda4b4c14b35d699206a9 Reviewed-on: https://chromium-review.googlesource.com/644136 Reviewed-by: Charlie Andrews <charliea@chromium.org> Commit-Queue: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#499284} [modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/73a3e594ada44c03b4bd0e011c142f9d0b284dbb/tools/battor_agent/battor_connection_impl.h
,
Sep 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d33a9557ccd125e36a6aa3b269a6f93424256eb0 commit d33a9557ccd125e36a6aa3b269a6f93424256eb0 Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Sat Sep 02 02:20:04 2017 Roll src/third_party/catapult/ 017fd5cf4..e29053eb7 (11 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/017fd5cf4ccb..e29053eb7dfc $ git log 017fd5cf4..e29053eb7 --date=short --no-merges --format='%ad %ae %s' 2017-09-01 dtu [pinpoint] Handle missing chartjson files. 2017-09-01 simonhatch Pinpoint - Update job dialog with latest names + tir_label. 2017-09-01 tiborg [vr] Add metric to measure frame cycle times 2017-09-01 charliea Move the benchmark health report's deps into a requirements.txt file 2017-09-01 kristipark [Telemetry] Removed legacy blob storage metrics 2017-09-01 kraynov Android systrace: Plot used memory from /proc/meminfo. 2017-09-01 benjhayden Clean up histogram-set-json-format.md 2017-09-01 benjhayden Serialize traces when --output-format=html. 2017-09-01 charliea Upload a new version of the BattOr firmware 2017-09-01 charliea Make BattOrWrapper StopTracing check for 'Done.' 2017-09-01 djordje.golubovic.imgtec Update the update_v8 script to support MIPS Created with: roll-dep src/third_party/catapult BUG= 726906 , 717515 , 757508 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I22caf15b3a034b5b0779e40dd967f468e401c892 Reviewed-on: https://chromium-review.googlesource.com/647655 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#499376} [modify] https://crrev.com/d33a9557ccd125e36a6aa3b269a6f93424256eb0/DEPS
,
Sep 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/54ca27592159e6b7d65f5d0f03126c2a6f36185c commit 54ca27592159e6b7d65f5d0f03126c2a6f36185c Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Tue Sep 05 23:07:32 2017 Roll src/third_party/catapult/ 4ab7a181d..3a5d44da0 (5 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/4ab7a181db1c..3a5d44da0799 $ git log 4ab7a181d..3a5d44da0 --date=short --no-merges --format='%ad %ae %s' 2017-09-05 dtu [pinpoint] Adjust chart bounds and pulse size. 2017-09-05 dtu [pinpoint] Show values as a histogram. 2017-09-05 charliea Roll the BattOr agent binary to include new flushing logic 2017-09-05 benjhayden Make 'html' telemetry's default output format. 2017-09-05 dtu [pinpoint] Fix Jobs table initial sorting. Created with: roll-dep src/third_party/catapult BUG= 757508 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I53a97ab2c4d7ac7c2f5f27a4b60817be82dddd02 Reviewed-on: https://chromium-review.googlesource.com/651273 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#499784} [modify] https://crrev.com/54ca27592159e6b7d65f5d0f03126c2a6f36185c/DEPS
,
Sep 11 2017
|
||
►
Sign in to add a comment |
||
Comment 1 by charliea@chromium.org
, Aug 22 2017