BattOr needs more resilient retry logic |
||||||||||
Issue descriptionsystem_health.common_desktop failing Builders failed on: - Mac Air 10.11 Perf: https://build.chromium.org/p/chromium.perf/builders/Mac%20Air%2010.11%20Perf Example log: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Air_10.11_Perf%2F1484%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 106, in _RunStoryAndProcessErrorIfNeeded test.Measure(state.platform, results) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py", line 277, in Measure trace_result = platform.tracing_controller.StopTracing() File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py", line 47, in StopTracing return self._tracing_controller_backend.StopTracing() File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 140, in StopTracing '\n'.join(raised_exception_messages)) TracingException: Exceptions raised when trying to stop tracing: Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 118, in StopTracing agent.StopAgentTracing() File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 83, in StopAgentTracing self._battor.StopTracing() File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 247, in StopTracing '%s %s' % (self._STOP_TRACING_CMD, self._trace_results_path)) File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 359, in _SendBattOrCommand 'Outputted: %s' % (cmd, status)) BattOrError: BattOr did not complete command 'StopTracing /b/s/w/itWkW9FA/tmpNnbnUS' correctly. Outputted: [1102/094745.770174:FATAL:battor_agent_bin.cc(100)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES Traceback (most recent call last): File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 128, in StopTracing agent.CollectAgentTraceData(builder) File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 112, in CollectAgentTraceData data = self._battor.CollectTraceData(timeout=timeout) File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 262, in CollectTraceData self._stop_tracing_time)) BattOrError: No start or stop time detected when collecting BattOr trace. Start: 1509641162 Stop: None (INFO) 2017-11-02 09:47:55,196 browser.Close:268 Closing browser (pid=17142) ... (INFO) 2017-11-02 09:47:55,674 browser.Close:283 Browser is closed. [ FAILED ] browse:news:nytimes@{'case': 'browse', 'group': 'news'} (120654 ms)
,
Nov 3 2017
,
Nov 3 2017
,
Nov 3 2017
Issue 771928 has been merged into this issue.
,
Nov 3 2017
The problem here is that the stock Mac USB/serial driver says that it supports a feature called "flow control", which is supposed to automatically tell the sender to stop sending bytes when the receiver's buffer fills up. This prevents bytes from being lost due to a buffer overflow. However, the problem is that the Mac driver doesn't actually appear to implement this feature correctly, which means that we are in fact losing bytes when the Mac system slows down for some reason and we're not able to read bytes as fast as the sender can send them. There are two possible paths to take: 1) FTDI, the company that makes our USB/serial adapter, has a signed (Apple certified) serial driver available on their site that may have the flow control support that the Apple driver only claims to have. The problem here is that when I installed this driver, I wasn't able to fully collect a BattOr trace. I didn't have time to investigate much more, though, and it's possible that there is an easily fixed problem. If this driver works better than the stock Mac one, then we could roll this out to our fleet, solving many of our problems. 2) The part of the tracing workflow that we always seem to run into this flow control issue is at StopTracing, when the BattOr needs to download a huge trace, often taking as long as a minute and a half. If anything goes wrong during this minute and a half, we have to restart the _whole_ trace download and, if the computer is still busy, it's likely that we'll just run into the same problem again, dropping bytes in the middle. Eventually, we exhaust all of our download retry attempts and the BattOr download fails. To circumvent this, Aaron Schulman (with Mellow) said that it's fairly trivial to implement a command in the BattOr firmware that allows us to start the download at a given frame, effectively allowing us to avoid restarting the whole download process and losing our progress. The problem with this is that it's somewhat complicated on the BattOr agent side to detect when the bad, old bytes stop and where the good, new bytes begin. I don't think this is impossible, but it is tricky.
,
Nov 3 2017
,
Nov 3 2017
Along the lines of 2, would it be possible to get the data from the BattOr in batches of, say, 1 MiB (or whatever makes sense). If the transmission of one of the batches fails, throw it away and try again from the same starting position. This should also help reduce the stress on the buffer, since we only request a new batch (of some manageable size) after the previous one has been received and recorded successfully.
,
Nov 9 2017
Yep, that type of "stop and wait" protocol is exactly what I asked Aaron to make possible from the BattOr firmware side of things. He's now implemented it in a branch, so it's just a matter of me implementing the BattOr agent side of things.
,
Nov 28 2017
Assigning this to Aaron, who's currently working on implementing this
,
Dec 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49 commit a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49 Author: Aaron Schulman <aschulman@chromium.org> Date: Wed Dec 20 21:57:55 2017 [BattOr] Fix transmit logic so it properly escapes ESC bytes. This fixes an omission in the logic for transmitting bytes over the serial connection where START and END bytes were escaped, but not ESC bytes. Bug: 780907 Change-Id: I96e799cabf7594b4b6fc7602ceeb3224139779d6 Reviewed-on: https://chromium-review.googlesource.com/802826 Reviewed-by: Charlie Andrews <charliea@chromium.org> Commit-Queue: Aaron Schulman <aschulman@chromium.org> Cr-Commit-Position: refs/heads/master@{#525479} [modify] https://crrev.com/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49/tools/battor_agent/battor_connection_impl_unittest.cc
,
Feb 7 2018
,
Feb 7 2018
Issue 735434 has been merged into this issue.
,
Feb 7 2018
Issue 805520 has been merged into this issue.
,
Feb 9 2018
,
Feb 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c3efcb1b671ea581f2235f371c0048237e7d8dad commit c3efcb1b671ea581f2235f371c0048237e7d8dad Author: Aaron Schulman <aschulman@chromium.org> Date: Sat Feb 17 02:30:39 2018 [BattOr] Request individual sample frames rather than entire buffer. This fixes a long running problem with BattOr on Mac where the StopTracing command fails due to repeated loss of a few bytes. It addresses this issue by switching to requesting each sample frame individually. When bytes are lost and a frame is corrupted, the broken frame will be requested again before downloading the rest of the buffer. Bug: 780907 Change-Id: Ie773405530a76454b30564ec35ceee87a61446da Reviewed-on: https://chromium-review.googlesource.com/908252 Commit-Queue: Aaron Schulman <aschulman@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#537510} [modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent.h [modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent_unittest.cc [modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_error.cc [modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_error.h
,
Feb 23 2018
Assigning this to myself for rollout.
,
Feb 23 2018
📍 Pinpoint job started. https://chromeperf.appspot.com/job/11fada6f840000
,
Feb 23 2018
📍 Pinpoint job started. https://chromeperf.appspot.com/job/14c82720440000
,
Feb 23 2018
📍 Pinpoint job started. https://chromeperf.appspot.com/job/159b942f840000
,
Feb 23 2018
Starting three power.idle_platform perf tests on Android (Linux), Windows, and Mac to test out the new software and firmware
,
Feb 23 2018
📍 Couldn't reproduce a difference. https://chromeperf.appspot.com/job/14c82720440000
,
Feb 23 2018
#21 should have posted https://pinpoint-dot-chromeperf.appspot.com/job/14c82720440000 Unfortunately, looks like swarming failures on both before/after CL. Dave, Simon, should Charlie kick off the job again or does something need to be fixed at the swarming level?
,
Feb 23 2018
This looks like the vpython issue.
,
Feb 23 2018
Wasn't the vpython issue fixed this week? IIUC, this was a perf tryjob run today, shouldn't it run against head?
,
Feb 23 2018
Hmm that's true, although one of the other jobs: https://pinpoint-dot-chromeperf.appspot.com/job/159b942f840000 is running just fine
,
Feb 23 2018
Dave told me that Charlie is asking about the mac bots: https://bugs.chromium.org/p/chromium/issues/detail?id=812693#c11
,
Feb 23 2018
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/129943ff840000
,
Feb 23 2018
📍 Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/159b942f840000
,
Feb 23 2018
📍 Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/129943ff840000
,
Feb 24 2018
📍 Couldn't reproduce a difference. https://pinpoint-dot-chromeperf.appspot.com/job/11fada6f840000
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/483cab6366a59bc9805cb5168c5d7f66c1e74bf3 commit 483cab6366a59bc9805cb5168c5d7f66c1e74bf3 Author: Charlie Andrews <charliea@chromium.org> Date: Mon Feb 26 12:45:33 2018 Roll the BattOr firmware and battor agent to the latest version These two need to be rolled together because the new battor_agent retry logic takes advantage of the new way that streaming back samples works. I've tested this via tryjobs on Mac and Windows and locally on my Linux workstation, and everything seems to be working well. Bug: chromium:780907 Change-Id: Idb6be5141387c602d284c9cefb903a2ed62588f2 Reviewed-on: https://chromium-review.googlesource.com/934921 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Charlie Andrews <charliea@chromium.org> [modify] https://crrev.com/483cab6366a59bc9805cb5168c5d7f66c1e74bf3/common/battor/battor/battor_binary_dependencies.json
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/534191e5068d883cfcbf61085e1dc44f708373ba commit 534191e5068d883cfcbf61085e1dc44f708373ba Author: Charlie Andrews <charliea@chromium.org> Date: Mon Feb 26 13:14:25 2018 Revert "Roll the BattOr firmware and battor agent to the latest version" This reverts commit 483cab6366a59bc9805cb5168c5d7f66c1e74bf3. Reason for revert: BattOr firmware and battor agent fail to prefetch when running Telemetry unit tests. See http://crbug.com/816443 Original change's description: > Roll the BattOr firmware and battor agent to the latest version > > These two need to be rolled together because the new battor_agent > retry logic takes advantage of the new way that streaming back samples > works. > > I've tested this via tryjobs on Mac and Windows and locally on my > Linux workstation, and everything seems to be working well. > > Bug: chromium:780907 > Change-Id: Idb6be5141387c602d284c9cefb903a2ed62588f2 > Reviewed-on: https://chromium-review.googlesource.com/934921 > Reviewed-by: Ned Nguyen <nednguyen@google.com> > Commit-Queue: Charlie Andrews <charliea@chromium.org> TBR=nednguyen@google.com,charliea@chromium.org Change-Id: Id2961ad374d8e810f9a0597dadba5904a4a6789b No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: chromium:780907 Reviewed-on: https://chromium-review.googlesource.com/937322 Reviewed-by: Charlie Andrews <charliea@chromium.org> Commit-Queue: Charlie Andrews <charliea@chromium.org> [modify] https://crrev.com/534191e5068d883cfcbf61085e1dc44f708373ba/common/battor/battor/battor_binary_dependencies.json
,
Feb 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b commit 47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b Author: Charlie Andrews <charliea@chromium.org> Date: Tue Feb 27 12:29:28 2018 Reland: Roll the new BattOr firmware and battor agent to the latest version TBR=nednguyen@google.com Bug: chromium:780907 Change-Id: Ia5b017433656800c963182f8b8a9b59acefe7a8e Reviewed-on: https://chromium-review.googlesource.com/937323 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> [modify] https://crrev.com/47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b/common/battor/battor/battor_binary_dependencies.json
,
Mar 1 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by charliea@chromium.org
, Nov 3 2017