New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 780907 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: ----
Type: ----

Blocked on:
issue 812693



Sign in to add a comment

BattOr needs more resilient retry logic

Project Member Reported by sullivan@chromium.org, Nov 2 2017

Issue description

system_health.common_desktop failing

Builders failed on: 
- Mac Air 10.11 Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Mac%20Air%2010.11%20Perf

Example log:
https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Air_10.11_Perf%2F1484%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 106, in _RunStoryAndProcessErrorIfNeeded
    test.Measure(state.platform, results)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py", line 277, in Measure
    trace_result = platform.tracing_controller.StopTracing()
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py", line 47, in StopTracing
    return self._tracing_controller_backend.StopTracing()
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 140, in StopTracing
    '\n'.join(raised_exception_messages))
TracingException: Exceptions raised when trying to stop tracing:
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 118, in StopTracing
    agent.StopAgentTracing()
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 83, in StopAgentTracing
    self._battor.StopTracing()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 247, in StopTracing
    '%s %s' % (self._STOP_TRACING_CMD, self._trace_results_path))
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 359, in _SendBattOrCommand
    'Outputted: %s' % (cmd, status))
BattOrError: BattOr did not complete command 'StopTracing /b/s/w/itWkW9FA/tmpNnbnUS' correctly.
Outputted: [1102/094745.770174:FATAL:battor_agent_bin.cc(100)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 128, in StopTracing
    agent.CollectAgentTraceData(builder)
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 112, in CollectAgentTraceData
    data = self._battor.CollectTraceData(timeout=timeout)
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 262, in CollectTraceData
    self._stop_tracing_time))
BattOrError: No start or stop time detected when collecting BattOr trace.
Start: 1509641162 
 Stop: None
(INFO) 2017-11-02 09:47:55,196 browser.Close:268  Closing browser (pid=17142) ...
(INFO) 2017-11-02 09:47:55,674 browser.Close:283  Browser is closed.
[  FAILED  ] browse:news:nytimes@{'case': 'browse', 'group': 'news'} (120654 ms)

 
Summary: Bad Mac serial drivers causing BattOr errors on Macs (was: BattOr errors causing system_health.common_desktop failure on Macbook Air)
Cc: charliea@chromium.org
 Issue 780958  has been merged into this issue.
Cc: perezju@chromium.org
 Issue 764929  has been merged into this issue.
 Issue 771928  has been merged into this issue.
The problem here is that the stock Mac USB/serial driver says that it supports a feature called "flow control", which is supposed to automatically tell the sender to stop sending bytes when the receiver's buffer fills up. This prevents bytes from being lost due to a buffer overflow. However, the problem is that the Mac driver doesn't actually appear to implement this feature correctly, which means that we are in fact losing bytes when the Mac system slows down for some reason and we're not able to read bytes as fast as the sender can send them.

There are two possible paths to take:

1) FTDI, the company that makes our USB/serial adapter, has a signed (Apple certified) serial driver available on their site that may have the flow control support that the Apple driver only claims to have. The problem here is that when I installed this driver, I wasn't able to fully collect a BattOr trace. I didn't have time to investigate much more, though, and it's possible that there is an easily fixed problem. If this driver works better than the stock Mac one, then we could roll this out to our fleet, solving many of our problems.

2) The part of the tracing workflow that we always seem to run into this flow control issue is at StopTracing, when the BattOr needs to download a huge trace, often taking as long as a minute and a half. If anything goes wrong during this minute and a half, we have to restart the _whole_ trace download and, if the computer is still busy, it's likely that we'll just run into the same problem again, dropping bytes in the middle. Eventually, we exhaust all of our download retry attempts and the BattOr download fails. To circumvent this, Aaron Schulman (with Mellow) said that it's fairly trivial to implement a command in the BattOr firmware that allows us to start the download at a given frame, effectively allowing us to avoid restarting the whole download process and losing our progress. The problem with this is that it's somewhat complicated on the BattOr agent side to detect when the bad, old bytes stop and where the good, new bytes begin. I don't think this is impossible, but it is tricky.
Components: Speed>Benchmarks>Waterfall
Labels: Performance-Power
Along the lines of 2, would it be possible to get the data from the BattOr in batches of, say, 1 MiB (or whatever makes sense). If the transmission of one of the batches fails, throw it away and try again from the same starting position. This should also help reduce the stress on the buffer, since we only request a new batch (of some manageable size) after the previous one has been received and recorded successfully.
Yep, that type of "stop and wait" protocol is exactly what I asked Aaron to make possible from the BattOr firmware side of things. He's now implemented it in a branch, so it's just a matter of me implementing the BattOr agent side of things.
Owner: aschulman@chromium.org
Assigning this to Aaron, who's currently working on implementing this
Project Member

Comment 10 by bugdroid1@chromium.org, Dec 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49

commit a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49
Author: Aaron Schulman <aschulman@chromium.org>
Date: Wed Dec 20 21:57:55 2017

[BattOr] Fix transmit logic so it properly escapes ESC bytes.

This fixes an omission in the logic for transmitting bytes over the
serial connection where START and END bytes were escaped, but not ESC
bytes.

Bug:  780907 
Change-Id: I96e799cabf7594b4b6fc7602ceeb3224139779d6
Reviewed-on: https://chromium-review.googlesource.com/802826
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Aaron Schulman <aschulman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525479}
[modify] https://crrev.com/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/a90fb7b9f901ea5639a989a70b3a1ca8a9ac6d49/tools/battor_agent/battor_connection_impl_unittest.cc

Summary: BattOr needs more resilient retry logic (was: Bad Mac serial drivers causing BattOr errors on Macs)
Cc: nednguyen@chromium.org martiniss@chromium.org aschulman@chromium.org
 Issue 735434  has been merged into this issue.
Cc: jo...@chromium.org linds...@chromium.org eyaich@chromium.org
 Issue 805520  has been merged into this issue.
Cc: xhw...@chromium.org crouleau@chromium.org
 Issue 809207  has been merged into this issue.
Project Member

Comment 15 by bugdroid1@chromium.org, Feb 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c3efcb1b671ea581f2235f371c0048237e7d8dad

commit c3efcb1b671ea581f2235f371c0048237e7d8dad
Author: Aaron Schulman <aschulman@chromium.org>
Date: Sat Feb 17 02:30:39 2018

[BattOr] Request individual sample frames rather than entire buffer.

This fixes a long running problem with BattOr on Mac where the
StopTracing command fails due to repeated loss of a few bytes. It
addresses this issue by switching to requesting each sample frame
individually. When bytes are lost and a frame is corrupted, the broken
frame will be requested again before downloading the rest of the buffer.

Bug:  780907 
Change-Id: Ie773405530a76454b30564ec35ceee87a61446da
Reviewed-on: https://chromium-review.googlesource.com/908252
Commit-Queue: Aaron Schulman <aschulman@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#537510}
[modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent.h
[modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_agent_unittest.cc
[modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_error.cc
[modify] https://crrev.com/c3efcb1b671ea581f2235f371c0048237e7d8dad/tools/battor_agent/battor_error.h

Owner: charliea@chromium.org
Status: Assigned (was: Available)
Assigning this to myself for rollout.
Project Member

Comment 17 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Pinpoint job started.
https://chromeperf.appspot.com/job/11fada6f840000
Project Member

Comment 18 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Pinpoint job started.
https://chromeperf.appspot.com/job/14c82720440000
Project Member

Comment 19 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Pinpoint job started.
https://chromeperf.appspot.com/job/159b942f840000
Starting three power.idle_platform perf tests on Android (Linux), Windows, and Mac to test out the new software and firmware
Project Member

Comment 21 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Couldn't reproduce a difference.
https://chromeperf.appspot.com/job/14c82720440000
Cc: simonhatch@chromium.org dtu@chromium.org
#21 should have posted https://pinpoint-dot-chromeperf.appspot.com/job/14c82720440000

Unfortunately, looks like swarming failures on both before/after CL. Dave, Simon, should Charlie kick off the job again or does something need to be fixed at the swarming level?
This looks like the vpython issue.
Wasn't the vpython issue fixed this week? IIUC, this was a perf tryjob run today, shouldn't it run against head?
Hmm that's true, although one of the other jobs: https://pinpoint-dot-chromeperf.appspot.com/job/159b942f840000 is running just fine
Blockedon: 812693
Dave told me that Charlie is asking about the mac bots: https://bugs.chromium.org/p/chromium/issues/detail?id=812693#c11
Project Member

Comment 28 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/159b942f840000
Project Member

Comment 29 by 42576172...@developer.gserviceaccount.com, Feb 23 2018

📍 Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/129943ff840000
Project Member

Comment 30 by 42576172...@developer.gserviceaccount.com, Feb 24 2018

📍 Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/11fada6f840000
Project Member

Comment 31 by bugdroid1@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/483cab6366a59bc9805cb5168c5d7f66c1e74bf3

commit 483cab6366a59bc9805cb5168c5d7f66c1e74bf3
Author: Charlie Andrews <charliea@chromium.org>
Date: Mon Feb 26 12:45:33 2018

Roll the BattOr firmware and battor agent to the latest version

These two need to be rolled together because the new battor_agent
retry logic takes advantage of the new way that streaming back samples
works.

I've tested this via tryjobs on Mac and Windows and locally on my
Linux workstation, and everything seems to be working well.

Bug:  chromium:780907 
Change-Id: Idb6be5141387c602d284c9cefb903a2ed62588f2
Reviewed-on: https://chromium-review.googlesource.com/934921
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/483cab6366a59bc9805cb5168c5d7f66c1e74bf3/common/battor/battor/battor_binary_dependencies.json

Project Member

Comment 32 by bugdroid1@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/534191e5068d883cfcbf61085e1dc44f708373ba

commit 534191e5068d883cfcbf61085e1dc44f708373ba
Author: Charlie Andrews <charliea@chromium.org>
Date: Mon Feb 26 13:14:25 2018

Revert "Roll the BattOr firmware and battor agent to the latest version"

This reverts commit 483cab6366a59bc9805cb5168c5d7f66c1e74bf3.

Reason for revert: BattOr firmware and battor agent fail to prefetch when running Telemetry unit tests. See  http://crbug.com/816443 

Original change's description:
> Roll the BattOr firmware and battor agent to the latest version
> 
> These two need to be rolled together because the new battor_agent
> retry logic takes advantage of the new way that streaming back samples
> works.
> 
> I've tested this via tryjobs on Mac and Windows and locally on my
> Linux workstation, and everything seems to be working well.
> 
> Bug:  chromium:780907 
> Change-Id: Idb6be5141387c602d284c9cefb903a2ed62588f2
> Reviewed-on: https://chromium-review.googlesource.com/934921
> Reviewed-by: Ned Nguyen <nednguyen@google.com>
> Commit-Queue: Charlie Andrews <charliea@chromium.org>

TBR=nednguyen@google.com,charliea@chromium.org

Change-Id: Id2961ad374d8e810f9a0597dadba5904a4a6789b
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  chromium:780907 
Reviewed-on: https://chromium-review.googlesource.com/937322
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/534191e5068d883cfcbf61085e1dc44f708373ba/common/battor/battor/battor_binary_dependencies.json

Project Member

Comment 33 by bugdroid1@chromium.org, Feb 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b

commit 47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b
Author: Charlie Andrews <charliea@chromium.org>
Date: Tue Feb 27 12:29:28 2018

Reland: Roll the new BattOr firmware and battor agent to the latest version

TBR=nednguyen@google.com

Bug:  chromium:780907 
Change-Id: Ia5b017433656800c963182f8b8a9b59acefe7a8e
Reviewed-on: https://chromium-review.googlesource.com/937323
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/47ca0db0bd967cf7c6f4f5ebffb47b2bd0422a8b/common/battor/battor/battor_binary_dependencies.json

Status: Fixed (was: Assigned)

Sign in to add a comment