New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735434 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 780907
Owner:
Last visit > 30 days ago
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-09-18
OS: Mac
Pri: 1
Type: Bug



Sign in to add a comment

BattOr failing during StopTracing due to incorrect sequence numbers

Project Member Reported by perezju@chromium.org, Jun 21 2017

Issue description

The error is:

(ERROR) 2017-06-21 02:26:19,493 battor_wrapper._FlashBattOr:166  Git hash returned from BattOr was not as expected: [0621/022619.493134:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES
Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 153, in _FlashBattOr
    device_git_hash = self.GetFirmwareGitHash()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 383, in GetFirmwareGitHash
    int(self._git_hash, 16)
ValueError: invalid literal for int() with base 16: '[0621/022619.493134:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'
(WARNING) 2017-06-21 02:26:19,495 shared_page_state.DumpStateUponFailure:131  Cannot dump browser state: No browser.
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F794%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

Also, strangely, swarming claims that "shard #0 timed out, took too much time to complete", with a "Total duration: 3708.1s". However the log only shows output from "02:24:34,195" to "2017-06-21 02:26:20,213" (~ 2 mins); where was all the additional time spent?
 
This is still happening consistently. Anybody had a chance to look at this?
Also looks like the issue just started spreading on mac-air-10.11 as of build:
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Air%2010.11%20Perf/976

Same error:
ValueError: invalid literal for int() with base 16: '[0624/142234.652564:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'

Also test stopped after 2 minutes but swarming thinks it timed out after running for one hour.
Labels: -Pri-2 Pri-1
Sorry Juan - I lost track of this last week while I was in Mountain View. I'm going to put this on my plate again for today.
FWIW, I'm pretty sure this is due to us no longer initializing the BattOr before requesting the firmware hash. However, performing that initialization adds ~3-4s to each story, and Aaron's almost positive that it's not required by the BattOr firmware. Something specific to Mac is causing this logic to fail.
Any updates? This is preventing most of system health common benchmark from running on these mac bots.
Sorry for not following up on this sooner Juan - I'm going to investigate this today and follow up on this bug saying what I find.
This is now happening on both the air-10.11 and retina.

Latest builds with BattOrError:

                      build        BattOrError
config                                 
mac-air-10.11 common   1244  FFFFFF-FFFFFFFFFF
mac-retina common      1100  ----F-FFFFFFFFFFF


Bots:
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Air%2010.11%20Perf
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf
Looking at this now.
Note that the swarming timeout problem is already being tracked here: https://bugs.chromium.org/p/chromium/issues/detail?id=760973
Cc: aschulman@chromium.org
Summary: system_health.common_desktop failing on Mac due to failures during StopTracing (was: system_health.common_desktop failing on mac-retina with battor error (and swarming timeout?))
This actually seems to be a slightly different error than the one with which the bug was originally filed. The call stack that I'm seeing is: 

Traceback (most recent call last):
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 118, in StopTracing
    agent.StopAgentTracing()
  File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 83, in StopAgentTracing
    self._battor.StopTracing()
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 247, in StopTracing
    '%s %s' % (self._STOP_TRACING_CMD, self._trace_results_path))
  File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 353, in _SendBattOrCommand
    'Outputted: %s' % (cmd, status))
BattOrError: BattOr did not complete command 'StopTracing /b/s/w/itugGz6F/tmpvEXCfH' correctly.
Outputted: [0907/231451.295339:WARNING:battor_agent.cc(88)] Unexpected sequence number: wanted 4238, but got 93261966.

A couple of interesting notes, looking at the serial log (attached):

1. It seems like the flush seems to be working well. There are several instances where streaming back the trace fails, the flush gets rid of a bunch of data, and then the next command is successful. This is exactly what we want to see.

2. The log of events looks like this:

<regular events, including EEPROM reception>
6:11:08.767 Sample received (success)
...
6:11:11.973 Sample received (success)
6:11:13.973 Flush started
6:11:38.604 Flush ended
<EEPROM request and reception>
6:11:38.622 Sample received
6:11:48.982 Sample received
6:11:50.983 Flush started
6:12:08.384 Flush ended
<EEPROM request and reception>
6:12:08.525 Sample received
...
6:12:20.606 Sample received
6:12:22.606 Flush started
6:12:38.164 Flush ended
...

Basically, it looks like the frames that are coming in are valid, but something about them isn't valid so it's initiating retries. Based on the output that we see above ("Unexpected sequence number: wanted 4238, but got 93261966."), my guess is that something's happening that's corrupting the sequence number. This also seems to be corroborated by what happens if you look at the binary representation of 4238 and 93261966 (the expected vs actual sequence numbers):

00000000 00000000 00010000 10001110 (4238)
00000101 10001111 00010000 10001110 (83261966)

All of the lower order bits are the same, but some of the higher order bits in the actual number seem to be corrupted. I'm not quite sure what to make of this.

CCing Aaron. Aaron, do you have any idea what might be causing this? I'm seeing several of the corrupted sequence numbers happen across multiple retries of StopTracing. Based on how long into the streaming back it takes, it doesn't seem likely that it's always the same data frame that's corrupted, although it might be (15s into streaming back vs. 10s and 12s respectively).
I think that the next step here is to move the warning messages that are currently shown in the WARNING output to the serial log, where they probably belong.
Components: Speed>Benchmarks>Waterfall
Owner: aschulman@chromium.org
Summary: BattOr failing during StopTracing due to incorrect sequence numbers (was: system_health.common_desktop failing on Mac due to failures during StopTracing)
Aaron, it looks like the BattOr is sending back data frames with incorrect sequence numbers, causing the BattOr agent to fail. Comment 11 describes the issue more fully. The issue is reproducible by following the instructions on https://codereview.chromium.org/3014563002/: for me, it took about 24h of stress testing to reproduce, but YMMV. 

Could you look into what's happening? Given that the frame sequence number is wrong and that only the higher order bits are incorrect, it seems very likely that this is a firmware problem and not a client-side problem.
NextAction: 2017-09-18
The NextAction date has arrived: 2017-09-18
Cc: charliea@chromium.org
To me it looks like what is happening here is simply bytes are being lost by the Mac because of the known FTDI driver issue. We have found that flow control does not work properly on Mac, and we retry full downloads each time bytes are lost, so when at least one byte is lost in several downloads in a row, the BattOr agent will fail.

The interesting thing here is that you are observing bytes being dropped in the middle of transmitting the sequence number. Sequence numbers are 32-bits transmitted LSB first. In the example you posted it looks like the first two bytes of the sequence number are transmitted properly (the lowest order bytes), then there are quite a few bytes lost, and then you are treating bytes somewhere in the middle of the frame as the last two sequence number bytes.

Since there seems to be no good fix for the Mac FTDI issue and we are seeing a significant number of failures, especially on the slower macbook air, I think we need to go to plan B. I have an idea for a fix that I have been keeping in my back pocket, that I only wanted to implement if we are desperate. The idea is to allow for the host to request retransmission of a specific broken frame. This will increase the probability that we repair lost bytes successfully. The other advantage of this is that it should significantly decrease the run time of benchmarks on the Mac, since many of them perform at least one retransmission now. 

Since we do appear to be without any other options, I will go ahead and implement per-frame retransmission.

I also will move the WARNING message to the serial log.
Project Member

Comment 18 by bugdroid1@chromium.org, Jan 29 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7d22d62dbd262280ee627864adffd7449e9c90c2

commit 7d22d62dbd262280ee627864adffd7449e9c90c2
Author: Aaron Schulman <aschulman@chromium.org>
Date: Mon Jan 29 20:51:30 2018

[BattOr] Make connection Flush()'ing public

In preparation for the frame retransmission fix we need to bring out
the BattOr connection's Flush() out into it's own function so it can
be called directly by the BattOr agent.

In order to achieve this we had to rework many BattOr agent test
cases which did not pass after this change. It turns out they
never should have passed because they did not advance the clock to
complete the command retry timeout after a command failed.

Bug:  735434 
Change-Id: Ie376c91ba47c6e466f57c6acde8a377bd96c209e
Reviewed-on: https://chromium-review.googlesource.com/836850
Commit-Queue: Aaron Schulman <aschulman@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#532579}
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent.h
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent_unittest.cc
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection.h
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl.h
[modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl_unittest.cc

Mergedinto: 780907
Status: Duplicate (was: Assigned)

Sign in to add a comment