Issue metadata
Sign in to add a comment
|
BattOr failing during StopTracing due to incorrect sequence numbers |
||||||||||||||||||||||||
Issue description
The error is:
(ERROR) 2017-06-21 02:26:19,493 battor_wrapper._FlashBattOr:166 Git hash returned from BattOr was not as expected: [0621/022619.493134:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES
Traceback (most recent call last):
File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 153, in _FlashBattOr
device_git_hash = self.GetFirmwareGitHash()
File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 383, in GetFirmwareGitHash
int(self._git_hash, 16)
ValueError: invalid literal for int() with base 16: '[0621/022619.493134:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'
(WARNING) 2017-06-21 02:26:19,495 shared_page_state.DumpStateUponFailure:131 Cannot dump browser state: No browser.
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F794%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout
Also, strangely, swarming claims that "shard #0 timed out, took too much time to complete", with a "Total duration: 3708.1s". However the log only shows output from "02:24:34,195" to "2017-06-21 02:26:20,213" (~ 2 mins); where was all the additional time spent?
,
Jun 26 2017
Also looks like the issue just started spreading on mac-air-10.11 as of build: https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Air%2010.11%20Perf/976 Same error: ValueError: invalid literal for int() with base 16: '[0624/142234.652564:FATAL:battor_agent_bin.cc(91)] Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES' Also test stopped after 2 minutes but swarming thinks it timed out after running for one hour.
,
Jun 26 2017
,
Jun 26 2017
Sorry Juan - I lost track of this last week while I was in Mountain View. I'm going to put this on my plate again for today.
,
Jun 26 2017
FWIW, I'm pretty sure this is due to us no longer initializing the BattOr before requesting the firmware hash. However, performing that initialization adds ~3-4s to each story, and Aaron's almost positive that it's not required by the BattOr firmware. Something specific to Mac is causing this logic to fail.
,
Jul 3 2017
Any updates? This is preventing most of system health common benchmark from running on these mac bots.
,
Jul 5 2017
Sorry for not following up on this sooner Juan - I'm going to investigate this today and follow up on this bug saying what I find.
,
Sep 7 2017
This is now happening on both the air-10.11 and retina.
Latest builds with BattOrError:
build BattOrError
config
mac-air-10.11 common 1244 FFFFFF-FFFFFFFFFF
mac-retina common 1100 ----F-FFFFFFFFFFF
Bots:
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Air%2010.11%20Perf
https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Retina%20Perf
,
Sep 11 2017
Looking at this now.
,
Sep 11 2017
Note that the swarming timeout problem is already being tracked here: https://bugs.chromium.org/p/chromium/issues/detail?id=760973
,
Sep 11 2017
This actually seems to be a slightly different error than the one with which the bug was originally filed. The call stack that I'm seeing is:
Traceback (most recent call last):
File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py", line 118, in StopTracing
agent.StopAgentTracing()
File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/battor_tracing_agent.py", line 83, in StopAgentTracing
self._battor.StopTracing()
File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 247, in StopTracing
'%s %s' % (self._STOP_TRACING_CMD, self._trace_results_path))
File "/b/s/w/ir/third_party/catapult/common/battor/battor/battor_wrapper.py", line 353, in _SendBattOrCommand
'Outputted: %s' % (cmd, status))
BattOrError: BattOr did not complete command 'StopTracing /b/s/w/itugGz6F/tmpvEXCfH' correctly.
Outputted: [0907/231451.295339:WARNING:battor_agent.cc(88)] Unexpected sequence number: wanted 4238, but got 93261966.
A couple of interesting notes, looking at the serial log (attached):
1. It seems like the flush seems to be working well. There are several instances where streaming back the trace fails, the flush gets rid of a bunch of data, and then the next command is successful. This is exactly what we want to see.
2. The log of events looks like this:
<regular events, including EEPROM reception>
6:11:08.767 Sample received (success)
...
6:11:11.973 Sample received (success)
6:11:13.973 Flush started
6:11:38.604 Flush ended
<EEPROM request and reception>
6:11:38.622 Sample received
6:11:48.982 Sample received
6:11:50.983 Flush started
6:12:08.384 Flush ended
<EEPROM request and reception>
6:12:08.525 Sample received
...
6:12:20.606 Sample received
6:12:22.606 Flush started
6:12:38.164 Flush ended
...
Basically, it looks like the frames that are coming in are valid, but something about them isn't valid so it's initiating retries. Based on the output that we see above ("Unexpected sequence number: wanted 4238, but got 93261966."), my guess is that something's happening that's corrupting the sequence number. This also seems to be corroborated by what happens if you look at the binary representation of 4238 and 93261966 (the expected vs actual sequence numbers):
00000000 00000000 00010000 10001110 (4238)
00000101 10001111 00010000 10001110 (83261966)
All of the lower order bits are the same, but some of the higher order bits in the actual number seem to be corrupted. I'm not quite sure what to make of this.
CCing Aaron. Aaron, do you have any idea what might be causing this? I'm seeing several of the corrupted sequence numbers happen across multiple retries of StopTracing. Based on how long into the streaming back it takes, it doesn't seem likely that it's always the same data frame that's corrupted, although it might be (15s into streaming back vs. 10s and 12s respectively).
,
Sep 12 2017
I think that the next step here is to move the warning messages that are currently shown in the WARNING output to the serial log, where they probably belong.
,
Sep 15 2017
Aaron, it looks like the BattOr is sending back data frames with incorrect sequence numbers, causing the BattOr agent to fail. Comment 11 describes the issue more fully. The issue is reproducible by following the instructions on https://codereview.chromium.org/3014563002/: for me, it took about 24h of stress testing to reproduce, but YMMV. Could you look into what's happening? Given that the frame sequence number is wrong and that only the higher order bits are incorrect, it seems very likely that this is a firmware problem and not a client-side problem.
,
Sep 15 2017
,
Sep 18 2017
The NextAction date has arrived: 2017-09-18
,
Oct 11 2017
To me it looks like what is happening here is simply bytes are being lost by the Mac because of the known FTDI driver issue. We have found that flow control does not work properly on Mac, and we retry full downloads each time bytes are lost, so when at least one byte is lost in several downloads in a row, the BattOr agent will fail. The interesting thing here is that you are observing bytes being dropped in the middle of transmitting the sequence number. Sequence numbers are 32-bits transmitted LSB first. In the example you posted it looks like the first two bytes of the sequence number are transmitted properly (the lowest order bytes), then there are quite a few bytes lost, and then you are treating bytes somewhere in the middle of the frame as the last two sequence number bytes. Since there seems to be no good fix for the Mac FTDI issue and we are seeing a significant number of failures, especially on the slower macbook air, I think we need to go to plan B. I have an idea for a fix that I have been keeping in my back pocket, that I only wanted to implement if we are desperate. The idea is to allow for the host to request retransmission of a specific broken frame. This will increase the probability that we repair lost bytes successfully. The other advantage of this is that it should significantly decrease the run time of benchmarks on the Mac, since many of them perform at least one retransmission now. Since we do appear to be without any other options, I will go ahead and implement per-frame retransmission. I also will move the WARNING message to the serial log.
,
Oct 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/da3f2b24e62846c25c5c68883e288b014bff1085 commit da3f2b24e62846c25c5c68883e288b014bff1085 Author: Charlie Andrews <charliea@google.com> Date: Wed Oct 18 16:37:56 2017 Add higher-level information to the serial log via BattOrAgent Bug: 735434 , 775520 Change-Id: Ibab9cd698d0ad74593c008956d0a90fab08fe596 Reviewed-on: https://chromium-review.googlesource.com/723595 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: rnephew <rnephew@chromium.org> Cr-Commit-Position: refs/heads/master@{#509791} [modify] https://crrev.com/da3f2b24e62846c25c5c68883e288b014bff1085/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/da3f2b24e62846c25c5c68883e288b014bff1085/tools/battor_agent/battor_agent.h [modify] https://crrev.com/da3f2b24e62846c25c5c68883e288b014bff1085/tools/battor_agent/battor_agent_unittest.cc [modify] https://crrev.com/da3f2b24e62846c25c5c68883e288b014bff1085/tools/battor_agent/battor_connection.h [modify] https://crrev.com/da3f2b24e62846c25c5c68883e288b014bff1085/tools/battor_agent/battor_connection_impl.h
,
Jan 29 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7d22d62dbd262280ee627864adffd7449e9c90c2 commit 7d22d62dbd262280ee627864adffd7449e9c90c2 Author: Aaron Schulman <aschulman@chromium.org> Date: Mon Jan 29 20:51:30 2018 [BattOr] Make connection Flush()'ing public In preparation for the frame retransmission fix we need to bring out the BattOr connection's Flush() out into it's own function so it can be called directly by the BattOr agent. In order to achieve this we had to rework many BattOr agent test cases which did not pass after this change. It turns out they never should have passed because they did not advance the clock to complete the command retry timeout after a command failed. Bug: 735434 Change-Id: Ie376c91ba47c6e466f57c6acde8a377bd96c209e Reviewed-on: https://chromium-review.googlesource.com/836850 Commit-Queue: Aaron Schulman <aschulman@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#532579} [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent.h [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_agent_unittest.cc [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection.h [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl.h [modify] https://crrev.com/7d22d62dbd262280ee627864adffd7449e9c90c2/tools/battor_agent/battor_connection_impl_unittest.cc
,
Feb 7 2018
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by perezju@chromium.org
, Jun 26 2017