New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 634188 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Oct 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 3
Type: Bug

Blocked on:
issue 634278



Sign in to add a comment

battor_agent intermittently fails on Mac

Project Member Reported by aschulman@chromium.org, Aug 3 2016

Issue description

Bytes are lost seemingly randomly and this leads to various battor_agent failures including: "unexpected message", "receive failure", max init retires, and timeout.
 
Blockedon: 634278
Project Member

Comment 2 by bugdroid1@chromium.org, Aug 6 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e03714057f72f9edaf7f7d4e1799eb1ef08b06a

commit 2e03714057f72f9edaf7f7d4e1799eb1ef08b06a
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Sat Aug 06 02:30:56 2016

Roll src/third_party/catapult/ f09349b28..05755f0b4 (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/f09349b28951..05755f0b45da

$ git log f09349b28..05755f0b4 --date=short --no-merges --format='%ad %ae %s'

BUG= 634188 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2220743002
Cr-Commit-Position: refs/heads/master@{#410254}

[modify] https://crrev.com/2e03714057f72f9edaf7f7d4e1799eb1ef08b06a/DEPS

The good news is the patches that have landed so far got us to the point where both Mac bots are sometimes green, yay! The dual-gpu Mac is green very often, and the low end one is green not so often

There seem to be a two cases where the runs still fail, and here are the fixes I think we should try):

1) Timeout [dual gpu] - The really interesting thing here is that these timeouts happen both when the agent tries to RecordClockSync and when it tries to StartTracing. The RecordClockSync could be due to lost bytes (since we never retry the RecordClockSync if it fails). For StartTracing, we probably could increase the timeout to greater than 4 seconds. 4 seconds is very conservative and I'm sure sometimes the laptop can go out to lunch and it will take slightly longer than 4s to INIT and start a run. I think 6 seconds is a decent choice to see if it fixes this problem. It also could be that the timeout is due to some other issue

2) Max INIT retries [low end] - This one confuses me, it really should have been fixed by https://codereview.chromium.org/2057973002/. The only thing I can think of is that I've seen the OSX FTDI driver holds on to serial data received after an application close()'es the file descriptor. This means that when we call init, there can still be quite a bit of data queued that needs to be flushed out (oh and by the way flushing the serial device doesn't seem to work unfortunately). I think the fix here is to increase the max size of the CONTROL_ACK reply so the BattOr can clear out any outstanding data faster. This issue could really use some serial logging to help us debug it. Maybe we can avoid the issue of serial logging bogging down the battor_agent and still get a serial log by adding a few levels of logging (1) log all bytes only for control actions (2) log all bytes for all actions. Then we can just run this with log level 1 written to cloud storage to debug the issue.

Surprisingly we are not seeing the UNEXPECTED MESSAGE or RECEIVE ERROR issues anymore that appeared to be due to dropped bytes by the Mac FTDI driver.

charliea@, what do you think?
These issues seem to be mostly resolved enough that we are seeing green for most runs on both the Dual GPU and Low End bots (https://build.chromium.org/p/chromium.perf.fyi/console). I'd like to wait another 24 hours before calling this Fixed to make sure no other issues creep up.

After the new BattOr agent binary rolled, low end still had a significant number of battor_agent TIMEOUTs. I manually tested the BattOr attached to that bot and found that the SD card was faulty (it needed a power cycle in order to work properly again, so likely something is faulty about it). I'm doing some more investigation and I will post my results soon.
Project Member

Comment 5 by bugdroid1@chromium.org, Aug 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/91a47c2f20f96a4888a69f8120742159d1493438

commit 91a47c2f20f96a4888a69f8120742159d1493438
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Mon Aug 15 20:43:43 2016

Roll src/third_party/catapult/ 2d5cfe52d..1c6ff37b8 (3 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/2d5cfe52dfef..1c6ff37b8350

$ git log 2d5cfe52d..1c6ff37b8 --date=short --no-merges --format='%ad %ae %s'

BUG= 634188 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2241423002
Cr-Commit-Position: refs/heads/master@{#412038}

[modify] https://crrev.com/91a47c2f20f96a4888a69f8120742159d1493438/DEPS

Status: Assigned (was: Untriaged)
Adding a sub-issue to this which just came up on the FYI waterfall. If the chromium process crashes during a benchmark which means the battor_agent dies in the middle of a StopTracing command, then the next time a battor_agent starts and StartTracing is run it has a large number of bytes to dump. This seems to mostly work given that the battor_agent retries the init several times, but for some reason this trace ends with several serial read error 2 (TIMEOUT?) instead of reading a proper init ack.
battor-serial-log-2016-08-18_18-21.txt-23699.txt
18.1 KB View Download
One of the failures we are seeing is due to the firmware halting when the sample buffer overflows while the BattOr is waiting to transmit the sample count for RecordClockSyncMarker. This issue will be handled in https://github.com/aschulm/battor/issues/53.
Cc: te...@mellowresearch.com
Another sub-issue appears to be related to the Mac FTDI driver crashing when the FTDI chip receives bytes after the socket is closed. This crash results in the battor_agent not being able to communicate with the battor until the FTDI kernel driver is reloaded (either with kextunload or by unplugging the USB cable).

The interesting thing is that we somewhat created this problem for ourselves. The battor_agent terminates immediately if it sees a single frame sequence number that is out of order. What we can do instead is to continue reading even if an error has been encountered, then report the error when the battor_agent reads all of the data.

I whipped up a CL that does this and it has been running now for many hours without any problems (before that the driver would crash within an hour).
Status: Verified (was: Assigned)
This issue seems to be under control, so I am going to close it for now. However, there are two CLs that I started that could help us if we have any future issues with this bug: https://codereview.chromium.org/2343973003/ and https://codereview.chromium.org/2217513002/.

Sign in to add a comment