New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 699581 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug

Blocking:
issue 687280
issue 718366



Sign in to add a comment

BattOr failed due to timeout when starting tracing

Project Member Reported by charliea@chromium.org, Mar 8 2017

Issue description

I saw this in a recent bisect that I did on Mac: https://build.chromium.org/p/tryserver.chromium.perf/builders/mac_retina_perf_bisect/builds/1944

Here's the log from the failed run: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fmac_retina_perf_bisect%2F1944%2F%2B%2Frecipes%2Fsteps%2FRunning_WITH_patch%2F0%2Fsteps%2FPerformance_Test__With_Patch__1_of_1%2F0%2Fstdout

Probably the most important line in that file is:

[0307/110507.871291:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT

testa@, if I helped you understand how to navigate our infrastructure, do you think you could help debug why this run failed?
 
battor-serial-log-2017-03-07_11-05.txt-14056.txt
1.2 KB View Download
Cc: charliea@google.com
Cc: -charliea@google.com
Cc: -testa@google.com testac@google.com
Owner: aschulman@chromium.org
Corrected Chris' @google email. I'm taking a look at the logs I'll let you know my verdict asap.
@rnephew, I will need to see the serial logs before and after this issue in order to debug it. Since I don't have access to the logs yet, do you mind attaching them to this bug?
Status: Assigned (was: Untriaged)
Woops! Sorry about the email mixup. Aaron, they should be attached to the original bug filing (battor-serial-log-2017-03-07_11-05.txt-14056.txt). Do you see them there?
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F261%2F%2B%2Frecipes%2Fsteps%2Fbattor.trivial_pages_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

(search for

File "/b/s/w/irX9uIWO/third_party/catapult/common/battor/battor/battor_wrapper.py", line 318, in _SendBattOrCommandImpl

)
Blocking: 687280
Thanks. If possible I actually would like the serial logs for the battor_agent runs on this bot proceeding and succeeding this problematic run. Do you mind attaching those logs too?
Hmmm, that's actually not possible right now. We only attach the serial logs on failure, and all of the other user stories in this benchmark passed, so their serial logs weren't uploaded.

I'm going to submit a quick, temporary patch that always uploads serial logs to cloud storage.
Thanks. It's actually useful on its own to know that it succeeded in the test before and after this one that failed.
I think in both cases, it was the last story to be run, so I can only say with confidence that it succeeded in the preceding test, not the next one.

Comment 12 Deleted

Is there any way to know if this trybot ran another BattOr benchmark at some point after this (and if that log succeeded?)
Project Member

Comment 14 by bugdroid1@chromium.org, Mar 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fd3496f6cad18623fde35367afc4c57080e0044e

commit fd3496f6cad18623fde35367afc4c57080e0044e
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Wed Mar 15 21:24:34 2017

Roll src/third_party/catapult/ 4226576b1..76b86f523 (2 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/4226576b1dc7..76b86f523bce

$ git log 4226576b1..76b86f523 --date=short --no-merges --format='%ad %ae %s'
2017-03-15 simonhatch Dashboard - Update bot_whitelist from /change_internal_only
2017-03-15 charliea Always upload BattOr serial logs to cloud storage

Created with:
  roll-dep src/third_party/catapult
BUG= 699581 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2753923002
Cr-Commit-Position: refs/heads/master@{#457213}

[modify] https://crrev.com/fd3496f6cad18623fde35367afc4c57080e0044e/DEPS

Looks like I screwed something up, and it only uploads the flashing serial log (usually the less useful one). Sending a CL out to fix that now.
Project Member

Comment 16 by bugdroid1@chromium.org, Mar 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c08cf6325bde19708568adcf6e92f9103f449c31

commit c08cf6325bde19708568adcf6e92f9103f449c31
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Thu Mar 16 20:25:29 2017

Roll src/third_party/catapult/ faf60eb37..efbc9db2e (1 commit)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/faf60eb37f8b..efbc9db2ef8f

$ git log faf60eb37..efbc9db2e --date=short --no-merges --format='%ad %ae %s'
2017-03-16 charliea Fix bug in always uploading BattOr serial logs

Created with:
  roll-dep src/third_party/catapult
BUG= 699581 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2748273004
Cr-Commit-Position: refs/heads/master@{#457535}

[modify] https://crrev.com/c08cf6325bde19708568adcf6e92f9103f449c31/DEPS

Cc: -testac@google.com te...@mellowresearch.com
Labels: BattOr
I don't think I have access to the cloud storage bucket with the BattOr traces, but just doing some testing on my own I think I have identified the root cause of this issue.

This appears to be another instance of an issue we have known about for a while, which is that the Mac FTDI driver drops random bytes from the BattOr.

There is a fix I've had in mind for a while, but never got a chance to implement, which is to implement retry in battor_agent for every command (including the download command if bytes are lost during the download). Right now we do have some retry logic, but it's just for the init command.

Charlie, are you ok with us moving to adding retry for all commands?
Checking in on this bug. Charlie, do you agree that we should implement command retry for all commands, including download? I think this would be a good way of dealing with the intermittent lost bytes on macOS.
Sorry for not responding: yep, I think that this is a good idea.
Cc: charliea@chromium.org
I have a CL up for review to swap out all of our earlier control message retry logic with command retry logic (https://codereview.chromium.org/2826913002). It's a doozie. I almost didn't feel comfortable making it into one CL, but it's really hard to seperate out the changes required into smaller byte-sized CLs.

In the ideal case I think charliea@ would review this CL because it touches on quite a bit of his battor_agent code. He also understands what was broken about the old way of retrying control messages and reads. He also knows why this new way is likely to be much much more robust than the old one.

rnephew@, if we want to check to see if this change makes the Mac bots more green we can consider just uploading a new mac binary with this change and seeing what happens until charliea@ gets back to do the formal review of this mega CL.

Thoughts rnephew@?
Blocking: 718366
Are there any updates on this?

It's still happening fairly regularly on the mac-retina perf bot.
Yes the CL to resolve this issue is still under review. We should have a fix committed soon.
Project Member

Comment 24 by bugdroid1@chromium.org, May 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047

commit b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047
Author: aschulman <aschulman@chromium.org>
Date: Thu May 18 07:47:04 2017

[BattOr] Retry battor_agent commands when they fail.

Many battor_agent failures on the waterfall could be solved by just
retrying the command (e.g., StartTracing). Previously, the battor_agent
retried reads and a select few BattOr control messages when they failed
(e.g., INIT). This was incomplete, as there were many errors that did
not result in retried BattOr control messages.

This CL replaces this incomplete per-control-message retry logic with
retry logic for entire battor_agent commands. This will make the
battor_agent robust to intermittent failures of any BattOr control
message.

This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic.

BUG= 699581 

Review-Url: https://codereview.chromium.org/2826913002
Cr-Commit-Position: refs/heads/master@{#472724}

[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent.h
[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent_unittest.cc
[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_connection_impl.cc
[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_error.cc
[modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_error.h

@rnephew can you push this new version of the battor_agent to the mac bots to see if it resolves the flakiness in long runs such as tough_video_cases?
Project Member

Comment 26 by bugdroid1@chromium.org, May 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/910c7d9dcd9571744626dcd19f3ec3dfad2daf5d

commit 910c7d9dcd9571744626dcd19f3ec3dfad2daf5d
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Wed May 31 02:55:07 2017

Roll src/third_party/catapult/ e18c0f9c7..32a3f0b8f (8 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/e18c0f9c7950..32a3f0b8f49b

$ git log e18c0f9c7..32a3f0b8f --date=short --no-merges --format='%ad %ae %s'
2017-05-30 rnephew [BattOr] Update BattOr agent Binary.
2017-05-30 simonhatch Pinpoint - Remove some unneeded files.
2017-05-30 ccraik Switch from printf to echo for tgid fixup
2017-05-30 ccraik Fix viewer html generation on windows
2017-05-30 dtu [pinpoint] Make Change.deps a frozenset.
2017-05-30 jbudorick [devil] Make devil.utils.cmd_helper work on Windows?
2017-05-30 dproy Fix markdown typo in cpu_time tree doc
2017-05-30 phsilva Implement the Ownership Diagnostic Class

Created with:
  roll-dep src/third_party/catapult
BUG= 699581 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: I4502fff3bec2c407c5f82649a68b86d24a35f000
Reviewed-on: https://chromium-review.googlesource.com/518507
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#475759}
[modify] https://crrev.com/910c7d9dcd9571744626dcd19f3ec3dfad2daf5d/DEPS

Status: Fixed (was: Assigned)
Looking at the bots, it doesn't look like we've had any failures due to BattOr failures within the last 24 hours. I'm going to close this, and we can reopen it in the unfortunate (and hopefully unlikely) event that we run into further issues.
Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment