BattOr failed due to timeout when starting tracing |
|||||||||||
Issue descriptionI saw this in a recent bisect that I did on Mac: https://build.chromium.org/p/tryserver.chromium.perf/builders/mac_retina_perf_bisect/builds/1944 Here's the log from the failed run: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fmac_retina_perf_bisect%2F1944%2F%2B%2Frecipes%2Fsteps%2FRunning_WITH_patch%2F0%2Fsteps%2FPerformance_Test__With_Patch__1_of_1%2F0%2Fstdout Probably the most important line in that file is: [0307/110507.871291:FATAL:battor_agent_bin.cc(89)] Fatal error when communicating with the BattOr: TIMEOUT testa@, if I helped you understand how to navigate our infrastructure, do you think you could help debug why this run failed?
,
Mar 8 2017
,
Mar 15 2017
Corrected Chris' @google email. I'm taking a look at the logs I'll let you know my verdict asap.
,
Mar 15 2017
@rnephew, I will need to see the serial logs before and after this issue in order to debug it. Since I don't have access to the logs yet, do you mind attaching them to this bug?
,
Mar 15 2017
Woops! Sorry about the email mixup. Aaron, they should be attached to the original bug filing (battor-serial-log-2017-03-07_11-05.txt-14056.txt). Do you see them there?
,
Mar 15 2017
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F261%2F%2B%2Frecipes%2Fsteps%2Fbattor.trivial_pages_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout (search for File "/b/s/w/irX9uIWO/third_party/catapult/common/battor/battor/battor_wrapper.py", line 318, in _SendBattOrCommandImpl )
,
Mar 15 2017
,
Mar 15 2017
Thanks. If possible I actually would like the serial logs for the battor_agent runs on this bot proceeding and succeeding this problematic run. Do you mind attaching those logs too?
,
Mar 15 2017
Hmmm, that's actually not possible right now. We only attach the serial logs on failure, and all of the other user stories in this benchmark passed, so their serial logs weren't uploaded. I'm going to submit a quick, temporary patch that always uploads serial logs to cloud storage.
,
Mar 15 2017
Thanks. It's actually useful on its own to know that it succeeded in the test before and after this one that failed.
,
Mar 15 2017
I think in both cases, it was the last story to be run, so I can only say with confidence that it succeeded in the preceding test, not the next one.
,
Mar 15 2017
Is there any way to know if this trybot ran another BattOr benchmark at some point after this (and if that log succeeded?)
,
Mar 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fd3496f6cad18623fde35367afc4c57080e0044e commit fd3496f6cad18623fde35367afc4c57080e0044e Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Wed Mar 15 21:24:34 2017 Roll src/third_party/catapult/ 4226576b1..76b86f523 (2 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/4226576b1dc7..76b86f523bce $ git log 4226576b1..76b86f523 --date=short --no-merges --format='%ad %ae %s' 2017-03-15 simonhatch Dashboard - Update bot_whitelist from /change_internal_only 2017-03-15 charliea Always upload BattOr serial logs to cloud storage Created with: roll-dep src/third_party/catapult BUG= 699581 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2753923002 Cr-Commit-Position: refs/heads/master@{#457213} [modify] https://crrev.com/fd3496f6cad18623fde35367afc4c57080e0044e/DEPS
,
Mar 16 2017
Looks like I screwed something up, and it only uploads the flashing serial log (usually the less useful one). Sending a CL out to fix that now.
,
Mar 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c08cf6325bde19708568adcf6e92f9103f449c31 commit c08cf6325bde19708568adcf6e92f9103f449c31 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Thu Mar 16 20:25:29 2017 Roll src/third_party/catapult/ faf60eb37..efbc9db2e (1 commit) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/faf60eb37f8b..efbc9db2ef8f $ git log faf60eb37..efbc9db2e --date=short --no-merges --format='%ad %ae %s' 2017-03-16 charliea Fix bug in always uploading BattOr serial logs Created with: roll-dep src/third_party/catapult BUG= 699581 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2748273004 Cr-Commit-Position: refs/heads/master@{#457535} [modify] https://crrev.com/c08cf6325bde19708568adcf6e92f9103f449c31/DEPS
,
Mar 25 2017
I don't think I have access to the cloud storage bucket with the BattOr traces, but just doing some testing on my own I think I have identified the root cause of this issue. This appears to be another instance of an issue we have known about for a while, which is that the Mac FTDI driver drops random bytes from the BattOr. There is a fix I've had in mind for a while, but never got a chance to implement, which is to implement retry in battor_agent for every command (including the download command if bytes are lost during the download). Right now we do have some retry logic, but it's just for the init command. Charlie, are you ok with us moving to adding retry for all commands?
,
Mar 29 2017
Checking in on this bug. Charlie, do you agree that we should implement command retry for all commands, including download? I think this would be a good way of dealing with the intermittent lost bytes on macOS.
,
Mar 29 2017
Sorry for not responding: yep, I think that this is a good idea.
,
Apr 19 2017
I have a CL up for review to swap out all of our earlier control message retry logic with command retry logic (https://codereview.chromium.org/2826913002). It's a doozie. I almost didn't feel comfortable making it into one CL, but it's really hard to seperate out the changes required into smaller byte-sized CLs. In the ideal case I think charliea@ would review this CL because it touches on quite a bit of his battor_agent code. He also understands what was broken about the old way of retrying control messages and reads. He also knows why this new way is likely to be much much more robust than the old one. rnephew@, if we want to check to see if this change makes the Mac bots more green we can consider just uploading a new mac binary with this change and seeing what happens until charliea@ gets back to do the formal review of this mega CL. Thoughts rnephew@?
,
May 10 2017
,
May 10 2017
Are there any updates on this? It's still happening fairly regularly on the mac-retina perf bot.
,
May 10 2017
Yes the CL to resolve this issue is still under review. We should have a fix committed soon.
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047 commit b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047 Author: aschulman <aschulman@chromium.org> Date: Thu May 18 07:47:04 2017 [BattOr] Retry battor_agent commands when they fail. Many battor_agent failures on the waterfall could be solved by just retrying the command (e.g., StartTracing). Previously, the battor_agent retried reads and a select few BattOr control messages when they failed (e.g., INIT). This was incomplete, as there were many errors that did not result in retried BattOr control messages. This CL replaces this incomplete per-control-message retry logic with retry logic for entire battor_agent commands. This will make the battor_agent robust to intermittent failures of any BattOr control message. This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic. BUG= 699581 Review-Url: https://codereview.chromium.org/2826913002 Cr-Commit-Position: refs/heads/master@{#472724} [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent.h [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_agent_unittest.cc [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_connection_impl.cc [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_error.cc [modify] https://crrev.com/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047/tools/battor_agent/battor_error.h
,
May 18 2017
@rnephew can you push this new version of the battor_agent to the mac bots to see if it resolves the flakiness in long runs such as tough_video_cases?
,
May 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/910c7d9dcd9571744626dcd19f3ec3dfad2daf5d commit 910c7d9dcd9571744626dcd19f3ec3dfad2daf5d Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Wed May 31 02:55:07 2017 Roll src/third_party/catapult/ e18c0f9c7..32a3f0b8f (8 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/e18c0f9c7950..32a3f0b8f49b $ git log e18c0f9c7..32a3f0b8f --date=short --no-merges --format='%ad %ae %s' 2017-05-30 rnephew [BattOr] Update BattOr agent Binary. 2017-05-30 simonhatch Pinpoint - Remove some unneeded files. 2017-05-30 ccraik Switch from printf to echo for tgid fixup 2017-05-30 ccraik Fix viewer html generation on windows 2017-05-30 dtu [pinpoint] Make Change.deps a frozenset. 2017-05-30 jbudorick [devil] Make devil.utils.cmd_helper work on Windows? 2017-05-30 dproy Fix markdown typo in cpu_time tree doc 2017-05-30 phsilva Implement the Ownership Diagnostic Class Created with: roll-dep src/third_party/catapult BUG= 699581 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: I4502fff3bec2c407c5f82649a68b86d24a35f000 Reviewed-on: https://chromium-review.googlesource.com/518507 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#475759} [modify] https://crrev.com/910c7d9dcd9571744626dcd19f3ec3dfad2daf5d/DEPS
,
May 31 2017
Looking at the bots, it doesn't look like we've had any failures due to BattOr failures within the last 24 hours. I'm going to close this, and we can reopen it in the unfortunate (and hopefully unlikely) event that we run into further issues.
,
Jan 16
,
Jan 16
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by charliea@chromium.org
, Mar 8 2017