BattOr timing out while awaiting firmware version |
||||||||
Issue descriptionOS: I've only seen this on the Mac Retina bot, but it may also be happening on Windows, too Link to buildbot run: https://build.chromium.org/p/chromium.perf/builders/Mac%20Retina%20Perf/builds/173 Link to STDIO: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F173%2F%2B%2Frecipes%2Fsteps%2Fbattor.power_cases_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout Link to serial log: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/battor-serial-log-2017-01-08_11-37.txt-41332.txt The serial log in its entirety: """ Opening serial connection. Serial connection open finished with success: 1. Bytes sent: 0x00 0x03 0x09 0x02 0x00 0x02 0x00 0x02 0x00 0x02 0x00 0x01. Read requested. Before doing a serial read, checking to see if we already have a complete message in the 'already read' buffer. No complete message found in the 'already read' buffer. Starting read of up to 191 bytes. """ The battor_agent failed in a timeout. This is weird: I've never seen the BattOr have a problem with receiving an initialization ack before. It's like we're sending the initialization message, but the BattOr just isn't responsive. In the next test, battor.power_cases.reference (which ran immediately after this one), the BattOr was working again. This makes me wonder if we should be retrying when this happens.
,
Jan 9 2017
,
Jan 9 2017
Other supporting evidence that I screwed something up: any failure at this step should result in a TOO_MANY_INIT_ATTEMPTS failure, not a TIMEOUT failure. Removing any mention of Mac, because I believe all OSes are susceptible to this failure.
,
Jan 9 2017
I made a mistake early on in this bug: the problem isn't a timeout while awaiting the init ack but a timeout while awaiting the *firmware version ack*. This makes a lot more sense: we currently don't have any retry logic in place for this request. Presumably, requesting the firmware version should have all of the retry logic associated with requesting an initialization of the BattOr.
,
Jan 9 2017
,
Jan 9 2017
,
Jan 9 2017
,
Jan 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fc4c96b992fc256fe2bb8819ba6a600580b25900 commit fc4c96b992fc256fe2bb8819ba6a600580b25900 Author: rnephew <rnephew@chromium.org> Date: Thu Jan 12 19:15:03 2017 [BattOr] Initialize BattOr before getting firmware git hash. BUG= 679375 Review-Url: https://codereview.chromium.org/2622003005 Cr-Commit-Position: refs/heads/master@{#443318} [modify] https://crrev.com/fc4c96b992fc256fe2bb8819ba6a600580b25900/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/fc4c96b992fc256fe2bb8819ba6a600580b25900/tools/battor_agent/battor_agent_unittest.cc
,
Jan 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/17b4caf4626ef05da5b86ba8482711ee7cc1178e commit 17b4caf4626ef05da5b86ba8482711ee7cc1178e Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Fri Jan 13 06:21:26 2017 Roll src/third_party/catapult/ fe8a3c8a5..4bd198f6a (2 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/fe8a3c8a5cdb..4bd198f6a85c $ git log fe8a3c8a5..4bd198f6a --date=short --no-merges --format='%ad %ae %s' 2017-01-12 bpastene devil: Push devil onto the path when running reset_usb.py manually. 2017-01-12 rnephew [BattOr] Update BattOr Agent binary in cloud storage. BUG= 679375 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2631643002 Cr-Commit-Position: refs/heads/master@{#443508} [modify] https://crrev.com/17b4caf4626ef05da5b86ba8482711ee7cc1178e/DEPS
,
Jan 13 2017
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by charliea@chromium.org
, Jan 9 2017Hmmm, now that I think about it more, we already *should* be retrying if the init fails: void BattOrAgent::OnMessageRead(bool success, BattOrMessageType type, std::unique_ptr<vector<char>> bytes) { ... if (!success) { switch (last_action_) { ... case Action::READ_INIT_ACK: if (num_init_attempts_++ < kMaxInitAttempts) { PerformDelayedAction(Action::SEND_INIT, base::TimeDelta::FromMilliseconds(kInitRetryDelayMilliseconds)); } else { CompleteCommand(BATTOR_ERROR_TOO_MANY_INIT_RETRIES); } return; I wonder if I somehow broke this logic with my other changes to the retry logic?