Issue metadata
Sign in to add a comment
|
BattOr RecordClockSyncMarker fails due to timeout |
||||||||||||||||||||
Issue descriptionThe Macs in the perf testbed have had a significant decrease in their failure rate due to the command retry fix in #699581 (yay!). However, as I expected there is still a low failure rate on Mac because I did not add command retrying for RecordClockSync. Here is an example of a recent log that demonstrates this issue: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F715%2F%2B%2Frecipes%2Fsteps%2Fmedia.tough_video_cases_tbmv2.reference_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout The reason for not adding retry to this command is that the timing of the execution of this command is important. If we retry the command inside the battor_agent, the battor_wrapper will not know, and may assume that the clock sync happened earlier than the operation actually completed. I am opening this bug to start the discussion about how we should address this issue since it seems like there are still a few failed traces happening due to failed clock syncs. rnewphew@ based on how the RecordClockSync timestamping is implemented in telemetry, should we implement some sort of clock sync retry in the battor_wrapper, or should we just retry the command in the battor_agent?
,
Jun 1 2017
,
Jun 1 2017
,
Jun 1 2017
I've attached the serial log from the failed clock sync.
,
Jun 1 2017
It looks like the following is happening in the serial log: - (expected) GetFirmwareGitHash fails because INIT caused a BattOr reset - (expected) GetFirmwareGitHash succeeds - (expected) StartTracing sent but fails because INIT caused a BattOr reset - (expected) StartTracing succeeds - (????????) RecordClockSyncMarker fails because READ_SAMPLE_COUNT caused a BattOr reset Aaron, can you think of any reason that the BattOr would be resetting when we tell it to record a clock sync marker? It seems like this is a problem that we want to fix from the firmware side.
,
Jun 1 2017
,
Jun 5 2017
I investigated this a bit more, and it looks like there are two issues here. The first is what Charlie nicely pointed out that there is a reset due to a firmware issue. I have a fix in mind for that, but it will take some time for me to figure out how to do it properly. The second issue is one that I don't know exactly what is going on. Take a look at these two traces: https://storage.cloud.google.com/chrome-telemetry-output/battor-serial-log-2017-06-01_00-15.txt-60405.txt the next run: https://storage.cloud.google.com/chrome-telemetry-output/battor-serial-log-2017-06-01_01-13.txt-12798.txt The first trace ends in a TIMEOUT due to no data received (not even a RESET byte) from the BattOr. The second trace starts with sending an INIT, and interestingly enough, the reply is the clock sync reply from the prior run. The only explanation I have come up with so far is that a timeout happens immediately after the RecordClockSync is sent. Another possibility is that the read call isn't happening properly for some reason in the first run, then it works properly in the 2nd run.
,
Jun 27 2017
The fact that we can't answer the questions that you just posed in the last comment is, in itself, probably a bug. I'm going to make two changes to the serial log: 1) Prefix each entry in the log with a timestamp 2) Add a serial log message for when the read is cancelled due to a timeout. Right now the only way to know that a serial read timed out is by knowing that a "Starting read of up to X bytes." without any subsequent message indicates a timeout.
,
Jun 27 2017
Yes. I think both of these pieces of information would help us debug this issue and others.
,
Jun 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/083590ad7a5f0189c27a068b4c222d61785cfb4e commit 083590ad7a5f0189c27a068b4c222d61785cfb4e Author: Charlie Andrews <charliea@google.com> Date: Tue Jun 27 19:18:52 2017 Add timestamps and timeouts to the BattOr serial log Bug: 728496 Change-Id: Iccaf529534958223e7d4df240390f86e9c9ec818 Reviewed-on: https://chromium-review.googlesource.com/550321 Reviewed-by: rnephew <rnephew@chromium.org> Commit-Queue: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#482708} [modify] https://crrev.com/083590ad7a5f0189c27a068b4c222d61785cfb4e/tools/battor_agent/battor_connection_impl.cc
,
Jul 10 2017
Charlie, do we need to do something to update the battor_agents on the testbed to have the latest version of the src? I still don't see timestamps in the serial logs. I would do this myself, but I don't know the status of the autobuilders.
,
Jul 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/538f75da5a8e54f071ccacbf6d76c14e482686bf commit 538f75da5a8e54f071ccacbf6d76c14e482686bf Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Mon Jul 10 15:55:48 2017 Roll src/third_party/catapult/ 9f7e1bcf9..72646aa58 (1 commit) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9f7e1bcf98c3..72646aa580ee $ git log 9f7e1bcf9..72646aa58 --date=short --no-merges --format='%ad %ae %s' 2017-07-10 charliea Upload the battor agent binary to the latest version Created with: roll-dep src/third_party/catapult BUG= 728496 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: Id17109882ae7c93dc517fda12a3c1bdbe9d7ceab Reviewed-on: https://chromium-review.googlesource.com/565001 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#485282} [modify] https://crrev.com/538f75da5a8e54f071ccacbf6d76c14e482686bf/DEPS
,
Jul 28 2017
Charlie and Randy. I just pushed a firmware fix to the BattOr reset due to RecordClockSync issue. I will put up a patch to upload the new firmware in a moment. https://github.com/aschulm/battor/commit/3157c85d1bcffdc89692b429d05e284724a6908b Let's set a NextAction to Monday to make sure this change has worked.
,
Jul 28 2017
Thanks Aaron! We'll look for your patch.
,
Jul 28 2017
,
Jul 28 2017
@charliea, I just realized I can't upload into the chromium-telemetry bucket anymore. Either you or Randy will have to submit the CL to upload the latest BattOr firmware.
,
Jul 31 2017
The NextAction date has arrived: 2017-07-31
,
Jul 31 2017
CL is in the commit queue!
,
Aug 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5dc438aca416be722ae65ae09d7521c1ec48b4f4 commit 5dc438aca416be722ae65ae09d7521c1ec48b4f4 Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org> Date: Tue Aug 01 08:27:50 2017 Roll src/third_party/catapult/ 9e69c3388..4d5d88c3e (12 commits) https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9e69c338846a..4d5d88c3e04c $ git log 9e69c3388..4d5d88c3e --date=short --no-merges --format='%ad %ae %s' 2017-07-31 benjhayden Fix object-instance-track heading. 2017-07-31 loloangela Remove pylintrc bad-continuation and fix resulting errors pt.42 2017-07-31 rnephew [Telemetry] Add ability to disable by mac os version number. 2017-07-31 loloangela Fixed errors related to bad-continuation pt.41 2017-07-31 xunjieli [wpr-go] Allow disabling script injection 2017-07-31 loloangela Fixed errors related to bad-continuation pt.40 2017-07-31 benjhayden Fix ObjectSnapshot.userFriendlyName. 2017-07-31 simonhatch Dashboard - Add entities to _INTERNAL_ONLY_KINDS in datastore_hooks 2017-07-31 loloangela Fixed errors related to bad-continuation pt.39 2017-07-31 sullivan Set bug status to 'Assigned' when assigning to culprit author. 2017-07-31 loloangela Fixed errors related to bad-continuation pt.38 2017-07-31 charliea Upload a new version of the BattOr agent firmware Created with: roll-dep src/third_party/catapult BUG=713222, 728496 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=sullivan@chromium.org Change-Id: If6aca002e8592ed8dc89391f6964933fab5bcea6 Reviewed-on: https://chromium-review.googlesource.com/595226 Reviewed-by: <catapult-deps-roller@chromium.org> Commit-Queue: <catapult-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#490917} [modify] https://crrev.com/5dc438aca416be722ae65ae09d7521c1ec48b4f4/DEPS
,
Aug 14 2017
Marking this as fixed. I'll reopen if I see the problem crop up again. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by aschulman@chromium.org
, Jun 1 2017