New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 728496 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Cc:
EstimatedDays: ----
NextAction: 2017-07-31
OS: Mac
Pri: 3
Type: Bug

Blocking:
issue 737176



Sign in to add a comment

BattOr RecordClockSyncMarker fails due to timeout

Project Member Reported by aschulman@chromium.org, Jun 1 2017

Issue description

The Macs in the perf testbed have had a significant decrease in their failure rate due to the command retry fix in #699581 (yay!). However, as I expected there is still a low failure rate on Mac because I did not add command retrying for RecordClockSync. Here is an example of a recent log that demonstrates this issue:

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F715%2F%2B%2Frecipes%2Fsteps%2Fmedia.tough_video_cases_tbmv2.reference_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

The reason for not adding retry to this command is that the timing of the execution of this command is important. If we retry the command inside the battor_agent, the battor_wrapper will not know, and may assume that the clock sync happened earlier than 
 the operation actually completed.

I am opening this bug to start the discussion about how we should address this issue since it seems like there are still a few failed traces happening due to failed clock syncs.

rnewphew@ based on how the RecordClockSync timestamping is implemented in telemetry, should we implement some sort of clock sync retry in the battor_wrapper, or should we just retry the command in the battor_agent?
 
Description: Show this description
Description: Show this description
Description: Show this description
I've attached the serial log from the failed clock sync.
battor-serial-log-2017-05-31_03-37.txt-97996.txt
3.5 KB View Download
Owner: aschulman@chromium.org
Status: Assigned (was: Untriaged)
It looks like the following is happening in the serial log:

- (expected) GetFirmwareGitHash fails because INIT caused a BattOr reset
- (expected) GetFirmwareGitHash succeeds
- (expected) StartTracing sent but fails because INIT caused a BattOr reset
- (expected) StartTracing succeeds
- (????????) RecordClockSyncMarker fails because READ_SAMPLE_COUNT caused a BattOr reset

Aaron, can you think of any reason that the BattOr would be resetting when we tell it to record a clock sync marker? It seems like this is a problem that we want to fix from the firmware side.
NextAction: 2017-02-06
I investigated this a bit more, and it looks like there are two issues here. The first is what Charlie nicely pointed out that there is a reset due to a firmware issue. I have a fix in mind for that, but it will take some time for me to figure out how to do it properly.

The second issue is one that I don't know exactly what is going on. Take a look at these two traces:

https://storage.cloud.google.com/chrome-telemetry-output/battor-serial-log-2017-06-01_00-15.txt-60405.txt

the next run:

https://storage.cloud.google.com/chrome-telemetry-output/battor-serial-log-2017-06-01_01-13.txt-12798.txt

The first trace ends in a TIMEOUT due to no data received (not even a RESET byte) from the BattOr. The second trace starts with sending an INIT, and interestingly enough, the reply is the clock sync reply from the prior run. The only explanation I have come up with so far is that a timeout happens immediately after the RecordClockSync is sent. Another possibility is that the read call isn't happening properly for some reason in the first run, then it works properly in the 2nd run.
The fact that we can't answer the questions that you just posed in the last comment is, in itself, probably a bug. I'm going to make two changes to the serial log:

1) Prefix each entry in the log with a timestamp
2) Add a serial log message for when the read is cancelled due to a timeout. Right now the only way to know that a serial read timed out is by knowing that a "Starting read of up to X bytes." without any subsequent message indicates a timeout.
Yes. I think both of these pieces of information would help us debug this issue and others.
Project Member

Comment 10 by bugdroid1@chromium.org, Jun 27 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/083590ad7a5f0189c27a068b4c222d61785cfb4e

commit 083590ad7a5f0189c27a068b4c222d61785cfb4e
Author: Charlie Andrews <charliea@google.com>
Date: Tue Jun 27 19:18:52 2017

Add timestamps and timeouts to the BattOr serial log

Bug:  728496 
Change-Id: Iccaf529534958223e7d4df240390f86e9c9ec818
Reviewed-on: https://chromium-review.googlesource.com/550321
Reviewed-by: rnephew <rnephew@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#482708}
[modify] https://crrev.com/083590ad7a5f0189c27a068b4c222d61785cfb4e/tools/battor_agent/battor_connection_impl.cc

Charlie, do we need to do something to update the battor_agents on the testbed to have the latest version of the src? I still don't see timestamps in the serial logs.

I would do this myself, but I don't know the status of the autobuilders.
Project Member

Comment 12 by bugdroid1@chromium.org, Jul 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/538f75da5a8e54f071ccacbf6d76c14e482686bf

commit 538f75da5a8e54f071ccacbf6d76c14e482686bf
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Mon Jul 10 15:55:48 2017

Roll src/third_party/catapult/ 9f7e1bcf9..72646aa58 (1 commit)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9f7e1bcf98c3..72646aa580ee

$ git log 9f7e1bcf9..72646aa58 --date=short --no-merges --format='%ad %ae %s'
2017-07-10 charliea Upload the battor agent binary to the latest version

Created with:
  roll-dep src/third_party/catapult
BUG= 728496 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: Id17109882ae7c93dc517fda12a3c1bdbe9d7ceab
Reviewed-on: https://chromium-review.googlesource.com/565001
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#485282}
[modify] https://crrev.com/538f75da5a8e54f071ccacbf6d76c14e482686bf/DEPS

NextAction: 2017-07-31
Charlie and Randy. I just pushed a firmware fix to the BattOr reset due to RecordClockSync issue. I will put up a patch to upload the new firmware in a moment.

https://github.com/aschulm/battor/commit/3157c85d1bcffdc89692b429d05e284724a6908b

Let's set a NextAction to Monday to make sure this change has worked.
Thanks Aaron! We'll look for your patch.
Blocking: 737176
@charliea, I just realized I can't upload into the chromium-telemetry bucket anymore. Either you or Randy will have to submit the CL to upload the latest BattOr firmware.
The NextAction date has arrived: 2017-07-31
CL is in the commit queue!
Project Member

Comment 19 by bugdroid1@chromium.org, Aug 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5dc438aca416be722ae65ae09d7521c1ec48b4f4

commit 5dc438aca416be722ae65ae09d7521c1ec48b4f4
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Tue Aug 01 08:27:50 2017

Roll src/third_party/catapult/ 9e69c3388..4d5d88c3e (12 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/9e69c338846a..4d5d88c3e04c

$ git log 9e69c3388..4d5d88c3e --date=short --no-merges --format='%ad %ae %s'
2017-07-31 benjhayden Fix object-instance-track heading.
2017-07-31 loloangela Remove pylintrc bad-continuation and fix resulting errors pt.42
2017-07-31 rnephew [Telemetry] Add ability to disable by mac os version number.
2017-07-31 loloangela Fixed errors related to bad-continuation pt.41
2017-07-31 xunjieli [wpr-go] Allow disabling script injection
2017-07-31 loloangela Fixed errors related to bad-continuation pt.40
2017-07-31 benjhayden Fix ObjectSnapshot.userFriendlyName.
2017-07-31 simonhatch Dashboard - Add entities to _INTERNAL_ONLY_KINDS in datastore_hooks
2017-07-31 loloangela Fixed errors related to bad-continuation pt.39
2017-07-31 sullivan Set bug status to 'Assigned' when assigning to culprit author.
2017-07-31 loloangela Fixed errors related to bad-continuation pt.38
2017-07-31 charliea Upload a new version of the BattOr agent firmware

Created with:
  roll-dep src/third_party/catapult
BUG=713222, 728496 


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: If6aca002e8592ed8dc89391f6964933fab5bcea6
Reviewed-on: https://chromium-review.googlesource.com/595226
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#490917}
[modify] https://crrev.com/5dc438aca416be722ae65ae09d7521c1ec48b4f4/DEPS

Status: Fixed (was: Assigned)
Marking this as fixed. I'll reopen if I see the problem crop up again.

Sign in to add a comment