New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 711277 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Start & stop battor tracing takes up to 30% for system health benchmark's total duration

Project Member Reported by nedngu...@google.com, Apr 13 2017

Issue description

In example trace https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_60-2017-04-13_00-57-39-16058.html, start & stop battor takes about 30s.

The total run time of the whole system_health.common_mobile benchmark on nexus5x-webview is 61 minutes, and start/stop collecting battor trace is taking 21 minute (the jump in https://chromeperf.appspot.com/report?masters=ChromiumPerf&bots=android-webview-nexus5X&tests=system_health.common_mobile%2Fbenchmark_duration&checked=benchmark_duration%2Cbenchmark_duration_ref%2Cref&rev=464059), which is 30% of the total benchmark runtime is too much.

Can we make star/stop battor taking less time?
 
Summary: Start & stop battor tracing took up to 30% for system health benchmark's total duration (was: Start & stop battor tracing took up to 30s for system health stories)
Summary: Start & stop battor tracing takes up to 30% for system health benchmark's total duration (was: Start & stop battor tracing took up to 30% for system health benchmark's total duration)
Cc: aschulman@chromium.org testac@google.com oshalev@google.com
Aaron, Chris, Ophira: any suggestions for how we could improve Android BattOr runtime?
Cc: nedngu...@google.com perezju@chromium.org
 Issue 714643  has been merged into this issue.
Owner: aschulman@chromium.org
A few questions to help me investigate this issue. From the trace nednguyen@ linked to, it looks like this "browse:media:youtube" benchmark is running for 1.25 minutes, not including the BattOr start/stop time. Is this correct? this seems fairly long compared to the other page benchmarks that I am familiar with.

Also, is it fair to assume that the idle time before and after the power trace corresponds to the time that the BattOr is starting and stopping its trace? If that is correct, there seems to be 12.5 seconds spent in StartTracing and 20 seconds spent in StopTracing in the trace that you sent.

Can you please provide a link to a run of this benchmark before the BattOr tracing was enabled so I can compare with the trace after the BattOr was enabled? For some reason I am not seeing links to the traces in the chromeperf graph. Even if I did have the traces, I actually can not tell which of the jumps in "benchmark_duration" is related to the BattOr in the chromeperf graph you sent. It seems until 04-11-17 the benchmark was taking 50 minutes, then it dropped to 0.5 minutes for some reason, then it jumped all the way to 67 minutes around 04-12-17, then it jumped even higher to 90 minutes on 04-18-17. Then it jumped back to close to 0 again, and now it's up to somewhere around 73 minutes. Was the BattOr enabled this entire time, or was it enabled when the time jumped to 67 minutes?
Cc: -testac@google.com -oshalev@google.com oph...@mellowresearch.com tes...@mellowresearch.com
I can answer some of the questions.

> it looks like this "browse:media:youtube" benchmark is running for 1.25 minutes

Yes, this is typical for browse:* stories, which not only load the page but also interact with it for a bit (navigating to subpages, clicking on media to play, etc.)

On the jumps around benchmark duration:

A) "04-11-17 the benchmark was taking 50 minutes" - this was before BattOr's
B) "then it dropped to 0.5 minutes for some reason" - this was when BattOr's got installed but not properly configured ( issue 710806 ).
C) "it jumped all the way to 67 minutes around 04-12-17" - this is with BattOr's installed and working.
D) "then it jumped even higher to 90 minutes on 04-18-17" - this is due to new stories being added to the benchmark ( issue 714635 ).

So the difference between A and C, about 17 minutes, we think it's attributable to BattOr's being installed.

The relevant section of the graph is thus:
https://chromeperf.appspot.com/report?sid=2daea37d7898b04952908e49cf9b51b0d99aa1a6f17d62d0b50c5681d00d9cd3&start_rev=463003&end_rev=464667

With that in mind we can grab a trace without BattOr's:
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_56-2017-04-10_09-45-48-93491.html

And a trace with BattOr's:
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_60-2017-04-13_00-57-39-16058.html
(Ah btw, to get link to traces from chromeperf one has to pick a metric measured on a single page, not on summary metrics. We know this is not ideal, a bit confusing :-/)
Status: Assigned (was: Untriaged)
I just put a CL up for review that will shave 4 seconds off of each trace (https://codereview.chromium.org/2859353003/).

I'm looking for other opportunities to shave time. However, the bulk of the BattOr time is spent downloading the trace, and that has already been heavily optimized.

Let's see how much this CL improves the benchmark duration metric. If it's not enough, then I'm happy to continue digging.
Cc: -tes...@mellowresearch.com te...@mellowresearch.com
I just noticed another opportunity to speed things up. We don't need to call INIT before we read the GIT HASH (see https://cs.chromium.org/chromium/src/tools/battor_agent/battor_agent.cc?dr=CSs&l=191). This adds another 4 seconds per trace.

Removing this extra INIT and with the RESET optimization described in my prior comment, we should be shaving a total of 8 seconds off of each trace.
Owner: charliea@chromium.org
Good catch! I can go ahead and make that change.
Project Member

Comment 15 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1689c7e8fdfed4dc543d02e7743a3733be6b19e5

commit 1689c7e8fdfed4dc543d02e7743a3733be6b19e5
Author: Charlie Andrews <charliea@chromium.org>
Date: Thu Jun 08 16:12:03 2017

Don't initialize the BattOr before retrieving the git hash

I think this is safe, but it's not completely clear to me. We only
recently started initializing the BattOr before retrieving the git hash
(in https://crrev.com/2622003005). However, that change was attached
to a bug that had to do with retrying while retrieving the git hash,
which we now do. I can't remember if there was real rationale behind
why we made the change or if we were just trying to see if that'd fix
a problem.

All in all, neither I nor Aaron can see any theoretical reason why we
shouldn't be able to retrieve the git hash before an init, so I think
it's worth trying again now that we have retry logic in place. This
should cut the time to start tracing by 4 seconds, which is significant.

TBR=aschulman@chromium.org

Bug:  711277 
Change-Id: Iaaf2a695551f4abbf1539ef047ebd48c224683c2
Reviewed-on: https://chromium-review.googlesource.com/527327
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Aaron Schulman <aschulman@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#477984}
[modify] https://crrev.com/1689c7e8fdfed4dc543d02e7743a3733be6b19e5/tools/battor_agent/battor_agent.cc
[modify] https://crrev.com/1689c7e8fdfed4dc543d02e7743a3733be6b19e5/tools/battor_agent/battor_agent_unittest.cc

Note that although this change was supposed to be safe, it has seemed to introduce many new failures where the Git hash is not retrieved properly. The error is, "Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'" The interesting thing is this issue (like most issues with BattOr) only appears to be happening on Mac. Here is an example of a recent run:

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F748%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout

I did some local testing and I am unable to reproduce this issue, so we need to look at the serial traces from the testbed, however unfortunately the testbed doesn't seem to be writing serial logs when there is a failure while checking the Git Hash.

You should also note that for many cases where is a failure checking the Git Hash, the dashboard still indicates that the test passed for some reason.
I'm going to fix the problem by EOD about the BattOr wrapper not uploading the serial log when retrieving the git hash fails and then we can try to better understand why we're seeing TOO_MANY_COMMAND_RETRIES.
Can you also see why these are not showing up as red or yellow failures on buildbot? I want to make sure we know if we fixed the problem when we do.
I'm seeing all recent runs of system_health.common_desktop failing on Mac. Are these logs from system_health.commmon_desktop.reference? If so, in the not so distant past, we made a change that made reference builds always show up in green. The reasoning behind this is that there aren't any changes that we can make to Chromium to fix reference build failures (because it's a fixed version of Chrome from the past), and if it's an infrastructure failure, it's going to show up in the non-reference version anyway. In this case, it seems like that might be working as intended.
ah ok, that's probably why I am confused.
Project Member

Comment 21 by sheriffbot@chromium.org, Jul 20 2017

Labels: Hotlist-Google
Status: WontFix (was: Assigned)
Going to mark this as WontFix - I don't actually think that there's anything further we can do about this. It takes ~55s to download a 180s trace from the BattOr, assuming that everything goes right. That's about a 30% overhead, unfortunately.

Sign in to add a comment