Start & stop battor tracing takes up to 30% for system health benchmark's total duration |
||||||||||
Issue descriptionIn example trace https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_60-2017-04-13_00-57-39-16058.html, start & stop battor takes about 30s. The total run time of the whole system_health.common_mobile benchmark on nexus5x-webview is 61 minutes, and start/stop collecting battor trace is taking 21 minute (the jump in https://chromeperf.appspot.com/report?masters=ChromiumPerf&bots=android-webview-nexus5X&tests=system_health.common_mobile%2Fbenchmark_duration&checked=benchmark_duration%2Cbenchmark_duration_ref%2Cref&rev=464059), which is 30% of the total benchmark runtime is too much. Can we make star/stop battor taking less time?
,
Apr 13 2017
,
Apr 13 2017
,
Apr 13 2017
Aaron, Chris, Ophira: any suggestions for how we could improve Android BattOr runtime?
,
Apr 25 2017
,
Apr 25 2017
A few questions to help me investigate this issue. From the trace nednguyen@ linked to, it looks like this "browse:media:youtube" benchmark is running for 1.25 minutes, not including the BattOr start/stop time. Is this correct? this seems fairly long compared to the other page benchmarks that I am familiar with. Also, is it fair to assume that the idle time before and after the power trace corresponds to the time that the BattOr is starting and stopping its trace? If that is correct, there seems to be 12.5 seconds spent in StartTracing and 20 seconds spent in StopTracing in the trace that you sent. Can you please provide a link to a run of this benchmark before the BattOr tracing was enabled so I can compare with the trace after the BattOr was enabled? For some reason I am not seeing links to the traces in the chromeperf graph. Even if I did have the traces, I actually can not tell which of the jumps in "benchmark_duration" is related to the BattOr in the chromeperf graph you sent. It seems until 04-11-17 the benchmark was taking 50 minutes, then it dropped to 0.5 minutes for some reason, then it jumped all the way to 67 minutes around 04-12-17, then it jumped even higher to 90 minutes on 04-18-17. Then it jumped back to close to 0 again, and now it's up to somewhere around 73 minutes. Was the BattOr enabled this entire time, or was it enabled when the time jumped to 67 minutes?
,
Apr 25 2017
,
Apr 25 2017
I can answer some of the questions. > it looks like this "browse:media:youtube" benchmark is running for 1.25 minutes Yes, this is typical for browse:* stories, which not only load the page but also interact with it for a bit (navigating to subpages, clicking on media to play, etc.) On the jumps around benchmark duration: A) "04-11-17 the benchmark was taking 50 minutes" - this was before BattOr's B) "then it dropped to 0.5 minutes for some reason" - this was when BattOr's got installed but not properly configured ( issue 710806 ). C) "it jumped all the way to 67 minutes around 04-12-17" - this is with BattOr's installed and working. D) "then it jumped even higher to 90 minutes on 04-18-17" - this is due to new stories being added to the benchmark ( issue 714635 ). So the difference between A and C, about 17 minutes, we think it's attributable to BattOr's being installed. The relevant section of the graph is thus: https://chromeperf.appspot.com/report?sid=2daea37d7898b04952908e49cf9b51b0d99aa1a6f17d62d0b50c5681d00d9cd3&start_rev=463003&end_rev=464667 With that in mind we can grab a trace without BattOr's: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_56-2017-04-10_09-45-48-93491.html And a trace with BattOr's: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_60-2017-04-13_00-57-39-16058.html
,
Apr 25 2017
(Ah btw, to get link to traces from chromeperf one has to pick a metric measured on a single page, not on summary metrics. We know this is not ideal, a bit confusing :-/)
,
May 3 2017
,
May 5 2017
I just put a CL up for review that will shave 4 seconds off of each trace (https://codereview.chromium.org/2859353003/). I'm looking for other opportunities to shave time. However, the bulk of the BattOr time is spent downloading the trace, and that has already been heavily optimized. Let's see how much this CL improves the benchmark duration metric. If it's not enough, then I'm happy to continue digging.
,
May 5 2017
,
May 31 2017
I just noticed another opportunity to speed things up. We don't need to call INIT before we read the GIT HASH (see https://cs.chromium.org/chromium/src/tools/battor_agent/battor_agent.cc?dr=CSs&l=191). This adds another 4 seconds per trace. Removing this extra INIT and with the RESET optimization described in my prior comment, we should be shaving a total of 8 seconds off of each trace.
,
May 31 2017
Good catch! I can go ahead and make that change.
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1689c7e8fdfed4dc543d02e7743a3733be6b19e5 commit 1689c7e8fdfed4dc543d02e7743a3733be6b19e5 Author: Charlie Andrews <charliea@chromium.org> Date: Thu Jun 08 16:12:03 2017 Don't initialize the BattOr before retrieving the git hash I think this is safe, but it's not completely clear to me. We only recently started initializing the BattOr before retrieving the git hash (in https://crrev.com/2622003005). However, that change was attached to a bug that had to do with retrying while retrieving the git hash, which we now do. I can't remember if there was real rationale behind why we made the change or if we were just trying to see if that'd fix a problem. All in all, neither I nor Aaron can see any theoretical reason why we shouldn't be able to retrieve the git hash before an init, so I think it's worth trying again now that we have retry logic in place. This should cut the time to start tracing by 4 seconds, which is significant. TBR=aschulman@chromium.org Bug: 711277 Change-Id: Iaaf2a695551f4abbf1539ef047ebd48c224683c2 Reviewed-on: https://chromium-review.googlesource.com/527327 Commit-Queue: Charlie Andrews <charliea@chromium.org> Commit-Queue: Aaron Schulman <aschulman@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#477984} [modify] https://crrev.com/1689c7e8fdfed4dc543d02e7743a3733be6b19e5/tools/battor_agent/battor_agent.cc [modify] https://crrev.com/1689c7e8fdfed4dc543d02e7743a3733be6b19e5/tools/battor_agent/battor_agent_unittest.cc
,
Jun 11 2017
Note that although this change was supposed to be safe, it has seemed to introduce many new failures where the Git hash is not retrieved properly. The error is, "Fatal error when communicating with the BattOr: TOO MANY COMMAND RETRIES'" The interesting thing is this issue (like most issues with BattOr) only appears to be happening on Mac. Here is an example of a recent run: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F748%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Mac_on_Mac-10.11%2F0%2Fstdout I did some local testing and I am unable to reproduce this issue, so we need to look at the serial traces from the testbed, however unfortunately the testbed doesn't seem to be writing serial logs when there is a failure while checking the Git Hash. You should also note that for many cases where is a failure checking the Git Hash, the dashboard still indicates that the test passed for some reason.
,
Jun 12 2017
I'm going to fix the problem by EOD about the BattOr wrapper not uploading the serial log when retrieving the git hash fails and then we can try to better understand why we're seeing TOO_MANY_COMMAND_RETRIES.
,
Jun 12 2017
Can you also see why these are not showing up as red or yellow failures on buildbot? I want to make sure we know if we fixed the problem when we do.
,
Jun 12 2017
I'm seeing all recent runs of system_health.common_desktop failing on Mac. Are these logs from system_health.commmon_desktop.reference? If so, in the not so distant past, we made a change that made reference builds always show up in green. The reasoning behind this is that there aren't any changes that we can make to Chromium to fix reference build failures (because it's a fixed version of Chrome from the past), and if it's an infrastructure failure, it's going to show up in the non-reference version anyway. In this case, it seems like that might be working as intended.
,
Jun 12 2017
ah ok, that's probably why I am confused.
,
Jul 20 2017
,
Aug 30 2017
Going to mark this as WontFix - I don't actually think that there's anything further we can do about this. It takes ~55s to download a 180s trace from the BattOr, assuming that everything goes right. That's about a 30% overhead, unfortunately. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by nedngu...@google.com
, Apr 13 2017