New issue
Advanced search Search tips

Issue 640312 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 647398

Blocking:
issue 640054



Sign in to add a comment

Implement TBMv2 version of thread_times metric.

Project Member Reported by erikc...@chromium.org, Aug 23 2016

Issue description

Ned suggested that I consider try out the thread_times measurement as a more granular metric for measuring CPU power consumption. See  Issue 640054  for more details.

I did this. The first thing I noticed is that if the user story doesn't have a page interaction, the measurement fails in TimelineController
"""
 87     if len(self._smooth_records) == 0:                                                                                                                                               
 88       raise legacy_page_test.Failure('No interaction record was created.')   
""""

 
Cc: charliea@chromium.org tdres...@chromium.org panicker@chromium.org
That is because the measurement originally created to measure cpu usage of animations only. To mark animation spans in the browser, the pages in the page_set would use: action_runner.CreateInteractionRecord('Scrolling') to mark the region where measurement is supposed to apply to.

In the long run, I think that it probably makes sense to have thread_times as a diagnostic for our other power metrics. That way, thread_times will help give context to changes in the power data (i.e. this regression follows a clear spike in thread times), but we're not paying attention to thread_times changes that don't affect power (which IMO we shouldn't be: at least not in power benchmarks).
What do you mean by "thread_times changes that don't affect power"? 
Sorry: looking back, that wasn't the best explanation.

In TBMv2, there are metrics, which are generally the things that you actually care about going up or down. From the power team's standpoint, power is the thing that we care about going up or down, and is thus the metric that we want to be alerting on.

However, in TBMv2, we also have the option of giving additional context to each data point in the metric through something called a diagnostic. A diagnostic is basically numerical metadata that you can add to a value that helps you better understand why it might have gone up or down.

What I'm saying is that, at least from the power standpoint, we don't actually care about whether thread_times increases unless it also causes energy consumption to increase. (Note: maybe other benchmarks should care, but the power benchmarks shouldn't.) Instead thread_times as a metric independent of our power metrics, we should attach thread_times as a diagnostic to each power metric that helps provide information on why the power may have increased.

As far as the perf dashboard UI goes, this means that instead of having two separate metrics (thread times and average_power, for example) that you have to drag to the graph in the perf dashboard individually, there'd only be one metric, average_power. When you click on each data point in the average_power graph, you'd get something like:

Revision range: aaaaa-bbbbb
Average power: 15mW
(Thread times: 350ms)

The main reason for doing this is that our power metrics should give an accurate reflection of whether the state of power w/in Chrome is getting better or worse. If we managed to decrease thread times but increase average power by offloading more work to the graphics card.

A couple of other candidates for diagnostics might be things like:

- GPU time
- Idle wakeups
- Screen brightness
- % of CPU occupied by non-Chrome processes at start of benchmark


Cc: brucedaw...@chromium.org
Thanks for the explanation. I agree we should add CPU time as a diagnostic to average power.

That being said, CPU time might prove worthwhile as a metric, since average_power is computed via BattOr, which measures all processes, whereas CPU time can be scoped to only the processes that we care about. This means that CPU time is much less susceptible to noise from other processes. 

It will be prone to incorrect diagnoses when we upload work from CPU to GPU, but that's a rather exceptional use case. Perhaps average_power should be a diagnostic for CPU time?
It looks like this task mostly involves porting a bunch of logic from tools/perf/metrics/timeline.py into TBMv2. I think this should be as straight forward as adding a new metric to third_party/catapult/tracing/tracing/metrics. I'll give that a shot and see what happens.
Project Member

Comment 7 by bugdroid1@chromium.org, Sep 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c4b1e75d9334ca63343d5e44768c246d796c8386

commit c4b1e75d9334ca63343d5e44768c246d796c8386
Author: nednguyen <nednguyen@google.com>
Date: Sat Sep 10 12:00:05 2016

Manually roll src/third_party/catapult/ da6d44e4a..c0b988891 (3 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/da6d44e4a2a5..c0b988891fd1

$ git log da6d44e4a..c0b988891 --date=short --no-merges --format='%ad %ae %s'
2016-09-09 erikchen Add a new cpu time TBMv2 system health metric.
2016-09-09 nednguyen Update the lock operation in cloud_storage with better implementation using py_utils.lock
2016-09-09 rnephew [BattOr] Kill BattOr shell if there is a problem with communicating during clock sync.

BUG= 640312 , 637904 , 645106 , 645720 

TBR=catapult-sheriff@chromium.org, jbudorick@chromium.org

Review-Url: https://codereview.chromium.org/2326063004
Cr-Commit-Position: refs/heads/master@{#417833}

[modify] https://crrev.com/c4b1e75d9334ca63343d5e44768c246d796c8386/DEPS
[modify] https://crrev.com/c4b1e75d9334ca63343d5e44768c246d796c8386/build/android/test_runner.pydeps

nednguyen: I put together a CL to deal with "overhead" slices: https://codereview.chromium.org/2334863002/

However, it looks like no overhead slices are emitted by Chrome tracing. Maybe this should be removed?
https://cs.chromium.org/search/?q=trace_event_overhead&sq=package:chromium&type=cs

https://cs.chromium.org/chromium/src/tools/perf/metrics/timeline.py?q=trace_event_overhead&sq=package:chromium&dr=C&l=99
Summary: Implement TBMv2 version of thread_times metric. (was: Investigate the use of thread_times measurement for Chrome Mac CPU power usage.)
Tried to enable the metric: https://codereview.chromium.org/2338433002/
Running into issues where Telemetry/tracing can't handle the size of outputted trace.

> Same issue:
> https://build.chromium.org/p/tryserver.chromium.perf/builders/mac_retina_perf_cq/builds/499/steps/battor.steady_state/logs/stdio
> """
> INFO:root:Trace (TraceDataPart("cpuSnapshots")) of size 25539 bytes saved.
> INFO:root:Trace (TraceDataPart("tabIds")) of size 40 bytes saved.
> INFO:root:Trace (TraceDataPart("telemetry")) of size 22057 bytes saved.
> INFO:root:Trace (TraceDataPart("traceEvents")) of size 330205233 bytes saved.
> [ RUN      ] /tmp/tmp7WE6SW.html
> TraceImportError: Invalid string length
> """


Blockedon: 647398
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 20 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/68b995c4d3eb37217ce143c225a7d2eed81d320a

commit 68b995c4d3eb37217ce143c225a7d2eed81d320a
Author: erikchen <erikchen@chromium.org>
Date: Tue Sep 20 01:23:15 2016

Add cpuTimeMetric to BattOr benchmark.

BUG= 640312 
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq

Review-Url: https://codereview.chromium.org/2338433002
Cr-Commit-Position: refs/heads/master@{#419624}

[modify] https://crrev.com/68b995c4d3eb37217ce143c225a7d2eed81d320a/tools/perf/benchmarks/battor.py
[modify] https://crrev.com/68b995c4d3eb37217ce143c225a7d2eed81d320a/tools/perf/page_sets/idle_after_loading_stories.py

I have a CL in progress that refactors the cpu time metric, and fixes some small logic issues. https://codereview.chromium.org/2351963006/

There are a couple more changes that need to happen:
  * Only consider slices that comes from Chrome processes. Note that chrome_model_helper only knows how to detect Renderer, Browser, and GPU processes.
  * Ignore async slices.
  * For total "trace" duration, only consider slices that come from Chrome processes.
Project Member

Comment 13 by bugdroid1@chromium.org, Sep 24 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/175242c00bc4e762699f2b4fe28e667d672f65a3

commit 175242c00bc4e762699f2b4fe28e667d672f65a3
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Fri Sep 23 23:58:41 2016

Roll src/third_party/catapult/ 7bd10eda4..a8deb272b (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/7bd10eda47f1..a8deb272b550

$ git log 7bd10eda4..a8deb272b --date=short --no-merges --format='%ad %ae %s'
2016-09-23 erikchen Add range of interest support to cpuTimeMetric.

BUG= 640312 

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2368723002
Cr-Commit-Position: refs/heads/master@{#420793}

[modify] https://crrev.com/175242c00bc4e762699f2b4fe28e667d672f65a3/DEPS

Cc: primiano@chromium.org benjhayden@chromium.org
My first thought for C#12 was that we could emit a origin:chrome metadata parameter for each Chrome process. Bug doing some digging around, it looks like we dynamically construct the process object by glomming together slices, and then assign the name field by using the existing *Helper processes and looking for specific signatures. 

I chatted with benjhayden@, and he suggested that processes have a parent-pid, so I could trace everything back to the root process. This assumes a tree hierarchy, with the browser as root], but seems like it could work. He suggested I cc primiano.
Specifically, I wondered if per-process TracingControllers could log ppid in metadata events.
Reconstructing the process tree would require that all of Chrome's processes run a TracingController, not just the browser and renderer processes, and that all of the per-process metadata events would be preserved into the log instead of rotated out of the ring buffer.
Now that I think about it, though, logging pids is a security no-no, whereas a simple origin:chrome tag would be innocuous. Hm.

What about coming at it from the other direction? Could you filter out events from processes that are known to not be chrome? Are the non-chrome tracing agents more easily enumerable than chrome processes?
I would love to see the model_helpers/ get some attention.

Definitely talk to primiano about what TracingController can do to help here. :-)

> Now that I think about it, though, logging pids is a security no-no, whereas a simple origin:chrome tag would be innocuous. Hm.

I think we already do that, no? each individual trace event has the pid.
Definitely supportive to add some metadata to tell deterministically (without magic guesses) which process is which. I guess I just miss a bit of context (didn't have time to read this full bug yet). Can we talk about this in the upcoming PI meeting? (sorry, bit out of bandwidth this week)
Right, the current process's pid is logged, but I thought that it would be bad form to log another processes's pid. But you're right, if malicious code is injected, then it wouldn't need to look in the trace log, it can just call ppid() itself if it wants. Maybe I'm overcomplicating it, maybe logging ppid and reconstructing the process tree would work.

The Performance Instrumentation and Modelling weekly on Thursday, right? (I think all perf-insights (the old PI) is gone. :-) Sure, I'll talk to fmeawad about putting it on the agenda.
Cc: fmea...@chromium.org
The question I am trying to answer is: What is the CPU usage of Chrome in a trace.

More generally: Chrome does too much [unnecessary] work. I want a metric that can capture the amount of work that Chrome is doing, so that we can measure this and drive it down. If all Chrome work is captured by task schedulers, and we log the cpu time for each top level task, this should be relatively easy to compute. This information is contained in a trace that includes the "toplevel" category, but there are several points that need fixing.

1) Traces can include slices from other sources [systrace, telemetry, etc.]. We don't want to count this. Note that we don't want to count systrace since that would double-count work. 

2) When looking at some real traces, I noticed that there are slices with the "toplevel" label that are nested in other slices. Maybe this isn't the right label to be using. I guess another solution is to add another label to tasks posted by task schedulers, and only count slices that have that special label.

3) I wonder how we measure CPU duration vs slice duration. e.g. if a task spends a long time waiting on disk [but CPU is idle], slice duration should be long and CPU duration should be low. Is this actually the case?
Re #19, the other tricky thing is dealing with being descheduled. We may want to know CPU duration vs slice duration vs self duration.
#19

1) How would counting systrace slices double-count cpu activity? Is there a bug with cpuSelfTime?

Do we need to change systrace somehow, like make it set the pid field of its events to its own pid, and use a different field to indicate the pid of the target process? I am not very familiar with systrace, but if it's doing something wrong then we can fix it, right?

2) Can you filter by slice title as well as category?

3) I believe so, yes. Slice.cpuDuration and Slice.cpuSelfTime should handle that case, unless you've found clear indication of a specific bug.
Just chatted with benjhayden, fmeawad, nednguyen.

* Only Chrome ever emits the cpuDuration property, so in the short term, that's an acceptable way of distinguishing whether a slice comes from Chrome.

* For normalization time, we can find the browser process, so we can find the first/last slices there, and that will allow us to not be egregiously off with normalization.
Status: Fixed (was: Assigned)
I was thinking about the second bullet point in c#22. I no longer think it's a good idea. Doing so makes the assumption that there is a browser process in a system that is otherwise Chrome-agnostic. This has potential failure conditions, just like the existing behavior, which I think is "good enough".

The thread_times metrics looks pretty stable, so I'm going to close this bug. Feel free to reopen if you think there is more discussion necessary.
Status: Available (was: Fixed)
Erik: I disagree with your assessment in @23. To be precise, there metric here is intended to be "cpu time usage of Chrome". What we are computing at the moment is the total number of cpu time of everything in the traces normalized by the trace's total time. 

It's just happens to approximate the cpu usage well enough because:
1) Our combined trace currently only have cpu time data on Chrome's event.
2) Our combined trace's begin & end time is not to far off from when Chrome actually starts & end.

We won't know when conditions 1) & 2) will be break, so I think the metrics' implementation should not be considered complete.
Status: Assigned (was: Available)
I've been doing some CPU usage testing of Chrome and Edge during media playback and one think that this reminds me of is that CPU usage caused by chrome is not entirely in chrome.exe.

When playing back h264 video in Chrome on a SurfaceBook I see three Chrome processes that use 4.65%, 4.40% and 3.25% of CPU time. I also see dwm.exe (desktop compositing) using 0.80% and audiodg.exe (audio mixing) also using 0.80%. The dwm.exe and audiodg.exe CPU time is triggered by chrome and is therefore a real part of the cost of playing video. I also a small amount of CPU time in the system process - it is possible for Chrome to trigger significant work here. Sometimes the biggest CPU optimizations actually show up as reductions in these other processes.

In some cases changes in how we handle video, audio, and networking will change the CPU usage in these other processes, and ideally we would account for this.

One way to account for all of these costs would be to monitor idle time instead of Chrome.exe time. This has the advantage of automatically accounting for everything. It also has the disadvantage of being more sensitive to noise caused by random Windows processes that spring to life for random Windows reasons.

Something to consider.

Status: Fixed (was: Assigned)
CL landed: https://codereview.chromium.org/2446423009/

brucedawson: We do have existing mechanisms to monitor CPU usage of all processes on Mac, although the values aren't summed. I believe the telemetry team is actively working on tracking down and getting rid of random processes that start themselves while we're runnign tests.

This bug tracks TBMv2 thread_times, which by definition uses data from tracing.
Project Member

Comment 28 by bugdroid1@chromium.org, Nov 1 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/25ae60df508ec4ae305ba3b5610793af69e5a0be

commit 25ae60df508ec4ae305ba3b5610793af69e5a0be
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Tue Nov 01 01:03:35 2016

Roll src/third_party/catapult/ 3c6de14e2..b28a7dab0 (2 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/3c6de14e28c7..b28a7dab0434

$ git log 3c6de14e2..b28a7dab0 --date=short --no-merges --format='%ad %ae %s'
2016-10-31 erikchen Normalize thread_times metric against the bounds of the browser process.
2016-10-31 eakuefner [Telemetry] Add support for macOS Sierra

BUG= 640312 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2463743004
Cr-Commit-Position: refs/heads/master@{#428893}

[modify] https://crrev.com/25ae60df508ec4ae305ba3b5610793af69e5a0be/DEPS

Labels: -Pri-3 Pri-2
Status: Assigned (was: Fixed)
I was trying to use the thread_times metric on a real trace:
https://bugs.chromium.org/p/chromium/issues/detail?id=646211#c6

The problem I'm having is that the thread_times metric includes threads from the chrome://tracing process which is very CPU intensive.  As a short term solution, I propose checking the process's labels i.e.
if (processes.labels[0] == "chrome://tracing")

I imagine we'll want a better long term solution. benjhayden@, nednguyen@: does this sound okay in the short term?
Erik: I forgot to mention that we have rendererHelper.isChromeTracingUI method (https://github.com/catapult-project/catapult/blob/master/tracing/tracing/metrics/system_health/loading_metric.html#L426)
Ha. It would appear someone beat me to it:
"""
    this.isChromeTracingUI_ =
        process.labels !== undefined &&
        process.labels.length === 1 &&
        process.labels[0] === 'chrome://tracing';
"""
Project Member

Comment 32 by bugdroid1@chromium.org, Nov 5 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c9da801622a97bfa43a367741e6bf666e39e2b1e

commit c9da801622a97bfa43a367741e6bf666e39e2b1e
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Sat Nov 05 01:54:43 2016

Roll src/third_party/catapult/ 22b8550d6..3660f3bfa (2 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/22b8550d67b9..3660f3bfa20a

$ git log 22b8550d6..3660f3bfa --date=short --no-merges --format='%ad %ae %s'
2016-11-04 erikchen Ignore tracing process for TBMv2 cpu time metric.
2016-11-04 lpy Keep selected field inside diagnostics table for consistency.

BUG= 640312 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2475373003
Cr-Commit-Position: refs/heads/master@{#430124}

[modify] https://crrev.com/c9da801622a97bfa43a367741e6bf666e39e2b1e/DEPS

Status: Fixed (was: Assigned)

Sign in to add a comment