New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 616919 link

Starred by 6 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug
cwp



Sign in to add a comment

~3% of all CPU cycles spent in i915_gem_execbuffer2 on devices with Braswell CPUs

Project Member Reported by chongjiang@chromium.org, Jun 2 2016

Issue description

Chrome Version: 46-50, stable
Chrome OS Version: various
Chrome OS Platform: cyan, celes, reks, terra, ultima, edgar (Braswell), and possibly chell (Skylake-Y)
Network info: various

Steps To Reproduce:
(1) Have an idle device with at least one tab open, or have a busy device, e.g. playing a 4K video on youtube.com,
(2) Over SSH, run `sudo perf top` or `perf record -g -a`.

Expected Result:
i915_gem_execbuffer2 takes ~0.5% of all CPU cycles (inclusive),
(.*force_wake_mt_get|.*force_wake_get|fw_domains_get) takes ~0% of all CPU cycles.


Actual Result:
i915_gem_execbuffer2 takes ~3% of all CPU cycles (inclusive),
fw_domains_get takes 1.5% of all CPU cycles.

How frequently does this problem reproduce? (Always, sometimes, hard to
reproduce?)
This problem appears consistently within aggregate data collected via ChromeOS-wide-profiling, spanning multiple devices and Chrome stable versions.

What is the impact to the user, and is there a workaround? If so, what is
it?
Larger than expected CPU usage. I am unaware of any workaround.

Please provide any additional information below. Attach a screen shot or
log if possible.

For comparison, the following table shows the average CPU time spent in i915_gem_execbuffer2, broken down by uarch and for Chrome 50 stable:

Skylake-Y: 3%
Airmont: 3.4%
Silvermont: 0.5%
Broadwell: 0.6%
max(all other x86 uarchs): 0.5%

And similarly for (.*force_wake_mt_get|.*force_wake_get|fw_domains_get):

Skylake-Y: 1%
Airmont: 2.2%
Silvermont: 0.01%
Broadwell: 0.02%
max(all other x86 uarchs): 0.01%


Example profiled callgraphs (percentages shown are conditional on i915_gem_execbuffer2), generated on a smaller subset of the data, are also attached.

Note these are aggregated samples across multiple devices over the last 1.5 months, so the usual caveats of different sample sizes and workloads apply. Nonetheless, the pattern has been consistent since Chrome 46 stable on Airmont (the earliest data we have for that uarch). Skylake-Y has just shown up in Chrome 50 stable and has few reports so far, so its inclusion here may be due to variance alone.
 
airmont.svg
132 KB Download
broadwell.svg
154 KB Download
silvermont.svg
150 KB Download
skylake.svg
145 KB Download
Now with more data, the average for Skylake for Chrome 50 stable is 1.5%. This suggests that it too is an outlier, though not to the extent that Airmont is.

Comment 2 by snanda@chromium.org, Jun 22 2016

Cc: snanda@chromium.org
Owner: marc...@chromium.org
marcheu is probably a better owner of this than I am.
Well, I think it is reasonable that the graphics driver consumers 3% CPU cycles while rendering is going on. I don't think this is something that can be improved directly.
To clarify, the number above is not counting everything within i915, just a particular function that is much hotter in Airmont (and in Skylake, to a lesser extent) than in other uarchs.

That said, the total percentages for i915_.* are:

Airmont: 5%
Skylake: 2.5%
max(all other x86 uarchs): 1.5%

To me, this seems like a significant discrepancy that warrants further investigation.

Comment 5 by gs0...@gmail.com, Jul 5 2016

I had assessment of this today with my handy celes and rambi.

When system is idle, I could see high portion (2~5%) of 'fw_domains_get' on celes but not in rambi, even I aligned rambi w/ 3.18 kernel, alternately, celes looks no significant improvement when I hack its kernel to upstream v4.6.

When system is busy, follow the step I play a 2160p fireplace clip, I see the portion drops, like attached screenshot/perf/top.

I am dealing w/ issue 611896, I ever tried to workaround high portion of '__vdso_clock_gettime', but it makes not much difference to me in terms of idle / busy against the above cases of 'fw_domains_get' portion.

celes_perf_top.png
170 KB View Download
celes_top.png
175 KB View Download
fireplace_4k.png
282 KB View Download

Comment 6 by gs0...@gmail.com, Jul 6 2016

Overall assessed 4 machines, rambi/celes/lulu/chell, CPU models are as list:
BYT: N2840
BSW: N3050
BDW: i3-5005U
SKL: m5-6Y57

Gists as:

 - BSW/BDW/SKL support VP9 hardware acceleration through out 'vainfo' tool, except BYT (rambi)
 - Compare to BSW (celes), likewise SKL (chell), it has higher portion of 'fw_domains_get' in 'perf top' profiling when system is in idle, says 1.5 ~ 4%
 - Compare BSW/BYT, UHD 2160p vp9 playback, the busy case, BSW has lower, almost half, cpu usage rate against BYT; while in FHD 1080p vp9 playback, both looked smooth, BSW still has lower CPU usage against BYT, says 30% vs 50%; these are observed through 'top' command.
 - When it comes to UHD 2160p vp9 playback, BYT/BSW group are much more demanding against BDW/SKL
I re-examined this on more recent CWP data, and the data is very consistent with before. To reiterate, this appears to be a significant regression, rather than an opportunity to improve rendering performance.

Namely, as of M54, the percentage of all CPU time spent in i915_gem_execbuffer2, broken down by uarch (which all share a kernel version):

Skylake (kernel 3.18): 1.5%
Airmont (kernel 3.18): 4.3%
Haswell (kernel 3.8): 0.6%
Silvermont (kernel 3.10): 0.5%
Broadwell (kernel 3.14): 0.5%
max(all other x86 uarchs): 0.4%

The difference when comparing Broadwell to Skylake, and Silvermont to Airmont, is significant both in absolute terms (1%, 3.8% of all CPU cycles) and in relative terms (3x, 8x regression).
Given that the Airmont and Skylake boards are the only ones on kernel 3.18, this seems to be evidence of a kernel issue.


I also broke this down further for the 3 Skylake boards:
chell (Skylake-Y): 1.5%
lars (Skylake-U): 1.5%
sentry (Skylake-U): 1.9%

These exhibit similar behavior, despite the much higher resolution on chell. This suggests a cause other than general rendering performance.
Status: Fixed (was: Unconfirmed)
that's probably fixed by https://chromium-review.googlesource.com/#/c/439433/
IIUC, that patch should have landed in M58 beta, with the first push to Airmont boards on 2017-03-21.

Based on the preliminary data CWP has collected on M58 beta so far, I am not seeing any reduction of time spent in 915_gem_execbuffer2 on Airmont boards.

Comment 10 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 11 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 13 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment