~3% of all CPU cycles spent in i915_gem_execbuffer2 on devices with Braswell CPUs |
|||||||
Issue descriptionChrome Version: 46-50, stable Chrome OS Version: various Chrome OS Platform: cyan, celes, reks, terra, ultima, edgar (Braswell), and possibly chell (Skylake-Y) Network info: various Steps To Reproduce: (1) Have an idle device with at least one tab open, or have a busy device, e.g. playing a 4K video on youtube.com, (2) Over SSH, run `sudo perf top` or `perf record -g -a`. Expected Result: i915_gem_execbuffer2 takes ~0.5% of all CPU cycles (inclusive), (.*force_wake_mt_get|.*force_wake_get|fw_domains_get) takes ~0% of all CPU cycles. Actual Result: i915_gem_execbuffer2 takes ~3% of all CPU cycles (inclusive), fw_domains_get takes 1.5% of all CPU cycles. How frequently does this problem reproduce? (Always, sometimes, hard to reproduce?) This problem appears consistently within aggregate data collected via ChromeOS-wide-profiling, spanning multiple devices and Chrome stable versions. What is the impact to the user, and is there a workaround? If so, what is it? Larger than expected CPU usage. I am unaware of any workaround. Please provide any additional information below. Attach a screen shot or log if possible. For comparison, the following table shows the average CPU time spent in i915_gem_execbuffer2, broken down by uarch and for Chrome 50 stable: Skylake-Y: 3% Airmont: 3.4% Silvermont: 0.5% Broadwell: 0.6% max(all other x86 uarchs): 0.5% And similarly for (.*force_wake_mt_get|.*force_wake_get|fw_domains_get): Skylake-Y: 1% Airmont: 2.2% Silvermont: 0.01% Broadwell: 0.02% max(all other x86 uarchs): 0.01% Example profiled callgraphs (percentages shown are conditional on i915_gem_execbuffer2), generated on a smaller subset of the data, are also attached. Note these are aggregated samples across multiple devices over the last 1.5 months, so the usual caveats of different sample sizes and workloads apply. Nonetheless, the pattern has been consistent since Chrome 46 stable on Airmont (the earliest data we have for that uarch). Skylake-Y has just shown up in Chrome 50 stable and has few reports so far, so its inclusion here may be due to variance alone.
,
Jun 22 2016
marcheu is probably a better owner of this than I am.
,
Jun 22 2016
Well, I think it is reasonable that the graphics driver consumers 3% CPU cycles while rendering is going on. I don't think this is something that can be improved directly.
,
Jun 23 2016
To clarify, the number above is not counting everything within i915, just a particular function that is much hotter in Airmont (and in Skylake, to a lesser extent) than in other uarchs. That said, the total percentages for i915_.* are: Airmont: 5% Skylake: 2.5% max(all other x86 uarchs): 1.5% To me, this seems like a significant discrepancy that warrants further investigation.
,
Jul 5 2016
I had assessment of this today with my handy celes and rambi. When system is idle, I could see high portion (2~5%) of 'fw_domains_get' on celes but not in rambi, even I aligned rambi w/ 3.18 kernel, alternately, celes looks no significant improvement when I hack its kernel to upstream v4.6. When system is busy, follow the step I play a 2160p fireplace clip, I see the portion drops, like attached screenshot/perf/top. I am dealing w/ issue 611896, I ever tried to workaround high portion of '__vdso_clock_gettime', but it makes not much difference to me in terms of idle / busy against the above cases of 'fw_domains_get' portion.
,
Jul 6 2016
Overall assessed 4 machines, rambi/celes/lulu/chell, CPU models are as list: BYT: N2840 BSW: N3050 BDW: i3-5005U SKL: m5-6Y57 Gists as: - BSW/BDW/SKL support VP9 hardware acceleration through out 'vainfo' tool, except BYT (rambi) - Compare to BSW (celes), likewise SKL (chell), it has higher portion of 'fw_domains_get' in 'perf top' profiling when system is in idle, says 1.5 ~ 4% - Compare BSW/BYT, UHD 2160p vp9 playback, the busy case, BSW has lower, almost half, cpu usage rate against BYT; while in FHD 1080p vp9 playback, both looked smooth, BSW still has lower CPU usage against BYT, says 30% vs 50%; these are observed through 'top' command. - When it comes to UHD 2160p vp9 playback, BYT/BSW group are much more demanding against BDW/SKL
,
Dec 17 2016
I re-examined this on more recent CWP data, and the data is very consistent with before. To reiterate, this appears to be a significant regression, rather than an opportunity to improve rendering performance. Namely, as of M54, the percentage of all CPU time spent in i915_gem_execbuffer2, broken down by uarch (which all share a kernel version): Skylake (kernel 3.18): 1.5% Airmont (kernel 3.18): 4.3% Haswell (kernel 3.8): 0.6% Silvermont (kernel 3.10): 0.5% Broadwell (kernel 3.14): 0.5% max(all other x86 uarchs): 0.4% The difference when comparing Broadwell to Skylake, and Silvermont to Airmont, is significant both in absolute terms (1%, 3.8% of all CPU cycles) and in relative terms (3x, 8x regression). Given that the Airmont and Skylake boards are the only ones on kernel 3.18, this seems to be evidence of a kernel issue. I also broke this down further for the 3 Skylake boards: chell (Skylake-Y): 1.5% lars (Skylake-U): 1.5% sentry (Skylake-U): 1.9% These exhibit similar behavior, despite the much higher resolution on chell. This suggests a cause other than general rendering performance.
,
Feb 8 2017
that's probably fixed by https://chromium-review.googlesource.com/#/c/439433/
,
Mar 31 2017
IIUC, that patch should have landed in M58 beta, with the first push to Airmont boards on 2017-03-21. Based on the preliminary data CWP has collected on M58 beta so far, I am not seeing any reduction of time spent in 915_gem_execbuffer2 on Airmont boards.
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by chongjiang@chromium.org
, Jun 22 2016