New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 748603 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug-Regression



Sign in to add a comment

elm: Page cycler regression between 16-17 of March

Reported by matteo.f...@arm.com, Jul 25 2017

Issue description

UserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
Platform: 9751.0.2017_07_19_1111

Steps to reproduce the problem:
We noticed a regression affecting the page cycler scores (typical_25) when running on the elm chromebook. We first see the regression on a build dated 17th of March 2017 and an initial investigation suggests that it was caused by a fix in the croup configuration for elm at https://chromium-review.googlesource.com/450159. This is a good CL as it fixes a bug in how cpusets are set up in older kernels (e.g. 3.18, which is elm's current kernel). Before this CL landed, elm cpuset setup failed and Chrome didn't use cpusets at all (e.g. the directory /sys/fs/cgroup/cpuset/chrome didn't even exist). Paradoxically, Chrome achieved better page cycler scores (~10-15% reduction of TTFMP, TTFCP) on elm without this configuration.

The issue is relatively easy to examine by hand:

1. Run the page cycler benchmark manually using a recent ChromeOS test image for elm.

2. Log in as root into the Chromebook. "cat /sys/fs/cgroup/cpuset/chrome/urgent/cpus" should show 2-3. Do "echo 0-3 >/sys/fs/cgroup/cpuset/chrome/urgent/cpus".

3. Run the page cycler benchmark again. The page cycler scores (Time To First Meaningful Paint, Time To First Contentful Paint, Time To Onload) should improve by 10-15% on average. In my local runs, I mostly used arstechnica and repeated 3 times. Many other pages are also affected in a pretty strong way. 

I expect this problem to affect all boards derived from baseboard-oak.

One possible solution could be to use the same cpuset settings for chrome/urgent as used by other big.LITTLE platforms (e.g. kevin).

What is the expected behavior?

What went wrong?
Page cycler scores got worse by 10-15% on average.

Did this work before? N/A 

Chrome version: 61.0.3160.115  Channel: dev
OS Version: 
Flash Version:
 
Cc: pyeh@chromium.org
Labels: Performance

Comment 2 by matteo.f...@arm.com, Aug 31 2017

Following the comment at:

  https://chromium-review.googlesource.com/#/c/chromiumos/overlays/board-overlays/+/575999/

I measured the impact of changing chrome/urgent CPU restrictions when running the falling leaves benchmark at https://webkit.org/blog-files/leaves/ on Elm (2 big + 2 little cores, kernel 3.18) and Kevin (2 big + 2 little cores, kernel 4.4).

The procedure I followed is described below:

  1. I used two ChromiumOS test images I built myself (for Elm and Kevin), I appended the following three lines at the end of /etc/chrome_dev.conf (requires a RW remount of /):

       --show-fps-counter
       --enable-logging=stderr
       --vmodule=head*=1

     I saved the file and rebooted the chromebook. Doing this I got the FPS meter HUD on the top-right corner of the screen. More importantly, I could get FPS statistics directly in the UI log /var/log/ui/ui.LATEST.

  2. I logged in as Guest (clicked on "Browse as Guest"), I maximized the browser window and navigated to https://webkit.org/blog-files/leaves/. I waited for a few seconds to allow the benchmark to go to a steady state (number of leaves appearing ~= number of leaves disappearing).

  3. I used a Python script to collect CPU frequency measurements and - simultaneously - extract the FPS count from /var/log/ui/ui.LATEST. The script also changes cpuset restrictions every 20 seconds by echoing the following strings to /sys/fs/cgroup/cpuset/chrome/urgent/cpus:

       - Elm: "1", "2", "2-3", "0-1", "0-3" (0-1 are the little cores and 2-3 are the big).
       - Kevin: "1", "4", "4-5", "0-3", "0-5" (0-3 are the little cores and 4-5 are the big).

     This corresponds to having the urgent tasks running respectively on: one little core, one big core, all the big cores, all the little cores, on any core.

     This allows to observe, for different cgroup settings, how the scheduler chooses the big and little cluster frequencies,  and what impact this has on the achieved frame-rate.

  4. The data was collected and plotted. The results are attached below.

Below I comment on the results:

  1. On Elm restricting chrome/urgent to one CPU (time t in [0, 40[) causes the governor to choose a considerably higher frequency for that core. In particular, restricting urgent/cpus to one single big core (i.e. doing "echo 2 > chrome/urgent/cpus") gives the lowest amount of dropped frames for the leaves benchmark. Restricting to one single little core also works pretty well. The restrictions 2-3, 0-1 and 0-3 (time t in [40, 100[ cause more occasional dropped frames, but - in general - the FPS values are pretty good for any choice of urgent/cpus.

  2. On Kevin restrictions of chrome/urgent to single core also causes the frequency to go up, but only slightly. The outcome on the frame-rate are different than for Elm: restricting to one little core causes a persistent drop in frame-rate. Restricting to all little cores also causes dropped frames, but to a lower extent. Restricting to one or two big cores works well and allows to hit 60 FPS with good stability. No restriction works also very well, with 60 FPS hit comfortably.

  3. On Kevin average frequencies are adjusted at a higher pace, probably due to the different governor and how it is set up. The frequencies also change more dramatically (they "jump" from ~0.4 GHz straight to above 1 GHz, while Elm is happy to operate at 0.7 GHz and seems much less keen to scale the frequency up).

  4. Looking at the energy model for Elm, it seems that big cores at lower frequency have a lower capacity than little cores at higher frequencies. For the same capacity little cores are much more energy efficient. It is therefore possible that restricting urgent/cpus to 2-3 (big cores) will lead to lower performance for higher power consumption, when the system is lightly loaded.

  5. For Elm, in the case "echo 0-3 > chrome/urgent/cpus" (no restrictions), the plot I attach shows that the scheduler behaves similarly to "echo 0-1 > chrome/urgent/cpus" (the region [60-80[ is similar to [80-100[). I have seen other cases, however, where the scheduler behaves more similarly to "echo 2-3 > ...". I also have seen an instance where the scheduler switched from one behaviour to the other one. In summary, when using 0-3, the "behaviour" of the scheduler has been seen to oscillate between the behaviours 0-1 and 2-3.

  6. The HUD (FPS meter) has been seen to considerably affect the performance of benchmarks. This is definitely true for Motionmark, for example. I will carry out some measurement to see whether this is the case for the CSS leaves benchmark. I am keen to find alternative ways of measuring the FPS, as the HUD is causing me a lot of pain and interfering quite drastically with what it is trying to measure.

In summary, changing Elm cpuset for chrome/urgent to 0-3 seems not to have a significant impact on the leaves benchmark. We have measured significant improvement on the page cycler workloads. There are indications that speedometer and speedometer2 also benefit from this change. In any case, the achieved frame-rate is heavily impacted by the choices of CPU frequency that the governor makes. Elm seems to run light workloads at lower frequencies than Kevin and this seems to be the main cause of dropped frames.

I ran more benchmarks which I am going to discuss separately in this thread.
freq_and_fps_vs_time-elm.png
82.7 KB View Download
freq_and_fps_vs_time-kevin.png
99 KB View Download
I repeated the measurements of frequency without the FPS HUD (removed the three lines I added to /etc/chrome_dev.conf). The produced plots are a bit rough. They lack the FPS measurements and there is now an empty plot for the FPS (which, admittedly, I  could have removed). For Elm the time > 80s region is also lacking the final parts of the lines. This is due to the way the measurements are taken: only changes of frequency are recorded and we don't have any change of frequency in that region apart from a very narrow peak in the big cores frequency at around 84.16 s. Apart from this tiny peak, for t > 80s all cores are running at the lowest frequency on Elm, which is ~0.5 GHz. These plots show that the HUD is a pretty invasive measuring instrument, as it can affect considerably what it is trying to measure, at least for light workloads such as https://webkit.org/blog-files/leaves/. Still the observations above are confirmed: Elm runs the leaves workload at the lowest possible frequency, while Kevin runs it at a considerably higher frequency. Dropped frames in this workload seem to be highly correlated with the frequency chosen by the governor.
freq_vs_time-elm-nohud.png
62.2 KB View Download
freq_vs_time-kevin-nohud.png
88.5 KB View Download
Cc: drinkcat@chromium.org sonnyrao@chromium.org diand...@chromium.org jcliang@chromium.org bccheng@chromium.org dtor@chromium.org
Components: OS>Performance OS>Kernel
Status: Started (was: Unconfirmed)
Summary: elm: Page cycler regression between 16-17 of March (was: Page cycler regression on elm Chromebook between 16-17 of March)

Comment 5 by matteo.f...@arm.com, Sep 28 2017

I ran the patch at

  [1] https://chromium-review.googlesource.com/#/c/chromiumos/overlays/board-overlays/+/575999/

for a few other workloads. The change of chrome/urgent/cpus from 0-3 to 2-3 has generally positive effects on most of the benchmarks we run (page_cycler, loading.desktop, some of blink_perf and MotionMark). There are, however, also some regressions (far fewer, among our selection of benchmarks). The most interesting I found is in MotionMark:

  http://browserbench.org/MotionMark/developer.html

The peculiarity of this workload is that it runs - by default - in "ramp mode", a mode where complexity is increased until the device fails to hit a target framerate. For example, if the benchmark animates N sprites, then N is increased (and adjusted in a bisect way, it seems) to find the value that achieves the target 50 FPS. This mode is interesting as it requires the governor to dynamically match the CPU frequency to the current load. I found that the "focus" sub-benchmark of MotionMark has an anomalously low performance on Elm and further worsens when changing chrome/urgent/cpus from 2-3 to 0-3. Monitoring the core frequencies I can confirm that this behaviour again comes from the frequency choices made by the interactive governor (switching to the performance governor gives a higher score).

In summary:

  - the issues we see on elm are mainly due to its governor (interactive) and how it is set up.

  - many workloads benefit from [1] (changing chrome/urgent/cpus to 0-3) but some are also impacted negatively. I think this can be explained as follows: setting urgent/cpus to 0-3 gives more freedom to urgent tasks placement and hence improves performance in some workloads (especially those where the CPU utilisation is high). At the same time, however, it tends to drive the interactive governor to choosing lower CPU frequencies (due to its choices being based on system load). This penalizes workloads with low CPU utilisation, for which frequency ends up being kept too low.

  - the interactive governor has a number of knobs that can be tweaked and lead to quite different behaviours. Fine tuning these knobs is not straightforward, as each change is likely to favour some workloads and negatively impact other workloads. There is also a potential impact on power saving which must be assessed.

  - evaluating what is a positive/negative change in a workload is also not straightforward: for some workloads we may prefer performance over power, for others we may do the opposite.

So what's next? I guess it makes sense to make an attempt to tune the interactive governor and see where this brings us.
Elm is on 3.18 and kevin is on 4.4. We have different sets of EAS patches on 3.18 and 4.4, and they use different cpufreq governors (interactive on 3.18, sched on 4.4).

One could try porting all the EAS patches from 4.4 to 3.18 and use sched governor instead of interactive governor. I'm not sure about the amount of effort this requires though.
re #6  -- at one point I thought elm ran on 4.4 -- but it may not anymore
I had a todo to try and do a performance comparison between 3.18 and 4.4 on elm to see how much the 4.4 version of EAS was helping but I got side tracked -- maybe someone else could try Elm on 4.4 and see if it still boots and runs?

Comment 8 by matteo.f...@arm.com, Sep 29 2017

jcliang, it would indeed make sense to have the same kernel/governor running on the two Chromebooks, as this would simplify optimising for both boards. It is to be seen whether the sched-freq governor (I call it sched below) would perform better than interactive on Elm. I did some experiments on Kevin which suggest that this may be the case:

- I compared three different governors on kevin - sched, performance and interactive - when running MotionMark/focus. Unsurprisingly, in ramp mode, the performance governor is the one which does better, followed by sched (on average 5% worse, up to ~20% worse, when considering the minimum score achieved over 24 repetitions), followed by interactive (which sometimes gives an anomalous minimal score of 1.0, I cannot exclude this is an issue with the benchmark itself). In fixed complexity (complexity=10) the same performance ordering is observed: fps(performance) > fps(sched) > fps(interactive). The difference in FPS obtained by switching to the different governors is smaller: ~2% when going from performance to sched and the same when going from sched to interactive (I have performed less runs of these, so I have less confidence in these numbers).

- changing chrome/urgent/cpus from 0-5 (the default on Kevin) to 4-5 or anything else always leads to lower performance. So urgent/cpus==0-5 seems the optimal choice for Kevin, at least for the sched governor.

- Note that MotionMark/focus (ramp) achieves a far better score on Kevin than on Elm, but Kevin has clearly some tiling/rasterisation issues (they affect only some frames of the animation and appear/disappear very quickly). I can raise a bug (possibly attaching a screenshot), if you do not expect this behaviour for Kevin.
I attach a couple of plots that is probably worth sharing.

The first plot, titled `elm/MotionMark/focus', shows results for the Focus suite of the MotionMark benchmark. This is the MotionMark suite which is most negatively impacted by changing chrome/urgent/cpus from 2-3 to 0-3 (as per patch below).

  [1] https://chromium-review.googlesource.com/#/c/chromiumos/overlays/board-overlays/+/575999/

The second plot, titled `elm/MotionMark/multiply', shows the results for the Multiply suite, which is the one most positively affected by [1].

THE PLOTS
---------

Both plots are obtained following the same procedure. Telemetry is used to increase the complexity of the considered workload (x-axis) and record the framerate (y-axis) which the device is able to sustain for 30 seconds at that complexity. Each point in the plot represents the average between 10 values. There are 4 different curves in the plot, corresponding to 4 different configurations. The solid lines show results obtained using the interactive governor in Elm (default governor, default configuration), while the dashed lines show results for the performance governor. The dashed lines hence show an (optimal) reference result, as the performance governor just keeps the frequency always set to the maximum possible value for each core (this is the best strategy when power and thermal are neglected). The green curves show results for chrome/urgent/cpus == 2-3, while the black curves show results for chrome/urgent/cpus == 0-3.

DISCUSSION
----------

General observations:

1. At low complexity (e.g. only one sprite to animate) we expect workloads to hit 60 FPS, while at higher complexities (e.g. 1000 sprites to animate) the FPS will decrease - normally in a monotonic fashion. There should always be a complexity value, x0, where this transition occurs. In terms of x0, we expect FPS == 60 for x <= x0, while for x > x0 we expect FPS to gradually approach zero. Note that x0 is what MotionMark tries to determine when running in ramp mode. Here we run it in fixed-complexity mode (we fix the complexity and run for a fixed amount of time). Note also that we expect x0 to be a function of `cat chrome/urgent/cpus`.

2. At high complexity the system should be under high load. We expect any well-behaved governor to set the CPU frequencies to the maximum values in this regime. We therefore expect solid curves to approach their dashed counterpart in the region x/x0 >> 1.

3. Setting chrome/urgent/cpus to 0-3 (rather than 2-3) gives the system more "freedom". This has two competing effects:

  a. More freedom means more cores available for computation. I would expect this to have a positive impact on performance, especially in high-load conditions when CPUs are busy (and hence scheduling "freedom" is more scarce). In other words, I would expect the black curves to lie above the corresponding green curves for x/x0 >> 1.

  b. More freedom means higher probability of making scheduling mistakes (e.g. running critical-path tasks on little cores). I would expect the green curves to lie above the corresponding black curves at x/x0 <~ 1, where scheduling "freedom" is abundant and hence (3a) doesn't matter much.

4. Additionally, setting chrome/urgent/cpus to 0-3 was observed (see previous posts) to lower the CPU frequencies for the interactive governor in low-load conditions. I would therefore expect the black solid curve to lie below all other curves at x/x0 <~ 1.

The expectations above are generally satisfied by the two plots.

Observations for the focus benchmark:

- In the x > x0 region, we can see 2 regimes, probably corresponding to a change of the critical path. For interactive/0-3 we further see a third regime in the region x < 5. This is probably what makes MotionMark/Focus a particularly under-performing suite for [1] on Elm.

Observations for the multiply benchmark:

- There is a gap between black curves and their green counterparts which persists for complexities of 400-500 and is probably due to (3a) above. Interestingly, the green solid curve (interactive/2-3) struggles to converge even to its green dashed counterpart (performance/2-3). One possible explanation: the 2-3 CPU restriction may keep little cores idle which - in turn - keeps load down and induces the interactive governor to use lower frequencies.  This hypothesis seems to be confirmed by monitoring the CPU frequencies of little cores: when running MotionMark/multiply with chrome/urgent/cpus==2-3 the little cores tend to run at lower frequencies, while for chrome/urgent/cpus==0-3 little cores run almost always at the highest frequency.

CONCLUSION
----------

One way possible forward is: try to understand why using chrome/urgent/cpus==0-3 leads the governor to choose lower frequencies and explore whether it is possible to fine tune the governor parameters to counteract this tendency. In fact, many of the interactive governor knobs can be tuned just for particular frequencies. This would then fix issue (4) above.

It would be interesting to repeat this study with the schedfreq governor to see how far this is from the performance governor in the MotionMark suites. Depending on the results, it may make more sense to just switch Elm to a more recent kernel and governor, rather than fine tune the interactive governor.

fps_vs_complexity-focus.pdf
10.9 KB Download
fps_vs_complexity-multiply.pdf
10.7 KB Download
re #9 -- I was proposing that you try it on 4.4 since it did work there at one point and then you could try the schedfreq governor
Cc: djkurtz@chromium.org
Sonny (#10), thanks for the suggestion. I'll give it a try at some point. First, I'd like to see what happens on Kevin. There I already have the possibility of comparing the 3 governors: perf, interactive, and sched. Preliminary results on Kevin suggest that sched is not always performing better that interactive (MotionMark/Canvas Lines).
re #10 -- ok sure testing on Kevin makes sense if the problem reproduces there -- I thought it might be specific to elm.

I'm also not sure if sched is supposed to always perform better than others -- there's always that trade off between performance and power usage which we took into account during tuning.
FYI: the patchset ending at [0] should work with elm on chromeos-4.4:

https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/711390
re #7 - testing 4.4 kernel on elm with the sched governor:

I've done some testing with the power_LoadTest workload, and have numbers for 3.18 vs 4.4 on elm. I've done two runs of 3.18, one of 4.4 at boost 10, and one of 4.4 at boost 20.

In summary, with 4.4 (both boost 10 and 20), battery life is unchanged and performance improves. This appears to be because 4.4 spends less time at the lowest frequency, but also less time at the highest frequency (especially on the big cores), which seems to result in better performance-per-watt.

Battery life:
3.18: 797, 810
4.4b10: 805
4.4b20: 809

Page load geomean:
3.18: 4041, 4235
4.4b10: 3778
4.4b20: 3813

Mean CPU frequency (little; big):
3.18: 632, 647; 896, 862
4.4b10: 684; 762
4.4b20: 849; 880

The attached graph shows proportion of time spent at different frequencies during the power_LoadTest.
cpu frequencies.png
35.5 KB View Download
We also see improvements on a range of workloads when testing the 4.4 kernel on elm. In summary (we can provide more data on these, if that's of interest):

motionmark (fixed complexity): 14% improvement in percentage_smooth
page_cycler (cold): 19% reduction in timeToFirstContentfulPaint
page_cycler (warm): 13% reduction in timeToFirstContentfulPaint
top_25_smooth: 5% increase in percentage smooth
speedometer: 1.4% improvement in score
GLAquarium: 0.3% improvement in percentage_smooth

Additionally, the 4.4 kernel mitigates the impact of  issue 785930  (the regression in smoothness observed when using WiFi is reduced from around 10-20% to approximately 6%).
Thanks for doing this!  Seems to indicate there's a significant win on page load and smoothness in going to 4.4 for Elm and presumably related devices.
Sorry, the X-axis for the big cores histogram in #15 is mislabelled. I've attached an updated version of the graph (there are no other differences).
cpu frequencies.png
33.4 KB View Download
Attached is another microbenchmark that shows anomaly on Elm vs Kevin.

It is testing page access latency when the working set exceeds installed RAM hence uses ZRAM.

The command is:
memory-eater.arm --size 2465 --speed --fork --repeat 4 --chunk 500 --wait 0

It will use two processes where each allocates a 2465MB buffer to thrash the memory system.

It reports the number of accessed pages per second.

On Kevin, the scores are consistent and it is higher when using the big cores than using the little cores. Enabling all 6 cores renders similar results than only enabling the 2 big cores.

On Elm, the score is much lower when all 4 cores are enabled, even lower than using a single little core. And the best score on Elm is lower than Kevin, which is unexpected too.
memory-eater.arm
9.7 KB Download
bccheng (#19), do the scores improve when switching to the performance governor? If this is the case, then what you see is probably another manifestation of the interactive governor issue. Otherwise, it may be a separate one.
No I didn't change the governor setting. I will try performance next. 
I changed the governor to performance and the scores become steady and reasonable - using the big cores is faster than little cores.

However, the score is about 12% slower than Kevin, and I'm curious to see if Elm w/ 4.4 kernel can gain any performance on ZRAM.
@#22 The chromeos-4.4 kernel does support elm, so if you already have an board=elm setup in your chroot you can try your experiment easily:

emerge-elm --unmerge linux-sources sys-kernel/chromeos-kernel-3_18
USE="-kernel-3_18 kernel-4_4" emerge-elm -j linux-sources sys-kernel/chromeos-kernel-4_4
./update_kernel --board=elm --remote=${DUT_IP}

@22: note that elm has faster cores and better memory bandwidth compared to kevin, but it has fewer cores.  Now that we can use all the cores for zram it's possible the extra 2 little cores explain the 12% difference.
I tried the steps in #23 and Elm achieves 25% improvement w/ 4.4 kernel. Here are the results from the ZRAM walker benchmark:

Kevin: 87803 pages/sec
Elm (3.18 kernel): 79455 pages/sec
Elm (4.4 kernel): 99222 pages/sec

I also ran the benchmark for a number of different core configurations on Elm, as well as with the UI turned on and off, and got the attached results.

(I ran each three times, so the error bars represent the standard error from six values (3 * one value from each of the two processes = 6))

I guess your 99222 pages/sec corresponds to the (Started, Any) value in my results, so I'm seeing results consistent with yours. However, in cases where the benchmark is restricted to running on any of the big cores or a single big core (2 Only), then going to 4.4 makes things worse when the UI is running, it would appear.
memory-eater-kernel-change.png
42.1 KB View Download
Thanks bccheng, Stephen. Interesting data! Kernel 4.4 seems to have significantly better performance in the most relevant cases (Stopped-Any and Started-Any). I find it interesting that Stopped-only2 (benchmark pinned to the big core 2, with stopped UI) exhibits better performance than the other "stopped" cases, even for schedfreq. The same outcome is confirmed when running on little cores (Stopped-only0 vs Stopped-little-only).


It is pretty evident that this benchmark (and probably the way it interacts with the kernel) is pretty sensitive to the core count and configuration.
Thanks for the data. The highest recorded score with only core #2 enabled is interesting. Do you know why that's the case?
The benchmark is under review and the link is:
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/792713

It is using a pretty straight-and-forward way to access 500 pages each time between two processes. So it is only using a single core from the user space but the kernel might benefit from multiple cores if ZRAM uses them (it seems the case as Doug mentioned in #24).
I'm still looking into this, particularly why we see bimodal results for the (UI stopped, single little core only) case, where sometimes we achieve ~120k, and sometimes ~70k.

Running perf with the benchmark shows a profile like the following for the bad and good cases:

BAD CASE:

# Samples: 373K of event 'cycles:ppp'
# Event count (approx.): 156810127579
#
# Overhead  Command          Shared Object      Symbol                                      
# ........  ...............  .................  ............................................
#
    23.06%  memory-eater.ar  [kernel.kallsyms]  [k] lzo1x_decompress_safe                   
    10.83%  memory-eater.ar  [kernel.kallsyms]  [k] _raw_spin_unlock_irq                    
     4.84%  memory-eater.ar  [kernel.kallsyms]  [k] do_raw_spin_lock                        
     3.74%  memory-eater.ar  [kernel.kallsyms]  [k] __alloc_pages_nodemask                  
     2.78%  memory-eater.ar  [zram]             [k] zram_rw_page        

GOOD CASE:

# Samples: 278K of event 'cycles:ppp'
# Event count (approx.): 116921859339
#
# Overhead  Command          Shared Object      Symbol                                 
# ........  ...............  .................  .......................................
#
    21.04%  memory-eater.ar  [kernel.kallsyms]  [k] lzo1x_decompress_safe              
     9.79%  memory-eater.ar  [kernel.kallsyms]  [k] _raw_spin_unlock_irq               
     4.58%  memory-eater.ar  [kernel.kallsyms]  [k] do_raw_spin_lock                   
     3.67%  memory-eater.ar  [kernel.kallsyms]  [k] __alloc_pages_nodemask             
     2.89%  memory-eater.ar  memory-eater.arm   [.] 0x00000ad0   

The percentage of time spent in the different dsos changes as follows:


             DSO  Avg % Good   Avg % Bad   Change Good->Bad
-----------------------------------------------------------
          kernel        84.7        86.3               +1.6
            zram         7.7         8.6               +0.9
memory-eater.arm         7.6         5.0               -2.6


I've been trying to get to the bottom of what controls the pages/sec score we see in this benchmark. In some configurations, running the same benchmark multiple times in a row can produce wildly different results, even if:

* the performance governor is used
* the benchmark is pinned to core 0 (LITTLE)
* the kswapd0 thread is pinned to core 3 (big)
* the UI is turned off

Even in this case, the benchmark sometimes scores ~110,000 pages/sec, and sometimes ~67,000. This also happens in the 4.4 kernel.

This hot function, lzo1x_decompress_safe, has a callstack like the following:

lzo1x_decompress_safe ([kernel.kallsyms]) 
zcomp_decompress ([zram]) 
zram_decompress_page ([zram]) <---- 
zram_bvec_rw ([zram]) 
zram_rw_page ([zram]) 
bdev_read_page ([kernel.kallsyms]) 
swap_readpage ([kernel.kallsyms]) 
read_swap_cache_async ([kernel.kallsyms]) 
swapin_readahead ([kernel.kallsyms]) 
handle_mm_fault ([kernel.kallsyms]) 
do_page_fault ([kernel.kallsyms]) 
do_mem_abort ([kernel.kallsyms]) 
el0_da ([kernel.kallsyms]) 

Adding tracing code to the 3.18 kernel, I observe that the score we get is directly related to how often we call zram_decompress_page.

Some example values are:
110k pages/sec -> 3703000 calls to zram_decompress_page.
67k pages/sec -> 5903734 calls to zram_decompress_page.

What is causing us to need to decompress 60% more pages only in certain cases is not clear to me.

One other observation was that the location of the buffer doesn't matter - if we modify the benchmark to reuse the buffer, so once the child process has finished, we fork a new one, and run the benchmark again, we still see this bimodal behaviour from run to run.

I've also attached updated results for the different core configurations where each configuration was run 10 times, including a boxplot to demonstrate the weird bimodal-ness of the benchmark.

In response to #24, I'm not sure if the extra cores matter, if I understand correctly, the compressing of zram only happens in kswapd0, and we only have one kswap thread, even on Kevin?
memory-eater-results.png
45.4 KB View Download
memory-eater-results-boxplot.png
62.9 KB View Download
Cc: semenzato@chromium.org
> the same benchmark multiple times in a row can produce wildly different results

Just to make sure, you've ensured that Thermal Throttling isn't a factor, right?

---

> the compressing of zram only happens in kswapd0, and we only have one kswap thread, even on Kevin?

Not totally true, but this is something I've thought a bit about too.  Yes, kswapd has only 1 thread.  I've wanted kswapd to be able to use more than one thread, but when I looked at trying to actually do that in our kernel it wasn't totally trivial.  Other processors can compress to zram, but only one thread preemptively compresses.  As I understand it:

1. We reach watermark where we want to start swapping stuff out: one thread (kswapd) kicks off and starts compressing.

2. Other threads keep consuming memory much faster than one thread can compress, so we eventually get to a critical level.

3. When another thread is critical, it will help out kswapd by compressing some memory itself.


In general I've found that the memory manager in most of our kernels is very hard to test and get consistent results.  If it happens that things run in a slightly different order then you can get wildly different results.
> Just to make sure, you've ensured that Thermal Throttling isn't a factor, right?

Running with the configuration above (workload pinned to core 0, kswapd pinned to core 3, performance governor, no UI), ftrace reports no change in frequency, and the reported thermal_temperature doesn't go beyond 51C, but I still saw the two different results within that period, so I don't think so, no.

> Not totally true, but this is something I've thought a bit about too.

Thanks for the info on that. I was interested to see if there was a significant difference in who was doing in the compressing between good and bad cases, so perf sampled two ftrace events, one for when zcomp_compress is called, and one when zcomp_decompress is called, but it isn't that illuminating, sadly.

*** In the good case, where we got 108918 pages/sec ************

# Samples: 3M of event 'zram:zcomp_compress'
# Event count (approx.): 3493953
#
# Children      Self       Samples    Pid:Command        
# ........  ........  ............  .....................
#
    99.92%    99.92%       3491184     63:kswapd0        
     0.03%     0.03%          1162   2914:memory-eater.ar
     0.03%     0.03%          1058   2913:memory-eater.ar
     0.01%     0.01%           399   2912:perf           
     0.00%     0.00%           150    157:kworker/u8:5   

# Samples: 3M of event 'zram:zcomp_decompress'
# Event count (approx.): 3374698
#
# Children      Self       Samples    Pid:Command        
# ........  ........  ............  .....................
#
    50.00%    50.00%       1687317   2914:memory-eater.ar
    49.97%    49.97%       1686468   2913:memory-eater.ar
     0.01%     0.01%           183   1076:shill          
     0.00%     0.00%           154    529:dbus-daemon    
     0.00%     0.00%           120    836:powerd         
     0.00%     0.00%            97   1346:avahi-daemon   


*** In the bad case, where we got 75588 pages/sec ************

# Samples: 4M of event 'zram:zcomp_compress'
# Event count (approx.): 4706996
#
# Children      Self       Samples    Pid:Command        
# ........  ........  ............  .....................
#
    99.92%    99.92%       4703066     63:kswapd0        
     0.03%     0.03%          1353   2926:memory-eater.ar
     0.03%     0.03%          1340   2925:memory-eater.ar
     0.02%     0.02%          1100   2924:perf           
     0.00%     0.00%            75    157:kworker/u8:5   
     0.00%     0.00%            62    529:dbus-daemon  

# Samples: 4M of event 'zram:zcomp_decompress'
# Event count (approx.): 4709049
#
# Children      Self       Samples    Pid:Command        
# ........  ........  ............  .....................
#
    50.09%    50.09%       2358538   2926:memory-eater.ar
    49.89%    49.89%       2349505   2925:memory-eater.ar
     0.00%     0.00%           231    529:dbus-daemon    
     0.00%     0.00%           189   1076:shill          
     0.00%     0.00%           150    836:powerd         
     0.00%     0.00%           125   1346:avahi-daemon   
     0.00%     0.00%            69    347:frecon         
     0.00%     0.00%            56    649:wpa_supplicant 
     0.00%     0.00%            55   2274:sshd           
     0.00%     0.00%            41   2269:metrics_daemon 
     0.00%     0.00%            28    584:timberslide    
     0.00%     0.00%            24   1083:dhcpcd         
     0.00%     0.00%            17   2924:perf           
     0.00%     0.00%            15   1634:MountThread    
     0.00%     0.00%             6    852:daisydog   

So there's a general increase in compression and decompression, but most of the compression comes from kswapd0, and there's no other process suddenly forcing compression that wasn't before.

So yes, looks like it may just be that if things run in a slightly different order, then you get wildly different results as you say.

bccheng@ - were you checking that your achieved results were consistent for the different scenarios you were testing?

I guess another thing we can take away from this is that this provides another benefit for switching to 4.4. In the (Started, Any) case in my graph above, we do get a consistently better result when using 4.4.
Labels: -Performance Performance-Loading
Status: Available (was: Started)
This issue has been marked as started, but has no owner. Making available.

Sign in to add a comment