Improve zlib inflate_fast speed by using SSE |
|||||||||||
Issue descriptionzlib inflate_fast() speed on ARM devices was improved in issue 697280 by performing large chunk copies via NEON SIMD code. Similar improvement could be expected by using SSE SIMD on Intel devices. The code from issue 697280 could be modified to add support SSE, and also changed from being an arch-specific contrib to a more general inflate SIMD contrib.
,
Oct 9 2017
Related issue for this "chunk-copy" code is the infback case, issue 769880 .
,
Oct 24 2017
https://bugs.chromium.org/p/chromium/issues/detail?id=697280#c46 reported the decode perf for the ARM chunk copy case. I repeated that experiment for Intel using SSE chunk copy on my macbook pro: MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) png mac-book-pro-sse-chunk-adler https://docs.google.com/spreadsheets/d/1ZGSdjmPKbdvsBlXvUPRaIG4uJ2jyEOljUVf78e-BibQ/edit#gid=595603569 Chunk copy (the blue-line on the graph, label sse chunk + no alder) is an average 17% improvement in PNG decoding rate for the PNG 140 corpus. Adding the adler32 SSE change (the read-line, label sse chunk + ssse3 alder), the PNG decoding improvement is average 37% for the PNG 140 corpus.
,
Oct 25 2017
Patch uploaded https://chromium-review.googlesource.com/c/chromium/src/+/708834 and we can begin review. - solves issue 769880 by using inffastloose.c that was provided by cavalcantii@ in https://gist.github.com/Adenilson/0525bb60f79fcc90b4ccedb778a198d1 which I renamed to inffastchunk.c - defines the int splats needed by SIMD code for both NEON and SSE2 as macros Be advised I will be OOO for the next two weeks, so you won't be hearing from me :)
,
Oct 25 2017
Noel That is pretty cool, I really like the spreadsheets with the the output of image decoding benchmark (do you have a script to import the data into Google docs?). 17% is in line with the NEON results for inffast (i.e. 16% as in https://docs.google.com/spreadsheets/d/1WOWEuAMXrxj608WMuHj-9DKv1wN08eY5hiSMufMrbJw/edit#gid=595603569). I recall that Nigel had a patch optimizing inflate fast for Intel (https://chromium-review.googlesource.com/c/chromium/src/+/601694) that was pending on investigating the reported performance regression on media_perftests* on Intel related to the Adler-32 patch (https://bugs.chromium.org/p/chromium/issues/detail?id=772950). *By the way, did you look into the regression? It is being 16 days already since the report. You also reported that Nigel's patch yielded an average gain of 2% (https://bugs.chromium.org/p/chromium/issues/detail?id=760853#c21). So what is the plan here so we can coordinate the work: land Nigel's patch and then the SIMD inffast for Intel? Something else?
,
Oct 26 2017
In any case, I was planning to refactor the inffast NEON code to allow to easily implement the same optimization for other architectures. I will do that together with the infback bug fix before we implement further optimizations. Then we can land the Intel specific code.
,
Oct 26 2017
,
Nov 21 2017
#5 > That is pretty cool, I really like the spreadsheets with the the output of image decoding benchmark (do you have a script to import the data into Google docs?). No script, sorry. I compute the results locally, storing them in a .cvs file, and then upload them to google spreadsheets. > 17% is in line with the NEON results for inffast (i.e. 16% as in https://docs.google.com/spreadsheets/d/1WOWEuAMXrxj608WMuHj-9DKv1wN08eY5hiSMufMrbJw/edit#gid=595603569). Yeap, about the same for Intel or NEON chunk copy. > I recall that Nigel had a patch optimizing inflate fast for Intel (https://chromium-review.googlesource.com/c/chromium/src/+/601694) that was pending on investigating the reported performance regression on media_perftests* on Intel related to the Adler-32 patch (https://bugs.chromium.org/p/chromium/issues/detail?id=772950). > *By the way, did you look into the regression? It is being 16 days already since the report. I don't tend to deal with regressions while I'm away on vacation :) Anyhow, the particular media test is about audio channel passthrough mixing, and I don't see how PNG encoding or decoding is involved with that at all. > You also reported that Nigel's patch yielded an average gain of 2% (https://bugs.chromium.org/p/chromium/issues/detail?id=760853#c21). > So what is the plan here so we can coordinate the work: land Nigel's patch and then the SIMD inffast for Intel? Something else? I would land the SIMD inffast for Intel first, it has measurements (#3). Nigel's patch could be added afterward, if measurements of it show that it's a progression on the chunk copy and SIMD adler32 changes.
,
Dec 8 2017
The regression for issue 771209 was examined and the adler32 is not even used by the test involved. (Removing comments above about that side issue).
,
Dec 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c293a3255eb27dee8879f85f2c45dedff58e2452 commit c293a3255eb27dee8879f85f2c45dedff58e2452 Author: Noel Gordon <noel@chromium.org> Date: Fri Dec 08 11:39:34 2017 Improve zlib inflate speed by using SSE2 chunk copy Using SSE2 chunk copies improves the decoding rate of the PNG 140 corpus by an average 17%, giving a total 37% performance increase when combined with SIMD adler32 code ( https://crbug.com/772870#c3 for details). Move the arm-specific code back into the main chunk copy code and generalize the SIMD parts of chunkset_core() with inline function calls for ARM, and Intel SSE2 devices. This removes the TODO from arm/chunkcopy_arm.h, and that file can be deleted as a result. Add SSE2 vector load / store SSE helpers for chunkset_core(). The existing NEON load code had alignment issues, as noted in review. Fix that: use unaligned loads in the ARM helper code. Change chunkcopy.h to use __builtin_memcpy if it's available, use zmemcpy otherwise such as on MSVC. Also call x86_check_features() in inflateInit2_() to keep the adler32 SIMD code path enabled. Update BUILD.gn to conditionally compile the SIMD chunk copy code on Intel SSE2 and ARM NEON devices. Update names.h to add the new symbol defined by the inflate chunk copy code path. Code had various comment styles; pick one and use it consistently everywhere. Add inffast_chunk.h TODO(cblume). Bug: 772870 Change-Id: I47004c68ee675acf418825fb0e1f8fa8018d4342 Reviewed-on: https://chromium-review.googlesource.com/708834 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Chris Blume <cblume@chromium.org> Cr-Commit-Position: refs/heads/master@{#522764} [modify] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/BUILD.gn [delete] https://crrev.com/14b857f982900cb6e61cb350342f35a46644f54a/third_party/zlib/contrib/optimizations/arm/chunkcopy_arm.h [modify] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/contrib/optimizations/chunkcopy.h [rename] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/contrib/optimizations/inffast_chunk.c [rename] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/contrib/optimizations/inffast_chunk.h [modify] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/contrib/optimizations/inflate.c [modify] https://crrev.com/c293a3255eb27dee8879f85f2c45dedff58e2452/third_party/zlib/names.h
,
Dec 13 2017
Per http://bit.ly/2AfaxhG, "This has shown some lovely performance improvements on all the perf bots EXCEPT Android One, where it shows a clear regression [1]. Any ideas why?" We think it's the -O3 here: It gave us wins on all Android bots _except_ the Android One. Backing out the -O3 part for Android of #15 for now to see if the Android One perf bot recovers.
,
Dec 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/53edf03e13a1569f146f136c9e43bc6e476bf33e commit 53edf03e13a1569f146f136c9e43bc6e476bf33e Author: Noel Gordon <noel@chromium.org> Date: Wed Dec 13 03:51:05 2017 Use default optimization on Android for zlib inflate chunk copy Speculative fix: back off from using -O3 on Android for the inflate chunk copy code to see if -O3 caused the perf bot regression on the Android One bot, http://bit.ly/2AfaxhG Bug: 772870 Change-Id: I06bd941224dacbc5d7024a65934faaa53e4a4ce5 Reviewed-on: https://chromium-review.googlesource.com/823503 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Chris Blume <cblume@chromium.org> Reviewed-by: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#523679} [modify] https://crrev.com/53edf03e13a1569f146f136c9e43bc6e476bf33e/third_party/zlib/BUILD.gn
,
Dec 15 2017
,
Dec 20 2017
,
Feb 5 2018
@cblume, I believe there was regression nearby at the time, issue 795180 by ajwong@ and I followed the trail to your discussion with him about it https://chromium-review.googlesource.com/c/chromium/src/+/826255#message-c71b9219715adfc98132d6fc6bc2bc15977d0f00 Reading it, suggest we should try re-enabling -O3 on ARM again? Excepting Android One, every ARM bot really liked the -O3, so let me know.
,
Feb 5 2018
And what I read from that linked discussion was > cblume@ > So I'm sending this email to: > 1.) Let everyone know there is a small regression, but > 2.) it is probably fine, and > 2.) Noel, it might be worth trying -O3 again. I never got mail about it though :/
,
Feb 5 2018
IIRC, the enable frame pointers (EFP) regression happened *after* the fix for Android one regression (i.e. reverting the -O3). Just re-enabling the flag would produce the same results observed before: a nice improvement on higher end devices (with A72 cores) and a regression on lower end devices (probably with A53 cores and some ARMv7 processors). Some issues of disabling the compiler flag *only* for Android One are: a) Android One is something relatively new (even though announced in 2014, it seems the devices are reaching mass market starting in 2017). b) As a result, you have millions of cheap devices mostly in Asian/African/South American markets that still run some version of Android (!= Android One). So a question that should be asked is: should we have a good gain in higher end devices at the cost of a considerable cost for lower end devices? We still have some performance work to complete (e.g. using ARMv8-a crc32 will equally help both A72 as also A53). What would be *ideal* would be if the Google App store distributed *target specific* builds of Chrome. I suggested this in chromium-dev (i.e. have ARMv8 and ARMv7 specific builds) and heard that it would require considerable(?) changes in the app store backend. I wouldn't ask to support this feature (i.e. target specific apks) for everyone, but maybe an exception could be made for Chrome? Just an idea. Maybe similar to Fuchsia and the ARM cpu-features syscall, you guys could ask around and see if that would be feasible (i.e. target specific apks in the app store *only* for chrome)? I have the gut feeling that we could have a nice global gain if the *whole* Chromium was specially built to better use the ARM device hardware resources.
,
Feb 5 2018
Just confirmed what I mentioned before, the EFP regression happened *after* the fix in Android one. The first peak in Android One was thanks to the -O3 and the second was thanks to the EFP: https://chromeperf.appspot.com/report?sid=f77857a3ac817e41f64f9d596aa1bc19b8dfe0f57dfb5610cb5e788c0563a355&start_rev=519183&end_rev=526297 So they are unrelated and we should expect to observe the same behavior as before.
,
Feb 5 2018
As I understand it, separate ARMv7 and ARMv8 builds have been possible in the app store since ARMv8 was introduced, but Chromium just doesn't want to do that. Chrome problem, not Android. I think at one point Chromium did ship a A64 build, or maybe just one phone shipped with it, but later upgraded back to an ARMv7 compatible A32 build.
,
Feb 5 2018
@Mike: so maybe is an issue of just getting buy in from the people involved provided that there are such performance gains?
,
Feb 5 2018
So basically the idea would be to still keep building 32bits binaries (as 64bits binaries are bigger and use more memory), but using proper optimizations flags (e.g. -O3 || -O0), target (e.g. -march=armv7 || -march=armv8), etc and distribute the best match apk for the device. Like I said, just an idea (and would require some hard data to justify the approach).
,
Feb 15 2018
The app store stuff is interesting, but I wasn't asking about that in particular. I was considering, with the zlib optimizations we've added, how we have already given much better performance to Android One, and indeed to all other Android devices we test. So can we take some of the gain we've provided back from Android One (5% loss with -O3) to provide additional gain to all those other Android devices?
,
Mar 19 2018
^^^ polite ping :)
,
Mar 19 2018
,
Mar 19 2018
> So can we take some of the gain we've provided back from Android One (5% loss > with -O3) to provide additional gain to all those other Android devices? Sgtm. #22 sounds like cblume@ supports it.
,
Mar 21 2018
Not actively working on this bug mind: FTR, required work would be to reapply https://chromium-review.googlesource.com/c/chromium/src/+/823503
,
Mar 21 2018
Ahem, undo that change :) to be precise.
,
Mar 21 2018
I can help on this one. The first thing would be to assess what would be the impact on modern and affordable ARMv8 SoCs (e.g. A53, which are in-order), as those are commonly used in Android One cheap devices. If it is not a regression in there, that would justify the possible penalty on older/cheaper ARMv7 devices. We now have zlib_bench for the task (which also runs standalone on Android), which will make things much easier.
,
Mar 21 2018
Ok, I'm still running the benchmarks but the initial data seems to point that: a) It is *not* a perf regression on A53 (actually it may be a small perf gain in most cases, 1 to 2%). b) It is a perf regression in some cases (up to 5%) on ARMv7, but at least it won't be worst than vanilla zlib.
,
Mar 21 2018
,
Mar 22 2018
Plan sgtm, initial results promising and thank you for helping here, Adenlison.
,
Mar 22 2018
,
Mar 22 2018
Quite interesting that it seems to hurt perf on Intel (I guess we better enable the flag only for ARM). a) ToT adenilson@chrome-monster:~/chromium/src/out/Release$ ./zlib_bench gzip ~/temp/snappy/testdata/* /home/adenilson/temp/snappy/testdata/alice29.txt : GZIP: [b 1M] bytes 152089 -> 54397 35.8% comp 18.4 ( 18.5) MB/s uncomp 497.5 (499.2) MB/s /home/adenilson/temp/snappy/testdata/asyoulik.txt : GZIP: [b 1M] bytes 125179 -> 48927 39.1% comp 17.5 ( 17.8) MB/s uncomp 418.7 (428.1) MB/s /home/adenilson/temp/snappy/testdata/baddata1.snappy : GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 44.2 ( 44.8) MB/s uncomp 215.6 (217.5) MB/s /home/adenilson/temp/snappy/testdata/baddata2.snappy : GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 44.7 ( 44.9) MB/s uncomp 214.5 (216.9) MB/s /home/adenilson/temp/snappy/testdata/baddata3.snappy : GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 44.7 ( 44.7) MB/s uncomp 212.4 (212.9) MB/s /home/adenilson/temp/snappy/testdata/fireworks.jpeg : GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 47.0 ( 48.0) MB/s uncomp 1243.4 (1245.1) MB/s /home/adenilson/temp/snappy/testdata/geo.protodata : GZIP: [b 1M] bytes 118588 -> 15124 12.8% comp 101.4 (103.2) MB/s uncomp 1036.2 (1122.2) MB/s /home/adenilson/temp/snappy/testdata/html : GZIP: [b 1M] bytes 102400 -> 13707 13.4% comp 80.1 ( 80.2) MB/s uncomp 930.5 (978.9) MB/s /home/adenilson/temp/snappy/testdata/html_x_4 : GZIP: [b 1M] bytes 409600 -> 53285 13.0% comp 73.0 ( 73.1) MB/s uncomp 952.3 (977.3) MB/s /home/adenilson/temp/snappy/testdata/kppkn.gtb : GZIP: [b 1M] bytes 184320 -> 38727 21.0% comp 11.9 ( 11.9) MB/s uncomp 659.9 (661.7) MB/s /home/adenilson/temp/snappy/testdata/lcet10.txt : GZIP: [b 1M] bytes 426754 -> 144867 33.9% comp 19.9 ( 19.9) MB/s uncomp 496.8 (511.4) MB/s /home/adenilson/temp/snappy/testdata/paper-100k.pdf : GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 48.5 ( 48.6) MB/s uncomp 420.1 (429.8) MB/s /home/adenilson/temp/snappy/testdata/plrabn12.txt : GZIP: [b 1M] bytes 481861 -> 195102 40.5% comp 14.9 ( 14.9) MB/s uncomp 424.8 (444.2) MB/s /home/adenilson/temp/snappy/testdata/urls.10K : GZIP: [b 1M] bytes 702087 -> 222358 31.7% comp 40.6 ( 40.7) MB/s uncomp 462.4 (478.5) MB/s b) Compiler flag enabled adenilson@chrome-monster:~/chromium/src/out/Release$ time ./zlib_bench gzip ~/temp/snappy/testdata/* /home/adenilson/temp/snappy/testdata/alice29.txt : GZIP: [b 1M] bytes 152089 -> 54397 35.8% comp 19.2 ( 19.3) MB/s uncomp 483.7 (490.1) MB/s /home/adenilson/temp/snappy/testdata/asyoulik.txt : GZIP: [b 1M] bytes 125179 -> 48927 39.1% comp 17.5 ( 17.5) MB/s uncomp 417.0 (433.1) MB/s /home/adenilson/temp/snappy/testdata/baddata1.snappy : GZIP: [b 1M] bytes 27512 -> 22920 83.3% comp 44.0 ( 44.4) MB/s uncomp 210.8 (212.9) MB/s /home/adenilson/temp/snappy/testdata/baddata2.snappy : GZIP: [b 1M] bytes 27483 -> 23000 83.7% comp 44.1 ( 44.2) MB/s uncomp 213.4 (215.2) MB/s /home/adenilson/temp/snappy/testdata/baddata3.snappy : GZIP: [b 1M] bytes 28384 -> 23705 83.5% comp 44.0 ( 44.0) MB/s uncomp 210.5 (212.1) MB/s /home/adenilson/temp/snappy/testdata/fireworks.jpeg : GZIP: [b 1M] bytes 123093 -> 122927 99.9% comp 46.2 ( 46.8) MB/s uncomp 1196.4 (1196.8) MB/s /home/adenilson/temp/snappy/testdata/geo.protodata : GZIP: [b 1M] bytes 118588 -> 15124 12.8% comp 99.0 (100.8) MB/s uncomp 1044.0 (1056.2) MB/s /home/adenilson/temp/snappy/testdata/html : GZIP: [b 1M] bytes 102400 -> 13707 13.4% comp 75.4 ( 77.7) MB/s uncomp 925.3 (927.0) MB/s /home/adenilson/temp/snappy/testdata/html_x_4 : GZIP: [b 1M] bytes 409600 -> 53285 13.0% comp 71.9 ( 73.0) MB/s uncomp 940.1 (944.2) MB/s /home/adenilson/temp/snappy/testdata/kppkn.gtb : GZIP: [b 1M] bytes 184320 -> 38727 21.0% comp 11.6 ( 11.6) MB/s uncomp 633.8 (658.2) MB/s /home/adenilson/temp/snappy/testdata/lcet10.txt : GZIP: [b 1M] bytes 426754 -> 144867 33.9% comp 18.7 ( 19.3) MB/s uncomp 484.4 (505.8) MB/s /home/adenilson/temp/snappy/testdata/paper-100k.pdf : GZIP: [b 1M] bytes 102400 -> 81276 79.4% comp 48.9 ( 48.9) MB/s uncomp 354.6 (390.0) MB/s /home/adenilson/temp/snappy/testdata/plrabn12.txt : GZIP: [b 1M] bytes 481861 -> 195102 40.5% comp 14.5 ( 14.6) MB/s uncomp 427.9 (444.9) MB/s /home/adenilson/temp/snappy/testdata/urls.10K : GZIP: [b 1M] bytes 702087 -> 222358 31.7% comp 39.2 ( 40.1) MB/s uncomp 452.1 (472.7) MB/s
,
Mar 22 2018
Experiment results are available here: https://docs.google.com/spreadsheets/d/1kLp0UfvCbnAvJUnSQU3b9wMopjqQsfKR0izFMu1Zd0Q/edit?usp=sharing Comments: for little ARMv8 cores seems to be a general small win (the 2 perf regressions are too small and could be within the noise), while for ARMv7 is a loss in 6/14 cases (on Nexus4 it actually pushed the result below vanilla zlib in 1 case). That being said, those losses are in bad compression cases (i.e. baddata.snappy, jpeg, pdf), while we observed a win for html content (1 to 2% improvement on ARMv7). The averages are: a) rk3328 (ARMv8): 1.021337833 b) nexus4 (ARMv7): 0.9986796919 c) nexus6 (ARMv7): 1.007643199 I think is safe to proceed given that we already know that it will be a nice perf gain on big ARMv8 cores (around 8-10%) and the regression on ARMv7 is manageable given the NEON optimizations implemented in inflate_fast.
,
Mar 23 2018
"Quite interesting that it seems to hurt perf on Intel", when measuring on Intel you need to add the official build optimization rule inside every contribution in zlib/BUILD.gn file:
source_set("zlib_some_contribution") {
visibility = [ ":*" ]
if (!is_ios && (current_cpu == "x86" || current_cpu == "x64")) {
...
configs -= [ "//build/config/compiler:default_optimization" ]
configs += [ "//build/config/compiler:optimize_speed" ]
}
}
to measure perf that reflects how our chrome official releases are built (for those builds, the default_optimization is optimize_speed).
Note: "those losses are in bad compression cases (i.e. baddata.snappy, jpeg, pdf)" they these cases are surprisingly common on the web [1].
[1] https://discuss.httparchive.org/t/sites-that-deliver-images-using-gzip-deflate-encoding/220
If we are concerned about ARMv7 regressions, maybe we can control it with a gn arg (I left code review comments about it).
,
Mar 23 2018
> If we are concerned about ARMv7 regressions, maybe we can control it with a gn arg (I left code review comments about it). Pretty sure we still only ship ARMv7 builds to both ARMv7 and ARMv8 devices. If we disable for ARMv7, it'll turn it off on all ARM, right?
,
Mar 26 2018
Right, seems using a gn arg won't help here.
,
Mar 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ffc960a39d26fda610e0023667b2465118b8580a commit ffc960a39d26fda610e0023667b2465118b8580a Author: Adenilson Cavalcanti <cavalcantii@chromium.org> Date: Tue Mar 27 04:50:58 2018 Enable compiler optimization in inflate_fast on ARM Generate optimized code that will help ARMv8 chips (between 2 to 10% perf gain on ARMv8 little/big cores) at the expense of a small perf regression on older chips (i.e. ARMv7). Bug: 772870 Change-Id: Ifc47870724da9790c944f76fedcc225d0a896caf Reviewed-on: https://chromium-review.googlesource.com/976521 Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> Reviewed-by: Chris Blume <cblume@chromium.org> Reviewed-by: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#546016} [modify] https://crrev.com/ffc960a39d26fda610e0023667b2465118b8580a/third_party/zlib/BUILD.gn
,
Mar 27 2018
,
Mar 28 2018
Now waiting on the Android One perf PNG test bot to complain (as it did before #17).
,
Jun 30 2018
No Android One perf PNG test bot complained. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by noel@chromium.org
, Oct 9 2017Cc: cavalcantii@chromium.org cblume@chromium.org nigeltao@chromium.org sho...@google.com mtklein@chromium.org
Owner: noel@chromium.org