zlib adler32 speed-ups |
|||||||||
Issue descriptionzlib adler32 speed-ups third/party/zlib is used for PNG decoding, which spends 75% of the decode time in z_inflate_fast, the remaining time is spent in the row write loops in libPNG writing out the decoded pixels (some perf work has already been done there, issue 234071 , issue 258324 ). About 15% to 20% of the z_inflate_fast time is spent computing the adler32 checksum. If we could speed it up, that should translate into a PNG decoding win. On x86/x64 targets, SSE might help. libdeflate has an SSE2 adler32, but it's not quite suitable for PNG decoding, where small workloads (buffers) need special attention to keep things fast. In https://chromium-review.googlesource.com/c/chromium/src/+/611492, NEON is proposed to speed up adler32 on ARM using a "taps" multiplier method for the core loop. I reviewed the adler32 computation, https://en.wikipedia.org/wiki/Adler-32, and wrote out the math to confirm the "taps" method is sound (it is). I reviewed the existing alder32_z https://github.com/madler/zlib/blob/v1.2.11/adler32.c to gain understanding of its implementation, and the importance of the NMAX (5552) parameter used to protect from integer overflow when the adler32 component sums (called Adler A and B) are computed in uint32_t type. A naive adler32 implementation is: uint32_t adler32(uint32_t adler, const unsigned char* buf, size_t len) { /* split Adler-32 into component sums A and B (aka s1 and s2) */ uint32_t s1 = adler & 0xffff; uint32_t s2 = adler >> 16; for (size_t i = 0; i < len; ++i) { s1 += *buf++; s2 += s1; s1 %= 65521U; s2 %= 65521U; } /* return recombined sums */ return s1 | (s2 << 16); } The Adler A and B sums must be computed modulo the adler32 BASE value (65521). This reduction, or modulo, step is expensive, and alder32_z avoids it -- it processes 5552 bytes of input before each reduce. This limits the number of modulo operations required (for speed) _and_ avoids overflowing the uint32_t types used to represent Adler A and B during the computation (for correctness of the reduce step). For small workloads, alder32_z also uses special clauses to minimize computation and simplify the modulo operation for the Adler A sum to a subtraction. Seems sane, and seems doable in SIMD with attention to the above. SSE3 would be a reasonable target for chrome desktop users, in particular, SSSE3 (triple-S-E3) - it has new instructions that could do the taps mult part fast (Adler B) - _mm_sad_epu8() is fast for byte sums (Adler A), http://bit.ly/2wpUOeD We'd have to write an SSSE3 impl from scratch though, one that - should handle small workloads well (the PNG decode case) - should handle large workloads well (the general inflate case) - should work well with issue 760853 - have small code size to preserve instruction cache - integrate easily into zlib as a zlib/contrib - have a build flag enable, and SSSE3 run-time check
,
Sep 6 2017
Next the SSE to meet the goals (small / big workloads, code size< overflow control, etc) and some more benchmarking: % ./bench large random data length 16 MB --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 8838e342 1143 u-secs 14.00 MB/s 1254 u-secs 12.76 MB/s zlib 8838e342 6937 u-secs 2.31 MB/s 7020 u-secs 2.28 MB/s random data length 8 MB --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 a286e45a 602 u-secs 13.29 MB/s 664 u-secs 12.05 MB/s zlib a286e45a 3461 u-secs 2.31 MB/s 3564 u-secs 2.24 MB/s random data length 4 MB --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 f257c4e5 227 u-secs 17.62 MB/s 311 u-secs 12.86 MB/s zlib f257c4e5 1715 u-secs 2.33 MB/s 1735 u-secs 2.31 MB/s random data length 2 MB --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 e0159951 91 u-secs 21.98 MB/s 92 u-secs 21.74 MB/s zlib e0159951 852 u-secs 2.35 MB/s 855 u-secs 2.34 MB/s random data length 1 MB --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 6f852dfa 46 u-secs 21.74 MB/s 46 u-secs 21.74 MB/s zlib 6f852dfa 426 u-secs 2.35 MB/s 455 u-secs 2.20 MB/s random data length 524288 bytes --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 f6a8649a 23 u-secs 21.74 MB/s 23 u-secs 21.74 MB/s zlib f6a8649a 213 u-secs 2.35 MB/s 214 u-secs 2.34 MB/s random data length 262144 bytes --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 6b5315cf 10 u-secs 25.00 MB/s 11 u-secs 22.73 MB/s zlib 6b5315cf 106 u-secs 2.36 MB/s 113 u-secs 2.21 MB/s random data length 131072 bytes --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 e3b89979 5 u-secs 25.00 MB/s 6 u-secs 20.83 MB/s zlib e3b89979 53 u-secs 2.36 MB/s 58 u-secs 2.16 MB/s random data length 65536 bytes --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 390b83f7 2 u-secs 31.25 MB/s 3 u-secs 20.83 MB/s zlib 390b83f7 28 u-secs 2.23 MB/s 28 u-secs 2.23 MB/s random data length 32768 bytes --test-- alder32 Min time Min Rate Median time Median Rate zlib sse3 f544e187 1 u-secs 31.25 MB/s 1 u-secs 31.25 MB/s zlib f544e187 13 u-secs 2.40 MB/s 13 u-secs 2.40 MB/s Next I measured with PNG 140 corpus on 4 machines. I posted all the results on https://bugs.chromium.org/p/chromium/issues/detail?id=760853#c21 seems like a nice win for PNG, about 8-9% on the average across machines, few losses. Some images experience gains >> 15%. On the macbook pro (Core i7), all corpus images improve.
,
Sep 6 2017
Uploaded https://chromium-review.googlesource.com/c/chromium/src/+/653337 with the SSSE3 code ready for zlib if you'd care to review. Remaining things TODO in subsequent patches: - fill out README.chromium - actually call it from adler32_z() - work out how to do the run-time check - the Intel deflate contribution has the basis for the cpu check - zlib/x86.h - but that contribution is being moved or something - is their a bug for this @cblume? - fix the BUILD.gn file
,
Sep 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/94b37d283b45036a471e316ffcaf8fe71914a7ac commit 94b37d283b45036a471e316ffcaf8fe71914a7ac Author: Nigel Tao <nigeltao@chromium.org> Date: Thu Sep 07 03:24:45 2017 Run "git cl format" on third_party/zlib. TBR=thakis@chromium.org Bug:762564 Change-Id: I63a809b785a21190b0b8a1744cf073a02beaa239 Reviewed-on: https://chromium-review.googlesource.com/644512 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Stuart Langley <slangley@chromium.org> Reviewed-by: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#500206} [modify] https://crrev.com/94b37d283b45036a471e316ffcaf8fe71914a7ac/third_party/zlib/BUILD.gn
,
Sep 11 2017
Across the 140 PNG measurement corpus, the results on different machines: HP Z840 2016 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm) * Turbo Boost off. win10-z840-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966 DELL PRECISION M3800 LAPTOP * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm) win10-laptop-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966 MAC PRO 2013 (12-CORE) 2013 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) mac-pro-slow-fast-zlib-adler-ssse3 https://docs.google.com/spreadsheets/d/1YKF2ShSqwvxfVUPlg-_1v0pZxsnqYBqy65aS9CClWGE/edit#gid=595603569 MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) mac-book-pro-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1m-d-aF2hHyN6Oi8HHV4Htz_7Xgdh6kEnKnDjKc3WsrU/edit#gid=595603569 Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054 for Nigel's patch (from issue 760853 ). The results of that are graphed as the blue line labelled "fast zlib". The SSE results are graphed as the red-line: decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio). The x-axis is the corpus image (sorted in ascending order).
,
Sep 11 2017
> Remaining things TODO in subsequent patches: > - ... > - work out how to do the run-time check > - ... > - is their a bug for this @cblume? You mean a bug for moving Intel's contributions out of top-level? There is no bug. I sent them an email about upstreaming the patch. Apparently it was originally upstreamed via the zlib dev mailing list, but the review didn't really happen and nothing came of it. I'll make a patch for it eventually (moving to contrib/ and making a github pull request rather than the mailing list). Fixing existing complication is a lower priority than introducing further complication.
,
Sep 11 2017
,
Sep 11 2017
@Noel: I just created an issue to track the Intel code refactoring: https://bugs.chromium.org/p/chromium/issues/detail?id=764094 It has also the description of how the folders will be organized (but basically Intel specific code will be hosted in 'contrib/x86').
,
Sep 12 2017
,
Sep 14 2017
,
Sep 15 2017
#6 > You mean a bug for moving Intel's contributions out of top-level? Yeap. Got some answers above, thx. Couple of issues for the work herein [1], one of which was I need modify, and also call that Intel code to runtime detect SSSE3 support for my stuff :) Seems to me we end up having contribs depending on other contribs (and if they were turtles, then it's turtles all the way down). [1] https://chromium-review.googlesource.com/c/chromium/src/+/660019#message-8a7ffab42f74bc0ee9ae22bcf2d030b4df3d2d8e Second thing, copyrights. I'm was using "* For conditions of distribution and use, see copyright notice in zlib.h" Do I need to add anything more that? This came up in [2], so any advice you have would help. [2] https://chromium-review.googlesource.com/c/chromium/src/+/660019/16/third_party/zlib/adler32_simd.c#2
,
Sep 15 2017
Maybe this is a silly idea... Can we modify libpng or zlib to not compute checksums at all while decoding? I'm imagining 4 scenarios: 1) The image has good checksums and we check them. We decode normally. 2) The image has good checksums and we don't check them. We decode normally faster. 3) The image has bad checksums and we check them. We abort the decode. 4) The image has bad checksums and we don't check them. We decode wrong. Surely there's no difference in security between 3) and 4). The checksums are just as untrusted as the data they summarize... if there's an enemy, they're all crafted by that enemy. So I suspect the only difference is between aborting the decode and returning a possibly nonsense decode. Is that difference important?
,
Sep 15 2017
Mike I thought about that and it would be easy to skip all checksums . But on the other hand, if the data has being corrupted while traveling the network, wouldn't that imply in continuing spending CPU to render an image that is going to be displayed garbled? In this scenario, is faster and better for the user to abort any further work.
,
Sep 15 2017
Still, IIRC the PNG spec allows the UA to decide if it is going to start displaying the content as it arrives from the network or to only display the content after having all the bytes and having performed the validations using the checksums.
,
Sep 15 2017
> But on the other hand, if the data has being corrupted while traveling the network, wouldn't that imply in continuing spending CPU to render an image that is going to be displayed garbled?
Yes, it is a waste of time to continue decoding a broken .png, but we don't really care about that. The proportion of .pngs that are broken is vanishingly small. The balance of
time = time(broken)*P(broken) + time(good)*(1 - P(broken))
is going to drop dramatically because P(broken) is tiny.
,
Sep 15 2017
Yes, indeed. But on the other hand that would make Chromium a non-compliant PNG viewer. Not sure if that would break layout tests. :-( And I doubt that we could ever upstream this change.
,
Sep 15 2017
I agree the main issue here would be, does it make things better or worse for Chromium's users? We control Blink layout tests, so we can put that part out of our minds. Have we not already completely given up on upstreaming everything we're doing? Very seriously asking this. I'm currently considering us forked given the lack of interest and interaction we've been able to get from madler.
,
Sep 15 2017
We are talking about quite a few layout tests here (last time I checked, over 40K). Not to mention that some of them come from W3C and all browsers are expected to behave in the same way. Before we reach to this step (i.e. skipping all checksums), I noticed that the bottleneck now seems to be moved, at least on ARM, to the Compositor (CC) image cache. I say we go and optimize our code in CC before going to extremes. :-) Concerning upstreaming, I would say not yet. As an example, we succeeded in upstreaming all ARM patches to zlib-ng 5 months ago (the plus is that they already have all the Intel patches): https://github.com/Dead2/zlib-ng/commit/ec02ecf104e1d3f1836a908a359f20aa93494df5 Even though the primary goal of this investigation is to improve performance, I still have as secondary goal one day to be able to no longer maintain a zlib fork but instead just use one that is 'ready for deploy'. When you got some spare time, I ask you to please check the report I wrote earlier this year: https://goo.gl/ZUoy96
,
Sep 15 2017
Ah, yes, I do agree that finding a new active upstream is probably better than maintaining yet another fork. Surely all browsers cannot be expected to behave the same with malformed inputs?
,
Sep 15 2017
> And I doubt that we could ever upstream this change.
There's not anything that we would need to upstream. One line in Chromium's client code would make this change:
png_set_crc_action(png_, PNG_CRC_QUIET_USE, PNG_CRC_QUIET_USE);
That tells libpng to not calculate CRCs for critical chunks nor ancillary chunks. We already use it when we modify the encoded data so we can use standard libpng to decode fdAT frames for APNG.
For what it's worth, the spec [1] says:
"A PNG decoder should handle errors as follows:
Detect errors as early as possible using the PNG signature bytes and CRCs on each chunk. Decoders should verify that all eight bytes of the PNG signature are correct. A decoder can have additional confidence in the datastream's integrity if the next eight bytes begin an IHDR chunk with the correct chunk length. A CRC should be checked before processing the chunk data. Sometimes this is impractical, for example when a streaming PNG decoder is processing a large IDAT chunk. In this case the CRC should be checked when the end of the chunk is reached."
But then it goes on to say "Errors that have little or no effect on the processing of the image may be ignored".
Note that it says "should" (check the CRC) here, rather than "must". And I think there's a strong case for CRC errors having "little or no effect on the processing of the image".
> I agree the main issue here would be, does it make things better or worse for Chromium's users?
+1. Based on the opening comment in this bug, it sounds like skipping the calculation would be faster.
> We are talking about quite a few layout tests here (last time I checked, over 40K).
Most of them are probably not confirming that we do not show PNGs with bad CRCs.
> Not to mention that some of them come from W3C and all browsers are expected to behave in the same way.
Indeed. The Web Platform Tests (I think [2] is the right place) are used to test all browsers. I don't know whether it covers PNGs with bad CRCs. It would be in the spirit of the Web Platform Tests to convince the other major browsers to agree on skipping the CRC check. Firefox does not appear to skip the CRC check, just based on the fact that I didn't see the above code line in their decoder. No idea whether they've given it much thought.
[1] https://www.w3.org/TR/PNG/
[2] https://github.com/w3c/web-platform-tests
,
Sep 15 2017
Concerning the Compositor, for reference please check the comment #38 where I posted a screenshot of a Chrome trace for a know PNG test case: https://bugs.chromium.org/p/chromium/issues/detail?id=687631#c38
,
Sep 15 2017
So, uh, that's awesome. Not to derail the CLs you two have been working on, but wouldn't we rather land that one liner in Chromium if it really does completely cut the adler32() runtime to zero?
,
Sep 15 2017
@Mike: the behavior of all browsers should ideally be the same, even for broken content as that improves the portability of web content. But that is another can of worms that I would like to avoid discussing (or discuss it over some beers instead). :-) I rather focus in improving all our software stack before going to extremes. @Leo: about the comment "It would be in the spirit of the Web Platform Tests to convince the other major browsers to agree on skipping the CRC check." I've once worked with specs (was part of the W3C CSS WG) and I can tell that is an herculean task to change even non controversial aspects of w3c specs. It is extremely hard to get all the people together and even harder to get them all to agree. Not saying it is impossible, it is just time consuming.
,
Sep 15 2017
> Not to derail the CLs you two have been working on, but wouldn't we rather land > that one liner in Chromium if it really does completely cut the adler32() > runtime to zero? Certainly something I would love to experiment. :-)
,
Sep 15 2017
I think we could easily argue the opposite here too, that it's in browsers' interest to be as incompatible as possible for broken content. This encourages content to not be broken.
,
Sep 15 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dcf65ae672b4b2e3235310f7ef7415294e75e472 commit dcf65ae672b4b2e3235310f7ef7415294e75e472 Author: Mike Klein <mtklein@chromium.org> Date: Fri Sep 15 22:18:54 2017 Revert "Run "git cl format" on third_party/zlib." This reverts commit 94b37d283b45036a471e316ffcaf8fe71914a7ac. Reason for revert: need to revert earlier third_party/zlib CLs. Original change's description: > Run "git cl format" on third_party/zlib. > > TBR=thakis@chromium.org > Bug:762564 > > Change-Id: I63a809b785a21190b0b8a1744cf073a02beaa239 > Reviewed-on: https://chromium-review.googlesource.com/644512 > Commit-Queue: Noel Gordon <noel@chromium.org> > Reviewed-by: Stuart Langley <slangley@chromium.org> > Reviewed-by: Noel Gordon <noel@chromium.org> > Cr-Commit-Position: refs/heads/master@{#500206} TBR=noel@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,slangley@chromium.org,nigeltao@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 762564 Change-Id: Ie01454d9b4f24da84a6ae9120ff9d73cf20f32b7 Reviewed-on: https://chromium-review.googlesource.com/669539 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Cr-Commit-Position: refs/heads/master@{#502402} [modify] https://crrev.com/dcf65ae672b4b2e3235310f7ef7415294e75e472/third_party/zlib/BUILD.gn
,
Sep 15 2017
@Leo: not claiming that *not* calculating the checksums is necessarily a bad idea or that we can ever make it *faster* than a 'nop', but it is important to clarify that in PNG the CRC and the Adler-32 checksums are used for different purposes. IIRC, the first is used in the compressed content and the later is used in the uncompressed content. For CRC32 we have this nice instruction on ARMv8 which is 10x faster than the C implementation (with potential to make it even faster using a vmull): https://chromium-review.googlesource.com/c/612629/ It is just too bad that I was told that Chrome is shipped on Android as an ARMv7 binary and apparently it would be impossible to have a custom built of Chrome in the 'Play Store' (or maybe no one ever bothered to look into it?). On the other hand, I wonder about the viability of implementing an optimized CRC using NEON. Chromium's zlib already have an Intel optimized CRC32 using SSE.
,
Sep 15 2017
The funny part is that Intel has a CRC instruction but they use the 'wrong' polynomial (so useless for zlib/libpng).
,
Sep 15 2017
It looks like png_set_crc_action(png_, PNG_CRC_QUIET_USE, PNG_CRC_QUIET_USE) will turn off both CRC32 and adler32 checks as of libpng-1.6.26. Chromium is on libpng-1.6.22.
,
Sep 15 2017
I guess is time for an update? :-)
,
Sep 18 2017
I've been meaning to update anyway. Will prepare a patch.
,
Sep 19 2017
#31> I've been meaning to update anyway. Will prepare a patch. Sounds good. #12 > Maybe this is a silly idea... watk@ mentioned the same idea to me two weeks ago... > Can we modify libpng or zlib to not compute checksums at all while decoding? > I'm imagining 4 scenarios: > ... > So I suspect the only difference is between aborting the decode and returning a possibly nonsense decode. Is that difference important? Maybe, if it happened more often? By skipping adler32, you effectively increase the Pud (probability of undetected error) by some factor that will depend on the quality of the sub-network(s) and transport network(s) (802.x, ether, ATM AAL5, fiber, satellite, etc) connecting the chrome user to the server (see the older study in [1]). [1] http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf PNG was being designed around that time, and the paper recommendations the use of application-level checksums. On the other hand, what checksums are used in JPEG, or WebP? What protects them? > Surely all browsers cannot be expected to behave the same with malformed inputs? True. We render image data as it is being received, and when/where/how we detect errors in the stream can vary b/w vendors, producing different renderings. At least for Chrome, if an image is malformed, it's important that we render it the same way always. Do otherwise, and it can look suspiciously like a sec-issue, and will need investigation, see issue 398235 . #22 > Not to derail the CLs you two have been working on, but wouldn't we rather land that one liner in Chromium if it really does completely cut the adler32() runtime to zero? Fine, given more experimental data on the implications. Right now I'd prefer to make chrome faster while someone gathers said data. Also note this work is not just about PNG - alder32() is used in the network stack, WebSockets, PDF, etc ...
,
Sep 21 2017
... and consider disabling adler32 in those other places as well :). Updating the bench.cpp (attached) I used to vet the current work. Another question about it, how do you run it on an Android device? Well, install the latest Android NDK on your host platform (OSX for me), and also install adb if needed. % pwd ~/android/android-ndk-r15c Build bench.cpp as an arm(64 in my case) executable. % export ndkroot=$(pwd) % ./toolchains/aarch64-linux-android-4.9/prebuilt/darwin-x86_64/bin/aarch64-linux-android-g++ --sysroot=$ndkroot/platforms/android-23/arch-arm64 -pie -fno-exceptions -std=c++14 -O3 -Wall -I $ndkroot/sources/cxx-stl/stlport/stlport $ndkroot/sources/cxx-stl/stlport/libs/arm64-v8a/libstlport_static.a bench.cpp Now send a.out to your Android device, then shell into the device and run it. %adb push a.out /data/local/tmp a.out: 1 file pushed. 4.6 MB/s (17680 bytes in 0.004s) % adb shell dragon:/ # nice -n -20 /data/local/tmp/a.out [large|small|stress|znull] enjoy.
,
Sep 21 2017
1) Android Pixel C tablet, arm8 quad-core Nvidia X1 CPU, 3G RAM dragon:/ # nice -n -20 /data/local/tmp/a.out large random data length 16 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib f045afbe 12442 u-secs 1.29 MB/s 12464 u-secs 1.28 MB/s zlib simd f045afbe 4693 u-secs 3.41 MB/s 4721 u-secs 3.39 MB/s random data length 8 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 3b2ec782 6231 u-secs 1.28 MB/s 6249 u-secs 1.28 MB/s zlib simd 3b2ec782 2349 u-secs 3.41 MB/s 2363 u-secs 3.39 MB/s random data length 4 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 51242a71 3118 u-secs 1.28 MB/s 3142 u-secs 1.27 MB/s zlib simd 51242a71 1192 u-secs 3.36 MB/s 1206 u-secs 3.32 MB/s random data length 2 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib e6bbf630 1572 u-secs 1.27 MB/s 1606 u-secs 1.25 MB/s zlib simd e6bbf630 641 u-secs 3.12 MB/s 666 u-secs 3.00 MB/s random data length 1 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 2557939c 751 u-secs 1.33 MB/s 759 u-secs 1.32 MB/s zlib simd 2557939c 217 u-secs 4.61 MB/s 220 u-secs 4.55 MB/s random data length 524288 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib b66db687 367 u-secs 1.36 MB/s 369 u-secs 1.36 MB/s zlib simd b66db687 94 u-secs 5.32 MB/s 95 u-secs 5.26 MB/s random data length 262144 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 101e4462 184 u-secs 1.36 MB/s 184 u-secs 1.36 MB/s zlib simd 101e4462 47 u-secs 5.32 MB/s 47 u-secs 5.32 MB/s random data length 131072 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib b4b0b406 92 u-secs 1.36 MB/s 92 u-secs 1.36 MB/s zlib simd b4b0b406 23 u-secs 5.43 MB/s 24 u-secs 5.21 MB/s random data length 65536 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib b97db333 46 u-secs 1.36 MB/s 46 u-secs 1.36 MB/s zlib simd b97db333 11 u-secs 5.68 MB/s 12 u-secs 5.21 MB/s random data length 32768 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 8c08f977 23 u-secs 1.36 MB/s 23 u-secs 1.36 MB/s zlib simd 8c08f977 5 u-secs 6.25 MB/s 6 u-secs 5.21 MB/s 2) Android Nexus 9 tablet, arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM --test-- adler32 Min time Min Rate Median time Median Rate zlib 09ae8dd3 8805 u-secs 1.82 MB/s 8861 u-secs 1.81 MB/s zlib simd 09ae8dd3 4369 u-secs 3.66 MB/s 5248 u-secs 3.05 MB/s random data length 8 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 359d7f46 4404 u-secs 1.82 MB/s 4426 u-secs 1.81 MB/s zlib simd 359d7f46 2595 u-secs 3.08 MB/s 2624 u-secs 3.05 MB/s random data length 4 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 1afdbaab 2193 u-secs 1.82 MB/s 2214 u-secs 1.81 MB/s zlib simd 1afdbaab 1295 u-secs 3.09 MB/s 1306 u-secs 3.06 MB/s random data length 2 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib 27ea9bf8 1097 u-secs 1.82 MB/s 1100 u-secs 1.82 MB/s zlib simd 27ea9bf8 551 u-secs 3.63 MB/s 562 u-secs 3.56 MB/s random data length 1 MB --test-- adler32 Min time Min Rate Median time Median Rate zlib a629b1e5 546 u-secs 1.83 MB/s 550 u-secs 1.82 MB/s zlib simd a629b1e5 218 u-secs 4.59 MB/s 222 u-secs 4.50 MB/s random data length 524288 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 2528a332 272 u-secs 1.84 MB/s 275 u-secs 1.82 MB/s zlib simd 2528a332 104 u-secs 4.81 MB/s 106 u-secs 4.72 MB/s random data length 262144 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 4281da4b 136 u-secs 1.84 MB/s 137 u-secs 1.82 MB/s zlib simd 4281da4b 52 u-secs 4.81 MB/s 53 u-secs 4.72 MB/s random data length 131072 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 9310f5a9 68 u-secs 1.84 MB/s 69 u-secs 1.81 MB/s zlib simd 9310f5a9 26 u-secs 4.81 MB/s 27 u-secs 4.63 MB/s random data length 65536 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib 82d82033 34 u-secs 1.84 MB/s 34 u-secs 1.84 MB/s zlib simd 82d82033 12 u-secs 5.21 MB/s 12 u-secs 5.21 MB/s random data length 32768 bytes --test-- adler32 Min time Min Rate Median time Median Rate zlib f4cf18d7 17 u-secs 1.84 MB/s 17 u-secs 1.84 MB/s zlib simd f4cf18d7 5 u-secs 6.25 MB/s 6 u-secs 5.21 MB/s Not too bad for a tablet, the NEON adler32 is ~3-4x-ish faster.
,
Sep 21 2017
FWIW, I've found that the Denver cores in the K1 perform wildly differently than the rest of the ARMv8 chips in the wild, and changes that improve most cores often hurt it and vice versa. I'd strongly recommend never benchmarking with it... it's an oddball distraction. Now, of course, in this case everything looks just fine eh?
,
Sep 21 2017
Next step is running image_decode_bench on these Android devices. That's maybe unfamiliar to some folks, so here's a script to do it, essentially ... % ninja -C out/Release image_decode_bench_apk to create an org.chromium.native_test, and use the script (attached) to load that apk onto the device and "run it". The adb incantations used to "run it" are not exactly obvious, and there's also no way I know of to nice(1) the command the org.chromium.native_test harness spawns on the device that actually calls image_decode_bench. One consequence is that: other things the Android OS decides to run on the tablet while you are testing can add intermittent variance to the results at times, and cause some corpus images to be marked as perf regressions in a given corpus test run. Where I saw examples that (eg., one image was 75% slower with neon, than without), I re-measured the particular image. The typically outcome (though not always) was there was no regression at all, and I updated the results for that image (it was a few percent faster in fact).
,
Sep 21 2017
> Now, of course, in this case everything looks just fine eh? Ah yeah :), though they are kinda different too under image-decode bench. The K1 results look more like the desktop results (in shape), whereas the X1 CPU (Pixel C tablet) was ... flatter I suppose (not sure why that happens). Anyho, with Battery 100%, WIFI off, BlueTooth off, Screen On Always, power supply wall plugged: 1) Android Pixel C tablet, arm8 quad-core Nvidia X1 CPU, 3G RAM https://docs.google.com/spreadsheets/d/11VrR-ZkkdmUArZWVmOf2oebqEo42euriLWpmHZrlDHg/edit#gid=595603569 +2% win on the average, no so many good wins though. 2) Android Nexus 9 tablet, arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM https://docs.google.com/spreadsheets/d/1m6SptRMg0-eujnWMyZTP8P9jwzteX0kWCgZ0LbxQ8Ew/edit#gid=595603569 +4% win on the average, plenty of good wins.
,
Sep 21 2017
The 3x gain on ARM is in line with the results I observed in January (8 months ago): https://bugs.chromium.org/p/chromium/issues/detail?id=688601 One important thing to notice concerning the variation is that you will have different results if the code is running in a little or a big core (the nice value is no guarantee of core allocation if you have an EAS kernel). I've observed consistent 3x gain in a big core (A72) and 2x gain in a little core (A53). Concerning image decoding speedups, the percentage gains will also be more significant on ARMv7 (up to 18% in some cases) than on ARMv8 (5-7%). Considering that you now have re-validated the approach for ARM, what is left to land this?
,
Sep 25 2017
Seems the image decoding speedups are more modest on ARMv8 (2% over the corpus on the average) based on my measurements, I don't have an ARMv7 to compare. Thanks for explanation re big (A72) or little core (A53). The gain on the bench is closer to 4x now, I compared "bench large" on a Pixel C tablet for your patch https://chromium-review.googlesource.com/c/chromium/src/+/611492/12 and my patch https://chromium-review.googlesource.com/c/chromium/src/+/660019/36 and attached the results. > what is left to land this? Review I suppose, and what to do about the fact that my patch depends on the Intel simd.patch.
,
Sep 25 2017
Latest bench.cpp (attached, see also #33).
,
Sep 26 2017
Put all the results in once place: Results on different machines for the 140 PNG measurement corpus. Desktop Devices =============== HP Z840 2016 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm) * Turbo Boost off. win10-z840-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966 DELL PRECISION M3800 LAPTOP * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm) win10-laptop-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966 MAC PRO 2013 (12-CORE) 2013 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) mac-pro-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/16T4PlAXPqBtlw16JZnhAROk4bGNOC572vtOdJKHAluc/edit#gid=558740138 MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) png mac-book-pro-slow-fast-zlib-adler-ssse3 https://docs.google.com/spreadsheets/d/1YKF2ShSqwvxfVUPlg-_1v0pZxsnqYBqy65aS9CClWGE/edit#gid=595603569 Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054 for Nigel's patch (from issue 760853 ). The results of that are graphed as the blue line labelled "fast zlib". The SSE results are graphed as the red-line: decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio). The x-axis is the corpus image (sorted in ascending order). Android Devices =============== ANDROID PIXEL C TABLET * Arm8 quad-core Nvidia X1 CPU, 3G RAM https://docs.google.com/spreadsheets/d/11VrR-ZkkdmUArZWVmOf2oebqEo42euriLWpmHZrlDHg/edit#gid=595603569 ANDROID NEXUS 9 TABLET * Arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM https://docs.google.com/spreadsheets/d/1m6SptRMg0-eujnWMyZTP8P9jwzteX0kWCgZ0LbxQ8Ew/edit#gid=595603569
,
Sep 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/09b784fd12f255a9da38107ac6e0386f4dde6d68 commit 09b784fd12f255a9da38107ac6e0386f4dde6d68 Author: Noel Gordon <noel@chromium.org> Date: Fri Sep 29 19:44:25 2017 zlib adler_simd.c Add SSSE3 implementation of the adler32 checksum, suitable for both large workloads, and small workloads commonly seen during PNG image decoding. Add a NEON implementation. Speed is comparable to the serial adler32 computation but near 64 bytes of input data, the SIMD code paths begin to be faster than the serial path: 3x faster at 256 bytes of input data, to ~8x faster for 1M of input data (~4x on ARMv8 NEON). For the PNG 140 image corpus, PNG decoding speed is ~8% faster on average on the desktop machines tested, and ~2% on an ARMv8 Pixel C Android (N) tablet, https://crbug.com/762564#c41 Update x86.{c,h} to runtime detect SSSE3 support and use it to enable the adler32_simd code path and update inflate.c to call x86_check_features(). Update the name mangler file names.h for the new symbols added, add FIXME about simd.patch. Ignore data alignment in the SSSE3 case since unaligned access is no longer penalized on current generation Intel CPU. Use it in the NEON case however to avoid the extra costs of unaligned memory access on ARMv8/v7. NEON credits: the v_s1/s2 vector component accumulate code was provided by Adenilson Cavalcanti. The uint16 column vector sum code is from libdeflate with corrections to process NMAX input bytes which improves performance by 3% for large buffers. Update BUILD.gn to put the code in its own source set, and add it conditionally to the zlib library build rule. On ARM, build the SIMD with max-speed config to produce the smallest code. No change in behavior, covered by many existing tests. Bug: 762564 Change-Id: I14a39940ae113b5a67ba70a99c3741e289b1796b Reviewed-on: https://chromium-review.googlesource.com/660019 Commit-Queue: Chris Blume <cblume@chromium.org> Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Reviewed-by: Chris Blume <cblume@chromium.org> Cr-Commit-Position: refs/heads/master@{#505447} [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/BUILD.gn [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32.c [add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32_simd.c [add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32_simd.h [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/inflate.c [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/names.h [add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/patches/0005-adler32-simd.patch [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/x86.c [modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/x86.h
,
Sep 29 2017
,
Oct 6 2017
Script for running image_decode_bench on Android devices, added more comments about the steps needed to prepare and run image_decode_bench.
,
Oct 6 2017
Very cool!
,
Oct 9 2017
I also uploaded the WIP patch I used for image decode bench for the decoding rate measurements, and added an encoding rate bench to it as well [1]. [1] https://chromium-review.googlesource.com/c/chromium/src/+/706744 Adler32 affects PNG encoding rate. Out of interest, I measured the 140 PNG corpus by decoding each image, and then PNG encoding the image frame, via ImageDataBuffer which is used for Blink image encoding, in an image encoding bench. MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) png mac-book-pro-encoding-rate-sse-adler https://docs.google.com/spreadsheets/d/1q1Xz55UDOk4EIaqD5j1Chee0tVHe9HjJcD5YIP9a69k/edit#gid=595603569 PNG encoding rate improves with the adler32 ssse3 change by ~15% on the average over the PNG 140 corpus on the test machine.
,
Dec 20 2017
re-opened: minor fix to adler simd code coming. The MSVC release bot spotted:
unsigned n = NMAX / BLOCK_SIZE; /* The NMAX constraint. */
if (n > blocks)
n = blocks; <-- possible precision loss converting from size_t to unsigned
since blocks is size_t. No precision loss occurs since n > blocks. Anyho, making n size_t too should do the trick.
,
Dec 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0cb9a22e2fb55e092342192d66f7e33c14432d27 commit 0cb9a22e2fb55e092342192d66f7e33c14432d27 Author: Noel Gordon <noel@chromium.org> Date: Wed Dec 20 07:42:57 2017 zlib adler_simd.c: unsigned cast |blocks| on assignment MSVC noted the unsigned |n| = size_t |blocks| could be a possible loss in precision. No loss in precision occurs since (n > blocks) at this point: |blocks| fits in an unsigned type. To silence compiler warnings, first update BUILD.gn for the adler SIMD code to use chromium compiler:chromium_code rule (more error checking), rather than the permissive "compiler:no_chromium_code" rule. Then cast |blocks| to unsigned on assigment to |n| (this is safe to do as mentioned above). No change in behavior, no new tests. Tbr: cblume@chromium.org Bug: 762564 Change-Id: Ia97120bcca206287fd42b97674f8a6215283e4a5 Reviewed-on: https://chromium-review.googlesource.com/835927 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Sam McNally <sammc@chromium.org> Cr-Commit-Position: refs/heads/master@{#525285} [modify] https://crrev.com/0cb9a22e2fb55e092342192d66f7e33c14432d27/third_party/zlib/BUILD.gn [modify] https://crrev.com/0cb9a22e2fb55e092342192d66f7e33c14432d27/third_party/zlib/adler32_simd.c
,
Dec 20 2017
> Anyho, making n size_t too should do the trick. ... or else casting |blocks| to unsigned when assigning to |n|. Bots tell us the latter is acceptable to them, so go with that.
,
Dec 20 2017
,
Feb 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c376937e3709ddef4b97621a58039a21b67faa6e commit c376937e3709ddef4b97621a58039a21b67faa6e Author: Noel Gordon <noel@chromium.org> Date: Wed Feb 07 09:33:01 2018 Remove x86_check_features() from inflate adler32 can be freely called like crc32, per the "zlib.h" docs so move the x86 feature check there, which allows us to remove x86_check_features() from zlib inflate. Note inflate() calls adler32(0, Z_NULL, 0); (once only) before using the zlib adler32() routine. A reading from third_party/libpng png.c 2324: /* Now calculate the adler32 if not done already. */ 2327: adler = adler32(0, NULL, 0); 2328: adler = adler32(adler, profile, length); suggests they also conform to the "zlib.h" doc rules about use of the adler32 routine. Bug: 762564 Change-Id: Id84da0eb52bc10edb541a284eab9aef652ba2c72 Reviewed-on: https://chromium-review.googlesource.com/901043 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Cr-Commit-Position: refs/heads/master@{#534964} [modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/adler32.c [modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/contrib/optimizations/inflate.c [modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/inflate.c
,
Feb 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9749d6d75c1703adaf154c8bd37c46d13ad4423a commit 9749d6d75c1703adaf154c8bd37c46d13ad4423a Author: Noel Gordon <noel@chromium.org> Date: Sat Feb 10 01:47:09 2018 Delete zlib/patches/0005-adler32-simd.patch Patches are no longer a thing. Remove the patch file for alder32 SIMD change since it mostly touched files that are not part of upstream. One upstream file adler32.c was touched but it has feature guards, so we should be good when merging in any upstream changes. Bug: 762564 Change-Id: Iad8e389667189300fa180755bc2ddc0187138bc0 Reviewed-on: https://chromium-review.googlesource.com/910539 Reviewed-by: Chris Blume <cblume@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#535922} [delete] https://crrev.com/21142de6a3d071974cf458b4e4fb63e2b2e1afd6/third_party/zlib/patches/0005-adler32-simd.patch
,
Feb 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0342258175d72113ae72c0ce56d79a3784ad2dd0 commit 0342258175d72113ae72c0ce56d79a3784ad2dd0 Author: Noel Gordon <noel@chromium.org> Date: Tue Feb 13 00:46:55 2018 [image_decode_bench] Change BUILD gn to build an executable Currently built as a test fixture, which makes it painful to use when testing on Android: a special script is needed for Android [1], which makes Android a special snowflake when image benching. Simpler worlds can be had even for Android: build/create image decode bench binary executable. Android developers can 'adb push' the binary to their Android test device, and run it from adb shell (and no fancy scripts required). For everyone else (mac/linux...), nothing changes, just run out/Release/image_decode_bench as usual. [1] https://crbug.com/762564#c44 Bug: 601198, 762564 Change-Id: I7c229f686ea5f6672b2d4d972a010f2bfbccd40c Reviewed-on: https://chromium-review.googlesource.com/912988 Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Reviewed-by: Leon Scroggins <scroggo@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#536249} [modify] https://crrev.com/0342258175d72113ae72c0ce56d79a3784ad2dd0/third_party/WebKit/Source/platform/BUILD.gn |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by noel@chromium.org
, Sep 6 20177.4 KB
7.4 KB View Download