New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 762564 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on:
issue 762468

Blocking:
issue 764085



Sign in to add a comment

zlib adler32 speed-ups

Project Member Reported by noel@chromium.org, Sep 6 2017

Issue description

zlib adler32 speed-ups

third/party/zlib is used for PNG decoding, which spends 75% of the decode time in z_inflate_fast, the remaining time is spent in the row write loops in libPNG writing out the decoded pixels (some perf work has already been done there,  issue 234071 ,  issue 258324 ).

About 15% to 20% of the z_inflate_fast time is spent computing the adler32 checksum. If we could speed it up, that should translate into a PNG decoding win.

On x86/x64 targets, SSE might help.  libdeflate has an SSE2 adler32, but it's not quite suitable for PNG decoding, where small workloads (buffers) need special attention to keep things fast.  In https://chromium-review.googlesource.com/c/chromium/src/+/611492, NEON is proposed to speed up adler32 on ARM using a "taps" multiplier method for the core loop.

I reviewed the adler32 computation, https://en.wikipedia.org/wiki/Adler-32, and wrote out the math to confirm the "taps" method is sound (it is).  I reviewed the existing alder32_z https://github.com/madler/zlib/blob/v1.2.11/adler32.c to gain understanding of its implementation, and the importance of the NMAX (5552) parameter used to protect from integer overflow when the adler32 component sums (called Adler A and B) are computed in uint32_t type.  A naive adler32 implementation is:

  uint32_t adler32(uint32_t adler, const unsigned char* buf, size_t len)
  {
    /* split Adler-32 into component sums A and B (aka s1 and s2) */
    uint32_t s1 = adler & 0xffff;
    uint32_t s2 = adler >> 16;

    for (size_t i = 0; i < len; ++i) {
      s1 += *buf++;
      s2 += s1;
      s1 %= 65521U;
      s2 %= 65521U;
    }

    /* return recombined sums */
    return s1 | (s2 << 16);
  }

The Adler A and B sums must be computed modulo the adler32 BASE value (65521).  This reduction, or modulo, step is expensive, and alder32_z avoids it -- it processes 5552 bytes of input before each reduce.  This limits the number of modulo operations required (for speed) _and_ avoids overflowing the uint32_t types used to represent Adler A and B during the computation (for correctness of the reduce step).  For small workloads, alder32_z also uses special clauses to minimize computation and simplify the modulo operation for the Adler A sum to a subtraction.

Seems sane, and seems doable in SIMD with attention to the above.  SSE3 would be a reasonable target for chrome desktop users, in particular, SSSE3 (triple-S-E3)

  - it has new instructions that could do the taps mult part fast (Adler B)
  - _mm_sad_epu8() is fast for byte sums (Adler A), http://bit.ly/2wpUOeD

We'd have to write an SSSE3 impl from scratch though, one that

  - should handle small workloads well (the PNG decode case)
  - should handle large workloads well (the general inflate case)
  - should work well with  issue 760853 
  - have small code size to preserve instruction cache
  - integrate easily into zlib as a zlib/contrib
  - have a build flag enable, and SSSE3 run-time check

 

Comment 1 by noel@chromium.org, Sep 6 2017

On my macbook pro 2016, the naive adler32 implementation runs at 0.38 Mb/s for work loads >= 32K for example.  zlib's alder_z() runs at 2 MB/s.  Much faster, so the NMAX processing loop it has really helps ...  I needed a bench marker, I adapted the  one mentioned on https://chromium-review.googlesource.com/c/chromium/src/+/611492

attached for reference:
bench.cpp
7.4 KB View Download

Comment 2 by noel@chromium.org, Sep 6 2017

Next the SSE to meet the goals (small / big workloads, code size< overflow control, etc) and some more benchmarking:

% ./bench large
random data length 16 MB
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 8838e342   1143 u-secs  14.00 MB/s   1254 u-secs  12.76 MB/s 
zlib      8838e342   6937 u-secs   2.31 MB/s   7020 u-secs   2.28 MB/s 
random data length 8 MB
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 a286e45a    602 u-secs  13.29 MB/s    664 u-secs  12.05 MB/s 
zlib      a286e45a   3461 u-secs   2.31 MB/s   3564 u-secs   2.24 MB/s 
random data length 4 MB
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 f257c4e5    227 u-secs  17.62 MB/s    311 u-secs  12.86 MB/s 
zlib      f257c4e5   1715 u-secs   2.33 MB/s   1735 u-secs   2.31 MB/s 
random data length 2 MB
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 e0159951     91 u-secs  21.98 MB/s     92 u-secs  21.74 MB/s 
zlib      e0159951    852 u-secs   2.35 MB/s    855 u-secs   2.34 MB/s 
random data length 1 MB
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 6f852dfa     46 u-secs  21.74 MB/s     46 u-secs  21.74 MB/s 
zlib      6f852dfa    426 u-secs   2.35 MB/s    455 u-secs   2.20 MB/s 
random data length 524288 bytes
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 f6a8649a     23 u-secs  21.74 MB/s     23 u-secs  21.74 MB/s 
zlib      f6a8649a    213 u-secs   2.35 MB/s    214 u-secs   2.34 MB/s 
random data length 262144 bytes
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 6b5315cf     10 u-secs  25.00 MB/s     11 u-secs  22.73 MB/s 
zlib      6b5315cf    106 u-secs   2.36 MB/s    113 u-secs   2.21 MB/s 
random data length 131072 bytes
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 e3b89979      5 u-secs  25.00 MB/s      6 u-secs  20.83 MB/s 
zlib      e3b89979     53 u-secs   2.36 MB/s     58 u-secs   2.16 MB/s 
random data length 65536 bytes
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 390b83f7      2 u-secs  31.25 MB/s      3 u-secs  20.83 MB/s 
zlib      390b83f7     28 u-secs   2.23 MB/s     28 u-secs   2.23 MB/s 
random data length 32768 bytes
--test--  alder32   Min time     Min Rate    Median time  Median Rate 
zlib sse3 f544e187      1 u-secs  31.25 MB/s      1 u-secs  31.25 MB/s 
zlib      f544e187     13 u-secs   2.40 MB/s     13 u-secs   2.40 MB/s 

Next I measured with PNG 140 corpus on 4 machines.  I posted all the results on https://bugs.chromium.org/p/chromium/issues/detail?id=760853#c21 seems like a nice win for PNG, about 8-9% on the average across machines, few losses.  Some images experience gains >> 15%.  On the macbook pro (Core i7), all corpus images improve.

Comment 3 by noel@chromium.org, Sep 6 2017

Owner: noel@chromium.org
Status: Started (was: Untriaged)
Uploaded https://chromium-review.googlesource.com/c/chromium/src/+/653337 with the SSSE3 code ready for zlib if you'd care to review.

Remaining things TODO in subsequent patches:
  - fill out README.chromium
  - actually call it from adler32_z()
  - work out how to do the run-time check
   - the Intel deflate contribution has the basis for the cpu check
     - zlib/x86.h
   - but that contribution is being moved or something
   - is their a bug for this @cblume?
  - fix the BUILD.gn file
Project Member

Comment 4 by bugdroid1@chromium.org, Sep 7 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/94b37d283b45036a471e316ffcaf8fe71914a7ac

commit 94b37d283b45036a471e316ffcaf8fe71914a7ac
Author: Nigel Tao <nigeltao@chromium.org>
Date: Thu Sep 07 03:24:45 2017

Run "git cl format" on third_party/zlib.

TBR=thakis@chromium.org
Bug:762564

Change-Id: I63a809b785a21190b0b8a1744cf073a02beaa239
Reviewed-on: https://chromium-review.googlesource.com/644512
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Stuart Langley <slangley@chromium.org>
Reviewed-by: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#500206}
[modify] https://crrev.com/94b37d283b45036a471e316ffcaf8fe71914a7ac/third_party/zlib/BUILD.gn

Comment 5 by noel@chromium.org, Sep 11 2017

Cc: mtklein@chromium.org robert.b...@intel.com
Across the 140 PNG measurement corpus, the results on different machines:

HP Z840 2016
 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm)
 * Turbo Boost off.
win10-z840-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966

DELL PRECISION M3800 LAPTOP
 * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm)
win10-laptop-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966

MAC PRO 2013 (12-CORE) 2013
 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) 
mac-pro-slow-fast-zlib-adler-ssse3
https://docs.google.com/spreadsheets/d/1YKF2ShSqwvxfVUPlg-_1v0pZxsnqYBqy65aS9CClWGE/edit#gid=595603569

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)
mac-book-pro-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1m-d-aF2hHyN6Oi8HHV4Htz_7Xgdh6kEnKnDjKc3WsrU/edit#gid=595603569

Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054
for Nigel's patch (from  issue 760853 ).  The results of that are graphed as the blue line labelled "fast zlib".

The SSE results are graphed as the red-line: decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio).  The x-axis is the corpus image (sorted in ascending order).

Comment 6 by cblume@chromium.org, Sep 11 2017

> Remaining things TODO in subsequent patches:
>  - ...
>  - work out how to do the run-time check
>   - ...
>   - is their a bug for this @cblume?

You mean a bug for moving Intel's contributions out of top-level?

There is no bug. I sent them an email about upstreaming the patch. Apparently it was originally upstreamed via the zlib dev mailing list, but the review didn't really happen and nothing came of it. I'll make a patch for it eventually (moving to contrib/ and making a github pull request rather than the mailing list).

Fixing existing complication is a lower priority than introducing further complication.
Blocking: 764085
@Noel: I just created an issue to track the Intel code refactoring: https://bugs.chromium.org/p/chromium/issues/detail?id=764094

It has also the description of how the folders will be organized (but basically Intel specific code will be hosted in 'contrib/x86').

Blockedon: 762468
Cc: cavalcantii@chromium.org

Comment 11 by noel@chromium.org, Sep 15 2017

#6 > You mean a bug for moving Intel's contributions out of top-level?

Yeap.  Got some answers above, thx.  Couple of issues for the work herein [1], one of which was I need modify, and also call that Intel code to runtime detect SSSE3 support for my stuff :)

Seems to me we end up having contribs depending on other contribs (and if they were turtles, then it's turtles all the way down).

[1] https://chromium-review.googlesource.com/c/chromium/src/+/660019#message-8a7ffab42f74bc0ee9ae22bcf2d030b4df3d2d8e

Second thing, copyrights.  I'm was using 

"* For conditions of distribution and use, see copyright notice in zlib.h"

Do I need to add anything more that?  This came up in [2], so any advice you have would help.

[2] https://chromium-review.googlesource.com/c/chromium/src/+/660019/16/third_party/zlib/adler32_simd.c#2
Maybe this is a silly idea...

Can we modify libpng or zlib to not compute checksums at all while decoding?
I'm imagining 4 scenarios:

  1) The image has good checksums and we check them.  We decode normally.
  2) The image has good checksums and we don't check them.  We decode normally faster.
  3) The image has bad checksums and we check them.  We abort the decode.
  4) The image has bad checksums and we don't check them.  We decode wrong.

Surely there's no difference in security between 3) and 4).  The checksums are just as untrusted as the data they summarize... if there's an enemy, they're all crafted by that enemy.

So I suspect the only difference is between aborting the decode and returning a possibly nonsense decode.  Is that difference important?
Mike

I thought about that and it would be easy to skip all checksums .

But on the other hand, if the data has being corrupted while traveling the network, wouldn't that imply in continuing spending CPU to render an image that is going to be displayed garbled?

In this scenario, is faster and better for the user to abort any further work.

Still, IIRC the PNG spec allows the UA to decide if it is going to start displaying the content as it arrives from the network or to only display the content after having all the bytes and having performed the validations using the checksums.
> But on the other hand, if the data has being corrupted while traveling the network, wouldn't that imply in continuing spending CPU to render an image that is going to be displayed garbled?

Yes, it is a waste of time to continue decoding a broken .png, but we don't really care about that.  The proportion of .pngs that are broken is vanishingly small.  The balance of

    time = time(broken)*P(broken) + time(good)*(1 - P(broken))

is going to drop dramatically because P(broken) is tiny.
Yes, indeed.

But on the other hand that would make Chromium a non-compliant PNG viewer. Not sure if that would break layout tests.
:-(

And I doubt that we could ever upstream this change.
I agree the main issue here would be, does it make things better or worse for Chromium's users?

We control Blink layout tests, so we can put that part out of our minds.

Have we not already completely given up on upstreaming everything we're doing?  Very seriously asking this.  I'm currently considering us forked given the lack of interest and interaction we've been able to get from madler.
We are talking about quite a few layout tests here (last time I checked, over 40K). Not to mention that some of them come from W3C and all browsers are expected to behave in the same way.

Before we reach to this step (i.e. skipping all checksums), I noticed that the bottleneck now seems to be moved, at least on ARM, to the Compositor (CC) image cache.

I say we go and optimize our code in CC before going to extremes.
:-)

Concerning upstreaming, I would say not yet. As an example, we succeeded in upstreaming all ARM patches to zlib-ng 5 months ago (the plus is that they already have all the Intel patches):
https://github.com/Dead2/zlib-ng/commit/ec02ecf104e1d3f1836a908a359f20aa93494df5

Even though the primary goal of this investigation is to improve performance, I still have as secondary goal one day to be able to no longer maintain a zlib fork but instead just use one that is 'ready for deploy'.

When you got some spare time, I ask you to please check the report I wrote earlier this year:
https://goo.gl/ZUoy96

Ah, yes, I do agree that finding a new active upstream is probably better than maintaining yet another fork.

Surely all browsers cannot be expected to behave the same with malformed inputs?
> And I doubt that we could ever upstream this change.

There's not anything that we would need to upstream. One line in Chromium's client code would make this change:

    png_set_crc_action(png_, PNG_CRC_QUIET_USE, PNG_CRC_QUIET_USE);

That tells libpng to not calculate CRCs for critical chunks nor ancillary chunks. We already use it when we modify the encoded data so we can use standard libpng to decode fdAT frames for APNG.

For what it's worth, the spec [1] says:

"A PNG decoder should handle errors as follows:

Detect errors as early as possible using the PNG signature bytes and CRCs on each chunk. Decoders should verify that all eight bytes of the PNG signature are correct. A decoder can have additional confidence in the datastream's integrity if the next eight bytes begin an IHDR chunk with the correct chunk length. A CRC should be checked before processing the chunk data. Sometimes this is impractical, for example when a streaming PNG decoder is processing a large IDAT chunk. In this case the CRC should be checked when the end of the chunk is reached."

But then it goes on to say "Errors that have little or no effect on the processing of the image may be ignored".

Note that it says "should" (check the CRC) here, rather than "must". And I think there's a strong case for CRC errors having "little or no effect on the processing of the image".

> I agree the main issue here would be, does it make things better or worse for Chromium's users?

+1. Based on the opening comment in this bug, it sounds like skipping the calculation would be faster.

> We are talking about quite a few layout tests here (last time I checked, over 40K).

Most of them are probably not confirming that we do not show PNGs with bad CRCs.

> Not to mention that some of them come from W3C and all browsers are expected to behave in the same way.

Indeed. The Web Platform Tests (I think [2] is the right place) are used to test all browsers. I don't know whether it covers PNGs with bad CRCs. It would be in the spirit of the Web Platform Tests to convince the other major browsers to agree on skipping the CRC check. Firefox does not appear to skip the CRC check, just based on the fact that I didn't see the above code line in their decoder. No idea whether they've given it much thought.

[1] https://www.w3.org/TR/PNG/
[2] https://github.com/w3c/web-platform-tests
Concerning the Compositor, for reference please check the comment #38 where I posted a screenshot of a Chrome trace for a know PNG test case:
https://bugs.chromium.org/p/chromium/issues/detail?id=687631#c38
So, uh, that's awesome.

Not to derail the CLs you two have been working on, but wouldn't we rather land that one liner in Chromium if it really does completely cut the adler32() runtime to zero?
@Mike: the behavior of all browsers should ideally be the same, even for broken content as that improves the portability of web content. 

But that is another can of worms that I would like to avoid discussing (or discuss it over some beers instead).
:-)

I rather focus in improving all our software stack before going to extremes.

@Leo: about the comment "It would be in the spirit of the Web Platform Tests to convince the other major browsers to agree on skipping the CRC check."

I've once worked with specs (was part of the W3C CSS WG) and I can tell that is an herculean task to change even non controversial aspects of w3c specs. 

It is extremely hard to get all the people together and even harder to get them all to agree.

Not saying it is impossible, it is just time consuming.

> Not to derail the CLs you two have been working on, but wouldn't we rather land > that one liner in Chromium if it really does completely cut the adler32() 
> runtime to zero?

Certainly something I would love to experiment.
:-)

I think we could easily argue the opposite here too, that it's in browsers' interest to be as incompatible as possible for broken content.  This encourages content to not be broken.
Project Member

Comment 26 by bugdroid1@chromium.org, Sep 15 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dcf65ae672b4b2e3235310f7ef7415294e75e472

commit dcf65ae672b4b2e3235310f7ef7415294e75e472
Author: Mike Klein <mtklein@chromium.org>
Date: Fri Sep 15 22:18:54 2017

Revert "Run "git cl format" on third_party/zlib."

This reverts commit 94b37d283b45036a471e316ffcaf8fe71914a7ac.

Reason for revert: need to revert earlier third_party/zlib CLs.

Original change's description:
> Run "git cl format" on third_party/zlib.
> 
> TBR=thakis@chromium.org
> Bug:762564
> 
> Change-Id: I63a809b785a21190b0b8a1744cf073a02beaa239
> Reviewed-on: https://chromium-review.googlesource.com/644512
> Commit-Queue: Noel Gordon <noel@chromium.org>
> Reviewed-by: Stuart Langley <slangley@chromium.org>
> Reviewed-by: Noel Gordon <noel@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#500206}

TBR=noel@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,slangley@chromium.org,nigeltao@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  762564 
Change-Id: Ie01454d9b4f24da84a6ae9120ff9d73cf20f32b7
Reviewed-on: https://chromium-review.googlesource.com/669539
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Mike Klein <mtklein@chromium.org>
Cr-Commit-Position: refs/heads/master@{#502402}
[modify] https://crrev.com/dcf65ae672b4b2e3235310f7ef7415294e75e472/third_party/zlib/BUILD.gn

@Leo: not claiming that *not* calculating the checksums is necessarily a bad idea or that we can ever make it *faster* than a 'nop', but it is important to clarify that in PNG the CRC and the Adler-32 checksums are used for different purposes.

IIRC, the first is used in the compressed content and the later is used in the uncompressed content.

For CRC32 we have this nice instruction on ARMv8 which is 10x faster than the C implementation (with potential to make it even faster using a vmull):
https://chromium-review.googlesource.com/c/612629/

It is just too bad that I was told that Chrome is shipped on Android as an ARMv7 binary and apparently it would be impossible to have a custom built of Chrome in the 'Play Store' (or maybe no one ever bothered to look into it?).

On the other hand, I wonder about the viability of implementing an optimized CRC using NEON.

Chromium's zlib already have an Intel optimized CRC32 using SSE.

The funny part is that Intel has a CRC instruction but they use the 'wrong' polynomial (so useless for zlib/libpng).

It looks like png_set_crc_action(png_, PNG_CRC_QUIET_USE, PNG_CRC_QUIET_USE) will turn off both CRC32 and adler32 checks as of libpng-1.6.26.  Chromium is on libpng-1.6.22.
I guess is time for an update? :-)
I've been meaning to update anyway. Will prepare a patch.

Comment 32 by noel@chromium.org, Sep 19 2017

#31> I've been meaning to update anyway. Will prepare a patch.

Sounds good.

#12 > Maybe this is a silly idea...

watk@ mentioned the same idea to me two weeks ago...

> Can we modify libpng or zlib to not compute checksums at all while decoding?
> I'm imagining 4 scenarios:
> ...

> So I suspect the only difference is between aborting the decode and returning a possibly nonsense decode.  Is that difference important?

Maybe, if it happened more often?  By skipping adler32, you effectively increase the Pud (probability of undetected error) by some factor that will depend on the quality of the sub-network(s) and transport network(s) (802.x, ether, ATM AAL5, fiber, satellite, etc) connecting the chrome user to the server (see the older study in [1]).

[1] http://conferences.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf

PNG was being designed around that time, and the paper recommendations the use of application-level checksums.  On the other hand, what checksums are used in JPEG, or WebP?  What protects them?

> Surely all browsers cannot be expected to behave the same with malformed inputs?

True.  We render image data as it is being received, and when/where/how we detect errors in the stream can vary b/w vendors, producing different renderings.

At least for Chrome, if an image is malformed, it's important that we render it the same way always.  Do otherwise, and it can look suspiciously like a sec-issue, and will need investigation, see  issue 398235 .

#22 > Not to derail the CLs you two have been working on, but wouldn't we rather land that one liner in Chromium if it really does completely cut the adler32() runtime to zero?

Fine, given more experimental data on the implications.  Right now I'd prefer to make chrome faster while someone gathers said data.  Also note this work is not just about PNG - alder32() is used in the network stack, WebSockets, PDF, etc ...

Comment 33 by noel@chromium.org, Sep 21 2017

... and consider disabling adler32 in those other places as well :).

Updating the bench.cpp (attached) I used to vet the current work.  Another question about it, how do you run it on an Android device?

Well, install the latest Android NDK on your host platform (OSX for me), and also install adb if needed.

% pwd
~/android/android-ndk-r15c

Build bench.cpp as an arm(64 in my case) executable.

% export ndkroot=$(pwd)
% ./toolchains/aarch64-linux-android-4.9/prebuilt/darwin-x86_64/bin/aarch64-linux-android-g++ --sysroot=$ndkroot/platforms/android-23/arch-arm64 -pie -fno-exceptions -std=c++14 -O3 -Wall -I $ndkroot/sources/cxx-stl/stlport/stlport $ndkroot/sources/cxx-stl/stlport/libs/arm64-v8a/libstlport_static.a bench.cpp

Now send a.out to your Android device, then shell into the device and run it.

%adb push a.out /data/local/tmp 
a.out: 1 file pushed. 4.6 MB/s (17680 bytes in 0.004s)

% adb shell
dragon:/ # nice -n -20 /data/local/tmp/a.out [large|small|stress|znull]

enjoy.
bench.cpp
12.0 KB View Download

Comment 34 by noel@chromium.org, Sep 21 2017

1) Android Pixel C tablet, arm8 quad-core Nvidia X1 CPU, 3G RAM

dragon:/ # nice -n -20 /data/local/tmp/a.out large                                                
random data length 16 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      f045afbe  12442 u-secs   1.29 MB/s  12464 u-secs   1.28 MB/s 
zlib simd f045afbe   4693 u-secs   3.41 MB/s   4721 u-secs   3.39 MB/s 
random data length 8 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      3b2ec782   6231 u-secs   1.28 MB/s   6249 u-secs   1.28 MB/s 
zlib simd 3b2ec782   2349 u-secs   3.41 MB/s   2363 u-secs   3.39 MB/s 
random data length 4 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      51242a71   3118 u-secs   1.28 MB/s   3142 u-secs   1.27 MB/s 
zlib simd 51242a71   1192 u-secs   3.36 MB/s   1206 u-secs   3.32 MB/s 
random data length 2 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      e6bbf630   1572 u-secs   1.27 MB/s   1606 u-secs   1.25 MB/s 
zlib simd e6bbf630    641 u-secs   3.12 MB/s    666 u-secs   3.00 MB/s 
random data length 1 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      2557939c    751 u-secs   1.33 MB/s    759 u-secs   1.32 MB/s 
zlib simd 2557939c    217 u-secs   4.61 MB/s    220 u-secs   4.55 MB/s 
random data length 524288 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      b66db687    367 u-secs   1.36 MB/s    369 u-secs   1.36 MB/s 
zlib simd b66db687     94 u-secs   5.32 MB/s     95 u-secs   5.26 MB/s 
random data length 262144 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      101e4462    184 u-secs   1.36 MB/s    184 u-secs   1.36 MB/s 
zlib simd 101e4462     47 u-secs   5.32 MB/s     47 u-secs   5.32 MB/s 
random data length 131072 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      b4b0b406     92 u-secs   1.36 MB/s     92 u-secs   1.36 MB/s 
zlib simd b4b0b406     23 u-secs   5.43 MB/s     24 u-secs   5.21 MB/s 
random data length 65536 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      b97db333     46 u-secs   1.36 MB/s     46 u-secs   1.36 MB/s 
zlib simd b97db333     11 u-secs   5.68 MB/s     12 u-secs   5.21 MB/s 
random data length 32768 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      8c08f977     23 u-secs   1.36 MB/s     23 u-secs   1.36 MB/s 
zlib simd 8c08f977      5 u-secs   6.25 MB/s      6 u-secs   5.21 MB/s 


2) Android Nexus 9 tablet, arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM

--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      09ae8dd3   8805 u-secs   1.82 MB/s   8861 u-secs   1.81 MB/s 
zlib simd 09ae8dd3   4369 u-secs   3.66 MB/s   5248 u-secs   3.05 MB/s 
random data length 8 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      359d7f46   4404 u-secs   1.82 MB/s   4426 u-secs   1.81 MB/s 
zlib simd 359d7f46   2595 u-secs   3.08 MB/s   2624 u-secs   3.05 MB/s 
random data length 4 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      1afdbaab   2193 u-secs   1.82 MB/s   2214 u-secs   1.81 MB/s 
zlib simd 1afdbaab   1295 u-secs   3.09 MB/s   1306 u-secs   3.06 MB/s 
random data length 2 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      27ea9bf8   1097 u-secs   1.82 MB/s   1100 u-secs   1.82 MB/s 
zlib simd 27ea9bf8    551 u-secs   3.63 MB/s    562 u-secs   3.56 MB/s 
random data length 1 MB
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      a629b1e5    546 u-secs   1.83 MB/s    550 u-secs   1.82 MB/s 
zlib simd a629b1e5    218 u-secs   4.59 MB/s    222 u-secs   4.50 MB/s 
random data length 524288 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      2528a332    272 u-secs   1.84 MB/s    275 u-secs   1.82 MB/s 
zlib simd 2528a332    104 u-secs   4.81 MB/s    106 u-secs   4.72 MB/s 
random data length 262144 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      4281da4b    136 u-secs   1.84 MB/s    137 u-secs   1.82 MB/s 
zlib simd 4281da4b     52 u-secs   4.81 MB/s     53 u-secs   4.72 MB/s 
random data length 131072 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      9310f5a9     68 u-secs   1.84 MB/s     69 u-secs   1.81 MB/s 
zlib simd 9310f5a9     26 u-secs   4.81 MB/s     27 u-secs   4.63 MB/s 
random data length 65536 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      82d82033     34 u-secs   1.84 MB/s     34 u-secs   1.84 MB/s 
zlib simd 82d82033     12 u-secs   5.21 MB/s     12 u-secs   5.21 MB/s 
random data length 32768 bytes
--test--  adler32   Min time     Min Rate    Median time  Median Rate 
zlib      f4cf18d7     17 u-secs   1.84 MB/s     17 u-secs   1.84 MB/s 
zlib simd f4cf18d7      5 u-secs   6.25 MB/s      6 u-secs   5.21 MB/s 

Not too bad for a tablet, the NEON adler32 is ~3-4x-ish faster.
FWIW, I've found that the Denver cores in the K1 perform wildly differently than the rest of the ARMv8 chips in the wild, and changes that improve most cores often hurt it and vice versa.  I'd strongly recommend never benchmarking with it... it's an oddball distraction.

Now, of course, in this case everything looks just fine eh?

Comment 36 by noel@chromium.org, Sep 21 2017

Next step is running image_decode_bench on these Android devices.  That's maybe unfamiliar to some folks, so here's a script to do it, essentially ... 

  % ninja -C out/Release image_decode_bench_apk

to create an org.chromium.native_test, and use the script (attached) to load that apk onto the device and "run it".

The adb incantations used to "run it" are not exactly obvious, and there's also no way I know of to nice(1) the command the org.chromium.native_test harness spawns on the device that actually calls image_decode_bench.

One consequence is that: other things the Android OS decides to run on the tablet while you are testing can add intermittent variance to the results at times, and cause some corpus images to be marked as perf regressions in a given corpus test run.

Where I saw examples that (eg., one image was 75% slower with neon, than without), I re-measured the particular image.  The typically outcome (though not always) was there was no regression at all, and I updated the results for that image (it was a few percent faster in fact).

test.sh
1.6 KB View Download

Comment 37 by noel@chromium.org, Sep 21 2017

> Now, of course, in this case everything looks just fine eh?

Ah yeah :), though they are kinda different too under image-decode bench.  The K1 results look more like the desktop results (in shape), whereas the X1 CPU (Pixel C tablet) was ... flatter I suppose (not sure why that happens).

Anyho, with Battery 100%, WIFI off, BlueTooth off, Screen On Always, power supply wall plugged:

1) Android Pixel C tablet, arm8 quad-core Nvidia X1 CPU, 3G RAM
https://docs.google.com/spreadsheets/d/11VrR-ZkkdmUArZWVmOf2oebqEo42euriLWpmHZrlDHg/edit#gid=595603569
  +2% win on the average, no so many good wins though.

2) Android Nexus 9 tablet, arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM
https://docs.google.com/spreadsheets/d/1m6SptRMg0-eujnWMyZTP8P9jwzteX0kWCgZ0LbxQ8Ew/edit#gid=595603569
  +4% win on the average, plenty of good wins.

The 3x gain on ARM is in line with the results I observed in January (8 months ago):
https://bugs.chromium.org/p/chromium/issues/detail?id=688601

One important thing to notice concerning the variation is that you will have different results if the code is running in a little or a big core (the nice value is no guarantee of core allocation if you have an EAS kernel). 

I've observed consistent 3x gain in a big core (A72) and 2x gain in a little core (A53).

Concerning image decoding speedups, the percentage gains will also be more significant on ARMv7 (up to 18% in some cases) than on ARMv8 (5-7%).

Considering that you now have re-validated the approach for ARM, what is left to land this?

Comment 39 by noel@chromium.org, Sep 25 2017

Seems the image decoding speedups are more modest on ARMv8 (2% over the corpus on the average) based on my measurements, I don't have an ARMv7 to compare.

Thanks for explanation re big (A72) or little core (A53).  The gain on the bench is closer to 4x now, I compared "bench large" on a Pixel C tablet for your patch

  https://chromium-review.googlesource.com/c/chromium/src/+/611492/12

and my patch https://chromium-review.googlesource.com/c/chromium/src/+/660019/36 and attached the results.

> what is left to land this?

Review I suppose, and what to do about the fact that my patch depends on the Intel simd.patch.
bench-large-compare-pixel-c.txt
5.0 KB View Download

Comment 40 by noel@chromium.org, Sep 25 2017

Latest bench.cpp (attached, see also #33).
bench.cpp
12.0 KB View Download

Comment 41 by noel@chromium.org, Sep 26 2017

Put all the results in once place:

Results on different machines for the 140 PNG measurement corpus.

Desktop Devices
===============

HP Z840 2016
 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm)
 * Turbo Boost off.
win10-z840-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966

DELL PRECISION M3800 LAPTOP
 * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm)
win10-laptop-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966

MAC PRO 2013 (12-CORE) 2013
 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) 
mac-pro-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/16T4PlAXPqBtlw16JZnhAROk4bGNOC572vtOdJKHAluc/edit#gid=558740138

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)
png mac-book-pro-slow-fast-zlib-adler-ssse3
https://docs.google.com/spreadsheets/d/1YKF2ShSqwvxfVUPlg-_1v0pZxsnqYBqy65aS9CClWGE/edit#gid=595603569

Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054 for Nigel's patch (from  issue 760853 ). The results of that are graphed as the blue line labelled "fast zlib".

The SSE results are graphed as the red-line: decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio).  The x-axis is the corpus image (sorted in ascending order).

Android Devices
===============

ANDROID PIXEL C TABLET
 * Arm8 quad-core Nvidia X1 CPU, 3G RAM
https://docs.google.com/spreadsheets/d/11VrR-ZkkdmUArZWVmOf2oebqEo42euriLWpmHZrlDHg/edit#gid=595603569

ANDROID NEXUS 9 TABLET
  * Arm8 dual-core Nvidia Tegra K1 CPU, 2G RAM
https://docs.google.com/spreadsheets/d/1m6SptRMg0-eujnWMyZTP8P9jwzteX0kWCgZ0LbxQ8Ew/edit#gid=595603569

Project Member

Comment 42 by bugdroid1@chromium.org, Sep 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/09b784fd12f255a9da38107ac6e0386f4dde6d68

commit 09b784fd12f255a9da38107ac6e0386f4dde6d68
Author: Noel Gordon <noel@chromium.org>
Date: Fri Sep 29 19:44:25 2017

zlib adler_simd.c

Add SSSE3 implementation of the adler32 checksum, suitable for
both large workloads, and small workloads commonly seen during
PNG image decoding. Add a NEON implementation.

Speed is comparable to the serial adler32 computation but near
64 bytes of input data, the SIMD code paths begin to be faster
than the serial path: 3x faster at 256 bytes of input data, to
~8x faster for 1M of input data (~4x on ARMv8 NEON).

For the PNG 140 image corpus, PNG decoding speed is ~8% faster
on average on the desktop machines tested, and ~2% on an ARMv8
Pixel C Android (N) tablet,  https://crbug.com/762564#c41 

Update x86.{c,h} to runtime detect SSSE3 support and use it to
enable the adler32_simd code path and update inflate.c to call
x86_check_features(). Update the name mangler file names.h for
the new symbols added, add FIXME about simd.patch.

Ignore data alignment in the SSSE3 case since unaligned access
is no longer penalized on current generation Intel CPU. Use it
in the NEON case however to avoid the extra costs of unaligned
memory access on ARMv8/v7.

NEON credits: the v_s1/s2 vector component accumulate code was
provided by Adenilson Cavalcanti. The uint16 column vector sum
code is from libdeflate with corrections to process NMAX input
bytes which improves performance by 3% for large buffers.

Update BUILD.gn to put the code in its own source set, and add
it conditionally to the zlib library build rule. On ARM, build
the SIMD with max-speed config to produce the smallest code.

No change in behavior, covered by many existing tests.

Bug:  762564 
Change-Id: I14a39940ae113b5a67ba70a99c3741e289b1796b
Reviewed-on: https://chromium-review.googlesource.com/660019
Commit-Queue: Chris Blume <cblume@chromium.org>
Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org>
Reviewed-by: Chris Blume <cblume@chromium.org>
Cr-Commit-Position: refs/heads/master@{#505447}
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/BUILD.gn
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32.c
[add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32_simd.c
[add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/adler32_simd.h
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/inflate.c
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/names.h
[add] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/patches/0005-adler32-simd.patch
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/x86.c
[modify] https://crrev.com/09b784fd12f255a9da38107ac6e0386f4dde6d68/third_party/zlib/x86.h

Status: Fixed (was: Started)

Comment 44 by noel@chromium.org, Oct 6 2017

Script for running image_decode_bench on Android devices, added more comments about the steps needed to prepare and run image_decode_bench.
test.sh
1.9 KB View Download
Very cool!

Comment 46 by noel@chromium.org, Oct 9 2017

I also uploaded the WIP patch I used for image decode bench for the decoding rate measurements, and added an encoding rate bench to it as well [1].

[1] https://chromium-review.googlesource.com/c/chromium/src/+/706744

Adler32 affects PNG encoding rate.  Out of interest, I measured the 140 PNG corpus by decoding each image, and then PNG encoding the image frame, via ImageDataBuffer which is used for Blink image encoding, in an image encoding bench.

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)
png mac-book-pro-encoding-rate-sse-adler
https://docs.google.com/spreadsheets/d/1q1Xz55UDOk4EIaqD5j1Chee0tVHe9HjJcD5YIP9a69k/edit#gid=595603569

PNG encoding rate improves with the adler32 ssse3 change by ~15% on the average over the PNG 140 corpus on the test machine.

Comment 47 by noel@chromium.org, Dec 20 2017

Status: Started (was: Fixed)
re-opened: minor fix to adler simd code coming.  The MSVC release bot spotted:

  unsigned n = NMAX / BLOCK_SIZE;  /* The NMAX constraint. */
  if (n > blocks)
      n = blocks;  <-- possible precision loss converting from size_t to unsigned 

since blocks is size_t.  No precision loss occurs since n > blocks.  Anyho, making n size_t too should do the trick.
Project Member

Comment 48 by bugdroid1@chromium.org, Dec 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0cb9a22e2fb55e092342192d66f7e33c14432d27

commit 0cb9a22e2fb55e092342192d66f7e33c14432d27
Author: Noel Gordon <noel@chromium.org>
Date: Wed Dec 20 07:42:57 2017

zlib adler_simd.c: unsigned cast |blocks| on assignment

MSVC noted the unsigned |n| = size_t |blocks| could be a possible
loss in precision. No loss in precision occurs since (n > blocks)
at this point: |blocks| fits in an unsigned type.

To silence compiler warnings, first update BUILD.gn for the adler
SIMD code to use chromium compiler:chromium_code rule (more error
checking), rather than the permissive "compiler:no_chromium_code"
rule. Then cast |blocks| to unsigned on assigment to |n| (this is
safe to do as mentioned above).

No change in behavior, no new tests.

Tbr: cblume@chromium.org
Bug:  762564 
Change-Id: Ia97120bcca206287fd42b97674f8a6215283e4a5
Reviewed-on: https://chromium-review.googlesource.com/835927
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Sam McNally <sammc@chromium.org>
Cr-Commit-Position: refs/heads/master@{#525285}
[modify] https://crrev.com/0cb9a22e2fb55e092342192d66f7e33c14432d27/third_party/zlib/BUILD.gn
[modify] https://crrev.com/0cb9a22e2fb55e092342192d66f7e33c14432d27/third_party/zlib/adler32_simd.c

Comment 49 by noel@chromium.org, Dec 20 2017

 > Anyho, making n size_t too should do the trick.

... or else casting |blocks| to unsigned when assigning to |n|.  Bots tell us the latter is acceptable to them, so go with that.

Comment 50 by noel@chromium.org, Dec 20 2017

Status: Fixed (was: Started)
Project Member

Comment 51 by bugdroid1@chromium.org, Feb 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/c376937e3709ddef4b97621a58039a21b67faa6e

commit c376937e3709ddef4b97621a58039a21b67faa6e
Author: Noel Gordon <noel@chromium.org>
Date: Wed Feb 07 09:33:01 2018

Remove x86_check_features() from inflate

adler32 can be freely called like crc32, per the "zlib.h" docs
so move the x86 feature check there, which allows us to remove
x86_check_features() from zlib inflate. Note inflate() calls

  adler32(0, Z_NULL, 0);

(once only) before using the zlib adler32() routine. A reading
from third_party/libpng png.c

  2324:  /* Now calculate the adler32 if not done already. */
  2327:     adler = adler32(0, NULL, 0);
  2328:     adler = adler32(adler, profile, length);

suggests they also conform to the "zlib.h" doc rules about use
of the adler32 routine.

Bug:  762564 
Change-Id: Id84da0eb52bc10edb541a284eab9aef652ba2c72
Reviewed-on: https://chromium-review.googlesource.com/901043
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Cr-Commit-Position: refs/heads/master@{#534964}
[modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/adler32.c
[modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/contrib/optimizations/inflate.c
[modify] https://crrev.com/c376937e3709ddef4b97621a58039a21b67faa6e/third_party/zlib/inflate.c

Project Member

Comment 52 by bugdroid1@chromium.org, Feb 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9749d6d75c1703adaf154c8bd37c46d13ad4423a

commit 9749d6d75c1703adaf154c8bd37c46d13ad4423a
Author: Noel Gordon <noel@chromium.org>
Date: Sat Feb 10 01:47:09 2018

Delete zlib/patches/0005-adler32-simd.patch

Patches are no longer a thing. Remove the patch file for alder32 SIMD
change since it mostly touched files that are not part of upstream.

One upstream file adler32.c was touched but it has feature guards, so
we should be good when merging in any upstream changes.

Bug:  762564 
Change-Id: Iad8e389667189300fa180755bc2ddc0187138bc0
Reviewed-on: https://chromium-review.googlesource.com/910539
Reviewed-by: Chris Blume <cblume@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535922}
[delete] https://crrev.com/21142de6a3d071974cf458b4e4fb63e2b2e1afd6/third_party/zlib/patches/0005-adler32-simd.patch

Project Member

Comment 53 by bugdroid1@chromium.org, Feb 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0342258175d72113ae72c0ce56d79a3784ad2dd0

commit 0342258175d72113ae72c0ce56d79a3784ad2dd0
Author: Noel Gordon <noel@chromium.org>
Date: Tue Feb 13 00:46:55 2018

[image_decode_bench] Change BUILD gn to build an executable

Currently built as a test fixture, which makes it painful to use when
testing on Android: a special script is needed for Android [1], which
makes Android a special snowflake when image benching.

Simpler worlds can be had even for Android: build/create image decode
bench binary executable. Android developers can 'adb push' the binary
to their Android test device, and run it from adb shell (and no fancy
scripts required). For everyone else (mac/linux...), nothing changes,
just run out/Release/image_decode_bench as usual.

[1]  https://crbug.com/762564#c44 

Bug: 601198,  762564 
Change-Id: I7c229f686ea5f6672b2d4d972a010f2bfbccd40c
Reviewed-on: https://chromium-review.googlesource.com/912988
Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org>
Reviewed-by: Leon Scroggins <scroggo@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#536249}
[modify] https://crrev.com/0342258175d72113ae72c0ce56d79a3784ad2dd0/third_party/WebKit/Source/platform/BUILD.gn

Sign in to add a comment