New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 760853 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 764085



Sign in to add a comment

zlib inflate Huffman decoder input reading speed-ups for 64 bit Intel

Project Member Reported by nigeltao@chromium.org, Aug 31 2017

Issue description

https://github.com/madler/zlib/pull/292 is a pull request that speeds up zlib decode by up to 1.5x on the x86_64 architecture.

That patch also affects some other 64 bit architectures: AArch64 and PowerPC64. I don't have direct access to such hardware, but a colleague of mine ran the numbers on such hardware, and reported a smaller but still consistently positive speedup of around 1.2x on both.

There have been a couple of places where getting this improvement into chromium was discussed:

Code review:
https://chromium-review.googlesource.com/c/chromium/src/+/601694

"Progressive Web Metrics" list discussion re metrics:
https://groups.google.com/a/chromium.org/d/topic/progressive-web-metrics/Z-lWP4pQ43c/discussion

This issue is a tracking issue, and hopefully a central place for discussion.
 
Cc: sho...@google.com cavalcantii@chromium.org cblume@chromium.org noel@chromium.org
On https://chromium-review.googlesource.com/c/chromium/src/+/601694, cavalcantii asks:

> Could you check what are the performance improvement on the same test case
> (https://codepen.io/Savago/pen/JEoYNX)?

I tried 5 reps of hitting the "Run" button before (old) and after (new) my patch. The numbers reported look too noisy to draw any strong conclusion. For example, on the same binary (old), the ratio of best to worst times is 1.75x:

Old
### took 0.01999999999998181 miliseconds.Test results:
### took 0.01999999999998181 miliseconds.Test results:
### took 0.01999999999998181 miliseconds.Test results:
### took 0.024999999999636202 miliseconds.Test results:
### took 0.035000000000081855 miliseconds.Test results:

New
### took 0.01999999999998181 miliseconds.Test results:
### took 0.020000000000436557 miliseconds.Test results:
### took 0.020000000000436557 miliseconds.Test results:
### took 0.024999999999636202 miliseconds.Test results:
### took 0.02500000000009095 miliseconds.Test results:



As an alternative (but with the same https://upload.wikimedia.org/wikipedia/commons/3/3f/ZebraHighRes.png PNG image), I ran chromium's image_decode_bench program (as in, "ninja -C out/release image_decode_bench"), before (old) and after (new) my patch:

$ for i in {1..5}; do ./old ZebraHighRes.png 20; done | sort 
1.234950 0.061748
1.234987 0.061749
1.236219 0.061811
1.236629 0.061831
1.238431 0.061922

$ for i in {1..5}; do ./new ZebraHighRes.png 20; done | sort 
1.145096 0.057255
1.145904 0.057295
1.147598 0.057380
1.173633 0.058682
1.177580 0.058879

For each run, the two numbers are total time and time per repetition (here, 20 reps). In other words, decoding ZebraHighRes.png took 62ms per rep before and 57ms per rep after, or a 1.07x improvement.
On https://chromium-review.googlesource.com/c/chromium/src/+/601694, cavalcantii asks:

> I recommend to have a look on blink_platform_unittests (*PNG* and *Png) and
> post the results in an issue tracking this patch.

blink_platform_unittests seems happy:



$ out/rel/blink_platform_unittests | tail
[2098/2103] PaintControllerUnderInvalidationTest.NoopPairsInSubsequence (0 ms)
[2099/2103] PaintControllerUnderInvalidationTest.ChangeDrawingInSubsequence (43 ms)
[2100/2103] PaintControllerUnderInvalidationTest.MoreDrawingInSubsequence (50 ms)
[2101/2103] PaintControllerUnderInvalidationTest.LessDrawingInSubsequence (44 ms)
[2102/2103] PaintControllerUnderInvalidationTest.ChangeNonCacheableInSubsequence (49 ms)
[2103/2103] PaintControllerUnderInvalidationTest.InvalidationInSubsequence (0 ms)
Retrying 1 test (retry #1)
[2104/2104] MailboxSharedGpuContextTest.MailboxCaching (1 ms)
SUCCESS: all tests passed.
Tests took 0 seconds.



$ out/rel/blink_platform_unittests | grep -i png
[363/2103] BitmapImageTest.pngHasColorProfile (0 ms)
[570/2103] DeferredImageDecoderTestWoPlatform.mixImagesPng (1 ms)
[722/2103] AnimatedPNGTests.sizeTest (1 ms)
[723/2103] AnimatedPNGTests.repetitionCountTest (0 ms)
[724/2103] AnimatedPNGTests.MetaDataTest (0 ms)
[725/2103] AnimatedPNGTests.EmptyFrame (0 ms)
[726/2103] AnimatedPNGTests.ByteByByteSizeAvailable (0 ms)
[727/2103] AnimatedPNGTests.ByteByByteMetaData (0 ms)
[728/2103] AnimatedPNGTests.TestRandomFrameDecode (0 ms)
[729/2103] AnimatedPNGTests.TestDecodeAfterReallocation (1 ms)
[730/2103] AnimatedPNGTests.ProgressiveDecode (0 ms)
[771/2103] AnimatedPNGTests.FailureDuringDecodingInvalidatesDecoder (0 ms)
[772/2103] AnimatedPNGTests.VerifyIENDBeforeIDATInvalidatesDecoder (0 ms)
[773/2103] AnimatedPNGTests.MixedDataChunks (1 ms)
[774/2103] AnimatedPNGTests.VerifyInvalidDisposalAndBlending (0 ms)
[775/2103] AnimatedPNGTests.VerifySuccessfulFirstFrameDecodeAfterLaterFrame (0 ms)
[776/2103] AnimatedPNGTests.DecodeFromIndependentFrame (0 ms)
[777/2103] AnimatedPNGTests.SubsetFromIHDR (1 ms)
[778/2103] StaticPNGTests.repetitionCountTest (0 ms)
[779/2103] StaticPNGTests.sizeTest (0 ms)
[780/2103] StaticPNGTests.MetaDataTest (0 ms)
[811/2103] StaticPNGTests.InvalidIHDRChunk (0 ms)
[812/2103] StaticPNGTests.ProgressiveDecoding (0 ms)
[813/2103] StaticPNGTests.ProgressiveDecodingContinuesAfterFullData (0 ms)
[814/2103] StaticPNGTests.FailureDuringDecodingInvalidatesDecoder (1 ms)
[815/2103] PNGTests.VerifyFrameCompleteBehavior (2 ms)
[816/2103] PNGTests.sizeMayOverflow (0 ms)
[851/2103] AnimatedPNGTests.ParseAndDecodeByteByByte (0 ms)
[852/2103] AnimatedPNGTests.FailureDuringParsing (1 ms)
[853/2103] AnimatedPNGTests.ActlErrors (4 ms)
[854/2103] AnimatedPNGTests.fdatBeforeIdat (1 ms)
[855/2103] AnimatedPNGTests.IdatSizeMismatch (0 ms)
[856/2103] AnimatedPNGTests.VerifyFrameOutsideImageSizeFails (0 ms)
[857/2103] AnimatedPNGTests.ProgressiveDecodingContinuesAfterFullData (0 ms)
[858/2103] AnimatedPNGTests.RandomDecodeAfterClearFrameBufferCache (0 ms)
[859/2103] AnimatedPNGTests.VerifyAlphaBlending (0 ms)
[860/2103] AnimatedPNGTests.FailureMissingIendChunk (0 ms)
[886/2103] ICOImageDecoderTests.errorInPngInIco (0 ms)

Re the https://codepen.io/Savago/pen/JEoYNX benchmark mentioned in comment #2, on closer inspection, the numbers look bogus. Decoding a PNG image of that size shouldn't take less that a *millisecond*. I've attached a screenshot of Chrome stable (version 60) on desktop, x86_64, that shows similar numbers.

My conclusion is that that codepen.io snippet isn't measuring image decode time at all. Maybe it's just some DOM manipulation, but I'm only guessing.

I don't have any clue what the difference is between my Chrome and your (cavalcantii) Chrome such that you were measuring around 100 milliseconds (instead of sub-millisecond) on your own device.
codepen.png
111 KB View Download
@Nigel: the timestamp printed in the webpage is related to the time for insertion of the DOM element in the tree. I use this test to collect Chrome traces, where later I look for the time spent in ImageFrameGenerator::decode.
A way we can improve this test is to request image.png?1, image.png?2, etc.
Otherwise, the second decode might be using the cache.

Until we finish landing the JS decode API, the best JS option is to add an onload() event handler. But it includes downloading :(

For an example of that, I'm forced to use it here:
https://chromium-review.googlesource.com/c/chromium/src/+/646242
So for the Zebra test image image_decode_bench reported 7% improvement? 

Expect the effect in the Web Browser to be a bit smaller (i.e. perf bot image_decoding.html?png), as there is some considerable time spent in the compositor copying/resizing the pixels (I'm planning to investigate how we can speed up things there).

Probably the net effect will be around 5% (assuming a similar behavior of the compositor in Intel as I've observed in Android).

YMMV because Intel has one-copy texture uploads.

Comment 8 Deleted

And seconding what cblume mentioned, you got to make sure you won't be hitting the cache while collecting the trace.
Huh, I'd have thought that you *do* want to hit the cache while collecting the trace. Otherwise, you're going to add network I/O noise to your timings, and network I/O can take much longer than image decoding.

Comment 11 by noel@chromium.org, Sep 1 2017

Having network I/O timing noise in image decoding measurements is not good.

When using tracing, note that image decodes happen off the main thread:
 -- you might get scheduled out on a thread switch

Dunno about the new perf bot thing in #6 for images:
 -- do they decode off the main thread, or not?

Closest thing to a synchronous decode using a web page is to:
 -- img = new Image(), img.onload = function() { draw(this) }, img.src = url;
 -- draw() would draw to image to a <canvas> element
 -- that <canvas> draw causes a sync (main thread) image decode

And with any method used:
 -- can you ever be sure you have received all the compressed image data?

We use the image_decode_bench to remove all these variables when we want to assess the performance aspects of any change that directly affects the blink image decoders or a sub-system they require (eg., zlib).

re: blink_platform_unittests
 -- no real need to run those manually (unless you broke them)
 -- they are automatically run on try jobs and on the CQ

While I support the use of image_decode_bench, it should be clear it won't fully reveal what is going to be the global impact in the Browser because there are other component layers at play.

I'm not deeply familiar with how the Blink and the Compositor image caches work (or how they are integrated). I have the impression that one is a network cache while the second is a texture/image cache.

But it is reasonable to deduct that if you are trying to insert an image in a webpage that is already in the cache, it won't be decoded twice (otherwise, why you need a cache at all?).

This seems to be confirmed by the trace I'm attaching, where I clicked twice (waiting a couple seconds between the clicks) in the button in the 'Parrot' test case to include 2 images. You can see clearly in the first image insertion that time is being spent in ImageFrameGenerator::decode, while in the second execution (i.e. hit the cache) you only see call into Skia SkCanvas::drawRect().

Also, as a browser user would notice, the first image inclusion took a little while and the second was instantaneous.

The only scenario where the browser user would have everything in cache would be in a 'light' refresh of a given page. But the common user case is to open a new tab (i.e. probably spawning a new RenderProcess) and doing a cold load of a website.

If network 'noise' is of major concern, one alternative is simply host the test image in a local computer in the same network. That being said, I fail to see how noise from network would throw off the time spent *actually* decoding a PNG. As new data arrives through the pipe, you decode it. The time spent decoding is what matters, not the time it took to load the page, and this is what
ImageFrameGenerator::decode() is going show.

Again, this is how I see that testing should be done, from the perspective of what a Chrome user will face while using the browser. I'm sharing what I learned in the ARM optimization story so you can profit from it.

2_parrots.png
233 KB View Download
Not sure if this is the kind of output expected, but here it goes:

(xenial)adenilson@localhost:~/snappy/build$ file snappy_unittest 
snappy_unittest: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=dfafcee6f4a0b93059920d0008d072bd5e0f750f, not stripped
(xenial)adenilson@localhost:~/snappy/build$ ldd snappy_unittest 
        linux-vdso.so.1 =>  (0x0000007c222da000)
        libz.so.1 => /usr/local/lib/libz.so.1 (0x0000007c222a8000)
        libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007c22270000)
        libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007c220e0000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007c220bf000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007c21f78000)
        /lib/ld-linux-aarch64.so.1 (0x0000005eff1bc000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007c21eca000)

a) Vanilla 1.2.11
(xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest 
Running microbenchmarks.
WARNING: Compiled with assertions enabled, will be slow.
WARNING: Compiled without optimization, will be slow.
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0             506661     506122        245 192.9MB/s  html
BM_UFlat/1            5027760    5030000        100 133.1MB/s  urls
BM_UFlat/2              14299      14297      11890 8.0GB/s  jpg
BM_UFlat/3               1257       1254     156250 152.1MB/s  jpg_200
BM_UFlat/4              69533      69225       2788 1.4GB/s  pdf
BM_UFlat/5            2034730    2030000        100 192.4MB/s  html4
BM_UFlat/6            1868254    1867924        106 77.6MB/s  txt1
BM_UFlat/7            1627352    1631147        122 73.2MB/s  txt2
BM_UFlat/8            5017350    5020000        100 81.1MB/s  txt3
BM_UFlat/9            6999800    7000000        100 65.6MB/s  txt4
BM_UFlat/10            478931     479318        411 235.9MB/s  pb
BM_UFlat/11           1952382    1950980        102 90.1MB/s  gaviota
BM_UIOVec/0            960393     961165        206 101.6MB/s  html
BM_UIOVec/1           8407500    8380000        100 79.9MB/s  urls
BM_UIOVec/2             15434      15467       9310 7.4GB/s  jpg
BM_UIOVec/3              2428       2431      79365 78.4MB/s  jpg_200
BM_UIOVec/4            130191     130193       1444 750.1MB/s  pdf
BM_UValidate/0         209145     209081        947 467.1MB/s  html
BM_UValidate/1        2307110    2310000        100 289.9MB/s  urls
BM_UValidate/2           1010       1009     172413 113.6GB/s  jpg
BM_UValidate/3            683        676     263157 282.0MB/s  jpg_200
BM_UValidate/4          19981      19973       9813 4.8GB/s  pdf
BM_ZFlat/0            1325281    1322147        149 73.9MB/s  html (22.31 %)
BM_ZFlat/1           19445290   19440000        100 34.4MB/s  urls (47.78 %)
BM_ZFlat/2              49976      50086       3474 2.3GB/s  jpg (99.95 %)
BM_ZFlat/3               5795       5800      34305 32.9MB/s  jpg_200 (73.00 %)
BM_ZFlat/4             166427     165907       1097 588.6MB/s  pdf (83.30 %)
BM_ZFlat/5            5699970    5690000        100 68.7MB/s  html4 (22.52 %)
BM_ZFlat/6            4990370    4990000        100 29.1MB/s  txt1 (57.88 %)
BM_ZFlat/7            4404810    4410000        100 27.1MB/s  txt2 (61.91 %)
BM_ZFlat/8           14135230   14130000        100 28.8MB/s  txt3 (54.99 %)
BM_ZFlat/9           18724740   18720000        100 24.5MB/s  txt4 (66.26 %)
BM_ZFlat/10           1281894    1278145        151 88.5MB/s  pb (19.68 %)
BM_ZFlat/11           4231080    4230000        100 41.6MB/s  gaviota (37.72 %)

b) With iffast64 patch (from https://github.com/madler/zlib/pull/292):
(xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest 
Running microbenchmarks.
WARNING: Compiled with assertions enabled, will be slow.
WARNING: Compiled without optimization, will be slow.
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0             505374     504043        371 193.7MB/s  html
BM_UFlat/1            4987610    4980000        100 134.5MB/s  urls
BM_UFlat/2              14441      14425      13726 7.9GB/s  jpg
BM_UFlat/3               1248       1244     155038 153.2MB/s  jpg_200
BM_UFlat/4              69954      69439       2693 1.4GB/s  pdf
BM_UFlat/5            2048920    2040000        100 191.5MB/s  html4
BM_UFlat/6            1871066    1876190        105 77.3MB/s  txt1
BM_UFlat/7            1651471    1628099        121 73.3MB/s  txt2
BM_UFlat/8            5016400    5020000        100 81.1MB/s  txt3
BM_UFlat/9            6925810    6920000        100 66.4MB/s  txt4
BM_UFlat/10            479709     479217        409 236.0MB/s  pb
BM_UFlat/11           1973780    1960000        100 89.7MB/s  gaviota
BM_UIOVec/0            970712     960975        205 101.6MB/s  html
BM_UIOVec/1           8317450    8310000        100 80.6MB/s  urls
BM_UIOVec/2             15424      15424      11540 7.4GB/s  jpg
BM_UIOVec/3              2458       2452      72992 77.8MB/s  jpg_200
BM_UIOVec/4            130534     130555       1440 748.0MB/s  pdf
BM_UValidate/0         210539     209033        952 467.2MB/s  html
BM_UValidate/1        2315300    2310000        100 289.9MB/s  urls
BM_UValidate/2           1006       1006     147058 113.9GB/s  jpg
BM_UValidate/3            678        676     263157 282.0MB/s  jpg_200
BM_UValidate/4          20008      20008       9896 4.8GB/s  pdf
BM_ZFlat/0            1327800    1324137        145 73.8MB/s  html (22.31 %)
BM_ZFlat/1           18967000   18940000        100 35.4MB/s  urls (47.78 %)
BM_ZFlat/2              49063      49209       3353 2.3GB/s  jpg (99.95 %)
BM_ZFlat/3               5831       5762      32626 33.1MB/s  jpg_200 (73.00 %)
BM_ZFlat/4             167448     167814       1162 581.9MB/s  pdf (83.30 %)
BM_ZFlat/5            5633460    5630000        100 69.4MB/s  html4 (22.52 %)
BM_ZFlat/6            5068580    5060000        100 28.7MB/s  txt1 (57.88 %)
BM_ZFlat/7            4326500    4330000        100 27.6MB/s  txt2 (61.91 %)
BM_ZFlat/8           13995480   13970000        100 29.1MB/s  txt3 (54.99 %)
BM_ZFlat/9           18611570   18600000        100 24.7MB/s  txt4 (66.26 %)
BM_ZFlat/10           1264190    1263157        152 89.5MB/s  pb (19.68 %)
BM_ZFlat/11           4207690    4210000        100 41.8MB/s  gaviota (37.72 %)
Traces in a Google Pixel for the Zebra parrot test case (it seems that the NEON code is faster in AArch64).
neon_inffast64.png
285 KB View Download
Your snappy_unittest numbers don't really show any change between the two, which looks odd. When I run my numbers, I use the LD_LIBRARY_PATH environment variable to pick the zlib implemtation. You're not setting LD_LIBRARY_PATH, so how are you picking which zlib implementation for snappy_unittest to use??

For your Zebra parrot case, can you get numbers for my inffast64 version versus vanilla zlib, not versus your NEON code?
And for Release builds (both zlib + test):
a) vanilla release
(xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest 
Running microbenchmarks.
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0              90815      90696       1709 1.1GB/s  html
BM_UFlat/1            1044234    1040000        175 643.8MB/s  urls
BM_UFlat/2              12403      12415      14981 9.2GB/s  jpg
BM_UFlat/3                253        253     666666 752.4MB/s  jpg_200
BM_UFlat/4              17186      17153      10610 5.6GB/s  pdf
BM_UFlat/5             384169     382692        520 1020.7MB/s  html4
BM_UFlat/6             341037     341880        585 424.3MB/s  txt1
BM_UFlat/7             303240     302571        661 394.6MB/s  txt2
BM_UFlat/8             950082     949541        218 428.6MB/s  txt3
BM_UFlat/9            1367939    1364864        148 336.7MB/s  txt4
BM_UFlat/10             90209      90325       2181 1.2GB/s  pb
BM_UFlat/11            342017     342657        572 513.0MB/s  gaviota
BM_UIOVec/0            279080     279266        709 349.7MB/s  html
BM_UIOVec/1           2504980    2500000        100 267.8MB/s  urls
BM_UIOVec/2             12950      12958      15048 8.8GB/s  jpg
BM_UIOVec/3               708        712     198019 267.9MB/s  jpg_200
BM_UIOVec/4             33323      33245       5715 2.9GB/s  pdf
BM_UValidate/0          55428      55415       3573 1.7GB/s  html
BM_UValidate/1         639232     638709        310 1.0GB/s  urls
BM_UValidate/2            231        231     666666 496.3GB/s  jpg
BM_UValidate/3            160        160    1000000 1.2GB/s  jpg_200
BM_UValidate/4           4790       4802      40816 19.9GB/s  pdf
BM_ZFlat/0             216380     216494        873 451.1MB/s  html (22.31 %)
BM_ZFlat/1            3534340    3530000        100 189.7MB/s  urls (47.78 %)
BM_ZFlat/2              21545      21576       8435 5.3GB/s  jpg (99.95 %)
BM_ZFlat/3                936        935     204081 203.8MB/s  jpg_200 (73.00 %)
BM_ZFlat/4              32104      32129       6038 3.0GB/s  pdf (83.30 %)
BM_ZFlat/5            1014164    1010309        194 386.6MB/s  html4 (22.52 %)
BM_ZFlat/6             844189     840796        201 172.5MB/s  txt1 (57.88 %)
BM_ZFlat/7             798351     797520        242 149.7MB/s  txt2 (61.91 %)
BM_ZFlat/8            2640390    2640000        100 154.2MB/s  txt3 (54.99 %)
BM_ZFlat/9            3501170    3500000        100 131.3MB/s  txt4 (66.26 %)
BM_ZFlat/10            207517     208017        923 543.7MB/s  pb (19.68 %)
BM_ZFlat/11            592317     592233        309 296.8MB/s  gaviota (37.72 %)

b) inffast64 release
(xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest 
Running microbenchmarks.
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0              90464      90479       1691 1.1GB/s  html
BM_UFlat/1            1094982    1087719        171 615.6MB/s  urls
BM_UFlat/2              13003      12887      15209 8.9GB/s  jpg
BM_UFlat/3                259        259     571428 736.4MB/s  jpg_200
BM_UFlat/4              16731      16647      10632 5.7GB/s  pdf
BM_UFlat/5             384682     382978        517 1020.0MB/s  html4
BM_UFlat/6             345548     343750        576 421.9MB/s  txt1
BM_UFlat/7             303663     304414        657 392.2MB/s  txt2
BM_UFlat/8            1003932    1005181        193 404.9MB/s  txt3
BM_UFlat/9            1319884    1319727        147 348.2MB/s  txt4
BM_UFlat/10             90027      90163       2196 1.2GB/s  pb
BM_UFlat/11            341894     342013        576 514.0MB/s  gaviota
BM_UIOVec/0            278528     278409        704 350.8MB/s  html
BM_UIOVec/1           2495040    2490000        100 268.9MB/s  urls
BM_UIOVec/2             12963      13005      14609 8.8GB/s  jpg
BM_UIOVec/3               714        713     256410 267.2MB/s  jpg_200
BM_UIOVec/4             33207      33234       5717 2.9GB/s  pdf
BM_UValidate/0          55569      55416       3591 1.7GB/s  html
BM_UValidate/1         637785     638977        313 1.0GB/s  urls
BM_UValidate/2            232        232     645161 493.1GB/s  jpg
BM_UValidate/3            160        160     909090 1.2GB/s  jpg_200
BM_UValidate/4           4813       4813      39682 19.8GB/s  pdf
BM_ZFlat/0             215057     215835        922 452.5MB/s  html (22.31 %)
BM_ZFlat/1            3354450    3350000        100 199.9MB/s  urls (47.78 %)
BM_ZFlat/2              22116      22168       8661 5.2GB/s  jpg (99.95 %)
BM_ZFlat/3                940        940     202020 202.8MB/s  jpg_200 (73.00 %)
BM_ZFlat/4              32545      32532       5871 2.9GB/s  pdf (83.30 %)
BM_ZFlat/5             886608     886956        230 440.4MB/s  html4 (22.52 %)
BM_ZFlat/6             876201     876146        218 165.5MB/s  txt1 (57.88 %)
BM_ZFlat/7             783378     781250        256 152.8MB/s  txt2 (61.91 %)
BM_ZFlat/8            2372350    2370000        100 171.7MB/s  txt3 (54.99 %)
BM_ZFlat/9            3404180    3410000        100 134.8MB/s  txt4 (66.26 %)
BM_ZFlat/10            212025     211640        945 534.4MB/s  pb (19.68 %)
BM_ZFlat/11            602060     602409        332 291.8MB/s  gaviota (37.72 %)
@Nigel: I've set it to point to /usr/local/lib (as is visible in 'ldd').

I removed the self compiled zlib from /usr/local and install the tested candidate for each run.

I'm afraid that I can't supply traces for vanilla X inffast64 because I'm working from home and would need my workstation that is at the office (not to mention that Monday will be labor day in USA).

I hope the data was helpful.

Kind of a extrapolation, but in the issue (https://bugs.chromium.org/p/chromium/issues/detail?id=697280) I've included a trace of vanilla 32bits@ARM where it would take 140ms for the Zebra parrot test case.

It may be a tad faster @AArch64 (i.e. more registers) but not by that much.

> I removed the self compiled zlib from /usr/local and install the tested candidate for each run.

One detail I've just noticed is that I usually run snappy_unittest with arguments ("snappy_unittest testdata/*") and you are not. I think that you are therefore producing numbers for the *snappy* compression codec, not the *zlib* one, and the snappy implementation is obviously unaffected by applying my patch to the /usr/local zlib.

The output should look something like:

testdata/alice29.txt                     :
ZLIB:   [b 1M] bytes 152089 ->  54404 35.8%  comp  19.6 MB/s  uncomp 258.2 MB/s

and not like:

BM_UFlat/0              90815      90696       1709 1.1GB/s  html
BM_UFlat/1            1044234    1040000        175 643.8MB/s  urls

Note especially the "ZLIB" in the output.



> I'm afraid that I can't supply traces for vanilla X inffast64 because I'm working from home and would need my workstation that is at the office (not to mention that Monday will be labor day in USA).

Well, I'm not in a hurry. I work part time, and this whole zlib optimization thing is mostly unrelated to what I'm meant to be working on.
My previously mentioned https://gist.github.com/nigeltao/a3ef1761ad7339fac70a12a802467d56 might be helpful for making snappy_unittest give you ZLIB numbers.

Comment 21 by noel@chromium.org, Sep 4 2017

Measuring one image. Across the 140 PNG corpus, the results on different machines:

HP Z840 2016
 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm)
 * Turbo Boost off.
win10-z840-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966

DELL PRECISION M3800 LAPTOP
 * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm)
win10-laptop-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966

MAC PRO 2013 (12-CORE) 2013
 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) 
mac-pro-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/16T4PlAXPqBtlw16JZnhAROk4bGNOC572vtOdJKHAluc/edit#gid=558740138

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)
mac-book-pro-slow-fast-zlib-adler-sse
https://docs.google.com/spreadsheets/d/1m-d-aF2hHyN6Oi8HHV4Htz_7Xgdh6kEnKnDjKc3WsrU/edit#gid=595603569

Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054
which contains Nigel's patch.  The results of the change are graphed as the blue line labelled "fast zlib".

There are wins and losses, that's nothing new: the average speed-up is 2%, some images have large gains >> 15%.

> The time spent decoding is what matters, not the time it took to load the page ...

Yes, and that's what image_decode_bench measures.  Network noise, latency, or other browser factors are of interest, but not when measuring image decoder speed.
@Nigel: vanilla zlib code @AArch64 will yield something around 133ms in a google pixel for the Zebra parrot test case (IIRC it would take 140ms@AArch32).

Using the patch + environment variable + release build + CLI arguments I can confirm speed ups in the snappy_unittest on AArch64.

For HTML the speed up was around 34% (339.6MB/252.9MB) and 8% (599.8MB/552.5MB)for JPEG tested in a Rockchip RK3399@2Ghz, using taskset to bind it to one of the big cores (A72).

Not like the reported numbers for Intel (1.42x or 42%), but still pretty good.

@Noel: that is some pretty amazing data! I see you have being busy.
:-)

It would be helpful to have a description in the columns of the tables to better understand the data.

I got a few questions:

a) I'm assuming the 'Chart results' tab shows 1.0 using what would be the expected time for vanilla zlib?

b) There is a red line labeled 'zlib adler sse'. Is it an implementation of the Adler32 checksum (like we did for ARM) using SSE2? If so, it is quite interesting to see that it will yield a boost of up to 29% alone for best case (I've observed close to 18% on ARM but YMMV).

c) You think you could get some data for Intel M3 and M5? They are quite different beasts than the typical Xeon/i7 Intel chip.
While I am excited by these great improvements, Chrome is typically not built for 64-bit. We should be looking at the 32-bit gains as well. Landing & upstreaming both is desirable.
https://chromium-review.googlesource.com/c/chromium/src/+/601694 "Speed up zlib inflate_fast by up to 1.5x on x86_64" should have no effect on 32-bit architectures.

If you want to make similar improvements for 32-bit, (e.g. fill the bit accumulator 4 bytes at a time instead of 1 (vanilla zlib) or 8 (with 601694)), I suppose that that's possible, although the relative impact would probably be smaller (simply because 4 is less than 8), and it would make the zlib code even more complicated in terms of juggling multiple #ifdef's or multiple source files (combined with multiple build systems to configure those source files: Chromium's BUILD.gn and upstream's make and cmake).

Long story short, I think that optimizing 32-bit would have to be a separate change from 601694, and to be honest, I have found the code review process for 601694 quite frustrating, for reason's I'll detail in the code review.
In the code review for 601694, I have been asked to host the benchmark data in this linked issue rather than the commit log. Thus:

https://github.com/google/snappy is an unrelated compression format but
its snappy_unittest program is able to benchmark the snappy library and
the zlib library. We can re-purpose it, via LD_LIBRARY_PATH, to measure
different versions of zlib, such as before / after this commit.

snappy_unittest's output looks like:

testdata/alice29.txt                     :
ZLIB:   [b 1M] bytes 152089 ->  54404 35.8%  comp  19.6 MB/s  uncomp
258.2 MB/s
testdata/asyoulik.txt                    :
ZLIB:   [b 1M] bytes 125179 ->  48897 39.1%  comp  17.9 MB/s  uncomp
242.9 MB/s
etc

When running snappy_unittest, make sure that you specify test files on the
command line, such as ("snappy_unittest testdata/*"), and make sure that the
output includes lines that say "ZLIB". You may need to pass a -zlib flag, or
otherwise apply
https://gist.github.com/nigeltao/a3ef1761ad7339fac70a12a802467d56. If you do
not, the snappy_unittest program will be benchmarking the snappy compression
codec instead of the zlib one.

Summarizing the before/after zlib uncompress numbers give:

alice29.txt       uncomp MB/s  before  258.2  after  392.4  ratio 1.52
asyoulik.txt      uncomp MB/s  before  242.9  after  346.6  ratio 1.43
fireworks.jpeg    uncomp MB/s  before 1066.7  after 1194.7  ratio 1.12
geo.protodata     uncomp MB/s  before  565.6  after  756.8  ratio 1.34
html              uncomp MB/s  before  492.1  after  698.8  ratio 1.42
html_x_4          uncomp MB/s  before  502.1  after  697.8  ratio 1.39
kppkn.gtb         uncomp MB/s  before  336.4  after  483.6  ratio 1.44
lcet10.txt        uncomp MB/s  before  265.2  after  409.5  ratio 1.54
paper-100k.pdf    uncomp MB/s  before  264.9  after  331.4  ratio 1.25
plrabn12.txt      uncomp MB/s  before  240.9  after  372.5  ratio 1.55
urls.10K          uncomp MB/s  before  280.7  after  381.3  ratio 1.36

Comment 27 by noel@chromium.org, Sep 8 2017

#24 > While I am excited by these great improvements, Chrome is typically not built for 64-bit. We should be looking at the 32-bit gains as well.

Huh?  The official builds I receive are 64-bit on OSX and WIN per chrome://version
Google Chrome	60.0.3112.113 (Official Build) (64-bit)

#27 oh my gosh, you're right. *smacks forehead* Same with on Linux.
Sorry...been focusing so much on Chrome for Android, which does 32-bit builds even for 64-bit devices.

Okay, false alarm. Sorry about that.



#25
> although the relative impact would probably be smaller (simply because 4 is less than 8)

Agreed

> be a separate change from 601694

Also agreed
To clarify a point I forgot to mention about #10, in #6 I don't mean the download cache. I mean the decoded image cache.

If you ask Chrome to decode an image and it already has a decoded copy of that image, it can just grab the copy from its cache. There are times when this works and when it doesn't. For example, if we have rastered the image and decided we don't need it anymore, we might discard it.

But whatever the case, forcing it to appear like a completely separate image will prevent us from accidentally using a cached decoded image.
Blocking: 764085
Project Member

Comment 31 by bugdroid1@chromium.org, Feb 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4982484356f4c052358a1e62daadd98ed5e29100

commit 4982484356f4c052358a1e62daadd98ed5e29100
Author: Noel Gordon <noel@chromium.org>
Date: Wed Feb 07 14:00:22 2018

Resolve chunk copy TODO: use INFLATE_FAST_MIN_INPUT/OUTPUT

Update the chunk copy to use INFLATE_FAST_MIN_{INPUT,OUTPUT} defines. No
change in behavior, no new tests.

After this change, the patch from  crbug.com/764431  (and its .patch file)
can be removed, it's not needed anymore TODO(cblume@ on  bug 760853 ).

Bug:  760853 
Change-Id: I93e4ddf3fd99da472a7571bae7d3b8b6c2efe5b1
Reviewed-on: https://chromium-review.googlesource.com/900885
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Cr-Commit-Position: refs/heads/master@{#534998}
[modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inffast_chunk.c
[modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inffast_chunk.h
[modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inflate.c

Comment 32 by noel@chromium.org, Feb 7 2018

Cc: -noel@chromium.org nigeltao@chromium.org
Owner: noel@chromium.org
Status: Started (was: Untriaged)
https://bugs.chromium.org/p/chromium/issues/detail?id=798943&desc=5#c14 noted when landing the zlib_bench tool, that the zlib Huffman decoder on modern CPU was I/O bound.  The chunk copy work improves its output-side write performance.  Nigel's idea in (this issue) improves its input-side read performance.

And when these are combined with the other work on adler32 simd, and crc32 simd, the results are very good, they all work together well.  On my Macbook Pro 2016 Intel Core i7 for PNG 140 image corpus:

https://docs.google.com/spreadsheets/d/1v3ECbh190IlQBNG7cNdHFUqnCYJJ4BED5mlrSnwWL1M/edit#gid=595603569
  * blue, chunk copy (avg improv +16%)
  * red, chunk copy + adler32 simd (avg improv +36%)
  * yellow, chunk copy + adler32 simd + sse4.2 crc32 (avg improv +40%)
  * green, chunk copy + adler32 simd + sse4.2 crc32 + wide bits (avg improv +41%)

I called Nigel's method "wide bits", and it adds 1% more on the average.  Might not seem worth it, until you use zlib_bench to look at zlib decode performance for ZLIB and GZIP wrapped DEFLATE streams (viz., the common web browser use-case).

Using the snappy testdata corpus, again on the Macbook Pro 2016, the results are:

ZLIB wrapped DEFLATE stream, snappy corpus:
https://docs.google.com/spreadsheets/d/1RoS6sgAuVjQ0LBIMJIz5D5WwWkRVCb2dQvG3ztTqb5Q/edit#gid=595603569
 * red, chunk copy [deflate] (avg improv +22%)
 * yellow, chunk copy + adler32 simd [deflate] (avg improv +50%)
 * green, chunk copy + adler32 simd + wide bits [deflate] (avg improv +74%)
 --------------------------
 1.96x faster in the median

GLIB wrapped DEFLATE stream, snappy corpus
https://docs.google.com/spreadsheets/d/15b0-iT0sXB5_d8yN--y48dRhySkeFe3frjWIv7_Vivw/edit#gid=595603569
 * red, chunk copy [gzip] (avg improv +19%)
 * yellow, chunk copy + sse4.2 crc32 [gzip] (avg improv +69%)
 * green, chunk copy + sse4.2 crc32 + wide bits [gzip] (avg improv +96%)
 --------------------------
 2.17x faster in the median

Nice to be getting ~2x more speed on the average, and well worth incorporating in the chunk copy input reading code.  I'll begin with x86 ...
Project Member

Comment 33 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6e212423a214e0e41794e8c9969c2896e2c33121

commit 6e212423a214e0e41794e8c9969c2896e2c33121
Author: Noel Gordon <noel@chromium.org>
Date: Thu Feb 08 13:09:37 2018

Increase inflate speed: read decoder input into a uint64_t

The chunk-copy code contribution deals with writing decoded DEFLATE data
to the output with SIMD methods to increase inflate decode speed. Modern
compilers such as gcc/clang/msvc elide the portable memcpy() calls used,
replacing them with much faster SIMD machine instructions.

Similarly, reading the input data to the DEFLATE decoder with wide, SIMD
methods can also increase decode speed. See  https://crbug.com/760853#c32 
for details; content-encoding: gzip decoding speed improves by 2.17x, in
the median over the snappy testdata corpus, when this method is combined
with the chunk-copy, and the adler32, and crc32 SIMD contributions (this
method improves our current inflate decode speed by 20-30%).

Update the chunk-copy code with a wide input data reader, which consumes
input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the
wide reader. This feature is supported on little endian machines, and is
enabled with the INFLATE_CHUNK_READ_64LE build flag in BUILD.gn on Intel
CPU only for now.

The wide reader idea is due to nigeltao@chromium.org who did the initial
work. This patch is based on his patch [1]. No change in behavior (other
than more inflate decode speed), so no new tests.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/601694/16

Bug:  760853 
Change-Id: Ia806d9a225737039367e1b803624cd59e286ce51
Reviewed-on: https://chromium-review.googlesource.com/900982
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535365}
[modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/BUILD.gn
[modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/chunkcopy.h
[modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/inffast_chunk.c
[modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/inffast_chunk.h

Project Member

Comment 34 by bugdroid1@chromium.org, Feb 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213

commit b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213
Author: Noel Gordon <noel@chromium.org>
Date: Fri Feb 09 01:38:56 2018

Add x86 bots to the read64le party

thestig@ reminds me we do have linux x86 bots on the waterfall. Add them
to the read64le party. Minor: run git cl format over BUILD.gn, and add a
line space here and there for readability.

Bug:  760853 
Change-Id: Ia47296a26bff77f9be699e31053d8b94aac583f4
Reviewed-on: https://chromium-review.googlesource.com/910328
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535612}
[modify] https://crrev.com/b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213/third_party/zlib/BUILD.gn

Project Member

Comment 35 by bugdroid1@chromium.org, Feb 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b7fef3927da4a703c523e7e8916fd8d4c031a402

commit b7fef3927da4a703c523e7e8916fd8d4c031a402
Author: Noel Gordon <noel@chromium.org>
Date: Mon Feb 12 01:46:01 2018

Revert "Add x86 bots to the read64le party"

This reverts commit b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213.

Reason for revert: the litmus test of running this code on linux x86 bots
worked. However, performance-wise we're not so sure. Back this change out
to restrict it to Intel x64 for now.

Original change's description:
> Add x86 bots to the read64le party
>
> thestig@ reminds me we do have linux x86 bots on the waterfall. Add them
> to the read64le party. Minor: run git cl format over BUILD.gn, and add a
> line space here and there for readability.
>
> Bug:  760853 
> Change-Id: Ia47296a26bff77f9be699e31053d8b94aac583f4
> Reviewed-on: https://chromium-review.googlesource.com/910328
> Reviewed-by: Mike Klein <mtklein@chromium.org>
> Commit-Queue: Noel Gordon <noel@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#535612}

TBR=noel@chromium.org,mtklein@chromium.org,nigeltao@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  760853 
Change-Id: If3d24a2cdc6614ff82c781019623108a52dbd16a
Reviewed-on: https://chromium-review.googlesource.com/913029
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#536020}
[modify] https://crrev.com/b7fef3927da4a703c523e7e8916fd8d4c031a402/third_party/zlib/BUILD.gn

Comment 36 by noel@chromium.org, Feb 15 2018

I not so sure this feature should be enabled on ARM without more investigation. I filed issue 812499 with initial results for this ARM/ARM64 to cover that work.

Comment 37 by noel@chromium.org, Feb 15 2018

Summary: zlib inflate Huffman decoder input reading speed-ups for 64 bit Intel (was: zlib speed-ups for 64 bit architectures)

Comment 38 by noel@chromium.org, Feb 15 2018

Status: Fixed (was: Started)
Intel work is all done.  Closing.

Sign in to add a comment