zlib inflate Huffman decoder input reading speed-ups for 64 bit Intel |
|||||
Issue descriptionhttps://github.com/madler/zlib/pull/292 is a pull request that speeds up zlib decode by up to 1.5x on the x86_64 architecture. That patch also affects some other 64 bit architectures: AArch64 and PowerPC64. I don't have direct access to such hardware, but a colleague of mine ran the numbers on such hardware, and reported a smaller but still consistently positive speedup of around 1.2x on both. There have been a couple of places where getting this improvement into chromium was discussed: Code review: https://chromium-review.googlesource.com/c/chromium/src/+/601694 "Progressive Web Metrics" list discussion re metrics: https://groups.google.com/a/chromium.org/d/topic/progressive-web-metrics/Z-lWP4pQ43c/discussion This issue is a tracking issue, and hopefully a central place for discussion.
,
Aug 31 2017
On https://chromium-review.googlesource.com/c/chromium/src/+/601694, cavalcantii asks: > Could you check what are the performance improvement on the same test case > (https://codepen.io/Savago/pen/JEoYNX)? I tried 5 reps of hitting the "Run" button before (old) and after (new) my patch. The numbers reported look too noisy to draw any strong conclusion. For example, on the same binary (old), the ratio of best to worst times is 1.75x: Old ### took 0.01999999999998181 miliseconds.Test results: ### took 0.01999999999998181 miliseconds.Test results: ### took 0.01999999999998181 miliseconds.Test results: ### took 0.024999999999636202 miliseconds.Test results: ### took 0.035000000000081855 miliseconds.Test results: New ### took 0.01999999999998181 miliseconds.Test results: ### took 0.020000000000436557 miliseconds.Test results: ### took 0.020000000000436557 miliseconds.Test results: ### took 0.024999999999636202 miliseconds.Test results: ### took 0.02500000000009095 miliseconds.Test results: As an alternative (but with the same https://upload.wikimedia.org/wikipedia/commons/3/3f/ZebraHighRes.png PNG image), I ran chromium's image_decode_bench program (as in, "ninja -C out/release image_decode_bench"), before (old) and after (new) my patch: $ for i in {1..5}; do ./old ZebraHighRes.png 20; done | sort 1.234950 0.061748 1.234987 0.061749 1.236219 0.061811 1.236629 0.061831 1.238431 0.061922 $ for i in {1..5}; do ./new ZebraHighRes.png 20; done | sort 1.145096 0.057255 1.145904 0.057295 1.147598 0.057380 1.173633 0.058682 1.177580 0.058879 For each run, the two numbers are total time and time per repetition (here, 20 reps). In other words, decoding ZebraHighRes.png took 62ms per rep before and 57ms per rep after, or a 1.07x improvement.
,
Aug 31 2017
On https://chromium-review.googlesource.com/c/chromium/src/+/601694, cavalcantii asks: > I recommend to have a look on blink_platform_unittests (*PNG* and *Png) and > post the results in an issue tracking this patch. blink_platform_unittests seems happy: $ out/rel/blink_platform_unittests | tail [2098/2103] PaintControllerUnderInvalidationTest.NoopPairsInSubsequence (0 ms) [2099/2103] PaintControllerUnderInvalidationTest.ChangeDrawingInSubsequence (43 ms) [2100/2103] PaintControllerUnderInvalidationTest.MoreDrawingInSubsequence (50 ms) [2101/2103] PaintControllerUnderInvalidationTest.LessDrawingInSubsequence (44 ms) [2102/2103] PaintControllerUnderInvalidationTest.ChangeNonCacheableInSubsequence (49 ms) [2103/2103] PaintControllerUnderInvalidationTest.InvalidationInSubsequence (0 ms) Retrying 1 test (retry #1) [2104/2104] MailboxSharedGpuContextTest.MailboxCaching (1 ms) SUCCESS: all tests passed. Tests took 0 seconds. $ out/rel/blink_platform_unittests | grep -i png [363/2103] BitmapImageTest.pngHasColorProfile (0 ms) [570/2103] DeferredImageDecoderTestWoPlatform.mixImagesPng (1 ms) [722/2103] AnimatedPNGTests.sizeTest (1 ms) [723/2103] AnimatedPNGTests.repetitionCountTest (0 ms) [724/2103] AnimatedPNGTests.MetaDataTest (0 ms) [725/2103] AnimatedPNGTests.EmptyFrame (0 ms) [726/2103] AnimatedPNGTests.ByteByByteSizeAvailable (0 ms) [727/2103] AnimatedPNGTests.ByteByByteMetaData (0 ms) [728/2103] AnimatedPNGTests.TestRandomFrameDecode (0 ms) [729/2103] AnimatedPNGTests.TestDecodeAfterReallocation (1 ms) [730/2103] AnimatedPNGTests.ProgressiveDecode (0 ms) [771/2103] AnimatedPNGTests.FailureDuringDecodingInvalidatesDecoder (0 ms) [772/2103] AnimatedPNGTests.VerifyIENDBeforeIDATInvalidatesDecoder (0 ms) [773/2103] AnimatedPNGTests.MixedDataChunks (1 ms) [774/2103] AnimatedPNGTests.VerifyInvalidDisposalAndBlending (0 ms) [775/2103] AnimatedPNGTests.VerifySuccessfulFirstFrameDecodeAfterLaterFrame (0 ms) [776/2103] AnimatedPNGTests.DecodeFromIndependentFrame (0 ms) [777/2103] AnimatedPNGTests.SubsetFromIHDR (1 ms) [778/2103] StaticPNGTests.repetitionCountTest (0 ms) [779/2103] StaticPNGTests.sizeTest (0 ms) [780/2103] StaticPNGTests.MetaDataTest (0 ms) [811/2103] StaticPNGTests.InvalidIHDRChunk (0 ms) [812/2103] StaticPNGTests.ProgressiveDecoding (0 ms) [813/2103] StaticPNGTests.ProgressiveDecodingContinuesAfterFullData (0 ms) [814/2103] StaticPNGTests.FailureDuringDecodingInvalidatesDecoder (1 ms) [815/2103] PNGTests.VerifyFrameCompleteBehavior (2 ms) [816/2103] PNGTests.sizeMayOverflow (0 ms) [851/2103] AnimatedPNGTests.ParseAndDecodeByteByByte (0 ms) [852/2103] AnimatedPNGTests.FailureDuringParsing (1 ms) [853/2103] AnimatedPNGTests.ActlErrors (4 ms) [854/2103] AnimatedPNGTests.fdatBeforeIdat (1 ms) [855/2103] AnimatedPNGTests.IdatSizeMismatch (0 ms) [856/2103] AnimatedPNGTests.VerifyFrameOutsideImageSizeFails (0 ms) [857/2103] AnimatedPNGTests.ProgressiveDecodingContinuesAfterFullData (0 ms) [858/2103] AnimatedPNGTests.RandomDecodeAfterClearFrameBufferCache (0 ms) [859/2103] AnimatedPNGTests.VerifyAlphaBlending (0 ms) [860/2103] AnimatedPNGTests.FailureMissingIendChunk (0 ms) [886/2103] ICOImageDecoderTests.errorInPngInIco (0 ms)
,
Sep 1 2017
Re the https://codepen.io/Savago/pen/JEoYNX benchmark mentioned in comment #2, on closer inspection, the numbers look bogus. Decoding a PNG image of that size shouldn't take less that a *millisecond*. I've attached a screenshot of Chrome stable (version 60) on desktop, x86_64, that shows similar numbers. My conclusion is that that codepen.io snippet isn't measuring image decode time at all. Maybe it's just some DOM manipulation, but I'm only guessing. I don't have any clue what the difference is between my Chrome and your (cavalcantii) Chrome such that you were measuring around 100 milliseconds (instead of sub-millisecond) on your own device.
,
Sep 1 2017
@Nigel: the timestamp printed in the webpage is related to the time for insertion of the DOM element in the tree. I use this test to collect Chrome traces, where later I look for the time spent in ImageFrameGenerator::decode.
,
Sep 1 2017
A way we can improve this test is to request image.png?1, image.png?2, etc. Otherwise, the second decode might be using the cache. Until we finish landing the JS decode API, the best JS option is to add an onload() event handler. But it includes downloading :( For an example of that, I'm forced to use it here: https://chromium-review.googlesource.com/c/chromium/src/+/646242
,
Sep 1 2017
So for the Zebra test image image_decode_bench reported 7% improvement? Expect the effect in the Web Browser to be a bit smaller (i.e. perf bot image_decoding.html?png), as there is some considerable time spent in the compositor copying/resizing the pixels (I'm planning to investigate how we can speed up things there). Probably the net effect will be around 5% (assuming a similar behavior of the compositor in Intel as I've observed in Android). YMMV because Intel has one-copy texture uploads.
,
Sep 1 2017
And seconding what cblume mentioned, you got to make sure you won't be hitting the cache while collecting the trace.
,
Sep 1 2017
Huh, I'd have thought that you *do* want to hit the cache while collecting the trace. Otherwise, you're going to add network I/O noise to your timings, and network I/O can take much longer than image decoding.
,
Sep 1 2017
Having network I/O timing noise in image decoding measurements is not good.
When using tracing, note that image decodes happen off the main thread:
-- you might get scheduled out on a thread switch
Dunno about the new perf bot thing in #6 for images:
-- do they decode off the main thread, or not?
Closest thing to a synchronous decode using a web page is to:
-- img = new Image(), img.onload = function() { draw(this) }, img.src = url;
-- draw() would draw to image to a <canvas> element
-- that <canvas> draw causes a sync (main thread) image decode
And with any method used:
-- can you ever be sure you have received all the compressed image data?
We use the image_decode_bench to remove all these variables when we want to assess the performance aspects of any change that directly affects the blink image decoders or a sub-system they require (eg., zlib).
re: blink_platform_unittests
-- no real need to run those manually (unless you broke them)
-- they are automatically run on try jobs and on the CQ
,
Sep 1 2017
While I support the use of image_decode_bench, it should be clear it won't fully reveal what is going to be the global impact in the Browser because there are other component layers at play. I'm not deeply familiar with how the Blink and the Compositor image caches work (or how they are integrated). I have the impression that one is a network cache while the second is a texture/image cache. But it is reasonable to deduct that if you are trying to insert an image in a webpage that is already in the cache, it won't be decoded twice (otherwise, why you need a cache at all?). This seems to be confirmed by the trace I'm attaching, where I clicked twice (waiting a couple seconds between the clicks) in the button in the 'Parrot' test case to include 2 images. You can see clearly in the first image insertion that time is being spent in ImageFrameGenerator::decode, while in the second execution (i.e. hit the cache) you only see call into Skia SkCanvas::drawRect(). Also, as a browser user would notice, the first image inclusion took a little while and the second was instantaneous. The only scenario where the browser user would have everything in cache would be in a 'light' refresh of a given page. But the common user case is to open a new tab (i.e. probably spawning a new RenderProcess) and doing a cold load of a website. If network 'noise' is of major concern, one alternative is simply host the test image in a local computer in the same network. That being said, I fail to see how noise from network would throw off the time spent *actually* decoding a PNG. As new data arrives through the pipe, you decode it. The time spent decoding is what matters, not the time it took to load the page, and this is what ImageFrameGenerator::decode() is going show. Again, this is how I see that testing should be done, from the perspective of what a Chrome user will face while using the browser. I'm sharing what I learned in the ARM optimization story so you can profit from it.
,
Sep 2 2017
Not sure if this is the kind of output expected, but here it goes: (xenial)adenilson@localhost:~/snappy/build$ file snappy_unittest snappy_unittest: ELF 64-bit LSB executable, ARM aarch64, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, BuildID[sha1]=dfafcee6f4a0b93059920d0008d072bd5e0f750f, not stripped (xenial)adenilson@localhost:~/snappy/build$ ldd snappy_unittest linux-vdso.so.1 => (0x0000007c222da000) libz.so.1 => /usr/local/lib/libz.so.1 (0x0000007c222a8000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007c22270000) libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007c220e0000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007c220bf000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007c21f78000) /lib/ld-linux-aarch64.so.1 (0x0000005eff1bc000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007c21eca000) a) Vanilla 1.2.11 (xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest Running microbenchmarks. WARNING: Compiled with assertions enabled, will be slow. WARNING: Compiled without optimization, will be slow. Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 506661 506122 245 192.9MB/s html BM_UFlat/1 5027760 5030000 100 133.1MB/s urls BM_UFlat/2 14299 14297 11890 8.0GB/s jpg BM_UFlat/3 1257 1254 156250 152.1MB/s jpg_200 BM_UFlat/4 69533 69225 2788 1.4GB/s pdf BM_UFlat/5 2034730 2030000 100 192.4MB/s html4 BM_UFlat/6 1868254 1867924 106 77.6MB/s txt1 BM_UFlat/7 1627352 1631147 122 73.2MB/s txt2 BM_UFlat/8 5017350 5020000 100 81.1MB/s txt3 BM_UFlat/9 6999800 7000000 100 65.6MB/s txt4 BM_UFlat/10 478931 479318 411 235.9MB/s pb BM_UFlat/11 1952382 1950980 102 90.1MB/s gaviota BM_UIOVec/0 960393 961165 206 101.6MB/s html BM_UIOVec/1 8407500 8380000 100 79.9MB/s urls BM_UIOVec/2 15434 15467 9310 7.4GB/s jpg BM_UIOVec/3 2428 2431 79365 78.4MB/s jpg_200 BM_UIOVec/4 130191 130193 1444 750.1MB/s pdf BM_UValidate/0 209145 209081 947 467.1MB/s html BM_UValidate/1 2307110 2310000 100 289.9MB/s urls BM_UValidate/2 1010 1009 172413 113.6GB/s jpg BM_UValidate/3 683 676 263157 282.0MB/s jpg_200 BM_UValidate/4 19981 19973 9813 4.8GB/s pdf BM_ZFlat/0 1325281 1322147 149 73.9MB/s html (22.31 %) BM_ZFlat/1 19445290 19440000 100 34.4MB/s urls (47.78 %) BM_ZFlat/2 49976 50086 3474 2.3GB/s jpg (99.95 %) BM_ZFlat/3 5795 5800 34305 32.9MB/s jpg_200 (73.00 %) BM_ZFlat/4 166427 165907 1097 588.6MB/s pdf (83.30 %) BM_ZFlat/5 5699970 5690000 100 68.7MB/s html4 (22.52 %) BM_ZFlat/6 4990370 4990000 100 29.1MB/s txt1 (57.88 %) BM_ZFlat/7 4404810 4410000 100 27.1MB/s txt2 (61.91 %) BM_ZFlat/8 14135230 14130000 100 28.8MB/s txt3 (54.99 %) BM_ZFlat/9 18724740 18720000 100 24.5MB/s txt4 (66.26 %) BM_ZFlat/10 1281894 1278145 151 88.5MB/s pb (19.68 %) BM_ZFlat/11 4231080 4230000 100 41.6MB/s gaviota (37.72 %) b) With iffast64 patch (from https://github.com/madler/zlib/pull/292): (xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest Running microbenchmarks. WARNING: Compiled with assertions enabled, will be slow. WARNING: Compiled without optimization, will be slow. Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 505374 504043 371 193.7MB/s html BM_UFlat/1 4987610 4980000 100 134.5MB/s urls BM_UFlat/2 14441 14425 13726 7.9GB/s jpg BM_UFlat/3 1248 1244 155038 153.2MB/s jpg_200 BM_UFlat/4 69954 69439 2693 1.4GB/s pdf BM_UFlat/5 2048920 2040000 100 191.5MB/s html4 BM_UFlat/6 1871066 1876190 105 77.3MB/s txt1 BM_UFlat/7 1651471 1628099 121 73.3MB/s txt2 BM_UFlat/8 5016400 5020000 100 81.1MB/s txt3 BM_UFlat/9 6925810 6920000 100 66.4MB/s txt4 BM_UFlat/10 479709 479217 409 236.0MB/s pb BM_UFlat/11 1973780 1960000 100 89.7MB/s gaviota BM_UIOVec/0 970712 960975 205 101.6MB/s html BM_UIOVec/1 8317450 8310000 100 80.6MB/s urls BM_UIOVec/2 15424 15424 11540 7.4GB/s jpg BM_UIOVec/3 2458 2452 72992 77.8MB/s jpg_200 BM_UIOVec/4 130534 130555 1440 748.0MB/s pdf BM_UValidate/0 210539 209033 952 467.2MB/s html BM_UValidate/1 2315300 2310000 100 289.9MB/s urls BM_UValidate/2 1006 1006 147058 113.9GB/s jpg BM_UValidate/3 678 676 263157 282.0MB/s jpg_200 BM_UValidate/4 20008 20008 9896 4.8GB/s pdf BM_ZFlat/0 1327800 1324137 145 73.8MB/s html (22.31 %) BM_ZFlat/1 18967000 18940000 100 35.4MB/s urls (47.78 %) BM_ZFlat/2 49063 49209 3353 2.3GB/s jpg (99.95 %) BM_ZFlat/3 5831 5762 32626 33.1MB/s jpg_200 (73.00 %) BM_ZFlat/4 167448 167814 1162 581.9MB/s pdf (83.30 %) BM_ZFlat/5 5633460 5630000 100 69.4MB/s html4 (22.52 %) BM_ZFlat/6 5068580 5060000 100 28.7MB/s txt1 (57.88 %) BM_ZFlat/7 4326500 4330000 100 27.6MB/s txt2 (61.91 %) BM_ZFlat/8 13995480 13970000 100 29.1MB/s txt3 (54.99 %) BM_ZFlat/9 18611570 18600000 100 24.7MB/s txt4 (66.26 %) BM_ZFlat/10 1264190 1263157 152 89.5MB/s pb (19.68 %) BM_ZFlat/11 4207690 4210000 100 41.8MB/s gaviota (37.72 %)
,
Sep 2 2017
Traces in a Google Pixel for the Zebra parrot test case (it seems that the NEON code is faster in AArch64).
,
Sep 2 2017
Your snappy_unittest numbers don't really show any change between the two, which looks odd. When I run my numbers, I use the LD_LIBRARY_PATH environment variable to pick the zlib implemtation. You're not setting LD_LIBRARY_PATH, so how are you picking which zlib implementation for snappy_unittest to use?? For your Zebra parrot case, can you get numbers for my inffast64 version versus vanilla zlib, not versus your NEON code?
,
Sep 2 2017
And for Release builds (both zlib + test): a) vanilla release (xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest Running microbenchmarks. Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 90815 90696 1709 1.1GB/s html BM_UFlat/1 1044234 1040000 175 643.8MB/s urls BM_UFlat/2 12403 12415 14981 9.2GB/s jpg BM_UFlat/3 253 253 666666 752.4MB/s jpg_200 BM_UFlat/4 17186 17153 10610 5.6GB/s pdf BM_UFlat/5 384169 382692 520 1020.7MB/s html4 BM_UFlat/6 341037 341880 585 424.3MB/s txt1 BM_UFlat/7 303240 302571 661 394.6MB/s txt2 BM_UFlat/8 950082 949541 218 428.6MB/s txt3 BM_UFlat/9 1367939 1364864 148 336.7MB/s txt4 BM_UFlat/10 90209 90325 2181 1.2GB/s pb BM_UFlat/11 342017 342657 572 513.0MB/s gaviota BM_UIOVec/0 279080 279266 709 349.7MB/s html BM_UIOVec/1 2504980 2500000 100 267.8MB/s urls BM_UIOVec/2 12950 12958 15048 8.8GB/s jpg BM_UIOVec/3 708 712 198019 267.9MB/s jpg_200 BM_UIOVec/4 33323 33245 5715 2.9GB/s pdf BM_UValidate/0 55428 55415 3573 1.7GB/s html BM_UValidate/1 639232 638709 310 1.0GB/s urls BM_UValidate/2 231 231 666666 496.3GB/s jpg BM_UValidate/3 160 160 1000000 1.2GB/s jpg_200 BM_UValidate/4 4790 4802 40816 19.9GB/s pdf BM_ZFlat/0 216380 216494 873 451.1MB/s html (22.31 %) BM_ZFlat/1 3534340 3530000 100 189.7MB/s urls (47.78 %) BM_ZFlat/2 21545 21576 8435 5.3GB/s jpg (99.95 %) BM_ZFlat/3 936 935 204081 203.8MB/s jpg_200 (73.00 %) BM_ZFlat/4 32104 32129 6038 3.0GB/s pdf (83.30 %) BM_ZFlat/5 1014164 1010309 194 386.6MB/s html4 (22.52 %) BM_ZFlat/6 844189 840796 201 172.5MB/s txt1 (57.88 %) BM_ZFlat/7 798351 797520 242 149.7MB/s txt2 (61.91 %) BM_ZFlat/8 2640390 2640000 100 154.2MB/s txt3 (54.99 %) BM_ZFlat/9 3501170 3500000 100 131.3MB/s txt4 (66.26 %) BM_ZFlat/10 207517 208017 923 543.7MB/s pb (19.68 %) BM_ZFlat/11 592317 592233 309 296.8MB/s gaviota (37.72 %) b) inffast64 release (xenial)adenilson@localhost:~/snappy/build$ taskset -c 4 ./snappy_unittest Running microbenchmarks. Benchmark Time(ns) CPU(ns) Iterations --------------------------------------------------- BM_UFlat/0 90464 90479 1691 1.1GB/s html BM_UFlat/1 1094982 1087719 171 615.6MB/s urls BM_UFlat/2 13003 12887 15209 8.9GB/s jpg BM_UFlat/3 259 259 571428 736.4MB/s jpg_200 BM_UFlat/4 16731 16647 10632 5.7GB/s pdf BM_UFlat/5 384682 382978 517 1020.0MB/s html4 BM_UFlat/6 345548 343750 576 421.9MB/s txt1 BM_UFlat/7 303663 304414 657 392.2MB/s txt2 BM_UFlat/8 1003932 1005181 193 404.9MB/s txt3 BM_UFlat/9 1319884 1319727 147 348.2MB/s txt4 BM_UFlat/10 90027 90163 2196 1.2GB/s pb BM_UFlat/11 341894 342013 576 514.0MB/s gaviota BM_UIOVec/0 278528 278409 704 350.8MB/s html BM_UIOVec/1 2495040 2490000 100 268.9MB/s urls BM_UIOVec/2 12963 13005 14609 8.8GB/s jpg BM_UIOVec/3 714 713 256410 267.2MB/s jpg_200 BM_UIOVec/4 33207 33234 5717 2.9GB/s pdf BM_UValidate/0 55569 55416 3591 1.7GB/s html BM_UValidate/1 637785 638977 313 1.0GB/s urls BM_UValidate/2 232 232 645161 493.1GB/s jpg BM_UValidate/3 160 160 909090 1.2GB/s jpg_200 BM_UValidate/4 4813 4813 39682 19.8GB/s pdf BM_ZFlat/0 215057 215835 922 452.5MB/s html (22.31 %) BM_ZFlat/1 3354450 3350000 100 199.9MB/s urls (47.78 %) BM_ZFlat/2 22116 22168 8661 5.2GB/s jpg (99.95 %) BM_ZFlat/3 940 940 202020 202.8MB/s jpg_200 (73.00 %) BM_ZFlat/4 32545 32532 5871 2.9GB/s pdf (83.30 %) BM_ZFlat/5 886608 886956 230 440.4MB/s html4 (22.52 %) BM_ZFlat/6 876201 876146 218 165.5MB/s txt1 (57.88 %) BM_ZFlat/7 783378 781250 256 152.8MB/s txt2 (61.91 %) BM_ZFlat/8 2372350 2370000 100 171.7MB/s txt3 (54.99 %) BM_ZFlat/9 3404180 3410000 100 134.8MB/s txt4 (66.26 %) BM_ZFlat/10 212025 211640 945 534.4MB/s pb (19.68 %) BM_ZFlat/11 602060 602409 332 291.8MB/s gaviota (37.72 %)
,
Sep 2 2017
@Nigel: I've set it to point to /usr/local/lib (as is visible in 'ldd'). I removed the self compiled zlib from /usr/local and install the tested candidate for each run. I'm afraid that I can't supply traces for vanilla X inffast64 because I'm working from home and would need my workstation that is at the office (not to mention that Monday will be labor day in USA). I hope the data was helpful.
,
Sep 2 2017
Kind of a extrapolation, but in the issue (https://bugs.chromium.org/p/chromium/issues/detail?id=697280) I've included a trace of vanilla 32bits@ARM where it would take 140ms for the Zebra parrot test case. It may be a tad faster @AArch64 (i.e. more registers) but not by that much.
,
Sep 2 2017
> I removed the self compiled zlib from /usr/local and install the tested candidate for each run.
One detail I've just noticed is that I usually run snappy_unittest with arguments ("snappy_unittest testdata/*") and you are not. I think that you are therefore producing numbers for the *snappy* compression codec, not the *zlib* one, and the snappy implementation is obviously unaffected by applying my patch to the /usr/local zlib.
The output should look something like:
testdata/alice29.txt :
ZLIB: [b 1M] bytes 152089 -> 54404 35.8% comp 19.6 MB/s uncomp 258.2 MB/s
and not like:
BM_UFlat/0 90815 90696 1709 1.1GB/s html
BM_UFlat/1 1044234 1040000 175 643.8MB/s urls
Note especially the "ZLIB" in the output.
> I'm afraid that I can't supply traces for vanilla X inffast64 because I'm working from home and would need my workstation that is at the office (not to mention that Monday will be labor day in USA).
Well, I'm not in a hurry. I work part time, and this whole zlib optimization thing is mostly unrelated to what I'm meant to be working on.
,
Sep 2 2017
My previously mentioned https://gist.github.com/nigeltao/a3ef1761ad7339fac70a12a802467d56 might be helpful for making snappy_unittest give you ZLIB numbers.
,
Sep 4 2017
Measuring one image. Across the 140 PNG corpus, the results on different machines: HP Z840 2016 * 16-core Intel Xeon E5-2683 V4 2.1GHz Broadwell-EP (14nm) * Turbo Boost off. win10-z840-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1V0DF-NYrKo1xWF88N-5DJGQfUDtmo0VkhHm9uOaQ_7M/edit#gid=1475395966 DELL PRECISION M3800 LAPTOP * Intel Core i7-4712HQ 2.3GHz Haswell-H (22nm) win10-laptop-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1eUDlm5GLA7Zd7IfEd9NUaM1wkes-WzvZBI6rKriZskk/edit#gid=1546726966 MAC PRO 2013 (12-CORE) 2013 * 12-core Intel Xeon E5-2697 V2, 2.7GHz Ivy Bridge-EP (22 nm) mac-pro-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/16T4PlAXPqBtlw16JZnhAROk4bGNOC572vtOdJKHAluc/edit#gid=558740138 MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) mac-book-pro-slow-fast-zlib-adler-sse https://docs.google.com/spreadsheets/d/1m-d-aF2hHyN6Oi8HHV4Htz_7Xgdh6kEnKnDjKc3WsrU/edit#gid=595603569 Measurement patch https://chromium-review.googlesource.com/c/chromium/src/+/609054 which contains Nigel's patch. The results of the change are graphed as the blue line labelled "fast zlib". There are wins and losses, that's nothing new: the average speed-up is 2%, some images have large gains >> 15%. > The time spent decoding is what matters, not the time it took to load the page ... Yes, and that's what image_decode_bench measures. Network noise, latency, or other browser factors are of interest, but not when measuring image decoder speed.
,
Sep 6 2017
@Nigel: vanilla zlib code @AArch64 will yield something around 133ms in a google pixel for the Zebra parrot test case (IIRC it would take 140ms@AArch32). Using the patch + environment variable + release build + CLI arguments I can confirm speed ups in the snappy_unittest on AArch64. For HTML the speed up was around 34% (339.6MB/252.9MB) and 8% (599.8MB/552.5MB)for JPEG tested in a Rockchip RK3399@2Ghz, using taskset to bind it to one of the big cores (A72). Not like the reported numbers for Intel (1.42x or 42%), but still pretty good.
,
Sep 6 2017
@Noel: that is some pretty amazing data! I see you have being busy. :-) It would be helpful to have a description in the columns of the tables to better understand the data. I got a few questions: a) I'm assuming the 'Chart results' tab shows 1.0 using what would be the expected time for vanilla zlib? b) There is a red line labeled 'zlib adler sse'. Is it an implementation of the Adler32 checksum (like we did for ARM) using SSE2? If so, it is quite interesting to see that it will yield a boost of up to 29% alone for best case (I've observed close to 18% on ARM but YMMV). c) You think you could get some data for Intel M3 and M5? They are quite different beasts than the typical Xeon/i7 Intel chip.
,
Sep 8 2017
While I am excited by these great improvements, Chrome is typically not built for 64-bit. We should be looking at the 32-bit gains as well. Landing & upstreaming both is desirable.
,
Sep 8 2017
https://chromium-review.googlesource.com/c/chromium/src/+/601694 "Speed up zlib inflate_fast by up to 1.5x on x86_64" should have no effect on 32-bit architectures. If you want to make similar improvements for 32-bit, (e.g. fill the bit accumulator 4 bytes at a time instead of 1 (vanilla zlib) or 8 (with 601694)), I suppose that that's possible, although the relative impact would probably be smaller (simply because 4 is less than 8), and it would make the zlib code even more complicated in terms of juggling multiple #ifdef's or multiple source files (combined with multiple build systems to configure those source files: Chromium's BUILD.gn and upstream's make and cmake). Long story short, I think that optimizing 32-bit would have to be a separate change from 601694, and to be honest, I have found the code review process for 601694 quite frustrating, for reason's I'll detail in the code review.
,
Sep 8 2017
In the code review for 601694, I have been asked to host the benchmark data in this linked issue rather than the commit log. Thus: https://github.com/google/snappy is an unrelated compression format but its snappy_unittest program is able to benchmark the snappy library and the zlib library. We can re-purpose it, via LD_LIBRARY_PATH, to measure different versions of zlib, such as before / after this commit. snappy_unittest's output looks like: testdata/alice29.txt : ZLIB: [b 1M] bytes 152089 -> 54404 35.8% comp 19.6 MB/s uncomp 258.2 MB/s testdata/asyoulik.txt : ZLIB: [b 1M] bytes 125179 -> 48897 39.1% comp 17.9 MB/s uncomp 242.9 MB/s etc When running snappy_unittest, make sure that you specify test files on the command line, such as ("snappy_unittest testdata/*"), and make sure that the output includes lines that say "ZLIB". You may need to pass a -zlib flag, or otherwise apply https://gist.github.com/nigeltao/a3ef1761ad7339fac70a12a802467d56. If you do not, the snappy_unittest program will be benchmarking the snappy compression codec instead of the zlib one. Summarizing the before/after zlib uncompress numbers give: alice29.txt uncomp MB/s before 258.2 after 392.4 ratio 1.52 asyoulik.txt uncomp MB/s before 242.9 after 346.6 ratio 1.43 fireworks.jpeg uncomp MB/s before 1066.7 after 1194.7 ratio 1.12 geo.protodata uncomp MB/s before 565.6 after 756.8 ratio 1.34 html uncomp MB/s before 492.1 after 698.8 ratio 1.42 html_x_4 uncomp MB/s before 502.1 after 697.8 ratio 1.39 kppkn.gtb uncomp MB/s before 336.4 after 483.6 ratio 1.44 lcet10.txt uncomp MB/s before 265.2 after 409.5 ratio 1.54 paper-100k.pdf uncomp MB/s before 264.9 after 331.4 ratio 1.25 plrabn12.txt uncomp MB/s before 240.9 after 372.5 ratio 1.55 urls.10K uncomp MB/s before 280.7 after 381.3 ratio 1.36
,
Sep 8 2017
#24 > While I am excited by these great improvements, Chrome is typically not built for 64-bit. We should be looking at the 32-bit gains as well. Huh? The official builds I receive are 64-bit on OSX and WIN per chrome://version Google Chrome 60.0.3112.113 (Official Build) (64-bit)
,
Sep 8 2017
#27 oh my gosh, you're right. *smacks forehead* Same with on Linux. Sorry...been focusing so much on Chrome for Android, which does 32-bit builds even for 64-bit devices. Okay, false alarm. Sorry about that. #25 > although the relative impact would probably be smaller (simply because 4 is less than 8) Agreed > be a separate change from 601694 Also agreed
,
Sep 8 2017
To clarify a point I forgot to mention about #10, in #6 I don't mean the download cache. I mean the decoded image cache. If you ask Chrome to decode an image and it already has a decoded copy of that image, it can just grab the copy from its cache. There are times when this works and when it doesn't. For example, if we have rastered the image and decided we don't need it anymore, we might discard it. But whatever the case, forcing it to appear like a completely separate image will prevent us from accidentally using a cached decoded image.
,
Sep 11 2017
,
Feb 7 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4982484356f4c052358a1e62daadd98ed5e29100 commit 4982484356f4c052358a1e62daadd98ed5e29100 Author: Noel Gordon <noel@chromium.org> Date: Wed Feb 07 14:00:22 2018 Resolve chunk copy TODO: use INFLATE_FAST_MIN_INPUT/OUTPUT Update the chunk copy to use INFLATE_FAST_MIN_{INPUT,OUTPUT} defines. No change in behavior, no new tests. After this change, the patch from crbug.com/764431 (and its .patch file) can be removed, it's not needed anymore TODO(cblume@ on bug 760853 ). Bug: 760853 Change-Id: I93e4ddf3fd99da472a7571bae7d3b8b6c2efe5b1 Reviewed-on: https://chromium-review.googlesource.com/900885 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Cr-Commit-Position: refs/heads/master@{#534998} [modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inffast_chunk.c [modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inffast_chunk.h [modify] https://crrev.com/4982484356f4c052358a1e62daadd98ed5e29100/third_party/zlib/contrib/optimizations/inflate.c
,
Feb 7 2018
https://bugs.chromium.org/p/chromium/issues/detail?id=798943&desc=5#c14 noted when landing the zlib_bench tool, that the zlib Huffman decoder on modern CPU was I/O bound. The chunk copy work improves its output-side write performance. Nigel's idea in (this issue) improves its input-side read performance. And when these are combined with the other work on adler32 simd, and crc32 simd, the results are very good, they all work together well. On my Macbook Pro 2016 Intel Core i7 for PNG 140 image corpus: https://docs.google.com/spreadsheets/d/1v3ECbh190IlQBNG7cNdHFUqnCYJJ4BED5mlrSnwWL1M/edit#gid=595603569 * blue, chunk copy (avg improv +16%) * red, chunk copy + adler32 simd (avg improv +36%) * yellow, chunk copy + adler32 simd + sse4.2 crc32 (avg improv +40%) * green, chunk copy + adler32 simd + sse4.2 crc32 + wide bits (avg improv +41%) I called Nigel's method "wide bits", and it adds 1% more on the average. Might not seem worth it, until you use zlib_bench to look at zlib decode performance for ZLIB and GZIP wrapped DEFLATE streams (viz., the common web browser use-case). Using the snappy testdata corpus, again on the Macbook Pro 2016, the results are: ZLIB wrapped DEFLATE stream, snappy corpus: https://docs.google.com/spreadsheets/d/1RoS6sgAuVjQ0LBIMJIz5D5WwWkRVCb2dQvG3ztTqb5Q/edit#gid=595603569 * red, chunk copy [deflate] (avg improv +22%) * yellow, chunk copy + adler32 simd [deflate] (avg improv +50%) * green, chunk copy + adler32 simd + wide bits [deflate] (avg improv +74%) -------------------------- 1.96x faster in the median GLIB wrapped DEFLATE stream, snappy corpus https://docs.google.com/spreadsheets/d/15b0-iT0sXB5_d8yN--y48dRhySkeFe3frjWIv7_Vivw/edit#gid=595603569 * red, chunk copy [gzip] (avg improv +19%) * yellow, chunk copy + sse4.2 crc32 [gzip] (avg improv +69%) * green, chunk copy + sse4.2 crc32 + wide bits [gzip] (avg improv +96%) -------------------------- 2.17x faster in the median Nice to be getting ~2x more speed on the average, and well worth incorporating in the chunk copy input reading code. I'll begin with x86 ...
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6e212423a214e0e41794e8c9969c2896e2c33121 commit 6e212423a214e0e41794e8c9969c2896e2c33121 Author: Noel Gordon <noel@chromium.org> Date: Thu Feb 08 13:09:37 2018 Increase inflate speed: read decoder input into a uint64_t The chunk-copy code contribution deals with writing decoded DEFLATE data to the output with SIMD methods to increase inflate decode speed. Modern compilers such as gcc/clang/msvc elide the portable memcpy() calls used, replacing them with much faster SIMD machine instructions. Similarly, reading the input data to the DEFLATE decoder with wide, SIMD methods can also increase decode speed. See https://crbug.com/760853#c32 for details; content-encoding: gzip decoding speed improves by 2.17x, in the median over the snappy testdata corpus, when this method is combined with the chunk-copy, and the adler32, and crc32 SIMD contributions (this method improves our current inflate decode speed by 20-30%). Update the chunk-copy code with a wide input data reader, which consumes input in 64-bit (8 byte) chunks. Update inflate_fast_chunk_() to use the wide reader. This feature is supported on little endian machines, and is enabled with the INFLATE_CHUNK_READ_64LE build flag in BUILD.gn on Intel CPU only for now. The wide reader idea is due to nigeltao@chromium.org who did the initial work. This patch is based on his patch [1]. No change in behavior (other than more inflate decode speed), so no new tests. [1] https://chromium-review.googlesource.com/c/chromium/src/+/601694/16 Bug: 760853 Change-Id: Ia806d9a225737039367e1b803624cd59e286ce51 Reviewed-on: https://chromium-review.googlesource.com/900982 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Cr-Commit-Position: refs/heads/master@{#535365} [modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/BUILD.gn [modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/chunkcopy.h [modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/inffast_chunk.c [modify] https://crrev.com/6e212423a214e0e41794e8c9969c2896e2c33121/third_party/zlib/contrib/optimizations/inffast_chunk.h
,
Feb 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213 commit b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213 Author: Noel Gordon <noel@chromium.org> Date: Fri Feb 09 01:38:56 2018 Add x86 bots to the read64le party thestig@ reminds me we do have linux x86 bots on the waterfall. Add them to the read64le party. Minor: run git cl format over BUILD.gn, and add a line space here and there for readability. Bug: 760853 Change-Id: Ia47296a26bff77f9be699e31053d8b94aac583f4 Reviewed-on: https://chromium-review.googlesource.com/910328 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#535612} [modify] https://crrev.com/b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213/third_party/zlib/BUILD.gn
,
Feb 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b7fef3927da4a703c523e7e8916fd8d4c031a402 commit b7fef3927da4a703c523e7e8916fd8d4c031a402 Author: Noel Gordon <noel@chromium.org> Date: Mon Feb 12 01:46:01 2018 Revert "Add x86 bots to the read64le party" This reverts commit b07c92a12d39e1ac51fe0dab8e4cb98f91f0d213. Reason for revert: the litmus test of running this code on linux x86 bots worked. However, performance-wise we're not so sure. Back this change out to restrict it to Intel x64 for now. Original change's description: > Add x86 bots to the read64le party > > thestig@ reminds me we do have linux x86 bots on the waterfall. Add them > to the read64le party. Minor: run git cl format over BUILD.gn, and add a > line space here and there for readability. > > Bug: 760853 > Change-Id: Ia47296a26bff77f9be699e31053d8b94aac583f4 > Reviewed-on: https://chromium-review.googlesource.com/910328 > Reviewed-by: Mike Klein <mtklein@chromium.org> > Commit-Queue: Noel Gordon <noel@chromium.org> > Cr-Commit-Position: refs/heads/master@{#535612} TBR=noel@chromium.org,mtklein@chromium.org,nigeltao@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 760853 Change-Id: If3d24a2cdc6614ff82c781019623108a52dbd16a Reviewed-on: https://chromium-review.googlesource.com/913029 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#536020} [modify] https://crrev.com/b7fef3927da4a703c523e7e8916fd8d4c031a402/third_party/zlib/BUILD.gn
,
Feb 15 2018
I not so sure this feature should be enabled on ARM without more investigation. I filed issue 812499 with initial results for this ARM/ARM64 to cover that work.
,
Feb 15 2018
,
Feb 15 2018
Intel work is all done. Closing. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by nigeltao@chromium.org
, Aug 31 2017