Update zlib in chromium? |
||||||||||||||||||||||||||
Issue descriptionIIUC, our current zlib is: https://github.com/madler/zlib +Significant patch for mixed source data compression +Some Intel opts from https://github.com/jtkukunas/zlib/ It looks like there are a variety of active zlib projects: https://github.com/madler/zlib https://github.com/jtkukunas/zlib https://github.com/cloudflare/zlib https://github.com/Dead2/zlib-ng https://sortix.org/libz/ One drawback of working with a forked zlib is that every time an upstream project makes improvements (ex: jtkukunas has added more Intel opts), we face a big challenge integrating those into Chrome. Recently, I've spoken with cavalcantii@ and vasileios.laganakos@ who are very enthusiastic about writing Arm NEON opts for zlib. There is potential to make a big impact. For one example, 60-80% of png decode time is spent in zlib and we anticipate being able to make certain zlib function 2-3x faster. I think this is great, but I don't really want to encourage any changes that will fork Chrome's zlib further and increase the maintenance burden. I'm wondering if it makes sense to: (1) Assess active zlib projects for the best fit for Chrome (assuming license and security reviews) (2) Switch Chrome to that zlib (3) Upstream Arm NEON opts and roll the changes into Chrome The current patch for mixed source data compression is an unknown factor for me: Is this still needed by Chrome? Do any of the active zlib projects have similar code? Would they be receptive to upstreaming this patch? agl@, gavinp@ would be interested to hear if you think this type of project is viable.
,
Feb 4 2017
,
Feb 4 2017
,
Feb 4 2017
Traces of a patched zlib can be found at: https://bugs.chromium.org/p/chromium/issues/detail?id=688601 Performance improvements show good potential.
,
Feb 6 2017
,
Feb 6 2017
Finished writing/creating issues to all the aforementioned projects, let's see how responsive they are.
,
Feb 8 2017
Looks like the mixed source data compression patch has been removed: https://codereview.chromium.org/2669053004 That's good news, should make this a lot easier :). agl@: Do you have any objections/concerns about moving forward with this? cavalcantii@: Have you done any benchmarking (or looked at general project helath, licensing, etc.) to determine which zlib might make sense for chromium?
,
Feb 9 2017
I'm working on it, should have more data later this week. But the overall situation is from all those zlib projects that I contacted, only 2 replied and are interested in having ARM optimizations. From those, 1 has already started a branch targeting to use the ARMv8 specific CRC32 instruction (that should help both reading PNGs as also decompressing gzipped pages). I'm currently testing the libraries concerning: a) Compatibility: does it work with libpng? Will it work with Chromium? b) Performance: I'm using https://github.com/jsnell/zlib-bench to compare the libraries. Concerning code health, the following factors are being analyzed: a) Is it still actively maintained? b) What is the truck factor (i.e. number of developers)? c) Is the project responsive to external feedback (i.e. reporting issues)? I will be updating here as I have further information.
,
Feb 10 2017
Attaching the data collected by zlib-bench. Hardware: a) Intel: Xeon E5-2690@2.6Ghz b) ARMv8: Cortex A72@2.1Ghz + Cortex A53@1.7Ghz (MediaTek M8173C)
,
Feb 10 2017
About zlib-bench: basically it runs operations (compression/decompression) on content (html, image, etc) using the minigzip present on a set of zlib projects.
,
Feb 10 2017
The pretty formatted output.
,
Feb 10 2017
I had to change zlib-bench to point to the latest good hashes of each project.
,
Feb 10 2017
,
Feb 10 2017
Brief data analysis: For Intel, Cloudflare behaved pretty well when compared to the canonical zlib (from 20 tests, 18 were faster, thus 18/20). Next, zlib-ng was pretty good (17/20) compared to canonical zlib. The Intel fork was faster in 16/20 tests. When zlib-ng is compared to the Intel fork, there is pretty much a matching between them in all cases where it is expected to be faster than the canonical zlib. That is because the Intel patches were merged to zlib-ng in the hackandslash5 branch (the one used in this experiment). That being said, specifically in 4 tests zlib-ng was remarkably faster than the Intel fork (decompress jpeg, compress -1 html/jpeg/pngpixels). For ARM, zlib-ng is faster in some cases (8/20) and slower in others (9/20) than canonical zlib. Cloudflare is always faster (19/20). What is potentially interesting for Chromium on ARM is that in zlib-ng both decompress html and png tests were faster (compared to canonical: 88.33% for html, 89.31% for PNG). Unfortunately, the same wasn't true for decompress JPEG (103.36%). This same trend was observed in Intel (i.e. zlib-ng being faster to decompress web content (html + png).
,
Feb 10 2017
I should be posting tomorrow observations about the code health/etc of each project.
,
Feb 10 2017
Any insights on which tests are important to Chrome? Certainly the png decoding tests fall into that category. On the other hand, I'm not sure what zlib has to do with jpeg decoding? Also I'm assuming it's ok if we choose a healthy project that's lacking a little bit on ARM, since we have to right people make those improvements :).
,
Feb 10 2017
Concerning tests: I'm open to suggestions concerning which perf/layout tests we should be measuring that might be affected by changes on zlib. For sure anything using compressed pages and/or PNGs should be affected. Not so sure about JPEG. My primary concern is to avoid perf regressions on all platforms (Intel and ARM) and streamline the process of allowing users to benefit of any optimization work we may make for ARM.
,
Feb 11 2017
The report with the findings, plus analysis concerning code health can be found at: https://docs.google.com/document/d/1XMa2z5LmYn3TLjDVRz-C2VcHBoeWwzCmPQyPrnJsGW8
,
Feb 11 2017
>Any insights on which tests are important to Chrome? Certainly the png decoding tests fall into that category. I would say decompressing data is the common workflow of a web browser (html and images).
,
Feb 11 2017
> On the other hand, I'm not sure what zlib has to do with jpeg decoding? The benchmark will compress/decompress a JPEG file. :-)
,
Feb 20 2017
+mark@ I noticed that zlib was recently updated in https://codereview.chromium.org/2690623003. Any thoughts on madler/zlib vs. other open source versions?
,
Feb 23 2017
@msarett Chromium builds minizip as a static library (https://cs.chromium.org/chromium/src/third_party/zlib/BUILD.gn?l=100). zlib-ng doesn't have it because it was part of the 'contrib' folder in canonical zlib (https://github.com/madler/zlib/tree/master/contrib) and everything there is unmaintained. For *what* exactly minizip is used in Chromium?
,
Feb 24 2017
I don't know. It might be unused. Easiest thing to do might be to delete it and see what breaks?
,
Feb 24 2017
cavalcantii@, can we improve our fuzzer coverage of zlib? https://cs.chromium.org/chromium/src/testing/libfuzzer/fuzzers/zlib_uncompress_fuzzer.cc This is a good change on its own, and IMO necessary to feel comfortable with a new library.
,
Mar 1 2017
,
Mar 18 2017
,
Mar 18 2017
,
Mar 18 2017
,
Mar 22 2017
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/02177c7ab2602bac1fa9f2dfdc9421b70a4c3f46 commit 02177c7ab2602bac1fa9f2dfdc9421b70a4c3f46 Author: pdfium-deps-roller <pdfium-deps-roller@chromium.org> Date: Mon Mar 27 17:06:01 2017 Roll src/third_party/pdfium/ 2977e1e34..0d6d1783e (6 commits) https://pdfium.googlesource.com/pdfium.git/+log/2977e1e342a1..0d6d1783ed96 $ git log 2977e1e34..0d6d1783e --date=short --no-merges --format='%ad %ae %s' 2017-03-24 thestig Remove old test expectations after the Mac 10.12 upgrade. 2017-03-23 adenilson.cavalcanti Update to zlib 1.2.11 2017-03-24 caryclark fix new tab crash in skia clip stack 2017-03-24 tsepez kill another CFX_ArrayTemplate in cfde_txtedtengine.cpp 2017-03-24 tsepez Use std::vector in fxfa/app. 2017-03-24 stephana Ensure empty output directory to avoid duplicate upload Created with: roll-dep src/third_party/pdfium BUG= 703912 , 687631 , 704442 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls TBR=dsinclair@chromium.org Review-Url: https://codereview.chromium.org/2779673002 Cr-Commit-Position: refs/heads/master@{#459805} [modify] https://crrev.com/02177c7ab2602bac1fa9f2dfdc9421b70a4c3f46/DEPS
,
Mar 28 2017
,
Mar 31 2017
The next step in the plan was to allow PDFium to use Chromium's zlib: https://pdfium-review.googlesource.com/c/3350/
,
Apr 4 2017
For reference, using all the 3 patches linked to this issue will yield a performance boost around 40% in PNG image decoding speed (using a Nexus 6). The image shows the traces for a test page (http://codepen.io/Savago/pen/VPeQaX) where we compare Chromium m59 vanilla X patched (using the 3 optimizations resulting from the PNG investigation: Adler32, inflate_fast, palette). It is interesting to see that the time spent decoding the image dropped from 116.187ms to 73.844ms (an improvement of around 40%). It is interesting to see that now GPUImageDecodeCache::DecodeImage() will take longer to execute (94.025ms) than actually decoding the image in ImageFrameGenerator::decode(). Is the image cache compressed? I wonder if we could make it faster? The Adler32 optimization was submitted to zlib-ng and is waiting for review/merge.
,
Apr 4 2017
@reveman: any tips about why GPUImageDecodeCache::DecodeImage() may take longer to execute than to decompress a PNG?
,
Apr 4 2017
I'm not very familiar with GPUImageDecodeCache. vmpstr@, ericrk@?
,
Apr 4 2017
GPUImageDecodeCache doesn't use any compression of cached items - instead, I think the time you're seeing is due to: - Allocating the discardable memory to back the image. - scaling the image (which is done as a separate step from decoding) - Determining if the image is codec-backed, which is currently inefficient (tracked by bug 646089 ) Not sure we can avoid the cost of the first two parts, but we should be fixing the last issue soon. Hopefully this will close the gap.
,
Apr 4 2017
,
Apr 5 2017
Oh holy moly. Thank you so much for bringing this to my attention. I am very interested in this. :D I talked to madler (canonical zlib) long ago and he said he was happy to accept NEON optimizations. He even suggested where I might start to get the biggest bang for my buck. He made it clear that he would accept changes but they would have to go in the contrib folder. Our build would then select which files to include and would opt for the files with the NEON optimizations. Beyond just NEON improvements, I believe there is also room for optimizing cache usage.
,
Apr 5 2017
@cblume: nice talking to you today on IRC. :-) Concerning canonical zlib, I guess Mark may have had a change of heart as this is still pending a reply after almost 2 months: https://github.com/madler/zlib/issues/216
,
Apr 5 2017
,
Apr 5 2017
Really excited about this! 40% PNG decoding improvement is awesome! I think there is a fair amount of verification that still needs to be done, given that zlib-ng is a new library. Can we go ahead and try to check to it in? We can start start writing fuzzers and kick off the security review process (not sure how this works). Then we'll be able to experimentally swap it for canonical zlib on various projects (ex: libpng).
,
Apr 5 2017
Re #44 -- I'll try to contact Mark. He may just not get notifications from github or maybe they go to a separate address. Re #46 -- I'm going to look into writing a fuzzer for zlib and zlib-ng. I was thinking of using one of the fuzzers you or scroggo@ wrote recently as a starting point.
,
Apr 5 2017
"Re #44 -- I'll try to contact Mark. He may just not get notifications from github or maybe they go to a separate address." Would also be interesting to find out if he is interested in various Intel optimizations from jtkukunas/zlib (some of these have been added to our fork) or other optimized zlibs. "Re #46 -- I'm going to look into writing a fuzzer for zlib and zlib-ng. I was thinking of using one of the fuzzers you or scroggo@ wrote recently as a starting point." Awesome!
,
Apr 5 2017
Re #48 -- IIRC Mark doesn't want to accept outside contributions anywhere except in the contrib/ folder. Within the contrib/ folder you get your own subfolder and can add whatever you want. So I would imagine his response would be "Sure, knock yourself out." But I'll check with him.
,
Apr 8 2017
FWIW, zlib's decompression could benefit from some refactoring around the way it handles short-output callers like libpng. Or at least the interaction should be addressed somehow. libpng makes a decompress call per row and this constrains zlib's output window to the width of the line (plus one byte). This limits the amount of time that can be spent in deflate_fast() compared to the much slower edge-of-output handling case. For example, a paletted image as narrow as 258 pixels will fail to enter into deflate_fast() at all (a 512 pixel wide image would only spend half its time there). The patch in issue 697280 is only a quick tweak to address the inefficiency of byte copies in deflate_fast(). A page full of thumbnails might never see any benefit.
,
Apr 8 2017
,
Apr 8 2017
,
Apr 8 2017
,
Apr 8 2017
,
Apr 8 2017
,
Apr 8 2017
I'm reorganizing the issues dependency relationship, as updating the zlib used by Chromium is a self contained task and should not be blocking the pursuit of further optimizations. For that I created a new issue to work as a common umbrella for Image handling related optimizations: https://bugs.chromium.org/p/chromium/issues/detail?id=709707
,
Apr 8 2017
> Really excited about this! 40% PNG decoding improvement is awesome! We can extract more performance by implementing the ImageFrame::setRGBAPremultiply optimization as also an optimized CRC32 function (either using ARMv8 CRC32 instruction or a SIMD version using NEON (similar to what Chromium have for Intel). For CRC32, I tested the first approach and it should be between 5x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x (A72: 91ms x 9ms) faster than the C implementation currently used by zlib. One nice side effect of this optimization is that it should also help while decompressing webpages (i.e. Content-Encoding: gzip). Anyway, in case it is unclear, the 40% gain was observed with the 3 patches *on top* of the zlib used by Chromium.
,
Apr 8 2017
@cavalcantii, It looks to me like the vmull_p64() method of splicing several CRC streams shown in github.com/axboe/fio might still have the potential to offer a performance advantage over zlib-ng's straightline version, even after porting to 32-bit. It's kind of a half-and-half solution. The three patches you mention are adler32 and memcpy for zlib, and expand-palette for libpng, right?
,
Apr 8 2017
> @cavalcantii, It looks to me like the vmull_p64() method of splicing several ... Awesome! I will experiment with it and report back the results. > The three patches you mention are adler32 and memcpy for zlib, and expand-palette > for libpng, right? That is correct.
,
Aug 24 2017
Re using github.com/jsnell/zlib-bench to compare the libraries, this is a Perl script that exec's an OS-level process for every iteration of the loop, which can be relatively noisy. As per my work-in-progress https://chromium-review.googlesource.com/c/chromium/src/+/601694, an alternative is to use the benchmark program in the github.com/google/snappy repo. From that CL's description: ---- https://github.com/google/snappy is an unrelated compression format but its snappy_unittest program is able to benchmark the snappy library and the zlib library. We can re-purpose it, via LD_LIBRARY_PATH, to measure different versions of zlib, such as before / after this commit. snappy_unittest's output looks like: testdata/alice29.txt : ZLIB: [b 1M] bytes 152089 -> 54404 35.8% comp 19.6 MB/s uncomp 258.2 MB/s testdata/asyoulik.txt : ZLIB: [b 1M] bytes 125179 -> 48897 39.1% comp 17.9 MB/s uncomp 242.9 MB/s etc and summarizing the before/after zlib uncompress numbers give: alice29.txt uncomp MB/s before 258.2 after 392.4 ratio 1.52 asyoulik.txt uncomp MB/s before 242.9 after 346.6 ratio 1.43 etc ----
,
Aug 24 2017
> We can extract more performance by implementing the > ImageFrame::setRGBAPremultiply optimization as also > an optimized CRC32 function (either using ARMv8 CRC32 > instruction or a SIMD version using NEON (similar > to what Chromium have for Intel). FWIW, we're working to combine ImageDecoder with Skia's SkCodec, which already has optimizations for its equivalent of setRGBAPremultiply.
,
Aug 24 2017
@scroggo: thanks for sharing this. That is one of the reasons why I skipped working on that function.
,
Aug 31 2017
As explained in blink-dev (https://groups.google.com/a/chromium.org/forum/#!searchin/blink-dev/zlib/blink-dev/T3rPfd6G6ps/l8pAk7gnBgAJ), we tried to upstream the optimizations to zlib|ng with varying degree of success. The path forward is to land the ARM specific optimizations in Chromium's zlib and we can revisit the decision in the future.
,
Sep 4 2017
#61 Are the SIMD optimizations based on the old skia rounding premultiplier, which did the following per pixel component?
unsigned char skia_premult(unsigned char x, unsigned char alpha)
{
const unsigned prod = x * alpha + 128;
return (prod + (prod >> 8)) >> 8;
}
Blink does not use that method, it uses:
enum FractionControl { roundFraction = 257 * 128 };
unsigned char blink_premult(unsigned char x, unsigned char alpha)
{
return ((unsigned short)(x) * alpha + roundFraction) >> 16;
}
https://godbolt.org/g/2WuHAX compares these code snippets at -03 on clang: the blink method is 3 instructions, the skia is 6 instructions.
I expect an SIMD version of the blink method would be similarly faster compared to an SIMD version of the skia premult.
,
Sep 5 2017
SIMD operations offer a lot of rounded and accumulating versions of operations compared with scalar opcodes (reflecting the DSP-oriented work that they're meant for). The performance difference tends to be a lot smaller. As I recall the Blink version does work out faster even for NEON, but it's a marginal speed gain and it's not strictly correct.
,
Sep 9 2017
Not sure what you mean by "not strictly correct", but one of the requirements when building blink_premult was that is should be bit-exact with skia_premult, and per https://bugs.chromium.org/p/chromium/issues/detail?id=234071#c5 they do produce the same output for all inputs via the math, and that was confirmed with the other testing mentioned in that issue. Another requirement was that the premult should round. As it turns out, that's needed for web-compat, see issue 450189 . The Skia scalar code for a rounding premult is: SkMulDiv255Round() https://cs.chromium.org/chromium/src/third_party/skia/include/core/SkMath.h?rcl=3ed4781ee1bd5af9b3ee2623e5356e86ce36e812&l=115 For non-scalar SSE code implementation, ideally skia_premult would be 5 SSE instructions, blink_premult would be 3. Looking at the code, the blink_premult magic constants 257 and 128 for a rounding premult are there: SSE scale() routine https://cs.chromium.org/chromium/src/third_party/skia/src/opts/SkSwizzler_opts.h?type=cs&l=502 and 3 instructions to do the work. Looks good and fast to me (3 vs. 5). The NEON code uses another approach, but again is a 3-instruction rounding premult, by the looks: NEON scale() routine https://cs.chromium.org/chromium/src/third_party/skia/src/opts/SkSwizzler_opts.h?rcl=3ed4781ee1bd5af9b3ee2623e5356e86ce36e812&l=196 Any image decoder that supports alpha needs premult. The blink PNG decoder spends 25% of the image decode time of alpha PNG writing out decoded row pixels. Every pixel must be converted to premultiplied alpha during a row write. Having a fast premult helps.
,
Sep 9 2017
Ahh, the version you pasted (and used in godbolt) lacked the `alpha * 257` multiplication, and I didn't read it properly and assumed the shift was by 8. The problem for SIMD is that when you include that multiplication (even if you do it outside of the loop) the intermediate arithmetic is 32-bit. Even if the result is trimmed back to 16-bit in the same instruction the longer multiply is an extra cycle (or two; I forget) for NEON. That cycle might be pipelined away, but unfortunately I think NEON lacks the unsigned (a*b)>>16 instruction.
,
Sep 10 2017
OIC. Yes, the `alpha = A * 257` term is common when premultiplying an RGBA pixel and only needs to be computed once, rather than 3 times, per pixel. Thanks for the NEON details. The lack of a (a*b)>>16 inst. there perhaps explains why the Skia NEON scale() code uses (x+127)/255 instead. // (x+127)/255 == ((x+128)*257)>>16 for 0 <= x <= 255*255.
,
Oct 29
Re Comment 29 by msarett@chromium.org, Feb 24 2017 > cavalcantii@, can we improve our fuzzer coverage of zlib? zlib-ng has an oss-fuzzer with >80% coverage for the past months: https://github.com/google/oss-fuzz/tree/master/projects/zlib-ng https://github.com/Dead2/zlib-ng/pull/206 I also have fixed all bugs in zlib-ng that the sanitizers and fuzzers found. Re Comment 63 by cavalcantii@chromium.org, Aug 31 2017 > As explained in blink-dev (https://groups.google.com/a/chromium.org/forum/#!searchin/blink-dev/zlib/blink-dev/T3rPfd6G6ps/l8pAk7gnBgAJ), we tried to upstream the optimizations to zlib|ng with varying degree of success. >> given that zlib-ng hasn't yet made an official release and its security status are unknown (i.e. new bugs?), it seems a bit too risky to migrate Chromium to it. > The path forward is to land the ARM specific optimizations in Chromium's zlib and we can revisit the decision in the future. I think zlib-ng will be able to release its first stable release: https://github.com/Dead2/zlib-ng/issues/64 and we now can migrate both chromium.org and aosp to zlib-ng. Let's reopen this issue?
,
Nov 8
@sebpop: it is really nice to learn that zlib-ng is planning a stable release! Maybe it will become in the future the new 'standard' zlib? :-) Answering the question concerning Chromium: the plan for *now* is to keep maintaining the Chromium's fork for the near future, at least until madler/zlib is still considered the 'standard' zlib.
,
Nov 16
I am trying to get the performance improvements to AOSP's zlib, and the AOSP developers said that they will do what chromium does. This is why I am trying to figure out what the plans of the chromium project are for zlib. If chromium stays on madler/zlib plus custom patches I will have to port some (perf improvements and bug fixes) of chromium's patches to AOSP and install an oss-fuzzer for aosp/zlib to validate that. It looks to me like a lot of duplicated effort to pull, validate, and maintain chromium's custom patches to AOSP. As it is up to the chromium project to define what 'standard' zlib means, I will follow your recommendations on how to port zlib's performance changes to Android. PS: Is there a way to add my email to the CC list? I have not seen your reply until now that I reloaded the page.
,
Nov 16
sebpop@gmail.com, I've added you to the CC list. I have it on my list of things to do to update AOSP's zlib (I updated it to the version that Chromium is based on, but have not applied any of the patches), but I haven't gotten to it. I'll be happy to review your changes! As stated in comment#70, the current plan is for Chromium to stick with its current version.
,
Nov 16
@sebpop: I recommend that instead you open a new issue to track any new development. Question: do you have performance numbers of zlib-ng? Preferably using zlib_bench (https://cs.chromium.org/chromium/src/third_party/zlib/contrib/bench/zlib_bench.cc) and the snappy dataset (https://github.com/google/snappy/tree/master/testdata). For integration of the benchmark tool, you can also refer to: https://github.com/Adenilson/zlib/commit/07e2eba07b1b1eef29b84a3af90997903f6ef737
,
Nov 16
,
Nov 16
@scroggo I ported some of the performance patches from chromium's zlib to zlib-ng. The intent is to have all the chromium changes impacting performance in zlib-ng. I have not yet ported those patches to aosp's zlib. @cavalcantii do you mean opening a new issue to track the performance of zlib-ng vs. chromium/zlib instead of adding more comments to this ticket? Also thanks for the pointers, I will post the perf numbers for zlib-bench and snappy once I finish reviewing and porting all the performance patches from chromium/zlib to zlib-ng.
,
Nov 16
Yep, no need to keep posting in an issue that is marked as resolved (feel free to add me to the new issue). I'm not following one thing here: what do you mean you got to port optimizations from Chromium's zlib to ng? How many patches/optimizations are we talking about here?
,
Nov 16
For example the patch from Nigel Tao https://github.com/Dead2/zlib-ng/pull/224 some bits are not yet in zlib-ng. |
||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||
Comment 1 by cavalcantii@chromium.org
, Feb 4 2017