zlib crc32 inflate SSE support |
||||
Issue descriptionzlib crc32 inflate SSE support Intel contributed a patch that improved zlib deflate (compression) speed with a stateful SSE4.2 crc32 computation, which is applied to the compressed data as it is copied to the output buffer. The implementation depends on the zlib deflate state. third/party/zlib is also used for PNG decoding, and gzip decoding, and is the common and majority use-case in the browser. This is the zlib inflate (decompression) path, and no accelerated crc32 implementation is available to support it. We should add an SSE-based crc32 implementation to improve PNG and gzip decoding speeds. The following paper examines implementing arbitrary polynomial CRC on Intel devices: * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" * V. Gopal, E. Ozturk, et al., 2009, http://intel.ly/2ySEwL0 including the CRC32 needed by zlib/gzip, but no code implementation, so again we'd need to do the work to write and test some SSE code, and add it to zlib.
,
Dec 19 2017
Code can be integrated into zlib similar to the alder simd patch ( issue 762564 ), and some measurements on PNG decode performance with the image decode bench: MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) png macbook-pro-chunk-alder-sse42-pclmul-crc https://docs.google.com/spreadsheets/d/1v3ECbh190IlQBNG7cNdHFUqnCYJJ4BED5mlrSnwWL1M/edit#gid=595603569 The results show PNG decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio). The x-axis is the corpus image (sorted in ascending order). The series of changes were: blue-line: inflate chunk copy change ( issue 772870 ) avg improv 16% red-line: inflate chunk copy + adler simd change ( issue 762564 ) avg improv 36% yellow-line: inflate chunk copy + adler simd + crc32 in SSE4.2 avg improv 40% Seems like a useful win for PNG decoding.
,
Jan 4 2018
re #1, noted that zlib uses the BYFOUR crc32 (8 table) implementation, which is a bit faster than the single table crc32 code using in crc.cc in #1. Updated the crc.cc bench (attached) to use the BYFOUR 8 table method therefore, and benched again. Results attached: 5x to 25x faster than the zlib BYFOUR crc32, depending on the buffer size.
,
Jan 4 2018
I needed tool to check if these crc32 changes improve perf or not, and indeed the same for the submitted zlib chunk copy, and adler32 SIMD changes [1]. Used the snappy repository testdata files (11 files in all) as a test corpus. MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm) macbook-pro-chunk-alder-zlib-inflate-snappy-corpus https://docs.google.com/spreadsheets/d/1RoS6sgAuVjQ0LBIMJIz5D5WwWkRVCb2dQvG3ztTqb5Q/edit#gid=595603569 zlib inflate: chunk copy (red line) + adler32 simd (yellow line), improves zlib-wrapped data decode (aka inflate) perf of 50%. macbook-pro-chunk-crc32-gzip-inflate-snappy-corpus https://docs.google.com/spreadsheets/d/15b0-iT0sXB5_d8yN--y48dRhySkeFe3frjWIv7_Vivw/edit#gid=595603569 gzip inflate: chunk copy (red line) + crc32 simd (yellow line), improves gzip-wrapped data decode (aka inflate) perf by 69%. [1] Filed issue 798943 about adding a tool to measure GZIP/ZLIB stream encode/decode perf on any chrome target.
,
Jan 4 2018
@nigel: note the green-line on those spread sheets in #4, that's what I measured for your current code changes on issue 760853 :), so it might be good to dust that work off and complete it.
,
Jan 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/65e2abcb74b1c07fa14f46abaa1fb1717892eec3 commit 65e2abcb74b1c07fa14f46abaa1fb1717892eec3 Author: Noel Gordon <noel@chromium.org> Date: Thu Jan 04 07:30:36 2018 Improve zlib inflate speed by using SSE4.2 crc32 Using an SSE4.2-based crc32 improves the decoding rate of the PNG 140 corpus by 4% average, giving a total 40% performance increase when combined with adler32 SIMD code and inflate chunk copy code, see https://crbug.com/796178#c2 for details. Raw crc32 speed is 5x - 25x faster than the zlib default "BYFOUR" crc32, and gzip- and zlib-wrapped inflate performance improves by 69% and 50% for the snappy corpus ( https://crbug.com/796178#c3 #4 for details). Add crc32 SIMD implementation and update the call-site in crc32.c to use the new crc32 code, using run-time detection of the SSE4.2 and PCLMUL support required by the crc32 SIMD code. Update BUILD.gn to compile the crc32 SIMD code for Intel devices, also update names.h with the new symbol defined by the crc32 SIMD code path. Bug: 796178 Change-Id: I1bb94b47c9a4934eed01ba3d4feda51d67c4bf85 Reviewed-on: https://chromium-review.googlesource.com/833820 Commit-Queue: Noel Gordon <noel@chromium.org> Reviewed-by: Chris Blume <cblume@chromium.org> Cr-Commit-Position: refs/heads/master@{#526935} [modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/BUILD.gn [modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32.c [add] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32_simd.c [add] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32_simd.h [modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/names.h
,
Jan 30 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7f1e10802a20c5dbe737a2261bfcb78cb3baedcf commit 7f1e10802a20c5dbe737a2261bfcb78cb3baedcf Author: Noel Gordon <noel@chromium.org> Date: Tue Jan 30 01:23:09 2018 Remove feature guards from names.h Having guards, per feature, is unneeded since BUILD.gn handles it for us. Provided the new symbols of any feature code are added to names.h, mangling their symbol names works, regardless of whether we have guards in names.h or not. Remove the guards. No change in behavior, no new tests. Bug: 796178 Change-Id: I9441544f58c5db82ba0e42bf755872c4a0273e46 Reviewed-on: https://chromium-review.googlesource.com/891198 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#532716} [modify] https://crrev.com/7f1e10802a20c5dbe737a2261bfcb78cb3baedcf/third_party/zlib/names.h
,
Jan 30 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5aa815a73c390fc6a3b5cec6ec8e10079a55741a commit 5aa815a73c390fc6a3b5cec6ec8e10079a55741a Author: Noel Gordon <noel@chromium.org> Date: Tue Jan 30 02:03:17 2018 Move the SSE4.2 crc32 code from crc32() into crc32_z() zlib calls to crc32() redirect into the other public API function crc32_z(). Move the SSE4.2 crc32 code to crc32_z(), so that users of it also benefit from its SSE4.2 SIMD acceleration. No change in behavior, no new tests. Bug: 796178 Change-Id: I0d80a6ebf9c703dc3d67be1d1f3457ebfcb150bd Reviewed-on: https://chromium-review.googlesource.com/890155 Reviewed-by: Chris Blume <cblume@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#532728} [modify] https://crrev.com/5aa815a73c390fc6a3b5cec6ec8e10079a55741a/third_party/zlib/crc32.c
,
Feb 1 2018
,
Feb 5 2018
If you looked at the crc.cc code in #3, you'll note I provided an implementation of crc32 using the ARMv8 pmull SIMD instruction.
It's a re-implementation of the SSE code from #5, in NEON, and it's way faster than using NEON _crc32, once the buffer size is larger than 512 bytes (there is a bit more set-up needed to use pmull in this way).
Anyho, a mask operation is needed when folding the 128-bit temporary result down to 64-bits, prior to folding that down to a 32-bit crc32 with Barret reduction. In the NEON pmull code, it reads
/*
* Fold 128-bits to 64-bits.
*/
static uint32_t zalign(16) mask[4] = { ~0u, 0u, ~0u, 0u };
In the x86 code from #6, the mask is defined the opposite order:
x3 = _mm_set_epi32(0, ~0, 0, ~0);
which is a little confusing. A simple fix for the SSE is to define the mask in reverse order to better match the NEON pmull and SEE implementations. That's a simple one-line patch, but I just wanted why that change would make sense.
,
Feb 5 2018
ahem "... just wanted to say why that change would make sense."
,
Feb 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5fdfc741f157923aa5f10415de243ae1c1f9ba72 commit 5fdfc741f157923aa5f10415de243ae1c1f9ba72 Author: Noel Gordon <noel@chromium.org> Date: Mon Feb 05 20:37:08 2018 crc32 x86: Reverse the order of the 128 to 64 bit fold mask Longer explaination added on bug, but to better align the x86 SSE crc32 code with NEON (pmull) implementation of same, and to avoid possible confusion, reverse the sense of the fold mask. No change in behavior, no new tests: reserving the sense does not change the value of the computed crc32. Bug: 796178 Change-Id: I35f772ab4df414d6f5f808d92bc7f896528e07ef Reviewed-on: https://chromium-review.googlesource.com/901223 Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#534488} [modify] https://crrev.com/5fdfc741f157923aa5f10415de243ae1c1f9ba72/third_party/zlib/crc32_simd.c
,
Feb 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/14be760d0e6ebb2d30e9c73bbc0de1a07976f656 commit 14be760d0e6ebb2d30e9c73bbc0de1a07976f656 Author: Noel Gordon <noel@chromium.org> Date: Sat Feb 10 03:32:48 2018 Guard the include on the crc32 CRC32_SIMD_SSE42_PCLMUL feature Bug: 796178 Change-Id: I3850e783f79a70ea13d275875fa77ae35faa06ba Reviewed-on: https://chromium-review.googlesource.com/911228 Reviewed-by: Chris Blume <cblume@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#535949} [modify] https://crrev.com/14be760d0e6ebb2d30e9c73bbc0de1a07976f656/third_party/zlib/crc32.c
,
Feb 11 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/dba46e17d91a2b1c89bbe9272a203d45a0971c75 commit dba46e17d91a2b1c89bbe9272a203d45a0971c75 Author: Noel Gordon <noel@chromium.org> Date: Sun Feb 11 00:18:26 2018 Add comment about zlib convention for crc32 use Similar to crrev.com/534964, since crc32 can be freely called per the "zlib.h" docs, add comment about zlib crc32 usage conventions viz., calling crc32(0, NULL, 0) before use and how we use that as signal to cache CPU features. Tbr: cavalcantii@chromium.org Bug: 796178 Change-Id: I3c80149ea9c09baeb5a376d54ffeb4e36b12b5ec Reviewed-on: https://chromium-review.googlesource.com/912790 Reviewed-by: Noel Gordon <noel@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#535991} [modify] https://crrev.com/dba46e17d91a2b1c89bbe9272a203d45a0971c75/third_party/zlib/crc32.c
,
Feb 13 2018
https://chromium-review.googlesource.com/c/chromium/src/+/902262/10 zlib crc32 is used in so many places in chrome.
,
Feb 15 2018
All work done, closing. |
||||
►
Sign in to add a comment |
||||
Comment 1 by noel@chromium.org
, Dec 19 2017Owner: noel@chromium.org
Status: Started (was: Untriaged)
19.0 KB
19.0 KB View Download
4.6 KB
4.6 KB View Download