New issue
Advanced search Search tips

Issue 796178 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows , Mac
Pri: 3
Type: Bug

Blocked on:
issue 798943



Sign in to add a comment

zlib crc32 inflate SSE support

Project Member Reported by noel@chromium.org, Dec 19 2017

Issue description

zlib crc32 inflate SSE support

Intel contributed a patch that improved zlib deflate (compression) speed with a stateful SSE4.2 crc32 computation, which is applied to the compressed data as it is copied to the output buffer.  The implementation depends on the zlib deflate state.

third/party/zlib is also used for PNG decoding, and gzip decoding, and is the common and majority use-case in the browser.  This is the zlib inflate (decompression) path, and no accelerated crc32 implementation is available to support it.

We should add an SSE-based crc32 implementation to improve PNG and gzip decoding speeds. The following paper examines implementing arbitrary polynomial CRC on Intel devices:

  * "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
  *  V. Gopal, E. Ozturk, et al., 2009, http://intel.ly/2ySEwL0

including the CRC32 needed by zlib/gzip, but no code implementation, so again we'd need to do the work to write and test some SSE code, and add it to zlib.
 

Comment 1 by noel@chromium.org, Dec 19 2017

Labels: OS-Linux OS-Mac OS-Windows
Owner: noel@chromium.org
Status: Started (was: Untriaged)
A bench mark for the new SSE code (requires sse4.2+pclmul, which zlib can run-time detect due to the earlier Intel patch), is attached.

Bench marking on my macbook pro (results attached) shows small buffers at 5 to 10x faster than zlib c-code, quickly rising to 60x faster for 32k buffers.

crc.cc
19.0 KB View Download
macbook-pro-crc32.txt
4.6 KB View Download

Comment 2 by noel@chromium.org, Dec 19 2017

Code can be integrated into zlib similar to the alder simd patch ( issue 762564 ), and some measurements on PNG decode performance with the image decode bench:

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)
png macbook-pro-chunk-alder-sse42-pclmul-crc
https://docs.google.com/spreadsheets/d/1v3ECbh190IlQBNG7cNdHFUqnCYJJ4BED5mlrSnwWL1M/edit#gid=595603569

The results show PNG decoding rate MB/s with the change, divided by the decoding rate MB/s with no change (decoding rate ratio).  The x-axis is the corpus image (sorted in ascending order). The series of changes were:

blue-line: inflate chunk copy change ( issue 772870 )
  avg improv 16%
red-line: inflate chunk copy + adler simd change ( issue 762564 )
  avg improv 36%
yellow-line: inflate chunk copy + adler simd + crc32 in SSE4.2
  avg improv 40%

Seems like a useful win for PNG decoding.

Comment 3 by noel@chromium.org, Jan 4 2018

re #1, noted that zlib uses the BYFOUR crc32 (8 table) implementation, which is a bit faster than the single table crc32 code using in crc.cc in #1.

Updated the crc.cc bench (attached) to use the BYFOUR 8 table method therefore, and benched again.  Results attached: 5x to 25x faster than the zlib BYFOUR crc32, depending on the buffer size.

crc.cc
50.8 KB View Download
macbook-pro-crc32-byfour.txt
5.0 KB Download

Comment 4 by noel@chromium.org, Jan 4 2018

Cc: nigeltao@chromium.org
I needed tool to check if these crc32 changes improve perf or not, and indeed the same for the submitted zlib chunk copy, and adler32 SIMD changes [1].

Used the snappy repository testdata files (11 files in all) as a test corpus.

MAC BOOK PRO 13" 2016 (TOUCH BAR) LAPTOP
 * Dual-core Intel Core i7-6567U, 3.3GHz Skylake-U (14nm)

macbook-pro-chunk-alder-zlib-inflate-snappy-corpus
https://docs.google.com/spreadsheets/d/1RoS6sgAuVjQ0LBIMJIz5D5WwWkRVCb2dQvG3ztTqb5Q/edit#gid=595603569

zlib inflate: chunk copy (red line) + adler32 simd (yellow line), improves zlib-wrapped data decode (aka inflate) perf of 50%.

macbook-pro-chunk-crc32-gzip-inflate-snappy-corpus
https://docs.google.com/spreadsheets/d/15b0-iT0sXB5_d8yN--y48dRhySkeFe3frjWIv7_Vivw/edit#gid=595603569

gzip inflate: chunk copy (red line) + crc32 simd (yellow line), improves gzip-wrapped data decode (aka inflate) perf by 69%.

[1] Filed  issue 798943  about adding a tool to measure GZIP/ZLIB stream encode/decode perf on any chrome target.

Comment 5 by noel@chromium.org, Jan 4 2018

@nigel: note the green-line on those spread sheets in #4, that's what I measured for your current code changes on  issue 760853  :), so it might be good to dust that work off and complete it.


Project Member

Comment 6 by bugdroid1@chromium.org, Jan 4 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/65e2abcb74b1c07fa14f46abaa1fb1717892eec3

commit 65e2abcb74b1c07fa14f46abaa1fb1717892eec3
Author: Noel Gordon <noel@chromium.org>
Date: Thu Jan 04 07:30:36 2018

Improve zlib inflate speed by using SSE4.2 crc32

Using an SSE4.2-based crc32 improves the decoding rate of the PNG
140 corpus by 4% average, giving a total 40% performance increase
when combined with adler32 SIMD code and inflate chunk copy code,
see  https://crbug.com/796178#c2  for details.

Raw crc32 speed is 5x - 25x faster than the zlib default "BYFOUR"
crc32, and gzip- and zlib-wrapped inflate performance improves by
69% and 50% for the snappy corpus ( https://crbug.com/796178#c3  #4
for details).

Add crc32 SIMD implementation and update the call-site in crc32.c
to use the new crc32 code, using run-time detection of the SSE4.2
and PCLMUL support required by the crc32 SIMD code.

Update BUILD.gn to compile the crc32 SIMD code for Intel devices,
also update names.h with the new symbol defined by the crc32 SIMD
code path.

Bug:  796178 
Change-Id: I1bb94b47c9a4934eed01ba3d4feda51d67c4bf85
Reviewed-on: https://chromium-review.googlesource.com/833820
Commit-Queue: Noel Gordon <noel@chromium.org>
Reviewed-by: Chris Blume <cblume@chromium.org>
Cr-Commit-Position: refs/heads/master@{#526935}
[modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/BUILD.gn
[modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32.c
[add] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32_simd.c
[add] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/crc32_simd.h
[modify] https://crrev.com/65e2abcb74b1c07fa14f46abaa1fb1717892eec3/third_party/zlib/names.h

Project Member

Comment 7 by bugdroid1@chromium.org, Jan 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f1e10802a20c5dbe737a2261bfcb78cb3baedcf

commit 7f1e10802a20c5dbe737a2261bfcb78cb3baedcf
Author: Noel Gordon <noel@chromium.org>
Date: Tue Jan 30 01:23:09 2018

Remove feature guards from names.h

Having guards, per feature, is unneeded since BUILD.gn handles it
for us. Provided the new symbols of any feature code are added to
names.h, mangling their symbol names works, regardless of whether
we have guards in names.h or not. Remove the guards.

No change in behavior, no new tests.

Bug:  796178 
Change-Id: I9441544f58c5db82ba0e42bf755872c4a0273e46
Reviewed-on: https://chromium-review.googlesource.com/891198
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#532716}
[modify] https://crrev.com/7f1e10802a20c5dbe737a2261bfcb78cb3baedcf/third_party/zlib/names.h

Project Member

Comment 8 by bugdroid1@chromium.org, Jan 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5aa815a73c390fc6a3b5cec6ec8e10079a55741a

commit 5aa815a73c390fc6a3b5cec6ec8e10079a55741a
Author: Noel Gordon <noel@chromium.org>
Date: Tue Jan 30 02:03:17 2018

Move the SSE4.2 crc32 code from crc32() into crc32_z()

zlib calls to crc32() redirect into the other public API function
crc32_z(). Move the SSE4.2 crc32 code to crc32_z(), so that users
of it also benefit from its SSE4.2 SIMD acceleration.

No change in behavior, no new tests.

Bug:  796178 
Change-Id: I0d80a6ebf9c703dc3d67be1d1f3457ebfcb150bd
Reviewed-on: https://chromium-review.googlesource.com/890155
Reviewed-by: Chris Blume <cblume@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#532728}
[modify] https://crrev.com/5aa815a73c390fc6a3b5cec6ec8e10079a55741a/third_party/zlib/crc32.c

Comment 9 by noel@chromium.org, Feb 1 2018

Blockedon: 798943

Comment 10 by noel@chromium.org, Feb 5 2018

If you looked at the crc.cc code in #3, you'll note I provided an implementation of crc32 using the ARMv8 pmull SIMD instruction.

It's a re-implementation of the SSE code from #5, in NEON, and it's way faster than using NEON _crc32, once the buffer size is larger than 512 bytes (there is a bit more set-up needed to use pmull in this way).

Anyho, a mask operation is needed when folding the 128-bit temporary result down to 64-bits, prior to folding that down to a 32-bit crc32 with Barret reduction.  In the NEON pmull code, it reads

    /*
     * Fold 128-bits to 64-bits.
     */
    static uint32_t zalign(16) mask[4] = { ~0u, 0u, ~0u, 0u };

In the x86 code from #6, the mask is defined the opposite order:

    x3 = _mm_set_epi32(0, ~0, 0, ~0);

which is a little confusing.  A simple fix for the SSE is to define the mask in reverse order to better match the NEON pmull and SEE implementations.  That's a simple one-line patch, but I just wanted why that change would make sense.

Comment 11 by noel@chromium.org, Feb 5 2018

ahem "... just wanted to say why that change would make sense."
Project Member

Comment 12 by bugdroid1@chromium.org, Feb 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5fdfc741f157923aa5f10415de243ae1c1f9ba72

commit 5fdfc741f157923aa5f10415de243ae1c1f9ba72
Author: Noel Gordon <noel@chromium.org>
Date: Mon Feb 05 20:37:08 2018

crc32 x86: Reverse the order of the 128 to 64 bit fold mask

Longer explaination added on bug, but to better align the x86 SSE
crc32 code with NEON (pmull) implementation of same, and to avoid
possible confusion, reverse the sense of the fold mask.

No change in behavior, no new tests: reserving the sense does not
change the value of the computed crc32.

Bug:  796178 
Change-Id: I35f772ab4df414d6f5f808d92bc7f896528e07ef
Reviewed-on: https://chromium-review.googlesource.com/901223
Reviewed-by: Mike Klein <mtklein@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#534488}
[modify] https://crrev.com/5fdfc741f157923aa5f10415de243ae1c1f9ba72/third_party/zlib/crc32_simd.c

Project Member

Comment 13 by bugdroid1@chromium.org, Feb 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/14be760d0e6ebb2d30e9c73bbc0de1a07976f656

commit 14be760d0e6ebb2d30e9c73bbc0de1a07976f656
Author: Noel Gordon <noel@chromium.org>
Date: Sat Feb 10 03:32:48 2018

Guard the include on the crc32 CRC32_SIMD_SSE42_PCLMUL feature

Bug:  796178 
Change-Id: I3850e783f79a70ea13d275875fa77ae35faa06ba
Reviewed-on: https://chromium-review.googlesource.com/911228
Reviewed-by: Chris Blume <cblume@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535949}
[modify] https://crrev.com/14be760d0e6ebb2d30e9c73bbc0de1a07976f656/third_party/zlib/crc32.c

Project Member

Comment 14 by bugdroid1@chromium.org, Feb 11 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/dba46e17d91a2b1c89bbe9272a203d45a0971c75

commit dba46e17d91a2b1c89bbe9272a203d45a0971c75
Author: Noel Gordon <noel@chromium.org>
Date: Sun Feb 11 00:18:26 2018

Add comment about zlib convention for crc32 use

Similar to crrev.com/534964, since crc32 can be freely called per
the "zlib.h" docs, add comment about zlib crc32 usage conventions
viz., calling crc32(0, NULL, 0) before use and how we use that as
signal to cache CPU features.

Tbr: cavalcantii@chromium.org
Bug:  796178 
Change-Id: I3c80149ea9c09baeb5a376d54ffeb4e36b12b5ec
Reviewed-on: https://chromium-review.googlesource.com/912790
Reviewed-by: Noel Gordon <noel@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#535991}
[modify] https://crrev.com/dba46e17d91a2b1c89bbe9272a203d45a0971c75/third_party/zlib/crc32.c

Comment 15 by noel@chromium.org, Feb 13 2018

https://chromium-review.googlesource.com/c/chromium/src/+/902262/10 zlib crc32 is used in so many places in chrome.

Comment 16 by noel@chromium.org, Feb 15 2018

Status: Fixed (was: Started)
All work done, closing.

Sign in to add a comment