Optimize CRC32 for ARM |
||||||||||||||||
Issue descriptionCurrently Chromium's zlib has a SSE2 based implementation of CRC32. This affects performance for both image decompression (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip). The idea here is to: a) Implement an optimized CRC32 function using the dedicated instruction available in ARMv8. b) Implement a NEON-ized CRC32 function (this should address ARMv7 devices). For CRC32, I tested the first approach and it should be between 5x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x (A72: 91ms x 9ms) faster than the C implementation currently used by zlib.
,
Apr 8 2017
,
Apr 8 2017
,
Apr 12 2017
,
Apr 25 2017
Uploaded the optimization using the CRC32 specific instruction (step 'a') to: https://github.com/madler/zlib/pull/251/commits/5393223b7c459e8e05f511282f9fc45f8cda5dab Waiting for Mark Adler review.
,
Aug 12 2017
Updated patch: https://chromium-review.googlesource.com/c/612629
,
Aug 12 2017
Traces running in a Pixel (Qualcomm 821).
,
Sep 11 2017
,
Oct 26 2017
,
Oct 26 2017
,
Oct 26 2017
,
Nov 28 2017
An update: it may make sense to first add a pure _crc32w version and next one implementation using the new pmull instruction (as the later can offer an even greater performance boost when combined with crc32w).
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/35988c821c051a57e30c76f9fcd87b7b677bd9bd commit 35988c821c051a57e30c76f9fcd87b7b677bd9bd Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com> Date: Thu Nov 30 01:34:16 2017 Using ARMv8 CRC32 specific instruction CRC32 affects performance for both image decompression (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip). This patch implements an optimized CRC32 function using the dedicated instruction available in ARMv8. This instruction is available in new Android devices featuring an ARMv8 SoC, like Nexus 5x and Google Pixel. It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x faster (A72: 91ms x 9ms) than the C implementation currently used by zlib. PNG decoding performance gains should be around 5-9%. Finally it also introduces code to perform the ARM CPU features detection using getauxval()@Linux/CrOS or android_getCpuFeatures(). We pre-built and link the CRC32 instruction dependent code but will decide if to use it at run time. If the feature is not supported, we fallback to the C implementation. This approach allows to use the instruction in both 32bits and 64bits builds and works fine either in ARMv7 or ARMv8 processor. I tested the generated Chromium apk in both a ARMv7 (Nexus 4 and 6) and ARMv8 (Nexus 5x and Google Pixel). Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7 Bug: 709716 Reviewed-on: https://chromium-review.googlesource.com/612629 Reviewed-by: Chris Blume <cblume@chromium.org> Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> Cr-Commit-Position: refs/heads/master@{#520377} [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/BUILD.gn [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/README.chromium [add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/arm_features.c [add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/arm_features.h [add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/armv8_crc32.c [add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/armv8_crc32.h [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/inflate.c [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/crc32.c [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/deflate.c [modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/inflate.c
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e7d9a4649bde6f047105d29f0026dd8c3d54143a commit e7d9a4649bde6f047105d29f0026dd8c3d54143a Author: Boris Sazonov <bsazonov@chromium.org> Date: Thu Nov 30 10:01:30 2017 Revert "Using ARMv8 CRC32 specific instruction" This reverts commit 35988c821c051a57e30c76f9fcd87b7b677bd9bd. Reason for revert: broke build ('cpu-features.h' not found) https://uberchromegw.corp.google.com/i/internal.client.clank/builders/x64-builder/builds/13697 Original change's description: > Using ARMv8 CRC32 specific instruction > > CRC32 affects performance for both image decompression (PNG) > as also in general browsing while accessing websites that serve > content using compression (i.e. Content-Encoding: gzip). > > This patch implements an optimized CRC32 function using the > dedicated instruction available in ARMv8. This instruction is available > in new Android devices featuring an ARMv8 SoC, like Nexus 5x and > Google Pixel. > > It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to > 10x faster (A72: 91ms x 9ms) than the C implementation currently used > by zlib. > > PNG decoding performance gains should be around 5-9%. > > Finally it also introduces code to perform the ARM CPU features detection > using getauxval()@Linux/CrOS or android_getCpuFeatures(). We pre-built > and link the CRC32 instruction dependent code but will decide if to > use it at run time. > > If the feature is not supported, we fallback to the C implementation. > > This approach allows to use the instruction in both 32bits and 64bits > builds and works fine either in ARMv7 or ARMv8 processor. I tested the > generated Chromium apk in both a ARMv7 (Nexus 4 and 6) and ARMv8 (Nexus 5x and > Google Pixel). > > Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7 > Bug: 709716 > Reviewed-on: https://chromium-review.googlesource.com/612629 > Reviewed-by: Chris Blume <cblume@chromium.org> > Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> > Cr-Commit-Position: refs/heads/master@{#520377} TBR=agl@chromium.org,noel@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,mtklein@chromium.org,adenilson.cavalcanti@arm.com Change-Id: Ief2c32df5c8a37635f937cd6a671f5574f5a53a3 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 709716 Reviewed-on: https://chromium-review.googlesource.com/799930 Reviewed-by: Chris Blume <cblume@chromium.org> Reviewed-by: Boris Sazonov <bsazonov@chromium.org> Commit-Queue: Boris Sazonov <bsazonov@chromium.org> Cr-Commit-Position: refs/heads/master@{#520497} [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/BUILD.gn [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/README.chromium [delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/arm_features.c [delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/arm_features.h [delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/armv8_crc32.c [delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/armv8_crc32.h [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/contrib/optimizations/inflate.c [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/crc32.c [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/deflate.c [modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/inflate.c
,
Dec 11 2017
Results for Doodles and Kodak png corpus (1.0 means the same runtime, < 1.0 means improvement).
,
Dec 12 2017
Bench for crc32 for ARM (attached). Instructions for building it with the Android NDK are given in that file. Bench results suggest the implementation given in review is up to 10x faster than zlib crc32. Also looked at using __crc32d as suggest by mtklein@ in review, and it seems to double the perf compared to the implementation given in review (results attached).
,
Dec 12 2017
Neat! This matches my expectations. Inside #if defined(ALIGN_64), why drop to __crc32w() in the while (len >= 16) loop? Aren't we still 8-byte aligned and can do each of those loops with 2 __crc32d() instead of 4 __crc32w()? (psst.... Min Time == Max Rate :P)
,
Dec 12 2017
Yeah, no real reason, and I suspect we could tune the tail part a bit more. Was mainly looking to see if __crc32d in the core part of the crc32 implementation would give us anything more. It sure seems to.
,
Dec 12 2017
👍
,
Dec 12 2017
> Inside #if defined(ALIGN_64), why drop to __crc32w() in the while (len >= 16) loop? Aren't we still 8-byte aligned ... Yeap, we're still 8-byte aligned so can use __crc32d(). Did a quick look, used one __crc32d() in a while (len >= 8). Results still seem good for small buffers. > (psst.... Min Time == Max Rate :P) Ah yes, fixed :)
,
Dec 12 2017
Pretty awesome results! I will make sure to incorporate this changes in the initial patch. :-)
,
Dec 13 2017
Since the data previously posted covered the aspect of running in 32bits, I decided to validate the approach for 64bits while factoring the big/LITTLE cores (see attached the data) The good news are that the data points to the same direction (i.e. gains in using crc32d). In big cores (A72), the gains start to show up for vectors bigger than 8KB, while in little cores with smaller vectors (i.e. 1.9x faster by using crc32d for 8KB vectors). It was a bit slower for vectors of 1MB size though (again, on little cores). The tested device was a Rockchip RK3399 SoC, featuring a dual-core A72 and quad-core A53.
,
Dec 13 2017
For reference, this is about 32x faster (big core) to 45x faster (little core) than the zlib C function for 8KB vectors.
,
Dec 13 2017
32z faster on big, 45x faster on LITTLE...that's insane. I completely believe it (specialized hardware). But it's still amazing! :D Great work, everyone!
,
Jan 10 2018
Ah yeah, sorry for this, but my bench was against the single table crc32 from zlib. I found / mentioned on issue 796178 that our zlib uses the "BYFOUR" 8-table crc32 code by default, which is faster than the single table crc32c. I update the bench tool to use the BYFOUR 8-table (attached). The intel SIMD crc32 is only 5x - 25x faster than the "BYFOUR" 8-table method, still good though.
,
Jan 10 2018
I used that bench on my Pixel C tablet, compiled with arm32 and arm64 builds. The SIMD crc32 code is the same speed as before, but the BYFOUR 8-table code is faster than the single table crc32. arm64 is up 12x faster than zlib's default BYFOUR crc32 for large buffers, while for arm32, it's up 4x faster. Results attached.
,
Jan 25 2018
Updated image benchmarker numbers.
,
Feb 7 2018
,
Feb 7 2018
,
Feb 14 2018
,
Feb 14 2018
,
Feb 14 2018
,
Feb 15 2018
The patch initially posted for this issue, contained a significant PNG 140 corpus regression. Measured on Pixel C Tablet, aarch32, the yellow-line, labelled "chunk copy + adler32 + crc32d" https://docs.google.com/spreadsheets/d/1cqpMjk_fMNrDHNO-px7t4MIAZWtDYf6g0pYPKKyV3j4/edit#gid=595603569 Ideally it should above the red-line "chunk copy + adler32", but instead was below it most everywhere. We worked through the causes of this regression, and eliminated them. The results for the current patch is the green-line. Much better, and gives a average improv of 1% for PNG decoding. Combined with our other work (chunk copy, adler32 simd) the overall avg. improv for PNG decoding on ARM is 31% for the PNG 140 corpus. So if using this crc32 NEON only adds 1% for PNG decoding, why bother? Well, you'd be missing another important use-case for a web browser, content-encoding: gzip (meaning a GZIP wrapped DEFLATE stream) which is very common, esp. on mobile, and less so, content-encoding: deflate (meaning a zlib wrapped DEFLATE stream). For the snappy/testdata/* corpus using zlib_bench on aarch32, and looking at our combined work on chunk copy, adler32 simd, crc32 simd, we have: pixel-c-chunk-alder-zlib-inflate snappy corpus https://docs.google.com/spreadsheets/d/1dFkY8oxnAQscA8vVMER0B4q580I9GpoVxfqG7RykLWw/edit#gid=595603569 - uses our chunk-copy + adler32 simd optimizations 1.49x improve for "content-encoding: deflate" decode perf in the median pixel-c-chunk-crc32-gzip-inflate snappy corpus https://docs.google.com/spreadsheets/d/1l2wSfsLdAsDDVEZTXhq7hlF9074Us6GH3R7Pru2gd30/edit?pli=1#gid=595603569 - uses our chunk-copy + crc32 simd optimizations 1.56x improve for "content-encoding: gzip" decode perf in the median These are very good wins.
,
Feb 15 2018
The crc.cc from #25 has been updated with a NEON PMULL crc32 implementation. This applies on AARCH64, and needs -march=armv8-a+crc+crypto for the compilers we use. See the notes in crc.cc (attached). This implementation is the same speed as NEON CRC32 for very small buffers, but as the buffer size grows, its speed grows faster, and is generally much faster than NEON CRC32 and moreover, it can sustain top speed for longer in comparison on my Pixel C. For buffer sizes > ~8Meg, speed tails off and PMULL and CRC32 have about the same perf. The break point where it is better to use PMULL rather than CRC32 on ARM64 seems to be around 512 bytes.
,
Feb 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/28c9623083688b3a354c33bf77746f4c51f58826 commit 28c9623083688b3a354c33bf77746f4c51f58826 Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com> Date: Fri Feb 16 03:41:14 2018 Compute crc32 using ARMv8 specific instruction CRC32 affects performance for both image decoding (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip). This patch implements an optimized CRC32 function using the dedicated instruction available in ARMv8a. We only support ARM Little-Endian (LE). This instruction is available in new Android devices featuring an ARMv8 SoC, like Nexus 5x and Google Pixel. It should be between 3x (A72) to 7x faster (A53) than the C implementation currently used by zlib for 8KB vectors. This is performance critical code and can be called with both large (8KB) or small vectors, therefore we must avoid extraneous function calls or branching (otherwise the performance benefits are negated). So the use of 'public' variables to read the CPU features status flags (i.e. arm_cpu_enable_crc32 | pmull). Finally it also introduces code to perform run-time ARM CPU feature detection on the supported platforms: Android and Linux/CrOS. We build and link the CRC32 instruction dependent code, but will decide to use it at run-time if the ARM CPU supports the CRC32 instruction. Otherwise, we fallback to using zlib's default C implementation. This approach allows to use the instruction in both 32bits and 64bits and works fine either in ARMv7 or ARMv8 processor. I tested the generated Chrome apk in both a Nexus 6 (ARMv7) and a Google Pixel (ARMv8). The crc32 function benefited from input from Yang Zang and Mike Klein, while the arm_features benefited from input from Noel Gordon. Bug: 709716 Change-Id: I315c1216f8b3a8d88607630a28737c41f52a2f5d Reviewed-on: https://chromium-review.googlesource.com/801108 Reviewed-by: Chris Blume <cblume@chromium.org> Reviewed-by: Noel Gordon <noel@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#537179} [modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/BUILD.gn [add] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/arm_features.c [add] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/arm_features.h [modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32.c [modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32_simd.c [modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32_simd.h [modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/names.h
,
Aug 13
,
Aug 13
,
Aug 13
Opened new issue to address the PMULL optimization in: https://bugs.chromium.org/p/chromium/issues/detail?id=873725
,
Aug 14
SGTM |
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by cavalcantii@chromium.org
, Apr 8 2017