New issue
Advanced search Search tips

Issue 709716 link

Starred by 8 users

Issue metadata

Status: Verified
Owner:
Closed: Aug 13
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 3
Type: Feature


Sign in to add a comment

Optimize CRC32 for ARM

Project Member Reported by cavalcantii@chromium.org, Apr 8 2017

Issue description

Currently Chromium's zlib has a SSE2 based implementation of CRC32.

This affects performance for both image decompression (PNG) as also in general browsing while accessing websites that serve content using compression (i.e. Content-Encoding: gzip).

The idea here is to:

a) Implement an optimized CRC32 function using the dedicated instruction available in ARMv8.

b) Implement a NEON-ized CRC32 function (this should address ARMv7 devices).

For CRC32, I tested the first approach and it should be between 5x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to 10x (A72: 91ms x 9ms) faster than the C implementation currently used by zlib.

 
Blocking: 709707
Owner: cavalcantii@chromium.org
Status: Started (was: Untriaged)
Cc: msarett@chromium.org fmalita@chromium.org
Cc: simon.ho...@arm.com amaury.l...@arm.com
Uploaded the optimization using the CRC32 specific instruction (step 'a') to:
https://github.com/madler/zlib/pull/251/commits/5393223b7c459e8e05f511282f9fc45f8cda5dab

Waiting for Mark Adler review.
Cc: -simon.ho...@arm.com -msarett@chromium.org
Updated patch:
https://chromium-review.googlesource.com/c/612629
Traces running in a Pixel (Qualcomm 821).
crc32_traces.png
253 KB View Download
Blocking: 764085
Cc: -amaury.l...@arm.com scroggo@chromium.org cblume@chromium.org mtklein@chromium.org
Labels: OS-Android
Labels: -Type-Bug Type-Feature
An update: it may make sense to first add a pure _crc32w version and next one implementation using the new pmull instruction (as the later can offer an even greater performance boost when combined with crc32w).
Project Member

Comment 13 by bugdroid1@chromium.org, Nov 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/35988c821c051a57e30c76f9fcd87b7b677bd9bd

commit 35988c821c051a57e30c76f9fcd87b7b677bd9bd
Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com>
Date: Thu Nov 30 01:34:16 2017

Using ARMv8 CRC32 specific instruction

CRC32 affects performance for both image decompression (PNG)
as also in general browsing while accessing websites that serve
content using compression (i.e. Content-Encoding: gzip).

This patch implements an optimized CRC32 function using the
dedicated instruction available in ARMv8. This instruction is available
in new Android devices featuring an ARMv8 SoC, like Nexus 5x and
Google Pixel.

It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to
10x faster (A72: 91ms x 9ms) than the C implementation currently used
by zlib.

PNG decoding performance gains should be around 5-9%.

Finally it also introduces code to perform the ARM CPU features detection
using getauxval()@Linux/CrOS or android_getCpuFeatures(). We pre-built
and link the CRC32 instruction dependent code but will decide if to
use it at run time.

If the feature is not supported, we fallback to the C implementation.

This approach allows to use the instruction in both 32bits and 64bits
builds and works fine either in ARMv7 or ARMv8 processor. I tested the
generated Chromium apk in both a ARMv7 (Nexus 4 and 6) and ARMv8 (Nexus 5x and
Google Pixel).

Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7
Bug:  709716 
Reviewed-on: https://chromium-review.googlesource.com/612629
Reviewed-by: Chris Blume <cblume@chromium.org>
Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
Cr-Commit-Position: refs/heads/master@{#520377}
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/BUILD.gn
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/README.chromium
[add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/arm_features.c
[add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/arm_features.h
[add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/armv8_crc32.c
[add] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/arm/armv8_crc32.h
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/contrib/optimizations/inflate.c
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/crc32.c
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/deflate.c
[modify] https://crrev.com/35988c821c051a57e30c76f9fcd87b7b677bd9bd/third_party/zlib/inflate.c

Project Member

Comment 14 by bugdroid1@chromium.org, Nov 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e7d9a4649bde6f047105d29f0026dd8c3d54143a

commit e7d9a4649bde6f047105d29f0026dd8c3d54143a
Author: Boris Sazonov <bsazonov@chromium.org>
Date: Thu Nov 30 10:01:30 2017

Revert "Using ARMv8 CRC32 specific instruction"

This reverts commit 35988c821c051a57e30c76f9fcd87b7b677bd9bd.

Reason for revert: broke build ('cpu-features.h' not found)
https://uberchromegw.corp.google.com/i/internal.client.clank/builders/x64-builder/builds/13697

Original change's description:
> Using ARMv8 CRC32 specific instruction
> 
> CRC32 affects performance for both image decompression (PNG)
> as also in general browsing while accessing websites that serve
> content using compression (i.e. Content-Encoding: gzip).
> 
> This patch implements an optimized CRC32 function using the
> dedicated instruction available in ARMv8. This instruction is available
> in new Android devices featuring an ARMv8 SoC, like Nexus 5x and
> Google Pixel.
> 
> It should be between 6x (A53: 116ms X 22ms for a 4Kx4Kx4 buffer) to
> 10x faster (A72: 91ms x 9ms) than the C implementation currently used
> by zlib.
> 
> PNG decoding performance gains should be around 5-9%.
> 
> Finally it also introduces code to perform the ARM CPU features detection
> using getauxval()@Linux/CrOS or android_getCpuFeatures(). We pre-built
> and link the CRC32 instruction dependent code but will decide if to
> use it at run time.
> 
> If the feature is not supported, we fallback to the C implementation.
> 
> This approach allows to use the instruction in both 32bits and 64bits
> builds and works fine either in ARMv7 or ARMv8 processor. I tested the
> generated Chromium apk in both a ARMv7 (Nexus 4 and 6) and ARMv8 (Nexus 5x and
> Google Pixel).
> 
> Change-Id: I069408ebc06c49a3c2be4ba3253319e025ee09d7
> Bug:  709716 
> Reviewed-on: https://chromium-review.googlesource.com/612629
> Reviewed-by: Chris Blume <cblume@chromium.org>
> Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#520377}

TBR=agl@chromium.org,noel@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,mtklein@chromium.org,adenilson.cavalcanti@arm.com

Change-Id: Ief2c32df5c8a37635f937cd6a671f5574f5a53a3
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  709716 
Reviewed-on: https://chromium-review.googlesource.com/799930
Reviewed-by: Chris Blume <cblume@chromium.org>
Reviewed-by: Boris Sazonov <bsazonov@chromium.org>
Commit-Queue: Boris Sazonov <bsazonov@chromium.org>
Cr-Commit-Position: refs/heads/master@{#520497}
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/BUILD.gn
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/README.chromium
[delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/arm_features.c
[delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/arm_features.h
[delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/armv8_crc32.c
[delete] https://crrev.com/616c8345fbe447467905b69cee8e4cbde8d29ffa/third_party/zlib/contrib/optimizations/arm/armv8_crc32.h
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/contrib/optimizations/inflate.c
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/crc32.c
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/deflate.c
[modify] https://crrev.com/e7d9a4649bde6f047105d29f0026dd8c3d54143a/third_party/zlib/inflate.c

Results for Doodles and Kodak png corpus (1.0 means the same runtime, < 1.0 means improvement).
doodles.ods
36.7 KB Download
kodac.ods
33.8 KB Download

Comment 16 by noel@chromium.org, Dec 12 2017

Bench for crc32 for ARM (attached).  Instructions for building it with the Android NDK are given in that file.  

Bench results suggest the implementation given in review is up to 10x faster than zlib crc32.  Also looked at using __crc32d as suggest by mtklein@ in review, and it seems to double the perf compared to the implementation given in review (results attached).
crc.cc
12.0 KB View Download
bench-32-64-aligner-arm-crc.txt
9.4 KB View Download
Neat!  This matches my expectations.

Inside #if defined(ALIGN_64), why drop to __crc32w() in the while (len >= 16) loop?  Aren't we still 8-byte aligned and can do each of those loops with 2 __crc32d() instead of 4 __crc32w()?

(psst....  Min Time == Max Rate  :P)

Comment 18 by noel@chromium.org, Dec 12 2017

Yeah, no real reason, and I suspect we could tune the tail part a bit more.  Was mainly looking to see if __crc32d in the core part of the crc32 implementation would give us anything more.  It sure seems to.
👍

Comment 20 by noel@chromium.org, Dec 12 2017

> Inside #if defined(ALIGN_64), why drop to __crc32w() in the while (len >= 16) loop?  Aren't we still 8-byte aligned ...

Yeap, we're still 8-byte aligned so can use __crc32d().  Did a quick look, used one __crc32d() in a while (len >= 8).  Results still seem good for small buffers.

> (psst....  Min Time == Max Rate  :P)

Ah yes, fixed :)

crc.cc
11.9 KB View Download
Pretty awesome results!

I will make sure to incorporate this changes in the initial patch.
:-)
Since the data previously posted covered the aspect of running in 32bits, I decided to validate the approach for 64bits while factoring the big/LITTLE cores (see attached the data)

The good news are that the data points to the same direction (i.e. gains in using crc32d).

In big cores (A72), the gains start to show up for vectors bigger than 8KB, while in little cores with smaller vectors (i.e. 1.9x faster by using crc32d for 8KB vectors).

It was a bit slower for vectors of 1MB size though (again, on little cores).

The tested device was a Rockchip RK3399 SoC, featuring a dual-core A72 and quad-core A53.
little_crc32w.txt
4.6 KB View Download
little_crc32d.txt
4.6 KB View Download
big_crc32w.txt
4.6 KB View Download
big_crc32d.txt
4.6 KB View Download
For reference, this is about 32x faster (big core) to 45x faster (little core) than the zlib C function for 8KB vectors.

32z faster on big, 45x faster on LITTLE...that's insane.
I completely believe it (specialized hardware). But it's still amazing! :D Great work, everyone!

Comment 25 by noel@chromium.org, Jan 10 2018

Ah yeah, sorry for this, but my bench was against the single table crc32 from zlib.  I found / mentioned on  issue 796178  that our zlib uses the "BYFOUR" 8-table crc32 code by default, which is faster than the single table crc32c.  

I update the bench tool to use the BYFOUR 8-table (attached).  The intel SIMD crc32 is only 5x - 25x faster than the "BYFOUR" 8-table method, still good though.
crc.cc
45.9 KB View Download

Comment 26 by noel@chromium.org, Jan 10 2018

I used that bench on my Pixel C tablet, compiled with arm32 and arm64 builds.  The SIMD crc32 code is the same speed as before, but the BYFOUR 8-table code is faster than the single table crc32.

arm64 is up 12x faster than zlib's default BYFOUR crc32 for large buffers, while for arm32, it's up 4x faster.  Results attached.

crc32d-aarch64.txt
3.3 KB View Download
crc32d-aarch32.txt
3.3 KB View Download
Updated image benchmarker numbers.
140png.ods
65.2 KB Download
kodak.ods
30.5 KB Download
doodles.ods
55.2 KB Download
Blocking: 810125
Blocking: 810127
Blocking: 812394
Blocking: 812395
Blocking: 812400

Comment 33 by noel@chromium.org, Feb 15 2018

The patch initially posted for this issue, contained a significant PNG 140 corpus regression.  Measured on Pixel C Tablet, aarch32, the yellow-line, labelled "chunk copy + adler32 + crc32d"

https://docs.google.com/spreadsheets/d/1cqpMjk_fMNrDHNO-px7t4MIAZWtDYf6g0pYPKKyV3j4/edit#gid=595603569

Ideally it should above the red-line "chunk copy + adler32", but instead was below it most everywhere.

We worked through the causes of this regression, and eliminated them.  The results for the current patch is the green-line.  Much better, and gives a average improv of 1% for PNG decoding.  Combined with our other work (chunk copy, adler32 simd) the overall avg. improv for PNG decoding on ARM is 31% for the PNG 140 corpus.

So if using this crc32 NEON only adds 1% for PNG decoding, why bother?  Well, you'd be missing another important use-case for a web browser, content-encoding: gzip (meaning a GZIP wrapped DEFLATE stream) which is very common, esp. on mobile, and less so, content-encoding: deflate (meaning a zlib wrapped DEFLATE stream).

For the snappy/testdata/* corpus using zlib_bench on aarch32, and looking at our combined work on chunk copy, adler32 simd, crc32 simd, we have:

pixel-c-chunk-alder-zlib-inflate snappy corpus
https://docs.google.com/spreadsheets/d/1dFkY8oxnAQscA8vVMER0B4q580I9GpoVxfqG7RykLWw/edit#gid=595603569
  - uses our chunk-copy + adler32 simd optimizations
  1.49x improve for "content-encoding: deflate" decode perf in the median

pixel-c-chunk-crc32-gzip-inflate snappy corpus
https://docs.google.com/spreadsheets/d/1l2wSfsLdAsDDVEZTXhq7hlF9074Us6GH3R7Pru2gd30/edit?pli=1#gid=595603569
  - uses our chunk-copy + crc32 simd optimizations
  1.56x improve for "content-encoding: gzip" decode perf in the median

These are very good wins.

Comment 34 by noel@chromium.org, Feb 15 2018

The crc.cc from #25 has been updated with a NEON PMULL crc32 implementation.  This applies on AARCH64, and needs -march=armv8-a+crc+crypto for the compilers we use. See the notes in crc.cc (attached).

This implementation is the same speed as NEON CRC32 for very small buffers, but as the buffer size grows, its speed grows faster, and is generally much faster than NEON CRC32 and moreover, it can sustain top speed for longer in comparison on my Pixel C.  For buffer sizes > ~8Meg, speed tails off and PMULL and CRC32 have about the same perf.

The break point where it is better to use PMULL rather than CRC32 on ARM64 seems to be around 512 bytes.
crc.cc
52.0 KB View Download

Comment 35 Deleted

Project Member

Comment 36 by bugdroid1@chromium.org, Feb 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/28c9623083688b3a354c33bf77746f4c51f58826

commit 28c9623083688b3a354c33bf77746f4c51f58826
Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com>
Date: Fri Feb 16 03:41:14 2018

Compute crc32 using ARMv8 specific instruction

CRC32 affects performance for both image decoding (PNG)
as also in general browsing while accessing websites that serve
content using compression (i.e. Content-Encoding: gzip).

This patch implements an optimized CRC32 function using the
dedicated instruction available in ARMv8a. We only support
ARM Little-Endian (LE).

This instruction is available in new Android devices featuring an
ARMv8 SoC, like Nexus 5x and Google Pixel. It should be between
3x (A72) to 7x faster (A53) than the C implementation currently used
by zlib for 8KB vectors.

This is performance critical code and can be called with both large (8KB)
or small vectors, therefore we must avoid extraneous function calls or
branching (otherwise the performance benefits are negated). So the use
of 'public' variables to read the CPU features status flags
(i.e. arm_cpu_enable_crc32 | pmull).

Finally it also introduces code to perform run-time ARM CPU feature
detection on the supported platforms: Android and Linux/CrOS. We build
and link the CRC32 instruction dependent code, but will decide to use it
at run-time if the ARM CPU supports the CRC32 instruction. Otherwise,
we fallback to using zlib's default C implementation.

This approach allows to use the instruction in both 32bits and 64bits and
works fine either in ARMv7 or ARMv8 processor. I tested the generated
Chrome apk in both a Nexus 6 (ARMv7) and a Google Pixel (ARMv8).

The crc32 function benefited from input from Yang Zang and Mike Klein,
while the arm_features benefited from input from Noel Gordon.

Bug:  709716 
Change-Id: I315c1216f8b3a8d88607630a28737c41f52a2f5d
Reviewed-on: https://chromium-review.googlesource.com/801108
Reviewed-by: Chris Blume <cblume@chromium.org>
Reviewed-by: Noel Gordon <noel@chromium.org>
Commit-Queue: Noel Gordon <noel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#537179}
[modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/BUILD.gn
[add] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/arm_features.c
[add] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/arm_features.h
[modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32.c
[modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32_simd.c
[modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/crc32_simd.h
[modify] https://crrev.com/28c9623083688b3a354c33bf77746f4c51f58826/third_party/zlib/names.h

Status: Verified (was: Started)
Blocking: 873725
Opened new issue to address the PMULL optimization in:
https://bugs.chromium.org/p/chromium/issues/detail?id=873725
SGTM

Sign in to add a comment