[zlib][arm] Optimize crc32 for aarch64 using pmul |
|||||||
Issue descriptionAs mentioned in previous issue (https://bugs.chromium.org/p/chromium/issues/detail?id=709716) there is the potential for considerable performance gains on aarch64 by implementing crc32 using the PMULL instruction. A start point for implementation is: https://bugs.chromium.org/p/chromium/issues/detail?id=709716#c34
,
Aug 13
,
Aug 13
,
Aug 14
,
Oct 26
,
Oct 26
,
Oct 26
Started with the original implementation and collected data in 3 ARM boards targeting to assess expected performance gains. All data is available at: https://docs.google.com/spreadsheets/d/1JripXj_pEovcPIZ_7AbHgOADPDeMxtC8nMLORZAAOkQ/edit?usp=sharing
,
Oct 26
At least on the 3 boards tested, it is an average performance regression to use a pmull based implementation for crc32 (on the other hand, it is more stable performance-wise). Planning to test next on Android to rule out the OS variable from it.
,
Oct 29
Repeated the experiment running Android in the rock64 board and also tested in a Google Pixel 1 (Qualcomm 820). The results on rock64@android pointed to the same direction as running in Linux an average regression 13% (it was 14% on linux). On the other hand, it seems that a pmull based crc32 implementation is faster on a Google pixel. Updated the spreadsheet with the new data (it would be interesting to validate another Qualcomm SoC to verify it this happens in other devices).
,
Oct 29
Using zlib_bench to validate performance, it seems that a pmull based crc32 wouldn't present gains.
Running:
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./tot -wrapper gzip -compression 1 snappy/html
snappy/html :
GZIP: [b 1M] bytes 102400 -> 17016 16.6% comp 65.7 ( 67.9) MB/s uncomp 313.1 (314.5) MB/s
0m01.05s real 0m00.86s user 0m00.15s system
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./tot -wrapper gzip snappy/html
snappy/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 26.9 ( 27.1) MB/s uncomp 340.8 (345.1) MB/s
0m02.19s real 0m02.02s user 0m00.12s system
And:
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./512pmull -wrapper gzip -compression 1 snappy/html
snappy/html :
GZIP: [b 1M] bytes 102400 -> 17016 16.6% comp 65.9 ( 67.1) MB/s uncomp 308.9 (309.6) MB/s
0m00.97s real 0m00.84s user 0m00.09s system
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./512pmull -wrapper gzip snappy/html
snappy/html :
GZIP: [b 1M] bytes 102400 -> 13711 13.4% comp 25.4 ( 26.1) MB/s uncomp 339.2 (343.8) MB/s
0m02.24s real 0m02.05s user 0m00.14s system
,
Oct 29
,
Nov 8
All things considered, I think it was still worthwhile to do the investigation. I'm closing the bug since there weren't consistent (non SoC dependent) performance gains. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by cavalcantii@chromium.org
, Aug 13