New issue
Advanced search Search tips

Issue 873725 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Closed: Nov 8
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Chrome , Fuchsia
Pri: 3
Type: Feature

Blocked on:
issue 709716

Blocking:
issue 764085



Sign in to add a comment

[zlib][arm] Optimize crc32 for aarch64 using pmul

Project Member Reported by cavalcantii@chromium.org, Aug 13

Issue description

As mentioned in previous issue (https://bugs.chromium.org/p/chromium/issues/detail?id=709716) there is the potential for considerable performance gains on aarch64 by implementing crc32 using the PMULL instruction.

A start point for implementation is:
https://bugs.chromium.org/p/chromium/issues/detail?id=709716#c34
 
Blockedon: 709716
Blocking: 764085
Status: Assigned (was: Untriaged)
Cc: noel@chromium.org
Status: Started (was: Assigned)
Labels: -Type-Bug OS-Android OS-Chrome OS-Fuchsia OS-Linux Type-Feature
Started with the original implementation and collected data in 3 ARM boards targeting to assess expected performance gains.

All data is available at:
https://docs.google.com/spreadsheets/d/1JripXj_pEovcPIZ_7AbHgOADPDeMxtC8nMLORZAAOkQ/edit?usp=sharing
At least on the 3 boards tested, it is an average performance regression to use a pmull based implementation for crc32 (on the other hand, it is more stable performance-wise).

Planning to test next on Android to rule out the OS variable from it.
Repeated the experiment running Android in the rock64 board and also tested in a Google Pixel 1 (Qualcomm 820).

The results on rock64@android pointed to the same direction as running in Linux an average regression 13% (it was 14% on linux).

On the other hand, it seems that a pmull based crc32 implementation is faster on a Google pixel.

Updated the spreadsheet with the new data (it would be interesting to validate another Qualcomm SoC to verify it this happens in other devices).

Using zlib_bench to validate performance, it seems that a pmull based crc32 wouldn't present gains.

Running:

marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./tot -wrapper gzip -compression 1 snappy/html                                                       
snappy/html                              :
GZIP: [b 1M] bytes 102400 ->  17016 16.6% comp  65.7 ( 67.9) MB/s uncomp 313.1 (314.5) MB/s
    0m01.05s real     0m00.86s user     0m00.15s system
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./tot -wrapper gzip snappy/html                                                                      
snappy/html                              :
GZIP: [b 1M] bytes 102400 ->  13711 13.4% comp  26.9 ( 27.1) MB/s uncomp 340.8 (345.1) MB/s
    0m02.19s real     0m02.02s user     0m00.12s system


And:
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./512pmull -wrapper gzip -compression 1 snappy/html                                                  
snappy/html                              :
GZIP: [b 1M] bytes 102400 ->  17016 16.6% comp  65.9 ( 67.1) MB/s uncomp 308.9 (309.6) MB/s
    0m00.97s real     0m00.84s user     0m00.09s system
marlin:/data/local/tmp $ time LD_LIBRARY_PATH=./ ./512pmull -wrapper gzip snappy/html                                                                 
snappy/html                              :
GZIP: [b 1M] bytes 102400 ->  13711 13.4% comp  25.4 ( 26.1) MB/s uncomp 339.2 (343.8) MB/s
    0m02.24s real     0m02.05s user     0m00.14s system


Cc: cblume@chromium.org mtklein@chromium.org
Status: WontFix (was: Started)
All things considered, I think it was still worthwhile to do the investigation.

I'm closing the bug since there weren't consistent (non SoC dependent) performance gains.


Sign in to add a comment