zlib uses byte copy loops exclusively, even where larger chunks could be much faster
Reported by
simon.ho...@arm.com,
Mar 1 2017
|
|||||||||||||||||
Issue descriptiondeflate_fast() in zlib currently uses byte copy loops during decode; there's a fairly limited set of cases where this is necessary and the rest of the time it would be faster to copy in larger chunks. I already have a patch to use ARM NEON registers for this. For the batch of PNG files that I tested using image_decode_bench on Cortex-A15 I got an overall improvement of ~30% total run-time. That said, pending 687631 it's not clear where this patch belongs.
,
Apr 27 2017
Updated https://codereview.chromium.org/2722063002/ with the rest of the easy changes. Performance is now around 33% improvement for PNG decode (just this patch in isolation). Upstream pull request created at: https://github.com/madler/zlib/pull/256 I don't know what order these are expected to land. If this could go first then I guess I should fix README.chromium.
,
Aug 22 2017
,
Aug 22 2017
Traces comparing vanilla X patched running a Google Pixel (Snapdragon 820) for test case: https://codepen.io/Savago/pen/JEoYNX Improvement of 30% (1 - (98.264/141.509))
,
Aug 22 2017
Rebased at: https://chromium-review.googlesource.com/627098
,
Aug 23 2017
,
Aug 24 2017
Still trying to figure it out how to run image_decode_bench in my Pixel, but blink_platform_unittests are running fine (please see attached output for *PNG and *Png for both vanilla x patched).
,
Aug 24 2017
Drive by comment: in the original post, "deflate_fast" should be "inflate_fast".
,
Aug 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84 commit ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84 Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com> Date: Mon Aug 28 20:49:26 2017 zlib: inflate using wider loads and stores In inflate_fast() the output pointer always has plenty of room to write. This means that so long as the target is capable, wide un-aligned loads and stores can be used to transfer several bytes at once. When the reference distance is too short simply unroll the data a little to increase the distance. Patch by Simon Hosie. PNG decoding performance gains should be around 30-33%. BUG= 697280 Change-Id: I59854eb25d2b1e43561c8a2afaf9175bf10cf674 Reviewed-on: https://chromium-review.googlesource.com/627098 Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Reviewed-by: Mike Klein <mtklein@chromium.org> Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> Cr-Commit-Position: refs/heads/master@{#497866} [add] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/0003-arm-inffast.patch [modify] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/BUILD.gn [modify] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/README.chromium [add] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/contrib/arm/chunkcopy.h [add] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/contrib/arm/inffast.c [add] https://crrev.com/ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84/third_party/zlib/contrib/arm/inflate.c
,
Aug 28 2017
Let's keep an eye on the bots and monitor any regression. If everything goes well, this should be featured on Chromium M62 (branching on Aug 31th, 2017).
,
Aug 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e1f30a329eccf19ce1c8772e873abf88970cb65c commit e1f30a329eccf19ce1c8772e873abf88970cb65c Author: Paul Miller <paulmiller@chromium.org> Date: Mon Aug 28 23:42:13 2017 Revert "zlib: inflate using wider loads and stores" This reverts commit ad8fca9f4ad29ab87e49ec91a3577ef3696b3a84. Reason for revert: broke Android_Cronet_ARMv6_Builder: error: "NEON support not enabled" Original change's description: > zlib: inflate using wider loads and stores > > In inflate_fast() the output pointer always has plenty of room to write. > This means that so long as the target is capable, wide un-aligned > loads and stores can be used to transfer several bytes at once. > > When the reference distance is too short simply unroll the data a > little to increase the distance. Patch by Simon Hosie. > > PNG decoding performance gains should be around 30-33%. > > BUG= 697280 > > Change-Id: I59854eb25d2b1e43561c8a2afaf9175bf10cf674 > Reviewed-on: https://chromium-review.googlesource.com/627098 > Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> > Reviewed-by: Mike Klein <mtklein@chromium.org> > Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> > Cr-Commit-Position: refs/heads/master@{#497866} TBR=scroggo@chromium.org,agl@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,mtklein@chromium.org,adenilson.cavalcanti@arm.com Change-Id: I93030617e4ed4feb665de1095c411bce80c531a9 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 697280 Reviewed-on: https://chromium-review.googlesource.com/639399 Reviewed-by: Paul Miller <paulmiller@chromium.org> Commit-Queue: Paul Miller <paulmiller@chromium.org> Cr-Commit-Position: refs/heads/master@{#497938} [delete] https://crrev.com/1c85b39e1265cdf81fdcebcaeedfcf5a667ef11a/third_party/zlib/0003-arm-inffast.patch [modify] https://crrev.com/e1f30a329eccf19ce1c8772e873abf88970cb65c/third_party/zlib/BUILD.gn [modify] https://crrev.com/e1f30a329eccf19ce1c8772e873abf88970cb65c/third_party/zlib/README.chromium [delete] https://crrev.com/1c85b39e1265cdf81fdcebcaeedfcf5a667ef11a/third_party/zlib/contrib/arm/chunkcopy.h [delete] https://crrev.com/1c85b39e1265cdf81fdcebcaeedfcf5a667ef11a/third_party/zlib/contrib/arm/inffast.c [delete] https://crrev.com/1c85b39e1265cdf81fdcebcaeedfcf5a667ef11a/third_party/zlib/contrib/arm/inflate.c
,
Aug 29 2017
build log snippets from the offending bot attached
,
Aug 29 2017
@Paul: thanks a lot for detecting early the failing ARMv6 bot and sharing the build logs. I got a few ideas about how to handle this case and will be posting a new patchset tomorrow.
,
Aug 29 2017
,
Aug 29 2017
The fix on the buildsystem landed on: https://chromium-review.googlesource.com/c/chromium/src/+/639931
,
Aug 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0397489124ce7e6aced020f8b85f5034c7d5f49b commit 0397489124ce7e6aced020f8b85f5034c7d5f49b Author: Chris Blume <cblume@chromium.org> Date: Wed Aug 30 20:00:08 2017 Reland "zlib: inflate using wider loads and stores" This reverts commit e1f30a329eccf19ce1c8772e873abf88970cb65c. Reason for revert: This patch was originally reverted because we have an ARMv6 build target (part of Cronet) which incorrectly set NEON support in our build files. As a result, the NEON intrinsics were included despite ARMv6 not supporting NEON. The build problem was addressed here: https://chromium-review.googlesource.com/c/chromium/src/+/639931 Now that ARMv6 builds do not set NEON support, these zlib changes (which are only applied if NEON support is set in the build) will not cause errors on ARMv6 targets. BUG= 697280 Original change's description: > zlib: inflate using wider loads and stores > > In inflate_fast() the output pointer always has plenty of room to write. > This means that so long as the target is capable, wide un-aligned > loads and stores can be used to transfer several bytes at once. > > When the reference distance is too short simply unroll the data a > little to increase the distance. Patch by Simon Hosie. > > PNG decoding performance gains should be around 30-33%. > > BUG= 697280 > > Change-Id: I59854eb25d2b1e43561c8a2afaf9175bf10cf674 > Reviewed-on: https://chromium-review.googlesource.com/627098 > Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> > Reviewed-by: Mike Klein <mtklein@chromium.org> > Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> > Cr-Commit-Position: refs/heads/master@{#497866} > Change-Id: I0b4cd1a393464c960c6a1e48a022a20340781e75 Reviewed-on: https://chromium-review.googlesource.com/641575 Commit-Queue: Chris Blume <cblume@chromium.org> Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Reviewed-by: Leon Scroggins <scroggo@chromium.org> Cr-Commit-Position: refs/heads/master@{#498580} [add] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/0003-arm-inffast.patch [modify] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/BUILD.gn [modify] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/README.chromium [add] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/contrib/arm/chunkcopy.h [add] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/contrib/arm/inffast.c [add] https://crrev.com/0397489124ce7e6aced020f8b85f5034c7d5f49b/third_party/zlib/contrib/arm/inflate.c
,
Aug 31 2017
Bots are starting to produce results (and they look good): https://chromeperf.appspot.com/report?sid=192bbecaad36946937bc656d2daff9fe831ccae4078ca70dce47c068f80cb714
,
Sep 6 2017
All patches landed and no further regressions, I'm closing the issue.
,
Sep 11 2017
,
Sep 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7a620575b0512b33708e723c260fe66b90b5b103 commit 7a620575b0512b33708e723c260fe66b90b5b103 Author: Mike Klein <mtklein@chromium.org> Date: Sat Sep 16 02:20:04 2017 Revert "Reland "zlib: inflate using wider loads and stores"" This reverts commit 0397489124ce7e6aced020f8b85f5034c7d5f49b. Reason for revert: appears to break Cronet. Original change's description: > Reland "zlib: inflate using wider loads and stores" > > This reverts commit e1f30a329eccf19ce1c8772e873abf88970cb65c. > > Reason for revert: This patch was originally reverted because > we have an ARMv6 build target (part of Cronet) which > incorrectly set NEON support in our build files. As a result, > the NEON intrinsics were included despite ARMv6 not > supporting NEON. > > The build problem was addressed here: > https://chromium-review.googlesource.com/c/chromium/src/+/639931 > > Now that ARMv6 builds do not set NEON support, these zlib > changes (which are only applied if NEON support is set in the > build) will not cause errors on ARMv6 targets. > > BUG= 697280 > > Original change's description: > > zlib: inflate using wider loads and stores > > > > In inflate_fast() the output pointer always has plenty of room to write. > > This means that so long as the target is capable, wide un-aligned > > loads and stores can be used to transfer several bytes at once. > > > > When the reference distance is too short simply unroll the data a > > little to increase the distance. Patch by Simon Hosie. > > > > PNG decoding performance gains should be around 30-33%. > > > > BUG= 697280 > > > > Change-Id: I59854eb25d2b1e43561c8a2afaf9175bf10cf674 > > Reviewed-on: https://chromium-review.googlesource.com/627098 > > Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> > > Reviewed-by: Mike Klein <mtklein@chromium.org> > > Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> > > Cr-Commit-Position: refs/heads/master@{#497866} > > > > Change-Id: I0b4cd1a393464c960c6a1e48a022a20340781e75 > Reviewed-on: https://chromium-review.googlesource.com/641575 > Commit-Queue: Chris Blume <cblume@chromium.org> > Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> > Reviewed-by: Leon Scroggins <scroggo@chromium.org> > Cr-Commit-Position: refs/heads/master@{#498580} TBR=scroggo@chromium.org,agl@chromium.org,cavalcantii@chromium.org,cblume@chromium.org,mtklein@chromium.org,adenilson.cavalcanti@arm.com # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 697280 Change-Id: I200e0e3b9cb9c884acfd6868067464a5f6cef804 Reviewed-on: https://chromium-review.googlesource.com/669259 Reviewed-by: Mike Klein <mtklein@chromium.org> Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Commit-Queue: Mike Klein <mtklein@chromium.org> Cr-Commit-Position: refs/heads/master@{#502474} [delete] https://crrev.com/9108eeb3d74cff20b0cf85b28faad013952f7151/third_party/zlib/0003-arm-inffast.patch [modify] https://crrev.com/7a620575b0512b33708e723c260fe66b90b5b103/third_party/zlib/BUILD.gn [modify] https://crrev.com/7a620575b0512b33708e723c260fe66b90b5b103/third_party/zlib/README.chromium [delete] https://crrev.com/9108eeb3d74cff20b0cf85b28faad013952f7151/third_party/zlib/contrib/arm/chunkcopy.h [delete] https://crrev.com/9108eeb3d74cff20b0cf85b28faad013952f7151/third_party/zlib/contrib/arm/inffast.c [delete] https://crrev.com/9108eeb3d74cff20b0cf85b28faad013952f7151/third_party/zlib/contrib/arm/inflate.c
,
Sep 18 2017
Please also make sure to merge the revert into M62 branch. The CL caused data corruption in Cronet (b/65491196). It will affect Chrome as well once M62 rolls out more widely.
,
Sep 21 2017
As per comment 22 we do need to merge the revert into M62.
,
Sep 21 2017
This bug requires manual review: Reverts referenced in bugdroid comments after merge request. Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Sep 21 2017
I've verified that zlib/contrib/arm/* is used in Android and iOS builds. I'm not sure about Chrome OS.
,
Sep 21 2017
,
Sep 22 2017
There are many CLs attached to this bug. Which ones are you specifically trying to merge into M62?
,
Sep 22 2017
This one: https://chromium-review.googlesource.com/c/chromium/src/+/669259 We (re-)landed a patch. That seemed fine in Chrome. But other apps which use Cronet (built by Chrome) were having problems. So we reverted it. The revert landed. But it did so after M62 had already branched. So M62 didn't get the revert.
,
Sep 25 2017
So your patch is fine and not causing any issues in Chrome. Can this revert in M62 cause any issue in Chrome? Also, this is a P3 bug, does that mean it is not as critical?
,
Sep 25 2017
As far as we know, it is not causing any issues in Chrome. I think we worry that if other apps noticed an issue then perhaps there is an issue in Chrome and we simply haven't seen it yet. I think I see where you are going. Maybe we can let sleeping dogs lie. > Can it cause problems. Anything has the chance to. :D If we haven't witnessed the issue in Chrome, maybe leaving it is the safer option. > P3? The original issue was P3. I don't know if we would use P3 for the revert. I'm okay leaving it in M62 I suppose.
,
Sep 25 2017
Leaving it in M62 will affect Cronet releases (Cronet takes builds from Stable branch). The CL causes a data corruption bug, which is 100% reproducible. Can we merge the revert to M62? I don't think we should leave it as it is. Data corruption issues are hard to track down. I don't think we should wait for M62 to roll out more widely to find out how difficult it'd be.
,
Sep 25 2017
One alternative to avoid the chain of reverts would be to land a patch in M62 branch that would only remove the affected code (e.g. https://chromium-review.googlesource.com/c/chromium/src/+/669681).
,
Sep 26 2017
I'd like to reiterate the importance of merging a fix back to M62. Without this fix Cronet cannot ship the M62 release.
,
Sep 26 2017
I think it is better to land a patch in M62 branch to remove affected code than reverting many cls. If that alternative will fix the issue please go ahead and merge it. Thanks!
,
Sep 28 2017
FWIW we've found that the problem is caused by arm-optimized zlib changing the undefined behavior that was inadvertently used by client app that uses Cronet. Specifically it adds some random data into |output buffer| beyond the |bytes_read| boundary but within the |output_buffer_size| boundary. The application has wrongfully relied on buffer between |bytes_read| and |output_buffer_size| to remain intact. Given that this issue does not appear to affect Chrome or any other clients of Cronet it could be OK to keep optimized code in M62.
,
Sep 28 2017
I think it should also be safe to reland the optimized code in M63. Thanks again to everyone that helped investigate!
,
Sep 28 2017
I'm going to work on re-re-re-landing the optimized code. There are some conflict with other patches we want to land as well. Since M63 branches in two weeks and we want to land other patches first, it may not make it in time for M63. Additionally, there is a bug in inflateBack() which we aren't using in Chrome. That should be fixed as part of the re-re-re-land. But that said, I agree that M62 is fine as-is.
,
Sep 28 2017
,
Sep 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/78104f4d73e3bbb4155fa804d00ed66682180556 commit 78104f4d73e3bbb4155fa804d00ed66682180556 Author: Adenilson Cavalcanti <adenilson.cavalcanti@arm.com> Date: Fri Sep 29 03:24:52 2017 zlib: inflate using wider loads and stores In inflate_fast() the output pointer always has plenty of room to write. This means that so long as the target is capable, wide un-aligned loads and stores can be used to transfer several bytes at once. When the reference distance is too short simply unroll the data a little to increase the distance. Patch by Simon Hosie. PNG decoding performance gains should be around 30-33%. This also includes the fix reported in madler/zlib#245. Bug: 697280 Change-Id: I90a9866cc56aa766df5de472cd10c007f4b560d8 Reviewed-on: https://chromium-review.googlesource.com/689961 Reviewed-by: Chris Blume <cblume@chromium.org> Commit-Queue: Adenilson Cavalcanti <cavalcantii@chromium.org> Cr-Commit-Position: refs/heads/master@{#505276} [modify] https://crrev.com/78104f4d73e3bbb4155fa804d00ed66682180556/third_party/zlib/BUILD.gn [add] https://crrev.com/78104f4d73e3bbb4155fa804d00ed66682180556/third_party/zlib/contrib/arm/chunkcopy.h [add] https://crrev.com/78104f4d73e3bbb4155fa804d00ed66682180556/third_party/zlib/contrib/arm/inffast.c [add] https://crrev.com/78104f4d73e3bbb4155fa804d00ed66682180556/third_party/zlib/contrib/arm/inflate.c [add] https://crrev.com/78104f4d73e3bbb4155fa804d00ed66682180556/third_party/zlib/patches/0004-neon-inffast.patch
,
Sep 29 2017
#8 > Still trying to figure it out how to run image_decode_bench in my Pixel, but blink_platform_unittests are running fine. And did we figure it out? All I see is "PNG decoding performance gains should be around 30-33%.", but that seems to be for one image reading #5.
,
Sep 29 2017
I believe that number is my original comment. To measure 30%-33% I built image_decode_bench for Linux and ran it across the test corpus. As I recall I did something like set up a script to pass a number of iterations that would give approximately five seconds of benchmarking timing on each file. Then calculated (1 - patched_time / unpatched_time) * 100% (so patched would have taken about 0.67 to 0.7 of the time that unpatched takes). The figure is, I think, the range of typical values for the average over the whole set (perhaps adding droids.png) tested on a couple of different platforms and before and after perturbing memory a bit. I think. It was a long time ago.
,
Sep 29 2017
#41> "I built image_decode_bench for Linux". Meaning Linux desktop?
,
Sep 29 2017
Well, ARM Linux and Chrome OS. I ran image_decode_bench via ssh and don't recall whether the Linux systems were set up as desktop or not. The numbers in the commit message come from image_decode_bench. If they align with chrome traces then that's everything working as it should.
,
Oct 2 2017
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 2 2017
Removing Merge tag since we don't need to merge the revert any more. (The original code is fine. It was vindicated. We relanded into M63.)
,
Oct 9 2017
#41 "I think. It was a long time ago." :) Also, "the test corpus" makes me think, which corpus? as I don't recall sharing our PNG 140 test corpus with you. Anyways, thanks for running image_decode_bench. FTR, I locally reverted your change, and the adler32 neon change from issue 762564 , and ran image_decode_bench on an Android Pixel C tablet with the PNG 140 corpus to establish baseline results. I then added your change back, then added the adler32 neon change back, and measured both: ANDROID PIXEL C TABLET * Arm8 quad-core Nvidia X1 CPU, 3G RAM https://docs.google.com/spreadsheets/d/1WOWEuAMXrxj608WMuHj-9DKv1wN08eY5hiSMufMrbJw/edit#gid=595603569 Your change (the blue-line on the graph, label neon chunk) is an average 16% improvement in PNG decoding rate for the PNG 140 corpus. When the adler32 neon code change issue 762564 is included (the red-line on the graph, label neon chunk + neon adler) the PNG decoding improvement is average 28% for the PNG 140 corpus, and 50% or more for about 1/5th of the test images. These are good wins ...
,
Oct 9 2017
Remaining work is issue 769880 to fix the infback case (not used by chrome, but good to do for upstreaming this development). Also seems to me that SSE SIMD could be used for this "chunk-copy" approach on Intel devices, I filed issue 772870 about that.
,
Feb 11 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1dd116fcf2bba4520c2e5603485d95a0141bf51d commit 1dd116fcf2bba4520c2e5603485d95a0141bf51d Author: Noel Gordon <noel@chromium.org> Date: Sun Feb 11 23:31:46 2018 Remove zlib/patches/0004-neon-inffast.patch Patch files are no longer needed per the patch/README guidelines. Bug: 697280 Change-Id: I975a99ac0de50eb7c61a037c458853dc6406ff69 Reviewed-on: https://chromium-review.googlesource.com/912908 Reviewed-by: Adenilson Cavalcanti <cavalcantii@chromium.org> Commit-Queue: Noel Gordon <noel@chromium.org> Cr-Commit-Position: refs/heads/master@{#536012} [delete] https://crrev.com/ead73f584a60e1b228f5842395fd0161969ed904/third_party/zlib/patches/0004-neon-inffast.patch |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by simon.ho...@arm.com
, Mar 1 2017