New issue
Advanced search Search tips

Issue 894827 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 10
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocking:
issue 758566
issue 913436



Sign in to add a comment

Code ordering decreases speedometer 2.0 performance on Pixel

Project Member Reported by lizeb@chromium.org, Oct 12

Issue description

From tests I did locally, there seems to be a sizable (>=10%) performance regression when using the current orderfile on Pixel.

Reproduced by running speedometer 2.0 https://browserbench.org/Speedometer2.0/
with and without code ordering.
 
Blocking: 758566
The hypothesis is that it's L1/L2 cache misses that are causing issues.

The previous orderfile instrumentation was ordering symbols by (process, time). This implies that some locality is preserved, and orderfile instrumentation was improving speedometer performance.

The hypothesis from that was that the performance improvement came either from fewer TLB misses, and/or L1/L2 cache misses. Separately, it has been observed by skyostil@ that code ordering lowers the number of L1 icache misses by ~half.

However, the new instrumentation with "startup", "common" and "steady-state" sections uses a set() to track symbols, effectively randomizing their location within each section. The hypothesis is then that this would cause cache misses to spike.

From a local speedometer 2.0 run on Pixel 2, with the "simple" (startup-only) orderfile:
- Initial startup-only orderfile: 37.7
- Randomized ordering within the startup-only orderfile: 35.89


I did simple ordering of the system health orderfile which had no improvement on speedometer2 results (on pixel 2). ThinLTO was enabled for all tests. Note that the speedometer2 results vary by ~0.5 between runs.

Initial startup-only orderfile: 37.4
Random startup-only orderfile: 35.9
Original system health orderfile: 35.3
Rank-ordered system health orderfile: 35.1

The rank ordering is done in crrev.com/c/1280262. The idea is that in each profile, we note the rank of that symbol in the order that it appeared in the profile as a number in [0,1), with 0 being the first symbol seen and 1 the last. This rank is averaged over all runs for each offset, and then used to order the offsets in the final orderfile.

I suspect that this simple averaged ranking isn't doing a good job at maintaining function locality, from the point of icache locality we care less about whether something was called towards the beginning or end of an instrumentation run and more about what was called nearby it.
Project Member

Comment 4 by bugdroid1@chromium.org, Oct 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/03642a2aea39a94f72995da5a3a922412cdf1ca2

commit 03642a2aea39a94f72995da5a3a922412cdf1ca2
Author: Matthew Cary <mattcary@chromium.org>
Date: Mon Oct 15 09:08:08 2018

Orderfile: disable system health orderfile generation.

The bugs below suggest that under thinLTO, the system health orderfile
regresses memory, as well as performance (the latter probably even
without thinLTO). This CL reverts back to the simpler orderfile.

Bug: 758566,893981, 894827 
Change-Id: I93c898dd7c8d55311ac3aed08815c610573c49c2
Reviewed-on: https://chromium-review.googlesource.com/c/1277804
Reviewed-by: Benoit L <lizeb@chromium.org>
Commit-Queue: Matthew Cary <mattcary@chromium.org>
Cr-Commit-Position: refs/heads/master@{#599592}
[modify] https://crrev.com/03642a2aea39a94f72995da5a3a922412cdf1ca2/tools/cygprofile/orderfile_generator_backend.py

Another observation: LTO is not a major contributor here. All numbers are speedometer2 as above.

startup-only orderfile without lto at all: 38.1
startup-only orderfile instrumented on non-lto, used to build lto: 38.9

sys health no-lto orderfile, on an lto build: 35.6.

So startup-only orderfie is around 38 regardless of LTO, and sys health is 35.5 which is about the same as a random startup-only orderfile.

There might be a performance difference between LTO and not, but I wouldn't bet on it given the variations in speedometer runs I've observed.
I made the orderfile use the base (non-orderfile/by file) ordering for code within startup/common/interactive regions. Below are the results, compared against the simple and randomized simple orderfile. The natural ordering recovers some but not all of the simple orderfile's speedometer performance. 

I wonder if the problem now is icache misses between the common and interactive regions. That might be mitigated by merging the common and interactive regions, although startup performance may still have a problem.

natural system health (lto on)
36.70 36.84

original system health (lto on)
34.95 34.99

simple chrome (lto on)
37.34 36.91

random simple (lto on)
35.18 35.27


After the object-type symbol change (see crrev.com/c/1326486 and crbug.com/893981), using the natural ordering for the system health orderfile, the performance difference is still pronounced on the lower-end N5 (note that I've discovered that speedometer has error bars).

legacy chrome N5
 11.0 10.1 (+/- 0.72)
legacy chrome N5x
 13.8 11.3 +/- 2.3

syshealth chrome N5
 8.79 +/- 0.24 9.20 +/- 0.55
syshealth chrome N5x
 12.8 +/- 3.4 11.1 +/-2.2

Cc: -mattcary@chromium.org lizeb@chromium.org
Owner: mattcary@chromium.org
After crrev.com/c/1329166 (dealing correctly halfword aligned symbols, which LTO seemed to exacerbate), speedometer performance seems to have recovered (although error is clearly large). Note this is *not* using the natural ordering to improve icache; that may not be as much of an issue as we thought.

syshealth
N5
 10.1 +/- .58 9.3 +/- 0.62
N5x
 17.0 +/- 2.7 13.8 +/- 2.8

legacy
N5
 9.8 +/- .6 9.37 +/- .33
N5x
 13.4 +/- 3 15.1 +/-2.8
The difference between the correct halfword alignment is even bigger just comparing across the legacy orderfile. Due to problems with the N5 and N5x devices producing consistent results, here are some on Go and Pixel2. 0811 is before the halfword alignment fix, and 1311 is after that fix.

However, syshealth performance is still not good.

0811-legacy
go 6.985 +- .040 6.974 +- .046
p2 38.19 +-.17 37.89 +- .19

1311-legacy
go 7.565 +- .061 7.519 +-.061
p2 40.8 +- .44 40.7 +-.34

1311-syshealth
go 6.878 +-.034 (only one result as it's getting late)
p2 37.49 +- .22 37.69 +-.23


I'm rolling clank to pick up this change and if it improves memory I'll cherry-pick it as well to the branch, if necessary just the legacy code ordering as syshealth still seems to be worse.
While trying to look at differences between legacy and syshealth performance, I ran an empty orderfile just to see what would happen. Here is the following (note that I re-ran on legacy & syshealth in case there's systematic changes with the phone or the benchmark website).

All these were done with the P2. As a reminder, these are speedometer2 benchmarks

empty
38.7 +-.42 38.53 +-.29
legacy
40.37 +-.31 40.59 +-.34 40.1 +-.2
syshealth
36.58 +-.3 36.51 +-.12

Note that legacy is a significant improvement on empty. This was a little surprising, as legacy is focused on startup and I wouldn't think that the wiki science page exercises much of the javascript hit in speedometer.
Project Member

Comment 11 by bugdroid1@chromium.org, Nov 30

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/91df979561d344ab4304c21a115caa99a07da469

commit 91df979561d344ab4304c21a115caa99a07da469
Author: Matthew Cary <mattcary@chromium.org>
Date: Fri Nov 30 14:35:15 2018

Orderfile: Simple call-graph-based orderfile.

This CL adds simple function call graph clustering. The call graph is
inferred from the function ordering offset list, and the orderfile is
produced by clustering based on this inferred call graph.

In local testing, this change recovers the speedometer performance
lost in the bug. Startup is not affected, and memory shows similar
improvement to the original system_health orderfile.

Bug:  894827 
Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368
Reviewed-on: https://chromium-review.googlesource.com/c/1350876
Commit-Queue: Matthew Cary <mattcary@chromium.org>
Reviewed-by: Benoit L <lizeb@chromium.org>
Reviewed-by: Egor Pasko <pasko@chromium.org>
Cr-Commit-Position: refs/heads/master@{#612644}
[add] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/cluster.py
[add] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/cluster_unittest.py
[modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/orderfile_generator_backend.py
[delete] https://crrev.com/0f6f00aca4fb7794ba38a36aaa806f83ca065f5b/tools/cygprofile/phased_orderfile.py
[delete] https://crrev.com/0f6f00aca4fb7794ba38a36aaa806f83ca065f5b/tools/cygprofile/phased_orderfile_unittest.py
[modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/process_profiles.py
[modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/process_profiles_unittest.py
[modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/profile_android_startup.py

Cc: pasko@chromium.org mattcary@chromium.org
 Issue 866884  has been merged into this issue.
Status: Fixed (was: Started)
Blocking: 913436
Project Member

Comment 15 by bugdroid1@chromium.org, Dec 13

Labels: merge-merged-3626
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b2c4a71349008ce7db8116a6c97232d4b0bdcb04

commit b2c4a71349008ce7db8116a6c97232d4b0bdcb04
Author: Matthew Cary <mattcary@chromium.org>
Date: Thu Dec 13 13:53:18 2018

Orderfile: Simple call-graph-based orderfile.

This CL adds simple function call graph clustering. The call graph is
inferred from the function ordering offset list, and the orderfile is
produced by clustering based on this inferred call graph.

In local testing, this change recovers the speedometer performance
lost in the bug. Startup is not affected, and memory shows similar
improvement to the original system_health orderfile.

Bug:  894827 
Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368
Reviewed-on: https://chromium-review.googlesource.com/c/1350876
Commit-Queue: Matthew Cary <mattcary@chromium.org>
Reviewed-by: Benoit L <lizeb@chromium.org>
Reviewed-by: Egor Pasko <pasko@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#612644}(cherry picked from commit 91df979561d344ab4304c21a115caa99a07da469)
Reviewed-on: https://chromium-review.googlesource.com/c/1375724
Reviewed-by: Matthew Cary <mattcary@chromium.org>
Cr-Commit-Position: refs/branch-heads/3626@{#325}
Cr-Branched-From: d897fb137fbaaa9355c0c93124cc048824eb1e65-refs/heads/master@{#612437}
[add] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/cluster.py
[add] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/cluster_unittest.py
[modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/orderfile_generator_backend.py
[delete] https://crrev.com/5958ebb18d6c7d40fd4fa8969a287ca3bca1aaa9/tools/cygprofile/phased_orderfile.py
[delete] https://crrev.com/5958ebb18d6c7d40fd4fa8969a287ca3bca1aaa9/tools/cygprofile/phased_orderfile_unittest.py
[modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/process_profiles.py
[modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/process_profiles_unittest.py
[modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/profile_android_startup.py

Labels: CommitLog-Audit-Violation Merge-Without-Approval M-72
Here's a summary of the rules that were executed: 
 - OnlyMergeApprovedChange: Rule Failed -- Revision b2c4a71349008ce7db8116a6c97232d4b0bdcb04 was merged to refs/branch-heads/3626 branch with no merge approval from a TPM! 
Please explain why this change was merged to the branch!
 - AcknowledgeMerge: Notification Required -- 
Labels: Merge-Merged-72-3626
The following revision refers to this bug: 
https://chromium.googlesource.com/chromium/src.git/+/b2c4a71349008ce7db8116a6c97232d4b0bdcb04

Commit: b2c4a71349008ce7db8116a6c97232d4b0bdcb04
Author: mattcary@chromium.org
Commiter: mattcary@chromium.org
Date: 2018-12-13 13:53:18 +0000 UTC

Orderfile: Simple call-graph-based orderfile.

This CL adds simple function call graph clustering. The call graph is
inferred from the function ordering offset list, and the orderfile is
produced by clustering based on this inferred call graph.

In local testing, this change recovers the speedometer performance
lost in the bug. Startup is not affected, and memory shows similar
improvement to the original system_health orderfile.

Bug:  894827 
Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368
Reviewed-on: https://chromium-review.googlesource.com/c/1350876
Commit-Queue: Matthew Cary <mattcary@chromium.org>
Reviewed-by: Benoit L <lizeb@chromium.org>
Reviewed-by: Egor Pasko <pasko@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#612644}(cherry picked from commit 91df979561d344ab4304c21a115caa99a07da469)
Reviewed-on: https://chromium-review.googlesource.com/c/1375724
Reviewed-by: Matthew Cary <mattcary@chromium.org>
Cr-Commit-Position: refs/branch-heads/3626@{#325}
Cr-Branched-From: d897fb137fbaaa9355c0c93124cc048824eb1e65-refs/heads/master@{#612437}
May I pls know why this change at #15 got merged to M72 without merge request and approval?
The merge request was made in crbug.com/913436; the CL was tagged with both bugs.

Labels: -CommitLog-Audit-Violation -Merge-Without-Approval
Removing "Merge-Without-Approval" & "CommitLog-Audit-Violation" per comment #19.

Sign in to add a comment