Code ordering decreases speedometer 2.0 performance on Pixel |
||||||||
Issue descriptionFrom tests I did locally, there seems to be a sizable (>=10%) performance regression when using the current orderfile on Pixel. Reproduced by running speedometer 2.0 https://browserbench.org/Speedometer2.0/ with and without code ordering.
,
Oct 12
The hypothesis is that it's L1/L2 cache misses that are causing issues. The previous orderfile instrumentation was ordering symbols by (process, time). This implies that some locality is preserved, and orderfile instrumentation was improving speedometer performance. The hypothesis from that was that the performance improvement came either from fewer TLB misses, and/or L1/L2 cache misses. Separately, it has been observed by skyostil@ that code ordering lowers the number of L1 icache misses by ~half. However, the new instrumentation with "startup", "common" and "steady-state" sections uses a set() to track symbols, effectively randomizing their location within each section. The hypothesis is then that this would cause cache misses to spike. From a local speedometer 2.0 run on Pixel 2, with the "simple" (startup-only) orderfile: - Initial startup-only orderfile: 37.7 - Randomized ordering within the startup-only orderfile: 35.89
,
Oct 15
I did simple ordering of the system health orderfile which had no improvement on speedometer2 results (on pixel 2). ThinLTO was enabled for all tests. Note that the speedometer2 results vary by ~0.5 between runs. Initial startup-only orderfile: 37.4 Random startup-only orderfile: 35.9 Original system health orderfile: 35.3 Rank-ordered system health orderfile: 35.1 The rank ordering is done in crrev.com/c/1280262. The idea is that in each profile, we note the rank of that symbol in the order that it appeared in the profile as a number in [0,1), with 0 being the first symbol seen and 1 the last. This rank is averaged over all runs for each offset, and then used to order the offsets in the final orderfile. I suspect that this simple averaged ranking isn't doing a good job at maintaining function locality, from the point of icache locality we care less about whether something was called towards the beginning or end of an instrumentation run and more about what was called nearby it.
,
Oct 15
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/03642a2aea39a94f72995da5a3a922412cdf1ca2 commit 03642a2aea39a94f72995da5a3a922412cdf1ca2 Author: Matthew Cary <mattcary@chromium.org> Date: Mon Oct 15 09:08:08 2018 Orderfile: disable system health orderfile generation. The bugs below suggest that under thinLTO, the system health orderfile regresses memory, as well as performance (the latter probably even without thinLTO). This CL reverts back to the simpler orderfile. Bug: 758566,893981, 894827 Change-Id: I93c898dd7c8d55311ac3aed08815c610573c49c2 Reviewed-on: https://chromium-review.googlesource.com/c/1277804 Reviewed-by: Benoit L <lizeb@chromium.org> Commit-Queue: Matthew Cary <mattcary@chromium.org> Cr-Commit-Position: refs/heads/master@{#599592} [modify] https://crrev.com/03642a2aea39a94f72995da5a3a922412cdf1ca2/tools/cygprofile/orderfile_generator_backend.py
,
Oct 15
Another observation: LTO is not a major contributor here. All numbers are speedometer2 as above. startup-only orderfile without lto at all: 38.1 startup-only orderfile instrumented on non-lto, used to build lto: 38.9 sys health no-lto orderfile, on an lto build: 35.6. So startup-only orderfie is around 38 regardless of LTO, and sys health is 35.5 which is about the same as a random startup-only orderfile. There might be a performance difference between LTO and not, but I wouldn't bet on it given the variations in speedometer runs I've observed.
,
Oct 19
I made the orderfile use the base (non-orderfile/by file) ordering for code within startup/common/interactive regions. Below are the results, compared against the simple and randomized simple orderfile. The natural ordering recovers some but not all of the simple orderfile's speedometer performance. I wonder if the problem now is icache misses between the common and interactive regions. That might be mitigated by merging the common and interactive regions, although startup performance may still have a problem. natural system health (lto on) 36.70 36.84 original system health (lto on) 34.95 34.99 simple chrome (lto on) 37.34 36.91 random simple (lto on) 35.18 35.27
,
Nov 8
After the object-type symbol change (see crrev.com/c/1326486 and crbug.com/893981), using the natural ordering for the system health orderfile, the performance difference is still pronounced on the lower-end N5 (note that I've discovered that speedometer has error bars). legacy chrome N5 11.0 10.1 (+/- 0.72) legacy chrome N5x 13.8 11.3 +/- 2.3 syshealth chrome N5 8.79 +/- 0.24 9.20 +/- 0.55 syshealth chrome N5x 12.8 +/- 3.4 11.1 +/-2.2
,
Nov 12
After crrev.com/c/1329166 (dealing correctly halfword aligned symbols, which LTO seemed to exacerbate), speedometer performance seems to have recovered (although error is clearly large). Note this is *not* using the natural ordering to improve icache; that may not be as much of an issue as we thought. syshealth N5 10.1 +/- .58 9.3 +/- 0.62 N5x 17.0 +/- 2.7 13.8 +/- 2.8 legacy N5 9.8 +/- .6 9.37 +/- .33 N5x 13.4 +/- 3 15.1 +/-2.8
,
Nov 14
The difference between the correct halfword alignment is even bigger just comparing across the legacy orderfile. Due to problems with the N5 and N5x devices producing consistent results, here are some on Go and Pixel2. 0811 is before the halfword alignment fix, and 1311 is after that fix. However, syshealth performance is still not good. 0811-legacy go 6.985 +- .040 6.974 +- .046 p2 38.19 +-.17 37.89 +- .19 1311-legacy go 7.565 +- .061 7.519 +-.061 p2 40.8 +- .44 40.7 +-.34 1311-syshealth go 6.878 +-.034 (only one result as it's getting late) p2 37.49 +- .22 37.69 +-.23 I'm rolling clank to pick up this change and if it improves memory I'll cherry-pick it as well to the branch, if necessary just the legacy code ordering as syshealth still seems to be worse.
,
Nov 19
While trying to look at differences between legacy and syshealth performance, I ran an empty orderfile just to see what would happen. Here is the following (note that I re-ran on legacy & syshealth in case there's systematic changes with the phone or the benchmark website). All these were done with the P2. As a reminder, these are speedometer2 benchmarks empty 38.7 +-.42 38.53 +-.29 legacy 40.37 +-.31 40.59 +-.34 40.1 +-.2 syshealth 36.58 +-.3 36.51 +-.12 Note that legacy is a significant improvement on empty. This was a little surprising, as legacy is focused on startup and I wouldn't think that the wiki science page exercises much of the javascript hit in speedometer.
,
Nov 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/91df979561d344ab4304c21a115caa99a07da469 commit 91df979561d344ab4304c21a115caa99a07da469 Author: Matthew Cary <mattcary@chromium.org> Date: Fri Nov 30 14:35:15 2018 Orderfile: Simple call-graph-based orderfile. This CL adds simple function call graph clustering. The call graph is inferred from the function ordering offset list, and the orderfile is produced by clustering based on this inferred call graph. In local testing, this change recovers the speedometer performance lost in the bug. Startup is not affected, and memory shows similar improvement to the original system_health orderfile. Bug: 894827 Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368 Reviewed-on: https://chromium-review.googlesource.com/c/1350876 Commit-Queue: Matthew Cary <mattcary@chromium.org> Reviewed-by: Benoit L <lizeb@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Cr-Commit-Position: refs/heads/master@{#612644} [add] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/cluster.py [add] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/cluster_unittest.py [modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/orderfile_generator_backend.py [delete] https://crrev.com/0f6f00aca4fb7794ba38a36aaa806f83ca065f5b/tools/cygprofile/phased_orderfile.py [delete] https://crrev.com/0f6f00aca4fb7794ba38a36aaa806f83ca065f5b/tools/cygprofile/phased_orderfile_unittest.py [modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/process_profiles.py [modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/process_profiles_unittest.py [modify] https://crrev.com/91df979561d344ab4304c21a115caa99a07da469/tools/cygprofile/profile_android_startup.py
,
Dec 3
,
Dec 10
,
Dec 10
,
Dec 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b2c4a71349008ce7db8116a6c97232d4b0bdcb04 commit b2c4a71349008ce7db8116a6c97232d4b0bdcb04 Author: Matthew Cary <mattcary@chromium.org> Date: Thu Dec 13 13:53:18 2018 Orderfile: Simple call-graph-based orderfile. This CL adds simple function call graph clustering. The call graph is inferred from the function ordering offset list, and the orderfile is produced by clustering based on this inferred call graph. In local testing, this change recovers the speedometer performance lost in the bug. Startup is not affected, and memory shows similar improvement to the original system_health orderfile. Bug: 894827 Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368 Reviewed-on: https://chromium-review.googlesource.com/c/1350876 Commit-Queue: Matthew Cary <mattcary@chromium.org> Reviewed-by: Benoit L <lizeb@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#612644}(cherry picked from commit 91df979561d344ab4304c21a115caa99a07da469) Reviewed-on: https://chromium-review.googlesource.com/c/1375724 Reviewed-by: Matthew Cary <mattcary@chromium.org> Cr-Commit-Position: refs/branch-heads/3626@{#325} Cr-Branched-From: d897fb137fbaaa9355c0c93124cc048824eb1e65-refs/heads/master@{#612437} [add] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/cluster.py [add] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/cluster_unittest.py [modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/orderfile_generator_backend.py [delete] https://crrev.com/5958ebb18d6c7d40fd4fa8969a287ca3bca1aaa9/tools/cygprofile/phased_orderfile.py [delete] https://crrev.com/5958ebb18d6c7d40fd4fa8969a287ca3bca1aaa9/tools/cygprofile/phased_orderfile_unittest.py [modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/process_profiles.py [modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/process_profiles_unittest.py [modify] https://crrev.com/b2c4a71349008ce7db8116a6c97232d4b0bdcb04/tools/cygprofile/profile_android_startup.py
,
Dec 19
Here's a summary of the rules that were executed: - OnlyMergeApprovedChange: Rule Failed -- Revision b2c4a71349008ce7db8116a6c97232d4b0bdcb04 was merged to refs/branch-heads/3626 branch with no merge approval from a TPM! Please explain why this change was merged to the branch! - AcknowledgeMerge: Notification Required --
,
Dec 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b2c4a71349008ce7db8116a6c97232d4b0bdcb04 Commit: b2c4a71349008ce7db8116a6c97232d4b0bdcb04 Author: mattcary@chromium.org Commiter: mattcary@chromium.org Date: 2018-12-13 13:53:18 +0000 UTC Orderfile: Simple call-graph-based orderfile. This CL adds simple function call graph clustering. The call graph is inferred from the function ordering offset list, and the orderfile is produced by clustering based on this inferred call graph. In local testing, this change recovers the speedometer performance lost in the bug. Startup is not affected, and memory shows similar improvement to the original system_health orderfile. Bug: 894827 Change-Id: Iedab652d0f25c9c9ea481c2a75214cac41dee368 Reviewed-on: https://chromium-review.googlesource.com/c/1350876 Commit-Queue: Matthew Cary <mattcary@chromium.org> Reviewed-by: Benoit L <lizeb@chromium.org> Reviewed-by: Egor Pasko <pasko@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#612644}(cherry picked from commit 91df979561d344ab4304c21a115caa99a07da469) Reviewed-on: https://chromium-review.googlesource.com/c/1375724 Reviewed-by: Matthew Cary <mattcary@chromium.org> Cr-Commit-Position: refs/branch-heads/3626@{#325} Cr-Branched-From: d897fb137fbaaa9355c0c93124cc048824eb1e65-refs/heads/master@{#612437}
,
Dec 20
May I pls know why this change at #15 got merged to M72 without merge request and approval?
,
Dec 21
The merge request was made in crbug.com/913436; the CL was tagged with both bugs.
,
Dec 21
Removing "Merge-Without-Approval" & "CommitLog-Audit-Violation" per comment #19. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by pasko@chromium.org
, Oct 12