New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 882290 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 3
Type: Feature


Show other hotlists

Hotlists containing this issue:
Hotlist-1


Sign in to add a comment

Enable native function layout optimization for Android Chrome on x86

Project Member Reported by hong.zh...@intel.com, Sep 10

Issue description

We know native function layout optimization can reduce the stall cycles caused by instruction misses and instruction TLB misses through improving native code locality. We did some experiments to evaluate it impact for x86 Android Chromium on ChromeOS ARC++. And we observed Speedometer2 and CanvasMark can be improved 1.4% and 5.8% performance respectively, without binary size and peak memory impact. Details in https://docs.google.com/document/d/1YeOZ44YkI0tfIEgIn-OMCH_4W-xJgi7Rc5r98OTVovQ
This issue is to track the optimization.
 
Cc: -g...@google.com g...@chromium.org
Labels: Type-Feature
Owner: g...@chromium.org
Status: Assigned (was: Unconfirmed)
Thanks for looking into this!

I'll take it for the moment.

It'll take some chatting with our Android friends, since we already do similar things for ARM (with the goal of reducing memory footprint and improving startup, so it's unclear if it's ideal to use the same layout techniques on both platforms); I'll try to figure out who to tag and such shortly.
Cc: pasko@chromium.org
One of Android friends here ..

We only recently started caring about Speedometer2 when doing code ordering. The main goal being not to regress it :) Improving the score with better code ordering is getting more interesting, but primarily we would like to not on memory and startup.

Can I haz access to the doc?
Hi pasko@, thanks for your information. access is granted, please review.
Hi pasko@, we evaluated function layout optimization impact on WebView and updated data in doc. From the data we can see the optimization can improve WebView more than 1% performance for Speedometer2 and CanvasMark without obvious binary size and memory consumption impact. Do you think function layout optimization for Android Chromium/WebView is doable? or if we need do more evaluation, please feel free to tell us. thanks
Cc: lizeb@chromium.org mattcary@chromium.org benhenry@chromium.org
Thanks for the experimentation! Really interesting and quite promising!

I left a few comments in the document, but preferring to put high-level conclusions and thoughts here in this bug.

I think Speedometer2 and memory consumption (amount of code resident in memory) are reasonable targets for x86 ARC++ WebView. On the other hand, CanvasMark is in a questionable niche: there are many microbenchmarks like that, profiling all of them would be a burden, and they may hurt one another in a way that the end result will be less beneficial to each. Also there is less clarity on how microbenchmarks affect user experience.

I did not check --call-graph-ordering-file before, thanks for experimenting! Can you share some details on how you acquired cgprofile.txt? Is it derived from LBR?

FYI: With -Wl,-section-ordering-file piped from tools/cygprofile/orderfile_generator_backend.py we are also improving Speedometer2 scores for ARM. I think it would be easy to add support for x86, if we have hardware to profile on. Have you considered comparing the performance gains?

The most important piece to figure out is how we are going to maintain the profile. We have a bot that re-benchmarks everything a couple of times per day, uploads profiles and updates them in the internal repository. This bot requires some non-trivial maintenance at times, we probably would not go through this exercise for 1% improvements on WebView/ARC++. I am also subjective here, a wider consensus is needed (gbiv@, lizeb@, mattcary@, benhenry@ are the other stakeholders, I could be also forgetting someone .. sorry in advance).

Finally: can a copy of the document about the experiment be made public? It would help doing so for reaching broader consensus. I can create a copy for you attached @chomium.org organization, if you want.
Note that the very naive function layout from orderfile_generator got ~10% speedometer2 performance on android ARM devices.

+1 to the difficulties of profile maintenance, depending on how complicated generating the profile is. But it's manageable especially if it's based on known chromium technologies like telemetry.

+1 also for a public doc, I've already requested access :)
From offline conversation with lizeb: applying the ARM profile to x86 is easy.

On the other hand, there is no regular benchmark coverage for it, so even if we improve it now it may rot, and, as we know with orderfile, may actually make things worse than no artificial ordering at all.
FWIW, my plan going forward with this was to apply the CrOS AFDO profiles here. We've been working with Intel on how to do this/potential tweaks to our profiling pipeline that'd be needed to do this, and it all sounds pretty promising.

The concern for me was lack of substantial testing on x86 Android for Chrome. (so +100 pasko@)

As luck would have it, CrOS also wants to turn this all on for Chrome/CrOS. If we use the same bits as them (they're swapping to lld very soon, we're sourcing our profiles from them, etc), I personally think that we only need to track the perf on one of the two platforms to be confident that this is a perf win on both. Naturally, better testing would be better, but that's always the case.

If we also want to apply this to the ARM side, that story changes quite a bit, since Android has an amazing testing story on ARM. :) But if we're talking x86-only, ...
Hi all, thanks for your comments.

For CanvasMark, let me clarify we only want to improve Canvas performance because it is an important element in H5. Because CanvasMark is a popular Canvas benchmark, we take it as an indicator for the optimization evaluation. If you have better candidates, we can also try:)

For cgprofile.txt generation, it is generated from LBR sampling with below instructions
1. Collect LBR sampling
    perf record -ag -e r20c4 -c 500000 –b
2. Build create_lld_profile tool (https://docs.google.com/document/d/10iVjnIrcTd8V5X42Z2Ot3XTlqOqqVpOTgnXs9-esNM4/edit#bookmark=id.kzg7r8vpclch)
    2.1 Clone git repo from github. 
        git clone https://github.com/pdeng6/autofdo.git 
        git checkout lld-cgprofile
        // This repo is forked from https://github.com/pcc/autofdo, contains two extra fixes. These fixes are going to be submitted, after that, pcc’s repo is ok too.
    2.2 Build the project
        ./configure
        make
3. Use create_lld_profile to generate cgprofile.txt
    ./create_lld_profile -binary chrome perf.data

For your similar optimization for ARM, could you share more details? We would like to try it on X86 platform

For document sharing, it is ok to share to public. We (Intel domain) don't have permission to do it, thanks in advance
The current optimization in ARM is only function ordering via lld. Note that we haven't tested it with the thinLTO optmizations that were recently turned on.

The profiling and orderfile generation script can be found in tools/cygprofile/orderfile_generator_backend.py. This is what is used to created the downstream (clank) orderfile, the only thing missing from the public directory is the git logistics and uploading of the orderfile to cloud storage. The orderfiles generated by running orderfile_generator_backend.py are identical to what we currently use downstream.

We are in process of launching orderfiles generated with --system-health-profiling, which improve memory usage by 2-4M and speedometer2 by 1-2% on top of the 7-10% seen in the current orderfile.

The profile is generated by instrumenting the binary using code in base/android/orderfile. It's currently arm-only for reasons discussed below, but the code will compile and work correctly.

What won't compile on x86 is base/android/library_loader/anchor_functions.cc, because we use an asm() intrinsic with junk values to create unique functions. These functions are only used for madvise'ing text and you can probably just ignore them, or use the appropriate data() intrinsic (I don't remember offhand the right compiler intrinsics).

What is more of a problem for x86 is the post-processing of the instrumentation. The instrumentation dumps a list of offsets of functions in the order they were first called. Our post-processing assuming things like functions must be at least 4 bytes long (which is true on arm, even with thumb instructions, because of the instrumentation). This may not be true on x86 with near relative jumps, so be careful. Also, we can assume functions are aligned. Functions with odd offsets refer to thumb instructions, which our processing accounts for (see tools/cygprofile/process_profiles.py).

There may be other assumptions of that sort buried in there, so caveat emptor :)

Let me know if you have other questions.
Status: WontFix (was: Assigned)
Based on a few internal discussions, I think it would be fair to say that we are currently not prioritizing Webview+x86+ARC++ high enough to have continuous benchmarking on chromeperf.appspot.com or elsewhere on Google side.

Without this continuous benchmarking, even if Intel committed to continuous benchmarking and code ordering generation, we don't see how it would be feasible for us in Chrome to integrate it into our build.
Why is benchmarking on CrOS insufficient if we use identical profiles and identical linkers/etc?
Status: Assigned (was: WontFix)
So we chatted offline, and yeah, pasko's right: it's not feasible at the moment to spin up a special orderfile pipeline, in large part because it'd be necessary to track the performance granted by said pipeline. Full support for that has quite a high cost.

That said, CrOS is looking into pursuing this for the host. As long as we do the bit-for-bit same thing that CrOS is doing, since we're running on literally the same hardware/etc., any perf issues that arise as a result on this will very likely be caught by CrOS' testing infra.

Hence, we don't feel that a full testing pipeline/etc. would be required in that case. So, if copying what CrOS does proves to be helpful and low-maintenance, I believe we're receptive to doing so.

Reopening until CrOS actually has orderfile goodies so we can evaluate them on x86/Android. :)

Sign in to add a comment