Monorail Project: v8 Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Issue 2935 Poor performance in Ember app Discourse
Starred by 137 users Project Member Reported by paulir...@chromium.org, Oct 16 2013 Back to list
Status: Assigned
Owner:
Cc:
Components:
HW: All
OS: All
Priority: High
Type: FeatureRequest


Sign in to add a comment
Chrome on Android is performing poorly in the Discourse app
Thread:  http://meta.discourse.org/t/why-is-discourse-so-slow-on-android/8823
Video: http://www.youtube.com/watch?v=Jd7i2NeZARU 

All we've been able to determine is that Chrome is spending all of its time in v8.callFunction.

In the following zip are recordings for two actions:
* loading a new topic
* going back to home view

The exported recordings:
* DevTools timeline
* CPU profile 
* tracing trace

https://dl.dropboxusercontent.com/u/39519/temp/discourse-profiles.zip

To test: 

A running instance of Discourse is at http://meta.discourse.org - to see the slowness I suggest entering a topic with at least 20 posts in it.
The entire source is available at https://github.com/discourse/discourse - if you want to run it in development mode, there is a VM to make that easy. 
 
Project Member Comment 1 by yu...@chromium.org, Oct 16 2013
Cc: yu...@chromium.org loislo@chromium.org
Project Member Comment 2 by a...@chromium.org, Oct 16 2013
Cc: a...@chromium.org
Project Member Comment 3 by loislo@chromium.org, Oct 16 2013
Cc: u...@chromium.org yangguo@chromium.org
Owner: danno@chromium.org
I saw nothing interesting in the CPU profile.

v8 logs from android.

The test site was http://bbs.boingboing.net
logcat.v8
82.1 KB View Download
Project Member Comment 4 by yu...@chromium.org, Oct 17 2013
Cc: jochen@chromium.org
Project Member Comment 5 by danno@chromium.org, Oct 17 2013
Owner: bmeu...@chromium.org
Project Member Comment 6 by bmeu...@chromium.org, Nov 12 2013
Cc: -u...@chromium.org bmeu...@chromium.org
Owner: u...@chromium.org
Does not seem to be reproducible on ARM Chromebook. Maybe Android/Clank related. Ulan can you provide more insight on this?
Project Member Comment 7 by rmcilroy@chromium.org, Nov 12 2013
I took a timeline trace of http://meta.discourse.org - the attached files show the time during which I selected a topic from the menu and experienced the loading pause. There does seem to be a lot of recompile async / recompile sync going on in this trace - maybe the code is triggering a lot of deopts somehow?
ember-timer-events.png
67.4 KB View Download
v8-ember.log
5.2 MB Download
Comment 8 by gart...@gmail.com, Nov 12 2013
Here's a simplified example to trace: http://emberjs.jsbin.com/oBaQARI/1
The last item taking ~3 sec to render on Android.
Project Member Comment 9 by yangguo@chromium.org, Nov 12 2013
The plot does not suggest that this has anything to do with deopts. In fact, deopts rarely cause any execution pauses. It does cause slow execution though, since we resort to unoptimized code.

What I can tell from the plot is that at the time of the pause, both the optimizing compiler thread and the main thread are compiling something for an equally long time, so it's probably the same function they are compiling. The optimizing compiler thread is doing the regular recompilation, while the main thread - after waiting for a short while for the compiler thread - decides to take the matter in its own hands by OSR. This OSR is what is blocking the main thread. The reason compilation, regardless of OSR or regular one, is taking so long, can have several reasons:

- The function to be compiled is simply very large.
- There is a bug in the optimizing compiler, causing it to waste time.
- The function source is structured in a particular way that causes one of the optimizing compiler's phase to run into a worse-case behavior, taking a lot of time.

Giving that OSR is disabled for very large functions, I would put my bet on the latter two causes. You could do this to verify and debug:

- Run chrome with the js-flag --nouse-osr or (the experimental) --concurrent-osr. This way, OSR either is disabled or runs on the concurrent thread, hence not blocking any longer.
- Run chrome with the js-flag --trace-opt to find out which function is the one taking that long, while having concurrent recompilation disabled.
- Then use --hydrogen-filter to single that function out for optimization. Add --hydrogen-stats to see which phase is taking up all that time.

Note that due to issue 2991, I don't think Chrome on Android is actually running a concurrent compiler thread. I haven't had the chance to verify this yet. This doesn't change anything I wrote though.

Project Member Comment 10 by rmcilroy@chromium.org, Nov 13 2013
I ran a trace with the simplified example in #8 (attached).  From this trace, it looks like the recompile (sync/async) is a reasonable overhead (16% or so), but I'm not sure this is the cause of the ~3 second pause in rendering the page - code execution is taking a much larger amount (67.5%) of the trace time.  Running with --nouse-osr or --concurrent-osr didn't seem to make much impact on the trace or time taken to render.

In any case, after using --trace-opt, most functions were in the sub-ms range to compile, with only a couple reaching tens of ms. One of the longest functions to compile is the "get" function.  --hydrogen-stats results for this below:  

I/v8      (20623): [marking 
I/v8      (20623): 0x2aaa4611 <JS Function get (SharedFunctionInfo 0x2aa43b99)>
I/v8      (20623):  for recompilation, reason: hot and stable
I/v8      (20623): , ICs with typeinfo: 28/29 (96%)
I/v8      (20623): ]
I/v8      (20623): [optimizing 
I/v8      (20623): 0x2aaa4611 <JS Function get (SharedFunctionInfo 0x2aa43b99)>
I/v8      (20623):  - took 15.839, 71.991, 6.440 ms]
I/v8      (20623): Timing results:
I/v8      (20623):                 H_Block building
I/v8      (20623):    21.904 ms / 14.2 % 
I/v8      (20623):    1024364 bytes / 28.7 %
I/v8      (20623):                 H_Block ordering
I/v8      (20623):     0.947 ms /  0.6 % 
I/v8      (20623):      39052 bytes /  1.1 %
I/v8      (20623):              H_Assign dominators
I/v8      (20623):     0.307 ms /  0.2 % 
I/v8      (20623):         72 bytes /  0.0 %
I/v8      (20623):  H_Environment liveness analysis
I/v8      (20623):     5.833 ms /  3.8 % 
I/v8      (20623):      54812 bytes /  1.5 %
I/v8      (20623):      H_Redundant phi elimination
I/v8      (20623):     1.223 ms /  0.8 % 
I/v8      (20623):       4628 bytes /  0.1 %
I/v8      (20623):         H_Mark unrechable blocks
I/v8      (20623):     0.670 ms /  0.4 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):          H_Dead code elimination
I/v8      (20623):     3.781 ms /  2.5 % 
I/v8      (20623):      18936 bytes /  0.5 %
I/v8      (20623):                H_Escape analysis
I/v8      (20623):     0.789 ms /  0.5 % 
I/v8      (20623):       4628 bytes /  0.1 %
I/v8      (20623):          H_Infer representations
I/v8      (20623):     1.409 ms /  0.9 % 
I/v8      (20623):       8440 bytes /  0.2 %
I/v8      (20623):      H_Merge removable simulates
I/v8      (20623):     1.190 ms /  0.8 % 
I/v8      (20623):        968 bytes /  0.0 %
I/v8      (20623):   H_Mark deoptimize on undefined
I/v8      (20623):     0.306 ms /  0.2 % 
I/v8      (20623):       7744 bytes /  0.2 %
I/v8      (20623):         H_Representation changes
I/v8      (20623):     2.439 ms /  1.6 % 
I/v8      (20623):      48248 bytes /  1.4 %
I/v8      (20623):                H_Inferring types
I/v8      (20623):     0.609 ms /  0.4 % 
I/v8      (20623):       4852 bytes /  0.1 %
I/v8      (20623): H_Compute safe UInt32 operations
I/v8      (20623):     0.243 ms /  0.2 % 
I/v8      (20623):       1936 bytes /  0.1 %
I/v8      (20623):                   H_Canonicalize
I/v8      (20623):     1.252 ms /  0.8 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):         H_Global value numbering
I/v8      (20623):     6.589 ms /  4.3 % 
I/v8      (20623):     614908 bytes / 17.2 %
I/v8      (20623):                 H_Range analysis
I/v8      (20623):     1.706 ms /  1.1 % 
I/v8      (20623):      50264 bytes /  1.4 %
I/v8      (20623): H_Compute change undefined to nan
I/v8      (20623):     0.702 ms /  0.5 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):      H_Compute minus zero checks
I/v8      (20623):     2.443 ms /  1.6 % 
I/v8      (20623):        984 bytes /  0.0 %
I/v8      (20623):        H_Stack check elimination
I/v8      (20623):     0.243 ms /  0.2 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):      H_Bounds checks elimination
I/v8      (20623):     1.035 ms /  0.7 % 
I/v8      (20623):      30324 bytes /  0.9 %
I/v8      (20623):     H_Dehoist index computations
I/v8      (20623):     1.894 ms /  1.2 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):          H_Restore actual values
I/v8      (20623):     0.431 ms /  0.3 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):                 L_Building chunk
I/v8      (20623):     8.030 ms /  5.2 % 
I/v8      (20623):     674816 bytes / 18.9 %
I/v8      (20623):           L_Register constraints
I/v8      (20623):     4.728 ms /  3.1 % 
I/v8      (20623):     179508 bytes /  5.0 %
I/v8      (20623):                   L_Resolve phis
I/v8      (20623):     0.977 ms /  0.6 % 
I/v8      (20623):      21332 bytes /  0.6 %
I/v8      (20623):              L_Build live ranges
I/v8      (20623):    15.684 ms / 10.2 % 
I/v8      (20623):     695804 bytes / 19.5 %
I/v8      (20623):     L_Allocate general registers
I/v8      (20623):    40.037 ms / 26.0 % 
I/v8      (20623):      12940 bytes /  0.4 %
I/v8      (20623):      L_Allocate double registers
I/v8      (20623):     0.731 ms /  0.5 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):          L_Populate pointer maps
I/v8      (20623):     1.221 ms /  0.8 % 
I/v8      (20623):       5572 bytes /  0.2 %
I/v8      (20623):                 L_Connect ranges
I/v8      (20623):     0.549 ms /  0.4 % 
I/v8      (20623):       1936 bytes /  0.1 %
I/v8      (20623):           L_Resolve control flow
I/v8      (20623):    10.618 ms /  6.9 % 
I/v8      (20623):         72 bytes /  0.0 %
I/v8      (20623):              L_Mark empty blocks
I/v8      (20623):     1.160 ms /  0.8 % 
I/v8      (20623):          0 bytes /  0.0 %
I/v8      (20623):                Z_Code generation
I/v8      (20623):    12.058 ms /  7.8 % 
I/v8      (20623):      59692 bytes /  1.7 %
I/v8      (20623): -------------------------------------------------------------------------------
I/v8      (20623):                     Create graph   15.839 ms / 16.8 % 
I/v8      (20623):                   Optimize graph   71.991 ms / 76.4 % 
I/v8      (20623):        Generate and install code    6.440 ms /  6.8 % 
I/v8      (20623): -------------------------------------------------------------------------------
I/v8      (20623):                            Total   94.270 ms (386.4 times slower than full code gen)
I/v8      (20623):            Average per kB source  103.023 ms           3806.395 kB allocated


I'm not sure what is normal on these stats, does anything stick out to you Yang?
ember-timer-events.png
63.2 KB View Download
Project Member Comment 11 by yangguo@chromium.org, Nov 14 2013
The proportions seem sane. Though live range analysis and register allocation seem to take awfully long time. And being 386.4 times slower than full code gen also seems odd (I'm used to values between 10-30).
For reference, a year later the numbers continue to be quite poor for Android Users. In this same time frame iOS users have seen really nice performance improvements.

Obviously ember can (and continuous) to improve in performance, but the disparity between high-end Android devices and iOS devices seems to be growing at an alarming rate.

some comparisons from the following discussion: http://discuss.emberjs.com/t/why-is-ember-3x-5x-slower-on-android/6577

iPad Air 2 -- 224 ms
iPhone 6+ -- 252 ms
iPhone 5s -- 300 ms
Galaxy S5 -- 1297 ms
Nexus 5 -- 1426 ms
Nexus 7 -- 1930 ms

Surface Pro 3 core i5
IE 11 (x64, touch) -- 613 ms
Firefox latest -- 312 ms
Chrome latest -- 260 ms

Desktop Core 4470
IE 11 x32 -- 390 ms
Firefox -- 245 ms
Chrome -- 181 ms
To be clear, the numbers above are for Android Chrome, not the old Android Browser.

We really seem to have a serious problem here. If you look in the linked thread, the conclusion the Discourse and Ember developers seem to be coming to is that iOS is eating our lunch on mobile performance in real-world apps, e.g.

> it really seems to me that JS is just piss poor in general on that platform.

> Android users legitimately have a poorer web experience.

> It is pretty funny although the v8 team despises micro-benchmarks, they are almost always the winners in such benchmarks. But when it comes to macro benchmarks (real world usage) JSC/Safari is really doing much better lately.

This is hard to believe given all the work we've been doing on mobile performance over the last year, but we have here at least one popular large website where the data is showing us to be inferior to iOS.

(Although, from the links, I haven't seen any evidence this is a V8 issue instead of e.g. a Blink issue. Perhaps this needs to be moved to the Chromium bug tracker.)
Here's the benchmark in question:

http://samsaffron.github.io/ember-performance/?ember=1.8.0

Sample results were posted above (two posts up) by Stefan.

We'd LOVE the Google team to help us with this. Our Android users are really suffering, and we still don't know why, even after a year of looking at this. 
Domenic if you check the dates on my original topic

https://meta.discourse.org/t/why-is-discourse-so-slow-on-android/8823

That's August 2nd, 2013. The first version of Chrome to ship with Blink was July 10th(ish) 2013. We were tracking this performance problem many months before that.

So this performance problem predates the Blink fork.. we were originally comparing two WebKit browsers with different JS runtimes. I believe that's still the problem, though if someone can provide evidence that Blink is the culprit, I'd love to see that evidence!
Which of those benchmarks are you running? There seems to a bunch of DOM related things in those tests which are not just JS.
@esphren the Complex List benchmark seems to produce the most accurate app results. It does include a lot of rendering -- if that is the problem we would like to know who to approach with that data.
Project Member Comment 18 by vegorov@chromium.org, Oct 28 2014
fwiw Safari 7.1 on my macbook seems to be twice as fast on the Complex List too. So there is something definitely going south here. 

If you look at the cpu profile it will be all full of yellow exclamation marks "optimized too many times", which is a bad sign.

I took a quick look a deopts at it seems that optimizations really never stabilize - I might be wrong but it seems like we are hitting the same issue I already reported for angular: first several times function deopts due to polymorphism as we inline more and more alternatives, then it stabilizes - but at this point GC comes, collects maps and optimized code and type feedback are thrown away due to "weak-code" dependency and the loop starts from scratch. 

These functions that don't stabilize are relatively big (400 smth blocks) so not being able to stabilize optimizations for it is a bad sign and an obvious potential performance drain.
Fantastic, if this is an issue for Angular as well perhaps a fix here will help everyone (and greatly improve many benchmarks as well).
Project Member Comment 20 by u...@chromium.org, Oct 29 2014
Status: Assigned
This slipped through. Started looking into it.
@Ulan 

keep in mind both the Ember team and Discourse team are open to improving our code base so it is friendlier to v8 if needed, just let us know what needs doing. 


Yes absolutely, we (ember.js) will gladly work to continue to improve the situation from our side. If you have actionable suggestions don't hesitate to reach out.

We have been slowly making progress on the perf side of things, still mountains of work ahead. If in your testing you notice common and solvable patterns please don't hesitate to reach out and share potential solutions.
Project Member Comment 23 by u...@chromium.org, Oct 30 2014
Blockedon: v8:3663 v8:3664 v8:3665
I looked at performance on MacBook and on Nexus7. Besides weak references in optimized code (described by Slava in comment #18), I found two more sources of poor performance:
- clearing of polymorphic ICs at each GC,
- code deoptimization caused by allocation site pretenuring.

I opened three issues to track each of these sources (v8:3663, v8:3664, v8:3665).

MacOS results:
- Chrome with default flags: 203.53ms
- Chrome without weak references, IC clearing, allocation site pretenuring [1]: 150.51ms
- Safari: 95.2ms

Nexus7 results:
- Chrome with default flags: 1845.93ms
- Chrome without weak references, IC clearing, allocation site pretenuring [1]: 1426.85ms

[1]: with the following flags: --js-flags="--noweak-embedded-maps-in-optimized-code --nocleanup-code-caches-at-gc --noallocation-site-pretenuring"

Note that there is still unexplained time difference on MacOS and even more on Nexus7. I am investigating this.

sam.saffron, stefan.penner: thanks! I am sure there is a way to make it friendly to the optimizing compiler. I'll get in touch once I have deeper understanding of Ember code. In the meantime I'd like to remove DOM from the equation and get standalone test that can be run in d8 (v8 shell). Do you think it would possible without investing huge effort?
@ulan, we have been slowly working towards allowing larger chunks of ember to run in the d8 shell, but still haven't made much process on this task.

What I have been doing (and it is terrible), is extracting problematic parts of ember and removing/patching until it works in d8. *shudder*

If there are parts you know you want to run in D8, I may be able to help with the leg work for you...
Project Member Comment 25 by u...@chromium.org, Nov 12 2014
I updated v8:3664 with a prototype fix, which reduces number of deoptimizations caused by weak references. To see noticeable performance improvement we need to also fix v8:3663, otherwise there are many deoptimizations caused by high degree of polymorphism. Not clearing ICs would allow generated code to reach the slow generic case faster and thus avoid deoptimizations. This however is a big project (several months) as it requires support for weak references in polymorphic ICs to avoid memory leaks.

In case anyone is wondering why there is high degree of polymorphism. Stefan and I discussed this offline. Here is what we found out in “Complex List” benchmark:

EachHelper generates new short living classes for each render. Stack trace:
  makeCtor (ember.1.8.0.js:33674)
  extend (ember.1.8.0.js:34036)
  collectionHelper (ember.1.8.0.js:8319)
  eachHelper (ember.1.8.0.js:8912)
  (anonymous function) (VM1610:61)
  __exports__.template (handlebars-v1.3.0.js:436)
  CoreView.extend.render (ember.1.8.0.js:40441)
  EmberRenderer_createElement (ember.1.8.0.js:37691)
  Renderer_renderTree (ember.1.8.0.js:10337)
V8 creates a hidden class for each of these Ember classes, which leads to high polymorphism in rendering and view  functions. When these short-living classes die, they cause deoptimization in all code that was optimized for them.

The hot Class() function is polymorphic in initProperties:
https://github.com/emberjs/ember.js/blob/master/packages/ember-runtime/lib/system/core_object.js#L115-L117
Since initProperties can have arbitrary shape, V8 deoptimizes this function a lot.

The meta objects are initialized inconsistently and form complex prototype chains, which results in many hidden classes:
https://github.com/emberjs/ember.js/blob/master/packages/ember-metal/lib/utils.js#L311-L313

In these cases, if instead of tailoring optimized code for a small set of hidden classes, V8 would emit slow generic code, that would give better performance by avoiding costly optimizations and deoptimizations. I think this also explains why performance difference on Android is bigger than performance difference on desktop. Since Android devices are slower than desktop, the relative cost of optimization/deoptimization on Android is higher.

We have ideas on how to handle these cases better in V8, but there is no quick fix.
Yup, thank to Ulan, I have been doing some experiments to remove some of these problem spots but while maintaining backwards compatibility. 

We will work on improving ember's side of things. Unfortunately a large amount of the excess map allocations are do to our malleable object model. I believe we can make improvements, but likely won't be able to remove these problematic areas entirely without breaking semver (and causing much community pain). I would like to push for reducing the flexibility of our object model anyways, good perf #'s will hopefully compensate future upgrade pains.

As I make progress I'll report back here. :)
Project Member Comment 27 by u...@chromium.org, Jan 16 2015
Update: issue 3663 was fixed. 

I re-ran the "Complex List" benchmark (http://samsaffron.github.io/ember-performance/?ember=1.8.0) in the MacOS Canary (41.0.2259.0) and Android ContentShell (at revision 310819):

Chrome 39 on MacOS: 200ms
Chrome 41 on MacOS: 158ms

Chrome 39 on Nexus 7: 1810ms
TOT Android ContentShell on Nexus 7: 1392ms

Note that this is almost the performance improvement that was predicted in comment #23.

Wow that's a big improvement: 21% faster on Chrome / Mac and 23% faster on Android. Well done, and I hope the pace continues!
I've been working on improving Ember's performance on this benchmark and just noticed that Canary was significantly faster than the latest stable release. Then the email notification came in from this issue and I found out why :)

Thanks!
Correction on the speedup percentages calculated by jatw...@codinghorror.com. The percentages given reflect how much less time is used, which is the inverse of how much faster the code now runs. That is, the code now uses 21% less time on Chrome/Mac, and 23% less time on Android.

That means that it is 26.6% faster on Chrome/Mac, and 30% faster on Android. (200/158 - 1, and 1810/1392 - 1).
Project Member Comment 31 by mvstan...@chromium.org, Jan 28 2015
Cc: mvstan...@chromium.org
Project Member Comment 32 by bmeu...@chromium.org, Jan 4 2016
Owner: bmeu...@chromium.org
Project Member Comment 33 by yu...@chromium.org, Jan 10 2016
Cc: -yu...@chromium.org
Project Member Comment 34 by danno@chromium.org, May 9 2016
Cc: cbruni@chromium.org
Project Member Comment 35 by bmeu...@chromium.org, May 17 2016
Owner: verwa...@chromium.org
Assigning to startup team for further investigation.
Project Member Comment 36 by bmeu...@chromium.org, Oct 8
Blockedon: 5495 5456
Owner: ishell@chromium.org
Project Member Comment 37 by bmeu...@chromium.org, Oct 9
Cc: vegorov@chromium.org stefan.p...@gmail.com kris.sel...@gmail.com
Components: Runtime GC Compiler
Labels: Performance TurboFan HelpWanted Crankshaft HW-All OS-All Priority-High Type-FeatureRequest
We've been hit hard by this problem, is there any news on the subject?
Project Member Comment 39 by bmeu...@chromium.org, Nov 17
mvstanton@ and ishell@ are working on this.
Sign in to add a comment