New issue
Advanced search Search tips

Issue 863099 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Speed: Slow performance on older Atom processors, slow microbenchmarks

Reported by rrwinter...@gmail.com, Jul 12

Issue description

There is a performance issue with older Atom processors running Chrome in comparison to Safari running on iPads and iOS.  Attached is the test case for the bug to run along with performance numbers gathered.  It looks like it is spending most of the time in the interaction with Blink and the messaging.  This is fairly branching code and there are a fair amount of branch mispredicts and cache misses which contribute to the older Atom processors and Chrome running slower but it is mostly an browser software architectural issue it looks like between Chrome and Safari with the DOM creating dynamically creating a significant number of objects.  Looking for comments on this test case and potential optimizations.
 
PerfTestOrig.html
4.2 KB View Download
domtesting.pdf
556 KB Download
Components: Blink>JavaScript>Runtime
Summary: Speed: Slow performance on older Atom processors, slow microbenchmarks (was: Speed:)
Given that you are running Safari on an Apple A8X CPU and Chrome (for Windows) on an Intel Atom E3845 how are you deriving an expectation for what the speed ratio "should" be?

P.S. Some of the text in the PDF appears to actually be bitmap data so I can't copy/paste it as text, which made writing this up trickier.

I'm not sure we have enough Atom customers to care about Atom performance. The fact that test 1:

	function test1(){
		var x = 0;
		for(var i = 0, ii = 1000000; i < ii; i++){
			x = x + 1;
		}
	}

is so much faster on Safari on the same hardware might be of interest, but maybe not - optimizing for micro-benchmarks is generally not as useful as optimizing for real-world scenarios. In particular, a sufficiently smart compiler could figure out that that function does nothing and could turn it into a NOP, but that would not help real web pages.

This feels like multiple bugs squished in to one:
Atom is slower than A8X - I'm not sure if that matters due to the different CPU types and the small number of Atoms out there
test1() is slower in Chrome than Safari on the same hardware - not a realistic test
test4() is slower in Chrome than Safari on the same hardware - is it a realistic test?

Yes, it is not real test scenario. As little background we provide HTML5-based HMI's (human-machine-interfaces) to operate machines. These HMI's can be complex JavaScript driven web applications. Our users claims poor user experience with long lasting script execution on Atom CPUs related to lots of DOM manipulations. We created this reduced simple benchmark to localize, that this issue is related to the execution time of JavaScript code and not to a graphical issue. 
Even with browser benchmarks like Jetstream and Speedometer2.0, we see a poor performance (see attachment). We compared it with an test device (Ipad Air 2), with similar CPU speed and amount of cores. Even if the CPU are based on different architecture, OS and browser engines, we had expected similar values. Actually, we see that the Ipad is factor 2-3 times faster then the atom devices. We have tested a Mac Mini with Safari (MacOS) and Chrome (Windows) to check the different browser engines. But we saw no difference in performance. Furthermore we have tested the Atom devices with different OS's (Windows, Linux and FreeBSD) without any significant difference. Therefore we came to the conclusion, that the issue is related to the Atom CPU.
We will confirm that the Intel Atom CPU is not a consumer CPU. But in terms of IoT or industrial automation device it is still important for the next years and we see the same issue for the next Atom generation (Apollo Lake).
Do you see any specific optimization for a Atom CPU (e.g. browser flags, compiler switches) for now or the future?
Thanks in advance,
Sven
Results_Speedometer_Jetstream.xls
26.5 KB Download
Cc: bmeu...@chromium.org jarin@chromium.org
Could you provide the numbers you tried with the Apollo Lake system?  I am going to look for general tests that can run on Apollo Lake, Windows and the A8X iOS for a sanity check on the 2x-3x number for general CPU performance.  We know that the the tests I ran on the BayTrail system had a lot of branch mispredicts which have improved significantly on the Apollo Lake in my initial testing.  We need to be careful about what we are comparing.
Labels: Pri-2
Status: Available (was: Unconfirmed)
We run each test ten times and provide the mean values. Our JavaScript-microbenchmark (see above) has 100/1000, Speedometer 10 and Jetstream 3 iterations.
I have updated the results with our JavaScript micro benchmark and added the results for the Apollo Lake CPU Intel Atom Apollo Lake E3950 to the attachment. Unfortunately, we still see that the Ipad is 2-3 time faster, if we compare the benchmarks.

Results_Speedometer_Jetstream_MiniJS_Benchmark.xls
36.5 KB Download
Cc: -jarin@chromium.org mvstan...@chromium.org danno@chromium.org
Components: -Blink>JavaScript>Runtime Blink>JavaScript>Compiler
Labels: Arch-All
Owner: jarin@chromium.org
Status: Assigned (was: Available)
We took the test1 case for a ride with an isolated test case:

```js
function test1() {
  var x = 0;
  for(var i = 0, ii = 1000000; i < ii; i++) {
    x = x + 1;
  }
}

var start = Date.now();
for (var i = 0; i < 1000; ++i) test1();
var end = Date.now();
print("Time: " + (end - start) + " ms.");
```

Running this on my MBP I see the following results with JSC (from Technology Preview) and V8 (from last week):

======================
Time (d8):  664 ms.
Time (jsc): 278 ms.
======================

So JSC is significantly faster than V8 even on the same (beefy) processor. The reason is that already DFG generates a super tight loop, which only consists of 4 instructions (FTL generates the same loop code later):

===========================
loop: inc %eax
      inc %ecx
      cmp $0xf4240, %ecx
      jl loop
===========================

Contrast that with the best code that TurboFan can generate here:

==============================================================
loop:      cmp rax,0xf4240
           jnc loop_done
           cmpq rsp,[r13+0xe88]
           jna stack_overflow
           leal rdx,[rax+0x1]
           movq rcx,rbx
           addl rcx,0x1
           jo deoptimize
           movq rax,rdx
           movq rbx,rcx
           jmp loop
loop_done:
==============================================================

Here we have a stack check inside, and for the x increment we do overflow checks. We also need to move stuff around because of the frame states that need access to the previous values.
Small correction: It's not the code that DFG generates, but the FTL code (the --dumpDFGDisassembly output is misleading).
Changing the test code to something like this

```js
function test1(ii) {
  var x = 0;
  for(var i = 0; i < ii; i++) {
    x = x + 1;
  }
}

var start = Date.now();
for (var i = 0; i < 1000; ++i) test1(1000000);
var end = Date.now();
print("Time: " + (end - start) + " ms.");
```

where TurboFan no longer sees a constant for the upper bound, the execution time in V8 goes down, whereas FTL still generates the 4 instruction loop:

======================
Time (d8):  791 ms.
Time (jsc): 277 ms.
======================
Cc: sigurds@chromium.org
Thanks for the feedback. I have tested with a constant as upper bound, but I see no significant improvement on my laptop (Core I7) or on our controller (Atom).

======================
Time (upper bound = 1000000) = 0.844ms (Core I7), 2.911ms (Atom)
Time (upper bound = ii, ii = 1000000) = 0.838ms (Core I7), 2.901ms (Atom)
======================

Nevertheless, test2 and test3 is more relevant for our use case of lots of DOM manipulations.

// -------------------------------------------
function test2(){
    var ii = 10000;
    for(var i = 0; i < ii; i++){
        document.body.appendChild(document.createElement('div'));
    }
}
// -------------------------------------------
function test3(){
    var div = document.createElement('div');
    var count = 0;

    while(count<10000){
        div.appendChild(document.createElement('div'));
        count++;
    }
    document.body.appendChild(div);
}

Please investigate this in your test enviroment. Thanks in advance.

P.S: In general we are satified with the browser performance of common consumer CPU (Intel Core I3/5/7,...). But we are interested if the browser performance can be improved on a less beefy atom cpu.
Results_Speedometer_Jetstream_MiniJS_Benchmark.xls
36.5 KB Download
PerfTestOrigV2.html
4.4 KB View Download

Sign in to add a comment