Speed: Slow performance on older Atom processors, slow microbenchmarks
Reported by
rrwinter...@gmail.com,
Jul 12
|
|||||
Issue descriptionThere is a performance issue with older Atom processors running Chrome in comparison to Safari running on iPads and iOS. Attached is the test case for the bug to run along with performance numbers gathered. It looks like it is spending most of the time in the interaction with Blink and the messaging. This is fairly branching code and there are a fair amount of branch mispredicts and cache misses which contribute to the older Atom processors and Chrome running slower but it is mostly an browser software architectural issue it looks like between Chrome and Safari with the DOM creating dynamically creating a significant number of objects. Looking for comments on this test case and potential optimizations.
,
Jul 17
Yes, it is not real test scenario. As little background we provide HTML5-based HMI's (human-machine-interfaces) to operate machines. These HMI's can be complex JavaScript driven web applications. Our users claims poor user experience with long lasting script execution on Atom CPUs related to lots of DOM manipulations. We created this reduced simple benchmark to localize, that this issue is related to the execution time of JavaScript code and not to a graphical issue. Even with browser benchmarks like Jetstream and Speedometer2.0, we see a poor performance (see attachment). We compared it with an test device (Ipad Air 2), with similar CPU speed and amount of cores. Even if the CPU are based on different architecture, OS and browser engines, we had expected similar values. Actually, we see that the Ipad is factor 2-3 times faster then the atom devices. We have tested a Mac Mini with Safari (MacOS) and Chrome (Windows) to check the different browser engines. But we saw no difference in performance. Furthermore we have tested the Atom devices with different OS's (Windows, Linux and FreeBSD) without any significant difference. Therefore we came to the conclusion, that the issue is related to the Atom CPU. We will confirm that the Intel Atom CPU is not a consumer CPU. But in terms of IoT or industrial automation device it is still important for the next years and we see the same issue for the next Atom generation (Apollo Lake). Do you see any specific optimization for a Atom CPU (e.g. browser flags, compiler switches) for now or the future? Thanks in advance, Sven
,
Jul 26
,
Jul 26
Could you provide the numbers you tried with the Apollo Lake system? I am going to look for general tests that can run on Apollo Lake, Windows and the A8X iOS for a sanity check on the 2x-3x number for general CPU performance. We know that the the tests I ran on the BayTrail system had a lot of branch mispredicts which have improved significantly on the Apollo Lake in my initial testing. We need to be careful about what we are comparing.
,
Jul 30
,
Jul 30
We run each test ten times and provide the mean values. Our JavaScript-microbenchmark (see above) has 100/1000, Speedometer 10 and Jetstream 3 iterations. I have updated the results with our JavaScript micro benchmark and added the results for the Apollo Lake CPU Intel Atom Apollo Lake E3950 to the attachment. Unfortunately, we still see that the Ipad is 2-3 time faster, if we compare the benchmarks.
,
Jul 31
We took the test1 case for a ride with an isolated test case:
```js
function test1() {
var x = 0;
for(var i = 0, ii = 1000000; i < ii; i++) {
x = x + 1;
}
}
var start = Date.now();
for (var i = 0; i < 1000; ++i) test1();
var end = Date.now();
print("Time: " + (end - start) + " ms.");
```
Running this on my MBP I see the following results with JSC (from Technology Preview) and V8 (from last week):
======================
Time (d8): 664 ms.
Time (jsc): 278 ms.
======================
So JSC is significantly faster than V8 even on the same (beefy) processor. The reason is that already DFG generates a super tight loop, which only consists of 4 instructions (FTL generates the same loop code later):
===========================
loop: inc %eax
inc %ecx
cmp $0xf4240, %ecx
jl loop
===========================
Contrast that with the best code that TurboFan can generate here:
==============================================================
loop: cmp rax,0xf4240
jnc loop_done
cmpq rsp,[r13+0xe88]
jna stack_overflow
leal rdx,[rax+0x1]
movq rcx,rbx
addl rcx,0x1
jo deoptimize
movq rax,rdx
movq rbx,rcx
jmp loop
loop_done:
==============================================================
Here we have a stack check inside, and for the x increment we do overflow checks. We also need to move stuff around because of the frame states that need access to the previous values.
,
Jul 31
Small correction: It's not the code that DFG generates, but the FTL code (the --dumpDFGDisassembly output is misleading).
,
Jul 31
Changing the test code to something like this
```js
function test1(ii) {
var x = 0;
for(var i = 0; i < ii; i++) {
x = x + 1;
}
}
var start = Date.now();
for (var i = 0; i < 1000; ++i) test1(1000000);
var end = Date.now();
print("Time: " + (end - start) + " ms.");
```
where TurboFan no longer sees a constant for the upper bound, the execution time in V8 goes down, whereas FTL still generates the 4 instruction loop:
======================
Time (d8): 791 ms.
Time (jsc): 277 ms.
======================
,
Jul 31
,
Aug 10
Thanks for the feedback. I have tested with a constant as upper bound, but I see no significant improvement on my laptop (Core I7) or on our controller (Atom).
======================
Time (upper bound = 1000000) = 0.844ms (Core I7), 2.911ms (Atom)
Time (upper bound = ii, ii = 1000000) = 0.838ms (Core I7), 2.901ms (Atom)
======================
Nevertheless, test2 and test3 is more relevant for our use case of lots of DOM manipulations.
// -------------------------------------------
function test2(){
var ii = 10000;
for(var i = 0; i < ii; i++){
document.body.appendChild(document.createElement('div'));
}
}
// -------------------------------------------
function test3(){
var div = document.createElement('div');
var count = 0;
while(count<10000){
div.appendChild(document.createElement('div'));
count++;
}
document.body.appendChild(div);
}
Please investigate this in your test enviroment. Thanks in advance.
P.S: In general we are satified with the browser performance of common consumer CPU (Intel Core I3/5/7,...). But we are interested if the browser performance can be improved on a less beefy atom cpu.
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by brucedaw...@chromium.org
, Jul 12Summary: Speed: Slow performance on older Atom processors, slow microbenchmarks (was: Speed:)
Given that you are running Safari on an Apple A8X CPU and Chrome (for Windows) on an Intel Atom E3845 how are you deriving an expectation for what the speed ratio "should" be? P.S. Some of the text in the PDF appears to actually be bitmap data so I can't copy/paste it as text, which made writing this up trickier. I'm not sure we have enough Atom customers to care about Atom performance. The fact that test 1: function test1(){ var x = 0; for(var i = 0, ii = 1000000; i < ii; i++){ x = x + 1; } } is so much faster on Safari on the same hardware might be of interest, but maybe not - optimizing for micro-benchmarks is generally not as useful as optimizing for real-world scenarios. In particular, a sufficiently smart compiler could figure out that that function does nothing and could turn it into a NOP, but that would not help real web pages. This feels like multiple bugs squished in to one: Atom is slower than A8X - I'm not sure if that matters due to the different CPU types and the small number of Atoms out there test1() is slower in Chrome than Safari on the same hardware - not a realistic test test4() is slower in Chrome than Safari on the same hardware - is it a realistic test?