New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 704966 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
OOO until 2019-02-10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 3
Type: Bug

Blocked on:
issue v8:4826



Sign in to add a comment

Spread/apply slower than intuitively should be?

Project Member Reported by fs...@chromium.org, Mar 24 2017

Issue description

I've crafted this jsperf:
https://jsperf.com/minarray/1

of ways of doing Math.min in an array.
My intuition in general would be "less JS" => "more C++" => "faster". And yet...

I understand that the reduce() example would be slower, due to more function calls, fine.
But I don't get why apply() and the spread syntax should be slower than a while/for loop.

Could we do something about this? <3
 
Cc: hablich@chromium.org
Status: WontFix (was: Untriaged)
I can't repro this on Chrome 58 beta. With the FullCodeGen+CrankShaft pipeline all except reduce are on par. With Ignition+TF the While loops is 30 % faster than the rest.

What browser were you using?

Comment 2 by fs...@chromium.org, Mar 30 2017

Testing in Chrome 58.0.3029 / Mac OS X 10.12.3. Do I need to enable a flag to get FullCodeGen+CrankShaft?

while loops is way faster than everything else.
for is 30% slower
spread is 42% slower
apply is 47% slower
reduce is 92% slower

My point was that this is somehow counter intuitive, and that spread/apply should be way closer, if not faster.

Comment 3 by fs...@chromium.org, Mar 30 2017

Status: Available (was: WontFix)
With Experimental JavaScript Compilation Pipeline enabled, the numbers are even worse:

The baseline for while is identical (less than 5% difference) and:
for is 32% lower
spread is 44% slower
apply is 64% slower
reduce is 92% slower


Cc: bmeu...@chromium.org petermarshall@chromium.org
Components: -Blink>JavaScript Blink>JavaScript>Compiler
Owner: petermarshall@chromium.org
Status: Assigned (was: Available)
Peter please take a look.
It looks like jperf.com is down. What was the general gist of the benchmark code? I'll see if I can reproduce similar results locally.

Comment 7 by fs...@chromium.org, Apr 10 2017

Google cache cached this one:
https://webcache.googleusercontent.com/search?q=cache:mqQYYZvH5N0J:https://jsperf.com/minarray+&cd=2&hl=en&ct=clnk&gl=ca

But for future reference, copy and pasting here:

var arr = [], maxLimit = 1000000000;
for (var i = 0; i < 1000; i++) {
  arr.push(Math.floor(Math.random() * maxLimit));
}

and then trying different ways of finding the min of the array.

function withApply() {
  let min = Math.min.apply(Math, arr);
  return min;
}

function withLoop() {
  let min = arr[0];
  for(let i = 1; i < arr.length; ++i) {
    if (arr[i] < min) min = arr[i];
  }
  return min;
}

function withSpread() {
  let min = Math.min(...arr);
  return min;
}

function withWhile() {
  let len = arr.length, min = Infinity;
  while (len--) {
    if (arr[len] < min) min = arr[len];
  }
  return min;
}

function withReduce() {
  let min = arr.reduce((p, v) => ( p < v ? p : v ));
  return min;
}

Apply and spread are slower basically because we push the array onto the stack. Spread actually used to be much, much slower because the spec says that it must use the full iteration protocol, which has a lot of JS-observable side-effects. 

In simple cases (like in this benchmark) we can prove that actually performing the iteration won't be observable, so we just push the array onto the stack and call into Math.min, as if we literally called it with 1000 arguments.

Given that the operation we are performing here (min) is so simple, the time taken to prove that the iteration isn't needed and push the array onto the stack is pretty large, and clearly noticeable.

There isn't a lot we can do about the time taken to push the array. We did briefly consider keeping the array on the heap and passing a pointer to it in a special parameter-passing mode - we don't actually need to duplicate the entire array in this case. However, while Math.min doesn't modify the input array, we would need to prove that this is true for each function we wanted to use a  spread parameter with.

Basically, this is exactly what the handwritten for/while implementations are doing - we know that we aren't modifying the array, so we don't make a copy. 

Another issue with the spread/apply approach is that you will run into a stack overflow at various different array sizes depending on the implementation - again a consequence of pushing the arguments to the stack. This stems from the fact that Math.min takes varargs parameters and then iterates over the arguments object. This makes sense for a small number of handwritten parameters, but when combined with spreads, produces the problems above. 

To summarize: It could be possible for us to avoid pushing the array to the stack in the spread/apply case. This would give a nice speedup, and remove the stack overflow problem. It might be a fair bit of work, though. Other JS engines (I think) implement spread/apply in a similar way to V8 currently, so you wouldn't be able to rely on not running into a stack overflow, or hitting a performance cliff.

Comment 9 by fs...@chromium.org, Apr 11 2017

Peter, thanks for taking the time to look into this and answer it. :)

2 small things:

1. I re-did the benchmark on Firefox and Safari. Although on average we have way better numbers, a couple things popped out: on both of them their for loop is two times faster than our while loop (and everything else is abysmally lower). Also, on Firefox, their reduce is as fast as our spread. I know it's only tangencial to this request, but I think it may be worth taking a look at this.

2. Is there a reason for the for-loop to be 30%-ish slower than the while-loop for us? is it because of arr.length?

And now to the meat of the discussion:

Just to be clear, when you say "proving the iteration won't be observable", you just mean making sure that the function (Math.min) won't change the 'arguments' object, right?

If that would be the case, I wonder if we could do something about it. If we had a way to mark functions as "const arguments", then we could pass a pointer. And we can start marking C++ functions as being "const arguments", which would solve bring a nice speed up across the board for spread. Then we could worry as a second step on how to mark JS functions as const arguments, but that could be an independent step, no?
Cc: caitp@chromium.org
The for version is slower because you are using for(let...) instead of for(var...), which requires more complex scoping, that is not well optimized at this point (see https://twitter.com/bmeurer/status/850776357151944704 for example). caitp@chromium.org is working on a quickfix for the frontend, for the common case.

Comment 11 by caitp@chromium.org, Apr 11 2017

> caitp@chromium.org is working on a quickfix for the frontend, for the common case.

Adam has asked me not to continue with it, saying we should wait until FCG+crankshaft are removed, (or at least until lexical variables force I+TF pipeline), so that quick fix may not be available for a few milestones.
Cc: adamk@chromium.org
Hm, interesting. At least after M59 branch cut it should be fine, since M60 will very likely not even include Crankshaft, and FCG will be around for asm.js only, which doesn't need lexical variables.

Comment 13 by adamk@chromium.org, Apr 11 2017

After more discussion on IRC, I think there's a slimmed-down version of caitp's patch that would be reasonable to land before Crankshaft is gone.
1. Interesting that reduce is faster on firefox - we could look into that.

2. When I ran these locally using our JSPerfTest runner, for and while were pretty much exactly equal - not sure where that discrepancy came from. Otherwise the relative performance of everything was pretty much the same.

When I say "proving the iteration won't be observable", what I'm talking about is iteration using Symbol.iterator, like with a for-of loop. Calls with spread parameters don't just index into the array from 0-->length, they use iterators - see the spec here for the entry point: https://tc39.github.io/ecma262/#sec-argument-lists-runtime-semantics-argumentlistevaluation. There are a whole bunch of events here that are observable via monkey-patching, getters/setters, etc.

In most simple cases though, iterating over an array and simply indexing into it from 0--> length will produce the same result, so we have an optimization that realizes when this is the case and skips the iteration. The tracking bug for that work is here: https://bugs.chromium.org/p/v8/issues/detail?id=5511.

My point being that calls with spread parameters have quite different semantics to the handwritten for/while loop versions, although in this case they do the exact same thing.

It might be possible to do something with builtin functions like max/min where we know they don't modify arguments, but the question is, is that an important use case? Often max/min are given as examples of how to use spread arguments, but doing it with varargs seems kind of strange. Normally you would make a function that just operates on the array, rather than iterating over arguments. Or use something like reduce, which can be done nicely in-line.

Another idea was to pass a pointer to the array, and then use something like copy-on-write, where we assume the function won't modify arguments, and pay the performance penalty in the case that it does. What isn't clear to me is whether this sort of thing would cause a slowdown for other functions that don't use spreads - we would have to insert at least one check/branch somewhere when accessing the arguments. That would probably cause noticeable slowdown elsewhere.

So basically the stage we got to with this was that we have some ideas about how we could make spread calls like this a bit faster, but we haven't fully explored the design, and perhaps more importantly, aren't sure that the use-cases it would cover are really important. Also, I don't see a way spread/apply would be faster than the for/while versions, even if something was done to elide the stack-pushing, it would probably be roughly the same speed in the end.

Comment 15 by fs...@chromium.org, Apr 12 2017

Fixing the let binding thing is a great idea! :)

Given https://codereview.chromium.org/2465253011 this means that Math.min(...x) should not go through an iterator, but into normal indexing, is that correct? So the cost we are paying is only the copy, not the iterating.  And yes, I can see how, even if we remove the copy, we would be close to the same speed (but wouldn't we save all the function calls to Math.min?)

I understand your point regarding this being a not common use case. But your argument almost goes to the point that the "functional" operators (map/filter/reduce) should be at least as fast as the loops (and they are the slowest solution right now) but I understand optimizing them is way more complicated. Is there anything we can do for those?

I think there's a discrepancy between what you guys know to be true and what a reasonable JS programmer would expect. Which boils down to: spread performance is, right now, bad (and who knows what's happening with 'apply'? I'm assuming also a copy, but what else?).

Anecdotally, I've been showing this jsperf around and have yet to meet a Chrome/Blink/JS developer that correctly guesses the performance order there. Most of them say "spread and reduce are the fastest". They assume that 'reduce' is syntactic sugar for a faster while loop. They assume that 'spread' would never copy the array, because JS functions never do. I know optimization is not intuitive. But still, I think it should be a factor when deciding where to put optimization effort in. To make things that look like should be faster, actually faster.

All that said, if there's a tracking bug for spread performance, I feel more than contemplated for that particular case. I trust you guys will prioritize and work on it as necessary. :)
Cc: danno@chromium.org
There's https://docs.google.com/document/d/1EA9EbfnydAmmU_lM8R_uEMQ-U_v4l9zulePSBkeYWmY/edit?pli=1#heading=h.n8mbljw3mpuo and the tracking bug is at https://bugs.chromium.org/p/v8/issues/detail?id=5511.

Adding danno@ for reduce performance.

I totally get the problem with the non-intuitive performance, and I'd be more than happy if we can address that, and thereby help to make performance more predictable. Maybe there's still something we can do for spread calls in TurboFan.
Blockedon: v8:4826
I think there's one thing we should probably address, which is having a fast-path for double arrays for apply and spread. For example consider this simple test case:

===============================================================
var arr = [], maxLimit = 1000000000;
for (var i = 0; i < 1000; i++) {
  arr.push(Math.floor(Math.random() * maxLimit));
}

function withApply() {
  let min = Math.min.apply(Math, arr);
  return min;
}

function withLoop() {
  let min = arr[0];
  for(let i = 1; i < arr.length; ++i) {
      if (arr[i] < min) min = arr[i];
    }
  return min;
}

function withSpread() {
  let min = Math.min(...arr);
  return min;
}

function withWhile() {
  let len = arr.length, min = Infinity;
  while (len--) {
      if (arr[len] < min) min = arr[len];
    }
  return min;
}

function withReduce() {
  let min = arr.reduce((p, v) => ( p < v ? p : v ));
  return min;
}

const N = 100000;
const F = [withApply, withLoop, withSpread, withWhile, withReduce];

function test(f, n) {
  var o;
  for (var i = 0; i < n; ++i) o = f();
  return o;
}

// Warmup
for (var i = 0; i < 10; ++i) {
  for (var f of F) test(f, 50);
}

// Measure
for (var f of F) {
  var start = Date.now();
  var o = test(f, N);
  var end = Date.now();
  print(f.name + ": " + (end - start) + "ms.");
}
===============================================================

In this case the arr has elements kind FAST_ELEMENTS, which is properly optimized for both apply and spread, and thus the results are not too bad (modulo the reduce case, but that's know):

===============================================================
withApply: 252ms.
withLoop: 191ms.
withSpread: 251ms.
withWhile: 191ms.
withReduce: 1375ms.
===============================================================

However turning arr into a FAST_DOUBLE_ELEMENTS array, we see a very, very significant performance difference:

===============================================================
var arr = [0.1], maxLimit = 1000000000;
for (var i = 0; i < 1000; i++) {
  arr.push(Math.floor(Math.random() * maxLimit));
}

function withApply() {
  let min = Math.min.apply(Math, arr);
  return min;
}

function withLoop() {
  let min = arr[0];
  for(let i = 1; i < arr.length; ++i) {
      if (arr[i] < min) min = arr[i];
    }
  return min;
}

function withSpread() {
  let min = Math.min(...arr);
  return min;
}

function withWhile() {
  let len = arr.length, min = Infinity;
  while (len--) {
      if (arr[len] < min) min = arr[len];
    }
  return min;
}

function withReduce() {
  let min = arr.reduce((p, v) => ( p < v ? p : v ));
  return min;
}

const N = 100000;
const F = [withApply, withLoop, withSpread, withWhile, withReduce];

function test(f, n) {
  var o;
  for (var i = 0; i < n; ++i) o = f();
  return o;
}

// Warmup
for (var i = 0; i < 10; ++i) {
  for (var f of F) test(f, 50);
}

// Measure
for (var f of F) {
  var start = Date.now();
  var o = test(f, N);
  var end = Date.now();
  print(f.name + ": " + (end - start) + "ms.");
}
===============================================================

Look at the numbers again:

===============================================================
withApply: 1108ms.
withLoop: 270ms.
withSpread: 2938ms.
withWhile: 170ms.
withReduce: 1670ms.
===============================================================

I don't know what's up with the withLoop test, but for withApply and withSpread, it's clearly lacking the fast-path for FAST_DOUBLE_ELEMENTS, see v8:4826.
Cc: yangguo@chromium.org
Project Member

Comment 20 by bugdroid1@chromium.org, Jun 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/5b427ad2d1e2c21a1a2ffafeaebb1593c70fc5b5

commit 5b427ad2d1e2c21a1a2ffafeaebb1593c70fc5b5
Author: Peter Marshall <petermarshall@chromium.org>
Date: Tue Jun 13 10:24:35 2017

[builtins] Add a fast-path for Apply with double elements.

Double element types were much slower than Smi/Object previously.
We can box each double in a HeapNumber and push them into a new
FixedArray to save going into the runtime.

Bug:  v8:4826 , chromium:704966
Change-Id: I7f15d0d636a52760daefed722265c696c1ebb13e
Reviewed-on: https://chromium-review.googlesource.com/531004
Commit-Queue: Peter Marshall <petermarshall@chromium.org>
Reviewed-by: Camillo Bruni <cbruni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#45897}
[modify] https://crrev.com/5b427ad2d1e2c21a1a2ffafeaebb1593c70fc5b5/src/builtins/builtins-call-gen.cc
[modify] https://crrev.com/5b427ad2d1e2c21a1a2ffafeaebb1593c70fc5b5/test/mjsunit/apply.js

Sign in to add a comment