Performance of generator functions for iteration
Reported by
mic.bes...@gmail.com,
Dec 21 2017
|
||||
Issue description
Version: 6.5.88
OS: Linux
Architecture: x64
What steps will reproduce the problem?
'use strict';
class Standard {
constructor(x, y) {
this.x = x;
this.y = y;
}
computeSum() {
let sum = 0;
for (let x = 0; x < this.x; x++) {
for (let y = 0; y < this.y; y++) {
sum += x * y;
}
}
return sum;
}
}
class Iterator {
constructor(x, y) {
this.x = x;
this.y = y;
}
computeSum() {
let sum = 0;
for (const {x, y} of this.keys()) {
sum += x * y;
}
return sum;
}
*keys() {
for (let x = 0; x < this.x; x++) {
for (let y = 0; y < this.y; y++) {
yield {x, y};
}
}
}
}
const standard = new Standard(10, 15);
const iterator = new Iterator(10, 15);
// warm up
for (let i = 0; i < 1000; i++) {
standard.computeSum();
iterator.computeSum();
}
// check result
if (standard.computeSum() !== 4725) throw new Error('wrong value: ' + standard.computeSum());
if (iterator.computeSum() !== 4725) throw new Error('wrong value: ' + iterator.computeSum());
test('standard', standard);
test('iterator', iterator);
function test(name, fun) {
console.time(name);
for (let i = 0; i < 1e6; i++) {
fun.computeSum();
}
console.timeEnd(name);
}
What is the expected output?
Similar time to run the "standard" and "iterator" tests.
The objects yielded from the "keys" generator do not escape and could be optimized out?
What do you see instead?
The "iterator" test is ~25x slower.
Numbers on my machine:
standard: 142.695ms
iterator: 3500.634ms
Please use labels and text to provide additional information.
,
Dec 22 2017
I think it's fair to compare iterator approaches to simple loops, I merged all of them together into one test case:
```js
'use strict';
class Standard {
constructor(x, y) {
this.x = x;
this.y = y;
}
computeSum() {
let sum = 0;
for (let x = 0; x < this.x; x++) {
for (let y = 0; y < this.y; y++) {
sum += x * y;
}
}
return sum;
}
}
class RangeIterator {
constructor(xEnd, yEnd) {
this.xCurrent = 0;
this.yCurrent = 0;
this.xEnd = xEnd;
this.yEnd = yEnd;
}
[Symbol.iterator]() {
return this;
}
next() {
let x = this.xCurrent;
let y = this.yCurrent;
let done = false;
if (x < this.xEnd) {
if (y < this.yEnd - 1) {
this.yCurrent = y + 1;
} else {
this.yCurrent = 0;
this.xCurrent = x + 1;
}
} else {
done = true;
}
return {value: {x, y}, done};
}
}
class Iterator {
constructor(x, y) {
this.x = x;
this.y = y;
}
computeSum() {
let sum = 0;
for (const {x, y} of this.keys()) {
sum += x * y;
}
return sum;
}
keys() {
return new RangeIterator(this.x, this.y);
}
}
class Generator {
constructor(x, y) {
this.x = x;
this.y = y;
}
computeSum() {
let sum = 0;
for (const {x, y} of this.keys()) {
sum += x * y;
}
return sum;
}
* keys() {
for (let x = 0; x < this.x; x++) {
for (let y = 0; y < this.y; y++) {
yield {x, y};
}
}
}
}
const x = 10;
const y = 15;
const standard = new Standard(x, y);
const iterator = new Iterator(x, y);
const generator = new Generator(x, y);
// warm up
for (let i = 0; i < 1000; i++) {
standard.computeSum();
iterator.computeSum();
generator.computeSum();
}
// check result
if (standard.computeSum() !== 4725)
throw new Error('wrong value: ' + standard.computeSum());
if (iterator.computeSum() !== 4725)
throw new Error('wrong value: ' + iterator.computeSum());
if (generator.computeSum() !== 4725)
throw new Error('wrong value: ' + generator.computeSum());
test('standard', standard);
test('iterator', iterator);
test('generator', generator);
function test(name, fun) {
console.time(name);
for (let i = 0; i < 1e6; i++) {
fun.computeSum();
}
console.timeEnd(name);
}
```
Running this with V8 ToT I get the following results:
```
$ out/Release/d8 bench-targos.js
console.timeEnd: standard, 238.206000
console.timeEnd: iterator, 592.631000
console.timeEnd: generator, 3820.113000
```
So the (hand-optimized) RangeIterator is 2.5x slower than the hand-written loops and the generator is A LOT slower than that, about 6x slower than the iterator, which it seems is due to inlining partly, but also due to the state management of the generator.
There's probably some overlap with work that has to be done for async functions here.
,
Dec 22 2017
IIRC bytecode liveness analysis doesn’t work correctly with Suspend nodes, so even when TF is involved, they likely save more info than needed on each yield. I believe that’s still an issue, maybe relatively low-hanging fruit to start improving this.
,
Dec 27 2017
That sounds like a good starting point! Do you wanna prototype something here?
,
Dec 27 2017
I had a go of it, but so far my naive efforts have managed to shave only about 100ms off a 35000ms microbenchmark, nothing significant. Also: unclear (to me) how to materialize the full generator reg file if a deopt occurs, might need more expertise to do it right
,
Dec 27 2017
I was discussing this with jarin@ earlier this month: Maybe we need to look into inlining the generator at the Generator.prototype.next call site, that is inline both the next builtin as well as the actual call to the code object (which is done via the ResumeGenerator builtin). I think the inlining does most of the performance for the iterator case. WDYT?
,
Dec 27 2017
I can definitely do some work to inline Generator.prototype.next. About inlining the code object, it seems a bit limited: If the generator function is very large and doesn't fit into the inlining heuristics, is there anything we can do to improve things in that case?
,
Dec 27 2017
If the generator function is big, then there's not much we can do. G.p.next can essentially resume after every yield (incl. the initial yield), so the majority of the code is needed (except for the generator object allocation). We could recognize the special generator inlining (which needs special treatment anyways) and subtract the header bytecodes.
,
Dec 28 2017
The past 2 days, I worked on adding support for inlining the resume functions (caveat: doesn't inline ResumeGeneratorTrampoline, not sure I can do the fill-params-with-holes thing during the inlining phase?) --- The fix also involved getting rid of the %_CreateIteratorResultObject() calls from ignition for Yield, which should help a bit with escape analysis before the actual code can be inlined (the resume functions generate the result object instead). So far, this has made a roughly 1000ms difference on the benchmark, so it's an improvement, but there's still more to go.
,
Dec 29 2017
1000ms sounds like a good start already, brings us from 6.5x slower to only 4.7x slower! :-) The parameters are part of the FrameState during inlining; we should be able to construct them with holes. I think the key is really to be able to inline through the ResumeGeneratorTrampoline. Maybe we need a matching JSResumeGenerator operator that we can handle in the JSInlining class, and some prototyping to explore the space. Are you sure that you can move the result object creation to the resume builtin? Is that spec compliant? I thought that yield* and JSIterResultObject identity requires us to not create those objects outside of the generators.
,
Dec 29 2017
,
Dec 29 2017
Yes, it doesn’t really work in the yield* case, at least not compliantly
,
Dec 29 2017
Wasn't it the case that JSC doesn't implement this compliantly either? Do you know about their implementation? Does it makes sense to propose a spec change if this helps performance wise?
,
Dec 29 2017
Yes, Safari Tech Preview 46 is still getting this wrong. Whether it makes sense or not, lets see how much of a difference that escape analysis boost makes. If it's significant without also inlining ResumeGeneratorTrampoline and the generator code itself, then there's a case for it especially for large generator bodies.
,
Dec 29 2017
I've made a version which doesn't do anything to the baseline, and the benefits are a bit more muted (running about 3200ms vs 3500ms, as opposed to 2500ms vs 3500ms). Of course, this would score a lot better if ResumeGeneratorTrampoline and the generator body were also inlined.
,
Dec 30 2017
Looks like V8, SpiderMonkey and ChakraCore all implement the spec's semantics for yield* where the identity of the iterator result is preserved. Test at the end of this post.
Probably a little unlikely, but I could imagine someone depending on this in an environment where JSC isn't typically used (e.g., Node), so I wouldn't be 100% sure there's no compatibility concerns. I'm not sure if many people tend to add other properties on their iteration results, but if you're manually creating and using an iterator, with a yield* in the middle, I can understand if you might take advantage of this working.
To make a spec change and be worth the churn for programmers and implementers, it'd be good to have a strong understanding of why this is more optimizable. I'd imagine that being more optimizable may have been a reason why the current spec *does* pass along the identity--for a simple implementation, the current spec avoids an allocation.
Really naive suggestion that probably won't work: Would it be possible to implement yield* spec-compliantly together with this optimization by using the continuation to figure out whether a yield* is currently executing, and refrain from doing the wrapping if so?
----
Quick and ugly test that I used:
$ eshost -e "(function() { function* copy(iterable) { yield* iterable } let obj = { [Symbol.iterator]() { return { next() { return { value: 1, done: false, other: 123 } } } } }; let myCopy = copy(obj); return myCopy.next().other })()"
#### jsc
undefined
#### d8
123
#### chakracore
123
#### spidermonkey
123
,
Dec 30 2017
The benefit is that it allows eliminating allocation of the iterator result, and eliminate loads of the value/done properties, in more cases. Once the whole generator body can be inlined, we could do this anyways in many cases, but its constrained by the generator body fitting into inline heuristics. That said, I’m unsure how common non-inlineable generators would be in practice.
,
Dec 31 2017
> If the generator function is big, then there's not much we can do. G.p.next can essentially resume after every yield (incl. the initial yield), so the majority of the code is needed (except for the generator object allocation). We could recognize the special generator inlining (which needs special treatment anyways) and subtract the header bytecodes. I've been thinking more about this a bit. You want to support inlining the "regular call" case (which allocates generator, processes formal parameters, and returns the generator object), and you also want to support inlining the generator body itself. In the current design, this basically means you need to find the "branch-if-undefined-otherwise-jump-to-the-current-suspend" code, and find the initial yield/return, and for optimizing the general case, only include the first (offset of initial yield return), without including the branch-if-undefined and jumptable. In practice, you need to create a special "view" of the bytecode array which does this translation live, or create a modified bytecode array only including those bytes, with jump distances rearranged. It seems like a lot of work. What would you think about adopting the JSC model instead, where the "wrapper" and "body" each get their own bytecode, and are optimized independently of each other? I think it would allow simplifying/eliminating the resume generator trampoline, and just do a normal Call instead (passing resume mode, suspend point on stack, and avoid doing weird things to make accessing formal parameters work). This would avoid the need of a ResumeGenerator operator and make it simpler to inline both cases (at least, that's how it seems), without having to do any craziness to keep BytecodeGraphBuilder doing the right things for both types of call. I'm not sure how hard it would be to make this work (like, how would the generator body JSFunction be associated with the "wrapper" JSFunction? Live in wrapper function's constant pool?), but it seems doable, and would make it a lot easier to get the inlining of the generator body that we want.
,
Jan 2 2018
> The benefit is that it allows eliminating allocation of the iterator result, and eliminate loads of the value/done properties, in more cases. One thing that we'd need to explain to the committee is why it's not feasible to do the optimization anyway in the way I described in the fourth paragraph of https://bugs.chromium.org/p/v8/issues/detail?id=7236#c16 , or one of the other strategies that's being discussed here.
,
Jan 2 2018
I’m sure it’s possible, but I don’t know if it’s worth it: If the body isn’t inlined, we can’t tell if the yield* path is never taken or not, so both cases would need handling—load elimination and scalar replacement tuen work on one path but not the other, which probably precludes it from being done in the common case, or else produces much worse code to handle both cases. I could be wrong, but I think we get best results without that.
,
Jan 2 2018
ah, my phone omitted some characters typing that last message. Basically, right now, an object _always_ escapes when generator resume methods are called, which eliminates chance for load elimination/scalar replacement and avoiding the allocation. The spec-noncompliant change makes it so that an object _never_ escapes, so load elimination/scalar replacement are always possible even if the generator body isn't inlined. The idea you present means in that an object conditionally escapes or doesn't escape, and in cases where the generator body isn't inlined, you can't statically know (unless we had information about the number of yield* suspends in CallIC or something, only for generator resume calls, but we don't). So since you can't statically know, you either always emit the slow case, or you emit a combination of both cases. I'm not sure that the combination of both cases would actually help performance, and in a lot of cases it would produce dead codepaths that aren't known-dead at compile time, and would be hard to eliminate later on with no information telling us that we know those paths are dead. Anyways, I'm sure there's something we could do, but I don't really think this "feature" (of returning the same iterator result object for yield*) is worth that much in practice, especially when not all engines are implementing it.
,
Jan 3 2018
> Anyways, I'm sure there's something we could do, but I don't really think this "feature" (of returning the same iterator result object for yield*) is worth that much in practice, especially when not all engines are implementing it. I think so too. I think the point here is: If the committee wants ideal performance for generators across the board, the spec needs to be changed. Otherwise if performance is secondary and the benefits of the current semantics are superior to the performance concern, leave it as is. > I've been thinking more about this a bit. You want to support inlining the "regular call" case (which allocates generator, processes formal parameters, and returns the generator object), and you also want to support inlining the generator body itself. > In the current design, this basically means you need to find the "branch-if-undefined-otherwise-jump-to-the-current-suspend" code, and find the initial yield/return, and for optimizing the general case, only include the first (offset of initial yield return), without including the branch-if-undefined and jumptable. We do already inline the regular call, and TurboFan itself strips off the parts that are concerned with the actual generator logic (because it can constant-fold the undefined comparison). It's really only about inlining through Generator.prototype.next now.
,
Jan 3 2018
OK, the reasoning here seems pretty concrete--if there were two paths generated at the callsite, there would have to be an extra conditional and probably some dead code, just for preserving identity in the yield* case, and this would largely negate the benefit of the optimization. Is that a correct summary? Do you want me to propose this change to the committee?
,
Jan 4 2018
I don't think it's even feasible to implement such a strategy in the optimizing compiler, as you need to be really careful to get this right with deoptimization and all ways in which identity can be observed.
,
Jan 4 2018
OK, see https://github.com/tc39/ecma262/issues/1058 for asking TC39 about this issue.
,
Jan 4 2018
The NextAction date has arrived: 2018-01-04 |
||||
►
Sign in to add a comment |
||||
Comment 1 by caitp@chromium.org
, Dec 21 2017A more fair comparison: ``` class Standard { constructor(x, y) { this.x = x; this.y = y; } [Symbol.iterator]() { return { collection: this, x: 0, y: 0, [Symbol.iterator]() { return this; }, next() { let done = this.x == this.collection.x; let value; if (!done) { value = { x: this.x, y: this.y }; if (++this.y == this.collection.y) { this.y = 0; ++this.x; } } return { value, done }; } }; } computeSum() { let sum = 0; for (const {x, y} of this) { sum += x * y; } return sum; } } class Iterator { constructor(x, y) { this.x = x; this.y = y; } computeSum() { let sum = 0; for (const {x, y} of this.keys()) { sum += x * y; } return sum; } *keys() { for (let x = 0; x < this.x; x++) { for (let y = 0; y < this.y; y++) { yield {x, y}; } } } } const standard = new Standard(10, 15); const iterator = new Iterator(10, 15); // warm up for (let i = 0; i < 1000; i++) { standard.computeSum(); iterator.computeSum(); } // check result if (standard.computeSum() !== 4725) throw new Error('wrong value: ' + standard.computeSum()); if (iterator.computeSum() !== 4725) throw new Error('wrong value: ' + iterator.computeSum()); test('standard', standard); test('iterator', iterator); function test(name, fun) { console.time(name); for (let i = 0; i < 1e6; i++) { fun.computeSum(); } console.timeEnd(name); } ``` But yes, the non-generator case still wins by a wide margin.