Issue metadata
Sign in to add a comment
|
0.5-1.5MiB v8:effective_size regression in memory.top_10_mobile at 455708:455733 |
||||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Mar 10 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8985502771799706512
,
Mar 10 2017
,
Mar 10 2017
=== Auto-CCing suspected CL author hablich@chromium.org === Hi hablich@chromium.org, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Perf regression found with culprit Suspected Commit Author : Michael Hablich Commit : 7c3936c67a6361371d9a00d83e2c03ecc603ce76 Date : Thu Mar 09 08:05:42 2017 Subject: Version 5.9.34.1 (Turn on I+TF) Bisect Details Configuration: android_nexus5X_perf_bisect Benchmark : memory.top_10_mobile Metric : memory:chrome:all_processes:reported_by_chrome:v8:effective_size_avg/background/after_http_search_yahoo_com_search__ylt_p_google Change : 12.38% | 4550900.0 -> 5114272.0 Revision Result N chromium@455707 4550900 +- 11381.5 6 good chromium@455714 4560879 +- 29805.2 6 good chromium@455717 4552148 +- 15163.5 6 good chromium@455719 4556781 +- 26170.3 6 good chromium@455719,v8@7c3936c67a 5109885 +- 7867.48 6 bad <-- chromium@455719,v8@fbffc377e3 5111431 +- 10408.3 6 bad chromium@455720 5112832 +- 10491.5 6 bad chromium@455733 5114272 +- 10137.5 6 bad Please refer to the following doc on diagnosing memory regressions: https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md To Run This Test src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests memory.top_10_mobile Debug Info https://chromeperf.appspot.com/buildbucket_job_status/8985502771799706512 Is this bisect wrong? https://chromeperf.appspot.com/bad_bisect?try_job_id=4974572799000576 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Speed>Bisection. Thank you!
,
Mar 13 2017
This happened after the switch to I+TF. Average memory consumption increased.
,
Mar 13 2017
Only seems to be happening when the pages are in the background. Yandex seems to improve somewhat. The other thing I notice is that all the pages seem to have very similar memory values which seems very surprising - is this expected? Adding Juan in case he has any input on the benchmark. https://chromeperf.appspot.com/report?sid=b52c971e8e8a133b7e857d0050a044cd1946bea57f7071742d0560782ded42bb&rev=455729
,
Mar 13 2017
It looks like malloc memory increases. That could be the zone memory. Is TF using more zone memory than CS?
,
Mar 13 2017
,
Mar 13 2017
Could this be a memory leak? I've noted on android-nexus5X running after_http_search_yahoo_com the regressions are: - memory.top_10_mobile: +550KiB (restarts browser between page-set repeats) - memory.top_10_mobile_stress: +1.5MiB (does not restart the browser) Similarly other regressions on same device/page are proportionally larger on _stress compared to the regular benchmark. Also note, when trying to reproduce these regressions: do not use --story-filter, as running the whole suite seems important to trigger the regression.
,
Mar 13 2017
Re #7 - Only the final two regressions are malloc increase, the other two are v8_effective size - is this not JS heap or does this include malloc?
,
Mar 13 2017
PSA: Please ignore the non-memory related alerts. Autobisect running amok: https://github.com/catapult-project/catapult/issues/3372
,
Mar 13 2017
as discussed: Analyze trade-off space.
,
Mar 13 2017
,
Mar 15 2017
v8_effective size includes both JS and non-js. This is why I think it is zone.
,
Mar 15 2017
Yeah I had a look at this with primiano today - it is allocated_by_malloc:peak_size_avg which is increasing (which gets aggregated into v8_effective_size:peak_size_avg), so this is purely down to an increase in the peak zone size. Most pages don't see a change but some (e.g., Yahoo) see a very large increase (~1.5MB -> 6MB) - I'm going to dig into this locally and see if there is anything we can do.
,
Mar 20 2017
I've been looking at Yahoo, there is only one function which is optimized, and TF takes about 6MB of zone memory to do so, the function's JS source is:
getRequires: function(T) {
if (!T) {
return c
}
if (T._parsed) {
return T.expanded || c
}
var N, J, M, F, D, V, B = this.testresults, W = T.name, C, U = p[W] && p[W].details, P, K, E, G, Q, H, A, R, S, z, I = T.lang || T.intl, O = this.moduleInfo, L = d.Features && d.Features.tests.load, x, y;
if (T.temp && U) {
Q = T;
T = this.addModule(U, W);
T.group = Q.group;
T.pkg = Q.pkg;
delete T.expanded
}
y = !((!this.lang || T.langCache === this.lang) && (T.skinCache === this.skin.defaultSkin));
if (T.expanded && !y) {
return T.expanded
}
P = [];
x = {};
G = this.filterRequires(T.requires);
if (T.lang) {
P.unshift("intl");
G.unshift("intl");
I = true
}
H = this.filterRequires(T.optional);
T._parsed = true;
T.langCache = this.lang;
T.skinCache = this.skin.defaultSkin;
for (N = 0; N < G.length; N++) {
if (!x[G[N]]) {
P.push(G[N]);
x[G[N]] = true;
J = this.getModule(G[N]);
if (J) {
F = this.getRequires(J);
I = I || (J.expanded_map && (w in J.expanded_map));
for (M = 0; M < F.length; M++) {
P.push(F[M])
}
}
}
}
G = this.filterRequires(T.supersedes);
if (G) {
for (N = 0; N < G.length; N++) {
if (!x[G[N]]) {
if (T.submodules) {
P.push(G[N])
}
x[G[N]] = true;
J = this.getModule(G[N]);
if (J) {
F = this.getRequires(J);
I = I || (J.expanded_map && (w in J.expanded_map));
for (M = 0; M < F.length; M++) {
P.push(F[M])
}
}
}
}
}
if (H && this.loadOptional) {
for (N = 0; N < H.length; N++) {
if (!x[H[N]]) {
P.push(H[N]);
x[H[N]] = true;
J = O[H[N]];
if (J) {
F = this.getRequires(J);
I = I || (J.expanded_map && (w in J.expanded_map));
for (M = 0; M < F.length; M++) {
P.push(F[M])
}
}
}
}
}
C = this.conditions[W];
if (C) {
T._parsed = false;
if (B && L) {
s(B, function(X, Z) {
var Y = L[Z].name;
if (!x[Y] && L[Z].trigger === W) {
if (X && L[Z]) {
x[Y] = true;
P.push(Y)
}
}
})
} else {
for (N in C) {
if (C.hasOwnProperty(N)) {
if (!x[N]) {
E = C[N];
K = E && ((!E.ua && !E.test) || (E.ua && d.UA[E.ua]) || (E.test && E.test(d, G)));
if (K) {
x[N] = true;
P.push(N);
J = this.getModule(N);
if (J) {
F = this.getRequires(J);
for (M = 0; M < F.length; M++) {
P.push(F[M])
}
}
}
}
}
}
}
}
if (T.skinnable) {
R = this.skin.overrides;
for (N in YUI.Env.aliases) {
if (YUI.Env.aliases.hasOwnProperty(N)) {
if (d.Array.indexOf(YUI.Env.aliases[N], W) > -1) {
S = N
}
}
}
if (R && (R[W] || (S && R[S]))) {
z = W;
if (R[S]) {
z = S
}
for (N = 0; N < R[z].length; N++) {
A = this._addSkin(R[z][N], W);
if (!this.isCSSLoaded(A, this._boot)) {
P.push(A)
}
}
} else {
A = this._addSkin(this.skin.defaultSkin, W);
if (!this.isCSSLoaded(A, this._boot)) {
P.push(A)
}
}
}
T._parsed = false;
if (I) {
if (T.lang && !T.langPack && d.Intl) {
V = d.Intl.lookupBestLang(this.lang || v, T.lang);
D = this.getLangPackName(V, W);
if (D) {
P.unshift(D)
}
}
P.unshift(w)
}
T.expanded_map = n.hash(P);
T.expanded = e.keys(T.expanded_map);
return T.expanded
},
,
Mar 20 2017
Here are numbers for the different phases of the compilation (showing increase in nodes and memory usage between phases): Nodes Main Zone (KB) Temp Zone (KB) All Zones (KB) loop assignment analysis 0 0.8046875 0 1112.140625 graph builder 2518 585.1015625 125.2578125 159.359375 inlining 1631 392.203125 42.875 0 early graph trimming 0 0 33.40625 0 typer 10 22.3828125 46 0 typed lowering 208 59.5859375 16.5859375 0 concurrent optimization preparation 3 0.28125 0 0 loop peeling 276 72.2421875 240.34375 1187.507813 load elimination 3 0.2578125 503.28125 221.3828125 escape analysis 1 0.1171875 187.28125 0 simplified lowering 161 56.4140625 293.515625 16.0703125 generic lowering 233 102.3671875 24.7578125 0 early optimization 2 0.21875 175.0859375 0 effect linearization 3631 778.8359375 720.328125 520.203125 dead code elimination 2 0.125 41.2734375 0 store-store elimination 0 0 193.921875 0 control flow optimization 0 0 28.3046875 0 memory optimization 161 140.28125 113.34375 162.2421875 late optimization 21 1.96875 244.859375 59.7578125 late graph trimming 0 0 73.21875 0 scheduling 604 621.3046875 1534.375 1639.53125 select instructions 0 0 242.6328125 295.7578125 It looks like the most expensive phases are what we would expect: loop peeling, effect linearization, scheduling and instruction selection. I tried turning off loop peeling and it had no real effect on total peak usage (~100KB) since the later phases just increased peak by as much as the loop peeling phase would have otherwise.
,
Mar 20 2017
+TF folks
,
Mar 20 2017
We were chatting about it here, had these thoughts: * Scheduling - we think node splitting is introducing so many (604) nodes. Can we try turning it off? There is a fear this increases live ranges, and it may hurt asm.js (lua). Where is the 1500KB in the schedule temp zone coming from? That is really big. * Effect control linearization - we can change nodes in place, but this would only help by a fractional amount. If we count the number of nodes we actually lower, that would help decide if it's the place to change nodes in place.
,
Mar 20 2017
I did a rough break-down of the types which are allocated in the zones. For the main graph zone (after compile is complete) the percentage of each type is: Nodes 40.59396662 ZoneContainers 28.81027521 OutOfLineNodeInputs 9.667737001 BasicBlock 9.023607072 Operator 7.144412442 UnknownArray 1.039510494 StateValueTypes 0.3384645067 CallDescriptor 0.335157036 RangeType 0.2783787881 StateValueNodeKey 0.180808401 Signature 0.1256838885 UnionType 0.03638217825 SourcePositionWrapper 0.03224783981 HeapConstantType 0.03031848187 Schedule 0.008268676874 FrameStateInfo 0.00330747075 TupleType 0.0022049805 SourcePosTable 0.001929357937 unknown 2.347339553 For Scheduling, the temp zone is almost entirely (99.34%) zone containers of some type. I'll try to drill down into what containers are taking the space tomorrow.
,
Mar 20 2017
,
Mar 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/f9c4085904ec8579549df3ada44b0cc9bfe47acf commit f9c4085904ec8579549df3ada44b0cc9bfe47acf Author: Ross McIlroy <rmcilroy@chromium.org> Date: Tue Mar 21 08:15:10 2017 [TurboFan] Use temporary zone for effect linearization schedule. Also move phi NodeVector in TryCloneBranch to temporary zone. BUG= chromium:700364 Change-Id: Id19d51dae63ed5a6f5dccbba77a19b3663fd325e Reviewed-on: https://chromium-review.googlesource.com/456285 Reviewed-by: Ross McIlroy <rmcilroy@chromium.org> Reviewed-by: Jaroslav Sevcik <jarin@chromium.org> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org> Cr-Commit-Position: refs/heads/master@{#43962} [modify] https://crrev.com/f9c4085904ec8579549df3ada44b0cc9bfe47acf/src/compiler/effect-control-linearizer.cc [modify] https://crrev.com/f9c4085904ec8579549df3ada44b0cc9bfe47acf/src/compiler/pipeline.cc [modify] https://crrev.com/f9c4085904ec8579549df3ada44b0cc9bfe47acf/src/compiler/scheduler.cc [modify] https://crrev.com/f9c4085904ec8579549df3ada44b0cc9bfe47acf/src/compiler/scheduler.h
,
Mar 21 2017
Re #19: Turning off node spliting does avoid creating the extra 604 nodes in the schedule phase. It reduces the scheduling temp zone usage by around 300KB and the final graph zone by ~150KB, but doesn't have much impact on the peak zone memory usage for some reason (I need to look into this, it might be due to the way zone memory is allocated in exponentially increasing chunks). Rough breakdown of Scheduler temporary zone (with node splitting) by percentage of total zone use. It's dominated by the control equivalence and scheduler's node data. There is also a reasonable usage by queues, which might be better if allocated as non-zone memory. ControlEquivalenceNodeData 29.69873692 SchedulerNodeData 25.45606021 ScheduleEarlyNodeQueue 9.017782757 ZoneVectorOther 8.990490965 SchedulerNodes 8.561004338 ScheduleLateMarkingQueue 4.271883708 ZoneStack 4.7530811 SchedulerQueue 3.01071562 SchedulerRootNodes 1.960699819 ZoneMapOther 1.910425465 ZoneQueueOther 1.440240551 UnknownArray 0.5592423415 ControlEquivalenceBracketList 0.2413169008 SpecialRPOLoops 0.02872820248 SpecialRPOBackEdges 0.02968580923 OutgoingBlocks 0.02394016873 ScheduleLateMarked 0.009097264118 unknown 0.03686785984 I'll have a look and see if there is anything we can do here.
,
Mar 22 2017
I mentioned this offline to Ross, but just so that we have something in writing too: ZoneQueues currently have a terrible implementation, which is that they are backed by a std::deque. std::deque allocates chunks for its data, and when a chunk is empty it deallocates it -- for Zone allocators, this effectively means growing memory use unboundedly as the queue is used, as we keep allocating new chunks for the back of the queue while losing the chunks for the front. ZoneStack is almost as bad, as every time popping something off the stack crosses a chunk boundary, we deallocate that chunk and lose that memory. In practice, I don't know how often we're hitting these issues, but potentially we should consider using free-lists of chunks, or circular queues, or just not use Zone collections here at all.
,
Mar 24 2017
Status update, the bad peak malloc memory regression has now recovered [1]. This recovered with Jaro's CL https://codereview.chromium.org/2766783002 "Reset the runtime profiler ticks for bytecode if IC state changes." which means that we no longer optimize the function in #16 at all. Saying that, there is still a lot of scope for reducing TurboFan's memory usage (some of which I have in in-flight CLs) - I'm going to split out the TurboFan memory reduction into a separate bug (https://bugs.chromium.org/p/v8/issues/detail?id=6150) As for the remaining regressions [2], these are in the V8 heap so may just be due to GC time moving about. They are also much smaller regressions. I'll have a dig into it and see if they are anything to worry about. [1] https://chromeperf.appspot.com/report?sid=069d5e2c6e73bea7b6f3732d78327a0dbdebbf3c35d4245e85d5b8c3c4a61fe0 [2] https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1MaDjQoM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1JaBgggM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDglOf-9wgM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1Jr7tQkM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1Lz1ogoM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1KOEqAkM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1Ka6qwkM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDglMOl0wkM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg1NWApgsM
,
Mar 24 2017
OK, looking into the v8_effective_size regressions I don't think this is a problem, I think it is entirely GC timing related. The overall V8 memory usage when the page is in the foreground goes down significantly on all devices with I+TF (~12-16% reduction). When the page is in the background, I+TF generally reduces memory for most of the devices, however on the Nexus5X there is a slight (~2%) increase in memory usage. Most likely this is due to the fact that the foreground usage is lower, so the GC doesn't end up triggering a GC that it would have pre-I+TF when the page goes into the background. [1] https://chromeperf.appspot.com/report?sid=b52defde4691964dc5b6e72000b6c5fd4bb0ee181158f6473dc23efbfaba2420
,
Mar 24 2017
Thanks for the thorough analysis and hunting down the "culprit".
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/e73bde18dc51e79640a1c0e6926b518c8c39398a commit e73bde18dc51e79640a1c0e6926b518c8c39398a Author: Ross McIlroy <rmcilroy@chromium.org> Date: Mon Mar 27 14:14:14 2017 [TurboFan] Reduce memory usage of ControlEquivalence. The Control Equivalance phase was taking a lot of memory by allocating a large datastructure for every node even if the nodes were dead or wouldn't participate in the control equivalence algorithm. Instead allocate the data on-demand, and use the presense of the data as the flag for whether the node participates in the algorithm. Also remove DFS number field as it was unused. This reduces the amount of memory used for a 10,000 node graph in the linked bug from ~450KB to ~70KB. It also seems to reduce scheduling time by around 10% for local runs of Octane. BUG= chromium:700364 Change-Id: Iedfdf4dff0a01463c5b6471513e6b69ef010b02d Reviewed-on: https://chromium-review.googlesource.com/458219 Reviewed-by: Michael Starzinger <mstarzinger@chromium.org> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org> Cr-Commit-Position: refs/heads/master@{#44150} [modify] https://crrev.com/e73bde18dc51e79640a1c0e6926b518c8c39398a/src/compiler/control-equivalence.cc [modify] https://crrev.com/e73bde18dc51e79640a1c0e6926b518c8c39398a/src/compiler/control-equivalence.h
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/a059e87eed5adab44e228478665dc1a8a136d4b7 commit a059e87eed5adab44e228478665dc1a8a136d4b7 Author: Ross McIlroy <rmcilroy@chromium.org> Date: Mon Mar 27 14:33:47 2017 [TurboFan] Lazily allocate scheduled_nodes vectors since most remain empty. The scheduled_nodes_ vector is used to maintain a per-block list of non-fixed nodes. For most blocks this list remains empty, so lazily initialize it instead of pre-allocating to save memory. Also pre-reserve an extra 10% of blocks to avoid reallocting space in the vector when fusing floating control creates new basic blocks. BUG= chromium:700364 Change-Id: I9876e6a42bc90c9bff5838620365c18609ed1ee9 Reviewed-on: https://chromium-review.googlesource.com/458919 Reviewed-by: Michael Starzinger <mstarzinger@chromium.org> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org> Cr-Commit-Position: refs/heads/master@{#44152} [modify] https://crrev.com/a059e87eed5adab44e228478665dc1a8a136d4b7/src/compiler/scheduler.cc [modify] https://crrev.com/a059e87eed5adab44e228478665dc1a8a136d4b7/src/compiler/scheduler.h
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/bdb4a8d33dbd99e48bb54594f7e96eb0313a0643 commit bdb4a8d33dbd99e48bb54594f7e96eb0313a0643 Author: Ross McIlroy <rmcilroy@chromium.org> Date: Mon Mar 27 14:50:06 2017 [TurboFan] Reserve space in scheduler node data for split nodes. When node splitting is enabled new nodes could be created during scheduling. The Scheduler::node_data_ and Schedule::nodeid_to_block_ zone vectors reserve enough space for the node count before splitting, however will have to reallocate space when node splitting occurs. The vectors double in space by default, meaning the peak zone usage is 3x the required amount for these vectors as soon as a single node is split. Avoid this in the common case by reserving 10% extra space for split nodes. The value 10% was choosen since it covers 98.7% of the optimized functions in Octane. BUG= chromium:700364 Change-Id: Ibabd8d04cffd1eb08cc3b8a12b76892208ef3288 Reviewed-on: https://chromium-review.googlesource.com/458425 Commit-Queue: Ross McIlroy <rmcilroy@chromium.org> Reviewed-by: Michael Starzinger <mstarzinger@chromium.org> Cr-Commit-Position: refs/heads/master@{#44153} [modify] https://crrev.com/bdb4a8d33dbd99e48bb54594f7e96eb0313a0643/src/compiler/scheduler.cc [modify] https://crrev.com/bdb4a8d33dbd99e48bb54594f7e96eb0313a0643/src/compiler/scheduler.h
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/v8/v8.git/+/b90a20b2c70540188b1b1b44ea70c51af1badaa1 commit b90a20b2c70540188b1b1b44ea70c51af1badaa1 Author: Ross McIlroy <rmcilroy@chromium.org> Date: Mon Mar 27 15:10:42 2017 Add RecyclingZoneAllocator for ZoneDeque. A std::deque interacts badly with zone memory in that it allocates chunks of memory for the back of the queue and frees memory from the front of the queue. As such we never reuse zone memory for the queue. Implement a very simple RecyclingZoneAllocator which keeps a single block of memory from deallocation that can be reused on allocation. Also clean up zone-allocator a bit and make it use proper Chromium coding style. BUG= chromium:700364 Change-Id: I19330a8a9ec6d75fe18d8168d41f1a12030a6c4d Reviewed-on: https://chromium-review.googlesource.com/458916 Reviewed-by: Michael Starzinger <mstarzinger@chromium.org> Reviewed-by: Ross McIlroy <rmcilroy@chromium.org> Commit-Queue: Ross McIlroy <rmcilroy@chromium.org> Cr-Commit-Position: refs/heads/master@{#44154} [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/src/compiler/instruction.h [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/src/objects.cc [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/src/zone/zone-allocator.h [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/src/zone/zone-containers.h [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/test/unittests/BUILD.gn [modify] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/test/unittests/unittests.gyp [add] https://crrev.com/b90a20b2c70540188b1b1b44ea70c51af1badaa1/test/unittests/zone/zone-allocator-unittest.cc
,
Mar 30 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by hablich@chromium.org
, Mar 10 2017