Discussion to stop alerting on effective_size for v8. |
|||
Issue descriptioneffective_size measures total committed memory. This is highly affected by fragmentation. It also has noise, and tends to increment by large chunks. In this crbug, reillyg@ spent a significant amount of time trying to understand a 300KB regression. After looking at traces, we determined that his actual CL only bumped allocated_objects_size by 20KB, an expected increase. https://bugs.chromium.org/p/chromium/issues/detail?id=729205#c12 In this crbug, v8 effective_size regressed by 500KB. Note that 478208 has multiple runs, which have noise of 494303 https://bugs.chromium.org/p/chromium/issues/detail?id=732849#c3 Looking at the traces for 478210 and 478211 [supposedly 478211 regressed] https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-06-13_11-09-30-74055.html https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-06-13_11-02-58-13055.html We see that effective_size varies by 500KB, but allocated_objects_size varies by 1.4KB!! I'm not saying that effective_size isn't important. It is a "true" measurement of memory impact of v8. But it's noisy in the worst way [deterministically non-deterministic. Small behavior changes can push v8 over a boundary, causing it to allocate [but not free] additional blocks, fragmentation, etc.]. =============================================================== The examples I give above are specifically for v8. I expect they apply to all allocators [e.g. malloc, partition_alloc, etc].
,
Jun 26 2017
Your analysis is spot on. Any change can push V8 over the 512K boundary of a committed page and will trigger these sort of alerts for "effective_size" Now, these alerts have helped us in the past though as changes often have unintended side effects (snapshot, a lot more or less code depending on inlining) that would otherwise go unnoticed. It also helps us look again when we just slowly regress allocated_objects_size. I.e., it at least gives an alert when a lot of people start adding new sets of 20K live objects. I agree that for most parts, if allocated_objects_size doesn't change significantly, the metric is mainly an FYI.
,
Jun 26 2017
Hmm Issue 729205 is quite unfortunate, but at the same time I a bit worried of just stop alerting on effective_size for v8. as mlippautz@ says that helped to unveil a good number of v8 regressions in the past, which wouldn't show up at all using just allocated_object_size, because AOS effectively hides away any GC behavior. > the metric is mainly an FYI. Well, yes if you see this from the viewpoint of a chrome, non-v8, developer. But I don't fully agree with this if you see this from the viewpoint of the end user. The effective size of the v8 heap is what the end user device experiences. I think the best way forward to make a decision here is: - look at the history of the last ~30 alerts (or just jumps) in v8 effective size - look at how many of them had a matching jump/alert in allocated_object_size - look at how many cases of [effective_size jumped but not allocated_object_size] were actual bugs vs false positives The data is there. I just don't know how easily diggable that is. Without looking at data, my fear is that we would just hide completely any sort of regression coming from v8 GCs, which would make me quite nervous.
,
Jun 26 2017
Actually, giving a 2nd look to Issue 729205, there were other alerts on runtime regressions on v8-gc-latency-mark-compactor_sum. Which suggests that whatever change caused that, it was causing more thrashing on the GC (Which would explain both the runtime and commited heap size variation). If this is the case, I am not even sure this is a "false positive". Also, by looking even, more, I think that Issue 729205 has another problem: there are two overlapping regressions, the bisect identified only one and blamed all the metrics on that, and it has been erroneously marked as WontFix. The v8-gc-latency-mark-compactor_sum regressions didn't recover, despite the fact that the WebUSB was reverted back to experimental. Also, the memory metrics that have been deemed as "recovered", didn't recover correspondingly to the WebUSB revert (#480063) but did recover in 479371 - 479466. Going back to this original bug, I think that there is something else going on with effective_size, IMHO not fully related with WebUSB. I will reopen that bug there, as I think that is a case of bisect going wrong because of double-regressions in range. If this is confirmed, that sounds a reason for actually keeping the alert :)
,
Jun 26 2017
Not sure if the link to Issue 732849 was intentional (it doesn't seem to be about v8) or not, but that turned out to be an actual regression (now duped into Issue 732294 ).
,
Jun 26 2017
Had a long discussion with primiano about alerting on effective_size vs. allocated_objects_size for both v8, and other allocators [partition_alloc, malloc, etc.] ================================================= Definitions, for others to follow along. effective_size: committed size [represents total non-discardable memory in use by allocator, including fragmentation, mostly empty blocks, etc.] allocated_objects_size: sum of size of allocated objects. ================================================== Here's our rundown of all scenarios: Scenario 1) CL regresses both allocated_objects_size and effective_size. In this case, the marginal utility of alerting on effective_size is 0, since the alert on allocated_objects_size would have bisected to the appropriate CL. Scenario 2) CL regresses allocated_objects_size but not effective_size. This can happen if a CL happens to fill up more most-empty-blocks. In this case, firing an alert is still correct, the fact that effective_size did not regress is most-likely context dependent. Scenario 3) CL regresses effective_size but not allocated_objects_size. Scenario 3a) Due to sharp-cutoffs caused by allocator [e.g. allocating in blocks], minimal timing/behavior changes can cause semi-stable changes in effective_size. primiano@ and I looked at several examples of this in the perf dashboard. Also see crbugs linked in my opening post. Scenario 3b) Regression in implementation of allocator itself [e.g. GC, fragmentations, etc.] Scenario 3c) A CL makes fragmentation worse [e.g. allocating, then freeing many, but not all objects]. ================================================== (3a) causes false positives, which wastes developer time. (3c) is a scenario we'd like to support, but there's currently no good way to distinguish (3a) from (3c). (3b) is most relevant for v8. For Desktop OSes, the marginal utility of catching regressions of (3c) is less than the marginal cost of dealing with false positives from (3a). On Android, System Health Council is a forcing function, and we *absolutely* cannot regress effective_size. ================================================== Action Items, short term: On desktop, turn off alerts for <allocator>:effective_size and replace them with alerts for <allocator>:allocated_objects_size. On Android, keep on alerts for <allocator>:effective_size, but also turn on alerts for <allocator>:allocated_objects_size. Action Items, medium term: Assuming our bisect system is able to differentiate between scenarios 1, 2, and 3... If there is a scenario 3 regression: Bisects for v8:effective_size that single out a v8 roll should be filed against the v8 team [Likely 3b]. Other bisects should indicate the two potential scenarios [3a] and [3c].
,
Jun 27 2017
We greatly rely on effective size alerts when optimizing basic memory infrastructure. In fact, we also had OKRs that were fragmentation related and depended on both (allocated and effective). If something regresses there, we won't catch it anymore. Also, the longer running gmails apps together with effective size are used to check the behavior of our memory reducer. Until some uber system (however it is called) is actually ready and in action I don't see how we can get away without alerting for those. All those cases probably fall into 3b). Are they not worthy of alerting anymore? What is the problem with WontFix-ing minor regressions on this metric?
,
Jun 27 2017
erikchen@, please do not remove effective size alerts for v8. We have different GC heuristics for desktop and Android. Some changes can regress desktop without regressing Android. > For Desktop OSes, the marginal utility of catching regressions of (3c) is less than the marginal cost of dealing with false positives from (3a). This is actually (3b + 3c) vs 3a.
,
Apr 9 2018
See also issue 830517 for where the metric was super flaky. Having flaky metrics on perf monitoring undermines the perf bugs since people will stop paying attention to them once they realize they're often noise. Can you take a look at issue 830517 and check what went awry there?
,
Apr 9 2018
"you" being ulan, sorry for not mentioning in the previous comment.
,
Apr 9 2018
Replied in issue 830517 .
,
Apr 9 2018
+perezju, do you still have that dashboard handy with the statistics of noise, alerts and wontfixes for each metric (specifically the one in question here)?
,
Apr 10 2018
It was not a "continuous" dashboard, so the analysis exists but the data we used at the time will probably not reflect the current situation. That's, however, one of the things I'm working on now, to productionize the data gathering/analysis scripts so we can answer these kinds of questions repeatedly and more easily. Stay tuned. |
|||
►
Sign in to add a comment |
|||
Comment 1 by erikc...@chromium.org
, Jun 23 2017