Investigate memory regression caused from turning on DispatchSafetyNetCheckOffThread feature in M64 |
|||||||||
Issue descriptionMemory.Browser.PrivateMemoryFootprint is showing significant regressions in the .2-.3 MB range with the SubresourceFilter experiment turned on. We should get to the bottom of this regression by M66. One possibility I thought of was that we were indexing our 2MB ruleset (happens either after startup or when the component updater updates) during memory infra dumps, but I don't think that's the case (we wouldn't see such significant results). Additionally, this should only happen once (when the component updates).
,
Jan 30 2018
Alternatively, this could just be a red herring, and some other histogram would have triggered the growth later on.
,
Jan 30 2018
Yikes, if the map is indeed growing by 2mb from a single extra histogram, that sounds like probably an inappropriate growth rate for the map at this size. Perhaps we should choose a different container, following the guidance at https://chromium.googlesource.com/chromium/src/+show/master/base/containers/README.md?
,
Jan 30 2018
I think the use of this container might be new from a recent change by fdegros.
,
Jan 30 2018
Yes potentially, although it depends on the perf characteristics of the histograms code. I'm planning on adding some local instrumentation to histograms code to see how big this buffer grows to in practice, to evaluate if my hypothesis holds water. I'll report back with what I see. Semi-comedy option: maybe we should log a histogram with the size of the histogram buffer??
,
Jan 30 2018
Oh, just
,
Jan 30 2018
Just to clarify, we are seeing this regression in M64 so if code changed after then we'll need to take that into account (at least for this bug).
,
Jan 31 2018
I just took a look at the statistics recorder on M64 and it looks like we're using std::map in that version. To my knowledge, std::map does not have the sort of memory boundary conditions that std::unordered_map does, since it is implemented as a rb-tree. So, I think this regression is not related to histograms code. I did add some instrumentation to TOT and found that the map does not typically grow past ~3000 buckets with light usage over 10 minutes. It is strange that I saw a 2mb allocation in that trace. I took a memory infra of that run and it showed RegisterOrDeleteDuplicate being responsible for ~40k of allocations, far less than 2mb.
,
Feb 1 2018
Thanks for chasing that up. Is the problem happening only with M64, or do you also observe it with HEAD? There used to be a problem with RegisterOrDeleteDuplicate: it used to leak memory if it was called repeatedly before StatisticsRecorder was initialized. This has been recently fixed. Not sure this is related to the problem you are observing, but it might be useful to keep in mind.
,
Feb 1 2018
Update, at this point I'm fairly sure the regression is coming from an experiment that was turned on in the same config. Now that SubresourceFilter is launched to 100%, we kept a small Safe Browsing optimization (DispatchSafetyNetCheckOffThread) at 5% holdback yesterday. That holdback is showing the same memory regression. https://uma.googleplex.com/p/chrome/variations/?sid=5722dc9c6d342f3ab4e1d488aa069ae8 Changing components accordingly. The only thing I can think of is that we add a new java HandlerThread in this experiment, which I thought should take 8-32K of memory only. I'll continue to investigate today.
,
Feb 2 2018
Quick update: Looks like UMA is reporting that this is almost entirely due to malloc (i.e. Memory.Experimental.Browser2.Malloc). This is evidence that this regression is not caused due to something on the java heap, which this experiment also touches. Will run some memory-infra to try to repro today.
,
Feb 2 2018
It doesn't look like this is SafeBrowsing related, after all. csharrison@ -- can you please confirm and if that's the case, remove the SafeBrowsing component? Thanks.
,
Feb 2 2018
,
Feb 2 2018
It is due to SafeBrowsing, in particular it is due to our safetynet experimental code. I haven't figured out the root cause but I'm wondering if we're increasing peak memory due to queuing up many tasks (in some bad conditions) on the background blocking task scheduler queue.
,
Feb 5 2018
From https://uma.googleplex.com/p/chrome/variations/?sid=8bcf639fd043943fd990aa4b7ff428cb#TaskScheduler.TaskLatencyMicroseconds.Browser.UserVisibleTaskPriority_MayBlock it looks like you increase the workload by 26%! All on a single sequence. So yes, there's a major backlog on your sequence in some scenarios. Is cancellation an option? The good thing is: it being expensive to run shows how good moving this off the IO thread is :)!
,
Feb 5 2018
This feature has security implications so it would be unwise to drop requests on the floor with cancellation. I think it may make more sense to keep track of the queue length and just execute the task directly on the IO thread if we're over the limit. Note also that we share this thread pool with the system DNS resolver, which posts blocking tasks to this queue and could potentially slow us down quite a bit. It does seem like when Async DNS is turned on this regression is mitigated: AsyncDns Disabled: https://uma.googleplex.com/p/chrome/variations/?sid=6cdee82656f1e50ad2882235d01b1d21 AsyncDns Enabled: https://uma.googleplex.com/p/chrome/variations/?sid=d3a971ce4e2faa05c5ba00cd5a0ac1d7 The TaskScheduler.TaskLatencyMicroseconds.Browser.UserVisibleTaskPriority_MayBlock metrics also look a lot better with AsyncDns on, although you need to look at newer data than the links above to get builds with that metricx. AsyncDns should launch on stable in M65 for >= Android M, and M66 for everyone else.
,
Feb 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/47ee2b19edebc6e061841b03c9a16b64af83585f commit 47ee2b19edebc6e061841b03c9a16b64af83585f Author: Charles Harrison <csharrison@chromium.org> Date: Tue Feb 06 01:57:18 2018 SafeBrowsingApiHandlerBridge: Remove callback copy This CL does two things: 1. Makes the completion callback a OnceCallback. 2. Allocates a single callback on the heap, rather than binding one with the call to Core::StartURLCheck, and then copying one on the heap. This shouldn't affect memory use, but it does make the memory usage of the callback more apparent, and avoids copying it around. Bug: 806981 Change-Id: I8830f5e4e8379ebfd4d67c5333877d5532e12428 Reviewed-on: https://chromium-review.googlesource.com/902862 Reviewed-by: Varun Khaneja <vakh@chromium.org> Commit-Queue: Charlie Harrison <csharrison@chromium.org> Cr-Commit-Position: refs/heads/master@{#534581} [modify] https://crrev.com/47ee2b19edebc6e061841b03c9a16b64af83585f/components/safe_browsing/android/remote_database_manager.cc [modify] https://crrev.com/47ee2b19edebc6e061841b03c9a16b64af83585f/components/safe_browsing/android/remote_database_manager_unittest.cc [modify] https://crrev.com/47ee2b19edebc6e061841b03c9a16b64af83585f/components/safe_browsing/android/safe_browsing_api_handler.h [modify] https://crrev.com/47ee2b19edebc6e061841b03c9a16b64af83585f/components/safe_browsing/android/safe_browsing_api_handler_bridge.cc [modify] https://crrev.com/47ee2b19edebc6e061841b03c9a16b64af83585f/components/safe_browsing/android/safe_browsing_api_handler_bridge.h
,
Feb 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3b1f30b790683bb0b0cddf06d3ca2c06127c978a commit 3b1f30b790683bb0b0cddf06d3ca2c06127c978a Author: Charles Harrison <csharrison@chromium.org> Date: Sat Feb 10 04:27:48 2018 Safe Browsing: Add ScopedBlockingCall(MAY_BLOCK) before remote call The CL also adds a microsecond histogram to the call, to see how often we'd get above the 10ms threshold in practice. A similar histogram exists (SubresourceFilter.SafeBrowsing.CheckDispatchTime), but it is only called for SubresourceFilter safe browsing checks on main frame navigations. Bug: 806981 Change-Id: If63590a4f543a1e65760ee5c7f35d8bc773cef10 Reviewed-on: https://chromium-review.googlesource.com/911750 Commit-Queue: Charlie Harrison <csharrison@chromium.org> Reviewed-by: Varun Khaneja <vakh@chromium.org> Reviewed-by: Robert Kaplow <rkaplow@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#535954} [modify] https://crrev.com/3b1f30b790683bb0b0cddf06d3ca2c06127c978a/components/safe_browsing/android/safe_browsing_api_handler_bridge.cc [modify] https://crrev.com/3b1f30b790683bb0b0cddf06d3ca2c06127c978a/tools/metrics/histograms/histograms.xml
,
Feb 22 2018
A bit more stable data shows that we are also "regressing" renderer memory. I think this is more evidence that faster resource loading can cause weird side effects with memory. There is just no way this experiment is directly influencing renderer memory. https://uma.googleplex.com/p/chrome/variations/?sid=9494f4d4764574915ec61c30ec5b9f0b At 75th and 95th %iles, the dashboard shows 100-200kb regression.
,
Feb 26 2018
Removing RB-Stable label for a few reasons: 1. Not reproducible in the lab. 2. Already landed some mitigations for queuing latency and queue size (though data is hard to track on Canary/Dev due to SimpleCache priority experiments). AsyncDns is landing in M65 which should further reduce this queue's size and memory. 3. Data indicates that this could be caused by making IO thread faster, changing global browser behavior. Supporting evidence: - Significant increase of # of subresource loads, and > 7% reduction in subresource loading times. - Less time waiting to dispatch more requests = more time with more responses and other data in memory - Significant increase of # foreground page loads (e.g. ttfcp) - Significant increase in renderer memory, despite being a simple browser side change 4. Direct memory use of the feature is transient for resource loading only. +sullivan FYI. I'll leave the bug open and continue to investigate. I just don't think it should block the M66 release.
,
May 11 2018
This experiment has been turned down here in favor of threading on the Java side: https://chromium-review.googlesource.com/1050804 It still might be nice to have additional tools in place for dealing with PostTask overhead, but marking this one as WontFix. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by csharrison@chromium.org
, Jan 30 2018Components: Internals>Metrics>UMA