New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 760578 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Feb 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 751972



Sign in to add a comment

Turn on the OOP HP shim for memory perf tests.

Project Member Reported by erikc...@chromium.org, Aug 30 2017

Issue description

The previous shim [--enable-heap-profiling=native] was too slow, and causing timeouts. 

The new shim --memlog=browser should be faster, both at profiling and dumping. Remaining steps:

  1) Making sure that we use the same GUID when we dump into the trace, which allows us to coalesce metrics across detailed dumps.
  2) Emitting the shim_allocated_objects_size and shim_allocated_objects_count metrics.
 
Cc: perezju@chromium.org
Cc: lalitm@chromium.org ssid@chromium.org
Hi all.

Coming back to this. The new flag to turn on OOP HP is "--memlog=all". NHP has been turned off on the telemetry waterfall due to slowness/flakiness, see  crbug.com/757847 . The benefit of turning it on some form of HP is that we get malloc:shim_allocated_objects_size, which is ~10x less noisy than the malloc stats we get from querying malloc directly. 

The following modifications are necessary to turn it on:
1) tools/perf/benchmarks/memory.py
"""
 91   def SetExtraBrowserOptions(self, options):                                    
 92     super(MemoryBenchmarkTrivialSitesDesktop, self).SetExtraBrowserOptions(     
 93           options)                                                              
 95     options.AppendExtraBrowserArgs([                                            
 96       '--memlog=all',                                                                                                                                                                                                                        
 97     ])                                                                          
 98           
"""

2) telemetry/telemetry/internal/backends/chrome_inspector/tracing_backend.py
"""
185   def DumpMemory(self, timeout=None):
...
225     time.sleep(5)                                                                                                                                                                                                                            
226     result = response['result']                                                 
227     return result['dumpGuid'] if result['success'] else None     
"""

The sleep is necessary because memory_instrumentation *thinks* that the dump has finished, but OOP HP still needs some time to serialize and flush the dump. This in turn is because MDP::OnMemoryDump() expects the data to be synchronously returned/serialized, whereas OOP HP is asynchronous. 

I've attached the resulting html object. There are two issues:
1) shim_allocated_objects_size does not show up in the tracing UI under the malloc MDP.
2) shim_allocated_objects_size is not extracted by the catapult measurement system.

I've confirmed that the trace contains shim_allocated_objects_size. Here's what I think is happening:
  * Browser Malloc MDP emits a Malloc allocator dictionary that has some stats. It does not have shim_allocated_objects_size.
  * Later, the OOP HP MDP serializes emits malloc allocator dictionary [using the same pid, guid], which contains different object/count/size stats, as well as the additional shim_allocated_objects_size fields. 
    * See ProfilingProcessHost::OnDumpProcessesForTracingCallback for emission logic
    * json_exporter.cc:WriteAllocatorsSummary() for the allocator dictionary emitted by OOP HP MDP.
    * MallocDumpProvider::OnMemoryDump for the allocator dictionary emitted by malloc MDP.

I bet that the UI/extractor look for the first instance of the malloc allocator dictionary and uses the info there.

Next steps:
  1) I think the best solution to this problem is for the Malloc MDP for a process to do nothing if the process is being OOP-HP-profiled. 
  2) It would be nice if the MDP::OnMemoryDump interface allowed async responses. That way the OOP HP can asynchronously serialize all relevant information to the trace log before memory_instrumentation thinks that the "dump" is complete.


telemetry_oophp_trace.html.gz
976 KB Download
(1) has been implemented. I'm going to work on (2) next.

Comment 4 by hjd@chromium.org, Jan 11 2018

Cc: hjd@chromium.org
> MDP::OnMemoryDump interface allowed async responses

Hmm that sounds tricky. Can you have a short doc with a plan? Mostly in your interest to avoid that you spend too much time.
I think it's quite hard to change the current API without changing hundreds of call sites and adding quite some risk of new thread races. Also not sure if/how you can minimize thread hops if you don't control them anymore.

A bunch of things to keep in mind:
1) We guarantee linearization on the PMD passed, i.e. while OnMemoryDump is called, on any task runner, we guarantee that it won't possibly be called on another one.

2) We guarantee to call MDPs on their own thread, to avoid having them deal with threads and locks. having locks in all the MDPs means dooming ourselves to catch bugs in all the various MDP sources.

3) What happens if an async MDP doesn't reply? I don't like to be in the state where a MDP can compromise the full dumps by forgetting to dump. Shutdown in particular would become a hell because task runners will get stopped in random order and the various MDP code will very likely not deal with them (and neither we want to replicate some sort of careful code in 69 places (current #MDPs, without counting tests).  How will MDP know what is the right task runner on which to post the callback?

4) few MDP rely on having a shared PMD and they read/write into it. mostly to check if something else created the same keys before overriding (without relying on relative ordering though).

In summary, I am supportive of an *extra* async API, but would prefer that it remains something we can keep under control (very likely just for the HP)  and not the default. My gut feeling is that can be too much of a footgun.


Owner: erikc...@chromium.org
Status: Assigned (was: Untriaged)
Okay, I've looked at some alternative solutions. I will summarize my findings.

Problem we're trying to solve:
When OOP HP is enabled, memory_instrumentation service needs to know when heap dump has been added to the trace.

There are three places we can add a hook:
  1) MemoryDumpManager
  2) CoordinatorImpl
  3) ClientProcessImpl


(1) seems like the semantically correct location. We are, after all, adding a memory dump to the trace. 
  1a) My original proposal was to simplify the implementation by having a single controlling thread, with a lock that controls the relevant data structure, and dispatching "work" to other threads. This easily allows OnMemoryDump to become asynchronous [or if we want, to have a cousin API OnMemoryDumpAsync]. The cost is a potential doubling of thread hops, since we now always need to hop back to the controlling thread.
  1b) Primiano's suggestion is to add an async mechanism to the existing implementation. Certainly doable, adds to complexity.

Note: (1a) and (1b) are not mutually exclusive. 

(2) is the next most logical location. We could add an async PendingResponse with type kMemlog. The problem is layering - all the OOP HP code lives in chrome/, and the relevant dumping code in ProfilingProcessHost isn't exposed as part of a service. While CoordinatorImpl could directly to the profiling service to get the heap dump, there are some hooks in ProfilingProcessHost [for tests, and to stop tracing] which we can't skip. 
  2a) Since CoordinatorImpl is a singleton, we could just have ProfilingProcessHost directly communicate with it, skipping the mojo layer. This has the unfortunate side effect of implicitly tying the class to the same process as ProfilingProcessHost. I don't like this approach.
  2b) The alternative is to create another Service to communicate with ProfilingProcessHost. Seems like way too much complexity.

(3) is the least semantically correct solution. Every process has a ClientProcessImpl, which currently invokes the MDM. We could add a mechanism for ProfilingProcessHost to talk to ClientProcessImpl [again, violating base / chrome layering issues].


I'm probably going to go with (1b). If MemoryDumpManager were less convoluted, it would be the obviously correct choice. As it is...this will be a fun ordeal. I still think that we've prematurely optimized [we get to skip...30? 90? thread hops during a memory-infra trace], at the cost of pretty confusing code. 


Comment 7 by ssid@chromium.org, Jan 31 2018

MemoryDumpManager does not "add dumps to trace". All MemoryDumpManager does is to invoke dump providers in process. I would be against adding inter-process communication and waiting to MemoryDumpManager's providers which the in-process dump does not care about. I feel we should be using the already existing async behavior in CoordinatorImpl instead of making MDM async.

I think we generally deal with layering issues by adding delegate implementations. Why can't we just have a memlog delegate along with ClientProcessImpl. Only the process that is capable of collecting dump (browser) will implement the delegate and add dumps for all process. Just like how the OsDumps are implemented, except the OsDumpProvider doesn't need a delegate and it can be implemented in-place.

Another solution would be to implement memory_instrumentation::ClientProcess in ProfilingProcessHost. Since the profiling process host is not a simple dump provider it can be treated as a process on it's own, which adds dumps for other processes. RequestChromeDump can create memlog dump and add it to the PMD. It will be easier to move this host to different process in future if needed and that process does not need MemoryDumpManager.

If we are implementing memlog dump as a dump provider, then it should be implemented only on browser process and the browser process can add dumps on behalf of other processes. Adding a dump provider in each process would cost 2 extra round trips for the heap dump data and 3 extra IPC calls, per each process:
renderer asks for dump from browser, browser sends the whole dump to renderer and renderer sends it back to browser. MemoryDumpManager is anyway not collecting only in-process dumps. So it is fine to collect dumps for other processes.
The dump provider should be implemented only on browser process and not on all processes.
I would like to understand why Primiano suggested adding aync mechanism rather than using the service too. Should we set up a vc?

The point of trying to reduce thread hops helps to get a faster memory snapshot. The idea was to try and get a snapshot where the values can be compared and added up to a total.
Primiano and I discussed out of band, we think the correct solution is to:

1) Make the memory_instrumentation service responsible for adding OOP HP heap dumps to the trace. This can be easily accomplished by passing it a ProfilingService interface ptr.
Project Member

Comment 9 by bugdroid1@chromium.org, Feb 21 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/67fff826658bea7a7f36dc7ea4c5dbcb27438636

commit 67fff826658bea7a7f36dc7ea4c5dbcb27438636
Author: erikchen <erikchen@chromium.org>
Date: Wed Feb 21 18:53:43 2018

Make memory_instrumentation service responsible for adding heap dumps.

Prior to this CL, the ProfilingProcessHost was a MemoryDumpProvider, and
asynchronously added heap dumps to the trace. The problem is that there is no
callback to inform the MemoryDumpManager that the heap has been added. This in
turn means that external consumers [e.g. telemetry tests] have no way to know
when a heap dump has been added.

In this CL, the profiling service now implements
memory_instrumentation::HeapProfiler, which memory_instrumentation calls to
obtain heap dumps. ProfilingProcessHost is still responsible for starting a
trace, but now requests memory_instrumentation to perform a heap dump.

When the ProfilingProcessHost starts the ProfilingService, it passes a
memory_instrumentation::HeapProfiler interface ptr to the memory_instrumentation
service, which is what allows the latter to know to fetch heap dumps during
memory infra traces.

This CL should have no visible effect on consumers of heap dumps, as evidenced
by the minimal changes needed to ProfilingTestDriver to get all e2e tests to
pass.

Change-Id: Ib9165b90e447d4b9261eb65e929caf367ed43e7b
Bug:  760578 
Reviewed-on: https://chromium-review.googlesource.com/899963
Commit-Queue: Erik Chen <erikchen@chromium.org>
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538167}
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/browser/chrome_content_browser_manifest_overlay.json
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/browser/profiling_host/profiling_process_host.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/browser/profiling_host/profiling_process_host.h
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/browser/profiling_host/profiling_test_driver.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/common/profiling/profiling_service.mojom
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/profiling/memlog_connection_manager.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/profiling/memlog_connection_manager.h
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/profiling/profiling_manifest.json
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/profiling/profiling_service.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/chrome/profiling/profiling_service.h
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/manifest.json
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/memory_instrumentation/coordinator_impl.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/memory_instrumentation/coordinator_impl.h
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/memory_instrumentation/queued_request.h
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/public/cpp/memory_instrumentation/tracing_integration_unittest.cc
[modify] https://crrev.com/67fff826658bea7a7f36dc7ea4c5dbcb27438636/services/resource_coordinator/public/mojom/memory_instrumentation/memory_instrumentation.mojom

Status: WontFix (was: Assigned)
I ran some tests locally with OOP HP enabled for memory benchmarks.

The shim provides more noisy, and less complete results than allocated_objects_size. It looks like, for reasons unknown, the previous source of noise: timing of malloc MDP relative to other MDPs is no longer an issue. I suspect that we were previously alerting on and looking at effective_size, which will be much more noisy than allocated_objects_size [as well as shim_allocated_objects_size].

I'm going to abandon the CL that would have added this:
https://chromium-review.googlesource.com/c/chromium/src/+/930041/3/tools/perf/benchmarks/memory.py

Note: The work that went into making this possible did not go to waste. Big wins include:

1) Overall cleaner architecture
2) Memlog tests run significantly faster. Previously, memory-infra would trigger a global dump, which would trigger an asynchronous heap dump, which would request a vm regions dump, which would have to wait for the first global dump to finish. This dependency has now been removed [since vm region dumps can now execute in parallel with global dumps].

results.html.zip
780 KB Download

Sign in to add a comment