New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 770151 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Mac
Pri: 1
Type: Bug



Sign in to add a comment

Memory-infra doesn't emit renderer stats on linux-dev

Project Member Reported by primiano@chromium.org, Sep 29 2017

Issue description

See trace in Issue 770096 .
The browser process has all MDPs but the renderer processes are empty.
lalitm@ managed to repro on linux-dev. It seems this bug repros consistently for the first dump and then everything works fine for the other ones.

Something in the service is broken and we end up losing all MDPs coming from all renderers (But curiously not from the browser process)

I tried on my mac canary, and it is also broken, although less visibly: only some (2/10) renderer rows are blank.

:(

+perezju as this might cause flakes on telemetry metrics
 
Owner: lalitm@chromium.org
Status: Assigned (was: Untriaged)
lalitm@ is looking at linux right now.
Is this only for periodic dumps? Or also affects explicitly triggered ones?

Comment 3 by lalitm@chromium.org, Sep 29 2017

Cc: oysteine@chromium.org
This also affects explicitly triggered ones yes.

+oysteine: for reasons below

I performed a Git bisect and found that the culprit to be the commit https://chromium-review.googlesource.com/c/chromium/src/+/611188.

I'm taking a look to figure out how this commit is affecting things but the suspicion primiano@ and I have is some sort of race condition between the tracing being enabled and memory dump being requested and actioned by renderer processes. This is especially true as IsMemoryInfraTracingEnabled returns false in TracingObserver::ShouldAddToTrace on the first memory dump for all processes except the browser.

Comment 4 by lalitm@chromium.org, Sep 29 2017

A correction to my previous comment: I do not believe this affects periodic dumps. My apologies.

I've made more progress in terms of finding the issue but the parts of the codebase I'm looking at seem complex and critical enough that I'd appreciate some input on how best this issue could be fixed.

In essence, the issue stems from the fact that a global dump is requested before the child processes are fully initialised. Printing out a stack trace solved the issue which suggested a race condition.

After painstaking debugging, I found the race as follows (to the best of my understanding):

Begin: In TracingControllerImpl::StartAgentTracing, StartAgentTracing is called which calls SetEnabledOnBackgroundThread on a background thread.

Racing thread/process 1.
SetEnabledOnBackgroundThread
-> enables TraceLog on browser process
-> TracingObserver::OnTraceLogEnabled is called
-> MemoryDumpManager::SetupForTracing is called
-> MemoryDumpScheduler::Start
-> scheduling is started on a background thread
-> global dump is requested

Racing thread/process 2.
SetEnabledOnBackgroundThread posts onto UI thread
-> TracingControllerImpl::OnStartAgentTracingAcked
-> TraceMessageFilter::SendBeginTracing
-> IPC to every child process
-> ChildTraceMessageFilter::OnBeginTracing
-> enables TraceLog on every process

With the CL included, process 1 always seems to win the race which means that we request a global dump before the child processes are able to enable trace logging. Without the CL process 2 wins the race but primiano@ suggested this was still flaky.

I am rather surprised that using the FILE thread is enough to suppress this race but apparently it has largely worked to date!

Comment 5 by lalitm@chromium.org, Sep 29 2017

A correction to my correction: I do not believe this affects explicit dumps. My apologies x2 :P

Comment 6 by ssid@chromium.org, Oct 2 2017

Cc: primiano@chromium.org
So, this problem used to exist in the first version and we worked around it by requesting the first dump in periodic scheduler after the 1st time period rather than 0th tick. I think somewhere later we decided that the time period should be 10secs and we cannot really wait for 10s for the first dump, so we trigger the first dump at 0th tick which brings the issue of child process not getting TraceLog enabled.
This is a problem with the tracing using IPC and memory-infra using mojo. I think it used to be the case that they were both IPC and this happened much lesser.

Whateverz about the history. A solution would be to start periodic timer after StartTracingDoneCallback(), which should be called only after all processes have ack-ed StartTracing message. But we have an issue 709524 where the callback is called after sending IPC rather than after receiving an ack. This should be fixed when moving tracing to mojo.

I do not have a good solution for this. What we can do here?
1. Add delay the first dump by 200ms.
2. This could be fixed by tracing servicification. The periodic timer to be started by the StartTracingDoneCallback() rather than OnTraceLogEnabled(). Currently we are triggering global dumps based on in-process enabling which is wrong.
We could add an observer to TracingControllerImpl that notifies when OnStartTracingDone() and OnEndtracingtriggered(). We implement a memory-infra observer which starts and stops periodic scheduler based on this.
3. Add new mojo messages in memory-infra that says MDM enabled for tracing in each process. Once we receive acks from all processes, the periodic dumps can be started.

From thinking about tracing and related bugs with tracing maybe we should not couple memory-infra and tracing too much, just implement new messages on memory-instrumentation. I'd go with 3 here.
Can we do 1 as stopgap and land it asap to avoid having this bug in the stable (branch point is coming), and then figure out a better solution?
Ok going to send a CL for 1 your way.
Project Member

Comment 9 by bugdroid1@chromium.org, Oct 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/37c9c1405efbd2457ef5b687cb8974894cb79f4b

commit 37c9c1405efbd2457ef5b687cb8974894cb79f4b
Author: Lalit Maganti <lalitm@chromium.org>
Date: Fri Oct 06 16:27:15 2017

memory-infra: delay first periodic dump by 200 ms

This is a temporary hack until a more correct solution is implemented.
This is so that the upcoming stable does not have the bug specified.

Bug: 770151
Change-Id: I7e4f5c7ad5c41620e3ed667af059498aaa9ad041
Reviewed-on: https://chromium-review.googlesource.com/703234
Commit-Queue: Lalit Maganti <lalitm@chromium.org>
Reviewed-by: Siddhartha S <ssid@chromium.org>
Reviewed-by: Primiano Tucci <primiano@chromium.org>
Cr-Commit-Position: refs/heads/master@{#507092}
[modify] https://crrev.com/37c9c1405efbd2457ef5b687cb8974894cb79f4b/base/trace_event/memory_dump_scheduler.cc

Sign in to add a comment