New issue
Advanced search Search tips
Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Feb 11
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment
link

Issue 911839: Slow reports rely on the order of adding tracing agents and start/stop tracing signals

Reported by chiniforooshan@chromium.org, Dec 4 Project Member

Issue description

Summary of chat with etienneb@:

We have the following pattern in the browser process for slow reports:

TracingControllerImpl::AddAgents()
TracingControllerImpl::StartTracing()
TraceLog::SetEnabled()
TracingControllerImpl::StopTracing()

It is possible that the trace event agent of the browser process is registered by the tracing service after the stop tracing command is processed and so the tracing service does not send any start/stop tracing signal to the browser process. This is problematic, e.g. because the browser process has explicitly started tracing and then it won't stop tracing and send back data.

The root cause is that agent registration and start/stop tracing commands are sent using two different mojo pipes for security: everyone should be able to register themselves as agents, but only privileged clients should be able to start/stop tracing. The ordering of messages is guaranteed only within a pipe, not between the two.

One possible fix is that the tracing service can send an ack after registering an agent. Then, the client can enforce the order if it wishes to.

oysteine@, WDYT?
 

Comment 1 by etienneb@chromium.org, Dec 5

This is the pattern used to synchronize browser tests.
There is an "observer" that observe the tracelog OnTracingEnabled/OnTracingDisabld and calls a QuitClosure on a RunLoop (used as a synchronization mechanism).

The following snippet shows the sequence of actions. The flakiness come from the fact that we can't rely on the observer OnTracingEnabled and OnTracingDisabled to be called. The AddAgent may not had completed the task when calling StartTracing / StopTracing. Their message are sent using different queues.

There is currently no way to be sure AddAgent is completed.

BrowserProcess:
  MainThread:
    AddAgent( tracelog );
  
    tracelog->AddEnableObserver( observer );
 
    tracing_controler->StartTracing(...);  // No reply callback
    tracing_controler->StopTracing(...);  // No reply callback
 
 
    // observer: would you assume it will get OnStateEnabled and OnStateDisabled ?

Comment 2 by etienneb@chromium.org, Dec 5

I added a thread::Sleep(2 seconds) into the TraceController::AddAgents and the flaky tests were stable.

This is obviously not the right fix.

Comment 3 by oysteine@chromium.org, Dec 5

Hmm. It's tricky, there's a bit of an inherent conflict between the async Mojo interfaces and the few use-cases that are dependent on sync and direct access to the TraceLog.

I think this case only applies to tests though, right? Could we have a WaitForAgentAdded(pid) callback which is on the same pipe as the Start/Stop tracing calls?

Comment 4 by oysteine@chromium.org, Dec 5

WaitForAgentAdded(pid, callback) that is

Comment 5 by chiniforooshan@chromium.org, Dec 7

I had a chat with oysteine@ and, to us, it looked like that if the tracing coordinator sends StopAndFlush signal to agents that register after tracing is already stopped, we will cover all use-cases that happen in the code.

So, e.g. if we have

RegisterAgent()
TraceLog::SetEnabled()
StartTracing()
StopTracing()

TraceLog::SetDisabled() will be called, even if the RegisterAgent() is registered after StopTracing() is processed.

Note: if an agent wants to ensure that it is registered before calling StopTracing(), it can wait to receive the StartTracing signal from the tracing coordinator.

I'll send two CLs:

1- Now that I'm touching this part of the code, I'll send a first CL to clean up some of the coordinator/agent logic that is not used: there is no need to send mojo messages back and forth between the coordinator and agents just for issuing clock sync markers. Agents can issue those markers when they receive the StopAndFlush signal.

2- Coordinator send StopAndFlush to late-registering agents.

Comment 6 by chiniforooshan@chromium.org, Dec 7

Status: Started (was: Untriaged)

Comment 7 by bugdroid1@chromium.org, Dec 14

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f527d2ad94e4c7f319b981705989610c5c4d428e

commit f527d2ad94e4c7f319b981705989610c5c4d428e
Author: Ehsan Chiniforooshan <chiniforooshan@chromium.org>
Date: Fri Dec 14 20:40:45 2018

Tracing: send stop tracing to late agents

To make the status of tracing agents and the tracing coordinator
consistent, we make the tracing coordinator ALWAYS send the last
start/stop tracing signal to agents when they register. This way, agents
can make sure their tracing status is the same as the coordinator
status.

If an agent registers after StopAndFlush is received by the
coordinator and the coordinator has already gathered trace data from all
agents with that label, we send a no-op recorder to the agent and
discard its data (we just want to let it know tracing is stopped). If
the coordinator is still gathering data, then an actual recorder is sent
and the agents data is collected.

Bug:  911839 
Change-Id: If43ae0490fd5286b62a6ae2a9b235dea16275ff4
Reviewed-on: https://chromium-review.googlesource.com/c/1372719
Commit-Queue: Ehsan Chiniforooshan <chiniforooshan@chromium.org>
Reviewed-by: oysteine <oysteine@chromium.org>
Cr-Commit-Position: refs/heads/master@{#616811}
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/cast_tracing_agent.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/cros_tracing_agent.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/tracing_controller_impl.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry.h
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry_unittest.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator.cc
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator.h
[modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator_unittest.cc

Comment 8 by etienneb@chromium.org, Feb 11

Status: Fixed (was: Started)

Sign in to add a comment