Slow reports rely on the order of adding tracing agents and start/stop tracing signals |
||
Issue descriptionSummary of chat with etienneb@: We have the following pattern in the browser process for slow reports: TracingControllerImpl::AddAgents() TracingControllerImpl::StartTracing() TraceLog::SetEnabled() TracingControllerImpl::StopTracing() It is possible that the trace event agent of the browser process is registered by the tracing service after the stop tracing command is processed and so the tracing service does not send any start/stop tracing signal to the browser process. This is problematic, e.g. because the browser process has explicitly started tracing and then it won't stop tracing and send back data. The root cause is that agent registration and start/stop tracing commands are sent using two different mojo pipes for security: everyone should be able to register themselves as agents, but only privileged clients should be able to start/stop tracing. The ordering of messages is guaranteed only within a pipe, not between the two. One possible fix is that the tracing service can send an ack after registering an agent. Then, the client can enforce the order if it wishes to. oysteine@, WDYT?
,
Dec 5
I added a thread::Sleep(2 seconds) into the TraceController::AddAgents and the flaky tests were stable. This is obviously not the right fix.
,
Dec 5
Hmm. It's tricky, there's a bit of an inherent conflict between the async Mojo interfaces and the few use-cases that are dependent on sync and direct access to the TraceLog. I think this case only applies to tests though, right? Could we have a WaitForAgentAdded(pid) callback which is on the same pipe as the Start/Stop tracing calls?
,
Dec 5
WaitForAgentAdded(pid, callback) that is
,
Dec 7
I had a chat with oysteine@ and, to us, it looked like that if the tracing coordinator sends StopAndFlush signal to agents that register after tracing is already stopped, we will cover all use-cases that happen in the code. So, e.g. if we have RegisterAgent() TraceLog::SetEnabled() StartTracing() StopTracing() TraceLog::SetDisabled() will be called, even if the RegisterAgent() is registered after StopTracing() is processed. Note: if an agent wants to ensure that it is registered before calling StopTracing(), it can wait to receive the StartTracing signal from the tracing coordinator. I'll send two CLs: 1- Now that I'm touching this part of the code, I'll send a first CL to clean up some of the coordinator/agent logic that is not used: there is no need to send mojo messages back and forth between the coordinator and agents just for issuing clock sync markers. Agents can issue those markers when they receive the StopAndFlush signal. 2- Coordinator send StopAndFlush to late-registering agents.
,
Dec 7
,
Dec 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f527d2ad94e4c7f319b981705989610c5c4d428e commit f527d2ad94e4c7f319b981705989610c5c4d428e Author: Ehsan Chiniforooshan <chiniforooshan@chromium.org> Date: Fri Dec 14 20:40:45 2018 Tracing: send stop tracing to late agents To make the status of tracing agents and the tracing coordinator consistent, we make the tracing coordinator ALWAYS send the last start/stop tracing signal to agents when they register. This way, agents can make sure their tracing status is the same as the coordinator status. If an agent registers after StopAndFlush is received by the coordinator and the coordinator has already gathered trace data from all agents with that label, we send a no-op recorder to the agent and discard its data (we just want to let it know tracing is stopped). If the coordinator is still gathering data, then an actual recorder is sent and the agents data is collected. Bug: 911839 Change-Id: If43ae0490fd5286b62a6ae2a9b235dea16275ff4 Reviewed-on: https://chromium-review.googlesource.com/c/1372719 Commit-Queue: Ehsan Chiniforooshan <chiniforooshan@chromium.org> Reviewed-by: oysteine <oysteine@chromium.org> Cr-Commit-Position: refs/heads/master@{#616811} [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/cast_tracing_agent.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/cros_tracing_agent.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/content/browser/tracing/tracing_controller_impl.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry.h [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/agent_registry_unittest.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator.cc [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator.h [modify] https://crrev.com/f527d2ad94e4c7f319b981705989610c5c4d428e/services/tracing/coordinator_unittest.cc |
||
►
Sign in to add a comment |
||
Comment 1 by etienneb@chromium.org
, Dec 5This is the pattern used to synchronize browser tests. There is an "observer" that observe the tracelog OnTracingEnabled/OnTracingDisabld and calls a QuitClosure on a RunLoop (used as a synchronization mechanism). The following snippet shows the sequence of actions. The flakiness come from the fact that we can't rely on the observer OnTracingEnabled and OnTracingDisabled to be called. The AddAgent may not had completed the task when calling StartTracing / StopTracing. Their message are sent using different queues. There is currently no way to be sure AddAgent is completed. BrowserProcess: MainThread: AddAgent( tracelog ); tracelog->AddEnableObserver( observer ); tracing_controler->StartTracing(...); // No reply callback tracing_controler->StopTracing(...); // No reply callback // observer: would you assume it will get OnStateEnabled and OnStateDisabled ?