New issue
Advanced search Search tips

Issue 766507 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , iOS , Chrome , Mac , Fuchsia
Pri: 3
Type: Bug



Sign in to add a comment

Performance of tracing might degrade as more single-thread task runners move to schedulers

Project Member Reported by primiano@chromium.org, Sep 19 2017

Issue description

Tracing has a fastpath where it uses thread-local event buffers to avoid locking to inject events in the buffer. (see TraceLog::InitializeThreadLocalEventBufferIfSupported())

The optimization translates as follows:
 - If we are on a MessageLoop (i.e. MessageLoop::current() != nullptr), initialize a new thread-local buffer on a TLS slot.
 - If not use a fallback event buffer that uses locks for every event.

Now, my understanding (is this correct?) is that the only thing that concretely has a MessageLoop are STTRs (SingleThreadTaskRunner).
My understanding is that as part of the blink and base schedulers, we have are moving away from STRR, in favor of more fancier pooled task runners. 

My main question:
is MessageLoop::current() going to return nullptr for the new fancier task runners? Or are they, under the hoods, still backed by a MessageLoop?

If the above is true (i.e.  MessageLoop::current() will return nullptr for most new task runners in chrome) the performances of tracing will degrade over time, as we will end up hitting the fallback (locked event buffers) on the non-STTR cases.


 
Status: Available (was: Unconfirmed)
[mac bug triage]

Marking as available to get it out of the mac bug triage unconfirmed/untriaged queues - apologies for noise.
> MessageLoop::current() going to return nullptr for the new fancier task runners? Or are they, under the hoods, still backed by a MessageLoop

The Task Scheduler STTR's under the hood are not backed by a MessageLoop. MessageLoop::current() for these should be returning nullptr. 

To make sure I understand this correctly, this relies on DestructionObservers to flush this when the MessageLoop is getting destroyed. I presume you would like something similar in the new scheduler?
#2: Yep that matches my understanding; something to trigger the TraceLog::ThreadLocalEventBuffer::WillDestroyCurrentMessageLoop call.
Components: Internals>TaskScheduler
One thing to keep in mind is whether we want something per scheduler TaskRunner or just per scheduler thread. I think the latter.

If you think to the details, we don't really need per-task-runner tracing. That might end up just creating lot of unnecessary tracing buffers, overkill if we end up with hundreds of task runners in chrome (IIRC one of the goals of the TaskScheduler is: "task runners have little cost, just create your own task runner").

IMHO what we really need is tracing to be still thread-local, regardless of how many task runners the scheduler puts on one thread. The reason for that is avoiding locking while tracing, which is something that threads themselves guarantee. I know that task runners sequences would also guarantee that, but IMHO that's too fine grained for our purposes. 

So I think what would be nice to have from the scheduler is:
1) a on_before_thread_exit callback for each scheduler-managed thread it creates, so we get an opportunity to drain the TLS buffers before the scheduler thread (which doesn't have a messageloop) goes away.
2) A backdoor PostTaskOnThisThread_IDontCareWhichSequence way to post a task to that given thread managed by the task scheduler, to force flushing the buffer when we stop tracing. 

Comment 6 by gab@chromium.org, Oct 10 2017

I think it would be nice to have to ability to expand to a sequence view
(but I understand this might be technically infeasible buffering wise --
FWIW we have SequenceLocalStorage as an equivalent for ThreadLocalStorage).

Or at the very least to highlight a specific sequence's tasks across across
threads (today we include the sequence ID in the metadata but it's not even
searchable -- an interaction bug between tracing data bundles and trace
search IMO).

Le mar. 10 oct. 2017 14 h 38, primiano via monorail <
monorail+v2.3582794469@chromium.org> a écrit :
What is  a "a sequence view" ?

> Or at the very least to highlight a specific sequence's tasks across across
threads 
Not sure I get this.
What I was trying to say in #5 is that at the end of the day we don't care about sequences in tracing, we still just care about threads and would be nice to have a way to post things and get notification about the underlying threads, regardless of sequences that run onto them.

Comment 8 by gab@google.com, Oct 10 2017

A "sequence view" is the name I picked for a view of chrome://tracing that would have sequences instead of threads for the vertical axis.

> at the end of the day we don't care about sequences in tracing

I disagree. Sequences are effectively "chrome managed threads". Saying we don't care about sequences is like saying "we don't care about threads, we just care about CPUs".
#8: I think there's two different and separate issues here; one is the actual trace data and what gets logged as the "tid" equivalent; in this case logging "sequence views" makes perfect sense, as you mention.

I think Primiano's point is more about TraceLog's internals: We have one buffer per thread to avoid using any locking, it's not related to what actually gets logged. In this sense, there's no point in the TraceLog internals caring about sequences.

Comment 10 by gab@google.com, Oct 10 2017

Oh, I see. Well MessageLoop::current() won't work on TaskScheduler threads for sure (returns null). But ThreadLocalStorage still does (and perhaps we should use SequenceLocalStorage :)).

MessageLoop::current() is generally a poor signal (as it relies on impl details) and is bound to die.
The core of the issue here is tracing would like to flush its events at thread shutdown to avoid the cost of synchronizing at each tracing call.

A thread local buffer holds the events and at thread shutdown, flushes the events.

If destruction observation is really unpalatable, then an alternative approach (feel free to tell me I'm wrong), is to have a centralized object farm out buffers that are referenced by the TLS. Each thread writes trace events to those buffers (and avoids the need to synchronize).

At main thread shutdown, the main thread iterates through these buffers and flushes them. Would that provide equivalent functionality?

Comment 12 by gab@chromium.org, Oct 11 2017

For destruction observers we can hook a custom ThreadLocalStorage
TLSDestructorFunc instead of using MessageLoop destruction observers.

Or, again, we can consider using SequenceLocalStorage (which can also
implicitly be used to observe the end of sequence).

Le mar. 10 oct. 2017 17 h 37, robliao via monorail <
monorail+v2.1842724928@chromium.org> a écrit :
Yeah what oysteine@ said in #9

>> at the end of the day we don't care about sequences in tracing
> I disagree.

Ahh hold on, let me clarify this a bit, there is an overload of tracing here (TraceViewer vs TRACE_EVENT macros implementation). We definitely care about sequences in TraceViewer, and I agree with the "sequence view" idea (funny I just had some similar thought about that before reading this bug).
What I was trying to say above is that as far as the *implementation* of TRACE_EVENT macros is concerned, we don't care about sequences for the mere purpose of logging events into a TLS buffer.
The only point of the tracing TLS buffer is to log events very quickly without any locks (% when the local buffer is full and we need to exchange it with a new one). Which, I understand, is something that could be guaranteed even with sequence-locak-storage, but IMHO is just too overkill (see below)

> Or, again, we can consider using SequenceLocalStorage 

My main fear is that if we have 100 sequences and 10 threads, we end up with 100 local buffers, while instead we really just needed 10. That will cause extra memory usage, will fragment more the buffer and will reduce cache locality of the buffers.


> At main thread shutdown, the main thread iterates through these buffers and flushes them. Would that provide equivalent functionality?

So there are two distinct needs here:

1) When the user press "stop tracing" we PostTask to all registered MessageLoops to tell each TLS buffer to return their content to the main buffer. Then (i.e. after all posttasks have replied % timeouts) we serialize the main buffer on some bg thread.

2) If a thread goes away while tracing is running and before it's stopped (Question: are the underlying scheduler threads long lived or can they die at any point in time?), we want to prematurely return the TLS buffer into the main buffer, so that when we stop tracing the trace of the now-dead thread is still there.

> For destruction observers we can hook a custom ThreadLocalStorage
TLSDestructorFunc instead of using MessageLoop destruction observers.

That would makes sense, I think.
The only thing that we'd be still missing is some sort of backdoor like PostTaskOnThreadX_IDontCareWhichSpecificSequence()

Comment 14 by gab@chromium.org, Oct 11 2017

Got it.

> can scheduler threads go away
Yes. TLSDestructorFunc solves that though.

> The only thing that we'd be still missing is some sort of backdoor like PostTaskOnThreadX_IDontCareWhichSpecificSequence()
Right, but in practice what you want is more something like TaskScheduler::RunThisTaskOnAllThreads(base::RepeatingClosure task)?

We could probably give you a specialized/dedicated hook for that.

Then a task scheduler thread would be guaranteed when tracing completes to either hit TLSDestructorFunc (or have hit it already) or run |task|.
> Right, but in practice what you want is more something like TaskScheduler::RunThisTaskOnAllThreads(base::RepeatingClosure task)?

That would work yes. As long as we receive the thread id back or something similar to do timeout detection (we give up waiting for a thread that might be hung after some seconds at the cost of losing its local trace buffer)

> We could probably give you a specialized/dedicated hook for that.
> Then a task scheduler thread would be guaranteed when tracing completes to either hit TLSDestructorFunc (or have hit it already) or run |task|.

Yup that would be great, thanks.

Comment 16 by gab@chromium.org, Oct 11 2017

Re. timeout detection: can't you do global timeout detection (i.e. stop waiting after X seconds for all threads? why do you need individual thread IDs? to report which ones are hung? a thread ID isn't really meaningful in the scheduler anyways 
 -- it's just a worker...)
> Re. timeout detection: can't you do global timeout detection
Ah you mean based just on the initial count vs replies count.
There is nothing specific that comes to my mind, but I'd have to look at the impl at this point to check that I'm not missing something else.
Without looking into the details I think it should work as long as the scheduler, upon the RunThisTaskOnAllThreads returns a size_t that tells us how many replies we should expect in the best case (i.e. how many threads that is going to post onto). 
> So there are two distinct needs here:
> 1) When the user press "stop tracing" we PostTask to all registered MessageLoops to tell each TLS buffer to return their content to the main buffer. Then (i.e. after all posttasks have replied % timeouts) we serialize the main buffer on some bg thread.

How about this scheme then:
Global Object
  - One lock to synchronize the list of buffers
  - Owns all tracing buffers

One Buffer
  - Has its own lock for reading/writing
  - Is only contended during flushing
  - Is not contended if no flushing is occurring

TLS
  - Points to the global object owned buffer

Since there is no contention, I'd expect the lock acquisition and release to be quick. At flush (either when requested or at shutdown), the global object is the only one who will content with these locks, but we're flushing away so that might be okay.

Would this be sufficient to maintain high throughput per thread?
> Since there is no contention, I'd expect the lock acquisition and release to be quick.
No that would still be a couple of atomic operations for each trace event, which is way too invasive on some archs (e.g., arm).

> At flush (either when requested or at shutdown), the global object is the only one who will content with these locks, but we're flushing away so that might be okay.

This will stall all the other threads at flush. Which means that at the first occurence of a task depending on the execution of another task (e.g. via a waitable event or nested message loop) we'll just deadlock.

Also I'm generally quite wary of changing the threading model of tracing unless we have a strong need. Anecdotally each change on that was followed by weeks spend on bugs some months after. 

Comment 20 by gab@chromium.org, Apr 12 2018

The TaskScheduler migration is complete. Is performance on ARM a problem today? If not we should close this.

Comment 21 by gab@chromium.org, May 3 2018

Status: WontFix (was: Available)
WontFixing this given no response to #20.

@primiano : I assume TaskScheduler's use case will be considered critical in the new tracing design.
Re #21: we are in the process of swapping the internals of tracing with Perfetto (tracker label:perfetto). We'll rethink on what we need here soon (and eventually reopen this). It's fine to wontfix right now.

>I assume TaskScheduler's use case will be considered critical
WDYM? From the UI viewpoint (showing sequences) or what?

Comment 23 by gab@chromium.org, May 3 2018


>>I assume TaskScheduler's use case will be considered critical
>WDYM? From the UI viewpoint (showing sequences) or what?

Sure but I meant more w.r.t. this bug, i.e. don't depend on threads running MessageLoops and support threads coming and dying (required for long traces as scheduler recycles unused threads after some time (≈60s)).
TaskScheduler's thread should have a thread-local buffer for tracing events. Before exiting (which can happen after ~60s of inactivity), they should gracefully tear down this buffer in the same way that a MessageLoop's buffer is teared down when the MessageLoop is destroyed (https://cs.chromium.org/chromium/src/base/trace_event/trace_log.cc?l=292&rcl=734de4c06e9357f1adffc0e524eeb4aaa676a628).

Comment 25 by gab@chromium.org, May 9 2018

Using base::ThreadLocalStorageSlot would work (instead of assuming a MessageLoop on thread and tying lifetime to it), both MessageLoop based threads and SchedulerWorkers will result in ThreadLocalStorageSlot's DestructorFunc being invoked when the thread dies.

Something to keep in mind for next-gen tracing IMO as it wants to support super long traces (in which case SchedulerWorkers coming in and out of life will be more prevalent).
Yep that's what we're doing, see the early version here: https://cs.chromium.org/chromium/src/services/tracing/public/cpp/perfetto/trace_event_data_source.cc?q=trace_event_data_sour&sq=package:chromium&l=244

There's still unresolved issues around cleaning up TLS buffers/objects after tracing is finished (rather than just when threads die), but Perfetto-based tracing doesn't require the PostTask'ed Flush() calls that the TraceLog does.
Yup I think that cleanup on destruction is sorted with the at-thread-exit.

> Perfetto-based tracing doesn't require the PostTask'ed Flush() calls that the TraceLog does

Well let's say that it's "less needed" than before, but would still be desirable, e.g., to guarantee that traces are flushed to disk every X seconds to get data back in the case of a crash (or just to reduce the windowing of the UI when searching for events).
I think the only think that would be nice here is if the scheduler had a special backdoor of the form: BroadcastFunctionToAllSchedulerThreads(FunctionPtr) which invokes FunctionPtr on each thread. We could use that in turn to do the timed Flush()

Sign in to add a comment