Performance of tracing might degrade as more single-thread task runners move to schedulers |
|||
Issue descriptionTracing has a fastpath where it uses thread-local event buffers to avoid locking to inject events in the buffer. (see TraceLog::InitializeThreadLocalEventBufferIfSupported()) The optimization translates as follows: - If we are on a MessageLoop (i.e. MessageLoop::current() != nullptr), initialize a new thread-local buffer on a TLS slot. - If not use a fallback event buffer that uses locks for every event. Now, my understanding (is this correct?) is that the only thing that concretely has a MessageLoop are STTRs (SingleThreadTaskRunner). My understanding is that as part of the blink and base schedulers, we have are moving away from STRR, in favor of more fancier pooled task runners. My main question: is MessageLoop::current() going to return nullptr for the new fancier task runners? Or are they, under the hoods, still backed by a MessageLoop? If the above is true (i.e. MessageLoop::current() will return nullptr for most new task runners in chrome) the performances of tracing will degrade over time, as we will end up hitting the fallback (locked event buffers) on the non-STTR cases.
,
Oct 9 2017
> MessageLoop::current() going to return nullptr for the new fancier task runners? Or are they, under the hoods, still backed by a MessageLoop The Task Scheduler STTR's under the hood are not backed by a MessageLoop. MessageLoop::current() for these should be returning nullptr. To make sure I understand this correctly, this relies on DestructionObservers to flush this when the MessageLoop is getting destroyed. I presume you would like something similar in the new scheduler?
,
Oct 10 2017
#2: Yep that matches my understanding; something to trigger the TraceLog::ThreadLocalEventBuffer::WillDestroyCurrentMessageLoop call.
,
Oct 10 2017
,
Oct 10 2017
One thing to keep in mind is whether we want something per scheduler TaskRunner or just per scheduler thread. I think the latter. If you think to the details, we don't really need per-task-runner tracing. That might end up just creating lot of unnecessary tracing buffers, overkill if we end up with hundreds of task runners in chrome (IIRC one of the goals of the TaskScheduler is: "task runners have little cost, just create your own task runner"). IMHO what we really need is tracing to be still thread-local, regardless of how many task runners the scheduler puts on one thread. The reason for that is avoiding locking while tracing, which is something that threads themselves guarantee. I know that task runners sequences would also guarantee that, but IMHO that's too fine grained for our purposes. So I think what would be nice to have from the scheduler is: 1) a on_before_thread_exit callback for each scheduler-managed thread it creates, so we get an opportunity to drain the TLS buffers before the scheduler thread (which doesn't have a messageloop) goes away. 2) A backdoor PostTaskOnThisThread_IDontCareWhichSequence way to post a task to that given thread managed by the task scheduler, to force flushing the buffer when we stop tracing.
,
Oct 10 2017
I think it would be nice to have to ability to expand to a sequence view (but I understand this might be technically infeasible buffering wise -- FWIW we have SequenceLocalStorage as an equivalent for ThreadLocalStorage). Or at the very least to highlight a specific sequence's tasks across across threads (today we include the sequence ID in the metadata but it's not even searchable -- an interaction bug between tracing data bundles and trace search IMO). Le mar. 10 oct. 2017 14 h 38, primiano via monorail < monorail+v2.3582794469@chromium.org> a écrit :
,
Oct 10 2017
What is a "a sequence view" ? > Or at the very least to highlight a specific sequence's tasks across across threads Not sure I get this. What I was trying to say in #5 is that at the end of the day we don't care about sequences in tracing, we still just care about threads and would be nice to have a way to post things and get notification about the underlying threads, regardless of sequences that run onto them.
,
Oct 10 2017
A "sequence view" is the name I picked for a view of chrome://tracing that would have sequences instead of threads for the vertical axis. > at the end of the day we don't care about sequences in tracing I disagree. Sequences are effectively "chrome managed threads". Saying we don't care about sequences is like saying "we don't care about threads, we just care about CPUs".
,
Oct 10 2017
#8: I think there's two different and separate issues here; one is the actual trace data and what gets logged as the "tid" equivalent; in this case logging "sequence views" makes perfect sense, as you mention. I think Primiano's point is more about TraceLog's internals: We have one buffer per thread to avoid using any locking, it's not related to what actually gets logged. In this sense, there's no point in the TraceLog internals caring about sequences.
,
Oct 10 2017
Oh, I see. Well MessageLoop::current() won't work on TaskScheduler threads for sure (returns null). But ThreadLocalStorage still does (and perhaps we should use SequenceLocalStorage :)). MessageLoop::current() is generally a poor signal (as it relies on impl details) and is bound to die.
,
Oct 10 2017
The core of the issue here is tracing would like to flush its events at thread shutdown to avoid the cost of synchronizing at each tracing call. A thread local buffer holds the events and at thread shutdown, flushes the events. If destruction observation is really unpalatable, then an alternative approach (feel free to tell me I'm wrong), is to have a centralized object farm out buffers that are referenced by the TLS. Each thread writes trace events to those buffers (and avoids the need to synchronize). At main thread shutdown, the main thread iterates through these buffers and flushes them. Would that provide equivalent functionality?
,
Oct 11 2017
For destruction observers we can hook a custom ThreadLocalStorage TLSDestructorFunc instead of using MessageLoop destruction observers. Or, again, we can consider using SequenceLocalStorage (which can also implicitly be used to observe the end of sequence). Le mar. 10 oct. 2017 17 h 37, robliao via monorail < monorail+v2.1842724928@chromium.org> a écrit :
,
Oct 11 2017
Yeah what oysteine@ said in #9 >> at the end of the day we don't care about sequences in tracing > I disagree. Ahh hold on, let me clarify this a bit, there is an overload of tracing here (TraceViewer vs TRACE_EVENT macros implementation). We definitely care about sequences in TraceViewer, and I agree with the "sequence view" idea (funny I just had some similar thought about that before reading this bug). What I was trying to say above is that as far as the *implementation* of TRACE_EVENT macros is concerned, we don't care about sequences for the mere purpose of logging events into a TLS buffer. The only point of the tracing TLS buffer is to log events very quickly without any locks (% when the local buffer is full and we need to exchange it with a new one). Which, I understand, is something that could be guaranteed even with sequence-locak-storage, but IMHO is just too overkill (see below) > Or, again, we can consider using SequenceLocalStorage My main fear is that if we have 100 sequences and 10 threads, we end up with 100 local buffers, while instead we really just needed 10. That will cause extra memory usage, will fragment more the buffer and will reduce cache locality of the buffers. > At main thread shutdown, the main thread iterates through these buffers and flushes them. Would that provide equivalent functionality? So there are two distinct needs here: 1) When the user press "stop tracing" we PostTask to all registered MessageLoops to tell each TLS buffer to return their content to the main buffer. Then (i.e. after all posttasks have replied % timeouts) we serialize the main buffer on some bg thread. 2) If a thread goes away while tracing is running and before it's stopped (Question: are the underlying scheduler threads long lived or can they die at any point in time?), we want to prematurely return the TLS buffer into the main buffer, so that when we stop tracing the trace of the now-dead thread is still there. > For destruction observers we can hook a custom ThreadLocalStorage TLSDestructorFunc instead of using MessageLoop destruction observers. That would makes sense, I think. The only thing that we'd be still missing is some sort of backdoor like PostTaskOnThreadX_IDontCareWhichSpecificSequence()
,
Oct 11 2017
Got it. > can scheduler threads go away Yes. TLSDestructorFunc solves that though. > The only thing that we'd be still missing is some sort of backdoor like PostTaskOnThreadX_IDontCareWhichSpecificSequence() Right, but in practice what you want is more something like TaskScheduler::RunThisTaskOnAllThreads(base::RepeatingClosure task)? We could probably give you a specialized/dedicated hook for that. Then a task scheduler thread would be guaranteed when tracing completes to either hit TLSDestructorFunc (or have hit it already) or run |task|.
,
Oct 11 2017
> Right, but in practice what you want is more something like TaskScheduler::RunThisTaskOnAllThreads(base::RepeatingClosure task)? That would work yes. As long as we receive the thread id back or something similar to do timeout detection (we give up waiting for a thread that might be hung after some seconds at the cost of losing its local trace buffer) > We could probably give you a specialized/dedicated hook for that. > Then a task scheduler thread would be guaranteed when tracing completes to either hit TLSDestructorFunc (or have hit it already) or run |task|. Yup that would be great, thanks.
,
Oct 11 2017
Re. timeout detection: can't you do global timeout detection (i.e. stop waiting after X seconds for all threads? why do you need individual thread IDs? to report which ones are hung? a thread ID isn't really meaningful in the scheduler anyways -- it's just a worker...)
,
Oct 11 2017
> Re. timeout detection: can't you do global timeout detection Ah you mean based just on the initial count vs replies count. There is nothing specific that comes to my mind, but I'd have to look at the impl at this point to check that I'm not missing something else. Without looking into the details I think it should work as long as the scheduler, upon the RunThisTaskOnAllThreads returns a size_t that tells us how many replies we should expect in the best case (i.e. how many threads that is going to post onto).
,
Oct 12 2017
> So there are two distinct needs here: > 1) When the user press "stop tracing" we PostTask to all registered MessageLoops to tell each TLS buffer to return their content to the main buffer. Then (i.e. after all posttasks have replied % timeouts) we serialize the main buffer on some bg thread. How about this scheme then: Global Object - One lock to synchronize the list of buffers - Owns all tracing buffers One Buffer - Has its own lock for reading/writing - Is only contended during flushing - Is not contended if no flushing is occurring TLS - Points to the global object owned buffer Since there is no contention, I'd expect the lock acquisition and release to be quick. At flush (either when requested or at shutdown), the global object is the only one who will content with these locks, but we're flushing away so that might be okay. Would this be sufficient to maintain high throughput per thread?
,
Oct 13 2017
> Since there is no contention, I'd expect the lock acquisition and release to be quick. No that would still be a couple of atomic operations for each trace event, which is way too invasive on some archs (e.g., arm). > At flush (either when requested or at shutdown), the global object is the only one who will content with these locks, but we're flushing away so that might be okay. This will stall all the other threads at flush. Which means that at the first occurence of a task depending on the execution of another task (e.g. via a waitable event or nested message loop) we'll just deadlock. Also I'm generally quite wary of changing the threading model of tracing unless we have a strong need. Anecdotally each change on that was followed by weeks spend on bugs some months after.
,
Apr 12 2018
The TaskScheduler migration is complete. Is performance on ARM a problem today? If not we should close this.
,
May 3 2018
WontFixing this given no response to #20. @primiano : I assume TaskScheduler's use case will be considered critical in the new tracing design.
,
May 3 2018
Re #21: we are in the process of swapping the internals of tracing with Perfetto (tracker label:perfetto). We'll rethink on what we need here soon (and eventually reopen this). It's fine to wontfix right now. >I assume TaskScheduler's use case will be considered critical WDYM? From the UI viewpoint (showing sequences) or what?
,
May 3 2018
>>I assume TaskScheduler's use case will be considered critical >WDYM? From the UI viewpoint (showing sequences) or what? Sure but I meant more w.r.t. this bug, i.e. don't depend on threads running MessageLoops and support threads coming and dying (required for long traces as scheduler recycles unused threads after some time (≈60s)).
,
May 9 2018
TaskScheduler's thread should have a thread-local buffer for tracing events. Before exiting (which can happen after ~60s of inactivity), they should gracefully tear down this buffer in the same way that a MessageLoop's buffer is teared down when the MessageLoop is destroyed (https://cs.chromium.org/chromium/src/base/trace_event/trace_log.cc?l=292&rcl=734de4c06e9357f1adffc0e524eeb4aaa676a628).
,
May 9 2018
Using base::ThreadLocalStorageSlot would work (instead of assuming a MessageLoop on thread and tying lifetime to it), both MessageLoop based threads and SchedulerWorkers will result in ThreadLocalStorageSlot's DestructorFunc being invoked when the thread dies. Something to keep in mind for next-gen tracing IMO as it wants to support super long traces (in which case SchedulerWorkers coming in and out of life will be more prevalent).
,
May 9 2018
Yep that's what we're doing, see the early version here: https://cs.chromium.org/chromium/src/services/tracing/public/cpp/perfetto/trace_event_data_source.cc?q=trace_event_data_sour&sq=package:chromium&l=244 There's still unresolved issues around cleaning up TLS buffers/objects after tracing is finished (rather than just when threads die), but Perfetto-based tracing doesn't require the PostTask'ed Flush() calls that the TraceLog does.
,
May 9 2018
Yup I think that cleanup on destruction is sorted with the at-thread-exit. > Perfetto-based tracing doesn't require the PostTask'ed Flush() calls that the TraceLog does Well let's say that it's "less needed" than before, but would still be desirable, e.g., to guarantee that traces are flushed to disk every X seconds to get data back in the case of a crash (or just to reduce the windowing of the UI when searching for events). I think the only think that would be nice here is if the scheduler had a special backdoor of the form: BroadcastFunctionToAllSchedulerThreads(FunctionPtr) which invokes FunctionPtr on each thread. We could use that in turn to do the timed Flush() |
|||
►
Sign in to add a comment |
|||
Comment 1 by patricia...@chromium.org
, Sep 20 2017