TaskScheduler.TaskLatencyMicroseconds.* should be based on an independent heartbeat task |
|||
Issue descriptionWhile investigating issue 806981 , we realized that TaskScheduler.TaskLatencyMicroseconds.* being based on reports from each individual task in the scheduler is problematic. If a single sequence accrues a large backlog, its tasks' latency will be high despite the pool's latency staying under control. If that sequence's workload is large enough, this can look like a regression in the high percentiles of this task whereas it's merely a slow sequence with a lot of tasks (not a slow pool). To fix this, we should base such metrics on a heartbeat task independent from any specific sequence.
,
May 11 2018
Trying to figure out what's a good metric here. For context : another problem I found with the current approach for TaskScheduler.TaskLatencyMicroseconds.* of sampling each task is that it makes it look like UserBlocking is slower than UserVisible (which is impossible by design). What I think is happening is that UserBlocking tasks tend to be associated with busier periods and as such those tasks overall have a higher latency (but UserVisible tasks running in that period would have an even higher one). I'm thinking about adding TaskScheduler.HeartbeatLatencyMicroseconds.* : i.e. the latency of each bucket at random intervals but that might be too distant from reality (i.e. land during uninteresting idle periods). But then that's what EQT does I think... perhaps looking at long tail is sufficient to capture what happens when sampling lands in a non-idle period (at least it's well distributed and doesn't penalize UserBlocking tasks for being associated with busy periods). Or we could force observation during non-idle period. e.g. Set a bit at random heartbeat intervals but only launch latency probes after receiving the next UserBlocking task (while bit is set). But then, contrary to random sampling, this may associate too much with a given event type and miss out on a class of latency bugs. Overall random sampling is safer? So long as we know that we only care about the long tail? I think we should keep TaskScheduler.TaskLatencyMicroseconds.* as well but just document better its caveats. Its raw count has been useful in the past to see when workloads increased dramatically and it still captures well whether we have a reasonable SLA on all tasks. @tdresser: thoughts?
,
May 11 2018
TaskScheduler.HeartbeatLatencyMicroseconds.* similar to EQT makes sense. It could be recorded every 1 second, just like EQT. We could add suffixes to help us put the value in context: - Is there a page loading? (Can be provided by TabManager) - Was there a input event recently? - Are we during startup?
,
May 11 2018
Every 1 second feels aggressive, I was considering every 5 minutes... sampling wise I don't think it adds a bias and I doubt we need more samples..?
,
May 14 2018
Sounds like the problem is that we're normalizing by the task count, whereas we want to normalize by time somehow? We can accomplish that by sampling at some time interval, instead sampling per task. There are other approaches though. If we're already recording this for every task, can we just clone EQT?
,
May 16 2018
Currently we're not normalizing, we just report the latency of every single task. Instead the proposal is to report latency on a heartbeat, independent of running tasks. This is very similar to EQT indeed (though it doesn't involve a reply), I added you to https://chromium-review.googlesource.com/c/chromium/src/+/1059459
,
May 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3fb9e4fe6f5e51574d0647ee0adc0462da885c61 commit 3fb9e4fe6f5e51574d0647ee0adc0462da885c61 Author: Gabriel Charette <gab@chromium.org> Date: Fri May 18 21:34:43 2018 [TaskScheduler] Introduce HeartbeatLatencyMicroseconds metric. Put the timer on the service thread to avoid depending on the main thread (want to record the metric even if -- especially if -- the main thread is too busy for its timers to fire). Move the impl to service_thread.cc to be able to use post_task.h and test the full stack (also allows documenting "service thread" in its new header which was never a well documented concept of task scheduler's internals). Kept reporting logic in task_tracker.cc to centralize traits-based metrics reporting logic. R=fdoray@chromium.org, jwd@chromium.org Bug: 810746 Change-Id: Ie866d521c734bc63941836319d6c1258253cb8c5 Reviewed-on: https://chromium-review.googlesource.com/1059459 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Jesse Doherty <jwd@chromium.org> Reviewed-by: Timothy Dresser <tdresser@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Cr-Commit-Position: refs/heads/master@{#560052} [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/BUILD.gn [add] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/service_thread.cc [add] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/service_thread.h [add] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/service_thread_unittest.cc [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/task_scheduler_impl.cc [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/task_scheduler_impl.h [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/task_tracker.cc [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/task_tracker.h [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/base/task_scheduler/task_tracker_unittest.cc [modify] https://crrev.com/3fb9e4fe6f5e51574d0647ee0adc0462da885c61/tools/metrics/histograms/histograms.xml
,
May 28 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7e7dd55cce56f643d5945f5a2dbd42d475c73e1b commit 7e7dd55cce56f643d5945f5a2dbd42d475c73e1b Author: Gabriel Charette <gab@chromium.org> Date: Mon May 28 13:39:21 2018 [TaskScheduler] Slower latency heartbeat The heartbeat is causing minor CPU usage regressions ( crbug.com/845919 ) and we're getting more reports than we need. Let's start with one report per hour per user and increase later if we find that's not enough. Previously we relied on the heartbeat interval being fast enough (less than 30 seconds to not cause threads it used to be recycled (and in turn we also kept 2 or 3 threads alive per pool instead of 1 because of these). Now we only perform a heartbeat with random traits at a time so that, if idle, we use the existing idle thread and do not cause further churn. R=fdoray@chromium.org Bug: 845919 , 810746 Change-Id: I2bd5cab89cbdafa22ccbd31ab49d7c3c9a4fc53b Reviewed-on: https://chromium-review.googlesource.com/1073569 Reviewed-by: François Doray <fdoray@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#562248} [modify] https://crrev.com/7e7dd55cce56f643d5945f5a2dbd42d475c73e1b/base/task_scheduler/service_thread.cc [modify] https://crrev.com/7e7dd55cce56f643d5945f5a2dbd42d475c73e1b/base/task_scheduler/service_thread.h [modify] https://crrev.com/7e7dd55cce56f643d5945f5a2dbd42d475c73e1b/base/task_scheduler/service_thread_unittest.cc
,
May 28 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by gab@chromium.org
, Apr 12 2018Status: Available (was: Untriaged)