Stalled lock posting tasks |
|||||||||||
Issue descriptionWe are seeing locks stalled on the browser main thread for multiple seconds. Slow-reports are getting traces when jank are detected on the browser main thread. These following reports contains example of locks stalled on the main threads. a854c224a7a58095 78ef666bef5a2b23 15ec44ddde951fcc cfe9f3adf788db45 1f7b05c7e6ea7cd6 2bd2d71673e650b6 e626a80082921fb7 b28d0d7489f6827e Example of stalled stackframes: base::internal::LockImpl::Lock() base::MessageLoop::Controller::DidQueueTask(bool) base::internal::MessageLoopTaskRunner::AddToIncomingQueue(base::Location const &,base::OnceCallback<void >,base::TimeDelta,base::Nestable) base::internal::MessageLoopTaskRunner::PostDelayedTask(base::Location const &,base::OnceCallback<void >,base::TimeDelta) base::internal::TimerBase::PostNewScheduledTask(base::TimeDelta) base::RepeatingTimer::RunUserTask() base::internal::BaseTimerTaskInternal::Run() base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *) base::MessageLoop::RunTask(base::PendingTask *) base::MessageLoop::DoDelayedWork(base::TimeTicks *) base::MessagePumpForUI::DoRunLoop() base::MessagePumpWin::Run(base::MessagePump::Delegate *) base::RunLoop::Run() ChromeBrowserMainParts::MainMessageLoopRun(int *) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const &) content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *) content::ContentMainRunnerImpl::Run(bool) service_manager::Main(service_manager::MainParams const &) content::ContentMain(content::ContentMainParams const &) ChromeMain MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks) wWinMain __scrt_common_main_seh kernel32.dll ntdll.dll base::internal::LockImpl::Lock() base::MessageLoop::Controller::DidQueueTask(bool) base::internal::MessageLoopTaskRunner::AddToIncomingQueue(base::Location const &,base::OnceCallback<void >,base::TimeDelta,base::Nestable) base::internal::MessageLoopTaskRunner::PostDelayedTask(base::Location const &,base::OnceCallback<void >,base::TimeDelta) viz::DelayBasedTimeSource::PostNextTickTask(base::TimeTicks) viz::DelayBasedTimeSource::OnTimerTick() base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *) base::MessageLoop::RunTask(base::PendingTask *) base::MessageLoop::DoDelayedWork(base::TimeTicks *) base::MessagePumpForUI::DoRunLoop() base::MessagePumpWin::Run(base::MessagePump::Delegate *) base::RunLoop::Run() ChromeBrowserMainParts::MainMessageLoopRun(int *) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const &) content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *) content::ContentMainRunnerImpl::Run(bool) service_manager::Main(service_manager::MainParams const &) content::ContentMain(content::ContentMainParams const &) ChromeMain MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks) wWinMain __scrt_common_main_seh kernel32.dll ntdll.dll base::internal::LockImpl::Lock() base::MessageLoop::Controller::DidQueueTask(bool) base::internal::MessageLoopTaskRunner::AddToIncomingQueue(base::Location const &,base::OnceCallback<void >,base::TimeDelta,base::Nestable) base::internal::MessageLoopTaskRunner::PostDelayedTask(base::Location const &,base::OnceCallback<void >,base::TimeDelta) base::internal::TimerBase::PostNewScheduledTask(base::TimeDelta) base::RepeatingTimer::RunUserTask() base::internal::BaseTimerTaskInternal::Run() base::debug::TaskAnnotator::RunTask(char const *,base::PendingTask *) base::MessageLoop::RunTask(base::PendingTask *) base::MessageLoop::DoDelayedWork(base::TimeTicks *) base::MessagePumpForUI::DoRunLoop() base::MessagePumpWin::Run(base::MessagePump::Delegate *) base::RunLoop::Run() ChromeBrowserMainParts::MainMessageLoopRun(int *) content::BrowserMainLoop::RunMainMessageLoopParts() content::BrowserMainRunnerImpl::Run() content::BrowserMain(content::MainFunctionParams const &) content::RunBrowserProcessMain(content::MainFunctionParams const &,content::ContentMainDelegate *) content::ContentMainRunnerImpl::Run(bool) service_manager::Main(service_manager::MainParams const &) content::ContentMain(content::ContentMainParams const &) ChromeMain MainDllLoader::Launch(HINSTANCE__ *,base::TimeTicks) wWinMain __scrt_common_main_seh BaseThreadInitThunk RtlUserThreadStart
,
Oct 1
30 seconds on a single lock acquisition, that's crazy beans =S Let's see if it happens on other threads as well (e.g. IO and renderer main).
,
Oct 1
FWIW, this is posting to the MessageLoop, not "TaskScheduler". MessageLoop is about to be replaced by SequenceManager, I don't expect that to fix the problem but it will alter the stack traces we encounter.
,
Oct 2
We can find some stackframe for other threads by looking to crashes when the main thread hang (or I/O thread). As an example, the following report "89cced08d550f415" shows the main thread to be blocked: base::internal::LockImpl::Lock() base::MessageLoop::Controller::DidQueueTask(bool) base::internal::MessageLoopTaskRunner::AddToIncomingQueue [...] While a background thread is owning the lock and trying to signal. PostMessageW <<-- OS base::MessagePumpForUI::ScheduleWork() base::MessageLoop::Controller::DidQueueTask(bool) The OS call to PostMessageW is probably aware of thread priority and won't let the background thread executed until any higher level thread is completed. (minus the random scheduling to fix priority inversion). This seems to point to a classic priority inversion problem.
,
Oct 2
More examples: 01ac2410b9bd8f5d 0eb245b0cdc8a8ef 11931c79f6ba9819
,
Oct 2
Thanks etienneb for clearly demonstrating that we can get lock priority inversions on Win. I've long thought this could be a problem in particular when mixed with priority boosting (which seems to be at fault for releasing the lock owner's quanta in the majority of cases). The architectural solution is to have staging queues to post across priorities But short term how about we experiment with temporarily disabling background priority on Win (i.e. return false from LockImpl's Supports...() method). That would let us know whether it fixes the cases etienneb is seeing. We know priorities are a win and Francois showed that using an even more aggressive bg priority is beneficial but a short term experiment would further motivate the need for an advanced fix (as opposed to dropping thread priorities wholesale). And the more advanced solution would enable priorities on all platforms, woohoo!
,
Oct 3
,
Oct 13
Will take this one on. I'm pretty sure it's the root cause of long-puzzling issue 874237.
,
Oct 13
,
Oct 17
Here's a set of 22982 crash reports that are likely associated with this : https://crash.corp.google.com/browse?q=product_name%3D%27Chrome%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28thread%29+CROSS+JOIN+UNNEST%28StackTrace.StackFrame%29+WHERE+FunctionName%3D%27base%3A%3Ainternal%3A%3ALockImpl%3A%3ALock%28%29%27%29+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28thread%29+CROSS+JOIN+UNNEST%28StackTrace.StackFrame%29+WHERE+FunctionName%3D%27base%3A%3AMessageLoop%3A%3AController%3A%3ADidQueueTask%28bool%29%27%29 Here's a subset of 5672 of those crash reports which are definitely associated with this : https://crash.corp.google.com/browse?q=product_name%3D%27Chrome%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name+LIKE+%27%25+hang%5D%25%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28thread%29+CROSS+JOIN+UNNEST%28StackTrace.StackFrame%29+WHERE+FunctionName%3D%27base%3A%3Ainternal%3A%3ALockImpl%3A%3ALock%28%29%27%29+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28thread%29+CROSS+JOIN+UNNEST%28StackTrace.StackFrame%29+WHERE+FunctionName%3D%27base%3A%3AMessageLoop%3A%3AController%3A%3ADidQueueTask%28bool%29%27%29#samplereports
,
Oct 18
Issue 896177 has been merged into this issue.
,
Oct 22
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f7da13a28c3799819bd39c4d95dc02a805b27e4f commit f7da13a28c3799819bd39c4d95dc02a805b27e4f Author: Gabriel Charette <gab@chromium.org> Date: Mon Oct 22 16:07:49 2018 [MessageLoop] Lock-free ScheduleWork() scheme The Lock is causing hangs because of priority inversion mixed with priority boosting (ScheduleWork() tends to boost the destination thread which may deschedule the posting thread; if the posting thread is a background thread this boost-induded-desched-while-holding-lock can cause a livelock). See https://crbug.com/890978#c10 for example crashes catching this. The Lock was only necessary for startup/shutdown and is being replaced by a lock-free atomic scheme in this CL. MessagePump::ScheduleWork() itself was already thread-safe (but the Android impl did unnecessarily check a non-atomic bool) This adds a WaitableEvent in ~MessageLoop(); hence the requirement for a wait-allowance in net's EmbeddedTestServer. TBR=zhongyi@chromium.org (embedded_test_server.cc side-effects) Bug: 890978 , 874237 Change-Id: I0916e5a99035a935b0a23a770af256f334e78c43 Reviewed-on: https://chromium-review.googlesource.com/c/1278631 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: François Doray <fdoray@chromium.org> Cr-Commit-Position: refs/heads/master@{#601600} [modify] https://crrev.com/f7da13a28c3799819bd39c4d95dc02a805b27e4f/base/message_loop/message_loop.cc [modify] https://crrev.com/f7da13a28c3799819bd39c4d95dc02a805b27e4f/base/message_loop/message_loop.h [modify] https://crrev.com/f7da13a28c3799819bd39c4d95dc02a805b27e4f/base/message_loop/message_pump.h [modify] https://crrev.com/f7da13a28c3799819bd39c4d95dc02a805b27e4f/base/message_loop/message_pump_android.cc [modify] https://crrev.com/f7da13a28c3799819bd39c4d95dc02a805b27e4f/net/test/embedded_test_server/embedded_test_server.cc
,
Oct 22
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/243f975c09f36ff7e90f43a530e64b31aac641e9 commit 243f975c09f36ff7e90f43a530e64b31aac641e9 Author: Yi Gu <yigu@chromium.org> Date: Mon Oct 22 21:34:13 2018 Revert "[MessageLoop] Lock-free ScheduleWork() scheme" This reverts commit f7da13a28c3799819bd39c4d95dc02a805b27e4f. Reason for revert: As per comments by Findit Original change's description: > [MessageLoop] Lock-free ScheduleWork() scheme > > The Lock is causing hangs because of priority inversion > mixed with priority boosting (ScheduleWork() tends to > boost the destination thread which may deschedule the > posting thread; if the posting thread is a background > thread this boost-induded-desched-while-holding-lock > can cause a livelock). See https://crbug.com/890978#c10 > for example crashes catching this. > > The Lock was only necessary for startup/shutdown and is > being replaced by a lock-free atomic scheme in this CL. > > MessagePump::ScheduleWork() itself was already thread-safe > (but the Android impl did unnecessarily check a non-atomic bool) > > This adds a WaitableEvent in ~MessageLoop(); hence the requirement > for a wait-allowance in net's EmbeddedTestServer. > > TBR=zhongyi@chromium.org (embedded_test_server.cc side-effects) > > Bug: 890978 , 874237 > Change-Id: I0916e5a99035a935b0a23a770af256f334e78c43 > Reviewed-on: https://chromium-review.googlesource.com/c/1278631 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: François Doray <fdoray@chromium.org> > Cr-Commit-Position: refs/heads/master@{#601600} TBR=gab@chromium.org,fdoray@chromium.org,zhongyi@chromium.org Change-Id: I521ef6ba2758f84d13a4f98d65cb41b276cb115e No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 890978 , 874237 Reviewed-on: https://chromium-review.googlesource.com/c/1294717 Reviewed-by: Yi Gu <yigu@chromium.org> Commit-Queue: Yi Gu <yigu@chromium.org> Cr-Commit-Position: refs/heads/master@{#601731} [modify] https://crrev.com/243f975c09f36ff7e90f43a530e64b31aac641e9/base/message_loop/message_loop.cc [modify] https://crrev.com/243f975c09f36ff7e90f43a530e64b31aac641e9/base/message_loop/message_loop.h [modify] https://crrev.com/243f975c09f36ff7e90f43a530e64b31aac641e9/base/message_loop/message_pump.h [modify] https://crrev.com/243f975c09f36ff7e90f43a530e64b31aac641e9/base/message_loop/message_pump_android.cc [modify] https://crrev.com/243f975c09f36ff7e90f43a530e64b31aac641e9/net/test/embedded_test_server/embedded_test_server.cc
,
Oct 23
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911 commit c8a1dc548deacfd28e32e5f1c54b02ec59dfc911 Author: Gabriel Charette <gab@chromium.org> Date: Tue Oct 23 23:10:55 2018 Reland "[MessageLoop] Lock-free ScheduleWork() scheme" This is a reland of f7da13a28c3799819bd39c4d95dc02a805b27e4f (original change in PS1) The original change was missing acquire ordering in DisconnectFromParent(). This is necessary in order for the disconnecting thread to see all memory side-effects previously made by other threads (or some side-effects of message_loop_->ScheduleWork() could racily come in after ~MessageLoop()). Also removed the DCHECK that |operations_state_ == kDisconnectedBit| at the end of DisconnectFromParent() as it was incorrect. A racy BeforeOperation() call can make it 1->3 (and no-op) after DisconnectFromParent() made it 0->1. And lastly, added a scoped allowance to always allow the very fast wait instead of requiring callers to know about this implementation detail of MessageLoop (and reverted changes to //net). Original change's description: > [MessageLoop] Lock-free ScheduleWork() scheme > > The Lock is causing hangs because of priority inversion > mixed with priority boosting (ScheduleWork() tends to > boost the destination thread which may deschedule the > posting thread; if the posting thread is a background > thread this boost-induded-desched-while-holding-lock > can cause a livelock). See https://crbug.com/890978#c10 > for example crashes catching this. > > The Lock was only necessary for startup/shutdown and is > being replaced by a lock-free atomic scheme in this CL. > > MessagePump::ScheduleWork() itself was already thread-safe > (but the Android impl did unnecessarily check a non-atomic bool) > > This adds a WaitableEvent in ~MessageLoop(); hence the requirement > for a wait-allowance in net's EmbeddedTestServer. > > TBR=zhongyi@chromium.org (embedded_test_server.cc side-effects) > > Bug: 890978 , 874237 > Change-Id: I0916e5a99035a935b0a23a770af256f334e78c43 > Reviewed-on: https://chromium-review.googlesource.com/c/1278631 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: François Doray <fdoray@chromium.org> > Cr-Commit-Position: refs/heads/master@{#601600} Bug: 890978 , 874237, 897925 Change-Id: I17c515f9a3169bbdfc303a4b259f34097e31730d Reviewed-on: https://chromium-review.googlesource.com/c/1297129 Reviewed-by: François Doray <fdoray@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#602166} [modify] https://crrev.com/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911/base/message_loop/message_loop.cc [modify] https://crrev.com/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911/base/message_loop/message_loop.h [modify] https://crrev.com/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911/base/message_loop/message_pump.h [modify] https://crrev.com/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911/base/message_loop/message_pump_android.cc [modify] https://crrev.com/c8a1dc548deacfd28e32e5f1c54b02ec59dfc911/base/threading/thread_restrictions.h
,
Nov 1
Fixed, documented effect on SchedulerWork() @ https://crbug.com/874237#c15 @etienneb : any visible effects on slow reports? I'll reland the bug behind a holdback experiment so we can see the impact on Browser.Responsiveness.JankyIntervalsPerThirtySeconds (the metric unfortunately changed in the same revision this bug was fixed..).
,
Nov 1
I no longer see it. That's great. Other janks disappear at the same time. I'm having hard time make precise metric. The sampling profiler should solve this.
,
Nov 5
,
Nov 8
On BrowserThread::UI, that improvement reduced LockImpl::Lock overhead on the main thread by 66% at runtime (https://uma.googleplex.com/p/chrome/callstacks?sid=2272e8f261252fc2f5e75502b96d45ca) and by 85% on startup (https://uma.googleplex.com/p/chrome/callstacks?sid=987a001cf2533ab9fe77ef7d0ceb9f9d) Looks like SequenceManager (Blink threads only for now) has the same issue (issue 901345), will need to be fixed before using it on BrowserThreads. It's not as bad their because renderers don't use background threads as much and thus suffer from priority inversions induced by this less.
,
Nov 8
And from a manual analysis of stack sampled metrics performed by wittman@, this removed 71ms from startup at the median. W00t!
,
Nov 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1 commit 1b876dd0841d6a1a44a5fbdadff2191012a1e0f1 Author: Francois Doray <fdoray@chromium.org> Date: Mon Nov 12 17:09:37 2018 Base: Add feature to disable thread priorities. This will allow us to assess the impact of using thread priorities. The impact can be both positive (unimportant work gets out of the way) and negative (priority inversions when sharing locks across threads with different priorities). Note: This CL has extra complexity because we don't have guarantees that the FeatureList is initialized before threads are started in unit tests https://crbug.com/846380. Bug: 872820, 890978 , 902441, 846380 Change-Id: I8887c7b9e0eb77f7c11aa6e2be8af1ea7150b491 Reviewed-on: https://chromium-review.googlesource.com/c/1318686 Commit-Queue: François Doray <fdoray@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#607276} [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/BUILD.gn [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/task/task_scheduler/task_scheduler_impl.cc [add] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread.cc [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread.h [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread_fuchsia.cc [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread_mac.mm [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread_posix.cc [modify] https://crrev.com/1b876dd0841d6a1a44a5fbdadff2191012a1e0f1/base/threading/platform_thread_win.cc
,
Nov 12
,
Nov 12
,
Nov 21
Reopening for the retroactive experiment (https://chromium-review.googlesource.com/c/chromium/src/+/1347585); otherwise Finch presubmits complain that I'm using a closed bug for tracking...
,
Dec 3
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9339569fe638b187b8e691f1b90e8e93ce4b2006 commit 9339569fe638b187b8e691f1b90e8e93ce4b2006 Author: Chris Hamilton <chrisha@chromium.org> Date: Mon Dec 03 18:06:52 2018 [MessageLoop] Experimentally reintroduce the ScheduleWork + Lock bug to verify new metrics We want to cross-check that we can see the impact of this fix on the metrics we want to track long term. The fix unfortunately landed at the same time as a tweak to these metrics and as such we missed that opportunity the first time around. This experiment will only be run for two days on Windows Canary. This is a continuation of this CL by gab@: https://chromium-review.googlesource.com/c/chromium/src/+/1347585/4 BUG: 890978 Change-Id: I8543d398ea804344dd160066c3f8f1638c7f5770 Reviewed-on: https://chromium-review.googlesource.com/c/1358800 Reviewed-by: François Doray <fdoray@chromium.org> Commit-Queue: Chris Hamilton <chrisha@chromium.org> Cr-Commit-Position: refs/heads/master@{#613156} [modify] https://crrev.com/9339569fe638b187b8e691f1b90e8e93ce4b2006/base/message_loop/message_loop_impl.cc
,
Jan 2
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/53bb3aeb50d3cea9e3148e66cfa3a4cbffb432cb commit 53bb3aeb50d3cea9e3148e66cfa3a4cbffb432cb Author: Gabriel Charette <gab@chromium.org> Date: Wed Jan 02 16:41:08 2019 Revert "[MessageLoop] Experimentally reintroduce the ScheduleWork + Lock bug to verify new metrics" This reverts commit 9339569fe638b187b8e691f1b90e8e93ce4b2006. Reason for revert: Experiment is complete. Original change's description: > [MessageLoop] Experimentally reintroduce the ScheduleWork + Lock bug to verify new metrics > > We want to cross-check that we can see the impact of this fix on the > metrics we want to track long term. The fix unfortunately landed at > the same time as a tweak to these metrics and as such we missed that > opportunity the first time around. > > This experiment will only be run for two days on Windows Canary. > > This is a continuation of this CL by gab@: > > https://chromium-review.googlesource.com/c/chromium/src/+/1347585/4 > > BUG: 890978 > Change-Id: I8543d398ea804344dd160066c3f8f1638c7f5770 > Reviewed-on: https://chromium-review.googlesource.com/c/1358800 > Reviewed-by: François Doray <fdoray@chromium.org> > Commit-Queue: Chris Hamilton <chrisha@chromium.org> > Cr-Commit-Position: refs/heads/master@{#613156} TBR=chrisha@chromium.org,fdoray@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Change-Id: Iddb585d7cb69b7ecfb80208c575f541702f649bb Reviewed-on: https://chromium-review.googlesource.com/c/1392735 Reviewed-by: Gabriel Charette <gab@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#619419} [modify] https://crrev.com/53bb3aeb50d3cea9e3148e66cfa3a4cbffb432cb/base/message_loop/message_loop_impl.cc
,
Jan 2
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by etienneb@chromium.org
, Oct 162.4 KB
62.4 KB View Download