Time spent waiting on the BrowserThreadGlobals lock in mojo::SimpleWatcher::Context::Notify() during browser process IO thread startup on 64-bit Chrome on Windows regressed by 37ms in 67.0.3365.0 Canary release. This now accounts for 1.1% of non-idle startup execution time on the thread.
No changes were detected to the functions on the mojo::SimpleWatcher::Context::Notify() call stack in this release, so the immediate cause of this regression was likely some change indirectly affecting this code.
Execution profile difference of Canary 67.0.3364.0 vs. 67.0.3365.0: https://uma.googleplex.com/p/chrome/callstacks?sid=de499992cc86c8cb5d8b3f6ecf6995a5
A similar but smaller regression is also seen on Mac:
https://uma.googleplex.com/p/chrome/callstacks?sid=85d272e1199d51429e67f6195791b99f
The regression range is correct; it's derived from measuring this code in the wild on 67.0.3364.0 and 67.0.3365.0 and computing the difference in execution time between the two releases. Looking at the profiles for 67.0.3366.0 and 67.0.3367.0 the regression remains in the following releases.
It's possible that the cause of this contention is from somewhere else and we're just seeing it most significantly on this call stack for some reason. We do see smaller regressions in calls to content::BrowserThread::GetCurrentThreadIdentifier() on the IO[1] and UI[2] threads due to contention on the same lock, for example.
Perhaps gab@ has some idea what might have increased contention on this lock.
[1] https://uma.googleplex.com/p/chrome/callstacks?sid=bce7adc9cc6ad83817686ea22fbe2aa1
[2] https://uma.googleplex.com/p/chrome/callstacks?sid=c5df0c34cccfffbb3fe7221230605bf6
One other thought: this could be the result of some change that caused this code path to be executed much more frequently. Are there any CLs in the regression range that could have done that?
It's not the first time that content::BrowserThread::CurrentlyOn() shows up in profiling reports (UMA profiler or local ETW traces).
Irrespectively of what caused this specific regression, I believe it should be made more efficient. Perhaps we could ensure that BrowserThreadGlobals is initialized when Chrome is still single-threaded, and make most accesses lock-free.
I agree that the locking scheme here is overkill now that we only have UI/IO BrowserThreads, all it's there for now is shutdown and shutdown is much simpler than it used to be.
Thanks for filling Mike, love the sampling profiler data once again!
I'll fix this.
Note however that posting a task will also require a lock, so fixing this might just push the problem a few stack frames down to the next lock. But at least it will be a per task runner lock rather than a global lock. Sampling profiler will let us know if this is a problem then :)
As for why it regressed in M67, this might just be that more and more tasks are posted from TaskScheduler. Since the logic below (wrongfully) assumes that any thread that isn't a BrowserThread doesn't outlive BrowserThreads, it now grabs the lock on PostTasks from all threads except posting from UI->IO.
bool target_thread_outlives_current =
GetCurrentThreadIdentifier(¤t_thread) &&
current_thread >= identifier;
This is overly conservative nowadays.
Removed BrowserThreadGlobals::lock_ in r546212 (first live in 67.0.3382.0).
Comparison of time spend in LockImpl::Lock on BrowserThread::IO during browser startup:
67.0.3381.0 : https://uma.googleplex.com/p/chrome/callstacks?sid=4123b774cdae0360a93ddeb4f00e0c01
67.0.3382.0 : https://uma.googleplex.com/p/chrome/callstacks?sid=c076a5e584d5d22700b7241705e0d1d7
Summary:
67.0.3381.0 :
- 3.722% time spent in LockImpl::Lock
- Major contributor = mojo::SimpleWatcher::Context::Notify
- 1.112% time spent locking BrowserThreadGlobals::lock_
- 0.2057% time spent locking IncomingTaskQueue::incoming_queue_lock_
- Other major contributors are stuck on IncomingTaskQueue::incoming_queue_lock_
- 1.238% spent in content::BrowserThread::CurrentlyOn (all on BrowserThreadGlobals::lock_)
67.0.3382.0 :
- 3.390% time spent in LockImpl::Lock
- Major contributor = mojo::SimpleWatcher::Context::Notify
- 0% time spent locking BrowserThreadGlobals::lock_
- 1.216% time spent locking IncomingTaskQueue::incoming_queue_lock_
- Other major contributors are stuck on IncomingTaskQueue::incoming_queue_lock_
- 0% spent in content::BrowserThread::CurrentlyOn (the method still exists but is now blazing fast)
So this fix highlights even stronger that the real issue is crbug.com/786597. The vast majority of the locking is spent waiting for the incoming_queue_lock_ and that's most likely because the lock is held too long because managing the delayed task queue is slow.
Getting rid of BrowserThreadGlobals::lock_ was a first step to healing the situation and helped speed up BrowserThread::CurrentlyOn() from an average of 34.3ms to an instant 0ms call, but for PostTask the Lock merely acted as a bottleneck before trying to obtain IncomingTaskQueue::incoming_queue_lock_, relieving that means we get to it faster but don't obtain it faster.
I'll be attacking issue 786597 next to heal that situation.
PS: Thanks once again to the Sampling Profiler for this awesome data :)!
Turns out this was a prereq to issue 890978 (which wasn't known at the time).
This hence means that iOS is still affected by issue 890978 per this lock.
Comment 1 by roc...@chromium.org
, Mar 12 2018