New issue
Advanced search Search tips

Issue 804034 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

41ms startup performance regression due to base::DefaultSingletonTraits<>::New()

Project Member Reported by wittman@chromium.org, Jan 19 2018

Issue description

Data from the UMA Sampling Profiler shows that newly added calls to base::DefaultSingletonTraits<>::New() regressed browser process UI thread startup on 64-bit Chrome on Windows by 41ms. This occurred in the 65.0.3324.0 Canary release.

The responsible CL appears to be https://chromium.googlesource.com/chromium/src.git/+/f297f018bb66706c8d4ff456bd5868f31985206d

Profile difference of Canary 65.0.3323.0 vs. 65.0.3324.0: https://uma.googleplex.com/p/chrome/callstacks?sid=240f6edeb08ab15efd8cd70123627e9f

 
Components: -UI>Browser>Profiles -Platform>Extensions
Components: Internals>TaskScheduler

Comment 3 by gab@chromium.org, Jan 22 2018

Cc: fdoray@chromium.org thakis@chromium.org dcheng@chromium.org brucedaw...@chromium.org
Interesting, looks like a ~0.5ms regression per call to Traits::New(). The main difference is that it is now templated out into GetOrCreateLazyPointer's CreatorFunc. Perhaps that affects inlining or code/cache locality..?

Doesn't appear like out-of-lining NeedsLazyInstance()/CompleteLazyInstance() is at fault here (would have otherwise been my first guess for that CL but it isn't in the faulty stacks).

@wittman: when you say "by 41ms", what's the scale? What's the timeline? i.e. median time-to-first-paint?

@brucedawson/thakis as Windows/clang experts : ideas on why this CL would affect the performance of Traits::New()?

+dcheng FYI: look at all the Singletons initialized on startup =S!

Comment 4 by thakis@chromium.org, Jan 22 2018

To save others the context-building: The change at hand is https://chromium-review.googlesource.com/c/chromium/src/+/868013 , the related diffs are mostly 

https://chromium-review.googlesource.com/c/chromium/src/+/868013/7/base/memory/singleton.h
https://chromium-review.googlesource.com/c/chromium/src/+/868013/7/base/lazy_instance_helpers.h

From a cursory glance it looks like GetOrCreateLazyPointer() (which replaces the old Singleton code) does a call to out-of-line NeedsLazyInstance() while the Singleton happy-path code was all without jumps to other functions. If that reading is correct, I bet that's the cause.

Comment 5 by gab@chromium.org, Jan 22 2018

@thakis : that was my intuition too but as I stated in #3, that doesn't appear to be the case from the stacks (the time spent in the regression diff is in Traits::New() -- which is still inline in GetOrCreateLazyPointer() -- it's not in GetOrCreateLazyPointer() itself nor in Needs/CompleteLazyInstance() methods).

Comment 6 by gab@chromium.org, Jan 22 2018

I'm building official builds w/ and w/o this change to see if it affected code size (because CreatorFunc is templated in GetOrCreateLazyPointer()). Perhaps that would have thrown off code locality...

Seems like a 0.5ms regression per call is way too high for being caused by one more (not even virtual) method on the stack.

Comment 7 by thakis@chromium.org, Jan 22 2018

If it's during startup, the call can cause the executable to be paged in in a different order, so I can imagine it. I don't know how reliable the stacks are, so I'm not sure if I should believe them 100% :-)

Maybe we just need to update the order files (https://chromium.googlesource.com/chromium/src/+/master/docs/win_order_files.md)

Comment 8 by gab@chromium.org, Jan 22 2018

Labels: Needs-Feedback
AFAIK we page-in all of chrome.dll on startup before doing anything, so it'd be more about code locality than page faults I think.

Looks like real-world startup (multiple seconds, not ms) is not really affected though : https://uma.googleplex.com/p/chrome/timeline_v2/?sid=2d382b85b2fbe17f9b9ddccd317fec60

Will wait for wittman@ to let us know what to read from this.

Comment 9 by thakis@chromium.org, Jan 22 2018

I do know that we had uma-level chrome startup regressions with clang builds before we did the order files, so they do...something. I'd be somewhat surprised if it wasn't related to paging in, but I don't know the details of Windows's loader, so I wouldn't be too surprised if I were to be surprised here :-P

Comment 10 by gab@chromium.org, Jan 22 2018

So, here's a code size experiment:

Without mixed impls (before this CL):
 43,219,456 chrome.dll
 63,614,976 chrome_child.dll
    445,952 chrome_elf.dll
    538,624 chrome_watcher.dll
	
With mixed impls (r529885):
  43,228,672 chrome.dll
  63,616,000 chrome_child.dll
     445,952 chrome_elf.dll
     538,624 chrome_watcher.dll

So we get a significant amount more code (~9KB). I'm thinking this is the template specialization.

@fdoray: You added the CreatorFunc template param, if you want to figure out a clever way to un-templatize this, it's all yours!
@gab: 41ms is the additional time taken due to the new invocations of base::DefaultSingletonTraits<>::New() during the first 30 seconds of browser thread UI startup, almost all of which are pre-message loop start. This increase doesn't appear to have been offset by decreases elsewhere.

The issue may also be affecting singletons in other threads and processes where we don't have profiler coverage, so the actual impact is likely somewhat higher than 41ms.

@thakis: The profiles are quite reliable as an indicator of average execution time. :) We've validated them against timeline data where they're covering the same execution and they match closely to the 99% trimmed mean.

For this case though I don't think we should realistically expect to discern an impact in higher-level metrics. The noise floor on those is simply way too high to reliably detect an effect of this size.
Labels: -Needs-Feedback
I took a look at the distribution of stack samples within the base::DefaultSingletonTraits<HistoryServiceFactory>::New() call (https://uma.googleplex.com/p/chrome/callstacks?sid=d18c27009bd57ad848e5aee293e453ae).

From this it looks like:
1. The average execution time is almost entirely due to the contribution of a subset of startups seeing small amounts of execution time in this function.
2. The end of the long tail (0.01%) of startups are seeing >100ms of execution time in this function.

(2) likely indicates the effects of paging in pathological cases (e.g. over a network drive).

(1) is a weak signal due to the 10Hz sampling resolution, but could be due to paging being experienced by a fraction of startups getting averaged over all of them.
Additional observations:

98% of the sampled stacks in base::DefaultSingletonTraits<>::New() are at the start of the function. The remaining 2% are at the new Type() line.

The locality of these functions in the executable looks pretty terrible. Of 150 instantiations, 144 are on different 4k pages spread throughout. (See attached plot.) The calling code is consistently on the same page.
defaultsingletontraits-locality.png
34.9 KB View Download

Comment 14 by gab@chromium.org, Jan 24 2018

Wow, thanks for catching this.

I played with the template logic and recovered 3X what the blamed CL had induced in code bloat :) : https://chromium-review.googlesource.com/c/chromium/src/+/883364

Eager to see how that fares in the wild :)

Comment 15 by gab@chromium.org, Jan 24 2018

Status: Started (was: Assigned)
Project Member

Comment 16 by bugdroid1@chromium.org, Jan 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d55aad5050e41df78ef55a3746f26967da79ebf9

commit d55aad5050e41df78ef55a3746f26967da79ebf9
Author: Gabriel Charette <gab@chromium.org>
Date: Fri Jan 26 18:07:51 2018

Tweak GetOrCreateLazyPointer's template logic to enable more code folding.

This CL saves 27KB in chrome.dll and 5KB in chrome_child.dll.

https://chromium-review.googlesource.com/868013 had regressed size by
9KB and was caught as a performance regression. This CL should recover
this 3X.

Also removed an extra atomic operation in the simple case (used to
atomic-load for check and atomic-load again for return, now only
atomic-loads once).

Also moved GetOrCreateLazyPointer() to a subtle:: namespace (now that
calling it is tricky), also resolving recurring confusion about whether
Needs/CompleteLazyInstance() are invoked outside of it (they're
not, they're strictly helpers).

R=fdoray@chromium.org, thakis@chromium.org

Bug:  804034 
Change-Id: I8de94fe742a8f9a68d4a66e0202a6fb276843af9
Reviewed-on: https://chromium-review.googlesource.com/883364
Reviewed-by: François Doray <fdoray@chromium.org>
Reviewed-by: Nico Weber <thakis@chromium.org>
Commit-Queue: Gabriel Charette <gab@chromium.org>
Cr-Commit-Position: refs/heads/master@{#532013}
[modify] https://crrev.com/d55aad5050e41df78ef55a3746f26967da79ebf9/base/lazy_instance.h
[modify] https://crrev.com/d55aad5050e41df78ef55a3746f26967da79ebf9/base/lazy_instance_helpers.h
[modify] https://crrev.com/d55aad5050e41df78ef55a3746f26967da79ebf9/base/memory/singleton.h
[modify] https://crrev.com/d55aad5050e41df78ef55a3746f26967da79ebf9/base/task_scheduler/lazy_task_runner.cc
[modify] https://crrev.com/d55aad5050e41df78ef55a3746f26967da79ebf9/base/task_scheduler/lazy_task_runner.h

Comment 17 by gab@chromium.org, Jan 28 2018

@wittman: this has now shipped in 66.0.3333.0. Will some automated system notify this bug of the delta once processed?

Thanks!
Cc: isherman@chromium.org
Eventually yes. We're currently working on automating the process. +isherman@ as fyi for the posting-on-regression-fix automation case.

The 66.0.3333.0 profile is now available and shows this change produced a 70ms improvement in startup time:
https://uma.googleplex.com/p/chrome/callstacks?sid=6ed6676d97b2f12fa413ec48893d4d0d

Comment 19 by gab@chromium.org, Jan 29 2018

Status: Fixed (was: Started)
Wooohooo! A 1% startup gain :)!

Not 3X but still recovered more than we'd lost (guessing because while this change helps all LazyInstances -- which there are more than Singletons -- Singletons are more on the critical startup path whereas LazyInstances come in all over the place later).

Sign in to add a comment