Assess performance impact of WeakReference::Flag::flag_ switch to Atomic32 from bool |
||||
Issue descriptionAs part of implementing WeakPtr::MaybeValid() to support TaskScheduler optimization, the WeakReference::Flag::flag_ was switched from a bool to an Atomic32, to allow thread-safe access. Using Atomic32 imposes constraints on the query & clear operations, impacting the processor's ability to pipeline & reorder them. These overheads will affect performance not only for the MaybeValid() API, but for IsValid() and Invalidate(), even though those APIs should always be called on the same sequence. We could potentially revert to using a bool rather than an atomic, since: 1. Flag::flag_ will only ever transition from true to false, never back to true. 2. MaybeValid() is permitted to return true even if the Flag was invalidated on another thread, since the check is inherently racey in that case. We'd expect that reverting to a bool will impact effectiveness of MaybeValid() optimization in TaskScheduler, since MaybyValid() may continue to return true for some arbitrary time[1] after the WeakPtr was actually invalidated. [1] Basically until the affected cache line is flushed, or a memory barrier on it is reached.
,
Oct 3
Re #1: This bug was filed for us to review the performance difference between the bool and AtomicFlag implementations of Flag, given tapted@'s observation that it _does_ affect performance of his new ObserverList perf-tests. You're right, I should have left out speculation as to possible workarounds. :) The point is that the current implementation, using AtomicFlag, is expressing stronger requirements (the side-effects visibility you refer to above) in its use of intrinsics than we actually have for IsValid(), so we should assess impact and consider a specialized implementation that removes that overhead.
,
Oct 3
Well, I don't know, I'm not convinced that this overhead is relevant in the grand scheme of things. i.e. gaining 2X in the nanoseconds space will probably not result in user visible gains. Being able to cancel tasks across threads more important IMO. And while memory_order_relaxed may make sense for Callback::MaybeValid(), I'm not sure it always does for WeakPtr::MaybeValid() (i.e. like AtomicFlag it's not impossible that a thread would make conclusions about other memory's state based on seeing invalidity.. I guess we could document not to do that).
,
Oct 8
Drive-by here. If there is an observable perf impact by someone, then it probably warrants some amount of analysis. Also, given how low this particular library function is, I'm actually leery of ignoring perf concerns. @tapted: do you mind posting some more info on your perf observations? Looking around online, I think *think* it's safe to use memory_order_relaxed on the set, and only force an ordering operation on the read. Another thought is, why are we using an Atomic32 as opposed to one sized to the natural machine word? Is it a space/perf tradeoff?
,
Oct 8
There is some context at https://bugs.chromium.org/p/chromium/issues/detail?id=888973#c7 and an observer_list_perftest.cc in base_perftests that landed in r596492 . (It has two categories of tests; one using ObserverList<CheckedObserver>, and one using ObserverList<T>::Unchecked). https://chromium-review.googlesource.com/c/chromium/src/+/1256483 is a change that switches the atomic in WeakPtr to a regular bool patching that in should almost halve the runtime of the CheckedObserver test in base_perftests for a Release build (i.e. no DCHECKs). Specifically, ~7ns of the 13ns overhead seems to come from this atomic.
,
Oct 8
Gotcha. Hmm... is there an easy way to test the perf with and std::atomic<uint32_t> with the memory_order_relaxed on the invalidate? Could also try it with std::atomic<uint_fast32_t> (which will probably be 64-bit on x64). An unprotected bool being read across 2 threads is unfortunately pretty clearly undefined behavior. As long as we're expecting to set on one thread and read on another, I think we definitely need an atomic.
,
Oct 9
std::atomic<uint_fast32_t> seems to work great.. Indistinguishable from bool. (assuming I'm using it right) (Note this isn't the machine that I measured on earlier, but a z620 which gave more variance. And of course the perf test is only testing single-threaded performance.) base::AtomicFlag *RESULT ObserverListPerfTest_0.CheckedObserver: NotifyPerformance= 1.07837 ns/observe *RESULT ObserverListPerfTest_1.CheckedObserver: NotifyPerformance= 11.38143 ns/observe *RESULT ObserverListPerfTest_2.CheckedObserver: NotifyPerformance= 12.208300122083001 ns/observe *RESULT ObserverListPerfTest_4.CheckedObserver: NotifyPerformance= 12.71018 ns/observe *RESULT ObserverListPerfTest_8.CheckedObserver: NotifyPerformance= 13.206940132069402 ns/observe *RESULT ObserverListPerfTest_16.CheckedObserver: NotifyPerformance= 16.704772672763628 ns/observe *RESULT ObserverListPerfTest_32.CheckedObserver: NotifyPerformance= 14.498880144988801 ns/observe *RESULT ObserverListPerfTest_64.CheckedObserver: NotifyPerformance= 14.28565499997925 ns/observe *RESULT ObserverListPerfTest_128.CheckedObserver: NotifyPerformance= 14.272784700968241 ns/observe bool -> https://chromium-review.googlesource.com/c/chromium/src/+/1256483/1 *RESULT ObserverListPerfTest_0.CheckedObserver: NotifyPerformance= 1.09659 ns/observe *RESULT ObserverListPerfTest_1.CheckedObserver: NotifyPerformance= 8.26794 ns/observe *RESULT ObserverListPerfTest_2.CheckedObserver: NotifyPerformance= 8.5807400858074 ns/observe *RESULT ObserverListPerfTest_4.CheckedObserver: NotifyPerformance= 8.91779 ns/observe *RESULT ObserverListPerfTest_8.CheckedObserver: NotifyPerformance= 9.164380091643801 ns/observe *RESULT ObserverListPerfTest_16.CheckedObserver: NotifyPerformance= 10.168401626944261 ns/observe *RESULT ObserverListPerfTest_32.CheckedObserver: NotifyPerformance= 9.904310099043101 ns/observe *RESULT ObserverListPerfTest_64.CheckedObserver: NotifyPerformance= 9.653553378743682 ns/observe *RESULT ObserverListPerfTest_128.CheckedObserver: NotifyPerformance= 9.561449848293345 ns/observe std::atomic --> https://chromium-review.googlesource.com/c/chromium/src/+/1256483/3 *RESULT ObserverListPerfTest_0.CheckedObserver: NotifyPerformance= 1.07829 ns/observe *RESULT ObserverListPerfTest_1.CheckedObserver: NotifyPerformance= 8.06526 ns/observe *RESULT ObserverListPerfTest_2.CheckedObserver: NotifyPerformance= 8.5054300850543 ns/observe *RESULT ObserverListPerfTest_4.CheckedObserver: NotifyPerformance= 8.79823 ns/observe *RESULT ObserverListPerfTest_8.CheckedObserver: NotifyPerformance= 9.038330090383301 ns/observe *RESULT ObserverListPerfTest_16.CheckedObserver: NotifyPerformance= 9.636551541848247 ns/observe *RESULT ObserverListPerfTest_32.CheckedObserver: NotifyPerformance= 9.661830096618301 ns/observe *RESULT ObserverListPerfTest_64.CheckedObserver: NotifyPerformance= 9.576283351699173 ns/observe *RESULT ObserverListPerfTest_128.CheckedObserver: NotifyPerformance= 9.40713968935388 ns/observe
,
Oct 9
Thanks! You're using the atomic "correctly" for implementing a relaxed-store, sequential-read atomic. Unfortunately part of the discussion we've been having in #2, #3, and #4 is whether or not relaxed store is good enough. That's fascinating to see the cost is mostly in the store since the load() on your atomic is still using std::memory_order_seq_cst (the default), which IIRC on x86/x64 platforms is equiv to AcquireLoad() is doing since I don't think the x86 memory model has acquire/release semantics. This raises one more question: would this still be as fast if we use std::atomic<int32_t>? If so, then we know the slow-down is from the memory ordering and not from making the field word sized. I'm 98% certain that's actually true, but better to test...
,
Oct 11
removing std::memory_order_relaxed on the store doesn't affect this perf -- WeakReference::Flag::Invalidate() is only called rarely I think (and possibly not at all in this test). So... we could probably use something more expensive there.
std::atomic<int32_t> also doesn't measurably regress the performance. nor does std::atomic<int8_t>, (nor std::atomic<bool>).
I thought "Perhaps std::atomic just allows clang to use better compiler intrinsics?" but it looks like this machine (linux) is using atomicops_internals_portable.h, which casts the address to `volatile std::atomic<Atomic32>*`.
And, sure enough changing the declaration from
std::atomic<uint_fast8_t> valid_{true};
to
volatile std::atomic<uint_fast8_t> valid_{true};
now regresses perf back to the Atomic32 version.
,
Oct 11
Thinking aloud.. is it possible that casting to `volatile` pointers is defeating any kind of "relaxation" we're attempting when choosing to call a `nobarrier` flavour over a `barrier` flavour of the functions in atomicops.h? (the standard provides some guarantees for using volatile in signal handlers too, maybe that's affecting loads as well as stores). Also we're getting into platform-specifics.. #c7,#c9 are linux. https://crbug.com/888973#c7 is mac. Haven't tested Windows.
,
Oct 11
I toyed with shredding `volatile` from the atomicops headers. That alone didn't seem to have impact. But that *and* inlining AtomicFlag::IsSet() in the header took performance pretty close. see https://chromium-review.googlesource.com/c/chromium/src/+/1278405 (note _just_ inlining AtomicFlag::IsSet() also did not have impact). Shredding `volatile` is probably the right fix.. or just deprecating atomicops.h in favour of std::atomic. So with that in mind, I think a good action to take here is to migrate AtomicFlag to std::atomic. --> https://chromium-review.googlesource.com/c/chromium/src/+/1278425 The reinterpret-casting from int32_t to std::atomic in atomicops_internals_portable.h is also a little spooky. I guess we already compile with -fno-strict-aliasing. But it's possible `aliasing + volatile` is confounding the compiler optimizations further. atomicops_internals_portable.h even says it's meant to be "transitional" -- is there already a plan to phase it out and replace with std::atomic?
,
Oct 12
First off, just need to say: Niiiiiiice analysis. Also, since you're doing so much of the work, assigning the bug to you so you have credit. Feel free to unassign or toss to someone else if that somehow becomes necessary. Due to this thread, I actually picked up https://bugs.chromium.org/p/chromium/issues/detail?id=559247 but it's gonna be weeks/months before I get to trying it. Agreed that the ultimate goal is to just use the c++11 atomics everywhere if possible. If you felt like jumping into that too, feel free to grab the ball. Otherwise it's on my (ever growing) queue of items to get to at some point. The use of volatile in combination with std::atomic<> is suspect and removing it does look completely right to me. Regarding compiler optimization, volatile is a different dimension of restrictions not related to memory fences and atomicity. If I just dump the generated code for AtomicFlag, I see no distinction between having the member variable marked volatile or not (https://godbolt.org/z/LsVtE_), so I'm betting it has to be interacting with the calling function code. All that being said, I like the change you've proposed. An going to tag gab on the review to see what he things.
,
Oct 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d64bfc8e330fea105de74137e6e176763dc546a4 commit d64bfc8e330fea105de74137e6e176763dc546a4 Author: Trent Apted <tapted@chromium.org> Date: Fri Oct 12 06:34:45 2018 Migrate base::AtomicFlag to std::atomic atomicops.h uses `volatile` which inhibits some compiler optimisations without contributing to threading correctness. Define AtomicFlag::IsSet() in the header, rather than out-of-line. Without this, WeakPtr overheads in ObserverListPerfTest_*.CheckedObserver in base_perftests are up to 70% greater. Bug: 891536 Change-Id: Ibc733042bf0027738b49c70ccbbcefc4cde3289d Reviewed-on: https://chromium-review.googlesource.com/c/1278425 Commit-Queue: Trent Apted <tapted@chromium.org> Reviewed-by: Albert J. Wong <ajwong@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#599118} [modify] https://crrev.com/d64bfc8e330fea105de74137e6e176763dc546a4/base/synchronization/atomic_flag.cc [modify] https://crrev.com/d64bfc8e330fea105de74137e6e176763dc546a4/base/synchronization/atomic_flag.h
,
Oct 12
Patch-set from comment #13 looks to be a substantial improvement for Intel! Have you run those perf-tests on ARM at all?
,
Oct 15
On a Nexus 7, seems to shave off 14 ~ 25 nanoseconds. 10%ish. baseline (Unchecked) 0 43.46 1 71.557 2 86.697 4 77.233 8 77.719 16 74.340 32 73.239 64 74.047 128 74.124 before (Checked) 0. 38.45 1. 204.45 2. 246.92 4. 279.80 8. 300.29 16. 314.56 32. 323.02 64. 324.93 128. 328.38 after (Checked+WeakPtr::IsValid inlined). 0. 40.79 1. 190.16 2. 260.24 4. 273.30 8. 288.18 16. 296.60 32. 296.95 64. 298.60 128. 308.66 |
||||
►
Sign in to add a comment |
||||
Comment 1 by gab@chromium.org
, Oct 3