ScopedBlockingCall spinning indefinitely on multiple threads in LazyInstance::Get |
|||||||||||
Issue descriptionChrome Version: 63.0.3239.84 OS: Windows Server 2012 R2 Datacenter I have an ETW trace from a customer that shows consuming 100% of CPU time on a four-core machine, indefinitely. The trace shows that eight different threads in the browser process are all spinning continuously on this stack: |- ntdll.dll!RtlUserThreadStart | kernel32.dll!BaseThreadInitThunk | chrome.dll!base::`anonymous namespace'::ThreadFunc | chrome.dll!base::internal::SchedulerWorker::Thread::ThreadMain | chrome.dll!base::internal::SchedulerWorker::Delegate::WaitForWork | chrome.dll!base::WaitableEvent::TimedWait | chrome.dll!base::`anonymous namespace'::WaitUntil | chrome.dll!base::ScopedBlockingCall::ScopedBlockingCall | chrome.dll!base::LazyInstance<base::ThreadLocalPointer<base::ScopedBlockingCall>, ...>>::Get | KernelBase.dll!Sleep All of the samples are in the ::Get function or farther down. This looks extremely serious and I believe this renders Chrome unusable for these customers.
,
Dec 22 2017
I think that we need to initialize all of these worker-thread TLS variables in the main thread so that they don't race on allocating the initial TLS structures.
I hit the initialization loop here:
> base.dll!base::internal::NeedsLazyInstance(int * state) Line 32 C++
base.dll!base::internal::GetOrCreateLazyPointer<`lambda at ../..\base/lazy_instance.h:195:9'>(int * state, const base::LazyInstance<base::ThreadLocalBoolean,base::internal::LeakyLazyInstanceTraits<base::ThreadLocalBoolean> >::Pointer::<unnamed-tag> & creator_func, void(*)(void *) destructor, void * destructor_arg) Line 154 C++
base.dll!base::LazyInstance<base::ThreadLocalBoolean,base::internal::LeakyLazyInstanceTraits<base::ThreadLocalBoolean> >::Pointer() Line 193 C++
base.dll!base::LazyInstance<base::ThreadLocalBoolean,base::internal::LeakyLazyInstanceTraits<base::ThreadLocalBoolean> >::Get() Line 184 C++
base.dll!base::internal::AssertBaseSyncPrimitivesAllowed() Line 105 C++
base.dll!base::WaitableEvent::TimedWait(const base::TimeDelta & wait_delta) Line 111 C++
base.dll!base::internal::SchedulerWorker::Delegate::WaitForWork(base::WaitableEvent * wake_up_event) Line 171 C++
base.dll!base::internal::SchedulerWorker::Thread::ThreadMain() Line 55 C++
base.dll!base::`anonymous namespace'::ThreadFunc(void * params) Line 91 C++
which corresponds to this:
void AssertBaseSyncPrimitivesAllowed() {
DCHECK(!g_base_sync_primitives_disallowed.Get().Get())
<< "Waiting on a //base sync primitive is not allowed on this thread to "
"prevent jank and deadlock. If waiting on a //base sync primitive is "
"unavoidable, do it within the scope of a "
"ScopedAllowBaseSyncPrimitives. If in a test, "
"use ScopedAllowBaseSyncPrimitivesForTesting.";
}
Maybe we can assert that the lazy initialization of these pointers is not done on worker threads, if we can't easily fix the lazy initialization to avoid busy waiting. In addition to g_base_sync_primitives_disallowed that would include tls_last_scoped_blocking_call and tls_blocking_observer and ???
,
Dec 22 2017
The Assertion is a DCHECK. Is the customer running with DCHECKs enabled?
,
Dec 22 2017
Re #3: DCHECK() guarantees to reference the |condition| even in non-DCHECK builds; IIUC this means it must still be executed unless the compiler can verify that it has no side-effects?
,
Dec 22 2017
We rely on the dead-code analyzer in the optimizer to get rid of the code in the DCHECK |condition|:
#define EAT_STREAM_PARAMETERS \
true ? (void)0 \
: ::logging::LogMessageVoidify() & (*::logging::g_swallow_stream)
When !DCHECK_IS_ON()
#define DCHECK(condition) EAT_STREAM_PARAMETERS << !(condition)
That's not to say shards of the DCHECK haven't slipped through to production.
,
Dec 22 2017
> Most of the busy threads are priority 8. I can't tell if there is a low priority thread that never gets to run. Doesn't Windows do random priority boosting of lower priority threads to prevent this sort of thing? https://msdn.microsoft.com/en-us/library/windows/desktop/ms684831(v=vs.85).aspx
,
Dec 22 2017
Thanks for the great research so far Bruce! Some digging through the Internets show that a key difference between client and server editions is the default scheduling schema on server is to optimize for background tasks. Could this be tripping our chances of untying the deadlock knot. https://blogs.msdn.microsoft.com/larryosterman/2009/01/08/why-do-people-think-that-a-server-sku-works-well-as-a-general-purpose-operating-system/ https://blogs.technet.microsoft.com/askperf/2014/01/23/the-case-of-above-normal-priority/
,
Dec 27 2017
Regarding comments 3-5: The g_base_sync_primitives_disallowed loop was on a local debug build. I have no reason to believe it is a customer facing issue. It still seems sloppy to have a busy loop spinning ever - I've seen busy loops go bad too many times - but it is not immediately correlated to the customer issue. Regarding comment 6: The OS does do some priority boosting of starved threads but there are some fairly tight constraints on how much it will boost priorities so there is no guarantee that this will work. We certainly don't want to depend on this as a general-purpose way of breaking out of CPU spins because it can easily be several scheduling quanta before the priority adjusts. If we're willing to wait that long then we might as well do a Sleep(1) and break out of the loop even faster (and with less wait). There was a bug a few months ago where some terminal server customers were frequently hitting a priority inversion. Two audio threads (real-time priority) were trying to acquire a partition alloc lock which is a pure spin lock and if it was ever held by a normal priority thread then (on two-core systems) the spin would run forever. TL;DR - Sleep(0) is not a synchronization primitive. Regarding comment 7: The scheduling differences could be causing this to happen more frequently. I still don't have a mental model for how it can happen, but the indefinite spinning is definitely happening. The more I think about this the more I think that the best solution is to identify these problematic LazyInstance objects and ensure that they are initialized in the main thread before the workers get started. The alternative of fixing LazyInstance objects so they don't spin-wait is tempting but likely to have unintended consequences.
,
Dec 28 2017
I received another customer trace which shows more unlimited spinning in LazyInstance::Get, but on different objects and call stacks. Ten threads are spinning on this call stack: |- ntdll.dll!RtlUserThreadStart | kernel32.dll!BaseThreadInitThunk | chrome.dll!base::`anonymous namespace'::ThreadFunc | chrome.dll!base::internal::SchedulerWorker::Thread::ThreadMain | chrome.dll!base::internal::SchedulerWorkerPoolImpl::SchedulerWorkerDelegateImpl::OnMainEntry | chrome.dll!base::LazyInstance<base::ThreadLocalPointer<base::internal::SchedulerWorkerPool const >,base::internal::LeakyLazyInstanceTraits<base::ThreadLocalPointer<base::internal::SchedulerWorkerPool const > > >::Get | |- KernelBase.dll!SleepEx It looks like OnMainEntry() calls outer_->BindToCurrentThread() which calls tls_current_worker_pool.Get().Set(this); which triggers the issue. Meanwhile another thread is spinning here: |- ntdll.dll!RtlUserThreadStart | kernel32.dll!BaseThreadInitThunk | chrome.dll!base::`anonymous namespace'::ThreadFunc | chrome.dll!base::Thread::ThreadMain | chrome.dll!content::BrowserThreadImpl::Run | chrome.dll!content::BrowserThreadImpl::IOThreadRun | chrome.dll!base::Thread::Run | chrome.dll!base::MessagePumpWin::Run | chrome.dll!base::MessagePumpForIO::DoRunLoop | chrome.dll!base::MessageLoop::DoWork | chrome.dll!base::MessageLoop::DeferOrRunPendingTask | chrome.dll!base::MessageLoop::RunTask | chrome.dll!base::debug::TaskAnnotator::RunTask | chrome.dll!base::DeletePointer<google_apis::BatchableDelegate> | chrome.dll!ProfileImplIOData::`scalar deleting destructor' | chrome.dll!ProfileIOData::~ProfileIOData | chrome.dll!content::NetworkContext::`scalar deleting destructor' | chrome.dll!content::NetworkContext::~NetworkContext | chrome.dll!net::`anonymous namespace'::ContainerURLRequestContext::`scalar deleting destructor' | chrome.dll!net::`anonymous namespace'::ContainerURLRequestContext::~ContainerURLRequestContext | chrome.dll!net::URLRequestContextStorage::~URLRequestContextStorage | chrome.dll!net::ProxyService::`scalar deleting destructor' | chrome.dll!net::ProxyService::~ProxyService | chrome.dll!net::DhcpProxyScriptFetcherWin::`scalar deleting destructor' | chrome.dll!net::DhcpProxyScriptFetcherWin::~DhcpProxyScriptFetcherWin | chrome.dll!base::SequencedWorkerPool::OnDestruct | chrome.dll!base::SequencedWorkerPool::Inner::RunsTasksOnCurrentThread | chrome.dll!base::internal::SchedulerParallelTaskRunner::RunsTasksInCurrentSequence | chrome.dll!base::LazyInstance<base::ThreadLocalPointer<base::internal::SchedulerWorkerPool const >,base::internal::LeakyLazyInstanceTraits<base::ThreadLocalPointer<base::internal::SchedulerWorkerPool const > > >::Get It is presumably trying to initialize the same object. Note, however, that it is talking to the scheduler in an object's destructor. Presumably (I don't know for sure) this indicates that this is not the first time that the SequencedWorkerPool has been used, which makes the spinning more mysterious. Or, at least, it suggests that the LazyInstance pointer is in a genuinely bad state rather than just hitting a priority inversion during initialization. Does this makes sense? Addendum to my comment on thread priorities and different scheduler behavior on Windows Server. I just realized that all of the threads are maintaining the same thread priority over many seconds and thousands of context switches. This suggests that the server SKU doesn't dynamically adjust priorities - developers are assumed to be careful. This makes spin locks more dangerous than ever. What code in this area has changed for M63?
,
Jan 2 2018
Could some one from cc'ed dev, please take a look into this issue as per C#9, as it is marked as beta blocker. Thanks..!!
,
Jan 2 2018
I think it is safe to make this stable blocker instead of beta blocker. Thanks!
,
Jan 4 2018
Re the destructor point in comment #9: If the LazyInstance isn't yet initialized, because everyone working with it is spinning, then presumably we could also end up spinning in the BrowserThreadImpl stack that you mention? But it is also possible for the BrowserThreadImpl to have reached the point of trying to check whether the SchedulerWorkerPool is bound to the current thread _before_ any scheduler worker thread has actually reached its ThreadMain() and bound itself, I think, if the worker pool threads take a while to be scheduled, e.g. if the system is under high load?
,
Jan 4 2018
We're hitting this bug across our hosted RDS server estate and there doesn't appear to be a pattern regarding the load on the server at the time. It only appears to be affecting Chrome on RDS servers (a mixture of 2008R2/2012R2/2016) although we may not be noticing it locally.
,
Jan 4 2018
(back! and catching up) @bruce are you able to tell thread priorities from the traces? Or are we speculating? If not we could add a trick similar to BrowserThreadImpl::FooThreadRun() which forces a stack frame that does nothing else but easily highlight the properties of a worker thread in a minidump. As far as #9 is concerned: priority inversion is a possibility indeed. SchedulerWorkerPool::BindToCurrentThread() is the first code to touch the LazyInstance and it's possible it'd first be hit from a background thread (that's a problem!) Re. destructors on the stack. It could merely mean that the user tried to close chrome before grabbing this stack and the IO thread is hung trying to exit (it happens to be touching the LazyInstance in what should be a very quick call to confirm it's not on a worker thread). All of this pretty much comes down to the TaskScheduler's assumption that Windows can deal well with resolving priority inversions... On OSes which can deal with priority inversions, TaskScheduler uses actual thread priorities, on other OSes it merely makes an effort to push BACKGROUND tasks out of the way but when they do run, they do so at normal thread priority. The bit that flips this on and off is currently encoded in Lock::HandlesMultipleThreadPriorities(). We can try setting this to false on OS_WIN and seeing whether it resolves this issue. As far as M63 is concerned: the main difference IMO is that M63 is the milestone where the migration to TaskScheduler was completed. Increasing the likelihood of such races per increased workload. Another thought: maybe Windows is actually good at handling priority inversion but since atomic-acquire + yield isn't seen as an operation that is blocking on another thread, the thread in charge of resolving the issue isn't being promoted... (Lock + CV sounds like it could work better for this; only locking when atomic compare fails so it's a one time init cost only when there's an actual init race... better than a spin-lock..?) I can take a stab at the LazyInstance tweak. How can we verify that it works? @bruce do you have customers with a regular repro? If this doesn't pan out than I'm afraid we'll have to consider OS_WIN as yet another platform that poorly supports priority inversions (POSIX has this great PTHREAD_PRIO_INHERIT flag to avoid specifically this issue but it's broken on all platforms but Mac at the moment... sigh!). @fdoray: that makes your work to better support BACKGROUND tasks on platforms without thread priority support even more important!
,
Jan 4 2018
,
Jan 4 2018
The ETW trace shows the priorities for all of the threads that are scheduled during the trace. A priority inversion implies a low-priority thread that was never scheduled and therefore doesn't show up in the trace. So, a priority inversion is not proven or disproven. Windows has a couple of ways of dealing with priority inversions, neither perfect. One is that when an event is signaled the thread that is woken up is given a brief priority boost, to increase the odds that it gets to run. The other is that if a thread runs for a long time then its priority decays and if a thread doesn't run for a while then its priority is boosted (IIRC). However the amount of the decay/boost is quite limited so this will not avoid inversions from threads with widely varying priorities. And, it looks like this second mechanism does not apply on server SKUs. > TaskScheduler's assumption that Windows can deal well with resolving priority inversions Why do we have this assumption? Why can't we write code that is immune to priority inversions? Even if Windows can handle priority inversions it will often only do it after tens of ms of busy waiting, which is horrible. I think you misunderstood my comment about destructors on the stack. My understanding (correct me if I'm wrong) is that the spinning happens on the first allocation of a lazyinstance pointer and it will only happen once per pointer - not once-per-thread-per-pointer. Is that correct? Further, my assumption is that a destructor cannot be the first task scheduled, and therefore we should already have been past the spinning-possibility stage. Understanding the problem is crucial for making the right fix: a) If it's a priority inversion then switching to Sleep(1) will fix it. b) If it's an infinite loop caused by a logic bug then we need to fix that. pastarmovj@ can arrange access to test machines.
,
Jan 4 2018
>> TaskScheduler's assumption that Windows can deal well with resolving priority inversions > Why do we have this assumption? We made this assumption where we thought it was correct as it allowed us to use background priority threads for TaskPriority::BACKGROUND tasks, letting the kernel move it out of the way if it was running when something more important came in (otherwise we can only attempt not to run too many of them but the kernel is unaware of their priority while running). > Why can't we write code that is immune to priority inversions? Even if Windows can handle priority inversions it will often only do it after tens of ms of busy waiting, which is horrible. We definitely try as hard as we can to not unnecessarily depend on priority inversion resolving but one place where we have no choice is the PostTask()/GetWork() logic. At some point a normal priority thread will want to post background work which will have to be retrieved by a background thread. This obviously involves a lock and therefore has potential for priority inversions. We hadn't thought about LazyInstance/Singleton's use of a sometimes contended spin lock for initialization. > my assumption is that a destructor cannot be the first task scheduled, and therefore we should already have been past the spinning-possibility stage. That's what I was trying to articulate, it *can*. Because this is the IO thread, it has nothing to do with the SchedulerWorkerPool so it may very well have executed many tasks before this point. It just happens to be touching the SchedulerWorkerPool's LazyInstance now because deep in the stack something wants to know whether it's running on a worker thread (we know it is not) (this is indeed once-per-pointer not once-per-thread-per-pointer, the "per thread" bit here is merely the fact that this is a LazyInstance<ThreadLocalPointer> but that's irrelevant to the reason the LazyInstance getter is hanging). Priority inversion sounds likely because ThreadLocalPointer should otherwise be a very fast constructor and the only reason to hang in the constructor IMO is to be descheduled while in it...
,
Jan 4 2018
@bruce: are you positive that Windows doesn't consider wait-chains for priority boosts? I assumed it did (since it is aware of wait chains for other things AFAIU and I don't see why it wouldn't take advantage of it for this...) and was going to find a way to have a lock+CV here to wait on creation but if you don't think Windows will take advantage of this wait chain, then maybe Sleep(1) is just as good... Thanks, Gab
,
Jan 4 2018
> @bruce: are you positive that Windows doesn't consider wait-chains for priority boosts? I'm not positive. But, before going that direction I think we need to be positive that Windows *does* consider wait-chains for priority boosts *on server SKUs*, and I suspect it does not. What about this suggestion (from comment#2): "Maybe we can assert that the lazy initialization of these pointers is not done on worker threads, if we can't easily fix the lazy initialization to avoid busy waiting. In addition to g_base_sync_primitives_disallowed that would include tls_last_scoped_blocking_call and tls_blocking_observer and ???" That is, if this problem can only happen on the first use of a lazy instance pointer, let's make sure that that first use is on the main thread. This fix is trivial, and it seems like it must work.
,
Jan 4 2018
You mean adding a DCHECK_NE(ThreadPriority::BACKGROUND, PlatformThread::GetCurrentThreadPriority()); in the creation part of GetOrCreateLazyPointer()? I guess the inversion can happen at other levels but BACKGROUND is the one we're seeing now so perhaps that's ok. This check also depends on the code running in dev builds and the race typically happening in a way that it is hit. It also prevents usage of LazyInstance strictly amongst BACKGROUND tasks... I assume the problematic races are only a problem on startup in practice where many of these LazyInstances/Singletons are racily initialized. Maybe the solution is to just hold a bit on the creation of the background threads (or Sleep(5s) at the top of their ThreadMain()...). WDYT?
,
Jan 4 2018
I meant more like CHECK(NotOnWorkerThread()). But that would be the enforcement part. The actual fix would be to explicitly call GetOrCreateLazyPointer() on each of the lazy pointers that the worker threads use on the *main* thread. That is, make the lazy initialization not actually lazy.
,
Jan 4 2018
Can we ask someone able to repro this hang to attach WinDbg and grab a dump so we can see what it is that is causing the LazyInstance<ThreadLocalPointer> construction to take so long? As regards fixes, I think something in LazyInstance that protects against priority inversion (e.g. via sleep(1ms) as Bruce suggested) would be preferable to a one-off fix for the SchedulerWorker.
,
Jan 4 2018
Alternate way of saving a crash dump: https://docs.microsoft.com/en-us/sysinternals/downloads/procdump
,
Jan 5 2018
The problem of initializing a LazyInstance pointer exactly once is identical to the problem of doing thread-safe initialization of a static local. The algorithms for implementing this C++11 feature were designed in large part by Mike Burrows of Google so I asked him for advice. Here is his response:
"""
Below is what I might have written, though it's untested,
so I can't guarantee it.
I might have suggested using a condition variable wait
for the wait, but that would be a little awkward
because you have no safe static initializer for mutexes
and condition variables.
I copied the atexit call, but I don't understand how you're using it.
It's not safe to destruct something at exit that another thread could be using;
you might crash or deadlock by touching the destructed object rather than exit.
Also, if you're exiting, freeing the memory is unnecessary as the whole
address space is about to be blown away.
But perhaps you're using that mechanism to flush I/O
streams rather than actually destructing objects.
template <typename CreatorFunc>
void* GetOrCreateLazyPointer(subtle::AtomicWord* state,
const CreatorFunc& creator_func,
void (*destructor)(void*),
void* destructor_arg) {
subtle::AtomicWord value = subtle::Acquire_Load(state);
if (value == 0 || value == kLazyInstanceStateCreating) { // not yet done
if (subtle::NoBarrier_CompareAndSwap(state, 0,
kLazyInstanceStateCreating) == 0) {
// This thread won the right to initialize.
value = reinterpret_cast<subtle::AtomicWord>(creator_func());
subtle::Release_Store(state, value);
if (destructor != NULL) {
AtExitManager::RegisterCallback(destructor, destructor_arg);
}
} else { // This thread is waiting for another to initialize.
int64_t i = 0;
do { // Spin a bit, then yield, then sleep a few ms at a time.
if (i == 100) {
PlatformThread::YieldCurrentThread();
} else if (i > 100) {
PlatformThread::Sleep(TimeDelta::FromMilliseconds(2));
} else if (i > 110) {
PlatformThread::Sleep(TimeDelta::FromMilliseconds(10));
}
value = subtle::Acquire_Load(state);
i++;
} while (value == 0 || value == kLazyInstanceStateCreating);
}
}
return reinterpret_cast<void*>(value);
"""
There are a lot of subtleties in any lock-free algorithm which is why I'm including his entire function, but I will point out that his recommended solution to the busy-wait problem is to cap the maximum number of spins and then fall back to Sleep(1) and then Sleep(10). I'm not convinced that the Sleep(10) fallback is necessary, but falling back to Sleep(1) after "a while" should definitely solve this bug.
The maximum amount of busy waiting could also be time based - call YieldCurrentThread for a maximum of 1 ms for instance - rather than count based.
In a follow-up email he said:
"""
I might have used a mutex and condition variable, but
in Chrome the mutex and condition variable have no static initilializers.
And they means you need the equalivalent of pthread_once
(InitOnceExecuteOnce on Win32)
to initialize them, but you're trying to implement pthread_once, which
is why it gets tricky.
"""
So...
Using InitOnceExecuteOnce would be the most elegant (but non-portable) solution. Falling back to Sleep(1) after a ms or so of spinning is the simplest solution.
Make it so?
,
Jan 5 2018
Sounds good, I came to the same conclusion last night. To use lock+condition variable we need a Lock. But since we want LazyInstance to be trivially constructible, we need Lock to be trivially constructible (which it isn't). We could also have used a WaitableEvent but they also aren't trivially constructible. Without trivial construction we incur a static initializer on each LazyInstance (bad!). We could also use a global Lock for all LazyInstances but once again it can't be initialized in lazy_instance.cc without incurring a static initializer (solution to this is usually to put it in a LazyInstance but here we have a chicken-and-egg problem..!). Sleep(1) after 1ms of sleeping sounds reasonable. But... that still doesn't guarantee the background thread will be scheduled (other foreground threads could be preempting it too)... definitely better than spinning unnecessarily though and perhaps sufficient to solve this issue in practice. I'd even be inclined to always Sleep(1) as these are one-time initialized and a few ms here and there on startup won't change a thing IMO. Note: we should also update the same logic in singleton.cc (we should de-dupe these really...)
,
Jan 5 2018
All yours Bruce, thanks!
,
Jan 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cdf5faebbb47d9041b74288f8206cfeb84333a27 commit cdf5faebbb47d9041b74288f8206cfeb84333a27 Author: Bruce Dawson <brucedawson@chromium.org> Date: Sat Jan 06 00:16:37 2018 Fall back to a non-busy wait in NeedsLazyInstance In some cases (servers SKUs of Windows when we get unlucky with the timing?) we can end up with eight Chrome worker threads all spinning madly in NeedsLazyInstance, waiting for another thread to finish initializing the object. If the other thread is running at lower priority then this busy waiting ensures that it never gets a chance to run. Falling back to sleeping for 1 ms at a time will avoid this CPU starvation. Bug: 797129 Change-Id: I0e0541c9b84f6b808f11669d83ce6b19b41bb249 Reviewed-on: https://chromium-review.googlesource.com/852441 Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Commit-Queue: Bruce Dawson <brucedawson@chromium.org> Cr-Commit-Position: refs/heads/master@{#527445} [modify] https://crrev.com/cdf5faebbb47d9041b74288f8206cfeb84333a27/base/lazy_instance.cc
,
Jan 8 2018
The fix is in the canary builds. It would be helpful to know whether it fixes the bug. If it doesn't then the hang will still happen but it won't be a 100% CPU-usage hang and that will indicate that we have misunderstood this bug. If some of the reporters could try canary builds of Chrome that would be appreciated. If the fix works I can try to get it ported to M64 (currently it is in M65 only).
,
Jan 8 2018
,
Jan 10 2018
Thanks Bruce, regarding the discussion about the scheduler's assumptions that occasional priority inversion will eventually get resolved on Windows (and hence accept that trade-off which permits the use of true background threads in the first place) : Do you think we should do something further on trunk? Too bad there isn't a way to trigger PTHREAD_PRIO_INHERIT like behavior :( A) Bump thread priority in Lock/LazyInstance/Singleton upon acquiring? (seems overkill for all instances which aren't shared between background/normal priority threads...) B) Have a Windows specific scoped object which remembers the thread id of the owner and has other threads that come in later and block simulate PTHREAD_PRIO_INHERIT (i.e. bump priority of owning thread while waiting if lower than theirs). Use that in Lock, Singleton, and LazyInstance. C) Don't use background priority threads on Windows... :( I personally prefer (B) and am willing to implement it if you think it makes sense. Le sam. 6 janv. 2018 03 h 16, bugdroid1 via monorail < monorail+v2.3275348242@chromium.org> a écrit :
,
Jan 10 2018
Also, I just realized that with constexpr we can do much better in most cases. LazyInstance was often used to avoid a static initializer in globals. Most of these types can be trivially default constructed and making the constructor constexpr avoids the need to even use a LazyInstance to avoid the static initializer :). This should be the preferred solution in C++11 to avoid static initializers. I'll update the LazyInstance documentation.
,
Jan 10 2018
,
Jan 15 2018
I'm struggling to deploy to our hosted platforms and get some useful feedback given the difference between normal Chrome and the Canary build, but I've been experiencing the same issues on a Windows 10 client locally. Using M63 it appears to hit the bug reasonably reliably (espeically if I open/close Chrome a few times), using Canary I've not seen the bug appear yet. This is a local (non-virtualised) machine rather than a server so the bug also affects Windows client OSes, but it is using the same M63 Enterprise deployment via SCCM as the hosted platforms.
,
Jan 15 2018
I think we should strongly consider merging to M64 (getting very late for M63 I believe).
,
Jan 15 2018
This bug requires manual review: We are only 7 days from stable. Please contact the milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jan 16 2018
Thanks Bruce and Gab! We are 1 week away from M64 stable. Why is this change critical for M64 vs waiting until M65? Seems like this fix is tested in Canary, but can you please confirm that this is a safe merge overall?
,
Jan 16 2018
I believe that the fix is quite safe - it is unlikely to cause problems. And, based on the feedback from yesterday it looks like it fixes the issue, so merging to M64 would get the fix to our enterprise customers one release earlier.
,
Jan 16 2018
Agreed with Bruce, r527445 is a very safe merge and the issue affects a large amount of customers. I am further proving the utility of this change in an upcoming test @ https://chromium-review.googlesource.com/c/chromium/src/+/866871 (no need to wait for it to land to merge).
,
Jan 16 2018
Approving merge for M64. Branch:3282
,
Jan 16 2018
We're not planning any further M63 releases. Please merge the change to M64 branch 3282 ASAP as it is approved at #39. Thank you.
,
Jan 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/238d8bdbc36b73a99fbf8d72c2e02abc1dbb278c commit 238d8bdbc36b73a99fbf8d72c2e02abc1dbb278c Author: Bruce Dawson <brucedawson@chromium.org> Date: Wed Jan 17 00:26:21 2018 Fall back to a non-busy wait in NeedsLazyInstance Merging to M64 per approval on bug. In some cases (servers SKUs of Windows when we get unlucky with the timing?) we can end up with eight Chrome worker threads all spinning madly in NeedsLazyInstance, waiting for another thread to finish initializing the object. If the other thread is running at lower priority then this busy waiting ensures that it never gets a chance to run. Falling back to sleeping for 1 ms at a time will avoid this CPU starvation. Bug: 797129 Change-Id: I0e0541c9b84f6b808f11669d83ce6b19b41bb249 Reviewed-on: https://chromium-review.googlesource.com/852441 Reviewed-by: Wez <wez@chromium.org> Reviewed-by: Gabriel Charette <gab@chromium.org> Commit-Queue: Bruce Dawson <brucedawson@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#527445}(cherry picked from commit cdf5faebbb47d9041b74288f8206cfeb84333a27) Reviewed-on: https://chromium-review.googlesource.com/868649 Reviewed-by: Bruce Dawson <brucedawson@chromium.org> Cr-Commit-Position: refs/branch-heads/3282@{#520} Cr-Branched-From: 5fdc0fab22ce7efd32532ee989b223fa12f8171e-refs/heads/master@{#520840} [modify] https://crrev.com/238d8bdbc36b73a99fbf8d72c2e02abc1dbb278c/base/lazy_instance.cc
,
Jan 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f297f018bb66706c8d4ff456bd5868f31985206d commit f297f018bb66706c8d4ff456bd5868f31985206d Author: Gabriel Charette <gab@chromium.org> Date: Wed Jan 17 20:59:07 2018 Make Singleton and LazyInstance use the same logic for lazy creation. They had more or less duplicated code for a while. Creation logic for LazyInstance was extracted to base::internal methods a little while back for LazyTaskRunner. It can now also be used by Singleton to avoid code duplication. Note: because of code duplication, until this CL: Singleton still didn't benefit from the recent priority inversion mitigation added to LazyInstance in r527445. Bug: 797129 Change-Id: I4cfe0d94bce282901ec2471212e38b8e298d9f81 Reviewed-on: https://chromium-review.googlesource.com/868013 Reviewed-by: François Doray <fdoray@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#529885} [modify] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/BUILD.gn [modify] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/lazy_instance.h [rename] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/lazy_instance_helpers.cc [add] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/lazy_instance_helpers.h [delete] https://crrev.com/7672d23c1f6492e77fd88be14c37f6fd16726ebb/base/memory/singleton.cc [modify] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/memory/singleton.h [modify] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/memory/singleton_unittest.cc [modify] https://crrev.com/f297f018bb66706c8d4ff456bd5868f31985206d/base/task_scheduler/lazy_task_runner.h
,
Jan 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/70b37302be21346d0b8315d254939ae81105da28 commit 70b37302be21346d0b8315d254939ae81105da28 Author: Gabriel Charette <gab@chromium.org> Date: Wed Jan 17 21:00:18 2018 Add LazyInstanceTest.PriorityInversionAtInitializationResolves This test ensures that LazyInstance initialization isn't blocked by priority inversions when mixed-priority threads try to initialize the same LazyInstance object. That is, test to make sure r527445 works. On my Z840, this test takes 5 to 10 seconds to complete without r527445 but ~30ms with it :). An explicit 5s timeout was added to this test to make the regression case indeed fail (instead of relying on the longer test timeout under which it's mostly fine). Bug: 799853 , 797129 Change-Id: I3ff59b92f0b4c15574f091549b3cfe8eadd9ae6f Reviewed-on: https://chromium-review.googlesource.com/866871 Reviewed-by: François Doray <fdoray@chromium.org> Reviewed-by: Bruce Dawson <brucedawson@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#529886} [modify] https://crrev.com/70b37302be21346d0b8315d254939ae81105da28/base/lazy_instance_unittest.cc
,
Jan 18 2018
I believe we can call this fixed. @users: please let us know if you still encounter this problem after 64.0.3282.100.
,
Jan 18 2018
Unless fdoray@ wanted to use this bug to remove LazyInstance usage from ScopedBlockingCall altogether? I also have ideas to use a WaitableEvent in LazyInstance (with a trick to make it POD but not leak). But I don't think it blocks this bug per se, just another mitigation.
,
Jan 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ddb183db518bcfdd867b2188c30781fd8d375a06 commit ddb183db518bcfdd867b2188c30781fd8d375a06 Author: Gabriel Charette <gab@chromium.org> Date: Fri Jan 19 10:19:52 2018 Make LazyInstance's constructor constexpr. This allows hiding the buffer into private state, this will allow adding more POD members as a follow-up. Believe it or not, there was a user of this private state up until very recently..! https://chromium-review.googlesource.com/c/chromium/src/+/850294 This also allows removing the need for the LAZY_INSTANCE_INITIALIZER construction hack (will follow-up with a mass removal CL). -Wglobal-constructors (https://chromium-review.googlesource.com/c/chromium/src/+/866738) in //base guarantees that this doesn't result in global constructors. Bug: 797129 Change-Id: Id891d051180b18155f0e23c0ec55cb0f4d3018ac Reviewed-on: https://chromium-review.googlesource.com/861426 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Cr-Commit-Position: refs/heads/master@{#530469} [modify] https://crrev.com/ddb183db518bcfdd867b2188c30781fd8d375a06/base/lazy_instance.h [modify] https://crrev.com/ddb183db518bcfdd867b2188c30781fd8d375a06/base/lazy_instance_unittest.cc
,
Jan 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/94a0386f1e38a19d3890018a907167429a3b07a1 commit 94a0386f1e38a19d3890018a907167429a3b07a1 Author: Francois Doray <fdoray@chromium.org> Date: Tue Jan 23 14:18:46 2018 Use TimeTicks instead of Time in base::internal::NeedsLazyInstance. Time should be used instead of TimeTicks to measure elapsed time within the lifetime of a process https://docs.google.com/document/d/1ypBZPZnshJ612FWAjkkpSBQiII7vYEEDsRkDOkVjQFw/edit?usp=sharing Bug: 797129 Change-Id: I722bfedd9c1f27546c0a257fe991f53a5b42f54c Reviewed-on: https://chromium-review.googlesource.com/878941 Reviewed-by: Gabriel Charette <gab@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#531227} [modify] https://crrev.com/94a0386f1e38a19d3890018a907167429a3b07a1/base/lazy_instance_helpers.cc
,
Jan 26 2018
We deployed Chrome M64 across our estate last night (Via MSI) and the only issues we've had today are with the odd server that didn't get updated automatically so it's looking good. We've gone from approx ~50 instances of high CPU in stuck Chrome processes per day to only a few today and all of those have been traced to still running M63.
,
Jan 26 2018
Re #48: Great - thanks for the update. :)
,
Feb 1 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6c58b870ad54c73c124b20a1bdb7f4eb727403f8 commit 6c58b870ad54c73c124b20a1bdb7f4eb727403f8 Author: Robert Liao <robliao@chromium.org> Date: Thu Feb 01 04:47:01 2018 Revert "Make LazyInstance's constructor constexpr." This reverts commit ddb183db518bcfdd867b2188c30781fd8d375a06. Reason for revert: MSVC generates dynamic initializers for constexpr constructors with non-constexpr variables. It's not clear if the C++ spec allows for this. Original change's description: > Make LazyInstance's constructor constexpr. > > This allows hiding the buffer into private state, this will allow adding more POD > members as a follow-up. > Believe it or not, there was a user of this private state up until very recently..! > https://chromium-review.googlesource.com/c/chromium/src/+/850294 > > This also allows removing the need for the LAZY_INSTANCE_INITIALIZER > construction hack (will follow-up with a mass removal CL). > > -Wglobal-constructors (https://chromium-review.googlesource.com/c/chromium/src/+/866738) > in //base guarantees that this doesn't result in global constructors. > > Bug: 797129 > Change-Id: Id891d051180b18155f0e23c0ec55cb0f4d3018ac > Reviewed-on: https://chromium-review.googlesource.com/861426 > Commit-Queue: Gabriel Charette <gab@chromium.org> > Reviewed-by: Daniel Cheng <dcheng@chromium.org> > Cr-Commit-Position: refs/heads/master@{#530469} TBR=dcheng@chromium.org,gab@chromium.org,fdoray@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 797129 , 807491 Change-Id: I3bac97268584295fdd39563428c7f7dbe9ae8560 Reviewed-on: https://chromium-review.googlesource.com/896603 Reviewed-by: Robert Liao <robliao@chromium.org> Reviewed-by: Daniel Cheng <dcheng@chromium.org> Commit-Queue: Robert Liao <robliao@chromium.org> Cr-Commit-Position: refs/heads/master@{#533564} [modify] https://crrev.com/6c58b870ad54c73c124b20a1bdb7f4eb727403f8/base/lazy_instance.h [modify] https://crrev.com/6c58b870ad54c73c124b20a1bdb7f4eb727403f8/base/lazy_instance_unittest.cc |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by brucedaw...@chromium.org
, Dec 22 201737.9 KB
37.9 KB View Download