GpuCrash.GPUProcessCrashesExactlyOnce failing with FATAL:command_buffer_proxy_impl.h(163): Check failed: lockless_thread_checker_.CalledOnValidThread(). |
|||||||
Issue descriptionSeen on a couple of tryjob runs: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/221584 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223064 Stack trace: [1:1:0503/144454:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread(). #0 0x7f591a06f33e base::debug::StackTrace::StackTrace() #1 0x7f591a0868bb logging::LogMessage::~LogMessage() #2 0x7f5916732ce0 gpu::CommandBufferProxyImpl::SetGpuControlClient() #3 0x7f5918bc253e gpu::gles2::GLES2Implementation::~GLES2Implementation() #4 0x7f5918bc2b59 gpu::gles2::GLES2Implementation::~GLES2Implementation() #5 0x7f591709f8ff content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer() #6 0x7f591709fa29 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer() #7 0x7f5919d7a389 content::CompositorOutputSurface::~CompositorOutputSurface() #8 0x7f5916586b48 cc::LayerTreeHost::DidFailToInitializeOutputSurface() #9 0x7f59165bfce2 cc::ProxyMain::DidInitializeOutputSurface() #10 0x7f59165ce217 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1ELm2EEEENS0_9BindStateINS0_15RunnableAdapterIMN2cc9ProxyMainEFvbRKNS6_20RendererCapabilitiesEEEEFvPS7_bSA_EJRNS_7WeakPtrIS7_EERbSA_EEENS0_12InvokeHelperILb1EvSD_EEFvvEE3RunEPNS0_13BindStateBaseE #11 0x7f591a07008c base::debug::TaskAnnotator::RunTask() #12 0x7f5919cf0687 scheduler::TaskQueueManager::ProcessTaskFromWorkQueue() #13 0x7f5919cef315 scheduler::TaskQueueManager::DoWork() #14 0x7f5919cf1614 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1ELm2EEEENS0_9BindStateINS0_15RunnableAdapterIMN9scheduler16TaskQueueManagerEFvNS_9TimeTicksEbEEEFvPS7_S8_bEJNS_7WeakPtrIS7_EERS8_bEEENS0_12InvokeHelperILb1EvSB_EEFvvEE3RunEPNS0_13BindStateBaseE #15 0x7f591a07008c base::debug::TaskAnnotator::RunTask() #16 0x7f591a08dd55 base::MessageLoop::RunTask() #17 0x7f591a08e098 base::MessageLoop::DeferOrRunPendingTask() #18 0x7f591a08e3bb base::MessageLoop::DoWork() #19 0x7f591a08fcaf base::MessagePumpDefault::Run() #20 0x7f591a08d881 base::MessageLoop::RunHandler() #21 0x7f591a0b0960 base::RunLoop::Run() #22 0x7f591a08cb80 base::MessageLoop::Run() #23 0x7f5919e2a2be content::RendererMain() #24 0x7f591a04619b content::RunZygote() #25 0x7f591a046732 content::RunNamedProcessTypeMain() #26 0x7f591a047173 content::ContentMainRunnerImpl::Run() #27 0x7f591a045d70 content::ContentMain() #28 0x7f591524e29b ChromeMain #29 0x7f590e139ec5 __libc_start_main #30 0x7f591524e171 <unknown> The failure's intermittent and reported originally in Issue 608923 . Dana, do you think you could help fix the apparent race condition here? It's urgent as it's affecting CQ jobs. Thanks.
,
May 4 2016
This is being seen on a lot of tryjob runs at this point: https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyKgsSBUZsYWtlIh9jb250ZXh0X2xvc3RfdGVzdHMgKHdpdGggcGF0Y2gpDA Recent failures: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/216146 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223406 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223383 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/216098 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/216088 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/216091 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/216074 https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/221893 https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223212 sclittle@ marked this test flaky on Linux and Mac in https://codereview.chromium.org/1947673003 but this is not a good workaround. Secondly, it would need to be done on Windows as well. I'm upgrading this to P0. Dana, will you be able to look into this today? If not, please tell me so another owner can be found.
,
May 4 2016
Yep looking now
,
May 4 2016
Hm ok so..
#2 0x7f5916732ce0 gpu::CommandBufferProxyImpl::SetGpuControlClient()
#3 0x7f5918bc253e gpu::gles2::GLES2Implementation::~GLES2Implementation()
#4 0x7f5918bc2b59 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#5 0x7f591709f8ff content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#6 0x7f591709fa29 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#7 0x7f5919d7a389 content::CompositorOutputSurface::~CompositorOutputSurface()
This stack says ContextProvider is being destroyed.
SetGpuControlClient calls CheckLock() on CBPI. CheckLock() is
void CheckLock() const {
if (lock_)
lock_->AssertAcquired();
}
So we have a lock_ on the CBPI here. Well, in the ContextProvider destructor we do:
if (bind_succeeded_) {
// Clear the lock to avoid DCHECKs that the lock is being held during
// shutdown.
command_buffer_->SetLock(nullptr);
}
So, if we did a bind (which creates the command_buffer_), we will set the lock to null. SetLock is literally "lock_ = lock;", nothing fancy.
This implies that bind_succeeded_ is not true, but we have a command_buffer_, which means we started to do BindToCurrentThread but failed at some point along the way.
However for the lock to be non-null, we SetLock(&context_lock_) in ContextProviderCommandBuffer::SetupLock, which DCHECKS
DCHECK(bind_succeeded_);
So.. bind did succeed, but didn't when we get to the destructor? Can't be. Other possibilities..
- We're calling SetupLock when BindToCurrentThread returned false (I don't think we do this), and we're failing to hit that DCHECK. What?
- Lock::AssertAcquired happens but DCHECK doesn't - that's out it's implemented by DCHECK.
- Something else is setting a non-null Lock on CBPI?
Hmm..
,
May 4 2016
,
May 4 2016
> - Something else is setting a non-null Lock on CBPI? Code search is ruling this one out. (Also seems like this doesn't need to be part of the GpuControl interface.)
,
May 4 2016
> - We're calling SetupLock when BindToCurrentThread returned false Ruled out also. Mystery deepens.
,
May 4 2016
OOOOOOOOOOOH my checkout is old. CheckLock has changed recently to:
void CheckLock() {
if (lock_) {
lock_->AssertAcquired();
} else {
DCHECK(lockless_thread_checker_.CalledOnValidThread());
}
}
And it's the DCHECK that is failing, which simply means that we're destroying GLES2Impl on a different thread than we bound/used it.
This is allowed by ContextProviderCommandBuffer though, so I think the DCHECK is wrong.
,
May 4 2016
+dyen who added the dcheck
,
May 4 2016
Maybe we can reset the lockless_thread_checker_ when resetting the lock to null?
,
May 4 2016
Specifically, the context provider does this:
ContextProviderCommandBuffer::~ContextProviderCommandBuffer() {
DCHECK(main_thread_checker_.CalledOnValidThread() ||
context_thread_checker_.CalledOnValidThread());
main_thread_checker_ is the thread where the context was created. Contexts that are passed across threads are the worker and the compositor contexts.
The worker context is bound to the main (create) thread, and has a ref owned there in RenderThreadImpl.
Bug https://bugs.chromium.org/p/chromium/issues/detail?id=463288 is interesting here, the compositor may outlive the RenderThreadImpl, which means the context perhaps is being destroyed on the compositor thread?
The compositor context is bound to the compositor thread, but is owned by LayerTreeHost on the main thread. This is something I'd like to undo now, and just own and destroy it on the compositor thread. But when the renderer compositor shuts down, it would destroy the compositor thread refs to the context provider before killing its LayerTreeHost-owned output surface. This would pass the DCHECKs in ~ContextProviderCommandBuffer, but I think it would hit the DCHECK on the CommandBufferProxyImpl via ~GLES2Implementation. If it's this though, I'm not sure why it would be flaky.
,
May 4 2016
Oh, this is in cc::LayerTreeHost::DidFailToInitializeOutputSurface(), which drops the current_output_surface_ deleting the compositor context on the main thread. Still not sure why this wouldn't happen on shutdown too. > Maybe we can reset the lockless_thread_checker_ when resetting the lock to null? Hm! Ya that's a good thought.
,
May 4 2016
mmh, actually, I don't think it's ok to SetGpuControlClient on a different thread than the one that runs the callbacks, ever.
,
May 4 2016
Hmm, unless it's lost I guess. But complexity begets complexity. The right way to fix this, is I think to pass ownership of the OutputSurface to the compositor thread. It's not the simplest change, but maybe I should just do that right now.
,
May 4 2016
,
May 4 2016
Theory: We don't see this on shutdown because fast shutdown, so we don't actually destroy cc. And it's pretty hard to destroy cc without destroying the renderer process I guess on desktop.
,
May 4 2016
I can reproduce something like this. It's doing Finish instead, so it's slightly sooner.
[1:1:0504/155841:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread().
#0 0x7f42cf115c1e base::debug::StackTrace::StackTrace()
#1 0x7f42cf13600b logging::LogMessage::~LogMessage()
#2 0x7f42cc21cbb9 gpu::CommandBufferProxyImpl::Flush()
#3 0x7f42cc158a87 gpu::CommandBufferHelper::Finish()
#4 0x7f42c52d2bd6 gpu::gles2::GLES2Implementation::WaitForCmd()
#5 0x7f42c52d26b2 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#6 0x7f42c52d2dc9 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#7 0x7f42cc9827bf content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#8 0x7f42cc9828e9 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#9 0x7f42cd2ee9b9 content::CompositorOutputSurface::~CompositorOutputSurface()
By running chrome with the following diff:
diff --git a/cc/output/output_surface.cc b/cc/output/output_surface.cc
index 30864f1..a78f934 100644
--- a/cc/output/output_surface.cc
+++ b/cc/output/output_surface.cc
@@ -207,6 +207,9 @@ bool OutputSurface::BindToClient(OutputSurfaceClient* client) {
if (context_provider_.get()) {
success = context_provider_->BindToCurrentThread();
+ static int i = 0;
+ if (++i == 1)
+ success = false;
if (success) {
context_provider_->SetLostContextCallback(base::Bind(
&OutputSurface::DidLoseOutputSurface, base::Unretained(this)));
,
May 4 2016
Or this hits it exactly:
diff --git a/content/common/gpu/client/context_provider_command_buffer.cc b/content/common/gpu/client/context_provider_command_buffer.cc
index 7c5e7fb..9efa734 100644
--- a/content/common/gpu/client/context_provider_command_buffer.cc
+++ b/content/common/gpu/client/context_provider_command_buffer.cc
@@ -205,6 +205,10 @@ bool ContextProviderCommandBuffer::BindToCurrentThread() {
if (share_group && share_group->IsLost())
return false;
+ static int i = 0;
+ if (++i == 2)
+ return false;
+
shared_providers_->list.push_back(this);
}
set_bind_failed.Reset();
[1:1:0504/160320:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread().
#0 0x7fead1b22c1e base::debug::StackTrace::StackTrace()
#1 0x7fead1b4300b logging::LogMessage::~LogMessage()
#2 0x7feacec2b5a0 gpu::CommandBufferProxyImpl::SetGpuControlClient()
#3 0x7feac7cdf7e1 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#4 0x7feac7cdfdc9 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#5 0x7feacf38f7bf content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#6 0x7feacf38f8e9 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#7 0x7feacfcfb9e9 content::CompositorOutputSurface::~CompositorOutputSurface()
It needs 2 cuz the worker is made and bound first.
,
May 4 2016
Super awesome, thank you for tracking this down!
,
May 5 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0795777fbf9349b39b0c66704e5ce7633add15b8 commit 0795777fbf9349b39b0c66704e5ce7633add15b8 Author: kbr <kbr@chromium.org> Date: Thu May 05 00:24:09 2016 Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms. The assertion failure in Issue 608946 is firing frequently on the commit queue, and the flaky retry mechanism isn't working for this failure mode. BUG= 608923 , 608946 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel TBR=danakj@chromium.org, sclittle@chromium.org NOTRY=true Review-Url: https://codereview.chromium.org/1947303002 Cr-Commit-Position: refs/heads/master@{#391712} [modify] https://crrev.com/0795777fbf9349b39b0c66704e5ce7633add15b8/content/test/gpu/gpu_tests/context_lost_expectations.py
,
May 6 2016
Still important to fix soon, but not P0 since not blocking the CQ any more. Let's re-enable the test above once the crash appears to be fixed.
,
May 6 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245 commit cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245 Author: danakj <danakj@chromium.org> Date: Fri May 06 21:52:05 2016 Detach the output surface if bind fails. This destroys the compositor context provider on the correct thread. R=enne, piman BUG= 608946 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel Review-Url: https://codereview.chromium.org/1949753007 Cr-Commit-Position: refs/heads/master@{#392168} [modify] https://crrev.com/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245/cc/output/output_surface.cc [modify] https://crrev.com/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245/content/renderer/gpu/compositor_output_surface.cc
,
May 6 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c commit a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c Author: danakj <danakj@chromium.org> Date: Fri May 06 22:57:36 2016 Revert of Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms. (patchset #1 id:1 of https://codereview.chromium.org/1947303002/ ) Reason for revert: Temporary fix here: https://codereview.chromium.org/1949753007/ More work to come. Original issue's description: > Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms. > > The assertion failure in Issue 608946 is firing frequently on the > commit queue, and the flaky retry mechanism isn't working for this > failure mode. > > BUG= 608923 , 608946 > CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel > TBR=danakj@chromium.org, sclittle@chromium.org > NOTRY=true > > Committed: https://crrev.com/0795777fbf9349b39b0c66704e5ce7633add15b8 > Cr-Commit-Position: refs/heads/master@{#391712} TBR=sclittle@chromium.org,kbr@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG= 608923 , 608946 Review-Url: https://codereview.chromium.org/1956703007 Cr-Commit-Position: refs/heads/master@{#392193} [modify] https://crrev.com/a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c/content/test/gpu/gpu_tests/context_lost_expectations.py
,
May 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0b4b94e39d65563ab62d634f02e2fb7a207b587f commit 0b4b94e39d65563ab62d634f02e2fb7a207b587f Author: danakj <danakj@chromium.org> Date: Tue May 10 22:33:01 2016 Split the media context from the compositor worker context. Give the media system its own ContextProviderCommandBuffer, so that we can move ownership of the compositor raster worker contex to the compositor thread along with the output surface. R=piman BUG= 608946 Review-Url: https://codereview.chromium.org/1960803002 Cr-Commit-Position: refs/heads/master@{#392750} [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/common/gpu/client/command_buffer_metrics.cc [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/common/gpu/client/command_buffer_metrics.h [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/common/gpu_process_launch_causes.h [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/renderer/render_thread_impl.cc [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/renderer/render_thread_impl.h [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/content/renderer/render_widget.cc [modify] https://crrev.com/0b4b94e39d65563ab62d634f02e2fb7a207b587f/tools/metrics/histograms/histograms.xml
,
May 11 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0dd9e1efed6a509dad5b7438cbfc8c015b193238 commit 0dd9e1efed6a509dad5b7438cbfc8c015b193238 Author: danakj <danakj@chromium.org> Date: Wed May 11 22:15:09 2016 Give the media context small memory limits (same as compositor). This makes a static gpu::SharedMemoryLimits::ForMailboxContext() method so that RenderThreadImpl and RenderWidget can share the same limits for the media context and the compositor context which both use the context for creating textures and mailboxing them. R=piman@chromium.org BUG= 608946 CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/1969163002 Cr-Commit-Position: refs/heads/master@{#393086} [modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/content/renderer/render_thread_impl.cc [modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/content/renderer/render_widget.cc [modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/gpu/command_buffer/client/shared_memory_limits.h
,
May 12 2016
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by kbr@chromium.org
, May 3 2016