New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 608946 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug

Blocking:
issue 608923



Sign in to add a comment

GpuCrash.GPUProcessCrashesExactlyOnce failing with FATAL:command_buffer_proxy_impl.h(163): Check failed: lockless_thread_checker_.CalledOnValidThread().

Project Member Reported by kbr@chromium.org, May 3 2016

Issue description

Seen on a couple of tryjob runs:

https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/221584
https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/223064

Stack trace:

[1:1:0503/144454:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread(). 
#0 0x7f591a06f33e base::debug::StackTrace::StackTrace()
#1 0x7f591a0868bb logging::LogMessage::~LogMessage()
#2 0x7f5916732ce0 gpu::CommandBufferProxyImpl::SetGpuControlClient()
#3 0x7f5918bc253e gpu::gles2::GLES2Implementation::~GLES2Implementation()
#4 0x7f5918bc2b59 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#5 0x7f591709f8ff content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#6 0x7f591709fa29 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#7 0x7f5919d7a389 content::CompositorOutputSurface::~CompositorOutputSurface()
#8 0x7f5916586b48 cc::LayerTreeHost::DidFailToInitializeOutputSurface()
#9 0x7f59165bfce2 cc::ProxyMain::DidInitializeOutputSurface()
#10 0x7f59165ce217 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1ELm2EEEENS0_9BindStateINS0_15RunnableAdapterIMN2cc9ProxyMainEFvbRKNS6_20RendererCapabilitiesEEEEFvPS7_bSA_EJRNS_7WeakPtrIS7_EERbSA_EEENS0_12InvokeHelperILb1EvSD_EEFvvEE3RunEPNS0_13BindStateBaseE
#11 0x7f591a07008c base::debug::TaskAnnotator::RunTask()
#12 0x7f5919cf0687 scheduler::TaskQueueManager::ProcessTaskFromWorkQueue()
#13 0x7f5919cef315 scheduler::TaskQueueManager::DoWork()
#14 0x7f5919cf1614 _ZN4base8internal7InvokerINS_13IndexSequenceIJLm0ELm1ELm2EEEENS0_9BindStateINS0_15RunnableAdapterIMN9scheduler16TaskQueueManagerEFvNS_9TimeTicksEbEEEFvPS7_S8_bEJNS_7WeakPtrIS7_EERS8_bEEENS0_12InvokeHelperILb1EvSB_EEFvvEE3RunEPNS0_13BindStateBaseE
#15 0x7f591a07008c base::debug::TaskAnnotator::RunTask()
#16 0x7f591a08dd55 base::MessageLoop::RunTask()
#17 0x7f591a08e098 base::MessageLoop::DeferOrRunPendingTask()
#18 0x7f591a08e3bb base::MessageLoop::DoWork()
#19 0x7f591a08fcaf base::MessagePumpDefault::Run()
#20 0x7f591a08d881 base::MessageLoop::RunHandler()
#21 0x7f591a0b0960 base::RunLoop::Run()
#22 0x7f591a08cb80 base::MessageLoop::Run()
#23 0x7f5919e2a2be content::RendererMain()
#24 0x7f591a04619b content::RunZygote()
#25 0x7f591a046732 content::RunNamedProcessTypeMain()
#26 0x7f591a047173 content::ContentMainRunnerImpl::Run()
#27 0x7f591a045d70 content::ContentMain()
#28 0x7f591524e29b ChromeMain
#29 0x7f590e139ec5 __libc_start_main
#30 0x7f591524e171 <unknown>


The failure's intermittent and reported originally in  Issue 608923 .

Dana, do you think you could help fix the apparent race condition here? It's urgent as it's affecting CQ jobs. Thanks.

 
stdout-1.txt
110 KB View Download
stdout-2.txt
133 KB View Download

Comment 1 by kbr@chromium.org, May 3 2016

Components: Internals>GPU>Testing
Yep looking now
Hm ok so..

#2 0x7f5916732ce0 gpu::CommandBufferProxyImpl::SetGpuControlClient()
#3 0x7f5918bc253e gpu::gles2::GLES2Implementation::~GLES2Implementation()
#4 0x7f5918bc2b59 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#5 0x7f591709f8ff content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#6 0x7f591709fa29 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#7 0x7f5919d7a389 content::CompositorOutputSurface::~CompositorOutputSurface()

This stack says ContextProvider is being destroyed.

SetGpuControlClient calls CheckLock() on CBPI. CheckLock() is

  void CheckLock() const {
    if (lock_)
      lock_->AssertAcquired();
  }

So we have a lock_ on the CBPI here. Well, in the ContextProvider destructor we do:
  if (bind_succeeded_) {
    // Clear the lock to avoid DCHECKs that the lock is being held during                                                                                  
    // shutdown.                                                                                                                                           
    command_buffer_->SetLock(nullptr);
  }

So, if we did a bind (which creates the command_buffer_), we will set the lock to null. SetLock is literally "lock_ = lock;", nothing fancy.

This implies that bind_succeeded_ is not true, but we have a command_buffer_, which means we started to do BindToCurrentThread but failed at some point along the way.

However for the lock to be non-null, we SetLock(&context_lock_) in ContextProviderCommandBuffer::SetupLock, which DCHECKS
  DCHECK(bind_succeeded_);

So.. bind did succeed, but didn't when we get to the destructor? Can't be. Other possibilities..

- We're calling SetupLock when BindToCurrentThread returned false (I don't think we do this), and we're failing to hit that DCHECK. What?
- Lock::AssertAcquired happens but DCHECK doesn't - that's out it's implemented by DCHECK.
- Something else is setting a non-null Lock on CBPI?
Hmm..
Cc: piman@chromium.org
> - Something else is setting a non-null Lock on CBPI?

Code search is ruling this one out. (Also seems like this doesn't need to be part of the GpuControl interface.)
> - We're calling SetupLock when BindToCurrentThread returned false

Ruled out also. Mystery deepens.
OOOOOOOOOOOH my checkout is old. CheckLock has changed recently to:

  void CheckLock() {
    if (lock_) {
      lock_->AssertAcquired();
    } else {
      DCHECK(lockless_thread_checker_.CalledOnValidThread());
    }
  }

And it's the DCHECK that is failing, which simply means that we're destroying GLES2Impl on a different thread than we bound/used it.

This is allowed by ContextProviderCommandBuffer though, so I think the DCHECK is wrong.
Cc: dyen@chromium.org
+dyen who added the dcheck
Maybe we can reset the lockless_thread_checker_ when resetting the lock to null?
Specifically, the context provider does this:

ContextProviderCommandBuffer::~ContextProviderCommandBuffer() {
  DCHECK(main_thread_checker_.CalledOnValidThread() ||
         context_thread_checker_.CalledOnValidThread());
  
main_thread_checker_ is the thread where the context was created. Contexts that are passed across threads are the worker and the compositor contexts.

The worker context is bound to the main (create) thread, and has a ref owned there in RenderThreadImpl.

Bug https://bugs.chromium.org/p/chromium/issues/detail?id=463288 is interesting here, the compositor may outlive the RenderThreadImpl, which means the context perhaps is being destroyed on the compositor thread?

The compositor context is bound to the compositor thread, but is owned by LayerTreeHost on the main thread. This is something I'd like to undo now, and just own and destroy it on the compositor thread. But when the renderer compositor shuts down, it would destroy the compositor thread refs to the context provider before killing its LayerTreeHost-owned output surface. This would pass the DCHECKs in ~ContextProviderCommandBuffer, but I think it would hit the DCHECK on the CommandBufferProxyImpl via ~GLES2Implementation. If it's this though, I'm not sure why it would be flaky.
Oh, this is in cc::LayerTreeHost::DidFailToInitializeOutputSurface(), which drops the current_output_surface_ deleting the compositor context on the main thread. Still not sure why this wouldn't happen on shutdown too.

> Maybe we can reset the lockless_thread_checker_ when resetting the lock to null?

Hm! Ya that's a good thought.
mmh, actually, I don't think it's ok to SetGpuControlClient on a different thread than the one that runs the callbacks, ever.
Hmm, unless it's lost I guess. But complexity begets complexity.

The right way to fix this, is I think to pass ownership of the OutputSurface to the compositor thread. It's not the simplest change, but maybe I should just do that right now.
Cc: enne@chromium.org siev...@chromium.org
Theory: We don't see this on shutdown because fast shutdown, so we don't actually destroy cc. And it's pretty hard to destroy cc without destroying the renderer process I guess on desktop.
I can reproduce something like this. It's doing Finish instead, so it's slightly sooner.

[1:1:0504/155841:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread(). 
#0 0x7f42cf115c1e base::debug::StackTrace::StackTrace()
#1 0x7f42cf13600b logging::LogMessage::~LogMessage()
#2 0x7f42cc21cbb9 gpu::CommandBufferProxyImpl::Flush()
#3 0x7f42cc158a87 gpu::CommandBufferHelper::Finish()
#4 0x7f42c52d2bd6 gpu::gles2::GLES2Implementation::WaitForCmd()
#5 0x7f42c52d26b2 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#6 0x7f42c52d2dc9 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#7 0x7f42cc9827bf content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#8 0x7f42cc9828e9 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#9 0x7f42cd2ee9b9 content::CompositorOutputSurface::~CompositorOutputSurface()


By running chrome with the following diff:


diff --git a/cc/output/output_surface.cc b/cc/output/output_surface.cc
index 30864f1..a78f934 100644
--- a/cc/output/output_surface.cc
+++ b/cc/output/output_surface.cc
@@ -207,6 +207,9 @@ bool OutputSurface::BindToClient(OutputSurfaceClient* client) {
 
   if (context_provider_.get()) {
     success = context_provider_->BindToCurrentThread();
+    static int i = 0;
+    if (++i == 1)
+      success = false;
     if (success) {
       context_provider_->SetLostContextCallback(base::Bind(
           &OutputSurface::DidLoseOutputSurface, base::Unretained(this)));

Or this hits it exactly:

diff --git a/content/common/gpu/client/context_provider_command_buffer.cc b/content/common/gpu/client/context_provider_command_buffer.cc
index 7c5e7fb..9efa734 100644
--- a/content/common/gpu/client/context_provider_command_buffer.cc
+++ b/content/common/gpu/client/context_provider_command_buffer.cc
@@ -205,6 +205,10 @@ bool ContextProviderCommandBuffer::BindToCurrentThread() {
     if (share_group && share_group->IsLost())
       return false;
 
+    static int i = 0;
+    if (++i == 2)
+      return false;
+
     shared_providers_->list.push_back(this);
   }
   set_bind_failed.Reset();



[1:1:0504/160320:FATAL:command_buffer_proxy_impl.h(163)] Check failed: lockless_thread_checker_.CalledOnValidThread(). 
#0 0x7fead1b22c1e base::debug::StackTrace::StackTrace()
#1 0x7fead1b4300b logging::LogMessage::~LogMessage()
#2 0x7feacec2b5a0 gpu::CommandBufferProxyImpl::SetGpuControlClient()
#3 0x7feac7cdf7e1 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#4 0x7feac7cdfdc9 gpu::gles2::GLES2Implementation::~GLES2Implementation()
#5 0x7feacf38f7bf content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#6 0x7feacf38f8e9 content::ContextProviderCommandBuffer::~ContextProviderCommandBuffer()
#7 0x7feacfcfb9e9 content::CompositorOutputSurface::~CompositorOutputSurface()


It needs 2 cuz the worker is made and bound first.

Comment 19 by kbr@chromium.org, May 4 2016

Super awesome, thank you for tracking this down!

Project Member

Comment 20 by bugdroid1@chromium.org, May 5 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0795777fbf9349b39b0c66704e5ce7633add15b8

commit 0795777fbf9349b39b0c66704e5ce7633add15b8
Author: kbr <kbr@chromium.org>
Date: Thu May 05 00:24:09 2016

Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms.

The assertion failure in  Issue 608946  is firing frequently on the
commit queue, and the flaky retry mechanism isn't working for this
failure mode.

BUG= 608923 ,  608946 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel
TBR=danakj@chromium.org, sclittle@chromium.org
NOTRY=true

Review-Url: https://codereview.chromium.org/1947303002
Cr-Commit-Position: refs/heads/master@{#391712}

[modify] https://crrev.com/0795777fbf9349b39b0c66704e5ce7633add15b8/content/test/gpu/gpu_tests/context_lost_expectations.py

Comment 21 by kbr@chromium.org, May 6 2016

Labels: -Pri-0 Pri-1
Still important to fix soon, but not P0 since not blocking the CQ any more.

Let's re-enable the test above once the crash appears to be fixed.

Project Member

Comment 22 by bugdroid1@chromium.org, May 6 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245

commit cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245
Author: danakj <danakj@chromium.org>
Date: Fri May 06 21:52:05 2016

Detach the output surface if bind fails.

This destroys the compositor context provider on the correct thread.

R=enne, piman
BUG= 608946 
CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel

Review-Url: https://codereview.chromium.org/1949753007
Cr-Commit-Position: refs/heads/master@{#392168}

[modify] https://crrev.com/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245/cc/output/output_surface.cc
[modify] https://crrev.com/cbd17b97df7b1d54eae6ceb1a4d5a75d60d61245/content/renderer/gpu/compositor_output_surface.cc

Project Member

Comment 23 by bugdroid1@chromium.org, May 6 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c

commit a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c
Author: danakj <danakj@chromium.org>
Date: Fri May 06 22:57:36 2016

Revert of Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms. (patchset #1 id:1 of https://codereview.chromium.org/1947303002/ )

Reason for revert:
Temporary fix here: https://codereview.chromium.org/1949753007/

More work to come.

Original issue's description:
> Skip GpuCrash.GPUProcessCrashesExactlyOnce on all platforms.
>
> The assertion failure in  Issue 608946  is firing frequently on the
> commit queue, and the flaky retry mechanism isn't working for this
> failure mode.
>
> BUG= 608923 ,  608946 
> CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel
> TBR=danakj@chromium.org, sclittle@chromium.org
> NOTRY=true
>
> Committed: https://crrev.com/0795777fbf9349b39b0c66704e5ce7633add15b8
> Cr-Commit-Position: refs/heads/master@{#391712}

TBR=sclittle@chromium.org,kbr@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG= 608923 ,  608946 

Review-Url: https://codereview.chromium.org/1956703007
Cr-Commit-Position: refs/heads/master@{#392193}

[modify] https://crrev.com/a76f4435305e2b1a7fa0c8a5e0c5ae9cdde9755c/content/test/gpu/gpu_tests/context_lost_expectations.py

Project Member

Comment 25 by bugdroid1@chromium.org, May 11 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0dd9e1efed6a509dad5b7438cbfc8c015b193238

commit 0dd9e1efed6a509dad5b7438cbfc8c015b193238
Author: danakj <danakj@chromium.org>
Date: Wed May 11 22:15:09 2016

Give the media context small memory limits (same as compositor).

This makes a static gpu::SharedMemoryLimits::ForMailboxContext() method
so that RenderThreadImpl and RenderWidget can share the same limits for
the media context and the compositor context which both use the context
for creating textures and mailboxing them.

R=piman@chromium.org
BUG= 608946 
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_optional_gpu_tests_rel;tryserver.chromium.mac:mac_optional_gpu_tests_rel;tryserver.chromium.win:win_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/1969163002
Cr-Commit-Position: refs/heads/master@{#393086}

[modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/content/renderer/render_thread_impl.cc
[modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/content/renderer/render_widget.cc
[modify] https://crrev.com/0dd9e1efed6a509dad5b7438cbfc8c015b193238/gpu/command_buffer/client/shared_memory_limits.h

Status: Fixed (was: Assigned)

Sign in to add a comment