New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 882103 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 1
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocked on:
issue 858907

Blocking:
issue 878258



Sign in to add a comment

"gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.ContextLost_WebGLBlockedAfterJSNavigation" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Sep 8

Issue description

"gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.ContextLost_WebGLBlockedAfterJSNavigation" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNydwsSBUZsYWtlImxncHVfdGVzdHMuY29udGV4dF9sb3N0X2ludGVncmF0aW9uX3Rlc3QuQ29udGV4dExvc3RJbnRlZ3JhdGlvblRlc3QuQ29udGV4dExvc3RfV2ViR0xCbG9ja2VkQWZ0ZXJKU05hdmlnYXRpb24M.

Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs

This flaky test/step was previously tracked in  issue 832886 .
 
Detected 3 new flakes for test/step "gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.ContextLost_WebGLBlockedAfterJSNavigation". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNydwsSBUZsYWtlImxncHVfdGVzdHMuY29udGV4dF9sb3N0X2ludGVncmF0aW9uX3Rlc3QuQ29udGV4dExvc3RJbnRlZ3JhdGlvblRlc3QuQ29udGV4dExvc3RfV2ViR0xCbG9ja2VkQWZ0ZXJKU05hdmlnYXRpb24M. This message was posted automatically by the chromium-try-flakes app.
Owner: kbr@chromium.org
Status: Assigned (was: Untriaged)
kbr: I think this is the issue that https://chromium.googlesource.com/chromium/src/+/3689f3e57b85ca683cc6b61841715fbd0c6af559 was trying to fix, right? It seems there's still a problem. :(
Blockedon: 858907
Components: Blink>WebGL
Cc: fsam...@chromium.org
Components: Internals>Compositing
Labels: OS-Android
Owner: enne@chromium.org
Yes, that's the bug that the previous fix was intended to address.

This failure:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/81637

has the following stack trace which looks like the cause:

  	signal 6 (SIGABRT), code -6 in tid 8347 (chromium.chrome)
  	pid: 8347, tid: 8347, name: chromium.chrome  >>> org.chromium.chrome <<<
  	signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
  	[FATAL:compositor_impl_android.cc(1133)] Fatal error making Gpu context
  	
  	Stack Trace:
  	  RELADDR   FUNCTION
  	  0000000003d224a3  logging::LogMessage::~LogMessage()
  	  0000000002a0c3c7  content::CompositorImpl::OnGpuChannelEstablished(scoped_refptr<gpu::GpuChannelHost>)
  	  0000000002a0e36f  void base::internal::FunctorTraits<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), void>::Invoke<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>, scoped_refptr<gpu::GpuChannelHost> >(void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>&&, scoped_refptr<gpu::GpuChannelHost>&&)
  	  0000000001d329e3  base::OnceCallback<void (mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)>::Run(mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)
  	  000000000006a144  <UNKNOWN>
  	  00000000000678d4  <UNKNOWN>
  	  0000000000023838  <UNKNOWN>
  	  000000000001dfd8  <UNKNOWN>
  	  0000000003dc6484  base::debug::BreakDebugger()
  	  0000000003d228ac  logging::LogMessage::~LogMessage()
  	  0000000002a0c3c4  content::CompositorImpl::OnGpuChannelEstablished(scoped_refptr<gpu::GpuChannelHost>)
  	  0000000002a0e36c  void base::internal::FunctorTraits<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), void>::Invoke<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>, scoped_refptr<gpu::GpuChannelHost> >(void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>&&, scoped_refptr<gpu::GpuChannelHost>&&)
  	  0000000001d329e0  base::OnceCallback<void (mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)>::Run(mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)
  	  00000000027a37fc  content::BrowserGpuChannelHostFactory::EstablishGpuChannel(base::OnceCallback<void (scoped_refptr<gpu::GpuChannelHost>)>)
  	  0000000002a0bc80  content::CompositorImpl::HandlePendingLayerTreeFrameSinkRequest()
  	  0000000002a0bef4  content::CompositorImpl::RequestNewLayerTreeFrameSink()
  	  0000000004ac8084  cc::SingleThreadProxy::RequestNewLayerTreeFrameSink()
  	  0000000003d52074  base::internal::Invoker<base::internal::BindState<void (base::RunLoop::*)(), base::WeakPtr<base::RunLoop> >, void ()>::Run(base::internal::BindStateBase*)
  	  0000000003d12470  base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*)
  	  0000000003d2d05c  base::MessageLoop::RunTask(base::PendingTask*)
  	  0000000003d2d394  base::MessageLoop::DoWork()
  	  0000000003d307d8  base::MessagePumpForUI::OnNonDelayedLooperCallback()
  	  0000000003d30468  base::(anonymous namespace)::NonDelayedLooperCallback(int, int, void*)

I know that Chrome on Android requires a GPU process – but this code appears fragile in the face of GPU process crashes. Can it be made more robust by retrying the GPU process connection? enne, would that be a workable / worthwhile solution?

I'll suppress this failure on Android in the meantime. (The flakes are Android-specific.)

Cc: kbr@chromium.org ericrk@chromium.org
Owner: piman@chromium.org
This looks different than the normal process for gpu process crashing repeatedly.  GpuProcessHost calls into FallBackToNextGpuMode in https://cs.chromium.org/chromium/src/content/browser/gpu/gpu_data_manager_impl_private.cc?q=fallbacktonextgpumode&sq=package:chromium&g=0&l=860 when the retry limit is hit, and crashes on Android.

This looks like the gpu channel itself is failing to be created? Is this a possible result from the context lost test, where the gpu process goes away mid-creation and we crash here?

+piman
I think it's a possible result from this test. There might be a race condition where the test's renderer process causes the GPU process to crash as another part of Chromium is establishing a GPU channel.

Can we add a retry here, or make this subject to the GPU process crash count limit?

Project Member

Comment 8 by bugdroid1@chromium.org, Sep 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2614a046aeb1935f31b11491fe7e98c1df61733e

commit 2614a046aeb1935f31b11491fe7e98c1df61733e
Author: Kenneth Russell <kbr@chromium.org>
Date: Mon Sep 10 20:31:44 2018

Suppress ContextLost_WebGLBlockedAfterJSNavigation flake on Android.

GPU process bringup is flaky while the tests are crashing the GPU
process. Similar suppressions may ultimately be needed for other GPU
process crash tests on Android.

Bug:  882103 
No-Try: True
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I4326e81a9f8e64066b1ac271fc87e9afa97a8681
Tbr: enne@chromium.org
Reviewed-on: https://chromium-review.googlesource.com/1217403
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590042}
[modify] https://crrev.com/2614a046aeb1935f31b11491fe7e98c1df61733e/content/test/gpu/gpu_tests/context_lost_expectations.py

Labels: -Pri-1 Pri-2
Downgrading to P2; hopefully that suppression prevents further flakes while we decide whether to improve BrowserGpuChannelHostFactory::EstablishGpuChannel or CompositorImpl::OnGpuChannelEstablished in this area. If we could do a check in EstablishGpuChannel which would indicate whether the call to ContextProviderCommandBuffer::BindToCurrentThread() is likely to fail later, we could throw away and attempt to re-establish the GPU channel.

Cc: danakj@chromium.org
Components: Internals>GPU>Internals
There is already retry logic to BrowserGpuChannelHostFactory, see https://cs.chromium.org/chromium/src/content/browser/gpu/browser_gpu_channel_host_factory.cc?dr=CSs&q=BrowserGpuChanne&sq=package:chromium&g=0&l=156

Basically, if we failed to create the channel and the GPU process is dead (we would necessarily know by then), we'll retry once with a new GPU process. If it failed again with a new GPU process, we assume there is a problem with the GPU process, and we give up altogether. Either way, that's not in cause at this point, because the stack shows that EstablishGpuChannel calls the callback directly, which means that it has a channel that is thinks is valid.

CompositorImpl::OnGpuChannelEstablished will only LOG(FATAL) if the ContextProviderCommandBuffer initialization returns a fatal error, which is typically either:
- user error (asking for invalid config)
- client-side issues (e.g. unable to allocate shared memory)

Trying to connect to a lost channel (therefore failing IPCs) would give to a transient error, which should cause a retry in the compositor logic.

Indeed I see a fatal error, which is logged:

09-10 10:46:21.803  8347  8347 E chromium: [ERROR:command_buffer_proxy_impl.cc(100)] ContextResult::kFatalFailure: Shared memory region is not valid

Looking at the code, this comes from https://cs.chromium.org/chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc?sq=package:chromium&dr=CSs&g=0&l=100 which means GpuChannel::ShareToGpuProcess fails.

Aaaand, indeed the code at https://cs.chromium.org/chromium/src/gpu/ipc/client/gpu_channel_host.cc?sq=package:chromium&dr=CSs&g=0&l=216 has this IsLost test, which is historical (sharing with a dead process used to cause leaks on Windows, but not any more) and always bothered me... And which cause this case to be treated as a fatal failure. I think we should just remove it. If not, at least treat the IsLost case as a transient failure so that we can go through the retry flow.

This should be an easy change if anyone wants to take a look, otherwise I'll eventually get to it.
Project Member

Comment 12 by bugdroid1@chromium.org, Sep 12

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/037da945faf417a7d4bf374da804eead8ffd4ec8

commit 037da945faf417a7d4bf374da804eead8ffd4ec8
Author: danakj <danakj@chromium.org>
Date: Wed Sep 12 21:09:42 2018

Mark GpuChannelHost::ShareToGpuProcess as a transient failure.

Currently it is marked as fatal because historically sharing with a
dead process caused a memory leak on Windows (see  crbug.com/882103#c10 ).
This is no longer the case, and the failure is not a permanent error
which is unrecoverable or user error, so it should rightfully be marked
as transient. This allows us to create a new context if the gpu process
crashed/was lost while setting things up and the ShareToGpuProcess()
returns a null handle.

R=piman@chromium.org

Bug:  882103 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I9426837eea9ebf3f56f6b2534eec9322ed2ee6c8
Reviewed-on: https://chromium-review.googlesource.com/1219992
Commit-Queue: danakj <danakj@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Antoine Labour <piman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590820}
[modify] https://crrev.com/037da945faf417a7d4bf374da804eead8ffd4ec8/gpu/ipc/client/command_buffer_proxy_impl.cc

TODO here is to have ShareToGpuProcess() cause a Fatal error if we're out of file descriptors and thus won't be able to make shared memory even if we retry.
Project Member

Comment 14 by bugdroid1@chromium.org, Sep 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/59b641213f41f0b2e0b8af564ebfe0daf3c5c005

commit 59b641213f41f0b2e0b8af564ebfe0daf3c5c005
Author: Kenneth Russell <kbr@chromium.org>
Date: Fri Sep 14 01:51:29 2018

Unmark ContextLost_WebGLBlockedAfterJSNavigation as flaky.

The underlying failure has been fixed.

Tbr: danakj@chromium.org
Bug:  882103 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I195c7a76e020542667d92a642e4631d78bf1c7a3
Reviewed-on: https://chromium-review.googlesource.com/1226460
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#591258}
[modify] https://crrev.com/59b641213f41f0b2e0b8af564ebfe0daf3c5c005/content/test/gpu/gpu_tests/context_lost_expectations.py

Blockedon: 878258
Labels: -Sheriff-Chromium
is this fixed?
Blockedon: -878258
Blocking: 878258
Status: Fixed (was: Assigned)
Yes, I think this specific failure is fixed. Turning around the blocking relationship with another bug. Thanks piman for fixing this.

Sign in to add a comment