New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 905519 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

Blocked on:
issue 906166

Blocking:
issue 870116
issue 891059



Sign in to add a comment

Flaky OpenGL Error in SharedImageStub Initialization causing random WebGL conformance test failures

Project Member Reported by enga@chromium.org, Nov 15

Issue description

Seen in this tryjob: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_optional_gpu_tests_rel/11658

This shard: https://chromium-swarm.appspot.com/task?id=4129608531e1bd10&refresh=10&show_raw=1

GL_INVALID_OPERATION was seen in FeatureInfo initialization triggered by SharedImageStub during a WebGL conformance tests roll.
It's not clear whether the failure is reliable, but the crash occurred while attempting to run gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance2_context_methods_2
https://cs.chromium.org/chromium/src/third_party/webgl/src/sdk/tests/conformance2/context/methods-2.html?q=methods-2.html&sq=package:chromium&dr

[20239:20239:1113/205142.267906:FATAL:feature_info.cc(1733)] Check failed: ::gl::g_current_gl_context_tls->Get()->Api->glGetErrorFn() == static_cast<GLuint>(0x0) (1282 vs. 0)
#0 0x7f981cd7ae6f base::debug::StackTrace::StackTrace()
#1 0x7f981ccc47cb logging::LogMessage::~LogMessage()
#2 0x7f981e405409 gpu::gles2::FeatureInfo::InitializeFloatAndHalfFloatFeatures()
#3 0x7f981e3fec78 gpu::gles2::FeatureInfo::InitializeFeatures()
#4 0x7f981e3f778d gpu::SharedImageBackingFactoryGLTexture::SharedImageBackingFactoryGLTexture()
#5 0x7f981e3f6537 gpu::SharedImageFactory::SharedImageFactory()
#6 0x7f981e6161a5 gpu::SharedImageStub::MakeContextCurrentAndCreateFactory()
#7 0x7f981e61659a gpu::SharedImageStub::OnCreateSharedImage()
#8 0x7f981e61644c _ZN3IPC8MessageTI36GpuChannelMsg_CreateSharedImage_MetaNSt3__15tupleIJ38GpuChannelMsg_CreateSharedImage_ParamsEEEvE8DispatchIN3gpu15SharedImageStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_
#9 0x7f981e61633d gpu::SharedImageStub::OnMessageReceived()
#10 0x7f981e60424f IPC::MessageRouter::RouteMessage()
#11 0x7f981e602a61 gpu::GpuChannel::HandleMessageHelper()
#12 0x7f981e60033f gpu::GpuChannel::HandleMessage()
#13 0x7f9819c1e18d _ZN4base8internal7InvokerINS0_9BindStateIMN3net14MDnsClientImpl4CoreEFvRKNSt3__14pairINS6_12basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEEtEEEJNS_7WeakPtrIS5_EESE_EEEFvvEE3RunEPNS0_13BindStateBaseE
#14 0x7f981e178b93 gpu::Scheduler::RunNextTask()
#15 0x7f9819c06194 _ZN4base8internal7InvokerINS0_9BindStateIMN3net16HostResolverImpl8ProcTaskEFvvEJNS_7WeakPtrIS5_EEEEEFvvEE7RunOnceEPNS0_13BindStateBaseE
#16 0x7f981cccda32 base::debug::TaskAnnotator::RunTask()
#17 0x7f981cccceaf base::MessageLoopImpl::RunTask()
#18 0x7f981cccd452 base::MessageLoopImpl::DoWork()
#19 0x7f981cccff0f base::(anonymous namespace)::WorkSourceDispatch()
#20 0x7f981469ae04 g_main_context_dispatch
#21 0x7f981469b048 <unknown>
#22 0x7f981469b0ec g_main_context_iteration
#23 0x7f981cccfcc2 base::MessagePumpGlib::Run()
#24 0x7f981cccc981 base::MessageLoopImpl::Run()
#25 0x7f981ccf6ce6 base::RunLoop::Run()
#26 0x7f9821354a9a content::GpuMain()
#27 0x7f981c7fc907 content::ContentMainRunnerImpl::Run()
#28 0x7f981c82f30a service_manager::Main()
#29 0x7f981c7fac01 content::ContentMain()
#30 0x7f98198ad1b3 ChromeMain
#31 0x7f98106aef45 __libc_start_main
#32 0x7f98198ad02a _start


Traceback (most recent call last):
  _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:155
    self.RunActualGpuTest(url, *args)
  RunActualGpuTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:190
    getattr(self, test_name)(test_path, *args[1:])
  _RunConformanceTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:210
    self._CheckTestCompletion()
  _CheckTestCompletion at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:204
    self._WebGLTestMessages(self.tab))
  fail at .swarming_module/lib/python2.7/unittest/case.py:410
    raise self.failureException(msg)
AssertionError: GPU process crashed during test.

Locals:
  msg : u'GPU process crashed during test.\n'


Marking P1 because this is a test reliability issue.
Going to rerun this to see if it's flaky.
 
Owner: piman@chromium.org
Summary: Flaky OpenGL Error in SharedImageStub Initialization causing random WebGL conformance test failures (was: Flaky(?) OpenGL Error in SharedImageStub Initialization)
This is happening frequently on some of the main GPU bot configurations, and affecting random tests. See:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29

and in particular:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29/6358

https://chromium-swarm.appspot.com/task?id=4132ab9cf22ac610&refresh=10&show_raw=1

This is urgent to fix. Fortunately it doesn't seem to be affecting one of the main trybots:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng?limit=200

piman, could you please find an owner for this?

Blocking: 870116 891059
So, I'm almost certain that this is picking up a GL error from something else, probably why it's random.
Blocking: 902406
Agreed. It's difficult to understand where the error might be coming from.

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29/6358 is another failure of this mode.

 Issue 902406  is tracking general failures on https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29?limit=200 . The bot's in bad shape right now, probably due to 2 or 3 root causes, and we urgently need to get it back to a green state.

Sending https://chromium-review.googlesource.com/c/chromium/src/+/1338881/ to trybots to see if it picks up anything.
CL in #6 has not yet been able to trigger the crash (neither the added DCHECK nor the original one) on the tyrbots. Trying to run things locally, but hasn't repro'ed yet.
Blockedon: 906166
Status: Assigned (was: Untriaged)
GPU Triage: Marking assigned, as piman@ appears to be looking at this.
 Issue 906212  has been merged into this issue.
Cc: rjkroege@chromium.org
Labels: Hotlist-PixelWrangler
Interesting. 
906212 is about WebglConformance_conformance2_context_methods_2 failing, whereas here I've been mostly seeing WebglConformance_conformance2_extensions_webgl_multiview failing.

But very interestingly, in either case they follow WebglConformance_conformance2_buffers_buffer_copying_restrictions so the fault is most likely with that particular one.
Although, in both cases it's also the second test, so I wonder if it could be a red herring.
I think I have a theory, which is about how we initialize FeatureInfo in shared contexts: https://cs.chromium.org/chromium/src/gpu/command_buffer/service/feature_info.cc?q=FeatureInfo::InitializeFeatures&sq=package:chromium&g=0&l=440

It's possibly that the current client-side context doesn't expose ES3, but the underlying context has it, and is shared with an ES3 client. In that case we should still reset PBOs
I'm hoping that https://chromium-review.googlesource.com/c/chromium/src/+/1340973/ will fix this. I confirmed that WebglConformance_conformance2_buffers_buffer_copying_restrictions does leave GL_PIXEL_UNPACK_BUFFER bound. Interestingly I can't repro the crash locally because I have GL_NV_pixel_buffer_object (but bots don't seem to have it). But even when I comment out that part, I don't get a crash, I believe because it just depends on the timing of the SharedImageFactory lazy init. But if I check the current binding when making the context current in SharedImageStub ([1], which is where we might be doing a lazy if it wasn't otherwise done) I do get it to assert, so that gives me confidence.

[1] https://cs.chromium.org/chromium/src/gpu/ipc/service/shared_image_stub.cc?type=cs&q=SharedImageStub&sq=package:chromium&g=0&l=195
Let's remember to revert this suppression of WebglConformance_conformance2_context_methods_2, https://chromium-review.googlesource.com/c/1340948 , after the fix lands. Thanks!

Project Member

Comment 18 by bugdroid1@chromium.org, Nov 17

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ae9d975aa70128c5fce24c912ae2c213eeca6e9b

commit ae9d975aa70128c5fce24c912ae2c213eeca6e9b
Author: Antoine Labour <piman@chromium.org>
Date: Sat Nov 17 00:47:45 2018

Reset unpack buffer in FeatureInfo if PBOs are supported by the driver

We need to reset GL_PIXEL_UNPACK_BUFFER in FeatureInfo initialization if it is
non-0. We should always do that regardless of whether or not the current decoder
exposes them, because with shared contexts it is possible that another decoder
uses ES3 whereas the current one doesn't.

Bug:  905519 
Change-Id: I2c2f457fcdd76ca4812ac8d71c8f77d694506a57
Reviewed-on: https://chromium-review.googlesource.com/c/1340973
Commit-Queue: Antoine Labour <piman@chromium.org>
Commit-Queue: Zhenyao Mo <zmo@chromium.org>
Reviewed-by: Zhenyao Mo <zmo@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#609050}
[modify] https://crrev.com/ae9d975aa70128c5fce24c912ae2c213eeca6e9b/gpu/command_buffer/service/feature_info.cc
[modify] https://crrev.com/ae9d975aa70128c5fce24c912ae2c213eeca6e9b/gpu/command_buffer/service/test_helper.cc

Let's make sure we have one clean run, then revert the suppression.
Project Member

Comment 20 by bugdroid1@chromium.org, Nov 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/86cc80e3c988388b691a5458ad83287e25afcda5

commit 86cc80e3c988388b691a5458ad83287e25afcda5
Author: Antoine Labour <piman@chromium.org>
Date: Mon Nov 19 18:25:36 2018

Revert "Suppress failling WebGL test on Linux Nvidia"

This reverts commit e73c82bc92e89f4e41e80c15e60ac5524722eeda.

Reason for revert: source problem looks fixed.

Original change's description:
> Suppress failling WebGL test on Linux Nvidia
> 
> WebglConformance_conformance2_context_methods_2 is failing sometimes on Linux Nvidia with
> a GPU crash. Mark the test as flaky.
> 
> BUG= 906212 
> TBR=kbr@chromium.org
> 
> Change-Id: I86bea7e01fde6b26b25d7fa27c8a9f41e5be9df7
> Reviewed-on: https://chromium-review.googlesource.com/c/1340948
> Reviewed-by: Robert Kroeger <rjkroege@chromium.org>
> Commit-Queue: Robert Kroeger <rjkroege@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#609037}

TBR=rjkroege@chromium.org,kbr@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  906212 ,  905519 
Change-Id: Ie5918918c972ac81fd0d48e9ac285c0a705a91d8
Reviewed-on: https://chromium-review.googlesource.com/c/1342500
Reviewed-by: Antoine Labour <piman@chromium.org>
Reviewed-by: Robert Kroeger <rjkroege@chromium.org>
Commit-Queue: Antoine Labour <piman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#609357}
[modify] https://crrev.com/86cc80e3c988388b691a5458ad83287e25afcda5/content/test/gpu/gpu_tests/webgl_conformance_expectations.py

Status: Fixed (was: Assigned)
Marking fixed. There's other issues on the bots, but they seem unrelated to this.
Blocking: -902406
Thank you Antoine for getting to the bottom of this thorny problem!

Sign in to add a comment