Flaky OpenGL Error in SharedImageStub Initialization causing random WebGL conformance test failures |
|||||||
Issue descriptionSeen in this tryjob: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_optional_gpu_tests_rel/11658 This shard: https://chromium-swarm.appspot.com/task?id=4129608531e1bd10&refresh=10&show_raw=1 GL_INVALID_OPERATION was seen in FeatureInfo initialization triggered by SharedImageStub during a WebGL conformance tests roll. It's not clear whether the failure is reliable, but the crash occurred while attempting to run gpu_tests.webgl_conformance_integration_test.WebGLConformanceIntegrationTest.WebglConformance_conformance2_context_methods_2 https://cs.chromium.org/chromium/src/third_party/webgl/src/sdk/tests/conformance2/context/methods-2.html?q=methods-2.html&sq=package:chromium&dr [20239:20239:1113/205142.267906:FATAL:feature_info.cc(1733)] Check failed: ::gl::g_current_gl_context_tls->Get()->Api->glGetErrorFn() == static_cast<GLuint>(0x0) (1282 vs. 0) #0 0x7f981cd7ae6f base::debug::StackTrace::StackTrace() #1 0x7f981ccc47cb logging::LogMessage::~LogMessage() #2 0x7f981e405409 gpu::gles2::FeatureInfo::InitializeFloatAndHalfFloatFeatures() #3 0x7f981e3fec78 gpu::gles2::FeatureInfo::InitializeFeatures() #4 0x7f981e3f778d gpu::SharedImageBackingFactoryGLTexture::SharedImageBackingFactoryGLTexture() #5 0x7f981e3f6537 gpu::SharedImageFactory::SharedImageFactory() #6 0x7f981e6161a5 gpu::SharedImageStub::MakeContextCurrentAndCreateFactory() #7 0x7f981e61659a gpu::SharedImageStub::OnCreateSharedImage() #8 0x7f981e61644c _ZN3IPC8MessageTI36GpuChannelMsg_CreateSharedImage_MetaNSt3__15tupleIJ38GpuChannelMsg_CreateSharedImage_ParamsEEEvE8DispatchIN3gpu15SharedImageStubES9_vMS9_FvRKS4_EEEbPKNS_7MessageEPT_PT0_PT1_T2_ #9 0x7f981e61633d gpu::SharedImageStub::OnMessageReceived() #10 0x7f981e60424f IPC::MessageRouter::RouteMessage() #11 0x7f981e602a61 gpu::GpuChannel::HandleMessageHelper() #12 0x7f981e60033f gpu::GpuChannel::HandleMessage() #13 0x7f9819c1e18d _ZN4base8internal7InvokerINS0_9BindStateIMN3net14MDnsClientImpl4CoreEFvRKNSt3__14pairINS6_12basic_stringIcNS6_11char_traitsIcEENS6_9allocatorIcEEEEtEEEJNS_7WeakPtrIS5_EESE_EEEFvvEE3RunEPNS0_13BindStateBaseE #14 0x7f981e178b93 gpu::Scheduler::RunNextTask() #15 0x7f9819c06194 _ZN4base8internal7InvokerINS0_9BindStateIMN3net16HostResolverImpl8ProcTaskEFvvEJNS_7WeakPtrIS5_EEEEEFvvEE7RunOnceEPNS0_13BindStateBaseE #16 0x7f981cccda32 base::debug::TaskAnnotator::RunTask() #17 0x7f981cccceaf base::MessageLoopImpl::RunTask() #18 0x7f981cccd452 base::MessageLoopImpl::DoWork() #19 0x7f981cccff0f base::(anonymous namespace)::WorkSourceDispatch() #20 0x7f981469ae04 g_main_context_dispatch #21 0x7f981469b048 <unknown> #22 0x7f981469b0ec g_main_context_iteration #23 0x7f981cccfcc2 base::MessagePumpGlib::Run() #24 0x7f981cccc981 base::MessageLoopImpl::Run() #25 0x7f981ccf6ce6 base::RunLoop::Run() #26 0x7f9821354a9a content::GpuMain() #27 0x7f981c7fc907 content::ContentMainRunnerImpl::Run() #28 0x7f981c82f30a service_manager::Main() #29 0x7f981c7fac01 content::ContentMain() #30 0x7f98198ad1b3 ChromeMain #31 0x7f98106aef45 __libc_start_main #32 0x7f98198ad02a _start Traceback (most recent call last): _RunGpuTest at content/test/gpu/gpu_tests/gpu_integration_test.py:155 self.RunActualGpuTest(url, *args) RunActualGpuTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:190 getattr(self, test_name)(test_path, *args[1:]) _RunConformanceTest at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:210 self._CheckTestCompletion() _CheckTestCompletion at content/test/gpu/gpu_tests/webgl_conformance_integration_test.py:204 self._WebGLTestMessages(self.tab)) fail at .swarming_module/lib/python2.7/unittest/case.py:410 raise self.failureException(msg) AssertionError: GPU process crashed during test. Locals: msg : u'GPU process crashed during test.\n' Marking P1 because this is a test reliability issue. Going to rerun this to see if it's flaky.
,
Nov 16
So, I'm almost certain that this is picking up a GL error from something else, probably why it's random.
,
Nov 16
,
Nov 16
Agreed. It's difficult to understand where the error might be coming from. https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29/6358 is another failure of this mode. Issue 902406 is tracking general failures on https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29?limit=200 . The bot's in bad shape right now, probably due to 2 or 3 root causes, and we urgently need to get it back to a green state.
,
Nov 16
Sending https://chromium-review.googlesource.com/c/chromium/src/+/1338881/ to trybots to see if it picks up anything.
,
Nov 16
CL in #6 has not yet been able to trigger the crash (neither the added DCHECK nor the original one) on the tyrbots. Trying to run things locally, but hasn't repro'ed yet.
,
Nov 16
First failure of this type that I could find is https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Linux%20FYI%20Release%20%28NVIDIA%29/6194
,
Nov 16
,
Nov 16
GPU Triage: Marking assigned, as piman@ appears to be looking at this.
,
Nov 16
Issue 906212 has been merged into this issue.
,
Nov 16
,
Nov 16
Interesting. 906212 is about WebglConformance_conformance2_context_methods_2 failing, whereas here I've been mostly seeing WebglConformance_conformance2_extensions_webgl_multiview failing. But very interestingly, in either case they follow WebglConformance_conformance2_buffers_buffer_copying_restrictions so the fault is most likely with that particular one.
,
Nov 16
Although, in both cases it's also the second test, so I wonder if it could be a red herring.
,
Nov 16
I think I have a theory, which is about how we initialize FeatureInfo in shared contexts: https://cs.chromium.org/chromium/src/gpu/command_buffer/service/feature_info.cc?q=FeatureInfo::InitializeFeatures&sq=package:chromium&g=0&l=440 It's possibly that the current client-side context doesn't expose ES3, but the underlying context has it, and is shared with an ES3 client. In that case we should still reset PBOs
,
Nov 16
I'm hoping that https://chromium-review.googlesource.com/c/chromium/src/+/1340973/ will fix this. I confirmed that WebglConformance_conformance2_buffers_buffer_copying_restrictions does leave GL_PIXEL_UNPACK_BUFFER bound. Interestingly I can't repro the crash locally because I have GL_NV_pixel_buffer_object (but bots don't seem to have it). But even when I comment out that part, I don't get a crash, I believe because it just depends on the timing of the SharedImageFactory lazy init. But if I check the current binding when making the context current in SharedImageStub ([1], which is where we might be doing a lazy if it wasn't otherwise done) I do get it to assert, so that gives me confidence. [1] https://cs.chromium.org/chromium/src/gpu/ipc/service/shared_image_stub.cc?type=cs&q=SharedImageStub&sq=package:chromium&g=0&l=195
,
Nov 17
Let's remember to revert this suppression of WebglConformance_conformance2_context_methods_2, https://chromium-review.googlesource.com/c/1340948 , after the fix lands. Thanks!
,
Nov 17
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ae9d975aa70128c5fce24c912ae2c213eeca6e9b commit ae9d975aa70128c5fce24c912ae2c213eeca6e9b Author: Antoine Labour <piman@chromium.org> Date: Sat Nov 17 00:47:45 2018 Reset unpack buffer in FeatureInfo if PBOs are supported by the driver We need to reset GL_PIXEL_UNPACK_BUFFER in FeatureInfo initialization if it is non-0. We should always do that regardless of whether or not the current decoder exposes them, because with shared contexts it is possible that another decoder uses ES3 whereas the current one doesn't. Bug: 905519 Change-Id: I2c2f457fcdd76ca4812ac8d71c8f77d694506a57 Reviewed-on: https://chromium-review.googlesource.com/c/1340973 Commit-Queue: Antoine Labour <piman@chromium.org> Commit-Queue: Zhenyao Mo <zmo@chromium.org> Reviewed-by: Zhenyao Mo <zmo@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#609050} [modify] https://crrev.com/ae9d975aa70128c5fce24c912ae2c213eeca6e9b/gpu/command_buffer/service/feature_info.cc [modify] https://crrev.com/ae9d975aa70128c5fce24c912ae2c213eeca6e9b/gpu/command_buffer/service/test_helper.cc
,
Nov 17
Let's make sure we have one clean run, then revert the suppression.
,
Nov 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/86cc80e3c988388b691a5458ad83287e25afcda5 commit 86cc80e3c988388b691a5458ad83287e25afcda5 Author: Antoine Labour <piman@chromium.org> Date: Mon Nov 19 18:25:36 2018 Revert "Suppress failling WebGL test on Linux Nvidia" This reverts commit e73c82bc92e89f4e41e80c15e60ac5524722eeda. Reason for revert: source problem looks fixed. Original change's description: > Suppress failling WebGL test on Linux Nvidia > > WebglConformance_conformance2_context_methods_2 is failing sometimes on Linux Nvidia with > a GPU crash. Mark the test as flaky. > > BUG= 906212 > TBR=kbr@chromium.org > > Change-Id: I86bea7e01fde6b26b25d7fa27c8a9f41e5be9df7 > Reviewed-on: https://chromium-review.googlesource.com/c/1340948 > Reviewed-by: Robert Kroeger <rjkroege@chromium.org> > Commit-Queue: Robert Kroeger <rjkroege@chromium.org> > Cr-Commit-Position: refs/heads/master@{#609037} TBR=rjkroege@chromium.org,kbr@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 906212 , 905519 Change-Id: Ie5918918c972ac81fd0d48e9ac285c0a705a91d8 Reviewed-on: https://chromium-review.googlesource.com/c/1342500 Reviewed-by: Antoine Labour <piman@chromium.org> Reviewed-by: Robert Kroeger <rjkroege@chromium.org> Commit-Queue: Antoine Labour <piman@chromium.org> Cr-Commit-Position: refs/heads/master@{#609357} [modify] https://crrev.com/86cc80e3c988388b691a5458ad83287e25afcda5/content/test/gpu/gpu_tests/webgl_conformance_expectations.py
,
Nov 19
Marking fixed. There's other issues on the bots, but they seem unrelated to this.
,
Nov 21
,
Nov 29
Thank you Antoine for getting to the bottom of this thorny problem! |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by kbr@chromium.org
, Nov 16Summary: Flaky OpenGL Error in SharedImageStub Initialization causing random WebGL conformance test failures (was: Flaky(?) OpenGL Error in SharedImageStub Initialization)