"gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.ContextLost_WebGLBlockedAfterJSNavigation" is flaky |
||||||||||
Issue description"gpu_tests.context_lost_integration_test.ContextLostIntegrationTest.ContextLost_WebGLBlockedAfterJSNavigation" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label. We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNydwsSBUZsYWtlImxncHVfdGVzdHMuY29udGV4dF9sb3N0X2ludGVncmF0aW9uX3Rlc3QuQ29udGV4dExvc3RJbnRlZ3JhdGlvblRlc3QuQ29udGV4dExvc3RfV2ViR0xCbG9ja2VkQWZ0ZXJKU05hdmlnYXRpb24M. Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs This flaky test/step was previously tracked in issue 832886 .
,
Sep 10
kbr: I think this is the issue that https://chromium.googlesource.com/chromium/src/+/3689f3e57b85ca683cc6b61841715fbd0c6af559 was trying to fix, right? It seems there's still a problem. :(
,
Sep 10
,
Sep 10
,
Sep 10
Yes, that's the bug that the previous fix was intended to address. This failure: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android-marshmallow-arm64-rel/81637 has the following stack trace which looks like the cause: signal 6 (SIGABRT), code -6 in tid 8347 (chromium.chrome) pid: 8347, tid: 8347, name: chromium.chrome >>> org.chromium.chrome <<< signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr -------- [FATAL:compositor_impl_android.cc(1133)] Fatal error making Gpu context Stack Trace: RELADDR FUNCTION 0000000003d224a3 logging::LogMessage::~LogMessage() 0000000002a0c3c7 content::CompositorImpl::OnGpuChannelEstablished(scoped_refptr<gpu::GpuChannelHost>) 0000000002a0e36f void base::internal::FunctorTraits<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), void>::Invoke<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>, scoped_refptr<gpu::GpuChannelHost> >(void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>&&, scoped_refptr<gpu::GpuChannelHost>&&) 0000000001d329e3 base::OnceCallback<void (mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)>::Run(mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>) 000000000006a144 <UNKNOWN> 00000000000678d4 <UNKNOWN> 0000000000023838 <UNKNOWN> 000000000001dfd8 <UNKNOWN> 0000000003dc6484 base::debug::BreakDebugger() 0000000003d228ac logging::LogMessage::~LogMessage() 0000000002a0c3c4 content::CompositorImpl::OnGpuChannelEstablished(scoped_refptr<gpu::GpuChannelHost>) 0000000002a0e36c void base::internal::FunctorTraits<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), void>::Invoke<void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>, scoped_refptr<gpu::GpuChannelHost> >(void (content::CompositorImpl::*)(scoped_refptr<gpu::GpuChannelHost>), base::WeakPtr<content::CompositorImpl>&&, scoped_refptr<gpu::GpuChannelHost>&&) 0000000001d329e0 base::OnceCallback<void (mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>)>::Run(mojo::StructPtr<chrome::mojom::AvailableOfflineContentSummary>) 00000000027a37fc content::BrowserGpuChannelHostFactory::EstablishGpuChannel(base::OnceCallback<void (scoped_refptr<gpu::GpuChannelHost>)>) 0000000002a0bc80 content::CompositorImpl::HandlePendingLayerTreeFrameSinkRequest() 0000000002a0bef4 content::CompositorImpl::RequestNewLayerTreeFrameSink() 0000000004ac8084 cc::SingleThreadProxy::RequestNewLayerTreeFrameSink() 0000000003d52074 base::internal::Invoker<base::internal::BindState<void (base::RunLoop::*)(), base::WeakPtr<base::RunLoop> >, void ()>::Run(base::internal::BindStateBase*) 0000000003d12470 base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) 0000000003d2d05c base::MessageLoop::RunTask(base::PendingTask*) 0000000003d2d394 base::MessageLoop::DoWork() 0000000003d307d8 base::MessagePumpForUI::OnNonDelayedLooperCallback() 0000000003d30468 base::(anonymous namespace)::NonDelayedLooperCallback(int, int, void*) I know that Chrome on Android requires a GPU process – but this code appears fragile in the face of GPU process crashes. Can it be made more robust by retrying the GPU process connection? enne, would that be a workable / worthwhile solution? I'll suppress this failure on Android in the meantime. (The flakes are Android-specific.)
,
Sep 10
This looks different than the normal process for gpu process crashing repeatedly. GpuProcessHost calls into FallBackToNextGpuMode in https://cs.chromium.org/chromium/src/content/browser/gpu/gpu_data_manager_impl_private.cc?q=fallbacktonextgpumode&sq=package:chromium&g=0&l=860 when the retry limit is hit, and crashes on Android. This looks like the gpu channel itself is failing to be created? Is this a possible result from the context lost test, where the gpu process goes away mid-creation and we crash here? +piman
,
Sep 10
I think it's a possible result from this test. There might be a race condition where the test's renderer process causes the GPU process to crash as another part of Chromium is establishing a GPU channel. Can we add a retry here, or make this subject to the GPU process crash count limit?
,
Sep 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2614a046aeb1935f31b11491fe7e98c1df61733e commit 2614a046aeb1935f31b11491fe7e98c1df61733e Author: Kenneth Russell <kbr@chromium.org> Date: Mon Sep 10 20:31:44 2018 Suppress ContextLost_WebGLBlockedAfterJSNavigation flake on Android. GPU process bringup is flaky while the tests are crashing the GPU process. Similar suppressions may ultimately be needed for other GPU process crash tests on Android. Bug: 882103 No-Try: True Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I4326e81a9f8e64066b1ac271fc87e9afa97a8681 Tbr: enne@chromium.org Reviewed-on: https://chromium-review.googlesource.com/1217403 Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#590042} [modify] https://crrev.com/2614a046aeb1935f31b11491fe7e98c1df61733e/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Sep 10
Downgrading to P2; hopefully that suppression prevents further flakes while we decide whether to improve BrowserGpuChannelHostFactory::EstablishGpuChannel or CompositorImpl::OnGpuChannelEstablished in this area. If we could do a check in EstablishGpuChannel which would indicate whether the call to ContextProviderCommandBuffer::BindToCurrentThread() is likely to fail later, we could throw away and attempt to re-establish the GPU channel.
,
Sep 11
There is already retry logic to BrowserGpuChannelHostFactory, see https://cs.chromium.org/chromium/src/content/browser/gpu/browser_gpu_channel_host_factory.cc?dr=CSs&q=BrowserGpuChanne&sq=package:chromium&g=0&l=156 Basically, if we failed to create the channel and the GPU process is dead (we would necessarily know by then), we'll retry once with a new GPU process. If it failed again with a new GPU process, we assume there is a problem with the GPU process, and we give up altogether. Either way, that's not in cause at this point, because the stack shows that EstablishGpuChannel calls the callback directly, which means that it has a channel that is thinks is valid. CompositorImpl::OnGpuChannelEstablished will only LOG(FATAL) if the ContextProviderCommandBuffer initialization returns a fatal error, which is typically either: - user error (asking for invalid config) - client-side issues (e.g. unable to allocate shared memory) Trying to connect to a lost channel (therefore failing IPCs) would give to a transient error, which should cause a retry in the compositor logic. Indeed I see a fatal error, which is logged: 09-10 10:46:21.803 8347 8347 E chromium: [ERROR:command_buffer_proxy_impl.cc(100)] ContextResult::kFatalFailure: Shared memory region is not valid Looking at the code, this comes from https://cs.chromium.org/chromium/src/gpu/ipc/client/command_buffer_proxy_impl.cc?sq=package:chromium&dr=CSs&g=0&l=100 which means GpuChannel::ShareToGpuProcess fails. Aaaand, indeed the code at https://cs.chromium.org/chromium/src/gpu/ipc/client/gpu_channel_host.cc?sq=package:chromium&dr=CSs&g=0&l=216 has this IsLost test, which is historical (sharing with a dead process used to cause leaks on Windows, but not any more) and always bothered me... And which cause this case to be treated as a fatal failure. I think we should just remove it. If not, at least treat the IsLost case as a transient failure so that we can go through the retry flow. This should be an easy change if anyone wants to take a look, otherwise I'll eventually get to it.
,
Sep 12
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/037da945faf417a7d4bf374da804eead8ffd4ec8 commit 037da945faf417a7d4bf374da804eead8ffd4ec8 Author: danakj <danakj@chromium.org> Date: Wed Sep 12 21:09:42 2018 Mark GpuChannelHost::ShareToGpuProcess as a transient failure. Currently it is marked as fatal because historically sharing with a dead process caused a memory leak on Windows (see crbug.com/882103#c10 ). This is no longer the case, and the failure is not a permanent error which is unrecoverable or user error, so it should rightfully be marked as transient. This allows us to create a new context if the gpu process crashed/was lost while setting things up and the ShareToGpuProcess() returns a null handle. R=piman@chromium.org Bug: 882103 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I9426837eea9ebf3f56f6b2534eec9322ed2ee6c8 Reviewed-on: https://chromium-review.googlesource.com/1219992 Commit-Queue: danakj <danakj@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Reviewed-by: Antoine Labour <piman@chromium.org> Cr-Commit-Position: refs/heads/master@{#590820} [modify] https://crrev.com/037da945faf417a7d4bf374da804eead8ffd4ec8/gpu/ipc/client/command_buffer_proxy_impl.cc
,
Sep 12
TODO here is to have ShareToGpuProcess() cause a Fatal error if we're out of file descriptors and thus won't be able to make shared memory even if we retry.
,
Sep 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/59b641213f41f0b2e0b8af564ebfe0daf3c5c005 commit 59b641213f41f0b2e0b8af564ebfe0daf3c5c005 Author: Kenneth Russell <kbr@chromium.org> Date: Fri Sep 14 01:51:29 2018 Unmark ContextLost_WebGLBlockedAfterJSNavigation as flaky. The underlying failure has been fixed. Tbr: danakj@chromium.org Bug: 882103 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I195c7a76e020542667d92a642e4631d78bf1c7a3 Reviewed-on: https://chromium-review.googlesource.com/1226460 Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#591258} [modify] https://crrev.com/59b641213f41f0b2e0b8af564ebfe0daf3c5c005/content/test/gpu/gpu_tests/context_lost_expectations.py
,
Sep 14
,
Sep 26
is this fixed? |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by chromium...@appspot.gserviceaccount.com
, Sep 10