New issue
Advanced search Search tips

Issue 799002 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

--in-process-gpu can cause "loop to OOM"-hang

Reported by hu...@vewd.com, Jan 4 2018

Issue description

Chrome Version: 63

What steps will reproduce the problem?
1. Apply attached patch to simulate the "bad" timing.
2. $ content_shell --in-process-gpu 
3. Close the window by clicking its X (CTRL+C doesn't reproduce the bug).

What is the expected result?
Browser closes.

What happens instead?
Browse gets stuck. htop shows that it keeps allocating more and more memory. After a minute it will get killed by the Linux OOM killer.

Background:
At Vewd (downstream chromium) this OOM happens randomly on automated tests that quickly spawns and shuts down a --in-process-gpu browser.
 
0001-Demo-of-loop-until-OOM-hang.patch
2.0 KB Download
Project Member

Comment 1 by bugdroid1@chromium.org, Jan 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535

commit 94ec4045f6d1f18b5c724c43efd0c0f7fc58c535
Author: Hugo Holgersson <hugoh@vewd.com>
Date: Wed Jan 24 12:20:38 2018

Fix OOM crash when the in-process GPU thread starts slowly

Background:
Sometimes (depending on the execution order of browser's
GPU vs IO thread), the IO thread is shutting down while
the GPU thread starts up.

Problem:
BrowserIO can end up in an endless loop where it tries to
create GPU channels => Linux OOM:

SendOutstandingReplies()'s first while-loop never exits.

~GpuProcessHost
 GpuProcessHost::SendOutstandingReplies
  BrowserGpuChannelHostFactory::EstablishRequest::OnEstablishedOnIO
   BrowserGpuChannelHostFactory::EstablishRequest::EstablishOnIO
    GpuProcessHost::EstablishGpuChannel // Queues a channel_request.
     => SendOutstandingReplies sees another request => endless loop.

This happens because of OnEstablishedOnIO's retry-logic.

Solution:
(1) Run EstablishOnIO async. That gives
    SendOutstandingReplies a chance to exit its loop.
(2) Synchronously Stop() the GPU thread before we destruct
    its dependencies (here: tell it to stop starting).

Q: What happens if we don't stop the GPU thread
   before we destruct BrowserChildProcessHostImpl?
A: We get another race (and even a crash if we use
   the bug description's timing):

[15658:15698:FATAL:outgoing_broker_client_invitation.cc(52)]
0 0x7fe6a522e12d base::debug::StackTrace::StackTrace()
1 0x7fe6a522c02c base::debug::StackTrace::StackTrace()
2 0x7fe6a52c51ea logging::LogMessage::~LogMessage()
3 0x7fe69ef8262b mojo::edk::OutgoingBrokerClientInvitation
                 ::ExtractInProcessMessagePipe()
4 0x7fe6a60592d8 content::ChildThreadImpl::Init()
5 0x7fe6a605c032 content::ChildThreadImpl::ChildThreadImpl()
6 0x7fe6a5eb94cf content::GpuChildThread::GpuChildThread()
7 0x7fe6a5eb9e47 content::GpuChildThread::GpuChildThread()
8 0x7fe6a5ee0b23 content::InProcessGpuThread::Init()
9 0x7fe6a54c5b47 base::Thread::ThreadMain()
10 0x7fe6a54a4161 base::(anonymous namespace)::ThreadFunc()
11 0x7fe6ac4d56ba start_thread
12 0x7fe691ab63dd clone

Manual test: Follow the bug description's steps.
Automated test: Run content_browsertests.

Bug:  799002 
Change-Id: Ic3e7234fbcf401f7788181eb0b558a238aebf3bc
Reviewed-on: https://chromium-review.googlesource.com/850358
Commit-Queue: Hugo Holgersson <hugoh@vewd.com>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Reviewed-by: Antoine Labour <piman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#531505}
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/browser/gpu/DEPS
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/browser/gpu/browser_gpu_channel_host_factory.cc
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/browser/gpu/gpu_process_host.h
[add] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/browser/gpu/in_process_gpu_thread_browsertests.cc
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/content/test/BUILD.gn
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/chromium.chromiumos.json
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/chromium.fyi.json
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/chromium.memory.json
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/filters/BUILD.gn
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/filters/mojo.fyi.viz.content_browsertests.filter
[add] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/filters/mus.content_browsertests.filter
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/filters/viz.content_browsertests.filter
[modify] https://crrev.com/94ec4045f6d1f18b5c724c43efd0c0f7fc58c535/testing/buildbot/test_suites.pyl

Project Member

Comment 2 by bugdroid1@chromium.org, Jan 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4ce9b19bec721fedcdcd06f69829207979ac85b8

commit 4ce9b19bec721fedcdcd06f69829207979ac85b8
Author: Elly Fong-Jones <ellyjones@chromium.org>
Date: Wed Jan 24 14:45:08 2018

Revert "Fix OOM crash when the in-process GPU thread starts slowly"

This reverts commit 94ec4045f6d1f18b5c724c43efd0c0f7fc58c535.

Reason for revert:

Sheriffs think this introduced test crashes in content_browsertests and other browser test targets on debug builds. Every test in these targets crashes instantly like:

[ RUN      ] AccessibilityActionBrowserTest.IncrementDecrementActions
[1/1146] AccessibilityActionBrowserTest.IncrementDecrementActions (CRASHED)
[ RUN      ] AccessibilityActionBrowserTest.CanvasGetImageScale
[2/1146] AccessibilityActionBrowserTest.CanvasGetImageScale (CRASHED)

This is a speculative revert as we haven't yet reproed locally.

Original change's description:
> Fix OOM crash when the in-process GPU thread starts slowly
> 
> Background:
> Sometimes (depending on the execution order of browser's
> GPU vs IO thread), the IO thread is shutting down while
> the GPU thread starts up.
> 
> Problem:
> BrowserIO can end up in an endless loop where it tries to
> create GPU channels => Linux OOM:
> 
> SendOutstandingReplies()'s first while-loop never exits.
> 
> ~GpuProcessHost
>  GpuProcessHost::SendOutstandingReplies
>   BrowserGpuChannelHostFactory::EstablishRequest::OnEstablishedOnIO
>    BrowserGpuChannelHostFactory::EstablishRequest::EstablishOnIO
>     GpuProcessHost::EstablishGpuChannel // Queues a channel_request.
>      => SendOutstandingReplies sees another request => endless loop.
> 
> This happens because of OnEstablishedOnIO's retry-logic.
> 
> Solution:
> (1) Run EstablishOnIO async. That gives
>     SendOutstandingReplies a chance to exit its loop.
> (2) Synchronously Stop() the GPU thread before we destruct
>     its dependencies (here: tell it to stop starting).
> 
> Q: What happens if we don't stop the GPU thread
>    before we destruct BrowserChildProcessHostImpl?
> A: We get another race (and even a crash if we use
>    the bug description's timing):
> 
> [15658:15698:FATAL:outgoing_broker_client_invitation.cc(52)]
> 0 0x7fe6a522e12d base::debug::StackTrace::StackTrace()
> 1 0x7fe6a522c02c base::debug::StackTrace::StackTrace()
> 2 0x7fe6a52c51ea logging::LogMessage::~LogMessage()
> 3 0x7fe69ef8262b mojo::edk::OutgoingBrokerClientInvitation
>                  ::ExtractInProcessMessagePipe()
> 4 0x7fe6a60592d8 content::ChildThreadImpl::Init()
> 5 0x7fe6a605c032 content::ChildThreadImpl::ChildThreadImpl()
> 6 0x7fe6a5eb94cf content::GpuChildThread::GpuChildThread()
> 7 0x7fe6a5eb9e47 content::GpuChildThread::GpuChildThread()
> 8 0x7fe6a5ee0b23 content::InProcessGpuThread::Init()
> 9 0x7fe6a54c5b47 base::Thread::ThreadMain()
> 10 0x7fe6a54a4161 base::(anonymous namespace)::ThreadFunc()
> 11 0x7fe6ac4d56ba start_thread
> 12 0x7fe691ab63dd clone
> 
> Manual test: Follow the bug description's steps.
> Automated test: Run content_browsertests.
> 
> Bug:  799002 
> Change-Id: Ic3e7234fbcf401f7788181eb0b558a238aebf3bc
> Reviewed-on: https://chromium-review.googlesource.com/850358
> Commit-Queue: Hugo Holgersson <hugoh@vewd.com>
> Reviewed-by: Dirk Pranke <dpranke@chromium.org>
> Reviewed-by: Antoine Labour <piman@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#531505}

TBR=boliu@chromium.org,dpranke@chromium.org,piman@chromium.org,hugoh@vewd.com

Change-Id: Id9a9aedaa03a2ade4477f5a9662050e1f2137cd3
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  799002 
Reviewed-on: https://chromium-review.googlesource.com/883223
Reviewed-by: Elly Fong-Jones <ellyjones@chromium.org>
Commit-Queue: Elly Fong-Jones <ellyjones@chromium.org>
Cr-Commit-Position: refs/heads/master@{#531527}
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/content/browser/gpu/DEPS
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/content/browser/gpu/browser_gpu_channel_host_factory.cc
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/content/browser/gpu/gpu_process_host.h
[delete] https://crrev.com/84d1bb7b2829ea3750e5ebb80e46989853aeb4c3/content/browser/gpu/in_process_gpu_thread_browsertests.cc
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/content/test/BUILD.gn
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/chromium.chromiumos.json
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/chromium.fyi.json
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/chromium.memory.json
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/filters/BUILD.gn
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/filters/mojo.fyi.viz.content_browsertests.filter
[delete] https://crrev.com/84d1bb7b2829ea3750e5ebb80e46989853aeb4c3/testing/buildbot/filters/mus.content_browsertests.filter
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/filters/viz.content_browsertests.filter
[modify] https://crrev.com/4ce9b19bec721fedcdcd06f69829207979ac85b8/testing/buildbot/test_suites.pyl

Project Member

Comment 3 by bugdroid1@chromium.org, Jan 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa

commit f5f8cf9fd888af13b6ca20af7d4d100d3c354efa
Author: Hugo Holgersson <hugoh@vewd.com>
Date: Fri Jan 26 12:27:08 2018

Reland: Fix OOM crash when the in-process GPU thread starts slowly

This CL got reverted because Linux debug builds crashed
when running browser tests. I reproduced the crashes locally
and fixed them with:

 -    "//content/public/gpu:gpu_sources",
 +    "//content/public/gpu",

> Background:
> Sometimes (depending on the execution order of browser's
> GPU vs IO thread), the IO thread is shutting down while
> the GPU thread starts up.
>
> Problem:
> BrowserIO can end up in an endless loop where it tries to
> create GPU channels => Linux OOM:
>
> SendOutstandingReplies()'s first while-loop never exits.
>
> ~GpuProcessHost
>  GpuProcessHost::SendOutstandingReplies
>   BrowserGpuChannelHostFactory::EstablishRequest::OnEstablishedOnIO
>    BrowserGpuChannelHostFactory::EstablishRequest::EstablishOnIO
>     GpuProcessHost::EstablishGpuChannel // Queues a channel_request.
>      => SendOutstandingReplies sees another request => endless loop.
>
> This happens because of OnEstablishedOnIO's retry-logic.
>
> Solution:
> (1) Run EstablishOnIO async. That gives
>     SendOutstandingReplies a chance to exit its loop.
> (2) Synchronously Stop() the GPU thread before we destruct
>     its dependencies (here: tell it to stop starting).
>
> Q: What happens if we don't stop the GPU thread
>    before we destruct BrowserChildProcessHostImpl?
> A: We get another race (and even a crash if we use
>    the bug description's timing):
>
> [15658:15698:FATAL:outgoing_broker_client_invitation.cc(52)]
> 0 0x7fe6a522e12d base::debug::StackTrace::StackTrace()
> 1 0x7fe6a522c02c base::debug::StackTrace::StackTrace()
> 2 0x7fe6a52c51ea logging::LogMessage::~LogMessage()
> 3 0x7fe69ef8262b mojo::edk::OutgoingBrokerClientInvitation
>                  ::ExtractInProcessMessagePipe()
> 4 0x7fe6a60592d8 content::ChildThreadImpl::Init()
> 5 0x7fe6a605c032 content::ChildThreadImpl::ChildThreadImpl()
> 6 0x7fe6a5eb94cf content::GpuChildThread::GpuChildThread()
> 7 0x7fe6a5eb9e47 content::GpuChildThread::GpuChildThread()
> 8 0x7fe6a5ee0b23 content::InProcessGpuThread::Init()
> 9 0x7fe6a54c5b47 base::Thread::ThreadMain()
> 10 0x7fe6a54a4161 base::(anonymous namespace)::ThreadFunc()
> 11 0x7fe6ac4d56ba start_thread
> 12 0x7fe691ab63dd clone
>
> Manual test: Follow the bug description's steps.
> Automated test: Run content_browsertests.

Bug:  799002 
Change-Id: I1d61475b35abf11c98c6a6a21be6987f6eebdb0f
Reviewed-on: https://chromium-review.googlesource.com/886483
Reviewed-by: Antoine Labour <piman@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Hugo Holgersson <hugoh@vewd.com>
Cr-Commit-Position: refs/heads/master@{#531989}
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/browser/gpu/DEPS
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/browser/gpu/browser_gpu_channel_host_factory.cc
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/browser/gpu/gpu_process_host.h
[add] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/browser/gpu/in_process_gpu_thread_browsertests.cc
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/content/test/BUILD.gn
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/chromium.chromiumos.json
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/chromium.fyi.json
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/chromium.memory.json
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/filters/BUILD.gn
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/filters/mojo.fyi.viz.content_browsertests.filter
[add] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/filters/mus.content_browsertests.filter
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/filters/viz.content_browsertests.filter
[modify] https://crrev.com/f5f8cf9fd888af13b6ca20af7d4d100d3c354efa/testing/buildbot/test_suites.pyl

Comment 4 by hu...@vewd.com, Jan 29 2018

Status: Fixed (was: Started)

Sign in to add a comment