New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 768253 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocked on:
issue 566273
issue 163696
issue 783512
issue 783913
issue 784700
issue 786435
issue 787647



Sign in to add a comment

Renderer startup blocked on browser UI thread

Project Member Reported by dskiba@chromium.org, Sep 25 2017

Issue description

I was investigating why RenderThreadImpl::Create sometimes takes long time on 512 Android Go devices and found that renderer is blocked on browser UI thread.

See attached broker-77ms.trace where Broker::Broker() call blocked renderer for 77ms.

As I understand the following happens:

1. When renderer service is connected, ChildProcessLauncherHelper::PostLaunchOnLauncherThread is called (ChildProcessLauncher.onServiceConnected -> ChildProcessLauncherHelper.nativeOnChildProcessStarted -> PostLaunchOnLauncherThread)

2. PostLaunchOnLauncherThread() posts a task (PostLaunchOnClientThread) to client_thread_id_ (the UI thread)

3. PostLaunchOnClientThread() calls child_process_launcher_->Notify()

4. ChildProcessLauncher::Notify() calls invitation->Send()

5. OutgoingBrokerClientInvitation::Send() eventually calls NodeController::SendBrokerClientInvitation()

6. SendBrokerClientInvitation() posts a task (SendBrokerClientInvitationOnIOThread) to the IO thread

7. SendBrokerClientInvitationOnIOThread() actually sends to invitation

8. Renderer's Broker::Broker() call (called from InitializeMojoIPCChannel()) unblocks

The bottleneck here is step 3, which depends on the load of the UI thread. For example in broker-77ms.trace it waited for GpuChannelHost::Send() to complete.

Sometimes we're lucky - in broker-0ms.trace renderer startup managed to complete earlier, so we posted PostLaunchOnClientThread() task before GpuChannelHost::Send().

See the screenshots - first brown bar is PostLaunchOnLauncherThread(), the last two brown bars are Core::SendBrokerClientInvitation() and BrokerHost::SendChannel().
 
broker-77ms.png
60.1 KB View Download
broker-0ms.png
54.3 KB View Download
broker-77ms.html.gz
1.2 MB Download
broker-0ms.html.gz
1.2 MB Download

Comment 1 by dskiba@chromium.org, Sep 27 2017

I tried to hack things to send invitation from the launcher thread instead: https://chromium-review.googlesource.com/c/chromium/src/+/687821

...only to discover that there are more synchronization points. See the attached trace where I modified ChromeBrowserInitializer to run all startup tasks synchronously, instead of running then in individual UI thread tasks. With this change we start loading a page ~1s earlier, but we then spend ~460ms of that time waiting for something. And then we spend ~280ms in RenderThreadImpl::EstablishGpuChannelSync(), presumably waiting for GpuChannelHost::CreateViewCommandBuffer() to be ran on browser UI thread.

waiting-for-ui.png
67.1 KB View Download
invitation_synctasks.html.gz
1.2 MB Download

Comment 2 by dskiba@chromium.org, Sep 27 2017

Cc: haraken@chromium.org kinuko@chromium.org ksakamoto@chromium.org jam@chromium.org mariakho...@chromium.org
Labels: Performance-Loading LowMemory
Folks, what do you think - is it possible / feasible to get rid of these synchronization points during renderer startup?
Cc: hua...@chromium.org
Owner: dskiba@chromium.org
Status: Assigned (was: Untriaged)

Comment 5 by pasko@chromium.org, Oct 6 2017

Labels: -Pri-3 Pri-2
dskiba: awesome findings!

+1 to the question of feasibility of removing these synchronization points. Potentially worth 200ms wins or so seems worth looking into! P++

Comment 6 by kinuko@chromium.org, Oct 10 2017

Cc: mattcary@chromium.org vmi...@chromium.org
+vmiura@ to see if there's a chance to do more work before we block on RenderThreadImpl::EstablishGpuChannelSync()

Also +mattcary@ to see if we can extend spare renderer process experiment to cover this. Matt, do you think we can capture the same/similar trace with your experiment turned on and see what changes?

Comment 7 by vmi...@chromium.org, Oct 10 2017

Cc: danakj@chromium.org piman@chromium.org
Components: Internals>Compositing
+piman@ +danakj@.  This is something we've discussed in the past.

ProxyMain::SetDeferCommits is set until some time after the channel and contexts are established.  There's an opportunity to make these async.

Comment 8 by kinuko@chromium.org, Oct 10 2017

Cc: tzik@chromium.org
/cc +tzik@, who has looked into the similar issue around GPU channel initialization before.
A spare renderer could move up the renderer initialization by ~200ms (looking at the timing of the GetProcess call from the browser), but we'd have to have a good trigger point in order to spin up the spare renderer.

From the context I'm not sure of where to find such a trigger point. Any suggestions?
MainFrames used to depend on the context capabilities but after https://codereview.chromium.org/2267993002 that stopped being needed, so it should be possible. Scheduler doesn't ask for a main frame without a context either I believe, unless that's already changed.
mattcary@ note that in broker-77 case renderer was started at the beginning of the trace (ChildProcessServiceImpl.bind was called at ~1 second mark). We can start navigation earlier, but that doesn't buy us much renderer waits for the browser at so many points, see for example #1.

What happens is there are some heavy tasks happening on the browser UI thread (like ChromeTabbedActivity.finishNativeInitialization taking ~400ms - this is 512 Go device), so we either need to untie renderer startup from the UI thread, or somehow defer heavy UI tasks until renderer is sufficiently initialized. The second option feels like a hack though.

We already know about some of the sync points, like broker invitation, or EstablishGpuChannelSync.

But there are more. See attached screenshot of the renderer startup, where there in a gap of ~380ms just ~17ms of work happens. Does anyone has any idea of what renderer waits for?
idle-380ms.html.gz
1.3 MB Download
idle-380ms.png
31.9 KB View Download

Comment 12 by piman@chromium.org, Oct 10 2017

RenderThreadImpl::EstablishGpuChannelSync() does not interact with the browser UI thread, it is handled on the IO thread.

Comment 13 by piman@chromium.org, Oct 10 2017

It's not a low-hanging fruit to move the GpuChannelHost and CommandBufferProxyImpl off the main thread (and possibly asynchronous), but it is possible - keeping in mind that there are places that expect this to be synchronous that are exposed to the web: e.g. if you create a WebGL context, whether it is successful or not depends on GpuChannelHost+CommandBufferProxyImpl creation, and that is testable (and tested) by applications. This would still be a win for pages that don't need one of these things, and/or if those things are not created in the loading critical path.

The way I would approach the problem:
1- decouple the LayerTreeFrameSink creation from the main thread - i.e. provide a LayerTreeFrameSinkFactory to LayerTreeHost, that would be expected to be used on the compositor thread, and that could, in the short term, do the round trip to the main thread (instead of the existing LayerTreeHostClient::RequestNewLayerTreeFrameSink logic). It could be shared across LTH's
2- move the RenderThreadImpl::EstablishGpuChannelSync logic to the IO thread (async, but while still providing a way to synchronize from a client thread)
3- add an async version of CommandBufferProxyImpl::Create
4- make the LayerTreeFrameSinkFactory talk directly from the compositor thread to the GpuChannelHost logic on the IO thread
Some bug references for similar thoughts in the area and past work:

> 1- decouple the LayerTreeFrameSink creation from the main thread - i.e. provide a LayerTreeFrameSinkFactory to LayerTreeHost, that would be expected to be used on the compositor thread, and that could, in the short term, do the round trip to the main thread (instead of the existing LayerTreeHostClient::RequestNewLayerTreeFrameSink logic). It could be shared across LTH's

This is related to https://bugs.chromium.org/p/chromium/issues/detail?id=566273

I also attempted and abandoned (higher priority things to do) something similar in the past, in order to share the compositor context between compositor instances. In that case I was still using the main thread, but moving creation of the ContextProvider (the refcounted object) to the compositor thread: https://bugs.chromium.org/p/chromium/issues/detail?id=487471 with some very detailed comments in there.

> 2- move the RenderThreadImpl::EstablishGpuChannelSync logic to the IO thread (async, but while still providing a way to synchronize from a client thread)

This is discussed more specifically in https://bugs.chromium.org/p/chromium/issues/detail?id=566273
> keeping in mind that there are places that expect this to be synchronous that
> are exposed to the web: e.g. if you create a WebGL context, whether it is
> successful or not depends on GpuChannelHost+CommandBufferProxyImpl creation,
> and that is testable (and tested) by applications. This would still be a win
> for pages that don't need one of these things, and/or if those things are not
> created in the loading critical path.

So, creating a WebGL context during page load also seems to be one of the things that can negatively impact our page load times.

For example pages are sometimes making contexts for feature detection, and I measured a 0.5 second delay on MBP.  (Part of this is due to also triggering integrated/discrete GPU switch which we want to fix anyway.)

This seems tough to make non-blocking.
@11

There are 100-200ms of android system process startup for the renderer process between the process being requested from the browser and the renderer initializing. The spare renderer idea is to create the process proactively, in this case before the navigation starts.

That may make the renderer more likely to win the race as it does in the fast path.

This works well in, eg, the case of the omnibox, where we can start the process on focus, which is well before a navigation happens and yet we have high confidence that a navigation will occur.

It's not clear to me that such a trigger point exists in this case. I also agree that removing the synchronization points is going to be a more sure way to fix this problem.

However, Kinkuo asked about the spare renderer, so I wanted to make sure we considered the option.
> For example pages are sometimes making contexts for feature detection, and I measured a 0.5 second delay on MBP.
> 
> This seems tough to make non-blocking.

Does it need a GL context to do that? Can we just pass GpuInfo or something to the renderer when we start it up, since we presumably already know this information in the browser?
> Does it need a GL context to do that? Can we just pass GpuInfo or something to the
> renderer when we start it up, since we presumably already know this information in
> the browser?

In theory we could assume WebGL would succeed based on GpuInfo, create the context asynchronously, and then lose context later if it failed.

Assuming we think that's OK to do, I think it may still take significant work to get all the info the app may query into GpuInfo, and to make the separation between sync and async WebGL APIs.

> sync and async WebGL APIs.

By sync I don't mean we sync with the GPU, but that we have a valid GLES2 context.

Comment 20 by piman@chromium.org, Oct 11 2017

Cc: zmo@chromium.org kbr@chromium.org
> Does it need a GL context to do that? Can we just pass GpuInfo or something to the renderer when we start it up, since we presumably already know this information in the browser?

We don't, since the information requires a round-trip to the GPU process. We could maybe do this dance (assume success, lose context if not) after we got a GpuChannelHost. That said, I believe some of the first things we will do is grab extensions which will be a round-trip again. Not sure if we could skip that.

Comment 21 by kbr@chromium.org, Oct 11 2017

The feature detection scripts we've seen which use WebGL instantiate a context, query and enable an extension at the WebGL level, and make another query. It's not feasible to mock out all of this so that we avoid doing the real context creation under the hood.

So uh.. in addressing https://bugs.chromium.org/p/chromium/issues/detail?id=772574 I am adding another thing where the renderer can block on the browser.

Basically when a compositor client (canvas for example) wants to figure out if it should be in gpu or software mode, it needs to know what mode the compositor will be in. Right now it's local and not well synchronized. I'm adding an async message on RenderThreadImpl startup to the browser, and when canvas tries to make a context it would first wait for the reply to that message if it didn't arrive yet, in order to know its mode.

Is this going to be a problem?

I have this going to the browser UI thread and back, I could in theory put this on the browser IO thread, would it be worth it?
I mean, the good news is that the request is initialized at construction time of RenderThreadImpl, whereas EstablishGpuChannelSync is started at the time canvas decides it wants a context.

Comment 24 by piman@chromium.org, Oct 11 2017

@#22: Browser IO would indeed be better, because there are cases where we block the UI thread waiting for a renderer frame (e.g. resize), so it sounds like this could cause a soft deadlock.
Actually if the RenderProcessHostImpl has a CompositingModeWatcher, then when it runs the process it can pass --disable-gpu-compositing if it's disabled already, and the renderer can just assume Gpu compositing if that isn't present, until it gets notified otherwise - no blocking needed?

Comment 26 by zmo@chromium.org, Oct 12 2017

But that soon will change. Soon compositing can be disabled by two situations:
1) --disable-gpu-compositing is passed from browser to renderer
2) GpuHostChannel::gpu_feature_info_::status_values[GPU_FEATURE_TYPE_GPU_COMPOSITING] == kFeatureStatusBlacklisted. This has to wait until GpuHostChannel is established.

Right now these two bits are baked into one bit (--disable-gpu-compositing), but it's flaky because you can get different value before/after GPU process launch.

So the refactoring I am doing is decouple it into two bits, one from browser to render, one from GPU process to renderer.
I think that's ok, what I'm proposing is that something in the viz process would listen to what you're doing, and would tell the (browser and transitively and directly) renderer if it changed (ie the race you mention), based on the GpuChannelHost in the viz process.

Comment 28 by piman@chromium.org, Oct 31 2017

Something else that shouldn't be too hard to do is to start requesting the GPU channel from the browser side as soon as we're creating the RenderProcessHostImpl instead of waiting for the renderer to do so. We can cache the channel handle in the GpuClient if it comes back before the renderer requests it, in which case we can return faster from the sync IPC.
Note: We should avoid that if the |never_visible| flag is going to be true for the RenderWidget though.
Project Member

Comment 30 by bugdroid1@chromium.org, Nov 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca

commit 1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca
Author: Dmitry Skiba <dskiba@chromium.org>
Date: Thu Nov 09 19:32:02 2017

Send child process Mojo invitation from the launcher thread.

Child process is blocked until it receives the invitation, but sending
it requires a hop to the client thread. Renderers are started from the
UI thread on Android, and the UI thread can be busy with other tasks,
delaying renderer initialization.

This CL separates sending Mojo invitation from other child process
notification tasks, and sends the invitation from the launcher thread.

Bug: 768253
Change-Id: Id0c7397b239241b1693e2f9c108c7caba6d910c5
Reviewed-on: https://chromium-review.googlesource.com/687821
Commit-Queue: Dmitry Skiba <dskiba@chromium.org>
Reviewed-by: Ken Rockot <rockot@chromium.org>
Reviewed-by: Bo <boliu@chromium.org>
Cr-Commit-Position: refs/heads/master@{#515240}
[modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher.cc
[modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher.h
[modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher_helper.cc
[modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher_helper.h

Comment 31 by zmo@chromium.org, Nov 10 2017

Blockedon: 783512

Comment 32 by piman@chromium.org, Nov 10 2017

Blockedon: 566273
Blockedon: 783913
Blockedon: 784700
Blockedon: 786435
Labels: -Performance-Loading Performance-Startup
Cc: sadrul@chromium.org
Blockedon: 787647
Owner: ----
Status: Available (was: Assigned)
Cc: mheikal@chromium.org skyos...@chromium.org
Do we know whether this is still a common source of latency? I know mheikal did some work to kick off renderer process even earlier, that may mean this is hiding some of the potential gains.
Blockedon: 163696
Cc: alexclarke@chromium.org
Components: Blink>Scheduling

Sign in to add a comment