Renderer startup blocked on browser UI thread |
||||||||||||||||||||
Issue descriptionI was investigating why RenderThreadImpl::Create sometimes takes long time on 512 Android Go devices and found that renderer is blocked on browser UI thread. See attached broker-77ms.trace where Broker::Broker() call blocked renderer for 77ms. As I understand the following happens: 1. When renderer service is connected, ChildProcessLauncherHelper::PostLaunchOnLauncherThread is called (ChildProcessLauncher.onServiceConnected -> ChildProcessLauncherHelper.nativeOnChildProcessStarted -> PostLaunchOnLauncherThread) 2. PostLaunchOnLauncherThread() posts a task (PostLaunchOnClientThread) to client_thread_id_ (the UI thread) 3. PostLaunchOnClientThread() calls child_process_launcher_->Notify() 4. ChildProcessLauncher::Notify() calls invitation->Send() 5. OutgoingBrokerClientInvitation::Send() eventually calls NodeController::SendBrokerClientInvitation() 6. SendBrokerClientInvitation() posts a task (SendBrokerClientInvitationOnIOThread) to the IO thread 7. SendBrokerClientInvitationOnIOThread() actually sends to invitation 8. Renderer's Broker::Broker() call (called from InitializeMojoIPCChannel()) unblocks The bottleneck here is step 3, which depends on the load of the UI thread. For example in broker-77ms.trace it waited for GpuChannelHost::Send() to complete. Sometimes we're lucky - in broker-0ms.trace renderer startup managed to complete earlier, so we posted PostLaunchOnClientThread() task before GpuChannelHost::Send(). See the screenshots - first brown bar is PostLaunchOnLauncherThread(), the last two brown bars are Core::SendBrokerClientInvitation() and BrokerHost::SendChannel().
,
Sep 27 2017
Folks, what do you think - is it possible / feasible to get rid of these synchronization points during renderer startup?
,
Sep 27 2017
,
Oct 6 2017
,
Oct 6 2017
dskiba: awesome findings! +1 to the question of feasibility of removing these synchronization points. Potentially worth 200ms wins or so seems worth looking into! P++
,
Oct 10 2017
+vmiura@ to see if there's a chance to do more work before we block on RenderThreadImpl::EstablishGpuChannelSync() Also +mattcary@ to see if we can extend spare renderer process experiment to cover this. Matt, do you think we can capture the same/similar trace with your experiment turned on and see what changes?
,
Oct 10 2017
+piman@ +danakj@. This is something we've discussed in the past. ProxyMain::SetDeferCommits is set until some time after the channel and contexts are established. There's an opportunity to make these async.
,
Oct 10 2017
/cc +tzik@, who has looked into the similar issue around GPU channel initialization before.
,
Oct 10 2017
A spare renderer could move up the renderer initialization by ~200ms (looking at the timing of the GetProcess call from the browser), but we'd have to have a good trigger point in order to spin up the spare renderer. From the context I'm not sure of where to find such a trigger point. Any suggestions?
,
Oct 10 2017
MainFrames used to depend on the context capabilities but after https://codereview.chromium.org/2267993002 that stopped being needed, so it should be possible. Scheduler doesn't ask for a main frame without a context either I believe, unless that's already changed.
,
Oct 10 2017
mattcary@ note that in broker-77 case renderer was started at the beginning of the trace (ChildProcessServiceImpl.bind was called at ~1 second mark). We can start navigation earlier, but that doesn't buy us much renderer waits for the browser at so many points, see for example #1. What happens is there are some heavy tasks happening on the browser UI thread (like ChromeTabbedActivity.finishNativeInitialization taking ~400ms - this is 512 Go device), so we either need to untie renderer startup from the UI thread, or somehow defer heavy UI tasks until renderer is sufficiently initialized. The second option feels like a hack though. We already know about some of the sync points, like broker invitation, or EstablishGpuChannelSync. But there are more. See attached screenshot of the renderer startup, where there in a gap of ~380ms just ~17ms of work happens. Does anyone has any idea of what renderer waits for?
,
Oct 10 2017
RenderThreadImpl::EstablishGpuChannelSync() does not interact with the browser UI thread, it is handled on the IO thread.
,
Oct 10 2017
It's not a low-hanging fruit to move the GpuChannelHost and CommandBufferProxyImpl off the main thread (and possibly asynchronous), but it is possible - keeping in mind that there are places that expect this to be synchronous that are exposed to the web: e.g. if you create a WebGL context, whether it is successful or not depends on GpuChannelHost+CommandBufferProxyImpl creation, and that is testable (and tested) by applications. This would still be a win for pages that don't need one of these things, and/or if those things are not created in the loading critical path. The way I would approach the problem: 1- decouple the LayerTreeFrameSink creation from the main thread - i.e. provide a LayerTreeFrameSinkFactory to LayerTreeHost, that would be expected to be used on the compositor thread, and that could, in the short term, do the round trip to the main thread (instead of the existing LayerTreeHostClient::RequestNewLayerTreeFrameSink logic). It could be shared across LTH's 2- move the RenderThreadImpl::EstablishGpuChannelSync logic to the IO thread (async, but while still providing a way to synchronize from a client thread) 3- add an async version of CommandBufferProxyImpl::Create 4- make the LayerTreeFrameSinkFactory talk directly from the compositor thread to the GpuChannelHost logic on the IO thread
,
Oct 10 2017
Some bug references for similar thoughts in the area and past work: > 1- decouple the LayerTreeFrameSink creation from the main thread - i.e. provide a LayerTreeFrameSinkFactory to LayerTreeHost, that would be expected to be used on the compositor thread, and that could, in the short term, do the round trip to the main thread (instead of the existing LayerTreeHostClient::RequestNewLayerTreeFrameSink logic). It could be shared across LTH's This is related to https://bugs.chromium.org/p/chromium/issues/detail?id=566273 I also attempted and abandoned (higher priority things to do) something similar in the past, in order to share the compositor context between compositor instances. In that case I was still using the main thread, but moving creation of the ContextProvider (the refcounted object) to the compositor thread: https://bugs.chromium.org/p/chromium/issues/detail?id=487471 with some very detailed comments in there. > 2- move the RenderThreadImpl::EstablishGpuChannelSync logic to the IO thread (async, but while still providing a way to synchronize from a client thread) This is discussed more specifically in https://bugs.chromium.org/p/chromium/issues/detail?id=566273
,
Oct 11 2017
> keeping in mind that there are places that expect this to be synchronous that > are exposed to the web: e.g. if you create a WebGL context, whether it is > successful or not depends on GpuChannelHost+CommandBufferProxyImpl creation, > and that is testable (and tested) by applications. This would still be a win > for pages that don't need one of these things, and/or if those things are not > created in the loading critical path. So, creating a WebGL context during page load also seems to be one of the things that can negatively impact our page load times. For example pages are sometimes making contexts for feature detection, and I measured a 0.5 second delay on MBP. (Part of this is due to also triggering integrated/discrete GPU switch which we want to fix anyway.) This seems tough to make non-blocking.
,
Oct 11 2017
@11 There are 100-200ms of android system process startup for the renderer process between the process being requested from the browser and the renderer initializing. The spare renderer idea is to create the process proactively, in this case before the navigation starts. That may make the renderer more likely to win the race as it does in the fast path. This works well in, eg, the case of the omnibox, where we can start the process on focus, which is well before a navigation happens and yet we have high confidence that a navigation will occur. It's not clear to me that such a trigger point exists in this case. I also agree that removing the synchronization points is going to be a more sure way to fix this problem. However, Kinkuo asked about the spare renderer, so I wanted to make sure we considered the option.
,
Oct 11 2017
> For example pages are sometimes making contexts for feature detection, and I measured a 0.5 second delay on MBP. > > This seems tough to make non-blocking. Does it need a GL context to do that? Can we just pass GpuInfo or something to the renderer when we start it up, since we presumably already know this information in the browser?
,
Oct 11 2017
> Does it need a GL context to do that? Can we just pass GpuInfo or something to the > renderer when we start it up, since we presumably already know this information in > the browser? In theory we could assume WebGL would succeed based on GpuInfo, create the context asynchronously, and then lose context later if it failed. Assuming we think that's OK to do, I think it may still take significant work to get all the info the app may query into GpuInfo, and to make the separation between sync and async WebGL APIs.
,
Oct 11 2017
> sync and async WebGL APIs. By sync I don't mean we sync with the GPU, but that we have a valid GLES2 context.
,
Oct 11 2017
> Does it need a GL context to do that? Can we just pass GpuInfo or something to the renderer when we start it up, since we presumably already know this information in the browser? We don't, since the information requires a round-trip to the GPU process. We could maybe do this dance (assume success, lose context if not) after we got a GpuChannelHost. That said, I believe some of the first things we will do is grab extensions which will be a round-trip again. Not sure if we could skip that.
,
Oct 11 2017
The feature detection scripts we've seen which use WebGL instantiate a context, query and enable an extension at the WebGL level, and make another query. It's not feasible to mock out all of this so that we avoid doing the real context creation under the hood.
,
Oct 11 2017
So uh.. in addressing https://bugs.chromium.org/p/chromium/issues/detail?id=772574 I am adding another thing where the renderer can block on the browser. Basically when a compositor client (canvas for example) wants to figure out if it should be in gpu or software mode, it needs to know what mode the compositor will be in. Right now it's local and not well synchronized. I'm adding an async message on RenderThreadImpl startup to the browser, and when canvas tries to make a context it would first wait for the reply to that message if it didn't arrive yet, in order to know its mode. Is this going to be a problem? I have this going to the browser UI thread and back, I could in theory put this on the browser IO thread, would it be worth it?
,
Oct 11 2017
I mean, the good news is that the request is initialized at construction time of RenderThreadImpl, whereas EstablishGpuChannelSync is started at the time canvas decides it wants a context.
,
Oct 11 2017
@#22: Browser IO would indeed be better, because there are cases where we block the UI thread waiting for a renderer frame (e.g. resize), so it sounds like this could cause a soft deadlock.
,
Oct 12 2017
Actually if the RenderProcessHostImpl has a CompositingModeWatcher, then when it runs the process it can pass --disable-gpu-compositing if it's disabled already, and the renderer can just assume Gpu compositing if that isn't present, until it gets notified otherwise - no blocking needed?
,
Oct 12 2017
But that soon will change. Soon compositing can be disabled by two situations: 1) --disable-gpu-compositing is passed from browser to renderer 2) GpuHostChannel::gpu_feature_info_::status_values[GPU_FEATURE_TYPE_GPU_COMPOSITING] == kFeatureStatusBlacklisted. This has to wait until GpuHostChannel is established. Right now these two bits are baked into one bit (--disable-gpu-compositing), but it's flaky because you can get different value before/after GPU process launch. So the refactoring I am doing is decouple it into two bits, one from browser to render, one from GPU process to renderer.
,
Oct 12 2017
I think that's ok, what I'm proposing is that something in the viz process would listen to what you're doing, and would tell the (browser and transitively and directly) renderer if it changed (ie the race you mention), based on the GpuChannelHost in the viz process.
,
Oct 31 2017
Something else that shouldn't be too hard to do is to start requesting the GPU channel from the browser side as soon as we're creating the RenderProcessHostImpl instead of waiting for the renderer to do so. We can cache the channel handle in the GpuClient if it comes back before the renderer requests it, in which case we can return faster from the sync IPC.
,
Oct 31 2017
Note: We should avoid that if the |never_visible| flag is going to be true for the RenderWidget though.
,
Nov 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca commit 1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca Author: Dmitry Skiba <dskiba@chromium.org> Date: Thu Nov 09 19:32:02 2017 Send child process Mojo invitation from the launcher thread. Child process is blocked until it receives the invitation, but sending it requires a hop to the client thread. Renderers are started from the UI thread on Android, and the UI thread can be busy with other tasks, delaying renderer initialization. This CL separates sending Mojo invitation from other child process notification tasks, and sends the invitation from the launcher thread. Bug: 768253 Change-Id: Id0c7397b239241b1693e2f9c108c7caba6d910c5 Reviewed-on: https://chromium-review.googlesource.com/687821 Commit-Queue: Dmitry Skiba <dskiba@chromium.org> Reviewed-by: Ken Rockot <rockot@chromium.org> Reviewed-by: Bo <boliu@chromium.org> Cr-Commit-Position: refs/heads/master@{#515240} [modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher.cc [modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher.h [modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher_helper.cc [modify] https://crrev.com/1b7dbaf86705c2a6b3393b9cb22aae41660ec4ca/content/browser/child_process_launcher_helper.h
,
Nov 10 2017
,
Nov 10 2017
,
Nov 13 2017
,
Nov 14 2017
,
Nov 17 2017
,
Nov 17 2017
,
Nov 19 2017
,
Nov 22 2017
,
May 22 2018
,
Jun 6 2018
Do we know whether this is still a common source of latency? I know mheikal did some work to kick off renderer process even earlier, that may mean this is hiding some of the potential gains.
,
Aug 16
|
||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||
Comment 1 by dskiba@chromium.org
, Sep 27 201767.1 KB
67.1 KB View Download
1.2 MB
1.2 MB Download