We need to verify that the InProcessCommandBuffer used for DisplayOutputSurface is performant and correct. It doesn't use the gpu::Scheduler currently.
I tested a WIP CL (http://crrev.com/c/782666) on a GPU heavy telemetry workload (smoothness.gpu_rasterization.tough_scrolling_cases) on a device similar to Intel CrOS GPU. There was virtually no difference (slight regression, but that could be noise).
Landing this WIP CL would require additional work because it does not cover Android WebView, which also uses InProcCmdBuffer. AWV is unable to schedule it's GPU work via gpu::Scheduler, so we'd have to fork code paths which would add code complexity and a small amount of risk. I will defer this work until we get benchmarks that prove it's utility.
#10
I had a quick look at your CL and found a couple of issues:
1. The command buffer doesn't check Scheduler::ShouldYield. For IPC command buffer we do this in GpuCommandBufferStub::OnCommandBatchProcessed. If the command buffer is paused, GpuChannel::HandleMessage calls ContinueTask to retry the current task.
2. I don't think this matters a lot but to match IPC command buffer behavior we want to queue up the pending waits for flush only and not for every task.
Re: webview, we could extend the InProcessCommandBuffer::Service interface to be similar to the scheduler's interface and split out the task queue logic to a webview specific implementation.
The main methods we need to forward are:
1. ScheduleTask
2. ContinueTask
3. ShouldYield
Thanks Sunny. I reran tests with your suggested changes on a couple of hardware configurations (haswell i5 and celeron). There was no difference with 5 runs of smoothness.gpu_rasterization.tough_scrolling_cases
I'm closing this bug as WontFix. Feel free to reopen if anyone disagrees.
I think the reason this isnβt producing improved performance is because we place the appropriate metrics (UI metrics) to measure the impact. Iβd like to keep this bug open until sadrul@ and his team produce some UI telemetry metrics.
Apologies for the wall of text. I just wanted to pass on what I've learned to Sunny. Feel free to ignore my design suggestions. I had broadened the scope a bit to make InProcCmdBuffer more readable.
InProcCmdBuffer has 3 major uses: Android WebView, various tests (gpu_unittests, cc_unittests, gl_tests), and now OOP-D. The hack I coded only works for OOP-D, it breaks the other 2 cases.
Current state: For Android WebView, tasks are queued within DeferredGpuCommandService and are executed there when HWUI calls into it. For others, InProcCmdBuffer tasks are queued within InProcCmdBuffer::tasks_queue_ and periodically executed via post tasks to GpuMain thread (via GpuInProcessThreadService). Which mode InProcCmdBuffer uses depends on BlockThreadOnWaitSyncToken.
I think that the desired end-state would be for tests and OOP-D to use gpu::Scheduler, rather than maintain 3rd special case just for tests.
It would be a nice code clean up to move some of the special logic for AWV and OOP-D out of InProcCmdBuffer. Especially seeing as the gpu::Scheduler integration increases the differences between the code paths. I had intended on refactoring CommandBufferTaskExecutor to this end.
AFAICT, CBTE currently does two things:
(1) It holds some external dependencies shared among many InProcCmdBuffers (OOP-D has one per top-level window). These should be ref-counted to simplify lifetime.
(2) It also has hooks into task processing. These don't need to be shared or ref-counted.
My plan was to split CBTE among these two things, where (1) would remain ref-counted and (2) would not. (2) may hold a ref to (1). I was going to push the gpu::Scheduler logic into (2).
This would make (2) a more specific interface. It would include methods like this:
void QueueTask(bool out_of_order,
bool wait_for_pending_syncs,
base::OnceClosure task);
void ReleaseFenceSync(uint64_t release);
bool ShouldYield();
void WaitSyncTokenHint(const SyncToken& sync_token);
bool OnWaitSyncToken(const SyncToken& sync_token);
void OnWaitSyncTokenCompleted(const SyncToken& sync_token);
void OnDescheduleUntilFinished();
void OnRescheduleAfterFinished();
Re the last two methods, I did some digging at it is just to support WebGL on OSX (https://cs.chromium.org/chromium/src/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_deschedule.txt). It allows a client side method that looks a lot like a glFinish to the client, but allows the backend to process commands from other streams instead of blocking on a real glFinish. I believe that the test for this is very poor, so some care is probably needed in the implementation.
I was hoping to remove QueueRepeatableTask. This is a bit of hack for when FlushOnGpuThread gets descheduled. In that case, ProcessTasksOnGpuThread will not remove FlushOnGpuThread from tasks_queue_ and it will re-execute that callback again later (https://cs.chromium.org/chromium/src/gpu/ipc/in_process_command_buffer.cc?rcl=3d4a43a77622b48ac32a73aa0d73ae52d1e74792&l=659). But I'm not 100% sure that this is easy to do.
A lower road solution to refactoring would be to keep the code closer to the hack that I had coded (https://chromium-review.googlesource.com/c/chromium/src/+/782666), but pass a gpu::Scheduler up for tests through GpuInProcessThreadHolder, which is test only code.
Thanks for these suggestions. The minimal task scheduling interface you've identified is exactly what I had in mind.
At first, I'm probably going to put the above interface in CBTE. We can split it up later as needed. Idle task (ScheduleDelayedWork) scheduling will require some work. I've been meaning to move idle tasks to the scheduler, since the IPC command buffers look at scheduler state indirectly (via the order numbers) to run idle work any way. Maybe I'll do that first so that GpuInProcessThreadService doesn't have to deal with it.
re: DescheduleUntilFinished, since that's only needed for WebGL we can choose to not implement it for InProcessCommandBuffer. piman@ WDYT? We can make FeatureInfo only initialize the extension for WebGL if it's not already.
Some observations:
1) Porting idle work scheduling to scheduler isn't blocking this. We can continue using PostDelayedTask on the task runner for those.
2) DescheduleUntilFinished is already broken because we never schedule polling work when the decoder HasPollingWork which is where the DescheduleUntilFinished fences are checked. (We only schedule pending queries and other idle work.)
3) We cannot share a single sequence for all clients (InProcCmdBuf) of the CBTE because the sequence can be descheduled by any one of them so one window waiting on a renderer sync token could block other windows.
4) I'd probably split CBTE into two interfaces:
InProcessCommandBufferService:
InProcessCBTE CreateTaskExecutor();
<everything currently in CBTE except ScheduleTask stuff>
InProcessCommandBufferTaskExecutor:
void SetEnabled(); // only affects immediate (non-delayed) tasks, ignored by webview
void ScheduleTask(base::OnceClosure);
void ContinueTask(base::OnceClosure); // ignored by webview
void ScheduleDelayedWork(base::OnceClosure);
void WaitSyncTokenHint(const SyncToken&);
InProcCmdBuf will use InProcCmdBufService to create a task executor and hold on to it. For viz, the task executor will have its own sequence id from the scheduler, and just forward tasks to the scheduler.
For webview, the task executor will own its own sync point order data, but the tasks will be enqueued in a global queue owned by the service. The service will need to handle order number processing when running tasks (two calls basically). The other sync token stuff will be NOTREACHED in this case as BlockThreadForWaitSyncToken is true.
Tests can have a different implementation of the task executor if desired. It might be simpler to just make tests use the scheduler.
Comment 1 by kylec...@chromium.org
, Oct 26 2017