New issue
Advanced search Search tips

Issue 754436 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Nov 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocking:
issue 590342



Sign in to add a comment

Try decoding frames in bursts

Project Member Reported by jbau...@chromium.org, Aug 10 2017

Issue description

Currently after every frame is displayed its picture buffer is returned to the VDA, and that triggers the DXVAVDA to decode the next frame. That means the video decoder has to wake up at the video framerate, which may be efficient.

It would be possible to decode a new frame once every 2 or 3 picture buffers are returned, which might help with performance.


In theory it would be possible to have a swapchain with 4 or more buffers, then present into them in a burst, but actually schedule the swap to onscreen to 1, 2, 3 or more vsyncs in the future. I think Edge does this sometimes, but it would be hard to get Chrome to do this, because the renderer doesn't really have a way to schedule frames into the distant future.
 
Cc: stanisc@chromium.org
I am going to look at this.
There's risk that you will starve the renderer with the queue size that we have (4 frames). You'd want to do something like keep a queue with a min of 4 and a max of 8 or something similar, might work with 2,4, but benefits are obviously reduced.
Status: Started (was: Available)
I am still learning the code and trying to understand where the throttling of decoding is done in DXVAVideoDecodeAccelerator.

It looks like DXVAVideoDecodeAccelerator::DecodeInternal does the actual work only one time of each 5 times it is called. There is no reasoning in the code for this logic.

There is a condition at the top of the function that makes it skip the work:

  if (OutputSamplesPresent() || !pending_input_buffers_.empty()) {
    pending_input_buffers_.push_back(sample);
    return;
  }

The typical pattern looks like:
#  False 0
    True 0
    True 1
    True 2
    True 3
where True/False is the the value returned by OutputSamplesPresent() and the number is number of items in pending_input_buffers_
The pipeline is that the input buffer is decoded into an output sample, which is copied into a picture buffer.

There can be at most 1 output sample at a time, and at most kNumPictureBuffers picture buffers at a time. If all the picture buffers fill up, then the output sample will fill up, then pending input buffers will be stored until the output sample can be copied into an empty picture buffer.

So in the steady state decoding is triggered by when the client returns picture buffers so they can be reused.

Comment 5 Deleted

John, thank you for your comments!

If there can be at most 1 output sample at a time, why there is a list for output samples?

I went through analyzing the entire pipeline but it is still unclear where do I need to tweak the code to make it possible to decode in bursts. Any suggestions?

My impression is that one output sample limitation is unnecessary. What I see happening is that there are several unused picture buffers but DecodeInternal still stops after producing just one output sample.
In theory there could be more output samples, but we've limited it to 1 to avoid unnecessary memory usage and too much pipelining. Otherwise it would be possible to buffer up a ton of samples that will only be used far in the future. Some MFTs can also deadlock if too many output samples are outstanding at once.

I think the output sample is released after the copy from it to the picture buffer is finished. This also avoids unnecessary pipelining and GPU contention from too many copies happening at once. However this should happen very fast (on Windows 10) or the copy could be skipped completely on some drivers. So I don't know why more decodes aren't happening.
> I think the output sample is released after the copy from it to the picture buffer is finished

It turns out I was wrong. It appears in steady state there is only at most one picture buffer is available, and only in a beginning of each frame after ReusePictureBuffer is called.

The implementation does an attempt to process more pending samples and release the output buffer sooner, but it can't because at that time there are no available picture buffers.

What I will try as an experiment is to not process pending samples on ReusePictureBuffer() unless there are at least 2 picture buffers available.

The idea mentioned in #8 does work.
I'll measure the power difference tomorrow, however I doubt that that would save much power - there is still too much jumping between the decoder thread and the main thread.
decoding_tasks.PNG
8.4 KB View Download
Owner: stanisc@chromium.org
Cc: -stanisc@chromium.org
Owner: ----
Status: Available (was: Started)
I am moving to another team and won't be able to work on this.

According to my measurements there may be a very small power improvement when decoding two frames at a time - about 0.1 Wt, but the results weren't consistent. 

Here is the current prototype:
https://chromium-review.googlesource.com/c/chromium/src/+/716707

Project Member

Comment 12 by sheriffbot@chromium.org, Nov 2

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: tmathmeyer@chromium.org liber...@chromium.org
Status: WontFix (was: Untriaged)

Sign in to add a comment