Try decoding frames in bursts |
||||||
Issue descriptionCurrently after every frame is displayed its picture buffer is returned to the VDA, and that triggers the DXVAVDA to decode the next frame. That means the video decoder has to wake up at the video framerate, which may be efficient. It would be possible to decode a new frame once every 2 or 3 picture buffers are returned, which might help with performance. In theory it would be possible to have a swapchain with 4 or more buffers, then present into them in a burst, but actually schedule the swap to onscreen to 1, 2, 3 or more vsyncs in the future. I think Edge does this sometimes, but it would be hard to get Chrome to do this, because the renderer doesn't really have a way to schedule frames into the distant future.
,
Sep 25 2017
There's risk that you will starve the renderer with the queue size that we have (4 frames). You'd want to do something like keep a queue with a min of 4 and a max of 8 or something similar, might work with 2,4, but benefits are obviously reduced.
,
Oct 10 2017
I am still learning the code and trying to understand where the throttling of decoding is done in DXVAVideoDecodeAccelerator.
It looks like DXVAVideoDecodeAccelerator::DecodeInternal does the actual work only one time of each 5 times it is called. There is no reasoning in the code for this logic.
There is a condition at the top of the function that makes it skip the work:
if (OutputSamplesPresent() || !pending_input_buffers_.empty()) {
pending_input_buffers_.push_back(sample);
return;
}
The typical pattern looks like:
# False 0
True 0
True 1
True 2
True 3
where True/False is the the value returned by OutputSamplesPresent() and the number is number of items in pending_input_buffers_
,
Oct 10 2017
The pipeline is that the input buffer is decoded into an output sample, which is copied into a picture buffer. There can be at most 1 output sample at a time, and at most kNumPictureBuffers picture buffers at a time. If all the picture buffers fill up, then the output sample will fill up, then pending input buffers will be stored until the output sample can be copied into an empty picture buffer. So in the steady state decoding is triggered by when the client returns picture buffers so they can be reused.
,
Oct 10 2017
John, thank you for your comments! If there can be at most 1 output sample at a time, why there is a list for output samples? I went through analyzing the entire pipeline but it is still unclear where do I need to tweak the code to make it possible to decode in bursts. Any suggestions? My impression is that one output sample limitation is unnecessary. What I see happening is that there are several unused picture buffers but DecodeInternal still stops after producing just one output sample.
,
Oct 10 2017
In theory there could be more output samples, but we've limited it to 1 to avoid unnecessary memory usage and too much pipelining. Otherwise it would be possible to buffer up a ton of samples that will only be used far in the future. Some MFTs can also deadlock if too many output samples are outstanding at once. I think the output sample is released after the copy from it to the picture buffer is finished. This also avoids unnecessary pipelining and GPU contention from too many copies happening at once. However this should happen very fast (on Windows 10) or the copy could be skipped completely on some drivers. So I don't know why more decodes aren't happening.
,
Oct 11 2017
> I think the output sample is released after the copy from it to the picture buffer is finished It turns out I was wrong. It appears in steady state there is only at most one picture buffer is available, and only in a beginning of each frame after ReusePictureBuffer is called. The implementation does an attempt to process more pending samples and release the output buffer sooner, but it can't because at that time there are no available picture buffers. What I will try as an experiment is to not process pending samples on ReusePictureBuffer() unless there are at least 2 picture buffers available.
,
Oct 11 2017
The idea mentioned in #8 does work. I'll measure the power difference tomorrow, however I doubt that that would save much power - there is still too much jumping between the decoder thread and the main thread.
,
Oct 11 2017
,
Nov 1 2017
I am moving to another team and won't be able to work on this. According to my measurements there may be a very small power improvement when decoding two frames at a time - about 0.1 Wt, but the results weren't consistent. Here is the current prototype: https://chromium-review.googlesource.com/c/chromium/src/+/716707
,
Nov 2
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Nov 2
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by stanisc@chromium.org
, Sep 25 2017