Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Issue 92388 Texture Upload in Chrome is slow
Starred by 18 users Reported by toj...@gmail.com, Aug 10 2011 Back to list
Status: Fixed
Owner:
Closed: Nov 2011
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Restricted
  • Only users with Commit permission may comment.



Sign in to add a comment
Chrome Version       : 14.0.835.29 dev
URLs (if applicable) : http://jsperf.com/webgl-teximage2d-vs-texsubimage2d/2

Other browsers tested:
    Safari 5: OK
    Firefox 7: OK

What steps will reproduce the problem?
1. Push an image to video memory via gl.texImage2D or gl.texSubImage2D

What is the expected result?
Performance of this feature should be fast enough to allow uploading a medium-sized texture (1024x1024) without severely disrupting a realtime application. 

What happens instead?
Safari and Firefox both perform reasonably fast in this scenario, but Chrome lags far behind both in terms of upload speed. Calling gl.texImage2D on a 1024x1024 texture currently blocks the main thread for ~50ms.

 
Mergedinto: 91208
Status: Duplicate
Comment 2 by enne@chromium.org, Aug 11 2011
Cc: vangelis@chromium.org apatrick@chromium.org gman@chromium.org kbr@chromium.org nduca@chromium.org
Labels: -Area-Undefined Feature-GPU-WebGL
Mergedinto:
Status: Unconfirmed
I'm not sure this is a duplicate.  This bug concerns uploads via image and not uploads via video.
Comment 3 by nduca@chromium.org, Aug 11 2011
Agreed. I think that our slowness is worse once we start rendering because the current compositor scheduler causes the gpu to spend most of its time blocked on vsync. :'(
Comment 5 by enne@chromium.org, Aug 11 2011
Cc: enne@chromium.org zmo@chromium.org
Comment 6 by nduca@chromium.org, Aug 11 2011
Studied this and found that this is not gpu upload performance related. In fact, the actual gl call that uploads this texture takes 0.016ms.

The cost here is as follows:
 GraphicsContext3D::extractImageData                    :      6.675ms
 DecodeAlphaNotPremultiplied                            :      5.254ms
 GraphicsContext3D::packPixels                          :      1.345ms
 CommandBufferProxy::FlushSync                          :      0.836ms

Basically, we're doing a TON of work doing texture conversion.
Comment 7 by nduca@chromium.org, Aug 11 2011
The flushsync cost is closer to 2ms. My bad.


slow_texture_uploads.json
8.9 MB View Download
slow_texture_uploads.png
71.3 KB View Download
Comment 8 by gman@chromium.org, Aug 12 2011
Status: Available
So there are several issues

#1) The transfer buffer is only 1meg by default and because a few bytes are used at the front the largest texture you can upload without stalling is like 256x255. Upload 2 textures and you'd get a stall

#2) Even if we make the transfer buffer bigger we still have to do a copy

These are security issues. We take security seriously and to do that we can't let the process running JavaScript have direct access to the GPU. So uploads of textures are slower for us. Rendering is general should be faster though. Most apps don't need to upload textures every frame.
Does this mean that the biggest POT texture that can be copied in a single run is 256x128? That seems bad. I don't see how this particular choice of buffer size can be considered a security issue (or maybe your "these are" means something else).

Nonetheless, doesn't this only impact the FlushSync cost (i.e. around 13% of the time according to Comment #6/7)? Or is this something else?

Afaik, Tojiro's app doesn't upload textures every frame, but it reasonably expects that when it does upload a new texture, it won't take forever. It seems a very reasonable application.
Comment 10 by nduca@chromium.org, Aug 12 2011
If we can figure out why the conversion is happening all the time, we can get a huge perf win back, at least on this benchmark as well as any other examples that get tripped up on format conversions. At that point, you'll still have the perf issues described by gman --- however, at that point, the upload cost will be closer to 2ms per upload rather than 14.

To clarify, addressing the actual FlushSync hitch ("the biggest texture that can be copied is <1mb") is something we need to do. Its just hard work -- specifically, we need to build a more gooder ;) memory manager that will provide additional upload space for an app that needs while also controlling it enough that it scales back backgrounded and prevents a runaway app from draining system's shared memory resources.

Net/net, we agree, its slow. We need to make it faster. There are some easy things to do first, then some hard things. :)
I'm doing a code search fror DecodeAlphaNotPremultiplied and I get no hits.  Nat, any idea where that trace entry comes from?

Comment 12 by nduca@chromium.org, Aug 12 2011
Its a scope inside GraphicsContext3d::getImageData, platform/graphics/skia/GraphicsContext3DSkia.cpp:59


Comment 13 by toj...@gmail.com, Aug 12 2011
In response to Fernando's comment: The RAGE app mentioned in the blog post probably does, on average, somewhere around 5-6 texSubImage2D calls per second, but I do force it to only allow one texSubImage2D call per frame. This is, admittedly, unusually high for anything outside of maybe video processing but still something that should be reasonable. I know that maintaining a solid 60fps in those circumstances is likely a stretch, but I've also seen Safari and Firefox maintain a steady 58+ fps on the same demo, where as Chrome can struggle to stay in the 40s depending on the scene.

I wish I could post a live demo for you guys to do some more in-depth benchmarking, but I can't put the RAGE resource files on a live server. :(
Comment 14 by pya...@gmail.com, Aug 12 2011
I have made the exact same observation about speed that Tojiro does with the video stuff. The performance issues are very related.

For me, doing live video to texture stuff, and for Tojiro and his usage, this is an extremely big deal.

Getting your performance in line with at least Firefox is mandatory. Getting it line with actual machine capabilities is highly recommended.
RE 'Most apps don't need to upload textures every frame.' >
There are a ton of valid reasons to be frequently uploading textures, not just limited to games. Photo/video editors, browsing UIs (stores, content libraries, search results, etc), mapping, data visualization, etc. Slow texture uploads kill almost all of these scenarios, or severely limit their real world implementations. In most of these cases, getting faster texture upload is actually more important than a faster frame time - being locked at 30fps but never dropping a frame is almost always a much better user experience than 60fps with dropped frames every second or two.
Comment 16 by kbr@chromium.org, Aug 12 2011
Cc: jbau...@chromium.org
I agree that there is no good reason for Chrome to be substantially slower for texture uploads than other browsers, security issues notwithstanding. Any mandatory memcpy into shared memory should not be the bottleneck from the above measurements. I will look into this as soon as possible, unless someone else gets to it first.

Doing gl.pixelStorei(gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL, true); in the test case would help us avoid having to decode the image twice. Still not sure why GraphicsContext3D::extractImageData is so expensive, though.
Comment 18 by nduca@chromium.org, Aug 12 2011
For this...


Title     : GraphicsContext3D::getImageData
Start     : 3521.813 ms
Duration  : 24.232 ms
Args      :
 id       : 0
 extra    : skiaImage=0x7fd43c072d20 ignoreGammaAndColorProfile=0 hasAlpha=1 premultiplyAlpha=0 image->data=0x7fd43bf6cc80


Title     : PackPixels
Start     : 3546.061 ms
Duration  : 3.787 ms
Args      :
 id       : 0
 extra    : sourceDataFormat=9, w=1024, h=1024, sua=0, destinationFormat=1908 destinationType=1401 aop=0
Comment 19 by zmo@google.com, Aug 13 2011
GraphicsContext3D::extractImageData is expensive because of the re-decoding.  With gl.pixelStorei(gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL, true), this should be gone.

PackPixels could also be optimized in certain situations, like no conversion/no padding needed, then it's just a memcpy instead of row-by-row or pixel-by-pixel packing.
So we're redecoding due to the fact that it's not supposed to be premultiplied alpha. It might be best to let the application hint that image will never be used outside of GL, so we wouldn't decode it the first time. We could also only premultiply the image on demand, so we wouldn't have to decode it twice. However, that might cause delays in pages that use images normally.

Then we convert BGRA8 to RGBA8 to load into the texture. It looks like we do two passes over the data for that - once to convert BGRA8 to RGBA8 and the second to copy the RGBA8 intermediate buffer into the final buffer. The second part of that could be eliminated completely, perhaps by adding back the templated conversion stuff. The first part could be accelerated with SSE2 and/or rotl, or eliminated completely when on top of ANGLE. I expect BGRA8->RGBA8 to be pretty common, so it might be worth some extra optimization effort.
Comment 21 by pya...@gmail.com, Aug 13 2011
BGRA8 is supported since:
- OpenGL 1.1
- DirectX 6

The OpenGL ES 1.0 spec states that: "The RGB component ordering is always RGB or RGBA rather than BGRA since there is no real perceived advantage to using BGRA."

No perceived advantage, such as, oh, I don't know, faster texture uploads?
Comment 22 by gman@chromium.org, Aug 13 2011
There's a lot of things we could do.

*) As far as I know the browser only uses pre-multiplied alpha so it makes no sense we are converting?

*) If the user asks for RGBA but is loading an RGB image (jpeg) no conversion is necessary?
(actually that won't work because they could try to load a real RGBA into another mip. Should probably add a conformance test that no one is making that optimization

*) If the img stored as BGRA and the user asks for RGBA we can just upload as BGRA (GL_EXT_BGRA is exposed or can be to the WebGL impl

*) If the img has alpha and the user asks for RGB we can upload and clear out the alpha on the GPU?

*) Add extensions to do the conversions in the GPU process?

Just throwing out ideas.
Yeah, the browser only uses premultiplied alpha normally, so in this case we need to redecode the image so we can get a copy that's never been premultiplied. With jpegs or (on skia) pngs that are completely opaque, we recognize that fact and don't worry about redecoding or (un)premultiplying.

Would it be legal to upload a BGRA into an image with an RGBA internal format? Or would we have to change both?

Clearing out the alpha on the gpu would probably be more expensive than just replacing some memcpy somewhere with an efficient SSE2 copy that does the right thing. However, there are so many possible image and texture formats that no matter which method we choose we'll have to pick the most common ones and prioritize those.
Owner: jbau...@chromium.org
Status: Started
I'm working on a WebKit patch that speeds up the BGRA->RGBA conversion in webkit a lot. There are still some performance problems when we have to redecode the image, and the FlushSync when the transfer buffer runs out of space causes the renderer to always wait for the GPU process to finish processing everything, even with a really huge buffer.
Comment 25 by gman@google.com, Aug 25 2011
Given that 'if' we are running on top of OpenGL or ANGLE (vs OpenGL ES) in otherwords, Linux, Mac and Windows but not ChromeOS, it will do the conversion for us, why don't we just add a flag to let that happen automatically on the GPU side in the driver?


It doesn't look like ANGLE will do that quite yet, but it would be trivial to add an extension to make it work. Adding a fix to WebKit is a bit easier than plumbing that extension through, and it gets some similar benefits (it can also help in cases that ANGLE doesn't handle yet, like with premultiplication), so I'm doing that first.
Looks like for JPEGs the biggest issue is how slow WebCore::JPEGImageDecodr::outputScanlines. That really dominates everything except the actual image decoding.
Labels: WebKit-ID-59670
WebKit  bug 59670  deals with the outputScanlines issue.
Project Member Comment 29 by bugdroid1@chromium.org, Nov 28 2011
Labels: -WebKit-ID-59670 WebKit-ID-59670-NEW WebKit-Rev-100220 WebKit-Rev-100264 WebKit-Rev-100252
https://bugs.webkit.org/show_bug.cgi?id=59670
http://trac.webkit.org/changeset/100220http://trac.webkit.org/changeset/100264http://trac.webkit.org/changeset/100252
Project Member Comment 30 by bugdroid1@chromium.org, Nov 28 2011
Labels: -WebKit-ID-59670-NEW WebKit-ID-59670-RESOLVED WebKit-Rev-101286
https://bugs.webkit.org/show_bug.cgi?id=59670
http://trac.webkit.org/changeset/101286
Status: Fixed
Chrome is still a bit slower than firefox by default, but with gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL (or probably with opaque images) it's the same speed. I think this has improved enough for now.
Project Member Comment 32 by bugdroid1@chromium.org, Oct 13 2012
Labels: Restrict-AddIssueComment-Commit
This issue has been closed for some time. No one will pay attention to new comments.
If you are seeing this bug or have new data, please click New Issue to start a new bug.
Project Member Comment 33 by bugdroid1@chromium.org, Mar 11 2013
Labels: -Feature-GPU-WebGL Cr-Internals-GPU-WebGL
Project Member Comment 34 by bugdroid1@chromium.org, Apr 10 2013
Labels: -Cr-Internals-GPU-WebGL Cr-Blink-WebGL
Sign in to add a comment