New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 18 users

Issue metadata

Status: Fixed
Owner:
not on Chrome anymore
Closed: Nov 2011
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Restricted
  • Only users with Commit permission may comment.



Sign in to add a comment
link

Issue 92388: Texture Upload in Chrome is slow

Reported by toj...@gmail.com, Aug 10 2011

Issue description

Chrome Version       : 14.0.835.29 dev
URLs (if applicable) : http://jsperf.com/webgl-teximage2d-vs-texsubimage2d/2

Other browsers tested:
    Safari 5: OK
    Firefox 7: OK

What steps will reproduce the problem?
1. Push an image to video memory via gl.texImage2D or gl.texSubImage2D

What is the expected result?
Performance of this feature should be fast enough to allow uploading a medium-sized texture (1024x1024) without severely disrupting a realtime application. 

What happens instead?
Safari and Firefox both perform reasonably fast in this scenario, but Chrome lags far behind both in terms of upload speed. Calling gl.texImage2D on a 1024x1024 texture currently blocks the main thread for ~50ms.
 

Comment 1 by paulir...@chromium.org, Aug 10 2011

Mergedinto: 91208
Status: Duplicate

Comment 2 by enne@chromium.org, Aug 11 2011

Cc: vangelis@chromium.org apatrick@chromium.org gman@chromium.org kbr@chromium.org nduca@chromium.org
Labels: -Area-Undefined Feature-GPU-WebGL
Mergedinto:
Status: Unconfirmed
I'm not sure this is a duplicate.  This bug concerns uploads via image and not uploads via video.

Comment 3 by nduca@chromium.org, Aug 11 2011

Agreed. I think that our slowness is worse once we start rendering because the current compositor scheduler causes the gpu to spend most of its time blocked on vsync. :'(

Comment 5 by enne@chromium.org, Aug 11 2011

Cc: enne@chromium.org zmo@chromium.org

Comment 6 by nduca@chromium.org, Aug 11 2011

Studied this and found that this is not gpu upload performance related. In fact, the actual gl call that uploads this texture takes 0.016ms.

The cost here is as follows:
 GraphicsContext3D::extractImageData                    :      6.675ms
 DecodeAlphaNotPremultiplied                            :      5.254ms
 GraphicsContext3D::packPixels                          :      1.345ms
 CommandBufferProxy::FlushSync                          :      0.836ms

Basically, we're doing a TON of work doing texture conversion.

Comment 7 by nduca@chromium.org, Aug 11 2011

The flushsync cost is closer to 2ms. My bad.
slow_texture_uploads.json
8.9 MB View Download
slow_texture_uploads.png
71.3 KB View Download

Comment 8 by gman@chromium.org, Aug 12 2011

Status: Available
So there are several issues

#1) The transfer buffer is only 1meg by default and because a few bytes are used at the front the largest texture you can upload without stalling is like 256x255. Upload 2 textures and you'd get a stall

#2) Even if we make the transfer buffer bigger we still have to do a copy

These are security issues. We take security seriously and to do that we can't let the process running JavaScript have direct access to the GPU. So uploads of textures are slower for us. Rendering is general should be faster though. Most apps don't need to upload textures every frame.

Comment 9 by fernando...@gmail.com, Aug 12 2011

Does this mean that the biggest POT texture that can be copied in a single run is 256x128? That seems bad. I don't see how this particular choice of buffer size can be considered a security issue (or maybe your "these are" means something else).

Nonetheless, doesn't this only impact the FlushSync cost (i.e. around 13% of the time according to Comment #6/7)? Or is this something else?

Afaik, Tojiro's app doesn't upload textures every frame, but it reasonably expects that when it does upload a new texture, it won't take forever. It seems a very reasonable application.

Comment 10 by nduca@chromium.org, Aug 12 2011

If we can figure out why the conversion is happening all the time, we can get a huge perf win back, at least on this benchmark as well as any other examples that get tripped up on format conversions. At that point, you'll still have the perf issues described by gman --- however, at that point, the upload cost will be closer to 2ms per upload rather than 14.

To clarify, addressing the actual FlushSync hitch ("the biggest texture that can be copied is <1mb") is something we need to do. Its just hard work -- specifically, we need to build a more gooder ;) memory manager that will provide additional upload space for an app that needs while also controlling it enough that it scales back backgrounded and prevents a runaway app from draining system's shared memory resources.

Net/net, we agree, its slow. We need to make it faster. There are some easy things to do first, then some hard things. :)

Comment 11 by vangelis@google.com, Aug 12 2011

I'm doing a code search fror DecodeAlphaNotPremultiplied and I get no hits.  Nat, any idea where that trace entry comes from?

Comment 12 by nduca@chromium.org, Aug 12 2011

Its a scope inside GraphicsContext3d::getImageData, platform/graphics/skia/GraphicsContext3DSkia.cpp:59

Comment 13 by toj...@gmail.com, Aug 12 2011

In response to Fernando's comment: The RAGE app mentioned in the blog post probably does, on average, somewhere around 5-6 texSubImage2D calls per second, but I do force it to only allow one texSubImage2D call per frame. This is, admittedly, unusually high for anything outside of maybe video processing but still something that should be reasonable. I know that maintaining a solid 60fps in those circumstances is likely a stretch, but I've also seen Safari and Firefox maintain a steady 58+ fps on the same demo, where as Chrome can struggle to stay in the 40s depending on the scene.

I wish I could post a live demo for you guys to do some more in-depth benchmarking, but I can't put the RAGE resource files on a live server. :(

Comment 14 by pya...@gmail.com, Aug 12 2011

I have made the exact same observation about speed that Tojiro does with the video stuff. The performance issues are very related.

For me, doing live video to texture stuff, and for Tojiro and his usage, this is an extremely big deal.

Getting your performance in line with at least Firefox is mandatory. Getting it line with actual machine capabilities is highly recommended.

Comment 15 by benvanik@google.com, Aug 12 2011

RE 'Most apps don't need to upload textures every frame.' >
There are a ton of valid reasons to be frequently uploading textures, not just limited to games. Photo/video editors, browsing UIs (stores, content libraries, search results, etc), mapping, data visualization, etc. Slow texture uploads kill almost all of these scenarios, or severely limit their real world implementations. In most of these cases, getting faster texture upload is actually more important than a faster frame time - being locked at 30fps but never dropping a frame is almost always a much better user experience than 60fps with dropped frames every second or two.

Comment 16 by kbr@chromium.org, Aug 12 2011

Cc: jbau...@chromium.org
I agree that there is no good reason for Chrome to be substantially slower for texture uploads than other browsers, security issues notwithstanding. Any mandatory memcpy into shared memory should not be the bottleneck from the above measurements. I will look into this as soon as possible, unless someone else gets to it first.

Comment 17 by jbau...@chromium.org, Aug 12 2011

Doing gl.pixelStorei(gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL, true); in the test case would help us avoid having to decode the image twice. Still not sure why GraphicsContext3D::extractImageData is so expensive, though.

Comment 18 by nduca@chromium.org, Aug 12 2011

For this...


Title     : GraphicsContext3D::getImageData
Start     : 3521.813 ms
Duration  : 24.232 ms
Args      :
 id       : 0
 extra    : skiaImage=0x7fd43c072d20 ignoreGammaAndColorProfile=0 hasAlpha=1 premultiplyAlpha=0 image->data=0x7fd43bf6cc80


Title     : PackPixels
Start     : 3546.061 ms
Duration  : 3.787 ms
Args      :
 id       : 0
 extra    : sourceDataFormat=9, w=1024, h=1024, sua=0, destinationFormat=1908 destinationType=1401 aop=0

Comment 19 by zmo@google.com, Aug 13 2011

GraphicsContext3D::extractImageData is expensive because of the re-decoding.  With gl.pixelStorei(gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL, true), this should be gone.

PackPixels could also be optimized in certain situations, like no conversion/no padding needed, then it's just a memcpy instead of row-by-row or pixel-by-pixel packing.

Comment 20 by jbau...@chromium.org, Aug 13 2011

So we're redecoding due to the fact that it's not supposed to be premultiplied alpha. It might be best to let the application hint that image will never be used outside of GL, so we wouldn't decode it the first time. We could also only premultiply the image on demand, so we wouldn't have to decode it twice. However, that might cause delays in pages that use images normally.

Then we convert BGRA8 to RGBA8 to load into the texture. It looks like we do two passes over the data for that - once to convert BGRA8 to RGBA8 and the second to copy the RGBA8 intermediate buffer into the final buffer. The second part of that could be eliminated completely, perhaps by adding back the templated conversion stuff. The first part could be accelerated with SSE2 and/or rotl, or eliminated completely when on top of ANGLE. I expect BGRA8->RGBA8 to be pretty common, so it might be worth some extra optimization effort.

Comment 21 by pya...@gmail.com, Aug 13 2011

BGRA8 is supported since:
- OpenGL 1.1
- DirectX 6

The OpenGL ES 1.0 spec states that: "The RGB component ordering is always RGB or RGBA rather than BGRA since there is no real perceived advantage to using BGRA."

No perceived advantage, such as, oh, I don't know, faster texture uploads?

Comment 22 by gman@chromium.org, Aug 13 2011

There's a lot of things we could do.

*) As far as I know the browser only uses pre-multiplied alpha so it makes no sense we are converting?

*) If the user asks for RGBA but is loading an RGB image (jpeg) no conversion is necessary?
(actually that won't work because they could try to load a real RGBA into another mip. Should probably add a conformance test that no one is making that optimization

*) If the img stored as BGRA and the user asks for RGBA we can just upload as BGRA (GL_EXT_BGRA is exposed or can be to the WebGL impl

*) If the img has alpha and the user asks for RGB we can upload and clear out the alpha on the GPU?

*) Add extensions to do the conversions in the GPU process?

Just throwing out ideas.

Comment 23 by jbau...@chromium.org, Aug 13 2011

Yeah, the browser only uses premultiplied alpha normally, so in this case we need to redecode the image so we can get a copy that's never been premultiplied. With jpegs or (on skia) pngs that are completely opaque, we recognize that fact and don't worry about redecoding or (un)premultiplying.

Would it be legal to upload a BGRA into an image with an RGBA internal format? Or would we have to change both?

Clearing out the alpha on the gpu would probably be more expensive than just replacing some memcpy somewhere with an efficient SSE2 copy that does the right thing. However, there are so many possible image and texture formats that no matter which method we choose we'll have to pick the most common ones and prioritize those.

Comment 24 by jbau...@chromium.org, Aug 25 2011

Owner: jbau...@chromium.org
Status: Started
I'm working on a WebKit patch that speeds up the BGRA->RGBA conversion in webkit a lot. There are still some performance problems when we have to redecode the image, and the FlushSync when the transfer buffer runs out of space causes the renderer to always wait for the GPU process to finish processing everything, even with a really huge buffer.

Comment 25 by gman@google.com, Aug 25 2011

Given that 'if' we are running on top of OpenGL or ANGLE (vs OpenGL ES) in otherwords, Linux, Mac and Windows but not ChromeOS, it will do the conversion for us, why don't we just add a flag to let that happen automatically on the GPU side in the driver?

Comment 26 by jbau...@chromium.org, Aug 25 2011

It doesn't look like ANGLE will do that quite yet, but it would be trivial to add an extension to make it work. Adding a fix to WebKit is a bit easier than plumbing that extension through, and it gets some similar benefits (it can also help in cases that ANGLE doesn't handle yet, like with premultiplication), so I'm doing that first.

Comment 27 by jbau...@chromium.org, Sep 2 2011

Looks like for JPEGs the biggest issue is how slow WebCore::JPEGImageDecodr::outputScanlines. That really dominates everything except the actual image decoding.

Comment 28 by jbau...@chromium.org, Nov 23 2011

Labels: WebKit-ID-59670
WebKit  bug 59670  deals with the outputScanlines issue.

Comment 29 by bugdroid1@chromium.org, Nov 28 2011

Project Member

Comment 30 by bugdroid1@chromium.org, Nov 28 2011

Project Member
Labels: -WebKit-ID-59670-NEW WebKit-ID-59670-RESOLVED WebKit-Rev-101286
https://bugs.webkit.org/show_bug.cgi?id=59670
http://trac.webkit.org/changeset/101286

Comment 31 by jbau...@chromium.org, Nov 28 2011

Status: Fixed
Chrome is still a bit slower than firefox by default, but with gl.UNPACK_PREMULTIPLY_ALPHA_WEBGL (or probably with opaque images) it's the same speed. I think this has improved enough for now.

Comment 32 by bugdroid1@chromium.org, Oct 13 2012

Project Member
Labels: Restrict-AddIssueComment-Commit
This issue has been closed for some time. No one will pay attention to new comments.
If you are seeing this bug or have new data, please click New Issue to start a new bug.

Comment 33 by bugdroid1@chromium.org, Mar 11 2013

Project Member
Labels: -Feature-GPU-WebGL Cr-Internals-GPU-WebGL

Comment 34 by bugdroid1@chromium.org, Apr 10 2013

Project Member
Labels: -Cr-Internals-GPU-WebGL Cr-Blink-WebGL

Sign in to add a comment