New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 898030 link

Starred by 3 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows
Pri: 3
Type: Feature



Sign in to add a comment

webgl2-compute security

Project Member Reported by jie.a.c...@intel.com, Oct 23

Issue description

    webgl2-compute exposes a set of new low-level API based on OpenGL ES 3.1. It's necessary to make sure it strictly conforms to the principles of web platform. Some initial thoughts on this:
    Images, atomic counters, and shader storage buffer are the main new features which webgl2-compute brings. They all operates with gpu-memory based objects. So we firstly must pay attention to the out-of-bounds issue. The KHR_robust_buffer_access_behavior extension seems to be what we should rely on. It can ensure that reads from unbound buffers return zero and writes are discarded. This is perfect for atomic counter buffers and shader storage buffers. We may check the presence of this extension before creating webgl2-compute context. Per my observation, it's available at least on ANGLE-d3d11, Linux Intel/Nvidia. But image load/store is not mentioned in the extension, while ES31 spec says "Invalid image loads will return a vector where the value of R, G, and B components is 0 and the value of the A component is undefined. Invalid image stores will have no effect.". We still need to figure out how to handle the undefined A component.

Secondly the shared variables used in compute shader may have the vulnerability of the uninitialized memory access. The ELSL 3.1 spec says "Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution.". ANGLE has an ANGLE_robust_resource_initialization extension, but shared variables are yet to be covered by it. Probably we can do it by injecting some initialization code for shared variables during ELSL translation.



 
Status: Assigned (was: Untriaged)
Thanks for pointing out these issues Jie. Assigning this issue to you.

Labels: -Type-Bug Type-Feature
Status: Started (was: Assigned)
An image UB to be addressed:
"If a shader reads from an image unit with a texture bound as WRITE_ONLY, or writes to an image unit with a texture bound as READ_ONLY, the results of that shader operation are undefined and may lead to application termination."
A note regarding this: "Probably we can do it by injecting some initialization code for shared variables during ELSL translation."

The initialization code must be mindful of group size and initialize arrays in parallel, while excluding threads that generate out-of-bound indices. The performance would otherwise be degraded.

If possible, there should be a check, at least for a common pattern, to make sure this initialization is not added to a shader that already includes an initialization.
Thanks for the comment. Yes, there might be performance penalty. Alternatively we can change the spec prohibiting the use of uninitialized shared variables.
Cc: kbr@chromium.org geoffl...@chromium.org cwallez@chromium.org jmad...@chromium.org
Shared variables share storage between all work items in a local work group. According to the GLSL ES 3.1 and later, shared variables can be used without initialization -- “Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution. “.  This would be a big security problem for webgl2-compute, as a malicious website may have a chance to access these undefined values which may contain the data not owned by the website. So we’d better modify the webgl2-compute spec to constrain such usage. Probably we can:
1. prohibit read-before-write use for shared variables. For example, in the shader below, both variable ‘a’ and ‘b’ are uninitialized. ‘a’ is legal, but ‘b’ is illegal.

shared uint a;
shared uint b;
main()
{
    a = 100u;
    imageStore(image, coord0, uvec4(a));
    imageStore(image, coord1, uvec4(b));
}

But read-before-write may be hard to detect, especially for arrays.

shared uint a[1024];
main()
{
    a[0] = 100;
    imageStore(image, coord1, uvec4(a[1]));
}

In this example, the array ‘a’, as a whole, doesn't violate the read-before-write rule. But it’s still not secure, as a[0], and a[1] are not same element. We have to track the read-write dependency by element for such cases. Moreover if the index of array element is computed at run-time, this can be much more difficult to detect.

@Corentin, do you think this is doable in ANGLE compiler?

2. Clear uninitialized shared variables to zero. That is to say “Variables declared as shared may not have initializers and their contents are all zero at the beginning of shader execution.”.  To clear the variables, we can:

2a. Inject initialization code at the beginning of ‘main’:

shared uint a;
shared uint b;
main()
{
    // Initialize share variables.
   a = 0;
   b = 0;

   a = 100u;
   imageStore(image, coord0, uvec4(a));
   imageStore(image, coord1, uvec4(b));
}

This may incur performance penalty if the variable is a big size array, as we may not be able to use all available threads to clear the array in parallel.

2b. Propose a new GL_KHR_robust_shared_variable_initialization extension to be implemented in GPU drivers, which clears all such shared variables to zero silently for us. GPU drivers have more information about shared variables, they may do such clears more efficiently. But there will be no shared variables in webgl2-compute to use, before the extension is widely supported by most GPU drivers.

3. Require shared variables must be always declared with an initializer, unless the GL_KHR_robust_shared_variable_initialization is present.

shared uint a = 100u; // okay
shared uint b; // error if no GL_KHR_robust_shared_variable_initialization
main()
{
    imageStore(image, coord0, uvec4(a));
    imageStore(image, coord1, uvec4(b));
}

The main advantage of solution is that we may have a change to use shared variables without GL_KHR_robust_shared_variable_initialization. The disadvantages are:
a. ES31 spec is ambiguous about shared variable initializer. The wording ‘may not’ may mean ‘you can’t do that’. It may also mean that ‘you can do that, but you are not required to do that’. We may first need to file a bug to khronous asking for a clarification. It seems most GPU drivers take the second interpretation, according to https://chromium-review.googlesource.com/c/angle/angle/+/1301313.

b. For big size arrays, it’s not convenient to use initializer.

share uint a[4] = {0u, 0u, 0u, 0u}; // okay
share uint a[1024] = {0u, 0u, 0u, 0u, 0u …  // my god, that’s a lot!

The ELSL 3.1 spec requires “There must be exactly the same number of arguments as the size of the array being constructed.”

We may extend ANGLE to support supplying 1 same initial value for all elements, like ‘share uint a[1024] = { 0u };’, and expand it for all elements during translating. But that may still result in lengthy shader text, and compilation time.


Thanks for the investigation, we've been thinking about this problem for WebGPU as well.

Solution 1 would be very hard to implement and in general knowing if a variable will be written or not is equivalent to the halting problem. The best we could do is an approximation that would both require a huge amount of complex code to do the analysis, and the validation rules would be just as hard to write in the spec.

Solution 3 would work but we don't want to expose whether GL_KHR_robust_shared_variable_initialization to web pages because it is an implementation detail of the browsers.

Solution 2b means going through the specification process in Khronos and will take a lot of time and it isn't even clear if IHVs would accept the extension.

Really I don't think we have a choice and we have to go with 2a. This is what we're thinking of going for with WebGPU SPIR-V too. It does add some GPU cost to shaders but it's all ALU operations and doesn't do accesses to main GPU memory (unless your shader spills in which case you have other perf problems) so it should be cheap compared to memory operations. Also if we notice a perf issue, we'd be able to initialize arrays in parallel since with have multiple threads in a warp.
I don't think that 2a is correct.  There is no guarantee that each invocation within the work group is executed in parallel so one invocation may already already be writing to the shared value while other threads are still initializing it.  For this approach to work, we would need to insert a memoryBarrierShared() after the initialization.
Project Member

Comment 9 by bugdroid1@chromium.org, Oct 30

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed

commit 0927ce032bd03dcf0369a7c351d55c5a4a9f91ed
Author: jchen10 <jie.a.chen@intel.com>
Date: Tue Oct 30 01:43:22 2018

Enforce KHR_robust_buffer_access_behavior for webgl2-compute

This checks the KHR_robust_buffer_access_behavior extension in
the passthrough-cmd-decoder when creating webgl2-compute context.

Bug: 898030
Change-Id: I1deb3308c852054a9ae29cd620231a212b82e16e
Reviewed-on: https://chromium-review.googlesource.com/c/1297735
Reviewed-by: Geoff Lang <geofflang@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Commit-Queue: Jie A Chen <jie.a.chen@intel.com>
Cr-Commit-Position: refs/heads/master@{#603727}
[modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/feature_info.cc
[modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/feature_info.h
[modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/gles2_cmd_decoder_passthrough.cc

Currently most popular javascript AI libraries have already implemented the AI algorithms using fragment shader. The shared variable is one of major features of compute shader which outperform fragment shader. For better adoption of webgl2-compute, we have to be cautious about the performance of share variables in our implementation. I did a further investigation this week before rushing to any proposal.

1. Regarding whether initializer is allowed for shared variables in the GL/ES spec, I filed an issue(https://github.com/KhronosGroup/GLSL/issues/34#) to khronos. The feedback opted for "No" briefly.
2. I consulted some graphics driver experts internally on whether drivers can clear shared variables essentially faster than applications. They only expect a slight improvement. So it seems proposing a new GL_KHR_robust_shared_variable_initialization extension is unnecessary.
3.  I wrote a perf test in https://chromium-review.googlesource.com/c/angle/angle/+/1301313. The baseline workload is just to store values to an image. Furthermore it measures the 2 added workloads: a shared variable array with initializer, and a same variable array cleared by assigning zeros to it. The perf data were collected at https://docs.google.com/spreadsheets/d/13-gD1-zTFcgj5ZE8plONlafFROQdxOZoUGD-xE_YY2k/edit#gid=0. From the data, we can find that:
    a. For d3d11 backend, the overhead of using initializer is almost negligible regardless of GPU vendors. Whereas using assignment to clear is not.
    b. For gl backend regardless of Windows or Linux, Intel performs better using initilaizer, while Nvidia does better using assignment.

According to all these information, my thoughts are:
    I) the webgl2-compute spec should claim that all shared variables are cleared to zero by default.
    II) No new driver extension is going to be proposed.
    III) For d3d11 backend, we inject initializer to clear shared variables.
    IV) For gl backend, we can inject either initializer or assignment to clear shared variables for various GPU vendors.

The most challenging work for using assignment to clear probably is load-balance. For some complex variables like below one:
        struct T {
            uint t1;
            uint t2[64];
            uint t3[8][8];
        };
        struct S {
            uint s1;
            T s2[256];
            uint s3[128];
        };
        shared S data[16];

it may be hard to divide the clear sub-tasks equally to each thread in the group.
Thanks for the investigation!

I agree that we should try to ensure that the variables are initialized to 0 by default and not expose an extension.

I don't think we should try to divide up initialization by thread at first, especially if we will need if-statements.  At best, if there are arrays that need to be looped over we could have each thread handle different array indices using gl_LocalInvocationIndex.
I took a look at your tests. I see multiple issues with the way you initialize the shared variable array:

First of all, you have:

    layout(local_size_x=4, local_size_y=4) in;

I.e. your workgroup size is 4x4x1, and you do:

    glDispatchCompute(8, 8, 8);

Which means you have an 8x8x8 group invocation (i.e. 32x32x8 invocation ids). You don't use gl_GlobalInvocationID.z, so you are doing everything 8 times! I presume you meant to do glDispatchCompute(8, 8, 1);

---

Second, your initialization goes way beyond the array (csAssignmentSource):

    shared float data[256];

    void main() {
        uint i = (gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 4u) * 16u;
        uint end = i + 16u;
        for (; i < end; i++)
        {
            data[i] = 0.0f;
        }
    }

For invocation id (7,7,0), you are writing to the array in indices [560, 576).

The other shader (csInitializerSource2) also has a similar problem:

    uint i = (gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 4u) * 16u;
    float value = data[i];

For invocation id (7, 7, 0), you are reading from index 560.

There is another problem here!  The number of invocations is 32x32x1 (8x8x1 groups, 4x4x1 threads each) which means 1024 invocations. Your shared array is of size 256, so this is still overflowing the array (similarly, it's writing to the image at coordinates that are beyond its size).  You have a few options. Either do range check in the shader, which in this simplistic case doesn't make sense.  Or fix the dispatch to match the image and shared array size, i.e. glDispatch(4, 4, 1);.  Or change the image and shared array size to match the dispatch, i.e. 32x32 image and array of 1024 values.  I would go with the later just to avoid having two 4x4x1 sizes (for dispatch and local size) which could be confusing.  From here on, I will assume the later.

I have other comments about the first shader below, but for the second shader, what you probably want then is:

    uint i = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 32u;
    float value = data[i];

I also have a question regarding this which is at the bottom of this post.

---

Regarding the initialization loop, first of all, let's assume you have `i` calculated correctly somehow.  Look at the first thread group spanning invocation indices (0,0,0) through (3,3,0).  The threads in this group calculate i as 0, 16, 32, etc. This means that your loop results in the following simultanous index accesses:

- First iteration: 0,16,32,...
- Second iteration: 1,17,33,...
- Third iteration: 2,18,34,...
- ...

However, the local data storage (AMD term for where shared variables are stored) is optimized for the lanes accessing sequential address. This is acheived by having a number of banks (say 32) and the address diverted to bank "address%32". Your current algorithm causes a lot of bank conflicts, which slows the initialization down.  The most efficient access pattern to clear the shared array (still assuming 4x4x1 work group size) would be simultaneous writes to indices:

- 0,1,2,...
- 16,17,18,...
- 32,33,34,...
- ...

All that said, in this particular case, you don't even need a loop, as the shared array size (1024) is the same as the invocation size (8x8x1 groups, i.e. 32x32x1 invocations).  So all you need is:

    uint i = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 32u;
    data[i] = 0;

Note that the above is still not perfect, as it would produce indices 0,1,2,3,32,33,34,35,64,65,66,67,96,97,98,99 for the first work group for example.  An even better way to do it would be this:

    uint flattened_group_id = gl_WorkGroupID.x + gl_WorkGroupID.y * 8;
    uint flattened_local_id = gl_LocalInvocationID.x + gl_LocalInvocationID.y * 4;
    uint i = flattened_group_id * 16 + flattened_local_id;
    data[i] = 0;

---

Now if you had more invocations than your array size, you would need to do range check before initializing the array.  What if you had fewer invocations than your array size?  In that case you would indeed need a loop.  With the bank-conflict issue I mentioned above, the best way to do that would be:

    shared float data[N];

    uint flattened_group_id = gl_WorkGroupID.x + gl_WorkGroupID.y * 8;
    uint flattened_local_id = gl_LocalInvocationID.x + gl_LocalInvocationID.y * 4;
    uint i = flattened_group_id * 16 + flattened_local_id;

    for (; i < N; i += 1024)
        data[i] = 0;

The loop above covers all cases of N being smaller, equal or larger than invocation size as well as whether the invocation size divides N or not.

---

I also have a question.  When you fix up a shader, you have no idea how many invocations it will end up receiving.  How are you going to address that?  Unless you do the fixup at every dispatch call?

Ignore my last question, didn't know about gl_NumWorkGroups.
Excellent analysis syoussefi@! Thank you for looking at this!

@syoussefi, Thank you for reviewing my tests. Your analysis is quite helpful. I find a stupid mistake in my tests. When I was writing gl_GlobalInvocationID, gl_LocalInvocationID was actually in my mind. BTW, you can comment directly in the CL for convenience.
The perf data updated. It now reveals same except that initializer is better than assignment for Nvidia also. But for Nvidia Linux GL driver, initializer is  not well implemented. The test runs into compilation error if the array size is bigger than 512.
For how to optimize the assignment initialization, I think there will be a lot of details to discuss. We can discuss it later when implementing it.
Cc: yunchao...@intel.com
Glad I could help. Changing to gl_LocalInvocationID made most things clearer. I added a comment on one remaining issue.
Project Member

Comment 19 by bugdroid1@chromium.org, Nov 13

The following revision refers to this bug:
  https://chromium.googlesource.com/angle/angle/+/cd47a379f9606a2ff47c047acabaeea73eeeaaef

commit cd47a379f9606a2ff47c047acabaeea73eeeaaef
Author: jchen10 <jie.a.chen@intel.com>
Date: Tue Nov 13 02:03:08 2018

Add SH_INIT_SHARED_VARIABLES flag

This option is used to initialize shared variables to zero at the
beginning of shader execution to avoid compute shaders being able to
read undefined values that could be coming from another webpage or
application.
It's implemented by declaring variables with initial value for HLSL.
For GLSL, it's not allowed to use declaraction initializer for shared
variables, so we need to explicitly assign them to zero at the
beginning of main(). This implementation is only for HLSL.

Bug: chromium:898030

Change-Id: Ic5906500bf4a35cd9a071923f82f32c5e2991be3
Reviewed-on: https://chromium-review.googlesource.com/c/1330310
Commit-Queue: Jie A Chen <jie.a.chen@intel.com>
Reviewed-by: Jamie Madill <jmadill@chromium.org>

[modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/tests/gl_tests/ComputeShaderTest.cpp
[modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/libANGLE/Shader.cpp
[modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/include/GLSLANG/ShaderLang.h
[modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/compiler/translator/OutputHLSL.cpp

Project Member

Comment 20 by bugdroid1@chromium.org, Nov 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9e449edb49bf9919fae21c11caf53926753e85d0

commit 9e449edb49bf9919fae21c11caf53926753e85d0
Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Date: Tue Nov 13 05:42:28 2018

Roll src/third_party/angle 0b18583285cb..cd47a379f960 (5 commits)

https://chromium.googlesource.com/angle/angle.git/+log/0b18583285cb..cd47a379f960


git log 0b18583285cb..cd47a379f960 --date=short --no-merges --format='%ad %ae %s'
2018-11-13 jie.a.chen@intel.com Add SH_INIT_SHARED_VARIABLES flag
2018-11-12 timvp@google.com Fix failing dEQP EGL tests.
2018-11-12 jmadill@chromium.org Whitespace change to cq.cfg.
2018-11-12 jmadill@chromium.org Add missing refs.cfg.
2018-11-12 jmadill@chromium.org Whitespace change to cq.cfg.


Created with:
  gclient setdep -r src/third_party/angle@cd47a379f960

The AutoRoll server is located here: https://autoroll.skia.org/r/angle-chromium-autoroll

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.

CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel

BUG=chromium:898030, chromium:833848 , chromium:833848 , chromium:833848 
TBR=cwallez@chromium.org

Change-Id: I5b67d20c08fdeb6614bca9c7e0f451988f836d27
Reviewed-on: https://chromium-review.googlesource.com/c/1333054
Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#607502}
[modify] https://crrev.com/9e449edb49bf9919fae21c11caf53926753e85d0/DEPS

Project Member

Comment 21 by bugdroid1@chromium.org, Nov 22

The following revision refers to this bug:
  https://chromium.googlesource.com/angle/angle/+/efe061bd4f9de0f8da3135bcabff7f40e05e91f7

commit efe061bd4f9de0f8da3135bcabff7f40e05e91f7
Author: jchen10 <jie.a.chen@intel.com>
Date: Thu Nov 22 01:57:56 2018

Optimize HLSL zero initializer

Currently we initialize a variable using zero initializer. Take the
below variable for example:
    uint var[4];
We translate it to:
    uint var[4] = { 0, 0, 0, 0};
If the array size is large, we have to use very long zero initializer.
The problem is that it's very slow for D3D drivers to compile.

This CL uses the 'static' trick below to solve the problem:
    static uint _ANGLE_ZEROS_[256];
    ...
    uint var[516] = {_ANGLE_ZEROS_, _ANGLE_ZEROS_, 0, 0, 0, 0};

For 'static', if the declaration does not include an initializer, the
value is set to zero.
https://docs.microsoft.com/en-us/windows/desktop/direct3dhlsl/dx-graphics-hlsl-variable-syntax

Bug: chromium:898030

Change-Id: Ia3f6574b5ddaffa94bf971140eba95835ee105ee
Reviewed-on: https://chromium-review.googlesource.com/c/1332805
Reviewed-by: Shahbaz Youssefi <syoussefi@chromium.org>
Reviewed-by: Geoff Lang <geofflang@chromium.org>
Commit-Queue: Jie A Chen <jie.a.chen@intel.com>

[modify] https://crrev.com/efe061bd4f9de0f8da3135bcabff7f40e05e91f7/src/compiler/translator/OutputHLSL.h
[modify] https://crrev.com/efe061bd4f9de0f8da3135bcabff7f40e05e91f7/src/compiler/translator/OutputHLSL.cpp

Project Member

Comment 22 by bugdroid1@chromium.org, Nov 22

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0e26cf6d17cdcebe352f8ec53a10e3d14517f677

commit 0e26cf6d17cdcebe352f8ec53a10e3d14517f677
Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Date: Thu Nov 22 11:18:41 2018

Roll src/third_party/angle 9f379f493cd7..efe061bd4f9d (3 commits)

https://chromium.googlesource.com/angle/angle.git/+log/9f379f493cd7..efe061bd4f9d


git log 9f379f493cd7..efe061bd4f9d --date=short --no-merges --format='%ad %ae %s'
2018-11-22 jie.a.chen@intel.com Optimize HLSL zero initializer
2018-11-22 jmadill@chromium.org Don't use gl::Error in validation.
2018-11-21 jmadill@chromium.org Vulkan: Cache pipelines with Shader Programs.


Created with:
  gclient setdep -r src/third_party/angle@efe061bd4f9d

The AutoRoll server is located here: https://autoroll.skia.org/r/angle-chromium-autoroll

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.

CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel

BUG=chromium:898030
TBR=geofflang@chromium.org

Change-Id: I125a89563af417fed5d630b2be87b29d98f396f2
Reviewed-on: https://chromium-review.googlesource.com/c/1347083
Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#610377}
[modify] https://crrev.com/0e26cf6d17cdcebe352f8ec53a10e3d14517f677/DEPS

Sign in to add a comment