webgl2-compute security |
||||
Issue description
webgl2-compute exposes a set of new low-level API based on OpenGL ES 3.1. It's necessary to make sure it strictly conforms to the principles of web platform. Some initial thoughts on this:
Images, atomic counters, and shader storage buffer are the main new features which webgl2-compute brings. They all operates with gpu-memory based objects. So we firstly must pay attention to the out-of-bounds issue. The KHR_robust_buffer_access_behavior extension seems to be what we should rely on. It can ensure that reads from unbound buffers return zero and writes are discarded. This is perfect for atomic counter buffers and shader storage buffers. We may check the presence of this extension before creating webgl2-compute context. Per my observation, it's available at least on ANGLE-d3d11, Linux Intel/Nvidia. But image load/store is not mentioned in the extension, while ES31 spec says "Invalid image loads will return a vector where the value of R, G, and B components is 0 and the value of the A component is undefined. Invalid image stores will have no effect.". We still need to figure out how to handle the undefined A component.
Secondly the shared variables used in compute shader may have the vulnerability of the uninitialized memory access. The ELSL 3.1 spec says "Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution.". ANGLE has an ANGLE_robust_resource_initialization extension, but shared variables are yet to be covered by it. Probably we can do it by injecting some initialization code for shared variables during ELSL translation.
,
Oct 24
,
Oct 24
An image UB to be addressed: "If a shader reads from an image unit with a texture bound as WRITE_ONLY, or writes to an image unit with a texture bound as READ_ONLY, the results of that shader operation are undefined and may lead to application termination."
,
Oct 24
A note regarding this: "Probably we can do it by injecting some initialization code for shared variables during ELSL translation." The initialization code must be mindful of group size and initialize arrays in parallel, while excluding threads that generate out-of-bound indices. The performance would otherwise be degraded. If possible, there should be a check, at least for a common pattern, to make sure this initialization is not added to a shader that already includes an initialization.
,
Oct 25
Thanks for the comment. Yes, there might be performance penalty. Alternatively we can change the spec prohibiting the use of uninitialized shared variables.
,
Oct 29
Shared variables share storage between all work items in a local work group. According to the GLSL ES 3.1 and later, shared variables can be used without initialization -- “Variables declared as shared may not have initializers and their contents are undefined at the beginning of shader execution. “. This would be a big security problem for webgl2-compute, as a malicious website may have a chance to access these undefined values which may contain the data not owned by the website. So we’d better modify the webgl2-compute spec to constrain such usage. Probably we can:
1. prohibit read-before-write use for shared variables. For example, in the shader below, both variable ‘a’ and ‘b’ are uninitialized. ‘a’ is legal, but ‘b’ is illegal.
shared uint a;
shared uint b;
main()
{
a = 100u;
imageStore(image, coord0, uvec4(a));
imageStore(image, coord1, uvec4(b));
}
But read-before-write may be hard to detect, especially for arrays.
shared uint a[1024];
main()
{
a[0] = 100;
imageStore(image, coord1, uvec4(a[1]));
}
In this example, the array ‘a’, as a whole, doesn't violate the read-before-write rule. But it’s still not secure, as a[0], and a[1] are not same element. We have to track the read-write dependency by element for such cases. Moreover if the index of array element is computed at run-time, this can be much more difficult to detect.
@Corentin, do you think this is doable in ANGLE compiler?
2. Clear uninitialized shared variables to zero. That is to say “Variables declared as shared may not have initializers and their contents are all zero at the beginning of shader execution.”. To clear the variables, we can:
2a. Inject initialization code at the beginning of ‘main’:
shared uint a;
shared uint b;
main()
{
// Initialize share variables.
a = 0;
b = 0;
a = 100u;
imageStore(image, coord0, uvec4(a));
imageStore(image, coord1, uvec4(b));
}
This may incur performance penalty if the variable is a big size array, as we may not be able to use all available threads to clear the array in parallel.
2b. Propose a new GL_KHR_robust_shared_variable_initialization extension to be implemented in GPU drivers, which clears all such shared variables to zero silently for us. GPU drivers have more information about shared variables, they may do such clears more efficiently. But there will be no shared variables in webgl2-compute to use, before the extension is widely supported by most GPU drivers.
3. Require shared variables must be always declared with an initializer, unless the GL_KHR_robust_shared_variable_initialization is present.
shared uint a = 100u; // okay
shared uint b; // error if no GL_KHR_robust_shared_variable_initialization
main()
{
imageStore(image, coord0, uvec4(a));
imageStore(image, coord1, uvec4(b));
}
The main advantage of solution is that we may have a change to use shared variables without GL_KHR_robust_shared_variable_initialization. The disadvantages are:
a. ES31 spec is ambiguous about shared variable initializer. The wording ‘may not’ may mean ‘you can’t do that’. It may also mean that ‘you can do that, but you are not required to do that’. We may first need to file a bug to khronous asking for a clarification. It seems most GPU drivers take the second interpretation, according to https://chromium-review.googlesource.com/c/angle/angle/+/1301313.
b. For big size arrays, it’s not convenient to use initializer.
share uint a[4] = {0u, 0u, 0u, 0u}; // okay
share uint a[1024] = {0u, 0u, 0u, 0u, 0u … // my god, that’s a lot!
The ELSL 3.1 spec requires “There must be exactly the same number of arguments as the size of the array being constructed.”
We may extend ANGLE to support supplying 1 same initial value for all elements, like ‘share uint a[1024] = { 0u };’, and expand it for all elements during translating. But that may still result in lengthy shader text, and compilation time.
,
Oct 29
Thanks for the investigation, we've been thinking about this problem for WebGPU as well. Solution 1 would be very hard to implement and in general knowing if a variable will be written or not is equivalent to the halting problem. The best we could do is an approximation that would both require a huge amount of complex code to do the analysis, and the validation rules would be just as hard to write in the spec. Solution 3 would work but we don't want to expose whether GL_KHR_robust_shared_variable_initialization to web pages because it is an implementation detail of the browsers. Solution 2b means going through the specification process in Khronos and will take a lot of time and it isn't even clear if IHVs would accept the extension. Really I don't think we have a choice and we have to go with 2a. This is what we're thinking of going for with WebGPU SPIR-V too. It does add some GPU cost to shaders but it's all ALU operations and doesn't do accesses to main GPU memory (unless your shader spills in which case you have other perf problems) so it should be cheap compared to memory operations. Also if we notice a perf issue, we'd be able to initialize arrays in parallel since with have multiple threads in a warp.
,
Oct 29
I don't think that 2a is correct. There is no guarantee that each invocation within the work group is executed in parallel so one invocation may already already be writing to the shared value while other threads are still initializing it. For this approach to work, we would need to insert a memoryBarrierShared() after the initialization.
,
Oct 30
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed commit 0927ce032bd03dcf0369a7c351d55c5a4a9f91ed Author: jchen10 <jie.a.chen@intel.com> Date: Tue Oct 30 01:43:22 2018 Enforce KHR_robust_buffer_access_behavior for webgl2-compute This checks the KHR_robust_buffer_access_behavior extension in the passthrough-cmd-decoder when creating webgl2-compute context. Bug: 898030 Change-Id: I1deb3308c852054a9ae29cd620231a212b82e16e Reviewed-on: https://chromium-review.googlesource.com/c/1297735 Reviewed-by: Geoff Lang <geofflang@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: Jie A Chen <jie.a.chen@intel.com> Cr-Commit-Position: refs/heads/master@{#603727} [modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/feature_info.cc [modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/feature_info.h [modify] https://crrev.com/0927ce032bd03dcf0369a7c351d55c5a4a9f91ed/gpu/command_buffer/service/gles2_cmd_decoder_passthrough.cc
,
Nov 2
Currently most popular javascript AI libraries have already implemented the AI algorithms using fragment shader. The shared variable is one of major features of compute shader which outperform fragment shader. For better adoption of webgl2-compute, we have to be cautious about the performance of share variables in our implementation. I did a further investigation this week before rushing to any proposal. 1. Regarding whether initializer is allowed for shared variables in the GL/ES spec, I filed an issue(https://github.com/KhronosGroup/GLSL/issues/34#) to khronos. The feedback opted for "No" briefly. 2. I consulted some graphics driver experts internally on whether drivers can clear shared variables essentially faster than applications. They only expect a slight improvement. So it seems proposing a new GL_KHR_robust_shared_variable_initialization extension is unnecessary. 3. I wrote a perf test in https://chromium-review.googlesource.com/c/angle/angle/+/1301313. The baseline workload is just to store values to an image. Furthermore it measures the 2 added workloads: a shared variable array with initializer, and a same variable array cleared by assigning zeros to it. The perf data were collected at https://docs.google.com/spreadsheets/d/13-gD1-zTFcgj5ZE8plONlafFROQdxOZoUGD-xE_YY2k/edit#gid=0. From the data, we can find that: a. For d3d11 backend, the overhead of using initializer is almost negligible regardless of GPU vendors. Whereas using assignment to clear is not. b. For gl backend regardless of Windows or Linux, Intel performs better using initilaizer, while Nvidia does better using assignment. According to all these information, my thoughts are: I) the webgl2-compute spec should claim that all shared variables are cleared to zero by default. II) No new driver extension is going to be proposed. III) For d3d11 backend, we inject initializer to clear shared variables. IV) For gl backend, we can inject either initializer or assignment to clear shared variables for various GPU vendors. The most challenging work for using assignment to clear probably is load-balance. For some complex variables like below one: struct T { uint t1; uint t2[64]; uint t3[8][8]; }; struct S { uint s1; T s2[256]; uint s3[128]; }; shared S data[16]; it may be hard to divide the clear sub-tasks equally to each thread in the group.
,
Nov 2
Thanks for the investigation! I agree that we should try to ensure that the variables are initialized to 0 by default and not expose an extension. I don't think we should try to divide up initialization by thread at first, especially if we will need if-statements. At best, if there are arrays that need to be looped over we could have each thread handle different array indices using gl_LocalInvocationIndex.
,
Nov 2
I took a look at your tests. I see multiple issues with the way you initialize the shared variable array:
First of all, you have:
layout(local_size_x=4, local_size_y=4) in;
I.e. your workgroup size is 4x4x1, and you do:
glDispatchCompute(8, 8, 8);
Which means you have an 8x8x8 group invocation (i.e. 32x32x8 invocation ids). You don't use gl_GlobalInvocationID.z, so you are doing everything 8 times! I presume you meant to do glDispatchCompute(8, 8, 1);
---
Second, your initialization goes way beyond the array (csAssignmentSource):
shared float data[256];
void main() {
uint i = (gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 4u) * 16u;
uint end = i + 16u;
for (; i < end; i++)
{
data[i] = 0.0f;
}
}
For invocation id (7,7,0), you are writing to the array in indices [560, 576).
The other shader (csInitializerSource2) also has a similar problem:
uint i = (gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 4u) * 16u;
float value = data[i];
For invocation id (7, 7, 0), you are reading from index 560.
There is another problem here! The number of invocations is 32x32x1 (8x8x1 groups, 4x4x1 threads each) which means 1024 invocations. Your shared array is of size 256, so this is still overflowing the array (similarly, it's writing to the image at coordinates that are beyond its size). You have a few options. Either do range check in the shader, which in this simplistic case doesn't make sense. Or fix the dispatch to match the image and shared array size, i.e. glDispatch(4, 4, 1);. Or change the image and shared array size to match the dispatch, i.e. 32x32 image and array of 1024 values. I would go with the later just to avoid having two 4x4x1 sizes (for dispatch and local size) which could be confusing. From here on, I will assume the later.
I have other comments about the first shader below, but for the second shader, what you probably want then is:
uint i = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 32u;
float value = data[i];
I also have a question regarding this which is at the bottom of this post.
---
Regarding the initialization loop, first of all, let's assume you have `i` calculated correctly somehow. Look at the first thread group spanning invocation indices (0,0,0) through (3,3,0). The threads in this group calculate i as 0, 16, 32, etc. This means that your loop results in the following simultanous index accesses:
- First iteration: 0,16,32,...
- Second iteration: 1,17,33,...
- Third iteration: 2,18,34,...
- ...
However, the local data storage (AMD term for where shared variables are stored) is optimized for the lanes accessing sequential address. This is acheived by having a number of banks (say 32) and the address diverted to bank "address%32". Your current algorithm causes a lot of bank conflicts, which slows the initialization down. The most efficient access pattern to clear the shared array (still assuming 4x4x1 work group size) would be simultaneous writes to indices:
- 0,1,2,...
- 16,17,18,...
- 32,33,34,...
- ...
All that said, in this particular case, you don't even need a loop, as the shared array size (1024) is the same as the invocation size (8x8x1 groups, i.e. 32x32x1 invocations). So all you need is:
uint i = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * 32u;
data[i] = 0;
Note that the above is still not perfect, as it would produce indices 0,1,2,3,32,33,34,35,64,65,66,67,96,97,98,99 for the first work group for example. An even better way to do it would be this:
uint flattened_group_id = gl_WorkGroupID.x + gl_WorkGroupID.y * 8;
uint flattened_local_id = gl_LocalInvocationID.x + gl_LocalInvocationID.y * 4;
uint i = flattened_group_id * 16 + flattened_local_id;
data[i] = 0;
---
Now if you had more invocations than your array size, you would need to do range check before initializing the array. What if you had fewer invocations than your array size? In that case you would indeed need a loop. With the bank-conflict issue I mentioned above, the best way to do that would be:
shared float data[N];
uint flattened_group_id = gl_WorkGroupID.x + gl_WorkGroupID.y * 8;
uint flattened_local_id = gl_LocalInvocationID.x + gl_LocalInvocationID.y * 4;
uint i = flattened_group_id * 16 + flattened_local_id;
for (; i < N; i += 1024)
data[i] = 0;
The loop above covers all cases of N being smaller, equal or larger than invocation size as well as whether the invocation size divides N or not.
---
I also have a question. When you fix up a shader, you have no idea how many invocations it will end up receiving. How are you going to address that? Unless you do the fixup at every dispatch call?
,
Nov 2
Ignore my last question, didn't know about gl_NumWorkGroups.
,
Nov 2
Excellent analysis syoussefi@! Thank you for looking at this!
,
Nov 3
@syoussefi, Thank you for reviewing my tests. Your analysis is quite helpful. I find a stupid mistake in my tests. When I was writing gl_GlobalInvocationID, gl_LocalInvocationID was actually in my mind. BTW, you can comment directly in the CL for convenience.
,
Nov 3
The perf data updated. It now reveals same except that initializer is better than assignment for Nvidia also. But for Nvidia Linux GL driver, initializer is not well implemented. The test runs into compilation error if the array size is bigger than 512. For how to optimize the assignment initialization, I think there will be a lot of details to discuss. We can discuss it later when implementing it.
,
Nov 3
,
Nov 5
Glad I could help. Changing to gl_LocalInvocationID made most things clearer. I added a comment on one remaining issue.
,
Nov 13
The following revision refers to this bug: https://chromium.googlesource.com/angle/angle/+/cd47a379f9606a2ff47c047acabaeea73eeeaaef commit cd47a379f9606a2ff47c047acabaeea73eeeaaef Author: jchen10 <jie.a.chen@intel.com> Date: Tue Nov 13 02:03:08 2018 Add SH_INIT_SHARED_VARIABLES flag This option is used to initialize shared variables to zero at the beginning of shader execution to avoid compute shaders being able to read undefined values that could be coming from another webpage or application. It's implemented by declaring variables with initial value for HLSL. For GLSL, it's not allowed to use declaraction initializer for shared variables, so we need to explicitly assign them to zero at the beginning of main(). This implementation is only for HLSL. Bug: chromium:898030 Change-Id: Ic5906500bf4a35cd9a071923f82f32c5e2991be3 Reviewed-on: https://chromium-review.googlesource.com/c/1330310 Commit-Queue: Jie A Chen <jie.a.chen@intel.com> Reviewed-by: Jamie Madill <jmadill@chromium.org> [modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/tests/gl_tests/ComputeShaderTest.cpp [modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/libANGLE/Shader.cpp [modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/include/GLSLANG/ShaderLang.h [modify] https://crrev.com/cd47a379f9606a2ff47c047acabaeea73eeeaaef/src/compiler/translator/OutputHLSL.cpp
,
Nov 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9e449edb49bf9919fae21c11caf53926753e85d0 commit 9e449edb49bf9919fae21c11caf53926753e85d0 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Tue Nov 13 05:42:28 2018 Roll src/third_party/angle 0b18583285cb..cd47a379f960 (5 commits) https://chromium.googlesource.com/angle/angle.git/+log/0b18583285cb..cd47a379f960 git log 0b18583285cb..cd47a379f960 --date=short --no-merges --format='%ad %ae %s' 2018-11-13 jie.a.chen@intel.com Add SH_INIT_SHARED_VARIABLES flag 2018-11-12 timvp@google.com Fix failing dEQP EGL tests. 2018-11-12 jmadill@chromium.org Whitespace change to cq.cfg. 2018-11-12 jmadill@chromium.org Add missing refs.cfg. 2018-11-12 jmadill@chromium.org Whitespace change to cq.cfg. Created with: gclient setdep -r src/third_party/angle@cd47a379f960 The AutoRoll server is located here: https://autoroll.skia.org/r/angle-chromium-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG=chromium:898030, chromium:833848 , chromium:833848 , chromium:833848 TBR=cwallez@chromium.org Change-Id: I5b67d20c08fdeb6614bca9c7e0f451988f836d27 Reviewed-on: https://chromium-review.googlesource.com/c/1333054 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#607502} [modify] https://crrev.com/9e449edb49bf9919fae21c11caf53926753e85d0/DEPS
,
Nov 22
The following revision refers to this bug: https://chromium.googlesource.com/angle/angle/+/efe061bd4f9de0f8da3135bcabff7f40e05e91f7 commit efe061bd4f9de0f8da3135bcabff7f40e05e91f7 Author: jchen10 <jie.a.chen@intel.com> Date: Thu Nov 22 01:57:56 2018 Optimize HLSL zero initializer Currently we initialize a variable using zero initializer. Take the below variable for example: uint var[4]; We translate it to: uint var[4] = { 0, 0, 0, 0}; If the array size is large, we have to use very long zero initializer. The problem is that it's very slow for D3D drivers to compile. This CL uses the 'static' trick below to solve the problem: static uint _ANGLE_ZEROS_[256]; ... uint var[516] = {_ANGLE_ZEROS_, _ANGLE_ZEROS_, 0, 0, 0, 0}; For 'static', if the declaration does not include an initializer, the value is set to zero. https://docs.microsoft.com/en-us/windows/desktop/direct3dhlsl/dx-graphics-hlsl-variable-syntax Bug: chromium:898030 Change-Id: Ia3f6574b5ddaffa94bf971140eba95835ee105ee Reviewed-on: https://chromium-review.googlesource.com/c/1332805 Reviewed-by: Shahbaz Youssefi <syoussefi@chromium.org> Reviewed-by: Geoff Lang <geofflang@chromium.org> Commit-Queue: Jie A Chen <jie.a.chen@intel.com> [modify] https://crrev.com/efe061bd4f9de0f8da3135bcabff7f40e05e91f7/src/compiler/translator/OutputHLSL.h [modify] https://crrev.com/efe061bd4f9de0f8da3135bcabff7f40e05e91f7/src/compiler/translator/OutputHLSL.cpp
,
Nov 22
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0e26cf6d17cdcebe352f8ec53a10e3d14517f677 commit 0e26cf6d17cdcebe352f8ec53a10e3d14517f677 Author: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Date: Thu Nov 22 11:18:41 2018 Roll src/third_party/angle 9f379f493cd7..efe061bd4f9d (3 commits) https://chromium.googlesource.com/angle/angle.git/+log/9f379f493cd7..efe061bd4f9d git log 9f379f493cd7..efe061bd4f9d --date=short --no-merges --format='%ad %ae %s' 2018-11-22 jie.a.chen@intel.com Optimize HLSL zero initializer 2018-11-22 jmadill@chromium.org Don't use gl::Error in validation. 2018-11-21 jmadill@chromium.org Vulkan: Cache pipelines with Shader Programs. Created with: gclient setdep -r src/third_party/angle@efe061bd4f9d The AutoRoll server is located here: https://autoroll.skia.org/r/angle-chromium-autoroll Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. CQ_INCLUDE_TRYBOTS=luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel BUG=chromium:898030 TBR=geofflang@chromium.org Change-Id: I125a89563af417fed5d630b2be87b29d98f396f2 Reviewed-on: https://chromium-review.googlesource.com/c/1347083 Reviewed-by: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Commit-Queue: chromium-autoroll <chromium-autoroll@skia-public.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#610377} [modify] https://crrev.com/0e26cf6d17cdcebe352f8ec53a10e3d14517f677/DEPS |
||||
►
Sign in to add a comment |
||||
Comment 1 by kbr@chromium.org
, Oct 23