Change scaling from MakeHalfFloat to using GPU to do it |
|||
Issue description
MakeHalfFloat takes shorts which are 9, 10, or 12 bits.
To adjust for the various number of bits, the values are multiplied to normalize the range to 0 to 1.0f;
This takes a multiply.
If using the hardware float to half float conversion, it would be faster to not do this multiply, and leave the values in the original range.
e.g. 12 bits is 0.0f to 4095.0f
Both Intel and Arm have an instruction for float to half float.
This is the current gcc AVX2 code that uses vcvtps2ph
void HalfFloatRow_F16C(const uint16* src, uint16* dst, float scale, int width) {
asm volatile (
"vbroadcastss %3, %%ymm4 \n"
// 16 pixel loop.
LABELALIGN
"1: \n"
"vpmovzxwd " MEMACCESS(0) ",%%ymm2 \n" // 8 shorts -> 8 ints
"vpmovzxwd " MEMACCESS2(0x10,0) ",%%ymm3 \n" // 8 more
"lea " MEMLEA(0x20,0) ",%0 \n"
"vcvtdq2ps %%ymm2,%%ymm2 \n"
"vcvtdq2ps %%ymm3,%%ymm3 \n"
"vmulps %%ymm2,%%ymm4,%%ymm2 \n"
"vmulps %%ymm3,%%ymm4,%%ymm3 \n"
"vcvtps2ph $3, %%ymm2, %%xmm2 \n"
"vcvtps2ph $3, %%ymm3, %%xmm3 \n"
"vmovdqu %%xmm2," MEMACCESS(1) " \n"
"vmovdqu %%xmm3," MEMACCESS2(0x10,1) " \n"
"lea " MEMLEA(0x20,1) ",%1 \n"
"sub $0x10,%2 \n"
"jg 1b \n"
"vzeroupper \n"
: "+r"(src), // %0
"+r"(dst), // %1
"+r"(width) // %2
: "x"(scale) // %3
: "memory", "cc",
"xmm2", "xmm3", "xmm4"
);
}
The vmulps could be removed if scale is 1.0:
void HalfFloatRow_F16C(const uint16* src, uint16* dst, int width) {
asm volatile (
// 16 pixel loop.
LABELALIGN
"1: \n"
"vpmovzxwd " MEMACCESS(0) ",%%ymm2 \n" // 8 shorts -> 8 ints
"vpmovzxwd " MEMACCESS2(0x10,0) ",%%ymm3 \n" // 8 more
"lea " MEMLEA(0x20,0) ",%0 \n"
"vcvtdq2ps %%ymm2,%%ymm2 \n"
"vcvtdq2ps %%ymm3,%%ymm3 \n"
"vcvtps2ph $3, %%ymm2, %%xmm2 \n"
"vcvtps2ph $3, %%ymm3, %%xmm3 \n"
"vmovdqu %%xmm2," MEMACCESS(1) " \n"
"vmovdqu %%xmm3," MEMACCESS2(0x10,1) " \n"
"lea " MEMLEA(0x20,1) ",%1 \n"
"sub $0x10,%2 \n"
"jg 1b \n"
"vzeroupper \n"
: "+r"(src), // %0
"+r"(dst), // %1
"+r"(width) // %2
:
: "memory", "cc",
"xmm2", "xmm3"
);
}
Currently when AVX2 or NEON hardware instruction is unavailable, a multiply by a magic number is used.
void HalfFloatRow_C(const uint16* src, uint16* dst, float scale, int width) {
int i;
float mult = 1.9259299444e-34f * scale;
for (i = 0; i < width; ++i) {
float value = src[i] * mult;
dst[i] = (uint16)((*(uint32_t*)&value) >> 13);
}
}
If the scale is a power of 2 this is ok. e.g. 1/4096. But if scale is not power of 2, the value produced may round the mantissa differently than if the hardware instruction is used on the original value. So the results dont match.
Also if the scale is 1/65536, the values are so small the 5 bit halffloat exponent cant represent them, causing denormals.
This hurts performance
TestHalfFloatPlane_Opt (464 ms) is for 12 bits (1/4096)
TestHalfFloatPlane_denormal (24864 ms) is 16 bits (1/65536)
Consider using the existing functions but pass in 1.0f as scale value. The SIMD could check for 1 and use a specialized version of the code.
,
Oct 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/98626da3db8d1dec43d259d93f54b5adc9c500c2 commit 98626da3db8d1dec43d259d93f54b5adc9c500c2 Author: hubbe <hubbe@chromium.org> Date: Fri Oct 21 20:58:45 2016 Change half-float conversion to use 1.0 multiplier Half-float conversion is more efficient on some platforms if the multiplier used is 1.0. Also clean up the half-float conversion code and extend testing up to 16 bits. BUG= 657627 CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel Review-Url: https://chromiumcodereview.appspot.com/2444463002 Cr-Commit-Position: refs/heads/master@{#426885} [modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater.cc [modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater.h [modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater_unittest.cc
,
Oct 26 2016
,
Nov 15 2016
[bulk-edit : please ignore if not applicable] Could you please set the correct milestone for this issue?
,
Nov 24 2016
|
|||
►
Sign in to add a comment |
|||
Comment 1 by hubbe@chromium.org
, Oct 20 2016