New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 657627 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Change scaling from MakeHalfFloat to using GPU to do it

Project Member Reported by fbarchard@chromium.org, Oct 19 2016

Issue description

MakeHalfFloat takes shorts which are 9, 10, or 12 bits.
To adjust for the various number of bits, the values are multiplied to normalize the range to 0 to 1.0f;
This takes a multiply.
If using the hardware float to half float conversion, it would be faster to not do this multiply, and leave the values in the original range.
e.g. 12 bits is 0.0f to 4095.0f
Both Intel and Arm have an instruction for float to half float.

This is the current gcc AVX2 code that uses vcvtps2ph

void HalfFloatRow_F16C(const uint16* src, uint16* dst, float scale, int width) {
  asm volatile (
   "vbroadcastss  %3, %%ymm4                  \n"

    // 16 pixel loop.
    LABELALIGN
  "1:                                          \n"
    "vpmovzxwd   " MEMACCESS(0) ",%%ymm2       \n"  // 8 shorts -> 8 ints
    "vpmovzxwd   " MEMACCESS2(0x10,0) ",%%ymm3 \n"  // 8 more
    "lea         " MEMLEA(0x20,0) ",%0         \n"
    "vcvtdq2ps   %%ymm2,%%ymm2                 \n"
    "vcvtdq2ps   %%ymm3,%%ymm3                 \n"
    "vmulps      %%ymm2,%%ymm4,%%ymm2          \n"
    "vmulps      %%ymm3,%%ymm4,%%ymm3          \n"
    "vcvtps2ph   $3, %%ymm2, %%xmm2            \n"
    "vcvtps2ph   $3, %%ymm3, %%xmm3            \n"
    "vmovdqu     %%xmm2," MEMACCESS(1) "       \n"
    "vmovdqu     %%xmm3," MEMACCESS2(0x10,1) " \n"
    "lea         " MEMLEA(0x20,1) ",%1         \n"
    "sub         $0x10,%2                      \n"
    "jg          1b                            \n"

    "vzeroupper                                \n"
  : "+r"(src),   // %0
    "+r"(dst),   // %1
    "+r"(width)  // %2
  : "x"(scale)   // %3
  : "memory", "cc",
    "xmm2", "xmm3", "xmm4"
  );
}

The vmulps could be removed if scale is 1.0:

void HalfFloatRow_F16C(const uint16* src, uint16* dst, int width) {
  asm volatile (
    // 16 pixel loop.
    LABELALIGN
  "1:                                          \n"
    "vpmovzxwd   " MEMACCESS(0) ",%%ymm2       \n"  // 8 shorts -> 8 ints
    "vpmovzxwd   " MEMACCESS2(0x10,0) ",%%ymm3 \n"  // 8 more
    "lea         " MEMLEA(0x20,0) ",%0         \n"
    "vcvtdq2ps   %%ymm2,%%ymm2                 \n"
    "vcvtdq2ps   %%ymm3,%%ymm3                 \n"
    "vcvtps2ph   $3, %%ymm2, %%xmm2            \n"
    "vcvtps2ph   $3, %%ymm3, %%xmm3            \n"
    "vmovdqu     %%xmm2," MEMACCESS(1) "       \n"
    "vmovdqu     %%xmm3," MEMACCESS2(0x10,1) " \n"
    "lea         " MEMLEA(0x20,1) ",%1         \n"
    "sub         $0x10,%2                      \n"
    "jg          1b                            \n"

    "vzeroupper                                \n"
  : "+r"(src),   // %0
    "+r"(dst),   // %1
    "+r"(width)  // %2
  :
  : "memory", "cc",
    "xmm2", "xmm3"
  );
}


Currently when AVX2 or NEON hardware instruction is unavailable, a multiply by a magic number is used.

void HalfFloatRow_C(const uint16* src, uint16* dst, float scale, int width) {
  int i;
  float mult = 1.9259299444e-34f * scale;
  for (i = 0; i < width; ++i) {
    float value = src[i] * mult;
    dst[i] = (uint16)((*(uint32_t*)&value) >> 13);
  }
}

If the scale is a power of 2 this is ok. e.g. 1/4096.  But if scale is not power of 2, the value produced may round the mantissa differently than if the hardware instruction is used on the original value.  So the results dont match.
Also if the scale is 1/65536, the values are so small the 5 bit halffloat exponent cant represent them, causing denormals.
This hurts performance
TestHalfFloatPlane_Opt (464 ms) is for 12 bits (1/4096)
TestHalfFloatPlane_denormal (24864 ms) is 16 bits (1/65536)

Consider using the existing functions but pass in 1.0f as scale value.  The SIMD could check for 1 and use a specialized version of the code.



 

Comment 1 by hubbe@chromium.org, Oct 20 2016

Status: Assigned (was: Untriaged)
I'm going to write a CL that uses 1.0f as multiplier, UNLESS bits is 16. Since 0xFFFF will overflow a half-float.  I'll probably just use a multiplier of 0.5 when bits is 16 , as that will avoid overflows and denormalized floats.

Project Member

Comment 2 by bugdroid1@chromium.org, Oct 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/98626da3db8d1dec43d259d93f54b5adc9c500c2

commit 98626da3db8d1dec43d259d93f54b5adc9c500c2
Author: hubbe <hubbe@chromium.org>
Date: Fri Oct 21 20:58:45 2016

Change half-float conversion to use 1.0 multiplier

Half-float conversion is more efficient on some platforms if the
multiplier used is 1.0. Also clean up the half-float conversion
code and extend testing up to 16 bits.

BUG= 657627 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://chromiumcodereview.appspot.com/2444463002
Cr-Commit-Position: refs/heads/master@{#426885}

[modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater.cc
[modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater.h
[modify] https://crrev.com/98626da3db8d1dec43d259d93f54b5adc9c500c2/cc/resources/video_resource_updater_unittest.cc

Comment 3 by hubbe@chromium.org, Oct 26 2016

Status: Fixed (was: Assigned)
[bulk-edit : please ignore if not applicable]

Could you please set the correct milestone for this issue?
Labels: M-56

Sign in to add a comment