New issue
Advanced search Search tips

Issue 727518 link

Starred by 4 users

Issue metadata

Status: WontFix
Owner:
Closed: Sep 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

16.6%-18.6% regression in blink_perf.shadow_dom at 475081:475224

Project Member Reported by toyoshim@chromium.org, May 30 2017

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=727518

Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDght2wqwkM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgIDg-u_o4woM


Bot(s) for this bug's original alert(s):

chromium-rel-win7-x64-dual
chromium-rel-win8-dual
Project Member

Comment 3 by 42576172...@developer.gserviceaccount.com, May 30 2017

Cc: sebmarchand@chromium.org
Owner: sebmarchand@chromium.org

=== Auto-CCing suspected CL author sebmarchand@chromium.org ===

Hi sebmarchand@chromium.org, the bisect results pointed to your CL, please take a look at the
results.


=== BISECT JOB RESULTS ===
Perf regression found with culprit

Suspected Commit
  Author : sebmarchand
  Commit : 833aa96095e1f827ffc05af8bf545a094ffa6e35
  Date   : Sat May 27 01:34:31 2017
  Subject: Changing default Windows compiler to VS2017

Bisect Details
  Configuration: win_x64_perf_bisect
  Benchmark    : blink_perf.shadow_dom
  Metric       : v1-distribution/v1-distribution
  Change       : 20.23% | 4629.45015557 -> 3692.9303174

Revision             Result                  N
chromium@475080      4629.45 +- 60.7016      6      good
chromium@475149      4809.39 +- 30.5617      6      good
chromium@475183      4629.53 +- 83.9541      6      good
chromium@475200      4715.38 +- 63.7297      6      good
chromium@475209      4537.62 +- 150.152      6      good
chromium@475211      4679.88 +- 48.1666      6      good
chromium@475212      3733.32 +- 44.3159      6      bad       <--
chromium@475213      3814.69 +- 69.0268      6      bad
chromium@475217      3692.93 +- 45.6001      6      bad

To Run This Test
  src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests blink_perf.shadow_dom

Debug Info
  https://chromeperf.appspot.com/buildbucket_job_status/8978194145412518320

Is this bisect wrong?
  https://chromeperf.appspot.com/bad_bisect?try_job_id=5906646003351552


| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Speed>Bisection.  Thank you!
Project Member

Comment 4 by 42576172...@developer.gserviceaccount.com, May 30 2017

 Issue 727519  has been merged into this issue.
Owner: brucedaw...@chromium.org
A performance regression has been discovered during the temporary switch to VS2017.
Project Member

Comment 6 by 42576172...@developer.gserviceaccount.com, May 30 2017

Cc: liberato@google.com
 Issue 727758  has been merged into this issue.
Project Member

Comment 7 by 42576172...@developer.gserviceaccount.com, May 30 2017

 Issue 727759  has been merged into this issue.
Project Member

Comment 8 by 42576172...@developer.gserviceaccount.com, May 31 2017

Cc: sullivan@chromium.org peria@chromium.org dtu@chromium.org
 Issue 727434  has been merged into this issue.
A number of tests showed clear performance regressions that all appear to be aligned with the time that VS 2017 was the default compiler:

ChromiumPerf/chromium-rel-win10/media_perftests / sinc_resampler_convolve / unoptimized_aligned
ChromiumPerf/chromium-rel-win7-gpu-ati/media_perftests / sinc_resampler_convolve / unoptimized_aligned
ChromiumPerf/chromium-rel-win7-x64-dual/blink_perf.shadow_dom / v1-distribution
ChromiumPerf/chromium-rel-win8-dual/blink_perf.shadow_dom / v1-distribution
ChromiumPerf/chromium-rel-win7-dual/blink_perf.bindings / node-type

This test was also listed on the graphs and has a 'regression marker' but its graph doesn't actually show a regression.
ChromiumPerf/chromium-rel-win10/media_perftests / vector_math_ewma_and_max_power / unoptimized

The regression after the marker on that last graph throws off the scale, if you zoom in you can see a small regresssion:
Screen Shot 2017-05-31 at 5.30.13 PM.png
65.8 KB View Download
I can't reproduce the regression on my machine, but I can see the likely cause. Apologies for the wall of assembly language, but if you know where to look it makes the difference obvious. VC++ 2015 generates movaps whenever it is allowed to. This instruction assumes 16-byte aligned memory operands. While the tables at http://www.agner.org/optimize/instruction_tables.pdf suggest that movaps/movaps have identical latency and throughput, movups consumes more ops in some situations - five versus two for r,m operations on the AMD K7. Four versus one on the Pentium 4. My understanding is that the instructions are supposed to be identical on modern CPUs but I'm still trying to figure out what counts as 'modern'.

Note that the difference exists in both the aligned and unaligned cases because even the unaligned case does aligned loads from some parallel arrays - two of the three _mm_load*_ps operations in the unaligned case are aligned loads (_mm_load_ps).

VC++ 2015:
#if defined(ARCH_CPU_X86_FAMILY)
float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1,
                                  const float* k2,
                                  double kernel_interpolation_factor) {
00007FF6D01EDE20  xorps       xmm4,xmm4  
  __m128 m_input;
  __m128 m_sums1 = _mm_setzero_ps();
  __m128 m_sums2 = _mm_setzero_ps();

  // Based on |input_ptr| alignment, we need to use loadu or load.  Unrolling
  // these loops hurt performance in local testing.
  if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) {
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF6D01EDE23  mov         eax,8  
00007FF6D01EDE28  movaps      xmm5,xmm4  
00007FF6D01EDE2B  test        cl,0Fh  
00007FF6D01EDE2E  je          media::SincResampler::Convolve_SSE+3Ah (07FF6D01EDE5Ah)  
00007FF6D01EDE30  sub         rcx,rdx  
00007FF6D01EDE33  sub         r8,rdx  
      m_input = _mm_loadu_ps(input_ptr + i);
00007FF6D01EDE36  movups      xmm1,xmmword ptr [rcx+rdx]  
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF6D01EDE3A  movaps      xmm0,xmmword ptr [rdx]  
00007FF6D01EDE3D  mulps       xmm0,xmm1  
00007FF6D01EDE40  addps       xmm4,xmm0  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF6D01EDE43  movaps      xmm0,xmmword ptr [r8+rdx]  
00007FF6D01EDE48  mulps       xmm0,xmm1  
00007FF6D01EDE4B  add         rdx,10h  
00007FF6D01EDE4F  addps       xmm5,xmm0  
00007FF6D01EDE52  sub         rax,1  
00007FF6D01EDE56  jne         media::SincResampler::Convolve_SSE+16h (07FF6D01EDE36h)  
    }
  } else {
00007FF6D01EDE58  jmp         media::SincResampler::Convolve_SSE+60h (07FF6D01EDE80h)  
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF6D01EDE5A  sub         rcx,rdx  
00007FF6D01EDE5D  sub         r8,rdx  
      m_input = _mm_load_ps(input_ptr + i);
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF6D01EDE60  movaps      xmm0,xmmword ptr [rdx]  
00007FF6D01EDE63  mulps       xmm0,xmmword ptr [rdx+rcx]  
00007FF6D01EDE67  addps       xmm4,xmm0  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF6D01EDE6A  movaps      xmm0,xmmword ptr [rdx+r8]  
00007FF6D01EDE6F  mulps       xmm0,xmmword ptr [rdx+rcx]  
00007FF6D01EDE73  add         rdx,10h  
00007FF6D01EDE77  addps       xmm5,xmm0  
00007FF6D01EDE7A  sub         rax,1  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF6D01EDE7E  jne         media::SincResampler::Convolve_SSE+40h (07FF6D01EDE60h)  



VC++ 2017:
#if defined(ARCH_CPU_X86_FAMILY)
float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1,
                                  const float* k2,
                                  double kernel_interpolation_factor) {
00007FF7BFD82AA3  xorps       xmm5,xmm5  
  __m128 m_input;
  __m128 m_sums1 = _mm_setzero_ps();
  __m128 m_sums2 = _mm_setzero_ps();

  // Based on |input_ptr| alignment, we need to use loadu or load.  Unrolling
  // these loops hurt performance in local testing.
  if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) {
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF7BFD82AA6  mov         eax,8  
00007FF7BFD82AAB  test        cl,0Fh  
00007FF7BFD82AAE  je          media::SincResampler::Convolve_SSE+3Ah (07FF7BFD82ADAh)  
00007FF7BFD82AB0  sub         rcx,rdx  
00007FF7BFD82AB3  sub         r8,rdx  
      m_input = _mm_loadu_ps(input_ptr + i);
00007FF7BFD82AB6  movups      xmm1,xmmword ptr [rcx+rdx]  
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7BFD82ABA  movups      xmm0,xmmword ptr [rdx]  
00007FF7BFD82ABD  mulps       xmm0,xmm1  
00007FF7BFD82AC0  addps       xmm4,xmm0  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7BFD82AC3  movups      xmm0,xmmword ptr [r8+rdx]  
00007FF7BFD82AC8  mulps       xmm0,xmm1  
00007FF7BFD82ACB  add         rdx,10h  
00007FF7BFD82ACF  addps       xmm5,xmm0  
00007FF7BFD82AD2  sub         rax,1  
00007FF7BFD82AD6  jne         media::SincResampler::Convolve_SSE+16h (07FF7BFD82AB6h)  
    }
  } else {
00007FF7BFD82AD8  jmp         media::SincResampler::Convolve_SSE+60h (07FF7BFD82B00h)  
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF7BFD82ADA  sub         rcx,rdx  
00007FF7BFD82ADD  sub         r8,rdx  
      m_input = _mm_load_ps(input_ptr + i);
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7BFD82AE0  movups      xmm0,xmmword ptr [rdx]  
00007FF7BFD82AE3  mulps       xmm0,xmmword ptr [rdx+rcx]  
00007FF7BFD82AE7  addps       xmm4,xmm0  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7BFD82AEA  movups      xmm0,xmmword ptr [rdx+r8]  
00007FF7BFD82AEF  mulps       xmm0,xmmword ptr [rdx+rcx]  
00007FF7BFD82AF3  add         rdx,10h  
00007FF7BFD82AF7  addps       xmm5,xmm0  
00007FF7BFD82AFA  sub         rax,1  
00007FF7BFD82AFE  jne         media::SincResampler::Convolve_SSE+40h (07FF7BFD82AE0h)  

What sort of machines are we running these tests on? movups/movaps is supposed to have been irrelevant for a long time, and is on my machine. It could be useful to know what we are running the tests on. And, what percentage of our customers have affected machines.

The VS tracking bug for this is:
https://developercommunity.visualstudio.com/content/problem/19160/regression-from-vs-2015-in-ssseavx-instructions-ge.html
Status: Started (was: Untriaged)
For completeness, the clang-cl code, whose performance is basically identical to that for VS 2017. clang-cl uses movaps but schedules the instructions differently, and it's loop is less aligned? CPUs are weird. And maybe this isn't the code that runs on the bots.

#if defined(ARCH_CPU_X86_FAMILY)
float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1,
                                  const float* k2,
                                  double kernel_interpolation_factor) {
00007FF7351D6380  xorps       xmm0,xmm0  
00007FF7351D6383  mov         rax,0FFFFFFFFFFFFFFFCh  
00007FF7351D638A  xorps       xmm1,xmm1  
  __m128 m_input;
  __m128 m_sums1 = _mm_setzero_ps();
  __m128 m_sums2 = _mm_setzero_ps();

  // Based on |input_ptr| alignment, we need to use loadu or load.  Unrolling
  // these loops hurt performance in local testing.
  if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) {
00007FF7351D638D  test        cl,0Fh  
00007FF7351D6390  je          media::SincResampler::Convolve_SSE+37h (07FF7351D63B7h)  
      m_input = _mm_loadu_ps(input_ptr + i);
00007FF7351D6392  movups      xmm2,xmmword ptr [rcx+rax*4+10h]  
00007FF7351D6397  movaps      xmm4,xmmword ptr [rdx+rax*4+10h]  
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7351D639C  mulps       xmm4,xmm2  
00007FF7351D639F  addps       xmm1,xmm4  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7351D63A2  mulps       xmm2,xmmword ptr [r8+rax*4+10h]  
00007FF7351D63A8  addps       xmm0,xmm2  
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF7351D63AB  add         rax,4  
00007FF7351D63AF  cmp         rax,1Ch  
00007FF7351D63B3  jl          media::SincResampler::Convolve_SSE+12h (07FF7351D6392h)  
00007FF7351D63B5  jmp         media::SincResampler::Convolve_SSE+5Ah (07FF7351D63DAh)  
      m_input = _mm_load_ps(input_ptr + i);
00007FF7351D63B7  movaps      xmm2,xmmword ptr [rcx+rax*4+10h]  
00007FF7351D63BC  movaps      xmm4,xmmword ptr [rdx+rax*4+10h]  
      m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7351D63C1  mulps       xmm4,xmm2  
00007FF7351D63C4  addps       xmm1,xmm4  
      m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7351D63C7  mulps       xmm2,xmmword ptr [r8+rax*4+10h]  
00007FF7351D63CD  addps       xmm0,xmm2  
    }
  } else {
    for (int i = 0; i < kKernelSize; i += 4) {
00007FF7351D63D0  add         rax,4  
00007FF7351D63D4  cmp         rax,1Ch  
00007FF7351D63D8  jl          media::SincResampler::Convolve_SSE+37h (07FF7351D63B7h)  

Re modern
Nehalem was the first cpu to make unaligned movups the same speed as movaps.
Previously in core architecture, even if the memory was aligned, the movups was 5 cycles vs 1 for movaps.  Now they're both 1 for aligned memory or for reads that dont span a cache line.  There is still a penalty for cache line split or page split.
The encoding for aligned can be smaller and therefore the front end goes faster.

If you know for sure the memory is aligned, use movaps / _mm_load_ps. e.g. a constant.  Also consider directly using a memory argument
mulps       xmm4, xmmword ptr [rcx+rax*4+10h]
In general, smaller code / less instructions, goes faster on modern cpus.

Otherwise, use unaligned memory.  change _mm_load_ps to _mm_loadu_ps
Remove the if() statement
This is true on ARM as well.
When you allocate memory / declare constants, align them.  But in code, use unaligned loads/stores.


> If you know for sure the memory is aligned, use movaps / _mm_load_ps.

Unfortunately with VS 2017 intrinsics we cannot use movaps. The compiler will turn _mm_load_ps into movups, unilaterally.
I'm sorry I missed the question in #12 earlier! These machines are PowerEdge R210 II.

Is this bug still being worked on?
The bug is still of interest because we want to understand the implications of this compiler change, and why it causes so large a performance difference. The switch to clang means that this probably ultimately doesn't matter, and even without the switch to clang this bug would probably end up closed as won't-fix because it's not a huge regression, it is counteracted by gains elsewhere, it should only affect old CPUs, and it's not clear that we can do anything about it (other than remove some code paths which the compiler-change makes irrelevant).

Question: it looks like PowerEdge R210 II is a family of machines and this regression is very sensitive to what generation of CPU the tests are being run on. Do you know the precise model numbers?

Cc: martiniss@chromium.org
+martiniss: do you know how to get the exact model number of PowerEdge R210 II for Win 7 x64 Perf?
https://chromium-swarm.appspot.com/bot?id=build104-m1&show_state=true&sort_stats=total%3Adesc is the bot that triggered this alert, from what I can tell.

  "cpu_name": "Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz" is in the bot state, if that's helpful. There's also some other stuff in the bot state json on that page, if you find any of that useful.

If you want the model number of it, you'd probably need to ask someone from labs; I don't know windows well enough to do that. If you still need more info, I'll route this to the appropriate person.
I can get you any of the information listed in this class: https://msdn.microsoft.com/en-us/library/aa394102(v=vs.85).aspx

The "Model" field is apparently "PowerEdge R220", which confuses me....
The cpu_name data is exactly what I needed - thanks. Looking that up on Intel's ark.intel.com site shows it is "Products formerly Haswell" with a launch date of Q2'13.

The claim (in comment #14, which agrees with what Intel/AMD told MSFT) is that starting with Nehalem CPUs there was no advantage to using movaps instead of movups. Nehalem CPUs were released in 2008. Haswell CPUs came out in 2013 and are two microarchitectural generations ahead of Nehalem.

When I get a chance I will investigate more - it seems very surprising.

Cc: brucedaw...@chromium.org dalecur...@chromium.org
 Issue 759693  has been merged into this issue.

Comment 23 by dtu@chromium.org, Sep 1 2017

Cc: -dtu@chromium.org
Cc: johnchen@chromium.org
 Issue 768394  has been merged into this issue.
This was reported again ( crbug.com/768394 ) on the most recent switch so I investigated again, running this command on an optimized 64-bit build:

> media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*

For unoptimized_aligned and optimized_aligned on my Z840 workstation I see results of:
VS 2015: ~34,000 and ~105,000
VS 2017: ~30,000 and ~95,000 or 105,000

The results for unoptimized_aligned are 100% consistently a regression. The results for optimized_aligned are bi-modal for VS 2017.

Here are some typical raw results:

> out\release_64_2015\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
*RESULT sinc_resampler_convolve: unoptimized_aligned= 34335.04688107301 runs/ms
*RESULT sinc_resampler_convolve: optimized_aligned= 106067.48437625954 runs/ms
*RESULT sinc_resampler_convolve: optimized_unaligned= 98363.04225086117 runs/ms

> out\release_64\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
*RESULT sinc_resampler_convolve: unoptimized_aligned= 29902.911227825498 runs/ms
*RESULT sinc_resampler_convolve: optimized_aligned= 94703.785499714 runs/ms
*RESULT sinc_resampler_convolve: optimized_unaligned= 95439.70024298948 runs/ms

So, this is reproducible. Previously I had trouble seeing the differences. I'm not sure if my previous difficulties were because I was using a different CPU or were due to user error or perhaps due to the bimodal nature of the unoptimized_aligned test, or perhaps because whether or not there is a regression on my CPU depends on instruction alignment or other psuedo-random details.

I've attached the disassembly of media_perftests.exe!media::SincResampler::Convolve_SSE from a VS 2015 and VS 2017 build. The build settings I used were:

is_component_build = false
is_debug = false
target_cpu = "x64"
enable_nacl = false

There is one instruction that is different in the function prologue and then four instances of movaps being changed to movups. This is the same difference that was identified originally when I commented on this VS bug:
https://developercommunity.visualstudio.com/content/problem/19160/regression-from-vs-2015-in-ssseavx-instructions-ge.html

The conclusion at that point was that the change was intentional and agreed upon as appropriate by Intel, AMD, and Microsoft. It was decided that it should cause insignificant performance regressions. Specifically:

"When we made the change in VS2015, we knew the performance impact on old machines. But for modern cpus since Nehalem which was released almost 9 years ago, the use of movups vs. movaps has almost no performance impact. We have confirmed this with Intel team and AMD team."

Apparently this is not true on certain types of well tuned machine code. I am running a relatively modern CPU, an Intel Xeon CPU E5-2690 v3 @ 2.60 GHz (2 processors):
https://ark.intel.com/products/81713/Intel-Xeon-Processor-E5-2690-v3-30M-Cache-2_60-GHz

This process was released in Q3 of 2014, long after Nehalem. Here's some more information:

>wmic cpu get caption
Caption
Intel64 Family 6 Model 63 Stepping 2
Intel64 Family 6 Model 63 Stepping 2


So... I disagree with this change in code-gen for VS 2017. It can hurt performance (occasionally by non-trivial amounts) and it hides certain types of alignment bugs instead of letting the code crash. Thus, it is like removing asserts with a risk of harming performance by doing so. I believe that this doesn't affect most code but I still disagree.

vs2015.txt
1.5 KB View Download
vs2017.txt
1.5 KB View Download
BTW, the trick to reproducing this is to create two output directories with the same gn args (see the previous comment). Then, follow these steps:

set GYP_MSVS_VERSION=2015
gclient runhooks
gn gen out\release_64_2015
ninja -C out\release_64_2015 media_perftests

set GYP_MSVS_VERSION=2017
gclient runhooks
gn gen out\release_64_2017
ninja -C out\release_64_2017 media_perftests

It is *crucial* to invoke "gclient runhooks" and "gn gen" after changing GYP_MSVS_VERSION in order to be certain that the new compiler is invoked. Running "gn gen" will latch the currently selected compiler so it is also crucial to not accidentally cause a re-generation of .ninja files at the wrong time. Adding this code to the start of RunConvolveBenchmark helps avoid errors:

static void RunConvolveBenchmark(
    SincResampler* resampler,
    float (*convolve_fn)(const float*, const float*, const float*, double),
    bool aligned,
    const std::string& trace_name) {
#if _MSC_VER >= 1910
  printf("VS 2017\n");
#elif _MSC_VER == 1900
  printf("VS 2015\n");
#else
  #error
#endif


This gives the typical results shown below. Note that while testing this I hit one VS 2015 run that was show on the optimized_aligned case - its bimodality is annoying. ASLR related perhaps? However unoptimized_aligned continues to be a very reliable compiler detector.

>out\release_64_2015\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
VS 2015
*RESULT sinc_resampler_convolve: unoptimized_aligned= 34670.5053850229 runs/ms
VS 2015
*RESULT sinc_resampler_convolve: optimized_aligned= 105073.58303019604 runs/ms
VS 2015
*RESULT sinc_resampler_convolve: optimized_unaligned= 89338.83898818404 runs/ms


>out\release_64\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
VS 2017
*RESULT sinc_resampler_convolve: unoptimized_aligned= 30512.642913591248 runs/ms
VS 2017
*RESULT sinc_resampler_convolve: optimized_aligned= 104551.76567021864 runs/ms
VS 2017
*RESULT sinc_resampler_convolve: optimized_unaligned= 97245.61519521085 runs/ms

After a bit more analysis I realized that the function with the consistent regression is the unoptimized SSE function. Examination of the Convolve_C assembly language showed an extra integer add instruction in the inner loop (and some rearrangement of instructions). The extra add is *probably* the cause of the reproducible regression.

Since the optimized (_SSE) functions do not have a consistent regression I'm going to close this.

I've attached updated disassemblies that make the differences easy to see.

vs2015.txt
2.2 KB View Download
vs2017.txt
2.2 KB View Download
Status: WontFix (was: Started)
The performance regression is unfortunate but not crucial and not systemic. I'm going to close this as WontFix.
Very interesting analysis. Thanks Bruce! I agree, the non-optimized variant is just for comparison purposes so we don't care about regressions there.
Great!

This bug originally also covered blink_perf.shadow_dom which I don't have a good understanding of. If that tests regresses on this most recent switch then the bug will end up assigned to me and I will investigate it separately.

Sign in to add a comment