Issue metadata
Sign in to add a comment
|
16.6%-18.6% regression in blink_perf.shadow_dom at 475081:475224 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
May 30 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8978194145412518320
,
May 30 2017
=== Auto-CCing suspected CL author sebmarchand@chromium.org === Hi sebmarchand@chromium.org, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Perf regression found with culprit Suspected Commit Author : sebmarchand Commit : 833aa96095e1f827ffc05af8bf545a094ffa6e35 Date : Sat May 27 01:34:31 2017 Subject: Changing default Windows compiler to VS2017 Bisect Details Configuration: win_x64_perf_bisect Benchmark : blink_perf.shadow_dom Metric : v1-distribution/v1-distribution Change : 20.23% | 4629.45015557 -> 3692.9303174 Revision Result N chromium@475080 4629.45 +- 60.7016 6 good chromium@475149 4809.39 +- 30.5617 6 good chromium@475183 4629.53 +- 83.9541 6 good chromium@475200 4715.38 +- 63.7297 6 good chromium@475209 4537.62 +- 150.152 6 good chromium@475211 4679.88 +- 48.1666 6 good chromium@475212 3733.32 +- 44.3159 6 bad <-- chromium@475213 3814.69 +- 69.0268 6 bad chromium@475217 3692.93 +- 45.6001 6 bad To Run This Test src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests blink_perf.shadow_dom Debug Info https://chromeperf.appspot.com/buildbucket_job_status/8978194145412518320 Is this bisect wrong? https://chromeperf.appspot.com/bad_bisect?try_job_id=5906646003351552 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Speed>Bisection. Thank you!
,
May 30 2017
Issue 727519 has been merged into this issue.
,
May 30 2017
A performance regression has been discovered during the temporary switch to VS2017.
,
May 30 2017
,
May 30 2017
Issue 727759 has been merged into this issue.
,
May 31 2017
Issue 727434 has been merged into this issue.
,
May 31 2017
A number of tests showed clear performance regressions that all appear to be aligned with the time that VS 2017 was the default compiler: ChromiumPerf/chromium-rel-win10/media_perftests / sinc_resampler_convolve / unoptimized_aligned ChromiumPerf/chromium-rel-win7-gpu-ati/media_perftests / sinc_resampler_convolve / unoptimized_aligned ChromiumPerf/chromium-rel-win7-x64-dual/blink_perf.shadow_dom / v1-distribution ChromiumPerf/chromium-rel-win8-dual/blink_perf.shadow_dom / v1-distribution ChromiumPerf/chromium-rel-win7-dual/blink_perf.bindings / node-type This test was also listed on the graphs and has a 'regression marker' but its graph doesn't actually show a regression. ChromiumPerf/chromium-rel-win10/media_perftests / vector_math_ewma_and_max_power / unoptimized
,
May 31 2017
The regression after the marker on that last graph throws off the scale, if you zoom in you can see a small regresssion:
,
Jun 20 2017
I can't reproduce the regression on my machine, but I can see the likely cause. Apologies for the wall of assembly language, but if you know where to look it makes the difference obvious. VC++ 2015 generates movaps whenever it is allowed to. This instruction assumes 16-byte aligned memory operands. While the tables at http://www.agner.org/optimize/instruction_tables.pdf suggest that movaps/movaps have identical latency and throughput, movups consumes more ops in some situations - five versus two for r,m operations on the AMD K7. Four versus one on the Pentium 4. My understanding is that the instructions are supposed to be identical on modern CPUs but I'm still trying to figure out what counts as 'modern'. Note that the difference exists in both the aligned and unaligned cases because even the unaligned case does aligned loads from some parallel arrays - two of the three _mm_load*_ps operations in the unaligned case are aligned loads (_mm_load_ps). VC++ 2015: #if defined(ARCH_CPU_X86_FAMILY) float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1, const float* k2, double kernel_interpolation_factor) { 00007FF6D01EDE20 xorps xmm4,xmm4 __m128 m_input; __m128 m_sums1 = _mm_setzero_ps(); __m128 m_sums2 = _mm_setzero_ps(); // Based on |input_ptr| alignment, we need to use loadu or load. Unrolling // these loops hurt performance in local testing. if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) { for (int i = 0; i < kKernelSize; i += 4) { 00007FF6D01EDE23 mov eax,8 00007FF6D01EDE28 movaps xmm5,xmm4 00007FF6D01EDE2B test cl,0Fh 00007FF6D01EDE2E je media::SincResampler::Convolve_SSE+3Ah (07FF6D01EDE5Ah) 00007FF6D01EDE30 sub rcx,rdx 00007FF6D01EDE33 sub r8,rdx m_input = _mm_loadu_ps(input_ptr + i); 00007FF6D01EDE36 movups xmm1,xmmword ptr [rcx+rdx] m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i))); 00007FF6D01EDE3A movaps xmm0,xmmword ptr [rdx] 00007FF6D01EDE3D mulps xmm0,xmm1 00007FF6D01EDE40 addps xmm4,xmm0 m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i))); 00007FF6D01EDE43 movaps xmm0,xmmword ptr [r8+rdx] 00007FF6D01EDE48 mulps xmm0,xmm1 00007FF6D01EDE4B add rdx,10h 00007FF6D01EDE4F addps xmm5,xmm0 00007FF6D01EDE52 sub rax,1 00007FF6D01EDE56 jne media::SincResampler::Convolve_SSE+16h (07FF6D01EDE36h) } } else { 00007FF6D01EDE58 jmp media::SincResampler::Convolve_SSE+60h (07FF6D01EDE80h) for (int i = 0; i < kKernelSize; i += 4) { 00007FF6D01EDE5A sub rcx,rdx 00007FF6D01EDE5D sub r8,rdx m_input = _mm_load_ps(input_ptr + i); m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i))); 00007FF6D01EDE60 movaps xmm0,xmmword ptr [rdx] 00007FF6D01EDE63 mulps xmm0,xmmword ptr [rdx+rcx] 00007FF6D01EDE67 addps xmm4,xmm0 m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i))); 00007FF6D01EDE6A movaps xmm0,xmmword ptr [rdx+r8] 00007FF6D01EDE6F mulps xmm0,xmmword ptr [rdx+rcx] 00007FF6D01EDE73 add rdx,10h 00007FF6D01EDE77 addps xmm5,xmm0 00007FF6D01EDE7A sub rax,1 m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i))); 00007FF6D01EDE7E jne media::SincResampler::Convolve_SSE+40h (07FF6D01EDE60h) VC++ 2017: #if defined(ARCH_CPU_X86_FAMILY) float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1, const float* k2, double kernel_interpolation_factor) { 00007FF7BFD82AA3 xorps xmm5,xmm5 __m128 m_input; __m128 m_sums1 = _mm_setzero_ps(); __m128 m_sums2 = _mm_setzero_ps(); // Based on |input_ptr| alignment, we need to use loadu or load. Unrolling // these loops hurt performance in local testing. if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) { for (int i = 0; i < kKernelSize; i += 4) { 00007FF7BFD82AA6 mov eax,8 00007FF7BFD82AAB test cl,0Fh 00007FF7BFD82AAE je media::SincResampler::Convolve_SSE+3Ah (07FF7BFD82ADAh) 00007FF7BFD82AB0 sub rcx,rdx 00007FF7BFD82AB3 sub r8,rdx m_input = _mm_loadu_ps(input_ptr + i); 00007FF7BFD82AB6 movups xmm1,xmmword ptr [rcx+rdx] m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i))); 00007FF7BFD82ABA movups xmm0,xmmword ptr [rdx] 00007FF7BFD82ABD mulps xmm0,xmm1 00007FF7BFD82AC0 addps xmm4,xmm0 m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i))); 00007FF7BFD82AC3 movups xmm0,xmmword ptr [r8+rdx] 00007FF7BFD82AC8 mulps xmm0,xmm1 00007FF7BFD82ACB add rdx,10h 00007FF7BFD82ACF addps xmm5,xmm0 00007FF7BFD82AD2 sub rax,1 00007FF7BFD82AD6 jne media::SincResampler::Convolve_SSE+16h (07FF7BFD82AB6h) } } else { 00007FF7BFD82AD8 jmp media::SincResampler::Convolve_SSE+60h (07FF7BFD82B00h) for (int i = 0; i < kKernelSize; i += 4) { 00007FF7BFD82ADA sub rcx,rdx 00007FF7BFD82ADD sub r8,rdx m_input = _mm_load_ps(input_ptr + i); m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i))); 00007FF7BFD82AE0 movups xmm0,xmmword ptr [rdx] 00007FF7BFD82AE3 mulps xmm0,xmmword ptr [rdx+rcx] 00007FF7BFD82AE7 addps xmm4,xmm0 m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i))); 00007FF7BFD82AEA movups xmm0,xmmword ptr [rdx+r8] 00007FF7BFD82AEF mulps xmm0,xmmword ptr [rdx+rcx] 00007FF7BFD82AF3 add rdx,10h 00007FF7BFD82AF7 addps xmm5,xmm0 00007FF7BFD82AFA sub rax,1 00007FF7BFD82AFE jne media::SincResampler::Convolve_SSE+40h (07FF7BFD82AE0h)
,
Jun 21 2017
What sort of machines are we running these tests on? movups/movaps is supposed to have been irrelevant for a long time, and is on my machine. It could be useful to know what we are running the tests on. And, what percentage of our customers have affected machines. The VS tracking bug for this is: https://developercommunity.visualstudio.com/content/problem/19160/regression-from-vs-2015-in-ssseavx-instructions-ge.html
,
Jun 23 2017
For completeness, the clang-cl code, whose performance is basically identical to that for VS 2017. clang-cl uses movaps but schedules the instructions differently, and it's loop is less aligned? CPUs are weird. And maybe this isn't the code that runs on the bots.
#if defined(ARCH_CPU_X86_FAMILY)
float SincResampler::Convolve_SSE(const float* input_ptr, const float* k1,
const float* k2,
double kernel_interpolation_factor) {
00007FF7351D6380 xorps xmm0,xmm0
00007FF7351D6383 mov rax,0FFFFFFFFFFFFFFFCh
00007FF7351D638A xorps xmm1,xmm1
__m128 m_input;
__m128 m_sums1 = _mm_setzero_ps();
__m128 m_sums2 = _mm_setzero_ps();
// Based on |input_ptr| alignment, we need to use loadu or load. Unrolling
// these loops hurt performance in local testing.
if (reinterpret_cast<uintptr_t>(input_ptr) & 0x0F) {
00007FF7351D638D test cl,0Fh
00007FF7351D6390 je media::SincResampler::Convolve_SSE+37h (07FF7351D63B7h)
m_input = _mm_loadu_ps(input_ptr + i);
00007FF7351D6392 movups xmm2,xmmword ptr [rcx+rax*4+10h]
00007FF7351D6397 movaps xmm4,xmmword ptr [rdx+rax*4+10h]
m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7351D639C mulps xmm4,xmm2
00007FF7351D639F addps xmm1,xmm4
m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7351D63A2 mulps xmm2,xmmword ptr [r8+rax*4+10h]
00007FF7351D63A8 addps xmm0,xmm2
for (int i = 0; i < kKernelSize; i += 4) {
00007FF7351D63AB add rax,4
00007FF7351D63AF cmp rax,1Ch
00007FF7351D63B3 jl media::SincResampler::Convolve_SSE+12h (07FF7351D6392h)
00007FF7351D63B5 jmp media::SincResampler::Convolve_SSE+5Ah (07FF7351D63DAh)
m_input = _mm_load_ps(input_ptr + i);
00007FF7351D63B7 movaps xmm2,xmmword ptr [rcx+rax*4+10h]
00007FF7351D63BC movaps xmm4,xmmword ptr [rdx+rax*4+10h]
m_sums1 = _mm_add_ps(m_sums1, _mm_mul_ps(m_input, _mm_load_ps(k1 + i)));
00007FF7351D63C1 mulps xmm4,xmm2
00007FF7351D63C4 addps xmm1,xmm4
m_sums2 = _mm_add_ps(m_sums2, _mm_mul_ps(m_input, _mm_load_ps(k2 + i)));
00007FF7351D63C7 mulps xmm2,xmmword ptr [r8+rax*4+10h]
00007FF7351D63CD addps xmm0,xmm2
}
} else {
for (int i = 0; i < kKernelSize; i += 4) {
00007FF7351D63D0 add rax,4
00007FF7351D63D4 cmp rax,1Ch
00007FF7351D63D8 jl media::SincResampler::Convolve_SSE+37h (07FF7351D63B7h)
,
Jun 28 2017
Re modern Nehalem was the first cpu to make unaligned movups the same speed as movaps. Previously in core architecture, even if the memory was aligned, the movups was 5 cycles vs 1 for movaps. Now they're both 1 for aligned memory or for reads that dont span a cache line. There is still a penalty for cache line split or page split. The encoding for aligned can be smaller and therefore the front end goes faster. If you know for sure the memory is aligned, use movaps / _mm_load_ps. e.g. a constant. Also consider directly using a memory argument mulps xmm4, xmmword ptr [rcx+rax*4+10h] In general, smaller code / less instructions, goes faster on modern cpus. Otherwise, use unaligned memory. change _mm_load_ps to _mm_loadu_ps Remove the if() statement This is true on ARM as well. When you allocate memory / declare constants, align them. But in code, use unaligned loads/stores.
,
Jun 28 2017
> If you know for sure the memory is aligned, use movaps / _mm_load_ps. Unfortunately with VS 2017 intrinsics we cannot use movaps. The compiler will turn _mm_load_ps into movups, unilaterally.
,
Aug 21 2017
I'm sorry I missed the question in #12 earlier! These machines are PowerEdge R210 II. Is this bug still being worked on?
,
Aug 22 2017
The bug is still of interest because we want to understand the implications of this compiler change, and why it causes so large a performance difference. The switch to clang means that this probably ultimately doesn't matter, and even without the switch to clang this bug would probably end up closed as won't-fix because it's not a huge regression, it is counteracted by gains elsewhere, it should only affect old CPUs, and it's not clear that we can do anything about it (other than remove some code paths which the compiler-change makes irrelevant). Question: it looks like PowerEdge R210 II is a family of machines and this regression is very sensitive to what generation of CPU the tests are being run on. Do you know the precise model numbers?
,
Aug 22 2017
+martiniss: do you know how to get the exact model number of PowerEdge R210 II for Win 7 x64 Perf?
,
Aug 22 2017
https://chromium-swarm.appspot.com/bot?id=build104-m1&show_state=true&sort_stats=total%3Adesc is the bot that triggered this alert, from what I can tell. "cpu_name": "Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz" is in the bot state, if that's helpful. There's also some other stuff in the bot state json on that page, if you find any of that useful. If you want the model number of it, you'd probably need to ask someone from labs; I don't know windows well enough to do that. If you still need more info, I'll route this to the appropriate person.
,
Aug 22 2017
I can get you any of the information listed in this class: https://msdn.microsoft.com/en-us/library/aa394102(v=vs.85).aspx The "Model" field is apparently "PowerEdge R220", which confuses me....
,
Aug 22 2017
The cpu_name data is exactly what I needed - thanks. Looking that up on Intel's ark.intel.com site shows it is "Products formerly Haswell" with a launch date of Q2'13. The claim (in comment #14, which agrees with what Intel/AMD told MSFT) is that starting with Nehalem CPUs there was no advantage to using movaps instead of movups. Nehalem CPUs were released in 2008. Haswell CPUs came out in 2013 and are two microarchitectural generations ahead of Nehalem. When I get a chance I will investigate more - it seems very surprising.
,
Aug 28 2017
,
Sep 1 2017
,
Sep 25 2017
,
Sep 25 2017
This was reported again ( crbug.com/768394 ) on the most recent switch so I investigated again, running this command on an optimized 64-bit build: > media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.* For unoptimized_aligned and optimized_aligned on my Z840 workstation I see results of: VS 2015: ~34,000 and ~105,000 VS 2017: ~30,000 and ~95,000 or 105,000 The results for unoptimized_aligned are 100% consistently a regression. The results for optimized_aligned are bi-modal for VS 2017. Here are some typical raw results: > out\release_64_2015\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.* ... *RESULT sinc_resampler_convolve: unoptimized_aligned= 34335.04688107301 runs/ms *RESULT sinc_resampler_convolve: optimized_aligned= 106067.48437625954 runs/ms *RESULT sinc_resampler_convolve: optimized_unaligned= 98363.04225086117 runs/ms > out\release_64\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.* ... *RESULT sinc_resampler_convolve: unoptimized_aligned= 29902.911227825498 runs/ms *RESULT sinc_resampler_convolve: optimized_aligned= 94703.785499714 runs/ms *RESULT sinc_resampler_convolve: optimized_unaligned= 95439.70024298948 runs/ms So, this is reproducible. Previously I had trouble seeing the differences. I'm not sure if my previous difficulties were because I was using a different CPU or were due to user error or perhaps due to the bimodal nature of the unoptimized_aligned test, or perhaps because whether or not there is a regression on my CPU depends on instruction alignment or other psuedo-random details. I've attached the disassembly of media_perftests.exe!media::SincResampler::Convolve_SSE from a VS 2015 and VS 2017 build. The build settings I used were: is_component_build = false is_debug = false target_cpu = "x64" enable_nacl = false There is one instruction that is different in the function prologue and then four instances of movaps being changed to movups. This is the same difference that was identified originally when I commented on this VS bug: https://developercommunity.visualstudio.com/content/problem/19160/regression-from-vs-2015-in-ssseavx-instructions-ge.html The conclusion at that point was that the change was intentional and agreed upon as appropriate by Intel, AMD, and Microsoft. It was decided that it should cause insignificant performance regressions. Specifically: "When we made the change in VS2015, we knew the performance impact on old machines. But for modern cpus since Nehalem which was released almost 9 years ago, the use of movups vs. movaps has almost no performance impact. We have confirmed this with Intel team and AMD team." Apparently this is not true on certain types of well tuned machine code. I am running a relatively modern CPU, an Intel Xeon CPU E5-2690 v3 @ 2.60 GHz (2 processors): https://ark.intel.com/products/81713/Intel-Xeon-Processor-E5-2690-v3-30M-Cache-2_60-GHz This process was released in Q3 of 2014, long after Nehalem. Here's some more information: >wmic cpu get caption Caption Intel64 Family 6 Model 63 Stepping 2 Intel64 Family 6 Model 63 Stepping 2 So... I disagree with this change in code-gen for VS 2017. It can hurt performance (occasionally by non-trivial amounts) and it hides certain types of alignment bugs instead of letting the code crash. Thus, it is like removing asserts with a risk of harming performance by doing so. I believe that this doesn't affect most code but I still disagree.
,
Sep 25 2017
BTW, the trick to reproducing this is to create two output directories with the same gn args (see the previous comment). Then, follow these steps:
set GYP_MSVS_VERSION=2015
gclient runhooks
gn gen out\release_64_2015
ninja -C out\release_64_2015 media_perftests
set GYP_MSVS_VERSION=2017
gclient runhooks
gn gen out\release_64_2017
ninja -C out\release_64_2017 media_perftests
It is *crucial* to invoke "gclient runhooks" and "gn gen" after changing GYP_MSVS_VERSION in order to be certain that the new compiler is invoked. Running "gn gen" will latch the currently selected compiler so it is also crucial to not accidentally cause a re-generation of .ninja files at the wrong time. Adding this code to the start of RunConvolveBenchmark helps avoid errors:
static void RunConvolveBenchmark(
SincResampler* resampler,
float (*convolve_fn)(const float*, const float*, const float*, double),
bool aligned,
const std::string& trace_name) {
#if _MSC_VER >= 1910
printf("VS 2017\n");
#elif _MSC_VER == 1900
printf("VS 2015\n");
#else
#error
#endif
This gives the typical results shown below. Note that while testing this I hit one VS 2015 run that was show on the optimized_aligned case - its bimodality is annoying. ASLR related perhaps? However unoptimized_aligned continues to be a very reliable compiler detector.
>out\release_64_2015\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
VS 2015
*RESULT sinc_resampler_convolve: unoptimized_aligned= 34670.5053850229 runs/ms
VS 2015
*RESULT sinc_resampler_convolve: optimized_aligned= 105073.58303019604 runs/ms
VS 2015
*RESULT sinc_resampler_convolve: optimized_unaligned= 89338.83898818404 runs/ms
>out\release_64\media_perftests.exe --single-process-tests --gtest_filter=SincResamplerPerfTest.*
...
VS 2017
*RESULT sinc_resampler_convolve: unoptimized_aligned= 30512.642913591248 runs/ms
VS 2017
*RESULT sinc_resampler_convolve: optimized_aligned= 104551.76567021864 runs/ms
VS 2017
*RESULT sinc_resampler_convolve: optimized_unaligned= 97245.61519521085 runs/ms
,
Sep 25 2017
After a bit more analysis I realized that the function with the consistent regression is the unoptimized SSE function. Examination of the Convolve_C assembly language showed an extra integer add instruction in the inner loop (and some rearrangement of instructions). The extra add is *probably* the cause of the reproducible regression. Since the optimized (_SSE) functions do not have a consistent regression I'm going to close this. I've attached updated disassemblies that make the differences easy to see.
,
Sep 25 2017
The performance regression is unfortunate but not crucial and not systemic. I'm going to close this as WontFix.
,
Sep 25 2017
Very interesting analysis. Thanks Bruce! I agree, the non-optimized variant is just for comparison purposes so we don't care about regressions there.
,
Sep 25 2017
Great! This bug originally also covered blink_perf.shadow_dom which I don't have a good understanding of. If that tests regresses on this most recent switch then the bug will end up assigned to me and I will investigate it separately. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by toyoshim@chromium.org
, May 30 2017