Issue metadata
Sign in to add a comment
|
11.7%-12.6% regression in blink_perf.layout at 415551:415625 |
||||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Sep 2 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/9002613176851193344
,
Sep 2 2016
=== Auto-CCing suspected CL author thakis@chromium.org === Hi thakis@chromium.org, the bisect results pointed to your CL below as possibly causing a regression. Please have a look at this info and see whether your CL be related. ===== BISECT JOB RESULTS ===== Status: completed ===== SUSPECTED CL(s) ===== Subject : Roll clang 278861:280106. Author : thakis Commit description: * win: Members of base classes now should show up in debugger. * win: Debugger shouldn't show funny highlights anymore due to debug info no longer including column information. (we still force this on if sanitizers are used, mostly for clusterfuzz. maybe we want to make this toggleable independent of sanitizers at some point) * win: -Wextern-initializer no longer warns on midl-generated code * win: clang-cl now accepts /source-encoding:utf-8 and friends (utf-8 was the source enconding in clang-cl before already, but now we don't warn on an explicit flag requesting this) * all platforms: Three plugin checks are now on-by-default, remove flags for these (see https://codereview.chromium.org/2267713003 https://codereview.chromium.org/2268203002 https://codereview.chromium.org/2265093002 ) * win: clang-cl's /Brepro now does what it's supposed to do * win: clang-cl now emits absolute paths in diagnostics, by popular request. Ran `tools/clang/scripts/upload_revision.py 280106`. BUG= 640254 , 637456 , 636109 , 636091 , 636099 Review-Url: https://codereview.chromium.org/2292173002 Cr-Commit-Position: refs/heads/master@{#415563} Commit : a033f395bf2547cf4764f77cc9c86d08f3e22c23 Date : Wed Aug 31 05:18:36 2016 ===== TESTED REVISIONS ===== Revision Mean Std Dev N Good? chromium@415558 405.005 3.38111 5 good chromium@415561 409.007 5.02107 5 good chromium@415562 403.783 5.02506 5 good chromium@415563 454.608 3.2199 5 bad <-- chromium@415567 452.615 3.96062 5 bad chromium@415576 449.699 2.33349 5 bad Bisect job ran on: linux_perf_bisect Bug ID: 643724 Test Command: src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --upload-results --also-run-disabled-tests blink_perf.layout Test Metric: large-table-with-collapsed-borders-and-colspans-wider-than-table/large-table-with-collapsed-borders-and-colspans-wider-than-table Relative Change: 11.04% Score: 99.9 Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/linux_perf_bisect/builds/6690 Job details: https://chromeperf.appspot.com/buildbucket_job_status/9002613176851193344 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5824572824223744 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Sep 3 2016
,
Sep 8 2016
cfi folks, isn't large-table-with-collapsed-borders-and-colspans-wider-than-table a test you looked at it some detail? Any idea what might've regressed here?
,
Sep 8 2016
That particular microbenchmark is super noisy. See ~20% improvement when Clang toolchain was updated on August 20: https://chromeperf.appspot.com/report?sid=223e6e3fe5a736ca07effd16fab7c27a77db7c94042adc819b692c8aac732a60 This commit range: https://chromium.googlesource.com/chromium/src/+log/74765e5a1cda876448a6b44c23c3a9e20e59f53d%5E..3829c6ad5aeef24992be742340df28c324da9e61?pretty=fuller
,
Sep 8 2016
The improvement was observed on both Mac (~20%) and Linux (~4%). It feels that this time the regression is just cancelling those speedups.
,
Sep 8 2016
You are right about the two metrics on Mac, I will remove them from the bug now. The regression on Linux still needs some attention: the current level (~450) is significantly higher than Aug-19 (pre-improvement) level (~425).
,
Sep 8 2016
I will take a look into this today.
,
Sep 8 2016
From what I can see, the benchmark lives on the edge: there's a good state and there's a bad state. Probably, it almost fits a cache (like, L1), may be it's about memory alignment, may be something else. Multiple factors may put this benchmark on either side: LTO, CFI, specific Clang revision, different hardware families (like, Haswell vs Sandy Bridge), etc. I would probably ignore it unless it's 40%-50%.
,
Sep 8 2016
,
Sep 8 2016
10-20% seems big enough you should be able to measure this, what does a profiler tell you? Where is time spent differently between the two builds? What is different about the assembly output?
,
Sep 8 2016
I see blink::LayoutTable::cell{Above,After,Before,Below}, as well as blink::LayoutTableCell::computeCollapsedEndBorder in the top of the profile.
What I have not yet done, and probably should is to take an older toolchain, build the same Chrome revision with it, and see if the profile looks any different.
Taking the bug back until this data is collected.
,
Sep 9 2016
The renderer profiles for old (faster) and new (slower) Clang versions attached. Top entries are inlined: old clang: 9.84% 2184 blink::LayoutTable::cellAbove(blink::LayoutTableCell const*) const 9.33% 2058 blink::LayoutTable::cellBelow(blink::LayoutTableCell const*) const 6.70% 1481 blink::LayoutTable::cellAfter(blink::LayoutTableCell const*) const 5.91% 1312 blink::LayoutTableCell::hasStartBorderAdjoiningTable() const 5.84% 1299 blink::LayoutTableCell::computeCollapsedEndBorder(blink::IncludeBorderColorOrNot) const 5.80% 1291 blink::LayoutTable::cellBefore(blink::LayoutTableCell const*) const 5.17% 1154 blink::LayoutTableCell::hasEndBorderAdjoiningTable() const 2.19% 484 blink::LayoutTableCell::computeCollapsedAfterBorder(blink::IncludeBorderColorOrNot) const 2.13% 483 blink::NormalPage::sweep() 2.11% 468 blink::LayoutTableCell::computeCollapsedBeforeBorder(blink::IncludeBorderColorOrNot) const 1.58% 374 void blink::Element::traceImpl<blink::InlinedGlobalMarkingVisitor>(blink::InlinedGlobalMarkingVisitor) new clang: 11.22% 2712 blink::LayoutTable::cellAbove(blink::LayoutTableCell const*) const 10.55% 2555 blink::LayoutTable::cellBelow(blink::LayoutTableCell const*) const 7.28% 1765 blink::LayoutTable::cellAfter(blink::LayoutTableCell const*) const 6.60% 1596 blink::LayoutTable::cellBefore(blink::LayoutTableCell const*) const 6.23% 1508 blink::LayoutTableCell::hasStartBorderAdjoiningTable() const 6.03% 1457 blink::LayoutTableCell::computeCollapsedEndBorder(blink::IncludeBorderColorOrNot) const 6.01% 1455 blink::LayoutTableCell::hasEndBorderAdjoiningTable() const 2.40% 586 blink::LayoutTableCell::computeCollapsedAfterBorder(blink::IncludeBorderColorOrNot) const 1.84% 448 blink::LayoutTableCell::computeCollapsedBeforeBorder(blink::IncludeBorderColorOrNot) const 1.74% 449 blink::NormalPage::sweep() 1.18% 315 void blink::Element::traceImpl<blink::InlinedGlobalMarkingVisitor>(blink::InlinedGlobalMarkingVisitor) Looking inside those methods.
,
Sep 9 2016
Interesting. The hot place in blink::LayoutTable::cellAbove:
old clang:
4.76 │a0: mov (%r10,%rdx,4),%ecx
0.27 │ lea -0x1(%rcx,%rdi,1),%ebx
2.56 │ cmp %esi,%ebx
│ ↓ jae b8
39.42 │ add %edi,%ecx
3.98 │ inc %rdx
0.09 │ cmp %r9,%rdx
0.55 │ mov %ecx,%edi
37.96 │ ↑ jb a0
new clang:
6.19 │ a0: mov %edx,%ecx
6.34 │ mov (%r10,%rcx,4),%ecx
10.91 │ lea -0x1(%rcx,%rdi,1),%ebx
18.95 │ cmp %esi,%ebx
│ ↓ jae b9
12.24 │ add %edi,%ecx
6.16 │ inc %edx
7.15 │ cmp %r9d,%edx
9.48 │ mov %ecx,%edi
12.54 │ ↑ jb a0
1) This extra 'mov %edx,%ecx' looks unnecessary
2) In the new Clang it uses 32-bit arithmetic instead of 64-bit. It's hard to say, if that's a problem other being the reason for 1)
Nico, Hans, would you count this as a codegen issue?
I will now look at Clang ToT to see, if this is already fixed.
,
Sep 9 2016
The ToT Clang generates the same slow code as the "new" clang. The loop in the question is https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutTable.h?type=cs&q=m_effectiveColumns.effectiveColumn..span&sq=package:chromium&l=269 for (unsigned c = 0; effectiveColumn < numColumns && c + m_effectiveColumns[effectiveColumn].span - 1 < absoluteColumnIndex; ++effectiveColumn) c += m_effectiveColumns[effectiveColumn].span;
,
Sep 9 2016
I was able to reproduce that in with a standalone C++ program (attached): The ToT Clang, r280929: $ time clang++ -o loop loop.cc -fuse-ld=gold -O2 -flto && ./loop ... 00000000004006c0 <_Z31absoluteColumnToEffectiveColumnj>: 4006c0: 48 8b 0d 81 19 00 00 mov 0x1981(%rip),%rcx # 402048 <m_effectiveColumns+0x8> 4006c7: 4c 8b 05 72 19 00 00 mov 0x1972(%rip),%r8 # 402040 <m_effectiveColumns> 4006ce: 4c 29 c1 sub %r8,%rcx 4006d1: 48 c1 e9 02 shr $0x2,%rcx 4006d5: 31 c0 xor %eax,%eax 4006d7: 85 c9 test %ecx,%ecx 4006d9: 74 1e je 4006f9 <_Z31absoluteColumnToEffectiveColumnj+0x39> 4006db: 31 f6 xor %esi,%esi 4006dd: 31 c0 xor %eax,%eax 4006df: 90 nop 4006e0: 89 c7 mov %eax,%edi # <== unnecessary mov 4006e2: 41 8b 3c b8 mov (%r8,%rdi,4),%edi 4006e6: 8d 54 37 ff lea -0x1(%rdi,%rsi,1),%edx 4006ea: 83 fa 0a cmp $0xa,%edx 4006ed: 73 0a jae 4006f9 <_Z31absoluteColumnToEffectiveColumnj+0x39> 4006ef: 01 f7 add %esi,%edi 4006f1: ff c0 inc %eax 4006f3: 39 c8 cmp %ecx,%eax 4006f5: 89 fe mov %edi,%esi 4006f7: 72 e7 jb 4006e0 <_Z31absoluteColumnToEffectiveColumnj+0x20> 4006f9: c3 retq 4006fa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) The "old" Clang, r278861 00000000004006e0 <_Z31absoluteColumnToEffectiveColumnj>: 4006e0: 48 8b 15 61 19 00 00 mov 0x1961(%rip),%rdx # 402048 <m_effectiveColumns+0x8> 4006e7: 4c 8b 05 52 19 00 00 mov 0x1952(%rip),%r8 # 402040 <m_effectiveColumns> 4006ee: 4c 29 c2 sub %r8,%rdx 4006f1: 48 c1 ea 02 shr $0x2,%rdx 4006f5: 31 c0 xor %eax,%eax 4006f7: 85 d2 test %edx,%edx 4006f9: 74 2e je 400729 <_Z31absoluteColumnToEffectiveColumnj+0x49> 4006fb: 89 d2 mov %edx,%edx 4006fd: 31 c0 xor %eax,%eax 4006ff: 31 f6 xor %esi,%esi 400701: 66 66 66 66 66 66 2e data32 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400708: 0f 1f 84 00 00 00 00 40070f: 00 400710: 41 8b 3c 80 mov (%r8,%rax,4),%edi # <=== Everything is good here 400714: 8d 4c 37 ff lea -0x1(%rdi,%rsi,1),%ecx 400718: 83 f9 0a cmp $0xa,%ecx 40071b: 73 0c jae 400729 <_Z31absoluteColumnToEffectiveColumnj+0x49> 40071d: 01 f7 add %esi,%edi 40071f: 48 ff c0 inc %rax 400722: 48 39 d0 cmp %rdx,%rax 400725: 89 fe mov %edi,%esi 400727: 72 e7 jb 400710 <_Z31absoluteColumnToEffectiveColumnj+0x30> 400729: c3 retq 40072a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) Hans, Nico: shall I file an LLVM bug for that?
,
Sep 9 2016
The "new" Clang, r280106 (the current version used by Chrome): 00000000004006e0 <_Z31absoluteColumnToEffectiveColumnj>: 4006e0: 48 8b 0d 61 19 00 00 mov 0x1961(%rip),%rcx # 402048 <m_effectiveColumns+0x8> 4006e7: 4c 8b 05 52 19 00 00 mov 0x1952(%rip),%r8 # 402040 <m_effectiveColumns> 4006ee: 4c 29 c1 sub %r8,%rcx 4006f1: 48 c1 e9 02 shr $0x2,%rcx 4006f5: 31 c0 xor %eax,%eax 4006f7: 85 c9 test %ecx,%ecx 4006f9: 74 1e je 400719 <_Z31absoluteColumnToEffectiveColumnj+0x39> 4006fb: 31 f6 xor %esi,%esi 4006fd: 31 c0 xor %eax,%eax 4006ff: 90 nop 400700: 89 c7 mov %eax,%edi # <== unnecessary mov 400702: 41 8b 3c b8 mov (%r8,%rdi,4),%edi 400706: 8d 54 37 ff lea -0x1(%rdi,%rsi,1),%edx 40070a: 83 fa 0a cmp $0xa,%edx 40070d: 73 0a jae 400719 <_Z31absoluteColumnToEffectiveColumnj+0x39> 40070f: 01 f7 add %esi,%edi 400711: ff c0 inc %eax 400713: 39 c8 cmp %ecx,%eax 400715: 89 fe mov %edi,%esi 400717: 72 e7 jb 400700 <_Z31absoluteColumnToEffectiveColumnj+0x20> 400719: c3 retq 40071a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
,
Sep 9 2016
On the related note, I have made a simple optimization that speeds up the microbenchmark approximately by 2x: https://codereview.chromium.org/2325953002/. The measurements are attached. Basically, instead of having a single bool that keeps track if all columns have colspan == 1, it stores the number of first columns for which we know colspan == 1. The CL is obviously not yet ready for submit, as there's one important downside: instead of using 1 bit of RAM, we use more. If it's of a concern, it's possible to get away with as few as 4-12 bits, depending on the additional tricks. It's also not totally clear how often do such tables occur in the real world. That I would leave for the owners to decide.
,
Sep 9 2016
Thanks for investigating! 1. If new clang produces slower code than old clang, that seems like a regression we should report upstream. With your reduced repro, bisecting this should be fairly quick, right? 2. If there's some change that makes that code faster and less reliant on specific optimizations, that seems like something we should do too.
,
Sep 9 2016
I have sent the CL for a review: https://codereview.chromium.org/2325953002/ Next step: file an LLVM bug and run a bisect.
,
Sep 9 2016
Filed https://llvm.org/bugs/show_bug.cgi?id=30339 Running a bisect.
,
Sep 9 2016
Bisect identified the culprit (see the LLVM bug).
,
Sep 9 2016
Thanks! Looks like the commit disabled a new optimization that was enabled shortly before the previous clang roll, and that isn't quite ready yet apparently. So the roll this bug here is about just brought us back to the level we were at before the roll before that, and a future roll will hopefully bring the optimization in again when it's on again upstream.
,
Sep 9 2016
We might consider to enable this optimization with the current toolchain by passing -mllvm,cvp-dont-process-adds=false as the measurements.
,
Sep 9 2016
sorry, the reply was truncated: as the perf measurements show that Chrome is faster that way.
,
Sep 9 2016
I don't think we should tweak -mllvm flags in general. https://fun.irq.dk/funroll-loops.org/ etc :-) (mllvm flags are not a stable interface, we should trust clang to have good defaults, this is only a known win for a single synthetic benchmark, ...)
,
Sep 9 2016
Okay. In this case, I propose the interested parties to review https://codereview.chromium.org/2325953002/ and consider the incident as resolved.
,
Sep 9 2016
Yup, I agree. Thanks again!
,
Sep 10 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/60f47383d88f942923b5ff963023d4547397acc2 commit 60f47383d88f942923b5ff963023d4547397acc2 Author: krasin <krasin@chromium.org> Date: Sat Sep 10 10:35:07 2016 LayoutTable: gracefully degrade in the presence of collspan > 1. Currently, there's a boolean field that allows to take a fast path when mapping absolute columns to effective columns (and back), but the fast path is completely abandoned, if a single call with colspan > 1 presented. This change allows gracefully degrade the performance of the fast path by maintaining the number of first cells with colspan = 1, which allows to skip them, even if some cells have colspan > 1. This change speeds up large-table-with-collapsed-borders-and-colspans-wider-than-table microbenchmark by approximately 2x, see: https://storage.googleapis.com/cfi-stats/2016-09-09/results.html BUG= 643724 Review-Url: https://codereview.chromium.org/2325953002 Cr-Commit-Position: refs/heads/master@{#417829} [modify] https://crrev.com/60f47383d88f942923b5ff963023d4547397acc2/third_party/WebKit/Source/core/layout/LayoutTable.cpp [modify] https://crrev.com/60f47383d88f942923b5ff963023d4547397acc2/third_party/WebKit/Source/core/layout/LayoutTable.h
,
Sep 10 2016
The graphs confirm the improvement: https://chromeperf.appspot.com/report?sid=b55430b4c7fad00e3638b11ac242173118f167328f812ca9bc63bb41bc0cb6e3 Linux: 450 -> 200 Mac Retina: 520 -> 238 Mac 11: 628 -> ??? (the graph line dropped much below the visible area) Marking this issue as fixed. We'll still eventually get additional improvements in codegen and that unnessary mov will likely go away, but it's not critical anymore. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by mustaq@chromium.org
, Sep 2 2016