New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 643724 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

11.7%-12.6% regression in blink_perf.layout at 415551:415625

Project Member Reported by mustaq@chromium.org, Sep 2 2016

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=643724

Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICgoaTR_ggM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICg4aPftAoM,agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICg4ceFtAoM


Bot(s) for this bug's original alert(s):

chromium-rel-mac-retina
chromium-rel-mac10
linux-release
Cc: thakis@chromium.org
Owner: thakis@chromium.org

=== Auto-CCing suspected CL author thakis@chromium.org ===

Hi thakis@chromium.org, the bisect results pointed to your CL below as possibly
causing a regression. Please have a look at this info and see whether
your CL be related.


===== BISECT JOB RESULTS =====
Status: completed


===== SUSPECTED CL(s) =====
Subject : Roll clang 278861:280106.
Author  : thakis
Commit description:
  
* win: Members of base classes now should show up in debugger.
* win: Debugger shouldn't show funny highlights anymore due to
  debug info no longer including column information.  (we still
  force this on if sanitizers are used, mostly for clusterfuzz.
  maybe we want to make this toggleable independent of sanitizers
  at some point)
* win: -Wextern-initializer no longer warns on midl-generated code
* win: clang-cl now accepts /source-encoding:utf-8 and friends
  (utf-8 was the source enconding in clang-cl before already, but
  now we don't warn on an explicit flag requesting this)
* all platforms: Three plugin checks are now on-by-default,
  remove flags for these (see
    https://codereview.chromium.org/2267713003
    https://codereview.chromium.org/2268203002
    https://codereview.chromium.org/2265093002
  )
* win: clang-cl's /Brepro now does what it's supposed to do
* win: clang-cl now emits absolute paths in diagnostics, by
  popular request.

Ran `tools/clang/scripts/upload_revision.py 280106`.

BUG= 640254 , 637456 , 636109 , 636091 , 636099 

Review-Url: https://codereview.chromium.org/2292173002
Cr-Commit-Position: refs/heads/master@{#415563}
Commit  : a033f395bf2547cf4764f77cc9c86d08f3e22c23
Date    : Wed Aug 31 05:18:36 2016


===== TESTED REVISIONS =====
Revision         Mean     Std Dev  N  Good?
chromium@415558  405.005  3.38111  5  good
chromium@415561  409.007  5.02107  5  good
chromium@415562  403.783  5.02506  5  good
chromium@415563  454.608  3.2199   5  bad    <--
chromium@415567  452.615  3.96062  5  bad
chromium@415576  449.699  2.33349  5  bad

Bisect job ran on: linux_perf_bisect
Bug ID: 643724

Test Command: src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --upload-results --also-run-disabled-tests blink_perf.layout
Test Metric: large-table-with-collapsed-borders-and-colspans-wider-than-table/large-table-with-collapsed-borders-and-colspans-wider-than-table
Relative Change: 11.04%
Score: 99.9

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/linux_perf_bisect/builds/6690
Job details: https://chromeperf.appspot.com/buildbucket_job_status/9002613176851193344


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=5824572824223744

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!
Cc: h...@chromium.org
Cc: p...@chromium.org krasin@chromium.org
cfi folks, isn't  large-table-with-collapsed-borders-and-colspans-wider-than-table a test you looked at it some detail? Any idea what might've regressed here?
The improvement was observed on both Mac (~20%) and Linux (~4%). It feels that this time the regression is just cancelling those speedups. 
You are right about the two metrics on Mac, I will remove them from the bug now.

The regression on Linux still needs some attention: the current level (~450) is significantly higher than Aug-19 (pre-improvement) level (~425).
Owner: krasin@chromium.org
I will take a look into this today.
From what I can see, the benchmark lives on the edge: there's a good state and there's a bad state. Probably, it almost fits a cache (like, L1), may be it's about memory alignment, may be something else.

Multiple factors may put this benchmark on either side: LTO, CFI, specific Clang revision, different hardware families (like, Haswell vs Sandy Bridge), etc. I would probably ignore it unless it's 40%-50%.

Owner: esprehn@chromium.org
Cc: esprehn@chromium.org jyasskin@chromium.org
Components: Blink>Layout
Owner: e...@chromium.org
10-20% seems big enough you should be able to measure this, what does a profiler tell you? Where is time spent differently between the two builds? What is different about the assembly output?
Owner: krasin@chromium.org
I see blink::LayoutTable::cell{Above,After,Before,Below}, as well as blink::LayoutTableCell::computeCollapsedEndBorder in the top of the profile.

What I have not yet done, and probably should is to take an older toolchain, build the same Chrome revision with it, and see if the profile looks any different.

Taking the bug back until this data is collected.
The renderer profiles for old (faster) and new (slower) Clang versions attached.
Top entries are inlined:

old clang:

9.84%          2184   blink::LayoutTable::cellAbove(blink::LayoutTableCell const*) const
9.33%          2058   blink::LayoutTable::cellBelow(blink::LayoutTableCell const*) const 
6.70%          1481   blink::LayoutTable::cellAfter(blink::LayoutTableCell const*) const 
5.91%          1312   blink::LayoutTableCell::hasStartBorderAdjoiningTable() const
5.84%          1299   blink::LayoutTableCell::computeCollapsedEndBorder(blink::IncludeBorderColorOrNot) const
5.80%          1291   blink::LayoutTable::cellBefore(blink::LayoutTableCell const*) const
5.17%          1154   blink::LayoutTableCell::hasEndBorderAdjoiningTable() const
2.19%           484   blink::LayoutTableCell::computeCollapsedAfterBorder(blink::IncludeBorderColorOrNot) const
2.13%           483   blink::NormalPage::sweep()
2.11%           468   blink::LayoutTableCell::computeCollapsedBeforeBorder(blink::IncludeBorderColorOrNot) const
1.58%           374   void blink::Element::traceImpl<blink::InlinedGlobalMarkingVisitor>(blink::InlinedGlobalMarkingVisitor)

new clang:

11.22%         2712   blink::LayoutTable::cellAbove(blink::LayoutTableCell const*) const
10.55%         2555   blink::LayoutTable::cellBelow(blink::LayoutTableCell const*) const
7.28%          1765   blink::LayoutTable::cellAfter(blink::LayoutTableCell const*) const
6.60%          1596   blink::LayoutTable::cellBefore(blink::LayoutTableCell const*) const
6.23%          1508   blink::LayoutTableCell::hasStartBorderAdjoiningTable() const
6.03%          1457   blink::LayoutTableCell::computeCollapsedEndBorder(blink::IncludeBorderColorOrNot) const
6.01%          1455   blink::LayoutTableCell::hasEndBorderAdjoiningTable() const
2.40%           586   blink::LayoutTableCell::computeCollapsedAfterBorder(blink::IncludeBorderColorOrNot) const
1.84%           448   blink::LayoutTableCell::computeCollapsedBeforeBorder(blink::IncludeBorderColorOrNot) const
1.74%           449   blink::NormalPage::sweep()
1.18%           315   void blink::Element::traceImpl<blink::InlinedGlobalMarkingVisitor>(blink::InlinedGlobalMarkingVisitor)

Looking inside those methods.
new-clang-renderer.txt
2.3 MB View Download
old-clang-renderer.txt
2.3 MB View Download
Interesting. The hot place in blink::LayoutTable::cellAbove:

old clang:

  4.76 │a0:   mov    (%r10,%rdx,4),%ecx
  0.27 │      lea    -0x1(%rcx,%rdi,1),%ebx

  2.56 │      cmp    %esi,%ebx
       │    ↓ jae    b8
 39.42 │      add    %edi,%ecx

  3.98 │      inc    %rdx
  0.09 │      cmp    %r9,%rdx
  0.55 │      mov    %ecx,%edi
 37.96 │    ↑ jb     a0


new clang:

  6.19 │ a0:   mov    %edx,%ecx
  6.34 │       mov    (%r10,%rcx,4),%ecx
 10.91 │       lea    -0x1(%rcx,%rdi,1),%ebx

 18.95 │       cmp    %esi,%ebx
       │     ↓ jae    b9
 12.24 │       add    %edi,%ecx

  6.16 │       inc    %edx
  7.15 │       cmp    %r9d,%edx
  9.48 │       mov    %ecx,%edi
 12.54 │     ↑ jb     a0

1) This extra 'mov    %edx,%ecx' looks unnecessary
2) In the new Clang it uses 32-bit arithmetic instead of 64-bit. It's hard to say, if that's a problem other being the reason for 1)

Nico, Hans, would you count this as a codegen issue?

I will now look at Clang ToT to see, if this is already fixed.
The ToT Clang generates the same slow code as the "new" clang. The loop in the question is

https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutTable.h?type=cs&q=m_effectiveColumns.effectiveColumn..span&sq=package:chromium&l=269

for (unsigned c = 0; effectiveColumn < numColumns && c + m_effectiveColumns[effectiveColumn].span - 1 < absoluteColumnIndex; ++effectiveColumn)
  c += m_effectiveColumns[effectiveColumn].span;

I was able to reproduce that in with a standalone C++ program (attached):

The ToT Clang, r280929:

$ time clang++ -o loop loop.cc -fuse-ld=gold -O2  -flto && ./loop 

...
00000000004006c0 <_Z31absoluteColumnToEffectiveColumnj>:
  4006c0:       48 8b 0d 81 19 00 00    mov    0x1981(%rip),%rcx        # 402048 <m_effectiveColumns+0x8>
  4006c7:       4c 8b 05 72 19 00 00    mov    0x1972(%rip),%r8        # 402040 <m_effectiveColumns>
  4006ce:       4c 29 c1                sub    %r8,%rcx
  4006d1:       48 c1 e9 02             shr    $0x2,%rcx
  4006d5:       31 c0                   xor    %eax,%eax
  4006d7:       85 c9                   test   %ecx,%ecx
  4006d9:       74 1e                   je     4006f9 <_Z31absoluteColumnToEffectiveColumnj+0x39>
  4006db:       31 f6                   xor    %esi,%esi
  4006dd:       31 c0                   xor    %eax,%eax
  4006df:       90                      nop
  4006e0:       89 c7                   mov    %eax,%edi # <== unnecessary mov
  4006e2:       41 8b 3c b8             mov    (%r8,%rdi,4),%edi
  4006e6:       8d 54 37 ff             lea    -0x1(%rdi,%rsi,1),%edx
  4006ea:       83 fa 0a                cmp    $0xa,%edx
  4006ed:       73 0a                   jae    4006f9 <_Z31absoluteColumnToEffectiveColumnj+0x39>
  4006ef:       01 f7                   add    %esi,%edi
  4006f1:       ff c0                   inc    %eax
  4006f3:       39 c8                   cmp    %ecx,%eax
  4006f5:       89 fe                   mov    %edi,%esi
  4006f7:       72 e7                   jb     4006e0 <_Z31absoluteColumnToEffectiveColumnj+0x20>
  4006f9:       c3                      retq   
  4006fa:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)


The "old" Clang, r278861

00000000004006e0 <_Z31absoluteColumnToEffectiveColumnj>:
  4006e0:       48 8b 15 61 19 00 00    mov    0x1961(%rip),%rdx        # 402048 <m_effectiveColumns+0x8>
  4006e7:       4c 8b 05 52 19 00 00    mov    0x1952(%rip),%r8        # 402040 <m_effectiveColumns>
  4006ee:       4c 29 c2                sub    %r8,%rdx
  4006f1:       48 c1 ea 02             shr    $0x2,%rdx
  4006f5:       31 c0                   xor    %eax,%eax
  4006f7:       85 d2                   test   %edx,%edx
  4006f9:       74 2e                   je     400729 <_Z31absoluteColumnToEffectiveColumnj+0x49>
  4006fb:       89 d2                   mov    %edx,%edx
  4006fd:       31 c0                   xor    %eax,%eax
  4006ff:       31 f6                   xor    %esi,%esi
  400701:       66 66 66 66 66 66 2e    data32 data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
  400708:       0f 1f 84 00 00 00 00 
  40070f:       00 
  400710:       41 8b 3c 80             mov    (%r8,%rax,4),%edi # <=== Everything is good here
  400714:       8d 4c 37 ff             lea    -0x1(%rdi,%rsi,1),%ecx
  400718:       83 f9 0a                cmp    $0xa,%ecx
  40071b:       73 0c                   jae    400729 <_Z31absoluteColumnToEffectiveColumnj+0x49>
  40071d:       01 f7                   add    %esi,%edi
  40071f:       48 ff c0                inc    %rax
  400722:       48 39 d0                cmp    %rdx,%rax
  400725:       89 fe                   mov    %edi,%esi
  400727:       72 e7                   jb     400710 <_Z31absoluteColumnToEffectiveColumnj+0x30>
  400729:       c3                      retq   
  40072a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

Hans, Nico: shall I file an LLVM bug for that?
loop.cc
827 bytes View Download
The "new" Clang, r280106 (the current version used by Chrome):

00000000004006e0 <_Z31absoluteColumnToEffectiveColumnj>:
  4006e0:       48 8b 0d 61 19 00 00    mov    0x1961(%rip),%rcx        # 402048 <m_effectiveColumns+0x8>
  4006e7:       4c 8b 05 52 19 00 00    mov    0x1952(%rip),%r8        # 402040 <m_effectiveColumns>
  4006ee:       4c 29 c1                sub    %r8,%rcx
  4006f1:       48 c1 e9 02             shr    $0x2,%rcx
  4006f5:       31 c0                   xor    %eax,%eax
  4006f7:       85 c9                   test   %ecx,%ecx
  4006f9:       74 1e                   je     400719 <_Z31absoluteColumnToEffectiveColumnj+0x39>
  4006fb:       31 f6                   xor    %esi,%esi
  4006fd:       31 c0                   xor    %eax,%eax
  4006ff:       90                      nop
  400700:       89 c7                   mov    %eax,%edi # <== unnecessary mov
  400702:       41 8b 3c b8             mov    (%r8,%rdi,4),%edi
  400706:       8d 54 37 ff             lea    -0x1(%rdi,%rsi,1),%edx
  40070a:       83 fa 0a                cmp    $0xa,%edx
  40070d:       73 0a                   jae    400719 <_Z31absoluteColumnToEffectiveColumnj+0x39>
  40070f:       01 f7                   add    %esi,%edi
  400711:       ff c0                   inc    %eax
  400713:       39 c8                   cmp    %ecx,%eax
  400715:       89 fe                   mov    %edi,%esi
  400717:       72 e7                   jb     400700 <_Z31absoluteColumnToEffectiveColumnj+0x20>
  400719:       c3                      retq   
  40071a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

On the related note, I have made a simple optimization that speeds up the microbenchmark approximately by 2x: https://codereview.chromium.org/2325953002/. The measurements are attached.

Basically, instead of having a single bool that keeps track if all columns have colspan == 1, it stores the number of first columns for which we know colspan == 1.

The CL is obviously not yet ready for submit, as there's one important downside: instead of using 1 bit of RAM, we use more. If it's of a concern, it's possible to get away with as few as 4-12 bits, depending on the additional tricks.

It's also not totally clear how often do such tables occur in the real world. That I would leave for the owners to decide.

results.html
124 KB View Download
Thanks for investigating!

1. If new clang produces slower code than old clang, that seems like a regression we should report upstream. With your reduced repro, bisecting this should be fairly quick, right?

2. If there's some change that makes that code faster and less reliant on specific optimizations, that seems like something we should do too.
I have sent the CL for a review: https://codereview.chromium.org/2325953002/
Next step: file an LLVM bug and run a bisect.
Filed https://llvm.org/bugs/show_bug.cgi?id=30339
Running a bisect.
Bisect identified the culprit (see the LLVM bug).

Thanks!

Looks like the commit disabled a new optimization that was enabled shortly before the previous clang roll, and that isn't quite ready yet apparently. So the roll this bug here is about just brought us back to the level we were at before the roll before that, and a future roll will hopefully bring the optimization in again when it's on again upstream.
We might consider to enable this optimization with the current toolchain by passing -mllvm,cvp-dont-process-adds=false as the measurements.
sorry, the reply was truncated: as the perf measurements show that Chrome is faster that way.
I don't think we should tweak -mllvm flags in general. https://fun.irq.dk/funroll-loops.org/ etc :-) (mllvm flags are not a stable interface, we should trust clang to have good defaults, this is only a known win for a single synthetic benchmark, ...)
Okay. In this case, I propose the interested parties to review https://codereview.chromium.org/2325953002/ and consider the incident as resolved.
Yup, I agree. Thanks again!
Project Member

Comment 30 by bugdroid1@chromium.org, Sep 10 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/60f47383d88f942923b5ff963023d4547397acc2

commit 60f47383d88f942923b5ff963023d4547397acc2
Author: krasin <krasin@chromium.org>
Date: Sat Sep 10 10:35:07 2016

LayoutTable: gracefully degrade in the presence of collspan > 1.

Currently, there's a boolean field that allows to take a fast path
when mapping absolute columns to effective columns (and back), but
the fast path is completely abandoned, if a single call with
colspan > 1 presented.

This change allows gracefully degrade the performance of the fast
path by maintaining the number of first cells with colspan = 1,
which allows to skip them, even if some cells have colspan > 1.

This change speeds up large-table-with-collapsed-borders-and-colspans-wider-than-table
microbenchmark by approximately 2x, see:
https://storage.googleapis.com/cfi-stats/2016-09-09/results.html

BUG= 643724 

Review-Url: https://codereview.chromium.org/2325953002
Cr-Commit-Position: refs/heads/master@{#417829}

[modify] https://crrev.com/60f47383d88f942923b5ff963023d4547397acc2/third_party/WebKit/Source/core/layout/LayoutTable.cpp
[modify] https://crrev.com/60f47383d88f942923b5ff963023d4547397acc2/third_party/WebKit/Source/core/layout/LayoutTable.h

Status: Fixed (was: Assigned)
The graphs confirm the improvement: https://chromeperf.appspot.com/report?sid=b55430b4c7fad00e3638b11ac242173118f167328f812ca9bc63bb41bc0cb6e3

Linux: 450 -> 200
Mac Retina: 520 -> 238
Mac 11: 628 -> ??? (the graph line dropped much below the visible area)

Marking this issue as fixed. We'll still eventually get additional improvements in codegen and that unnessary mov will likely go away, but it's not critical anymore.

Sign in to add a comment