New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 634139 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

CFI is 5% slower on large-table-with-collapsed-borders-and-colspans-wider-than-table.html

Project Member Reported by krasin@chromium.org, Aug 3 2016

Issue description

(Elliott: just an fyi)

In the last (July 2016) attempt to launch CFI vcall on Linux, we realized that many blink layout regressed by ~3.5% and one of them by ~5%: blink_perf.layout.large-table-with-collapsed-borders-and-colspans-wider-than-table.html

The attached results (quite noisy) demonstrate that the slowdown is not due to a memory load for complex bitsets, but due to the most basic range check in CFI. To be more specific: the html has 4 columns:

off: that's the official Chrome build with Clang ToT with all default flags (including LTO and devirtualization)
cfi0: CFI-vcall with -lowertypetests-bitsets-level=0: no bitsets are checked; just a range check for vtables (already better than the CFI in Microsoft Edge)
cfi1: CFI-vcall with -lowertypetests-bitsets-level=1: only bitsets of size <= 64 are checked; no memory load is generated.
cfi2: CFI-vcall with -lowertypetests-bitsets-level=2: precise CFI with all bitsets, even those requiring a memory load.

While the results are somewhat noisy (I followed the best practices from https://crbug.com/629487, but they didn't give me much better than a default settings, probably due to a different machine, Z840 vs Z620), it's still visible that even cfi0 regressed the benchmark by solid 5%.

The next step is to take a closer look at which virtual call sites contribute the most, and think if we can devirtualize them.

 
results.html
132 KB View Download
Description: Show this description
While I am on it, below are the sizes for each of the variant.

krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target off-tot --platform linux
RESULT chrome: chrome= 1493224256 bytes
RESULT chrome-stripped: stripped= 105415440 bytes
RESULT chrome-text: text= 100189187 bytes
RESULT chrome-data: data= 5221080 bytes
RESULT chrome-bss: bss= 1714616 bytes
RESULT chrome-si: initializers= 7 files
RESULT chrome-textrel: textrel= 0 relocs
RESULT nacl_helper: nacl_helper= 66023744 bytes
RESULT nacl_helper-stripped: stripped= 2128920 bytes
RESULT nacl_helper-text: text= 2091269 bytes
RESULT nacl_helper-data: data= 30408 bytes
RESULT nacl_helper-bss: bss= 201744 bytes
RESULT nacl_helper-si: initializers= 6 files
RESULT nacl_helper-textrel: textrel= 0 relocs
RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes
RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes
RESULT nacl_helper_bootstrap-text: text= 5710 bytes
RESULT nacl_helper_bootstrap-data: data= 40 bytes
RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes
RESULT nacl_helper_bootstrap-si: initializers= 0 files
RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs
RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes
RESULT resources.pak: resources.pak= 17724496 bytes
RESULT totals-bss: bss= 1920473 bytes
RESULT totals-data: data= 5251528 bytes
RESULT totals-initializers: initializers= 13 files
RESULT totals-size: size= 1580742974 bytes
RESULT totals-stripped: stripped= 107553208 bytes
RESULT totals-text: text= 102286166 bytes
RESULT totals-textrel: textrel= 0 relocs


krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-0 --platform linux                                                                                                         
RESULT chrome: chrome= 1493736296 bytes
RESULT chrome-stripped: stripped= 108344080 bytes
RESULT chrome-text: text= 102461868 bytes
RESULT chrome-data: data= 5874712 bytes
RESULT chrome-bss: bss= 1714552 bytes
RESULT chrome-si: initializers= 7 files
RESULT chrome-textrel: textrel= 0 relocs
RESULT nacl_helper: nacl_helper= 66131144 bytes
RESULT nacl_helper-stripped: stripped= 2153496 bytes
RESULT nacl_helper-text: text= 2112798 bytes
RESULT nacl_helper-data: data= 33528 bytes
RESULT nacl_helper-bss: bss= 201728 bytes
RESULT nacl_helper-si: initializers= 6 files
RESULT nacl_helper-textrel: textrel= 0 relocs
RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes
RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes
RESULT nacl_helper_bootstrap-text: text= 5710 bytes
RESULT nacl_helper_bootstrap-data: data= 40 bytes
RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes
RESULT nacl_helper_bootstrap-si: initializers= 0 files
RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs
RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes
RESULT resources.pak: resources.pak= 17724496 bytes
RESULT totals-bss: bss= 1920393 bytes
RESULT totals-data: data= 5908280 bytes
RESULT totals-initializers: initializers= 13 files
RESULT totals-size: size= 1581362414 bytes
RESULT totals-stripped: stripped= 110506424 bytes
RESULT totals-text: text= 104580376 bytes
RESULT totals-textrel: textrel= 0 relocs


krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-1 --platform linux                                                                                                         
RESULT chrome: chrome= 1495970872 bytes
RESULT chrome-stripped: stripped= 110285584 bytes
RESULT chrome-text: text= 104403564 bytes
RESULT chrome-data: data= 5874728 bytes
RESULT chrome-bss: bss= 1714552 bytes
RESULT chrome-si: initializers= 7 files
RESULT chrome-textrel: textrel= 0 relocs
RESULT nacl_helper: nacl_helper= 66131912 bytes
RESULT nacl_helper-stripped: stripped= 2153496 bytes
RESULT nacl_helper-text: text= 2116382 bytes
RESULT nacl_helper-data: data= 33528 bytes
RESULT nacl_helper-bss: bss= 201760 bytes
RESULT nacl_helper-si: initializers= 6 files
RESULT nacl_helper-textrel: textrel= 0 relocs
RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes
RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes
RESULT nacl_helper_bootstrap-text: text= 5710 bytes
RESULT nacl_helper_bootstrap-data: data= 40 bytes
RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes
RESULT nacl_helper_bootstrap-si: initializers= 0 files
RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs
RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes
RESULT resources.pak: resources.pak= 17724496 bytes
RESULT totals-bss: bss= 1920425 bytes
RESULT totals-data: data= 5908296 bytes
RESULT totals-initializers: initializers= 13 files
RESULT totals-size: size= 1583597758 bytes
RESULT totals-stripped: stripped= 112447928 bytes
RESULT totals-text: text= 106525656 bytes
RESULT totals-textrel: textrel= 0 relocs


krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-2 --platform linux                                                                                                         
RESULT chrome: chrome= 1489119792 bytes
RESULT chrome-stripped: stripped= 110531344 bytes
RESULT chrome-text: text= 104652404 bytes
RESULT chrome-data: data= 5874736 bytes
RESULT chrome-bss: bss= 1714584 bytes
RESULT chrome-si: initializers= 7 files
RESULT chrome-textrel: textrel= 0 relocs
RESULT nacl_helper: nacl_helper= 66137320 bytes
RESULT nacl_helper-stripped: stripped= 2157592 bytes
RESULT nacl_helper-text: text= 2118366 bytes
RESULT nacl_helper-data: data= 33528 bytes
RESULT nacl_helper-bss: bss= 201744 bytes
RESULT nacl_helper-si: initializers= 6 files
RESULT nacl_helper-textrel: textrel= 0 relocs
RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes
RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes
RESULT nacl_helper_bootstrap-text: text= 5710 bytes
RESULT nacl_helper_bootstrap-data: data= 40 bytes
RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes
RESULT nacl_helper_bootstrap-si: initializers= 0 files
RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs
RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes
RESULT resources.pak: resources.pak= 17724496 bytes
RESULT totals-bss: bss= 1920441 bytes
RESULT totals-data: data= 5908304 bytes
RESULT totals-initializers: initializers= 13 files
RESULT totals-size: size= 1576752086 bytes
RESULT totals-stripped: stripped= 112697784 bytes
RESULT totals-text: text= 106776480 bytes
RESULT totals-textrel: textrel= 0 relocs

Just chrome.stripped:

off-tot:   105415440 / +0.0%
cfi-tot-0: 108344080 / +2.78%
cfi-tot-1: 110285584 / +4.62%
cfi-tot-2: 110531344 / +4.85%

Which means that bitsets themselves are insignificant in bloating the binary. It's mostly the inline checks.
Here is the list of hot virtual call sites during the benchmark retrieved using the instructions here: https://bugs.chromium.org/p/chromium/issues/detail?id=580389#c27

To put it into the perspective, the numbers correspond to ~6 seconds of runtime.

./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BoxClipper.cpp:38 _ZN5blink10BoxClipperC1ERKNS_9LayoutBoxERKNS_9PaintInfoERKNS_11LayoutPointENS_20ContentsClipBehaviorE cfi-vcall 1664136
./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:55 _ZN5blink12BlockPainter5paintERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 1664073
./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:181 _ZN5blink12BlockPainter11paintObjectERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 1664040
./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:227 _ZNK5blink12BlockPainter19intersectsPaintRectERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 2496093
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.cpp:939 _ZNK5blink12LayoutObject15containingBlockEv cfi-vcall 1440331
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:448 _ZNK5blink9LayoutBox12clientHeightEv cfi-vcall 2360160
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:448 _ZNK5blink9LayoutBox12clientHeightEv cfi-vcall 2360160
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4458 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4459 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4460 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4472 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.h:522 _ZNK5blink11LayoutBlock27createsNewFormattingContextEv cfi-vcall 1680124
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.h:512 _ZN5blink22PaintInvalidationState23updateForNormalChildrenEv cfi-vcall 1323501
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/PaintInvalidationState.cpp:336 _ZNK5blink22PaintInvalidationState43computePositionFromPaintInvalidationBackingEv cfi-vcall 1763551
./out/cfi-stats-0/../../third_party/WebKit/Source/core/page/FrameTree.cpp:115 _ZNK5blink9FrameTree6parentEv cfi-vcall 1443208
./out/cfi-stats-0/../../third_party/WebKit/Source/core/dom/Node.h:255 _ZN5blink17FlatTreeTraversal14traverseParentERKNS_4NodeEPNS_26LayoutTreeBuilderTraversal13ParentDetailsE cfi-vcall 1441302
./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:112 _ZN5blink4Node9traceImplINS_27InlinedGlobalMarkingVisitorEEEvT_ cfi-vcall 5414898
./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:75 _ZN5blink18AdjustAndMarkTraitINS_20HTMLTableCellElementELb0EE4markIPNS_7VisitorEEEvT_PKS1_ cfi-vcall 4413645
./out/cfi-stats-0/../../third_party/WebKit/Source/core/events/EventTarget.h:245 _ZN5blink21MajorGCWrapperVisitor21VisitPersistentHandleEPN2v810PersistentINS1_5ValueENS1_27NonCopyablePersistentTraitsIS3_EEEEt cfi-vcall 3682251
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.h:420 _ZNK5blink9LayoutBox13contentHeightEv cfi-vcall 1880096
./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.h:420 _ZNK5blink9LayoutBox13contentHeightEv cfi-vcall 1880096
./out/cfi-stats-0/../../third_party/WebKit/Source/web/WebLocalFrameImpl.cpp:1592 _ZN5blink17WebLocalFrameImpl9fromFrameERNS_10LocalFrameE cfi-vcall 2164437
./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:206 _ZN5blink10TraceTraitINS_4NodeEE5traceEPNS_7VisitorEPv cfi-vcall 3196481
./out/cfi-stats-0/../../third_party/skia/src/core/SkSmallAllocator.h:127 _ZN16SkSmallAllocatorILj3ELm3332EE8DestroyTI23SkARGB32_Opaque_BlitterEEvPv cfi-vcall 1401075
./out/cfi-stats-0/../../third_party/skia/src/core/SkBitmapDevice.cpp:149 _ZN14SkBitmapDevice14onAccessPixelsEP8SkPixmap cfi-vcall 1417873
./out/cfi-stats-0/../../third_party/skia/src/core/SkPixelRef.cpp:278 _ZN10SkPixelRef19notifyPixelsChangedEv cfi-vcall 1417887
./out/cfi-stats-0/../../third_party/skia/src/core/SkDevice.cpp:270 _ZN12SkBaseDevice12accessPixelsEP8SkPixmap cfi-vcall 1417873
./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:1929 _ZN8SkCanvas8drawRectERK6SkRectRK7SkPaint cfi-vcall 3605896
./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2158 _ZN8SkCanvas10onDrawRectERK6SkRectRK7SkPaint cfi-vcall 1417816
./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2978 _ZN8SkCanvas11drawPictureEPK9SkPicturePK8SkMatrixPK7SkPaint cfi-vcall 1193633
./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2982 _ZN8SkCanvas11drawPictureEPK9SkPicturePK8SkMatrixPK7SkPaint cfi-vcall 1193593
./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:332 _ZN8SkCanvas10onDrawRectERK6SkRectRK7SkPaint cfi-vcall 1417873
./out/cfi-stats-0/../../v8/src/global-handles.cc:677 _ZN2v88internal13GlobalHandles16IterateWeakRootsEPNS0_13ObjectVisitorE cfi-vcall 3444609
./out/cfi-stats-0/../../v8/src/global-handles.cc:823 _ZN2v88internal13GlobalHandles19IterateObjectGroupsEPNS0_13ObjectVisitorEPFbPNS0_4HeapEPPNS0_6ObjectEE cfi-vcall 3660256
./out/cfi-stats-0/../../v8/src/global-handles.cc:1152 _ZN2v88internal13GlobalHandles15IterateAllRootsEPNS0_13ObjectVisitorE cfi-vcall 3455264
./out/cfi-stats-0/../../v8/src/global-handles.cc:1161 _ZN2v88internal13GlobalHandles27IterateAllRootsWithClassIdsEPNS0_13ObjectVisitorE cfi-vcall 8085953
./out/cfi-stats-0/../../v8/src/api.cc:8048 _ZN2v814VisitorAdapter22VisitEmbedderReferenceEPPNS_8internal6ObjectEt cfi-vcall 8107932

Comment 5 by kcc@chromium.org, Aug 4 2016

The above stats show the call sites. Can it be extended to also show the call targets? 
It can be, but not easily.

But we can get a pretty solid feel looking at the code:
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutBox.cpp?q=third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457&sq=package:chromium&l=4457

    LayoutUnit left(borderLeft() + (shouldPlaceBlockDirectionScrollbarOnLogicalLeft() ? scrollBarWidth : 0));
    LayoutUnit top(borderTop());
    LayoutUnit right(borderRight());
    LayoutUnit bottom(borderBottom());

Each of these methods are virtual and they look like:

virtual int borderTop() const { return style()->borderTopWidth(); }
int borderTop() const override
    {
        if (style()->isHorizontalWritingMode())
            return style()->isFlippedBlocksWritingMode() ? borderAfter() : borderBefore();
        return style()->isLeftToRightDirection() ? borderStart() : borderEnd();
    }
int LayoutTableCell::borderTop() const
{
    return table()->collapseBorders() ? borderHalfTop(false) : LayoutBlockFlow::borderTop();
}

virtual int borderBottom() const { return style()->borderBottomWidth(); }
virtual int borderLeft() const { return style()->borderLeftWidth(); }
virtual int borderRight() const { return style()->borderRightWidth(); }

virtual bool shouldPlaceBlockDirectionScrollbarOnLogicalLeft() const { return style()->shouldPlaceBlockDirectionScrollbarOnLogicalLeft(); }
bool shouldPlaceBlockDirectionScrollbarOnLogicalLeft() const override { return false; }

and so on. Basically, it's not just a few hot virtual call sites, but a whole tree and we see its root. Leaves and branches don't show up in the profile, because there're quite a lot of different options.

At the same time, the implementation of these methods is mostly trivial. While these are not virtual constants, there might be some optimizations applicable. Don't know yet.
Writing down one possible optimization, which I have not yet thought through, just to avoid forgetting about it.

Consider the following code (two_checks.cc):

#include <stdio.h>

class Base {
 public:
  virtual void Do() {
    printf("Base::Do\n");
  }

  virtual void Boo() {
    printf("Base::Boo\n");
  }
};

class A : public Base {
 public:
  void Do() override {
    printf("A::Do\n");
  }

  void Boo() override {
    printf("A::Boo\n");
  }
};

__attribute__((noinline)) void printBase(Base* ptr) {
  ptr->Do();
  ptr->Boo();
}


int main() {
  Base base;
  A a;
  printBase(&base);
  printBase(&a);

  return 0;
}

Consider printBase: CFI will generate two identical checks for the same vtable:

define internal void @_Z9printBaseP4Base(%class.Base* %ptr) #0 !dbg !6 {
entry:
  %ptr.addr = alloca %class.Base*, align 8
  store %class.Base* %ptr, %class.Base** %ptr.addr, align 8
  call void @llvm.dbg.declare(metadata %class.Base** %ptr.addr, metadata !23, metadata !24), !dbg !25
  %0 = load %class.Base*, %class.Base** %ptr.addr, align 8, !dbg !26
  %1 = bitcast %class.Base* %0 to void (%class.Base*)***, !dbg !27
  %vtable = load void (%class.Base*)**, void (%class.Base*)*** %1, align 8, !dbg !27
  %2 = bitcast void (%class.Base*)** %vtable to i8*, !dbg !27, !nosanitize !2
  %3 = ptrtoint i8* %2 to i64, !dbg !27
  %4 = sub i64 %3, add (i64 ptrtoint ({ [4 x i8*], [0 x i8], [4 x i8*] }* @0 to i64), i64 16), !dbg !27
  %5 = lshr i64 %4, 5, !dbg !27
  %6 = shl i64 %4, 59, !dbg !27
  %7 = or i64 %5, %6, !dbg !27
  %8 = icmp ult i64 %7, 2, !dbg !27
  br i1 %8, label %cont, label %trap, !dbg !27, !nosanitize !2

trap:                                             ; preds = %entry
  call void @llvm.trap() #2, !dbg !28, !nosanitize !2
  unreachable, !dbg !28, !nosanitize !2
cont:                                             ; preds = %entry
  %vfn = getelementptr inbounds void (%class.Base*)*, void (%class.Base*)** %vtable, i64 0, !dbg !30
  %9 = load void (%class.Base*)*, void (%class.Base*)** %vfn, align 8, !dbg !30
  call void %9(%class.Base* %0), !dbg !30
  %10 = load %class.Base*, %class.Base** %ptr.addr, align 8, !dbg !32
  %11 = bitcast %class.Base* %10 to void (%class.Base*)***, !dbg !33
  %vtable1 = load void (%class.Base*)**, void (%class.Base*)*** %11, align 8, !dbg !33
  %12 = bitcast void (%class.Base*)** %vtable1 to i8*, !dbg !33, !nosanitize !2
  %13 = ptrtoint i8* %12 to i64, !dbg !33
  %14 = sub i64 %13, add (i64 ptrtoint ({ [4 x i8*], [0 x i8], [4 x i8*] }* @0 to i64), i64 16), !dbg !33
  %15 = lshr i64 %14, 5, !dbg !33
  %16 = shl i64 %14, 59, !dbg !33
  %17 = or i64 %15, %16, !dbg !33
  %18 = icmp ult i64 %17, 2, !dbg !33
  br i1 %18, label %cont3, label %trap2, !dbg !33, !nosanitize !2

trap2:                                            ; preds = %cont
  call void @llvm.trap() #2, !dbg !34, !nosanitize !2
  unreachable, !dbg !34, !nosanitize !2

cont3:                                            ; preds = %cont
  %vfn4 = getelementptr inbounds void (%class.Base*)*, void (%class.Base*)** %vtable1, i64 1, !dbg !35
  %19 = load void (%class.Base*)*, void (%class.Base*)** %vfn4, align 8, !dbg !35
  call void %19(%class.Base* %10), !dbg !35
  ret void, !dbg !36
}

This is stupid, and we should do better. I am yet to think through, but this is one of the obvious places, where we can significantly reduce impact without sacrificing anything.
For the reference, the command line I used to compile the file above:
clang++ -fuse-ld=lld -o lala two_checks.cc -std=c++11 -flto -fvisibility=hidden -fsanitize=cfi-vcall -Wl,--lto-O1  -g -Wl,-mllvm,-lowertypetests-bitsets-level=0 -Wl,-save-temps && ./lala

two_checks.cc is also attached to the comment.
two_checks.cc
463 bytes View Download
If implemented, this optimization should reduce the CFI penalty in LayoutBox::noOverflowRect 
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutBox.cpp?q=third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457&sq=package:chromium&l=4446 by about 6x.

I can't claim that the whole large-table-with-collapsed-borders-and-colspans-wider-than-table.html will become 6x faster, as there're other hot parts too, but it seems like a worthwhile optimization.

At first, I am going to fake it, just to measure the impact generating at most 1 CFI check per function. It will provide an upper bound for the speedup, and if the microbenchmark still regressed by 5%, it's not the most important thing to do now.
I am going to add an XFAIL test to compiler-rt: https://reviews.llvm.org/D23151 and then think of an optimization.

An important thing to consider is a risk of something being corrupted between the check and the call. Since CFI only checks vtable pointer, and we don't guard against stack modifications (that's a job for SafeStack), it should be possible to store vtable pointer in alloca and then reuse.
On the other hand, if we do reuse vtable, and the object got destructed between the first and the second call, we'll not catch that. So, the optimization needs to be carefully considered before I implemented it.
Another example of a very hot place where we unnecessarily put additional code is https://cs.chromium.org/chromium/src/v8/src/global-handles.cc?sq=package:chromium&rcl=1470309818&l=1158

void GlobalHandles::IterateAllRootsWithClassIds(ObjectVisitor* v) {
  for (NodeIterator it(this); !it.done(); it.Advance()) {
    if (it.node()->IsRetainer() && it.node()->has_wrapper_class_id()) {
      v->VisitEmbedderReference(it.node()->location(),
                                it.node()->wrapper_class_id());
    }
  }
}

VisitEmbedderReference is called 8 million times during the benchmark, but we can make a check and out of the loop and reduce the impact at least 8x. That will require to remember the pointer to the actual VisitEmbedderReference making it an indirect call instead of a virtual one. That again is based on the assumption that we can trust local variables (read: stack).
But we probably already trust stack in CFI, as the return addresses are taken from there.

Comment 14 by kcc@chromium.org, Aug 4 2016

The optimization in #12 will make as rely on secure stack *more* than we do now.
Besides, this is not a CFI-specific optimization, it should or should not happen
regardless of CFI. 

One more thing to look at: full devirtualization. 
I think in at least one of your examples the bitset contained just 2 ones.
In this case we can replace a vcall with 

       if (vptr == A::vptr) direct-call A::f
  else if (vptr == B::vptr) direct-call B::f
  else trap 

AFIACT, this common compiler optimization is not yet implemented in LLVM
(not totally sure)
GlobalHandles::IterateObjectGroups is even better example as the CFI check lies within a nested loop:
https://cs.chromium.org/chromium/src/v8/src/global-handles.cc?sq=package:chromium&rcl=1470309818&l=792

for (int i = 0; i < object_groups_.length(); i++) {
  ...
  for (size_t j = 0; j < entry->length; ++j) {
    ...
    v->VisitPointer(&object);
...


>The optimization in #12 will make as rely on secure stack *more* than we do now.
Correct. But since we can't have a CFI check within a loop and not slowdown Blink, it's the least evil we can do. The alternative is to blacklist such hot functions altogether, which is obviously worse.

>Besides, this is not a CFI-specific optimization, it should or should not happen regardless of CFI.

Yes, but it probably makes less sense to do if CFI is not enabled, as the virtual calls are cheaper. But I agree that if we go with this optimization, it will live outside of the CFI code.

>One more thing to look at: full devirtualization. 
>I think in at least one of your examples the bitset contained just 2 ones.
>In this case we can replace a vcall with 
>
>      if (vptr == A::vptr) direct-call A::f
>  else if (vptr == B::vptr) direct-call B::f
>  else trap 
This is still too expensive for hot loops like we see above, but it probably makes sense to implement this optimization anyway, as it will speed up other places, where there's no loop.



Comment 17 by kcc@chromium.org, Aug 4 2016

Note that hoisting vptr load (and check) out of the loop may be hindered by aliasing. 

complete devirtualization can in some cases be combined with hoisting vptr load,
and yet again this is not CFI-specific and is pretty hard. 
I have updated the stats (attached). They now cover everything, demangled, sorted by frequency and have links to the code search.

"View" won't work. Download + Open will.
cfi-vcall-aug-4.html
1.1 MB View Download
Note to myself: check if blink::Node::getNodeType gets devirtualized. It should, but the stats from #18 it does not.
Attached is the list of devirtualized methods in Chrome. This is tested with Clang ToT and the list (surprisingly) much larger than it was measured previously:

https://bugs.chromium.org/p/chromium/issues/detail?id=580389#c25

The number of devirtualized call sites: 53276 (or 95109, if we don't discard dups inlined in different places)
The number of devirtualized methods: 25101

Note: I am not sure about the originally reported number (~400). I am not sure about the currently reported numbers (but more sure, since it generated this nice list).

I will double check it tomorrow, now just leaving as a reference. According to the list, getNoteType methods have been devirtualized:
devirtualized blink::CDATASection::getNodeType() const
devirtualized blink::DocumentType::getNodeType() const
devirtualized blink::DocumentFragment::getNodeType() const
devirtualized blink::ProcessingInstruction::getNodeType() const
devirtualized blink::Attr::getNodeType() const
devirtualized blink::Text::getNodeType() const
devirtualized blink::Comment::getNodeType() const
devirtualized blink::Element::getNodeType() const
devirtualized blink::Document::getNodeType() const

It seems that sanitizer stats report without taking devirtualization into the account, as the instrumentation happens on the Clang level, and we don't disable it after devirtualizing stuff.

It probably means that the profile in #8 has quite a few "dead" entries.

Again, I'll verify this tomorrow, and I would ask for someone else to audit my findings.
chrome.devirtualized.methods.txt
2.0 MB View Download
I have confirmed that cfi-vcall sanstats are shown even for devirtualized calls. It's good news, but now I have to figure out how not to show them, so that we can see the real picture.
I've made some work on improving the precision for devirtualization stats (all committed to LLVM trunk, but not yet rolled into Chrome). The new numbers:

Number of devirtualized call sites: 51491
Number of devirtualized methods: 23149

An HTML with all the devirtualized methods, their call sites and reasons to be devirtualized is attached. I had to omit call sites for methods devirtualized for having a single-impl, because there're too many of them. Also, some of the call sites are estimated, as I had to limit the amount of remarks coming out of the compiler.

devirt-methods.html
12.4 MB View Download
Note: in this revision I don't see getNodeType methods, but I am currently too tired to investigate. Will continue tomorrow.
I have now collected more comprehensive stats which can see how devirtualization affected the dynamic number of virtual calls. top(50) entries with original count (cfi-vcall sanstat) and real number of checks (I added into LowerTypeTests):

                                                              Location     Orig    Real
                                                    v8/src/api.cc:8021  8034573    same
                                         v8/src/global-handles.cc:1161  8011676    same
                           third_party/skia/src/core/SkCanvas.cpp:1931  6933401    same
              third_party/WebKit/Source/core/layout/LayoutObject.h:506  5933625      18 partially
             third_party/WebKit/Source/platform/heap/TraceTraits.h:112  5418269    same
               third_party/WebKit/Source/core/layout/LayoutBox.cpp:451  4720320    same
              third_party/WebKit/Source/platform/heap/TraceTraits.h:75  4387552    same
              third_party/WebKit/Source/core/layout/LayoutObject.h:516  3840941       0 devirtualized
                 third_party/WebKit/Source/core/layout/LayoutBox.h:414  3760192    same
               third_party/WebKit/Source/core/events/EventTarget.h:250  3650165    same
                                          v8/src/global-handles.cc:823  3627244    same
                                         v8/src/global-handles.cc:1152  3423384    same
                                          v8/src/global-handles.cc:677  3412722    same
  third_party/WebKit/Source/core/layout/PaintInvalidationState.cpp:350  3366571       0 devirtualized
             third_party/WebKit/Source/platform/heap/TraceTraits.h:206  3228728    same
              third_party/WebKit/Source/core/layout/LayoutBox.cpp:4317  3120418    same
                         third_party/WebKit/Source/core/dom/Node.h:254  2883947    same
             third_party/WebKit/Source/core/layout/LayoutObject.h:1218  2880494 1440247 partially
             third_party/WebKit/Source/core/paint/BlockPainter.cpp:227  2496093    same
              third_party/WebKit/Source/web/WebLocalFrameImpl.cpp:1618  2164454       0 devirtualized
                 third_party/WebKit/Source/core/layout/LayoutBox.h:343  1760072    same
                third_party/WebKit/Source/core/paint/BoxClipper.cpp:38  1664136    same
              third_party/WebKit/Source/core/paint/BlockPainter.cpp:55  1664073    same
             third_party/WebKit/Source/core/paint/BlockPainter.cpp:181  1664040       0 devirtualized
              third_party/WebKit/Source/core/layout/LayoutObject.h:250  1563573    same
              third_party/WebKit/Source/core/layout/LayoutBox.cpp:4320  1560209    same
              third_party/WebKit/Source/core/layout/LayoutBox.cpp:4332  1560209    same
              third_party/WebKit/Source/core/layout/LayoutBox.cpp:4319  1560209    same
              third_party/WebKit/Source/core/layout/LayoutBox.cpp:4318  1560209    same
                 third_party/WebKit/Source/core/page/FrameTree.cpp:115  1443240    same
            third_party/WebKit/Source/core/layout/LayoutObject.cpp:923  1440331       0 devirtualized
                          third_party/skia/src/core/SkPixelRef.cpp:278  1417885       0 devirtualized
                      third_party/skia/src/core/SkBitmapDevice.cpp:149  1417871       0 devirtualized
                            third_party/skia/src/core/SkDevice.cpp:281  1417871    same
                            third_party/skia/src/core/SkCanvas.cpp:334  1417871       0 devirtualized
                           third_party/skia/src/core/SkCanvas.cpp:2182  1417816    same
                      third_party/skia/src/core/SkSmallAllocator.h:127  1401146      14 partially
                 third_party/WebKit/Source/core/layout/LayoutBox.h:413  1360602    same
               third_party/WebKit/Source/core/layout/LayoutBox.cpp:446  1360602    same
              third_party/WebKit/Source/core/layout/LayoutObject.h:517  1323431       0 devirtualized
    third_party/WebKit/Source/platform/graphics/GraphicsLayer.cpp:1110  1281300  640650 partially
                           third_party/skia/src/core/SkCanvas.cpp:1845  1194167       0 devirtualized
                                  cc/playback/display_item_list.cc:141  1193879    same
                                cc/playback/drawing_display_item.cc:83  1193879    same
                           third_party/skia/src/core/SkCanvas.cpp:3033  1193507    same
                         third_party/skia/src/core/SkBigPicture.cpp:32  1193436       0 devirtualized
                           third_party/skia/src/core/SkCanvas.cpp:3037  1193278    same
                           third_party/skia/src/core/SkCanvas.cpp:3044  1193241    same
                           third_party/skia/src/core/SkCanvas.cpp:3056  1193241    same
                           third_party/skia/src/core/SkCanvas.cpp:3057  1193241    same

As we can see, a reasonable number of top offenders have been devirtualized completely or partially. I will now proceed with generating the list of methods which we need to disable CFI on and then will file a bug to implement LLVM optimizations which will reduce the CFI overhead on those methods.
Project Member

Comment 25 by bugdroid1@chromium.org, Aug 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4e341f5368a22ceeab4d5133f03ec6a71729737f

commit 4e341f5368a22ceeab4d5133f03ec6a71729737f
Author: krasin <krasin@google.com>
Date: Fri Aug 19 00:07:44 2016

Add 10 Skia methods to CFI blacklist for perf reasons.

While they have not been observed to slow down real-world use cases,
some blink_layout microbenchmarks feel better with these methods
disabled. In order to be concervative at the launch time, lift
the CFI defense for these methods.

BUG= 634139 

Review-Url: https://codereview.chromium.org/2259143002
Cr-Commit-Position: refs/heads/master@{#412979}

[modify] https://crrev.com/4e341f5368a22ceeab4d5133f03ec6a71729737f/tools/cfi/blacklist.txt

Project Member

Comment 26 by bugdroid1@chromium.org, Aug 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8cf09e2797b11ca54ea9162a29353bfba07742a2

commit 8cf09e2797b11ca54ea9162a29353bfba07742a2
Author: krasin <krasin@google.com>
Date: Fri Aug 19 19:21:15 2016

Extend CFI blacklist to include methods with lots of CFI checks.

The longer term solution is to implement optimizations in Clang
to avoid generating unnecessary checks.

BUG= 634139 

Review-Url: https://codereview.chromium.org/2251213007
Cr-Commit-Position: refs/heads/master@{#413208}

[modify] https://crrev.com/8cf09e2797b11ca54ea9162a29353bfba07742a2/tools/cfi/blacklist.txt

Project Member

Comment 27 by bugdroid1@chromium.org, Aug 20 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/06739364e2093d1dd489d2a232edac0402354f1d

commit 06739364e2093d1dd489d2a232edac0402354f1d
Author: krasin <krasin@google.com>
Date: Sat Aug 20 09:36:39 2016

Fix patterns in CFI blacklist.

The comparison happens on mangled names. Of course, there's no
double colon is there.

BUG= 634139 
TBR=kcc@chromium.org

Review-Url: https://codereview.chromium.org/2267543002
Cr-Commit-Position: refs/heads/master@{#413329}

[modify] https://crrev.com/06739364e2093d1dd489d2a232edac0402354f1d/tools/cfi/blacklist.txt

Project Member

Comment 28 by bugdroid1@chromium.org, Aug 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/903d551350749e6dbd1ffbadee169c4d97a8c53b

commit 903d551350749e6dbd1ffbadee169c4d97a8c53b
Author: krasin <krasin@chromium.org>
Date: Sun Aug 21 01:35:21 2016

Fix CFI blacklist typos and patterns.

BUG= 634139 
TBR=kcc@chromium.org

Review-Url: https://codereview.chromium.org/2265763002
Cr-Commit-Position: refs/heads/master@{#413359}

[modify] https://crrev.com/903d551350749e6dbd1ffbadee169c4d97a8c53b/tools/cfi/blacklist.txt

Project Member

Comment 29 by bugdroid1@chromium.org, Aug 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d8b951fa7f4f2cf466523d26a4a0254e206e36a0

commit d8b951fa7f4f2cf466523d26a4a0254e206e36a0
Author: krasin <krasin@chromium.org>
Date: Sun Aug 21 03:29:10 2016

Add more functions into CFI blacklist.

BUG= 634139 
TBR=kcc@chromium.org

Review-Url: https://codereview.chromium.org/2265773002
Cr-Commit-Position: refs/heads/master@{#413364}

[modify] https://crrev.com/d8b951fa7f4f2cf466523d26a4a0254e206e36a0/tools/cfi/blacklist.txt

Project Member

Comment 30 by bugdroid1@chromium.org, Aug 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fcf451ffc147375610a1647ae8726f13b7e8d26a

commit fcf451ffc147375610a1647ae8726f13b7e8d26a
Author: krasin <krasin@chromium.org>
Date: Sun Aug 21 06:26:33 2016

CFI blacklist: a few new entries; a bit more sorted.

BUG= 634139 
TBR=kcc@chromium.org

Review-Url: https://codereview.chromium.org/2265783002
Cr-Commit-Position: refs/heads/master@{#413370}

[modify] https://crrev.com/fcf451ffc147375610a1647ae8726f13b7e8d26a/tools/cfi/blacklist.txt

Project Member

Comment 31 by bugdroid1@chromium.org, Aug 27 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/825ce48757eb1ec0428fe573262614635d25cb7f

commit 825ce48757eb1ec0428fe573262614635d25cb7f
Author: krasin <krasin@google.com>
Date: Sat Aug 27 11:01:11 2016

Disable CFI on a few src/cc methods for perf reasons.

While we have not observed CFI to slowdown real-world use cases,
there are a few blink_perf microbenchmarks, which are somewhat affected
by the change. This change removes CFI protection from the methods
which contribute the most to the slowdown.

Eventually, when a proposed optimization in Clang is implemented
(https://crbug.com/638056), these attributes would be possible to remove
from all/most of the methods.

BUG=638056, 634139 
CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel

Review-Url: https://codereview.chromium.org/2257323002
Cr-Commit-Position: refs/heads/master@{#414916}

[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/base/compiler_specific.h
[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/base/contiguous_container.h
[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/blink/web_layer_impl.cc
[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/playback/display_item_list.cc
[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/playback/drawing_display_item.cc
[modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/trees/layer_tree_host.cc

Comment 32 by p...@chromium.org, Sep 6 2016

> The attached results (quite noisy) demonstrate that the slowdown is not due to a memory load for complex bitsets, but due to the most basic range check in CFI.

Your result doesn't really explain why vtable interleaving (which avoids bitsets) was able to report a lower performance overhead than us. (I also continue to be skeptical of any result obtained by tweaking CPU powersave/governor settings.) 

Have you looked at any of the other benchmarks besides large-table-with-collapsed-borders-and-colspans-wider-than-table.html ?

I would also be curious about the delta in .rodata between 1 and 2. Last I measured the bitset size was on the order of 1MB.
>Have you looked at any of the other benchmarks besides large-table-with-collapsed-borders-and-colspans-wider-than-table.html ?
Yes, and there was some slowdown due to the memory load. Unfortunately, almost no paper trails left. The best I can refer to now is https://storage.googleapis.com/cfi-stats/2016-08-15/results.html which missed cfi-2 column.

Note that the launched variant includes all bitsets. In the end, blacklisting the places where we do CFI checks in a loop proved to be more important.
Status: Fixed (was: Untriaged)

Sign in to add a comment