CFI is 5% slower on large-table-with-collapsed-borders-and-colspans-wider-than-table.html |
||
Issue description(Elliott: just an fyi) In the last (July 2016) attempt to launch CFI vcall on Linux, we realized that many blink layout regressed by ~3.5% and one of them by ~5%: blink_perf.layout.large-table-with-collapsed-borders-and-colspans-wider-than-table.html The attached results (quite noisy) demonstrate that the slowdown is not due to a memory load for complex bitsets, but due to the most basic range check in CFI. To be more specific: the html has 4 columns: off: that's the official Chrome build with Clang ToT with all default flags (including LTO and devirtualization) cfi0: CFI-vcall with -lowertypetests-bitsets-level=0: no bitsets are checked; just a range check for vtables (already better than the CFI in Microsoft Edge) cfi1: CFI-vcall with -lowertypetests-bitsets-level=1: only bitsets of size <= 64 are checked; no memory load is generated. cfi2: CFI-vcall with -lowertypetests-bitsets-level=2: precise CFI with all bitsets, even those requiring a memory load. While the results are somewhat noisy (I followed the best practices from https://crbug.com/629487, but they didn't give me much better than a default settings, probably due to a different machine, Z840 vs Z620), it's still visible that even cfi0 regressed the benchmark by solid 5%. The next step is to take a closer look at which virtual call sites contribute the most, and think if we can devirtualize them.
,
Aug 3 2016
While I am on it, below are the sizes for each of the variant. krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target off-tot --platform linux RESULT chrome: chrome= 1493224256 bytes RESULT chrome-stripped: stripped= 105415440 bytes RESULT chrome-text: text= 100189187 bytes RESULT chrome-data: data= 5221080 bytes RESULT chrome-bss: bss= 1714616 bytes RESULT chrome-si: initializers= 7 files RESULT chrome-textrel: textrel= 0 relocs RESULT nacl_helper: nacl_helper= 66023744 bytes RESULT nacl_helper-stripped: stripped= 2128920 bytes RESULT nacl_helper-text: text= 2091269 bytes RESULT nacl_helper-data: data= 30408 bytes RESULT nacl_helper-bss: bss= 201744 bytes RESULT nacl_helper-si: initializers= 6 files RESULT nacl_helper-textrel: textrel= 0 relocs RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes RESULT nacl_helper_bootstrap-text: text= 5710 bytes RESULT nacl_helper_bootstrap-data: data= 40 bytes RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes RESULT nacl_helper_bootstrap-si: initializers= 0 files RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes RESULT resources.pak: resources.pak= 17724496 bytes RESULT totals-bss: bss= 1920473 bytes RESULT totals-data: data= 5251528 bytes RESULT totals-initializers: initializers= 13 files RESULT totals-size: size= 1580742974 bytes RESULT totals-stripped: stripped= 107553208 bytes RESULT totals-text: text= 102286166 bytes RESULT totals-textrel: textrel= 0 relocs krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-0 --platform linux RESULT chrome: chrome= 1493736296 bytes RESULT chrome-stripped: stripped= 108344080 bytes RESULT chrome-text: text= 102461868 bytes RESULT chrome-data: data= 5874712 bytes RESULT chrome-bss: bss= 1714552 bytes RESULT chrome-si: initializers= 7 files RESULT chrome-textrel: textrel= 0 relocs RESULT nacl_helper: nacl_helper= 66131144 bytes RESULT nacl_helper-stripped: stripped= 2153496 bytes RESULT nacl_helper-text: text= 2112798 bytes RESULT nacl_helper-data: data= 33528 bytes RESULT nacl_helper-bss: bss= 201728 bytes RESULT nacl_helper-si: initializers= 6 files RESULT nacl_helper-textrel: textrel= 0 relocs RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes RESULT nacl_helper_bootstrap-text: text= 5710 bytes RESULT nacl_helper_bootstrap-data: data= 40 bytes RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes RESULT nacl_helper_bootstrap-si: initializers= 0 files RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes RESULT resources.pak: resources.pak= 17724496 bytes RESULT totals-bss: bss= 1920393 bytes RESULT totals-data: data= 5908280 bytes RESULT totals-initializers: initializers= 13 files RESULT totals-size: size= 1581362414 bytes RESULT totals-stripped: stripped= 110506424 bytes RESULT totals-text: text= 104580376 bytes RESULT totals-textrel: textrel= 0 relocs krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-1 --platform linux RESULT chrome: chrome= 1495970872 bytes RESULT chrome-stripped: stripped= 110285584 bytes RESULT chrome-text: text= 104403564 bytes RESULT chrome-data: data= 5874728 bytes RESULT chrome-bss: bss= 1714552 bytes RESULT chrome-si: initializers= 7 files RESULT chrome-textrel: textrel= 0 relocs RESULT nacl_helper: nacl_helper= 66131912 bytes RESULT nacl_helper-stripped: stripped= 2153496 bytes RESULT nacl_helper-text: text= 2116382 bytes RESULT nacl_helper-data: data= 33528 bytes RESULT nacl_helper-bss: bss= 201760 bytes RESULT nacl_helper-si: initializers= 6 files RESULT nacl_helper-textrel: textrel= 0 relocs RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes RESULT nacl_helper_bootstrap-text: text= 5710 bytes RESULT nacl_helper_bootstrap-data: data= 40 bytes RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes RESULT nacl_helper_bootstrap-si: initializers= 0 files RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes RESULT resources.pak: resources.pak= 17724496 bytes RESULT totals-bss: bss= 1920425 bytes RESULT totals-data: data= 5908296 bytes RESULT totals-initializers: initializers= 13 files RESULT totals-size: size= 1583597758 bytes RESULT totals-stripped: stripped= 112447928 bytes RESULT totals-text: text= 106525656 bytes RESULT totals-textrel: textrel= 0 relocs krasin@krasin2:~/chr28$ ~/chr23/build/scripts/slave/chromium/sizes.py --target cfi-tot-2 --platform linux RESULT chrome: chrome= 1489119792 bytes RESULT chrome-stripped: stripped= 110531344 bytes RESULT chrome-text: text= 104652404 bytes RESULT chrome-data: data= 5874736 bytes RESULT chrome-bss: bss= 1714584 bytes RESULT chrome-si: initializers= 7 files RESULT chrome-textrel: textrel= 0 relocs RESULT nacl_helper: nacl_helper= 66137320 bytes RESULT nacl_helper-stripped: stripped= 2157592 bytes RESULT nacl_helper-text: text= 2118366 bytes RESULT nacl_helper-data: data= 33528 bytes RESULT nacl_helper-bss: bss= 201744 bytes RESULT nacl_helper-si: initializers= 6 files RESULT nacl_helper-textrel: textrel= 0 relocs RESULT nacl_helper_bootstrap: nacl_helper_bootstrap= 33342 bytes RESULT nacl_helper_bootstrap-stripped: stripped= 8848 bytes RESULT nacl_helper_bootstrap-text: text= 5710 bytes RESULT nacl_helper_bootstrap-data: data= 40 bytes RESULT nacl_helper_bootstrap-bss: bss= 4113 bytes RESULT nacl_helper_bootstrap-si: initializers= 0 files RESULT nacl_helper_bootstrap-textrel: textrel= 0 relocs RESULT nacl_irt_x86_64.nexe: nacl_irt_x86_64.nexe= 3737136 bytes RESULT resources.pak: resources.pak= 17724496 bytes RESULT totals-bss: bss= 1920441 bytes RESULT totals-data: data= 5908304 bytes RESULT totals-initializers: initializers= 13 files RESULT totals-size: size= 1576752086 bytes RESULT totals-stripped: stripped= 112697784 bytes RESULT totals-text: text= 106776480 bytes RESULT totals-textrel: textrel= 0 relocs
,
Aug 3 2016
Just chrome.stripped: off-tot: 105415440 / +0.0% cfi-tot-0: 108344080 / +2.78% cfi-tot-1: 110285584 / +4.62% cfi-tot-2: 110531344 / +4.85% Which means that bitsets themselves are insignificant in bloating the binary. It's mostly the inline checks.
,
Aug 4 2016
Here is the list of hot virtual call sites during the benchmark retrieved using the instructions here: https://bugs.chromium.org/p/chromium/issues/detail?id=580389#c27 To put it into the perspective, the numbers correspond to ~6 seconds of runtime. ./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BoxClipper.cpp:38 _ZN5blink10BoxClipperC1ERKNS_9LayoutBoxERKNS_9PaintInfoERKNS_11LayoutPointENS_20ContentsClipBehaviorE cfi-vcall 1664136 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:55 _ZN5blink12BlockPainter5paintERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 1664073 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:181 _ZN5blink12BlockPainter11paintObjectERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 1664040 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/paint/BlockPainter.cpp:227 _ZNK5blink12BlockPainter19intersectsPaintRectERKNS_9PaintInfoERKNS_11LayoutPointE cfi-vcall 2496093 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.cpp:939 _ZNK5blink12LayoutObject15containingBlockEv cfi-vcall 1440331 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:448 _ZNK5blink9LayoutBox12clientHeightEv cfi-vcall 2360160 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:448 _ZNK5blink9LayoutBox12clientHeightEv cfi-vcall 2360160 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4458 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4459 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4460 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.cpp:4472 _ZNK5blink9LayoutBox14noOverflowRectEv cfi-vcall 1560209 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.h:522 _ZNK5blink11LayoutBlock27createsNewFormattingContextEv cfi-vcall 1680124 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutObject.h:512 _ZN5blink22PaintInvalidationState23updateForNormalChildrenEv cfi-vcall 1323501 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/PaintInvalidationState.cpp:336 _ZNK5blink22PaintInvalidationState43computePositionFromPaintInvalidationBackingEv cfi-vcall 1763551 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/page/FrameTree.cpp:115 _ZNK5blink9FrameTree6parentEv cfi-vcall 1443208 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/dom/Node.h:255 _ZN5blink17FlatTreeTraversal14traverseParentERKNS_4NodeEPNS_26LayoutTreeBuilderTraversal13ParentDetailsE cfi-vcall 1441302 ./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:112 _ZN5blink4Node9traceImplINS_27InlinedGlobalMarkingVisitorEEEvT_ cfi-vcall 5414898 ./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:75 _ZN5blink18AdjustAndMarkTraitINS_20HTMLTableCellElementELb0EE4markIPNS_7VisitorEEEvT_PKS1_ cfi-vcall 4413645 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/events/EventTarget.h:245 _ZN5blink21MajorGCWrapperVisitor21VisitPersistentHandleEPN2v810PersistentINS1_5ValueENS1_27NonCopyablePersistentTraitsIS3_EEEEt cfi-vcall 3682251 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.h:420 _ZNK5blink9LayoutBox13contentHeightEv cfi-vcall 1880096 ./out/cfi-stats-0/../../third_party/WebKit/Source/core/layout/LayoutBox.h:420 _ZNK5blink9LayoutBox13contentHeightEv cfi-vcall 1880096 ./out/cfi-stats-0/../../third_party/WebKit/Source/web/WebLocalFrameImpl.cpp:1592 _ZN5blink17WebLocalFrameImpl9fromFrameERNS_10LocalFrameE cfi-vcall 2164437 ./out/cfi-stats-0/../../third_party/WebKit/Source/platform/heap/TraceTraits.h:206 _ZN5blink10TraceTraitINS_4NodeEE5traceEPNS_7VisitorEPv cfi-vcall 3196481 ./out/cfi-stats-0/../../third_party/skia/src/core/SkSmallAllocator.h:127 _ZN16SkSmallAllocatorILj3ELm3332EE8DestroyTI23SkARGB32_Opaque_BlitterEEvPv cfi-vcall 1401075 ./out/cfi-stats-0/../../third_party/skia/src/core/SkBitmapDevice.cpp:149 _ZN14SkBitmapDevice14onAccessPixelsEP8SkPixmap cfi-vcall 1417873 ./out/cfi-stats-0/../../third_party/skia/src/core/SkPixelRef.cpp:278 _ZN10SkPixelRef19notifyPixelsChangedEv cfi-vcall 1417887 ./out/cfi-stats-0/../../third_party/skia/src/core/SkDevice.cpp:270 _ZN12SkBaseDevice12accessPixelsEP8SkPixmap cfi-vcall 1417873 ./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:1929 _ZN8SkCanvas8drawRectERK6SkRectRK7SkPaint cfi-vcall 3605896 ./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2158 _ZN8SkCanvas10onDrawRectERK6SkRectRK7SkPaint cfi-vcall 1417816 ./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2978 _ZN8SkCanvas11drawPictureEPK9SkPicturePK8SkMatrixPK7SkPaint cfi-vcall 1193633 ./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:2982 _ZN8SkCanvas11drawPictureEPK9SkPicturePK8SkMatrixPK7SkPaint cfi-vcall 1193593 ./out/cfi-stats-0/../../third_party/skia/src/core/SkCanvas.cpp:332 _ZN8SkCanvas10onDrawRectERK6SkRectRK7SkPaint cfi-vcall 1417873 ./out/cfi-stats-0/../../v8/src/global-handles.cc:677 _ZN2v88internal13GlobalHandles16IterateWeakRootsEPNS0_13ObjectVisitorE cfi-vcall 3444609 ./out/cfi-stats-0/../../v8/src/global-handles.cc:823 _ZN2v88internal13GlobalHandles19IterateObjectGroupsEPNS0_13ObjectVisitorEPFbPNS0_4HeapEPPNS0_6ObjectEE cfi-vcall 3660256 ./out/cfi-stats-0/../../v8/src/global-handles.cc:1152 _ZN2v88internal13GlobalHandles15IterateAllRootsEPNS0_13ObjectVisitorE cfi-vcall 3455264 ./out/cfi-stats-0/../../v8/src/global-handles.cc:1161 _ZN2v88internal13GlobalHandles27IterateAllRootsWithClassIdsEPNS0_13ObjectVisitorE cfi-vcall 8085953 ./out/cfi-stats-0/../../v8/src/api.cc:8048 _ZN2v814VisitorAdapter22VisitEmbedderReferenceEPPNS_8internal6ObjectEt cfi-vcall 8107932
,
Aug 4 2016
The above stats show the call sites. Can it be extended to also show the call targets?
,
Aug 4 2016
It can be, but not easily. But we can get a pretty solid feel looking at the code: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutBox.cpp?q=third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457&sq=package:chromium&l=4457 LayoutUnit left(borderLeft() + (shouldPlaceBlockDirectionScrollbarOnLogicalLeft() ? scrollBarWidth : 0)); LayoutUnit top(borderTop()); LayoutUnit right(borderRight()); LayoutUnit bottom(borderBottom()); Each of these methods are virtual and they look like: virtual int borderTop() const { return style()->borderTopWidth(); } int borderTop() const override { if (style()->isHorizontalWritingMode()) return style()->isFlippedBlocksWritingMode() ? borderAfter() : borderBefore(); return style()->isLeftToRightDirection() ? borderStart() : borderEnd(); } int LayoutTableCell::borderTop() const { return table()->collapseBorders() ? borderHalfTop(false) : LayoutBlockFlow::borderTop(); } virtual int borderBottom() const { return style()->borderBottomWidth(); } virtual int borderLeft() const { return style()->borderLeftWidth(); } virtual int borderRight() const { return style()->borderRightWidth(); } virtual bool shouldPlaceBlockDirectionScrollbarOnLogicalLeft() const { return style()->shouldPlaceBlockDirectionScrollbarOnLogicalLeft(); } bool shouldPlaceBlockDirectionScrollbarOnLogicalLeft() const override { return false; } and so on. Basically, it's not just a few hot virtual call sites, but a whole tree and we see its root. Leaves and branches don't show up in the profile, because there're quite a lot of different options. At the same time, the implementation of these methods is mostly trivial. While these are not virtual constants, there might be some optimizations applicable. Don't know yet.
,
Aug 4 2016
Writing down one possible optimization, which I have not yet thought through, just to avoid forgetting about it.
Consider the following code (two_checks.cc):
#include <stdio.h>
class Base {
public:
virtual void Do() {
printf("Base::Do\n");
}
virtual void Boo() {
printf("Base::Boo\n");
}
};
class A : public Base {
public:
void Do() override {
printf("A::Do\n");
}
void Boo() override {
printf("A::Boo\n");
}
};
__attribute__((noinline)) void printBase(Base* ptr) {
ptr->Do();
ptr->Boo();
}
int main() {
Base base;
A a;
printBase(&base);
printBase(&a);
return 0;
}
Consider printBase: CFI will generate two identical checks for the same vtable:
define internal void @_Z9printBaseP4Base(%class.Base* %ptr) #0 !dbg !6 {
entry:
%ptr.addr = alloca %class.Base*, align 8
store %class.Base* %ptr, %class.Base** %ptr.addr, align 8
call void @llvm.dbg.declare(metadata %class.Base** %ptr.addr, metadata !23, metadata !24), !dbg !25
%0 = load %class.Base*, %class.Base** %ptr.addr, align 8, !dbg !26
%1 = bitcast %class.Base* %0 to void (%class.Base*)***, !dbg !27
%vtable = load void (%class.Base*)**, void (%class.Base*)*** %1, align 8, !dbg !27
%2 = bitcast void (%class.Base*)** %vtable to i8*, !dbg !27, !nosanitize !2
%3 = ptrtoint i8* %2 to i64, !dbg !27
%4 = sub i64 %3, add (i64 ptrtoint ({ [4 x i8*], [0 x i8], [4 x i8*] }* @0 to i64), i64 16), !dbg !27
%5 = lshr i64 %4, 5, !dbg !27
%6 = shl i64 %4, 59, !dbg !27
%7 = or i64 %5, %6, !dbg !27
%8 = icmp ult i64 %7, 2, !dbg !27
br i1 %8, label %cont, label %trap, !dbg !27, !nosanitize !2
trap: ; preds = %entry
call void @llvm.trap() #2, !dbg !28, !nosanitize !2
unreachable, !dbg !28, !nosanitize !2
cont: ; preds = %entry
%vfn = getelementptr inbounds void (%class.Base*)*, void (%class.Base*)** %vtable, i64 0, !dbg !30
%9 = load void (%class.Base*)*, void (%class.Base*)** %vfn, align 8, !dbg !30
call void %9(%class.Base* %0), !dbg !30
%10 = load %class.Base*, %class.Base** %ptr.addr, align 8, !dbg !32
%11 = bitcast %class.Base* %10 to void (%class.Base*)***, !dbg !33
%vtable1 = load void (%class.Base*)**, void (%class.Base*)*** %11, align 8, !dbg !33
%12 = bitcast void (%class.Base*)** %vtable1 to i8*, !dbg !33, !nosanitize !2
%13 = ptrtoint i8* %12 to i64, !dbg !33
%14 = sub i64 %13, add (i64 ptrtoint ({ [4 x i8*], [0 x i8], [4 x i8*] }* @0 to i64), i64 16), !dbg !33
%15 = lshr i64 %14, 5, !dbg !33
%16 = shl i64 %14, 59, !dbg !33
%17 = or i64 %15, %16, !dbg !33
%18 = icmp ult i64 %17, 2, !dbg !33
br i1 %18, label %cont3, label %trap2, !dbg !33, !nosanitize !2
trap2: ; preds = %cont
call void @llvm.trap() #2, !dbg !34, !nosanitize !2
unreachable, !dbg !34, !nosanitize !2
cont3: ; preds = %cont
%vfn4 = getelementptr inbounds void (%class.Base*)*, void (%class.Base*)** %vtable1, i64 1, !dbg !35
%19 = load void (%class.Base*)*, void (%class.Base*)** %vfn4, align 8, !dbg !35
call void %19(%class.Base* %10), !dbg !35
ret void, !dbg !36
}
This is stupid, and we should do better. I am yet to think through, but this is one of the obvious places, where we can significantly reduce impact without sacrificing anything.
,
Aug 4 2016
For the reference, the command line I used to compile the file above: clang++ -fuse-ld=lld -o lala two_checks.cc -std=c++11 -flto -fvisibility=hidden -fsanitize=cfi-vcall -Wl,--lto-O1 -g -Wl,-mllvm,-lowertypetests-bitsets-level=0 -Wl,-save-temps && ./lala two_checks.cc is also attached to the comment.
,
Aug 4 2016
If implemented, this optimization should reduce the CFI penalty in LayoutBox::noOverflowRect https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/layout/LayoutBox.cpp?q=third_party/WebKit/Source/core/layout/LayoutBox.cpp:4457&sq=package:chromium&l=4446 by about 6x. I can't claim that the whole large-table-with-collapsed-borders-and-colspans-wider-than-table.html will become 6x faster, as there're other hot parts too, but it seems like a worthwhile optimization. At first, I am going to fake it, just to measure the impact generating at most 1 CFI check per function. It will provide an upper bound for the speedup, and if the microbenchmark still regressed by 5%, it's not the most important thing to do now.
,
Aug 4 2016
I am going to add an XFAIL test to compiler-rt: https://reviews.llvm.org/D23151 and then think of an optimization. An important thing to consider is a risk of something being corrupted between the check and the call. Since CFI only checks vtable pointer, and we don't guard against stack modifications (that's a job for SafeStack), it should be possible to store vtable pointer in alloca and then reuse.
,
Aug 4 2016
On the other hand, if we do reuse vtable, and the object got destructed between the first and the second call, we'll not catch that. So, the optimization needs to be carefully considered before I implemented it.
,
Aug 4 2016
Another example of a very hot place where we unnecessarily put additional code is https://cs.chromium.org/chromium/src/v8/src/global-handles.cc?sq=package:chromium&rcl=1470309818&l=1158 void GlobalHandles::IterateAllRootsWithClassIds(ObjectVisitor* v) { for (NodeIterator it(this); !it.done(); it.Advance()) { if (it.node()->IsRetainer() && it.node()->has_wrapper_class_id()) { v->VisitEmbedderReference(it.node()->location(), it.node()->wrapper_class_id()); } } } VisitEmbedderReference is called 8 million times during the benchmark, but we can make a check and out of the loop and reduce the impact at least 8x. That will require to remember the pointer to the actual VisitEmbedderReference making it an indirect call instead of a virtual one. That again is based on the assumption that we can trust local variables (read: stack).
,
Aug 4 2016
But we probably already trust stack in CFI, as the return addresses are taken from there.
,
Aug 4 2016
The optimization in #12 will make as rely on secure stack *more* than we do now.
Besides, this is not a CFI-specific optimization, it should or should not happen
regardless of CFI.
One more thing to look at: full devirtualization.
I think in at least one of your examples the bitset contained just 2 ones.
In this case we can replace a vcall with
if (vptr == A::vptr) direct-call A::f
else if (vptr == B::vptr) direct-call B::f
else trap
AFIACT, this common compiler optimization is not yet implemented in LLVM
(not totally sure)
,
Aug 4 2016
GlobalHandles::IterateObjectGroups is even better example as the CFI check lies within a nested loop: https://cs.chromium.org/chromium/src/v8/src/global-handles.cc?sq=package:chromium&rcl=1470309818&l=792 for (int i = 0; i < object_groups_.length(); i++) { ... for (size_t j = 0; j < entry->length; ++j) { ... v->VisitPointer(&object); ...
,
Aug 4 2016
>The optimization in #12 will make as rely on secure stack *more* than we do now. Correct. But since we can't have a CFI check within a loop and not slowdown Blink, it's the least evil we can do. The alternative is to blacklist such hot functions altogether, which is obviously worse. >Besides, this is not a CFI-specific optimization, it should or should not happen regardless of CFI. Yes, but it probably makes less sense to do if CFI is not enabled, as the virtual calls are cheaper. But I agree that if we go with this optimization, it will live outside of the CFI code. >One more thing to look at: full devirtualization. >I think in at least one of your examples the bitset contained just 2 ones. >In this case we can replace a vcall with > > if (vptr == A::vptr) direct-call A::f > else if (vptr == B::vptr) direct-call B::f > else trap This is still too expensive for hot loops like we see above, but it probably makes sense to implement this optimization anyway, as it will speed up other places, where there's no loop.
,
Aug 4 2016
Note that hoisting vptr load (and check) out of the loop may be hindered by aliasing. complete devirtualization can in some cases be combined with hoisting vptr load, and yet again this is not CFI-specific and is pretty hard.
,
Aug 5 2016
I have updated the stats (attached). They now cover everything, demangled, sorted by frequency and have links to the code search. "View" won't work. Download + Open will.
,
Aug 5 2016
Note to myself: check if blink::Node::getNodeType gets devirtualized. It should, but the stats from #18 it does not.
,
Aug 5 2016
Attached is the list of devirtualized methods in Chrome. This is tested with Clang ToT and the list (surprisingly) much larger than it was measured previously: https://bugs.chromium.org/p/chromium/issues/detail?id=580389#c25 The number of devirtualized call sites: 53276 (or 95109, if we don't discard dups inlined in different places) The number of devirtualized methods: 25101 Note: I am not sure about the originally reported number (~400). I am not sure about the currently reported numbers (but more sure, since it generated this nice list). I will double check it tomorrow, now just leaving as a reference. According to the list, getNoteType methods have been devirtualized: devirtualized blink::CDATASection::getNodeType() const devirtualized blink::DocumentType::getNodeType() const devirtualized blink::DocumentFragment::getNodeType() const devirtualized blink::ProcessingInstruction::getNodeType() const devirtualized blink::Attr::getNodeType() const devirtualized blink::Text::getNodeType() const devirtualized blink::Comment::getNodeType() const devirtualized blink::Element::getNodeType() const devirtualized blink::Document::getNodeType() const It seems that sanitizer stats report without taking devirtualization into the account, as the instrumentation happens on the Clang level, and we don't disable it after devirtualizing stuff. It probably means that the profile in #8 has quite a few "dead" entries. Again, I'll verify this tomorrow, and I would ask for someone else to audit my findings.
,
Aug 5 2016
I have confirmed that cfi-vcall sanstats are shown even for devirtualized calls. It's good news, but now I have to figure out how not to show them, so that we can see the real picture.
,
Aug 12 2016
I've made some work on improving the precision for devirtualization stats (all committed to LLVM trunk, but not yet rolled into Chrome). The new numbers: Number of devirtualized call sites: 51491 Number of devirtualized methods: 23149 An HTML with all the devirtualized methods, their call sites and reasons to be devirtualized is attached. I had to omit call sites for methods devirtualized for having a single-impl, because there're too many of them. Also, some of the call sites are estimated, as I had to limit the amount of remarks coming out of the compiler.
,
Aug 12 2016
Note: in this revision I don't see getNodeType methods, but I am currently too tired to investigate. Will continue tomorrow.
,
Aug 12 2016
I have now collected more comprehensive stats which can see how devirtualization affected the dynamic number of virtual calls. top(50) entries with original count (cfi-vcall sanstat) and real number of checks (I added into LowerTypeTests):
Location Orig Real
v8/src/api.cc:8021 8034573 same
v8/src/global-handles.cc:1161 8011676 same
third_party/skia/src/core/SkCanvas.cpp:1931 6933401 same
third_party/WebKit/Source/core/layout/LayoutObject.h:506 5933625 18 partially
third_party/WebKit/Source/platform/heap/TraceTraits.h:112 5418269 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:451 4720320 same
third_party/WebKit/Source/platform/heap/TraceTraits.h:75 4387552 same
third_party/WebKit/Source/core/layout/LayoutObject.h:516 3840941 0 devirtualized
third_party/WebKit/Source/core/layout/LayoutBox.h:414 3760192 same
third_party/WebKit/Source/core/events/EventTarget.h:250 3650165 same
v8/src/global-handles.cc:823 3627244 same
v8/src/global-handles.cc:1152 3423384 same
v8/src/global-handles.cc:677 3412722 same
third_party/WebKit/Source/core/layout/PaintInvalidationState.cpp:350 3366571 0 devirtualized
third_party/WebKit/Source/platform/heap/TraceTraits.h:206 3228728 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:4317 3120418 same
third_party/WebKit/Source/core/dom/Node.h:254 2883947 same
third_party/WebKit/Source/core/layout/LayoutObject.h:1218 2880494 1440247 partially
third_party/WebKit/Source/core/paint/BlockPainter.cpp:227 2496093 same
third_party/WebKit/Source/web/WebLocalFrameImpl.cpp:1618 2164454 0 devirtualized
third_party/WebKit/Source/core/layout/LayoutBox.h:343 1760072 same
third_party/WebKit/Source/core/paint/BoxClipper.cpp:38 1664136 same
third_party/WebKit/Source/core/paint/BlockPainter.cpp:55 1664073 same
third_party/WebKit/Source/core/paint/BlockPainter.cpp:181 1664040 0 devirtualized
third_party/WebKit/Source/core/layout/LayoutObject.h:250 1563573 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:4320 1560209 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:4332 1560209 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:4319 1560209 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:4318 1560209 same
third_party/WebKit/Source/core/page/FrameTree.cpp:115 1443240 same
third_party/WebKit/Source/core/layout/LayoutObject.cpp:923 1440331 0 devirtualized
third_party/skia/src/core/SkPixelRef.cpp:278 1417885 0 devirtualized
third_party/skia/src/core/SkBitmapDevice.cpp:149 1417871 0 devirtualized
third_party/skia/src/core/SkDevice.cpp:281 1417871 same
third_party/skia/src/core/SkCanvas.cpp:334 1417871 0 devirtualized
third_party/skia/src/core/SkCanvas.cpp:2182 1417816 same
third_party/skia/src/core/SkSmallAllocator.h:127 1401146 14 partially
third_party/WebKit/Source/core/layout/LayoutBox.h:413 1360602 same
third_party/WebKit/Source/core/layout/LayoutBox.cpp:446 1360602 same
third_party/WebKit/Source/core/layout/LayoutObject.h:517 1323431 0 devirtualized
third_party/WebKit/Source/platform/graphics/GraphicsLayer.cpp:1110 1281300 640650 partially
third_party/skia/src/core/SkCanvas.cpp:1845 1194167 0 devirtualized
cc/playback/display_item_list.cc:141 1193879 same
cc/playback/drawing_display_item.cc:83 1193879 same
third_party/skia/src/core/SkCanvas.cpp:3033 1193507 same
third_party/skia/src/core/SkBigPicture.cpp:32 1193436 0 devirtualized
third_party/skia/src/core/SkCanvas.cpp:3037 1193278 same
third_party/skia/src/core/SkCanvas.cpp:3044 1193241 same
third_party/skia/src/core/SkCanvas.cpp:3056 1193241 same
third_party/skia/src/core/SkCanvas.cpp:3057 1193241 same
As we can see, a reasonable number of top offenders have been devirtualized completely or partially. I will now proceed with generating the list of methods which we need to disable CFI on and then will file a bug to implement LLVM optimizations which will reduce the CFI overhead on those methods.
,
Aug 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4e341f5368a22ceeab4d5133f03ec6a71729737f commit 4e341f5368a22ceeab4d5133f03ec6a71729737f Author: krasin <krasin@google.com> Date: Fri Aug 19 00:07:44 2016 Add 10 Skia methods to CFI blacklist for perf reasons. While they have not been observed to slow down real-world use cases, some blink_layout microbenchmarks feel better with these methods disabled. In order to be concervative at the launch time, lift the CFI defense for these methods. BUG= 634139 Review-Url: https://codereview.chromium.org/2259143002 Cr-Commit-Position: refs/heads/master@{#412979} [modify] https://crrev.com/4e341f5368a22ceeab4d5133f03ec6a71729737f/tools/cfi/blacklist.txt
,
Aug 19 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/8cf09e2797b11ca54ea9162a29353bfba07742a2 commit 8cf09e2797b11ca54ea9162a29353bfba07742a2 Author: krasin <krasin@google.com> Date: Fri Aug 19 19:21:15 2016 Extend CFI blacklist to include methods with lots of CFI checks. The longer term solution is to implement optimizations in Clang to avoid generating unnecessary checks. BUG= 634139 Review-Url: https://codereview.chromium.org/2251213007 Cr-Commit-Position: refs/heads/master@{#413208} [modify] https://crrev.com/8cf09e2797b11ca54ea9162a29353bfba07742a2/tools/cfi/blacklist.txt
,
Aug 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/06739364e2093d1dd489d2a232edac0402354f1d commit 06739364e2093d1dd489d2a232edac0402354f1d Author: krasin <krasin@google.com> Date: Sat Aug 20 09:36:39 2016 Fix patterns in CFI blacklist. The comparison happens on mangled names. Of course, there's no double colon is there. BUG= 634139 TBR=kcc@chromium.org Review-Url: https://codereview.chromium.org/2267543002 Cr-Commit-Position: refs/heads/master@{#413329} [modify] https://crrev.com/06739364e2093d1dd489d2a232edac0402354f1d/tools/cfi/blacklist.txt
,
Aug 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/903d551350749e6dbd1ffbadee169c4d97a8c53b commit 903d551350749e6dbd1ffbadee169c4d97a8c53b Author: krasin <krasin@chromium.org> Date: Sun Aug 21 01:35:21 2016 Fix CFI blacklist typos and patterns. BUG= 634139 TBR=kcc@chromium.org Review-Url: https://codereview.chromium.org/2265763002 Cr-Commit-Position: refs/heads/master@{#413359} [modify] https://crrev.com/903d551350749e6dbd1ffbadee169c4d97a8c53b/tools/cfi/blacklist.txt
,
Aug 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d8b951fa7f4f2cf466523d26a4a0254e206e36a0 commit d8b951fa7f4f2cf466523d26a4a0254e206e36a0 Author: krasin <krasin@chromium.org> Date: Sun Aug 21 03:29:10 2016 Add more functions into CFI blacklist. BUG= 634139 TBR=kcc@chromium.org Review-Url: https://codereview.chromium.org/2265773002 Cr-Commit-Position: refs/heads/master@{#413364} [modify] https://crrev.com/d8b951fa7f4f2cf466523d26a4a0254e206e36a0/tools/cfi/blacklist.txt
,
Aug 21 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fcf451ffc147375610a1647ae8726f13b7e8d26a commit fcf451ffc147375610a1647ae8726f13b7e8d26a Author: krasin <krasin@chromium.org> Date: Sun Aug 21 06:26:33 2016 CFI blacklist: a few new entries; a bit more sorted. BUG= 634139 TBR=kcc@chromium.org Review-Url: https://codereview.chromium.org/2265783002 Cr-Commit-Position: refs/heads/master@{#413370} [modify] https://crrev.com/fcf451ffc147375610a1647ae8726f13b7e8d26a/tools/cfi/blacklist.txt
,
Aug 27 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/825ce48757eb1ec0428fe573262614635d25cb7f commit 825ce48757eb1ec0428fe573262614635d25cb7f Author: krasin <krasin@google.com> Date: Sat Aug 27 11:01:11 2016 Disable CFI on a few src/cc methods for perf reasons. While we have not observed CFI to slowdown real-world use cases, there are a few blink_perf microbenchmarks, which are somewhat affected by the change. This change removes CFI protection from the methods which contribute the most to the slowdown. Eventually, when a proposed optimization in Clang is implemented (https://crbug.com/638056), these attributes would be possible to remove from all/most of the methods. BUG=638056, 634139 CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_precise_blink_rel Review-Url: https://codereview.chromium.org/2257323002 Cr-Commit-Position: refs/heads/master@{#414916} [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/base/compiler_specific.h [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/base/contiguous_container.h [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/blink/web_layer_impl.cc [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/playback/display_item_list.cc [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/playback/drawing_display_item.cc [modify] https://crrev.com/825ce48757eb1ec0428fe573262614635d25cb7f/cc/trees/layer_tree_host.cc
,
Sep 6 2016
> The attached results (quite noisy) demonstrate that the slowdown is not due to a memory load for complex bitsets, but due to the most basic range check in CFI. Your result doesn't really explain why vtable interleaving (which avoids bitsets) was able to report a lower performance overhead than us. (I also continue to be skeptical of any result obtained by tweaking CPU powersave/governor settings.) Have you looked at any of the other benchmarks besides large-table-with-collapsed-borders-and-colspans-wider-than-table.html ? I would also be curious about the delta in .rodata between 1 and 2. Last I measured the bitset size was on the order of 1MB.
,
Sep 6 2016
>Have you looked at any of the other benchmarks besides large-table-with-collapsed-borders-and-colspans-wider-than-table.html ? Yes, and there was some slowdown due to the memory load. Unfortunately, almost no paper trails left. The best I can refer to now is https://storage.googleapis.com/cfi-stats/2016-08-15/results.html which missed cfi-2 column. Note that the launched variant includes all bitsets. In the end, blacklisting the places where we do CFI checks in a loop proved to be more important.
,
Sep 7 2016
|
||
►
Sign in to add a comment |
||
Comment 1 by krasin@chromium.org
, Aug 3 2016