New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 672158 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

ThinLTO uses 3x more user time to link cc_unittests

Project Member Reported by krasin@chromium.org, Dec 7 2016

Issue description

I've measured the performance lld vs gold / ThinLTO vs FullLTO. I used 8 jobs in the ThinLTO case and my benchmark was to link cc_unittests. LLVM r288833.

The numbers are bad:

ThinLTO + LLD: real 96s, user 698s.
ThinLTO + Gold: real 96s, user 692s.
FullLTO + LLD: 219s
FullLTO + Gold: 208s

As we can see, LLD is slighly slower than Gold (bad, needs to be fixed, but not the end of the world). We also see that ThinLTO uses as much as 3.5x more cycles to link the binary. And that *is* bad.

GN flags for ThinLTO + LLD:

gn gen out/thin-lto-lld-tot '--args=is_chrome_branded=true is_official_build=true is_clang=true is_component_build=false is_debug=false allow_posix_link_time_opt=true is_cfi=false use_thin_lto=true symbol_level=1 clang_use_chrome_plugins=false use_lld=true clang_base_path="/usr/local/google/home/krasin/src/llvm.org/release-build"' --check


GN flags for FullLTO + LLD:

gn gen out/lto-lld-tot '--args=is_chrome_branded=true is_official_build=true is_clang=true is_component_build=false is_debug=false allow_posix_link_time_opt=true is_cfi=false symbol_level=1 clang_use_chrome_plugins=false use_lld=true clang_base_path="/usr/local/google/home/krasin/src/llvm.org/release-build"' --check

Perf profile for lld + thinlto:

+   2.79%  ld.lld  libc-2.19.so         [.] _int_malloc
+   2.14%  ld.lld  lld                  [.] llvm::PMDataManager::findAnalysisPass(void const*, bool)
+   1.94%  ld.lld  libc-2.19.so         [.] _int_free
+   1.77%  ld.lld  libc-2.19.so         [.] malloc
+   1.40%  ld.lld  lld                  [.] computeKnownBits(llvm::Value const*, llvm::APInt&, llvm::APInt&, unsigned int, (anonymous namespace)::Query const&)
+   1.35%  ld.lld  [kernel.kallsyms]    [k] 0xffffffff8105144a
+   0.98%  ld.lld  lld                  [.] llvm::Use::getImpliedUser() const
+   0.93%  ld.lld  lld                  [.] llvm::ValueHandleBase::AddToUseList()
+   0.92%  ld.lld  libc-2.19.so         [.] __memcmp_sse4_1
+   0.86%  ld.lld  lld                  [.] llvm::Value::getValueName() const
+   0.85%  ld.lld  libc-2.19.so         [.] malloc_consolidate
+   0.78%  ld.lld  lld                  [.] llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const
+   0.65%  ld.lld  lld                  [.] llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level)
+   0.65%  ld.lld  lld                  [.] llvm::AttributeSet::getAttributes(unsigned int) const
+   0.64%  ld.lld  lld                  [.] llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)
+   0.58%  ld.lld  lld                  [.] llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*)
+   0.54%  ld.lld  lld                  [.] llvm::BasicBlock::getTerminator()
+   0.44%  ld.lld  lld                  [.] llvm::Value::stripPointerCasts()
+   0.42%  ld.lld  lld                  [.] llvm::PMDataManager::removeNotPreservedAnalysis(llvm::Pass*)
+   0.42%  ld.lld  lld                  [.] llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*)
+   0.42%  ld.lld  lld                  [.] llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const
+   0.41%  ld.lld  lld                  [.] llvm::ValueHandleBase::RemoveFromUseList()
+   0.39%  ld.lld  lld                  [.] llvm::sys::MemoryFence()
+   0.38%  ld.lld  lld                  [.] llvm::PMTopLevelManager::findAnalysisPass(void const*)
+   0.37%  ld.lld  lld                  [.] llvm::SimplifyInstruction(llvm::Instruction*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::DominatorTree const*, llvm::AssumptionCache*)
+   0.37%  ld.lld  lld                  [.] llvm::AttributeSet::hasAttribute(unsigned int, llvm::Attribute::AttrKind) const
+   0.36%  ld.lld  libc-2.19.so         [.] free
+   0.34%  ld.lld  lld                  [.] llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*, llvm::DominatorTreeBase<llvm::BasicBlock>::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMap
+   0.34%  ld.lld  lld                  [.] llvm::FoldingSetNodeID::AddInteger(unsigned int)
+   0.33%  ld.lld  lld                  [.] llvm::Value::assertModuleIsMaterialized() const
+   0.33%  ld.lld  lld                  [.] llvm::TargetLibraryInfoImpl::getLibFunc(llvm::StringRef, llvm::LibFunc::Func&) const
+   0.33%  ld.lld  lld                  [.] llvm::Value::getContext() const
+   0.31%  ld.lld  lld                  [.] llvm::Type::getScalarType() const
+   0.31%  ld.lld  lld                  [.] bool llvm::DenseMapBase<llvm::DenseMap<llvm::Instruction*, unsigned int, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseMapPair<llvm::Instruction*, unsign
+   0.31%  ld.lld  lld                  [.] llvm::BranchInst::getSuccessorV(unsigned int) const
+   0.31%  ld.lld  lld                  [.] llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout const&, unsigned int)
+   0.31%  ld.lld  lld                  [.] llvm::InstCombiner::run()
+   0.30%  ld.lld  lld                  [.] llvm::StringMapImpl::FindKey(llvm::StringRef) const
+   0.30%  ld.lld  lld                  [.] AddReachableCodeToWorklist(llvm::BasicBlock*, llvm::DataLayout const&, llvm::SmallPtrSetImpl<llvm::BasicBlock*>&, llvm::InstCombineWorklist&, llvm::TargetLibraryInfo
+   0.30%  ld.lld  lld                  [.] llvm::ReplaceableMetadataImpl::getOrCreate(llvm::Metadata&)
+   0.29%  ld.lld  lld                  [.] llvm::ConstantFoldInstruction(llvm::Instruction*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*)
+   0.29%  ld.lld  lld                  [.] (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
+   0.29%  ld.lld  lld                  [.] llvm::APInt::operator&(llvm::APInt const&) const
+   0.29%  ld.lld  lld                  [.] llvm::Use::getUser() const
+   0.28%  ld.lld  lld                  [.] std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int
+   0.27%  ld.lld  libpthread-2.19.so   [.] pthread_rwlock_rdlock
+   0.27%  ld.lld  lld                  [.] llvm::DataLayout::getAlignment(llvm::Type*, bool) const
+   0.26%  ld.lld  lld                  [.] llvm::DAGTypeLegalizer::run()
+   0.26%  ld.lld  lld                  [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
+   0.26%  ld.lld  lld                  [.] llvm::Value::getName() const
+   0.26%  ld.lld  lld                  [.] llvm::ConstantRange::ConstantRange(unsigned int, bool)
+   0.26%  ld.lld  lld                  [.] llvm::MetadataTracking::track(void*, llvm::Metadata&, llvm::PointerUnion<llvm::MetadataAsValue*, llvm::Metadata*>)
+   0.25%  ld.lld  lld                  [.] (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*)
+   0.25%  ld.lld  libstdc++.so.6.0.19  [.] operator new(unsigned long)
+   0.24%  ld.lld  lld                  [.] llvm::ValueHandleBase::AddToExistingUseList(llvm::ValueHandleBase**)
+   0.24%  ld.lld  lld                  [.] llvm::InstCombiner::SimplifyDemandedUseBits(llvm::Value*, llvm::APInt, llvm::APInt&, llvm::APInt&, unsigned int, llvm::Instruction*)

 

Comment 1 by p...@chromium.org, Dec 7 2016

What did the numbers look like before the LLVM change that broke the chromium build?
https://docs.google.com/spreadsheets/d/18vi9p8ffIYNVPTyxtJwr-YrP4WJRbaQr_2nZ1AKKBs4/edit#gid=0

on 8 jobs, it was 617.55. So, it was already worse than FullLTO, but became even slower.
Perf profile for lld + FullLTO:

+   7.24%  ld.lld  libc-2.19.so         [.] _int_malloc
+   4.14%  ld.lld  [kernel.kallsyms]    [k] 0xffffffff8105144a
+   1.61%  ld.lld  libc-2.19.so         [.] malloc_consolidate
+   1.40%  ld.lld  libc-2.19.so         [.] _int_free
+   1.35%  ld.lld  lld                  [.] llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level)
+   1.33%  ld.lld  lld                  [.] llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)
+   1.14%  ld.lld  lld                  [.] llvm::PMDataManager::findAnalysisPass(void const*, bool)
+   1.04%  ld.lld  lld                  [.] llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const
+   0.75%  ld.lld  lld                  [.] llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*)
+   0.71%  ld.lld  libc-2.19.so         [.] memset
+   0.70%  ld.lld  libc-2.19.so         [.] malloc
+   0.62%  ld.lld  lld                  [.] llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCAsmLayout const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*
+   0.62%  ld.lld  lld                  [.] llvm::sys::MemoryFence()
+   0.60%  ld.lld  lld                  [.] llvm::Use::getImpliedUser() const
+   0.58%  ld.lld  lld                  [.] (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
+   0.57%  ld.lld  lld                  [.] llvm::DAGTypeLegalizer::run()
+   0.56%  ld.lld  libc-2.19.so         [.] free
+   0.54%  ld.lld  lld                  [.] (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*)
+   0.51%  ld.lld  lld                  [.] llvm::ScheduleDAGSDNodes::BuildSchedUnits()
+   0.49%  ld.lld  lld                  [.] llvm::FoldingSetNodeID::AddInteger(unsigned int)
+   0.43%  ld.lld  lld                  [.] llvm::DILocation::getImpl(llvm::LLVMContext&, unsigned int, unsigned int, llvm::Metadata*, llvm::Metadata*, llvm::Metadata::StorageType, bool)
+   0.43%  ld.lld  libc-2.19.so         [.] __memcmp_sse4_1
+   0.43%  ld.lld  lld                  [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
+   0.43%  ld.lld  lld                  [.] bool llvm::DenseMapBase<llvm::DenseMap<llvm::DILocation*, llvm::detail::DenseSetEmpty, llvm::MDNodeInfo<llvm::DILocation>, llvm::detail::DenseSetPair<llvm::DILocation*
+   0.42%  ld.lld  lld                  [.] llvm::calculateDbgValueHistory(llvm::MachineFunction const*, llvm::TargetRegisterInfo const*, llvm::DbgValueHistoryMap&)
+   0.39%  ld.lld  lld                  [.] std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int c
+   0.39%  ld.lld  lld                  [.] llvm::PMDataManager::removeNotPreservedAnalysis(llvm::Pass*)
+   0.38%  ld.lld  lld                  [.] llvm::FoldingSetImpl::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&)
+   0.38%  ld.lld  lld                  [.] llvm::ScheduleDAGSDNodes::AddSchedEdges()
+   0.37%  ld.lld  lld                  [.] llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMContext&, llvm::EVT) const
+   0.34%  ld.lld  lld                  [.] (anonymous namespace)::DeadMachineInstructionElim::runOnMachineFunction(llvm::MachineFunction&)
+   0.34%  ld.lld  lld                  [.] llvm::SelectionDAG::AssignTopologicalOrder()
+   0.32%  ld.lld  lld                  [.] llvm::BasicBlock::getTerminator()
+   0.32%  ld.lld  lld                  [.] llvm::Attribute::hasAttribute(llvm::StringRef) const
+   0.32%  ld.lld  lld                  [.] llvm::PMDataManager::initializeAnalysisImpl(llvm::Pass*)
+   0.31%  ld.lld  libc-2.19.so         [.] __memcpy_sse2_unaligned
+   0.31%  ld.lld  lld                  [.] llvm::LiveVariables::HandlePhysRegKill(unsigned int, llvm::MachineInstr*)
+   0.31%  ld.lld  lld                  [.] llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm::MachineOperand const&)
+   0.31%  ld.lld  lld                  [.] llvm::MCAsmLayout::ensureValid(llvm::MCFragment const*) const
+   0.31%  ld.lld  lld                  [.] (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*)
+   0.30%  ld.lld  lld                  [.] llvm::Attribute::isStringAttribute() const
+   0.30%  ld.lld  lld                  [.] llvm::X86InstrInfo::AnalyzeBranchImpl(llvm::MachineBasicBlock&, llvm::MachineBasicBlock*&, llvm::MachineBasicBlock*&, llvm::SmallVectorImpl<llvm::MachineOperand>&, llv
+   0.30%  ld.lld  lld                  [.] (anonymous namespace)::Verifier::visitFunction(llvm::Function const&)
+   0.30%  ld.lld  lld                  [.] (anonymous namespace)::Mapper::mapSimpleMetadata(llvm::Metadata const*)
+   0.28%  ld.lld  lld                  [.] llvm::BasicBlock::dropAllReferences()
+   0.28%  ld.lld  lld                  [.] llvm::SmallPtrSetImplBase::insert_imp_big(void const*)
+   0.28%  ld.lld  lld                  [.] llvm::SelectionDAG::Legalize()
+   0.27%  ld.lld  lld                  [.] llvm::MCAssembler::layoutSectionOnce(llvm::MCAsmLayout&, llvm::MCSection&)
+   0.27%  ld.lld  lld                  [.] llvm::TargetRegisterInfo::getMinimalPhysRegClass(unsigned int, llvm::MVT) const
+   0.27%  ld.lld  lld                  [.] llvm::ReplaceableMetadataImpl::getOrCreate(llvm::Metadata&)
+   0.27%  ld.lld  lld                  [.] llvm::SelectionDAG::createOperands(llvm::SDNode*, llvm::ArrayRef<llvm::SDValue>)
+   0.27%  ld.lld  lld                  [.] llvm::Value::assertModuleIsMaterialized() const
+   0.25%  ld.lld  lld                  [.] llvm::GlobalDCEPass::GlobalIsNeeded(llvm::GlobalValue*)
+   0.25%  ld.lld  lld                  [.] llvm::PMTopLevelManager::findAnalysisPass(void const*)
+   0.25%  ld.lld  lld                  [.] llvm::AttributeImpl::hasAttribute(llvm::StringRef) const
+   0.25%  ld.lld  libstdc++.so.6.0.19  [.] operator new(unsigned long)
+   0.24%  ld.lld  lld                  [.] llvm::ValueHandleBase::AddToUseList()
+   0.24%  ld.lld  lld                  [.] llvm::AttributeSet::getAttributes(unsigned int) const
+   0.24%  ld.lld  lld                  [.] llvm::MachineFunctionPass::runOnFunction(llvm::Function&)

Comment 4 by p...@chromium.org, Dec 7 2016

Fundamentally thin LTO is going to be using more cycles overall because it does more work than regular LTO.

I think that for things like chrome where we have a large number of test binaries that share code we will want to find some way to share work between the test binaries, perhaps by using a cache.
That would be a nice feature.

Still, I am sure we can make ThinLTO faster even without changing fundamentals (like introducing the cache). In particular, it's unclear what made ThinLTO performance to regress by ~15% in the last 2 months or so.

I plan to spend some time with the profiler and try to find shallow optimizations. With current speed, ThinLTO is barely useful for Chrome. Yesterday, I tested it with "ninja -C <...> All" and it was less than 2x faster than FullLTO.
Per suggestion from Peter, I have run a few experiments with thinlto cache.

- every chrome sync adds ~300 MB to the cache and it grows uncontrollably. That's just for cc_unittests. I expect that to be around 2GB for All.
- hot relink (link, rm, link) is very fast: real 7s, user 7s. But that also assumes the thinlto cache being in the system disk cache. Still, very fast.
- minor sync (a few commits) eliminates 35% of the cache value.

I plan to check how well does it work for building All, when we actually have many same modules. The numbers I'll get this way will be optimistic as we don't currently have CFI and WholeProgramDevirt passes. 
Haven't looked at this in detail yet, but a couple of quick questions. For
the ninja -C All test you did where ThinLTO was about 2X faster, was that
including the unit tests, or specifically what beyond the chromium binary
link? I would expect it to have an even bigger advantage. Was this with
debug info? There is a known issue there where we are importing too much
and it slows the link down quite a bit.

Yeah, we need to implement cache pruning (gold and lld), like is done for
ld64.

Comment 8 Deleted

>Haven't looked at this in detail yet, but a couple of quick questions. For
>the ninja -C All test you did where ThinLTO was about 2X faster, was that
>including the unit tests, or specifically what beyond the chromium binary
>link?
I am sorry, if I was unclear. All targets with ThinLTO were not faster at all (may be even slower).

A single target (cc_unittests) was about 2x faster in user time, but used 2.5x more cycles.

I am now testing ThinLTO + All + thinlto cache in a hope that this will be faster.

> I would expect it to have an even bigger advantage. Was this with
debug info? There is a known issue there where we are importing too much
and it slows the link down quite a bit.
There's some debug info, but not very much.

>Yeah, we need to implement cache pruning (gold and lld), like is done for
ld64.
Probably.
So, ThinLTO + All + ThinLTO cache failed with

../../third_party/binutils/Linux_x64/Release/bin/ld.gold: error: /usr/local/google/home/krasin/chr32/thinlto-cache-dir/7057D28CFAD05722CE58CC783733362725B3FF78: requires dynamic R_X86_64_PC32 reloc against '_ZNSs7reserveEm' which may overflow at runtime; recompile with -fPIC
../../third_party/binutils/Linux_x64/Release/bin/ld.gold: error: /usr/local/google/home/krasin/chr32/thinlto-cache-dir/59C97A616090CA240D95A659674C3158D77FEC46: requires dynamic R_X86_64_PC32 reloc against 'memmove' which may overflow at runtime; recompile with -fPIC

I will clean the cache and try again.

Comment 11 by p...@chromium.org, Dec 7 2016

Interesting. Here's what I suspect the problem is: some Chrome binaries are built with -pie (i.e. chrome itself), some are built with -shared, and some are built with neither (i.e. the test binaries). The gold plugin passes the value of the -pie/-shared flags in the LTO configuration, which lib/LTO uses to determine whether the intermediate object files should be built in PIC mode, but we don't currently include that information in the hash. Presumably one of the non-PIC links added an object file to the cache which was reused by one of the PIC links.

To solve this problem I think we will need to add most of the LTO configuration to the hash.
Hah! After cleaning up the cache and trying again, I get the same issue (slightly different symbols it complains about).

Looks like the hash does not include the presence of some important flag (like -fPIC). I will try to make a standalone reproducer and file an LLVM issue about that.

>To solve this problem I think we will need to add most of the LTO configuration to the hash.
Yes, that's what I suspected in #12 (sent at the same time as your reply, #11)

Comment 14 by p...@chromium.org, Dec 7 2016

Does the attached patch fix the problem?
0001-LTO-Hash-the-parts-of-the-LTO-configuration-that-aff.patch
2.7 KB Download
Thank you. The patch indeed helped:

$ time -p ninja -C out/thin-lto-gold-tot/ All
real 4210.45
user 145285.95
sys 4875.74

The size of cache directory is 13GB (I started with an empty one)

And I like the numbers: 1h10m is wonderful.

Comment 16 by p...@chromium.org, Dec 8 2016

Great! I'll go ahead and send the patch for review then.
Thank you! Once the patch is in, I will enabled it on https://build.chromium.org/p/chromium.fyi/builders/ThinLTO Linux ToT bot and store the cache in /tmp, so that the cache is gone after each reboot. And bot reboots on every build. That assumes that the bot has enough space on /tmp.

In the longer term, we'll need a retention policy for the cache. Possibly, we could reuse heuristics from ccache (https://ccache.samba.org/) which does a very similar thing.

Comment 19 by p...@chromium.org, Dec 8 2016

http://llvm-cs.pcc.me.uk/include/llvm/LTO/legacy/ThinLTOCodeGenerator.h#91

The legacy ThinLTO implementation has a few more caching knobs, but it isn't clear that we'll want exactly this.
On Wed, Dec 7, 2016 at 2:08 PM, kra… via monorail <monorail+v2.1114011977@
chromium.org> wrote:

Ok, this doesn't really surprise me - it is doing more work but is faster
due to the parallelization of the backends.  For smaller unittest targets
the advantage is also likely not as noticeable than for a large target like
the Chromium binary itself.

Yes with caching it should be much faster than full LTO. I think that's
what your later email indicates but I'm having trouble finding all the
numbers to compare in the email thread.
We'll get exact numbers for caching once I have enabled it on the bot.

As for more work being done with ThinLTO: is it really 2.5x more work? My weak mental model of ThinLTO does not suggest a source of such a big overhead.
That's a good point - that's much more than I would expect. When we
measured clang Full LTO vs ThinLTO with 1 thread, ThinLTO had a 25%
overhead - half due to the additional redundant work due to importing, half
because ThinLTO has a more aggressive optimization pipeline.

So yes, 2.5x is surprising. Comparing the two profiles you sent, nothing
huge stands out. computeKnownBits doesn't show up in the FullLTO profile
but is near the top of the ThinLTO profile, but it's only 1.4%. Overall,
the profiles look pretty flat.
Yes, so far it's a mystery. I will try to get the perf numbers from a different perspective, like per pass / per object, but I am still thinking about what exactly should I measure.

Anyway, I'll update this bug with some more measurements today. If you have any ideas about what's worth looking at, please, post them here.
On Thu, Dec 8, 2016 at 9:49 AM, kra… via monorail <monorail+v2.1114011977@
chromium.org> wrote:

Were you able to identify anything? I thought of this bug reading this
thread:
http://lists.llvm.org/pipermail/llvm-dev/2016-December/108342.html
Note the compile time regression, in particular related to computeKnownBits
which shows up in the profile here.

Teresa
Hi Teresa,

I didn't touch it last week, so I don't have new information to share. Most likely, I will be returning to this after vacation (so, mid January)

krasin
Ok sounds good, I wanted to record a link to that thread here since it
seems relevant.

Teresa
That makes sense. Thanks, Teresa!

Comment 28 by p...@chromium.org, Mar 15 2017

Caching support was added to lld in r296702 via the --thinlto-cache-dir flag, and all known cache-related bugs observed when self hosting LLVM have been fixed (r296907 and r297853), so it may be worth trying to enable the cache on the bot now.
Previously, the issue was that with no retention policy, the size of the cache directory grows indefinitely (and fast). While it makes sense to enable the cache on the bot (after it goes green), it's not possible to deploy to the bots which don't reboot after each build. 

Comment 30 by p...@chromium.org, Mar 15 2017

Perhaps we can add a build step that clears the cache directory? That probably makes sense anyway for the ToT bots for which I'd expect the entire cache to be invalidated for every run (because the compiler version has changed).
ToT bots reboot on every build, so /tmp will be okay.
I also don't want to add a build step, because we will need to add it to every single bot, and there are too many.

Also, requiring to clear the cache directory often will create issues with building Chrome with ThinLTO on developer workstations. The primary victims will be us, but other Chrome developers will be affected as well, assuming ThinLTO makes it to the official build.

I would really like to have a sane retention policy implemented. ccache did it, and it created zero issues for one of my previous teams when we used it. 

Comment 32 by p...@chromium.org, Mar 15 2017

It seems that the LRU policy used on Mac has been abstracted out (http://llvm-cs.pcc.me.uk/lib/Support/CachePruning.cpp), so it shouldn't be too hard to reuse in the new LTO API.

Do you know what policy ccache uses? Is it LRU based?
// When "max files" or "max cache size" is reached, one of the 16 cache                                                                                                                                    
// subdirectories is cleaned up. When doing so, files are deleted (in LRU                                                                                                                                  
// order) until the levels are below limit_multiple.
So, yes, LRU + a heuristic to decrease the freezing time.

Comment 35 by p...@chromium.org, Mar 15 2017

Okay, it sounds like CachePruning is pretty much what we need then. I'll get started on that.
Thank you!
Today, I tried to build content_browsertests with ThinLTO cache enabled and located on /tmp. It quickly saturated my /tmp and failed with an I/O error. That makes the only suitable location for the cache to be inside out/ directory.
> Today, I tried to build content_browsertests with ThinLTO cache enabled and located on /tmp. It quickly saturated my /tmp and failed with an I/O error. That makes the only suitable location for the cache to be inside out/ directory.

/tmp on our workstations tends to be very small. For many things I need to set my TMPDIR to another directory.
In my experience, the only thing that tends to not fit into my /tmp is ThinLTO. Even though there are other things exist which might require a larger tmp dir, I expect that to be a major frustration point for Chromium devs, if the cache is placed into $TMPDIR. We should avoid doing that.
Peter,

apparently, ThinLTO cache can get corrupted:

[553/776] SOLINK ./libppapi_tests.so
FAILED: libppapi_tests.so libppapi_tests.so.TOC 
python "/usr/local/google/home/krasin/chr33/src/build/toolchain/gcc_solink_wrapper.py" --readelf="readelf" --nm="nm" --sofile="./libppapi_tests.so" --tocfile="./libppapi_tests.so.TOC" --output="./libppapi_tests.so"  -- ../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libppapi_tests.so" -Wl,-soname="libppapi_tests.so" @"./libppapi_tests.so.rsp"
/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/251E3ED652E3886EDAC9A851FEDCA559A9A91A15: invalid data encoding
clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation)
[554/776] SOLINK_MODULE ./libblink_deprecated_test_plugin.so
FAILED: libblink_deprecated_test_plugin.so 
../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libblink_deprecated_test_plugin.so" -Wl,-soname="libblink_deprecated_test_plugin.so" @"./libblink_deprecated_test_plugin.so.rsp"
/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/02B862BCF8D42E108D51169F799A29D08DFFA4B0: invalid data encoding
clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation)
[555/776] SOLINK_MODULE ./libblink_test_plugin.so
FAILED: libblink_test_plugin.so 
../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libblink_test_plugin.so" -Wl,-soname="libblink_test_plugin.so" @"./libblink_test_plugin.so.rsp"
/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/30E3EF66245C0348C48E787788EBAAC08A72474F: invalid data encoding
clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation)

A restart of the linker fixed the issue, but something has definitely collided on that run.

Comment 41 by p...@chromium.org, Mar 16 2017

My suspicion is that one of the instances of the linker followed this code path: http://llvm-cs.pcc.me.uk/lib/LTO/Caching.cpp#28
and another instance saw a partially written file.

I think the fix is to do what is written in the FIXME.
Yes, FIXME looks very relevant.

Comment 43 by p...@chromium.org, Mar 17 2017

A fix for what I believe to be this race condition landed in r297970, and for another potential race condition in r298020. Cache pruning support also landed in lld in r298036.

As far as I know those are all the LLVM-side changes we need.
Thanks!

I have taken a look at llvm::pruneCache, and I don't see any safe-guards which would prohibit us from deleting user files in the case of incorrectly specified cache directory.

I would like it to be only looking at the filenames with a ThinLTO / LLVM specific prefix or other kind of a safe-guard. Otherwise, bad things will happen.

Comment 45 by p...@chromium.org, Mar 17 2017

Seems reasonable enough, but do we really need this feature to launch? I would imagine that the cache directory would be specified by the build system, so users wouldn't be at risk of specifying an incorrect cache directory.

If we did implement this, here's how I imagine that it would work:
1) when we initialize the cache, we create a directory for it. If we successfully created the directory, we also create a timestamp file ("llvmcache.timestamp") in the directory.
2) in pruneCache() we test for the presence of the timestamp file, and refuse to prune the cache if it is missing.
It's backwards compatible with existing caches because they are already creating the timestamp file in pruneCache(). So we can implement this feature at any time we want later.

Comment 46 by p...@chromium.org, Mar 17 2017

Actually this was easy enough to implement, so I did it: https://reviews.llvm.org/D31096
>Seems reasonable enough, but do we really need this feature to launch?
Yes. Because if we fail even for one user, it won't be pretty. See https://devtalk.nvidia.com/default/topic/492860/cuda-toolkit-4-0-13-deleted-usr-bin-on-installation/ and https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/issues/123 for prior art (NVIDIA installer deleted /usr for lots of users)

>1) when we initialize the cache, we create a directory for it. If we >successfully created the directory, we also create a timestamp file
>("llvmcache.timestamp") in the directory.
>2) in pruneCache() we test for the presence of the timestamp file, and refuse >to prune the cache if it is missing.

This approach can be simplified with some failure modes implicitly removed, as we can always require the basename for the directory have a specific name (such us .thinlto_cache). Alternatively, create such a directory within the specified location and store all cache files under it. In this case, deleting .thinlto_cache directory will delete the files. And user files are not expected to be inside the directory anyway.

Also, we should only create files with a special prefix; only delete them, and also never follow symlinks. That part is optional, but it's an almost no-op anyway.

I don't quite care about the existing caches, as ThinLTO is not yet widely used.
re #46: sorry, I think it's a fragile approach.

Comment 49 by p...@chromium.org, Mar 20 2017

We settled on a cache pruning approach which landed in r298271.
Thanks!
I am currently working on enabling cache on CFI ThinLTO Linux ToT bot.
Project Member

Comment 52 by bugdroid1@chromium.org, Mar 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/33d8074e449407ac84ad0e6f3ea7ab3dbda19f28

commit 33d8074e449407ac84ad0e6f3ea7ab3dbda19f28
Author: krasin <krasin@chromium.org>
Date: Tue Mar 21 15:18:13 2017

Enable ThinLTO cache.

This allows to reuse some codegen work when
linking multiple targets with the intersecting set of
object files, resulting in a reduce of overall build time.

This creates a thinlto-cache directory inside src/out/<gn-config-name>/ and keeping a lot of llvmcache* files there.

BUG=672158

Review-Url: https://codereview.chromium.org/2761893002
Cr-Commit-Position: refs/heads/master@{#458418}

[modify] https://crrev.com/33d8074e449407ac84ad0e6f3ea7ab3dbda19f28/build/config/compiler/BUILD.gn

ThinLTO has been working without an issue so far. 'CFI ThinLTO Linux ToT' bot run is now 4h20m instead of 6h20m:
https://build.chromium.org/p/chromium.fyi/builders/CFI%20ThinLTO%20Linux%20ToT/builds/14
https://build.chromium.org/p/chromium.fyi/builders/CFI%20ThinLTO%20Linux%20ToT/builds/13

Note: the ToT bots don't currently have threads enabled, as it will require to rework update.py script. I plan to do that, but don't see it as an immediate need. In fact, I would prefer to wait until the Clang roll ( https://crbug.com/703833 ) and only then dive into that.
Status: Assigned (was: Untriaged)

Sign in to add a comment