ThinLTO uses 3x more user time to link cc_unittests |
||
Issue descriptionI've measured the performance lld vs gold / ThinLTO vs FullLTO. I used 8 jobs in the ThinLTO case and my benchmark was to link cc_unittests. LLVM r288833. The numbers are bad: ThinLTO + LLD: real 96s, user 698s. ThinLTO + Gold: real 96s, user 692s. FullLTO + LLD: 219s FullLTO + Gold: 208s As we can see, LLD is slighly slower than Gold (bad, needs to be fixed, but not the end of the world). We also see that ThinLTO uses as much as 3.5x more cycles to link the binary. And that *is* bad. GN flags for ThinLTO + LLD: gn gen out/thin-lto-lld-tot '--args=is_chrome_branded=true is_official_build=true is_clang=true is_component_build=false is_debug=false allow_posix_link_time_opt=true is_cfi=false use_thin_lto=true symbol_level=1 clang_use_chrome_plugins=false use_lld=true clang_base_path="/usr/local/google/home/krasin/src/llvm.org/release-build"' --check GN flags for FullLTO + LLD: gn gen out/lto-lld-tot '--args=is_chrome_branded=true is_official_build=true is_clang=true is_component_build=false is_debug=false allow_posix_link_time_opt=true is_cfi=false symbol_level=1 clang_use_chrome_plugins=false use_lld=true clang_base_path="/usr/local/google/home/krasin/src/llvm.org/release-build"' --check Perf profile for lld + thinlto: + 2.79% ld.lld libc-2.19.so [.] _int_malloc + 2.14% ld.lld lld [.] llvm::PMDataManager::findAnalysisPass(void const*, bool) + 1.94% ld.lld libc-2.19.so [.] _int_free + 1.77% ld.lld libc-2.19.so [.] malloc + 1.40% ld.lld lld [.] computeKnownBits(llvm::Value const*, llvm::APInt&, llvm::APInt&, unsigned int, (anonymous namespace)::Query const&) + 1.35% ld.lld [kernel.kallsyms] [k] 0xffffffff8105144a + 0.98% ld.lld lld [.] llvm::Use::getImpliedUser() const + 0.93% ld.lld lld [.] llvm::ValueHandleBase::AddToUseList() + 0.92% ld.lld libc-2.19.so [.] __memcmp_sse4_1 + 0.86% ld.lld lld [.] llvm::Value::getValueName() const + 0.85% ld.lld libc-2.19.so [.] malloc_consolidate + 0.78% ld.lld lld [.] llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const + 0.65% ld.lld lld [.] llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level) + 0.65% ld.lld lld [.] llvm::AttributeSet::getAttributes(unsigned int) const + 0.64% ld.lld lld [.] llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) + 0.58% ld.lld lld [.] llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*) + 0.54% ld.lld lld [.] llvm::BasicBlock::getTerminator() + 0.44% ld.lld lld [.] llvm::Value::stripPointerCasts() + 0.42% ld.lld lld [.] llvm::PMDataManager::removeNotPreservedAnalysis(llvm::Pass*) + 0.42% ld.lld lld [.] llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*) + 0.42% ld.lld lld [.] llvm::DataLayout::getTypeSizeInBits(llvm::Type*) const + 0.41% ld.lld lld [.] llvm::ValueHandleBase::RemoveFromUseList() + 0.39% ld.lld lld [.] llvm::sys::MemoryFence() + 0.38% ld.lld lld [.] llvm::PMTopLevelManager::findAnalysisPass(void const*) + 0.37% ld.lld lld [.] llvm::SimplifyInstruction(llvm::Instruction*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*, llvm::DominatorTree const*, llvm::AssumptionCache*) + 0.37% ld.lld lld [.] llvm::AttributeSet::hasAttribute(unsigned int, llvm::Attribute::AttrKind) const + 0.36% ld.lld libc-2.19.so [.] free + 0.34% ld.lld lld [.] llvm::DenseMapBase<llvm::DenseMap<llvm::BasicBlock*, llvm::DominatorTreeBase<llvm::BasicBlock>::InfoRec, llvm::DenseMapInfo<llvm::BasicBlock*>, llvm::detail::DenseMap + 0.34% ld.lld lld [.] llvm::FoldingSetNodeID::AddInteger(unsigned int) + 0.33% ld.lld lld [.] llvm::Value::assertModuleIsMaterialized() const + 0.33% ld.lld lld [.] llvm::TargetLibraryInfoImpl::getLibFunc(llvm::StringRef, llvm::LibFunc::Func&) const + 0.33% ld.lld lld [.] llvm::Value::getContext() const + 0.31% ld.lld lld [.] llvm::Type::getScalarType() const + 0.31% ld.lld lld [.] bool llvm::DenseMapBase<llvm::DenseMap<llvm::Instruction*, unsigned int, llvm::DenseMapInfo<llvm::Instruction*>, llvm::detail::DenseMapPair<llvm::Instruction*, unsign + 0.31% ld.lld lld [.] llvm::BranchInst::getSuccessorV(unsigned int) const + 0.31% ld.lld lld [.] llvm::GetUnderlyingObject(llvm::Value*, llvm::DataLayout const&, unsigned int) + 0.31% ld.lld lld [.] llvm::InstCombiner::run() + 0.30% ld.lld lld [.] llvm::StringMapImpl::FindKey(llvm::StringRef) const + 0.30% ld.lld lld [.] AddReachableCodeToWorklist(llvm::BasicBlock*, llvm::DataLayout const&, llvm::SmallPtrSetImpl<llvm::BasicBlock*>&, llvm::InstCombineWorklist&, llvm::TargetLibraryInfo + 0.30% ld.lld lld [.] llvm::ReplaceableMetadataImpl::getOrCreate(llvm::Metadata&) + 0.29% ld.lld lld [.] llvm::ConstantFoldInstruction(llvm::Instruction*, llvm::DataLayout const&, llvm::TargetLibraryInfo const*) + 0.29% ld.lld lld [.] (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*) + 0.29% ld.lld lld [.] llvm::APInt::operator&(llvm::APInt const&) const + 0.29% ld.lld lld [.] llvm::Use::getUser() const + 0.28% ld.lld lld [.] std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int + 0.27% ld.lld libpthread-2.19.so [.] pthread_rwlock_rdlock + 0.27% ld.lld lld [.] llvm::DataLayout::getAlignment(llvm::Type*, bool) const + 0.26% ld.lld lld [.] llvm::DAGTypeLegalizer::run() + 0.26% ld.lld lld [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef) + 0.26% ld.lld lld [.] llvm::Value::getName() const + 0.26% ld.lld lld [.] llvm::ConstantRange::ConstantRange(unsigned int, bool) + 0.26% ld.lld lld [.] llvm::MetadataTracking::track(void*, llvm::Metadata&, llvm::PointerUnion<llvm::MetadataAsValue*, llvm::Metadata*>) + 0.25% ld.lld lld [.] (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*) + 0.25% ld.lld libstdc++.so.6.0.19 [.] operator new(unsigned long) + 0.24% ld.lld lld [.] llvm::ValueHandleBase::AddToExistingUseList(llvm::ValueHandleBase**) + 0.24% ld.lld lld [.] llvm::InstCombiner::SimplifyDemandedUseBits(llvm::Value*, llvm::APInt, llvm::APInt&, llvm::APInt&, unsigned int, llvm::Instruction*)
,
Dec 7 2016
https://docs.google.com/spreadsheets/d/18vi9p8ffIYNVPTyxtJwr-YrP4WJRbaQr_2nZ1AKKBs4/edit#gid=0 on 8 jobs, it was 617.55. So, it was already worse than FullLTO, but became even slower.
,
Dec 7 2016
Perf profile for lld + FullLTO: + 7.24% ld.lld libc-2.19.so [.] _int_malloc + 4.14% ld.lld [kernel.kallsyms] [k] 0xffffffff8105144a + 1.61% ld.lld libc-2.19.so [.] malloc_consolidate + 1.40% ld.lld libc-2.19.so [.] _int_free + 1.35% ld.lld lld [.] llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level) + 1.33% ld.lld lld [.] llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) + 1.14% ld.lld lld [.] llvm::PMDataManager::findAnalysisPass(void const*, bool) + 1.04% ld.lld lld [.] llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const + 0.75% ld.lld lld [.] llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*) + 0.71% ld.lld libc-2.19.so [.] memset + 0.70% ld.lld libc-2.19.so [.] malloc + 0.62% ld.lld lld [.] llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCAsmLayout const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const* + 0.62% ld.lld lld [.] llvm::sys::MemoryFence() + 0.60% ld.lld lld [.] llvm::Use::getImpliedUser() const + 0.58% ld.lld lld [.] (anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*) + 0.57% ld.lld lld [.] llvm::DAGTypeLegalizer::run() + 0.56% ld.lld libc-2.19.so [.] free + 0.54% ld.lld lld [.] (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*) + 0.51% ld.lld lld [.] llvm::ScheduleDAGSDNodes::BuildSchedUnits() + 0.49% ld.lld lld [.] llvm::FoldingSetNodeID::AddInteger(unsigned int) + 0.43% ld.lld lld [.] llvm::DILocation::getImpl(llvm::LLVMContext&, unsigned int, unsigned int, llvm::Metadata*, llvm::Metadata*, llvm::Metadata::StorageType, bool) + 0.43% ld.lld libc-2.19.so [.] __memcmp_sse4_1 + 0.43% ld.lld lld [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef) + 0.43% ld.lld lld [.] bool llvm::DenseMapBase<llvm::DenseMap<llvm::DILocation*, llvm::detail::DenseSetEmpty, llvm::MDNodeInfo<llvm::DILocation>, llvm::detail::DenseSetPair<llvm::DILocation* + 0.42% ld.lld lld [.] llvm::calculateDbgValueHistory(llvm::MachineFunction const*, llvm::TargetRegisterInfo const*, llvm::DbgValueHistoryMap&) + 0.39% ld.lld lld [.] std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int c + 0.39% ld.lld lld [.] llvm::PMDataManager::removeNotPreservedAnalysis(llvm::Pass*) + 0.38% ld.lld lld [.] llvm::FoldingSetImpl::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&) + 0.38% ld.lld lld [.] llvm::ScheduleDAGSDNodes::AddSchedEdges() + 0.37% ld.lld lld [.] llvm::TargetLoweringBase::getTypeConversion(llvm::LLVMContext&, llvm::EVT) const + 0.34% ld.lld lld [.] (anonymous namespace)::DeadMachineInstructionElim::runOnMachineFunction(llvm::MachineFunction&) + 0.34% ld.lld lld [.] llvm::SelectionDAG::AssignTopologicalOrder() + 0.32% ld.lld lld [.] llvm::BasicBlock::getTerminator() + 0.32% ld.lld lld [.] llvm::Attribute::hasAttribute(llvm::StringRef) const + 0.32% ld.lld lld [.] llvm::PMDataManager::initializeAnalysisImpl(llvm::Pass*) + 0.31% ld.lld libc-2.19.so [.] __memcpy_sse2_unaligned + 0.31% ld.lld lld [.] llvm::LiveVariables::HandlePhysRegKill(unsigned int, llvm::MachineInstr*) + 0.31% ld.lld lld [.] llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm::MachineOperand const&) + 0.31% ld.lld lld [.] llvm::MCAsmLayout::ensureValid(llvm::MCFragment const*) const + 0.31% ld.lld lld [.] (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*) + 0.30% ld.lld lld [.] llvm::Attribute::isStringAttribute() const + 0.30% ld.lld lld [.] llvm::X86InstrInfo::AnalyzeBranchImpl(llvm::MachineBasicBlock&, llvm::MachineBasicBlock*&, llvm::MachineBasicBlock*&, llvm::SmallVectorImpl<llvm::MachineOperand>&, llv + 0.30% ld.lld lld [.] (anonymous namespace)::Verifier::visitFunction(llvm::Function const&) + 0.30% ld.lld lld [.] (anonymous namespace)::Mapper::mapSimpleMetadata(llvm::Metadata const*) + 0.28% ld.lld lld [.] llvm::BasicBlock::dropAllReferences() + 0.28% ld.lld lld [.] llvm::SmallPtrSetImplBase::insert_imp_big(void const*) + 0.28% ld.lld lld [.] llvm::SelectionDAG::Legalize() + 0.27% ld.lld lld [.] llvm::MCAssembler::layoutSectionOnce(llvm::MCAsmLayout&, llvm::MCSection&) + 0.27% ld.lld lld [.] llvm::TargetRegisterInfo::getMinimalPhysRegClass(unsigned int, llvm::MVT) const + 0.27% ld.lld lld [.] llvm::ReplaceableMetadataImpl::getOrCreate(llvm::Metadata&) + 0.27% ld.lld lld [.] llvm::SelectionDAG::createOperands(llvm::SDNode*, llvm::ArrayRef<llvm::SDValue>) + 0.27% ld.lld lld [.] llvm::Value::assertModuleIsMaterialized() const + 0.25% ld.lld lld [.] llvm::GlobalDCEPass::GlobalIsNeeded(llvm::GlobalValue*) + 0.25% ld.lld lld [.] llvm::PMTopLevelManager::findAnalysisPass(void const*) + 0.25% ld.lld lld [.] llvm::AttributeImpl::hasAttribute(llvm::StringRef) const + 0.25% ld.lld libstdc++.so.6.0.19 [.] operator new(unsigned long) + 0.24% ld.lld lld [.] llvm::ValueHandleBase::AddToUseList() + 0.24% ld.lld lld [.] llvm::AttributeSet::getAttributes(unsigned int) const + 0.24% ld.lld lld [.] llvm::MachineFunctionPass::runOnFunction(llvm::Function&)
,
Dec 7 2016
Fundamentally thin LTO is going to be using more cycles overall because it does more work than regular LTO. I think that for things like chrome where we have a large number of test binaries that share code we will want to find some way to share work between the test binaries, perhaps by using a cache.
,
Dec 7 2016
That would be a nice feature. Still, I am sure we can make ThinLTO faster even without changing fundamentals (like introducing the cache). In particular, it's unclear what made ThinLTO performance to regress by ~15% in the last 2 months or so. I plan to spend some time with the profiler and try to find shallow optimizations. With current speed, ThinLTO is barely useful for Chrome. Yesterday, I tested it with "ninja -C <...> All" and it was less than 2x faster than FullLTO.
,
Dec 7 2016
Per suggestion from Peter, I have run a few experiments with thinlto cache. - every chrome sync adds ~300 MB to the cache and it grows uncontrollably. That's just for cc_unittests. I expect that to be around 2GB for All. - hot relink (link, rm, link) is very fast: real 7s, user 7s. But that also assumes the thinlto cache being in the system disk cache. Still, very fast. - minor sync (a few commits) eliminates 35% of the cache value. I plan to check how well does it work for building All, when we actually have many same modules. The numbers I'll get this way will be optimistic as we don't currently have CFI and WholeProgramDevirt passes.
,
Dec 7 2016
Haven't looked at this in detail yet, but a couple of quick questions. For the ninja -C All test you did where ThinLTO was about 2X faster, was that including the unit tests, or specifically what beyond the chromium binary link? I would expect it to have an even bigger advantage. Was this with debug info? There is a known issue there where we are importing too much and it slows the link down quite a bit. Yeah, we need to implement cache pruning (gold and lld), like is done for ld64.
,
Dec 7 2016
>Haven't looked at this in detail yet, but a couple of quick questions. For >the ninja -C All test you did where ThinLTO was about 2X faster, was that >including the unit tests, or specifically what beyond the chromium binary >link? I am sorry, if I was unclear. All targets with ThinLTO were not faster at all (may be even slower). A single target (cc_unittests) was about 2x faster in user time, but used 2.5x more cycles. I am now testing ThinLTO + All + thinlto cache in a hope that this will be faster. > I would expect it to have an even bigger advantage. Was this with debug info? There is a known issue there where we are importing too much and it slows the link down quite a bit. There's some debug info, but not very much. >Yeah, we need to implement cache pruning (gold and lld), like is done for ld64. Probably.
,
Dec 7 2016
So, ThinLTO + All + ThinLTO cache failed with ../../third_party/binutils/Linux_x64/Release/bin/ld.gold: error: /usr/local/google/home/krasin/chr32/thinlto-cache-dir/7057D28CFAD05722CE58CC783733362725B3FF78: requires dynamic R_X86_64_PC32 reloc against '_ZNSs7reserveEm' which may overflow at runtime; recompile with -fPIC ../../third_party/binutils/Linux_x64/Release/bin/ld.gold: error: /usr/local/google/home/krasin/chr32/thinlto-cache-dir/59C97A616090CA240D95A659674C3158D77FEC46: requires dynamic R_X86_64_PC32 reloc against 'memmove' which may overflow at runtime; recompile with -fPIC I will clean the cache and try again.
,
Dec 7 2016
Interesting. Here's what I suspect the problem is: some Chrome binaries are built with -pie (i.e. chrome itself), some are built with -shared, and some are built with neither (i.e. the test binaries). The gold plugin passes the value of the -pie/-shared flags in the LTO configuration, which lib/LTO uses to determine whether the intermediate object files should be built in PIC mode, but we don't currently include that information in the hash. Presumably one of the non-PIC links added an object file to the cache which was reused by one of the PIC links. To solve this problem I think we will need to add most of the LTO configuration to the hash.
,
Dec 7 2016
Hah! After cleaning up the cache and trying again, I get the same issue (slightly different symbols it complains about). Looks like the hash does not include the presence of some important flag (like -fPIC). I will try to make a standalone reproducer and file an LLVM issue about that.
,
Dec 7 2016
>To solve this problem I think we will need to add most of the LTO configuration to the hash. Yes, that's what I suspected in #12 (sent at the same time as your reply, #11)
,
Dec 7 2016
Does the attached patch fix the problem?
,
Dec 8 2016
Thank you. The patch indeed helped: $ time -p ninja -C out/thin-lto-gold-tot/ All real 4210.45 user 145285.95 sys 4875.74 The size of cache directory is 13GB (I started with an empty one) And I like the numbers: 1h10m is wonderful.
,
Dec 8 2016
Great! I'll go ahead and send the patch for review then.
,
Dec 8 2016
Thank you! Once the patch is in, I will enabled it on https://build.chromium.org/p/chromium.fyi/builders/ThinLTO Linux ToT bot and store the cache in /tmp, so that the cache is gone after each reboot. And bot reboots on every build. That assumes that the bot has enough space on /tmp. In the longer term, we'll need a retention policy for the cache. Possibly, we could reuse heuristics from ccache (https://ccache.samba.org/) which does a very similar thing.
,
Dec 8 2016
,
Dec 8 2016
http://llvm-cs.pcc.me.uk/include/llvm/LTO/legacy/ThinLTOCodeGenerator.h#91 The legacy ThinLTO implementation has a few more caching knobs, but it isn't clear that we'll want exactly this.
,
Dec 8 2016
On Wed, Dec 7, 2016 at 2:08 PM, kra… via monorail <monorail+v2.1114011977@ chromium.org> wrote: Ok, this doesn't really surprise me - it is doing more work but is faster due to the parallelization of the backends. For smaller unittest targets the advantage is also likely not as noticeable than for a large target like the Chromium binary itself. Yes with caching it should be much faster than full LTO. I think that's what your later email indicates but I'm having trouble finding all the numbers to compare in the email thread.
,
Dec 8 2016
We'll get exact numbers for caching once I have enabled it on the bot. As for more work being done with ThinLTO: is it really 2.5x more work? My weak mental model of ThinLTO does not suggest a source of such a big overhead.
,
Dec 8 2016
That's a good point - that's much more than I would expect. When we measured clang Full LTO vs ThinLTO with 1 thread, ThinLTO had a 25% overhead - half due to the additional redundant work due to importing, half because ThinLTO has a more aggressive optimization pipeline. So yes, 2.5x is surprising. Comparing the two profiles you sent, nothing huge stands out. computeKnownBits doesn't show up in the FullLTO profile but is near the top of the ThinLTO profile, but it's only 1.4%. Overall, the profiles look pretty flat.
,
Dec 8 2016
Yes, so far it's a mystery. I will try to get the perf numbers from a different perspective, like per pass / per object, but I am still thinking about what exactly should I measure. Anyway, I'll update this bug with some more measurements today. If you have any ideas about what's worth looking at, please, post them here.
,
Dec 20 2016
On Thu, Dec 8, 2016 at 9:49 AM, kra… via monorail <monorail+v2.1114011977@ chromium.org> wrote: Were you able to identify anything? I thought of this bug reading this thread: http://lists.llvm.org/pipermail/llvm-dev/2016-December/108342.html Note the compile time regression, in particular related to computeKnownBits which shows up in the profile here. Teresa
,
Dec 20 2016
Hi Teresa, I didn't touch it last week, so I don't have new information to share. Most likely, I will be returning to this after vacation (so, mid January) krasin
,
Dec 20 2016
Ok sounds good, I wanted to record a link to that thread here since it seems relevant. Teresa
,
Dec 20 2016
That makes sense. Thanks, Teresa!
,
Mar 15 2017
Previously, the issue was that with no retention policy, the size of the cache directory grows indefinitely (and fast). While it makes sense to enable the cache on the bot (after it goes green), it's not possible to deploy to the bots which don't reboot after each build.
,
Mar 15 2017
Perhaps we can add a build step that clears the cache directory? That probably makes sense anyway for the ToT bots for which I'd expect the entire cache to be invalidated for every run (because the compiler version has changed).
,
Mar 15 2017
ToT bots reboot on every build, so /tmp will be okay. I also don't want to add a build step, because we will need to add it to every single bot, and there are too many. Also, requiring to clear the cache directory often will create issues with building Chrome with ThinLTO on developer workstations. The primary victims will be us, but other Chrome developers will be affected as well, assuming ThinLTO makes it to the official build. I would really like to have a sane retention policy implemented. ccache did it, and it created zero issues for one of my previous teams when we used it.
,
Mar 15 2017
It seems that the LRU policy used on Mac has been abstracted out (http://llvm-cs.pcc.me.uk/lib/Support/CachePruning.cpp), so it shouldn't be too hard to reuse in the new LTO API. Do you know what policy ccache uses? Is it LRU based?
,
Mar 15 2017
// When "max files" or "max cache size" is reached, one of the 16 cache // subdirectories is cleaned up. When doing so, files are deleted (in LRU // order) until the levels are below limit_multiple.
,
Mar 15 2017
So, yes, LRU + a heuristic to decrease the freezing time.
,
Mar 15 2017
Okay, it sounds like CachePruning is pretty much what we need then. I'll get started on that.
,
Mar 15 2017
Thank you!
,
Mar 16 2017
Today, I tried to build content_browsertests with ThinLTO cache enabled and located on /tmp. It quickly saturated my /tmp and failed with an I/O error. That makes the only suitable location for the cache to be inside out/ directory.
,
Mar 16 2017
> Today, I tried to build content_browsertests with ThinLTO cache enabled and located on /tmp. It quickly saturated my /tmp and failed with an I/O error. That makes the only suitable location for the cache to be inside out/ directory. /tmp on our workstations tends to be very small. For many things I need to set my TMPDIR to another directory.
,
Mar 16 2017
In my experience, the only thing that tends to not fit into my /tmp is ThinLTO. Even though there are other things exist which might require a larger tmp dir, I expect that to be a major frustration point for Chromium devs, if the cache is placed into $TMPDIR. We should avoid doing that.
,
Mar 16 2017
Peter, apparently, ThinLTO cache can get corrupted: [553/776] SOLINK ./libppapi_tests.so FAILED: libppapi_tests.so libppapi_tests.so.TOC python "/usr/local/google/home/krasin/chr33/src/build/toolchain/gcc_solink_wrapper.py" --readelf="readelf" --nm="nm" --sofile="./libppapi_tests.so" --tocfile="./libppapi_tests.so.TOC" --output="./libppapi_tests.so" -- ../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libppapi_tests.so" -Wl,-soname="libppapi_tests.so" @"./libppapi_tests.so.rsp" /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/251E3ED652E3886EDAC9A851FEDCA559A9A91A15: invalid data encoding clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation) [554/776] SOLINK_MODULE ./libblink_deprecated_test_plugin.so FAILED: libblink_deprecated_test_plugin.so ../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libblink_deprecated_test_plugin.so" -Wl,-soname="libblink_deprecated_test_plugin.so" @"./libblink_deprecated_test_plugin.so.rsp" /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/02B862BCF8D42E108D51169F799A29D08DFFA4B0: invalid data encoding clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation) [555/776] SOLINK_MODULE ./libblink_test_plugin.so FAILED: libblink_test_plugin.so ../../../../src/llvm.org/release-build/bin/clang++ -shared -Wl,--fatal-warnings -fPIC -Wl,-z,noexecstack -Wl,-z,now -Wl,-z,relro -Wl,-z,defs -Wl,--no-as-needed -lpthread -Wl,--as-needed -fuse-ld=lld -Wl,--icf=all -flto=thin -Wl,--thinlto-jobs=8 -Wl,--thinlto-cache-dir=/usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache -Wl,-plugin-opt,-function-sections -m64 -pthread -Werror -Wl,-O1 -Wl,--gc-sections --sysroot=../../build/linux/debian_wheezy_amd64-sysroot -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/x86_64-linux-gnu -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib/gcc/x86_64-linux-gnu/4.6 -L/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,-rpath-link=/usr/local/google/home/krasin/chr33/src/build/linux/debian_wheezy_amd64-sysroot/usr/lib -Wl,--export-dynamic -o "./libblink_test_plugin.so" -Wl,-soname="libblink_test_plugin.so" @"./libblink_test_plugin.so.rsp" /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/../../../../src/llvm.org/release-build/bin/ld.lld: error: /usr/local/google/home/krasin/chr33/src/out/thin-lto-tot/thin-lto-cache/30E3EF66245C0348C48E787788EBAAC08A72474F: invalid data encoding clang-5.0: error: linker command failed with exit code 1 (use -v to see invocation) A restart of the linker fixed the issue, but something has definitely collided on that run.
,
Mar 16 2017
My suspicion is that one of the instances of the linker followed this code path: http://llvm-cs.pcc.me.uk/lib/LTO/Caching.cpp#28 and another instance saw a partially written file. I think the fix is to do what is written in the FIXME.
,
Mar 16 2017
Yes, FIXME looks very relevant.
,
Mar 17 2017
Thanks! I have taken a look at llvm::pruneCache, and I don't see any safe-guards which would prohibit us from deleting user files in the case of incorrectly specified cache directory. I would like it to be only looking at the filenames with a ThinLTO / LLVM specific prefix or other kind of a safe-guard. Otherwise, bad things will happen.
,
Mar 17 2017
Seems reasonable enough, but do we really need this feature to launch? I would imagine that the cache directory would be specified by the build system, so users wouldn't be at risk of specifying an incorrect cache directory.
If we did implement this, here's how I imagine that it would work:
1) when we initialize the cache, we create a directory for it. If we successfully created the directory, we also create a timestamp file ("llvmcache.timestamp") in the directory.
2) in pruneCache() we test for the presence of the timestamp file, and refuse to prune the cache if it is missing.
It's backwards compatible with existing caches because they are already creating the timestamp file in pruneCache(). So we can implement this feature at any time we want later.
,
Mar 17 2017
Actually this was easy enough to implement, so I did it: https://reviews.llvm.org/D31096
,
Mar 17 2017
>Seems reasonable enough, but do we really need this feature to launch? Yes. Because if we fail even for one user, it won't be pretty. See https://devtalk.nvidia.com/default/topic/492860/cuda-toolkit-4-0-13-deleted-usr-bin-on-installation/ and https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/issues/123 for prior art (NVIDIA installer deleted /usr for lots of users) >1) when we initialize the cache, we create a directory for it. If we >successfully created the directory, we also create a timestamp file >("llvmcache.timestamp") in the directory. >2) in pruneCache() we test for the presence of the timestamp file, and refuse >to prune the cache if it is missing. This approach can be simplified with some failure modes implicitly removed, as we can always require the basename for the directory have a specific name (such us .thinlto_cache). Alternatively, create such a directory within the specified location and store all cache files under it. In this case, deleting .thinlto_cache directory will delete the files. And user files are not expected to be inside the directory anyway. Also, we should only create files with a special prefix; only delete them, and also never follow symlinks. That part is optional, but it's an almost no-op anyway. I don't quite care about the existing caches, as ThinLTO is not yet widely used.
,
Mar 17 2017
re #46: sorry, I think it's a fragile approach.
,
Mar 20 2017
We settled on a cache pruning approach which landed in r298271.
,
Mar 20 2017
Thanks! I am currently working on enabling cache on CFI ThinLTO Linux ToT bot.
,
Mar 20 2017
,
Mar 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/33d8074e449407ac84ad0e6f3ea7ab3dbda19f28 commit 33d8074e449407ac84ad0e6f3ea7ab3dbda19f28 Author: krasin <krasin@chromium.org> Date: Tue Mar 21 15:18:13 2017 Enable ThinLTO cache. This allows to reuse some codegen work when linking multiple targets with the intersecting set of object files, resulting in a reduce of overall build time. This creates a thinlto-cache directory inside src/out/<gn-config-name>/ and keeping a lot of llvmcache* files there. BUG=672158 Review-Url: https://codereview.chromium.org/2761893002 Cr-Commit-Position: refs/heads/master@{#458418} [modify] https://crrev.com/33d8074e449407ac84ad0e6f3ea7ab3dbda19f28/build/config/compiler/BUILD.gn
,
Mar 22 2017
ThinLTO has been working without an issue so far. 'CFI ThinLTO Linux ToT' bot run is now 4h20m instead of 6h20m: https://build.chromium.org/p/chromium.fyi/builders/CFI%20ThinLTO%20Linux%20ToT/builds/14 https://build.chromium.org/p/chromium.fyi/builders/CFI%20ThinLTO%20Linux%20ToT/builds/13 Note: the ToT bots don't currently have threads enabled, as it will require to rework update.py script. I plan to do that, but don't see it as an immediate need. In fact, I would prefer to wait until the Clang roll ( https://crbug.com/703833 ) and only then dive into that.
,
Aug 1
|
||
►
Sign in to add a comment |
||
Comment 1 by p...@chromium.org
, Dec 7 2016