Issue metadata
Sign in to add a comment
|
[afdo] investigate the gap of CWP and Benchmark based profiles |
||||||||||||||||||||||
Issue descriptionThere are some gap in some benchmarks, such as speedometer2. Although the benchmark profiles are overfitted speedometer2, we are still interested in knowing more about the gaps. This may also give us some hints on improving the CPU cycles in the field. Some conjectures... 1. We suspect that profiles from ATOM boards are inferior than that from Core boards. This is a tracking bug of the investigating efforts.
,
Jul 28
CWP profiles from big cores don't do well on silvermont, either. The performance is the same as silvermont profiles on speedometer2.
,
Aug 4
,
Aug 4
,
Aug 4
,
Aug 5
Luis asked whether it's CWP profile doesn't work well on ToT. It doesn't seem to be the case, unfortunately. The following comparison is done with code base 68.0.3440.89. "silvermont" uses profiles collected on 68.0.3440.76 in the field, and "benchmark" uses profiles collected on 68.0.3440.89 (same as code base) using benchmarks. The images (3440.76 and 3440.89) used to collect profiles differ by 10 days, which is small considering that it's the end of beta. The 5.7% gap is still there.
Benchmark: speedometer2; Iterations: 5
keys silvermont (pass:5 fail:0) benchmark (pass:5 fail:0)
Keys Amean StdDev StdDev/Mean Amean StdDev StdDev/Mean GmeanSpeedup p-value
Total__summary (ms) 18873.41 81.12 0.4% 17863.51 37.11 0.2% +5.7% 0.00
,
Aug 6
Just had a discussion with AutoFDO owner wmi@ and here is a list of checks: 1. Compare the perf (cycle counts, instruction counts, cache miss, etc) of the images built with those two profiles, so as to see if it is caused by inlining difference. 2. Compare the hot functions from and calculate the distance of those two profiles. If they are very different, then the benchmark profile is probably overfitted to speedometer. We may just ignore the regression. The difficulty is how to define "difference" and the acceptable threshold of regression.
,
Aug 7
Interestingly, the cycle profiles of speedometer2 collected on the two resulted images are very different. To make sure I didn't make stupid mistake, I tried to flash & reboot & run more than 5 times and the profiles look pretty stable. `perf report` checks build id so it can't be wrong DSOs.
profile on image by benchmark profile:
percetange #sample process function name
1.67% 3011 chrome [.] v8::internal::compiler::CodeGenerator::AssembleArchInstruction
0.55% 983 TaskSchedulerFo [.] v8::internal::Factory::NewRawOneByteString
0.54% 975 chrome [.] v8::internal::compiler::RepresentationSelector::VisitNode
0.50% 899 chrome [.] mojo::edk::WatcherSet::NotifyState
0.44% 783 TaskSchedulerFo [.] v8::internal::JSObject::SetNormalizedProperty
0.43% 765 chrome [.] extensions::(anonymous namespace)::GetOrCreateOrNull
0.38% 681 chrome [.] v8::internal::StringReplaceGlobalRegExpWithString
0.37% 667 TaskSchedulerFo [.] v8::internal::ConcurrentMarking::Run
0.36% 641 chrome [.] v8::internal::ParserBase<v8::internal::PreParser>::ParseStatement
0.33% 598 chrome [.] v8::internal::ParserBase<v8::internal::PreParser>::ParsePropertyName
profile on image by CWP:
2.15% 3977 chrome [.] vp8_decode_mb_row_no_filter
1.14% 2121 TaskSchedulerFo [.] v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression
0.63% 1171 chrome [.] blink::HTMLTokenizer::NextToken
0.55% 1025 chrome [.] v8::internal::ConcurrentMarking::Run
0.40% 744 TaskSchedulerFo [.] v8::internal::KeywordOrIdentifierToken
0.40% 735 chrome [.] quant_all_bands
0.36% 660 chrome [.] v8::internal::Scanner::Scan
0.35% 654 chrome [.] v8::internal::V8HeapExplorer::GetStrongGcSubrootName
0.33% 616 chrome [.] v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression
0.32% 591 chrome [.] v8::internal::ParserBase<v8::internal::Parser>::ParseClassLiteral
Apparently there is something wrong in the profile on image by CWP; How vp8 is used in speedometer? Although the profile on image by benchmark profile looks reasonable, I don't know if it is real, either.
Not sure if this is problem in debug symbols or perf. However, if CWP profiles also suffers from the same thing, that may explain why there is a huge regression.
,
Aug 7
Why do you think something is wrong in the profile of image by CWP? The CWP profile is with data from the field. Inlining decisions are likely different and what you see are leaf non-inlined functions.
,
Aug 7
It would be useful to see the profiles under pprof using flame graphs. That would show inlined frames, and it may be even possible to compute a diff of the profiles.
,
Aug 7
These profiles are collected when running speedometer2. In what circumstances vp8 codes will be used?
,
Aug 7
I don't understand your question in c#11. I thought the goal was to compare profiles collected while running speedometer2.
,
Aug 7
My point is that vp8 shouldn't even appear in the profile, yet it is the hottest one.
BTW, expending the inlined functions by perf report --inline:
2.15% 3977 chrome [.] vp8_decode_mb_row_no_filter
vp8_decode_mb_row_no_filter (inline)
decode_mb_row_no_filter (inline)
inter_predict (inline)
,
Aug 7
Oh, vp8? I thought it was v8. Yeah, I don't know if speedometer2 has any video encoding or decoding. If not, then it looks like a problem. It may be a symbolization issue.
,
Aug 7
Hi Ting-Yuan, I assume you convert perf.data to autofdo profile when you get benchmark based profile. If you can run the good and bad binaries, generate perf.data files based on LBR event for them and convert the perf.data files into autofdo profiles. We can do some performance analysis based on autofdo profiles using the option --dump_for_analysis of /google/data/ro/teams/ autofdo/dump_llvm_prof. You can find more information about using that tool to analyze performance based on autofdo profile data here: https://critique.corp. google.com/#review/192699117 Although the tool is not very mature, in some case it can make the performance analysis a lot easier because it removes inline difference. By using that tool, it will be easy to find out whether vp8_decode_mb_row_no_filter is used or not in the binary built using cwp based profile. Thanks, Wei.
,
Aug 7
:) I'm trying to collect the LBR profiles to see if create_llvm_prof also has the problem.
,
Aug 8
It could be perf's bug although I'm still not sure whether afdo is affected. I tried: 1. to use pprof to dump the trace. vp8_decode_mb_row_no_filter doesn't show up. 2. to collect the LBR profile on celes (kernel doesn't support it on goldmont/reef). vp8_* doesn't show up. They do appear in `perf report`, although with lower frequency. 3. to dump the raw trace with `perf report -D` and verify the addresses with nm and addr2line. None of the samples fall in vp8_decode_mb_row_no_filter according to nm. To be precise, `perf report` is inconsistent with pprof, create_llvm_prof and addr2line/nm. Some of them (or the debug info) must be wrong. BTW, addr2line gave me lots of this error: x86_64-cros-linux-gnu-addr2line: Dwarf Error: mangled line number section (bad file number).
,
Aug 8
On the other hand, the images we collect benchmark profiles are different than official builds by a few features disabled: afdo, thinlto, hugepage, cfi. It may be worth collecting a benchmark profile on a official release build and compare that with current benchmark profile.
,
Aug 10
,
Aug 10
,
Aug 11
Alright, tried with a benchmark profile collected on an image with afdo, thinlto, hugepage, debug fission disabled and the performancer isn't affected. We are one step closer to declaring speedometer2 a bad benchmark.
,
Aug 11
s/disabled/enabled/
,
Sep 3
,
Dec 5
Just to update this, we are investigating merging cwp + benchmark profiles so that we get the improved speedometer scores while keeping cwp goodness. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by bugdroid1@chromium.org
, Jul 10