New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 860029 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug
Build-Toolchain



Sign in to add a comment

[afdo] investigate the gap of CWP and Benchmark based profiles

Project Member Reported by laszio@chromium.org, Jul 3

Issue description

There are some gap in some benchmarks, such as speedometer2. Although the benchmark profiles are overfitted speedometer2, we are still interested in knowing more about the gaps. This may also give us some hints on improving the CPU cycles in the field.

Some conjectures...
1. We suspect that profiles from ATOM boards are inferior than that from Core boards. This is a tracking bug of the investigating efforts.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jul 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/cce271fd4fd3c207b1ea94876a465625fd4ea4c3

commit cce271fd4fd3c207b1ea94876a465625fd4ea4c3
Author: Ting-Yuan Huang <laszio@chromium.org>
Date: Tue Jul 10 03:11:14 2018

afdo: switch profile from silvermont to broadwell on candy

We suspect that profiles from Core boards are noticeably better than
profiles from ATOM boards. This is an experiment to verify the guess.

BUG=chromium:860029
TEST=candy tryjob uses broadwell profile
CQ-DEPEND=CL:1125208

Change-Id: Ic12dfc98a970232bb81c23de72dbed7a4605bc62
Reviewed-on: https://chromium-review.googlesource.com/1125217
Commit-Ready: Ting-Yuan Huang <laszio@chromium.org>
Tested-by: Ting-Yuan Huang <laszio@chromium.org>
Reviewed-by: Manoj Gupta <manojgupta@chromium.org>
Reviewed-by: Gabriel Marin <gmx@chromium.org>

[modify] https://crrev.com/cce271fd4fd3c207b1ea94876a465625fd4ea4c3/overlay-candy/make.conf

CWP profiles from big cores don't do well on silvermont, either. The performance is the same as silvermont profiles on speedometer2.
Summary: [afdo] investigate the gap of CWP and Benchmark based profiles (was: [afdo] Verify profile performance across different uarchs)
Description: Show this description
Cc: llozano@chromium.org
Luis asked whether it's CWP profile doesn't work well on ToT. It doesn't seem to be the case, unfortunately. The following comparison is done with code base 68.0.3440.89. "silvermont" uses profiles collected on 68.0.3440.76 in the field, and "benchmark" uses profiles collected on 68.0.3440.89 (same as code base) using benchmarks. The images (3440.76 and 3440.89) used to collect profiles differ by 10 days, which is small considering that it's the end of beta. The 5.7% gap is still there.

Benchmark:  speedometer2;  Iterations: 5
                keys silvermont  (pass:5 fail:0)                   benchmark  (pass:5 fail:0)
                Keys   Amean  StdDev StdDev/Mean   Amean  StdDev StdDev/Mean GmeanSpeedup p-value
Total__summary (ms)  18873.41  81.12        0.4% 17863.51  37.11        0.2%        +5.7%    0.00

Just had a discussion with AutoFDO owner wmi@ and here is a list of checks:

1. Compare the perf (cycle counts, instruction counts, cache miss, etc) of the images built with those two profiles, so as to see if it is caused by inlining difference.

2. Compare the hot functions from and calculate the distance of those two profiles. If they are very different, then the benchmark profile is probably overfitted to speedometer. We may just ignore the regression. The difficulty is how to define "difference" and the acceptable threshold of regression.
Cc: wmi@google.com
Interestingly, the cycle profiles of speedometer2 collected on the two resulted images are very different. To make sure I didn't make stupid mistake, I tried to flash & reboot & run more than 5 times and the profiles look pretty stable. `perf report` checks build id so it can't be wrong DSOs.

profile on image by benchmark profile:
percetange       #sample  process              function name
     1.67%          3011  chrome           [.] v8::internal::compiler::CodeGenerator::AssembleArchInstruction
     0.55%           983  TaskSchedulerFo  [.] v8::internal::Factory::NewRawOneByteString
     0.54%           975  chrome           [.] v8::internal::compiler::RepresentationSelector::VisitNode
     0.50%           899  chrome           [.] mojo::edk::WatcherSet::NotifyState
     0.44%           783  TaskSchedulerFo  [.] v8::internal::JSObject::SetNormalizedProperty
     0.43%           765  chrome           [.] extensions::(anonymous namespace)::GetOrCreateOrNull
     0.38%           681  chrome           [.] v8::internal::StringReplaceGlobalRegExpWithString
     0.37%           667  TaskSchedulerFo  [.] v8::internal::ConcurrentMarking::Run
     0.36%           641  chrome           [.] v8::internal::ParserBase<v8::internal::PreParser>::ParseStatement
     0.33%           598  chrome           [.] v8::internal::ParserBase<v8::internal::PreParser>::ParsePropertyName



profile on image by CWP:
     2.15%          3977  chrome           [.] vp8_decode_mb_row_no_filter 
     1.14%          2121  TaskSchedulerFo  [.] v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression
     0.63%          1171  chrome           [.] blink::HTMLTokenizer::NextToken
     0.55%          1025  chrome           [.] v8::internal::ConcurrentMarking::Run
     0.40%           744  TaskSchedulerFo  [.] v8::internal::KeywordOrIdentifierToken
     0.40%           735  chrome           [.] quant_all_bands
     0.36%           660  chrome           [.] v8::internal::Scanner::Scan
     0.35%           654  chrome           [.] v8::internal::V8HeapExplorer::GetStrongGcSubrootName
     0.33%           616  chrome           [.] v8::internal::ParserBase<v8::internal::Parser>::ParsePrimaryExpression
     0.32%           591  chrome           [.] v8::internal::ParserBase<v8::internal::Parser>::ParseClassLiteral

Apparently there is something wrong in the profile on image by CWP; How vp8 is used in speedometer? Although the profile on image by benchmark profile looks reasonable, I don't know if it is real, either.

Not sure if this is problem in debug symbols or perf. However, if CWP profiles also suffers from the same thing, that may explain why there is a huge regression.
Why do you think something is wrong in the profile of image by CWP?
The CWP profile is with data from the field. Inlining decisions are likely different and what you see are leaf non-inlined functions.
It would be useful to see the profiles under pprof using flame graphs.
That would show inlined frames, and it may be even possible to compute a diff of the profiles.

These profiles are collected when running speedometer2. In what circumstances vp8 codes will be used?
I don't understand your question in c#11.
I thought the goal was to compare profiles collected while running speedometer2.
My point is that vp8 shouldn't even appear in the profile, yet it is the hottest one. 

BTW, expending the inlined functions by perf report --inline:

     2.15%          3977  chrome           [.] vp8_decode_mb_row_no_filter
            vp8_decode_mb_row_no_filter (inline)
            decode_mb_row_no_filter (inline)
            inter_predict (inline)

Oh, vp8? I thought it was v8.

Yeah, I don't know if speedometer2 has any video encoding or decoding.
If not, then it looks like a problem. It may be a symbolization issue.
Hi Ting-Yuan,

I assume you convert perf.data to autofdo profile when you get benchmark
based profile. If you can run the good and bad binaries, generate perf.data
files based on LBR event for them and convert the perf.data files into
autofdo profiles. We can do some performance analysis based on autofdo
profiles using the option --dump_for_analysis of /google/data/ro/teams/
autofdo/dump_llvm_prof.

You can find more information about using that tool to analyze performance
based on autofdo profile data here: https://critique.corp.
google.com/#review/192699117
Although the tool is not very mature, in some case it can make the
performance analysis a lot easier because it removes inline difference.

By using that tool, it will be easy to find out whether
vp8_decode_mb_row_no_filter
is used or not in the binary built using cwp based profile.

Thanks,
Wei.
:) I'm trying to collect the LBR profiles to see if create_llvm_prof also has the problem.
It could be perf's bug although I'm still not sure whether afdo is affected. I tried:

1. to use pprof to dump the trace. vp8_decode_mb_row_no_filter doesn't show up.
2. to collect the LBR profile on celes (kernel doesn't support it on goldmont/reef). vp8_* doesn't show up. They do appear in `perf report`, although with lower frequency.
3. to dump the raw trace with `perf report -D` and verify the addresses with nm and addr2line. None of the samples fall in vp8_decode_mb_row_no_filter according to nm.

To be precise, `perf report` is inconsistent with pprof, create_llvm_prof and addr2line/nm. Some of them (or the debug info) must be wrong.

BTW, addr2line gave me lots of this error:
x86_64-cros-linux-gnu-addr2line: Dwarf Error: mangled line number section (bad file number).
On the other hand, the images we collect benchmark profiles are different than official builds by a few features disabled: afdo, thinlto, hugepage, cfi.

It may be worth collecting a benchmark profile on a official release build and compare that with current benchmark profile.
Cc: tianyou...@intel.com pan.d...@intel.com
Cc: g...@chromium.org
Alright, tried with a benchmark profile collected on an image with afdo, thinlto, hugepage, debug fission disabled and the performancer isn't affected. We are one step closer to declaring speedometer2 a bad benchmark.
s/disabled/enabled/
Owner: manojgupta@chromium.org
Just to update this, we are investigating merging cwp + benchmark profiles so that we get the improved speedometer scores while keeping cwp goodness.

Sign in to add a comment