Confusing data on rendering benchmark |
|||||
Issue descriptionTrace file: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/accu_weather_2018_2018-09-28_01-35-58_76056.html . In the 'Metrics' tab, select 'renderingMetric' from the dropdown up above. . Click on 'input_event_latency_tbmv2' metric to look at the histogram. . The data claims: count 13 max 285.208 ms min 82.208 ms std 66.100 ms However, the associated chart shows all 13 data points being 50.0ms. This is pretty confusing. How should I interpret this data? If I rerun the metric on this trace file (with the 90/95 percentile computation [1]), then the metric reports 50 ms at 90%ile and 95%ile. And the average is still ~181 ms. This is also very confusing to me. Any explanations of these results? [1] https://chromium-review.googlesource.com/c/catapult/+/1274369
,
Oct 15
,
Oct 15
https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/metrics/rendering/latency.html?l=105&rcl=5aac72d05c7ed1238c420660d0786d98da9d73da The histogram is created with with bins from 0 to 50, so above 50 is the overflow bin.
,
Oct 15
Oh wow. Thanks dproy@! That explains what I am seeing. I guess we need to update the bin-buckets. Would it be possible to generate some sort of alert from telemetry if the overflow bucket is very large (perhaps optionally per-metric if we don't want to do this for all benchmarks/metrics)?
,
Oct 15
#4: that seems like a good idea. Maybe we can add an assertion that the overflow bucket size cannot be bigger than 10% total number of samples. It may break some benchmark/metrics but it's the good kind of breakages, imo
,
Oct 15
What if the bins are fine for the majority of stories and the proposed assertion fires on just a few stories? Won't we make histograms less useful if we broaden the bins? I wish we didn't have to hard-code bind boundaries and they would be automatically computed for each benchmark run. I know that may be non-trivial :)
,
Oct 15
There are a number of metrics where we don't specify the bins, I believe (e.g. avg_surface_fps), and I think that works OK. But I think having the bins are useful, so that the bins don't move around between runs. If the bins remain stable, then the metric at different percentiles would also be more stable. We would likely need to reevaluate the bins if we feel that the metric has moved enough to warrant a change there. The assertions for a large overflow bucket don't need to be fatal, as long as they are visible. But perhaps looking at the metrics at higher percentiles would also take care of that.
,
Oct 15
#7: sadrul@ my experience on benchmarking infra is non fatal reports about malformed state are rarely noticed by anyone :-/
,
Oct 15
#7: I think not specifying the boundaries just uses default boundaries. It doesn't try dynamically fit boundaries to the data: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/value/histogram.py?l=875
,
Oct 15
#9: Oh, hum. Thanks for pointing that out. We might want to look at the rest of the metrics and see if specifying bin-buckets for those would be useful. #8: Good point (albeit unfortunate). It'd be curious to try it out and see how often they fire.
,
Oct 16
If you really want, you can force results.html and the Metrics side panel to dynamically fit bins to the data by setting binBoundaries: tr.v.HistogramBinBoundaries.SINGULAR. It already does this automatically for data that was not produced by TBM2 metrics, such as blink_perf. However, there are several benefits to specifying static bin boundaries: stable percentiles as you mention, but also a more stable and knowledgeable user experience. Metric authors can use static bin boundaries to communicate to users what values are relatively high or low. They also give users a chance at becoming familiar enough with the boundaries that they don't need to read the y-axes carefully every single time they open a Histogram bar chart. When reviewing TBM2 metrics, I let authors either use the default boundaries or custom boundaries because they are more familiar with the expected range and shape of the data than I am (linear, exponential, or more complex like memoryMetric), and it's easy to change bin boundaries without impacting anything, except possibly percentile statistics. If you run into cases where most of the data is in an overflow or underflow bin, then I recommend running a colab study to determine better bin boundaries. Basically, fetch a large number of Histograms from the chromeperf dashboard, collect all their sampleValues, set the new max boundary to the top 1%-5% of that, the new min boundary to the bottom 1%-5%, and eyeball the distribution to decide whether to use linear, exponential, or complex boundaries in between. Perezju@ and others have done a fair amount of colab work with the chromeperf dashboard, so there is plenty of help to be found. You can also just run the benchmark a few times with the SINGULAR bin and go the "small data" route. I'll go ahead and close this since it doesn't look like there's any work to be done, but please feel free to IM or schedule a VC. I'm always happy to chat about this kind of stuff. :-)
,
Oct 16
Post facto +Simon, +Sean FYI. TLDR: metrics are hard!
,
Oct 17
> If you run into cases where most of the data is in an overflow or underflow bin, then I recommend running a colab study to determine better bin boundaries. Basically, fetch a large number of Histograms from the chromeperf dashboard, collect all their sampleValues, set the new max boundary to the top 1%-5% of that, the new min boundary to the bottom 1%-5%, and eyeball the distribution to decide whether to use linear, exponential, or complex boundaries in between. Perezju@ and others have done a fair amount of colab work with the chromeperf dashboard, so there is plenty of help to be found. Nice! Thanks! I will set aside a day (probably Friday) to do this for all rendering metrics, since the current TBM2 boundaries are based on just a few local runs :)
,
Oct 17
Thank you! I have filed https://bugs.chromium.org/p/chromium/issues/detail?id=896103 for fixing the buckets. I downloaded some trace-files from the dashboard (using the timeseries2 api) and process them locally to generate some results: http://springfield.wat.corp.google.com/stuff/fetch-history/all.html It's kind of interesting, although haven't really had any new insights. Still haven't processed all the data yet though. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by sadrul@chromium.org
, Oct 15