Improve FrameMetrics Efficiency |
||||||
Issue descriptionThe FrameMetrics relies ConvertableToTraceFormat/TracedValue a lot, so its cost increased to ~3.9 from ~0.1 after switching to Perfetto. Study the bottleneck of FrameMetrics reporting and improve its efficiency.
,
Sep 6
This discovered by @vmiura. The FrameMetrics also cause a regression in renderering by 10-15% for lower end android devices. I would like to use this bug to also track improvements on those as well.
,
Sep 6
,
Sep 6
The cost of ConvertableToTraceFormat is the same currently, Perfetto just moves the cost to be synchronous on the trace event call, rather than asynchronous later. The FrameMetrics is a pretty extreme case, though Perfetto will enable us to decrease the cost of TracedValue significantly by writing directly into a zero-alloc proto instead of the allocation-heavy pickle outputting JSON that's currently used. Note that the microbenchmark which regressed was a pretty specific case: "thread_total_all_cpu_time_per_frame" on the wikipedia_2018 pageset. I don't think we saw much elsewhere. Does anyone have any details on exactly how this is calculated and what it means / how much of an impact it has?
,
Sep 6
> The cost of ConvertableToTraceFormat is the same currently, Perfetto just moves the cost to be synchronous on the trace event call, rather than asynchronous later. Wait... does this mean that calling TRACE_EVENT macros will cause strings, too, being copied right away, with the Perfetto-backed impl, e.g. for event names, categories, etc? If so, that looks scary! Weren't TRACE_EVENT macros deliberately implemented in a way to delay these costs to flush()? Increasing the cost of macros will limit their usefulness a lot, IMO.
,
Sep 6
#5: That's a different discussion than this issue, but: The cost of copying the strings actually barely even registers in the profiles, they're small part of the cost of each trace event (ping me if you want to have a look). For the four days we had Perfetto on by default, this was the only regression and there were other similar microbenchmark improvements: https://chromeperf.appspot.com/group_report?bug_id=880314 (regression caused by the revert). I'm suspecting a lot of the improvements of the latter are due to the current tracing needing to grab a lock for every trace event emitted by a thread which doesn't have a MessageLoop (i.e. all of the sequenced workerpool worker threads).
,
Sep 6
Interesting! So, the cost is considerable only if we pass large args to the macro. Thank you for the explanation. About the thread_total_all_cpu_time_per_frame regression: it is CPU usage per animation frame. In other words, CPU usage per second * frame rate. Since the frame rate seems to be NOT affected by your change, the regression is really about increasing CPU/second. It's not only on wikipedia_2018 page, it happened on other pages, too, e.g. amazon_2018. And, it looks like it happens across browser and renderer processes and in both main threads and compositor threads. Here are CPU usage graphs for all these cases that show the regression: https://chromeperf.appspot.com/report?sid=8e43ef3cb70fc5d34d705d6ff0f04577cae8ffc61b0186b69b6584acef08054b&start_rev=585171&end_rev=588430 The way it's computed is explained in src/third_party/catapult/tracing/tracing/metrics/rendering/cpu_utilization.html.
,
Sep 7
> Note that the microbenchmark which regressed was a pretty specific case: > "thread_total_all_cpu_time_per_frame" on the wikipedia_2018 pageset. I don't think we > saw much elsewhere. Does anyone have any details on exactly how this is calculated > and what it means / how much of an impact it has? It is not only one pageset, it impacted CPU time on just about all pages that tje GPU team measures. On tests that are close to the edge, the overhead was large enough to reduce the frame rate of the tests which is very undesirable. > The cost of ConvertableToTraceFormat is the same currently, Perfetto just moves the > cost to be synchronous on the trace event call, rather than asynchronous later. How can we confirm this? The thread_total_all_cpu_time_per_frame measures CPU time on all threads in Chrome processes. Do you know why we wouldn't be seeing the overhead on any thread, if it's taking the same amount of time currently?
,
Sep 7
,
Sep 7
The reason Perfetto is slow for frame metrics: https://cs.chromium.org/chromium/src/ui/latency/frame_metrics.cc?sq=package:chromium&dr=CSs&g=0&l=453 There are 2 problems with FrameMetrics: 1. It's slow with Perfetto on; 2. It's slow on some android device (https://bugs.chromium.org/p/chromium/issues/detail?id=876709 & https://bugs.chromium.org/p/chromium/issues/detail?id=875984) I will re-assign the owner of this bug to @oysteine and create a new bug to track how to improve FrameMetrics performance on android.
,
Sep 7
> It is not only one pageset, it impacted CPU time on just about all pages > that tje GPU team measures. On tests that are close to the edge, the overhead > was large enough to reduce the frame rate of the tests which is very undesirable. Ah, I thought it was both of the changes combined which were large enough to impact actual framerates? I didn't see any reports from the ranges Perfetto was enabled but the FrameMetrics change wasn't. > How can we confirm this? The thread_total_all_cpu_time_per_frame measures CPU time > on all threads in Chrome processes. Do you know why we wouldn't be seeing the > overhead on any thread, if it's taking the same amount of time currently? The metric just adds together the CPU time of the top-level slices in the trace, while the story is running. With the old tracing, the ConvertableToTraceFormat cost happened after tracing was disabled so it wouldn't show up. I'm doing another pass of full pprof profile runs now and I can send you the results once I have them.
,
Sep 7
Hey, small question for everybody, I got a bit lost in the thread. My understanding is that switching to perfetto moves around the time when CTTF are expanded and that skews metric. The thing I haven't understood yet is: A) microbenchmarks move up/down by a constant offset (per benchmark), but no other side effects are caused. B) this is so bad that affects metrics in a non-constant / impredictable way (e.g. FPS drop). If B, I see the problem and we should make CTTF fast again. If A, is this really a problem at all? This is not going to impact users / chrome, is just tracing changing it's side effect and we should just take that as the new baseline. This case would be not different from cases that happened in the past where changes to tracing itself caused memory metrics to bump, which we solved by just acknowledging it and taking the new value as the new baseline.
,
Sep 7
My thoughts: #11: I think we cannot say that because perfetto is just shifting json/protobuf serialization costs from flush() to trace macros, we are fine. Trace flushing usually happens once, when we don't care about performance much, but trace macros are all over the code, in places that we usually care about performance (that's typically why those parts are instrumented). The cost shift does matter, if it's significant. E.g. it can prevent us from instrumenting the code like what FrameMetrics CL is trying to do. If the implementation has to be the way it is now, can estimate trace macro costs with --enable-perfetto and document it in the code? Something like the cost of a trace macro with arg-size smaller than x bytes is minor, i.e. is comparable to the cost of just generating the timestamp for that event? For macros that have args of size larger than x, we should be careful (and file bugs for existing large trace events so the owners decide whether to keep current instrumentation, make them lighter, throttle them, get rid of them, ...). #12: I #8 means that B happened with the FrameMetric CL. I agree that B is worse than A, but, is A really not an issue? If we were slowing down the execution uniformly, I would understand it. However, we are slowing down random taks by random amounts (depending on how heavily it's instrumented) which can have non-trivial interactions with deadlines, scheduling, .... Then, enabling tracing to measure performance may not measure the real-life performance. This is of course only if the serialization cost is significant.
,
Sep 7
+1 to chiniforooshan's points. Both types of skew are not great. If it's (A) and it's within a reasonable amount such as we have to sample CPU time and save a bit of data, it's a cost we have to take. Even then it's something that skews tracing today and we have to be careful to it. Current tracing with GPU benchmarking categories adds about 1ms/frame. Some of it is not spread evenly because of the density of traces such as around tiny tasks. In this case Perfetto's overhead was type (B). The impact looked large to the point of affecting FPS, and uneven, as in code that is saving more trace data got significantly more skewed. The change was over 8ms/frame added to total system CPU, whereas the total time of the workload + current overhead [on wikipedia_2018] was 12ms/frame. I don't think any of us want perf tools to take anywhere near 66% of the workload under measure. Btw, the breakdown of CPU overhead with Perfetto on was about 50/50 between processing now happening inline inside traces, and work now happening on IO threads or workers while we measure. The time inside traces is worst because it increases the critical paths, but the other time is also not ideal as it adds extra system load that may de-schedule threads or affect thermals. Also this is not just about benchmarking, we also use Tracing all the time for in browser performance analysis, often with more traces enabled. Ideally those are not skewed much by tracing overhead either. My thought is it's better for us and other Perfetto users to keep that in mind, and maybe follow the old tracing design model for flushing.
,
Sep 10
Ah if we are talking about 66% that's definitely no good. > maybe follow the old tracing design model for flushing. That causes too many problems: requires one trace buffer per process, can't persist traces across navigation or crash, can't bound the memory size of the trace buffer. Anyways I don't think this is the real problem, let me explain. Today, in legacy tracing, when you use TracedValue (as in the case of FrameMetrics) the following happens: In the fast path: - You build up the TracedValue - That under the hoods serializes everything into a base::Pickle as you call BeginDictionary/SetString(k,v)... - That gets std::move()d next to the trace event in the buffer When trace is flushed - The Pickle is deserialized and re-serialized as JSON Today, with the perfetto cmdline switch, we are in this weird intermediate state, and this is likely what's hurting us here (to be confirmed): - You build up the TracedValue - That under the hoods serializes everything into a base::Pickle - The TracedValue is immediately serialized as JSON into the perfetto buffer. So now you have a base::Pickle serialization, base::Pickle deserialization and JSON serialization in the fast path (!). This is a sad intermediate state due to the fact that we temporarily have to maintain both legacy and perfetto tracing. I think we need to do better here even in the intermediate phase (more details below). The long term plan (once we get rid of legacy tracing) is to do neither of this, and transform that TracedValue into a shim proxy class that directly writes protobufs as you go into the perfetto buffer (so those BeginDictionary(), SetStringValue(k,v) really become AppendKeyToProto(...) under the hoods. The cost of directly writing a proto in the perfetto shbuf is going to be no worse than what base::TracedValue does today (if not better, given that IIRC TracedValue today can hit std::vector expansions) Proposed intermediate solution: We could add some code to TracedValue, such that under perfetto mode, we bypass the pickle generation and we directly generate a protobuf, in a dedicated buffer owned by TracedValue (also we should do something similar to short-string-optim). When then we std::move the object to the buffer, within TraceLog we'd just memcpy the already serialized proto. This is still a tradeoff (as there is a memcpy() in the middle) but, if multi-serialization is what's hurting is, is very likely to make the problem go away (my theory/hope is that proto-generation using protozero + a memcpy might be as fast as TracedValue's base::Pickle)
,
Sep 10
#15: I think you are talking about arguments of type TracedValue, not the more general ConvertableToTraceFormat type, which is the one used in the FrameMetrics CL. In the case of ConvertableToTraceFormat, if the implementation of AppendAsTraceFormat chooses to serialize into a base::Pickle, it happens at trace flushing, not on the fast path. E.g. the following happens when we are flushing traces: https://cs.chromium.org/chromium/src/ui/latency/frame_metrics.cc?sq=package:chromium&dr=CSs&g=0&l=406 So, for example, when you construct the CTTF arg on the fast path you can just pass a few basic params and/or long-living pointers do data-structures and compute derivative more expensive parameters and serialize them inside AppendAsTraceFormat, when flushing traces.
,
Sep 10
#14: Can you confirm that the Perfetto change *without* the FrameMetrics change caused an FPS regression? I haven't been able to see that with local testing.
,
Sep 20
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/33d38175ff521d9b4046418b5601cf52b5e8b804 commit 33d38175ff521d9b4046418b5601cf52b5e8b804 Author: yiyix <yiyix@chromium.org> Date: Thu Sep 20 22:05:53 2018 FrameMetrics: Reduce FrameMetrics reporting frequency Enabling FrameMetrics caused 5-20% regression on rendering. (ref: https://bugs.chromium.org/p/chromium/issues/detail?id=876709). The FrameMetrics reports back to Tracing for each CompositorFrame submission with the following data: https://docs.google.com/document/d/1P1OKg3s5V2TwE8nTxLay01rn6JmpLgKDY00myyDuINs/edit?usp=sharing Reporting stats summary for each frame is not very interesting, So I reduced the frequency of FrameMetrics reporting to each second. Bug: 881863, 880560 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;master.tryserver.blink:linux_trusty_blink_rel Change-Id: I96ce7d5bed824aa8d86fa00311136362081df019 Reviewed-on: https://chromium-review.googlesource.com/1226012 Reviewed-by: Sadrul Chowdhury <sadrul@chromium.org> Commit-Queue: Yi Xu <yiyix@chromium.org> Cr-Commit-Position: refs/heads/master@{#592966} [modify] https://crrev.com/33d38175ff521d9b4046418b5601cf52b5e8b804/ui/latency/frame_metrics.cc [modify] https://crrev.com/33d38175ff521d9b4046418b5601cf52b5e8b804/ui/latency/frame_metrics_unittest.cc
,
Oct 4
,
Nov 30
With http://crrev.com/612360 and http://crrev.com/1263354 performance looks like it's back to normal.
,
Nov 30
(Second one should be http://crrev.com/603768) |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by tdres...@chromium.org
, Sep 5