Telemetry: Measurements for loading metrics have empty values. |
|||||||||||||||||||||
Issue descriptionI ran tools/perf/run_benchmark run loading.desktop --browser=canary --reset-results --story-filter="AirBnB|REI" which generated the attached results file. I can see the following issues:- - Many measurements are empty, particularly those for cold cache temperatures. - Also, I have seen with other runs that measurements for the cold cache temperature can be faster than those of warm, for metrics like timeToOnload which probably doesn't make sense. - The documentation at https://github.com/catapult-project/catapult/blob/master/docs/metrics-results-ui.md doesn't detail what can cause the metrics to be empty, and whether this is to be expected. - Also, averages (and other stats) for metrics with empty values should indicate that the average is not over all the cases. (since this makes comparison of averages between different benchmark runs incorrect). I had reported the same issue earlier as well - https://groups.google.com/a/chromium.org/forum/#!topic/telemetry/ehNixItBW00 but it seems it's still not fixed. I verified that I had the fix patched in.
,
Feb 12 2018
Hi Karan, what is the root cause that you find that caused this? Also feel free to send out CL to me & I can route to other reviewers if needed.
,
Feb 12 2018
I am not really sure. I was just using telemetry for some local perf testing and trying to use the loading.desktop benchmark when I encountered the inconsistency. I am not really familiar with the telemetry code, so won't be able to own this. It would be nice if someone working on telemetry can take a look, especially whether the benchmarks are being recorded correctly.
,
Feb 12 2018
I reassign this to kouhei@, owner of the loading benchmark, to investigate.
,
Feb 22 2018
,
Feb 22 2018
Looks like the traces are missing the main frame's navigationStart events.
,
Feb 26 2018
ned, ssid: Would you help me on this? I'm currently stuck, and would want your help to work on this. TL;DR: An early trace event from renderer process, which is critical for computing those metrics, are missing. As in c#6, the bug is due to TRACE_EVENT logged on: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?type=cs&l=113 missing from resulting traces. IIUC, this is caused from the following reasons: - Recently, PlzNavigate was enabled by default, which means that the event will happen a lot earlier than before. - At some point, start-up tracing from telemetry was changed to be configured via a json "config file". - The renderer process fails to pick up the json config file, since it doesn't have access to file system: -- The renderer process command-line has "--trace-config-file=/tmp/tmpSjTvdN/chrome-trace-config.json", but since the renderer process doesn't have access to file system, EnableStartTracingIfNeeded won't work: -- https://cs.chromium.org/chromium/src/components/tracing/common/trace_startup.cc?type=cs&q=EnableStartupTracingIfNeeded&sq=package:chromium&l=17 - The renderer process picks up the right filter config later, but at that time it is too late.
,
Feb 26 2018
+other folks: who can help with this tracing bug?
,
Feb 26 2018
I think +dproy@ has been digging into this.
,
Feb 26 2018
+chiniforooshan@ might be able to help here. I do not know what the right fix for this is yet, but I can confirm that the problem persists even when we start tracing through the chrome://tracing UI. A minimal test case is this: 1. Open a fresh chrome instance 2. Start tracing through chrome://tracing. Record 'blink.user_timing'. 3. Open a new tab, and navigate to www.google.com 4. Stop tracing. 5. Search for navigationStart in the recorded trace and see it's not there. If we now record a second trace and reload the google tab, we will now get the navigationStart event. It seems when we create a fresh renderer process, the initialization callback for the tracing coordinator[1] happens after SetNavigationStart[2] is called. We either need a way to start tracing immediate after the new renderer process is created, or have some sort of callback to emit the navigation start event retroactively when tracing actually starts. [1] https://cs.chromium.org/chromium/src/services/resource_coordinator/tracing/coordinator.cc?l=329-330&rcl=cb95a65980b204a38515c813f70da866e0328bee [2] https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?l=113-114&rcl=cb95a65980b204a38515c813f70da866e0328bee
,
Feb 26 2018
Issue 792497 has been merged into this issue.
,
Feb 26 2018
,
Feb 26 2018
,
Feb 26 2018
This is also a problem for extending runtimeStatsTotal metric to measure the startup time. We want to use the LoadExpectations to measure the startup time of the web pages. However on some pages (for ex: cnn on cold cache in v8.runtime_stats.top_25) the initial navigationStart is not reported and hence we don't have a LoadExpectation on that page.
,
Feb 26 2018
If a child is launched while tracing is enabled, the start tracing command is posted on the main thread as soon as the child gets a connection to the service manager. If we need earlier trace events, I think the launcher should append startup tracing switch to child processes command-line if the child is launched while tracing is enabled. I'll write a CL to test this approach. Let me know if you see any issues with this.
,
Feb 26 2018
Sounds good to me - thanks for digging in!
,
Feb 26 2018
I think some benchmarks rely on trace config file which contains more details about configuration than what can be specified in trace-startup. trace-startup only takes categories as argument. You might need to make sure you enable this passing flag via command line only for benchmarks that do not have special config set in the json. I would guess these are only memory-infra benchmarks. But I don't know what other benchmarks rely on it now. Also a possible solution might be to pass the entire config file through command line arg. The renderer process is unable read the trace config file due to sandbox restrictions. I think on desktop platforms the sandbox restrictions are put in place only after PreSandboxStartup() is done. So, it is possible that the renderer is be able to read the file before that. On Android, the renderer can read the file at any time (so shouldn't matter). So, it might be possible to read the file and store contents and turn on tracing in EnableStartupTracing(). It might also be possible to enable tracing before sandbox startup (not sure what might break).
,
Feb 28 2018
To #117: ~70% Telemetry benchmarks now use the trace-startup file since the TBM2 architecture doesn't allow hook to enabling tracing on the browser directly.
,
Mar 2 2018
This is blocking some loading metric correlation work which is pretty high priority - happy to provide details if you want. Upgrading to P1.
,
Mar 2 2018
,
Mar 2 2018
Ehsan - any thoughts on who should own this?
,
Mar 2 2018
I won't be able to spend much time on this, in the short term (not at least next week) due to other priorities. Is there a particular CL that has caused this issue that we can revert? +oysteine@ in case he is free.
,
Mar 2 2018
This was due to PlzNavigate, which, uh, we won't be reverting. :)
,
Mar 2 2018
,
Mar 2 2018
+Pasko: this would affect the startup benchmark as well.
,
Mar 7 2018
+clamy +arthursonzogni re tdresser: > This was due to PlzNavigate, which, uh, we won't be reverting. :) In what way is it due to PlzNavigate? Is the startup tracing working in renderers on desktop or that needs to be fixed too?
,
Mar 7 2018
Skimming through this, yes with PlzNavigate you won't get this trace event https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?type=cs&l=113 before commit time. Could you use this trace event from the browser process instead: https://cs.chromium.org/chromium/src/content/browser/frame_host/navigation_request.cc?type=cs&q=navigationrequest&l=466. This is called for all navigations when they start (aka after BeforeUnload)?
,
Mar 7 2018
Sorry for the confusion. While PlzNavigate made this issue always repro, it is the not one to blame. The root cause is that we are missing traces at renderer process start-up time. >c#27 Since there will be multiple top-level frame navigations and sub-frame navigations, this is quite complicated. It will be non-trivial, since we are currently using LocalFrame* to tie all DocumentLoadTimings for a particular navigation. (which is a hack, happy to discuss alternatives on this too)
,
Mar 7 2018
Is there a value (or a pointer) we can use as a UUID from both browser navreq and blink core/loader?
,
Mar 7 2018
c#28: Do you know what change caused missing renderer startup traces? Or, they were always missing in telemetry traces but now are needed due to PlzNavigate?
,
Mar 7 2018
The latter. They were always missing in telemetry traces but now are needed due to PlzNavigate? My guess is that the change to pass the trace categories in JSON is the direct cause of the issue.
,
Mar 20 2018
,
Mar 20 2018
I wrote a hack to test if startup tracing fixes this issue: crrev.com/c/972145 I get the attached result.html. kouhei@, karandeepb@, could you please check if it's OK or it has the same issue? (I see many 0ms numbers; I'm not sure if it's normal). Thanks.
,
Mar 21 2018
Re c#33, this seems to populate more values (particularly for cold cache temperature). Can you run these metrics for more stories and verify whether metrics like timeToOnLoad are always populated. The metric owners should be able to help verify if the results look sane.
,
Mar 21 2018
results.html from #33 LGTM.
,
Mar 22 2018
timeToFirstMeaningfulPaint entries are empty in the results.html from c#33. Is this expected for these pages?
,
Mar 22 2018
Good point, that is a bit suspicious. Any chance we can get some traces from these runs?
,
Mar 22 2018
My guess is the missing FMP is due to Issue 824761 . Working on a fix.
,
Mar 22 2018
#34: I ran with more stories and verified that timeToOnLoad is populated for them (result is attached). #37: A couple of traces are attached.
,
Mar 26 2018
I can still reproduce the problem using the steps in c#10 after using the patch. Do we have to specifically handle the case of tracing started through the UI?
,
Mar 26 2018
The patch fixes issues with --trace-startup and --trace-config-file, which should be enough for Telemetry. You won't get early traces when tracing is started through the UI (we never had this feature AFAIK). I'd suggest to open a new feature request bug for it if you need it. It should not be difficult to support (a few more command line flags?); but I'm going on a vacation this week :)
,
Mar 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f192e8a25e1937336d9cc16f9273b1d13b99530a commit f192e8a25e1937336d9cc16f9273b1d13b99530a Author: Ehsan Chiniforooshan <chiniforooshan@chromium.org> Date: Tue Mar 27 03:18:20 2018 Fix startup tracing in renderers Startup tracing is broken for several reasons: 1. When renderers are forked from a Zygote process, --trace-startup flags are ignored. 2. In a non-zygote mode, when --trace-config-file is used, it is ignored since sandboxed renderers do not have access to file system. 3. Another bug is that --trace-config-file is always passed to renderer processes, even if they are created after tracing is stopped, e.g. via DevTools. 4. Startup tracing flags are not passed along to the ppapi plugin and utility processes. This CL fixes 1, 3, and 4 and partially fixes 2 by passing categories and record mode of a trace config file using flags to sandboxed renderer processes so that they do not need to ready these information from a file. Obviously, this does not work for complex trace config files, but should be enough to resolve existing telemetry issues (please see discussions in crbug.com/809833 ). Bug: 809833 Change-Id: I6723664fd477117262763476c0a86fd6881b94a7 Reviewed-on: https://chromium-review.googlesource.com/972145 Reviewed-by: John Abd-El-Malek <jam@chromium.org> Reviewed-by: oysteine <oysteine@chromium.org> Commit-Queue: Ehsan Chiniforooshan <chiniforooshan@chromium.org> Cr-Commit-Position: refs/heads/master@{#545950} [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/base/trace_event/trace_config.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/base/trace_event/trace_config.h [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/chrome/browser/tracing/background_tracing_field_trial_unittest.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_config_file.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_config_file.h [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_startup.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_startup.h [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/tracing_switches.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/tracing_switches.h [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/app/android/library_loader_hooks.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/app/content_main_runner.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_child_process_host_impl.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_child_process_host_impl.h [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_main_loop.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_main_runner.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/devtools/protocol/tracing_handler.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/gpu/gpu_process_host.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/ppapi_plugin_process_host.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/renderer_host/render_process_host_impl.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/tracing/tracing_controller_impl.cc [modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/utility_process_host.cc
,
Mar 27 2018
,
Mar 27 2018
Ehsan: do you think it's possible to add some telemetry integration test coverage here to prevent this bug from happening again?
,
Mar 27 2018
I think there may be a larger story around telemetry integration testing. Deep has been doing some brainstorming on this. If there's something small we can do to handle this specific case, that sounds great, but we'll be kicking off a conversation on this more holistically in the not-too-distant future. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by karandeepb@chromium.org
, Feb 8 2018