New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 809833 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 688342
issue 809503



Sign in to add a comment

Telemetry: Measurements for loading metrics have empty values.

Project Member Reported by karandeepb@chromium.org, Feb 7 2018

Issue description

I ran

tools/perf/run_benchmark run loading.desktop   --browser=canary --reset-results --story-filter="AirBnB|REI"

which generated the attached results file.

I can see the following issues:-

- Many measurements are empty, particularly those for cold cache temperatures.
- Also, I have seen with other runs that measurements for the cold cache temperature can be faster than those of warm, for metrics like timeToOnload which probably doesn't make sense.
- The documentation at https://github.com/catapult-project/catapult/blob/master/docs/metrics-results-ui.md doesn't detail what can cause the metrics to be empty, and whether this is to be expected. 
- Also, averages (and other stats) for metrics with empty values should indicate that the average is not over all the cases. (since this makes comparison of averages between different benchmark runs incorrect).

I had reported the same issue earlier as well - https://groups.google.com/a/chromium.org/forum/#!topic/telemetry/ehNixItBW00 but it seems it's still not fixed. I verified that I had the fix patched in.


 
results.html
1.3 MB View Download
Hi, can this be followed up on? I was looking to do some perf testing for an extension API I was working on, and was looking to use telemetry (locally) for it. 
Components: -Tests>Telemetry Speed>Benchmarks
Hi Karan, what is the root cause that you find that caused this?

Also feel free to send out CL to me & I can route to other reviewers if needed.
I am not really sure. I was just using telemetry for some local perf testing and trying to use the loading.desktop benchmark when I encountered the inconsistency. I am not really familiar with the telemetry code, so won't be able to own this. It would be nice if someone working on telemetry can take a look, especially whether the benchmarks are being recorded correctly.
Owner: kouhei@chromium.org
Status: Assigned (was: Untriaged)
I reassign this to kouhei@, owner of the loading benchmark, to investigate.

Comment 5 by kouhei@chromium.org, Feb 22 2018

Status: Started (was: Assigned)

Comment 6 by kouhei@chromium.org, Feb 22 2018

Looks like the traces are missing the main frame's navigationStart events.

Comment 7 by kouhei@chromium.org, Feb 26 2018

Owner: ssid@chromium.org
Status: Assigned (was: Started)
ned, ssid: Would you help me on this? I'm currently stuck, and would want your help to work on this.

TL;DR: An early trace event from renderer process, which is critical for computing those metrics, are missing.

As in c#6, the bug is due to TRACE_EVENT logged on:
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?type=cs&l=113
missing from resulting traces.

IIUC, this is caused from the following reasons:
- Recently, PlzNavigate was enabled by default, which means that the event will happen a lot earlier than before.
- At some point, start-up tracing from telemetry was changed to be configured via a json "config file".
- The renderer process fails to pick up the json config file, since it doesn't have access to file system:
-- The renderer process command-line has "--trace-config-file=/tmp/tmpSjTvdN/chrome-trace-config.json", but since the renderer process doesn't have access to file system, EnableStartTracingIfNeeded won't work:
-- https://cs.chromium.org/chromium/src/components/tracing/common/trace_startup.cc?type=cs&q=EnableStartupTracingIfNeeded&sq=package:chromium&l=17
- The renderer process picks up the right filter config later, but at that time it is too late.
Cc: primiano@chromium.org sullivan@chromium.org tdres...@chromium.org
Components: Speed>Tracing
+other folks: who can help with this tracing bug?
Cc: dproy@chromium.org
I think +dproy@ has been digging into this.

Comment 10 by dproy@chromium.org, Feb 26 2018

+chiniforooshan@ might be able to help here.

I do not know what the right fix for this is yet, but I can confirm that the problem persists even when we start tracing through the chrome://tracing UI. A minimal test case is this: 
1. Open a fresh chrome instance
2. Start tracing through chrome://tracing. Record 'blink.user_timing'. 
3. Open a new tab, and navigate to www.google.com
4. Stop tracing. 
5. Search for navigationStart in the recorded trace and see it's not there. 

If we now record a second trace and reload the google tab, we will now get the navigationStart event. 

It seems when we create a fresh renderer process, the initialization callback for the tracing coordinator[1] happens after SetNavigationStart[2] is called. We either need a way to start tracing immediate after the new renderer process is created, or have some sort of callback to emit the navigation start event retroactively when tracing actually starts.

[1] https://cs.chromium.org/chromium/src/services/resource_coordinator/tracing/coordinator.cc?l=329-330&rcl=cb95a65980b204a38515c813f70da866e0328bee

[2] https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?l=113-114&rcl=cb95a65980b204a38515c813f70da866e0328bee

Comment 11 by dproy@chromium.org, Feb 26 2018

Cc: fmea...@chromium.org benjhayden@chromium.org mythria@chromium.org
 Issue 792497  has been merged into this issue.

Comment 12 by dproy@chromium.org, Feb 26 2018

Cc: chiniforooshan@chromium.org
Blocking: 688342
This is also a problem for extending runtimeStatsTotal metric to measure the startup time. We want to use the LoadExpectations to measure the startup time of the web pages. However on some pages (for ex: cnn on cold cache in v8.runtime_stats.top_25) the initial navigationStart is not reported and hence we don't have a LoadExpectation on that page. 
Owner: chiniforooshan@chromium.org
Status: Started (was: Assigned)
If a child is launched while tracing is enabled, the start tracing command is posted on the main thread as soon as the child gets a connection to the service manager. If we need earlier trace events, I think the launcher should append startup tracing switch to child processes command-line if the child is launched while tracing is enabled.

I'll write a CL to test this approach. Let me know if you see any issues with this.
Sounds good to me - thanks for digging in!

Comment 17 by ssid@chromium.org, Feb 26 2018

I think some benchmarks rely on trace config file which contains more details about configuration than what can be specified in trace-startup. trace-startup only takes categories as argument. You might need to make sure you enable this passing flag via command line only for benchmarks that do not have special config set in the json. I would guess these are only memory-infra benchmarks. But I don't know what other benchmarks rely on it now.
Also a possible solution might be to pass the entire config file through command line arg.

The renderer process is unable read the trace config file due to sandbox restrictions. I think on desktop platforms the sandbox restrictions are put in place only after PreSandboxStartup() is done. So, it is possible that the renderer is be able to read the file before that. On Android, the renderer can read the file at any time (so shouldn't matter). So, it might be possible to read the file and store contents and turn on tracing in EnableStartupTracing(). It might also be possible to enable tracing before sandbox startup (not sure what might break).
Cc: perezju@chromium.org
To #117: ~70% Telemetry benchmarks now use the trace-startup file since the TBM2 architecture doesn't allow hook to enabling tracing on the browser directly.
Cc: vinamrata@chromium.org
Labels: -Pri-2 Pri-1
This is blocking some loading metric correlation work which is pretty high priority - happy to provide details if you want.

Upgrading to P1.
Owner: ----
Status: Available (was: Started)
Ehsan - any thoughts on who should own this?
Cc: -perezju@chromium.org oysteine@chromium.org
I won't be able to spend much time on this, in the short term (not at least next week) due to other priorities. Is there a particular CL that has caused this issue that we can revert?

+oysteine@ in case he is free.
This was due to PlzNavigate, which, uh, we won't be reverting. :)
Cc: perezju@chromium.org
Cc: pasko@chromium.org
+Pasko: this would affect the startup benchmark as well. 
Cc: arthurso...@chromium.org clamy@chromium.org
+clamy +arthursonzogni

re tdresser:
> This was due to PlzNavigate, which, uh, we won't be reverting. :)

In what way is it due to PlzNavigate? Is the startup tracing working in renderers on desktop or that needs to be fixed too?
Skimming through this, yes with PlzNavigate you won't get this trace event https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/loader/DocumentLoadTiming.cpp?type=cs&l=113 before commit time. Could you use this trace event from the browser process instead: https://cs.chromium.org/chromium/src/content/browser/frame_host/navigation_request.cc?type=cs&q=navigationrequest&l=466. This is called for all navigations when they start (aka after BeforeUnload)?
Sorry for the confusion. While PlzNavigate made this issue always repro, it is the not one to blame. The root cause is that we are missing traces at renderer process start-up time. 

>c#27
Since there will be multiple top-level frame navigations and sub-frame navigations, this is quite complicated.
It will be non-trivial, since we are currently using LocalFrame* to tie all DocumentLoadTimings for a particular navigation. (which is a hack, happy to discuss alternatives on this too)
Is there a value (or a pointer) we can use as a UUID from both browser navreq and blink core/loader?
c#28: Do you know what change caused missing renderer startup traces? Or, they were always missing in telemetry traces but now are needed due to PlzNavigate?
The latter. They were always missing in telemetry traces but now are needed due to PlzNavigate?

My guess is that the change to pass the trace categories in JSON is the direct cause of the issue.
Blocking: 809503
Owner: chiniforooshan@chromium.org
Status: Started (was: Available)
I wrote a hack to test if startup tracing fixes this issue: crrev.com/c/972145

I get the attached result.html. kouhei@, karandeepb@, could you please check if it's OK or it has the same issue? (I see many 0ms numbers; I'm not sure if it's normal).

Thanks.
results.html
1.4 MB View Download
Re c#33, this seems to populate more values (particularly for cold cache temperature). Can you run these metrics for more stories and verify whether metrics like timeToOnLoad are always populated. 

The metric owners should be able to help verify if the results look sane.
results.html from #33 LGTM.
timeToFirstMeaningfulPaint entries are empty in the results.html from c#33. Is this expected for these pages?
Good point, that is a bit suspicious.

Any chance we can get some traces from these runs?

Comment 38 by dproy@chromium.org, Mar 22 2018

My guess is the missing FMP is due to  Issue 824761 . Working on a fix. 
#34: I ran with more stories and verified that timeToOnLoad is populated for them (result is attached).

#37: A couple of traces are attached.
results.html
1.4 MB View Download
Baidu_2018-03-22_11-20-46_64568.html
3.9 MB View Download
Economist_2018-03-22_11-23-19_63779.html
6.4 MB View Download

Comment 40 by dproy@chromium.org, Mar 26 2018

I can still reproduce the problem using the steps in c#10 after using the patch. Do we have to specifically handle the case of tracing started through the UI? 
The patch fixes issues with --trace-startup and --trace-config-file, which should be enough for Telemetry. You won't get early traces when tracing is started through the UI (we never had this feature AFAIK).

I'd suggest to open a new feature request bug for it if you need it. It should not be difficult to support (a few more command line flags?); but I'm going on a vacation this week :)
Project Member

Comment 42 by bugdroid1@chromium.org, Mar 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f192e8a25e1937336d9cc16f9273b1d13b99530a

commit f192e8a25e1937336d9cc16f9273b1d13b99530a
Author: Ehsan Chiniforooshan <chiniforooshan@chromium.org>
Date: Tue Mar 27 03:18:20 2018

Fix startup tracing in renderers

Startup tracing is broken for several reasons:

1. When renderers are forked from a Zygote process, --trace-startup
   flags are ignored.

2. In a non-zygote mode, when --trace-config-file is used, it is
   ignored since sandboxed renderers do not have access to file
   system.

3. Another bug is that --trace-config-file is always passed to
   renderer processes, even if they are created after tracing is
   stopped, e.g. via DevTools.

4. Startup tracing flags are not passed along to the ppapi plugin
   and utility processes.

This CL fixes 1, 3, and 4 and partially fixes 2 by passing categories
and record mode of a trace config file using flags to sandboxed
renderer processes so that they do not need to ready these information
from a file. Obviously, this does not work for complex trace config
files, but should be enough to resolve existing telemetry issues
(please see discussions in  crbug.com/809833 ).

Bug:  809833 
Change-Id: I6723664fd477117262763476c0a86fd6881b94a7
Reviewed-on: https://chromium-review.googlesource.com/972145
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Reviewed-by: oysteine <oysteine@chromium.org>
Commit-Queue: Ehsan Chiniforooshan <chiniforooshan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#545950}
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/base/trace_event/trace_config.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/base/trace_event/trace_config.h
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/chrome/browser/tracing/background_tracing_field_trial_unittest.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_config_file.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_config_file.h
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_startup.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/trace_startup.h
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/tracing_switches.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/components/tracing/common/tracing_switches.h
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/app/android/library_loader_hooks.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/app/content_main_runner.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_child_process_host_impl.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_child_process_host_impl.h
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_main_loop.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/browser_main_runner.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/devtools/protocol/tracing_handler.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/ppapi_plugin_process_host.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/renderer_host/render_process_host_impl.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/tracing/tracing_controller_impl.cc
[modify] https://crrev.com/f192e8a25e1937336d9cc16f9273b1d13b99530a/content/browser/utility_process_host.cc

Status: Fixed (was: Started)
Ehsan: do you think it's possible to add some telemetry integration test coverage here to prevent this bug from happening again?
I think there may be a larger story around telemetry integration testing. Deep has been doing some brainstorming on this. If there's something small we can do to handle this specific case, that sounds great, but we'll be kicking off a conversation on this more holistically in the not-too-distant future.

Sign in to add a comment