Issue metadata
Sign in to add a comment
|
103.5% regression in smoothness.sync_scroll.key_mobile_sites_smooth at 487936:488051 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Jul 27 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8972959419037463728
,
Jul 27 2017
=== Auto-CCing suspected CL author bokan@chromium.org === Hi bokan@chromium.org, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Perf regression found with culprit Suspected Commit Author : David Bokan Commit : 6933fe8a1b1d14e0d36a69cb27ef026217595d95 Date : Wed Jul 19 22:39:54 2017 Subject: Remove inert-visual-viewport flag (enabled by default) Bisect Details Configuration: android_nexus6_perf_bisect Benchmark : smoothness.sync_scroll.key_mobile_sites_smooth Metric : first_gesture_scroll_update_latency/Wikipedia (1 tab) - delayed scroll start Change : 85.87% | 10.2928333333 -> 19.1313333333 Revision Result N chromium@487935 10.2928 +- 5.43773 6 good chromium@487993 10.0962 +- 4.49807 6 good chromium@488008 11.0813 +- 7.99643 6 good chromium@488015 9.27817 +- 3.79522 6 good chromium@488017 10.4223 +- 9.39564 9 good chromium@488018 17.4914 +- 8.23915 9 bad <-- chromium@488019 17.6368 +- 8.91969 9 bad chromium@488022 18.4819 +- 8.36105 9 bad chromium@488051 19.1313 +- 2.59009 6 bad To Run This Test src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=Wikipedia..1.tab....delayed.scroll.start smoothness.sync_scroll.key_mobile_sites_smooth More information on addressing performance regressions: http://g.co/ChromePerformanceRegressions Debug information about this bisect: https://chromeperf.appspot.com/buildbucket_job_status/8972959419037463728 For feedback, file a bug with component Speed>Bisection
,
Jul 27 2017
Repro'd locally, taking a look today.
,
Jul 27 2017
Tim: I'm having a hard time understanding where the numbers are coming from here, I'm wondering if you could help shed some light. I've attached screenshots of the trace (from bots) for 488051 (bad - 18.865ms) and 487935 (good - 9.791ms). To me it seems that the first GSU latency is equal between the two, and it's the following GSUs that are taking a longer time in the bad trace. Am I looking at the wrong thing? Also, without getting more trace data it'll be hard to figure out what the difference is. I've tried adding --profiler=trace to run_benchmark but that throws an exception (presumably because tracing is already being run to collect this data). I've noticed I can add categories using --extra-chrome-categories but I can't find where/if the tracing data is being saved anywhere so I can't get at it. Do you know how this is supposed to work? Any debugging tips here would be much appreciated. Thanks.
,
Jul 27 2017
Yeah, getting a trace and the metrics in the same run will be pretty necessary here. It's annoying, but here's what I do: tools/perf/run_benchmark --browser=android-chrome-dev smoothness.top_25_smooth --output-format=json --output-dir=~/foo # in timeline_based_measurement.py (third_party/catapult) class Options::__init__ overhead_level = DEBUG_OVERHEAD_LEVEL
,
Jul 27 2017
Once you've got the traces, can you attach them on the bug?
,
Jul 27 2017
,
Jul 27 2017
Awesome, thanks Tim - traces attached. Locally it seems that the difference on the first GSU latency is more stark so maybe I just got unlucky with traces from the dashboard. Still not sure why the numbers in the metric are detached from the length of the first GSU latency though... Can't see yet what difference my change made but I'll dig in a little deeper on Monday (I'm off tomorrow).
,
Jul 28 2017
The way the trace event is drawn doesn't quite match the way we compute the duration for the metric, if I recall correctly. Code is here for computing this metric: https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/web_perf/metrics/rendering_stats.py?type=cs&q=gesture_scr+file:%5C.py&sq=package:chromium&l=30 The end timestamp is INPUT_EVENT_GPU_SWAP_BUFFER_COMPONENT. The start timestamp is INPUT_EVENT_LATENCY_ORIGINAL_COMPONENT. Good:164322450046 - 164322437000 = 13046 Bad: 164057305358 - 164057284000 = 21358 That's about the same as the difference in slice duration... It would be worth checking which delta between components is causing the change.
,
Aug 1 2017
Ok, I've figured out what's going on. This is indeed caused because innerWidth/innerHeight report the layout viewport size rather than the visual. On the Wikipedia case, the page initially loads into a larger viewport. You can verify this as the page is initially able to zoom out. As it loads, the minimum-scale eventually becomes 1, and so the layout viewport becomes device-width (this can happen either because content is removed so it no longer overflows the device-width or by explicitly setting minimum-scale in the viewport meta, I suspect the latter here). What appears to be happening here is that the page is checking the innerWidth/Height during loading and using that to determine if the user is on a tablet and showing a different UI (I think just the headings are all loaded expanded). Since we now report layout viewport for innerWidth/Height, the page sees the intermittent large layout viewport and thinks its on a tablet. This caused more work and the buffer swaps on the GPU to get pushed back (the traces show that in the "tablet mode" the swaps were happening half way into the next frame, rather than right at the beginning) and so the GSU doesn't get shown until the next frame, hence the large delay. I suspect this has been fixed since I don't see the same problem when I visit the live page today. I think the fix here is to rerecord the WPR? Tim, do you know what the process is for updating these? There's a bit of compat concern here as this could come up on other pages but I expect this not to be too bad since we've already shipped this for a milestone and haven't seen this in the wild. Hopefully this kind of viewport trickery is rare since it relies on timing and there's much better ways to do this (e.g. media queries).
,
Aug 2 2017
Docs on updating wpr recordings are here: https://www.chromium.org/developers/telemetry/record_a_page_set. Your analysis SGTM, rerecording is the right approach.
,
Aug 3 2017
Ok, so I created a new recording and, though there's no difference based on inert anymore, it seems that the page now has some heavier JS on it...the first GSU latency went way up (90-130ms) as we're inside a long running JS function (trace attached) when the first GSU comes in. This has made the metric much noisier. Additionally, there's a hiccup every ~700ms as a JS timer fires which will probably reduce reliability on the other metrics. Given that, it seems like a bad trade to me so I'm inclined to just leave the current recording and use the new number as the baseline. WDYT?
,
Aug 3 2017
+Ned I'd be inclined to perform the update. If the web gets worse, we still want our benchmarks to reflect the current state of the web. We have lots of noise vs representativeness tradeoffs, and much of the time I side with getting rid of noise, but in this case, I think representativeness should win out. Ned, what do you think?
,
Aug 3 2017
+1 to perform the update. Though please do it with "--use-wpr-go" flag (we are migrating to use wprgo - bit.ly/wpr-go-project). I agree that representativeness should win out as this effect how we optimize Chrome code. We would be better addressing noise later by improving the way we compute metrics, tuning CPU cores, repeat the test more time...
,
Aug 3 2017
yeah, I thought about that but wasn't sure where we draw the line so figured you guys might have more insight. I'm off tomorrow but I'll update the recording after the (long) weekend. Thanks for the help!
,
Aug 15 2017
Hey Ned, I tried using --use-wpr-go but it says that's already the default (perhaps it's been flipped in the meantime). However, I'm running into issue 755654 . It looks like it doesn't like the fact that the existing archives are .wpr but it's trying to save a .wprgo. How should I proceed?
,
Aug 15 2017
I answered in issue 755654
,
Aug 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2331c0f6e3a7a6df0883486ac70a04f2bb539e8c commit 2331c0f6e3a7a6df0883486ac70a04f2bb539e8c Author: David Bokan <bokan@chromium.org> Date: Wed Aug 23 13:45:31 2017 Update Wikipedia story in key_mobile_sites The recording of this was old and causing the site to "misload" when the inert-visual-viewport feature was turned on. The recording has been uploaded to cloud storage. This patch simply updates the hash to match the uploaded file. Bug: 749357 Change-Id: I2a877d5f45eab5e7597f3389dda833dc27c2b72d Reviewed-on: https://chromium-review.googlesource.com/621295 Commit-Queue: David Bokan <bokan@chromium.org> Reviewed-by: Ned Nguyen (ooo til 08/30) <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#496671} [modify] https://crrev.com/2331c0f6e3a7a6df0883486ac70a04f2bb539e8c/tools/perf/page_sets/data/key_mobile_sites_009.wprgo.sha1
,
Aug 25 2017
Ok, I re-recorded the benchmark, looks like the first_gesture_scroll_update latency came back to normal but so far seems a tad higher than before my patch. That's expected since, as I mentioned above, the re-recorded page seems to be doing more work than before: https://chromeperf.appspot.com/report?sid=430e14d43cffd9e26a403cf9c3ffae7d243ad454b1d99303b148a71b6378c6ed&start_rev=358213&end_rev=497340 I've checked some of the other tests, a little movement for both better and worse due to rerecording but overall seems fine to me. I'm going to close this bug. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, Jul 27 2017