Summary: 46.2% regression in system_health.common_desktop load:energy_sum at 453302:453462 (was: 46.2% regression in system_health.common_desktop at 453302:453462)
=== Auto-CCing suspected CL author rogerm@chromium.org ===
Hi rogerm@chromium.org, the bisect results pointed to your CL, please take a look at the
results.
=== BISECT JOB RESULTS ===
Perf regression found with culprit
Suspected Commit
Author : rogerm
Commit : 2233a4a7cb12ac4acdebd33491b528f8fb670f02
Date : Mon Feb 27 23:11:16 2017
Subject: [translate] Add translate ranker model loader.
Bisect Details
Configuration: winx64_high_dpi_perf_bisect
Benchmark : system_health.common_desktop
Metric : load:energy_sum/load_games/load_games_alphabetty
Change : 29.88% | 12.6947926833 -> 16.4879936732
Revision Result N
chromium@453301 12.6948 +- 5.88553 9 good
chromium@453342 11.8021 +- 2.63423 6 good
chromium@453362 11.9451 +- 2.79266 6 good
chromium@453372 12.0077 +- 2.64958 6 good
chromium@453377 11.8436 +- 2.02372 6 good
chromium@453378 17.7408 +- 4.55889 6 bad <--
chromium@453379 15.56 +- 4.40185 6 bad
chromium@453380 17.2375 +- 0.913405 6 bad
chromium@453382 17.2664 +- 0.868382 6 bad
chromium@453462 16.488 +- 4.12111 9 bad
To Run This Test
src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.games.alphabetty system_health.common_desktop
Debug Info
https://chromeperf.appspot.com/buildbucket_job_status/8986279781382214048
Is this bisect wrong?
https://chromeperf.appspot.com/bad_bisect?try_job_id=5570720567918592
| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
| X | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Speed>Bisection. Thank you!
Trying to convince myself that this is a real regression and not just an instance of the first contentful paint metric acting funny before I go and ruin someone's day tomorrow.
I took and attached screenshots with the load sections of the trace highlighted before and after the regression. The start of the load is when Telemetry tells the page to navigate and the end of the load is the first contentful paint.
In both cases, there's a burst of activity on the renderer thread after the load finishes. This is a signal to me that the first contentful paint successfully points out when there's content available to paint. In before_regression.png, that load duration 0.85 seconds. In after_regression.png, that load duration is 1.57 seconds. That's a pretty serious difference.
I had a quick chat with Annie.
Philippe, plz revert the Finch config that turns this on. We can look at
changing the post task to a post delayed task to better ensure that the
model load occurs outside the ttfcp.
We have a variety of loading metrics, and pushing out work so that it happens after FCP is just going to push back later loading metrics.
Is there a design doc for this change? When is the model actually needed?
The magnitude of the regression is much bigger than I would have anticipated given what this is doing.
Given that there's a power regression here as well as a page load time regression, it seems likely that rescheduling this work won't be enough to make this tradeoff worthwhile - we may need to do less work.
Have you done any profiling of this feature to figure out where the time is going?
Rogerm@ who knows more about this is OOO for a couple weeks. I'll try to investigate what could cause this regression. In the meantime, I am preparing a revert of the revert, with a gate that controls the ModelLoader with Finch. This Finch experiment is not active by default (and is not active in Beta or Stable), so it should cause any regression.
The graphs are jumping back up after your CL landed.
Is this experiment enabled in https://cs.chromium.org/chromium/src/testing/variations/fieldtrial_testing_config.json ?
That is the file the perfbots use for testing. The intent is to test the experiments we plan to launch. If we don't plan to launch this soon, it should be removed from the file. If it's not in the file, there is a problem with the way you've put the code behind an experiment.
It looks like the regression is still there. My guess is that the problem happens when building the TranslateRanker. This is not an easy fix. I will revert the CL again. Sorry about that.
After the second revert, the graphs are still indicating an increase in TTFCP. If the CL was really the culprit, we should see the graph go back down.
Which branch are these test ran on? If so, can we merge 34f5f9f1efc1283d1bb7c284ab0798ac842265fd with this version and confirm that the regression has been fixed?
In general, I'd like to understand better how these test are run. Is there documentation you could point me to? Can we run these locally? It would be useful to run these locally to test a fix before commit.
This is really, really strange:
https://chromeperf.appspot.com/group_report?bug_id=697665
There is a clear jump at the original CL, r453378
There is a clear dip at the revert, r454263
There is a clear jump at the reland, r454433
There is no dip at r454595, which removes the experiment from the testing configs.
There is no dip at r455280, which reverts entirely.
I'm going to try and kick off another bisect in the reland range, but I can't understand how the tests don't go back down.
We are running tests on ToT, not on a release branch.
You can use the command in the bisect to run the tests from chromium src dir:
src/tools/perf/run_benchmark -v --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.games.alphabetty system_health.common_desktop
=== Auto-CCing suspected CL author hamelphi@chromium.org ===
Hi hamelphi@chromium.org, the bisect results pointed to your CL, please take a look at the
results.
=== BISECT JOB RESULTS ===
Perf regression found with culprit
Suspected Commit
Author : hamelphi
Commit : d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120
Date : Thu Mar 02 23:38:24 2017
Subject: [Translate] Add translate ranker model loader.
Bisect Details
Configuration: winx64_10_perf_bisect
Benchmark : page_cycler_v2.intl_hi_ru
Metric : timeToFirstContentfulPaint_avg/pcv1-cold/http___ru.wikipedia.org_
Change : 137.83% | 255.971666664 -> 608.765833338
Revision Result N
chromium@454356 255.972 +- 13.8194 6 good
chromium@454413 243.983 +- 34.8348 6 good
chromium@454427 251.271 +- 12.9978 6 good
chromium@454431 286.335 +- 185.577 6 good
chromium@454432 251.532 +- 7.89917 6 good
chromium@454433 605.756 +- 6.571 6 bad <--
chromium@454434 608.404 +- 6.69118 6 bad
chromium@454441 609.951 +- 19.8206 6 bad
chromium@454469 608.766 +- 10.4639 6 bad
To Run This Test
src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=http...ru.wikipedia.org. page_cycler_v2.intl_hi_ru
Debug Info
https://chromeperf.appspot.com/buildbucket_job_status/8985468822100023264
Is this bisect wrong?
https://chromeperf.appspot.com/bad_bisect?try_job_id=5189764988272640
| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
| X | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Speed>Bisection. Thank you!
=== BISECT JOB RESULTS ===
Perf regression found with culprit
Suspected Commit
Author : hamelphi
Commit : d005809c83c0ad0b9cbbda6b10b7ed2c3dcd5120
Date : Thu Mar 02 23:38:24 2017
Subject: [Translate] Add translate ranker model loader.
Bisect Details
Configuration: winx64_10_perf_bisect
Benchmark : page_cycler_v2.intl_hi_ru
Metric : timeToFirstContentfulPaint_avg/pcv1-cold/http___ru.wikipedia.org_
Change : 130.80% | 259.846500009 -> 599.71966667
Revision Result N
chromium@454356 259.847 +- 15.6452 6 good
chromium@454413 252.567 +- 10.3668 6 good
chromium@454427 251.525 +- 5.20526 6 good
chromium@454431 263.62 +- 58.1582 6 good
chromium@454432 250.682 +- 3.7059 6 good
chromium@454433 607.104 +- 19.9101 6 bad <--
chromium@454434 606.65 +- 10.5471 6 bad
chromium@454441 605.064 +- 14.4439 6 bad
chromium@454469 599.72 +- 42.4354 6 bad
To Run This Test
src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=http...ru.wikipedia.org. page_cycler_v2.intl_hi_ru
Debug Info
https://chromeperf.appspot.com/buildbucket_job_status/8985467431747597600
Is this bisect wrong?
https://chromeperf.appspot.com/bad_bisect?try_job_id=5607496057618432
| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
| X | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Speed>Bisection. Thank you!
This is really bizarre. Here's an annotated version of the perf graph of the power graph for load_yahoo vs. load_yahoo_ref: https://chromeperf.appspot.com/report?sid=a9ede378b715e9a6c0b091bddd446b613b250c15af6f5e76c943ee28a89cff46
This CL is clearly responsible: there's a clear regression when it lands, a improvement when it's removed, and another regression when it relands. However, as we're seeing elsewhere, there's no improvement when it gets reverted the second time. I checked the revision log for that second revert and confirmed that the CL is in the revision log as we'd expect it to be.
A couple of hypotheses that might explain what we're seeing:
1) The second revert is somehow asymmetrical to the second landing. hamelphi@, were there any merge conflicts that you needed to resolve manually?
2) The CL triggers some pathological case in the load pipeline. While this CL was landed the second time, another CL triggered that same pathological case in the load pipeline so, when this CL was reverted, Chrome continued to demonstrate that pathological behavior.
I'm going to check #1 first, as it seems easier to verify. If that doesn't show anything, I'll move on to #2.
The CL includes a big file (histograms.xml) which prevented me from just hitting the "Revert patchset" button. I had to do a manual revert, but I do not recall having to resolve any merge conflicts, but I did apply a few auto merges as I was trying to land the CL.
One other change that got in the mix is 7a9fa89774a486bb9661acbb1d6c124e312a6b9e which removed TranslateRanker Finch experiments from field trials. This was submitted between the re-commit and the 2nd revert.
We could revert this CL too and see if it changes anything. I'd like to test it locally but I can't seem to be able to run the benchmark job locally, I get the following error:
CloudStorageError: ServiceException: 401 Anonymous users does not have storage.objects.get access to object chrome-partner-telemetry/d402a17e3c6de27aacc43997b935b4cff1452fcd.
Any clues?
I also wonder if there could have been a second unrelated regression of a similar magnitude that's confusing this result?
For example in the screenshot in comment 51, the recent graph levels are higher than they were with the original regression.
(One way to confirm this would be to revert the revert and see if it goes even higher - which would be the case if the two regressions are independent and cumulative. If not, then there's something else at play.)
I tried revertting the Finch CL (putting TranslateRanker experiments back in field trials) and had surprising results. It looks like it would resorb the regression.
Here are the results from the trybot :
https://goo.gl/RU89hV
Just to be sure, I ran the same thing on a trivial change on TOT (adding a comment line on ToT):
https://goo.gl/5Hb8mb
In short here are sample TTFCP for both runs:
ToT (finch_revert): ~300ms
Patch(finch_revert): ~140ms
ToT (trivial_change): ~150ms
Patch (trivial_change): ~140ms
I don't know if timings are supposed to be consistent for ToT between runs, but they are way off in this case. I am not sure what to make out of this, but if we don't care about consistencies between runs, it looks as though adding back the experiment drives TTFCP down considerably.
This is a bit surprising, since enabling the experiments should strictly add more work. I don't see how enabling it could impact this metric positively.
Maybe I am misinterpreting the results? Or maybe I am misunderstanding what having the experiment in fieldtrials means?
On the other hand, it explains the regression we have seen. The CL changes behavior associated with the experiment, and it can be seen as the same effect as disabling the experiment in some way.
So, it looks as though adding back the experiment will fix the graph, but it also means that the high TTCFP is what users are getting on stable.
Can someone with a better understanding of fieldtrials and benchmarks explain what we are seing here?
Would it be possible to go back and see the impact CL 9b3a19f7382b2b6db1f9bf1db63ceeee115bade4 had on the metrics? This is the Cl when we originally added the TranslateRanker experiments to fieldtrials.
And now that I think about it, it may make sense that TranslateRankerEnforcement reduces TTCFP. If the bot is actually fetching a valid TranslateRanker model and the model decides not to show the TranslateUI, we are skipping a bunch of work in this case.
So, if this is true, we are not actually seeing a regression, we are seeing the loss of an improvement that would be brought by TranslateRanker in this case if we would turn it on. Does that make sense?
Re #59: Looks like these metrics only go back to Dec 1 2016, a few months after the experiment was turned on on the bots.
Re #60: Can anyone run locally on Mac or Windows to see if there are differences with the translate UI on/off?
+groby@
Rachel, the TLDR is that we are seeing a big discrepancy for "Time to first contenful pane" for Windows with and without the ranker. It looks as though having the Ranker enabled makes it faster. My hypothesis is that it points to some problem in the infobar UI, and having the ranker enabled skips the problematic part wwhen running the benchmark. Do you have any insights on what could cause such an issue?
I cannot run locally on Mac or Windows, but I ran a trybot on Mac: https://goo.gl/yc6z7H
It doesn't seem to be affecting mac users. My guess is there is some bug with the translate UI on Windows (the infobar UI).
I launched another trybot on Windows that turns off ShowTranslateUI, but still goes through the Ranker Logic to see if we still see the same behavior. Will update when I get the results.
Here is the link to the trybot that disables ShowTranslateUI (see https://codereview.chromium.org/2743373003/):
https://goo.gl/7XhchI
There is still a difference between ToT and Patch, so there may be more to the problem, but it is not as bad as before:
ToT: ~190 ms
Patch (enable Ranker, disable ShowtranslateUI): ~ 140 ms
So, from what I understand, it does not look like the Original CL is really causing a regression. Turning down the TranslateRankerEnforcement does, but this already turned off on beta and stable.
There is still an underlying issue with translateUI and Windows that we need to figure out, and I think we should continue investigating., but it looks to me like it is decoupled from the Model Loader implementation the original Cl is implementing.
So, I believe it would be safe to resubmit the CL. Would you agree? It would allow us to move forward with the Ranker project, and potentially improve the metrics on Beta/Stable once we re-enable the experiment.
Sorry I'm a little slow on understanding how things are gated, when you say you want to resubmit the CL, you're pretty sure it will be off on the perfbots and off by default for end users? It is okay to resubmit in my opinion.
Are you the right owner for investigating the translateUI+windows issues?
Yes, we will make sure everything is gated by experiments. Until we add back the TranslateRanker experiments in field trials, it should not impact perfbots. And until we get the green light to turn on experiments in beta/stable, it won't affect users.
I am not the right owner for translateUI+windows issues. I believe groby@ would know who should own this.
Re c#66 - why do we think there's a translateUI+windows regression NOT based on TranslateRanker if that's the only work being done in TranslateUI right now?
Re c#67: I'm OOO until 4/2, so haven't read the full thread - but if we suspect it's really an underlying issue with Views, maybe pkasting@ would know somebody?
Re c#62: Windows doesn't use Infobars, it uses bubbles. So looking at infobars is the wrong path.
If you have a Linux build, it should show the *exact* same behavior, since that path is shared on both platforms - @hamelphi: Can you try investigating there?
Should I expect [ git cl try -b winx64_high_dpi_perf_bisect ] to work, or
is there another prescribed way to repro/test this?
I'm seeing Recipe failed. Reason: Could not load config file. Double check
your changes to config files for syntax errors. for this with no config
file changes.
Are you sure you want the perf_bisect? I suppose bisect is the binary search to find a culprit CL. If you just want to benchmark your CL, you probably want to run one of the benchmarks listed by
'tools/perf/run_benchmark try --help'
I guess the one are looking for is winx64-high-dpi.
There's a follow-up CL that's related to this change at issue here (to fix a race condition on shutdown).
Otherwise, there's nothing left to do on this.
Comment 1 by charliea@chromium.org
, Mar 1 2017