Issue metadata
Sign in to add a comment
|
A zero-to-nonzero regression in process_delta for media.tough_video_cases at 456690:456834 |
||||||||||||||||||||
Issue descriptionMany media tests triggering this alert across many windows bots. Will file separate bugs for other regressed metrics in the same range.
,
Mar 16 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8984947852434499664
,
Mar 16 2017
time-to-play metric also regressed in this range. tracked in issue 702332
,
Mar 17 2017
=== Auto-CCing suspected CL author sugoi@chromium.org === Hi sugoi@chromium.org, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Perf regression found with culprit Suspected Commit Author : sugoi Commit : 842efc6922b0cd66febd6f36f04318ebf396c184 Date : Tue Mar 14 18:10:39 2017 Subject: Moving SwiftShader from component to bundled library Bisect Details Configuration: win_8_perf_bisect Benchmark : media.tough_video_cases Metric : processes_delta/crowd.ogg Revision Result N chromium@456705 0.0 +- 0.0 6 good chromium@456738 0.0 +- 0.0 6 good chromium@456754 0.0 +- 0.0 6 good chromium@456762 0.0 +- 0.0 6 good chromium@456763 0.0 +- 0.0 6 good chromium@456764 1.0 +- 0.0 6 bad <-- chromium@456766 1.0 +- 0.0 6 bad chromium@456770 1.0 +- 0.0 6 bad chromium@456834 1.0 +- 0.0 6 bad To Run This Test src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests media.tough_video_cases Debug Info https://chromeperf.appspot.com/buildbucket_job_status/8984947852434499664 Is this bisect wrong? https://chromeperf.appspot.com/bad_bisect?try_job_id=6131462270615552 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Speed>Bisection. Thank you!
,
Mar 20 2017
,
Mar 20 2017
Explanation: This test is measuring something called process_delta. The way this works is that after Chrome is started, but before the test actually starts playing the audio, the test harness takes a snapshot of the number of processes that are running. After the test is over and Chrome has stopped playing audio, the test takes another snapshot. The delta is the number of processes running after minus the number of processes running before. The idea is that we want to make sure we don't accidentally leave processes running that are no longer needed. This is at least my guess about how this works. Maybe johnchen@ or chcunningham@ knows more about the code?
,
Mar 21 2017
The explanation by crouleau@ covers the essentials of how process_delta is measured. The test harness walks the process tree to count all the child processes of the main Chrome process, both before and after the audio is played. Any change in the number of child processes is reported as process_delta.
,
Mar 21 2017
Ah, ok, I understand now. So, generally, when we have a valid GPU, Chrome fires up the GPU process right away in order to use accelerated features and it's alive both at the beginning and at the end of the test (so no delta). Before the SwiftShader change, the GPU process was never getting fired up on the bots. Now, what seems to happen is that Chrome starts, but does not start the GPU process since there's no GPU (it's a VM). Something in the test triggers the GPU process to start (using SwiftShader). At the end, the GPU process is still alive and the process delta is 1. So, AFAIK, it can be either: 1) There's a bug and the GPU process is started for no reason (not likely, but possible) 2) The test could do something to try to start the GPU process before counting the number of processes initially 3) The test could ignore the GPU process in its process count, if that's possible. I'll investigate point 1.
,
Mar 21 2017
Thanks for taking a look! One thing to keep in mind is that this CL has been marked as causing a regression for multiple metrics: - vm_working_set, Issue 702337 - time_to_play, 702332 These other metrics were tracked in separate bugs but the bisect tool automatically marked those as duplicates of this issue and monorail UI doesn't do a great job of surfacing that. I suspect its also regressed vm_shared_dirty_delta, Issue 702002 (though bisects have not found a culprit yet).
,
Mar 21 2017
It is very unlikely that it caused Issue 702002 since the CL was a Windows only change and that issue applies to Android bots.
,
Mar 21 2017
Ack, ignore Issue 702002 .
,
Mar 21 2017
I still haven't figured out exactly where the issue comes from. It should be fairly straightforward to find. Essentially, SwiftShader gets enabled in GpuDataManagerImplPrivate::EnableSwiftShaderIfNecessary(), which in turn causes GpuDataManagerImplPrivate::GpuAccessAllowed() to return true. This then allows the creation of a GPU process. All of that is normal. But video decoding shouldn't be trying to use the GPU at this point, because all GPU features, including GPU_FEATURE_TYPE_ACCELERATED_VIDEO_DECODE are blacklisted. I understand the processes_delta issue, but I don't get why there are performance regressions yet.
,
Mar 29 2017
I think these regressions are just the new expected behavior. For example: - The 'processes_delta' change: Before the cl which caused this regression, we were in one of 2 cases: - A valid GPU is present by default - No valid GPU detected and the GPU process will not get created even if a GPU channel is requested After the cl, the behavior is: - No GPU process is present by default, but requesting a GPU channel creates one with SwiftShader - The 'time_to_play' regression: For the same reason as the previous regression, a GPU process is created the first time a video plays (this is a one time cost) if no GPU process exist yet. At that point, one is created and this is not done at 0 cost. It's just that the cost used to have already been paid at this point. Explanation: When pressing "play" on a video, RenderFrameImpl explicitly requests a GPU channel (which ends up creating the GPU process) by calling RenderThreadImpl::SharedMainThreadContextProvider() here: https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?l=2835 This can be fixed by GetSharedMainThreadContext3D() receive the "provider" argument by a RenderThreadImpl* and calling SharedMainThreadContextProvider() within GetSharedMainThreadContext3D() rather than at the line mentioned above. Unfortunately, this would not be enough. A few calls later, the creation of gpu_factories in RenderThreadImpl::GetGpuFactories() also establishes a GPU channel here: https://cs.chromium.org/chromium/src/content/renderer/render_thread_impl.cc?l=1383 This happens through the creation of the media player following these steps: https://cs.chromium.org/chromium/src/content/renderer/render_frame_impl.cc?l=2937 https://cs.chromium.org/chromium/src/media/blink/webmediaplayer_impl.cc?l=191 https://cs.chromium.org/chromium/src/media/blink/webmediaplayer_impl.cc?l=1727 https://cs.chromium.org/chromium/src/media/renderers/default_renderer_factory.cc?l=111 So unless someone who knows more than me about this code refactors these classes in order to use the GPU channel as more of an on-demand service, rather than something that's always created when media starts playing, these regressions are going to stay. Note that these are not new behaviors. These are behaviors that existed before, but were untested by the bots. Reverting the cl that caused the regressions would simply re-hide these problems, so I don't think that should be considered as a possibility.
,
Apr 19 2017
+Dale FYI. This is the cause of https://chromeperf.appspot.com/report?sid=b4460f36ca346674a4cdde4eaf2ad34dd6eea8d33ed3d2c7a232e182f0497880&start_rev=452825&end_rev=465537 Is there any further action needed?
,
Apr 19 2017
That's a pretty huge regression in time to playback. It seems like we should defer the GPU process startup by making the suggestions in c#13. I.e. don't trigger any GPU related behavior if 1) we don't need to yet. and 2) if there are no GMBs or GVDs. These changes would be beneficial outside of tests in the normal case it seems too. +sandersd@ want to take a look at this?
,
Apr 19 2017
In my humble opinion ~25% is not that big of a regression. And before September 2016 the time to playback was higher for everybody. Note that this is only affecting the tests and about 10% of users; those who have a blacklisted GPU. That said, leaving things as-is might hide further regressions or reduce future improvements (or at least give the impression of it), so I'd still recommend not launching the GPU process until you have to and know the GPU can be used.
,
Apr 19 2017
Time to play is like search speed in terms of $$$/ms due to abandonment, so if 10% of users on YouTube/Vimeo/Facebook/etc are taking XXX more ms to start playback, that's definitely something we definitely want to address. I agree that it was higher before, so we don't need to revert this or anything like that, we just need to look into making the suggested improvements.
,
Sep 6
This has been opened for a while and I don't think there's anything we can do on the SwiftShader side, so assigning to dalecurtis@ for proper assignment of this issue to someone in the Media team or to simply close it if this won't be addressed.
,
Sep 6
Graphs won't even load any more, so it's hard to say what's happened since. The ones that do load show original levels, so seems closable. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by chcunningham@chromium.org
, Mar 16 2017