CPU adaptation regression M48--M50 as well as video quality issues at Spotify |
||||||||||||||||||||||||||||
Issue descriptionCopying over buganizer bug's b/29441650 and b/29285690 Following data from a set of test runs across milestones from M48 to M52 show a big jump in CPU utilization. This data was collected from Panther i7 with C920 and 1 jabra speaker with 10 participants (4 harmony console 720p + 6 hotrod devices) in the session. Data recording time was 20 mins. Graph: https://docs.google.com/a/google.com/spreadsheets/d/1IKpG5IIvo4O8OfH0pPTnqNEH6_7DnORy7Bl26yHwdHo/pubchart?oid=2029770307&format=interactive Feedback reports M48 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9897636834 M49 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9895671702 M50 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9897653546 M51 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9898863470 M52 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9895591538 Spotify has also complained about poor video quality from CFM though we have not been consistently able to reproduce this issue in-house unless we open the setting dialog and it is visible as the call is ongoing. More details here https://buganizer.corp.google.com/u/0/issues/29285690#comment22 toprice@ did some log analysis on rtc_call_perf for Hotrods and found a fairly convincing graph showing the regression is widespread, and starting around the 9th May: http://shortn/_MblvOzQWCQ
,
Jun 23 2016
,
Jun 23 2016
,
Jul 5 2016
,
Jul 5 2016
tnakamura@ Would you assign this to someone in WebRTC team? Thanks.
,
Jul 6 2016
tnakamura@ is OOO. dtosic@ would you be able to help?
,
Jul 7 2016
Digged through the various builds, and found one build that has an obvious jump in resource consumption. Take this with a grain of salt: these are dev builds and finding one spike in the resource consumption does not mean we've foudn all of them. Here is a list of the builds i tested http://docs/spreadsheets/d/1aP_2C9vl_DEjWvXTZatj6lU8xNvxgwEkcj8KlhZL7hw/edit#gid=0, and these two are interesting: 01-29 2:02 7875.0.0 50.0.2633.3 - CPU limited (5 bots, 1 web client, 1 cfm) rapid camera camera movement triggers the issue OR adding 2 more bots to the meeting. CPU utilization is noticeably higher than in the case below 01-28 18:02 7874.0.0 50.0.2632.0 - works fine , same setup up could not repro the issue even with 9 participants (as opposed to above). There are no more available builds in between. As for the setup itself: setup up a meeting with 1 cfm and 6 remote participants. Cfm and 1 participant focus each other (both send initally HD, everything connected over ETH), the other ones just add load (filmstrip videos). Note that this investigation is also tracked in http://b/29285690.
,
Jul 8 2016
These are diffs: https://crosland.corp.google.com/log/7874.0.0..7875.0.0 https://chromium.googlesource.com/chromium/src/+log/50.0.2632.0..50.0.2633.3?pretty=fuller&n=10000 There is no visible regression at video playback side so not sure if this regression is at encode side or in something else. https://chromeperf.appspot.com/report?sid=770fd82b9087b24cd03db03acaa5e917ff52b3de77e8c28fb4df8a5b266e3e95&start_rev=23110410681203400&end_rev=27870000855600000 video_HangoutHardwarePerf doesn't run on Panther as it doesn't have hw decoding. jansson@ could you help with finding an owner in case this regression is coming from WeBRTC?
,
Jul 11 2016
[triage]: going to assume for now the regression is due to video. Åsa, Magnus, any idea where to take this next?
,
Jul 12 2016
,
Jul 12 2016
This is the same issue as http://b/29285690, just logged in crbug. As per the last comments there, we think we have found one possible cause for the regression, but since we are familiar with building chromeOS we would need help from someone to to build a version with the fixed proposed in comment #36.
,
Jul 13 2016
This CL might be the possible cause for this regression https://chromium-review.googlesource.com/#/c/320797/ Here https://b.corp.google.com/u/0/issues/29285690#comment36 yunqingwang@ is suggesting to modify the # of threads in https://cs.corp.google.com/piper///depot/google3/third_party/webrtc/files/stable/webrtc/modules/video_coding/codecs/vp8/vp8_impl.cc and see if it fixed the issue. Seems that we need to build custom chrome with this change and package it with ChromeOS but we are not familiar with how to do this. Can someone help get us a ChromeOS build for panther board with the suggested change?
,
Jul 13 2016
Thank you, we can definitely help with building a custom Chrome to test and I should be able to deploy a custom build with a fix onto your device directly if you'd be able to provide a CL to be tested.
,
Jul 14 2016
Hi Pawel, It seems the most likely candidate for any threading issues is https://chromium-review.googlesource.com/#/c/320797/ Any chance you can revert this single patch in a custom build for these guys? Just in case you aren't familiar with git you can download the patch from the attached link and you can apply it in reverse by using this: git apply -3 -R threading-change.patch Jim
,
Jul 14 2016
,
Jul 14 2016
We had a discussion (jimbankoski@, yaowu@, marpan@), and agreed that we should test 2 things to see how performance changes, and get the issue fixed quickly. One is to revert the patch as Jim suggested, and this is a quick solution if it works. Then, the other one is to use single-thread encoder when there are not enough cores available.
,
Jul 14 2016
Here is the CL that reverts the above-mentioned patch. https://chromium-review.googlesource.com/#/c/360630/1 Marco verified that the CL applied fine in webrtc checkout. Here are commands he used. $ cd third_party/libvpx/source/libvpx $ git fetch https://chromium.googlesource.com/webm/libvpx refs/changes/30/360630/1 && git cherry-pick FETCH_HEAD and then re-build $ ninja -C out/Release
,
Jul 19 2016
,
Jul 19 2016
,
Jul 19 2016
,
Jul 19 2016
If the trylocking spinloop is the cause of the performance regression VP9 might be in a similar state as it uses the same for(i = 0; i < 4k; ++i) trylock(); mechanism before locking as the above patch. This might hit the system more severely when it's loaded, but that's only speculative. We're still struggling on deploying a ChromeOS build then we'll try reverting and building locally to verify (me/dtosic@), but it looks like a good candidate to me. FTR: https://bugs.chromium.org/p/chromium/issues/detail?id=158922#c57 mentions possible issues here, and it looks like an overkill way to implement atomic read/writes.
,
Jul 20 2016
,
Jul 20 2016
This shouldn't go another release. Ideally we should merge a fix to 52. Expecting the hotrod team to debug a CPU utilization issue doesn't seem reasonable. Do we have a repro case in MTV? posciak@ is the current owner but is pretty heavily loaded with other things too. How can we move this forward?
,
Jul 20 2016
In MTV, we have been testing different builds trying to bisect when the issue started. So far the indication is that the last M48 (7647.84.0 / 48.0.2564.116) stable that was pushed to production works as expected with no drops in HD quality even with 10 participants. The issue starts appearing with the very early M49 build with first 49 chrome (49.0.2567.0) though sometimes, specially after fresh re-image/reboot, this build seems to work fine but the issue starts becoming more apparent in test runs after the device has been up for a while. We have been recording CPU utilization as well for some of the test runs but have not done any other CPU profiling. Spreadsheet with test results: https://docs.google.com/spreadsheets/d/1udHHREr_WSWkjR-jDWLi9hPrqYPOS0mFp4Gd28v4T9k/edit#gid=0
,
Jul 20 2016
,
Jul 20 2016
@24: where is the CPU utilization? I don't see it in this spreadsheet...
,
Jul 20 2016
Links are in the 'links' column for some of the test runs. Though I dont know how useful this data really is without corresponding HD vs. non-HD data at a given time. If this CPU usage data is useful, I can record it for each additional test run we do. Let me know.
,
Jul 20 2016
@27: ah ok, thanks. What is the "memory" column, is it memory usage or memory bandwidth? How was it measured?
,
Jul 21 2016
harpreet@: did you have a chance to test the revert from #17 I deployed to your device? pbos@: we are happy to help with building and/or deploying remotely to your test devices as needed.
,
Jul 21 2016
posciak@ - Yes, the issue is still reproducible with the revert.
,
Jul 21 2016
I see, thanks. (Also removing myself as an owner since I'm not leading the work on this)
,
Jul 21 2016
@28: marcheu@ - it is the percentage of memory used collected using get_mem_total and get_mem_free in https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/bin/site_utils.py?l=841 Here is the sample code used: def _memory_usage(self): """Returns total used memory in %.""" total_memory = site_utils.get_mem_total() return (total_memory - site_utils.get_mem_free()) * 100 / total_memory
,
Jul 21 2016
Wrt. #23: Who could take over in PST time zone now? Anyone who can help out, please sync with harpeet@
,
Jul 21 2016
,
Jul 21 2016
People are treating this as P0. That should be reflected in the bug.
,
Jul 21 2016
Assigning to marcheu@ as he is helping further debug the issue from CrOS side.
,
Jul 22 2016
Some more data comparing 48 vs 54 with CPU frequency min/max set to 2.1GHz. https://docs.google.com/spreadsheets/d/1E0cTR21L59VM9NdLDlBQzq8fTYU5XupQlwbp_sleIWc/edit#gid=0 Brief background on how the CPU usage data was collected: M54 Run 1 - 4 participant session to make sure the video sent from CFM stayed in HD for the whole 5 min duration of the test run Run 2 - 7 participant session to make sure the video sent from CFM stayed in nonHD for the whole 5 min duration of the test run M48 Run 1 - 7 participant session streamed HD video from CFM for the whole 5 min duration of the test run Also uploaded the CPU profiling data from these runs to https://pantheon.corp.google.com/storage/browser/chromiumos-test-logs/bugfiles/cr/622510/?pli=1
,
Jul 23 2016
Tested various M51 builds with 8 participants (6 bots, 1 web client and 1 cfm) and captured CPU utilization. Frequency was pinned to 2.1GHz for each build prior to the test run / data capture. CPU usage data is not drastically different across builds but it does seem to fluctuate more starting with 8150.0.0 / 51.0.2699.0 build. Usage curves for builds prior to this are smoother and within a narrower percentage band. https://docs.google.com/a/google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/pubchart?oid=12017749&format=interactive dtosic@ - Here is more info about the builds tested, maybe you can use the session ID to take a look at the JMI logs. https://docs.google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit#gid=0
,
Jul 25 2016
The fluctuations might also be caused due to cpu adaptation, so we need into the jmi stats. @toprice - can you pull some graphs out of the logs?
,
Jul 25 2016
Pulled Hangouts quality stats for harpreet's test runs and put them in http://shortn/_0QSOVYS6qx In short the adaptation rate certainly gets worse with the builds, with no obvious single step, but a single run for each isn't much data to go on to identify steps.
,
Jul 26 2016
Harpreet ran through a couple of build on M51 (https://docs.google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit#gid=213532321) and identified a possible larger regression between 8140->8142. I compared the 2 builds (each with approx 7 runs 15-20 minutes), results https://www.google.com/url?q=http://docs/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit%23gid%3D556873349&source=gmail&ust=1469625877847000&usg=AFQjCNHfvkkRzFwvi-v5vS81SIChiUywmA. There is a strong indication that something broke in this range, and looking at he the chrome version there has been only one version update (from 8141 to 8142): https://chromium.googlesource.com/chromium/src/+log/51.0.2696.0..51.0.2697.0?pretty=fuller&n=10000 It contains a libvpx push https://chromium.googlesource.com/webm/libvpx.git/+log/a6246927767f..904cb53302df that includes the updates to the skin model, which Stephane was suspecting as a possible reason for degradation (if I understood him correctly). I see this is marked as targeted for M53 - we're already 4 weeks in regression (since 51) and I don't think we have the luxury of waiting another 6 weeks until M53 gets out. Can we make sure we get a fix into M52 once we have it (them)?
,
Jul 26 2016
Given we have not yet identified a fix, it is too early to commit to merging to 52. If it is possible/safe we will do it. FWIW it's critical that we create appropriate tests to ensure these sorts of regressions are caught immediately rather than creating a scramble and potential for unsafe merges when they are discovered weeks later. Who should own creating such tests?
,
Jul 26 2016
We are looking into ways to setup a stress test using autotest infra and utilizing harmony bots to generate load. I'll create a separate bug for this with more details. Besides this integration test, folks should also add unit tests where appropriate.
,
Jul 26 2016
Absolutely! But unit tests wouldn't have caught a regression of this sort.
,
Jul 26 2016
Regarding #41: There should not be any changes in vp8 from the libvpx roll mentioned in #41. There is one vp8 CL in that list that updates the new skin model (model#1) but it does not switch to that model. The old model (model#0) is still used in M51. The switch to the new skin model was done later in M52. The rest of the CLs in that list are vp9 changes.
,
Jul 26 2016
@45: yes, the skin model did regress, but when it was switched on through the #define, not from the earlier CLs.
,
Jul 26 2016
(which means we need to hunt for another regression in that range)
,
Jul 27 2016
About tests... We do have performance tests for webrtc (check the webrtc_perf_test suit at chromeperf.appspot.com). I've looked at them I can't see any regression that matches this timeline. We don't have a cros machine there though, and from what I hear adding one is not trivial. Those are just webrtc standalone though. There are some chrome browser test as well, but of limited scope. If we can add more automated full stack Hangouts tests that would be awesome. We also have some automatic alerts for UMA and Hangouts stats, but that coverage should be extended. It's interesting that this regression basically isn't visible in the stats on any other platform, including linux.
,
Jul 31 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 2 2016
Can you try https://codereview.webrtc.org/2204533004 ? This fixes it for me
,
Aug 2 2016
Re #50: That code path shouldn't be active for Hangouts (or simulcast) at all, quality-based downscaling should only kick in for single-ssrc streams, and has nothing to do with CPU adaptation.
,
Aug 2 2016
sprang@: If #50 actually fixes the issue, can you help look into why this code path is active at all?
,
Aug 2 2016
We'll do further testing tomorrow, with some custom builds. I'll add some logging just to verify that this code path isn't incorrectly used. I'd hold of trying to revert this for now.
,
Aug 4 2016
sprang@ any updates on the testing that you were working on?
,
Aug 5 2016
,
Aug 8 2016
Over the last few day's I have tested the latest build from each milestone starting with 49. Performed 7 test run's on each build/milestone tested (except M49) with 6 harmony bots, 1 CFM and 1 webclient; with all participants muted and CFM showing WC video as active stream and vise versa. Video from bot's was not active on wc or cfm. For each test run following data was collected: - CPU usage - Average adaptation - HD send ratio Based on the graph below, CPU adaptation in M49 is at 0%, it jumped to 2.6% in M50, followed by a big jump to 10.1% in M51, then dropped down to 7.8% in M52 and further down to 4.1% in M53 and 3.1% in M54. M51 looks like the worst milestone for HD send ratio / cpu adaptaion and it gradually starts improving with M52, M53 and M54. Graph of mean adaptation, CPU usage and hd send ratio across 49, 50, 51, 52, 53, 54: https://docs.google.com/a/google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/pubchart?oid=1538501226&format=interactive Here is the raw data from all the test runs: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=740558155 CPU usage graphs: M49: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=967424089 M50: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=524686887 M51: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=817180553 M52: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=1960836693 M53: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=1991803458 M54: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=881790984
,
Aug 15 2016
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 15 2016
The main investigation currently takes place in the internal tracker: http://b/29285690
,
Aug 15 2016
marcheu@ do you have cycles to look into this? this is currently an RBS.
,
Aug 16 2016
,
Aug 17 2016
posciak@ is this something you can help investigate?
,
Aug 17 2016
ketakid@ - this is being actively investigated and updates are added to http://b/29285690 mnilsson@ / abodenha@ - Does it make sense to close this bug and continue tracking in http://b/29285690?
,
Aug 18 2016
If that's where the people working on this are most easily able to make progress, fine. If there are particular CLs we want to land in Chrome or Chrome OS as a result we can open separate issues here for them.
,
Mar 18 2017
Activating. Please assign to the right owner and the appropriate priority.
,
Mar 22 2017
|
||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||
Comment 1 by harpreet@chromium.org
, Jun 22 2016