New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

CPU adaptation regression M48--M50 as well as video quality issues at Spotify

Project Member Reported by harpreet@chromium.org, Jun 22 2016

Issue description

Copying over buganizer bug's b/29441650 and b/29285690

Following data from a set of test runs across milestones from M48 to M52 show a big jump in CPU utilization. This data was collected from Panther i7 with C920 and 1 jabra speaker with 10 participants (4 harmony console 720p + 6 hotrod devices) in the session. Data recording time was 20 mins.

Graph: https://docs.google.com/a/google.com/spreadsheets/d/1IKpG5IIvo4O8OfH0pPTnqNEH6_7DnORy7Bl26yHwdHo/pubchart?oid=2029770307&format=interactive

Feedback reports
M48 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9897636834
M49 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9895671702
M50 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9897653546
M51 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9898863470
M52 - https://feedback.corp.google.com/product/208/neutron?lView=rd&lReport=9895591538



Spotify has also complained about poor video quality from CFM though we have not been consistently able to reproduce this issue in-house unless we open the setting dialog and it is visible as the call is ongoing. More details here https://buganizer.corp.google.com/u/0/issues/29285690#comment22


toprice@ did some log analysis on rtc_call_perf for Hotrods and found a fairly convincing graph showing the regression is widespread, and starting around the 9th May: http://shortn/_MblvOzQWCQ

 
Let me know if more information is needed.
Labels: VideoShortList
Status: Assigned (was: Untriaged)
Cc: tnakamura@chromium.org
Components: -UI>Shell>Kiosk Blink>WebRTC
Status: Untriaged (was: Assigned)
Owner: tnakamura@chromium.org
tnakamura@ Would you assign this to someone in WebRTC team? Thanks.
Cc: vsu...@chromium.org
Owner: dtosic@chromium.org
tnakamura@ is OOO.

dtosic@ would you be able to help? 
Owner: rohi...@chromium.org
Digged through the various builds, and found one build that has an obvious jump in resource consumption. Take this with a grain of salt: these are dev builds and finding one spike in the resource consumption does not mean we've foudn all of them. Here is a list of the builds i tested http://docs/spreadsheets/d/1aP_2C9vl_DEjWvXTZatj6lU8xNvxgwEkcj8KlhZL7hw/edit#gid=0, and these two are interesting:

        01-29 2:02      7875.0.0        50.0.2633.3     - CPU limited (5 bots, 1 web client, 1 cfm) rapid camera camera movement triggers the issue OR adding 2 more bots to the meeting. CPU utilization is noticeably higher than in the case below
        01-28 18:02     7874.0.0        50.0.2632.0     - works fine , same setup up could not repro the issue even with 9 participants (as opposed to above).

There are no more available builds in between. As for the setup itself:

setup up a meeting with 1 cfm and 6 remote participants. Cfm and 1 participant focus each other (both send initally HD, everything connected over ETH), the other ones just add load (filmstrip videos).

Note that this investigation is also tracked in http://b/29285690.
Cc: jansson@chromium.org
Owner: ----
These are diffs:

https://crosland.corp.google.com/log/7874.0.0..7875.0.0
https://chromium.googlesource.com/chromium/src/+log/50.0.2632.0..50.0.2633.3?pretty=fuller&n=10000

There is no visible regression at video playback side so not sure if this regression is at encode side or in something else.

https://chromeperf.appspot.com/report?sid=770fd82b9087b24cd03db03acaa5e917ff52b3de77e8c28fb4df8a5b266e3e95&start_rev=23110410681203400&end_rev=27870000855600000

video_HangoutHardwarePerf doesn't run on Panther as it doesn't have hw decoding.

jansson@ could you help with finding an owner in case this regression is coming from WeBRTC?
Cc: asapersson@chromium.org sprang@chromium.org mflodman@chromium.org
[triage]: going to assume for now the regression is due to video. Åsa, Magnus, any idea where to take this next?
Owner: mnilsson@chromium.org
Mats. Is this related to http://b/29285690?
Owner: wuchengli@chromium.org
This is the same issue as http://b/29285690, just logged in crbug.

As per the last comments there, we think we have found one possible cause for the regression, but since we are familiar with building chromeOS we would need help from someone to to build a version with the fixed proposed in comment #36.
Cc: yunqingwang@chromium.org
This CL might be the possible cause for this regression https://chromium-review.googlesource.com/#/c/320797/

Here https://b.corp.google.com/u/0/issues/29285690#comment36 yunqingwang@ is suggesting to modify the # of threads in https://cs.corp.google.com/piper///depot/google3/third_party/webrtc/files/stable/webrtc/modules/video_coding/codecs/vp8/vp8_impl.cc and see if it fixed the issue.


Seems that we need to build custom chrome with this change and package it with ChromeOS but we are not familiar with how to do this. Can someone help get us a ChromeOS build for panther board with the suggested change?

Thank you, we can definitely help with building a custom Chrome to test and I should be able to deploy a custom build with a fix onto your device directly if you'd be able to provide a CL to be tested.
Hi Pawel, 

It seems the most likely candidate for any threading issues is 

https://chromium-review.googlesource.com/#/c/320797/

Any chance you can revert this single patch in a custom build for these guys?

Just in case you aren't familiar with git you can download the patch from the attached link and you can apply it in reverse by using this:  

git apply -3 -R threading-change.patch

Jim 

threading-change.patch
17.8 KB Download
Cc: marpan@google.com yunqingwang@google.com yaowu@google.com jimbankoski@google.com
We had a discussion (jimbankoski@, yaowu@, marpan@), and agreed that we should test 2 things to see how performance changes, and get the issue fixed quickly. One is to revert the patch as Jim suggested, and this is a quick solution if it works. Then, the other one is to use single-thread encoder when there are not enough cores available.

Here is the CL that reverts the above-mentioned patch.
https://chromium-review.googlesource.com/#/c/360630/1

Marco verified that the CL applied fine in webrtc checkout. Here are commands he used.
$ cd third_party/libvpx/source/libvpx
$ git fetch https://chromium.googlesource.com/webm/libvpx refs/changes/30/360630/1 && git cherry-pick FETCH_HEAD

and then re-build
$ ninja -C out/Release

Owner: posciak@chromium.org
Status: Assigned (was: Untriaged)

Comment 19 by pbos@chromium.org, Jul 19 2016

Cc: pbos@chromium.org
Cc: glider@chromium.org

Comment 21 by pbos@chromium.org, Jul 19 2016

If the trylocking spinloop is the cause of the performance regression VP9 might be in a similar state as it uses the same for(i = 0; i < 4k; ++i) trylock(); mechanism before locking as the above patch. This might hit the system more severely when it's loaded, but that's only speculative.

We're still struggling on deploying a ChromeOS build then we'll try reverting and building locally to verify (me/dtosic@), but it looks like a good candidate to me.

FTR: https://bugs.chromium.org/p/chromium/issues/detail?id=158922#c57 mentions possible issues here, and it looks like an overkill way to implement atomic read/writes.

Comment 22 by roy...@google.com, Jul 20 2016

Labels: Hotlist-Enterprise
Cc: marc...@chromium.org
Labels: ReleaseBlock-Stable M-53
This shouldn't go another release. Ideally we should merge a fix to 52.

Expecting the hotrod team to debug a CPU utilization issue doesn't seem reasonable.

Do we have a repro case in MTV?  posciak@ is the current owner but is pretty heavily loaded with other things too.

How can we move this forward?
In MTV, we have been testing different builds trying to bisect when the issue started. So far the indication is that the last M48 (7647.84.0 / 48.0.2564.116) stable that was pushed to production works as expected with no drops in HD quality even with 10 participants. The issue starts appearing with the very early M49 build with first 49 chrome (49.0.2567.0) though sometimes, specially after fresh re-image/reboot, this build seems to work fine but the issue starts becoming more apparent in test runs after the device has been up for a while.

We have been recording CPU utilization as well for some of the test runs but have not done any other CPU profiling.

Spreadsheet with test results: https://docs.google.com/spreadsheets/d/1udHHREr_WSWkjR-jDWLi9hPrqYPOS0mFp4Gd28v4T9k/edit#gid=0
Cc: vidster@chromium.org
@24: where is the CPU utilization? I don't see it in this spreadsheet...
Links are in the 'links' column for some of the test runs. Though I dont know how useful this data really is without corresponding HD vs. non-HD data at a given time.

If this CPU usage data is useful, I can record it for each additional test run we do. Let me know.
@27: ah ok, thanks. What is the "memory" column, is it memory usage or memory bandwidth? How was it measured?
harpreet@: did you have a chance to test the revert from #17 I deployed to your device?

pbos@: we are happy to help with building and/or deploying remotely to your test devices as needed.
posciak@ - Yes, the issue is still reproducible with the revert.
Status: Available (was: Assigned)
I see, thanks.

(Also removing myself as an owner since I'm not leading the work on this)
@28: marcheu@ - it is the percentage of memory used collected using get_mem_total and get_mem_free in https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/bin/site_utils.py?l=841

Here is the sample code used:

    def _memory_usage(self):
        """Returns total used memory in %."""
        total_memory = site_utils.get_mem_total()
        return (total_memory - site_utils.get_mem_free()) * 100 / total_memory

Comment 33 by blum@chromium.org, Jul 21 2016

Cc: bccheng@chromium.org blum@chromium.org
Labels: -Hotlist-Enterprise -VideoShortList Hotlist-enterprise videoshortlist
Wrt. #23: Who could take over in PST time zone now? Anyone who can help out, please sync with harpeet@
Cc: puneetster@chromium.org
Labels: -Pri-1 Pri-0
People are treating this as P0. That should be reflected in the bug.
Cc: posciak@chromium.org
Owner: marc...@chromium.org
Status: Assigned (was: Available)
Assigning to marcheu@ as he is helping further debug the issue from CrOS side.
Some more data comparing 48 vs 54 with CPU frequency min/max set to 2.1GHz.

https://docs.google.com/spreadsheets/d/1E0cTR21L59VM9NdLDlBQzq8fTYU5XupQlwbp_sleIWc/edit#gid=0


Brief background on how the CPU usage data was collected:
M54
Run 1 - 4 participant session to make sure the video sent from CFM stayed in HD for the whole 5 min duration of the test run
Run 2 - 7 participant session to make sure the video sent from CFM stayed in nonHD for the whole 5 min duration of the test run

M48
Run 1 - 7 participant session streamed HD video from CFM for the whole 5 min duration of the test run


Also uploaded the CPU profiling data from these runs to https://pantheon.corp.google.com/storage/browser/chromiumos-test-logs/bugfiles/cr/622510/?pli=1
Tested various M51 builds with 8 participants (6 bots, 1 web client and 1 cfm) and captured CPU utilization. Frequency was pinned to 2.1GHz for each build prior to the test run / data capture.

CPU usage data is not drastically different across builds but it does seem to fluctuate more starting with 8150.0.0 / 51.0.2699.0 build. Usage curves for builds prior to this are smoother and within a narrower percentage band.

https://docs.google.com/a/google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/pubchart?oid=12017749&format=interactive



dtosic@ - Here is more info about the builds tested, maybe you can use the session ID to take a look at the JMI logs.  https://docs.google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit#gid=0


The fluctuations might also be caused due to cpu adaptation, so we need into the jmi stats.

@toprice - can you pull some graphs out of the logs?

Comment 40 by toprice@google.com, Jul 25 2016

Pulled Hangouts quality stats for harpreet's test runs and put them in http://shortn/_0QSOVYS6qx

In short the adaptation rate certainly gets worse with the builds, with no obvious single step, but a single run for each isn't much data to go on to identify steps.
Harpreet ran through a couple of build on M51 (https://docs.google.com/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit#gid=213532321) and identified a possible larger regression between 8140->8142. I compared the 2 builds (each with approx 7 runs 15-20 minutes), results https://www.google.com/url?q=http://docs/spreadsheets/d/13SQEWYTCe5ikVl5mxL_eBv2dZYTZp5iMv8KUoK49yCk/edit%23gid%3D556873349&source=gmail&ust=1469625877847000&usg=AFQjCNHfvkkRzFwvi-v5vS81SIChiUywmA.

There is a strong indication that something broke in this range, and looking at he the chrome version there has been only one version update (from 8141 to 8142):
https://chromium.googlesource.com/chromium/src/+log/51.0.2696.0..51.0.2697.0?pretty=fuller&n=10000 

It contains a libvpx push https://chromium.googlesource.com/webm/libvpx.git/+log/a6246927767f..904cb53302df that includes the updates to the skin model, which Stephane was suspecting as a possible reason for degradation (if I understood him correctly).

I see this is marked as targeted for M53 - we're already 4 weeks in regression (since 51) and I don't think we have the luxury of waiting another 6 weeks until M53 gets out. Can we make sure we get a fix into M52 once we have it (them)?
Given we have not yet identified a fix, it is too early to commit to merging to 52. If it is possible/safe we will do it.

FWIW it's critical that we create appropriate tests to ensure these sorts of regressions are caught immediately rather than creating a scramble and potential for unsafe merges when they are discovered weeks later. Who should own creating such tests?
We are looking into ways to setup a stress test using autotest infra and utilizing harmony bots to generate load. I'll create a separate bug for this with more details.

Besides this integration test, folks should also add unit tests where appropriate.
Absolutely! But unit tests wouldn't have caught a regression of this sort.
Regarding #41:
There should not be any changes in vp8 from the libvpx roll mentioned in #41. There is one vp8 CL in that list that updates the new skin model (model#1) but it does not switch to that model. The old model (model#0) is still used in M51.
The switch to the new skin model was done later in M52. 
The rest of the CLs in that list are vp9 changes.
@45: yes, the skin model did regress, but when it was switched on through the #define, not from the earlier CLs.
(which means we need to hunt for another regression in that range)

Comment 48 by sprang@google.com, Jul 27 2016

About tests...

We do have performance tests for webrtc (check the webrtc_perf_test suit at chromeperf.appspot.com). I've looked at them I can't see any regression that matches this timeline. We don't have a cros machine there though, and from what I hear adding one is not trivial.

Those are just webrtc standalone though. There are some chrome browser test as well, but of limited scope. If we can add more automated full stack Hangouts tests that would be awesome.

We also have some automatic alerts for UMA and Hangouts stats, but that coverage should be extended. It's interesting that this regression basically isn't visible in the stats on any other platform, including linux.
Project Member

Comment 49 by sheriffbot@chromium.org, Jul 31 2016

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Can you try https://codereview.webrtc.org/2204533004 ? This fixes it for me

Comment 51 by pbos@chromium.org, Aug 2 2016

Re #50: That code path shouldn't be active for Hangouts (or simulcast) at all, quality-based downscaling should only kick in for single-ssrc streams, and has nothing to do with CPU adaptation.

Comment 52 by pbos@chromium.org, Aug 2 2016

sprang@: If #50 actually fixes the issue, can you help look into why this code path is active at all?
We'll do further testing tomorrow, with some custom builds.
I'll add some logging just to verify that this code path isn't incorrectly used.
I'd hold of trying to revert this for now.
sprang@ any updates on the testing that you were working on?
Cc: afakhry@chromium.org
Over the last few day's I have tested the latest build from each milestone starting with 49. Performed 7 test run's on each build/milestone tested (except M49) with 6 harmony bots, 1 CFM and 1 webclient; with all participants muted and CFM showing WC video as active stream and vise versa. Video from bot's was not active on wc or cfm.

For each test run following data was collected:
- CPU usage
- Average adaptation
- HD send ratio

Based on the graph below, CPU adaptation in M49 is at 0%, it jumped to 2.6% in M50, followed by a big jump to 10.1% in M51, then dropped down to 7.8% in M52 and further down to 4.1% in M53 and 3.1% in M54. M51 looks like the worst milestone for HD send ratio / cpu adaptaion and it gradually starts improving with M52, M53 and M54.

Graph of mean adaptation, CPU usage and hd send ratio across 49, 50, 51, 52, 53, 54: https://docs.google.com/a/google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/pubchart?oid=1538501226&format=interactive

 
Here is the raw data from all the test runs: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=740558155


CPU usage graphs:
M49: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=967424089
M50: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=524686887
M51: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=817180553
M52: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=1960836693
M53: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=1991803458
M54: https://docs.google.com/spreadsheets/d/1iVPDcLWPy5xO-ALGP_D-2NkO3jz4QqkRQhXVnqBIXdE/edit#gid=881790984

Project Member

Comment 57 by sheriffbot@chromium.org, Aug 15 2016

Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable?

If a fix is in active development, please set the status to Started.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
The main investigation currently takes place in the internal tracker: http://b/29285690
marcheu@ do you have cycles to look into this? this is currently an RBS.
Labels: -videoshortlist
posciak@ is this something you can help investigate?
ketakid@ - this is being actively investigated and updates are added to http://b/29285690

mnilsson@ / abodenha@ - Does it make sense to close this bug and continue tracking in http://b/29285690?
Status: Archived (was: Assigned)
If that's where the people working on this are most easily able to make progress, fine.  If there are particular CLs we want to land in Chrome or Chrome OS as a result we can open separate issues here for them.

Comment 64 by ketakid@google.com, Mar 18 2017

Labels: Pri-3
Status: Available (was: Archived)
Activating. Please assign to the right owner and the appropriate priority.
Status: Archived (was: Available)

Sign in to add a comment