Issue metadata
Sign in to add a comment
|
662.5%-1881.2% regression in multiple Default*Apm tests at 12187:12187 |
||||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Apr 1 2016
r12187 == https://codereview.webrtc.org/1844583003, so I'm sending this to aluebs@ (And FYI - while looking at this, I noticed that there's some inconsistency in the test names, filed that as bug 599953 )
,
Apr 1 2016
This is weird, since that CL was supposed to be fix for this regression: https://bugs.chromium.org/p/chromium/issues/detail?id=596079 And I corroborated locally on linux that it improved the performance. But I will have another look.
,
Apr 2 2016
I can reproduce it locally. As soon as I add any condition on public_submodules_ to rev_analysis_needed() performance drops drastically, which makes no sense, because this CL only restricts the cases where the splitting needs to be done. And this can be seen by all the other metrics that improved with this CL. I tried seeing where the time was spent using a profiling tool, but there the render doesn't seem to take as long as the test results suggest. So my best guess now is that the access to public_submodules_ in the render thread makes it slower somehow, although there are plenty of accesses in other render methods. Anyway, I am assigning this to peah, who designed the new threading model of the APM and might be able to shed some light on this. Please feel free to reassign back to me if you find it is not thread-related.
,
Apr 4 2016
,
Apr 5 2016
The problem is that the is_enabled() methods in the echo_cancellation, echo_control_mobile and gain_control submodules all take the capture side lock. Therefore, when call to the is_enabled() methods were added in the rev_analysis_needed() method, the render side of APM is forced to wait on the capture side. This is a true bug in the CL https://codereview.webrtc.org/1844583003 that I missed when I reviewed it. Unfortunately, there is no straightforward way around this that I can see. The is_enabled methods must take the capture lock and then we immediately run into this problem. There is a long-term solution to address this that we are working on that removes the submodules from the public APM api but that will take time.
,
Apr 5 2016
Reassigning back to aluebs
,
Apr 5 2016
For example what I can't understand is that enabled_ is accessed in ProcessRenderAudio(), which is called from the render side, only capturing crit_render_. While the is_enabled() only accesses the same variable, but needs to capture crit_capture_? Maybe I am missing something, but this seems weird to me.
,
Apr 5 2016
Reassigning back to peah to clarify my doubts. If the crit_* are working as intended, please feel free to reassign to me and I can revert the CL that introduced this, but that is going to bring back some other regressions that were fixed by that CL: https://bugs.chromium.org/p/chromium/issues/detail?id=596079
,
Apr 5 2016
The two-lock design was necessary in order to be able to allow the render and capture side of APM to be running concurrently with the current API design. Since there are shared resources these must be protected with a lock in order to be concurrently accessed, and if only one lock used, APM would basically be running in a single-threaded manner. And each of the render and capture side needs separate locks as there is nothing preventing multi-threaded access to either of them. To allow non-blocking reading of the shared resources, the scheme was set up such that the resources are only written while holding both the capture and render locks, and read while holding any of the locks. In order to prioritize the render-side processing (which have more critical real-time constraints) the render lock always needs to be aquired before the capture lock. If the locks are acquired in the other way, deadlocks may occur. What happens here, is that the is_enabled call is used, which takes the capture lock, as it is a capture-side API call. Furthermore, it cannot take the render lock as it may be called while holding the capture lock, which would cause a deadlock. Therefore, when called from the render side, that already holds the render lock, the call to is_enabled causes the render side both to hold the render and capture locks and for this to be possible it will need to wait for the capture side to finish. This basically causes the APM to be running in a single-threaded manner. So to answer the comment in #8: The reason that enabled_ can be accessed in ProcessRenderAudio while only holding the render lock is that it is a read-only access and in a render-side API call. And the reason that is_enabled() can be accessed in is_enabled() is that it is a read-only access in a capture-side API call. That was a long explanation, but it is not a simple thing to explain. This design will gradually be refactored as we evolve the APM API.
,
Apr 5 2016
I created a CL https://codereview.webrtc.org/1859243002/ that should solve this problem by adding another submodule API call that provides render side read access to enabled_ . It complicates the code, however, so it is not clear that that is a good solution to this issue.
,
Apr 5 2016
Thank you for doing that CL. Maybe it is a silly question, but why do we need an accesser that captures each crit_* when it is read-only? Can't we have one that doesn't capture any?
,
Apr 5 2016
That is a very valid question. The problem is that it is read-only only within that method. Other methods may modify it, and to ensure that this is not happening while reading the variable a lock is needed. Furthermore, due to the APM API design, all capture-side-only methods (that are never accessed from the render-side) must acquire the lock because there is nothing in the API preventing the capture-side methods to be accessed from different threads. The same is true for the render-side-only methods.
,
Apr 5 2016
I see, that makes sense. Thank you for clarifying.
,
Apr 6 2016
The CL that addresses the issue is being landed. aluebs@: could you please verify that the regression cease after the CL has been successfully landed?
,
Apr 6 2016
,
Apr 6 2016
It seem to be fix some of the regressions, but not all of them. Did you test it locally before submitting? You could have a look at the "original alerts" link above to see which metrics improved and which didn't change.
,
Apr 8 2016
Apparently when I checked the metrics hadn't been updated yet. Now all the graphs seem to be back to normal. I am closing this as fixed, but please feel free to re-open if you have any additional concerns. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by tnakamura@chromium.org
, Apr 1 2016