MSE/EME causing 'waiting' events
Reported by
g.marcil...@gmail.com,
Sep 3
|
|||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 Steps to reproduce the problem: Hi, We have a Javascript music library built using the MSE (and EME). We are seeing many 'waiting' events being fire with the readyState 2 which results in a brief interruption of audio. We've also recently added EME support and we are seeing x2 rebuffers. For context, the current implementation uses 3 audio elements that we recycle. After a new track begins, we clear the buffer from the unused audio element and start prefetching the next track. We currently buffer up to 4mn ahead and we do not remove data from the buffer manually. We let the browser handle it automatically. (which could be a problem according to this: https://bugs.chromium.org/p/chromium/issues/detail?id=854449&can=1&q=mse%20waiting&colspec=ID%20Pri%20M%20Stars%20ReleaseBlock%20Component%20Status%20Owner%20Summary%20OS%20Modified) In order to prevent this poor user experience, I would like to understand what is causing those `waiting` events and what could cause a 2x increase by now use the EME. Do you have any recommendations? What is the expected behavior? What went wrong? x2 increase of rebuffers after enabling EME Did this work before? N/A Does this work in other browsers? N/A Chrome version: 68.0.3440.106 Channel: stable OS Version: OS X 10.13.6 Flash Version:
Showing comments 55 - 154
of 154
Older ›
,
Sep 19
The data shared in #53 is based on real customers usage (not including your fix). At the moment we are not able to break down this data per Chrome version.
,
Sep 19
100ms is definitely too low; you'll want to target at least 1-2s buffer. So if you're seeing playbacks regularly approaching that distance, the waiting is just due to not buffering enough. I'm not sure how you're currently managing your buffering, but you probably want to attach to timeupdate events and ensure the buffering level never falls below N seconds. The background ones have enough data, but as mentioned once you receive the waiting the buffered state may already be out of date. This may be particularly true in background tabs where JS may be delayed. Did the data provided by the recommendation in c#36 reveal anything interesting?
,
Sep 19
100ms indicates that customers are having rebuffers because of a slow connection and are unable to download fast enough the next segment. P90 data shows there is +100 seconds of audio to play when a rebuffer occurs. I will test tomorrow c#36 and the M71 build.
,
Sep 19
Gotcha, are you able to filter out cases where you know you don't have the data? The platform wide smooth rate for audio-only MSE content is 95% for 0 rebuffers and 99% for 0-1 rebuffers. I.e., 95% of elements experience 0 rebuffers and 99% of elements experience at most 1 rebuffer. Do you have a sense of what your smooth rate is? That will help us determine the severity of your problem.
,
Sep 20
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e130ef9c8ac16f65d4dc5cff95e26118a4e39438 commit e130ef9c8ac16f65d4dc5cff95e26118a4e39438 Author: Dale Curtis <dalecurtis@chromium.org> Date: Thu Sep 20 00:17:13 2018 Add better tracing for renderers and decoder selection. Renderers are core to our entire pipeline, yet trace events are missing! So lets add a few that allow useful diagnosis when looking at trace reports. Specifically this adds traces for the following: - (AudioRenderer|VideoRenderer|RendererImpl)::Initialize() [async] - (AudioRenderer|VideoRenderer|RendererImpl)::Flush() [async] - DecoderSelector::SelectDecoder() [async] - RendererImpl::SetCdm - RendererImpl::StartPlayingFrom - RendererImpl::SetPlaybackRate - RendererImpl::OnBufferingStateChange - RendererImpl::StartPlayback - RendererImpl::PausePlayback - RendererImpl::OnRendererEnded - RendererImpl::OnError - RendererImpl::OnSelectedVideoTracksChanged - RendererImpl::OnEnabledAudioTracksChanged This also removes usage of ".is_null()" on callbacks in favor of the more modern built in boolean test. BUG=879970 TEST=none R=xhwang Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I79792d93b6369cfe4fe32f322f176b68acc92ee8 Reviewed-on: https://chromium-review.googlesource.com/1232688 Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#592622} [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/demuxer_stream.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/demuxer_stream.h [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/stream_parser_buffer.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/filters/decoder_selector.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/filters/decoder_selector.h [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/audio_renderer_impl.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/audio_renderer_impl.h [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/renderer_impl.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/renderer_impl.h [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/video_renderer_impl.cc [modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/video_renderer_impl.h
,
Sep 20
Hi, Adding additional logs as recommended in c#36 did not result in more useful information. Additionally, I am still seeing rebuffers using Canary. I also recorded a trace using Canary.
,
Sep 21
It looks like that trace is using the dev tools slowdown function? Is that the case? I'm more interested in traces from when you're actually hitting issues. I think with dev tools you're just running out of buffer, but I'll add some more traces for the demuxer to get more information.
,
Sep 21
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/43a86d270102c4cb45cbe8ab6e4d4b209e184a65 commit 43a86d270102c4cb45cbe8ab6e4d4b209e184a65 Author: Dale Curtis <dalecurtis@chromium.org> Date: Fri Sep 21 23:24:23 2018 Add demuxer level traces for Initialize, Seek, Read. This is another core component of media, we should have trace events to help diagnose issues in the field. This adds the following traces: - (Chunk|FFmpeg)Demuxer::Initialize - (Chunk|FFmpeg)Demuxer::Seek - (AudioBuffer|VideoFrame)Stream::(Decode|Read|ReadFromDemuxerStream) - VideoFrameStream::Prepare (for audio too, but never prepares) This also renames the DecoderStream<AUDIO>, DecoderStream<VIDEO> traces to use the normal names AudioBufferStream, VideoFrameStream. BUG=879970 TEST=none R=wolenetz, xhwang Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: Ie9ca7474bc5684117820b219012f0455d044eabc Reviewed-on: https://chromium-review.googlesource.com/1237573 Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#593374} [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/base/demuxer_stream.cc [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/base/demuxer_stream.h [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/chunk_demuxer.cc [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/chunk_demuxer.h [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/decoder_stream.cc [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/decoder_stream.h [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/ffmpeg_demuxer.cc [modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/ffmpeg_demuxer.h
,
Sep 24
@g.marcilhacy can you collect a trace with the most recent canary? Preferably without using dev tools to slowdown, but we have enough trace logging in there to confirm any issues with the dev tools forced slow down too if you want to provide one of those too.
,
Sep 24
I kinda need to have the console open in order to the see the rebuffer being logged. Is that fine if I do not use the Performance throttling?
,
Sep 24
Yeah that should be fine. You can also record one with performance throttling and I can tell you for sure now if it's just running out of buffer. But most helpful would be one that's actually a glitch w/o performance throttling.
,
Sep 24
Attached. The devtools were open on the console tab and I was not using cpu throttling. You'll find 1 rebuffer ~5 second before the end of the trace. Version: Version 71.0.3560.0 (Official Build) canary (64-bit)
,
Sep 25
Okay yes, the underflow is due to the decoder not being able to read from the demuxer. The trace shows Chrome attempting to read from MSE, but MSE not having any buffered data to provide. The read remains unsatisfied until appendBuffer() is called ~3,652.085 ms later. The previous append is of size 328386 and turns into ~430 buffers which if AAC, should turn into ~10s of data, but we seem to run out ~800ms later... So I think we do have enough data, which seems to imply that something is blocking mse on the main thread for some time. There is some DOM timer running during this underflow which executes for 3 secs -- DOM timers comes from setInterval/setTimeout. Can you instrument your setInterval/setTimeout calls to see if any take longer than > 1 second to execute? In chrome://tracing I don't seem to have a way to get any more detail. You might get more if you record a performance trace through dev tools: https://developers.google.com/web/tools/chrome-devtools/evaluate-performance/reference
,
Sep 25
Thanks for the explanation. So you're saying a long running task (setTimeout) on the main thread is the source of this rebuffer. Do you have any theories on why those rebuffers are exacerbate with DRM?
,
Sep 25
Yes that seems to be the case. I'm still investigating the trace. Since for plain MSE I wouldn't expect a really long main thread task to glitch if there's enough data available, since we decode+read on another thread. That said, DRM content requires more hops to the main thread to talk to the site through EME, so if you have a regular task blocking the main thread I'd expect it to glitch more. That's possibly what's happening here, but I'm still looking more closely.
,
Sep 26
Do you have any documentations on how to use the tracing tool? Where do you see in the trace a 3 secs long DOM timer?
,
Sep 26
I have a performance profile with a rebuffer caused by a position change in the song. Right before the waiting event being emitted, I see a timeupdate event which probably emits the new position in the song. I also know that when the rebuffer occured, ~14s of audio was still available. Is it because the CDM has not decrypted this part of the buffer even if the audio has been available? The waiting event is emitted at the ~35 second mark in the profile.
,
Sep 26
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ca79b4a569b34fa3080d83b27ddaf6311b695c9d commit ca79b4a569b34fa3080d83b27ddaf6311b695c9d Author: Dale Curtis <dalecurtis@chromium.org> Date: Wed Sep 26 18:26:29 2018 Expand traces for decrypt-only and decrypt+video_decoder. DecryptingDemuxerStream had no traces and the DecyrptingVideoDecoder had some old-style, incomplete traces. This adds traces to the former and expands the latter. Specifically we now have: - DecryptingDemuxerStream::DecryptPendingBuffer [async] - DecryptingDemuxerStream::WaitingForDecryptionKey [async] - DecryptingVideoDecoder::DecodePendingBuffer [async] - DecryptingVideoDecoder::WaitingForDecryptionKey [async] The DVD traces used to use a trace_id and provide a buffer size, but this isn't really necessary since the start trace is tagged correctly just by using |this| and providing the timestamp. BUG=879970 TEST=passes cq Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: If21288e6c47414ec038f034bba381fcebf5ca99b Reviewed-on: https://chromium-review.googlesource.com/1244164 Reviewed-by: Xiaohan Wang <xhwang@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Cr-Commit-Position: refs/heads/master@{#594393} [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/base/decryptor.cc [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/base/decryptor.h [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decoder_selector.cc [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_demuxer_stream.cc [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_demuxer_stream.h [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_video_decoder.cc [modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_video_decoder.h
,
Sep 26
If you look in the trace there is a point where the audio output device stops rendering audio shortly after the AudioBufferStream::Read stalls for > 200ms. There's another long trace in there which is annotated with TimerBase and DOMTimer. I'll take a screenshot in a bit and look at the dev tools trace you provided. Between meetings at the moment.
,
Sep 26
Did you mean to attach that performance profile? It's not for a music site at all, but some sort of WebGL looking demo.
,
Sep 26
Derp, that's my fault. Loaded the wrong trace.
,
Sep 26
Your performance trace has huge blocking timer tasks. Do you know what those are doing? I can't tell just from the trace since the code is obfuscated.
,
Sep 26
I assume most of them are UI updates. We use a UI library that's not performant when building complex UI applications... However, most of the blocking tasks are toward the beginning of the Profile, while the waiting event occurred toward the end of the Profile.
,
Sep 26
Thanks, right before the waiting event (@35696ms in trace) there's a 500ms "composite layers" call. I don't think that should show up as a DOMTimerFired, but maybe it does. In either case I'm still unsure how that's actually preventing us from reading buffers from MSE. Tomorrow I'll simulate main thread contention with full buffers to see if I can figure out if it's possible to glitch MSE that way. -scheduling, +Internals>Compositing folks to see if there's a way to dig into what "Composite Layers" means more.
,
Sep 26
Okay, thanks for digging into this issue :)
,
Sep 27
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e commit aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e Author: Dale Curtis <dalecurtis@chromium.org> Date: Thu Sep 27 01:38:11 2018 Add decode status to DecoderStream::Decode ending trace. Was left off in previous CLs. BUG=879970 TEST=passes cq Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I3dd5288bf60381e42a8ab08d054bdf60e4e96f96 Reviewed-on: https://chromium-review.googlesource.com/1244613 Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#594556} [modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/base/decode_status.cc [modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/base/decode_status.h [modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/filters/decoder_stream.cc [modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/filters/decoder_stream.h
,
Sep 27
I wasn't able to simulate any glitches with 2s main thread blocks except for when the block prevented sourceBuffer.append() from providing enough data. That's what I was expecting, except during MSE operations media doesn't care about main thread contention. Can you grab another trace from canary now that I've added a few EME traces? This will tell us if we ran into a need_key event.
,
Sep 27
Yes I can. Can you share the script you wrote to simulate the issue?
,
Sep 27
Not a script, but if you have your own chromium build:
dalecurtis@xorax /d/code/chrome/src $ git diff
diff --git a/media/blink/webmediaplayer_impl.cc b/media/blink/webmediaplayer_impl.cc
index 253f3fda0a0d..c11af50043f5 100644
--- a/media/blink/webmediaplayer_impl.cc
+++ b/media/blink/webmediaplayer_impl.cc
@@ -2822,6 +2822,8 @@ WebMediaPlayerImpl::UpdatePlayState_ComputePlayState(bool is_remote,
void WebMediaPlayerImpl::ReportMemoryUsage() {
DCHECK(main_task_runner_->BelongsToCurrentThread());
+ TRACE_EVENT0("media", "BlockingMainThread");
+ base::PlatformThread::Sleep(base::TimeDelta::FromMilliseconds(2000));
// About base::Unretained() usage below: We destroy |demuxer_| on the main
// thread. Before that, however, ~WebMediaPlayerImpl() posts a task to the
diff --git a/media/renderers/audio_renderer_impl.cc b/media/renderers/audio_renderer_impl.cc
index 79c34da2ab0f..335ca2978fec 100644
--- a/media/renderers/audio_renderer_impl.cc
+++ b/media/renderers/audio_renderer_impl.cc
@@ -1042,13 +1042,13 @@ int AudioRendererImpl::Render(base::TimeDelta delay,
frames_after_end_of_stream = frames_requested;
} else if (state_ == kPlaying &&
buffering_state_ != BUFFERING_HAVE_NOTHING) {
- algorithm_->IncreaseQueueCapacity();
+ // algorithm_->IncreaseQueueCapacity();
SetBufferingState_Locked(BUFFERING_HAVE_NOTHING);
}
} else if (frames_written < frames_requested && !received_end_of_stream_) {
// If we only partially filled the request and should have more data, go
// ahead and increase queue capacity to try and meet the next request.
- algorithm_->IncreaseQueueCapacity();
+ // algorithm_->IncreaseQueueCapacity();
}
audio_clock_->WroteAudio(frames_written + frames_after_end_of_stream,
I tested with this site:
http://storage.googleapis.com/dalecurtis/mse.html?src=buck2.mp3&type=mp3
,
Sep 27
Here's one. I don't think you'll see the first rebuffer because I think I was not recording yet, however you should see all the other ones. FYI, they all occurred when I opened a new tab.
,
Sep 27
Mostly these are just calls taking forever to complete, probably because the tab gets backgrounded at some point -- for some reason the renderer is listed as silent. It shouldn't have that status if audio is _audible_ (i.e., volume must be > 0). Do you see the speaker icon on the tab when it is in the background? The foreground waiting events are just running out of buffer.
,
Sep 27
Yes it was muted :/ Trying again...
,
Sep 27
This time the audio was not muted and the window was not in the background. The rebuffer happened at 14s.
,
Sep 27
Is this still in a VM? How many cores is your VM configured for? The trace shows hitting another huge main thread block and what appears to be Chrome's media thread being descheduled. Nothing happens on that thread for 450ms -- then all the queued tasks suddenly execute in flurry.
,
Sep 28
Can you grab another trace with the categories "toplevel.flow" and "system_stats" enabled? They'll help point out any system level issues.
,
Sep 28
Is it possible to decouple the reading the MSE buffer from the main thread? This trace and previous trace are using a Windows 10 VM with 1 core and 8gb ram. The rebbufer happen 32s in the track.
,
Sep 28
I also have a recording of myself reproducing the previous trace. I don't know if that's useful in anyway or not.
,
Sep 28
Reading of the MSE buffer is decoupled from the main thread, that's the test I did in c#81..#83 and confirmed that it is working as expected. The issue here is that a single core will cause the media thread to be descheduled when there's significant work happening elsewhere. You'll need to solve your UI library issues to prevent this from happening. Even though our media threads are at a higher priority they're at the mercy of the OS for complex tasks when you only have a single core. Single cores are pretty rare, are you able to reproduce any issues if you give your VM two cores?
,
Oct 3
So far I have not been able to repro rebuffers with a 2 core VM on a continuous stream like I was on a one core. However, I am able to have rebuffers when I seek past a buffered range. What appears to happen is when I seek to a new position, I append the right 10s chunk, playback resumes but stops briefly right after. What I guess is happening is the CDM was unable to decrypted the audio fast enough. Is this something you can confirm by looking at the trace? Time: 93.669297 Rebuffer duration: 245ms Buffered: 90.000544, 100.008343 From looking at customer logs it shows that they have rebuffers on continous streams, they do not seek, or skip, etc. I do not know right now how what I just reproduced can map to what customers are doing... :/ Lastly, if the MSE buffer is decoupled from the main thread i'm not sure that solving our UI issue would prevent rebuffers on a 2 core machine. Do you agree?
,
Oct 3
Depends on what else they're doing. If the other core is busy and your UI library spikes at an inopportune time it'll end up competing for resources with the media thread. Do you know how your rebuffer rate compares across the clients vis a vis the metrics I provided in c#58? You're always going to see some rebuffers, the question is just how frequent they are.
,
Oct 3
Not sure what's blocking in the CDM process, however best practice is to append the bytes at the location the user wants to seek to, _then_ issue the seek to the video element. When you do it before appending the bytes, you'll end up with xhwang@ can you take a look at the trace in c#93? The cdm is blocking doing something that seems to happen between decrypt calls. Do you have any idea what that might be? It doesn't seem to be a waiting for key event since that'd show up in the trace. @g.marcihacy: Can you enable the mojom category so we can see what mojo messages are being sent when this hang occurs?
,
Oct 4
@dalecur... Is this kind of CPU/thread contention we are experiencing unusual with regards to CDN usage? Am curious about other resolutions that may have worked in other instances. Since we are playing music - a relatively low bandwidth use case - I imagine other scenarios like this for video would have experienced something similar.
,
Oct 4
Err just realized I didn't finish my sentence in c#95; that should read "you'll end up with the MediaSource blocking on receiving the data and resuming playback the instant 200ms is available, which may be before you've actually buffered enough. Even if you append > 200ms, not all of it may be in the MediaSource upon resume." Yes, per the metrics I quoted in c#58, the experience here is unusual. It's not clear to me that the glitching issue is CPU usage related in the > 1 core case. There does seem to be some block happening in the CDM process after the seek, for which I'm not sure what'a going on yet. Do you have users reporting issues or just telemetry data? For telemetry I'd track if the volume is > 0 (otherwise tab priority is lowered), if the tab is in the background (just for correlation), and whether you see large jumps in Date.now() (which corresponds with suspend resume events). All of these are adverse conditions which would show up in telemetry but for which users would probably not report anything.
,
Oct 5
It's entirely via telemetry that we are seeing the rebuffers. Doesn't look like tab mute status is exposed without extensions though. We can track our own app's volume and correlate rebuffers.
,
Oct 5
Dale, is it possible to detect programmatically if a tab has been muted?
,
Oct 5
No but that usage is rare. You just need to check element.volume.
,
Oct 8
I think it's time that we hold a live troubleshooting session with technical folks to root cause this issue. We are going on a month of troubleshooting and we need to find a solution that allows us to move forward. @dalecur...What's the best way to hold this?
,
Oct 10
We're happy to have a VC to discuss more. You can e-mail me at the @google.com equivalent of my username. That said, at this point I think I've provided some pretty concrete advice: - fix your ui library; it's blocking your appends. - check your telemetry for correlation with is_background or volume==0. From our side: - The trace you sent does show a block in the CDM which I'm unsure of the cause. xhwang@ needs to take a look at the trace in c#93. - Otherwise we haven't been able to reproduce issues. - You haven't expressed how far off the stats you are from the ones I quoted in c#58, so I'm unsure how to quantify your problem. =>xhwang so he doesn't miss this. Assign back to me once you've taken a look.
,
Oct 12
Sorry for the delay. I checked the trace and saw the delay at Start 20,719.822 ms Wall Duration 554.984 ms CPU Duration 226.529 ms I have no idea what the that task is. I'll add more traces in CdmAdapter, which runs in the CDM/utility process.
,
Oct 16
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/def776e75cfc3b34930e83402a7bb2c186bc4eb2 commit def776e75cfc3b34930e83402a7bb2c186bc4eb2 Author: Xiaohan Wang <xhwang@chromium.org> Date: Tue Oct 16 04:10:26 2018 media: Add more trace events in CdmAdapter Add TRACE_EVENTs for all methods that call into the CDM so that we can monitor CDM's performance more closely. Bug: 879970 Test: Manually tested and checked traces. Change-Id: I61c063ad656b852463195b2cea93f91aee8d91ba Reviewed-on: https://chromium-review.googlesource.com/c/1281042 Reviewed-by: Frank Liberato <liberato@chromium.org> Commit-Queue: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#599844} [modify] https://crrev.com/def776e75cfc3b34930e83402a7bb2c186bc4eb2/media/cdm/cdm_adapter.cc
,
Oct 16
g.marcilhacy: I added more tracing in CdmAdapter which would help us understand what's going on with the long blocking task in the CDM/utility process. Could you please run the test again with tomorrow's Canary build? For now my guess is that those tasks are CreateSessionAndGenerateRequest() and UpdateSession(), which we know are slower. We can take another look to see whether we can improve the performance there. At the same time, the JS player should try to trigger the license exchange as early as possible to avoid resource contention during playback, or stall due to waiting for key.
,
Oct 18
Here is a new trace using the latest version of Canary. FYI, when Chrome is the only application running I cannot repro any rebuffers. For this trace I was playing the same DRM content on Firefox and I was also playing a youtube video on Edge. (Firefox did not have a rebuffer). Details: Time: 47s Ranges: [0, 54] Duration: 200ms
,
Oct 18
I don't think that trace caught the rebuffer. Or at least I don't see one in there. Is this still on Mac?
,
Oct 18
Here is another. This one had 2 rebuffers within the same track.
,
Oct 18
No huge blocks showing on the CDM thread, there's some sort of timer expiring, while the media thread waits for a frame, but xhwang@ will have to take a look at that.
,
Oct 18
Timer expiring is just the way CDM does some periodic routine work. To be fair, Firefox is also using Widevine CDM to decrypt audio :) So I don't think it's the CDM to blame. It seems to me this is just a performance issue? We can spend some time improving the overall throughput of our pipeline, e.g. buffer more audio in advance (see issue 739797 ), or support parallel decryption/decoding (see issue 721863 ). But on lower end devices with multiple other programs running you can always push the browser to the limit. So the question is where do we want to cut the line and how much effort do we want to spend for these cases. Dalecurtis: One question. In the render process and CDM process, these are a lot of tasks with a huge difference between the Wall Duration and CPU Duration, e.g. 152ms vs 5ms. I don't see it it in the browser process, especially on the audio thread. Is this a sign that the render/CDM process has lower priority and don't have enough CPU time allocated? Title CdmAdapter::Decrypt Start 23,515.761 ms Wall Duration 153.792 ms CPU Duration 5.089 ms
,
Oct 18
The traces are for Windows so marking this bug as Windows too. @g.marcilhacy do you have any traces of this happening on macOS? I'm not sure if that's the cause, in the trace above you can see the render thread still getting a time slice which is the same priority as our media/cdm threads. Certainly the audio threads are created with base::ThreadPriority::REALTIME_AUDIO. Though I don't remember if that does anything in the renderer process due to the sandbox. Our media thread is created with base::ThreadPriority::NORMAL since we generally expect a NORMAL priority thread to receive a time slice within the 200ms range of our audio buffer. We could try marking the media thread ::DISPLAY priority (I don't think it should be realtime given the buffer sizes). Note: NORMAL == DISPLAY on macOS. I'm not sure the media thread is the issue though. So probably we'd need the CDM process to get a similar priority. Do you know if anything else runs in the utility process besides the mojo cdm? Would this show up in the trace?
,
Oct 18
For the idea of support parallel decryption to improve throughput, I filed issue 896931. The utility process is dedicated to the CDM. I don't expect other non-CDM heavy tasks to run there.
,
Oct 18
Since there's not a lot going on in the utility process I wonder if it's getting descheduled by the OS at NORMAL priority. @g.marcilhacy: Can you open process explorer and bump the priority of the CDM process and see if you still get glitches?
,
Oct 19
What is the name of the process? I cannot find it.
,
Oct 19
You'll need to open the Chrome task manager and look for a utility process with the label named "Content Protection Module" and find it's id, then look up that id in process explorer -- sorry I thought it'd be labeled in PE.
,
Oct 19
To be more accurate, look for "Utility: Content Decryption Module Service".
,
Oct 19
I can indeed see the priority is set to NORMAl, however I can't change the priority of the process.
,
Oct 19
Are you using this? https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer It's been a while, but I thought you could using that.
,
Oct 19
I was able to repro again with the priority set to HIGH on the CDM process.
,
Oct 19
Thanks, that probably means the receive side isn't having issues -- I haven't seen any send side issues in the traces, but maybe there's something that isn't showing up in the trace. What are you doing when the underflow happens?
,
Oct 19
+rockot,mojo for any tracing advice which might help us figure out if this is due to a send side delay. Are there any trace categories besides mojom which would help diagnose if our mojo calls from the renderer process to the cdm utility process are being delayed?
,
Oct 19
,
Oct 19
What do you mean by delayed, exactly? Where would you expect to find sources of delay on the sender side? Just due to underlying socket congestion?
,
Oct 19
I mean we're seeing a situation where a Decrypt() message should be showing up in the utility process, but does not appear for > 200ms. We've bumped the priority of the utility process to high and still see the issue, so it seems like the message isn't leaving the renderer process in a timely fashion. We know the method: https://cs.chromium.org/chromium/src/media/mojo/interfaces/decryptor.mojom?l=33 ... has been invoked due to the trace. So my speculation is that there's contention somewhere in the sending process. E.g., could the IO/sending thread be blocked and thus the outbound message is delayed? I can't remember if the renderer process has a dedicated io thread for sending these message or if it's a task scheduler pool.
,
Oct 19
The renderer does have a dedicated IO thread, but it's not relevant when sending messages on normal Mojo interfaces (of which Decryptor is one) because calls on Mojo interfaces do the IPC immediately from the calling thread. The only time the IO thread is relevant when sending on normal interfaces is when the write to the underlying socket fails. At that point we queue the message and will send it out ASAP from the IO thread once the socket becomes writable again. To be clear, there is a single shared socket between any pair of communicating processes, and this is multiplexed for all interfaces going between those processes. The only way to see this kind of behavior is if: a) the renderer is hammering your service process with lots of IPCs (not necessarily on the same Mojo interface) b) the service process is simply not reading IPCs as fast as they're being written by that renderer
,
Oct 19
Hmm, there shouldn't be any other services hosted in that utility process AFAIK per c#112. Decrypt() calls are 1 at a time, so it shouldn't be backing up -- especially not when the utility process is marked as a high priority process. Specifically in the trace in c#108 it looks like the message is sent to the utility process, ~230ms later it's received and immediately completed, then ~590ms later it's delivered back to the renderer process. During this time the high priority audio threads are running on schedule. @g.marcilhacy can you try the process priority trick with both the Amazon Music renderer tab as well as the CDM process and see if you still get glitches? It's possible this doesn't help if the competition is within the processes though. We may need to test a canary build where we set DISPLAY priority for media+cdm threads.
,
Oct 19
I don't mean other services running in the process, I mean other connected interfaces. Surely there are other interfaces between the cdm service and its clients. Is it possible that a bug is making one of them spammy?
,
Oct 19
... though I don't know if DISPLAY priority will even help there are operations on the compositor thread which also look blocked and it should be at display priority. E.g., Scheduler::OnBeginImplFrameDeadline is running for ~500ms in the trace.
,
Oct 19
I'm not sure what you mean. Can you elaborate? There's not a lot of calls in/out of the CDM once playback starts aside from the regular cadence of decoding packets. Certainly I'm not sure how that wouldn't show up in the trace.
,
Oct 19
I wouldn't expect thread priority on the client side to matter much, given your data. If you're actually measuring 230ms between the Decrypt() call and its receipt in the service, either the socket is backlogged or the service is taking a long time to dispatch the message. On that note, one other thought - the message may actually arrive in the service process very quickly. *Incoming* IPCs wake up the IO thread and get scheduled for dispatch on the destination binding's SequencedTaskRunner ASAP. If that sequence (e.g. main thread of the service process?) is busy doing other stuff, you're not going to see the message dispatched right away.
,
Oct 19
I mean there are several Mojo interfaces that comprise the total IPC surface of the cdm service. At a glance I see CdmService, CdmFactory, and ContentDecryptionModule, in addition to Decyptor. Of course it's just a wild guess, but it's important to recognize that Decrpytor being slow doesn't necessarily mean Decryptor would be the offending interface pipe in the event of IPC spam.
,
Oct 19
Ah, none of those should be active at that point. Since we have the "mojom" category enabled, they should show up in the trace if that was the case too. The decrypt is happening on the main ChildThread which is just a RunLoop. It's just sitting there idle per the trace above. Since it's a MessageLoop we'd see a task tracking entry if it was busy I think.
,
Oct 22
@g.marcilhacy can you provide some more detail on exactly what you're doing when the glitch happens? Are you active within the VM or doing something else? I want to rule out a state where the VM is getting suspended by the host OS.
,
Oct 22
I am actively using the VM. I just repro again using the following steps: - Play on music.amazon.com using Chrome - play a youtube video on Edge - A rebuffer is emitted as soon as the video starts
,
Oct 22
Thanks. I wonder if Edge is using a high priority playback thread which is stealing from Chrome threads which are below real-time audio priority. Does anything else cause the issue to repro?
,
Oct 22
Hmm, during playback Edge content process spawns a bunch of threads it seems be using for media playback at pri 18/18 which is real time priority. If these are loaded that would explain why no thread in Chrome but our real time priority one gets a timeslice.
,
Oct 22
I was not able to repro in the past ~10mn without Edge being open.
,
Oct 22
Okay, I wonder if for audible audio cases we should bump the media thread to real-time priority like our audio thread. +brucedawson for his thoughts since Edge seems to already be doing this.
,
Oct 23
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e72bb599577ae34097d6df704a1b72a6c87524cb commit e72bb599577ae34097d6df704a1b72a6c87524cb Author: Xiaohan Wang <xhwang@chromium.org> Date: Tue Oct 23 23:10:10 2018 media: Increase the default audio buffer size for encrypted streams This partially reverts commit 251e255f303f954692cd7e2134ea54b37a9774ff. After mojo CDM was enabled by default, with the performance gain, we changed the audio buffer size back to the original value at 200ms, which is the same as the size as clear playback. Recently, we found cases where we still have performance issues for encrypted audio. Given the fact that users are super sensitive to audio glitches, and encrypted playback always has worse performance than clear playback, it makes sense to use a larger buffer size. This CL changes the audio buffer size back to 500ms. It may increase the start-to-play time slightly, but should help reduce rebuffers during playback. Bug: 718161 ,879970 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I41909d3fc3356df119b21fe720ebe8cafad206a5 Reviewed-on: https://chromium-review.googlesource.com/c/1297329 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Commit-Queue: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#602165} [modify] https://crrev.com/e72bb599577ae34097d6df704a1b72a6c87524cb/media/filters/audio_renderer_algorithm.cc
,
Oct 23
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e72bb599577ae34097d6df704a1b72a6c87524cb commit e72bb599577ae34097d6df704a1b72a6c87524cb Author: Xiaohan Wang <xhwang@chromium.org> Date: Tue Oct 23 23:10:10 2018 media: Increase the default audio buffer size for encrypted streams This partially reverts commit 251e255f303f954692cd7e2134ea54b37a9774ff. After mojo CDM was enabled by default, with the performance gain, we changed the audio buffer size back to the original value at 200ms, which is the same as the size as clear playback. Recently, we found cases where we still have performance issues for encrypted audio. Given the fact that users are super sensitive to audio glitches, and encrypted playback always has worse performance than clear playback, it makes sense to use a larger buffer size. This CL changes the audio buffer size back to 500ms. It may increase the start-to-play time slightly, but should help reduce rebuffers during playback. Bug: 718161 ,879970 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I41909d3fc3356df119b21fe720ebe8cafad206a5 Reviewed-on: https://chromium-review.googlesource.com/c/1297329 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Commit-Queue: Xiaohan Wang <xhwang@chromium.org> Cr-Commit-Position: refs/heads/master@{#602165} [modify] https://crrev.com/e72bb599577ae34097d6df704a1b72a6c87524cb/media/filters/audio_renderer_algorithm.cc
,
Oct 25
g.marcilhacy: The change that increases audio buffer size (#140) is available in 72.0.3590.0 and later, which is today's Canary on Windows and Mac. Could you please give it a try?
,
Oct 26
I felt probably only the audio rendering pipeline should be REALTIME_AUDIO. Audio/video "decoding" should probably not be since we can always decode early and keep a buffer of decoded frames. It's sad that Edge is doing this. But I don't feel we should respond by increasing our thread priorities. FWIW, this is an interesting related article: https://blog.codinghorror.com/thread-priorities-are-evil/ But it may still make sense to bump media threads' priority to DISPLAY. I experimented with it at https://chromium-review.googlesource.com/c/chromium/src/+/1301116
,
Oct 26
I'm at least 90% sure that the code linked to from the codinghorror site has a busy wait in it. One of the normal priority threads ends up spinning while waiting for something from a low priority thread and that causes the whole thing to take forever. The busy wait isn't in the visible code so it must be in the CLR, and indeed I have seem some busy waits there, and in other Microsoft runtimes. I can't think of anything else that could explain the problem. And, crucially, the busy wait is the problem, not the priorities. Busy waits can cause stalls and brief hangs even without changing thread priorities, it's just that the hangs are much shorter so we don't notice them. Therefore the thread priority changes are sort of a good thing because they expose the otherwise invisible micro-hangs due to busy waits. Therefore I think the real advice is: 1) Don't do busy waits. Just don't. Indefinite spin locks are a bad idea. 2) Only run your threads at high priority if the genuinely need to be more responsive than most *and* they absolutely never do busy waits *and* they are guaranteed to consume a small percentage of total CPU time. We've been bitten by this in the past because we had busy waits (in allocators, IIRC, and in some static initialization code) and the fix was to stop doing the busy waits. If we don't run our audio threads at higher priority then on a 100% busy system (and this can happen for seconds at a time during normal navigation) our audio will almost unavoidably cut out. So, I don't think running the audio pipeline at REALTIME_AUDIO priority is a bad idea, as long as we follow the rules.
,
Oct 29
I don't think we can commit to following the rules on these threads unfortunately. We have no idea what the software audio/video decoders might be doing on this thread under the hood. I'm pretty sure ffmpeg has some spin loops in there within it's threaded decoding model. We might be able to split just audio decoding off onto a thread which we could follow the rules on, since there is no multi-threaded software decoding. We've talked about doing this in the past -- it's not too difficult, but may have unconsidered performance impacts with adding a thread hop for audio decoding. Possibly those are all made up by the priority increase.
,
Oct 29
Oh, I guess splitting out the audio wouldn't work, since the return of the decoded samples to the normal priority media thread would still be delayed. Maybe we'd have to instead move off video decoding to its own thread -- media thread is a bit of a catch all though, so I'm not sure what other chrome pieces like capture and so forth might be doing.
,
Oct 29
> I'm pretty sure ffmpeg has some spin loops in there Any chance we can get those removed? Spin loops have bitten us in the past, especially for enterprise customers who often run Chrome in VMs with a low CPU count (often two, I believe). All it takes is a priority inversion with one high-priority spin-loop per core and Chrome will lockup forever.
,
Oct 29
Okay all the ones I remember actually seem gone. They're all while() { pthread_cond_Wait(); } now.
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/pthread_frame.c
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/pthread_slice.c
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/vp8.c
Though they are spawning there own threads for this which would not be at the same priority as the chrome media thread.
,
Nov 8
Hi Dale and Xiaohan, In the call on 10/23, you had mentioned the increased buffer size (change #140) would be available in Beta 71, but it looks like it was made to 72.0.3590.0 Canary, which wouldn't reach Beta until mid-late December. Is there a plan to move it into the 71 Beta? Thanks, Evan
,
Nov 8
Per our metrics it seems to have a small increase in MTBR (bad) and a small increase in smooth rate (good) which are aligned with increasing the buffer size. So I think merging to 71 sounds good: https://uma.googleplex.com/p/chrome/timeline_v2/?sid=ec1a68b146d0558cb8976ce4274e7f89 The thread priority thing is still stuck on https://chromium-review.googlesource.com/c/chromium/src/+/931763
,
Nov 8
One thing we could do is have more recall in our media pipeline for the optimal per-machine audio buffer size. I.e., instead of always starting from 200ms, we could keep that in a per process static and adjust it as necessary when rebuffers are seen. Probably we wouldn't want to keep increasing this on network underflows, just decoding underflow.
,
Nov 8
> Probably we wouldn't want to keep increasing this on network underflows, just decoding underflow. Time to dust off this CL? https://chromium-review.googlesource.com/c/chromium/src/+/1105394
,
Nov 12
Request to merge the CL in #140 (https://chromium-review.googlesource.com/c/1297329) to M71. It's only changing audio buffer size and the risk should be very low. dalecurtis@ also verified the results in #149.
,
Nov 12
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Nov 12
This is regressed in M68 and we're very close to M71 stable promotion. Let's wait until M72.
Showing comments 55 - 154
of 154
Older ›
|
|||||||||||
►
Sign in to add a comment |
|||||||||||