New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 879970 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

MSE/EME causing 'waiting' events

Reported by g.marcil...@gmail.com, Sep 3

Issue description

UserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36

Steps to reproduce the problem:
Hi,

We have a Javascript music library built using the MSE (and EME). We are seeing many 'waiting' events being fire with the readyState 2 which results in a brief interruption of audio. We've also recently added EME support and we are seeing x2 rebuffers.

For context, the current implementation uses 3 audio elements that we recycle. After a new track begins, we clear the buffer from the unused audio element and start prefetching the next track. We currently buffer up to 4mn ahead and we do not remove data from the buffer manually. We let the browser handle it automatically. (which could be a problem according to this: https://bugs.chromium.org/p/chromium/issues/detail?id=854449&can=1&q=mse%20waiting&colspec=ID%20Pri%20M%20Stars%20ReleaseBlock%20Component%20Status%20Owner%20Summary%20OS%20Modified)

In order to prevent this poor user experience, I would like to understand what is causing those `waiting` events and what could cause a 2x increase by now use the EME. Do you have any recommendations?

What is the expected behavior?

What went wrong?
x2 increase of rebuffers after enabling EME

Did this work before? N/A 

Does this work in other browsers? N/A

Chrome version: 68.0.3440.106  Channel: stable
OS Version: OS X 10.13.6
Flash Version:
 
Showing comments 55 - 154 of 154 Older
The data shared in #53 is based on real customers usage (not including your fix). At the moment we are not able to break down this data per Chrome version.
100ms is definitely too low; you'll want to target at least 1-2s buffer. So if you're seeing playbacks regularly approaching that distance, the waiting is just due to not buffering enough. I'm not sure how you're currently managing your buffering, but you probably want to attach to timeupdate events and ensure the buffering level never falls below N seconds. 

The background ones have enough data, but as mentioned once you receive the waiting the buffered state may already be out of date. This may be particularly true in background tabs where JS may be delayed. Did the data provided by the recommendation in c#36 reveal anything interesting?
100ms indicates that customers are having rebuffers because of a slow connection and are unable to download fast enough the next segment. P90 data shows there is +100 seconds of audio to play when a rebuffer occurs.

I will test tomorrow c#36 and the M71 build.
Gotcha, are you able to filter out cases where you know you don't have the data?

The platform wide smooth rate for audio-only MSE content is 95% for 0 rebuffers and 99% for 0-1 rebuffers. I.e., 95% of elements experience 0 rebuffers and 99% of elements experience at most 1 rebuffer. Do you have a sense of what your smooth rate is? That will help us determine the severity of your problem. 
Project Member

Comment 59 by bugdroid1@chromium.org, Sep 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e130ef9c8ac16f65d4dc5cff95e26118a4e39438

commit e130ef9c8ac16f65d4dc5cff95e26118a4e39438
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Thu Sep 20 00:17:13 2018

Add better tracing for renderers and decoder selection.

Renderers are core to our entire pipeline, yet trace events are
missing! So lets add a few that allow useful diagnosis when looking
at trace reports.

Specifically this adds traces for the following:
- (AudioRenderer|VideoRenderer|RendererImpl)::Initialize() [async]
- (AudioRenderer|VideoRenderer|RendererImpl)::Flush() [async]
- DecoderSelector::SelectDecoder() [async]
- RendererImpl::SetCdm
- RendererImpl::StartPlayingFrom
- RendererImpl::SetPlaybackRate
- RendererImpl::OnBufferingStateChange
- RendererImpl::StartPlayback
- RendererImpl::PausePlayback
- RendererImpl::OnRendererEnded
- RendererImpl::OnError
- RendererImpl::OnSelectedVideoTracksChanged
- RendererImpl::OnEnabledAudioTracksChanged

This also removes usage of ".is_null()" on callbacks in favor of
the more modern built in boolean test.

BUG=879970
TEST=none
R=xhwang

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I79792d93b6369cfe4fe32f322f176b68acc92ee8
Reviewed-on: https://chromium-review.googlesource.com/1232688
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592622}
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/demuxer_stream.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/demuxer_stream.h
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/base/stream_parser_buffer.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/filters/decoder_selector.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/filters/decoder_selector.h
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/audio_renderer_impl.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/audio_renderer_impl.h
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/renderer_impl.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/renderer_impl.h
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/video_renderer_impl.cc
[modify] https://crrev.com/e130ef9c8ac16f65d4dc5cff95e26118a4e39438/media/renderers/video_renderer_impl.h

Hi,

Adding additional logs as recommended in c#36 did not result in more useful information. Additionally, I am still seeing rebuffers using Canary.

I also recorded a trace using Canary.
trace_4.json.gz
14.0 MB Download
It looks like that trace is using the dev tools slowdown function? Is that the case? I'm more interested in traces from when you're actually hitting issues.

I think with dev tools you're just running out of buffer, but I'll add some more traces for the demuxer to get more information.
Project Member

Comment 62 by bugdroid1@chromium.org, Sep 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/43a86d270102c4cb45cbe8ab6e4d4b209e184a65

commit 43a86d270102c4cb45cbe8ab6e4d4b209e184a65
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Fri Sep 21 23:24:23 2018

Add demuxer level traces for Initialize, Seek, Read.

This is another core component of media, we should have trace events
to help diagnose issues in the field. This adds the following traces:
- (Chunk|FFmpeg)Demuxer::Initialize
- (Chunk|FFmpeg)Demuxer::Seek
- (AudioBuffer|VideoFrame)Stream::(Decode|Read|ReadFromDemuxerStream)
- VideoFrameStream::Prepare (for audio too, but never prepares)

This also renames the DecoderStream<AUDIO>, DecoderStream<VIDEO> traces
to use the normal names AudioBufferStream, VideoFrameStream.

BUG=879970
TEST=none
R=wolenetz, xhwang

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: Ie9ca7474bc5684117820b219012f0455d044eabc
Reviewed-on: https://chromium-review.googlesource.com/1237573
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#593374}
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/base/demuxer_stream.cc
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/base/demuxer_stream.h
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/chunk_demuxer.cc
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/chunk_demuxer.h
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/decoder_stream.cc
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/decoder_stream.h
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/ffmpeg_demuxer.cc
[modify] https://crrev.com/43a86d270102c4cb45cbe8ab6e4d4b209e184a65/media/filters/ffmpeg_demuxer.h

Owner: dalecur...@chromium.org
@g.marcilhacy can you collect a trace with the most recent canary? Preferably without using dev tools to slowdown, but we have enough trace logging in there to confirm any issues with the dev tools forced slow down too if you want to provide one of those too.


I kinda need to have the console open in order to the see the rebuffer being logged. Is that fine if I do not use the Performance throttling?
Yeah that should be fine. You can also record one with performance throttling and I can tell you for sure now if it's just running out of buffer. But most helpful would be one that's actually a glitch w/o performance throttling.
Attached. The devtools were open on the console tab and I was not using cpu throttling.
You'll find 1 rebuffer ~5 second before the end of the trace.


Version: Version 71.0.3560.0 (Official Build) canary (64-bit)

trace_5.json.gz
5.0 MB Download
Okay yes, the underflow is due to the decoder not being able to read from the demuxer. The trace shows Chrome attempting to read from MSE, but MSE not having any buffered data to provide. The read remains unsatisfied until appendBuffer() is called ~3,652.085 ms later.

The previous append is of size 328386 and turns into ~430 buffers which if AAC, should turn into ~10s of data, but we seem to run out ~800ms later... So I think we do have enough data, which seems to imply that something is blocking mse on the main thread for some time.

There is some DOM timer running during this underflow which executes for 3 secs -- DOM timers comes from setInterval/setTimeout. Can you instrument your setInterval/setTimeout calls to see if any take longer than > 1 second to execute? In chrome://tracing I don't seem to have a way to get any more detail. You might get more if you record a performance trace through dev tools:

https://developers.google.com/web/tools/chrome-devtools/evaluate-performance/reference
Thanks for the explanation. So you're saying a long running task (setTimeout) on the main thread is the source of this rebuffer. 

Do you have any theories on why those rebuffers are exacerbate with DRM?
Yes that seems to be the case. I'm still investigating the trace. Since for plain MSE I wouldn't expect a really long main thread task to glitch if there's enough data available, since we decode+read on another thread.

That said, DRM content requires more hops to the main thread to talk to the site through EME, so if you have a regular task blocking the main thread I'd expect it to glitch more. That's possibly what's happening here, but I'm still looking more closely.
Do you have any documentations on how to use the tracing tool? Where do you see in the trace a 3 secs long DOM timer?


I have a performance profile with a rebuffer caused by a position change in the song. Right before the waiting event being emitted, I see a timeupdate event which probably emits the new position in the song. I also know that when the rebuffer occured, ~14s of audio was still available.

Is it because the CDM has not decrypted this part of the buffer even if the audio has been available?

The waiting event is emitted at the ~35 second mark in the profile.
Profile-20180926T104718.json
31.0 MB Download
Project Member

Comment 72 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ca79b4a569b34fa3080d83b27ddaf6311b695c9d

commit ca79b4a569b34fa3080d83b27ddaf6311b695c9d
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Wed Sep 26 18:26:29 2018

Expand traces for decrypt-only and decrypt+video_decoder.

DecryptingDemuxerStream had no traces and the DecyrptingVideoDecoder
had some old-style, incomplete traces. This adds traces to the
former and expands the latter. Specifically we now have:
- DecryptingDemuxerStream::DecryptPendingBuffer [async]
- DecryptingDemuxerStream::WaitingForDecryptionKey [async]
- DecryptingVideoDecoder::DecodePendingBuffer [async]
- DecryptingVideoDecoder::WaitingForDecryptionKey [async]

The DVD traces used to use a trace_id and provide a buffer size,
but this isn't really necessary since the start trace is tagged
correctly just by using |this| and providing the timestamp.

BUG=879970
TEST=passes cq

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: If21288e6c47414ec038f034bba381fcebf5ca99b
Reviewed-on: https://chromium-review.googlesource.com/1244164
Reviewed-by: Xiaohan Wang <xhwang@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Cr-Commit-Position: refs/heads/master@{#594393}
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/base/decryptor.cc
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/base/decryptor.h
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decoder_selector.cc
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_demuxer_stream.cc
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_demuxer_stream.h
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_video_decoder.cc
[modify] https://crrev.com/ca79b4a569b34fa3080d83b27ddaf6311b695c9d/media/filters/decrypting_video_decoder.h

If you look in the trace there is a point where the audio output device stops rendering audio shortly after the AudioBufferStream::Read stalls for > 200ms. There's another long trace in there which is annotated with TimerBase and DOMTimer. I'll take a screenshot in a bit and look at the dev tools trace you provided. Between meetings at the moment.
Did you mean to attach that performance profile? It's not for a music site at all, but some sort of WebGL looking demo.
Derp, that's my fault. Loaded the wrong trace.
Your performance trace has huge blocking timer tasks. Do you know what those are doing? I can't tell just from the trace since the code is obfuscated.
I assume most of them are UI updates. We use a UI library that's not performant when building complex UI applications... However, most of the blocking tasks are toward the beginning of the Profile, while the waiting event occurred toward the end of the Profile.
Components: -Blink>Scheduling Internals>Compositing
Thanks, right before the waiting event (@35696ms in trace) there's a 500ms "composite layers" call. I don't think that should show up as a DOMTimerFired, but maybe it does. In either case I'm still unsure how that's actually preventing us from reading buffers from MSE. Tomorrow I'll simulate main thread contention with full buffers to see if I can figure out if it's possible to glitch MSE that way.

-scheduling, +Internals>Compositing folks to see if there's a way to dig into what "Composite Layers" means more.
Okay, thanks for digging into this issue :)
Project Member

Comment 80 by bugdroid1@chromium.org, Sep 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e

commit aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e
Author: Dale Curtis <dalecurtis@chromium.org>
Date: Thu Sep 27 01:38:11 2018

Add decode status to DecoderStream::Decode ending trace.

Was left off in previous CLs.

BUG=879970
TEST=passes cq

Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I3dd5288bf60381e42a8ab08d054bdf60e4e96f96
Reviewed-on: https://chromium-review.googlesource.com/1244613
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#594556}
[modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/base/decode_status.cc
[modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/base/decode_status.h
[modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/filters/decoder_stream.cc
[modify] https://crrev.com/aa81f1c12f03cd3fdecc04ac61964d7b63b6d25e/media/filters/decoder_stream.h

I wasn't able to simulate any glitches with 2s main thread blocks except for when the block prevented sourceBuffer.append() from providing enough data. That's what I was expecting, except during MSE operations media doesn't care about main thread contention.

Can you grab another trace from canary now that I've added a few EME traces? This will tell us if we ran into a need_key event.
Yes I can. Can you share the script you wrote to simulate the issue?
Not a script, but if you have your own chromium build:

dalecurtis@xorax /d/code/chrome/src $ git diff
diff --git a/media/blink/webmediaplayer_impl.cc b/media/blink/webmediaplayer_impl.cc
index 253f3fda0a0d..c11af50043f5 100644
--- a/media/blink/webmediaplayer_impl.cc
+++ b/media/blink/webmediaplayer_impl.cc
@@ -2822,6 +2822,8 @@ WebMediaPlayerImpl::UpdatePlayState_ComputePlayState(bool is_remote,
 
 void WebMediaPlayerImpl::ReportMemoryUsage() {
   DCHECK(main_task_runner_->BelongsToCurrentThread());
+  TRACE_EVENT0("media", "BlockingMainThread");
+  base::PlatformThread::Sleep(base::TimeDelta::FromMilliseconds(2000));
 
   // About base::Unretained() usage below: We destroy |demuxer_| on the main
   // thread.  Before that, however, ~WebMediaPlayerImpl() posts a task to the
diff --git a/media/renderers/audio_renderer_impl.cc b/media/renderers/audio_renderer_impl.cc
index 79c34da2ab0f..335ca2978fec 100644
--- a/media/renderers/audio_renderer_impl.cc
+++ b/media/renderers/audio_renderer_impl.cc
@@ -1042,13 +1042,13 @@ int AudioRendererImpl::Render(base::TimeDelta delay,
         frames_after_end_of_stream = frames_requested;
       } else if (state_ == kPlaying &&
                  buffering_state_ != BUFFERING_HAVE_NOTHING) {
-        algorithm_->IncreaseQueueCapacity();
+        // algorithm_->IncreaseQueueCapacity();
         SetBufferingState_Locked(BUFFERING_HAVE_NOTHING);
       }
     } else if (frames_written < frames_requested && !received_end_of_stream_) {
       // If we only partially filled the request and should have more data, go
       // ahead and increase queue capacity to try and meet the next request.
-      algorithm_->IncreaseQueueCapacity();
+      // algorithm_->IncreaseQueueCapacity();
     }
 
     audio_clock_->WroteAudio(frames_written + frames_after_end_of_stream,

I tested with this site:
http://storage.googleapis.com/dalecurtis/mse.html?src=buck2.mp3&type=mp3
Here's one. I don't think you'll see the first rebuffer because I think I was not recording yet, however you should see all the other ones. FYI, they all occurred when I opened a new tab.
Capture.PNG
25.1 KB View Download
trace_10.json.gz
8.9 MB Download
Mostly these are just calls taking forever to complete, probably because the tab gets backgrounded at some point -- for some reason the renderer is listed as silent. It shouldn't have that status if audio is _audible_ (i.e., volume must be > 0). Do you see the speaker icon on the tab when it is in the background?

The foreground waiting events are just running out of buffer.
Yes it was muted :/ Trying again...
This time the audio was not muted and the window was not in the background. The rebuffer happened at 14s.
trace_11.json.gz
4.8 MB Download
Is this still in a VM? How many cores is your VM configured for?

The trace shows hitting another huge main thread block and what appears to be Chrome's media thread being descheduled. Nothing happens on that thread for 450ms -- then all the queued tasks suddenly execute in flurry.
Can you grab another trace with the categories "toplevel.flow" and "system_stats" enabled? They'll help point out any system level issues.
Is it possible to decouple the reading the MSE buffer from the main thread?

This trace and previous trace are using a Windows 10 VM with 1 core and 8gb ram.
The rebbufer happen 32s in the track.
trace_11 (1).json.gz
3.2 MB Download
I also have a recording of myself reproducing the previous trace. I don't know if that's useful in anyway or not.
recording.mov
27.4 MB Download
Reading of the MSE buffer is decoupled from the main thread, that's the test I did in c#81..#83 and confirmed that it is working as expected. The issue here is that a single core will cause the media thread to be descheduled when there's significant work happening elsewhere. You'll need to solve your UI library issues to prevent this from happening. Even though our media threads are at a higher priority they're at the mercy of the OS for complex tasks when you only have a single core.

Single cores are pretty rare, are you able to reproduce any issues if you give your VM two cores?
So far I have not been able to repro rebuffers with a 2 core VM on a continuous stream like I was on a one core.
However, I am able to have rebuffers when I seek past a buffered range. What appears to happen is when I seek to a new position, I append the right 10s chunk, playback resumes but stops briefly right after. What I guess is happening is the CDM was unable to decrypted the audio fast enough. Is this something you can confirm by looking at the trace?

Time:  93.669297
Rebuffer duration: 245ms
Buffered: 90.000544, 100.008343


From looking at customer logs it shows that they have rebuffers on continous streams, they do not seek, or skip, etc. I do not know right now how what I just reproduced can map to what customers are doing... :/

Lastly, if the MSE buffer is decoupled from the main thread i'm not sure that solving our UI issue would prevent rebuffers on a 2 core machine. Do you agree?
trace_22.json.gz
9.1 MB Download
Depends on what else they're doing. If the other core is busy and your UI library spikes at an inopportune time it'll end up competing for resources with the media thread. Do you know how your rebuffer rate compares across the clients vis a vis the metrics I provided in c#58? You're always going to see some rebuffers, the question is just how frequent they are.
Not sure what's blocking in the CDM process, however best practice is to append the bytes at the location the user wants to seek to, _then_ issue the seek to the video element. When you do it before appending the bytes, you'll end up with 

xhwang@ can you take a look at the trace in c#93? The cdm is blocking doing something that seems to happen between decrypt calls. Do you have any idea what that might be? It doesn't seem to be a waiting for key event since that'd show up in the trace. 

@g.marcihacy: Can you enable the mojom category so we can see what mojo messages are being sent when this hang occurs?
@dalecur...   Is this kind of CPU/thread contention we are experiencing unusual with regards to CDN usage? Am curious about other resolutions that may have worked in other instances. Since we are playing music - a relatively low bandwidth use case - I imagine other scenarios like this for video would have experienced something similar. 
Err just realized I didn't finish my sentence in c#95; that should read "you'll end up with the MediaSource blocking on receiving the data and resuming playback the instant 200ms is available, which may be before you've actually buffered enough. Even if you append > 200ms, not all of it may be in the MediaSource upon resume."

Yes, per the metrics I quoted in c#58, the experience here is unusual. It's not clear to me that the glitching issue is CPU usage related in the > 1 core case. There does seem to be some block happening in the CDM process after the seek, for which I'm not sure what'a going on yet.

Do you have users reporting issues or just telemetry data? For telemetry I'd track if the volume is > 0 (otherwise tab priority is lowered), if the tab is in the background (just for correlation), and whether you see large jumps in Date.now() (which corresponds with suspend resume events). All of these are adverse conditions which would show up in telemetry but for which users would probably not report anything. 
It's entirely via telemetry that we are seeing the rebuffers. Doesn't look like tab mute status is exposed without extensions though. We can track our own app's volume and correlate rebuffers. 
Dale, is it possible to detect programmatically if a tab has been muted?
No but that usage is rare. You just need to check element.volume.
I think it's time that we hold a live troubleshooting session with technical folks to root cause this issue. We are going on a month of troubleshooting and we need to find a solution that allows us to move forward.

@dalecur...What's the best way to hold this? 
Owner: xhw...@chromium.org
We're happy to have a VC to discuss more. You can e-mail me at the @google.com equivalent of my username.

That said, at this point I think I've provided some pretty concrete advice:
- fix your ui library; it's blocking your appends.
- check your telemetry for correlation with is_background or volume==0.

From our side:
- The trace you sent does show a block in the CDM which I'm unsure of the cause. xhwang@ needs to take a look at the trace in c#93.
- Otherwise we haven't been able to reproduce issues.
- You haven't expressed how far off the stats you are from the ones I quoted in c#58, so I'm unsure how to quantify your problem.

=>xhwang so he doesn't miss this. Assign back to me once you've taken a look.
Sorry for the delay. I checked the trace and saw the delay at

Start	20,719.822 ms
Wall Duration	554.984 ms
CPU Duration	226.529 ms

I have no idea what the that task is. I'll add more traces in CdmAdapter, which runs in the CDM/utility process.
Project Member

Comment 104 by bugdroid1@chromium.org, Oct 16

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/def776e75cfc3b34930e83402a7bb2c186bc4eb2

commit def776e75cfc3b34930e83402a7bb2c186bc4eb2
Author: Xiaohan Wang <xhwang@chromium.org>
Date: Tue Oct 16 04:10:26 2018

media: Add more trace events in CdmAdapter

Add TRACE_EVENTs for all methods that call into the CDM so that we can
monitor CDM's performance more closely.

Bug: 879970
Test: Manually tested and checked traces.
Change-Id: I61c063ad656b852463195b2cea93f91aee8d91ba
Reviewed-on: https://chromium-review.googlesource.com/c/1281042
Reviewed-by: Frank Liberato <liberato@chromium.org>
Commit-Queue: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#599844}
[modify] https://crrev.com/def776e75cfc3b34930e83402a7bb2c186bc4eb2/media/cdm/cdm_adapter.cc

g.marcilhacy: I added more tracing in CdmAdapter which would help us understand what's going on with the long blocking task in the CDM/utility process. Could you please run the test again with tomorrow's Canary build?

For now my guess is that those tasks are CreateSessionAndGenerateRequest() and UpdateSession(), which we know are slower. We can take another look to see whether we can improve the performance there. At the same time, the JS player should try to trigger the license exchange as early as possible to avoid resource contention during playback, or stall due to waiting for key.
Here is a new trace using the latest version of Canary.
FYI, when Chrome is the only application running I cannot repro any rebuffers. For this trace I was playing the same DRM content on Firefox and I was also playing a youtube video on Edge. (Firefox did not have a rebuffer).

Details:
Time: 47s
Ranges: [0, 54]
Duration: 200ms

trace_1018.json.gz
8.5 MB Download
I don't think that trace caught the rebuffer. Or at least I don't see one in there. Is this still on Mac?
Here is another. This one had 2 rebuffers within the same track.
trace_10182.json.gz
13.3 MB Download
Capture.PNG
24.2 KB View Download
No huge blocks showing on the CDM thread, there's some sort of timer expiring, while the media thread waits for a frame, but xhwang@ will have to take a look at that.
Timer expiring is just the way CDM does some periodic routine work. To be fair, Firefox is also using Widevine CDM to decrypt audio :) So I don't think it's the CDM to blame.

It seems to me this is just a performance issue? We can spend some time improving the overall throughput of our pipeline, e.g. buffer more audio in advance (see  issue 739797 ), or support parallel decryption/decoding (see  issue 721863 ). But on lower end devices with multiple other programs running you can always push the browser to the limit. So the question is where do we want to cut the line and how much effort do we want to spend for these cases.

Dalecurtis: One question. In the render process and CDM process, these are a lot of tasks with a huge difference between the Wall Duration and CPU Duration, e.g. 152ms vs 5ms. I don't see it it in the browser process, especially on the audio thread. Is this a sign that the render/CDM process has lower priority and don't have enough CPU time allocated?

Title	
CdmAdapter::Decrypt
Start	23,515.761 ms
Wall Duration	153.792 ms
CPU Duration	5.089 ms
Labels: -OS-Mac OS-Windows
The traces are for Windows so marking this bug as Windows too. @g.marcilhacy do you have any traces of this happening on macOS?

I'm not sure if that's the cause, in the trace above you can see the render thread still getting a time slice which is the same priority as our media/cdm threads. Certainly the audio threads are created with base::ThreadPriority::REALTIME_AUDIO. Though I don't remember if that does anything in the renderer process due to the sandbox. 

Our media thread is created with base::ThreadPriority::NORMAL since we generally expect a NORMAL priority thread to receive a time slice within the 200ms range of our audio buffer. We could try marking the media thread ::DISPLAY priority (I don't think it should be realtime given the buffer sizes). Note: NORMAL == DISPLAY on macOS.

I'm not sure the media thread is the issue though. So probably we'd need the CDM process to get a similar priority. Do you know if anything else runs in the utility process besides the mojo cdm? Would this show up in the trace?
For the idea of support parallel decryption to improve throughput, I filed issue 896931.

The utility process is dedicated to the CDM. I don't expect other non-CDM heavy tasks to run there.
Since there's not a lot going on in the utility process I wonder if it's getting descheduled by the OS at NORMAL priority.

@g.marcilhacy: Can you open process explorer and bump the priority of the CDM process and see if you still get glitches?
What is the name of the process? I cannot find it.
You'll need to open the Chrome task manager and look for a utility process with the label named "Content Protection Module" and find it's id, then look up that id in process explorer -- sorry I thought it'd be labeled in PE.
To be more accurate, look for "Utility: Content Decryption Module Service".
I can indeed see the priority is set to NORMAl, however I can't change the priority of the process.
Are you using this? https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer

It's been a while, but I thought you could using that.
I was able to repro again with the priority set to HIGH on the CDM process.
Thanks, that probably means the receive side isn't having issues -- I haven't seen any send side issues in the traces, but maybe there's something that isn't showing up in the trace.

What are you doing when the underflow happens?
Cc: roc...@chromium.org
Components: -Internals>Compositing Internals>Mojo
+rockot,mojo for any tracing advice which might help us figure out if this is due to a send side delay. Are there any trace categories besides mojom which would help diagnose if our mojo calls from the renderer process to the cdm utility process are being delayed?
Cc: -roc...@chromium.org rockot@google.com
What do you mean by delayed, exactly? Where would you expect to find sources of delay on the sender side? Just due to underlying socket congestion?
I mean we're seeing a situation where a Decrypt() message should be showing up in the utility process, but does not appear for > 200ms. We've bumped the priority of the utility process to high and still see the issue, so it seems like the message isn't leaving the renderer process in a timely fashion. We know the method:

https://cs.chromium.org/chromium/src/media/mojo/interfaces/decryptor.mojom?l=33

... has been invoked due to the trace. So my speculation is that there's contention somewhere in the sending process. E.g., could the IO/sending thread be blocked and thus the outbound message is delayed? I can't remember if the renderer process has a dedicated io thread for sending these message or if it's a task scheduler pool.
The renderer does have a dedicated IO thread, but it's not relevant when sending messages on normal Mojo interfaces (of which Decryptor is one) because calls on Mojo interfaces do the IPC immediately from the calling thread.

The only time the IO thread is relevant when sending on normal interfaces is when the write to the underlying socket fails. At that point we queue the message and will send it out ASAP from the IO thread once the socket becomes writable again.

To be clear, there is a single shared socket between any pair of communicating processes, and this is multiplexed for all interfaces going between those processes.

The only way to see this kind of behavior is if:

a) the renderer is hammering your service process with lots of IPCs (not necessarily on the same Mojo interface)
b) the service process is simply not reading IPCs as fast as they're being written by that renderer

Hmm, there shouldn't be any other services hosted in that utility process AFAIK per c#112. Decrypt() calls are 1 at a time, so it shouldn't be backing up -- especially not when the utility process is marked as a high priority process.

Specifically in the trace in c#108 it looks like the message is sent to the utility process, ~230ms later it's received and immediately completed, then ~590ms later it's delivered back to the renderer process. During this time the high priority audio threads are running on schedule.

@g.marcilhacy can you try the process priority trick with both the Amazon Music renderer tab as well as the CDM process and see if you still get glitches? It's possible this doesn't help if the competition is within the processes though.

We may need to test a canary build where we set DISPLAY priority for media+cdm threads.
I don't mean other services running in the process, I mean other connected interfaces. Surely there are other interfaces between the cdm service and its clients. Is it possible that a bug is making one of them spammy?
... though I don't know if DISPLAY priority will even help there are operations on the compositor thread which also look blocked and it should be at display priority. E.g., Scheduler::OnBeginImplFrameDeadline is running for ~500ms in the trace.
I'm not sure what you mean. Can you elaborate? There's not a lot of calls in/out of the CDM once playback starts aside from the regular cadence of decoding packets. Certainly I'm not sure how that wouldn't show up in the trace.
I wouldn't expect thread priority on the client side to matter much, given your data. If you're actually measuring 230ms between the Decrypt() call and its receipt in the service, either the socket is backlogged or the service is taking a long time to dispatch the message.

On that note, one other thought - the message may actually arrive in the service process very quickly. *Incoming* IPCs wake up the IO thread and get scheduled for dispatch on the destination binding's SequencedTaskRunner ASAP. If that sequence (e.g. main thread of the service process?) is busy doing other stuff, you're not going to see the message dispatched right away.
I mean there are several Mojo interfaces that comprise the total IPC surface of the cdm service. At a glance I see CdmService, CdmFactory, and ContentDecryptionModule, in addition to Decyptor. Of course it's just a wild guess, but it's important to recognize that Decrpytor being slow doesn't necessarily mean Decryptor would be the offending interface pipe in the event of IPC spam.
Ah, none of those should be active at that point. Since we have the "mojom" category enabled, they should show up in the trace if that was the case too.

The decrypt is happening on the main ChildThread which is just a RunLoop. It's just sitting there idle per the trace above. Since it's a MessageLoop we'd see a task tracking entry if it was busy I think.
@g.marcilhacy can you provide some more detail on exactly what you're doing when the glitch happens? Are you active within the VM or doing something else? I want to rule out a state where the VM is getting suspended by the host OS.
I am actively using the VM.
I just repro again using the following steps:
- Play on music.amazon.com using Chrome
- play a youtube video on Edge
- A rebuffer is emitted as soon as the video starts
Thanks. I wonder if Edge is using a high priority playback thread which is stealing from Chrome threads which are below real-time audio priority. Does anything else cause the issue to repro?
Hmm, during playback Edge content process spawns a bunch of threads it seems be using for media playback at pri 18/18 which is real time priority. If these are loaded that would explain why no thread in Chrome but our real time priority one gets a timeslice.
I was not able to repro in the past ~10mn without Edge being open.
Cc: brucedaw...@chromium.org
Okay, I wonder if for audible audio cases we should bump the media thread to real-time priority like our audio thread. +brucedawson for his thoughts since Edge seems to already be doing this.
Project Member

Comment 139 by bugdroid1@chromium.org, Oct 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e72bb599577ae34097d6df704a1b72a6c87524cb

commit e72bb599577ae34097d6df704a1b72a6c87524cb
Author: Xiaohan Wang <xhwang@chromium.org>
Date: Tue Oct 23 23:10:10 2018

media: Increase the default audio buffer size for encrypted streams

This partially reverts commit 251e255f303f954692cd7e2134ea54b37a9774ff.

After mojo CDM was enabled by default, with the performance gain, we
changed the audio buffer size back to the original value at 200ms,
which is the same as the size as clear playback.

Recently, we found cases where we still have performance issues for
encrypted audio. Given the fact that users are super sensitive to audio
glitches, and encrypted playback always has worse performance than clear
playback, it makes sense to use a larger buffer size.

This CL changes the audio buffer size back to 500ms. It may increase
the start-to-play time slightly, but should help reduce rebuffers during
playback.

Bug:  718161 ,879970
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I41909d3fc3356df119b21fe720ebe8cafad206a5
Reviewed-on: https://chromium-review.googlesource.com/c/1297329
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Commit-Queue: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602165}
[modify] https://crrev.com/e72bb599577ae34097d6df704a1b72a6c87524cb/media/filters/audio_renderer_algorithm.cc

Project Member

Comment 140 by bugdroid1@chromium.org, Oct 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e72bb599577ae34097d6df704a1b72a6c87524cb

commit e72bb599577ae34097d6df704a1b72a6c87524cb
Author: Xiaohan Wang <xhwang@chromium.org>
Date: Tue Oct 23 23:10:10 2018

media: Increase the default audio buffer size for encrypted streams

This partially reverts commit 251e255f303f954692cd7e2134ea54b37a9774ff.

After mojo CDM was enabled by default, with the performance gain, we
changed the audio buffer size back to the original value at 200ms,
which is the same as the size as clear playback.

Recently, we found cases where we still have performance issues for
encrypted audio. Given the fact that users are super sensitive to audio
glitches, and encrypted playback always has worse performance than clear
playback, it makes sense to use a larger buffer size.

This CL changes the audio buffer size back to 500ms. It may increase
the start-to-play time slightly, but should help reduce rebuffers during
playback.

Bug:  718161 ,879970
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I41909d3fc3356df119b21fe720ebe8cafad206a5
Reviewed-on: https://chromium-review.googlesource.com/c/1297329
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Commit-Queue: Xiaohan Wang <xhwang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602165}
[modify] https://crrev.com/e72bb599577ae34097d6df704a1b72a6c87524cb/media/filters/audio_renderer_algorithm.cc

g.marcilhacy: The change that increases audio buffer size (#140) is available in 72.0.3590.0 and later, which is today's Canary on Windows and Mac. Could you please give it a try?
I felt probably only the audio rendering pipeline should be REALTIME_AUDIO. Audio/video "decoding" should probably not be since we can always decode early and keep a buffer of decoded frames.

It's sad that Edge is doing this. But I don't feel we should respond by increasing our thread priorities. FWIW, this is an interesting related article:
https://blog.codinghorror.com/thread-priorities-are-evil/

But it may still make sense to bump media threads' priority to DISPLAY. I  experimented with it at
https://chromium-review.googlesource.com/c/chromium/src/+/1301116
I'm at least 90% sure that the code linked to from the codinghorror site has a busy wait in it. One of the normal priority threads ends up spinning while waiting for something from a low priority thread and that causes the whole thing to take forever. The busy wait isn't in the visible code so it must be in the CLR, and indeed I have seem some busy waits there, and in other Microsoft runtimes. I can't think of anything else that could explain the problem. And, crucially, the busy wait is the problem, not the priorities.

Busy waits can cause stalls and brief hangs even without changing thread priorities, it's just that the hangs are much shorter so we don't notice them. Therefore the thread priority changes are sort of a good thing because they expose the otherwise invisible micro-hangs due to busy waits.

Therefore I think the real advice is:
1) Don't do busy waits. Just don't. Indefinite spin locks are a bad idea.
2) Only run your threads at high priority if the genuinely need to be more responsive than most *and* they absolutely never do busy waits *and* they are guaranteed to consume a small percentage of total CPU time.

We've been bitten by this in the past because we had busy waits (in allocators, IIRC, and in some static initialization code) and the fix was to stop doing the busy waits.

If we don't run our audio threads at higher priority then on a 100% busy system (and this can happen for seconds at a time during normal navigation) our audio will almost unavoidably cut out. So, I don't think running the audio pipeline at REALTIME_AUDIO priority is a bad idea, as long as we follow the rules.

I don't think we can commit to following the rules on these threads unfortunately. We have no idea what the software audio/video decoders might be doing on this thread under the hood. I'm pretty sure ffmpeg has some spin loops in there within it's threaded decoding model.

We might be able to split just audio decoding off onto a thread which we could follow the rules on, since there is no multi-threaded software decoding. We've talked about doing this in the past -- it's not too difficult, but may have unconsidered performance impacts with adding a thread hop for audio decoding. Possibly those are all made up by the priority increase.
Oh, I guess splitting out the audio wouldn't work, since the return of the decoded samples to the normal priority media thread would still be delayed. Maybe we'd have to instead move off video decoding to its own thread -- media thread is a bit of a catch all though, so I'm not sure what other chrome pieces like capture and so forth might be doing.
> I'm pretty sure ffmpeg has some spin loops in there

Any chance we can get those removed? Spin loops have bitten us in the past, especially for enterprise customers who often run Chrome in VMs with a low CPU count (often two, I believe). All it takes is a priority inversion with one high-priority spin-loop per core and Chrome will lockup forever.

Okay all the ones I remember actually seem gone. They're all while() { pthread_cond_Wait(); } now.
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/pthread_frame.c
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/pthread_slice.c
https://cs.chromium.org/chromium/src/third_party/ffmpeg/libavcodec/vp8.c

Though they are spawning there own threads for this which would not be at the same priority as the chrome media thread.
Hi Dale and Xiaohan,

In the call on 10/23, you had mentioned the increased buffer size (change #140) would be available in Beta 71, but it looks like it was made to 72.0.3590.0 Canary, which wouldn't reach Beta until mid-late December. Is there a plan to move it into the 71 Beta?

Thanks,
Evan
Per our metrics it seems to have a small increase in MTBR (bad) and a small increase in smooth rate (good) which are aligned with increasing the buffer size. So I think merging to 71 sounds good:

https://uma.googleplex.com/p/chrome/timeline_v2/?sid=ec1a68b146d0558cb8976ce4274e7f89

The thread priority thing is still stuck on https://chromium-review.googlesource.com/c/chromium/src/+/931763
One thing we could do is have more recall in our media pipeline for the optimal per-machine audio buffer size. I.e., instead of always starting from 200ms, we could keep that in a per process static and adjust it as necessary when rebuffers are seen. Probably we wouldn't want to keep increasing this on network underflows, just decoding underflow.
> Probably we wouldn't want to keep increasing this on network underflows, just decoding underflow.

Time to dust off this CL? https://chromium-review.googlesource.com/c/chromium/src/+/1105394
Labels: Merge-Request-71
Request to merge the CL in #140 (https://chromium-review.googlesource.com/c/1297329) to M71. It's only changing audio buffer size and the risk should be very low. dalecurtis@ also verified the results in #149.
Project Member

Comment 153 by sheriffbot@chromium.org, Nov 12

Labels: -Merge-Request-71 Hotlist-Merge-Review Merge-Review-71
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-71 Merge-Rejected-71
This is regressed in M68 and we're very close to M71 stable promotion. Let's wait until M72. 
Showing comments 55 - 154 of 154 Older

Sign in to add a comment