New issue
Advanced search Search tips

Issue 827450 link

Starred by 8 users

Issue metadata

Status: Verified
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , iOS , Chrome , Mac , Fuchsia
Pri: 1
Type: Bug



Sign in to add a comment

RTCRtpSender/Receiver, WebRtcMediaStreamAdapterMap and AdapterRef have broken threading semantics.

Project Member Reported by w...@chromium.org, Mar 30 2018

Issue description

This test crashed on the Fuchsia/x64 FYI bot in https://build.chromium.org/p/chromium.fyi/builders/Fuchsia/builds/15446:

[00369.438] 01206.01268> <== fatal exception: process content_unittests__exec[229603] thread MockPCFactory WebRtc Signaling [230493]
[00369.438] 01206.01268> <== fatal page fault, PC at 0x8322c14c04
[00369.438] 01206.01268>  CS:                   0 RIP:       0x8322c14c04 EFL:            0x10202 CR2:               0x10
[00369.438] 01206.01268>  RAX:                  0 RBX:     0x7214f8604230 RCX: 0x2feda5114e0739af RDX:     0x2523e01d3cb0
[00369.439] 01206.01268>  RSI:     0x5bfe2f37d000 RDI:               0x2f RBP:     0x2e5399286bf8 RSP:     0x2e5399286a90
[00369.439] 01206.01268>   R8: 0xffffffffffffffff  R9:     0x5c0af294dd80 R10:             0x1000 R11:        0x800000000
[00369.440] 01206.01268>  R12:     0x5c877f45c370 R13:     0x2e5399286d40 R14:     0x5bfe2f3770f0 R15:     0x2e5399286ac8
[00369.440] 01206.01268>  errc:               0x4
[00369.440] 01206.01268> bottom of user stack:
[00369.441] 01206.01268> 0x00002e5399286a90: 00000000 00000000 99286ac8 00002e53 |.........j(.S...|
[00369.455] 01206.01268> 0x00002e5399286aa0: 2f3770f0 00005bfe 22c14e73 00000083 |.p7/.[..sN."....|
[00369.455] 01206.01268> 0x00002e5399286ab0: 2f3770c0 00005bfe 23caf306 00000083 |.p7/.[.....#....|
[00369.455] 01206.01268> 0x00002e5399286ac0: 7f532ad0 00005c87 f8604230 00007214 |.*S..\..0B`..r..|
[00369.455] 01206.01268> 0x00002e5399286ad0: 4e0739af 2feda511 2f3770c0 00005bfe |.9.N.../.p7/.[..|
[00369.460] 01206.01268> 0x00002e5399286ae0: 7f536f00 00005c87 7f45c360 00005c87 |.oS..\..`.E..\..|
[00369.460] 01206.01268> 0x00002e5399286af0: 99286bf8 00002e53 23cb13b1 00000083 |.k(.S......#....|
[00369.460] 01206.01268> 0x00002e5399286b00: 7f532b60 00005c87 00000005 2feda511 |`+S..\........./|
[00369.461] 01206.01268> 0x00002e5399286b10: 4e0739af 2feda511 7f536f00 00005c87 |.9.N.../.oS..\..|
[00369.524] 01206.01268> 0x00002e5399286b20: 23a7418c 00005c34 2f377000 00005bfe |.A.#4\...p7/.[..|
[00369.524] 01206.01268> 0x00002e5399286b30: 00000000 00000000 23c97be5 00000083 |.........{.#....|
[00369.525] 01206.01268> 0x00002e5399286b40: 243a3b10 00000083 2f377000 00005bfe |.;:$.....p7/.[..|
[00369.525] 01206.01268> 0x00002e5399286b50: 23a7418c 00005c34 1dbf49e8 00000083 |.A.#4\...I......|
[00369.525] 01206.01268> 0x00002e5399286b60: 23ff1d50 00000083 23c97c89 00000083 |P..#.....|.#....|
[00369.526] 01206.01268> 0x00002e5399286b70: 7f4f34c0 00005c87 208891ef 00000083 |.4O..\..... ....|
[00369.526] 01206.01268> 0x00002e5399286b80: 99286d40 00002e53 21f11002 00000083 |@m(.S......!....|
[00369.530] 01206.01268> arch: x86_64
[00369.772] 01206.01268> dso: id=615ca57b6e6f2fbe540e425d5343f9761999b360 base=0x7cc933f6d000 name=libasync-default.so
[00369.773] 01206.01268> dso: id=a544e7483e97161e58387de2a4570a936e907aae base=0x77747f066000 name=libfdio.so
[00369.773] 01206.01268> dso: id=5efedc7801edd06c base=0x76807dddc000 name=libunwind.so.1
[00369.784] 01206.01268> dso: id=8eb4f1a404049d1d7aea82f90090697f327c8bf8 base=0x6befa5558000 name=liblaunchpad.so
[00369.796] 01206.01268> dso: id=c401e71426b591bdbf3504aa2b5839aeb48326a1 base=0x640a1288b000 name=<vDSO>
[00369.796] 01206.01268> dso: id=f15b94abfcbdd5d69fe3e872968ed2cee5162d07 base=0x62c0f1908000 name=libc.so
[00369.796] 01206.01268> dso: id=a2eac7bf15ca918d base=0x5ceb45352000 name=libfxl.so
[00369.796] 01206.01268> dso: id=401a597024c762aa base=0x5a40f64f5000 name=libosmesa.so
[00369.796] 01206.01268> dso: id=b61656ed713ac11d04067c3dc1c4fb0e52eb9b86 base=0x4e84456e9000 name=libsyslog.so
[00369.796] 01206.01268> dso: id=595637609bf55a50 base=0x31121eb84000 name=libc++.so.2
[00369.797] 01206.01268> dso: id=6841a18eeeebb2b7 base=0x2ba94946c000 name=libc++abi.so.1
[00369.797] 01206.01268> dso: id=d97edb5ce3c0772b3d0c746d419339f4188838f2 base=0x1b814f79a000 name=libtrace-engine.so
[00369.811] 01206.01268> dso: id=347d69f7bac1291b base=0x1b6cecc9a000 name=libfxl_logging.so
[00369.811] 01206.01268> dso: id=a332afe8df5fffff base=0x6612ce58000 name=libmedia_client.so
[00369.824] 01206.01268> dso: id=97a0dc303188df85 base=0xc5f8608000 name=libfsl.so
[00369.825] 01206.01268> dso: id=7f9e3e76ab9cfab6 base=0x831db24000 name=app:content_unittests__exec
#01: blink::PersistentBase<blink::MediaStreamSource, (blink::WeaknessPersistentConfiguration)0, (blink::CrossThreadnessPersistentConfiguration)0>::Assign(blink::MediaStreamSource*) at ??:?
#02: blink::WebMediaStreamTrack::Source() const at ??:?
#03: content::WebRtcMediaStreamTrackAdapter::Dispose() at ??:?
#04: content::WebRtcMediaStreamTrackAdapterMap::AdapterRef::~AdapterRef() at ??:?
#05: content::RTCRtpSender::RTCRtpSenderInternal::~RTCRtpSenderInternal() at ??:?
#06: content::RTCRtpSender::RTCRtpSenderInternal::~RTCRtpSenderInternal() at ??:?
#07: base::internal::BindState<void (media::AudioOutputController::*)(std::__1::unique_ptr<media::AudioBus, std::__1::default_delete<media::AudioBus> >, base::TimeTicks), scoped_refptr<media::AudioOutputController>, std::__1::unique_ptr<media::AudioBus, std::__1::default_delete<media::AudioBus> >, base::TimeTicks>::Destroy(base::internal::BindStateBase const*) at ??:?
#08: base::debug::TaskAnnotator::RunTask(char const*, base::PendingTask*) at ??:?
#09: base::MessageLoop::RunTask(base::PendingTask*) at ??:?
#10: base::MessageLoop::DoWork() at ??:?
#11: base::MessagePumpDefault::Run(base::MessagePump::Delegate*) at ??:?
#12: base::RunLoop::Run() at ??:?
#13: base::Thread::ThreadMain() at ??:?
#14: base::(anonymous namespace)::ThreadFunc(void*) at ??:?
#15: pc 0x62c0f1918e36 sp 0x2e5399286fe0 (libc.so,0x10e36)
#16: pc 0x62c0f198e599 sp 0x2e5399286ff0 (libc.so,0x86599)
#17: pc 0 sp 0x2e5399287000
For graphics-related bugs, please copy/paste the contents of the about:gpu
page at the end of this report.

Note that the Fuchsia OS and emulation environment have unusual thread-scheduling, so this may be a latent cross-platform race-condition.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Mar 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/96d90f7a00a0bdd43be13a5446c6e4727ca53fb6

commit 96d90f7a00a0bdd43be13a5446c6e4727ca53fb6
Author: Wez <wez@chromium.org>
Date: Fri Mar 30 02:58:03 2018

Disable newly-added RTCRtpReceiverTest.GetStats, which is flaky.

TBR: hbos
Bug:  827450 ,  680172 
Change-Id: I2d351e517dae3b011bd477c9482e756344df4fe2
Reviewed-on: https://chromium-review.googlesource.com/987394
Reviewed-by: Wez <wez@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#547089}
[modify] https://crrev.com/96d90f7a00a0bdd43be13a5446c6e4727ca53fb6/content/renderer/media/webrtc/rtc_rtp_receiver_unittest.cc

Comment 2 by w...@chromium.org, Mar 30 2018

Status: Fixed (was: Assigned)

Comment 3 by w...@chromium.org, Mar 30 2018

Status: Assigned (was: Fixed)

Comment 4 by hbos@chromium.org, Apr 3 2018

Labels: -Pri-1 Pri-3
Probably the same cause as  https://crbug.com/812296 .
All of these RTCRtpSenderTest/RTCRtpReceiverTest are flaky, I think it has to do with races where a thread is being shut down before it some other thread is finished, so that thread does PostTask and it doesn't actually post the task and the arguments are not moved to the other thread but destroyed on the current, wrong thread.

I added a WebRtcMediaStreamTrackAdapterMap stress test too that was flaking, probably same cause:
https://chromium-review.googlesource.com/c/chromium/src/+/983553

There's this assumption about if we just run all the posted tasks before we quit we should be good but maybe one of the posted tasks posts another task.

The annoying thing about these flakes is I've never been able to reproduce them locally and I don't understand how they can occur given my runlooping.

All of the code is covered by integrationtests though so it's not high priority to fix this.

Comment 5 by hbos@chromium.org, Apr 3 2018

We should consider adding a callback for all of these things. But I would prefer to have a repro case to test against before I change this stuff around, the WebRtcMediaStream[Track]AdapterMaps have caused many problems before.

Comment 6 by w...@chromium.org, Apr 10 2018

Labels: -Pri-3 Pri-1
RTCRtpSenderTest.GetStats flaked in https://ci.chromium.org/buildbot/chromium.fyi/Fuchsia/15840

Looking at the crash dump, I notice that there is an "AdapterRef" on the stack, which appears to be some sort of ref-counting hack. Looking at the entries further up the stack, I see that that is owned by the RTCRtpSenderInternal (which is RefCountedThreadSafe) and that that is being released by the media::AudioOutputManager (also RefCountedThreadSafe).

It looks like we have a thread-affine class (blink::WebMediaStreamTrack::Source()?) being invoked from the wrong thread as the result of multiple layers of multi-threaded ref-counting.

This seems extremely unlikely to be a test-specific issue.

Comment 7 by w...@chromium.org, Apr 10 2018

Looking at the try flake for RTCRtpReceiverTest.GetStats in https://ci.chromium.org/buildbot/tryserver.chromium.android/linux_android_rel_ng/517991 I see this crash:

261.765s run_tests_on_device(03ab182d003b9ebf)  [ RUN      ] RTCRtpReceiverTest.GetStats
I  261.765s run_tests_on_device(03ab182d003b9ebf)  [FATAL:webrtc_media_stream_track_adapter_map.cc(23)] Check failed: map_->main_thread_->BelongsToCurrentThread(). 
...

which appears to confirm that we're crashing due to broken threading.

Comment 8 by hbos@chromium.org, Apr 10 2018

The adapters are a bit horrible, they own stuff that is created on one thread but have to be destroyed on another thread. So when ref counted was added instead of using scoped_refptr which is very implicit we made a separate class, AdapterRef, which makes the ownership of references more explicit. For instance std::move() is used in PostTasks which explicitly passes ownership to avoid accidentally retain a reference on the thread that is calling PostTask to transfer ownership.

Aaaannyway, AdapterRef has to be destroyed on the main thread, that ensures that any last reference to this horrible adapter triggers the destruction on the correct thread. Generally speaking AdapterRefs should thus only be owned by objects who live on the main thread.

If there is something on the signaling thread keeping a sender (and its AdapterRefs) alive on the signaling thread that is a bug. Need to investigate if this is a threading error on the part of the tests or if it is real.

One of the reasons is I've brushed this off as "testing-only problem" is I've seen these problems before and they have only occurred in these unittests while we have integrationttests and production code that never hits this. And it's a race that I've never been able to repro locally, even after thousands of test runs.

Agreed this should be investigated further if time allows.

Comment 9 by w...@chromium.org, Apr 10 2018

Labels: -M-67 M-68
Re #8: Even if it is a test-only issue, it makes these tests useless across a swathe of platforms.

The good news is that the flake repros fairly often under Fuschsia for the GetStats tests, so there's a good chance we can repro it with just a large enough --gtest_repeats.  The timing differences that make things flake more often under Fuchsia are also reasonably well understood (essentially unexpectedly long delays at unusual points during execution; these flakes can usually be made 100% failures by addition of a Sleep() somewhere specific :).

Comment 10 by w...@chromium.org, Apr 10 2018

Re the thread-affine teardown: RefCountedThreadSafe<> supports deletion traits which you can provide to have it teardown on a particular thread, rather than on whichever thread happens to hold the final reference. So long as you aren't tearing down the target thread before the last reference to the object, that should work as you intend.

Is the idea simply that AdapterRef is effectively a non-copyable scoped_refptr<>, in effect, so that you have to explicitly be dispensed a new reference, and otherwise they can only be passed via std::move()?

Comment 11 by w...@chromium.org, Apr 10 2018

Re #8: (/me hits Send too soon..) I think the problem here is that you have this special AdapterRef thing that is intended to ensure correct threading by being move-only, but that doesn't work because the object holding the AdapterRef is potentially released on the wrong thread, since it is RefCountedThreadSafe.
Project Member

Comment 12 by bugdroid1@chromium.org, Apr 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9ea3e3cf0148a0685ebf8f7af4979e31735d94e5

commit 9ea3e3cf0148a0685ebf8f7af4979e31735d94e5
Author: Wez <wez@chromium.org>
Date: Tue Apr 10 22:29:58 2018

Disable or filter some more tests which often flake under Fuchsia.

- Disable ProcessUtilTest.LaunchWithHandleTransfer under Debug builds,
  since launching the child process is currently too time consuming.
- Modifies that test to distinguish the child process failing to send
  data from the timeout case.
- Disable SecurityTest.NewOverflow, which fails under Fuchsia/Debug
  builds due to the SDK exporting the wrong new[].
- Disable RTCRtpSenderTest.GetStats, which is part of a suite of tests
  which appear to be inherently flaky, but fail most often under Fuchsia.
- Filter DataPipeTest.SendConsumerAndCloseProducer

Bug:  793412 ,  828229 ,  827450 , 831024
Change-Id: Icb655ca91881c3ff7059f333f0e22d9bf2d92468
Reviewed-on: https://chromium-review.googlesource.com/1003996
Reviewed-by: Daniel Cheng <dcheng@chromium.org>
Reviewed-by: Yuri Wiitala <miu@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#549663}
[modify] https://crrev.com/9ea3e3cf0148a0685ebf8f7af4979e31735d94e5/base/process/process_util_unittest.cc
[modify] https://crrev.com/9ea3e3cf0148a0685ebf8f7af4979e31735d94e5/base/security_unittest.cc
[modify] https://crrev.com/9ea3e3cf0148a0685ebf8f7af4979e31735d94e5/content/renderer/media/webrtc/rtc_rtp_sender_unittest.cc
[modify] https://crrev.com/9ea3e3cf0148a0685ebf8f7af4979e31735d94e5/testing/buildbot/filters/fuchsia.mojo_unittests.filter

Comment 13 by w...@chromium.org, Apr 11 2018

Components: Tests>Flaky
Labels: OS-Android OS-Chrome OS-Linux OS-Mac OS-Windows
It is possible to repro the GetStats test failures trivially by adding:

--- a/content/renderer/media/webrtc/rtc_rtp_sender.cc
+++ b/content/renderer/media/webrtc/rtc_rtp_sender.cc
@@ -187,6 +187,7 @@ class RTCRtpSender::RTCRtpSenderInternal
     native_peer_connection_->GetStats(webrtc_sender_,
                                       RTCStatsCollectorCallbackImpl::Create(
                                           main_thread_, std::move(callback)));
+    base::PlatformThread::Sleep(base::TimeDelta::FromSeconds(2));
   }

This causes the GetStatsOnSignalingThread call not to return until after the main thread has dropped its last reference to the RTCRtpSenderInternal, causing the scoped_refptr<RTCRtpSenderInternal> that was Bind()ed to |this| to be the final reference. The class is then released on the signaling thread, implicitly resulting in the AdapterRef being released there as well.

Presumably similar problems apply to the other test flakes.

Reviewing the various classes involved in the call stack in more detail, it seems that:
1. Most of them are used on multiple threads, rather than being split e.g. into separate objects for the main and signaling thread parts.
2. Most of them are ref-counted objects; it's not immediately clear whether that is necessary, or just a work-around for the objects existing across multiple threads.
3. Most of these objects have the expectation of being torn-down on the main thread, but do nothing to guarantee that.
4. Many of these threading expectations are not DCHECKed.

Comment 14 by w...@chromium.org, Apr 11 2018

Cc: sergeyu@chromium.org m...@chromium.org
Adding some content/renderer/media OWNERS for visibility.

Comment 15 by w...@chromium.org, Apr 11 2018

Components: -Blink>MemoryAllocator>GarbageCollection -Blink>WebRTC Blink>WebRTC>PeerConnection

Comment 16 by w...@chromium.org, Apr 11 2018

Labels: Test-Disabled

Comment 17 by w...@chromium.org, Apr 11 2018

Summary: RTCRtpSender/Receiver, WebRtcMediaStreamAdapterMap and AdapterRef have broken threading semantics. (was: RTCRtpReceiverTest.GetStats failure on Fuchsia/x64 FYI bot)

Comment 18 by w...@chromium.org, Apr 11 2018

Cc: hbos@chromium.org
 Issue 812296  has been merged into this issue.

Comment 19 by w...@chromium.org, Apr 11 2018

Labels: Stability-Crash

Comment 20 by hbos@chromium.org, Apr 12 2018

Status: Started (was: Assigned)
#13: Thank you for that snippet. This is a real problem.

The async operations are a case where a reference is given to the webrtc signaling thread and thus there are no guarantees that the sender is destroyed on the main thread. Any operation that jumps to the signaling thread does jump back to the main thread per design, but GetStats is an example where, mistakenly, nothing is holding on to a sender reference during the operation to protect it until we are back on the main thread, meaning the signaling thread's reference may very well be the last.

This can easily be addressed by adding a reference that is posted back and forth until the operation is complete on the main thread, and an easy fix I can implement.

But this mix of reference counting of objects used on multiple threads but expected to be destroyed on one thread is a headache. We may want to redesign these classes or have the destructor of the ref counted object post to the main thread to do its destruction logic, that way you don't have to care about which thread held the last reference.

Comment 21 by hbos@chromium.org, Apr 12 2018

Even if BlahOnSignalingThread does PostTask to main thread with a reference to |this| it is racy, since that task could be executed before the end of BlahOnSignalingThread.

This code needs to be updated such that either threading assumptions are checked or cleanup is safe regardless of which thread held the last reference.

Comment 22 by w...@chromium.org, Apr 12 2018

Re #20: If an object runs on a particular Sequence, and async work can/should safely be "dropped" if that object is destroyed, then ref-counting isn't required; you can instead PostTaskAndReply(...) and Bind() a WeakPtr to the object into the reply Callback. If the object is gone by the time the reply arrives, it will be silently skipped.

Re #21: RefCountedThreadSafe<> is parameterized with Traits which include allowing a customized Destruct(const T*) behaviour. We have several ref-counted classes which implement that to DeleteSoon() the object on its "home" TaskRunner when the last reference is dropped, regardless of where that happens.

You listed a set of constraints that should be properly implemented, and checked, in https://bugs.chromium.org/p/chromium/issues/detail?id=812296#c14. We should probably implement those to address the immediate issues, but we should also clean up this code to remove any unnecessary ref-counting, split up multi-threaded classes into single-threaded sub-components, etc. e.g. Using RefCountedThreadSafe::Traits and WeakPtr I expect we can get rid of the need for AdapterRef.

I'm happy to help clean this up, but I'm unfamiliar with this code, so I'm unsure of which ref-counting is required and which was added just to to cope with multi-threading.

Comment 23 by hbos@chromium.org, Apr 13 2018

NextAction: 2018-04-17
Let me get back to this on Monday or Tuesday and I can do immediate fixes and get back here to talk about design improvements. I think PostTaskAndReply and DeleteSoon are good options.

One of the reasons for having AdapterRef though is we have blink-layer objects (blink::MediaStream, blink::MediaStreamTrack) and webrtc-layer objects (webrtc::MediaStreamInterface, webrtc::MediaStreamTrackInterface) that are supposed to map 1:1, but by design does not have the same life-time.

You can create a track and never use it in combination with WebRTC (e.g. capture camera and show it in a <video> tag). If you do want to attach a track to an RTCPeerConnection though we need to create content and webrtc glue to identify this track in objects that third_party/webrtc understands. These classes are reference counted and whenever an API is called (blink->webrtc) or a callback is invoke (webrtc->blink) when we see one of these objects we need to "convert" it to the other layer's representation of that object. This is why we have WebRtcMediaStream[Track]AdapterMap.

This is where "GetOrCreate" comes to play, we only create an adapter if we haven't already created an adapter for that object. And reference counting comes to play with the idea that we want to free these resources if they're not used, but there may be multiple usages.

For example, you might have two receivers referencing the same stream. If one of the receivers are removed from the peer connection, you only clean up the stream adapter if there are no other receivers that reference that stream. The same goes for senders referencing and tracks and streams.

Ideally we'd get rid of the content layer altogether, but in the meantime, a content layer sender is needed to as the glue between blink::RTCRtpSender and webrtc::RtpSenderInterface. Because both the blink and the content layer needs to reference senders, this is reference counted.

The split between content::RTCRtpSender not being ref counted and RTCRtpSenderInternal being ref counted was a work-around for not being able to use ref counted interfaces across blink and content using a pattern I've seen elsewhere (WebFoo and Foo, e.g. multiple WebMediaStreamTracks can be copied by value all representing the same MediaStreamTrack). If a blink interface implemented in content can be reference counted the internal class could be merged into content::RTCRtpSender but it does not change the overall design.

Comment 24 by hbos@chromium.org, Apr 13 2018

With DeleteSoon type of logic adapters could use "normal" ref counting and then it doesn't matter where they are destroyed, the cleanup will occur on the right thread. This would be a good thing.

Testing needs to allow threads to finish up any DeleteSoon logic is executed before the thread shuts down.

Comment 25 by hbos@chromium.org, Apr 13 2018

Also note that when the last AdapterRef is destroyed, the entry in the adapter map of that adapter is removed, which involves a mutex.

Comment 26 by w...@chromium.org, Apr 16 2018

Re #23: IIUC the reason for the ref-counting is that the objects on the Blink side may be multiply-referenced, and therefore the corresponding webrtc-side adapters must support multiple references to a single adapter.  Are we guaranteed that GetOrCreate() will always be called from the main thread?

I don't understand your point re content::RTCRtpSender.  Are you saying that the blink interface is single-owner, but multiple of those objects may refer to the same underlying RTCRtpSender, hence the need for ref-counting? The issue I see with the ref-counting is that you're both using it to allow multiple RTCRtpSenders to co-own a single RTCRtpSenderInternal, _and_ when posting tasks to be executed on the signaling thread. You could split the *Internal would be split into two parts; a ref-counted main-thread part, and a single-owner signaling thread part. However, just fixing the destruct traits to teardown on the main thread, and being explicit about which members are safe to access where is probably sufficient.

Re #25: If we fix things such that the adaptors are always inserted/fetched/removed from the AdapterMap from the main thread then we can have the adapter remove itself on destruction, without the need for a mutex, surely?
The NextAction date has arrived: 2018-04-17

Comment 28 by hbos@chromium.org, Apr 17 2018

Adapters fall into two categories, "local" adapters which are ones created locally e.g. by opening the camera, and "remote" adapters which are created by webrtc by the remote endpoint (i.e. another peer's local track and stream being sent over the network, received as a remote track and stream).

For local adapters, GetOrCreate() is always invoked on the main thread (this is guaranteed by calling places and is DCHECKed). They are meant to always be destroyed on the main thread due to destroying resources held by the adapter (this is DCHECKed but in race conditions this DCHECK may actually be hit as per this bug).

For remote adapters, GetOrCreate() is always invoked on the webrtc signaling thread (this is guaranteed by calling places and is DCHECKed). They too are always meant to be destroyed on the main thread due to resources held by the adapter (this is DCHECKed but in race conditions this DCHECK may actually be hit as per this bug).

These threading restrictions were imposed by the underlying code that the track adapters are essentially just wrapping around, refactoring and abstracting away. The functionality provided by the maps is reference counting and "getting if already exist, otherwise create a new one", but apparently DCHECK is not sufficient to ensure references are handled correctly.

In both cases, it is possible to Get() an adapter ref for an existing adapter on any thread, but then it will not be created if it does not exist. It is debatable whether this API is needed.

Both local and remote adapters share the same map and lock, though it would be entirely possible to split local and remote adapters into separate maps with separate locks (or possibly getting rid of the lock if we don't need to access the map on multiple threads after the split).

Currently mutex is needed because...
- GetOrCreate() happens on multiple threads causing insertions into a shared map (local: main thread, remote: webrtc signaling thread).
- Get() happens on multiple threads and needs to sync up with any previously created adapters on the other thread.

Comment 29 by hbos@chromium.org, Apr 17 2018

Fixing destruction traits sounds good to me.

It would also be good to be able to have some sort of callback for testing to ensure the tests don't tear down the threads before all destruction logic has fully executed.

Comment 30 by hbos@chromium.org, Apr 17 2018

NextAction: ----

Comment 31 by hbos@chromium.org, Apr 17 2018

The use of locks in the map have been a concern for deadlocks in the past, but those deadlock issues have been fixed.

The adapter map unittests are not flaky, and use of std::unique_ptr for AdapterRef make it relatively safe to jump between threads. E.g. if you're on the signaling thread and you do PostTask, you have to std::move() the adapter ref so there is no concern if the posted task is executed before the current function finishes. Passing around adapter refs is rather deterministic.

As for immediate concerns, enabling the failing tests and merging a fix for M67...

The concern is when an adapter ref is held by an object whose destruction logic is not deterministic.
This is a problem with RTCRtpSenderInternal and RTCRtpReceiverInternal since these use normal ref counting that can be destroyed on any thread, and when it performs operations with PostTask you don't know if that task or the current task will be the last to drop its reference. This class has nothing that guarantees the threading assumption that it would be destroyed on the main thread.

For the immediate fix I will add destructor traits to RTCRtpSenderInternal and RTCRtpReceiverInternal.

Comment 32 by hbos@chromium.org, Apr 17 2018

RTCRtpSenderInternal destructor traits CL: https://chromium-review.googlesource.com/c/chromium/src/+/1015002

RTCRtpReceiverInternal destructor traits CL: https://chromium-review.googlesource.com/c/chromium/src/+/1013492

Comment 33 by hbos@chromium.org, Apr 17 2018

Labels: FoundIn-67

Comment 34 by hbos@chromium.org, Apr 17 2018

Re#31: Actually there is a similar PostTask problem with ~WebRtcMediaStreamTrackAdapter DCHECKing that we are on the main thread even though the remote audio track case has Dispose logic for PostTasking to signaling thread and back, but this DCHECK should perhaps be removed just like the ~WebRtcMediaStreamTrackAdapterMap DCHECK.

Comment 35 by hbos@chromium.org, Apr 17 2018

I'll remove the invalid thread checks for the map and adapter while adding a stress test in this CL: https://chromium-review.googlesource.com/c/chromium/src/+/983553/
Rest assure the adapter's Dispose() function still has a thread check and the map is thread safe and already DCHECK it is empty upon destruction so this should be safe.
Project Member

Comment 36 by bugdroid1@chromium.org, Apr 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6bfff67c0102ca4d037a2eec2b75034d099ef08a

commit 6bfff67c0102ca4d037a2eec2b75034d099ef08a
Author: Henrik Boström <hbos@chromium.org>
Date: Tue Apr 17 19:00:30 2018

Track adapter map stress test added and invalid thread checks removed.

WebRtcMediaStreamTrackAdapterMapStressTest added due to
https://crbug.com/813574.

This continuously creates and destroys local and remote track adapters
concurrently in an attempt to cause a deadlock if possible (hopefully
not).

This also removes some invalid thread checks at the destructor of
~WebRtcMediaStreamTrackAdapter and ~WebRtcMediaStreamAdapterMap. The
first of which made the stress test flaky ( https://crbug.com/826365 ).

The adapter's DCHECK was invalid because the remote audio track's
Dispose (which properly DCHECKs that it is invoked on the main thread)
can jump to the signaling thread and back with a reference to the
adapter. It is possible for the signaling thread's task to be the last
reference to the adapter.

Similarly, WebRtcMediaStreamAdapterMap is reference counted and tasks
referencing on different threads could make which thread references it
last racy. We don't care which thread was the last reference as long as
it is empty (which it is if and only if all adapters have been
destroyed on the appropriate threads).

Bug:  827450 
Change-Id: I5f819948fc2148e810f343e027fa8011b03d2335
Reviewed-on: https://chromium-review.googlesource.com/983553
Reviewed-by: Guido Urdaneta <guidou@chromium.org>
Commit-Queue: Henrik Boström <hbos@chromium.org>
Cr-Commit-Position: refs/heads/master@{#551413}
[modify] https://crrev.com/6bfff67c0102ca4d037a2eec2b75034d099ef08a/content/renderer/media/webrtc/webrtc_media_stream_adapter_map.cc
[modify] https://crrev.com/6bfff67c0102ca4d037a2eec2b75034d099ef08a/content/renderer/media/webrtc/webrtc_media_stream_track_adapter.cc
[modify] https://crrev.com/6bfff67c0102ca4d037a2eec2b75034d099ef08a/content/renderer/media/webrtc/webrtc_media_stream_track_adapter_map_unittest.cc

Project Member

Comment 37 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a43fb027d85525a5f1038df421ca5643f5778fca

commit a43fb027d85525a5f1038df421ca5643f5778fca
Author: Henrik Boström <hbos@chromium.org>
Date: Wed Apr 25 14:53:24 2018

RTCRtpSenderInternal destructor traits ensuring delete on main thread.

The internal class owns AdapterRefs which have to be destructed on the
main thread. As such, the last reference to the internal class had to
be freed on the main thread. In flaky cases, when tasks were posted to
the signaling thread, an operation performed and a task posted back to
the main thread - the main thread would finish before the signaling
thread task and the last reference to the internal class would be
released on the signaling thread, causing DCHECK crashes at
~AdapterRef for being on the wrong thread.

With destructor traits, if we are not already on the main thread we
post to the main thread and delete RTCRtpSenderInternal there.

The TearDown of RTCRtpSenderTest is updated to ensure that any pending
tasks get a chance to execute, in case the signaling thread was not
finished yet or else the destructor posted to the main thread does not
get a chance to execute and the test would flakily leak.

Before this CL: Flake's symptoms could be reproduced by adding a thread
sleep at RTCRtpSenderInternal::GetStatsOnSignalingThread. After this CL:
Unable to repro flake. The same applies for the other flaking tests
based on replaceTrack().

Bug:  827450 
Change-Id: Ib594a53042e441e591ccc2c87ae8012bcb4ec75e
Reviewed-on: https://chromium-review.googlesource.com/1015002
Commit-Queue: Henrik Boström <hbos@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Guido Urdaneta <guidou@chromium.org>
Cr-Commit-Position: refs/heads/master@{#553550}
[modify] https://crrev.com/a43fb027d85525a5f1038df421ca5643f5778fca/content/renderer/media/webrtc/rtc_rtp_sender.cc
[modify] https://crrev.com/a43fb027d85525a5f1038df421ca5643f5778fca/content/renderer/media/webrtc/rtc_rtp_sender_unittest.cc

Project Member

Comment 38 by bugdroid1@chromium.org, Apr 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0ec0e885a2fecd965eba2e084f4e55fe002dae81

commit 0ec0e885a2fecd965eba2e084f4e55fe002dae81
Author: Henrik Boström <hbos@chromium.org>
Date: Wed Apr 25 18:41:39 2018

RTCRtpReceiverInternal destructor traits ensuring delete on main thread.

The internal class owns AdapterRefs which have to be destructed on the
main thread. As such, the last reference to the internal class had to
be freed on the main thread. In flaky cases, when tasks were posted to
the signaling thread, an operation performed and a task posted back to
the main thread - the main thread would finish before the signaling
thread task and the last reference to the internal class would be
released on the signaling thread, causing DCHECK crashes at
~AdapterRef for being on the wrong thread.

With destructor traits, if we are not already on the main thread we
post to the main thread and delete RTCRtpReceiverInternal there.

The TearDown of RTCRtpReceiverTest is updated to ensure that any pending
tasks get a chance to execute, in case the signaling thread was not
finished yet or else the destructor posted to the main thread does not
get a chance to execute and the test would flakily leak.

Before this CL: Flake's symptoms could be reproduced by adding a thread
sleep at RTCRtpReceiverInternal::GetStatsOnSignalingThread. After this CL:
Unable to repro flake.

Bug:  827450 
Change-Id: Icdfdd3e22c3e86308ee7911e67eff43590fde581
Reviewed-on: https://chromium-review.googlesource.com/1013492
Commit-Queue: Henrik Boström <hbos@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Guido Urdaneta <guidou@chromium.org>
Cr-Commit-Position: refs/heads/master@{#553675}
[modify] https://crrev.com/0ec0e885a2fecd965eba2e084f4e55fe002dae81/content/renderer/media/webrtc/rtc_rtp_receiver.cc
[modify] https://crrev.com/0ec0e885a2fecd965eba2e084f4e55fe002dae81/content/renderer/media/webrtc/rtc_rtp_receiver.h
[modify] https://crrev.com/0ec0e885a2fecd965eba2e084f4e55fe002dae81/content/renderer/media/webrtc/rtc_rtp_receiver_unittest.cc
[modify] https://crrev.com/0ec0e885a2fecd965eba2e084f4e55fe002dae81/content/renderer/media/webrtc/rtc_rtp_sender.cc
[modify] https://crrev.com/0ec0e885a2fecd965eba2e084f4e55fe002dae81/content/renderer/media/webrtc/rtc_rtp_sender.h

Comment 39 by hbos@chromium.org, Apr 30 2018

Labels: Merge-Request-67
a43fb027d85525a5f1038df421ca5643f5778fca and 0ec0e885a2fecd965eba2e084f4e55fe002dae81 have baked for 4-5 days now, requesting them be merged into M67 to ensure destructor is not executed on wrong thread if unlucky with GC (not very likely, but can happen).

E.g.
  let sender = pc.addTrack(track);
  sender.getStats().then(doSomething);
  ...
  pc.close() or pc.removeTrack(sender);
  pc = sender = null;
  // Race: last reference of content layer sender could be the task executing
  // getStats() on the signaling thread if GC happens of blink layer sender
  // while operation in flight.

Comment 40 by hbos@chromium.org, Apr 30 2018

TL;DR: There are some reference counted objects (RTCRtpSenderInternal, RTCRtpReceiverInternal) which own object that have to be destroyed on the main thread, but there were nothing insuring this happened in racy situations.

The CLs requested to be merged are low risk, adding destructor traits to these objects that simply jump to the main thread if not already there.
Project Member

Comment 41 by sheriffbot@chromium.org, Apr 30 2018

Labels: -Merge-Request-67 Merge-Review-67 Hotlist-Merge-Review
This bug requires manual review: Less than 25 days to go before AppStore submit on M67
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-67 Merge-Approved-67
Approving merge to M67 branch 3396 for CLs listed at #39 (a43fb027d85525a5f1038df421ca5643f5778fca and 0ec0e885a2fecd965eba2e084f4e55fe002dae81 ) based on comments #39 and #40. Please merge ASAP so we can pick them up for this week M67 beta release. 
Project Member

Comment 43 by bugdroid1@chromium.org, Apr 30 2018

Labels: -merge-approved-67 merge-merged-3396
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e1efd9340f94ef022bee476b1b2e5cad18578240

commit e1efd9340f94ef022bee476b1b2e5cad18578240
Author: Henrik Boström <hbos@chromium.org>
Date: Mon Apr 30 18:47:56 2018

RTCRtpSenderInternal destructor traits ensuring delete on main thread.

The internal class owns AdapterRefs which have to be destructed on the
main thread. As such, the last reference to the internal class had to
be freed on the main thread. In flaky cases, when tasks were posted to
the signaling thread, an operation performed and a task posted back to
the main thread - the main thread would finish before the signaling
thread task and the last reference to the internal class would be
released on the signaling thread, causing DCHECK crashes at
~AdapterRef for being on the wrong thread.

With destructor traits, if we are not already on the main thread we
post to the main thread and delete RTCRtpSenderInternal there.

The TearDown of RTCRtpSenderTest is updated to ensure that any pending
tasks get a chance to execute, in case the signaling thread was not
finished yet or else the destructor posted to the main thread does not
get a chance to execute and the test would flakily leak.

Before this CL: Flake's symptoms could be reproduced by adding a thread
sleep at RTCRtpSenderInternal::GetStatsOnSignalingThread. After this CL:
Unable to repro flake. The same applies for the other flaking tests
based on replaceTrack().

TBR=hbos@chromium.org

(cherry picked from commit a43fb027d85525a5f1038df421ca5643f5778fca)

Bug:  827450 
Change-Id: Ib594a53042e441e591ccc2c87ae8012bcb4ec75e
Reviewed-on: https://chromium-review.googlesource.com/1015002
Commit-Queue: Henrik Boström <hbos@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Guido Urdaneta <guidou@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#553550}
Reviewed-on: https://chromium-review.googlesource.com/1035483
Reviewed-by: Henrik Boström <hbos@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#390}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/e1efd9340f94ef022bee476b1b2e5cad18578240/content/renderer/media/webrtc/rtc_rtp_sender.cc
[modify] https://crrev.com/e1efd9340f94ef022bee476b1b2e5cad18578240/content/renderer/media/webrtc/rtc_rtp_sender_unittest.cc

Project Member

Comment 44 by bugdroid1@chromium.org, Apr 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8bc0a55c623e97d8c2e4722836d22101938a1117

commit 8bc0a55c623e97d8c2e4722836d22101938a1117
Author: Henrik Boström <hbos@chromium.org>
Date: Mon Apr 30 19:48:13 2018

RTCRtpReceiverInternal destructor traits ensuring delete on main thread.

The internal class owns AdapterRefs which have to be destructed on the
main thread. As such, the last reference to the internal class had to
be freed on the main thread. In flaky cases, when tasks were posted to
the signaling thread, an operation performed and a task posted back to
the main thread - the main thread would finish before the signaling
thread task and the last reference to the internal class would be
released on the signaling thread, causing DCHECK crashes at
~AdapterRef for being on the wrong thread.

With destructor traits, if we are not already on the main thread we
post to the main thread and delete RTCRtpReceiverInternal there.

The TearDown of RTCRtpReceiverTest is updated to ensure that any pending
tasks get a chance to execute, in case the signaling thread was not
finished yet or else the destructor posted to the main thread does not
get a chance to execute and the test would flakily leak.

Before this CL: Flake's symptoms could be reproduced by adding a thread
sleep at RTCRtpReceiverInternal::GetStatsOnSignalingThread. After this CL:
Unable to repro flake.

TBR=hbos@chromium.org

(cherry picked from commit 0ec0e885a2fecd965eba2e084f4e55fe002dae81)

Bug:  827450 
Change-Id: Icdfdd3e22c3e86308ee7911e67eff43590fde581
Reviewed-on: https://chromium-review.googlesource.com/1013492
Commit-Queue: Henrik Boström <hbos@chromium.org>
Reviewed-by: Wez <wez@chromium.org>
Reviewed-by: Guido Urdaneta <guidou@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#553675}
Reviewed-on: https://chromium-review.googlesource.com/1035882
Reviewed-by: Henrik Boström <hbos@chromium.org>
Cr-Commit-Position: refs/branch-heads/3396@{#395}
Cr-Branched-From: 9ef2aa869bc7bc0c089e255d698cca6e47d6b038-refs/heads/master@{#550428}
[modify] https://crrev.com/8bc0a55c623e97d8c2e4722836d22101938a1117/content/renderer/media/webrtc/rtc_rtp_receiver.cc
[modify] https://crrev.com/8bc0a55c623e97d8c2e4722836d22101938a1117/content/renderer/media/webrtc/rtc_rtp_receiver.h
[modify] https://crrev.com/8bc0a55c623e97d8c2e4722836d22101938a1117/content/renderer/media/webrtc/rtc_rtp_receiver_unittest.cc
[modify] https://crrev.com/8bc0a55c623e97d8c2e4722836d22101938a1117/content/renderer/media/webrtc/rtc_rtp_sender.cc
[modify] https://crrev.com/8bc0a55c623e97d8c2e4722836d22101938a1117/content/renderer/media/webrtc/rtc_rtp_sender.h

Comment 45 by hbos@chromium.org, Apr 30 2018

Status: Verified (was: Started)
The adapter map design is worth revisiting, and likely will have to when we're removing the content/renderer/media/webrtc/ layer (https://docs.google.com/document/d/14fsyfCRV_w2VcHDSnAXnhkB4DE83P-u_ewp6bg-HblU/edit?usp=sharing). Another time, another bug. This one is fixed and verified.
Project Member

Comment 46 by bugdroid1@chromium.org, May 2 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e3e32d3dc594a93c2445033e7740cef1ceb9aa96

commit e3e32d3dc594a93c2445033e7740cef1ceb9aa96
Author: Wez <wez@chromium.org>
Date: Wed May 02 11:47:02 2018

Unfilter RTCRtpSenderTest.ReplaceTrack* tests under Fuchsia.

These tests were filtered due to threading issues in the production
code leading to crash flakes.

Bug:  813795 ,  827450 
Change-Id: I1c5f69a3d26e23eec5e25c391bfe20422252c360
Reviewed-on: https://chromium-review.googlesource.com/1039068
Reviewed-by: Henrik Boström <hbos@chromium.org>
Commit-Queue: Wez <wez@chromium.org>
Cr-Commit-Position: refs/heads/master@{#555353}
[modify] https://crrev.com/e3e32d3dc594a93c2445033e7740cef1ceb9aa96/testing/buildbot/filters/fuchsia.content_unittests.filter

Sign in to add a comment