New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 850907 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: 2019-01-23
OS: Linux , Chrome , Mac
Pri: 2
Type: Bug



Sign in to add a comment

Test page hangs and unusable upon clicking hangup button

Project Member Reported by vasanthakumar@chromium.org, Jun 8 2018

Issue description

Chrome Version: 68.0.3440.15
OS: Gnawty

What steps will reproduce the problem?
(1) Attach a Camera (I used Brio 4K Stream Edition. Any other latest cameras also work for this)
(2) Go to Chrome://settings/content/camera/ and select 4k Camera.
(3) Go to https://webrtc.github.io/samples/src/content/peerconnection/constraints/ 
(4) Set the constraints similar to the picture attached in this ticket.
(5) Click Get media button, Connect button    
(6) Wait for 10 - 20 seconds and press Hang up button.

What is the expected result?
Test page hangs gracefully.

What happens instead?
Chrome tab gets unresponsive. We need to close the tab and open a new tab or wait until it recovers. This does not affect any other tabs. 

Reproducible?
100% percent

Does this reproduce with other cameras?
Yes. I reproduced this issue with Logitech HD Webcam C310

Risk?
This is possible for all cameras that support those resolutions. 

Crash Report?
https://listnr.corp.google.com/product/208/report/85488679481?dateRange=All

 
Resolutions as in this picture.

Screenshot 2018-06-07 at 3.29.08 PM.png
127 KB View Download
Components: Blink>GetUserMedia>Webcam
Labels: Needs-Feedback
Can we get a crash report instead of the CrOS feedback report? (or can I find it from the feedback?)
Labels: -Needs-Feedback
Hi Niklase, Here comes the crash report id. Note this page hangs and most of the time no crash report is created. Following id is the only one crash report I have got until now. 

ee0733c392b5d8c3
Labels: OS-Mac
This issue is reproducible in Mac as well. 
68.0.3440.25 Macbook pro 15".
Local Crash ID 38f78686-d93c-47c1-a33d-4999b906afa6
Crash report captured on Tuesday, June 19, 2018 at 11:44:41 AM (upload requested by user, not yet uploaded)

[Mac report]
Cc: -niklase@chromium.org
Owner: niklase@chromium.org
Cc: niklase@chromium.org
Owner: chfremer@chromium.org
Status: Assigned (was: Untriaged)
The crash report does not seem to contain any useful information or call stack. I will have to try to repro locally.
Tested with the repro steps from the report on a fizz board and Chrome 68.0.3440.15. Unable to repro. Will try on MacOS next, and maybe try to get a Gnawty as well.

Note: The camera I used for testing is a Logitech BRIO (046d:085e).
Also no repro on Macbook Air (Early 2015) using same camera and Chrome 68.0.3440.42.

How important is the waiting for 10-20 seconds in step 6? Does it not happen when waiting too short?
Hi, I was able to reproduce every time during that 10-20 seconds frame. If we do it quickly then not reproducible. 

I can retry again and update here soon. Thank you!
Tested but no repro on the following config:
Board: guado
Chrome Version: 68.0.3440.15
ChromeOS Version: 10718.13.0
Camera: Logirech BRIO (046d:085e)
This issue is reproducible in the Gnawty. I can see a delay in the video feed after pressing connect button. 

After 10-20 seconds if we press hang up page is unresponsive. This could be device specific. 

Attached few logs: https://pantheon.corp.google.com/storage/browser/chromiumos-test-logs/bugfiles/cr/850907/july25/

If you need any specific logs please let me know. 
Description: Show this description
Summary: Test page hangs and unusable upon clicking hangup button (was: Test page crashes upon clicking hangup button)
I was able to repro just now on a Gnawty on 68.0.3440.70 beta. 
Will start investigating from here.
I found that this does not repro very consistently. It sometimes takes me quite a few attempts to get it to hang.

The hang does reproduce on 70.0.3501, 10908.0.0 as well.

When the hang happens, the ui logs keep printing out the following two lines in a loop:
[1:17:0725/144221.325670:ERROR:thread_wrapper.cc(300)] NOTREACHED() hit.
[1:17:0725/144221.325937:ERROR:thread_wrapper.cc(291)] NOTREACHED() hit.
Cc: sergeyu@chromium.org
+ cc sergeyu@ as OWNER of jingle/glue/thread_wrapper.cc
I replicated the same using a custom build with dcheck_always_on = true. Here is the call stack for the NOTREACHED():

[1:17:0725/155254.969131:FATAL:thread_wrapper.cc(300)] Check failed: false. 
#0 0x5b3912b4415c base::debug::StackTrace::StackTrace()
#1 0x5b3912a9c4a0 logging::LogMessage::~LogMessage()
#2 0x5b3917551d39 jingle_glue::JingleThreadWrapper::Get()
#3 0x5b3913e01708 rtc::Thread::ProcessMessages()
#4 0x5b391751a58e webrtc::RTCStatsCollector::WaitForPendingRequest()
#5 0x5b3917504b30 webrtc::PeerConnection::Close()
#6 0x5b3917220c85 webrtc::internal::SynchronousMethodCall::OnMessage()
#7 0x5b39175511ba jingle_glue::JingleThreadWrapper::Dispatch()
#8 0x5b3917551b49 jingle_glue::JingleThreadWrapper::RunTask()
#9 0x5b390fa95428 _ZN4base8internal7InvokerINS0_9BindStateIMN3net14MDnsConnectionEFviEJNS_7WeakPtrIS4_EEiEEEFvvEE3RunEPNS0_13BindStateBaseE
#10 0x5b3912b66e1b base::debug::TaskAnnotator::RunTask()
#11 0x5b3912aa4d25 base::MessageLoop::RunTask()
#12 0x5b3912aa5137 base::MessageLoop::DeferOrRunPendingTask()
#13 0x5b3912aa53b8 base::MessageLoop::DoWork()
#14 0x5b3912aa77d6 base::MessagePumpDefault::Run()
#15 0x5b3912aa4691 base::MessageLoop::Run()
#16 0x5b3912acf606 base::RunLoop::Run()
#17 0x5b3912b150da base::Thread::Run()
#18 0x5b3912b1542e base::Thread::ThreadMain()
#19 0x5b3912b5613f base::(anonymous namespace)::ThreadFunc()
#20 0x7e13abff22b8 <unknown>
#21 0x7e13ab4a6fad clone
Cc: emir...@chromium.org
Labels: M-70
Owner: hbos@chromium.org
Based on the call stack from #22, I think it'd be more efficient for someone with more knowledge about webrtc::PeerConnection to take a look at this.

hbos@: Could you take a look or triage further?
Status: Started (was: Assigned)
The stack trace makes it look like it's related to stats, but then I don't see the relevance of the 4k camera and constraints, and the crash occurring inside of thread_wrapper.cc is just confusing.

I'll take a look and see if I can repro, but I'm OOO starting tomorrow and most of August.

cc guidou@ because of constraints.
webrtc::RTCStatsCollector::WaitForPendingRequest(), which is invoked from Close(), says that if a stats request is currently pending we process messages (tasks posted to the thread) in a loop until the pending stats request has finished. This is necessary because GetStats() are using resources that the PeerConnection are trying to destroy on Close(), so Close() must wait for any pending request to finish before proceeding to free its resources. And this is done in a blocking "process messages in a loop" because Close() has to be a synchronous operation.

This works on all other platforms, but on specifically on Mac, rtc::Thread::ProcessMessages, it turns out, has a special implementation. This implementation leads to jingle_glue::JingleThreadWrapper::Get() which is implemented as "NOTREACHED(); return false;".

This means that...
- In debug (or dcheck_always_on = true) builds, NOTREACHED() cases a crash.
- In release builds, Get() does nothing, meaning ProcessMessages() does nothing, meaning we never process the pending stats requests. We're stuck in an infinite loop, blocking the signaling thread, which is blocking the main thread trying to perform RTCPeerConnection.close().

There seems to be different implementations of the threads, because in third_party/webrtc I cannot reproduce this inside of unittests. But in a chromium build this is reproducible 100% of the time if close() is invoked while the getStats() request is still pending, just click this codepen: https://codepen.io/anon/pen/jpZZGb?editors=0010.

As for the bug description, I have no idea what the relevance of the 4K camera is. It could increase the workload of the system, and perhaps increase the risk that a getStats() request is still pending when close() is called, but it sounds like that should be flaky. The codepen is written such that close() is performed while the request is pending, guaranteeing that this is reproduced 100% of the time.
I don't have access to a 4K camera, but assuming this is the same bug the codepen is enough to repro. It's reproducible on latest chromium on Mac, but not on latest on Linux.
sergeyu@ why is Get() not supported?

  // Following methods are not supported.They are overriden just to
  // ensure that they are not called (each of them contain NOTREACHED
  // in the body). Some of this methods can be implemented if it
  // becomes neccessary to use libjingle code that calls them.
  void Quit() override;
  bool IsQuitting() override;
  void Restart() override;
  bool Get(rtc::Message* message, int delay_ms, bool process_io) override;
  bool Peek(rtc::Message* message, int delay_ms) override;
  void PostAt(const rtc::Location& posted_from,
              uint32_t timestamp,
              rtc::MessageHandler* handler,
              uint32_t id,
              rtc::MessageData* data) override;
  void ReceiveSends() override;
  int GetDelay() override;

https://cs.chromium.org/chromium/src/jingle/glue/thread_wrapper.h?type=cs&sq=package:chromium&g=0&l=93

On Mac we end up at NOTREACHED(). I'm not sure why this doesn't happen on other platforms, it looks like both Mac and non-Mac implementation of ProcessMessages performs Get(), and all platforms use the thread wrapper...
https://cs.chromium.org/chromium/src/content/renderer/media/webrtc/peer_connection_dependency_factory.cc?type=cs&sq=package:chromium&g=0&l=233
Cc: hbos@chromium.org
Owner: sergeyu@chromium.org
sergeyu@ can you take a look while I'm OOO? I'm back on Monday but I'll only be in the office a few days before I'm OOO again.
Thanks hbos@ for getting the analysis done so quickly and the clear description of the issue in #26. 

I tried running the codepen on the gnawty Chromebook, and I can confirm that it triggers the issue as well. But it does not happen 100% of the time. Could it be that the getStats() request sometimes finishes before the clos() is performed? Anyway, when re-running the script a couple of times it eventually triggers the issue.

Agreed that this has nothing to do with 4k camera or resolutions specifically. It was probably that this somehow led to favorable conditions for the issue to happen.

Thank you for also looking in to this.

Correction: I said that my codepen made it 100% reproducible, but this is not the case, because javascript and signaling thread are two different threads sometimes the getStats() request finishes before close(). 

However if you comment out "if (num_pending_partial_reports_)" at https://cs.chromium.org/chromium/src/third_party/webrtc/pc/rtcstatscollector.cc?q=rtcstatscollector.cc&sq=package:chromium&dr&l=817 then you will always end up processing messages, which should be harmless, but it crashes.
Labels: OS-Linux
It looks like this was not reproducible on my Linux machine because it was too fast. With the modification of #31 this is reproducible on any platform, becase they all use the libjingle wrapper.
Cc: hta@chromium.org
CC hta@ bug related to stats
Labels: -M-68 -M-70 Target-71 M-71 FoundIn-68
Cc: jansson@chromium.org
This issue is reproduced in Swanky in 70.0.3538.16 / 11021.13.0
Owner: hbos@chromium.org
Sorry about the delay.
I don't think it's feasible to implement Get() in JingleThreadWrapper: it supposed to be blocking which means it would also block any tasks scheduled on chromium message loops and we don't want to do that.
The issue here is that we have blocking WaitForPendingRequest() that has to be called before StatCollector is destroyed. We should avoid blocking one thread on an activity that happens on another thread. It should be possible to avoid blocking code in PeerConnection destruction sequence. I don't see any reason StatCollector really needs to be ref-counted. It posts tasks to some threads and it just needs to be able to cancel these tasks when it's destroyed. AFAICT there is rtc::BaseCancelableTask class that should make it easy to create and post cancelable tasks.

Comment 39 by hbos@chromium.org, Yesterday (25 hours ago)

NextAction: 2019-01-23
I agree that "We should avoid blocking one thread on an activity that happens on another thread", unfortunately WebRTC is full of this blocking thread invoke design pattern (https://cs.chromium.org/chromium/src/third_party/webrtc/rtc_base/thread.h?sq=package:chromium&dr&g=0&l=187). The WebRTC signaling thread will, on a regular basis, be blocked on the worker thread and the network thread. In fact the legacy getStats API is still a blocking call, but the "new" getStats API (this bug) lets its tasks run non-blockingly. But here close() needs to synchronously destroy the resources used by those tasks.

Something like rtc::BaseCancelableTask might be a step in the right direction, but the blocking design pattern is much more widespread than this particular issue.

I'll take a look when I have more time.

Sign in to add a comment