Test page hangs and unusable upon clicking hangup button |
|||||||||||||||||||||
Issue descriptionChrome Version: 68.0.3440.15 OS: Gnawty What steps will reproduce the problem? (1) Attach a Camera (I used Brio 4K Stream Edition. Any other latest cameras also work for this) (2) Go to Chrome://settings/content/camera/ and select 4k Camera. (3) Go to https://webrtc.github.io/samples/src/content/peerconnection/constraints/ (4) Set the constraints similar to the picture attached in this ticket. (5) Click Get media button, Connect button (6) Wait for 10 - 20 seconds and press Hang up button. What is the expected result? Test page hangs gracefully. What happens instead? Chrome tab gets unresponsive. We need to close the tab and open a new tab or wait until it recovers. This does not affect any other tabs. Reproducible? 100% percent Does this reproduce with other cameras? Yes. I reproduced this issue with Logitech HD Webcam C310 Risk? This is possible for all cameras that support those resolutions. Crash Report? https://listnr.corp.google.com/product/208/report/85488679481?dateRange=All
,
Jun 8 2018
Resolutions as in this picture.
,
Jun 8 2018
,
Jun 15 2018
Can we get a crash report instead of the CrOS feedback report? (or can I find it from the feedback?)
,
Jun 18 2018
Hi Niklase, Here comes the crash report id. Note this page hangs and most of the time no crash report is created. Following id is the only one crash report I have got until now. ee0733c392b5d8c3
,
Jun 19 2018
This issue is reproducible in Mac as well. 68.0.3440.25 Macbook pro 15".
,
Jun 19 2018
Local Crash ID 38f78686-d93c-47c1-a33d-4999b906afa6 Crash report captured on Tuesday, June 19, 2018 at 11:44:41 AM (upload requested by user, not yet uploaded) [Mac report]
,
Jun 21 2018
,
Jun 21 2018
,
Jul 2
,
Jul 9
The crash report does not seem to contain any useful information or call stack. I will have to try to repro locally.
,
Jul 9
Tested with the repro steps from the report on a fizz board and Chrome 68.0.3440.15. Unable to repro. Will try on MacOS next, and maybe try to get a Gnawty as well. Note: The camera I used for testing is a Logitech BRIO (046d:085e).
,
Jul 9
Also no repro on Macbook Air (Early 2015) using same camera and Chrome 68.0.3440.42. How important is the waiting for 10-20 seconds in step 6? Does it not happen when waiting too short?
,
Jul 23
Hi, I was able to reproduce every time during that 10-20 seconds frame. If we do it quickly then not reproducible. I can retry again and update here soon. Thank you!
,
Jul 23
Tested but no repro on the following config: Board: guado Chrome Version: 68.0.3440.15 ChromeOS Version: 10718.13.0 Camera: Logirech BRIO (046d:085e)
,
Jul 24
This issue is reproducible in the Gnawty. I can see a delay in the video feed after pressing connect button. After 10-20 seconds if we press hang up page is unresponsive. This could be device specific. Attached few logs: https://pantheon.corp.google.com/storage/browser/chromiumos-test-logs/bugfiles/cr/850907/july25/ If you need any specific logs please let me know.
,
Jul 24
,
Jul 24
,
Jul 24
I was able to repro just now on a Gnawty on 68.0.3440.70 beta. Will start investigating from here.
,
Jul 25
I found that this does not repro very consistently. It sometimes takes me quite a few attempts to get it to hang. The hang does reproduce on 70.0.3501, 10908.0.0 as well. When the hang happens, the ui logs keep printing out the following two lines in a loop: [1:17:0725/144221.325670:ERROR:thread_wrapper.cc(300)] NOTREACHED() hit. [1:17:0725/144221.325937:ERROR:thread_wrapper.cc(291)] NOTREACHED() hit.
,
Jul 25
+ cc sergeyu@ as OWNER of jingle/glue/thread_wrapper.cc
,
Jul 25
I replicated the same using a custom build with dcheck_always_on = true. Here is the call stack for the NOTREACHED(): [1:17:0725/155254.969131:FATAL:thread_wrapper.cc(300)] Check failed: false. #0 0x5b3912b4415c base::debug::StackTrace::StackTrace() #1 0x5b3912a9c4a0 logging::LogMessage::~LogMessage() #2 0x5b3917551d39 jingle_glue::JingleThreadWrapper::Get() #3 0x5b3913e01708 rtc::Thread::ProcessMessages() #4 0x5b391751a58e webrtc::RTCStatsCollector::WaitForPendingRequest() #5 0x5b3917504b30 webrtc::PeerConnection::Close() #6 0x5b3917220c85 webrtc::internal::SynchronousMethodCall::OnMessage() #7 0x5b39175511ba jingle_glue::JingleThreadWrapper::Dispatch() #8 0x5b3917551b49 jingle_glue::JingleThreadWrapper::RunTask() #9 0x5b390fa95428 _ZN4base8internal7InvokerINS0_9BindStateIMN3net14MDnsConnectionEFviEJNS_7WeakPtrIS4_EEiEEEFvvEE3RunEPNS0_13BindStateBaseE #10 0x5b3912b66e1b base::debug::TaskAnnotator::RunTask() #11 0x5b3912aa4d25 base::MessageLoop::RunTask() #12 0x5b3912aa5137 base::MessageLoop::DeferOrRunPendingTask() #13 0x5b3912aa53b8 base::MessageLoop::DoWork() #14 0x5b3912aa77d6 base::MessagePumpDefault::Run() #15 0x5b3912aa4691 base::MessageLoop::Run() #16 0x5b3912acf606 base::RunLoop::Run() #17 0x5b3912b150da base::Thread::Run() #18 0x5b3912b1542e base::Thread::ThreadMain() #19 0x5b3912b5613f base::(anonymous namespace)::ThreadFunc() #20 0x7e13abff22b8 <unknown> #21 0x7e13ab4a6fad clone
,
Jul 25
,
Jul 25
Based on the call stack from #22, I think it'd be more efficient for someone with more knowledge about webrtc::PeerConnection to take a look at this. hbos@: Could you take a look or triage further?
,
Jul 31
The stack trace makes it look like it's related to stats, but then I don't see the relevance of the 4k camera and constraints, and the crash occurring inside of thread_wrapper.cc is just confusing. I'll take a look and see if I can repro, but I'm OOO starting tomorrow and most of August. cc guidou@ because of constraints.
,
Jul 31
webrtc::RTCStatsCollector::WaitForPendingRequest(), which is invoked from Close(), says that if a stats request is currently pending we process messages (tasks posted to the thread) in a loop until the pending stats request has finished. This is necessary because GetStats() are using resources that the PeerConnection are trying to destroy on Close(), so Close() must wait for any pending request to finish before proceeding to free its resources. And this is done in a blocking "process messages in a loop" because Close() has to be a synchronous operation. This works on all other platforms, but on specifically on Mac, rtc::Thread::ProcessMessages, it turns out, has a special implementation. This implementation leads to jingle_glue::JingleThreadWrapper::Get() which is implemented as "NOTREACHED(); return false;". This means that... - In debug (or dcheck_always_on = true) builds, NOTREACHED() cases a crash. - In release builds, Get() does nothing, meaning ProcessMessages() does nothing, meaning we never process the pending stats requests. We're stuck in an infinite loop, blocking the signaling thread, which is blocking the main thread trying to perform RTCPeerConnection.close(). There seems to be different implementations of the threads, because in third_party/webrtc I cannot reproduce this inside of unittests. But in a chromium build this is reproducible 100% of the time if close() is invoked while the getStats() request is still pending, just click this codepen: https://codepen.io/anon/pen/jpZZGb?editors=0010. As for the bug description, I have no idea what the relevance of the 4K camera is. It could increase the workload of the system, and perhaps increase the risk that a getStats() request is still pending when close() is called, but it sounds like that should be flaky. The codepen is written such that close() is performed while the request is pending, guaranteeing that this is reproduced 100% of the time.
,
Jul 31
I don't have access to a 4K camera, but assuming this is the same bug the codepen is enough to repro. It's reproducible on latest chromium on Mac, but not on latest on Linux.
,
Jul 31
sergeyu@ why is Get() not supported?
// Following methods are not supported.They are overriden just to
// ensure that they are not called (each of them contain NOTREACHED
// in the body). Some of this methods can be implemented if it
// becomes neccessary to use libjingle code that calls them.
void Quit() override;
bool IsQuitting() override;
void Restart() override;
bool Get(rtc::Message* message, int delay_ms, bool process_io) override;
bool Peek(rtc::Message* message, int delay_ms) override;
void PostAt(const rtc::Location& posted_from,
uint32_t timestamp,
rtc::MessageHandler* handler,
uint32_t id,
rtc::MessageData* data) override;
void ReceiveSends() override;
int GetDelay() override;
https://cs.chromium.org/chromium/src/jingle/glue/thread_wrapper.h?type=cs&sq=package:chromium&g=0&l=93
On Mac we end up at NOTREACHED(). I'm not sure why this doesn't happen on other platforms, it looks like both Mac and non-Mac implementation of ProcessMessages performs Get(), and all platforms use the thread wrapper...
https://cs.chromium.org/chromium/src/content/renderer/media/webrtc/peer_connection_dependency_factory.cc?type=cs&sq=package:chromium&g=0&l=233
,
Jul 31
sergeyu@ can you take a look while I'm OOO? I'm back on Monday but I'll only be in the office a few days before I'm OOO again.
,
Jul 31
Thanks hbos@ for getting the analysis done so quickly and the clear description of the issue in #26. I tried running the codepen on the gnawty Chromebook, and I can confirm that it triggers the issue as well. But it does not happen 100% of the time. Could it be that the getStats() request sometimes finishes before the clos() is performed? Anyway, when re-running the script a couple of times it eventually triggers the issue. Agreed that this has nothing to do with 4k camera or resolutions specifically. It was probably that this somehow led to favorable conditions for the issue to happen.
,
Jul 31
Thank you for also looking in to this. Correction: I said that my codepen made it 100% reproducible, but this is not the case, because javascript and signaling thread are two different threads sometimes the getStats() request finishes before close(). However if you comment out "if (num_pending_partial_reports_)" at https://cs.chromium.org/chromium/src/third_party/webrtc/pc/rtcstatscollector.cc?q=rtcstatscollector.cc&sq=package:chromium&dr&l=817 then you will always end up processing messages, which should be harmless, but it crashes.
,
Jul 31
It looks like this was not reproducible on my Linux machine because it was too fast. With the modification of #31 this is reproducible on any platform, becase they all use the libjingle wrapper.
,
Jul 31
CC hta@ bug related to stats
,
Aug 7
,
Aug 30
,
Sep 18
,
Sep 20
This issue is reproduced in Swanky in 70.0.3538.16 / 11021.13.0
,
Jan 15
Sorry about the delay. I don't think it's feasible to implement Get() in JingleThreadWrapper: it supposed to be blocking which means it would also block any tasks scheduled on chromium message loops and we don't want to do that. The issue here is that we have blocking WaitForPendingRequest() that has to be called before StatCollector is destroyed. We should avoid blocking one thread on an activity that happens on another thread. It should be possible to avoid blocking code in PeerConnection destruction sequence. I don't see any reason StatCollector really needs to be ref-counted. It posts tasks to some threads and it just needs to be able to cancel these tasks when it's destroyed. AFAICT there is rtc::BaseCancelableTask class that should make it easy to create and post cancelable tasks.
,
Yesterday
(25 hours ago)
I agree that "We should avoid blocking one thread on an activity that happens on another thread", unfortunately WebRTC is full of this blocking thread invoke design pattern (https://cs.chromium.org/chromium/src/third_party/webrtc/rtc_base/thread.h?sq=package:chromium&dr&g=0&l=187). The WebRTC signaling thread will, on a regular basis, be blocked on the worker thread and the network thread. In fact the legacy getStats API is still a blocking call, but the "new" getStats API (this bug) lets its tasks run non-blockingly. But here close() needs to synchronously destroy the resources used by those tasks. Something like rtc::BaseCancelableTask might be a step in the right direction, but the blocking design pattern is much more widespread than this particular issue. I'll take a look when I have more time. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by vasanthakumar@chromium.org
, Jun 8 2018