Chrome crashing frequently with WebRTC sessions
Reported by
gbrownew...@gmail.com,
Aug 11 2017
|
||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36 Example URL: Steps to reproduce the problem: 1. WebRTC session VP8/Opus 2. Eventually Chrome crashes, sometimes 10 seconds in, sometimes 10 minutes. Seems to happen quicker if there is some packet loss. What is the expected behavior? Chrome doesn't crash What went wrong? Some machines get these crashes and others never do. Those that crash will crash quite frequently. All crashes show similar entries at the end of the debug log... [42688:5708:0811/102015.306:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=1 [42688:5708:0811/102015.326:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=2 [42688:5708:0811/102015.347:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=3 [42688:5708:0811/102015.367:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=4 [42688:5708:0811/102015.387:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=5 [42688:5708:0811/102015.407:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=6 [42688:5708:0811/102015.421:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=7 [42688:5708:0811/102015.421:WARNING:audio_sync_reader.cc(185)] ASR: No room in socket buffer.: The pipe is being closed. (0xE8) [42688:5708:0811/102015.421:WARNING:audio_sync_reader.cc(202)] AudioSyncReader::Read timed out, audio glitch count=8 [42688:23408:0811/102015.429:INFO:user_input_monitor_win.cc(157)] RegisterRawInputDevices() failed for RIDEV_REMOVE: The parameter is incorrect. (0x57) Did this work before? N/A Is it a problem with Flash or HTML5? HTML5 Does this work in other browsers? No Chrome 59 win and mac, Firefox 54 win and mac Chrome version: 60.0.3112.90 Channel: stable OS Version: 10.0 Flash Version: Contents of chrome://gpu:
,
Aug 11 2017
Sir, please forgive my ignorance. What is a "test case?" I will provide anything I can. I can provide pcaps of the session, though the media will be SRTP encrypted so not likely to be helpful. I can also provide hex dumps of the entire sessions or media streams after decryption. I can provide anything you need. For context, Chrome is one end of a session with our MCU. Video stream is libvpx encoding and audio stream is libopus encoding. I am in the process of moving everything to the newest versions of both libraries. The current system displaying the problem is using versions of those 2 libraries dated March 12, 1017. Can you confirm that the crash is likely related to the audio stream being sent to Chrome as the logs seem to suggest? opus is negotiated in the sdp with offer from Chrome... a=rtpmap:111 opus/48000/2 a=rtcp-fb:111 transport-cc a=fmtp:111 minptime=10;useinbandfec=1 and answer... a=rtpmap:111 opus/48000/2 a=fmtp:111 minptime=20;useinbandfec=1 Forcing PCMU seems to eliminate the problem but PCMU is not an acceptable quality. I really must have Opus, nothing else is good enough.
,
Aug 11 2017
2017, of course, not 1017 :)
,
Aug 14 2017
Can you provide crash IDs for the problematic sessions? You can find them in chrome://crashes
,
Aug 14 2017
,
Aug 14 2017
,
Aug 14 2017
These 2 are from 60.0.3112.90 running on a macbook air. Captured Friday but uploaded today. 98b46cc5-1917-4a6f-a5b9-859032090251 fb22325e-0316-4dea-92d6-c61e1d38a08a I will gather the others quickly.
,
Aug 14 2017
Thank you for providing more feedback. Adding requester "guidou@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 14 2017
Those crash IDs look like local crash IDs. Please send the crash IDs after they are reported as uploaded in chrome://crashes
,
Aug 14 2017
The first two are the two I posted a few minutes ago. I see they get another ID after uploading so I included them again with the uploaded crash ID. macbook air 60.0.3112.90 4596ecd1da92f427 (Local Crash ID: 98b46cc5-1917-4a6f-a5b9-859032090251) 3ea212111b2a9844 (Local Crash ID: fb22325e-0316-4dea-92d6-c61e1d38a08a) win10 60.0.3112.90 b364dec74f5153db (de919c9c-47ae-403e-9c1e-d4e59b09a68c) dd52025d0ca1c145 (0f932296-c5a3-4b61-a031-64bfd7afb3b1) bac2e74df40bd5e9 (b74f1c66-5797-418f-bc5b-976d88246a9e) e729ccc9d68f9c29 (1e2c7bb3-550c-463c-9f26-9b1eb5e1697e) dff2273bf172d3c1 (985c91af-5810-4dc8-8fdb-2a8885b2d837) 7d8733e7fca9e774 (241064a2-9450-45d9-b53c-6bc600b1e5de) I have several more on the way. Do you need the just the uploaded crash ID, or both values?
,
Aug 14 2017
Thank you for providing more feedback. Adding requester "guidou@chromium.org" to the cc list and removing "Needs-Feedback" label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 14 2017
gbrownewell@: we need only the uploaded crash IDs.
,
Aug 14 2017
Another from mac 60.0.3112.90 0f13de102ada78ae
,
Aug 15 2017
Stefan, Can you take a look at this? All the crashes reported by gbrownewell@ point to something related to NTP/time issues, with webrtc::RemoteNtpTimeEstimator::Estimate Since the problems apparently started with Chrome 60, the most suspect CL is https://codereview.webrtc.org/2963133003/
,
Aug 15 2017
In the comment above I meant "with webrtc::RemoteNtpTimeEstimator::Estimate appearing frequently in the reports".
,
Aug 16 2017
I've been recreating this as often as possible, trying to find a pattern. While it is certainly inconsistent, the crash seems to happen much more often right after the user begins talking after a period of silence. Also, I can upload more crash reports and IDs if that is useful. Should I keep generating them and provide the IDs here?
,
Aug 16 2017
I think new reports will look similar to the existing ones, and the ones you have already uploaded point to a specific part of the code. Maybe it's better to wait to see if holmer@ needs more feedback from you.
,
Aug 17 2017
Guidou, what makes you think it started with M60? I seem to be able to find crashes at least as early as M58. I don't find many examples though, not in M60 or earlier. gbrownewell, you seem to be able to consistently reproduce this. Could you share your repro steps? It may be possible to record an unencrypted pcap if you start Chrome with the flag --disable-webrtc-encryption. We might be able to use that to reproduce the problem, but it could also be a race condition causing this.
,
Aug 17 2017
Note that --disable-webrtc-encryption only works on Chrome Canary and Dev channels.
,
Aug 17 2017
holmer@: what makes me think it started with M60 is that I understood gbrownewell to mean that it is easy to reproduce the crash on 60 but not on 59 (or Firefox). gbrownewell@, can you confirm this?
,
Aug 17 2017
Sirs, I am being told that it has been happening since version 58 though I do not have access to any of the crash information for those incidents. Also, we have seen firefox crash rarely but I can't confirm its related. I can reproduce it frequently is because there are a couple machines here that are prone to it. The vast majority never experience the issue. One of them is a new macbook pro and the other is a win10 laptop. Strangely, the win10 machine is a model that we have dozens of and the others don't seem affected to the same degree, with the users reporting that they may have seen "aw snap" once or twice in the last 6 months. I will work on getting that pcap.
,
Aug 17 2017
gbrownewell@: In order to discard that crashes are more frequent in M60 and discard https://codereview.webrtc.org/2963133003/ as culprit, can you reproduce the crashes in M59 on the problematic machines as easily as with M60?
,
Aug 17 2017
gbrownewell@: we have found several other similar crash reports from 59 and 58, so no need for you to try to reproduce with older versions. Marking this bug as Untriaged again while we continue to try to find the cause. Also, changing priority to 2 since it is not a regression in 60, but at least as old as 58.
,
Aug 18 2017
holmer@ will continue the investigation since he is more familiar with that part of the code.
,
Aug 18 2017
"AW.Snap!" will pop-up when received a MMS during play a streaming video. Android platform.
,
Aug 22 2017
Getting an unencrypted packet capture is proving quite challenging, primarily because I do not have full-time access to the problematic machines. I expect to have one of them in my possession soon. Any good news to report?
,
Aug 23 2017
This issue is shows that the OS is "Windows", but I can assure you it is happening on Mac, in fact, it may be more prevalent on Mac. I have 1 Windows machine that is problematic, but now that I have the whole company on the lookout for these, it seems to be worse on Mac. Here are 3 crashes from today, minutes apart, from one of the problematic Macs... 0241b8c77acd933d c98a897875074bb2 77a3014ba17541a5
,
Aug 23 2017
Another set of crashes from the same session on a different Mac. I'm being told that one of the participants had over 5% packet loss on audio and video streams. 30ffb499165a9154 9953924fc5e37d50 14a1a327cd0298ff Don't know if that's significant, but these two crashed 6 times in about 1 minute during that period of high packet loss.
,
Aug 24 2017
,
Aug 24 2017
,
Aug 24 2017
,
Aug 24 2017
,
Aug 24 2017
Hi, to debug this, we'd need to be able to reproduce. Do you think we could have one or all of a packet dump, rtc_event_log, or a client test account so we could receive streams from the MCU?
,
Aug 24 2017
,
Aug 24 2017
I can set up a test account this morning. It will not take long. Once complete, how can I provide you the details to access it? I see this issue is now restricted, is it safe to post the information here?
,
Aug 24 2017
Please send account details via mail, to holmer@google.com. This issue may be opened some time after the bugs are fixed. In the mean time, we've found one of the recorded crashes included a stack trace clearly identifying an infinite recursion. Fix at https://codereview.webrtc.org/3004553002/. Bug triggered if UlpFEC is used to protect media packets including the RED encapsulation (when Chrome generates FEC packets, FEC is applied to the media packets as they appear *before* RED encapsulation).
,
Aug 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/external/webrtc.git/+/41476e014c8364adc15b90238d54a8aef91d7f56 commit 41476e014c8364adc15b90238d54a8aef91d7f56 Author: nisse <nisse@webrtc.org> Date: Fri Aug 25 16:08:44 2017 When Ulpfec recovers a packet, set |returned| flag earlier. This avoids infinite recursion in case the recovered packet carries a RED header. BUG= chromium:754748 Review-Url: https://codereview.webrtc.org/3004553002 Cr-Commit-Position: refs/heads/master@{#19525} [modify] https://crrev.com/41476e014c8364adc15b90238d54a8aef91d7f56/webrtc/modules/rtp_rtcp/source/ulpfec_receiver_impl.cc
,
Aug 28 2017
,
Aug 28 2017
I was trying to write a testcase for FEC on top of RED, when I realized that maybe that's not possible. Consider the receive side. When we get a media packet, we want to pass it to the FEC machinery because it may be part of a FEC block. But then we'd need to know if the FEC machinery should see the RED packet or the decapsulated media packet, because if we don't do it in exactly the same way as on the sending side, reconstructed packets will get garbled. Looking at the spec, rfc5109, section 10.3 (example) and 14.2 (more normative, but unclear if it's really describe out case) seem to be those that provide some clues on how FEC is supposed to work. The latter says The FEC MUST protect only the main codec, with the payload of FEC engine coming from virtual RTP packets created from the main codec data. I don't find any definition of "virtual RTP packet", but my best guess is that it means media packets without the RED encapsulation. Do you agree? Then fixing the processing in your MCU to do FEC before RED is essential to get FEC to work correctly. When my fix from Friday gets into Chrome canary (hopefully by tomorrow), it would be interesting to try to use it to connect to the version of your MCU used when the problem was discovered. From my current understanding, I would expect Chrome to not crash, but display somewhat garbled media because the packets supposedly recovered via FEC are corrupt. On the chrome side, we obviously need to not crash on any network input. We should probably have handling of recovered packets bypass RED decapsulation, so that any recovered packet which happens to carry the RED payload type is treated like any other packet with unknown payload type, and never sent back into the FEC machinery.
,
Aug 28 2017
I agree our MCU needed to change. Now that have more clarity on how it should have been done, it makes perfect sense. I think your understanding of rfc5109 is correct. I have completed the work but I am still able to test the previous MCU build to confirm that the upcoming Chrome canary handles the stream without crashing. I understand that this testing is only to confirm that Chrome does not crash given this flawed input.
,
Aug 29 2017
Latest Chrome Canary has the fix. Can you help verifying it?
,
Aug 29 2017
I will attempt to verify later tonight. I'll report my findings here.
,
Aug 29 2017
I was able to devote just over an hour to the testing. No crashes! The video becomes badly damaged, but eventually is repaired, though it often took quite a while for the browser to request the key frame. I understand that behavior is expected. I created the condition that would have crashed previously many times without crashing.
,
Aug 31 2017
Thanks a lot for the testing!
,
Aug 31 2017
|
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by manoranj...@chromium.org
, Aug 11 2017Labels: Needs-Triage-M60