5+ second browser process hang in UDPSocketWin::Close |
||||||||
Issue descriptionWhile using Chrome for Windows on a terrible hotel WiFi connection I found that Chrome was repeatedly hanging for 5+ seconds at a time, both on stable and canary. I managed to capture an ETW trace that recorded two hangs - one 3.4 seconds long and the other 5.5 seconds long. During these hangs the browser process and thus the entire Chrome window were entirely unresponsive. The ETW trace (2016-11-19_12-37-18 chrome hangs on WiFi.etl) is available on request but I have included my analysis below and that may be sufficient. The summary is that in both hangs the main thread of the browser process (PID 6052, TID 6048) was blocked waiting on thread 6024. Thread 6024 was blocked on one or more calls to chrome.dll!net::UDPSocketWin::Close, which calls ws2_32.dll!closesocket. Thread 6024 was eventually readied by the system process, at which point the hang ended. The documentation for closesocket is deliciously complex in its explanations of whether and when closesocket might not return immediately. It clearly depends on the linger settings and I don't know what Chrome does with those settings. However, experimentally, it is clear that whether or not closesocket *should* block it definitely sometimes *does* block. A closer look at the 5.5 s hang is shown here: Stable browser process (6052, thread 6048) hung for 5.516 s from 20.71 to 26.22 s. Thread 6048 was waiting for 5.515 s on this stack: chrome.dll!base::debug::TaskAnnotator::RunTask chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall base::win::ObjectWatcher::*)(base::win::ObjectWatcher::Delegate *),base::WeakPtr<base::win::ObjectWatcher>,base::win::ObjectWatcher::Delegate *>,void __cdecl(void)>::Run chrome.dll!cc::ContextCacheController::OnIdle chrome.dll!gpu::gles2::GLES2Implementation::SetAggressivelyFreeResources chrome.dll!gpu::gles2::GLES2Implementation::Flush chrome.dll!gpu::gles2::GLES2Implementation::FreeEverything chrome.dll!gpu::gles2::GLES2Implementation::WaitForCmd chrome.dll!gpu::CommandBufferHelper::Finish chrome.dll!gpu::CommandBufferHelper::WaitForGetOffsetInRange chrome.dll!gpu::CommandBufferProxyImpl::WaitForGetOffsetInRange chrome.dll!gpu::CommandBufferProxyImpl::Send chrome.dll!gpu::GpuChannelHost::Send chrome.dll!IPC::SyncChannel::Send chrome.dll!IPC::SyncChannel::WaitForReply chrome.dll!mojo::SyncHandleRegistry::WatchAllHandles chrome.dll!MojoWaitImpl chrome.dll!mojo::edk::Core::Wait chrome.dll!mojo::edk::Core::WaitManyInternal chrome.dll!mojo::edk::Waiter::Wait chrome.dll!base::ConditionVariable::Wait It was readied (AKA woken up, unlocked) by thread 6024 in the same process on this stack: chrome.dll!base::debug::TaskAnnotator::RunTask chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::ExternalPolicyDataFetcherBackend::*)(policy::ExternalPolicyDataFetcher::Job *),base::WeakPtr<policy::ExternalPolicyDataFetcherBackend>,policy::ExternalPolicyDataFetcher::Job *>,void __cdecl(void)>::Run chrome.dll!mojo::Watcher::OnHandleReady chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall content::AccessibilityUI::*)(base::ListValue const *),base::internal::UnretainedWrapper<content::AccessibilityUI> >,void __cdecl(base::ListValue const *)>::Run chrome.dll!mojo::Connector::OnHandleReadyInternal chrome.dll!mojo::Connector::ReadSingleMessage chrome.dll!mojo::FilterChain::Accept chrome.dll!IPC::`anonymous namespace'::ChannelAssociatedGroupController::Accept chrome.dll!mojo::FilterChain::Accept chrome.dll!mojo::InterfaceEndpointClient::HandleValidatedMessage chrome.dll!IPC::mojom::ChannelStub::Accept chrome.dll!IPC::internal::MessagePipeReader::Receive chrome.dll!IPC::ChannelMojo::OnMessageReceived chrome.dll!IPC::SyncChannel::SyncContext::OnMessageReceived chrome.dll!IPC::SyncChannel::SyncContext::TryToUnblockListener chrome.dll!IPC::MojoEvent::Signal chrome.dll!MojoWriteMessageImpl chrome.dll!mojo::edk::Core::WriteMessageNew chrome.dll!mojo::edk::MessagePipeDispatcher::WriteMessage chrome.dll!mojo::edk::NodeController::SendMessage chrome.dll!mojo::edk::ports::Node::SendMessage chrome.dll!mojo::edk::ports::Node::SendMessageInternal chrome.dll!mojo::edk::ports::Node::AcceptMessage chrome.dll!mojo::edk::ports::Node::OnUserMessage chrome.dll!mojo::edk::NodeController::PortStatusChanged chrome.dll!mojo::edk::MessagePipeDispatcher::PortObserverThunk::OnPortStatusChanged chrome.dll!mojo::edk::AwakableList::AwakeForStateChange chrome.dll!mojo::edk::WaitSetDispatcher::Waiter::Awake chrome.dll!mojo::edk::WaitSetDispatcher::WakeDispatcher chrome.dll!mojo::edk::AwakableList::AwakeForStateChange chrome.dll!mojo::edk::Waiter::Awake ntdll.dll!RtlReleaseSRWLockExclusive Thread 6024 was blocked on this stack for 6.706 s: chrome.dll!base::debug::TaskAnnotator::RunTask chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::URLBlacklistManager::*)(void),base::WeakPtr<policy::URLBlacklistManager> >,void __cdecl(void)>::Run chrome.dll!net::`anonymous namespace'::QuicChromeAlarm::OnAlarm chrome.dll!net::QuicConnection::CheckForTimeout chrome.dll!net::QuicConnection::CloseConnection chrome.dll!net::QuicConnection::TearDownLocalConnectionState chrome.dll!net::QuicChromiumClientSession::OnConnectionClosed chrome.dll!net::UDPSocketWin::Close ws2_32.dll!closesocket It was readied by the System process by three context switches totalling 5.235 s of waiting on this call stack: ndis.sys!<PDB not found> ndis.sys!<PDB not found> Netwsw00.sys!oscHandleInterrupt Netwsw00.sys!isrHandlerRoutineInta Netwsw00.sys!rfdQueueProcessFragments Netwsw00.sys!prvRfdQueueDispatch Netwsw00.sys!rxDataCbReceiveComplete Netwsw00.sys!rxDataProcessMPDUFrame Netwsw00.sys!rxDataDispatchMSDU Netwsw00.sys!rxDataProcessMSDUData Netwsw00.sys!rxDataSyncStatistics ndis.sys!<PDB not found> ndis.sys!<PDB not found> ndis.sys!<PDB not found> ndis.sys!<PDB not found> tcpip.sys!<PDB not found> ntoskrnl.exe!KeExpandKernelStackAndCalloutInternal tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tcpip.sys!<PDB not found> tdx.sys!TdxConnectConnectionTlRequestComplete ntoskrnl.exe!IopfCompleteRequest afd.sys!<PDB not found> ntoskrnl.exe!KeInsertQueueApc and two context switches totalling 1.471 s of waiting on this call stack: ntoskrnl.exe!PspSystemThreadStartup ntoskrnl.exe!ExpWorkerThread ntoskrnl.exe!IopProcessWorkItem NETIO.SYS!NetiopIoWorkItemRoutine tcpip.sys!<PDB not found> afd.sys!<PDB not found> ntoskrnl.exe!KeSetEvent
,
Dec 16 2016
QUIC-related, so adding QUIC label. Thread 6024 was blocked on this stack for 6.706 s: chrome.dll!base::debug::TaskAnnotator::RunTask chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::URLBlacklistManager::*)(void),base::WeakPtr<policy::URLBlacklistManager> >,void __cdecl(void)>::Run chrome.dll!net::`anonymous namespace'::QuicChromeAlarm::OnAlarm chrome.dll!net::QuicConnection::CheckForTimeout chrome.dll!net::QuicConnection::CloseConnection chrome.dll!net::QuicConnection::TearDownLocalConnectionState chrome.dll!net::QuicChromiumClientSession::OnConnectionClosed chrome.dll!net::UDPSocketWin::Close ws2_32.dll!closesocket
,
Feb 15 2017
Ryan could you take a look? We have had issues with UDP socket closing being janky in the past, understanding this would be good.
,
Feb 16 2017
Assigning to jri to get more information about closing UDP sockets on Windows.
,
Feb 21 2017
While using a different laptop in a different hotel on a different continent I hit what I believe was the same problem again. I was unable to get an ETW trace so I can't be certain, but the symptoms seemed identical. As a first step it might be worthwhile adding UMA stats to monitor how long closesocket() takes. My guess is that we will see a bimodal distribution with peaks around zero seconds and five seconds, with the vast majority at zero seconds, but with enough samples at five seconds to make mitigations important.
,
Feb 21 2017
Sounds like a good idea. Jana, can you land a histogram to track this?
,
Mar 1 2017
Yes, histogram SG. I'll get this in this week.
,
Mar 14 2017
I was just about to get a histogram in, and I found one already there. Clearly this isn't the first time someone's been looking at this problem. See Net.UDPSocketWinClose. The data is interesting: (i) 0.01% of socket closes take > ~1.5 seconds per day, (ii) 2% of users get at least one socket close > ~1 second per day, and (iii) 1.2% of users get at least one socket close > ~5 seconds per day. This is quite bad. I'll dig in.
,
Mar 14 2017
(I'm on merge duty this week, so I'll dig as I find time.)
,
Mar 14 2017
Let me know if you want the original ETW trace - I think I still have it.
,
Mar 14 2017
My Lenovo T460S Windows 10 Insider machine which was unusably hangy with Chrome had the following in the Windows Event Log: "Closing a UDP socket with local port number 52951 in process 10288 is taking longer than expected. The local port number may not be available until the close operation is completed. This happens typically due to misbehaving network drivers. Ensure latest updates are installed for Windows and any third-party networking software including NIC drivers, firewalls, or other security products." After updating the Intel Wifi drivers to the latest and installing the 15048 Insider build, all of my hangs seem to have gone away.
,
Mar 14 2017
brucedawson: I reached out to an MS networking dev who asks: - What's the OS version - What's the version of the Netwsw00.sys driver on your system? (I don't know how to look for this version, so I'll find out.)
,
Mar 14 2017
Looking at the ETW trace I see: - OS version is Windows 10 Pro, 14393.447.amd64fre.rs1_release_inmarket.161102-0100 - NETwsw00.SYS version is 15.4.1.1, built in February of 2013 Windows explorer shows the version number of c:\Windows\System32\Drivers\NETwsw00.SYS correctly, but it doesn't show the build date - I used dumpbin /headers for that. Note that this happened on two different Windows 10 laptops. I don't have version information for the other one handy but I can get it if necessary. If Intel has a newer driver that avoids network hangs then can we get the update on Windows Update so that our customers actually receive it? Fixing this for my laptop would be nice, to be sure, but I'd actually rather not manually fix it because it is better to see what our customers are dealing with.
,
Mar 14 2017
Thanks for the additional info, brucedawson and elawrence. Definitely helpful to know that the hang may be due to a driver bug, but it still seems odd that the app needs to hang around. Chrome ought to be able to close the socket and leave without waiting for the driver to do something right. brucedawson: Yes, please! I've never read an ETW trace, but my MS colleague may find it helpful.
,
Mar 14 2017
The ETW trace is too big to attach but it is available here: https://drive.google.com/file/d/0B9b7VYaDqPZbSmlsRFhtRkthU2M/view?usp=sharing The analysis in the original filing of the bug should be detailed enough to let any ETW expert replicate my findings. According to https://downloadcenter.intel.com/download/25769/Intel-Network-Adapter-Driver-for-Windows-10 there have been eight Intel driver updates in the last few years, culminating in version 22.0.1. I installed Windows 10 on my laptop around August 2015, at which point version 20.1 was the most recent version. So, it seems rather tragic that I am running a version from 2013 that is at least twelve versions behind and was already at least five versions behind when I installed Windows 10.
,
Mar 14 2017
Great, thanks for sending the trace and the info. I'm glad that you have the old driver - it gives us something to work with! I'd like to understand why Chrome waits on the closesocket.
,
Mar 14 2017
Note that on both laptops I cannot reproduce this bug on demand. It requires certain types of finicky hotel WiFi, and even in these problematic hotels it is not consistent. And neither hotel is located in the country where I live :-(
,
Mar 22 2017
Is this related to crbug.com/661306 (on Chrome OS with evidence of other platforms being affected)? The symptoms are somewhat similar, with the UI thread blocking because the IO thread is busy, likely because of networking things (given the IO thread stacks in the traces).
,
Mar 22 2017
I don't think it is related. I think it's best to treat this as a Windows specific issue where closing sockets sometimes takes five seconds. It may simply be a matter of buggy (Intel in my cases) drivers that sometimes hit a five second timeout. I don't know whether newer drivers fix the problem because I can't reproduce the problem reliably, and because Intel does not support upgrading my drivers on Windows 10 (grrr...).
,
Apr 10 2017
,
Jul 28 2017
,
Jul 31 2017
> Is this related to crbug.com/661306 (on Chrome OS with evidence of other platforms being affected)? No. The ETW traces clearly show that closesocket() takes five seconds to return. The clear clustering around five seconds is strongly indicative of a timeout of some sort, rather than an overloaded UI thread. Also, the ETW trace makes it clear that the problem is strictly related to the thread that is running closesocket(). That is, the trace(s) show that Chrome calls closesocket() and then that thread goes idle for ~5 s. It is then woken by the system process at which point it is able to continue executing. At no point does the UI thread have a chance to intervene or affect this. Yes, I responded back in March but I'd typed most of this reply before I noticed this and I wanted to clarify a couple of things anyway. Note that bug 739327 changes the analysis somewhat by showing that the bug doesn't just occur with Intel drivers and can in fact happen on a wired network. But, it still maintains the same ~5 second timeout. This could mean that this is actually a Windows OS bug/feature, or it could mean that the 5 second timeout originated in some sample driver code which has now been copied by many manufacturers.
,
Jul 31 2017
Correction - it is crbug.com/749946 that shows that this bug can happen on a wired network as well.
,
Aug 3 2017
TO be clear, there is an Intel WiFi driver bug that causes this too, though now I'm not sure how much of what we see in Chrome is attributable to that bug given that you've seen it on a wired network as well. Maybe it's an Intel NIC on the wired network with the same code in the driver?
,
Aug 3 2017
,
Aug 16 2017
Had another hang managed to create an ETW trace whilst it was happening https://drive.google.com/open?id=0B5pqsyyhdETjVEtGVFMwQ0RrbTA It resolved when I toggled the wifi adapter ~70s in. Not sure if this helps, or if you need anything else.
,
Aug 19 2017
Another trace doesn't hurt (I'll take a look if I get a chance) but I doubt it will change the conclusion. Make sure your driver is up-to-date, and complain to the driver developer, pointing them to the trace.
,
Aug 19 2017
I just heard back today from Microsoft that a fix is in the Windows 10 preview builds and will be in the Fall Creators Update. Hopefully that will resolve the issue.
,
Aug 30 2017
Issue 749758 has been merged into this issue.
,
Jul 16
Crash analysis has not encountered any reports for this bug for the past 90 days. We have added the label 'crash-BugNoRepro' Crash analysis will be automatically closing the bug in 10 days. If you do not want Crash analysis to automatically close the bug, please remove the label 'crash-BugNoRepro'. If you have any feedback on this feature, please contact pranavk@
,
Aug 2
Crash analysis has not encountered any reports for this bug for the past 90 days. Hence as per the comment 16 days ago, we are closing the bug and setting the status to WontFix. If you have any feedback on this feature, please contact pranavk@ |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by tyoshino@chromium.org
, Dec 16 2016