New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 667820 link

Starred by 12 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Aug 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

5+ second browser process hang in UDPSocketWin::Close

Project Member Reported by brucedaw...@chromium.org, Nov 22 2016

Issue description

While using Chrome for Windows on a terrible hotel WiFi connection I found that Chrome was repeatedly hanging for 5+ seconds at a time, both on stable and canary. I managed to capture an ETW trace that recorded two hangs - one 3.4 seconds long and the other 5.5 seconds long. During these hangs the browser process and thus the entire Chrome window were entirely unresponsive.

The ETW trace (2016-11-19_12-37-18 chrome hangs on WiFi.etl) is available on request but I have included my analysis below and that may be sufficient. The summary is that in both hangs the main thread of the browser process (PID 6052, TID 6048) was blocked waiting on thread 6024. Thread 6024 was blocked on one or more calls to chrome.dll!net::UDPSocketWin::Close, which calls ws2_32.dll!closesocket. Thread 6024 was eventually readied by the system process, at which point the hang ended.

The documentation for closesocket is deliciously complex in its explanations of whether and when closesocket might not return immediately. It clearly depends on the linger settings and I don't know what Chrome does with those settings. However, experimentally, it is clear that whether or not closesocket *should* block it definitely sometimes *does* block.

A closer look at the 5.5 s hang is shown here:

Stable browser process (6052, thread 6048) hung for 5.516 s from 20.71 to 26.22 s. Thread 6048 was waiting for 5.515 s on this stack:
  chrome.dll!base::debug::TaskAnnotator::RunTask
  chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall base::win::ObjectWatcher::*)(base::win::ObjectWatcher::Delegate *),base::WeakPtr<base::win::ObjectWatcher>,base::win::ObjectWatcher::Delegate *>,void __cdecl(void)>::Run
  chrome.dll!cc::ContextCacheController::OnIdle
  chrome.dll!gpu::gles2::GLES2Implementation::SetAggressivelyFreeResources
  chrome.dll!gpu::gles2::GLES2Implementation::Flush
  chrome.dll!gpu::gles2::GLES2Implementation::FreeEverything
  chrome.dll!gpu::gles2::GLES2Implementation::WaitForCmd
  chrome.dll!gpu::CommandBufferHelper::Finish
  chrome.dll!gpu::CommandBufferHelper::WaitForGetOffsetInRange
  chrome.dll!gpu::CommandBufferProxyImpl::WaitForGetOffsetInRange
  chrome.dll!gpu::CommandBufferProxyImpl::Send
  chrome.dll!gpu::GpuChannelHost::Send
  chrome.dll!IPC::SyncChannel::Send
  chrome.dll!IPC::SyncChannel::WaitForReply
  chrome.dll!mojo::SyncHandleRegistry::WatchAllHandles
  chrome.dll!MojoWaitImpl
  chrome.dll!mojo::edk::Core::Wait
  chrome.dll!mojo::edk::Core::WaitManyInternal
  chrome.dll!mojo::edk::Waiter::Wait
  chrome.dll!base::ConditionVariable::Wait

It was readied (AKA woken up, unlocked) by thread 6024 in the same process on this stack:
  chrome.dll!base::debug::TaskAnnotator::RunTask
  chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::ExternalPolicyDataFetcherBackend::*)(policy::ExternalPolicyDataFetcher::Job *),base::WeakPtr<policy::ExternalPolicyDataFetcherBackend>,policy::ExternalPolicyDataFetcher::Job *>,void __cdecl(void)>::Run
  chrome.dll!mojo::Watcher::OnHandleReady
  chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall content::AccessibilityUI::*)(base::ListValue const *),base::internal::UnretainedWrapper<content::AccessibilityUI> >,void __cdecl(base::ListValue const *)>::Run
  chrome.dll!mojo::Connector::OnHandleReadyInternal
  chrome.dll!mojo::Connector::ReadSingleMessage
  chrome.dll!mojo::FilterChain::Accept
  chrome.dll!IPC::`anonymous namespace'::ChannelAssociatedGroupController::Accept
  chrome.dll!mojo::FilterChain::Accept
  chrome.dll!mojo::InterfaceEndpointClient::HandleValidatedMessage
  chrome.dll!IPC::mojom::ChannelStub::Accept
  chrome.dll!IPC::internal::MessagePipeReader::Receive
  chrome.dll!IPC::ChannelMojo::OnMessageReceived
  chrome.dll!IPC::SyncChannel::SyncContext::OnMessageReceived
  chrome.dll!IPC::SyncChannel::SyncContext::TryToUnblockListener
  chrome.dll!IPC::MojoEvent::Signal
  chrome.dll!MojoWriteMessageImpl
  chrome.dll!mojo::edk::Core::WriteMessageNew
  chrome.dll!mojo::edk::MessagePipeDispatcher::WriteMessage
  chrome.dll!mojo::edk::NodeController::SendMessage
  chrome.dll!mojo::edk::ports::Node::SendMessage
  chrome.dll!mojo::edk::ports::Node::SendMessageInternal
  chrome.dll!mojo::edk::ports::Node::AcceptMessage
  chrome.dll!mojo::edk::ports::Node::OnUserMessage
  chrome.dll!mojo::edk::NodeController::PortStatusChanged
  chrome.dll!mojo::edk::MessagePipeDispatcher::PortObserverThunk::OnPortStatusChanged
  chrome.dll!mojo::edk::AwakableList::AwakeForStateChange
  chrome.dll!mojo::edk::WaitSetDispatcher::Waiter::Awake
  chrome.dll!mojo::edk::WaitSetDispatcher::WakeDispatcher
  chrome.dll!mojo::edk::AwakableList::AwakeForStateChange
  chrome.dll!mojo::edk::Waiter::Awake
  ntdll.dll!RtlReleaseSRWLockExclusive

Thread 6024 was blocked on this stack for 6.706 s:
  chrome.dll!base::debug::TaskAnnotator::RunTask
  chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::URLBlacklistManager::*)(void),base::WeakPtr<policy::URLBlacklistManager> >,void __cdecl(void)>::Run
  chrome.dll!net::`anonymous namespace'::QuicChromeAlarm::OnAlarm
  chrome.dll!net::QuicConnection::CheckForTimeout
  chrome.dll!net::QuicConnection::CloseConnection
  chrome.dll!net::QuicConnection::TearDownLocalConnectionState
  chrome.dll!net::QuicChromiumClientSession::OnConnectionClosed
  chrome.dll!net::UDPSocketWin::Close
  ws2_32.dll!closesocket

It was readied by the System process by three context switches totalling 5.235 s of waiting on this call stack:
  ndis.sys!<PDB not found>
  ndis.sys!<PDB not found>
  Netwsw00.sys!oscHandleInterrupt
  Netwsw00.sys!isrHandlerRoutineInta
  Netwsw00.sys!rfdQueueProcessFragments
  Netwsw00.sys!prvRfdQueueDispatch
  Netwsw00.sys!rxDataCbReceiveComplete
  Netwsw00.sys!rxDataProcessMPDUFrame
  Netwsw00.sys!rxDataDispatchMSDU
  Netwsw00.sys!rxDataProcessMSDUData
  Netwsw00.sys!rxDataSyncStatistics
  ndis.sys!<PDB not found>
  ndis.sys!<PDB not found>
  ndis.sys!<PDB not found>
  ndis.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  ntoskrnl.exe!KeExpandKernelStackAndCalloutInternal
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tcpip.sys!<PDB not found>
  tdx.sys!TdxConnectConnectionTlRequestComplete
  ntoskrnl.exe!IopfCompleteRequest
  afd.sys!<PDB not found>
  ntoskrnl.exe!KeInsertQueueApc

and two context switches totalling 1.471 s of waiting on this call stack:
  ntoskrnl.exe!PspSystemThreadStartup
  ntoskrnl.exe!ExpWorkerThread
  ntoskrnl.exe!IopProcessWorkItem
  NETIO.SYS!NetiopIoWorkItemRoutine
  tcpip.sys!<PDB not found>
  afd.sys!<PDB not found>
  ntoskrnl.exe!KeSetEvent

 
Components: -Blink>Network Internals>Network

Comment 2 by mef@chromium.org, Dec 16 2016

Components: -Internals>Network Internals>Network>QUIC
QUIC-related, so adding QUIC label.

Thread 6024 was blocked on this stack for 6.706 s:
  chrome.dll!base::debug::TaskAnnotator::RunTask
  chrome.dll!base::internal::Invoker<base::internal::BindState<void (__thiscall policy::URLBlacklistManager::*)(void),base::WeakPtr<policy::URLBlacklistManager> >,void __cdecl(void)>::Run
  chrome.dll!net::`anonymous namespace'::QuicChromeAlarm::OnAlarm
  chrome.dll!net::QuicConnection::CheckForTimeout
  chrome.dll!net::QuicConnection::CloseConnection
  chrome.dll!net::QuicConnection::TearDownLocalConnectionState
  chrome.dll!net::QuicChromiumClientSession::OnConnectionClosed
  chrome.dll!net::UDPSocketWin::Close
  ws2_32.dll!closesocket

Comment 3 by eroman@chromium.org, Feb 15 2017

Owner: rch@chromium.org
Ryan could you take a look?

We have had issues with UDP socket closing being janky in the past, understanding this would be good.

Comment 4 by rch@chromium.org, Feb 16 2017

Owner: jri@chromium.org
Assigning to jri to get more information about closing UDP sockets on Windows.
While using a different laptop in a different hotel on a different continent I hit what I believe was the same problem again. I was unable to get an ETW trace so I can't be certain, but the symptoms seemed identical.

As a first step it might be worthwhile adding UMA stats to monitor how long closesocket() takes. My guess is that we will see a bimodal distribution with peaks around zero seconds and five seconds, with the vast majority at zero seconds, but with enough samples at five seconds to make mitigations important.

Comment 6 by rch@chromium.org, Feb 21 2017

Sounds like a good idea. Jana, can you land a histogram to track this?

Comment 7 by jri@chromium.org, Mar 1 2017

Yes, histogram SG. I'll get this in this week.

Comment 8 by jri@chromium.org, Mar 14 2017

Labels: -Pri-3 Pri-2
Status: Assigned (was: Untriaged)
I was just about to get a histogram in, and I found one already there. Clearly this isn't the first time someone's been looking at this problem. See Net.UDPSocketWinClose. The data is interesting:
(i) 0.01% of socket closes take > ~1.5 seconds per day,
(ii) 2% of users get at least one socket close > ~1 second per day, and
(iii) 1.2% of users get at least one socket close > ~5 seconds per day.

This is quite bad. I'll dig in.

Comment 9 by jri@chromium.org, Mar 14 2017

(I'm on merge duty this week, so I'll dig as I find time.)
Let me know if you want the original ETW trace - I think I still have it.
My Lenovo T460S Windows 10 Insider machine which was unusably hangy with Chrome had the following in the Windows Event Log: "Closing a UDP socket with local port number 52951 in process 10288 is taking longer than expected. The local port number may not be available until the close operation is completed. This happens typically due to misbehaving network drivers. Ensure latest updates are installed for Windows and any third-party networking software including NIC drivers, firewalls, or other security products."

After updating the Intel Wifi drivers to the latest and installing the 15048  Insider build, all of my hangs seem to have gone away.

Comment 12 by jri@chromium.org, Mar 14 2017

brucedawson: I reached out to an MS networking dev who asks:
- What's the OS version
- What's the version of the Netwsw00.sys driver on your system? (I don't know how to look for this version, so I'll find out.)
Looking at the ETW trace I see:
- OS version is Windows 10 Pro, 14393.447.amd64fre.rs1_release_inmarket.161102-0100
- NETwsw00.SYS version is 15.4.1.1, built in February of 2013

Windows explorer shows the version number of c:\Windows\System32\Drivers\NETwsw00.SYS correctly, but it doesn't show the build date - I used dumpbin /headers for that.

Note that this happened on two different Windows 10 laptops. I don't have version information for the other one handy but I can get it if necessary.

If Intel has a newer driver that avoids network hangs then can we get the update on Windows Update so that our customers actually receive it? Fixing this for my laptop would be nice, to be sure, but I'd actually rather not manually fix it because it is better to see what our customers are dealing with.

Comment 14 by jri@chromium.org, Mar 14 2017

Thanks for the additional info, brucedawson and elawrence. Definitely helpful to know that the hang may be due to a driver bug, but it still seems odd that the app needs to hang around. Chrome ought to be able to close the socket and leave without waiting for the driver to do something right.

brucedawson: Yes, please! I've never read an ETW trace, but my MS colleague may find it helpful.
The ETW trace is too big to attach but it is available here:

https://drive.google.com/file/d/0B9b7VYaDqPZbSmlsRFhtRkthU2M/view?usp=sharing

The analysis in the original filing of the bug should be detailed enough to let any ETW expert replicate my findings.

According to https://downloadcenter.intel.com/download/25769/Intel-Network-Adapter-Driver-for-Windows-10 there have been eight Intel driver updates in the last few years, culminating in version 22.0.1. I installed Windows 10 on my laptop around August 2015, at which point version 20.1 was the most recent version.

So, it seems rather tragic that I am running a version from 2013 that is at least twelve versions behind and was already at least five versions behind when I installed Windows 10.

Comment 16 by jri@chromium.org, Mar 14 2017

Great, thanks for sending the trace and the info. I'm glad that you have the old driver - it gives us something to work with! I'd like to understand why Chrome waits on the closesocket.
Note that on both laptops I cannot reproduce this bug on demand. It requires certain types of finicky hotel WiFi, and even in these problematic hotels it is not consistent. And neither hotel is located in the country where I live :-(

Comment 18 by piman@chromium.org, Mar 22 2017

Is this related to crbug.com/661306 (on Chrome OS with evidence of other platforms being affected)? The symptoms are somewhat similar, with the UI thread blocking because the IO thread is busy, likely because of networking things (given the IO thread stacks in the traces).
I don't think it is related. I think it's best to treat this as a Windows specific issue where closing sockets sometimes takes five seconds. It may simply be a matter of buggy (Intel in my cases) drivers that sometimes hit a five second timeout. I don't know whether newer drivers fix the problem because I can't reproduce the problem reliably, and because Intel does not support upgrading my drivers on Windows 10 (grrr...).

Comment 20 by jri@chromium.org, Apr 10 2017

Cc: zhongyi@chromium.org
Cc: brajkumar@chromium.org sandeepkumars@chromium.org
 Issue 739327  has been merged into this issue.
> Is this related to crbug.com/661306 (on Chrome OS with evidence of other platforms being affected)?

No.

The ETW traces clearly show that closesocket() takes five seconds to return. The clear clustering around five seconds is strongly indicative of a timeout of some sort, rather than an overloaded UI thread. Also, the ETW trace makes it clear that the problem is strictly related to the thread that is running closesocket(). That is, the trace(s) show that Chrome calls closesocket() and then that thread goes idle for ~5 s. It is then woken by the system process at which point it is able to continue executing. At no point does the UI thread have a chance to intervene or affect this.

Yes, I responded back in March but I'd typed most of this reply before I noticed this and I wanted to clarify a couple of things anyway.

Note that  bug 739327  changes the analysis somewhat by showing that the bug doesn't just occur with Intel drivers and can in fact happen on a wired network. But, it still maintains the same ~5 second timeout. This could mean that this is actually a Windows OS bug/feature, or it could mean that the 5 second timeout originated in some sample driver code which has now been copied by many manufacturers.

Correction - it is crbug.com/749946 that shows that this bug can happen on a wired network as well.

Comment 24 by jri@chromium.org, Aug 3 2017

TO be clear, there is an Intel WiFi driver bug that causes this too, though now I'm not sure how much of what we see in Chrome is attributable to that bug given that you've seen it on a wired network as well. Maybe it's an Intel NIC on the wired network with the same code in the driver?

Comment 25 by jri@chromium.org, Aug 3 2017

Cc: brucedaw...@chromium.org
Issue 749946 has been merged into this issue.
Had another hang managed to create an ETW trace whilst it was happening

https://drive.google.com/open?id=0B5pqsyyhdETjVEtGVFMwQ0RrbTA

It resolved when I toggled the wifi adapter ~70s in.

Not sure if this helps, or if you need anything else.
Another trace doesn't hurt (I'll take a look if I get a chance) but I doubt it will change the conclusion. Make sure your driver is up-to-date, and complain to the driver developer, pointing them to the trace.

Comment 28 by rch@chromium.org, Aug 19 2017

I just heard back today from Microsoft that a fix is in the Windows 10 preview builds and will be in the Fall Creators Update. Hopefully that will resolve the issue.
Issue 749758 has been merged into this issue.
Project Member

Comment 30 by PranavkRobot, Jul 16

Labels: crash-BugNoRepro
Crash analysis has not encountered any reports for this bug for the past 90 days. We have added the label 'crash-BugNoRepro'

Crash analysis will be automatically closing the bug in 10 days. If you do not want Crash analysis to automatically close the bug, please remove the label 'crash-BugNoRepro'. If you have any feedback on this feature, please contact pranavk@
Project Member

Comment 31 by PranavkRobot, Aug 2

Labels: crash-BugNoRepro-Closed
Status: WontFix (was: Assigned)
Crash analysis has not encountered any reports for this bug for the past 90 days. Hence as per the comment 16 days ago, we are closing the bug and setting the status to WontFix. If you have any feedback on this feature, please contact pranavk@

Sign in to add a comment