New issue
Advanced search Search tips

Issue 808615 link

Starred by 3 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug

Blocked on:
issue 808963



Sign in to add a comment

Chrome: Crash Report - [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion

Project Member Reported by cr...@system.gserviceaccount.com, Feb 2 2018

Issue description

reporter:pbommana@google.com

Magic Signature: [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion

Crash link: https://crash.corp.google.com//browse?q=product.name%3D'Chrome'%20AND%20product.version%3D'66.0.3336.5'%20AND%20expanded_custom_data.ChromeCrashProto.channel%3D'canary'%20AND%20expanded_custom_data.ChromeCrashProto.ptype%3D'browser'%20AND%20expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D'%5BThreadWatcher%20IO%20hang%5D%20base%3A%3AMessagePumpForIO%3A%3AWaitForIOCompletion'%20AND%20ReportID%3D'2fc3094b41ca6eca'#3

-------------------------------------------------------------------------------
Sample Report
-------------------------------------------------------------------------------
Product name: Chrome
Magic Signature : [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion
Product Version: 66.0.3336.5
Process type: browser
Report ID: 2fc3094b41ca6eca
Report Url: https://crash.corp.google.com/2fc3094b41ca6eca
Report Time: 2018-02-02T11:23:34-08:00
Upload Time: 2018-02-02T11:24:00.417-08:00
Uptime: 129496000 ms
CumulativeProductUptime: 0 ms
OS Name: Windows NT
OS Version: 6.1.7601 17514
CPU Architecture: x86
CPU Info: GenuineIntel family 6 model 23 stepping 10

-------------------------------------------------------------------------------
Crashing thread: Thread index: 12. Stack Quality: 100%. Thread id: 4752.
-------------------------------------------------------------------------------
0x77bf70b4 (ntdll.dll + 0x000470b4)	KiFastSystemCallRet
0x77bf63a3 (ntdll.dll + 0x000463a3)	NtRemoveIoCompletion
0x75ef0646 (KERNELBASE.dll + 0x00020646)	GetQueuedCompletionStatus
0x5c416aa5 (chrome.dll - message_pump_win.cc: 518)	base::MessagePumpForIO::WaitForIOCompletion(unsigned long,base::MessagePumpForIO::IOHandler *)
0x5c41681c (chrome.dll - message_pump_win.cc: 492)	base::MessagePumpForIO::DoRunLoop()
0x5c4166dd (chrome.dll - message_pump_win.cc: 56)	base::MessagePumpWin::Run(base::MessagePump::Delegate *)
0x5c41664e (chrome.dll - message_loop.cc: 350)	base::MessageLoop::Run(bool)
0x5c41610d (chrome.dll - run_loop.cc: 133)	base::RunLoop::Run()
0x5c415fba (chrome.dll - thread.cc: 255)	base::Thread::Run(base::RunLoop *)
0x5c4b1150 (chrome.dll - browser_thread_impl.cc: 248)	content::BrowserThreadImpl::IOThreadRun(base::RunLoop *)
0x5c482b89 (chrome.dll - browser_thread_impl.cc: 283)	content::BrowserThreadImpl::Run(base::RunLoop *)
0x5c40dd54 (chrome.dll - thread.cc: 338)	base::Thread::ThreadMain()
0x5d03a62a (chrome.dll - platform_thread_win.cc: 91)	base::`anonymous namespace'::ThreadFunc
0x77b23c44 (kernel32.dll + 0x00053c44)	BaseThreadInitThunk
0x77c137f4 (ntdll.dll + 0x000637f4)	__RtlUserThreadStart
0x77c137c7 (ntdll.dll + 0x000637c7)	_RtlUserThreadStart

 
Cc: pbomm...@chromium.org
Labels: Stability-Sheriff-Desktop M-66 Target-66
This crash is seen since Chrome version 60.0.3112.90 and relatively crash rate is in single digits apart from couple of Chrome versions.

Please find crash history here : https://goto.google.com/cdoki


Note : Can't find a culprit CL hence tagging stability sheriff label, Also there was a CL from gab@(https://chromium-review.googlesource.com/c/chromium/src/+/848355) in this area to unearth some of the potential issues. 

Comment 2 by creis@chromium.org, Feb 9 2018

Owner: gab@chromium.org
Status: Assigned (was: Untriaged)
[Stability sheriff] gab@: Can you help triage this, if you're familiar with ThreadWatcher?

Comment 3 by gab@chromium.org, Feb 9 2018

Cc: robliao@chromium.org fdoray@chromium.org
Labels: -Restrict-View-EditIssue
Interesting!

In a quick survey:

 - MessagePumpForIO::ScheduleWork()'s InterlockedExchange could result in an ABA problem. But it seems to me like at worse this results in the pump doing an extra loop when idle when:
 (T1) PostTask
 (T2) PostTask
 (T1) ScheduleWork -> PostQueuedCompletionStatus()
 (T3 -- pumping thread) Processes both tasks
 (T3) Sets state back to READY
 (T2) ScheduleWork -> redundant PostQueuedCompletionStatus()

 - I think |if (InterlockedExchange(&work_state_, HAVE_WORK) != READY) return;| is wrong. As is generally the case with atomics : a thread is never *guaranteed* to see values written by other threads. The only thing that is guaranteed is that *if* T1 sees a value x = y and there was a Release barrier on T2 after it wrote 'y' to 'x' and 'x' was read with an acquire barrier; *then* T1 will also at least see all the memory side-effects from T2 committed before writing 'y' to 'x'. Now, this doesn't mean much if multiple threads can write redundant 'y's to 'x'..!

I think this second point is the bug, an unlikely race with atomics resulting in ScheduleWork() being skipped and the pump going idle when it shouldn't.

Comment 4 Deleted

Comment 5 by gab@chromium.org, Feb 9 2018

hmmm... actually... while I do think there's a "bug" in MessagePumpForUI/IO::ScheduleWork() as described above. In practice, MessageLoop::ScheduleWork() is documented as non-thread-safe and is called under a lock from IncomingTaskQueue. So this should actually "work"..?

Chrome.MessageLoopProblem might be a good insight into unexpected failure conditions :

https://uma.googleplex.com/histograms?endDate=20180207&dayCount=1&histograms=Chrome.MessageLoopProblem&fixupData=true&showMax=true&filters=platform%2Ceq%2CW%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
Project Member

Comment 6 by sheriffbot@chromium.org, Feb 12 2018

Labels: FoundIn-M-66 Fracas
Users experienced this crash on the following builds:

Win Canary 66.0.3344.0 -  0.23 CPM, 15 reports, 15 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas

Comment 7 by mfo...@chromium.org, Feb 20 2018

[Stability Sheriff] Could it simply be that the IO thread gets hung up in GetQueuedCompletionStatus?

It appears we are passing a zero timeout, which should either return the completed IO or return immediately if there was nothing to return.  But the IO thread appears to be hung in NtRemoveIoCompletion.

Or maybe I'm misunderstanding how this API should work...

gab@ can you please post next steps here?  Is there a fix possible for M66?

Comment 8 by gab@chromium.org, Feb 21 2018

Cc: gab@chromium.org
Labels: -Target-66 -FoundIn-M-66
Owner: ----
Status: Untriaged (was: Assigned)
Looked into crash/2fc3094b41ca6eca

Looks like the IO thread is waiting on an INFINITE timeout.

Looking up the stack, it indeed looks like MessageLoop::incoming_task_queue_ is empty.

Looking at other threads it looks like the PROCESS_LAUNCHER thread might actually be hung in a CreateProcess syscall.

Maybe the hang watcher is misreporting a PROCESS_LAUNCHER hang as an IO thread hang?

I will unfortunately not have time to look into this further for M66 (also note, this isn't new, the hang reporter was recently fixed to unhide reports but this impl hasn't changed in a while).

Comment 9 by gab@chromium.org, Feb 21 2018

Blockedon: 808963
Also, PROCESS_LAUNCHER isn't currently watched for hangs, so it's weird that reports would be generated for it and tagged with IO thread..?
Cc: brucedaw...@chromium.org
If someone can shed some more light on what's to blame here, and that's analysis is generalizable, I may be able to update the automated hang analysis so that we can cluster these better in the future.

Including Bruce in case he has some ideas on how to go about doing this analysis.

I looked at the following two reports:

 - crash/2fc3094b41ca6eca
   There are two threads that are suspicious here.  Thread 10 is creating a new process. Thread 32 is flushing some sqlite buffers.  Not sure whether these could be attributed to ThreadUnresponsive_IO.

 - crash/c04fc2e6f6af4bfd
   There are 6 threads stuck at file writes (NtWriteFile). Two threads are running some dns queries (QueryDnsForFamily).  Again, I'm not sure whether these could be attributed to ThreadUnresponsive_IO.

I looked through 13 crashes and saw some anomalous things but nothing consistent.

- One crash is running a version of Windows that hasn't been updated since 2010
- FlushFileBuffers is pretty expensive, as is creating a process, but ???
- slc.dll (something to do with Windows anti-piracy I think?) is present in 4 of the 13 crash dumps, and  is always be a version from July 2009, with Microsoft's name missing from the usual fields. Weird, but since it's only in 4 it doesn't seem related
- Lots of anti-virus - I'm not sure if it's more than normal

Sorry, no immediate ideas

Labels: -Stability-Sheriff-Desktop
This is a P2 and as suggested by the previous stability sheriff I’m dropping it from the queue.
Project Member

Comment 13 by sheriffbot@chromium.org, Mar 7 2018

Labels: FoundIn-M-67
Users experienced this crash on the following builds:

Win Canary 67.0.3362.0 -  0.10 CPM, 4 reports, 4 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Project Member

Comment 14 by sheriffbot@chromium.org, Mar 14 2018

Labels: FoundIn-67
Users experienced this crash on the following builds:

Win Canary 67.0.3368.1 -  0.10 CPM, 3 reports, 3 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Issue 871215 has been merged into this issue.
Components: Internals>Core

Sign in to add a comment