Chrome: Crash Report - [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion |
|||||||||||
Issue descriptionreporter:pbommana@google.com Magic Signature: [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion Crash link: https://crash.corp.google.com//browse?q=product.name%3D'Chrome'%20AND%20product.version%3D'66.0.3336.5'%20AND%20expanded_custom_data.ChromeCrashProto.channel%3D'canary'%20AND%20expanded_custom_data.ChromeCrashProto.ptype%3D'browser'%20AND%20expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D'%5BThreadWatcher%20IO%20hang%5D%20base%3A%3AMessagePumpForIO%3A%3AWaitForIOCompletion'%20AND%20ReportID%3D'2fc3094b41ca6eca'#3 ------------------------------------------------------------------------------- Sample Report ------------------------------------------------------------------------------- Product name: Chrome Magic Signature : [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion Product Version: 66.0.3336.5 Process type: browser Report ID: 2fc3094b41ca6eca Report Url: https://crash.corp.google.com/2fc3094b41ca6eca Report Time: 2018-02-02T11:23:34-08:00 Upload Time: 2018-02-02T11:24:00.417-08:00 Uptime: 129496000 ms CumulativeProductUptime: 0 ms OS Name: Windows NT OS Version: 6.1.7601 17514 CPU Architecture: x86 CPU Info: GenuineIntel family 6 model 23 stepping 10 ------------------------------------------------------------------------------- Crashing thread: Thread index: 12. Stack Quality: 100%. Thread id: 4752. ------------------------------------------------------------------------------- 0x77bf70b4 (ntdll.dll + 0x000470b4) KiFastSystemCallRet 0x77bf63a3 (ntdll.dll + 0x000463a3) NtRemoveIoCompletion 0x75ef0646 (KERNELBASE.dll + 0x00020646) GetQueuedCompletionStatus 0x5c416aa5 (chrome.dll - message_pump_win.cc: 518) base::MessagePumpForIO::WaitForIOCompletion(unsigned long,base::MessagePumpForIO::IOHandler *) 0x5c41681c (chrome.dll - message_pump_win.cc: 492) base::MessagePumpForIO::DoRunLoop() 0x5c4166dd (chrome.dll - message_pump_win.cc: 56) base::MessagePumpWin::Run(base::MessagePump::Delegate *) 0x5c41664e (chrome.dll - message_loop.cc: 350) base::MessageLoop::Run(bool) 0x5c41610d (chrome.dll - run_loop.cc: 133) base::RunLoop::Run() 0x5c415fba (chrome.dll - thread.cc: 255) base::Thread::Run(base::RunLoop *) 0x5c4b1150 (chrome.dll - browser_thread_impl.cc: 248) content::BrowserThreadImpl::IOThreadRun(base::RunLoop *) 0x5c482b89 (chrome.dll - browser_thread_impl.cc: 283) content::BrowserThreadImpl::Run(base::RunLoop *) 0x5c40dd54 (chrome.dll - thread.cc: 338) base::Thread::ThreadMain() 0x5d03a62a (chrome.dll - platform_thread_win.cc: 91) base::`anonymous namespace'::ThreadFunc 0x77b23c44 (kernel32.dll + 0x00053c44) BaseThreadInitThunk 0x77c137f4 (ntdll.dll + 0x000637f4) __RtlUserThreadStart 0x77c137c7 (ntdll.dll + 0x000637c7) _RtlUserThreadStart
,
Feb 9 2018
[Stability sheriff] gab@: Can you help triage this, if you're familiar with ThreadWatcher?
,
Feb 9 2018
Interesting! In a quick survey: - MessagePumpForIO::ScheduleWork()'s InterlockedExchange could result in an ABA problem. But it seems to me like at worse this results in the pump doing an extra loop when idle when: (T1) PostTask (T2) PostTask (T1) ScheduleWork -> PostQueuedCompletionStatus() (T3 -- pumping thread) Processes both tasks (T3) Sets state back to READY (T2) ScheduleWork -> redundant PostQueuedCompletionStatus() - I think |if (InterlockedExchange(&work_state_, HAVE_WORK) != READY) return;| is wrong. As is generally the case with atomics : a thread is never *guaranteed* to see values written by other threads. The only thing that is guaranteed is that *if* T1 sees a value x = y and there was a Release barrier on T2 after it wrote 'y' to 'x' and 'x' was read with an acquire barrier; *then* T1 will also at least see all the memory side-effects from T2 committed before writing 'y' to 'x'. Now, this doesn't mean much if multiple threads can write redundant 'y's to 'x'..! I think this second point is the bug, an unlikely race with atomics resulting in ScheduleWork() being skipped and the pump going idle when it shouldn't.
,
Feb 9 2018
hmmm... actually... while I do think there's a "bug" in MessagePumpForUI/IO::ScheduleWork() as described above. In practice, MessageLoop::ScheduleWork() is documented as non-thread-safe and is called under a lock from IncomingTaskQueue. So this should actually "work"..? Chrome.MessageLoopProblem might be a good insight into unexpected failure conditions : https://uma.googleplex.com/histograms?endDate=20180207&dayCount=1&histograms=Chrome.MessageLoopProblem&fixupData=true&showMax=true&filters=platform%2Ceq%2CW%2Cchannel%2Ceq%2C4%2Cisofficial%2Ceq%2CTrue&implicitFilters=isofficial
,
Feb 12 2018
Users experienced this crash on the following builds: Win Canary 66.0.3344.0 - 0.23 CPM, 15 reports, 15 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Feb 20 2018
[Stability Sheriff] Could it simply be that the IO thread gets hung up in GetQueuedCompletionStatus? It appears we are passing a zero timeout, which should either return the completed IO or return immediately if there was nothing to return. But the IO thread appears to be hung in NtRemoveIoCompletion. Or maybe I'm misunderstanding how this API should work... gab@ can you please post next steps here? Is there a fix possible for M66?
,
Feb 21 2018
Looked into crash/2fc3094b41ca6eca Looks like the IO thread is waiting on an INFINITE timeout. Looking up the stack, it indeed looks like MessageLoop::incoming_task_queue_ is empty. Looking at other threads it looks like the PROCESS_LAUNCHER thread might actually be hung in a CreateProcess syscall. Maybe the hang watcher is misreporting a PROCESS_LAUNCHER hang as an IO thread hang? I will unfortunately not have time to look into this further for M66 (also note, this isn't new, the hang reporter was recently fixed to unhide reports but this impl hasn't changed in a while).
,
Feb 21 2018
Also, PROCESS_LAUNCHER isn't currently watched for hangs, so it's weird that reports would be generated for it and tagged with IO thread..?
,
Feb 22 2018
If someone can shed some more light on what's to blame here, and that's analysis is generalizable, I may be able to update the automated hang analysis so that we can cluster these better in the future. Including Bruce in case he has some ideas on how to go about doing this analysis. I looked at the following two reports: - crash/2fc3094b41ca6eca There are two threads that are suspicious here. Thread 10 is creating a new process. Thread 32 is flushing some sqlite buffers. Not sure whether these could be attributed to ThreadUnresponsive_IO. - crash/c04fc2e6f6af4bfd There are 6 threads stuck at file writes (NtWriteFile). Two threads are running some dns queries (QueryDnsForFamily). Again, I'm not sure whether these could be attributed to ThreadUnresponsive_IO.
,
Feb 22 2018
I looked through 13 crashes and saw some anomalous things but nothing consistent. - One crash is running a version of Windows that hasn't been updated since 2010 - FlushFileBuffers is pretty expensive, as is creating a process, but ??? - slc.dll (something to do with Windows anti-piracy I think?) is present in 4 of the 13 crash dumps, and is always be a version from July 2009, with Microsoft's name missing from the usual fields. Weird, but since it's only in 4 it doesn't seem related - Lots of anti-virus - I'm not sure if it's more than normal Sorry, no immediate ideas
,
Feb 23 2018
This is a P2 and as suggested by the previous stability sheriff I’m dropping it from the queue.
,
Mar 7 2018
Users experienced this crash on the following builds: Win Canary 67.0.3362.0 - 0.10 CPM, 4 reports, 4 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Mar 14 2018
Users experienced this crash on the following builds: Win Canary 67.0.3368.1 - 0.10 CPM, 3 reports, 3 clients (signature [ThreadWatcher IO hang] base::MessagePumpForIO::WaitForIOCompletion) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Aug 7
Issue 871215 has been merged into this issue.
,
Nov 1
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by pbomm...@chromium.org
, Feb 2 2018Labels: Stability-Sheriff-Desktop M-66 Target-66