UI/IO thread hang reports have incorrectly been hidden for several milestones (years?!!!) |
|||||||
Issue descriptionChrome Version: M66 and forever before OS: All platforms we watch for hangs. With https://chromium-review.googlesource.com/c/chromium/src/+/743846 and https://chromium-review.googlesource.com/c/chromium/src/+/848355 I showed that removing StartWatching() calls from browser threads that are now unused resulted in a slew of hang reports (order of thousands per day in Canary instead of 1 or 2)... Investigating this revealed a very scary result : we have been ignoring many hang reports for a while (forever?!) In ThreadWatcher::GotNoResponse() -- invoked when a thread is unresponsive we have the following logic : // Crash the browser if the watched thread is to be crashed on hang and if the // number of other threads responding is less than or equal to // live_threads_threshold_ and at least one other thread is responding. if (crash_on_hang_ && responding_thread_count > 0 && responding_thread_count <= live_threads_threshold_) { |live_threads_threshold_| is a constant set to 2 (ref. kLiveThreadsThreshold) If more than |live_threads_threshold_| threads are responsive, we do NOT report the hang..! I do not understand why this was a good heuristic in the past but it clearly doesn't work now because we have 3 BrowserThreads that have fully been migrated to TaskScheduler (and now do *nothing* but make the ThreadWatcher happy..!) I suspect this was a problem before the migration to TaskScheduler as well because these threads can be fairly idle after startup and them being responsive doesn't seem to me like a good reason to hide a hang report on UI/IO threads..! I'll go ahead and remove |live_threads_threshold_|, expect a slew of *REAL* hang reports..! I'm also dubious about |responding_thread_count > 0| (i.e. only report the hang if other threads are responsive). I guess this is to avoid reporting hangs when _everything_ is slow. I'll leave that in for now which will mean that UI thread will report a hang so long as IO thread is responsive and vice-versa.
,
Jan 22 2018
Wow, so a ton of these hangs weren't being reported. +pam for visibility.
,
Jan 22 2018
So I found this in the thread_watcher.h docs: // |live_threads_threshold| specifies the maximum number of browser threads // that have to be responsive when we want to crash the browser because of // hung watched thread. This threshold allows us to either look for a system // deadlock, or look for a solo hung thread. A small live_threads_threshold // looks for a broad deadlock (few browser threads left running), and a large // threshold looks for a single hung thread (this in only appropriate for a // thread that *should* never have much jank, such as the IO). Turns out the only two browser threads that were being watched for hangs were UI/IO and they were using a small live_threads_threshold... whoopsie, that's the opposite of what the documentation says (not that I blame anyone, this is all very hard to parse)..!
,
Jan 22 2018
+rtenneti for comment, in case there's any context we're missing somehow.
,
Jan 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0ce1d29679f777b676fab9bbabeb90f22494083b commit 0ce1d29679f777b676fab9bbabeb90f22494083b Author: Gabriel Charette <gab@chromium.org> Date: Mon Jan 22 18:00:44 2018 Make ThreadWatcherList::kLiveThreadsThreshold the maximum number of threads. This CL is *expected* to cause a surge of hang reports on the UI and IO threads. Those are *real* hangs!!! This was miconfigured for years. This parameter will be removed in a follow-up cleanup but for now bump it to the maximum number of browser threads to prove the diagnosis in crbug.com/804345 R=asvitkine@chromium.org Bug: 804345 Change-Id: I3258e85020b13e2c51cafbcf2b1d50d212327bcc Reviewed-on: https://chromium-review.googlesource.com/878825 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#530909} [modify] https://crrev.com/0ce1d29679f777b676fab9bbabeb90f22494083b/chrome/browser/metrics/thread_watcher.cc
,
Jan 22 2018
Is this going to be so many that we need additional throttling, or to adjust our current throttling behavior to avoid washing out other reports?
,
Jan 22 2018
Little bit of history. I wasn't able to look at histogram data. Some of these limits were based histogram data at that time. + We wanted to be sure the system (the computer) is not hung and at least one other thread is responding, before we crash the browser. The following is the original code. https://codereview.chromium.org/7030020/diff/3002/chrome/browser/metrics/thread_watcher.cc (this was the first cut at it). // Crash the browser if IO thread hasn't responded atleast 3 times and if the // number of other threads is equal to 1. We picked 1 to reduce the number of // crashes and to get some sample data. if (thread_id_ == BrowserThread::IO && no_of_responding_threads == 1) { int* crash = NULL; CHECK(crash++); } + We might have increased it to 2 (to say at least 2 threads, generally FILE and CACHE threads used to be very responsive, if I recall. There are never hung or their responsive time used to be very good. I don't have access to the histograms and it based on more than 2 years old memory and I could be wrong). If they are responding, then system/browser is working ok, except UI/IO threads are not responding. + Lot of the limits in the code (at the time, we wrote this code), were to reduce false positive reports and/or reduce the number of crashes. We wanted to be sure it is working properly before we became aggressive. At that time, we used to crash the browser (we didn't have the code to report crashes without crashing the browser). We have also used ThreadWatcher histogram data (number of unresponsive threads and how many are being reported and how many crashes we could triage), to pick some of the numbers (which may not be valid now). Many many thanks for taking interest to read the code and changing it.
,
Jan 22 2018
In addition two way deadlocks (UI/IO), could there be 3-way dead locks? (UI/IO and another thread like FILE/DB/...)?
,
Jan 22 2018
@rtenneti: thanks for providing historical input :)! So we intentionally did not want to report a hang on the UI/IO thread if the FILE+CACHE threads were responsive? That's counterintuitive to me. These days all browser threads but UI/IO/PROCESS_LAUNCHER have been migrated to TaskScheduler. So FILE/CACHE/DB/FILE_USER_BLOCKING are *extremely* responsive because they all process zero work and happily reply to ThreadWatcher every ping. So ignoring a UI/IO thread hang because other browser threads (which we thus far failed to remove because of this issue (crbug.com/768886)) doesn't make sense (anymore)). @#6(pam): it will be low thousands of reports on Canary as we saw last time I tried to remove hang watching of the unused threads (issue 768886). I don't think this justifies throttling but have no expertise these, your call.
,
Jan 22 2018
I am wondering if "responding_thread_count <= live_threads_threshold_" should be changed to "responding_thread_count >= 1". That is at least one other thread is responding and MessageLoop is working (IMO, ThreadWatcher is working, browser is working because we are executing the code).
,
Jan 22 2018
We already have |responding_thread_count > 0|. But as I stated in the OP, even that I think is not necessary (and especially so now that all BrowserThreads are critical). I'm trying to resolve the fire and do some cleanups, not sure I'll get around to that one though as I don't have the bandwidth to monitor the side-effects...
,
Jan 22 2018
If this has really not be working as intended for years then there's no critical rush to change the behaviour. Perhaps we should agree first what we expect and/or want the detector to actually do, rather than just changing the threshold as in #5, or at least limit the threshold change to Canary channel for now?
,
Jan 22 2018
We are at the beginning of a new milestone, seems like the perfect time to test the waters. I don't see why we need a complex solution if this one doesn't turn out to flood us too bad. This is high priority because it blocks removing the deprecated BrowserThreads (we create 4 useless threads every startup just to monitor them)... This has been the case for too long already because I went on leave shortly after the migration to TaskScheduler completed (and failed to land their removal because I hit this issue through crbug.com/768886) This parameter definitely makes no sense in a world where all remaining BrowserThreads are critical.
,
Jan 23 2018
Re #13: Sorry, Aha - I didn't have context for StartWatching() when reviewing the CL description, so the connection to the otherwise-unused BrowserThreads wasn't obvious. So basically what you're saying is that we have a load of thread hangs being detected that never get reported, and removing the unused threads has the side-effect of causing those to be reported?
,
Jan 23 2018
Re #14: precisely, yes. (sorry about the unclear connection, I marked this bug a blocker of the BrowserThread removal but didn't explicitly mention them everywhere to try to keep this focused on upcoming hang reports)
,
Jan 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/71147015f1590a2e17aae762c4f4e7db0235f3ff commit 71147015f1590a2e17aae762c4f4e7db0235f3ff Author: Gabriel Charette <gab@chromium.org> Date: Wed Jan 24 12:56:52 2018 Remove ThreadWatcher's live_threads_threshold param. Will wait for https://chromium-review.googlesource.com/c/chromium/src/+/878825 to be live on Canary before landing this CL (to confirm the diagnosis in issue 804345 ). This is then the first cleanup CL of the ThreadWatcher's config. First removing the live_threads_threshold param which is confusing and definitely useless in a world with a small amount of BrowserThreads which are all critical. Bug: 804345 Change-Id: I3026eb0ddfa1e2fecd213a6aec52ce0e910b0ea9 Reviewed-on: https://chromium-review.googlesource.com/879145 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Cr-Commit-Position: refs/heads/master@{#531510} [modify] https://crrev.com/71147015f1590a2e17aae762c4f4e7db0235f3ff/chrome/browser/metrics/thread_watcher.cc [modify] https://crrev.com/71147015f1590a2e17aae762c4f4e7db0235f3ff/chrome/browser/metrics/thread_watcher.h [modify] https://crrev.com/71147015f1590a2e17aae762c4f4e7db0235f3ff/chrome/browser/metrics/thread_watcher_unittest.cc [modify] https://crrev.com/71147015f1590a2e17aae762c4f4e7db0235f3ff/chrome/common/chrome_switches.cc
,
Jan 24 2018
Going through the Mac reports, the majority are entirely in-actionable on the UI thread because the thread is all in system code. On 66.0.3333.0, 26/43 reported magic signatures are [ThreadWatcher UI hang], which I think means the hang delay on Mac may be too aggressive. Can we turn up the timeout so that the crash dashboards become a little more usable? I think part of the issue may because on the UI thread, our measure of responsiveness isn't quite reflective of what's happening in the system. The task runner ping is only indicative of whether one work source on the runloop is getting serviced. But the UI thread runloop services several other sources, which can cause some starvation of the Chromium task runner sources (we've observed and documented this before). So these are "hangs" with respect to Chrome's task runner work, but they are not necessarily indicative of the UI thread not making progress.
,
Jan 25 2018
First let's have it stabilize (#hangs/day on Canary is still ramping up). Then let's remove watching of the deprecated threads and see if that affects reporting rate (since their ability to pong is factored in decision to report a detected hang...). At that point I'll call this issue Fixed. On my side, I'll be able to finally remove deprecated BrowserThreads completely at last. //metrics/OWNERS and interested parties can then decide what to do with the hang reporting infrastructure.
,
Jan 25 2018
Re #18: Bear in mind that these hang reports aren't free; each DumpWithoutCrashing() call can cause a multi-second hang while a minidump is generated. So if hang reports are firing very often due to aggressive timeouts as per #17 then we could be making the situation considerably worse for affected users.
,
Jan 26 2018
,
Jan 26 2018
Actually, thought about #17 some more (hadn't fully grasped it when posting #18). I think the main issue is the ThreadWatcher doesn't support MessageLoopForUI well as many platforms pop system dialogs with a nested message pump (in which chrome application tasks intentionally aren't processed by default [1]). Those shouldn't be declared as hangs but they are per the pong not occuring when application tasks aren't processed. Until now, ThreadWatcher had avoided this issue per ignoring hangs if other (always idle) threads were responsive... But this was incorrect as it was discarding most true hangs as well (unless system deadlocked, but then the |responding_thread_count > 0| condition would also discard... so needed to be precisely not-processing-tasks-on-one-thread-and-just-a-tad-responsive-but-not-too-much=on-others...). I filed issue 806174 if someone wants to dig into this. Until then I propose we disable hang watching on the UI thread (was already pretty much dead prior to r530909) : https://chromium-review.googlesource.com/#/c/chromium/src/+/888498 I'll call this immediate issue fixed though (hangs are no longer hidden, they're just poorly detected, but already were). [1] https://cs.chromium.org/chromium/src/base/message_loop/message_loop.cc?type=cs&q=MessageLoop::DoWork+task_execution_allowed_&sq=package:chromium&l=437
,
Jan 26 2018
Re #17 and #21: MessageLoop has special handling for "nested" task processing, to cope with the same issue under Windows, where we specifically use a posted message (see https://cs.chromium.org/chromium/src/base/message_loop/message_pump_win.cc?l=247) to ensure that we get a chance to process tasks even while stuck in some system-provided event processing loop. Are we missing similar logic in the Mac MessagePump, or is it simply not feasible to provide it?
,
Jan 26 2018
On Mac there is a similar mechanism, but it depends on what is nested. If there is a nested system runloop without a call to MessageLoop::Run on the stack, then Chromium work will get pumped in that nested loop. If there already is a MessageLoop::Run frame on the stack and a nested system run loop starts, then Chromium work will not get pumped in the nested loop.
,
Jan 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/505bed3a721218e94c8c5b6d3ee9ebc3507b6b1a commit 505bed3a721218e94c8c5b6d3ee9ebc3507b6b1a Author: Gabriel Charette <gab@chromium.org> Date: Fri Jan 26 23:16:49 2018 Disable hang watching on the UI thread for now. Also restrict hang reporting to the Canary channel while uncovered issues stabilize. R=asviktine@chromium.org Bug: 806174, 804345 Change-Id: Ie99b0e7b36d606102642ccab10ef3a30832f08ab Reviewed-on: https://chromium-review.googlesource.com/888498 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#532086} [modify] https://crrev.com/505bed3a721218e94c8c5b6d3ee9ebc3507b6b1a/chrome/browser/metrics/thread_watcher.cc
,
Jan 29 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/794ac52d0357563a91b834bd200e11572e4d3334 commit 794ac52d0357563a91b834bd200e11572e4d3334 Author: Gabriel Charette <gab@chromium.org> Date: Mon Jan 29 09:36:07 2018 Re-attempt to remove hang watching for deprecated BrowserThreads. https://chromium-review.googlesource.com/c/chromium/src/+/878825 didn't cause as big of a spike as I expected (~250 instead of ~1000), will wait for it to settle a bit more before landing this CL (to confirm whether this one still causes havoc and there's more to uncover or not) R=asvitkine@chromium.org Bug: 804345 Change-Id: Iee2d1f52d55dff2e840420c28e6af7c80d526ca1 Reviewed-on: https://chromium-review.googlesource.com/883122 Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#532369} [modify] https://crrev.com/794ac52d0357563a91b834bd200e11572e4d3334/chrome/browser/metrics/thread_watcher.cc
,
Jan 30 2018
Issue 805358 has been merged into this issue.
,
Feb 1 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9f99a202a791d488f4d6608dc29402ba630ee0c2 commit 9f99a202a791d488f4d6608dc29402ba630ee0c2 Author: Gabriel Charette <gab@chromium.org> Date: Thu Feb 01 11:50:30 2018 Finally remove all deprecated BrowserThreads and cleanup remaining dependencies. This is the final piece in the attempt to reland https://chromium-review.googlesource.com/c/chromium/src/+/705775 Bug: 804345 Change-Id: I240576ae186886ae9b76c50a7920aed37b3a4ff1 TBR=avi@chromium.org, jam@chromium.org (fwd'ing from https://chromium-review.googlesource.com/c/chromium/src/+/705775) Change-Id: I240576ae186886ae9b76c50a7920aed37b3a4ff1 Reviewed-on: https://chromium-review.googlesource.com/890263 Commit-Queue: Gabriel Charette <gab@chromium.org> Reviewed-by: Alexei Svitkine <asvitkine@chromium.org> Reviewed-by: Dirk Pranke <dpranke@chromium.org> Cr-Commit-Position: refs/heads/master@{#533640} [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/PRESUBMIT.py [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/chrome/browser/metrics/thread_watcher_report_hang.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/components/metrics/call_stack_profile_metrics_provider.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/components/metrics/call_stack_profile_params.h [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/components/metrics/public/cpp/call_stack_profile_struct_traits.h [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/components/metrics/public/cpp/call_stack_profile_struct_traits_unittest.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/components/metrics/public/interfaces/call_stack_profile_collector.mojom [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/browser/browser_main_loop.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/browser/browser_thread_impl.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/browser/browser_thread_impl.h [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/public/browser/browser_thread.h [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/public/test/test_browser_thread_bundle.cc [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/content/public/test/test_browser_thread_bundle.h [modify] https://crrev.com/9f99a202a791d488f4d6608dc29402ba630ee0c2/third_party/metrics_proto/execution_context.proto
,
Feb 1 2018
Oops wrong bug id, should have gone to issue 689520 .
,
Feb 5 2018
,
Feb 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7f6a4e0d1c7ad0f26360196c98a3390b0c0b1dcb commit 7f6a4e0d1c7ad0f26360196c98a3390b0c0b1dcb Author: Gabriel Charette <gab@chromium.org> Date: Tue Feb 13 15:42:35 2018 Delete subtle_threading_bugs.md It's obsolete in TaskScheduler era. R=jam@chromium.org Bug: 804345 Change-Id: I37a861ae55658ce12fb11ac86cce727c06e77019 Reviewed-on: https://chromium-review.googlesource.com/897782 Reviewed-by: John Abd-El-Malek <jam@chromium.org> Commit-Queue: Gabriel Charette <gab@chromium.org> Cr-Commit-Position: refs/heads/master@{#536361} [delete] https://crrev.com/2a49ec7eae6dc68d0b95e9ea3416c4a77ae33a8f/docs/subtle_threading_bugs.md
,
Feb 13 2018
Oops, what's wrong with my copy-paste... should also have been issue 689520 ... |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by gab@chromium.org
, Jan 22 2018