Issue metadata
Sign in to add a comment
|
UI freeze across all windows triggered by event in webpage
Reported by
br...@electroly.com,
Aug 21
|
||||||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3529.3 Safari/537.36 Steps to reproduce the problem: 1. Open a window with any website, this will be the control window for detecting that an action in one window is able to lock up unrelated windows. 2. In a second window, open https://www.shacknews.com/chatty -- this is an online forum comprised of threads and replies. 3. Scroll down to a large thread, for instance any thread that says "Thread Truncated. Click to see all N replies." 4. Click on one of the replies in the large thread. This will expand the thread. 5. Using the mouse wheel, immediately begin scrolling up and down continuously so you can tell when the UI thread locks up. 6. Within 1-2 seconds, observe that the scrolling ceases and the browser becomes unresponsive. Observe that the unrelated window created in step 1 is also unresponsive. 7. Up to a minute later, observe that the browser becomes responsive again. The duration of the lockup seems to be correlated to the number of replies in the thread that was clicked. Here is a YouTube video depicting this reproduction: https://www.youtube.com/watch?v=wz-3xvYJygs I have attached a profiler capture that demonstrates a 60-second lockup. The lockup even freezes the dev tools pane itself. Notably, this is NOT reproducible on every machine that I tried, despite identical OS version, Chrome version, and (no) extensions installed. I cannot reproduce the issue in any other browser engine. I can reproduce in Chrome stable 68 and Chrome Canary 70. What is the expected behavior? The browser UI should not lock up when interacting with this webpage. Other browsers do not lock up with the same steps, and other computers running Chrome also do not lock up with the same steps. I have found one other person who can reproduce this issue on his computer; he is able to reproduce it using the Vivaldi browser. What went wrong? I have not identified the cause of this issue, but this website does not appear to have any noticeable UI-blocking behavior in other browsers, and this is a recent regression using Chrome. It is unknown whether the website changed their code causing this problem to be triggered, or whether it was a Chrome update that caused it. Did this work before? Yes Unknown -- worked fine a week ago Chrome version: 70.0.3529.3 Channel: canary OS Version: 10.0 Flash Version:
,
Aug 21
,
Aug 21
I like to solve all problems with ETW tracing. Well, at least performance problems on Windows. I've got instructions on how to grab ETW traces at https://tinyurl.com/uiforetw. The default instructions (start circular-buffer tracing, record the trace when Chrome recovers from the hang) should work fine. If a link to the trace can be shared with brucedawson@chromium.org then I can take a look. That should give me enough information to know who to assign the bug to.
,
Aug 21
I have sent the link to the .ETL file your way. It captures a UI hang that was maybe 15 seconds in length.
,
Aug 21
Thanks for the trace. The trace shows that the Chrome browser process was spinning in a loop and not pumping messages, causing the hang. A MsgCheck Delay was detected by Windows which says that the hang was for 31.578 s - that's pretty long. The CPU Usage (Sampled) data is pretty clean. I viewed it as a flame graph and attached a WPA screenshot. You can see the interval and time-location of the hang in the trace, and the CPU usage graphs. In a flame graph the width of the bars represents the number of samples - the amount of CPU time. We can see that basically 100% of the time is inside ui::AXTree::Unserialize. The CPU time eventually spreads out but it is mostly just different paths leading to recursive calls to chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId. I assume that some accessibility mode/provider is enabled in the browser? Has that changed recently? Have we changed something about accessibility recently? Or has the web page changed in some way to trigger this latent problem? I don't know the answer to any of these questions but this should give enough information to allow for triage. Chrome tracing may be needed to get more information about what is going on. Looked at from another point of view, about 40% of the CPU samples were inside std::_Hash<...>::lower_bound, called on this (inverted) partial call stack: chrome.dll!std::_Hash<std::_Umap_traits<int,std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::_Uhash_compare<int,base_hash::hash<int>,std::equal_to<int> >,std::allocator<std::pair<const int,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > >,0> >::lower_bound |- chrome.dll!content::BrowserAccessibilityManager::GetFromAXNode | |- chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | | |- chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | | | chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | | | |- chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId Another 17% were in GetIntAttribute, called on this (inverted) partial call stack: chrome.dll!ui::AXNodeData::GetIntAttribute chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId |- chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId | chrome.dll!content::`anonymous namespace'::FindNodeWithChildTreeId I can help with further dicing/slicing/analysis of the trace if needed but I'm hopeful that this gives enough clues.
,
Aug 21
I don't know which accessibility component to set, so I'll set all three and hope one of those teams knows what to do.
,
Aug 22
I have collected more reproduction reports. Current hypothesis is that it's related to touch-screen support. Among the machines tested, every machine with touch hardware exhibits the freeze and every machine without touch hardware does not. This is readily reproducible on Surface tablets and laptops.
,
Aug 22
We have confirmed on a machine exhibiting the freeze that disabling the touch hardware in the Device Manager will cause the freeze to vanish.
,
Aug 22
,
Aug 23
The use of the bisect-builds.py tool has yielded the following result: "You are probably looking for a change made after 482747 (known good), but no later than 482762 (first known bad)." https://chromium.googlesource.com/chromium/src/+log/64500c1e30d83c949f68d560b898d2c6274b9f73..b962595bafd843504c3489fe8dbbf5e4c518528b The troublemaker is surely 57c6dff: Forward BrowserAccessibility get_accState to AXPlatformNode. This is an old commit (1 year, 2 months ago), so something external to Chrome (perhaps in this site's code) must have changed to trigger this latent behavior. In bisecting it, the display choppiness between a good build and a bad build was obvious while the page loaded: if the loading indicator spins smoothly, then it ends up being a good build, but if the loading indicator is choppy, then it ends up being a bad build. This makes me think the loading process is now full of "micro-freezes" caused by the same issue, and I just happened to find a "smoking gun" where it freezes for a long time.
,
Aug 23
->dougt, author of the commit in comment 11, to triage further.
,
Aug 23
TL;DR: Can someone get me a longer stack trace? I want to see a few frames above the recursive FindNodeWithChildTreeId calls. -- Initial analysis: When things are working correctly, FindNodeWithChildTreeId is supposed to be called just once the first time we try to find the parent node of a BrowserAccessibilityManager representing a child frame. Basically it's searching for the iframe element in the parent frame. It is a somewhat expensive recursive function, but its result is supposed to be cached so we're only paying the cost once each time an iframe is attached, so the overall cost is supposed to be negligible. This suggests that there's a bug in BrowserAccessibilityManager::GetParentNodeFromParentTree() so that the caching isn't working in this circumstance. That means it's searching through every node in the AX tree every time it's called. For example, if a child frame C thought that its parent frame was P, but there was no node in P that lists C as its child frame, then that could cause this problem. This url, https://www.shacknews.com/chatty, has more than 10k nodes, which is definitely on the large side for a web page. That means searching through all of P's nodes takes longer. To make things worse, scrolling back and forth really fast on these systems with a touch screen is probably generating hundreds of queries per second to whatever accessibility API is being triggered. The combination of all of these would explain the system locking up. -- Bisect The bisect is not very helpful so far. The change you implicated is more or less just refactoring. However, it might become more clear if we could see the full stack trace. Perhaps that refactoring resulted in a call to GetParent in a place that wouldn't have called it before. -- Without understanding the problem any further, I can think of one solution that will almost certainly work - we could have each BrowserAccessibilityManager keep track of child tree IDs as they're encountered and store them in a hash map, and then the recursive FindNodeWithChildTreeId wouldn't be necessary. I'll work on that as one option.
,
Aug 23
@brucedawson, did the trace that you got have a longer stack trace? Let me know, that would be helpful. @brian@electroly.com, could you open up chrome://accessibility on an affected machine and let me know which of the checkboxes are checked?
,
Aug 23
,
Aug 23
"Native accessibility API support" and "Web accessibility" are checked. I took a shot at turning them off and restarting the canary browser, but the freeze persisted. A better reproduction would be to replace my step 5 with waving the mouse over the back/forward/refresh/home buttons and watching the hover effect to see when it stops updating. I believe that would eliminate from consideration the side effect of the mouse wheel itself causing accessibility queries from the web content. As best I can tell, the duration of the freeze is the same if I wave the mouse over these buttons instead of scrolling the web content.
,
Aug 23
Thanks. I have a potential fix here: https://chromium-review.googlesource.com/c/chromium/src/+/1186191
,
Aug 23
Here is a stack that goes all the way to the root. The numbers are sample counts, and the indenting happens when there are sibling functions sampled. Some of the samples are on sibling-stacks that aren't shown here. The GetFocus() and IsOffScreen() functions at the bottom are siblings, both children of MSAAState(). The flame graph in comment #6 shows the leaf nodes much better and overlaps with the bottom ~10 levels of this textual call stack. Line #, Stack, Count 1, [Root], 13080 2, ntdll.dll!RtlUserThreadStart, 13080 3, kernel32.dll!BaseThreadInitThunk, 13080 4, chrome.exe!__scrt_common_main_seh, 13080 5, chrome.exe!wWinMain, 13080 6, chrome.exe!MainDllLoader::Launch, 13080 7, chrome.dll!ChromeMain, 13080 8, chrome.dll!content::ContentMain, 13080 9, chrome.dll!service_manager::Main, 13080 10, chrome.dll!content::ContentMainRunnerImpl::Run, 13080 11, chrome.dll!content::RunBrowserProcessMain, 13080 12, chrome.dll!content::BrowserMain, 13080 13, chrome.dll!content::BrowserMainRunnerImpl::Run, 13080 14, chrome.dll!content::BrowserMainLoop::RunMainMessageLoopParts, 13080 15, chrome.dll!ChromeBrowserMainParts::MainMessageLoopRun, 13080 16, chrome.dll!base::RunLoop::Run, 13080 17, chrome.dll!base::MessagePumpWin::Run, 13080 18, chrome.dll!base::MessagePumpForUI::DoRunLoop, 13080 19, |- chrome.dll!base::MessageLoop::DoWork, 13079 20, | chrome.dll!base::MessageLoop::RunTask, 13079 21, | chrome.dll!base::debug::TaskAnnotator::RunTask, 13079 22, | chrome.dll!IPC::ChannelProxy::Context::OnDispatchMessage, 13079 23, | chrome.dll!content::RenderProcessHostImpl::OnMessageReceived, 13079 24, | chrome.dll!content::RenderFrameHostImpl::OnMessageReceived, 13079 25, | chrome.dll!IPC::MessageT<AccessibilityHostMsg_EventBundle_Meta,std::tuple<AccessibilityHostMsg_EventBundleParams,int,int>,void>::Dispatch<content::RenderFrameHostImpl,content::RenderFrameHostImpl,void,void (content::RenderFrameHostImpl::*)(const AccessibilityHostMsg_EventBundleParams &, int, int)>, 13079 26, | |- chrome.dll!content::RenderFrameHostImpl::OnAccessibilityEvents, 13077 27, | | |- chrome.dll!content::BrowserAccessibilityManager::OnAccessibilityEvents, 13075 28, | | | |- chrome.dll!ui::AXTree::Unserialize, 13062 29, | | | | |- chrome.dll!content::BrowserAccessibilityManagerWin::OnAtomicUpdateFinished, 13060 30, | | | | | |- chrome.dll!content::BrowserAccessibilityComWin::UpdateStep3FireEvents, 10480 31, | | | | | | |- chrome.dll!ui::AXPlatformNodeWin::MSAAState, 10265 32, | | | | | | | |- chrome.dll!content::BrowserAccessibility::GetFocus, 5131 33, | | | | | | | |- chrome.dll!content::BrowserAccessibility::IsOffscreen, 5125 Ping me if any of this doesn't make sufficient sense.
,
Aug 24
I tested the potential fix in comment 17 and it seems to work. On the affected machine I am no longer able to induce the freeze.
,
Aug 27
,
Aug 29
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/aa77482ee81b46b9a41529e0cd4df51f6123077b commit aa77482ee81b46b9a41529e0cd4df51f6123077b Author: Dominic Mazzoni <dmazzoni@chromium.org> Date: Wed Aug 29 20:33:58 2018 Make GetParentNodeFromParentTree fast. A stack trace showed a lot of cpu time spent in FindNodeWithChildTreeId, which is called from GetParentNodeFromParentTree. The context in which this is supposed to be triggered is when calling GetParent() on the root of a child frame the first time. Previously the strategy was to search the whole parent frame for the node that has the child frame as its child tree ID, and then cache that node for the next call. If the node wasn't found, FindNodeWithChildTreeId would have been called every single time. Instead, have AXTree keep track of a map between each unique child tree ID and nodes that reference that child tree ID. Then BrowserAccessibilityManager can just query it directly with no tree search required, whether it's found or not. Bug: 876315 Change-Id: Iad432f1f43e1e65d1123c231d85f9cd42b1da52f Reviewed-on: https://chromium-review.googlesource.com/1186191 Commit-Queue: Dominic Mazzoni <dmazzoni@chromium.org> Reviewed-by: David Tseng <dtseng@chromium.org> Cr-Commit-Position: refs/heads/master@{#587279} [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/chrome/renderer/extensions/automation_ax_tree_wrapper.cc [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/chrome/renderer/extensions/automation_ax_tree_wrapper.h [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/chrome/renderer/extensions/automation_internal_custom_bindings.cc [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/content/browser/accessibility/browser_accessibility_manager.cc [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/content/browser/accessibility/browser_accessibility_manager.h [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/ui/accessibility/ax_tree.cc [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/ui/accessibility/ax_tree.h [modify] https://crrev.com/aa77482ee81b46b9a41529e0cd4df51f6123077b/ui/accessibility/ax_tree_unittest.cc
,
Sep 10
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by pkasting@chromium.org
, Aug 21Components: -UI