Mach shared memory can cause renderer hang.
Reported by
alli...@saucelabs.com,
Mar 25 2016
|
||||||||||||||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Firefox/45.0 Example URL: http://google.com Steps to reproduce the problem: Here are the steps I follow to trigger my problem: 1. Using kvm-enabled qemu, launch a Mac VM (such as 10.9 - this issue was seen to affect OS X 10.8 through 10.11). 2. Download Chrome 49 (https://www.google.com/chrome/browser/desktop/) using Safari/Firefox 3. Launch Chrome What is the expected behavior? The two starting tabs load: "Chrome": chrome://chrome-signin/?access_point=0&reason=0 "Getting Started": https://www.google.com/intl/en/chrome/browser/welcome.html) Additional user-created tabs can also request urls and load them What went wrong? Of the two starting tabs, the chrome://chrome-signin tab either never loads, or appears to have loaded, but nothing is rendered on the screen. (In these cases, the Network tab shows that the page resources have been successfully retrieved, although they are not visible.) The "Getting Started" tab either never loads and spins forever, or successfully loads. When it successfully loads, any links on the page may be successfully clicked and loaded, and it's possible to edit elements via "Inspect element" to serve links to any desired webpage, and these successfully load. Sometimes, when this "Getting Started" tab loads successfully, inputting a new url in the address bar is also successful; other times, inputting a url and hitting enter appears to do nothing whatsoever - the page doesn't appear to be loading, and external network traffic logging indicates that no request is issued. Regardless of how the first two tabs behave, additional user-created tabs are unable to load anything. I've uploaded two gifs here that demonstrate the issue, as well as the Guest profile workaround described below: http://imgur.com/a/QqMg9 Does it occur on multiple sites: Yes Is it a problem with a plugin? No Did this work before? Yes Chrome 48 and below Does this work in other browsers? Yes Chrome version: 49.0.2623.108 Channel: stable OS Version: OS X 10.9 Flash Version: Shockwave Flash 21.0 r0 Interestingly, the issue goes away when switching the profile to a Guest profile - in this case, everything loads as expected. Switching away from this Guest profile causes the issue to come back. A copy of Chrome 48 on the same machine works without issue, as do Firefox and Safari. Google's search suggestions also appear successfully as text is written to the address bar. Hitting enter doesn't seem to cause the url to load. Launching Chrome with the switches "--enable-logging --v=1" and inspecting the resulting chrome_debug.log, there was only one error, which may or may not be related: [4303:1287:0324/171016:WARNING:vt_video_decode_accelerator_mac.cc(205)] Failed to create hardware VideoToolbox session. Hardware accelerated video decoding will be disabled. The Guest profile can load chrome://net-internals (and any other chrome:// pages), but the Default profile cannot. I verified that the latest dotrelease (49.0.2623.108) did not solve the issue. I'm more than happy to provide any additional details that would help. Many thanks for your time!
,
Mar 28 2016
Hi there! We actually came upon that suggestion when looking for a fix on our own, but the VMs that this is failing on don't have a Default user profile to begin with (it gets generated on the first run of Chrome 49). We've also tried having Chrome 48 generate a Default user profile for us (which gives us a working Chrome 48 instance), and then quitting out of Chrome 48 and starting a Chrome 49 instance afterwards to no avail. Interestingly enough, the opposite seems to work fine (having Chrome 49 generate the default profile and then using Chrome 48 to browse the web). I'm wondering if there is anything else we can do on our end to give you more information? I know getting a data dump from chrome://net-internals has been requested before, but we can really only access the chrome:// pages right now if we use the Guest profile, so I'm not sure how much use that would be. One other option is that we can set you up with remote access to one of our VMs that are exhibiting this problem so that you can see it first hand and poke around on the machine yourselves if you would like! Thanks in advance for all of your help! It's much appreciated :)
,
Mar 29 2016
Thanks for the response! I'm afraid that removing the default user profile didn't solve the issue, as my colleague mentioned (we're trying to solve this together). Unfortunately it's happening on a fresh install. We're following this similar issue: https://bugs.chromium.org/p/chromium/issues/detail?id=595968 It mentions old hardware, and the virtualized CPU we're using is also a Core 2 Duo, for what it's worth. If there's anything we can provide to help investigate this further, please do let us know! Thanks again for your help.
,
Mar 29 2016
One thing that was mentioned in issue 595968 was taking a look at the task manager to see if processes were being spawned for each tab, and it looks like in our case they are. I've attached another .gif that shows this behavior. One thing that might be worth noting is that the browser doesn't seem to respond to the "Stop" and "Refresh" commands.
,
Mar 30 2016
Thank you for providing more feedback. Adding requester "dtapuska@chromium.org" for another review and adding "Needs-Review" label for tracking. For more details visit https://sites.google.com/a/chromium.org/dev/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 30 2016
,
Mar 30 2016
Could one of you provide a network log as well? Instructions: https://sites.google.com/a/chromium.org/dev/for-testers/providing-network-details I don't think this is a network bug, but best to be sure.
,
Mar 30 2016
Hi! Unfortunately the chrome://net-internals page doesn't load using the default profile, but I was able to start a Guest session and use net-internals to capture information from the default profile window (which I didn't know was possible before). Please see the attached :)
,
Mar 30 2016
Also, if you would like to see this behavior first hand, we can still set you up with a VM to try out :)
,
Mar 30 2016
The fact that net-internals fails to load strongly indicates this is an issue starting up / setting up the child process. The log you uploaded also supports that theory - the only network requests are for google.com, from the omnibox, as you typed "twitter.com", and one for http://hello/. None for twitter, so the new process never made any network requests. I assume devtools shows the same behavior? If you run chrome with "--no-sandbox" does it work, or "--single-process"? You should never use either of these command line flags in production, this is purely for diagnostic purposes.
,
Mar 30 2016
And thanks for the offer of a VM, and your helpfulness in general. Someone may take you up on it, I'm just trying to determine who to pass this bug to.
,
Mar 30 2016
You're very welcome! We really appreciate the time you are taking to look into this as it's a fairly big issue for us considering having Chrome 49 available is a big part of our product offering. I tried out the --no-sandbox and --single-process switches both individually and together, and unfortunately the problem still persists. You are right about devtools showing no network activity when I use the omnibox to try and request a new URL. I've attached a screenshot where I've asked it to get hello.com, and as you can see, nothing new showed on the network monitor.
,
Mar 30 2016
That's unexpected - I would have thought devtools would be blank and hanging, too, just like the webpage, since chrome pages are hanging. Going to keep the navigation and blink labels, because that seems the place to start.
,
Mar 30 2016
Does chrome://settings work? If the only chrome:// URL failing is the initial sign-in page, that's a special case that involves loading some content from the web. I'm not sure what's special about the Getting Started process (and Guest profile), but it sounds like most other renderer processes are unable to make network requests for some reason. Typing in a cross-site URL like cnn.com will normally start a new process, just like going to it in a different tab. Clicking a link will normally stay in the same process, and that's working for you (at least in a process that is already working, like the Getting Started page). You can confirm this by running with --process-per-tab and typing cnn.com into the omnibox in the Getting Started tab. That won't do a process swap and it should work. Conversely, you can run with --site-per-process and we'll do a process swap on link clicks as well, which I would expect to fail. Unfortunately, I have no idea why network requests wouldn't work in most renderer processes. This would just help narrow down the cause a bit.
,
Mar 30 2016
chrome://net-internals/ is also apparently not working with the default profile.
,
Mar 30 2016
Just to clarify, all chrome:// pages don't actually load when using the default profile (including chrome://settings). Also, I decided to try a few more instances of opening up devtools, and it seems to only intermittently open up the devtools box. I've attached another .gif showing chrome://settings not loading, even though I'm trying it in a tab that can load other web pages without problems (the .gif also shows devtools not opening). Going to try the --process-per-tab switch and report back my findings.
,
Mar 30 2016
One interesting finding that we had was that if the working "Getting Started" tab spawns other tabs through link clicks, those tabs seem to work without any problems. spawnedtabs.gif shows this in action. @creis regarding the --process-per-tab and --site-per-process switches, the info I gathered is as follows: 1. --process-per-tab works as you expected with loading cnn.com in the working "Getting Started" tab (this pretty much behaves the same, as if I were not to use the switch) 2. Using --site-per-process, I was still able to load links, webpages and spawn new working tabs from the "Getting Started" tab (which, again is similar behavior to what would happen if I were not to use the switch). I attached the siteperprocess.gif where I show what I was doing with the --site-per-process switch on.
,
Mar 30 2016
Sounds like my theory might be wrong. Then again, you're only browsing google.com pages in all those videos, so we wouldn't be doing process swaps in any of them. What happens if you actually go cross-site to cnn.com or chromium.org? My theory was that any time you switch to a new process, it's unlikely to work. You can verify whether a new process is created or not using Chrome's Task Manager and seeing if the Process ID for the tab changes.
,
Mar 30 2016
I gave the --site-per-process thing another shot; this time trying to go to cnn.com and it does indeed hang on anything that isn't a google domain. I had the task manager open while doing this, and it doesn't appear to be able to spawn a new process for cnn.com. You can see this in siteperprocesscnn.gif I also included a gif of a --process-per-tab session as well for good measure.
,
Apr 4 2016
@creis in issue 595968 which seems to be very related (especially since using the Guest Profile also fixes the issue for the OP) the OP mentions that he's on a laptop using a Core 2 Duo as does another poster in that thread. This is a little suspicious since in our VMs we're currently emulating Core 2 Duo architecture as well. Off the top of your head, do you know of any recent changes in 49 that might cause this sort of incompatibility?
,
Apr 4 2016
I'm afraid I don't know enough in that area. Adding Avi and Mark in case they have any ideas about these two bugs, and how Core 2 Duo on Mac might have somehow regressed in M49. So far, it does sound like certain renderer processes end up getting stuck unable to navigate.
,
Apr 4 2016
I appreciate all of your help so far in this anyways @creis :) Hoping we can figure out what's going on soon! On our end, we're going to try cloning one of our 10.9 VMs and having it run on a Sandy Bridge emulation instead to see if that fixes the issue. Will report back if there's any progress.
,
Apr 5 2016
,
Apr 5 2016
More reports from issue 595968 that users running into this are using a Core 2 Duo. It could be a race that is more likely to surface when there are only two cores, I suppose.
,
Apr 5 2016
Updating title, to make issue more discoverable.
,
Apr 5 2016
Mark, could you help triage this? Avi said you might know more about the lower levels where we're having trouble with Core 2 Duo users.
,
Apr 5 2016
Is this a VM-only problem or has anyone been able to reproduce on a bare machine?
,
Apr 5 2016
@mark This is not only a VM problem. I originally reported it on my 2010 MacBook Air if you look at the other ticket that was just merged into this one.
,
Apr 5 2016
My Late 2009 Mac Mini is affected by this bug. Not a VM.
,
Apr 5 2016
Got it, thanks. I think that comment 24 is the most likely explanation so far, rather than it being a problem with the Core 2 architecture itself.
,
Apr 5 2016
We're actually only simulating 1 core in our VMs where this error occurs, so having two cores MIGHT not be the issue?
,
Apr 5 2016
I apologize for stating before that we're using Core 2 Duos, not Core 2 Solos, that was my mistake.
,
Apr 5 2016
Can we get a samples of all of the involved processes when you’re seeing this?
,
Apr 5 2016
Hi Mark! I've attached two .gifs displaying the task manager while I work through different tabs. This is what you were looking for, right? The first run is in processes.gif, and I figured that it might help to include a .gif of a subsequent run where it wasn't doing the "first run" stuff in processes2.gif I've offered this before, but if you like we can easily set you up with a trial account on our service so that you can see this behavior firsthand yourself and poke around on the VM.
,
Apr 5 2016
@mark just wanted to point out since I was the person who created the ticket that got merged into this one. The behavior I have been noticing (as well as the other people who commented on that ticket) is slightly different than the one described in this ticket. Occasionally I have seen the new tab page fail to load any page or do anything at all like in the gif attached here, but more often random pages will hang and fail to load and spin for a while. They show up in the task manager, but never actually make any network requests to load the page. It definitely feels like a race condition since it seems completely random when it happens. During the time when it is stuck the stop and reload buttons do not work, but killing the task in the task manager allows you to reload it at which point it will work again. It also applies to loading of extensions (certain extensions will randomly fail to load) when the browser is started up. After killing them and getting the message that they crashed and click to reload, they work again. I am pretty sure that is the more common behavior people are seeing. I can attach a GIF later to illustrate the behavior if you would like.
,
Apr 5 2016
No, I want you to run the “sample” tool and attach its output. You can use the Task Manager to find the relevant PIDs, and run “sample 1234 > sample_1234.txt” substituting the PID you’re interested in for 1234. Do this for each wedged process, and attach the captured samples here.
,
Apr 5 2016
Hi Mark! I've attached three sample files that I obtained using the following steps: 1. Started up Google Chrome and opened Task Manager 2. Opened a new Tab 3. Started all three sample instances to capture the processes "Browser", "GPU Process" and the "new Tab" over a period of 120 seconds 4. Typed CNN.com in the new tab and hit enter (tab didn't respond as usual) Hopefully these are what you're looking for, but if not please let me know what else I can grab for you. Cheers!
,
Apr 5 2016
Symbolized.
,
Apr 5 2016
Fixed symbolized gpu sample.
,
Apr 6 2016
While I've starred this bug and 595968 for some time, I thought I have no additional information. Being on OS X 10.9.5, Intel Core 2 Duo (natively, not a VM), I also noticed this randomly failing load of pages as well suddenly starting with Chrome 49. However, I just noticed, that this also happens if I start chrome anew and it loads the local starting page listing the 8 top visited pages. I think nobody mentioned this before, but it looks like the very same issue. Sometimes this starting page is shown, other times the window remains completely empty. No network should be involved to display this page. Also, no spinning counter in the tab is shown, while a spinning counter is shown for web pages failing to load. Everything else is identical as for failing load of web pages. It sometimes work, it sometimes doesn't. Really strange and very annoying.
,
Apr 6 2016
The exact same issue is happening to me (natively) on Macbook Pro Mid 2009 Core 2 Duo and 10.9.5. I have the same version of Chrome and Os on a newer Mac Mini 2012 or so (so a different architecture) and I don't have the issue there.
,
Apr 11 2016
Just an update from our side: we managed to get Chrome 49 working on a test VM running OS X 10.9.5 by switching over to Sandy Bridge emulation rather than using the Core 2 Solo we were using before. This isn't really anything new, but I just thought it might be useful to have another data point that supports this issue being related to compatibility between the Core 2 architecture and Chrome 49.
,
Apr 14 2016
Hi guys! I just confirmed that unfortunately this problem still exists on the new stable released today (50.0.2661.75). Is there currently still an effort to get this resolved? Thanks!
,
Apr 14 2016
Yeah. I pointed out in the other ticket that it still was not working correctly in Canary (51 at the time).
,
Apr 19 2016
Hi Mark, In case additional samples could be helpful, I have attached samples from this happening on a non VM machine.
,
Apr 19 2016
Hi, I'm also seeing this bug since Chrome 49. I'm running OS X 10.11.4 on a Mid-2010 MacBook Pro which uses a Core2Duo processor. Let me know if you need any further debug info; this issue is prevalent enough to be annoying i.e. multiple times per hour. No specific URL affected; I agree that if feels like a race condition - close the tab and reload the URL again usually fixes the issue on a per-incident basis. Cheers, Ian
,
Apr 20 2016
,
Apr 20 2016
I just did a bisect of all of the builds between Chrome 48 and Chrome 49 to try to figure out where this broke. I have narrowed it down to a couple of commits. Build 361868 works fine. Build 361871 does not work. I am fairly sure this is the commit that broke it https://chromium.googlesource.com/chromium/src/+/8e53d1be210193655b682ab381c3377700705067 Looking at the notes in the commit: > Speculative revert: 3 mac perf bots started timing out in random tests around the time this patch landed. I'll keep an eye on the waterfall and reland if things don't improve. Then it was relanded because > Relanding. Looks like the failures may have been unrelated after all. I really hope someone from the Chromium team can now take a look at this since I spent hours narrowing this down for you :)
,
Apr 20 2016
Moving skyostil@ to owners. Really appreciate that you ran the bisect.
,
Apr 20 2016
Thanks for the bisect! Passing this over to erikchen@ who is the original author of that CL.
,
Apr 20 2016
Just wanted to clarify that I was testing the behavior mentioned in issue 595968 . Someone from saucelabs will have to verify my findings for this ticket since I don't have a way to test that, but it is likely the same issue :)
,
Apr 20 2016
Will do, Craig! I still feel like our issues stem from the same problem, even though our symptoms seem to be slightly different. Thanks for getting the ball rolling again on this. Great find! :)
,
Apr 20 2016
iamcraig: Please follow the steps in Comment 36 to get samples of "Browser" after it has wedged. Both iamcraig and aliao: Follow the steps in https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/recording-tracing-runs to record a tracing run. After you've started the browser, make sure that starting a trace is the first thing you do. Then cause the browser to wedge and stop the tracing run. aliao: If you can give me access to a VM, I'll take a look directly as well.
,
Apr 20 2016
@erikc My samples are attached in comment #45. I will try to do the tracing run later tonight.
,
Apr 20 2016
iamcraig: Unfortunately, those are samples of the wrong process. Use Chrome's Task Manager to find out the PID of the "Browser" process and grab a sample of that. It should say "Google Chrome" instead of "Google Chrome Helper"
,
Apr 20 2016
@erikc could that be related to the issue? I am almost certain that I copied the correct Process IDs from the Task Manager. Could it be that when the tabs are in this “uninitialized spinning” state they show up as Chrome Helper processes? I will try it again later tonight.
,
Apr 20 2016
Let's try this: Open your task manager and take a screenshot using cmd+shift+4. Then get samples for "Browser", whose name should be "Google Chrome" and a hung renderer, whose name should be "Google Chrome Helper".
,
Apr 20 2016
Hi Erik! I've attached a zip file containing the trace that you requested, along with a .gif showing what I was doing along with the trace. Please let me know if this isn't sufficient and if you need anything else from me. I'm sending you an email right now with some instructions on getting you set up with a VM that exhibits this behavior so you can poke around yourself if you like! :) Cheers!
,
Apr 20 2016
@erikc Sorry I totally misread that you were asking for “Browser” samples. The instructions in comment 36 mentioned finding the pids for the individual stuck processes. I will get that and the tracing run for you later, but looks like aliao has some additional info in the meantime :)
,
Apr 20 2016
Interesting observations: The problem in aliao's VMs is definitely related to Mach shared memory. You can verify this by adding the command line flags for Chrome 49. Adding the first causes Chrome to break. Adding the second causes Chrome to work. --force-fieldtrials=MacMemoryMechanism/Mach --force-fieldtrials=MacMemoryMechanism/Posix iamcraig: Can you verify this with your non-VM machine? The NTP doesn't appear to have any trouble loading. Furthermore, launching a site with "./out/Debug/Chromium.app/... wikipedia.org" also works. When it loads, navigating by clicking on links works fine. All new tabs cannot perform any navigations. If you open a new tab, all navigations fail. If you then press the "X" button and wait, Chrome will put up the "Page no responding" dialog, which implies that the browser process is having trouble communicating with the new renderer. I added logging around all direct uses of Mach ports (MapAt, Duplicate, Create, etc.) but everything seems to work fine.
,
Apr 21 2016
@erikc I can confirm that when using --force-fieldtrials=MacMemoryMechanism/Posix everything works, and when using Mach, tabs occasionally hang. I have attached the new samples you have requested. Unfortunately, I cannot get tracing to work. When I press the record button, nothing happens. It's supposed to bring up a dialog with options from what I understand. I don't get anything. No errors in the console either.
,
Apr 21 2016
I tried again and was able to grab a trace. For some reason the tracing behavior also seems to exhibit the same tendencies (clicking record and it hanging intermittently). Although it could potentially be related to issue 604885 .
,
Apr 21 2016
Thanks for all the help, iamcraig and aliao. The problem is a race condition that is *almost* never hit, because it requires some very odd thread scheduling. Thread A posts a task onto thread B, and is immediately interrupted. Thread B runs the task to completion, and then Thread A continues its task. Working on a fix.
,
Apr 21 2016
,
Apr 21 2016
I'm going to merge the fix to M51, but we may want to consider using Finch to turn off Mach ports for M49 and M50. I'm inclined to do so. shrike@, opinion?
,
Apr 21 2016
To give slightly more details about the bug: We were connecting IPC channels, and then registering them with the attachment broker. This means that if there was a Mach port sent to a newly created process that was waiting to be processed, and if connecting the IPC channel pauses the current task immediately [instead of executing the next line of code, which would register the attachment broker], switches threads to the newly posted task and processes that, then all future IPC messages to the new process will be hosted, because we will have effectively lost an attachment brokering message.
,
Apr 21 2016
Seems like using Finch to prevent MacMemoryMechanism/Mach for M49 and M50 is the right thing. Can you provide a bit more info about this experiment/pointer to bug or design doc?
,
Apr 21 2016
,
Apr 21 2016
,
Apr 21 2016
Thanks for the design doc links (very well written!).
,
Apr 22 2016
Removing from triaging queue.
,
Apr 25 2016
Yeah, i'm in favor of turning this off for M49 and M50 via finch, and having the fix in M51. Thanks for all the research!
,
Apr 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9097190cafdecfb1b5796b65c83af92e861ccaf8 commit 9097190cafdecfb1b5796b65c83af92e861ccaf8 Author: erikchen <erikchen@chromium.org> Date: Mon Apr 25 23:45:31 2016 IPC: Fix attachment brokering race condition. A channel must be registered as a broker communication channel before it is connected. When possible, invert the sequence of the call to connect a channel, and the call to register the channel as a broker. In some cases, the channel constructor and the channel initializer had to be separated, so that the registration could happen in between. This requirement is now enforced by a CHECK, which verifies that a channel cannot be registered after it is connected. BUG= 598088 Review URL: https://codereview.chromium.org/1903663004 Cr-Commit-Position: refs/heads/master@{#389618} [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/components/nacl/loader/nacl_listener.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/content/browser/renderer_host/render_process_host_impl.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/content/child/child_thread_impl.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/content/common/child_process_host_impl.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/attachment_broker_mac_unittest.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/attachment_broker_privileged_win_unittest.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel.h [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_common.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_nacl.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_posix.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_proxy.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_proxy.h [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_channel_win.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_endpoint.h [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/ipc_test_base.h [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/ipc/mojo/ipc_channel_mojo.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/desktop_process.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/desktop_session_agent.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/ipc_util.h [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/ipc_util_posix.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/ipc_util_win.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/remoting_me2me_host.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/win/unprivileged_process_delegate.cc [modify] https://crrev.com/9097190cafdecfb1b5796b65c83af92e861ccaf8/remoting/host/win/wts_session_process_delegate.cc
,
Apr 26 2016
,
Apr 26 2016
,
Apr 26 2016
Transitioning to Mach shared memory fixed a common GPU hang. After turning off Mach shared memory, we saw a sudden spikes (thousands) of GPU hangs: https://bugs.chromium.org/p/chromium/issues/detail?id=560875 This may be worse than the very rare users who see a stuck renderer. So we may want to consider turning on Mach shared memory for M49 and M50 again...
,
Apr 26 2016
Yikes! If I'm following this correctly, the real fix for this issue is going to be put into M51, but the hotpatch to fix this in M50 and M49 was to have Chrome default to using Posix instead of Mach, and this is causing issue 560875 to become worse?
,
Apr 26 2016
That's right. Switching to Mach substantially reduced GPU hangs. Unfortunately, we don't have a good metric for the number of users who are seeing hung renderers so it's hard to make a call on this one.
,
Apr 26 2016
Your change meets the bar and is auto-approved for M51 (branch: 2704)
,
Apr 26 2016
Note that the crash "spike" used to be default behavior (M47 and older) so we're just delaying the milestone that fixes it.
,
Apr 26 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a132709b08846c8ac75bead8d97a0fec2cb30ace commit a132709b08846c8ac75bead8d97a0fec2cb30ace Author: erikchen <erikchen@chromium.org> Date: Tue Apr 26 23:28:02 2016 IPC: Fix attachment brokering race condition. A channel must be registered as a broker communication channel before it is connected. When possible, invert the sequence of the call to connect a channel, and the call to register the channel as a broker. In some cases, the channel constructor and the channel initializer had to be separated, so that the registration could happen in between. This requirement is now enforced by a CHECK, which verifies that a channel cannot be registered after it is connected. BUG= 598088 Review URL: https://codereview.chromium.org/1903663004 Cr-Commit-Position: refs/heads/master@{#389618} (cherry picked from commit 9097190cafdecfb1b5796b65c83af92e861ccaf8) Review URL: https://codereview.chromium.org/1917333002 . Cr-Commit-Position: refs/branch-heads/2704@{#258} Cr-Branched-From: 6e53600def8f60d8c632fadc70d7c1939ccea347-refs/heads/master@{#386251} [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/components/nacl/loader/nacl_listener.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/content/browser/renderer_host/render_process_host_impl.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/content/child/child_thread_impl.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/content/common/child_process_host_impl.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/attachment_broker_mac_unittest.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/attachment_broker_privileged_win_unittest.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel.h [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_common.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_nacl.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_posix.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_proxy.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_proxy.h [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_channel_win.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_endpoint.h [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/ipc_test_base.h [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/ipc/mojo/ipc_channel_mojo.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/desktop_process.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/desktop_session_agent.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/ipc_util.h [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/ipc_util_posix.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/ipc_util_win.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/remoting_me2me_host.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/win/unprivileged_process_delegate.cc [modify] https://crrev.com/a132709b08846c8ac75bead8d97a0fec2cb30ace/remoting/host/win/wts_session_process_delegate.cc
,
Apr 26 2016
Issue 560875 has been merged into this issue.
,
Apr 27 2016
Mach shared memory is fixed on M51+, and turned off via Finch on M50 and older.
,
Apr 27 2016
,
Apr 27 2016
,
May 6 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/75ca2da31f0449be76215e6c4cddd371d2631e50 commit 75ca2da31f0449be76215e6c4cddd371d2631e50 Author: erikchen <erikchen@chromium.org> Date: Fri May 06 17:43:17 2016 [Merge to M50] IPC: Fix attachment brokering race condition. This CL is a merge of the safest, and most importance pieces https://codereview.chromium.org/1917333002/ to M50. There is a race condition where an IPC::Channel can start processing messages before it is registered as a communication channel with the attachment broker. This CL reorders channel connection and registration. BUG= 598088 , 609262 R=avi@chromium.org, mseaborn@chromium.org, tsepez@chromium.org Review URL: https://codereview.chromium.org/1960513002 . Cr-Commit-Position: refs/branch-heads/2661@{#660} Cr-Branched-From: ef6f6ae5e4c96622286b563658d5cd62a6cf1197-refs/heads/master@{#378081} [modify] https://crrev.com/75ca2da31f0449be76215e6c4cddd371d2631e50/components/nacl/loader/nacl_listener.cc [modify] https://crrev.com/75ca2da31f0449be76215e6c4cddd371d2631e50/content/browser/renderer_host/render_process_host_impl.cc [modify] https://crrev.com/75ca2da31f0449be76215e6c4cddd371d2631e50/content/child/child_thread_impl.cc [modify] https://crrev.com/75ca2da31f0449be76215e6c4cddd371d2631e50/content/common/child_process_host_impl.cc [modify] https://crrev.com/75ca2da31f0449be76215e6c4cddd371d2631e50/ipc/ipc_channel_proxy.h
,
Jul 26 2016
erikchen@ - the patch in c#73 doesn't seem to touch any files that are directly related to the Mac. But given that you landed it here and marked the bug as fixed I guess that change does affect the Mac? Does it also prevent a similar race condition on Windows and Linux?
,
Jul 26 2016
It affects a race condition on Windows and Mac for brokering HANDLES and Mach ports, respectively. Note that this mostly effects releases around ~M51, since MojoChannel has been turned on for most channels in tip of tree.
,
Jul 26 2016
I'm hunting a regression, and this change (landed in beta) stands out the most. But can you tell me more about the Mojo stuff? I guess it handles the attachment brokering so that a regression caused by this change might no longer exist? Is MojoChannel turned on on the Mac?
,
Jul 26 2016
It's really unlikely that the change in c#73 causes a performance regression. Furthermore, almost all Chrome IPC channels are now layered on top of MojoChannels, including on Mac.
,
Jul 26 2016
I'm hunting a regression in SessionRestore.ForegroundTabFirstLoaded in beta: https://uma.googleplex.com/timeline_v2?sid=4277d44b761dc7f6fb8a15703790d74b It appears to occur within this range of cls, and there's nothing else that stands out: https://chromium.googlesource.com/chromium/src/+log/51.0.2704.22..51.0.2704.29?pretty=fuller&n=10000 When I look at where this change landed in Canary there is also a spike in SessionRestore.ForegroundTabFirstLoaded time. I'm asking about Mojo because there's a chance that the metric improved when we switched to it/away from this patch. That would perhaps help confirm that this was the source of the regression.
,
Jul 26 2016
rockot@ is the right person to talk to about the timing of MojoChannel changes. |
||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||
Comment 1 by dtapu...@chromium.org
, Mar 28 2016