| Page hoses browser | ||||||||||
| Project Member Reported by pkasting@chromium.org, Apr 21 2010 | Back to list | |||||||||
Visit http://www.teamliquid.net/forum/viewmessage.php?topic_id=120488 Now try various things while it's loading: * Hitting stop changes the stop button to a go button but the throbber keeps spinning * Clicking one of the "Spoiler" links doesn't do anything * Trying to end the process of the associated tab doesn't work (at least sometimes) * Opening a new tab, closing the old tab, and trying to visit something (e.g. google.com) in the new tab hangs up and glitches WTF? I don't even know who should look at this.
,
Apr 22 2010
I suspect this is because of mbelshe's backup ConnectJobs which don't respect socket limits. My change is probably not the root cause, it just exacerbates the problem. I will look into it tomorrow.
,
Apr 22 2010
Thanks for investigating! (sorry i didn't mean to assign blame, just figured you would be a good person to debug this)
,
Apr 22 2010
Hah, don't worry, I'm not a 12 yr old. Not being defensive, just providing the probable diagnosis.
,
Apr 23 2010
I'm working on fixing this right now. Just for clarification, I don't think it had anything to do with my change (I was confusing it with a different change, which would have exacerbated the issue). Actually, I'm sure of it, because I just tested against dev channel which didn't have my change. Would people consider this a release blocker for Mstone-5? I'm working on a fix now, although as eroman will attest to, this code is pretty sensitive. It might be worth considering a revert, but I don't know. Adding laforge and mbelshe. A simple fix might be to change the != to a greater than or less than condition. I'm not convinced it won't just hide some other issue, but I need to investigate further.
,
Apr 23 2010
I would consider this an M5 blocker, yes. "Just visiting a particular page hoses the IO thread and destroys the whole browser" is a bad bug.
,
Apr 23 2010
,
Apr 23 2010
I spent a reasonable chunk of today looking at how to fix this cleanly, without adding more one off hacks. I think we may need to revert the backup socket jobs in branch 375. Mike, is that ok with you? I know it might screw up some SPDY benchmarking / stats gathering.
,
Apr 23 2010
Will - I am happy to take this if you think it is related to the backup jobs. I think we can fix this. Personally, I can't reproduce any problems yet though. I tried following the instructions and it just works?
,
Apr 26 2010
mbelshe, willchan: for M5 we need to check in a fix (or revert the backup socket jobs CL) on the 375 branch today. Can you do that today?
,
Apr 26 2010
I did a messy revert of r41543 (original CL), r41547 (crash fix), and r42179 (vandebo's fix so we don't do backup jobs for non-TCP sockets). It's not a clean revert on ToT, partly due to eroman's LoadLog=>NetLog refactors. I did verify that a revert fixes the problem. It might be cleaner to revert on 375 directly. If not, then perhaps we should just do a patch similar to r42179 that disables the backup connect job code for all socket pools, not just non-TCP pools. It'd require updating unit tests too though. I don't think any revert will be clean since the code's been in for awhile. I'm assigning to mbelshe.
,
Apr 26 2010
,
Apr 26 2010
Er, Mike started looking into it and the backup connect job doesn't look to be the cause. It does fix the socket limits issue that we were seeing with the page, but apparently the ReachedMaxSocketLimits() thing might be a red herring. Peter, can you double check which version of Chrome causes this problem for you? If it's not reproducible in dev channel, then the backup connect job stuff clearly is not at fault.
,
Apr 27 2010
It was the Dev channel release at the time of filing, which I believe was 5.0.375.9.
,
Apr 27 2010
Oops, looks like my followup post didn't go through. Yes, we also verified that dev channel is broken. There is a while(true) loop that breaks out when ReachedMaxSocketLimits() returns true. Mike's change broke that. Mike says that he still sees hangs even after turning off the backup connect job. I'm working on trying to verify this. I don't see it though. So, there's at least one bug, perhaps more. Fixing the backup socket connect job will address the majority case at least, if not fix the entire issue. mbelshe's got a change for that which I'm going to look at shortly.
,
Apr 27 2010
This should block next dev channel/beta, not just stable. Twiddling priority and milestones.
,
Apr 27 2010
The biggest part of this bug was bug 42597 . I believe I can still reproduce this bug when we hit limits, but willchan says he cannot. I think there is a lingering bug, so leaving this open for a bit longer.
,
Apr 29 2010
mbelshe fixed the biggest part of this bug in bug 42597 for M5. Leave this bug open for the remaining issue. Moved to M6 and lowered priority to P2.
,
May 6 2010
Issue 43208 has been merged into this issue.
,
May 6 2010
This still seems to be happening on ToT for me. I looked into it, I think there's a bug with the releasing sockets bugfix stuff that I wrote back in February. I'm not sure why it's taken this long to show up. I think it's because on this website, we open so many connections. My worker pool had like 100+ threads in it. I think the fix will be touchy. A quick guaranteed fix is to turn off the 256 socket limit in the 375 branch. I don't think it's gaining us much, so I think we should just do it and get a real fix in for Mstone-6.
,
May 7 2010
,
May 7 2010
,
May 7 2010
The following revision refers to this bug:
http://src.chromium.org/viewvc/chrome?view=rev&revision=46641
------------------------------------------------------------------------
r46641 | laforge@chromium.org | 2010-05-06 17:19:07 -0700 (Thu, 06 May 2010) | 4 lines
Changed paths:
M http://src.chromium.org/viewvc/chrome/branches/375/src/net/http/http_network_session.cc?r1=46641&r2=46640
Change socket pool max from 100 to 1000000
This will effectively disable the socket limit, which is known to be buggy and can cause the IO thread to enter into an infinite loop.
BUG= 42267
Review URL: http://codereview.chromium.org/2038005
------------------------------------------------------------------------
,
May 7 2010
The following revision refers to this bug:
http://src.chromium.org/viewvc/chrome?view=rev&revision=46757
------------------------------------------------------------------------
r46757 | willchan@chromium.org | 2010-05-07 16:45:30 -0700 (Fri, 07 May 2010) | 7 lines
Changed paths:
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.cc?r1=46757&r2=46756
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.h?r1=46757&r2=46756
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base_unittest.cc?r1=46757&r2=46756
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/tcp_client_socket_pool_unittest.cc?r1=46757&r2=46756
Fix IO thread hang on releasing a socket.
The problem was we assumed ProcessPendingRequest() would always make progress once the group's releasing socket went down to zero. However, in OnAvailableSocketSlot(), since the top stalled group might still have a releasing socket, it won't necessarily make progress. The algorithmic solution is to simply never do any socket slot processing in DoReleaseSocket() if there are any releasing sockets. This requires us to search for any stalled groups after releasing sockets gets back down to 0, so it requires the full group scan each time num_releasing_sockets goes back to 0. There is a performance hit here, but I think a linear search through 256~ groups in the worst case is ok.
TODO(willchan): evaluate the perf hit and consider adding a secondary data structure to improve the stalled group search.
BUG= 42267
Review URL: http://codereview.chromium.org/2013009
------------------------------------------------------------------------
,
May 8 2010
,
May 10 2010
The following revision refers to this bug:
http://src.chromium.org/viewvc/chrome?view=rev&revision=46792
------------------------------------------------------------------------
r46792 | sky@chromium.org | 2010-05-08 13:18:14 -0700 (Sat, 08 May 2010) | 10 lines
Changed paths:
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.cc?r1=46792&r2=46791
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.h?r1=46792&r2=46791
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base_unittest.cc?r1=46792&r2=46791
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/tcp_client_socket_pool_unittest.cc?r1=46792&r2=46791
Revert 46757 - Fix IO thread hang on releasing a socket.
The problem was we assumed ProcessPendingRequest() would always make progress once the group's releasing socket went down to zero. However, in OnAvailableSocketSlot(), since the top stalled group might still have a releasing socket, it won't necessarily make progress. The algorithmic solution is to simply never do any socket slot processing in DoReleaseSocket() if there are any releasing sockets. This requires us to search for any stalled groups after releasing sockets gets back down to 0, so it requires the full group scan each time num_releasing_sockets goes back to 0. There is a performance hit here, but I think a linear search through 256~ groups in the worst case is ok.
TODO(willchan): evaluate the perf hit and consider adding a secondary data structure to improve the stalled group search.
BUG= 42267
Review URL: http://codereview.chromium.org/2013009
TBR=willchan@chromium.org
Review URL: http://codereview.chromium.org/1992010
------------------------------------------------------------------------
,
May 10 2010
The following revision refers to this bug:
http://src.chromium.org/viewvc/chrome?view=rev&revision=46837
------------------------------------------------------------------------
r46837 | willchan@chromium.org | 2010-05-10 11:58:54 -0700 (Mon, 10 May 2010) | 5 lines
Changed paths:
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.cc?r1=46837&r2=46836
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base.h?r1=46837&r2=46836
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/client_socket_pool_base_unittest.cc?r1=46837&r2=46836
M http://src.chromium.org/viewvc/chrome/trunk/src/net/socket/tcp_client_socket_pool_unittest.cc?r1=46837&r2=46836
Reland 46757.
OnAvailableSocketSlot() was dereferencing a NULL pointer. Add a NULL check. Add a test.
BUG= 42267
Review URL: http://codereview.chromium.org/2050005
------------------------------------------------------------------------
,
May 10 2010
Issue 43741 has been merged into this issue.
,
May 12 2010
This is not fixed yet Program received signal SIGPIPE, Broken pipe. [Switching to Thread 0x7fffec3d7910 (LWP 2298)] 0x00007ffff3a4296b in write () from /lib/libpthread.so.0 (gdb) q Installed: 6.0.401.0~svn20100512r46998-0ubuntu1~ucd1~karmic 6.0.401.0 (46998)
,
May 12 2010
The bug was an infinite loop. That shouldn't have anything to do with a SIGPIPE. Perhaps you want to file a new bug.
,
May 12 2010
Will i did... issue 43741 it got dupped of this one :S
,
May 12 2010
Bug 43741 doesn't say anything about a SIGPIPE.
,
May 12 2010
ok, i'll file a new one tomorrow
,
May 13 2010
Issue 44049 has been merged into this issue.
,
May 13 2010
so is issue 43741 a dupe or not? if it isnt, please remove its status as dupe
,
May 13 2010
The original issue you posted there is a dupe. The SIGPIPE sounds like a separate issue. You might be encountering multiple different issues, in which you should file separate bugs for each. Again, please post a stacktrace for all thread stacks when you file a new bug. "thread apply all bt" in gdb.
,
May 13 2010
Issue 43079 has been merged into this issue.
,
May 17 2010
Will as requested, new issue open http://code.google.com/p/chromium/issues/detail?id=44358 let me know if i can provide any more data, or if it is a dupe of one of the other i already have open
,
Sep 23 2010
Issue 43769 has been merged into this issue.
,
Oct 12 2012
This issue has been closed for some time. No one will pay attention to new comments. If you are seeing this bug or have new data, please click New Issue to start a new bug.
,
Mar 10 2013
,
Mar 13 2013
|
||||||||||
| ► Sign in to add a comment | ||||||||||
Status: Assigned