Network stack not releasing a IOBuffer after calling ResourceDispatcherHostImpl::CancelRequest(). |
|||||||||||||||
Issue descriptionHere are two failing tests made for this issue: Subresources: XXX.CancelRequestXHR From https://chromium-review.googlesource.com/c/chromium/src/+/880952 Main resource: BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit It needs NavigationMojoResponse to be enabled, else the MojoAsyncResourceHandler would not be used. # Repro step: 1) Load a slow resource and interrupt the server while sending bytes. 2) Cancel the request using ResourceDispatcherHostImpl::CancelRequest. 3) Wait for the Renderer to be notified about it, sometimes it isn't. After an initial investigation: 1) ResourceDispatcherHostImpl::CancelRequest(...) is called. 2) MojoAsyncResourceHandler, ResourceLoader, URLRequestHttpJob, etc... are deleted. 3) MojoAsyncResourceHandler::{SharedWriter, WriterIOBuffer} are not deleted. If they are not deleted, it prevents the mojo::ScopedDataPipeProducerHandle to close the mojo connection and the Renderer to be notified. The WriterIOBuffer is a ref-counted object (IOBuffer). They are a lot of objects in the Network stack that are keeping a reference to it. I suspect one of them not releasing it (I suspect HttpCache::Writers at least). edit: The IOBuffer is not released at all, except if the server sends more bytes or if new requests happen.
,
Jan 23 2018
Actually, since Mojo makes no ordering guarantees, and the test apparently isn't waiting on the IOBuffer being destroyed, even destroying the IOBuffer promptly would likely leave the test still racy.
,
Jan 23 2018
Thanks for the answer! So maybe we could do nothing and wait. If I enable NavigationMojoResponse, the test: BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit starts failing. The reason is this bug. I don't really know when ResourceDispatcherHostImpl::CancelRequest is called, but do you think I can start experimenting with finch with this bug? Knowing that this bug already already occurs for subresources. I have some question: 1) Wouldn't the copy from the buffer to the disk, expected to end at some point? 2) What will be the replacement for ResourceDispatcherHostImpl::CancelRequest()? Why do you think we won't have the same problem with the replacement? FYI: When I say "Network stack not releasing **immediately** a IOBuffer", I meant it does not release it at all, unless the server send more bytes or after some new network requests.
,
Jan 23 2018
Ah, not being released at all is a much different issue. I'm not sure why it's not being released - that sounds more concerning. I'm not sure why that would be happening.
,
Jan 23 2018
Ours answers crossed. Response to comment 2: The test is implicitly waiting for the IOBuffer destruction. IOBuffer destruction => mojo::ScopedDataPipeProducerHandle destruction => URLLoaderClientImpl being notified of it (URLLoaderClientImpl::OnConnectionClosed()) => The XHR "abort" event is fired => The DomMessageQueue gets an event (We are waiting for it) I don't want the IOBuffer to be deleted promptly, I want it to be deleted at some point if there are no more network activity. See comment 3.
,
Jan 23 2018
,
Jan 23 2018
I am updating the title to avoid confusion. Sorry.
,
Jan 23 2018
Oops, completely missed what's going on here. We don't use RDH when using the network service, so this test can never pass with the network service enabled, as it's written. If we need this functionality, we'll need to implement it another way. It seems like the test should depend on top level content APIs, and RDH is not one (any more, at least). What is the test trying to do?
,
Jan 23 2018
The test is not intended to be used with the NetworkService ;-) I knew that RDH are not used with the NetworkService. What this test is doing: 1) The renderer sends a XHR. 2) The server responds with half of the response and nothing after that. 3) The browser cancels the request. 4) The renderer should be notified. I am testing it by waiting for: https://xhr.spec.whatwg.org/#event-xhr-abort
,
Jan 23 2018
Ok, so looking at CL, this is *not* about the rule that you added to mojo.fyi.network_content_browsertests.filter, but the test itself failing as written, when directly, and when you mentioned Finch, you meant a trial for enabling xhr-abort, not for enabling the network service (Which is a rather long ways off, so the mention confused me a bit).
,
Jan 23 2018
*When directly = when run directly.
,
Jan 23 2018
> Ok, so looking at CL, this is *not* about the rule that you added to mojo.fyi.network_content_browsertests.filter, but the test itself failing as written, when run directly.
Yes!
The test pass, but only because I am using the wrong value in the EXPECT_EQ.
```
// 6) Wait for XHR cancellation.
{
std::string message;
EXPECT_TRUE(dom_message_queue.WaitForMessage(&message));
// TODO(arthursonzogni): This is wrong, "abort" should be received.
// See https://crbug.com/804868
EXPECT_EQ("\"timeout\"", message);
}
```
> and when you mentioned Finch, you meant a trial for enabling xhr-abort, not for enabling the network service (Which is a rather long ways off, so the mention confused me a bit).
I am talking about NavigationMojoResponse (https://goo.gl/Rrrc7n), which is not the NetworkService.
,
Jan 23 2018
Ok, I'll see if I can repro, and will look into this. Upping the priority as no longer maintaining two navigation paths seems like something we want to do sooner rather than later.
,
Jan 24 2018
An HttpCache::Writers is keeping a reference to the IOBuffer.
HttpCache::Writers is owned by an ActiveEntry.
The ActiveEntry is owned by the HttpCache.
At some point in the test HttpCache::Writers::RemoveTransation(...) is called:
________________________________________________________________________________
|
1 |void HttpCache::Writers::RemoveTransaction(Transaction* transaction,
2 | bool success) {
3 | EraseTransaction(transaction, OK);
3 |
4 | if (!all_writers_.empty())
5 | return;
6 |
7 | if (!success) {
8 | DCHECK_NE(State::CACHE_WRITE_TRUNCATED_RESPONSE, next_state_);
9 | if (InitiateTruncateEntry()) {
10| // |this| may have been deleted after truncation, so don't touch any
11| // members.
12| return;
13| }
14| }
15|
16| cache_->WritersDoneWritingToEntry(entry_, success, should_keep_entry_,
17| TransactionSet());
18|}
__|_____________________________________________________________________________
It returns line 12 without calling WritersDoneWritingToEntry(..).
FYI: If I remove lines 7-14, the IOBuffer is released and my tests pass.
Then I took a look at InitiateTruncateEntry()
________________________________________________________________________________
|
1 |bool HttpCache::Writers::InitiateTruncateEntry() {
2 | // If there is already an operation ongoing in the state machine, queue the
3 | // truncation to happen after the outstanding operation is complete by setting
3 | // the state.
4 | if (next_state_ != State::NONE) {
5 | DCHECK(next_state_ == State::CACHE_WRITE_DATA_COMPLETE ||
6 | next_state_ == State::NETWORK_READ_COMPLETE);
7 | next_state_ = State::ASYNC_OP_COMPLETE_PRE_TRUNCATE;
8 | return true;
9 | }
10|
11| if (!ShouldTruncate())
12| return false;
13|
14| next_state_ = State::CACHE_WRITE_TRUNCATED_RESPONSE;
15| // If not in do loop, initiate do loop.
16| if (!in_do_loop_)
17| DoLoop(OK);
18| return true;
19|}
__|_____________________________________________________________________________
It returns line 8 and set next_state_ = ASYNC_OP_COMPLETE_PRE_COMPLETE.
We are not currently in the loop (|in_do_loop_| == false) and the DoLoop() is not called after that.
If I try to insert line 8:
```
if (!in_do_loop_)
DoLoop(OK);
```
Then the IOBuffer is released and my tests pass.
I sent it the trybot (https://chromium-review.googlesource.com/c/chromium/src/+/883808). We will see...
I don't have much hope, because I don't know what I am doing ;-)
I will continue investigating...
,
Jan 24 2018
[+morlovich, +shivanisha], who know the cache. It's all black magic to me.
,
Jan 24 2018
[shivanisha]: Believe you're no longer working on the cache, but the relevant code is in HttpCache::Writers::InitiateTruncateEntry, which I believe was part of your cache lock fix (Though the code may have behaved the same prior to your work, not sure)
,
Jan 24 2018
[+jkarlin, +rdsmith] This is what's happening: The Writers state machine may possibly be in one of the following two async states, when the last transaction is removed and we would like to mark the entry as truncated: - NETWORK_READ_COMPLETE - CACHE_WRITE_DATA_COMPLETE At this point we do not continue with the truncation and wait for the async operation to complete and then proceed with truncation. This allows for making use of the data that is currently in flight (network). Josh/Randy: Do you remember any other reasons for doing this instead of what the HC::T used to earlier do which is : send the request to mark the entry as truncated and destroy this object. But as mentioned in this case, it could cause Writers (and thus IOBuffer) to stay alive for longer (is there a timeout at the lower layers which will finally return the async callback?) in case of a server hang.
,
Jan 24 2018
Truncating is an async operation. What if the network responds while waiting for truncation? We'll be in the wrong state.
,
Jan 24 2018
Ah, you mention deleting the writers object immediately instead of waiting for truncation to finish. That seems reasonable since the network callback takes a weak_ptr to writers.
,
Jan 25 2018
,
Jan 26 2018
I don't remember why, but I thought we considered destruction of the Writers object in several different error cases and had decided that there were races involved. I'll note that we could probably release the IOBuffer in this case without any problems--if it was still being used by any subsystem (cache disk write, network read) that subsystem would hold onto any reference it needed.
,
Feb 1 2018
Should be able to get to creating a fix for this next week.
,
Feb 2 2018
Great, thank you!
,
Feb 16 2018
,
Feb 16 2018
811214 (has high-level testcases) seems to trigger basically the same thing w/a multipart-mixed-replace --- we're waiting for a read to complete, but it never comes. I imagine hanging-get type scenarios may be affected, too.
,
Feb 19 2018
I'm not sure whether it's of any use for you still, but please tell me if the http://speedy.netavis.hu:62354/test server (the test-case from 811214) is not required anymore.
,
Feb 20 2018
Hi Shivani. I realize you're pretty busy as of late. Happy to take this one over if necessary. It looks like we'll be removing all of the PRE_TRUNCATE logic from writers, is that what you had in mind? It simplifies the code a good bit.
,
Feb 20 2018
Hi Josh, That's right, instead of transitioning to PRE_TRUNCATE state, writers send the truncate request to the cache and let itself be destroyed by the HttpCache. Thanks a lot for offering to take this one. Appreciate it. Happy to be the reviewer if necessary.
,
Feb 21 2018
Okay. I have a repro in a unittest: https://chromium-review.googlesource.com/c/chromium/src/+/928845 balazs@ please feel free to stop your test server.
,
Feb 21 2018
Great, thanks for the notification! In case you'll need it again, just tell me.
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/79689073b8fae2c2b1a06c9e5d9b616399c03c3b commit 79689073b8fae2c2b1a06c9e5d9b616399c03c3b Author: Josh Karlin <jkarlin@chromium.org> Date: Mon Feb 26 19:00:36 2018 [HttpCache] Immediately free HttpWriters when transaction is released If the last transaction in a Writers object is released, then free it. There is no need to wait for any pending cache writes or network reads to finish before truncating. Otherwise we wind up holding on to resources for far longer than necessary. Bug: 804868 Change-Id: I9480a801b0314c3be9d2aa4cbb579b96503ce638 Reviewed-on: https://chromium-review.googlesource.com/928845 Commit-Queue: Josh Karlin <jkarlin@chromium.org> Reviewed-by: Shivani Sharma <shivanisha@chromium.org> Cr-Commit-Position: refs/heads/master@{#539217} [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache.h [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_unittest.cc [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers.cc [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers.h [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers_unittest.cc [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/mock_http_cache.cc [modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/mock_http_cache.h
,
Feb 26 2018
,
Feb 27 2018
I've tested it briefly, and everything seems to work correctly now. Thanks for the great work guys!
,
Feb 27 2018
Thanks for the bug report and the confirmation!
,
Feb 27 2018
One last question though: which version will contain this fix presumably? Currently we have a workaround in code, which falls back to jpeg mode if chrome's version number starts with 64 or later. Would it be safe to assume that if it's 66 (current dev version) or later, we can use mjpeg again? Or this check should rather be done against the build/patch number?
,
Mar 5 2018
Thanks for this patch! I tried again the two tests I talked about in #1. * BrowserSideNavigationBaseBrowserTest.CancelRequestXHR in https://chromium-review.googlesource.com/c/chromium/src/+/948526 * BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit in https://chromium-review.googlesource.com/c/chromium/src/+/852094 With you patch, these two tests are now working on Linux. I tried locally and on the try bots. Unfortunately, it seems to work only on Linux. The first one is still failing on Mac/Windows/Android. That's weird, maybe there was an additional issue. I will take a look as soon as I have some time. Maybe there is still an issue with the IOBuffer.
,
Mar 5 2018
,
Mar 12 2018
I tried again BrowserSideNavigationBaseBrowserTest.CancelRequestXHR in https://chromium-review.googlesource.com/c/chromium/src/+/948526. The issue was in this temporary test, not in Chrome. On the other side, the test: BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit with NavigationMojoResponse enabled is still not working on Windows. The previous CL made it works on every other platform.
,
Mar 20 2018
Sorry, I've been away. That's strange that it works consistently on some platforms but not others. Seems like a race, because as you say, there is nothing platform specific about my patch. Are you saying that CancelRequestXHR now works but CancelRequestAfterReadyToCommit does not? Both are showing lots of red for me.
,
Mar 20 2018
CancelRequestAfterReadyToCommit doesn't work in a specific case: * On Windows * With NavigationMojoResponse enabled. By enabling it, it will use the MojoAsyncResourceHandler to load the main response, even for the main resource. I have no Windows OS, so that's a bit difficult for me to debug. I used https://chromium.googlesource.com/chromium/src/+/master/docs/win_cross.md to build a Windows binary on Linux and execute it on bots. It works fine. I am very happy to know it it possible. I put some logs and I am seeing the IOBuffer(i.e. SharedWriter) is created but not released. On linux is works fine. That's weird. I can reproduce it consistently. I don't know if CancelRequestXHR works on Windows. It is not working right know because it is using an hardcoded value for canceling the request and it happens it is incorrect on other OSes. I tried on Linux and Android, it works. I haven't tried on Windows. With NavigationMojoResponse, both test {CancelRequestXHR, CancelRequestAfterReadyToCommit} are using the MojoAsyncResourceHandler to load the {main,sub} resources.
,
Mar 22 2018
I updated the test CancelRequestXHR such that it doesn't use hardcoded value (it was failing on non-linux OS previously because of that). https://chromium-review.googlesource.com/#/c/880952 The try bots shows it doesn't work Windows. So it is the same as in CancelRequestAfterReadyToCommit, but now we have the prof this is not NavigationMojoResponse specific.
,
Apr 5 2018
Issue 820959 has been merged into this issue.
,
Apr 5 2018
I mark this bug as available, because it has been fixed on every platform, except on Windows.
,
Apr 5 2018
I understand now why things are different on Windows and why jkarlin@ patches fixed the issue on every platform except on this Windows.
There are two TCPSockets implementations, one for Windows and one for the others.
```
#if defined(OS_WIN)
typedef TCPSocketWin TCPSocket;
#elif defined(OS_POSIX)
typedef TCPSocketPosix TCPSocket;
#endif
```
On destruction, the TCPSocketWin will not release immediately its IOBuffer, instead there is an helper class that will outlive it:
```
// This class encapsulates all the state that has to be preserved as long as
// there is a network IO operation in progress. If the owner TCPSocketWin is
// destroyed while an operation is in progress, the Core is detached and it
// lives until the operation completes and the OS doesn't reference any resource
// declared on this class anymore.
class TCPSocketWin::Core : public base::RefCounted<Core>
```
It is waiting on Windows to write something or close the file descriptor:
```
bool ObjectWatcher::StartWatchingInternal(HANDLE object, Delegate* delegate,
bool execute_only_once) {
[...]
if (!RegisterWaitForSingleObject(&wait_object_, object, DoneWaiting,
this, INFINITE, wait_flags)) {
```
Since the embedded_test_server doesn't send any byte anymore, nothing happens.
Windows will probably close the connection after some time, but it will be after the test timeouts.
This issue has always been there (and on every platform before jkarlin@ patches). Maybe it is okay to continue doing this.
FYI: The function TCPSocketWin::Close() is interesting.
+mmenke@ What do you think?
,
Apr 5 2018
Closing a socket should result in a notification, per https://cs.chromium.org/chromium/src/net/socket/tcp_socket_win.cc?type=cs&q=TCPSocketWin::Core&sq=package:chromium&l=658 We use overlapped IO for writes on windows, and async IO for reads. The difference is that overlapped IO involves out-of-band writes by the system, while async IO means we get a notification when data can be read synchronously from the underlying socket. Both operations can be cancelled, but cancelling overlapped IO still requires waiting (Since the system could be writing data offthread), while waiting on an asyc notification can be cancelled synchronously. We could switch to async IO everywhere, and get rid of Core, and all should work... There may be a bug here: If there's a pending (async IO) read, I'm not sure closing the socket will ever result in a subsequent async notification. closesocket is documented to cancel the WSAEventSelect call used to get notifications of the socket being readable. So maybe all we need to do is, if there's a pending read, abort the read watcher and release a reference?
,
Apr 5 2018
I'm going to investigate this theory, and see if I repro this in a unit test. If so, I'll try and fix it. Not going to sink too much time into this, though.
,
Apr 5 2018
I've confirmed that this is a 6-year-old leak on windows. Thanks for all your work on this, arthursonzogni! I have a fix I'll send out for review as soon as the bots all pass.
,
Apr 6 2018
Issue 829332 has been merged into this issue.
,
Apr 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/e0673201112d4f6f49c588b715b3592c2a86c425 commit e0673201112d4f6f49c588b715b3592c2a86c425 Author: Matt Menke <mmenke@chromium.org> Date: Fri Apr 06 21:41:24 2018 Fix IOBuffer leak when a TCPSocketWin is closed with a pending read. This leak was added when the class was switched to use async reads instead of overlapped ones, years ago. Bug: 804868 Change-Id: Ibd6538820be3d966d5b98b0bc714d2c47bd2cb1a Reviewed-on: https://chromium-review.googlesource.com/998873 Commit-Queue: Matt Menke <mmenke@chromium.org> Reviewed-by: Eric Roman <eroman@chromium.org> Cr-Commit-Position: refs/heads/master@{#548937} [modify] https://crrev.com/e0673201112d4f6f49c588b715b3592c2a86c425/net/socket/tcp_socket_unittest.cc [modify] https://crrev.com/e0673201112d4f6f49c588b715b3592c2a86c425/net/socket/tcp_socket_win.cc
,
Apr 9 2018
I think this should be fixed now.
,
Apr 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d8173fb9e8eebbf3043870525498c7283f7b4ea4 commit d8173fb9e8eebbf3043870525498c7283f7b4ea4 Author: arthursonzogni <arthursonzogni@chromium.org> Date: Tue Apr 10 16:12:11 2018 Enable the test CancelRequestAfterReadyToCommit. This test has been disabled on Windows with NavigationMojoResponse: https://chromium-review.googlesource.com/c/chromium/src/+/958904 Then the issue has been fixed by: https://chromium-review.googlesource.com/998873 This CL enabled this test again. Bug: 804868 Change-Id: I77f5faaba1dd6a2e3517fe3d14df5d281a2f5e30 Reviewed-on: https://chromium-review.googlesource.com/1005347 Reviewed-by: Camille Lamy <clamy@chromium.org> Commit-Queue: Arthur Sonzogni <arthursonzogni@chromium.org> Cr-Commit-Position: refs/heads/master@{#549548} [modify] https://crrev.com/d8173fb9e8eebbf3043870525498c7283f7b4ea4/content/browser/browser_side_navigation_browsertest.cc |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by mmenke@chromium.org
, Jan 23 2018