New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 804868 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Network stack not releasing a IOBuffer after calling ResourceDispatcherHostImpl::CancelRequest().

Project Member Reported by arthurso...@chromium.org, Jan 23 2018

Issue description

Here are two failing tests made for this issue:

Subresources:
  XXX.CancelRequestXHR
  From https://chromium-review.googlesource.com/c/chromium/src/+/880952

Main resource:
  BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit
  It needs NavigationMojoResponse to be enabled, else the MojoAsyncResourceHandler would not be used.

# Repro step:
1) Load a slow resource and interrupt the server while sending bytes.
2) Cancel the request using ResourceDispatcherHostImpl::CancelRequest.
3) Wait for the Renderer to be notified about it, sometimes it isn't.

After an initial investigation:
1) ResourceDispatcherHostImpl::CancelRequest(...) is called.
2) MojoAsyncResourceHandler, ResourceLoader, URLRequestHttpJob, etc... are deleted.
3) MojoAsyncResourceHandler::{SharedWriter, WriterIOBuffer} are not deleted.
   If they are not deleted, it prevents the mojo::ScopedDataPipeProducerHandle to
   close the mojo connection and the Renderer to be notified.

The WriterIOBuffer is a ref-counted object (IOBuffer). They are a lot of objects in the Network stack that are keeping a reference to it. I suspect one of them not releasing it (I suspect HttpCache::Writers at least).


edit: The IOBuffer is not released at all, except if the server sends more bytes or if new requests happen.
 

Comment 1 by mmenke@chromium.org, Jan 23 2018

This can't be changed, without adding extra copies - we may be copying data from the buffer to disk on another thread.  I think we need to fix the tests not to depend on this behavior instead.  Also note that ResourceDispatcherHostImpl::CancelRequest itself is going away, so it's possible switching to whatever new cancellation path we will use would also fix these tests.

Comment 2 by mmenke@chromium.org, Jan 23 2018

Actually, since Mojo makes no ordering guarantees, and the test apparently isn't waiting on the IOBuffer being destroyed, even destroying the IOBuffer promptly would likely leave the test still racy.
Thanks for the answer! So maybe we could do nothing and wait.

If I enable NavigationMojoResponse, the test:
BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit
starts failing. The reason is this bug. I don't really know when
ResourceDispatcherHostImpl::CancelRequest is called, but do you think
I can start experimenting with finch with this bug? Knowing that this
bug already already occurs for subresources.


I have some question:
1) Wouldn't the copy from the buffer to the disk, expected to end at some point?
2) What will be the replacement for ResourceDispatcherHostImpl::CancelRequest()?
   Why do you think we won't have the same problem with the replacement?

FYI: When I say "Network stack not releasing **immediately** a IOBuffer", I meant it does not release it at all, unless the server send more bytes or after some new network requests.

Comment 4 by mmenke@chromium.org, Jan 23 2018

Ah, not being released at all is a much different issue.

I'm not sure why it's not being released - that sounds more concerning.  I'm not sure why that would be happening.
Ours answers crossed.

Response to comment 2:
The test is implicitly waiting for the IOBuffer destruction.
IOBuffer destruction
  => mojo::ScopedDataPipeProducerHandle destruction
  => URLLoaderClientImpl being notified of it (URLLoaderClientImpl::OnConnectionClosed())
  => The XHR "abort" event is fired
  => The DomMessageQueue gets an event (We are waiting for it)

I don't want the IOBuffer to be deleted promptly, I want it to be deleted at some point if there are no more network activity. See comment 3.
Description: Show this description
Summary: Network stack not releasing a IOBuffer after calling ResourceDispatcherHostImpl::CancelRequest(). (was: Network stack not releasing immediately a IOBuffer after calling ResourceDispatcherHostImpl::CancelRequest().)
I am updating the title to avoid confusion. Sorry.

Comment 8 by mmenke@chromium.org, Jan 23 2018

Oops, completely missed what's going on here.  We don't use RDH when using the network service, so this test can never pass with the network service enabled, as it's written.  If we need this functionality, we'll need to implement it another way.  It seems like the test should depend on top level content APIs, and RDH is not one (any more, at least).  What is the test trying to do?

Comment 9 Deleted

The test is not intended to be used with the NetworkService ;-) I knew that RDH are not used with the NetworkService.

What this test is doing:
1) The renderer sends a XHR.
2) The server responds with half of the response and nothing after that.
3) The browser cancels the request.
4) The renderer should be notified. I am testing it by waiting for:
   https://xhr.spec.whatwg.org/#event-xhr-abort
Ok, so looking at CL, this is *not* about the rule that you added to mojo.fyi.network_content_browsertests.filter, but the test itself failing as written, when directly, and when you mentioned Finch, you meant a trial for enabling xhr-abort, not for enabling the network service (Which is a rather long ways off, so the mention confused me a bit).
*When directly = when run directly.
> Ok, so looking at CL, this is *not* about the rule that you added to mojo.fyi.network_content_browsertests.filter, but the test itself failing as written, when run directly.
Yes!

The test pass, but only because I am using the wrong value in the EXPECT_EQ.
```
  // 6) Wait for XHR cancellation.
  {
    std::string message;
    EXPECT_TRUE(dom_message_queue.WaitForMessage(&message));
    // TODO(arthursonzogni): This is wrong, "abort" should be received.
    // See  https://crbug.com/804868 
    EXPECT_EQ("\"timeout\"", message);
  }
```

> and when you mentioned Finch, you meant a trial for enabling xhr-abort, not for enabling the network service (Which is a rather long ways off, so the mention confused me a bit).
I am talking about NavigationMojoResponse (https://goo.gl/Rrrc7n), which is not the NetworkService.
Labels: -Pri-3 Pri-2
Owner: mmenke@chromium.org
Status: Assigned (was: Available)
Ok, I'll see if I can repro, and will look into this.  Upping the priority as no longer maintaining two navigation paths seems like something we want to do sooner rather than later.
An HttpCache::Writers is keeping a reference to the IOBuffer.
HttpCache::Writers is owned by an ActiveEntry.
The ActiveEntry is owned by the HttpCache.

At some point in the test HttpCache::Writers::RemoveTransation(...) is called:
________________________________________________________________________________
  |
1 |void HttpCache::Writers::RemoveTransaction(Transaction* transaction,
2 |                                           bool success) {
3 |  EraseTransaction(transaction, OK);
3 |
4 |  if (!all_writers_.empty())
5 |    return;
6 |
7 |  if (!success) {
8 |    DCHECK_NE(State::CACHE_WRITE_TRUNCATED_RESPONSE, next_state_);
9 |    if (InitiateTruncateEntry()) {
10|       // |this| may have been deleted after truncation, so don't touch any
11|      // members.
12|      return;
13|    }
14|  }
15|
16|  cache_->WritersDoneWritingToEntry(entry_, success, should_keep_entry_,
17|                                    TransactionSet());
18|}
__|_____________________________________________________________________________

It returns line 12 without calling WritersDoneWritingToEntry(..).
FYI: If I remove lines 7-14, the IOBuffer is released and my tests pass.

Then I took a look at InitiateTruncateEntry()
________________________________________________________________________________
  |
1 |bool HttpCache::Writers::InitiateTruncateEntry() {
2 |  // If there is already an operation ongoing in the state machine, queue the
3 |  // truncation to happen after the outstanding operation is complete by setting
3 |  // the state.
4 |  if (next_state_ != State::NONE) {
5 |    DCHECK(next_state_ == State::CACHE_WRITE_DATA_COMPLETE ||
6 |           next_state_ == State::NETWORK_READ_COMPLETE);
7 |    next_state_ = State::ASYNC_OP_COMPLETE_PRE_TRUNCATE;
8 |    return true;
9 |  }
10|
11|  if (!ShouldTruncate())
12|    return false;
13|
14|  next_state_ = State::CACHE_WRITE_TRUNCATED_RESPONSE;
15|  // If not in do loop, initiate do loop.
16|  if (!in_do_loop_)
17|    DoLoop(OK);
18|  return true;
19|}
__|_____________________________________________________________________________

It returns line 8 and set next_state_ = ASYNC_OP_COMPLETE_PRE_COMPLETE.

We are not currently in the loop (|in_do_loop_| == false) and the DoLoop() is not called after that.
If I try to insert line 8:
```
  if (!in_do_loop_)
    DoLoop(OK);
```

Then the IOBuffer is released and my tests pass.

I sent it the trybot (https://chromium-review.googlesource.com/c/chromium/src/+/883808). We will see...
I don't have much hope, because I don't know what I am doing ;-)

I will continue investigating...
Cc: shivanisha@chromium.org morlovich@chromium.org
Components: Internals>Network>Cache
Owner: ----
Status: Available (was: Assigned)
[+morlovich, +shivanisha], who know the cache.  It's all black magic to me.
[shivanisha]:  Believe you're no longer working on the cache, but the relevant code is in HttpCache::Writers::InitiateTruncateEntry, which I believe was part of your cache lock fix (Though the code may have behaved the same prior to your work, not sure)
Cc: jkarlin@chromium.org rdsmith@chromium.org
[+jkarlin, +rdsmith]

This is what's happening:

The Writers state machine may possibly be in one of the following two async states, when the last transaction is removed and we would like to mark the entry as truncated:
- NETWORK_READ_COMPLETE
- CACHE_WRITE_DATA_COMPLETE
At this point we do not continue with the truncation and wait for the async operation to complete and then proceed with truncation. This allows for making use of the data that is currently in flight (network).

Josh/Randy: Do you remember any other reasons for doing this instead of what the HC::T used to earlier do which is : send the request to mark the entry as truncated and destroy this object.

But as mentioned in this case, it could cause Writers (and thus IOBuffer) to stay alive for longer (is there a timeout at the lower layers which will finally return the async callback?) in case of a server hang.


Truncating is an async operation. What if the network responds while waiting for truncation? We'll be in the wrong state.
Ah, you mention deleting the writers object immediately instead of waiting for truncation to finish. That seems reasonable since the network callback takes a weak_ptr to writers.

Comment 21 Deleted

Owner: ----
Status: Available (was: Started)
I don't remember why, but I thought we considered destruction of the Writers object in several different error cases and had decided that there were races involved.  

I'll note that we could probably release the IOBuffer in this case without any problems--if it was still being used by any subsystem (cache disk write, network read) that subsystem would hold onto any reference it needed.

Owner: shivanisha@chromium.org
Status: Assigned (was: Available)
Should be able to get to creating a fix for this next week.
Great, thank you!
Cc: viswa.karala@chromium.org mge...@chromium.org
 Issue 811214  has been merged into this issue.
811214 (has high-level testcases) seems to trigger basically the same thing w/a multipart-mixed-replace --- we're waiting for a read to complete, but it never comes. I imagine hanging-get type scenarios may be affected, too.

Comment 28 by bal...@netavis.hu, Feb 19 2018

I'm not sure whether it's of any use for you still, but please tell me if the http://speedy.netavis.hu:62354/test server (the test-case from 811214) is not required anymore.
Hi Shivani. I realize you're pretty busy as of late. Happy to take this one over if necessary. It looks like we'll be removing all of the PRE_TRUNCATE logic from writers, is that what you had in mind? It simplifies the code a good bit.
Owner: jkarlin@chromium.org
Hi Josh,
That's right, instead of transitioning to PRE_TRUNCATE state, writers send the truncate request to the cache and let itself be destroyed by the HttpCache.
Thanks a lot for offering to take this one. Appreciate it. Happy to be the reviewer if necessary.
Okay. I have a repro in a unittest: https://chromium-review.googlesource.com/c/chromium/src/+/928845

balazs@ please feel free to stop your test server.

Comment 32 by bal...@netavis.hu, Feb 21 2018

Great, thanks for the notification! In case you'll need it again, just tell me.
Project Member

Comment 33 by bugdroid1@chromium.org, Feb 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/79689073b8fae2c2b1a06c9e5d9b616399c03c3b

commit 79689073b8fae2c2b1a06c9e5d9b616399c03c3b
Author: Josh Karlin <jkarlin@chromium.org>
Date: Mon Feb 26 19:00:36 2018

[HttpCache] Immediately free HttpWriters when transaction is released

If the last transaction in a Writers object is released, then free it. There
is no need to wait for any pending cache writes or network reads to finish
before truncating. Otherwise we wind up holding on to resources for far longer
than necessary.

Bug:  804868 
Change-Id: I9480a801b0314c3be9d2aa4cbb579b96503ce638
Reviewed-on: https://chromium-review.googlesource.com/928845
Commit-Queue: Josh Karlin <jkarlin@chromium.org>
Reviewed-by: Shivani Sharma <shivanisha@chromium.org>
Cr-Commit-Position: refs/heads/master@{#539217}
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache.h
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_unittest.cc
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers.cc
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers.h
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/http_cache_writers_unittest.cc
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/mock_http_cache.cc
[modify] https://crrev.com/79689073b8fae2c2b1a06c9e5d9b616399c03c3b/net/http/mock_http_cache.h

Status: Fixed (was: Assigned)

Comment 35 by bal...@netavis.hu, Feb 27 2018

I've tested it briefly, and everything seems to work correctly now. Thanks for the great work guys!
Status: Verified (was: Fixed)
Thanks for the bug report and the confirmation!

Comment 37 by bal...@netavis.hu, Feb 27 2018

One last question though: which version will contain this fix presumably? Currently we have a workaround in code, which falls back to jpeg mode if chrome's version number starts with 64 or later. Would it be safe to assume that if it's 66 (current dev version) or later, we can use mjpeg again? Or this check should rather be done against the build/patch number?
Thanks for this patch!

I tried again the two tests I talked about in #1.

* BrowserSideNavigationBaseBrowserTest.CancelRequestXHR
  in https://chromium-review.googlesource.com/c/chromium/src/+/948526
* BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit
  in https://chromium-review.googlesource.com/c/chromium/src/+/852094

With you patch, these two tests are now working on Linux. I tried locally and on the try bots. Unfortunately, it seems to work only on Linux. The first one is still failing on Mac/Windows/Android.
That's weird, maybe there was an additional issue. I will take a look as soon as I have some time. Maybe there is still an issue with the IOBuffer.
Cc: -rdsmith@chromium.org
I tried again BrowserSideNavigationBaseBrowserTest.CancelRequestXHR in https://chromium-review.googlesource.com/c/chromium/src/+/948526. The issue was in this temporary test, not in Chrome.

On the other side, the test:
BrowserSideNavigationBaseBrowserTest.CancelRequestAfterReadyToCommit with NavigationMojoResponse enabled is still not working on Windows. The previous CL made it works on every other platform.
Sorry, I've been away. That's strange that it works consistently on some platforms but not others. Seems like a race, because as you say, there is nothing platform specific about my patch.

Are you saying that CancelRequestXHR now works but CancelRequestAfterReadyToCommit does not? Both are showing lots of red for me.
CancelRequestAfterReadyToCommit doesn't work in a specific case:
* On Windows
* With NavigationMojoResponse enabled. By enabling it, it will use the MojoAsyncResourceHandler to load the main response, even for the main resource.

I have no Windows OS, so that's a bit difficult for me to debug.
I used https://chromium.googlesource.com/chromium/src/+/master/docs/win_cross.md
to build a Windows binary on Linux and execute it on bots. It works fine. I am very happy to know it it possible.
I put some logs and I am seeing the IOBuffer(i.e. SharedWriter) is created but not released. On linux is works fine. That's weird.
I can reproduce it consistently.

I don't know if CancelRequestXHR works on Windows. It is not working right know because it is using an hardcoded value for canceling the request and it happens it is incorrect on other OSes.
I tried on Linux and Android, it works. I haven't tried on Windows.

With NavigationMojoResponse, both test {CancelRequestXHR, CancelRequestAfterReadyToCommit} are using the MojoAsyncResourceHandler to load the {main,sub} resources.

Comment 43 Deleted

I updated the test CancelRequestXHR such that it doesn't use hardcoded value (it was failing on non-linux OS previously because of that).
https://chromium-review.googlesource.com/#/c/880952
The try bots shows it doesn't work Windows.

So it is the same as in CancelRequestAfterReadyToCommit, but now we have the prof this is not NavigationMojoResponse specific.
 Issue 820959  has been merged into this issue.
I mark this bug as available, because it has been fixed on every platform, except on Windows.
Status: Available (was: Verified)
I understand now why things are different on Windows and why jkarlin@ patches fixed the issue on every platform except on this Windows.

There are two TCPSockets implementations, one for Windows and one for the others.

```
#if defined(OS_WIN)
typedef TCPSocketWin TCPSocket;
#elif defined(OS_POSIX)
typedef TCPSocketPosix TCPSocket;
#endif
```

On destruction, the TCPSocketWin will not release immediately its IOBuffer, instead there is an helper class that will outlive it:

```
// This class encapsulates all the state that has to be preserved as long as
// there is a network IO operation in progress. If the owner TCPSocketWin is
// destroyed while an operation is in progress, the Core is detached and it
// lives until the operation completes and the OS doesn't reference any resource
// declared on this class anymore.
class TCPSocketWin::Core : public base::RefCounted<Core>
```

It is waiting on Windows to write something or close the file descriptor:

```
bool ObjectWatcher::StartWatchingInternal(HANDLE object, Delegate* delegate,
                                          bool execute_only_once) {
  [...]
  if (!RegisterWaitForSingleObject(&wait_object_, object, DoneWaiting,
                                   this, INFINITE, wait_flags)) {
```

Since the embedded_test_server doesn't send any byte anymore, nothing happens.
Windows will probably close the connection after some time, but it will be after the test timeouts.

This issue has always been there (and on every platform before jkarlin@ patches). Maybe it is okay to continue doing this.

FYI: The function TCPSocketWin::Close() is interesting.

+mmenke@ What do you think?
Closing a socket should result in a notification, per https://cs.chromium.org/chromium/src/net/socket/tcp_socket_win.cc?type=cs&q=TCPSocketWin::Core&sq=package:chromium&l=658

We use overlapped IO for writes on windows, and async IO for reads.  The difference is that overlapped IO involves out-of-band writes by the system, while async IO means we get a notification when data can be read synchronously from the underlying socket.

Both operations can be cancelled, but cancelling overlapped IO still requires waiting (Since the system could be writing data offthread), while waiting on an asyc notification can be cancelled synchronously.  We could switch to async IO everywhere, and get rid of Core, and all should work...

There may be a bug here:  If there's a pending (async IO) read, I'm not sure closing the socket will ever result in a subsequent async notification.  closesocket is documented to cancel the WSAEventSelect call used to get notifications of the socket being readable.  So maybe all we need to do is, if there's a pending read, abort the read watcher and release a reference?
Owner: mmenke@chromium.org
Status: Assigned (was: Available)
I'm going to investigate this theory, and see if I repro this in a unit test.  If so, I'll try and fix it.  Not going to sink too much time into this, though.
I've confirmed that this is a 6-year-old leak on windows.  Thanks for all your work on this, arthursonzogni!  I have a fix I'll send out for review as soon as the bots all pass.
Issue 829332 has been merged into this issue.
Project Member

Comment 52 by bugdroid1@chromium.org, Apr 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e0673201112d4f6f49c588b715b3592c2a86c425

commit e0673201112d4f6f49c588b715b3592c2a86c425
Author: Matt Menke <mmenke@chromium.org>
Date: Fri Apr 06 21:41:24 2018

Fix IOBuffer leak when a TCPSocketWin is closed with a pending read.

This leak was added when the class was switched to use async reads
instead of overlapped ones, years ago.

Bug:  804868 
Change-Id: Ibd6538820be3d966d5b98b0bc714d2c47bd2cb1a
Reviewed-on: https://chromium-review.googlesource.com/998873
Commit-Queue: Matt Menke <mmenke@chromium.org>
Reviewed-by: Eric Roman <eroman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#548937}
[modify] https://crrev.com/e0673201112d4f6f49c588b715b3592c2a86c425/net/socket/tcp_socket_unittest.cc
[modify] https://crrev.com/e0673201112d4f6f49c588b715b3592c2a86c425/net/socket/tcp_socket_win.cc

Status: Fixed (was: Assigned)
I think this should be fixed now.
Project Member

Comment 54 by bugdroid1@chromium.org, Apr 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d8173fb9e8eebbf3043870525498c7283f7b4ea4

commit d8173fb9e8eebbf3043870525498c7283f7b4ea4
Author: arthursonzogni <arthursonzogni@chromium.org>
Date: Tue Apr 10 16:12:11 2018

Enable the test CancelRequestAfterReadyToCommit.

This test has been disabled on Windows with NavigationMojoResponse:
https://chromium-review.googlesource.com/c/chromium/src/+/958904

Then the issue has been fixed by:
https://chromium-review.googlesource.com/998873

This CL enabled this test again.

Bug:  804868 
Change-Id: I77f5faaba1dd6a2e3517fe3d14df5d281a2f5e30
Reviewed-on: https://chromium-review.googlesource.com/1005347
Reviewed-by: Camille Lamy <clamy@chromium.org>
Commit-Queue: Arthur Sonzogni <arthursonzogni@chromium.org>
Cr-Commit-Position: refs/heads/master@{#549548}
[modify] https://crrev.com/d8173fb9e8eebbf3043870525498c7283f7b4ea4/content/browser/browser_side_navigation_browsertest.cc

Sign in to add a comment