New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 870942 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Chrome , Mac
Pri: 1
Type: Bug
Proj-Servicification



Sign in to add a comment

URLLoaderFactoryGetter is unsafe

Project Member Reported by mmenke@chromium.org, Aug 4

Issue description

URLLoaderFactoryGetter has the exact same issue that I'm in https://chromium-review.googlesource.com/c/chromium/src/+/1153630.  It's a refcounted object that owns a mojo pipe, and posts a task to delete itself.  If it gets a a connection error when the delete task is pending, it grabs another reference to itself while posting a task to the UI thread, upping the refcount from 0 to 1.  This causes a DCHECK, but if it didn't, the class would delete itself twice, since the reference count would reach 0 twice.

This hasn't surfaced as a problem because we generally succeed on retry, since it's a flaky error, but if you look at the mojo_linux browser_test results, you can see a bunch of these failures on each try run.

This class looks to be used pretty extensively in content, at least when the network service is enabled, though not sure if it's used at all when the network service is disabled.  Either way, we should fix this promptly.
 
And, just for the record, a sample crash stack:

[ RUN      ] SecurityStateTabHelperTest.VerifyEnterprisePasswordReuseMaliciousContentAndSecurityLevel/0
....
#0 0x00000628b78c base::debug::StackTrace::StackTrace()
#1 0x0000061da61b logging::LogMessage::~LogMessage()
#2 0x000003e8f6d6 _ZN4base8internal34BanUnconstructedRefCountedReceiverIMN7content29SingleRequestURLLoaderFactory12HandlerStateEFvN4mojo16InterfaceRequestIN7network5mojom9URLLoaderEEENS5_12InterfacePtrINS8_15URLLoaderClientEEEEPS4_JRSA_RSD_EEENSt3__19enable_ifIXaaaasr17MakeFunctorTraitsIT_EE9is_methodsr3std10is_pointerINSJ_5decayIT0_E4typeEEE5valuesr16IsRefCountedTypeINSJ_14remove_pointerISP_E4typeEEE5valueEvE4typeERKSN_DpOT1_
#3 0x000004c36300 _ZN4base8internal9BindStateIMN7content22URLLoaderFactoryGetterEFvvEJ13scoped_refptrIS3_EEE6CreateIS5_JPS3_EEEPS8_PFvvEOT_DpOT0_
#4 0x000004c35681 content::URLLoaderFactoryGetter::HandleURLLoaderFactoryConnectionError()
#5 0x0000025bc0c4 _ZN4base8internal7InvokerINS0_9BindStateIMN12_GLOBAL__N_116SimpleHttpServer10ConnectionEFvvEJNS_7WeakPtrIS5_EEEEEFvvEE7RunOnceEPNS0_13BindStateBaseE
#6 0x00000722c790 mojo::InterfaceEndpointClient::NotifyError()
#7 0x000007234786 mojo::internal::MultiplexRouter::ProcessNotifyErrorTask()
#8 0x00000723174e mojo::internal::MultiplexRouter::ProcessTasks()
#9 0x00000722fdb3 mojo::internal::MultiplexRouter::OnPipeConnectionError()
#10 0x000007228289 mojo::Connector::HandleError()
#11 0x00000722912c mojo::Connector::OnHandleReadyInternal()
#12 0x000003e75164 mojo::SimpleWatcher::DiscardReadyState()
#13 0x0000068e3f93 mojo::SimpleWatcher::OnHandleReady()
#14 0x0000068e4426 mojo::SimpleWatcher::Context::Notify()
#15 0x0000068e2e33 mojo::SimpleWatcher::Context::CallNotify()
#16 0x000004460eb7 mojo::core::WatcherDispatcher::InvokeWatchCallback()
#17 0x000004460602 mojo::core::Watch::InvokeCallback()
#18 0x00000445b44a mojo::core::RequestContext::~RequestContext()
#19 0x00000444e9cd mojo::core::NodeChannel::OnChannelMessage()
#20 0x00000443c3ef mojo::core::Channel::OnReadComplete()
#21 0x000004466022 mojo::core::(anonymous namespace)::ChannelPosix::OnFileCanReadWithoutBlocking()
It feels to me that |RefCountedThreadSafe<>| doesn't play nicely with |WeakPtr|, and a quick fix would be checking |HasAtLeastOneRef()| before upping the refcount if it was posted from a |WeakPtr<>|:
https://cs.chromium.org/chromium/src/content/browser/url_loader_factory_getter.cc?l=181&rcl=87a7d8bef673fd8dd19a96706beb1744e64f9174

I'm not sure but do we have better utilities?

HasAtLeastOneRef would still be racy - since we could post the task immediately after that check, but before we grab another ref.  We'd need an atomic method to check the reference count, and if it's non-zero, increment the reference.
Owner: mmenke@chromium.org
Status: Assigned (was: Untriaged)
Think I'll take this on myself - fixed the other instance that was much like this already (Though this case is more complicated than the one I fixed, and may need a different solution).
Labels: OS-Android OS-Chrome OS-Linux OS-Mac OS-Windows
Cool,thanks Matt!
Do we even need this class?  Seems like we could just switch to SharedURLLoaderFactory[Info], and we could even reuse the ForBrowserProcessIOThread one, since URLLoaderFactoryGetter doesn't set any magic options (Like process ID for the corresponding renderer process - which seems like it may be a bug, for things like auth) or any web security features.
So it looks to me like all network service crash tests that do a navigation after the crash are potentially flaky - I don't think there's any guarantee the URLLoaderFactoryGetter will notice the old URLLoaderFactory's mojo pipe was closed before we try and use it again.  In practice, it may be this currently always happens, but mojo behavior is subject to change...

I think the simplest change here would be to not listen to the mojo pipe for closure, and just check for liveness each time it's used.  That may mean on network service crash the next navigation will always fail, since we don't have a live request on the URLLoaderFactory pipe, though I'm not positive of that.  It does mean that on tests, at least, we consistently only notice the pipe is dead on navigation, so we'd need a call to explicitly revive the URLLoaderFactoryGetter after each crash, which does seem a bit ugly.

I could try another approach, but that mojo race concerns me a bit, and seems like it would be good to fix it at the same time as well.
> So it looks to me like all network service crash tests that do a navigation after the crash are potentially flaky - I don't think there's any guarantee the URLLoaderFactoryGetter will notice the old URLLoaderFactory's mojo pipe was closed before we try and use it again.  In practice, it may be this currently always happens, but mojo behavior is subject to change...

Hmmm will it work if we flush |URLLoaderFactoryGetter| after the crash in all related tests? e.g.
https://cs.chromium.org/chromium/src/content/browser/network_service_restart_browsertest.cc?l=321&rcl=339f50e14b2f03c6c1ffb04f7fcbe185bb910e1c

> I think the simplest change here would be to not listen to the mojo pipe for closure, and just check for liveness each time it's used.  That may mean on network service crash the next navigation will always fail, since we don't have a live request on the URLLoaderFactory pipe, though I'm not positive of that.  It does mean that on tests, at least, we consistently only notice the pipe is dead on navigation, so we'd need a call to explicitly revive the URLLoaderFactoryGetter after each crash, which does seem a bit ugly.

Actually I don't recall I ever added connection error listener in |URLLoaderFactoryGetter|, and my original approach was to just check for liveness each time it's used (as you've mentioned).
The next navigation won't fail since the new pipe will be returned (assuming |StoragePartition| has noticed the error).

The connection error handler seems to be introduced later:
https://chromium-review.googlesource.com/c/chromium/src/+/1147042/6/content/browser/url_loader_factory_getter.cc

> Hmmm will it work if we flush |URLLoaderFactoryGetter| after the crash in all related tests? e.g.
https://cs.chromium.org/chromium/src/content/browser/network_service_restart_browsertest.cc?l=321&rcl=339f50e14b2f03c6c1ffb04f7fcbe185bb910e1c

That's not quite enough - we need to flush the interface, and then try to grab a NetworkContext again (Which is, admittedly, just a minor fix), since when we notice an error, we do an IO->UI->IO hop to get a new URLLoaderFactor.  We might actually want to fix that, since otherwise we potentially show 2 error pages (Pipe looks fine, navigate, fails.  Then when we reload, we see the pipe is closed, but we use it anyways while getting a new one).
Cc: cduvall@chromium.org
> That's not quite enough - we need to flush the interface, and then try to grab a NetworkContext again (Which is, admittedly, just a minor fix), since when we notice an error, we do an IO->UI->IO hop to get a new URLLoaderFactor.

I'm not sure I understand the problem correctly, but |StoragePartitionImpl::GetNetworkContext()| should re-grab |NetworkContext| automatically if it was flushed first and noticed connection error (at least at the time I wrote it):
https://chromium.googlesource.com/chromium/src/+blame/2f46bdba272c67d22c59f712949b7fdd71fefb2e/content/browser/storage_partition_impl.cc#690

However the reconnection logic has changed and I'm not entirely sure how it works now.
+cduvall@ for more context.

Also it seems that we already have issue 613371 (`InterfacePtr::encountered_error() can be incorrect if no connection-error handler is set`).

(Actually issue 613371 was related to your previous email, but probably not this issue)
https://cs.chromium.org/chromium/src/content/browser/url_loader_factory_getter.cc?q=url_loader_factory_get&sq=package:chromium&g=0&l=167 - if you just add a connection_error() check there, it posts a task to the UI thread to get a new mojo object, but then it returns the old borked mojo object.
Yes, as mentioned in #c11 the reconnection logic has been changed and adding a |connection_error()| check won't work today. The original code was something like:
```
network::mojom::URLLoaderFactory*
URLLoaderFactoryGetter::GetURLLoaderFactory() {
  DCHECK_CURRENTLY_ON(BrowserThread::IO);
  if (g_get_network_factory_callback.Get() && !test_factory_)
    g_get_network_factory_callback.Get().Run(this);

  if (test_factory_)
    return test_factory_;

  if (!network_factory_.is_bound() || network_factory_.encountered_error()) {
    BrowserThread::PostTask(
        BrowserThread::UI, FROM_HERE,
        base::BindOnce(
            &URLLoaderFactoryGetter::HandleNetworkFactoryRequestOnUIThread,
            this, mojo::MakeRequest(&network_factory_)));
  }
  return network_factory_.get();
}
```
Where the new mojo pipe will be returned (and the similar for |StoragePartitionImpl::GetNetworkContext()|).

Are you suggesting reverting it back to this pattern? I believe cduvall@ changed them for a good reason.

Yes, I am...But also keeping the current pattern for initialization.  (i.e., bind on UI thread when the storage partition tells us to, and then bind the resulting request on the IO thread on first use.  Then if we need a new URLLoaderFactory, use that pattern)
I see, thanks for the explanation! I like the original pattern so we won't spin reconnect requests when it never succeeds.

Anyway will wait for cduvall@'s input.

chongz:  Looking at crbug.com/613371, it seems the issue is just what I mentioned above - it means we'd always show a single error page per StoragePartition after a network service crash, even if the network service had long since restarted.

I had been assuming the pattern was to avoid a threadhop on the first load.

Actually...The StoragePartition is already watching the NetworkContext.  Perhaps we can just have it poke the URLLoaderFactoryGetter when there's an error?  If we just assumed that the URLLoaderFactory only fails when the NetworkContext pipe does, too, we'd only need the push path.  Tests would still need to do something to wait until the StoragePartition notices the pipe closure, though.
Hmmm thinking more about it, the summary should be:

1. The original pattern (check |encountered_error()| lazily) won't notice error on first crash if it was never dereferenced before. Following crashes should be fine since we've dereferencing it immediately.

2. The new approach (|set_connection_error_handler()|) fixed the first crash issue in 1., but introduced the refcount issue mentioned in the issue description.


Possible Solutions:

a. Revert back to 1. and dereference |InterfacePtr<>| manually on first bind.

b. Adopt a push model (from |StoragePartition|) as mentioned in #c17.
  * We are already using push model in path: browser => render => workers.
  * Note: A centralized connection error observer on 'network_service_instance.h| won't work, see https://chromium-review.googlesource.com/c/chromium/src/+/1139101/2/content/browser/frame_host/render_frame_host_impl.cc#4666

c. Introduce utility |WrapRefCountedIfHasAtLeastOneRef()|.

d. Always bind |InterfacePtr<>|, and use |InterfacePtrInfo<>| for transition purpose.


I don't have a strong preference but b. seems to be easier to implement (cannot recall why I didn't use the push model in the first place...).

I'll brainstorm a bit more tomorrow, but will tentatively plan to go with b.
Another solution:  Use a weak reference to the StoragePartitionImpl, add an object that owns that weak reference, and lives on the IO thread, owned by the refcounted object.  Have that object own the interface ptr, and listen to it for errors, and when it needs a reference, post a task to the UI thread to get a new one - this task would run a static function, so can be called even after the getter is destroyed, without issue.  When the getter is destroyed, regardless of which thread it's destroyed on, post a task to destroy the non-refcounted object on the IO thread.  (The weak ptr could also just be owned by a callback passed to the getter instead, if we want to minimize exposure of our weak ptrs)

This is more complicated to describe than it is to implement.  Nice thing is that the getter class is a bit more self-contained.
Project Member

Comment 21 by bugdroid1@chromium.org, Aug 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/58a77c090a803c77ed5496237bde55b62c21deda

commit 58a77c090a803c77ed5496237bde55b62c21deda
Author: Matt Menke <mmenke@chromium.org>
Date: Fri Aug 10 03:49:02 2018

URLLoaderFactoryGetter: Fix double free.

The class is refcounted and uses Mojo pipes. It could execute tasks as
a result of Mojo callbacks after its refcount was 0, while there was a
task to delete itself posted. Those tasks, if executed, would grab
another reference (incrementing it from 0 to 1). That would result in
use-after-frees and double deletion, once the already posted delete
task was executed.

This CL fixes that by getting rid of the callbacks executed from Mojo,
and removing the weak reference used in another call. The downside of
this approach is that Mojo errors are only lazily detected, which could
result in displaying an extra error page in the case of network service
crash.

Bug:  870942 
Change-Id: Ic7b00de6e7c623dc62098118292290666c91b1a7
Reviewed-on: https://chromium-review.googlesource.com/1164302
Reviewed-by: Chong Zhang <chongz@chromium.org>
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Commit-Queue: Matt Menke <mmenke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#582038}
[modify] https://crrev.com/58a77c090a803c77ed5496237bde55b62c21deda/content/browser/url_loader_factory_getter.cc
[modify] https://crrev.com/58a77c090a803c77ed5496237bde55b62c21deda/content/browser/url_loader_factory_getter.h

Status: Fixed (was: Assigned)
Oops - that description is wrong, describes the first approach I tried.  The actual one I went with does not display extra error pages (Instead, it results in a lazily done extra thread-hop in that case, which I don't think is a big deal)

Sign in to add a comment