URLLoaderFactoryGetter is unsafe |
|||||
Issue descriptionURLLoaderFactoryGetter has the exact same issue that I'm in https://chromium-review.googlesource.com/c/chromium/src/+/1153630. It's a refcounted object that owns a mojo pipe, and posts a task to delete itself. If it gets a a connection error when the delete task is pending, it grabs another reference to itself while posting a task to the UI thread, upping the refcount from 0 to 1. This causes a DCHECK, but if it didn't, the class would delete itself twice, since the reference count would reach 0 twice. This hasn't surfaced as a problem because we generally succeed on retry, since it's a flaky error, but if you look at the mojo_linux browser_test results, you can see a bunch of these failures on each try run. This class looks to be used pretty extensively in content, at least when the network service is enabled, though not sure if it's used at all when the network service is disabled. Either way, we should fix this promptly.
,
Aug 6
It feels to me that |RefCountedThreadSafe<>| doesn't play nicely with |WeakPtr|, and a quick fix would be checking |HasAtLeastOneRef()| before upping the refcount if it was posted from a |WeakPtr<>|: https://cs.chromium.org/chromium/src/content/browser/url_loader_factory_getter.cc?l=181&rcl=87a7d8bef673fd8dd19a96706beb1744e64f9174 I'm not sure but do we have better utilities?
,
Aug 6
HasAtLeastOneRef would still be racy - since we could post the task immediately after that check, but before we grab another ref. We'd need an atomic method to check the reference count, and if it's non-zero, increment the reference.
,
Aug 6
Think I'll take this on myself - fixed the other instance that was much like this already (Though this case is more complicated than the one I fixed, and may need a different solution).
,
Aug 6
,
Aug 6
Cool,thanks Matt!
,
Aug 6
Do we even need this class? Seems like we could just switch to SharedURLLoaderFactory[Info], and we could even reuse the ForBrowserProcessIOThread one, since URLLoaderFactoryGetter doesn't set any magic options (Like process ID for the corresponding renderer process - which seems like it may be a bug, for things like auth) or any web security features.
,
Aug 7
So it looks to me like all network service crash tests that do a navigation after the crash are potentially flaky - I don't think there's any guarantee the URLLoaderFactoryGetter will notice the old URLLoaderFactory's mojo pipe was closed before we try and use it again. In practice, it may be this currently always happens, but mojo behavior is subject to change... I think the simplest change here would be to not listen to the mojo pipe for closure, and just check for liveness each time it's used. That may mean on network service crash the next navigation will always fail, since we don't have a live request on the URLLoaderFactory pipe, though I'm not positive of that. It does mean that on tests, at least, we consistently only notice the pipe is dead on navigation, so we'd need a call to explicitly revive the URLLoaderFactoryGetter after each crash, which does seem a bit ugly. I could try another approach, but that mojo race concerns me a bit, and seems like it would be good to fix it at the same time as well.
,
Aug 7
> So it looks to me like all network service crash tests that do a navigation after the crash are potentially flaky - I don't think there's any guarantee the URLLoaderFactoryGetter will notice the old URLLoaderFactory's mojo pipe was closed before we try and use it again. In practice, it may be this currently always happens, but mojo behavior is subject to change... Hmmm will it work if we flush |URLLoaderFactoryGetter| after the crash in all related tests? e.g. https://cs.chromium.org/chromium/src/content/browser/network_service_restart_browsertest.cc?l=321&rcl=339f50e14b2f03c6c1ffb04f7fcbe185bb910e1c > I think the simplest change here would be to not listen to the mojo pipe for closure, and just check for liveness each time it's used. That may mean on network service crash the next navigation will always fail, since we don't have a live request on the URLLoaderFactory pipe, though I'm not positive of that. It does mean that on tests, at least, we consistently only notice the pipe is dead on navigation, so we'd need a call to explicitly revive the URLLoaderFactoryGetter after each crash, which does seem a bit ugly. Actually I don't recall I ever added connection error listener in |URLLoaderFactoryGetter|, and my original approach was to just check for liveness each time it's used (as you've mentioned). The next navigation won't fail since the new pipe will be returned (assuming |StoragePartition| has noticed the error). The connection error handler seems to be introduced later: https://chromium-review.googlesource.com/c/chromium/src/+/1147042/6/content/browser/url_loader_factory_getter.cc
,
Aug 7
> Hmmm will it work if we flush |URLLoaderFactoryGetter| after the crash in all related tests? e.g. https://cs.chromium.org/chromium/src/content/browser/network_service_restart_browsertest.cc?l=321&rcl=339f50e14b2f03c6c1ffb04f7fcbe185bb910e1c That's not quite enough - we need to flush the interface, and then try to grab a NetworkContext again (Which is, admittedly, just a minor fix), since when we notice an error, we do an IO->UI->IO hop to get a new URLLoaderFactor. We might actually want to fix that, since otherwise we potentially show 2 error pages (Pipe looks fine, navigate, fails. Then when we reload, we see the pipe is closed, but we use it anyways while getting a new one).
,
Aug 7
> That's not quite enough - we need to flush the interface, and then try to grab a NetworkContext again (Which is, admittedly, just a minor fix), since when we notice an error, we do an IO->UI->IO hop to get a new URLLoaderFactor. I'm not sure I understand the problem correctly, but |StoragePartitionImpl::GetNetworkContext()| should re-grab |NetworkContext| automatically if it was flushed first and noticed connection error (at least at the time I wrote it): https://chromium.googlesource.com/chromium/src/+blame/2f46bdba272c67d22c59f712949b7fdd71fefb2e/content/browser/storage_partition_impl.cc#690 However the reconnection logic has changed and I'm not entirely sure how it works now. +cduvall@ for more context. Also it seems that we already have issue 613371 (`InterfacePtr::encountered_error() can be incorrect if no connection-error handler is set`).
,
Aug 7
(Actually issue 613371 was related to your previous email, but probably not this issue)
,
Aug 7
https://cs.chromium.org/chromium/src/content/browser/url_loader_factory_getter.cc?q=url_loader_factory_get&sq=package:chromium&g=0&l=167 - if you just add a connection_error() check there, it posts a task to the UI thread to get a new mojo object, but then it returns the old borked mojo object.
,
Aug 7
Yes, as mentioned in #c11 the reconnection logic has been changed and adding a |connection_error()| check won't work today. The original code was something like:
```
network::mojom::URLLoaderFactory*
URLLoaderFactoryGetter::GetURLLoaderFactory() {
DCHECK_CURRENTLY_ON(BrowserThread::IO);
if (g_get_network_factory_callback.Get() && !test_factory_)
g_get_network_factory_callback.Get().Run(this);
if (test_factory_)
return test_factory_;
if (!network_factory_.is_bound() || network_factory_.encountered_error()) {
BrowserThread::PostTask(
BrowserThread::UI, FROM_HERE,
base::BindOnce(
&URLLoaderFactoryGetter::HandleNetworkFactoryRequestOnUIThread,
this, mojo::MakeRequest(&network_factory_)));
}
return network_factory_.get();
}
```
Where the new mojo pipe will be returned (and the similar for |StoragePartitionImpl::GetNetworkContext()|).
Are you suggesting reverting it back to this pattern? I believe cduvall@ changed them for a good reason.
,
Aug 7
Yes, I am...But also keeping the current pattern for initialization. (i.e., bind on UI thread when the storage partition tells us to, and then bind the resulting request on the IO thread on first use. Then if we need a new URLLoaderFactory, use that pattern)
,
Aug 7
I see, thanks for the explanation! I like the original pattern so we won't spin reconnect requests when it never succeeds. Anyway will wait for cduvall@'s input.
,
Aug 7
chongz: Looking at crbug.com/613371, it seems the issue is just what I mentioned above - it means we'd always show a single error page per StoragePartition after a network service crash, even if the network service had long since restarted. I had been assuming the pattern was to avoid a threadhop on the first load. Actually...The StoragePartition is already watching the NetworkContext. Perhaps we can just have it poke the URLLoaderFactoryGetter when there's an error? If we just assumed that the URLLoaderFactory only fails when the NetworkContext pipe does, too, we'd only need the push path. Tests would still need to do something to wait until the StoragePartition notices the pipe closure, though.
,
Aug 7
Hmmm thinking more about it, the summary should be: 1. The original pattern (check |encountered_error()| lazily) won't notice error on first crash if it was never dereferenced before. Following crashes should be fine since we've dereferencing it immediately. 2. The new approach (|set_connection_error_handler()|) fixed the first crash issue in 1., but introduced the refcount issue mentioned in the issue description. Possible Solutions: a. Revert back to 1. and dereference |InterfacePtr<>| manually on first bind. b. Adopt a push model (from |StoragePartition|) as mentioned in #c17. * We are already using push model in path: browser => render => workers. * Note: A centralized connection error observer on 'network_service_instance.h| won't work, see https://chromium-review.googlesource.com/c/chromium/src/+/1139101/2/content/browser/frame_host/render_frame_host_impl.cc#4666 c. Introduce utility |WrapRefCountedIfHasAtLeastOneRef()|. d. Always bind |InterfacePtr<>|, and use |InterfacePtrInfo<>| for transition purpose. I don't have a strong preference but b. seems to be easier to implement (cannot recall why I didn't use the push model in the first place...).
,
Aug 7
I'll brainstorm a bit more tomorrow, but will tentatively plan to go with b.
,
Aug 8
Another solution: Use a weak reference to the StoragePartitionImpl, add an object that owns that weak reference, and lives on the IO thread, owned by the refcounted object. Have that object own the interface ptr, and listen to it for errors, and when it needs a reference, post a task to the UI thread to get a new one - this task would run a static function, so can be called even after the getter is destroyed, without issue. When the getter is destroyed, regardless of which thread it's destroyed on, post a task to destroy the non-refcounted object on the IO thread. (The weak ptr could also just be owned by a callback passed to the getter instead, if we want to minimize exposure of our weak ptrs) This is more complicated to describe than it is to implement. Nice thing is that the getter class is a bit more self-contained.
,
Aug 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/58a77c090a803c77ed5496237bde55b62c21deda commit 58a77c090a803c77ed5496237bde55b62c21deda Author: Matt Menke <mmenke@chromium.org> Date: Fri Aug 10 03:49:02 2018 URLLoaderFactoryGetter: Fix double free. The class is refcounted and uses Mojo pipes. It could execute tasks as a result of Mojo callbacks after its refcount was 0, while there was a task to delete itself posted. Those tasks, if executed, would grab another reference (incrementing it from 0 to 1). That would result in use-after-frees and double deletion, once the already posted delete task was executed. This CL fixes that by getting rid of the callbacks executed from Mojo, and removing the weak reference used in another call. The downside of this approach is that Mojo errors are only lazily detected, which could result in displaying an extra error page in the case of network service crash. Bug: 870942 Change-Id: Ic7b00de6e7c623dc62098118292290666c91b1a7 Reviewed-on: https://chromium-review.googlesource.com/1164302 Reviewed-by: Chong Zhang <chongz@chromium.org> Reviewed-by: John Abd-El-Malek <jam@chromium.org> Commit-Queue: Matt Menke <mmenke@chromium.org> Cr-Commit-Position: refs/heads/master@{#582038} [modify] https://crrev.com/58a77c090a803c77ed5496237bde55b62c21deda/content/browser/url_loader_factory_getter.cc [modify] https://crrev.com/58a77c090a803c77ed5496237bde55b62c21deda/content/browser/url_loader_factory_getter.h
,
Aug 10
,
Aug 10
Oops - that description is wrong, describes the first approach I tried. The actual one I went with does not display extra error pages (Instead, it results in a lazily done extra thread-hop in that case, which I don't think is a big deal) |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by mmenke@chromium.org
, Aug 4