Cancel orphaned TCP job if its not bound |
|||||
Issue description
Currently, we leave the main job running if the request is bound to an alternative job.
I am proposing that we cancel the main job if the HttpStreamRequest is served by an alternative job.
This will reduce the number of unnecessary TCP/SSL connect attempts.
The change involves adding main_job_.reset() to the else-case below.
void HttpStreamFactoryImpl::JobController::OnRequestComplete() {
DCHECK(request_);
RemoveRequestFromSpdySessionRequestMap();
CancelJobs();
request_ = nullptr;
if (bound_job_) {
if (bound_job_->job_type() == MAIN) {
main_job_.reset();
// |alternative_job_| can be non-null if |main_job_| is resumed after
// |main_job_wait_time_| has elapsed. Allow |alternative_job_| to run to
// completion, rather than resetting it. OnOrphanedJobComplete() will
// clean up |this| when the job completes.
} else {
DCHECK(bound_job_->job_type() == ALTERNATIVE);
alternative_job_.reset();
+ main_job_.reset();
}
bound_job_ = nullptr;
}
MaybeNotifyFactoryOfCompletion();
}
,
Aug 21 2017
We delay the creation of the TCP job by <some amount> to avoid doing the TLS handshake. Only if the QUIC job has not succeeded by this time do we start the main job. Perhaps this timeout is poorly tuned? I don't think I have any concerns about this change, but I'd be curious to know how often it happens. Perhaps we could add a histogram associated of when we reset the main job in that case (and the main job is non-null). If this is happening regularly, that'd be (I think) some strong evidence that we've tuned the timer poorly.
,
Aug 21 2017
> I am proposing that we cancel the main job if the HttpStreamRequest is served by an alternative job. Would that be better to cancel the TCP job in HttpStreamFactoryImpl::JobController::OrphanUnboundJob? We used to think cancelling TCP job would be hard if it started already. And currently we treat "still waiting" as TCP job is still cancellable. If you want to cancel TCP job preemptively, this might be a better place to do it?
,
Aug 21 2017
> We delay the creation of the TCP job by <some amount> to avoid doing the TLS handshake. Only if the QUIC job has not succeeded by this time do we start the main job. Perhaps this timeout is poorly tuned? The problem isn't so much with the value of the timeout. When a Quic job cannot finish synchronously, we start TCP job (should_wait = false). If we have a couple of these should_wait = false main TCP jobs, there are a lot TLS handshakes happening at the same time. Both the racing timeout and the additional throttle logic added in Issue 718576 do not curb these simultaneous TLS handshakes to the same origin. I can definitely add a histogram if you think it's worth doing. > Would that be better to cancel the TCP job in HttpStreamFactoryImpl::JobController::OrphanUnboundJob? Yes, that part needs a cleanup as well. I sent out a CL, let's move the discussion over there? https://chromium-review.googlesource.com/c/chromium/src/+/624420
,
Aug 21 2017
An NetLog showing this behavior is at: https://drive.google.com/a/google.com/file/d/0B_9WseIqRoaFUjdETTVYOXNKVHc/view?usp=sharing (shared internally because of sensitive data) Event 19 is a main TCP Job. Event 20 is a QUIC Job. The request is served at t=1163, but the TCP job still hangs around and continues to do tls handshake.
,
Aug 21 2017
> The problem isn't so much with the value of the timeout. When a Quic job cannot finish synchronously, we start TCP job (should_wait = false). This is ... super confusing! :) The TCP job *should* be waiting. Cherie and I talked about this, and we're both flummoxed by why this is happening. She's going to dig.
,
Aug 22 2017
Thanks Ryan digging into the code together. There're two bugs here. #1: The initial TCP job(event 6) *should* be waiting. And it's not waiting because the alt job doesn't finish synchronously(which unblocks the tcp job) and the wait time is zero as QuicStreamFactory by default requires handshake confirmation. The reason that we delay TCP in this case is that we figured we don't require confirmation too late until a socket is configured for QUIC. #2: The second TCP job(event 19) *should* be killed. This is caused as new "throttling logic" adds in and breaks the invariant. We used to assume "orphan" logic happens right after wait state and either before init connection or during connection. However, this is no longer true now. There's a new state where job could be resumed, go to throttling, and get orphaned before init connection. This needs to be fixed. Additionally, we used to not kill TCP jobs once they start connecting. I am not sure about the underlying complexities of killing everything. Being conservative, I would recommend fixing the two bugs mentioned above while hold back killing too aggressively.
,
Aug 22 2017
To solve #2, I think the cleanest is to kill/cancel all orphaned TCP jobs right away. I think there are too many assumptions and complexity in this class -- to quote: "assume "orphan" logic happens right after wait state and either before init connection or during connection." I vote for simplify this part of logic.
,
Aug 22 2017
Just killing it is definitely a possibility and it's clearly cleaner. One potential gotcha is that in the case of 0-RTT failing, the Job will finish before the handshake (Eventually) fails. At that point the request will need to be retried. It would be desirable, in that case, to be able to use the socket established by the main job. But of course, we want to minimize the number of spurious handshakes we do, so the desire is to delay the start of the main job so perhaps this situation is rare? That being said, from looking at the socket pool code, I *think* that canceling the HttpStreamFactoryImpl::Job will not actual cancel the underlying TCP/SSL connect job. So if that's right, it totally mitigates my concern. (So yay! :>) On the other hand maybe that means it won't do what we're hoping it will do?
,
Aug 22 2017
Since we always do late-binding for socket requests, I think it's fine canceling the TCP HttpStreamFactoryImpl::Job. I don't think we gain much by having the TCP job around. +mmenke@ to confirm. If the QUIC connection for some reason fails later (in the 0RTT case) and we mark QUIC broken, retrying the HttpNetworkTransaction shouldn't be much different. The spurious handshakes are not as rare as we thought they were, especially for cases like startup and session restore. Spurious handshakes aside, I am more concerned about the code complexity of this class.
,
Aug 22 2017
To clarify, your late-binding comment means that you think killing the TCP HttpStreamFactoryImpl::Job will not kill the TPC connect job. Is that right? If so, I definitely have no concerns about killing the TCP HttpStreamFactoryImpl::Job in the case the that alternative job has completed. "The spurious handshakes are not as rare as we thought they were, especially for cases like startup and session restore." We should definitely fix these! :( Cherie's work to solve issue #1 should to a long way to help here, I believe, but if there are other cases where we're doing the wrong thing, let's get to the bottom of the situation and figure out how to avoid screwing things up. For example, perhaps even when handshake confirmation is required by the QuicStreamFactory, we should still delay the TCP job (as we do when it is not required).
,
Aug 22 2017
> To clarify, your late-binding comment means that you think killing the TCP HttpStreamFactoryImpl::Job will not kill the TPC connect job. Is that right? Yes. I believe that is true. Resetting not-yet-established ClientSocketHandle only cancels the request. Any in-progress ConnectJobs will continue and return established sockets to socket pools. I do think we need delay tcp job logic as well as the throttling logic. Since Cherie is fixing #1, do you have any objection in me uploading a patch to cancel TCP Job for #2?
,
Aug 22 2017
Cancelling an HttpStreamFactoryImpl::Job generally will not cancel the underlying TCP/SSL ConnectJob (Unless we're at the socket pool limit). See https://cs.chromium.org/chromium/src/net/socket/client_socket_pool_base.cc?type=cs&l=609 for the relevant code.
,
Aug 22 2017
Great! In that case there should be no downside to killing the TCP/TLS HttpStreamFactoryImpl::Job. I wonder if the same reasoning can be applied to the alternative job. If we kill the QuicStreamRequest, for example, that will not terminate the on-going QUIC connection. So perhaps we can kill the alternative job when the main job succeeds, even if the alternative job has not completed. If the QUIC connection eventually succeeds, a subsequent HttpStreamFactoryImpl::Job will be able to find it. What do y'all think?
,
Aug 22 2017
+100. I like the idea of killing alternative job. That will considerably simplify the JobController lifetime. JobController can now be tied with HttpStreamRequest's lifetime :)
,
Aug 22 2017
> I wonder if the same reasoning can be applied to the alternative job. If we kill the QuicStreamRequest, for example, that will not terminate the on-going QUIC connection. So perhaps we can kill the alternative job when the main job succeeds, even if the alternative job has not completed. We used to not kill alt job when main job succeeds so as to establish the QUIC connection and next time alt job could win the race. Looking at the code, killing QuicSteramRequest removes the Request->QSF::Job mapping, and leaves the QSF::Job around to finish connection. So yeah, we should be able to apply the same logic to alt job as well.
,
Aug 22 2017
Getting rid of orphaned jobs sounds great to me, too.
,
Aug 22 2017
I will start a CL. #1 in Cherie's findings is orthogonal and needs to be split off to a different bug.
,
Aug 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7f0e229fb317da182573209639c32529755dd1f5 commit 7f0e229fb317da182573209639c32529755dd1f5 Author: Helen Li <xunjieli@chromium.org> Date: Tue Aug 22 23:10:41 2017 Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done When an HttpStreamRequest is done, cancel any orphaned HttpStreamFactoryImpl::Job. We do not gain much by leaving the orphaned Jobs around to do connection establishment. In both TCP's and QUIC's cases, canceling Impl::Job does not terminate the underlying connection attempt. This will allow us to simplify the JobController code further -- now that it doesn't live longer than a HttpStreamRequest. I will do follow-ups to clean up unneeded code. Bug: 757548 Change-Id: I7d4c09f5625b27c778a74ed0b44fa83a2b0db695 Reviewed-on: https://chromium-review.googlesource.com/624420 Commit-Queue: Helen Li <xunjieli@chromium.org> Reviewed-by: Ryan Hamilton <rch@chromium.org> Cr-Commit-Position: refs/heads/master@{#496493} [modify] https://crrev.com/7f0e229fb317da182573209639c32529755dd1f5/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/7f0e229fb317da182573209639c32529755dd1f5/net/http/http_stream_factory_impl_job_controller_unittest.cc
,
Aug 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab commit b6ea9d7648b75a2215126a020bd3c0db0cf1ccab Author: Helen Li <xunjieli@chromium.org> Date: Wed Aug 23 20:17:03 2017 Make HttpStreamFactoryImpl::Request own JobController JobController no longer outlives Request after 7f0e229fb317da182573209639c32529755dd1f5. This CL simplies the lifetime of JobController for non-preconnects. Follow-up CLs will clean up unused logic and duplicated param passing. Bug: 475060, 757548 Change-Id: I4542f0f4f6386e3eeae47c41da8d6e98cce4dd83 Reviewed-on: https://chromium-review.googlesource.com/627311 Reviewed-by: Ryan Hamilton <rch@chromium.org> Reviewed-by: Zhongyi Shi <zhongyi@chromium.org> Commit-Queue: Helen Li <xunjieli@chromium.org> Cr-Commit-Position: refs/heads/master@{#496785} [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl.cc [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl.h [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller.h [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller_unittest.cc [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request.cc [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request.h [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request_unittest.cc [modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_test_util.h
,
Aug 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f62edf73906fa20106211c5f76de5e03278fdf0a commit f62edf73906fa20106211c5f76de5e03278fdf0a Author: Helen Li <xunjieli@chromium.org> Date: Wed Aug 23 22:25:18 2017 Revert "Make HttpStreamFactoryImpl::Request own JobController" This reverts commit b6ea9d7648b75a2215126a020bd3c0db0cf1ccab. Reason for revert: There breaks ReportBrokenAlternativeService() Original change's description: > Make HttpStreamFactoryImpl::Request own JobController > > JobController no longer outlives Request after > 7f0e229fb317da182573209639c32529755dd1f5. This CL simplies the lifetime of > JobController for non-preconnects. > > Follow-up CLs will clean up unused logic and duplicated param passing. > > Bug: 475060, 757548 > Change-Id: I4542f0f4f6386e3eeae47c41da8d6e98cce4dd83 > Reviewed-on: https://chromium-review.googlesource.com/627311 > Reviewed-by: Ryan Hamilton <rch@chromium.org> > Reviewed-by: Zhongyi Shi <zhongyi@chromium.org> > Commit-Queue: Helen Li <xunjieli@chromium.org> > Cr-Commit-Position: refs/heads/master@{#496785} TBR=rch@chromium.org,xunjieli@chromium.org,zhongyi@chromium.org Change-Id: Id42cc4b5fafbfcba00ae572c3d2505100c93e376 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 475060, 757548 Reviewed-on: https://chromium-review.googlesource.com/630396 Reviewed-by: Ryan Hamilton <rch@chromium.org> Commit-Queue: Ryan Hamilton <rch@chromium.org> Cr-Commit-Position: refs/heads/master@{#496831} [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl.cc [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl.h [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller.h [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller_unittest.cc [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request.cc [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request.h [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request_unittest.cc [modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_test_util.h
,
Aug 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1856b037275d856bea65471fa568cecf71eed20d commit 1856b037275d856bea65471fa568cecf71eed20d Author: Helen Li <xunjieli@chromium.org> Date: Wed Aug 23 22:30:13 2017 Revert "Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done" This reverts commit 7f0e229fb317da182573209639c32529755dd1f5. Reason for revert: Alternative job cannot be reset when main job finishes Original change's description: > Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done > > When an HttpStreamRequest is done, cancel any orphaned > HttpStreamFactoryImpl::Job. We do not gain much by leaving the > orphaned Jobs around to do connection establishment. In both > TCP's and QUIC's cases, canceling Impl::Job does not terminate > the underlying connection attempt. > > This will allow us to simplify the JobController code further -- > now that it doesn't live longer than a HttpStreamRequest. I will do > follow-ups to clean up unneeded code. > > Bug: 757548 > Change-Id: I7d4c09f5625b27c778a74ed0b44fa83a2b0db695 > Reviewed-on: https://chromium-review.googlesource.com/624420 > Commit-Queue: Helen Li <xunjieli@chromium.org> > Reviewed-by: Ryan Hamilton <rch@chromium.org> > Cr-Commit-Position: refs/heads/master@{#496493} TBR=rch@chromium.org,xunjieli@chromium.org Change-Id: Iade06edb13376bea21b57f4a531de1e726da1d89 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 757548 Reviewed-on: https://chromium-review.googlesource.com/630297 Reviewed-by: Helen Li <xunjieli@chromium.org> Commit-Queue: Helen Li <xunjieli@chromium.org> Cr-Commit-Position: refs/heads/master@{#496834} [modify] https://crrev.com/1856b037275d856bea65471fa568cecf71eed20d/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/1856b037275d856bea65471fa568cecf71eed20d/net/http/http_stream_factory_impl_job_controller_unittest.cc
,
Aug 23 2017
Talked to Ryan offline. The main reason why we can't reset the alternative job on Request completion is because only JobController knows whether quic job fails after tcp job has succeeded. The information is needed to mark QUIC as bad (the ReportBrokenAlternativeService() method). Without the orphan job logic, we wouldn't be able to know the cases where a quic job fails after a tcp job has succeeded. I will upload a CL tomorrow to only reset the orphaned TCP job.
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5649d99327a922a75948fafdab49529675a576b8 commit 5649d99327a922a75948fafdab49529675a576b8 Author: Helen Li <xunjieli@chromium.org> Date: Thu Aug 24 21:23:06 2017 Cancel orphaned TCP HttpStreamFactoryImpl::Job When a tcp HttpStreamFactoryImpl::Job is orphaned, delete it. We do not gain much by leaving the TCP orphaned Jobs around to do connection establishment, because canceling Impl::Job does not terminate the underlying connection attempt. See more discussion on this in the linked bug below. Bug: 757548 Change-Id: If791bf56e66378c190ade935130bc380fab41b39 Reviewed-on: https://chromium-review.googlesource.com/630179 Commit-Queue: Helen Li <xunjieli@chromium.org> Reviewed-by: Zhongyi Shi <zhongyi@chromium.org> Reviewed-by: Ryan Hamilton <rch@chromium.org> Cr-Commit-Position: refs/heads/master@{#497191} [modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job.cc [modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job.h [modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job_controller.cc [modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job_controller_unittest.cc
,
Aug 24 2017
,
Aug 24 2017
\o/ Thanks!
,
Aug 25 2017
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by xunji...@chromium.org
, Aug 21 2017