New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 757548 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 718576



Sign in to add a comment

Cancel orphaned TCP job if its not bound

Project Member Reported by xunji...@chromium.org, Aug 21 2017

Issue description

Currently, we leave the main job running if the request is bound to an alternative job.

I am proposing that we cancel the main job if the HttpStreamRequest is served by an alternative job.

This will reduce the number of unnecessary TCP/SSL connect attempts.

The change involves adding main_job_.reset() to the else-case below.

void HttpStreamFactoryImpl::JobController::OnRequestComplete() {
  DCHECK(request_);

  RemoveRequestFromSpdySessionRequestMap();
  CancelJobs();
  request_ = nullptr;
  if (bound_job_) {
    if (bound_job_->job_type() == MAIN) {
      main_job_.reset();
      // |alternative_job_| can be non-null if |main_job_| is resumed after
      // |main_job_wait_time_| has elapsed. Allow |alternative_job_| to run to
      // completion, rather than resetting it. OnOrphanedJobComplete() will
      // clean up |this| when the job completes.
    } else {
      DCHECK(bound_job_->job_type() == ALTERNATIVE);
      alternative_job_.reset();
+     main_job_.reset();
    }
    bound_job_ = nullptr;
  }
  MaybeNotifyFactoryOfCompletion();
}
 
Cc: rch@chromium.org zhongyi@chromium.org
+ rch, zhongyi@: Any concern?

Comment 2 by rch@chromium.org, Aug 21 2017

We delay the creation of the TCP job by <some amount> to avoid doing the TLS handshake. Only if the QUIC job has not succeeded by this time do we start the main job. Perhaps this timeout is poorly tuned?

I don't think I have any concerns about this change, but I'd be curious to know how often it happens. Perhaps we could add a histogram associated of when we reset the main job in that case (and the main job is non-null). If this is happening regularly, that'd be (I think) some strong evidence that we've tuned the timer poorly.
> I am proposing that we cancel the main job if the HttpStreamRequest is served by an alternative job.

Would that be better to cancel the TCP job in HttpStreamFactoryImpl::JobController::OrphanUnboundJob?

We used to think cancelling TCP job would be hard if it started already. And currently we treat "still waiting" as TCP job is still cancellable. If you want to cancel TCP job preemptively, this might be a better place to do it?
> We delay the creation of the TCP job by <some amount> to avoid doing the TLS handshake. Only if the QUIC job has not succeeded by this time do we start the main job. Perhaps this timeout is poorly tuned?

The problem isn't so much with the value of the timeout. When a Quic job cannot finish synchronously, we start TCP job (should_wait = false). If we have a couple of these should_wait = false main TCP jobs, there are a lot TLS handshakes happening at the same time. Both the racing timeout and the additional throttle logic added in  Issue 718576  do not curb these simultaneous TLS handshakes to the same origin.

I can definitely add a histogram if you think it's worth doing.


> Would that be better to cancel the TCP job in HttpStreamFactoryImpl::JobController::OrphanUnboundJob?

Yes, that part needs a cleanup as well.

I sent out a CL, let's move the discussion over there?

https://chromium-review.googlesource.com/c/chromium/src/+/624420
An NetLog showing this behavior is at: https://drive.google.com/a/google.com/file/d/0B_9WseIqRoaFUjdETTVYOXNKVHc/view?usp=sharing (shared internally because of sensitive data)

Event 19 is a main TCP Job. Event 20 is a QUIC Job. The request is served at t=1163, but the TCP job still hangs around and continues to do tls handshake.

Comment 6 by rch@chromium.org, Aug 21 2017

> The problem isn't so much with the value of the timeout. When a Quic job cannot finish synchronously, we start TCP job (should_wait = false).

This is ... super confusing! :) The TCP job *should* be waiting. Cherie and I talked about this, and we're both flummoxed by why this is happening. She's going to dig.


Thanks Ryan digging into the code together. 

There're two bugs here. 
#1: The initial TCP job(event 6) *should* be waiting. And it's not waiting because the alt job doesn't finish synchronously(which unblocks the tcp job) and the wait time is zero as QuicStreamFactory by default requires handshake confirmation. The reason that we delay TCP in this case is that we figured we don't require confirmation too late until a socket is configured for QUIC. 

#2: The second TCP job(event 19) *should* be killed. This is caused as new "throttling logic" adds in and breaks the invariant. We used to assume "orphan" logic happens right after wait state and either before init connection or during connection. However, this is no longer true now. There's a new state where job could be resumed, go to throttling, and get orphaned before init connection. This needs to be fixed. 

Additionally, we used to not kill TCP jobs once they start connecting. I am not sure about the underlying complexities of killing everything. Being conservative, I would recommend fixing the two bugs mentioned above while hold back killing too aggressively.
To solve #2, I think the cleanest is to kill/cancel all orphaned TCP jobs right away.  I think there are too many assumptions and complexity in this class -- to quote: "assume "orphan" logic happens right after wait state and either before init connection or during connection." I vote for simplify this part of logic. 



Comment 9 by rch@chromium.org, Aug 22 2017

Just killing it is definitely a possibility and it's clearly cleaner. One potential gotcha is that in the case of 0-RTT failing, the Job will finish before the handshake (Eventually) fails. At that point the request will need to be retried. It would be desirable, in that case, to be able to use the socket established by the main job. 

But of course, we want to minimize the number of spurious handshakes we do, so the desire is to delay the start of the main job so perhaps this situation is rare?

That being said, from looking at the socket pool code, I *think* that canceling the HttpStreamFactoryImpl::Job will not actual cancel the underlying TCP/SSL connect job. So if that's right, it totally mitigates my concern. (So yay! :>)

On the other hand maybe that means it won't do what we're hoping it will do?
Cc: mmenke@chromium.org
Since we always do late-binding for socket requests, I think it's fine canceling the TCP HttpStreamFactoryImpl::Job. I don't think we gain much by having the TCP job around. +mmenke@ to confirm. If the QUIC connection for some reason fails later (in the 0RTT case) and we mark QUIC broken, retrying the HttpNetworkTransaction shouldn't be much different.

The spurious handshakes are not as rare as we thought they were, especially for cases like startup and session restore. Spurious handshakes aside, I am more concerned about the code complexity of this class. 

Comment 11 by rch@chromium.org, Aug 22 2017

To clarify, your late-binding comment means that you think killing the TCP HttpStreamFactoryImpl::Job will not kill the TPC connect job. Is that right? If so, I definitely have no concerns about killing the TCP HttpStreamFactoryImpl::Job in the case the that alternative job has completed.

"The spurious handshakes are not as rare as we thought they were, especially for cases like startup and session restore." We should definitely fix these! :( Cherie's work to solve issue #1 should to a long way to help here, I believe, but if there are other cases where we're doing the wrong thing, let's get to the bottom of the situation and figure out how to avoid screwing things up. For example, perhaps even when handshake confirmation is required by the QuicStreamFactory, we should still delay the TCP job (as we do when it is not required).
> To clarify, your late-binding comment means that you think killing the TCP HttpStreamFactoryImpl::Job will not kill the TPC connect job. Is that right? 

Yes. I believe that is true. Resetting not-yet-established ClientSocketHandle only cancels the request. Any in-progress ConnectJobs will continue and return established sockets to socket pools.

I do think we need delay tcp job logic as well as the throttling logic. 

Since Cherie is fixing #1, do you have any objection in me uploading a patch to cancel TCP Job for #2?
Cancelling an HttpStreamFactoryImpl::Job generally will not cancel the underlying TCP/SSL ConnectJob (Unless we're at the socket pool limit).  See https://cs.chromium.org/chromium/src/net/socket/client_socket_pool_base.cc?type=cs&l=609 for the relevant code.

Comment 14 by rch@chromium.org, Aug 22 2017

Great! In that case there should be no downside to killing the TCP/TLS HttpStreamFactoryImpl::Job. I wonder if the same reasoning can be applied to the alternative job. If we kill the QuicStreamRequest, for example, that will not terminate the on-going QUIC connection. So perhaps we can kill the alternative job when the main job succeeds, even if the alternative job has not completed. If the QUIC connection eventually succeeds, a subsequent HttpStreamFactoryImpl::Job will be able to find it. 

What do y'all think?
+100. I like the idea of killing alternative job. That will considerably simplify the JobController lifetime. JobController can now be tied with HttpStreamRequest's lifetime :)
> I wonder if the same reasoning can be applied to the alternative job. If we kill the QuicStreamRequest, for example, that will not terminate the on-going QUIC connection. So perhaps we can kill the alternative job when the main job succeeds, even if the alternative job has not completed.

We used to not kill alt job when main job succeeds so as to establish the QUIC connection and next time alt job could win the race. Looking at the code, killing QuicSteramRequest removes the Request->QSF::Job mapping, and leaves the QSF::Job around to finish connection. So yeah, we should be able to apply the same logic to alt job as well. 
Getting rid of orphaned jobs sounds great to me, too.
Labels: -Pri-3 M-62 Pri-2
Owner: xunji...@chromium.org
Status: Assigned (was: Untriaged)
I will start a CL. #1 in Cherie's findings is orthogonal and needs to be split off to a different bug.
Project Member

Comment 19 by bugdroid1@chromium.org, Aug 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f0e229fb317da182573209639c32529755dd1f5

commit 7f0e229fb317da182573209639c32529755dd1f5
Author: Helen Li <xunjieli@chromium.org>
Date: Tue Aug 22 23:10:41 2017

Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done

When an HttpStreamRequest is done, cancel any orphaned
HttpStreamFactoryImpl::Job. We do not gain much by leaving the
orphaned Jobs around to do connection establishment. In both
TCP's and QUIC's cases, canceling Impl::Job does not terminate
the underlying connection attempt.

This will allow us to simplify the JobController code further --
now that it doesn't live longer than a HttpStreamRequest. I will do
follow-ups to clean up unneeded code.

Bug:  757548 
Change-Id: I7d4c09f5625b27c778a74ed0b44fa83a2b0db695
Reviewed-on: https://chromium-review.googlesource.com/624420
Commit-Queue: Helen Li <xunjieli@chromium.org>
Reviewed-by: Ryan Hamilton <rch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496493}
[modify] https://crrev.com/7f0e229fb317da182573209639c32529755dd1f5/net/http/http_stream_factory_impl_job_controller.cc
[modify] https://crrev.com/7f0e229fb317da182573209639c32529755dd1f5/net/http/http_stream_factory_impl_job_controller_unittest.cc

Project Member

Comment 20 by bugdroid1@chromium.org, Aug 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab

commit b6ea9d7648b75a2215126a020bd3c0db0cf1ccab
Author: Helen Li <xunjieli@chromium.org>
Date: Wed Aug 23 20:17:03 2017

Make HttpStreamFactoryImpl::Request own JobController

JobController no longer outlives Request after
7f0e229fb317da182573209639c32529755dd1f5. This CL simplies the lifetime of
JobController for non-preconnects.

Follow-up CLs will clean up unused logic and duplicated param passing.

Bug: 475060,  757548 
Change-Id: I4542f0f4f6386e3eeae47c41da8d6e98cce4dd83
Reviewed-on: https://chromium-review.googlesource.com/627311
Reviewed-by: Ryan Hamilton <rch@chromium.org>
Reviewed-by: Zhongyi Shi <zhongyi@chromium.org>
Commit-Queue: Helen Li <xunjieli@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496785}
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl.cc
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl.h
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller.cc
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller.h
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_job_controller_unittest.cc
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request.cc
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request.h
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_impl_request_unittest.cc
[modify] https://crrev.com/b6ea9d7648b75a2215126a020bd3c0db0cf1ccab/net/http/http_stream_factory_test_util.h

Project Member

Comment 21 by bugdroid1@chromium.org, Aug 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f62edf73906fa20106211c5f76de5e03278fdf0a

commit f62edf73906fa20106211c5f76de5e03278fdf0a
Author: Helen Li <xunjieli@chromium.org>
Date: Wed Aug 23 22:25:18 2017

Revert "Make HttpStreamFactoryImpl::Request own JobController"

This reverts commit b6ea9d7648b75a2215126a020bd3c0db0cf1ccab.

Reason for revert: There breaks ReportBrokenAlternativeService()

Original change's description:
> Make HttpStreamFactoryImpl::Request own JobController
> 
> JobController no longer outlives Request after
> 7f0e229fb317da182573209639c32529755dd1f5. This CL simplies the lifetime of
> JobController for non-preconnects.
> 
> Follow-up CLs will clean up unused logic and duplicated param passing.
> 
> Bug: 475060,  757548 
> Change-Id: I4542f0f4f6386e3eeae47c41da8d6e98cce4dd83
> Reviewed-on: https://chromium-review.googlesource.com/627311
> Reviewed-by: Ryan Hamilton <rch@chromium.org>
> Reviewed-by: Zhongyi Shi <zhongyi@chromium.org>
> Commit-Queue: Helen Li <xunjieli@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#496785}

TBR=rch@chromium.org,xunjieli@chromium.org,zhongyi@chromium.org

Change-Id: Id42cc4b5fafbfcba00ae572c3d2505100c93e376
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 475060,  757548 
Reviewed-on: https://chromium-review.googlesource.com/630396
Reviewed-by: Ryan Hamilton <rch@chromium.org>
Commit-Queue: Ryan Hamilton <rch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496831}
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl.cc
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl.h
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller.cc
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller.h
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_job_controller_unittest.cc
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request.cc
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request.h
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_impl_request_unittest.cc
[modify] https://crrev.com/f62edf73906fa20106211c5f76de5e03278fdf0a/net/http/http_stream_factory_test_util.h

Project Member

Comment 22 by bugdroid1@chromium.org, Aug 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1856b037275d856bea65471fa568cecf71eed20d

commit 1856b037275d856bea65471fa568cecf71eed20d
Author: Helen Li <xunjieli@chromium.org>
Date: Wed Aug 23 22:30:13 2017

Revert "Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done"

This reverts commit 7f0e229fb317da182573209639c32529755dd1f5.

Reason for revert: Alternative job cannot be reset when main job finishes

Original change's description:
> Cancel orphaned HttpStreamFactoryImpl::Job when HttpStreamRequest is done
> 
> When an HttpStreamRequest is done, cancel any orphaned
> HttpStreamFactoryImpl::Job. We do not gain much by leaving the
> orphaned Jobs around to do connection establishment. In both
> TCP's and QUIC's cases, canceling Impl::Job does not terminate
> the underlying connection attempt.
> 
> This will allow us to simplify the JobController code further --
> now that it doesn't live longer than a HttpStreamRequest. I will do
> follow-ups to clean up unneeded code.
> 
> Bug:  757548 
> Change-Id: I7d4c09f5625b27c778a74ed0b44fa83a2b0db695
> Reviewed-on: https://chromium-review.googlesource.com/624420
> Commit-Queue: Helen Li <xunjieli@chromium.org>
> Reviewed-by: Ryan Hamilton <rch@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#496493}

TBR=rch@chromium.org,xunjieli@chromium.org

Change-Id: Iade06edb13376bea21b57f4a531de1e726da1d89
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  757548 
Reviewed-on: https://chromium-review.googlesource.com/630297
Reviewed-by: Helen Li <xunjieli@chromium.org>
Commit-Queue: Helen Li <xunjieli@chromium.org>
Cr-Commit-Position: refs/heads/master@{#496834}
[modify] https://crrev.com/1856b037275d856bea65471fa568cecf71eed20d/net/http/http_stream_factory_impl_job_controller.cc
[modify] https://crrev.com/1856b037275d856bea65471fa568cecf71eed20d/net/http/http_stream_factory_impl_job_controller_unittest.cc

Talked to Ryan offline. The main reason why we can't reset the alternative job on Request completion is because only JobController knows whether quic job fails after tcp job has succeeded. The information is needed to mark QUIC as bad (the ReportBrokenAlternativeService() method). Without the orphan job logic, we wouldn't be able to know the cases where a quic job fails after a tcp job has succeeded.

I will upload a CL tomorrow to only reset the orphaned TCP job.
Project Member

Comment 24 by bugdroid1@chromium.org, Aug 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5649d99327a922a75948fafdab49529675a576b8

commit 5649d99327a922a75948fafdab49529675a576b8
Author: Helen Li <xunjieli@chromium.org>
Date: Thu Aug 24 21:23:06 2017

Cancel orphaned TCP HttpStreamFactoryImpl::Job

When a tcp HttpStreamFactoryImpl::Job is orphaned, delete it. We do not gain
much by leaving the TCP orphaned Jobs around to do connection establishment,
because canceling Impl::Job does not terminate the underlying connection
attempt. See more discussion on this in the linked bug below.

Bug:  757548 
Change-Id: If791bf56e66378c190ade935130bc380fab41b39
Reviewed-on: https://chromium-review.googlesource.com/630179
Commit-Queue: Helen Li <xunjieli@chromium.org>
Reviewed-by: Zhongyi Shi <zhongyi@chromium.org>
Reviewed-by: Ryan Hamilton <rch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#497191}
[modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job.cc
[modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job.h
[modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job_controller.cc
[modify] https://crrev.com/5649d99327a922a75948fafdab49529675a576b8/net/http/http_stream_factory_impl_job_controller_unittest.cc

Status: Fixed (was: Assigned)

Comment 26 by rch@chromium.org, Aug 24 2017

\o/ Thanks!
Blocking: 718576

Sign in to add a comment