New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 590875 link

Starred by 6 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Aug 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug

Blocked on:
issue 607354

Blocking:
issue 410497



Sign in to add a comment

Persist MultiThreadedCertVerifier's |cache_| on Android when browser is killed under memory pressure

Project Member Reported by rtenneti@chromium.org, Feb 29 2016

Issue description

Pickle (serialize and deserialize) MultiThreadedCertVerifier's |cache_|
and it's ojects.

Wire this up to Android's bundle when chrome is killed under
memory pressure.

 
Components: Internals>Network>SSL
Are we OK with security implications of this?
Left comments on the CL attached, but I think a lot more work needs to be done to describe in this bug (or in a publicly available design doc) the rationale and reasoning behind here, as well as our pass/fail criteria and threat model.

Comment 3 by rch@chromium.org, Mar 21 2016

A couple of notes on motivation and metrics.

Anecdotally, we noticed that looking at net-internals traces from Android in the wild that QUIC connections regularly spend a non-trivial amount of time verifying certs. Cert verification is expensive and so there is a cache of the results put in front of the actual expensive verification process. We know that Chrome on Android (as well as cronet apps) are killed because of memory pressure repeatedly throughout the day. As a result, the cert verification cache will be repeatedly emptied. We believe that this contributes to the significantly higher ratio of average cert verification time to first verification time on Android than Windows. There are of course, other factors, but this seems likely.

(Annoyingly, when QUIC attempts to do a 0-RTT handshake, it "immediately" starts verifying the certificate before it is able to send the CHLO packet. TLS, on the other hand, waits for the TCP handshake to complete before verifying the cert. In practice, this often results in the QUIC verification job taking a long time and overlapping with the TCP handshake. Then the TLS verification job is nearly instantaneous since the QUIC results end up cached. Cute :>)

In any case, we intend to run a finch experiment where we persist the cert verification cache somewhere (The android app bundle seems like a promising target) which would be loaded on startup (or resume, I guess). We will the be able to measure changes in the histogram Net.CertVerifier_First_Job_Latency and Net.CertVerifier_Job_Latency as well as the QUIC specific histograms Net.QuicSession.VerifyProofTime. We may well expect to see slight changes in Net.AlternateProtocolUsage (because QUIC will end up beating TLS a bit more). And finally, we will look for changes in per-service metrics like YouTube join latency.

If this approach does not bear fruit, we will of course revert these changes.

One note on persistence from our offline conversation. There may be multiple cert verifications for the certificates. In this case, the cache will have multiple entries, but there will be only one underling OSCertHandle. We should only persist this cert once so as not to bloat the storage. 

We should also evaluate extending the lifetime of cached cert verify results, though this is of course orthogonal.
"Cert verification is expensive and so there is a cache of the results put in front of the actual expensive verification process."

- This is not correct. Verification is single-threaded, but not intrinsically expensive across platform. Because we don't control the OS behaviour, and because it's designed around a blocking API, we place a verification cache in front of it so that we can both join existing verifications AND avoid a call into a blocking system library, but that does not intrinsically imply or is meant to suggest it's expensive.

For example, on Android (in particular), there is no blocking API to involve, and everything is relatively inexpensive (and "could" be conducted on a single thread / on the IO thread, but we multi-thread for consistency and due to JNI)


"Contributes to a significantly higher ratio of average cert verification time to first verification time on Android than Windows"

- What metrics are you using to quantify this? Non-Google Chromium builds should at least be able to observe the same location that you're histogramming, hence the desire to be specific.

"Then the TLS verification job is nearly instantaneous since the QUIC results end up cached" 

- We have independent metrics that look at overall verification time versus the time-to-verify-since-joining an in progress (or cached) result. Understanding what metrics you're examining will help ensure that they're being compared apples-to-apples.


"Which would be loaded on startup (or resume, I guess)"

- What metrics will you be watching to ensure that Startup time does not increase? That is, there's obviously a tradeoff being made here, how will we make sure that we're not driving down the cert verification time by driving up the startup time?


Comment 5 by rch@chromium.org, Mar 22 2016

ARGH! Raman mentioned that he didn't see my reply to comment #4 yesterday. I see a reply in my sent mail folder, but I guess it didn't make it to the bug. Argh.  Trying again.

On Mon, Mar 21, 2016 at 12:53 PM, rsleevi@chromium.org via Monorail <monorail@chromium.org> wrote:

Comment #4 on  issue 590875  by rsleevi@chromium.org: Persist MultiThreadedCertVerifier's |cache_| on Android when browser is killed under memory pressure
https://bugs.chromium.org/p/chromium/issues/detail?id=590875#c4

"- This is not correct. Verification is single-threaded, but not intrinsically expensive across platform. Because we don't control the OS behaviour, and because it's designed around a blocking API, we place a verification cache in front of it so that we can both join existing verifications AND avoid a call into a blocking system library, but that does not intrinsically imply or is meant to suggest it's expensive.

For example, on Android (in particular), there is no blocking API to involve, and everything is relatively inexpensive (and "could" be conducted on a single thread / on the IO thread, but we multi-thread for consistency and due to JNI)"

​Fair enough.
​ 
"- What metrics are you using to quantify this? Non-Google Chromium builds should at least be able to observe the same location that you're histogramming, hence the desire to be specific. "

​Net.CertVerifier_First_Job_Latency​ vs Net.CertVerifier_Job_Latency   ​
 
">"Then the TLS verification job is nearly instantaneous since the QUIC results end up cached" 

- We have independent metrics that look at overall verification time versus the time-to-verify-since-joining an in progress (or cached) result. Understanding what metrics you're examining will help ensure that they're being compared apples-to-apples. "

​I'm not sure what you mean. Can you say more? I'm talking about what we see looking at net internals traces wh​ere we see at a QUIC job verify the cert which first off a cert verifier job that takes a non-trivial time to complete. After it completes, QUIC sends off a CHLO. Later, we see the racing HTTPS job's TCP handshake complete and the TLS handshake progress to the point that it first off a cert verifier job. This hits the case and its super fast.

"- What metrics will you be watching to ensure that Startup time does not increase? That is, there's obviously a tradeoff being made here, how will we make sure that we're not driving down the cert verification time by driving up the startup time? "

Startup time was not one of the metrics we were planning to look at. I'm happy to though. What do you recommend?

Cc: rsch...@chromium.org gab@chromium.org rch@chromium.org
- Regarding startup time, I'm not sure. I will note that previous efforts at caching certificate information (though not verifications) to disk showed negligible benefit, but did show startup increases - but with the Perf Code Purple, I understand that many of these were retrofit. I've cc'd rschoen@ and gab@ for their thoughts on what appropriate metrics to monitor for this experiment would be.



- Regarding metrics, I do want to make sure I'm restating accurately:

Problem:
- There's a divergence in the Android Net.CertVerifier_First_Job_Latency (50% at 201ms, 90% at 717ms) and Net.CertVerifier_Job_Latency (50% at 37ms, 90% at 140ms). 
- You plan to cache results to see if this helps move both metrics down.
- You anticipate that, in doing so, you will also see Net.QuicSession.VerifyProofTime (and, presumably, Net.SSLCertVerificationTime) trend downward
- You expect to see Net.AlternateProtocolUsage increase, because QUIC will complete as-fast-or-faster than TLS

It's unclear why you would expect Net.CertVerifier_First_Job_Latency to decrease with this cache - the first time the OS has to be interrogated, you still have to load the certificates in memory (this is somewhat OS specific and hairy and not accurate, but 'close enough' approximation), so I wouldn't anticipate this to change, unless you expect to bias the metric by counting cached results - which I don't think you should.

Because Job metrics aren't logged when cached, I also don't expect to see Net.CertVerifier_Job_Latency trend down - because none of the cache results benefit.

In short, I'm trying to say that both those metrics are bad metrics to target, and if you see them move, it's likely due to a bug.


That said, I would expect Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime to move downward, since those benefit from the cache. That said, it's worth noting that the metrics for Net.SSLCertVerificationTime are >50% at 0ms, and 90% at 75ms. For QUIC, you're at >50% in 8ms, and 90% at 71ms. The 0ms vs 7ms split, I suspect, is because of the additional cryptographic overhead necessary for QUIC proofs (that is, you're not measuring apples:apples, as the TLS verification only concerns cert verification, whereas QUIC also verifies the signature over the server config *and* the CT status)

So, at a minimum, if you want to compare the QUIC cert verification time to TLS cert verification time, it seems like you need a better metric, especially if it only ends up being a small move. It also suggests that you're paying an 8ms penalty for having to verify the PSS signature - perhaps you can/should cache that? That seems low-hanging fruit.


Comment 7 by rch@chromium.org, Mar 22 2016

No, this is not the problem. The problem is that when we look at net
internals dumps of actual browsing in which QUIC is being used, we
often see non-trivial times spent verifying certs. (In addition
because of when QUIC and TLS verify certs in relation to each other,
TLS ends up able to use the cached QUIC results and that ends up
giving TLS a leg up).

The job latency metrics seem consistent with our hypothesized
explanation for this behavior, and so their movement seemed liked
reasonable ones to see move. (though as you mention below they may not
be the right metrics)

Yes. VeriyfProofTime and YouTube join latency are the two key metrics
we are motivated to see change.

Not quite. I would expect to see WON_RACE increase and LOST_RACE
decrease. But I would expect this movement to happen because there
were be some cases where today QUIC has to do work to verify the cert
and TLS doesn't, but with the cache QUIC will also not have to do
work.

No, I would not expect it to decrease. But the ratio of it to
Net.CertVerifier_Job_Latency is important. And I wouldn't want to look
at Net.CertVerifier_Job_Latency without also looking at the first job
latency.

Ah, interesting. I tried to trace through the code and it looks like
it was per job. I didn't see where the cache short circuited, but I
guess I didn't follow it correctly.

To be clear, ~70ms at 90% is a highly non-trivial amount of time.
Moving this would be fantastic.

IIRC, CT is only enabled with QUIC v30 which we only turned on in
Canary last week. I'd be surprised if CT status was part of the
difference here, but I'm surprised by lots of things.

I don't want to compare QUIC cert verification time to TLS cert
verification time. Remember that only Google servers speak QUIC at the
moment, but lots of other servers speak TLS, so comparing QUIC metrics
to TLS metrics is problematic because of population bias problems.
We've seen this repeatedly in any number of metrics.

We already have an in-memory cache which allows us to skip proof
verification completely (certs, signatures, CT, all of it) if we're
using the same server config from a previous connection. Is that
effectively what you are talking about?

Comment 8 by rch@chromium.org, Mar 22 2016

> - Regarding metrics, I do want to make sure I'm restating accurately:
>
> Problem:
> - There's a divergence in the Android Net.CertVerifier_First_Job_Latency (50% at 201ms, 90% at 717ms) and Net.CertVerifier_Job_Latency (50% at 37ms, 90% at 140ms).

No, this is not the problem. The problem is that when we look at net
internals dumps of actual browsing in which QUIC is being used, we
often see non-trivial times spent verifying certs. (In addition
because of when QUIC and TLS verify certs in relation to each other,
TLS ends up able to use the cached QUIC results and that ends up
giving TLS a leg up).

> - You plan to cache results to see if this helps move both metrics down.
> - You anticipate that, in doing so, you will also see Net.QuicSession.VerifyProofTime (and, presumably, Net.SSLCertVerificationTime) trend downward

The job latency metrics seem consistent with our hypothesized
explanation for this behavior, and so their movement seemed liked
reasonable ones to see move. (though as you mention below they may not
be the right metrics).

VeriyfProofTime and YouTube join latency are the two key metrics
we are motivated to see change.

> - You expect to see Net.AlternateProtocolUsage increase, because QUIC will complete as-fast-or-faster than TLS

Not quite. I would expect to see WON_RACE increase and LOST_RACE
decrease. But I would expect this movement to happen because there
were be some cases where today QUIC has to do work to verify the cert
and TLS doesn't, but with the cache QUIC will also not have to do
work.

> It's unclear why you would expect Net.CertVerifier_First_Job_Latency to decrease with this cache - the first time the OS has to be interrogated, you still have to load the certificates in memory (this is somewhat OS specific and hairy and not accurate, but 'close enough' approximation), so I wouldn't anticipate this to change, unless you expect to bias the metric by counting cached results - which I don't think you should.

No, I would not expect it to decrease. But the ratio of it to
Net.CertVerifier_Job_Latency is important. And I wouldn't want to look
at Net.CertVerifier_Job_Latency without also looking at the first job
latency.

> Because Job metrics aren't logged when cached, I also don't expect to see Net.CertVerifier_Job_Latency trend down - because none of the cache results benefit.

Ah, interesting. I tried to trace through the code and it looks like
it was per job. I didn't see where the cache short circuited, but I
guess I didn't follow it correctly.

> That said, I would expect Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime to move downward, since those benefit from the cache. That said, it's worth noting that the metrics for Net.SSLCertVerificationTime are >50% at 0ms, and 90% at 75ms. For QUIC, you're at >50% in 8ms, and 90% at 71ms.

To be clear, ~70ms at 90% is a highly non-trivial amount of time.
Moving this would be fantastic.

> The 0ms vs 7ms split, I suspect, is because of the additional cryptographic overhead necessary for QUIC proofs (that is, you're not measuring apples:apples, as the TLS verification only concerns cert verification, whereas QUIC also verifies the signature over the server config *and* the CT status)

IIRC, CT is only enabled with QUIC v30 which we only turned on in
Canary last week. I'd be surprised if CT status was part of the
difference here, but I'm surprised by lots of things.

> So, at a minimum, if you want to compare the QUIC cert verification time to TLS cert verification time, it seems like you need a better metric, especially if it only ends up being a small move.

I don't want to compare QUIC cert verification time to TLS cert
verification time. Remember that only Google servers speak QUIC at the
moment, but lots of other servers speak TLS, so comparing QUIC metrics
to TLS metrics is problematic because of population bias problems.
We've seen this repeatedly in any number of metrics.

> It also suggests that you're paying an 8ms penalty for having to verify the PSS signature - perhaps you can/should cache that? That seems low-hanging fruit.

We already have an in-memory cache which allows us to skip proof
verification completely (certs, signatures, CT, all of it) if we're
using the same server config from a previous connection. Is that
effectively what you are talking about? This may well explain part of the
reason that TLS has more 0ms verifications than QUIC does.

(We could, of course, simply persist and restore *this* cache, of course, but 
that would seem to be unfair to TLS.)
>  The problem is that when we look at net internals dumps of actual browsing in which QUIC is being used, we often see non-trivial times spent verifying certs.

Do you have metrics that capture this? The net-internals thing seems even harder to use to justify this change, because of the bias intrinsic to it.

> I didn't see where the cache short circuited, but I guess I didn't follow it correctly.

https://code.google.com/p/chromium/codesearch#chromium/src/net/cert/multi_threaded_cert_verifier.cc&l=439

Cached entries never create a Job. As such, you should just disregard the Net.CertVerifier_*_Latency from being part of either the justification or the results. If it's a justification, it's not a good one, and if it moves in the results, then the change is bad :)

> To be clear, ~70ms at 90% is a highly non-trivial amount of time.

This covers both success and failure cases.

> IIRC, CT is only enabled with QUIC v30 which we only turned on in Canary last week.

No. The ProofVerifier has been checking it for quite some time (for whitelisted EV certs & expect-ct). I sent you those CLs, remember? :) Either way, it's part of your measurements that you're doing additional cryptographic operations (proof verification) and memory operations (whitelist scanning, CT policy validation). You're thinking about the in-band delivery of SCTs, but that's orthogonal to the CT policy checking.

> Remember that only Google servers speak QUIC at the moment,

Is that enforced by code? Because I'm aware of several organizations taking Chrome code for their servers to use QUIC. Unless it's programatically gated, I don't think we can uniformly make this suggestion.

> if we're using the same server config from a previous connection.

Possibly.

To recap this:
- Net.CertVerifier_*_Job metrics are bad for this case (don't count the positive impact that the cache has, won't be impacted by caching)
- Net.QuicSession.VerifyProofTime accounts for far more than its TLS equivalent, and thus makes it hard to compare 1:1. It also means there's more noise here - you're not *just* verifying certificates, but doing "other stuff"
- The justification ("We see long verification in net-internals") doesn't seem to be supported by metrics, so it's unclear how we'll declare success/failure.


It *seems* like you want to see if caching to disk drives down Net.QuicSession.VerifyProofTime, and if it drives down Net.SSLCertVerificationTime. We can reasonably expect that SSLCertVerificationTime will go down at a greater percentage than VerifyProofTime, because of the additional noise in the QUIC implementation.

In addition to measuring that these go down (how much they go down to warrant complexity TBD), we also need to identify the metrics that shouldn't go up - whatever is the appropriate metric for startup time (TBD).


Separately, I'm nervous about the proof verification skipping - that sounds like you're doing something we explicitly removed from the TLS side on security grounds. We can follow-up separately on that.

Comment 10 by rch@chromium.org, Mar 25 2016


"> The problem is that when we look at net internals dumps of actual browsing in which QUIC is being used, we often see non-trivial times spent verifying certs. 

Do you have metrics that capture this? The net-internals thing seems even harder to use to justify this change, because of the bias intrinsic to it." 

I'm not sure what you're looking for. We know that when Chrome is restarted the cache is empty. We know that Chrome is frequently restarted ('cause of memory) on Android. We know that when cert verification does not hit the cache, it takes non-trivial time (from the histograms you mentioned earlier). We have a *hypothesis* that by persisting the cache and restoring it on startup we will drive down the actual time spent verifying certs. We may be wrong in this hypothesis, but all the information we have is consistent with this hypothesis. Hence, we propose to run an experiment. If the hypothesis is correct, then we'll see ​Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime move. Woo hoo! If we're wrong we will rip out this code and will have learned something.​

"> Remember that only Google servers speak QUIC at the moment, 

Is that enforced by code?"

Yes, it is because of the DROWN attack. In fact, I think you reviewed the CL which added the whitelist (which is empty) in addition to the IsGooglePinnedHost() check. Here's the code which does the check:

https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_stream_factory_impl.cc&rcl=1458675567&l=360

"Because I'm aware of several organizations taking Chrome code for their servers to use QUIC."

​​We are, of course, working actively with several different companies to deploy QUIC server side, but until Chrome enables QUIC v31 they will not be able to use QUIC. This has been a hot topic of conversation as you might imagine.
 
"Unless it's programatically gated, I don't think we can uniformly make this suggestion."

​Even if it were not gated, we could still make this claim because we have looked at Chrome's QUIC traffic when we've disabled QUIC on Google servers and, not surprisingly, it goes to basically 0. But, as I said, it is gated at the moment.​

"To recap this: 
- Net.CertVerifier_*_Job metrics are bad for this case (don't count the positive impact that the cache has, won't be impacted by caching) 
- ​​Net.QuicSession.VerifyProofTime accounts for far more than its TLS equivalent, and thus makes it hard to compare 1:1. It also means there's more noise here - you're not *just* verifying certificates, but doing "other stuff" "

​I don't understand why you keep talking about comparing the QUIC and TLS verification times to each other. I have no intention of doing that. As I mentioned before QUIC is only to Google currently (and will only be to a small set of sites as other QUIC servers come online) and TLS is to basically everything else. This is a huge population bias (in addition to whatever differences are inherent to the two histograms).​ My objective is to make these metrics lower, not to change the ratio between them (though that ratio may well change, that's a side effect, not an objective).
 
"- The justification ("We see long verification in net-internals") doesn't seem to be supported by metrics, so it's unclear how we'll declare success/failure. "

​We will declare success if we, say, reduce ​​Net.QuicSession.VerifyProofTime and/or ​Net.SSLCertVerificationTime by 10%. (Just making a number up). 

"In addition to measuring that these go down (how much they go down to warrant complexity TBD), we also need to identify the metrics that shouldn't go up - whatever is the appropriate metric for startup time (TBD). "

​Works for me. I welcome any suggestions you (or others) might have.
"I'm not sure what you're looking for."

My concern here is this - the initial set of metrics you proposed (CertVerifier_) were unrelated to cache, and biased towards full verifications.

The metrics you do have - VerifyProofTime and SSLCertVerificationTime - SHOULD move if caching helps, but WILL move at different rates, and I want to make sure we get that documented up front, because I read the original suggestion is that they would move at EQUAL rates, which would not be expected.

"non-trivial time" is a phrase I'd object to. We're sticking a cache in front of something without understanding why it's taking long. The lesson from the Perf CodePurple was to actually do the deep dives, and do systemic analysis about why performance wasn't what we expect, because that's where we find the interesting and exciting things. Caching, for example, makes a tradeoff for memory vs CPU, but the entire justification is for Android, where we're already memory constrained and multiple teams are trying to reduce memory usage (... and reduce the size of the networks' caches). As such, I still struggle to believe this is long-term the correct answer - this feels like a bandaid being presented as a solution.

But while I don't want to belabor that point too much, I want to circle back on a clear statement of the problem.

The initial statement (as I read it) was:
- We see non-trivial verification times in Net.CertVerifier_*_Job metrics

The problem I had was:
- "non-trivial" doesn't provide a good definition
- CertVerifier* wasn't a good set of metrics

So we've iterated a few times, and now we have
- We see non-trivial verification times in Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime (or maybe net-internals, I'm not sure)

The problem I have with this is:
- "non-trivial" is still undefined (this goes back to having a clear problem definition, and a proposed solution, so that we can objectively evaluate if the solution addresses the problem)
- VerifyProofTime *isn't* just certificate verification - it's a number of other things that contribute noise/entropy to the thing being measured.
- I want to make sure that if you mean "non-trivial in net-internals", then we're focusing on that problem - and it's a reasonable problem to investigate. If you mean "we don't like the value in the histograms", then we're focusing on that problem (and *ignoring* net-internals) - which is also reasonable. In the past, justifications I've heard for this is "net-internals says this is taking 10 seconds" - but that sort of statement isn't corroborated by the metrics, and hasn't been deep-dived. I want to make sure we're not still hanging on to that problem-statement behind the scenes, but that we're clear on what we're measuring :)


So let's say we reduce VerifyProofTime by 10%, but we add 100-to-200ms to startup time (which, given *why* we see CertVerifier_*_Job's first-job take a while, is *not* unreasonable to expect), or we now cause Chrome's memory usage overall to increase (because now, you can no longer Force-Quit Chrome and expect that, on Chrome resumption, memory usage will be in the default/no extra usage state). Is that success? Are we looking at the right metrics?

My request is that it's reasonable to expect that:
- Startup time will increase (IO contention, size of the data we're writing, CPU time to serialize/deserialize, or delays in loading the cache during first verification)
- Memory usage will increase (we're now caching things even longer, and restarting Chrome will be insufficient to clear them)

As such, I think the onus should be to find the right metrics and work with the teams responsible, to make sure you're watching (or coordinating with them and they're watching) for "non-trivial" increases :)

I've tried to cc the people who might help, but I'm not sure - but I think given the proposed plan of action, it's reasonable to expect we'll see this, and given that we're doing a data-driven experiment, it seems reasonable to request that you see about making sure we're looking at the whole picture, and not just shifting our problem (CertVerification sluggishness on Android) to someone elses' team (performance or svelte)

Comment 12 by rch@chromium.org, Mar 29 2016

[Man, I wish it were easier to get quoting to work in monorail]

"My concern here is this - the initial set of metrics you proposed (CertVerifier_) were unrelated to cache, and biased towards full verifications."

Indeed. And when you pointed this out, I explained that I had misunderstood the way those metrics were calculated.

"The metrics you do have - VerifyProofTime and SSLCertVerificationTime - SHOULD move if caching helps, but WILL move at different rates, and I want to make sure we get that documented up front, because I read the original suggestion is that they would move at EQUAL rates, which would not be expected."

I'm not sure where you got the impression that I thought they will move at equal rates. In any case, I think we're on the same page now that I have no intention of comparing those metrics to each other, right?

""non-trivial time" is a phrase I'd object to. We're sticking a cache in front of something without understanding why it's taking long."

I'm not proposing sticking a cache in front of anything. The cache is already there. I'm just proposing repopulating the cache after we restart after being killed because the system ran out of memory.

"The lesson from the Perf CodePurple was to actually do the deep dives, and do systemic analysis about why performance wasn't what we expect, because that's where we find the interesting and exciting things. Caching, for example, makes a tradeoff for memory vs CPU, but the entire justification is for Android, where we're already memory constrained and multiple teams are trying to reduce memory usage (... and reduce the size of the networks' caches). As such, I still struggle to believe this is long-term the correct answer - this feels like a bandaid being presented as a solution."

I appreciate that this is not the one true solution to cert verification performance. However, I do think that this has the potential to address a performance issue associated with applications being killed/restarted on Android. I don't have the resources to perform a deep dive into cert verifier performance, but I do have some resources to implement this small feature. I hope someone (on your team?) will have the time to dig deeper into the general performance situation in the future.

"So we've iterated a few times, and now we have
- We see non-trivial verification times in Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime (or maybe net-internals, I'm not sure)

The problem I have with this is:
- "non-trivial" is still undefined (this goes back to having a clear problem definition, and a proposed solution, so that we can objectively evaluate if the solution addresses the problem)
- VerifyProofTime *isn't* just certificate verification - it's a number of other things that contribute noise/entropy to the thing being measured.
- I want to make sure that if you mean "non-trivial in net-internals", then we're focusing on that problem - and it's a reasonable problem to investigate. If you mean "we don't like the value in the histograms", then we're focusing on that problem (and *ignoring* net-internals) - which is also reasonable. In the past, justifications I've heard for this is "net-internals says this is taking 10 seconds" - but that sort of statement isn't corroborated by the metrics, and hasn't been deep-dived. I want to make sure we're not still hanging on to that problem-statement behind the scenes, but that we're clear on what we're measuring :)"

We started thinking about this issue of making cert verification faster (specifically on mobile) because when we looked at net-internals we saw the behavior I described. In particular, for the anecdotal evidence we were looking at, it seemed consistent with the hypothesis that Chrome had been killed (as it pretty much always is for me) when it had been put into the background. Looking at histograms demonstrated that cert verifier latency was not "super fast". I'm not terribly familiar with cert verification and was thinking it might be a couple ms, but I hadn't thought about it too deeply. But when we see 71ms at 90%, that suggests that there is a lot of opportunity to speed things up.

We then though about how the existing cert verification cache worked and wondered if persisting/restoring when Chrome is killed would alleviate this problem. Hopefully this explains the motivation for the feature. This is a different question than what the metrics for success are. 

"So let's say we reduce VerifyProofTime by 10%, but we add 100-to-200ms to startup time (which, given *why* we see CertVerifier_*_Job's first-job take a while, is *not* unreasonable to expect), or we now cause Chrome's memory usage overall to increase (because now, you can no longer Force-Quit Chrome and expect that, on Chrome resumption, memory usage will be in the default/no extra usage state)."

It's not clear to me that Force-Quit would behave the way you are describing, but I'm happy to chat about this. My understanding was that the "instance state" (or whatever it's called) is written to when the app is about to be killed because of memory pressure and is then read when the app is restored. From what I've read, it is NOT used when the app is force-killed. But if this is not the way things work, let's talk about it.

"Is that success? Are we looking at the right metrics?"

That's a good question. As I said initially, I hadn't been planning to look at these metrics, and have solicited suggests on good metrics to look for, I'm hoping to get some suggestions. (And will follow up offline if needed). A 100-200ms increase in startup time doesn't seem awesome and I suspect that if we only move the mean by 10% that wouldn't be a success. Can you say more about why you would expect to see a such a large increase in startup latency?

That being said, as you point out this is a balance, I'm not sure I can describe the precise function which determines success based on the change in the distribution proof verification time and the change in distribution in startup behavior. What if it's at 10 ms increase in startup and 50% reduction at 90%. That'd probably be a win. What about a 100 ms increase in startup and 20% reduction at 90%. Probably not. But trying to enumerate these up front seems challenging.

"My request is that it's reasonable to expect that:
- Startup time will increase (IO contention, size of the data we're writing, CPU time to serialize/deserialize, or delays in loading the cache during first verification)"

Agreed. As you say it's definitely a balance.

"- Memory usage will increase (we're now caching things even longer, and restarting Chrome will be insufficient to clear them)"

Restarting Chrome should be sufficient to clear them, if I understand how instance state works. But true, after being killed by the OS it will.

"As such, I think the onus should be to find the right metrics and work with the teams responsible, to make sure you're watching (or coordinating with them and they're watching) for "non-trivial" increases :)

I've tried to cc the people who might help, but I'm not sure - but I think given the proposed plan of action, it's reasonable to expect we'll see this, and given that we're doing a data-driven experiment, it seems reasonable to request that you see about making sure we're looking at the whole picture, and not just shifting our problem (CertVerification sluggishness on Android) to someone elses' team (performance or svelte)"

Yes, that's why I agreed to look at this issue. (Though since I don't know what the right metrics are right now, I clearly can't enumerate what those metrics are and what the sorts of regression might be acceptable.) 
So, first, some follow-up notes, followed by making sure we're on the same page at the end.

--- Notes
"I'm not terribly familiar with cert verification and was thinking it might be a couple ms"

Considering that the CT check and your RSA-PSS operation takes 7ms (based on metrics), I think we have ample evidence that it won't be, until a more systemic investigation is done. That's a bit why I used the pejorative of "bandaid" - because 7ms to do an RSA operation and take a short-circuit seem rather unreasonably high to begin with, and if you've got engineering time to build a cache, it almost seems like understanding that could have a big win.

The interesting thing about the delta (between the TLS vs QUIC, which itself is measured in the RSA-PSS) is that it avoids the Java overhead. Just running one of the OpenSSL 'speed' commands on a device (or range of devices that you believe are a representative sample) will give you an idea about whether it's realistic expectation for verification time. If you see that you can only do 100 RSA verifications a second, then you know your lower-bound is going to be 10ms per verification, and that a given certificate chain will average 3-4 of those verifications at a minimum (more in pathological cases). It also means that if you can optimize that (say, by using assembly rather than C), and get a 2x multiplier (although in practice they tend to be 5x-50x depending on the algorithm), you can cut your time in half, w/o any of the memory/complexity tradeoffs.

Alternatively, it may be that the algorithm is already finely tuned. The cache will help there, but alternatively you could look at Android providing some persistent caching at the system layer better integrated (e.g. as iOS does), and amoritize the cost to memory (if multiple applications using Cronet are running) as well as improve the overall system performance.

It feels like we're stopping here because "This is all the time we have to investigate" - which is an understandable/reasonable statement, but which feels like it leaves a lot on the table, and just doesn't feel like it sits right with how we do investigations.
--- End notes

So to make sure we're on the same page:
- You or Raman will figure out (from gab / rschoen) what metrics or dashboards we have that are relevant to things like
  - Startup time
  - Memory usage / pressure
- We'll only use the metrics from the Net.QuicSession.VerifyProofTime and Net.SSLCertVerificationTime in the positive case


One last suggestion:
- It could be useful to examine metrics for
  - Cache hit/miss rate
  - success vs fail metrics (since both are cached)

Gathering those two sets will help inform fine tuning of the cache size when serializing (for example, maybe we should only serialize the 10 most recent entries, even though the cache itself could handle 256, on the basis that those are most likely open tabs to be restored?)
I haven't really been following this discussion but, for reference, here's bssl speed numbers on a Nexus 4:

$ adb shell /data/local/tmp/bssl speed RSA
Did 36 RSA 2048 signing operations in 1017213us (35.4 ops/sec)
Did 3003 RSA 2048 verify operations in 1012056us (2967.2 ops/sec)
Did 40 RSA 2048 sign PSS SHA-256 operations in 1010621us (39.6 ops/sec)
Did 2761 RSA 2048 verify PSS SHA-256 operations in 1029727us (2681.3 ops/sec)
Did 44 RSA 2048 (3 prime, e=3) signing operations in 1026797us (42.9 ops/sec)
Did 2992 RSA 2048 (3 prime, e=3) verify operations in 1061072us (2819.8 ops/sec)
Did 44 RSA 2048 (3 prime, e=3) sign PSS SHA-256 operations in 1017977us (43.2 ops/sec)
Did 2761 RSA 2048 (3 prime, e=3) verify PSS SHA-256 operations in 1059607us (2605.7 ops/sec)
Did 8 RSA 4096 signing operations in 1049565us (7.6 ops/sec)
Did 913 RSA 4096 verify operations in 1078621us (846.5 ops/sec)
Did 8 RSA 4096 sign PSS SHA-256 operations in 1004060us (8.0 ops/sec)
Did 891 RSA 4096 verify PSS SHA-256 operations in 1094033us (814.4 ops/sec)

You're certainly doing RSA 2048, not RSA 4096. This is not including the SHA-256 at the beginning, but that part is a drop in the bucket:

$ adb shell /data/local/tmp/bssl speed SHA-256
Did 708435 SHA-256 (16 bytes) operations in 1000366us (708175.8 ops/sec): 11.3 MB/s
Did 288500 SHA-256 (256 bytes) operations in 1000092us (288473.5 ops/sec): 73.8 MB/s
Did 15000 SHA-256 (8192 bytes) operations in 1020692us (14695.9 ops/sec): 120.4 MB/s
Thanks for those numbers, David. That certainly makes it seem like the difference (between QUIC, which counts EV-with-empy-whitelist + RSA-PSS-2048-SHA256 proof verify, and TLS, which doesn't) is unusually high in aggregate, and may serve as a good exploration as to why verification itself is high.

That's based on "Did 2761 RSA 2048 verify PSS SHA-256 operations in 1029727us (2681.3 ops/sec)" (which hopefully I'm not misreading), which suggests it should be much, MUCH faster between the two.
Here's the same numbers on a Galaxy Nexus:

$ adb shell /data/local/tmp/bssl speed RSA
Did 18 RSA 2048 signing operations in 1024292us (17.6 ops/sec)
Did 720 RSA 2048 verify operations in 1021728us (704.7 ops/sec)
Did 18 RSA 2048 sign PSS SHA-256 operations in 1009704us (17.8 ops/sec)
Did 690 RSA 2048 verify PSS SHA-256 operations in 1000702us (689.5 ops/sec)
Did 39 RSA 2048 (3 prime, e=3) signing operations in 1039246us (37.5 ops/sec)
Did 2464 RSA 2048 (3 prime, e=3) verify operations in 1038544us (2372.6 ops/sec)
Did 39 RSA 2048 (3 prime, e=3) sign PSS SHA-256 operations in 1039825us (37.5 ops/sec)
Did 2240 RSA 2048 (3 prime, e=3) verify PSS SHA-256 operations in 1017395us (2201.7 ops/sec)
Did 2 RSA 4096 signing operations in 1125702us (1.8 ops/sec)
Did 192 RSA 4096 verify operations in 1051452us (182.6 ops/sec)
Did 2 RSA 4096 sign PSS SHA-256 operations in 1103669us (1.8 ops/sec)
Did 192 RSA 4096 verify PSS SHA-256 operations in 1063415us (180.6 ops/sec)

$ adb shell /data/local/tmp/bssl speed SHA-256
Did 438984 SHA-256 (16 bytes) operations in 1001098us (438502.5 ops/sec): 7.0 MB/s
Did 125000 SHA-256 (256 bytes) operations in 1004913us (124388.9 ops/sec): 31.8 MB/s
Did 5317 SHA-256 (8192 bytes) operations in 1002655us (5302.9 ops/sec): 43.4 MB/s


And a Nexus 6P:

$ adb shell /data/local/tmp/bssl speed RSA
Did 72 RSA 2048 signing operations in 1019069us (70.7 ops/sec)
Did 4356 RSA 2048 verify operations in 1012336us (4302.9 ops/sec)
Did 77 RSA 2048 sign PSS SHA-256 operations in 1058083us (72.8 ops/sec)
Did 4444 RSA 2048 verify PSS SHA-256 operations in 1063392us (4179.1 ops/sec)
Did 119 RSA 2048 (3 prime, e=3) signing operations in 1061607us (112.1 ops/sec)
Did 7139 RSA 2048 (3 prime, e=3) verify operations in 1065565us (6699.7 ops/sec)
Did 121 RSA 2048 (3 prime, e=3) sign PSS SHA-256 operations in 1089498us (111.1 ops/sec)
Did 6699 RSA 2048 (3 prime, e=3) verify PSS SHA-256 operations in 1048213us (6390.9 ops/sec)
Did 13 RSA 4096 signing operations in 1013129us (12.8 ops/sec)
Did 1243 RSA 4096 verify operations in 1080651us (1150.2 ops/sec)
Did 13 RSA 4096 sign PSS SHA-256 operations in 1000200us (13.0 ops/sec)
Did 1243 RSA 4096 verify PSS SHA-256 operations in 1094736us (1135.4 ops/sec)

$ adb shell /data/local/tmp/bssl speed SHA-256
Did 2242000 SHA-256 (16 bytes) operations in 1000037us (2241917.0 ops/sec): 35.9 MB/s
Did 1404000 SHA-256 (256 bytes) operations in 1000068us (1403904.5 ops/sec): 359.4 MB/s
Did 100000 SHA-256 (8192 bytes) operations in 1003449us (99656.3 ops/sec): 816.4 MB/s

(The 6P has SHA-256 instructions, which is why that's so fast.)

Comment 17 by rch@chromium.org, Apr 4 2016

I don't understand what those adb shell commands are doing. Can you explain?
They're running BoringSSL's performance tests, compiled for that device, on a device under test. That gives you an idea of what the performance is of the BoringSSL implementation (e.g. the EVP_pkey calls). So if the ProofVerify code is contributing 7ms to the difference between QUIC vs TLS (since the difference is QUIC does Certs+CT+Proof while TLS does Certs), and we know that the BoringSSL performance contribution is significantly less than 7ms, it suggests there's very low-hanging fruit in the ProofVerify code to optimize.

Comment 19 by rch@chromium.org, Apr 4 2016

I see. So those number are just for the basic EVP_pkey call, not actually doing cert verification. Happy to add some additional histograms to proof_verifier_chromium.cc to see if we can determine where the time is going. Is that what you're recommending?
Following up offline, as I think we're saying different things again (or maybe I'm just not understanding your restating)
By the way, it's worth noting that the ProofVerifier operation isn't some inherent disadvantage QUIC has over TLS. TLS does a public key operation on the server's public key too[*]. That's what checking the ServerKeyExchange signature is. The only difference is you all happen to be bucketing time to include it on one case and not in the other.

[*] in non-resumption handshakes
Following-up from our offbug discussion:

The metric measures the time to do four things
1) Verifying the presence and validity of SCTs
2) Verifying the proof from the server (RSA-PSS operation)
3) Verifying the certificate chain
4) Verifying the SCTs match the SCT policy

In the case of a TLS handshake, 2 is handled by the ServerKeyExchange in the TLS handshake time, and #1/#4 are bucketed separately from the cert verification. Further, #1 only happens *after* we've verified the cert (in the SSLClientSocket*::VerifyCT method)

So we've got a 7ms difference between QUIC and TLS, and that difference is Steps #1/#2/#4. #4 should be a short-circuit for QUIC, because it only triggers for EV certs. So what we're really saying is that we're spending 7ms in #1/#2.

Both #1 and #2 are "supposed" to be bulk-crypto operations; #1 is ECC, #2 is RSA-PSS. We know from the speed metrics that RSA-PSS should be well below that, and I thought ECC metrics were similar-to-much-faster (after all, ECC). David, do you have bssl speeds for those?

Either way, my suggestion is to investigate why #1/#2 are contributing 7ms (and/or if the short-circuit in #4 is broken). That would yield low-hanging fruit, as the metrics should be easier to gather (see //base/tracing/trace_event and the jank dashboard), and also easier to optimize (it's all native code). And it shouldn't have very many trade-offs to optimize.
Some data from cronet. In the following code, QUIC's cert verifier job is taking 115ms. Attached are the net-internals logs Helen ran with cronet.

12: QUIC_SESSION
www.gstatic.com
Start Time: 2016-03-25 11:42:07.712

t=   128 [st=     0] +QUIC_SESSION  [dt=141150+]
                      --> cert_verify_flags = 4
                      --> host = "www.gstatic.com"
                      --> port = 443
                      --> privacy_mode = true
                      --> require_confirmation = false
t=   136 [st=     8]   +CERT_VERIFIER_REQUEST  [dt=115]
t=   137 [st=     9]      CERT_VERIFIER_REQUEST_BOUND_TO_JOB
                          --> source_dependency = 13 (CERT_VERIFIER_JOB)
t=   251 [st=   123]   -CERT_VERIFIER_REQUEST
....

13: CERT_VERIFIER_JOB
Start Time: 2016-03-25 11:42:07.719

t=135 [st=  0] +CERT_VERIFIER_JOB  [dt=115]
                --> certificates =
...
t=250 [st=115] -CERT_VERIFIER_JOB
                --> verified_cert =
cronet_quic_2.txt
1.3 MB View Download
Oh, one more thing I mentioned on chat - We should move #1 to after #3; the presence/absence of SCTs is only relevant in the case of affirmative trust. That is, the order of operations in TLS is #2 (SKE), #3 (SSLClientSocket*::VerifyCert), #1 (SSLClientSocket*::VerifyCT), and #4 (SSLClientSocket*::VerifyCT, but only if EV || expect-staple)
rtenneti: Yes, that's expected for the first verification, and why in the CertVerifier code this is measured as CertVerifier_First_Job_Latency vs the rest.
In the following cert verification, cert verifier took 41ms and ProofVerify took another 36ms before it sends CHLO.

Would be good to know where is this 36ms getting spent.

t=   624 [st=    64]   -CERT_VERIFIER_REQUEST
t=   660 [st=   100]    QUIC_SESSION_CERTIFICATE_VERIFIED

23: QUIC_SESSION
tachyon-grpc-test.sandbox.googleapis.com
Start Time: 2016-03-25 11:42:08.144

t=   584 [st=    24]   +CERT_VERIFIER_REQUEST  [dt=40]
t=   584 [st=    24]      CERT_VERIFIER_REQUEST_BOUND_TO_JOB
                          --> source_dependency = 24 (CERT_VERIFIER_JOB)
t=   624 [st=    64]   -CERT_VERIFIER_REQUEST
t=   660 [st=   100]    QUIC_SESSION_CERTIFICATE_VERIFIED
                        --> subjects = [...]
t=   661 [st=   101]    QUIC_SESSION_CRYPTO_HANDSHAKE_MESSAGE_SENT

24: CERT_VERIFIER_JOB
Start Time: 2016-03-25 11:42:08.166

t=582 [st= 0] +CERT_VERIFIER_JOB  [dt=41]
               --> certificates =
...
t=623 [st=41] -CERT_VERIFIER_JOB
               --> verified_cert =
> David, do you have bssl speeds for those?

Actually, ECDSA verifications are pretty slow in general. I'm at IETF, but I conveniently left a Nexus 7 still connected to my desktop:

Did 46 RSA 2048 signing operations in 1025574us (44.9 ops/sec)
Did 3597 RSA 2048 verify operations in 1079559us (3331.9 ops/sec)
Did 51 RSA 2048 (3 prime, e=3) signing operations in 1012786us (50.4 ops/sec)
Did 3267 RSA 2048 (3 prime, e=3) verify operations in 1062470us (3074.9 ops/sec)

Did 220 ECDSA P-256 signing operations in 1008820us (218.1 ops/sec)
Did 209 ECDSA P-256 verify operations in 1068512us (195.6 ops/sec)

There is, however, some low-hanging fruit here. I believe Adam has some 32-bit P-256 code that we're not currently using. There's also upstream's ARM version of the very fast P-256 implementation, but that's even more completely unreadable assembly and probably would be costly in binary size. (We don't ship the Intel version of that code in Chrome for that reason.)
rtenneti: There's an internal thread on this. Perhaps it'd be useful to track a separate bug if you're trying to do a public investigation. But that net-internals information isn't necessarily adding new information from what's discussed on this bug.
Building and getting data for android. Meanwhile I have data for Linux with chromium. Loaded bunch of google properties multiple times.

The following is for 50%th percentile

1) Signature verification mean is 6ms
2) cert verification mean is 0ms (for tests it is around 20micro seconds).
3) Post cert verification mean is 0ms

Overall proof verify time's mean is 8ms.


Histogram: Net.QuicSession.VerifyCertTime recorded 139 samples, average = 75.0 (flags = 0x1)
0     ------------------------------------------------------------------------O (90 = 64.7%)



Histogram: Net.QuicSession.VerifyPostCertTime recorded 139 samples, average = 1.4 (flags = 0x1)
0  ------------------------------------------------------------------------O (95 = 68.3%)


Histogram: Net.QuicSession.VerifyProofTime recorded 139 samples, average = 82.6 (flags = 0x1)
0     ... 
4     ------------------------------------------------------O                   (21 = 15.1%) {0.0%}
5     ------------------------------------------------------------------------O (28 = 20.1%) {15.1%}
6     ---------------------------------------------------O                      (20 = 14.4%) {35.3%}
7     ----------O                                                               (4 = 2.9%) {49.6%}
8     -------------------O                                                      (15 = 10.8%) {52.5%}


Histogram: Net.QuicSession.VerifySignatureTime recorded 139 samples, average = 5.7 (flags = 0x1)
0   ... 
4   ----------------------------------------O                                 (28 = 20.1%) {0.0%}
5   ------------------------------------------------------------------------O (50 = 36.0%) {20.1%}
6   ---------------------------------------------O                            (31 = 22.3%) {56.1%}





rtenneti: Comparing Linux and Android is going to be an Apples and Oranges comparison. I don't believe it will usefully add to the discussion here.

Comment 31 by gab@chromium.org, Apr 5 2016

Re #6, what startup stat to monitor: Startup.FirstWebContents.NonEmptyPaint2.* is a good one (the .* is optional but its variants help split cold/warm startup -- for things that impact I/O you'll want to loop more closely at Startup.FirstWebContents.NonEmptyPaint2.ColdStartup). Also, probably stating the obvious, but looking at high percentiles is where you'll see the impact of I/O contention.

I highly recommend putting your change behind Finch to be able depict the impact, noise-levels often hide it when looking strictly at timeline for Canary (and on more stable channels it becomes too far apart from commit to tell anything).

Comment 32 by mef@chromium.org, Apr 26 2016

Blocking: 596854
Blocking: -596854
Project Member

Comment 34 by bugdroid1@chromium.org, Jun 16 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b25d90c36cbf578cd8459f1e0acfd26930d0e139

commit b25d90c36cbf578cd8459f1e0acfd26930d0e139
Author: rsleevi <rsleevi@chromium.org>
Date: Thu Jun 16 07:11:59 2016

Add support for walking and modifying the CachingCertVerifier

In anticipation of experimenting with persisting the certificate
verification cache across process evictions on mobile, add the
ability to iterate through the cache's contents (allowing it to
be serialized) and to restore the contents later.

Because restoration is expected to be disjoint from object
creation, rather than requiring the deserialized entries be
presented at construction, allow them to be provided after the
cache already has results, provided that no other results are
evicted.

BUG= 590875 

Review-Url: https://codereview.chromium.org/1999733002
Cr-Commit-Position: refs/heads/master@{#400100}

[modify] https://crrev.com/b25d90c36cbf578cd8459f1e0acfd26930d0e139/net/cert/caching_cert_verifier.cc
[modify] https://crrev.com/b25d90c36cbf578cd8459f1e0acfd26930d0e139/net/cert/caching_cert_verifier.h
[modify] https://crrev.com/b25d90c36cbf578cd8459f1e0acfd26930d0e139/net/cert/caching_cert_verifier_unittest.cc
[modify] https://crrev.com/b25d90c36cbf578cd8459f1e0acfd26930d0e139/net/cert/cert_verifier.cc
[modify] https://crrev.com/b25d90c36cbf578cd8459f1e0acfd26930d0e139/net/cert/cert_verifier.h

Components: -Internals>Network>SSL Internals>Network>Certificate
Blocking: 410497
Blocking: -410497
Blocking: 410497
Cc: cbrubaker@google.com kroot@chromium.org davidben@chromium.org yfried...@chromium.org
 Issue 410497  has been merged into this issue.
Blockedon: 607354
In crbug.com/607354, the following two tasks were completed. The following two tasks were done for cronet/Android only.

1) CL that persists the protobuf of CertVerifierCache
2) Cronet Java API call for persisting the data into Bundle
Cc: xunji...@chromium.org
+ xunjieli@ to help out cronet based apps to use the Java API call to serialize/deserialize cert verify results into the Bundle.
Cc: -davidben@chromium.org zhongyi@chromium.org
Cc: -zhongyi@chromium.org
Owner: zhongyi@chromium.org
zhongyi@: Some of the work listed in https://bugs.chromium.org/p/chromium/issues/detail?id=590875#c22 was complete.

Please reassign it to the right person. IMO, it is not right thing to do, thus bug being assigned to me and I am not working on it (thinking if it is assigned to the right person, work will be completed). Thanks much.
Cc: -rsch...@chromium.org
Cc: zhongyi@chromium.org
Owner: ----
Status: Available (was: Started)
As discussed, https://chromium-review.googlesource.com/c/chromium/src/+/1117036 will remove Cronet CachingCertVerifier experiment as it's baked with some internal API assumptions and is blocking CertVerifier cleanup. If we manage to find a customer for this experiment, we may reland the code with fix(to not rely on internal details).

A side note on the status of this experiment: there are currently no customer for this experiment. Unfortunately, I don't have enough bandwidth to find a customer and help them deploy/evaluate/investigate this experiment for the time being, so I'm assign myself from this bug. If anyone is interested to push this experiment forward, feel free to ping me and I can provide more context. 
Project Member

Comment 45 by bugdroid1@chromium.org, Jun 29 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f132d5aea772cbb5ef9f3c7bee25897c8993e68

commit 7f132d5aea772cbb5ef9f3c7bee25897c8993e68
Author: Ryan Sleevi <rsleevi@chromium.org>
Date: Fri Jun 29 19:26:11 2018

Remove the Cronet CachingCertVerifier experiment

Cronet had an experimental cache serializer/deserializer for certificate
verification. However, this API baked into the serialization format
internal API assumptions (such as flags' semantic meaning), rather than
translating them.

To support cleanup of this API (in which the flags will be removed in
favor of explicit configuration), remove the
serialization/deserialization logic.

BUG= 590875 

Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
Change-Id: I4af8ab8167ab2b7d909dba8d76535b8a584bec4b
Reviewed-on: https://chromium-review.googlesource.com/1117036
Reviewed-by: Paul Jensen <pauljensen@chromium.org>
Reviewed-by: Andrei Kapishnikov <kapishnikov@chromium.org>
Commit-Queue: Ryan Sleevi <rsleevi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#571588}
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/BUILD.gn
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/api/src/org/chromium/net/ExperimentalCronetEngine.java
[delete] https://crrev.com/afde76d6ab1d064821294da9d31cd626cbf8581a/components/cronet/android/cert/cert_verifier_cache_serializer.cc
[delete] https://crrev.com/afde76d6ab1d064821294da9d31cd626cbf8581a/components/cronet/android/cert/cert_verifier_cache_serializer.h
[delete] https://crrev.com/afde76d6ab1d064821294da9d31cd626cbf8581a/components/cronet/android/cert/cert_verifier_cache_serializer_unittest.cc
[delete] https://crrev.com/afde76d6ab1d064821294da9d31cd626cbf8581a/components/cronet/android/cert/proto/cert_verification.proto
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/cronet_url_request_context_adapter.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/cronet_url_request_context_adapter.h
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/java/src/org/chromium/net/impl/CronetEngineBuilderImpl.java
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/java/src/org/chromium/net/impl/CronetUrlRequestContext.java
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/java/src/org/chromium/net/impl/JavaCronetEngine.java
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/test/cronet_url_request_context_config_test.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/test/javatests/src/org/chromium/net/BidirectionalStreamQuicTest.java
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/android/test/javatests/src/org/chromium/net/CronetUrlRequestContextTest.java
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/cronet_url_request_context.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/cronet_url_request_context.h
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/native/engine.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/stale_host_resolver_unittest.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/url_request_context_config.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/url_request_context_config.h
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/components/cronet/url_request_context_config_unittest.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/net/cert/caching_cert_verifier.cc
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/net/cert/caching_cert_verifier.h
[modify] https://crrev.com/7f132d5aea772cbb5ef9f3c7bee25897c8993e68/net/cert/caching_cert_verifier_unittest.cc

Project Member

Comment 46 by bugdroid1@chromium.org, Jul 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/57203881ab7de84c910096fdc16738416449570e

commit 57203881ab7de84c910096fdc16738416449570e
Author: Ryan Sleevi <rsleevi@chromium.org>
Date: Mon Jul 02 19:23:22 2018

Remove unused code from CachingCertVerifier

CachingCertVerifier::AddEntry is now dead code after r571588

BUG= 590875 

Change-Id: I4e5bc5845e7ca33972e768ae619033f447cda744
Reviewed-on: https://chromium-review.googlesource.com/1122696
Reviewed-by: David Benjamin <davidben@chromium.org>
Commit-Queue: Ryan Sleevi <rsleevi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#571964}
[modify] https://crrev.com/57203881ab7de84c910096fdc16738416449570e/net/cert/caching_cert_verifier.cc
[modify] https://crrev.com/57203881ab7de84c910096fdc16738416449570e/net/cert/caching_cert_verifier.h
[modify] https://crrev.com/57203881ab7de84c910096fdc16738416449570e/net/cert/caching_cert_verifier_unittest.cc

Project Member

Comment 47 by bugdroid1@chromium.org, Aug 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5d77f2b5485304ef6f4a8b6bba175fbb801da011

commit 5d77f2b5485304ef6f4a8b6bba175fbb801da011
Author: Ryan Sleevi <rsleevi@chromium.org>
Date: Fri Aug 10 21:02:29 2018

No longer explicitly initialize a CertVerifier in Cronet

Support for file-based persistence was removed in r571588,
so its no longer necessary for Cronet to explicitly create
a CertVerifier to guarantee it is a CachingCertVerifier.

Bug:  590875 
Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
Change-Id: I23e9e6db12225a7474ce589d9333595729108f1c
Reviewed-on: https://chromium-review.googlesource.com/1169822
Reviewed-by: Helen Li <xunjieli@chromium.org>
Commit-Queue: Ryan Sleevi <rsleevi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#582330}
[modify] https://crrev.com/5d77f2b5485304ef6f4a8b6bba175fbb801da011/components/cronet/url_request_context_config.cc

Status: WontFix (was: Available)

Sign in to add a comment