New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 700617 link

Starred by 13 users

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-04-10
OS: All
Pri: 1
Type: Bug

Blocked on:
issue 708802

Blocking:
issue 702411
issue 704949
issue 704956
issue 704988
issue 708240
issue 708260
issue 708569



Sign in to add a comment

Memory leak in network stack: alternative jobs [QUIC].

Project Member Reported by erikc...@chromium.org, Mar 11 2017

Issue description

I'm in the process of running an experiment where I profile all memory allocations made by Chrome.
https://docs.google.com/document/d/1Qka5hEG91L_a702AHwDFEiNw40WvdrCrd2rlC1z0u2A/edit#

I started this experiment a week ago. About a day ago, and continuing into today, browser process memory usage [live, malloc objects] has climbed by 200MB. It's stayed elevated for ~10 hours. I closed all my usual windows, waited 10 minutes, and used a guest profile to take a trace but did not see the numbers go back down. I'm currently observing ~600k live objects allocated from net, taking up just under 200MB. 

macOS: 10.12.3
chrome: branched from 7fa349e632a44c152b05ca6a66ade5f2e5b3f139, chrome://version shows 58.0.3029.0
trace: https://drive.google.com/a/google.com/file/d/0B6ho_7_ut5e1dXVnTGx6d0ExLWs/view?usp=sharing

There are two main stacks that are responsible for creating these objects. The first hits an internal profiling limit and dead-ends at net::HttpCache::Transaction::DoLoop, so I don't have more details. That stack is responsible for 90MB and 300k allocations. The other stack is within the profiling limit, and stems from DefaultJobFactory::CreateJob. This is also responsible for ~85MB and 300k allocations.

If necessary, I can rebuild the binary I'm using to increase the profiling limits, and get more information about the stack that dead-ends at net::HttpCache::Transaction::DoLoop, but I don't know how long it will take me to repro this scenario. I'm still using this browser session, so I'm also happy to grab data from net-internals, or anywhere else. Please let me know how I can help.
 
There are ~20k live malloc objects that stem directly from DefaultJobFactory::CreateJob, not including the constructor for Job. 
"""
    return new HttpStreamFactoryImpl::Job(
        delegate, job_type, session, request_info, priority, server_ssl_config,
        proxy_ssl_config, destination, origin_url, net_log);
"""
https://cs.chromium.org/chromium/src/net/http/http_stream_factory_impl.cc?q=defaultjobfactory::create+package:%5Echromium$&l=54

This suggests that there are currently ~20k live Job objects. :(
Screen Shot 2017-03-10 at 7.11.05 PM.png
57.4 KB View Download
Screen Shot 2017-03-10 at 6.11.18 PM.png
174 KB View Download
Screen Shot 2017-03-10 at 6.09.55 PM.png
146 KB View Download
Cc: xunji...@chromium.org
Cc: rch@chromium.org zhongyi@chromium.org
Components: Internals>Network>QUIC
Labels: -OS-Mac OS-All
Thanks erikchen@. This is HttpStreamFactoryImpl::JobController leak again. Looks like either we missed this case in  Issue 678768  or something has changed.

Here's a screenshot of net/ MemoryDumpProvider. There are 33,515 live JobControllers. Out of those, 33,167 are created for the purpose of QUIC (column "alt_job_count"), and 348 are for preconnects.
Screen Shot 2017-03-13 at 9.12.35 AM.png
89.7 KB View Download
Last time we leaked the main jobs.  This time we are leaking the alternative jobs.  Summing all leaked jobs across http_network_sessions, there are about 40+k alt jobs and 40+k job controllers leaked. I'll look at the code.
Owner: zhongyi@chromium.org
Status: Assigned (was: Untriaged)
We recently had a change to resume the main job only, which is landed in
https://chromium.googlesource.com/chromium/src/+/ff2bfbfd76f3c43e5618addd34c82fed42d2a6f5

That might be related. Looking now. 

Comment 6 by mmenke@chromium.org, Mar 13 2017

Given the number of major leaks we've run into in this code, we should probably think about improving the ownership model.
erikchen@: could you grab a net-internals for investigation? That would be helpful.

Comment 8 Deleted

Reading the net-logs, it does seemed that QUIC jobs are leaked as it's been orphaned but never deleted. 

t=310413 [st=  47]    HTTP_STREAM_JOB_WAITING  [dt=2]
                      --> should_wait = false
t=310574 [st= 208]    HTTP_STREAM_JOB_ORPHANED
t=317739 [st=7373]  (the net log dump exits)

And it seems the alt job is never finishing. I am trying to repro on my local to see whether it's QUIC issue or JobController issue.

Comment 10 by rch@chromium.org, Mar 13 2017

mmenke: Ironically, the JobController was created specifically to address the terribly confusing ownership model of the previous code. It's now much easier to reason about the ownership and in fact we're not actually leaking (in the sense of un-reachable) memory. Instead, we (in the previous bug and it looks like here too) are not properly cleaning up. (Not that this is any less buggy, of course)

In any case, I welcome any suggestions for how to improve the ownership model!

Comment 11 by ssid@chromium.org, Mar 13 2017

The last time we checked UMA stats for quic enabled and quic disabled users and it showed both were comparable. So, it is probable that it is not due to quic if it's a large leak.
Attaching another net-export log, this time taken from google guest.
chrome-net-export-log.json 3.zip
948 KB Download
ssid:  How do you figure?  Per comment #5, changes have been made recently in this area.
Labels: Performance-Memory

Comment 15 by ssid@chromium.org, Mar 13 2017

Re #13: I checked the recent dev uma stats with finch experiment, though we dont have large hits, they show similar numbers.

https://uma.googleplex.com/p/chrome/variations?sid=95691007b5ffd5bb64ee3190b28dc99b
https://uma.googleplex.com/p/chrome/variations?sid=2f7efc2f4b8f8c301d3ec71fae32bb07
gchatz@ has a browser process that regularly hits 5GB+ memory usage. I've asked him to start running a profiling build. He left it on for a week while he was OOO and I've included the attached trace. It shows the same leak as I was observing, but to a lesser extent. So far IOThread only shows 100MB, but he also hasn't been using this browser at all.

https://drive.google.com/a/google.com/file/d/0B6ho_7_ut5e1ZGQ3aVhKbDFGQU0/view?usp=sharing
Cc: -zhongyi@chromium.org
The trace in #16 shows that there are ~18k alternative jobs (under the black column, "net", in the Browser Process). It looks like the same issue. The Chrome version is also 58.0.3029.0 according to the trace metadata.

Re#15: The population on M58 canary/dev is probably not large enough to see a difference in Browser.Memory.Large2. 

I am not sure know how we can structure HttpStreamFactory/JobController/Job to be less error prone. I think we should have some basic integration tests where we spin up a QUIC simple server,  issue 1000  requests, and track the estimates from net/ MemoryDumpProvider. I filed  Issue 701387  to do it.

After spending sometime on the net-logs. Summarize some observations here: 
1. all those uncleaned alt_job/job_controller fell in delayed TCP case: where QUIC job starts first and DoInitConnectionImpl returns ERR_IO_PENDING(when init connection completes, it will run the callback which is HttpStreamFactoryImpl::Job::OnIOComplete), which then calls MaybeResumeMainJob with a nonzero delay. MaybeResumeMainJob simply marks main_job_is_blocked_ to false and returns immediately w/o posting task to resume the main job. Then job controller starts the main job until main job queries job controller whether it should wait. With nonzero delay, main job will wait(by returning ERR_IO_PENDING) with a delayed task posted to resume it. 

2. In those cases where alt_job never gets cleaned up, alt job seemed stuck somewhere, never reporting status to the controller. 
3. In some cases from the netlog in #12, we thought the alt job should succeed by using a existing QUIC_SESSION, which wasn't happen. 
2148257 URLRequest: (Managed to use QUIC_SESSION)
URL_REQUEST_START_JOB  [dt=30]
                    --> load_flags = 34624 (DO_NOT_SAVE_COOKIES | DO_NOT_SEND_AUTH_DATA | DO_NOT_SEND_COOKIES | MAYBE_USER_GESTURE | VERIFY_EV_CERT)
                    --> method = "OPTIONS"
                    --> url = "https://play.google.com/log?format=json&u=0&authuser=0"
2148331 URL_REQUEST: (which miss using QUIC_SESSION created by 2148257, and have alt job uncleaned up)
URL_REQUEST_START_JOB  [dt=259]
                    --> load_flags = 34624 (DO_NOT_SAVE_COOKIES | DO_NOT_SEND_AUTH_DATA | DO_NOT_SEND_COOKIES | MAYBE_USER_GESTURE | VERIFY_EV_CERT)
                     --> method = "POST"
                     --> upload_id = "0"
                     --> url = "https://play.google.com/log?format=json"

The two URLRequests has the same url: https://play.google.com:443 and the same privacy mode as both load flags are the same. Without much logging in Quic layer, I couldn't deduce why existing QUIC_SESSION is not used, and if not, why the QuicStreamFactory doesn't create a new Job to serve the request. 

What we have changed to the code recently: 
1. MaybeResumeMainJob() 
old code:
- return immediately if main_job_is_blocked_ == false
- flip main_job_is_blocked_ to false
- check if main job has been in wait state, if so post a task to resume. Else, no-op.
new code:
- ignoring main_job_is_blocked
- flip main_job_is_blocked_ to false
- check if main job has been in wait state, if so post a task to resume. Else, no-op. 

In which case could the code change behavior:
we expect the main job to be resumed immediately if it is always in wait state, alt job called MaybeResumeMainJob with a nonzero delay on its connection initialization is returning ERR_IO_PENDING, and later alt job failed calling MaybeResumeMainJob with NO delay.
- old code will do nothing, waiting the first timer to fire and resume main job
- new code will post a task to resume the main job with delay = 0, and resume the main job immediately. 

To have the new code change being executed, MaybeResumeMainJob will need to be called at least twice, and the second one must have delay = 0. If you look into the code on where we call MaybeResumeMainJob() with delay = 0; it must be coming from OnStreamFailed on some similar methods which then try to reset alt job:
- if main job happens to be bind to the request before, OnStreamFailed then realize alt job is orphaned, delete alt job
- if main job hasn't, OnStreamFailed then delete alt_job silently since main job is still alive. 
Another log, this time with adblock turned off.
chrome-net-export-log5.json.zip
5.2 MB Download
Had an offline discussion with erikchen@, I will upload a debugging CL to add granular NetLogs built on top of 7fa349e632a44c152b05ca6a66ade5f2e5b3f139. 

Since it's unclear whether this memory leak is platform specific, erikchen@ will help to build macOs version and both of us will try to repro on mac. At the same time I will also build linux version to see whether it could be repro'd on linux. I will update here once the debugging CL is ready. 
Cc: dskiba@chromium.org
I have cooked a CL adding more NetLogs in quic_stream_factory and http_stream_factory_impl_job(_controller), which should give us more information why the alt job is not deleted. 

https://codereview.chromium.org/2754003002/

Comment 24 by ssid@chromium.org, Mar 16 2017

Is it possible to add an UMA for number of connections open / kept around at a time, so that we are aware if some leaks happen?
After digging in the code, my current hypothesis is active_sessions_ or active_jobs in QuicStreamFactory might be corrupted in some cases. And a alt job didn't find an existing quic session in the active_sessions_, and find active_jobs has the server id as the key. Either there's no real QUIC job serving the server id, or there's one QUIC job never finishes. 

re#24, I can add a UMA histogram in HttpStreamFactoryImpl to count how many job controllers are left uncleaned up. 
I ran the MacOS build that erikchen@ sent out overnight. I am able to repro the problem. Attached is a NetLog and a memory dump.

View the NetLog dump using chrome://net-internals#import and apply "type: HTTP_STREAM_JOB_CONTROLLER is:active" filter, you will find a few live controllers that should have gone away but didn't.

In the memory dump, there are about ~80 live alt jobs. Not a big leak, but something is not working as expected.
trace_jobcontroller_leak.json.gz
502 KB Download
net-internals-log.json
1.9 MB View Download
Components: -Internals>Network
When the QuicStreamRequest is created, there is an active QuicStreamFactory::Job with that |server_id|, so a new one isn't created (Event 15377 is an example). QuicStreamFactory::OnJobCompletion() is never invoked for that existing active job, and thus the QuicStreamRequest never receives a completion callback.

I am still trying to figure out what happened to that existing active QuicStreamFactory::Job...
Thanks for getting the NetLogs, Helen!

Looking at the NetLog in #26. All those uncleaned JobController are waiting for alt job to complete, and all those stuck alt jobs are associated with the same existing QuicStreamFactory::Job, which we only logged partially(because that NetLog doesn't have the  +QUIC_STREAM_FACTORY_JOB). But looking at the timeline of this QuicStreamFactory::Job, it has been working around for 107s, which is not normal. 

Looking into the QuicStreamFactory::Job state machine to see which might cause this.
If it helps, last night I only left inbox.google.com tab open. It's the job to inbox.google.com that hang. 
 
One theory could be that when internet was disconnected and we CloseAllSessions(), canceled pending stream requests but never erased that active QuicStreamFactory::Job. 
#29 is the only pathway that I see where QuicStreamFactory::Job is not cleaned up.

QuicStreamFactory::CloseAllSession() closes all active sessions and deletes them synchronously.
If an existing QuicStreamFactory::Job is stuck in connect. The QuicChromiumClientSession that was created for it would have been gone in CloseAllSession(). QuicStreamFactory::Job wouldn't hear back because the session is gone together with the completion callback.
Scratch that..  RecordAndCloseSessionOnError() does run the |callback_|.

Comment 32 by rch@chromium.org, Mar 17 2017

I think it might be a good idea to add a CHECK in QuicStreamFactory::Job::~Job() that callback_.is_null() to ensure that the callback was actually run.

Of course, we don't know if the job in active_jobs_ is active (in which case this CHECK) would not help, or the JobSet is empty (in which case this CHECK would not help) or if it's dead (in which casse it would). But perhaps this would help narrow down.
I think it might worth adding a DCHECK in QuicStreamClientSession's destructor to make sure callback_.is_null() (additionally output what state the session is in). It might be that the session got destroyed but never notified the QuicStreamFactory::Job.

Comment 34 by rch@chromium.org, Mar 17 2017

That sounds good too.
Had an offline discussion with xunjieli@, summarize what we have observed in tracing and NetLogs.

So the tracing indicates there're 15 sessions in QuicStreamFactory::all_sessions_ map. Netlog showed there're 14 sessions as 14 QUIC_SESSION netlog is present. 

The missing one is inbox.google.com. It's been created, not activated (as active_sessions_ didn't find that, that's why HttpStreamFactoryImpl::Job only find the active job). And has no activity for a while. 

That being said, the QuicChromiumClientSession is not deleted yet. Adding the CHECKs will ensure the callbacks are run, though might not helping this case. 

Comment 36 by rch@chromium.org, Mar 17 2017

Can you tell if the session to inbox has never been activated, or if it could have been activated and then received a GOAWAY?
That I have no answer with the current log since no logging for the first QUIC_STREAM_FACTORY_JOB is not available :(

And if  Issue 702909  confirmed, that indicates we have a *state A* that for a specific server
active_sessions_, active_job_, all_sessions_.

1. OnGoAway (assuming we could be in *state A* when receive GOAWAY) could lead us to delete the active_sessions, but not deleting a job.
2. Connection migration?
3. ClockSkewed, removes active_session, but not the active job. 

I'm adding more logs to debug, see if we received GOAWAY, if we notified the factory of going away, if we notified factory session closed. I tried with Youtube turning on, however, wasn't getting repro case. Since QUIC currently is not fully back on, we might need to wait until Monday. 
Project Member

Comment 38 by bugdroid1@chromium.org, Mar 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/de52d1630afa30d255f1e773d350c758056fd2ed

commit de52d1630afa30d255f1e773d350c758056fd2ed
Author: zhongyi <zhongyi@chromium.org>
Date: Mon Mar 20 23:17:05 2017

Add histograms to track the number of job controller and job that are
not cleaned up when HttpStreamFactoryImpl is destructed.

BUG= 700617 

Review-Url: https://codereview.chromium.org/2752193003
Cr-Commit-Position: refs/heads/master@{#458228}

[modify] https://crrev.com/de52d1630afa30d255f1e773d350c758056fd2ed/net/http/http_stream_factory_impl.cc
[modify] https://crrev.com/de52d1630afa30d255f1e773d350c758056fd2ed/tools/metrics/histograms/histograms.xml

I think we can confirm that that QUIC job is hang, which causes all those dependent alt jobs being hang, and never finishes. And we know why this is happening. 

* why a lot of alt jobs not finishing?
HttpStreamFactoryImpl::Job(alt job) always try to find the an active QuicChromiumClientSession if such exists. Otherwise, it checks whether there has been any active QuicStreamfactory::Job serving the same server_id and binds itself to those Jobs. If neither of these happens, it eventually creates a new QuicStreamFactory::Job to serve the server_id and the new Job will be added to active_jobs MAP until it finishes. In this case where we see a lot of alt jobs not finishing, they are all bound to those QuicStreamFactory::Job which hangs.

* why QuicStreamFactory::Job hang?
Those Jobs are discovered to be hanging in CONNECT state during investigation. The Job waits on QuicChromiumClientSession::CryptoConnect, and passed a callback to be resumed once the session finishes work. The callback is not cached until crypto_stream->CryptoConnect() returns. While we could close the session during the crypto connect visa QuicChromiumClientSession::OnConnectionClosed() without notifying Job as the callback_ is null. [In this case, we have network disconnected, trigger a PACKET_WRITE_ERROR which cause us to close the connection and session]. We didn't check the state whether the connection is still alive when crypto_stream->CryptoConnect() return, but returning an ERR_IO_PENDING to the Job. Thus, the Job is waiting for the session to run the callback, while the session has been deleted which will never run the callback. 

int QuicChromiumClientSession::CryptoConnect(
    const CompletionCallback& callback) {
  connect_timing_.connect_start = base::TimeTicks::Now();
  RecordHandshakeState(STATE_STARTED);
  DCHECK(flow_controller());
                                       ------> callback_ = null
  crypto_stream_->CryptoConnect();     ------> Write packets, gets PACKET_WRITE_ERROR, closes session
                                       ------> returns, session is in goaway state, will be delete soon.
  if (IsCryptoHandshakeConfirmed()) {
    connect_timing_.connect_end = base::TimeTicks::Now();
    return OK;
  }

  // Unless we require handshake confirmation, activate the session if
  // we have established initial encryption.
  if (!require_confirmation_ && IsEncryptionEstablished())
    return OK;

  callback_ = callback;   
  return ERR_IO_PENDING;               ------> RETURN IO_PENDING to Job, session will be deleted.
}

void QuicChromiumClientSession::OnConnectionClosed() {
  ...
  NotifyFactoryOfSessionGoingAway();
  ...
  if (!callback_.is_null()) {                     ------> not executed, as callback_ is null
    base::ResetAndReturn(&callback_).Run(ERR_QUIC_PROTOCOL_ERROR);
  }
  ...
  NotifyFactoryOfSessionClosedLater();            ------> will delete session soon
}

QuicChromiumClientSession::~QuicChromiumClientSession() {
  ...
  not checking callback_ or run the callback_.
  ...
}
net-internals-log (6).json
1.7 MB View Download
Awesome work!

Three requests:
1) Before we merge fixes to M-57/58/59, can we figure out the typical impact of the leak? I want to know if it is likely responsible for several multi-GB browser leaks that I've observed [on my machine and others] but have not been able to track down.

2) It seems like we were already emitting a memory-infra stat [see c#3] that should have made it obvious that we were leaking jobs, but we weren't tracking this anywhere. Can we emit UMA stats for this and similar metrics that we can track with chirps?

3) Can we write some type of end-to-end test system that makes us more likely to catch this type of error in the lab? Maybe we need to restructure the code to make fuzzing more easy, or perhaps we just need end-to-end tests that allow us to exercise interesting edge cases.
Labels: -M-59 ReleaseBlock-Stable M-58

Comment 42 by rch@chromium.org, Mar 21 2017

erikchen: Interestingly, as far as we can tell, this bug has been present for years. Is it surprising that this was not previously caught by memory profiling? As I read the bug report, I think this was discovered by recent tools/process. If so, it's not shocking that we didn't notice this before. Am I understanding this correctly? If not, I wonder why we didn't notice this before :/
> Is it surprising that this was not previously caught by memory profiling?
No. AFAICT, no one has used memory profiling on Desktop since ssid did his Linux analysis a couple of months ago. [It hasn't even worked on macOS/Windows until very recently].


Comment 44 by rch@chromium.org, Mar 21 2017

Great! That solves one mystery :)

Comment 45 by jri@chromium.org, Mar 22 2017

Excellent work, zhongyi@! I assume that you've been able to repro this with logging showing this happens?
Re #40:
1) Before we merge fixes to M-57/58/59, can we figure out the typical impact of the leak?
I thought ssid@ has done some similar work when measuring  crbug.com/678768 . He might give better ideas. Isn't the UMA histogram Memory.Browser.Large2 tracking the difference between the Control and Enabled group? If the fix is checked in, presumably, we could compare data in histograms I think.

2) Can we emit UMA stats for this and similar metrics that we can track with chirps?
I have added Histograms in Comment 38 which logs the number of uncleaned JobController and Jobs when HttpStreamFactoryImpl is destructed. I will register chirp-alerts for those histograms. 

3) Maybe we need to restructure the code to make fuzzing more easy, or perhaps we just need end-to-end tests that allow us to exercise interesting edge cases.
The way this bug is triggered is really tricky. I felt we have to know the edge case in order to test in e2e tests. Fuzzing sounds like a plan though I don't know how much efforts it would require as it's a really big stack.

Re #45:
Yup. The easiest way to repro this issue I know is to have inbox.google.com open in one tab, and close your mac lid, which hopefully triggers PACKET_WRITE_ERROR when we attempting send CHLO to the server, resume your laptop and you might be able to catch that in net-internals.
Use the same analysis as ssid@ from his postmortem:
https://docs.google.com/document/d/1DPM0iZTo9I331a3KYs5VznOXD276OjOxeY7m5lYMqUY/edit#

Googlers, OSX, show a 26.7% regression in the 95th percentile in Memory.Browser.Large2 with QUIC enabled.
https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559

All users, OSX, show a 5.2% regression in the 95th percentile in Memory.Browser.Large2 with QUIC enabled.
https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559

Another reason to spend time understanding the impact of this particular bug/fix is to avoid the issue we ran into with ssid's post-mortem, where we assume that the bug we fixed is responsible for the regression we're seeing. 

Given the impact of this bug [multiple years of a severe memory leak], should we write a post-mortem?
Summary: Memory leak in network stack: alternative jobs [QUIC]. (was: Potential memory leak in net/, DefaultJobFactory::CreateJob)
Re #47: Problem with Browser.Memory.Large2 or some other general UMA is that we don't really know if the diff is expected or something fishy going on. 

Ssid@'s work and the net/ MemoryDumpProvider helped to narrow down the cause of this leak quickly (see comment #3).

net/ MDP is whitelisted only recently. I saw one trace from Android with leaked alternative jobs. However, that only happened when I started looking. There's definitely some improvement can be made to tooling and reporting infrastructure. As a first defense, maybe more code coverage in unit tests could help us with issues like this? zhongyi@: what do you think?
Re #47: Having said that, I think we will benefit from writing a postmortem so we know where the gaps are. Talked to zhongyi@ offline, we will draft one and send it out.
Is total leaked memory linearly proportional to number of leaked jobs? If so, the new UMA stat should give us an idea of the impact. 

The problem is that the leak grows over time, and people who pick up the new UMA stat are getting a new binary [and restarting chrome], so we'll see a bias towards lower leaks.
And once zhongyi's CL lands [and fixes the leak], we will no longer have any idea what the impact of the leak is.

Comment 53 by ssid@chromium.org, Mar 23 2017

If the leak is fixed, we should see improvement in Memory.Browser.Large2 histogram, right? By that I mean the links posted in #47 should no longer show the difference at 95th percentile.
Also, this seems to affect mainly Googlers from the histograms. I'd say the 5% at 95th percentile would be noise or QUIC just uses different data types, which uses slightly more memory.

About the bias towards lower leaks, we should be comparing versions from same channels after sufficiently long time. Comparing 2 Dev releases would show users with a week of uptime. But, yes that would not show the impact at stable channel and we should wait for some weeks time after the stable release. Also, if we see that the difference at 95th percentile goes down for Dev channel, we can be fairly confident that the difference in stable channel also goes down.

The new UMA added wouldn't help much because it would have reached just Dev channels by now and we again get back to only a week long session before the fix lands and browser gets updated.

Re #49: Yes I think we should look at more traces form slow reports. FYI there are traces from desktop platforms as well. I have never looked at them.
ssid: That's not the right way to measure impact of the leak.

>I'd say the 5% at 95th percentile would be noise
The 5% increase in the 95th percentile has a p value of 0.0001. It is *highly unlikely* for this to be noise.
https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559

>QUIC just uses different data types, which uses slightly more memory.
If QUIC intentionally causes Chrome to use 5% more memory at the 95th percentile, then we need to re-evaluate the implementation for bugs.

Just looking at before/after for an UMA metric introduces many potential sources of noise. I will point out that the last time this analysis was done, we *failed to realize* that there was another, potentially very significant memory leak.

There are two thorough ways to measure impact:
1) A/B experiment of fix.
Given that the problem has been present for years, rolling out the solution to half the population first isn't going to be the end of the world.

2) Explicit UMA metric that measures the number of alternate jobs that would have leaked, if the solution were not present. This is harder to get right, but doesn't delay rolling out the solution.

Comment 55 by ssid@chromium.org, Mar 23 2017

If the goal we are talking about is to avoid such leaks in future, then shouldn't the UMA of just number of jobs alive help here? We will be able to see how many jobs are kept around in devices. If this number goes to 100k like in this case then it's definitely a leak.

If the goal is to measure impact of the fix in terms of memory reduction, then we could look at memory UMA between 2 Chrome versions / QUIC experiment rather than a new experiment.
Having an explicit UMA that measures memory that might have been leaked sounds good. But that might be expensive. Also, IIUC just the number of jobs might not be representative of the memory usage.

IMHO rolling out an experiment for the fix seems overkill just to measure impact of the fixed leak and it is not going to help with future leaks. But, if we are going for totally noiseless measurement of memory impact, then this seems to be a good idea. Mind that there would still be noise because we are looking at total memory instead of net memory.
> we could look at memory UMA between 2 Chrome versions / QUIC experiment 

This makes several assumptions:
1) All changes to the UMA metric are caused by the CL you're trying to measure. Let's say that you see a 10MB regression. Your CL actually only fixes a 5MB regression, and another CL fixes a different 5MB regression. You'll think that your 10MB regression is "fixed", even though you only fixed by 5MB.

2) Canary/Dev UMA metrics are representative of Beta/Stable.
In your last post-mortem:
https://docs.google.com/document/d/1DPM0iZTo9I331a3KYs5VznOXD276OjOxeY7m5lYMqUY/edit#

You did exactly as you suggested and showed some graphs that seemed to imply that on Linux, pre-fix, QUIC used ~double the memory at the 95th percentile, and post-fix, QUIC used ~half the memory.

This is really suspicious, and the huge swing should tell you that the data isn't particularly reliable. When I looked at the Linux stable channel [28 day aggregation], I see that QUIC users *still* use 18% more memory. 
https://docs.google.com/document/d/1QS2Wv_dCpaII-ApmX-tNNrOvKz4uA1Q3XCfVb1dkbUw/edit?ts=58d421d9#

I spoke with jam@, who convinced me that it wasn't worthwhile to spend extra effort to measure the impact of the leak/fix. Fine. But we shouldn't substitute inaccurate information in its place. Attempting to attribute changes to UMA stats to specific CLs without an A/B experiment is fraught with peril.
For some reason, bugdroid is not working. The bugfix CL just landed in chromium. Manually adding it here. 

commit	363c91c57a5ae99e63a138555afa83dc02262325	[log] [tgz]
author	zhongyi <zhongyi@chromium.org>	Thu Mar 23 23:16:08 2017
committer	Commit bot <commit-bot@chromium.org>	Thu Mar 23 23:16:08 2017
tree	77ceaf75bf379afdfd37bd7a53b2df8cce089c83
parent	5e1457c6a247571aa25c76364693863b7bdd4b7f [diff]
QUIC: mark QUIC handshake failed if connection is closed after CryptoConnect
so that QuicStreamFactory::Job will not hang.

Currently, if QuicConnection is closed during crypto_stream_->CryptConnect(),
it will silently post a task to notify QuicStreamFactory to of the session
close without notifying the QuicStreamFactory::Job. This causes
QuicStreamFactory::Job to hang. Subsequent alt jobs will be bound to the
hanging QUIC job, and will never win race over TCP and get cleaned up.

BUG= 700617 

Review-Url: https://codereview.chromium.org/2766603004
Cr-Commit-Position: refs/heads/master@{#459271}

Blocking: 702411
Blocking: 704949
Blocking: 704956
Labels: Merge-Request-58
Project Member

Comment 62 by sheriffbot@chromium.org, Mar 24 2017

Labels: -Merge-Request-58 Hotlist-Merge-Approved Merge-Approved-58
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions.
Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
The bugfix 363c91c57a5ae99e63a138555afa83dc02262325 currently is in Canary. Unless there's any objection, I will merge it to M58 on Monday. 
Blocking: 704988
Cc: primiano@chromium.org
Seems like we didn't see anything suspicious during the past two days: no new crash due to the bugfix. I'll start to merge to M58 now.
Has anyone confirmed that this change fixes the problem erikchen@ reported in this bug?
Re #67: zhongyi@'s CL fixed one issue. We think it's likely the one that erikchen@ saw. Whether that will fix all the issues is to be verified.  Issue 704956  is one example that we need to follow up on. 
Project Member

Comment 69 by bugdroid1@chromium.org, Mar 27 2017

Labels: -merge-approved-58 merge-merged-3029
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0a4d3c147a0f52493a850051008e1fe57c125d30

commit 0a4d3c147a0f52493a850051008e1fe57c125d30
Author: Zhongyi Shi <zhongyi@chromium.org>
Date: Mon Mar 27 18:57:45 2017

QUIC: mark QUIC handshake failed if connection is closed after CryptoConnect so that QuicStreamFactory::Job will not hang.

Currently, if QuicConnection is closed during crypto_stream_->CryptConnect(),
it will silently post a task to notify QuicStreamFactory to of the session
close without notifying the QuicStreamFactory::Job. This causes
QuicStreamFactory::Job to hang. Subsequent alt jobs will be bound to the
hanging QUIC job, and will never win race over TCP and get cleaned up.

BUG= 700617 

Review-Url: https://codereview.chromium.org/2766603004
Cr-Commit-Position: refs/heads/master@{#459271}
(cherry picked from commit 363c91c57a5ae99e63a138555afa83dc02262325)

R=rch@chromium.org

Review-Url: https://codereview.chromium.org/2780523002 .
Cr-Commit-Position: refs/branch-heads/3029@{#435}
Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471}

[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/mock_crypto_client_stream_factory.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_chromium_client_session.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_network_transaction_unittest.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_stream_factory_test.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/mock_crypto_client_stream.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/mock_crypto_client_stream.h
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/quic_stream_factory_peer.cc
[modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/quic_stream_factory_peer.h

Blocking: 708240
Blocking: 708260
Blocking: 708569
Blockedon: 708802
Blockedon: 708806
NextAction: 2017-04-10
A friendly reminder that M58 Stable launch is coming soon! Your bug is labelled as Stable ReleaseBlock, please make sure to land the fix, verified in trunk and get it merged into the release branch ASAP.
Labels: -ReleaseBlock-Stable
- ReleaseBlock-Stable
The main CL has been merged into M58.

Leaving this bug open to do a couple of follow-ups but those shouldn't block the stable release.
Labels: sr-pm-5
Blockedon: -708806
Status: Fixed (was: Assigned)
Marking this bug as fixed. We have local chrome instances running for a long while, and I didn't notice any similar leaks so far. The bugfix(in 58.0.3029.40) is rolled out to stable.

Sign in to add a comment