Issue metadata
Sign in to add a comment
|
Memory leak in network stack: alternative jobs [QUIC]. |
|||||||||||||||||||||||||||
Issue descriptionI'm in the process of running an experiment where I profile all memory allocations made by Chrome. https://docs.google.com/document/d/1Qka5hEG91L_a702AHwDFEiNw40WvdrCrd2rlC1z0u2A/edit# I started this experiment a week ago. About a day ago, and continuing into today, browser process memory usage [live, malloc objects] has climbed by 200MB. It's stayed elevated for ~10 hours. I closed all my usual windows, waited 10 minutes, and used a guest profile to take a trace but did not see the numbers go back down. I'm currently observing ~600k live objects allocated from net, taking up just under 200MB. macOS: 10.12.3 chrome: branched from 7fa349e632a44c152b05ca6a66ade5f2e5b3f139, chrome://version shows 58.0.3029.0 trace: https://drive.google.com/a/google.com/file/d/0B6ho_7_ut5e1dXVnTGx6d0ExLWs/view?usp=sharing There are two main stacks that are responsible for creating these objects. The first hits an internal profiling limit and dead-ends at net::HttpCache::Transaction::DoLoop, so I don't have more details. That stack is responsible for 90MB and 300k allocations. The other stack is within the profiling limit, and stems from DefaultJobFactory::CreateJob. This is also responsible for ~85MB and 300k allocations. If necessary, I can rebuild the binary I'm using to increase the profiling limits, and get more information about the stack that dead-ends at net::HttpCache::Transaction::DoLoop, but I don't know how long it will take me to repro this scenario. I'm still using this browser session, so I'm also happy to grab data from net-internals, or anywhere else. Please let me know how I can help.
,
Mar 11 2017
,
Mar 13 2017
Thanks erikchen@. This is HttpStreamFactoryImpl::JobController leak again. Looks like either we missed this case in Issue 678768 or something has changed. Here's a screenshot of net/ MemoryDumpProvider. There are 33,515 live JobControllers. Out of those, 33,167 are created for the purpose of QUIC (column "alt_job_count"), and 348 are for preconnects.
,
Mar 13 2017
Last time we leaked the main jobs. This time we are leaking the alternative jobs. Summing all leaked jobs across http_network_sessions, there are about 40+k alt jobs and 40+k job controllers leaked. I'll look at the code.
,
Mar 13 2017
We recently had a change to resume the main job only, which is landed in https://chromium.googlesource.com/chromium/src/+/ff2bfbfd76f3c43e5618addd34c82fed42d2a6f5 That might be related. Looking now.
,
Mar 13 2017
Given the number of major leaks we've run into in this code, we should probably think about improving the ownership model.
,
Mar 13 2017
erikchen@: could you grab a net-internals for investigation? That would be helpful.
,
Mar 13 2017
Reading the net-logs, it does seemed that QUIC jobs are leaked as it's been orphaned but never deleted.
t=310413 [st= 47] HTTP_STREAM_JOB_WAITING [dt=2]
--> should_wait = false
t=310574 [st= 208] HTTP_STREAM_JOB_ORPHANED
t=317739 [st=7373] (the net log dump exits)
And it seems the alt job is never finishing. I am trying to repro on my local to see whether it's QUIC issue or JobController issue.
,
Mar 13 2017
mmenke: Ironically, the JobController was created specifically to address the terribly confusing ownership model of the previous code. It's now much easier to reason about the ownership and in fact we're not actually leaking (in the sense of un-reachable) memory. Instead, we (in the previous bug and it looks like here too) are not properly cleaning up. (Not that this is any less buggy, of course) In any case, I welcome any suggestions for how to improve the ownership model!
,
Mar 13 2017
The last time we checked UMA stats for quic enabled and quic disabled users and it showed both were comparable. So, it is probable that it is not due to quic if it's a large leak.
,
Mar 13 2017
Attaching another net-export log, this time taken from google guest.
,
Mar 13 2017
ssid: How do you figure? Per comment #5, changes have been made recently in this area.
,
Mar 13 2017
,
Mar 13 2017
Re #13: I checked the recent dev uma stats with finch experiment, though we dont have large hits, they show similar numbers. https://uma.googleplex.com/p/chrome/variations?sid=95691007b5ffd5bb64ee3190b28dc99b https://uma.googleplex.com/p/chrome/variations?sid=2f7efc2f4b8f8c301d3ec71fae32bb07
,
Mar 14 2017
gchatz@ has a browser process that regularly hits 5GB+ memory usage. I've asked him to start running a profiling build. He left it on for a week while he was OOO and I've included the attached trace. It shows the same leak as I was observing, but to a lesser extent. So far IOThread only shows 100MB, but he also hasn't been using this browser at all. https://drive.google.com/a/google.com/file/d/0B6ho_7_ut5e1ZGQ3aVhKbDFGQU0/view?usp=sharing
,
Mar 14 2017
The trace in #16 shows that there are ~18k alternative jobs (under the black column, "net", in the Browser Process). It looks like the same issue. The Chrome version is also 58.0.3029.0 according to the trace metadata. Re#15: The population on M58 canary/dev is probably not large enough to see a difference in Browser.Memory.Large2. I am not sure know how we can structure HttpStreamFactory/JobController/Job to be less error prone. I think we should have some basic integration tests where we spin up a QUIC simple server, issue 1000 requests, and track the estimates from net/ MemoryDumpProvider. I filed Issue 701387 to do it.
,
Mar 14 2017
After spending sometime on the net-logs. Summarize some observations here:
1. all those uncleaned alt_job/job_controller fell in delayed TCP case: where QUIC job starts first and DoInitConnectionImpl returns ERR_IO_PENDING(when init connection completes, it will run the callback which is HttpStreamFactoryImpl::Job::OnIOComplete), which then calls MaybeResumeMainJob with a nonzero delay. MaybeResumeMainJob simply marks main_job_is_blocked_ to false and returns immediately w/o posting task to resume the main job. Then job controller starts the main job until main job queries job controller whether it should wait. With nonzero delay, main job will wait(by returning ERR_IO_PENDING) with a delayed task posted to resume it.
2. In those cases where alt_job never gets cleaned up, alt job seemed stuck somewhere, never reporting status to the controller.
3. In some cases from the netlog in #12, we thought the alt job should succeed by using a existing QUIC_SESSION, which wasn't happen.
2148257 URLRequest: (Managed to use QUIC_SESSION)
URL_REQUEST_START_JOB [dt=30]
--> load_flags = 34624 (DO_NOT_SAVE_COOKIES | DO_NOT_SEND_AUTH_DATA | DO_NOT_SEND_COOKIES | MAYBE_USER_GESTURE | VERIFY_EV_CERT)
--> method = "OPTIONS"
--> url = "https://play.google.com/log?format=json&u=0&authuser=0"
2148331 URL_REQUEST: (which miss using QUIC_SESSION created by 2148257, and have alt job uncleaned up)
URL_REQUEST_START_JOB [dt=259]
--> load_flags = 34624 (DO_NOT_SAVE_COOKIES | DO_NOT_SEND_AUTH_DATA | DO_NOT_SEND_COOKIES | MAYBE_USER_GESTURE | VERIFY_EV_CERT)
--> method = "POST"
--> upload_id = "0"
--> url = "https://play.google.com/log?format=json"
The two URLRequests has the same url: https://play.google.com:443 and the same privacy mode as both load flags are the same. Without much logging in Quic layer, I couldn't deduce why existing QUIC_SESSION is not used, and if not, why the QuicStreamFactory doesn't create a new Job to serve the request.
,
Mar 14 2017
What we have changed to the code recently: 1. MaybeResumeMainJob() old code: - return immediately if main_job_is_blocked_ == false - flip main_job_is_blocked_ to false - check if main job has been in wait state, if so post a task to resume. Else, no-op. new code: - ignoring main_job_is_blocked - flip main_job_is_blocked_ to false - check if main job has been in wait state, if so post a task to resume. Else, no-op. In which case could the code change behavior: we expect the main job to be resumed immediately if it is always in wait state, alt job called MaybeResumeMainJob with a nonzero delay on its connection initialization is returning ERR_IO_PENDING, and later alt job failed calling MaybeResumeMainJob with NO delay. - old code will do nothing, waiting the first timer to fire and resume main job - new code will post a task to resume the main job with delay = 0, and resume the main job immediately. To have the new code change being executed, MaybeResumeMainJob will need to be called at least twice, and the second one must have delay = 0. If you look into the code on where we call MaybeResumeMainJob() with delay = 0; it must be coming from OnStreamFailed on some similar methods which then try to reset alt job: - if main job happens to be bind to the request before, OnStreamFailed then realize alt job is orphaned, delete alt job - if main job hasn't, OnStreamFailed then delete alt_job silently since main job is still alive.
,
Mar 14 2017
Another log, this time with adblock turned off.
,
Mar 14 2017
Had an offline discussion with erikchen@, I will upload a debugging CL to add granular NetLogs built on top of 7fa349e632a44c152b05ca6a66ade5f2e5b3f139. Since it's unclear whether this memory leak is platform specific, erikchen@ will help to build macOs version and both of us will try to repro on mac. At the same time I will also build linux version to see whether it could be repro'd on linux. I will update here once the debugging CL is ready.
,
Mar 15 2017
,
Mar 16 2017
I have cooked a CL adding more NetLogs in quic_stream_factory and http_stream_factory_impl_job(_controller), which should give us more information why the alt job is not deleted. https://codereview.chromium.org/2754003002/
,
Mar 16 2017
Is it possible to add an UMA for number of connections open / kept around at a time, so that we are aware if some leaks happen?
,
Mar 16 2017
After digging in the code, my current hypothesis is active_sessions_ or active_jobs in QuicStreamFactory might be corrupted in some cases. And a alt job didn't find an existing quic session in the active_sessions_, and find active_jobs has the server id as the key. Either there's no real QUIC job serving the server id, or there's one QUIC job never finishes. re#24, I can add a UMA histogram in HttpStreamFactoryImpl to count how many job controllers are left uncleaned up.
,
Mar 17 2017
I ran the MacOS build that erikchen@ sent out overnight. I am able to repro the problem. Attached is a NetLog and a memory dump. View the NetLog dump using chrome://net-internals#import and apply "type: HTTP_STREAM_JOB_CONTROLLER is:active" filter, you will find a few live controllers that should have gone away but didn't. In the memory dump, there are about ~80 live alt jobs. Not a big leak, but something is not working as expected.
,
Mar 17 2017
When the QuicStreamRequest is created, there is an active QuicStreamFactory::Job with that |server_id|, so a new one isn't created (Event 15377 is an example). QuicStreamFactory::OnJobCompletion() is never invoked for that existing active job, and thus the QuicStreamRequest never receives a completion callback. I am still trying to figure out what happened to that existing active QuicStreamFactory::Job...
,
Mar 17 2017
Thanks for getting the NetLogs, Helen! Looking at the NetLog in #26. All those uncleaned JobController are waiting for alt job to complete, and all those stuck alt jobs are associated with the same existing QuicStreamFactory::Job, which we only logged partially(because that NetLog doesn't have the +QUIC_STREAM_FACTORY_JOB). But looking at the timeline of this QuicStreamFactory::Job, it has been working around for 107s, which is not normal. Looking into the QuicStreamFactory::Job state machine to see which might cause this.
,
Mar 17 2017
If it helps, last night I only left inbox.google.com tab open. It's the job to inbox.google.com that hang. One theory could be that when internet was disconnected and we CloseAllSessions(), canceled pending stream requests but never erased that active QuicStreamFactory::Job.
,
Mar 17 2017
#29 is the only pathway that I see where QuicStreamFactory::Job is not cleaned up. QuicStreamFactory::CloseAllSession() closes all active sessions and deletes them synchronously. If an existing QuicStreamFactory::Job is stuck in connect. The QuicChromiumClientSession that was created for it would have been gone in CloseAllSession(). QuicStreamFactory::Job wouldn't hear back because the session is gone together with the completion callback.
,
Mar 17 2017
Scratch that.. RecordAndCloseSessionOnError() does run the |callback_|.
,
Mar 17 2017
I think it might be a good idea to add a CHECK in QuicStreamFactory::Job::~Job() that callback_.is_null() to ensure that the callback was actually run. Of course, we don't know if the job in active_jobs_ is active (in which case this CHECK) would not help, or the JobSet is empty (in which case this CHECK would not help) or if it's dead (in which casse it would). But perhaps this would help narrow down.
,
Mar 17 2017
I think it might worth adding a DCHECK in QuicStreamClientSession's destructor to make sure callback_.is_null() (additionally output what state the session is in). It might be that the session got destroyed but never notified the QuicStreamFactory::Job.
,
Mar 17 2017
That sounds good too.
,
Mar 17 2017
Had an offline discussion with xunjieli@, summarize what we have observed in tracing and NetLogs. So the tracing indicates there're 15 sessions in QuicStreamFactory::all_sessions_ map. Netlog showed there're 14 sessions as 14 QUIC_SESSION netlog is present. The missing one is inbox.google.com. It's been created, not activated (as active_sessions_ didn't find that, that's why HttpStreamFactoryImpl::Job only find the active job). And has no activity for a while. That being said, the QuicChromiumClientSession is not deleted yet. Adding the CHECKs will ensure the callbacks are run, though might not helping this case.
,
Mar 17 2017
Can you tell if the session to inbox has never been activated, or if it could have been activated and then received a GOAWAY?
,
Mar 19 2017
That I have no answer with the current log since no logging for the first QUIC_STREAM_FACTORY_JOB is not available :( And if Issue 702909 confirmed, that indicates we have a *state A* that for a specific server active_sessions_, active_job_, all_sessions_. 1. OnGoAway (assuming we could be in *state A* when receive GOAWAY) could lead us to delete the active_sessions, but not deleting a job. 2. Connection migration? 3. ClockSkewed, removes active_session, but not the active job. I'm adding more logs to debug, see if we received GOAWAY, if we notified the factory of going away, if we notified factory session closed. I tried with Youtube turning on, however, wasn't getting repro case. Since QUIC currently is not fully back on, we might need to wait until Monday.
,
Mar 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/de52d1630afa30d255f1e773d350c758056fd2ed commit de52d1630afa30d255f1e773d350c758056fd2ed Author: zhongyi <zhongyi@chromium.org> Date: Mon Mar 20 23:17:05 2017 Add histograms to track the number of job controller and job that are not cleaned up when HttpStreamFactoryImpl is destructed. BUG= 700617 Review-Url: https://codereview.chromium.org/2752193003 Cr-Commit-Position: refs/heads/master@{#458228} [modify] https://crrev.com/de52d1630afa30d255f1e773d350c758056fd2ed/net/http/http_stream_factory_impl.cc [modify] https://crrev.com/de52d1630afa30d255f1e773d350c758056fd2ed/tools/metrics/histograms/histograms.xml
,
Mar 21 2017
I think we can confirm that that QUIC job is hang, which causes all those dependent alt jobs being hang, and never finishes. And we know why this is happening.
* why a lot of alt jobs not finishing?
HttpStreamFactoryImpl::Job(alt job) always try to find the an active QuicChromiumClientSession if such exists. Otherwise, it checks whether there has been any active QuicStreamfactory::Job serving the same server_id and binds itself to those Jobs. If neither of these happens, it eventually creates a new QuicStreamFactory::Job to serve the server_id and the new Job will be added to active_jobs MAP until it finishes. In this case where we see a lot of alt jobs not finishing, they are all bound to those QuicStreamFactory::Job which hangs.
* why QuicStreamFactory::Job hang?
Those Jobs are discovered to be hanging in CONNECT state during investigation. The Job waits on QuicChromiumClientSession::CryptoConnect, and passed a callback to be resumed once the session finishes work. The callback is not cached until crypto_stream->CryptoConnect() returns. While we could close the session during the crypto connect visa QuicChromiumClientSession::OnConnectionClosed() without notifying Job as the callback_ is null. [In this case, we have network disconnected, trigger a PACKET_WRITE_ERROR which cause us to close the connection and session]. We didn't check the state whether the connection is still alive when crypto_stream->CryptoConnect() return, but returning an ERR_IO_PENDING to the Job. Thus, the Job is waiting for the session to run the callback, while the session has been deleted which will never run the callback.
int QuicChromiumClientSession::CryptoConnect(
const CompletionCallback& callback) {
connect_timing_.connect_start = base::TimeTicks::Now();
RecordHandshakeState(STATE_STARTED);
DCHECK(flow_controller());
------> callback_ = null
crypto_stream_->CryptoConnect(); ------> Write packets, gets PACKET_WRITE_ERROR, closes session
------> returns, session is in goaway state, will be delete soon.
if (IsCryptoHandshakeConfirmed()) {
connect_timing_.connect_end = base::TimeTicks::Now();
return OK;
}
// Unless we require handshake confirmation, activate the session if
// we have established initial encryption.
if (!require_confirmation_ && IsEncryptionEstablished())
return OK;
callback_ = callback;
return ERR_IO_PENDING; ------> RETURN IO_PENDING to Job, session will be deleted.
}
void QuicChromiumClientSession::OnConnectionClosed() {
...
NotifyFactoryOfSessionGoingAway();
...
if (!callback_.is_null()) { ------> not executed, as callback_ is null
base::ResetAndReturn(&callback_).Run(ERR_QUIC_PROTOCOL_ERROR);
}
...
NotifyFactoryOfSessionClosedLater(); ------> will delete session soon
}
QuicChromiumClientSession::~QuicChromiumClientSession() {
...
not checking callback_ or run the callback_.
...
}
,
Mar 21 2017
Awesome work! Three requests: 1) Before we merge fixes to M-57/58/59, can we figure out the typical impact of the leak? I want to know if it is likely responsible for several multi-GB browser leaks that I've observed [on my machine and others] but have not been able to track down. 2) It seems like we were already emitting a memory-infra stat [see c#3] that should have made it obvious that we were leaking jobs, but we weren't tracking this anywhere. Can we emit UMA stats for this and similar metrics that we can track with chirps? 3) Can we write some type of end-to-end test system that makes us more likely to catch this type of error in the lab? Maybe we need to restructure the code to make fuzzing more easy, or perhaps we just need end-to-end tests that allow us to exercise interesting edge cases.
,
Mar 21 2017
,
Mar 21 2017
erikchen: Interestingly, as far as we can tell, this bug has been present for years. Is it surprising that this was not previously caught by memory profiling? As I read the bug report, I think this was discovered by recent tools/process. If so, it's not shocking that we didn't notice this before. Am I understanding this correctly? If not, I wonder why we didn't notice this before :/
,
Mar 21 2017
> Is it surprising that this was not previously caught by memory profiling? No. AFAICT, no one has used memory profiling on Desktop since ssid did his Linux analysis a couple of months ago. [It hasn't even worked on macOS/Windows until very recently].
,
Mar 21 2017
Great! That solves one mystery :)
,
Mar 22 2017
Excellent work, zhongyi@! I assume that you've been able to repro this with logging showing this happens?
,
Mar 22 2017
Re #40: 1) Before we merge fixes to M-57/58/59, can we figure out the typical impact of the leak? I thought ssid@ has done some similar work when measuring crbug.com/678768 . He might give better ideas. Isn't the UMA histogram Memory.Browser.Large2 tracking the difference between the Control and Enabled group? If the fix is checked in, presumably, we could compare data in histograms I think. 2) Can we emit UMA stats for this and similar metrics that we can track with chirps? I have added Histograms in Comment 38 which logs the number of uncleaned JobController and Jobs when HttpStreamFactoryImpl is destructed. I will register chirp-alerts for those histograms. 3) Maybe we need to restructure the code to make fuzzing more easy, or perhaps we just need end-to-end tests that allow us to exercise interesting edge cases. The way this bug is triggered is really tricky. I felt we have to know the edge case in order to test in e2e tests. Fuzzing sounds like a plan though I don't know how much efforts it would require as it's a really big stack. Re #45: Yup. The easiest way to repro this issue I know is to have inbox.google.com open in one tab, and close your mac lid, which hopefully triggers PACKET_WRITE_ERROR when we attempting send CHLO to the server, resume your laptop and you might be able to catch that in net-internals.
,
Mar 22 2017
Use the same analysis as ssid@ from his postmortem: https://docs.google.com/document/d/1DPM0iZTo9I331a3KYs5VznOXD276OjOxeY7m5lYMqUY/edit# Googlers, OSX, show a 26.7% regression in the 95th percentile in Memory.Browser.Large2 with QUIC enabled. https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559 All users, OSX, show a 5.2% regression in the 95th percentile in Memory.Browser.Large2 with QUIC enabled. https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559 Another reason to spend time understanding the impact of this particular bug/fix is to avoid the issue we ran into with ssid's post-mortem, where we assume that the bug we fixed is responsible for the regression we're seeing. Given the impact of this bug [multiple years of a severe memory leak], should we write a post-mortem?
,
Mar 22 2017
,
Mar 22 2017
Re #47: Problem with Browser.Memory.Large2 or some other general UMA is that we don't really know if the diff is expected or something fishy going on. Ssid@'s work and the net/ MemoryDumpProvider helped to narrow down the cause of this leak quickly (see comment #3). net/ MDP is whitelisted only recently. I saw one trace from Android with leaked alternative jobs. However, that only happened when I started looking. There's definitely some improvement can be made to tooling and reporting infrastructure. As a first defense, maybe more code coverage in unit tests could help us with issues like this? zhongyi@: what do you think?
,
Mar 22 2017
Re #47: Having said that, I think we will benefit from writing a postmortem so we know where the gaps are. Talked to zhongyi@ offline, we will draft one and send it out.
,
Mar 22 2017
Is total leaked memory linearly proportional to number of leaked jobs? If so, the new UMA stat should give us an idea of the impact. The problem is that the leak grows over time, and people who pick up the new UMA stat are getting a new binary [and restarting chrome], so we'll see a bias towards lower leaks.
,
Mar 22 2017
And once zhongyi's CL lands [and fixes the leak], we will no longer have any idea what the impact of the leak is.
,
Mar 23 2017
If the leak is fixed, we should see improvement in Memory.Browser.Large2 histogram, right? By that I mean the links posted in #47 should no longer show the difference at 95th percentile. Also, this seems to affect mainly Googlers from the histograms. I'd say the 5% at 95th percentile would be noise or QUIC just uses different data types, which uses slightly more memory. About the bias towards lower leaks, we should be comparing versions from same channels after sufficiently long time. Comparing 2 Dev releases would show users with a week of uptime. But, yes that would not show the impact at stable channel and we should wait for some weeks time after the stable release. Also, if we see that the difference at 95th percentile goes down for Dev channel, we can be fairly confident that the difference in stable channel also goes down. The new UMA added wouldn't help much because it would have reached just Dev channels by now and we again get back to only a week long session before the fix lands and browser gets updated. Re #49: Yes I think we should look at more traces form slow reports. FYI there are traces from desktop platforms as well. I have never looked at them.
,
Mar 23 2017
ssid: That's not the right way to measure impact of the leak. >I'd say the 5% at 95th percentile would be noise The 5% increase in the 95th percentile has a p value of 0.0001. It is *highly unlikely* for this to be noise. https://uma.googleplex.com/p/chrome/variations/?sid=149f63017031c255aabcd8d20a3ca559 >QUIC just uses different data types, which uses slightly more memory. If QUIC intentionally causes Chrome to use 5% more memory at the 95th percentile, then we need to re-evaluate the implementation for bugs. Just looking at before/after for an UMA metric introduces many potential sources of noise. I will point out that the last time this analysis was done, we *failed to realize* that there was another, potentially very significant memory leak. There are two thorough ways to measure impact: 1) A/B experiment of fix. Given that the problem has been present for years, rolling out the solution to half the population first isn't going to be the end of the world. 2) Explicit UMA metric that measures the number of alternate jobs that would have leaked, if the solution were not present. This is harder to get right, but doesn't delay rolling out the solution.
,
Mar 23 2017
If the goal we are talking about is to avoid such leaks in future, then shouldn't the UMA of just number of jobs alive help here? We will be able to see how many jobs are kept around in devices. If this number goes to 100k like in this case then it's definitely a leak. If the goal is to measure impact of the fix in terms of memory reduction, then we could look at memory UMA between 2 Chrome versions / QUIC experiment rather than a new experiment. Having an explicit UMA that measures memory that might have been leaked sounds good. But that might be expensive. Also, IIUC just the number of jobs might not be representative of the memory usage. IMHO rolling out an experiment for the fix seems overkill just to measure impact of the fixed leak and it is not going to help with future leaks. But, if we are going for totally noiseless measurement of memory impact, then this seems to be a good idea. Mind that there would still be noise because we are looking at total memory instead of net memory.
,
Mar 23 2017
> we could look at memory UMA between 2 Chrome versions / QUIC experiment This makes several assumptions: 1) All changes to the UMA metric are caused by the CL you're trying to measure. Let's say that you see a 10MB regression. Your CL actually only fixes a 5MB regression, and another CL fixes a different 5MB regression. You'll think that your 10MB regression is "fixed", even though you only fixed by 5MB. 2) Canary/Dev UMA metrics are representative of Beta/Stable. In your last post-mortem: https://docs.google.com/document/d/1DPM0iZTo9I331a3KYs5VznOXD276OjOxeY7m5lYMqUY/edit# You did exactly as you suggested and showed some graphs that seemed to imply that on Linux, pre-fix, QUIC used ~double the memory at the 95th percentile, and post-fix, QUIC used ~half the memory. This is really suspicious, and the huge swing should tell you that the data isn't particularly reliable. When I looked at the Linux stable channel [28 day aggregation], I see that QUIC users *still* use 18% more memory. https://docs.google.com/document/d/1QS2Wv_dCpaII-ApmX-tNNrOvKz4uA1Q3XCfVb1dkbUw/edit?ts=58d421d9# I spoke with jam@, who convinced me that it wasn't worthwhile to spend extra effort to measure the impact of the leak/fix. Fine. But we shouldn't substitute inaccurate information in its place. Attempting to attribute changes to UMA stats to specific CLs without an A/B experiment is fraught with peril.
,
Mar 23 2017
For some reason, bugdroid is not working. The bugfix CL just landed in chromium. Manually adding it here. commit 363c91c57a5ae99e63a138555afa83dc02262325 [log] [tgz] author zhongyi <zhongyi@chromium.org> Thu Mar 23 23:16:08 2017 committer Commit bot <commit-bot@chromium.org> Thu Mar 23 23:16:08 2017 tree 77ceaf75bf379afdfd37bd7a53b2df8cce089c83 parent 5e1457c6a247571aa25c76364693863b7bdd4b7f [diff] QUIC: mark QUIC handshake failed if connection is closed after CryptoConnect so that QuicStreamFactory::Job will not hang. Currently, if QuicConnection is closed during crypto_stream_->CryptConnect(), it will silently post a task to notify QuicStreamFactory to of the session close without notifying the QuicStreamFactory::Job. This causes QuicStreamFactory::Job to hang. Subsequent alt jobs will be bound to the hanging QUIC job, and will never win race over TCP and get cleaned up. BUG= 700617 Review-Url: https://codereview.chromium.org/2766603004 Cr-Commit-Position: refs/heads/master@{#459271}
,
Mar 24 2017
,
Mar 24 2017
,
Mar 24 2017
,
Mar 24 2017
,
Mar 24 2017
Your change meets the bar and is auto-approved for M58. Please go ahead and merge the CL to branch 3029 manually. Please contact milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Mar 24 2017
The bugfix 363c91c57a5ae99e63a138555afa83dc02262325 currently is in Canary. Unless there's any objection, I will merge it to M58 on Monday.
,
Mar 24 2017
,
Mar 27 2017
,
Mar 27 2017
Seems like we didn't see anything suspicious during the past two days: no new crash due to the bugfix. I'll start to merge to M58 now.
,
Mar 27 2017
Has anyone confirmed that this change fixes the problem erikchen@ reported in this bug?
,
Mar 27 2017
Re #67: zhongyi@'s CL fixed one issue. We think it's likely the one that erikchen@ saw. Whether that will fix all the issues is to be verified. Issue 704956 is one example that we need to follow up on.
,
Mar 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0a4d3c147a0f52493a850051008e1fe57c125d30 commit 0a4d3c147a0f52493a850051008e1fe57c125d30 Author: Zhongyi Shi <zhongyi@chromium.org> Date: Mon Mar 27 18:57:45 2017 QUIC: mark QUIC handshake failed if connection is closed after CryptoConnect so that QuicStreamFactory::Job will not hang. Currently, if QuicConnection is closed during crypto_stream_->CryptConnect(), it will silently post a task to notify QuicStreamFactory to of the session close without notifying the QuicStreamFactory::Job. This causes QuicStreamFactory::Job to hang. Subsequent alt jobs will be bound to the hanging QUIC job, and will never win race over TCP and get cleaned up. BUG= 700617 Review-Url: https://codereview.chromium.org/2766603004 Cr-Commit-Position: refs/heads/master@{#459271} (cherry picked from commit 363c91c57a5ae99e63a138555afa83dc02262325) R=rch@chromium.org Review-Url: https://codereview.chromium.org/2780523002 . Cr-Commit-Position: refs/branch-heads/3029@{#435} Cr-Branched-From: 939b32ee5ba05c396eef3fd992822fcca9a2e262-refs/heads/master@{#454471} [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/mock_crypto_client_stream_factory.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_chromium_client_session.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_network_transaction_unittest.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/chromium/quic_stream_factory_test.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/mock_crypto_client_stream.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/mock_crypto_client_stream.h [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/quic_stream_factory_peer.cc [modify] https://crrev.com/0a4d3c147a0f52493a850051008e1fe57c125d30/net/quic/test_tools/quic_stream_factory_peer.h
,
Apr 4 2017
,
Apr 4 2017
,
Apr 5 2017
,
Apr 5 2017
,
Apr 5 2017
,
Apr 5 2017
A friendly reminder that M58 Stable launch is coming soon! Your bug is labelled as Stable ReleaseBlock, please make sure to land the fix, verified in trunk and get it merged into the release branch ASAP.
,
Apr 5 2017
- ReleaseBlock-Stable The main CL has been merged into M58. Leaving this bug open to do a couple of follow-ups but those shouldn't block the stable release.
,
Apr 10 2017
,
Apr 13 2017
,
Apr 25 2017
Marking this bug as fixed. We have local chrome instances running for a long while, and I didn't notice any similar leaks so far. The bugfix(in 58.0.3029.40) is rolled out to stable. |
||||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||||
Comment 1 by erikc...@chromium.org
, Mar 11 2017There are ~20k live malloc objects that stem directly from DefaultJobFactory::CreateJob, not including the constructor for Job. """ return new HttpStreamFactoryImpl::Job( delegate, job_type, session, request_info, priority, server_ssl_config, proxy_ssl_config, destination, origin_url, net_log); """ https://cs.chromium.org/chromium/src/net/http/http_stream_factory_impl.cc?q=defaultjobfactory::create+package:%5Echromium$&l=54 This suggests that there are currently ~20k live Job objects. :(57.4 KB
57.4 KB View Download
174 KB
174 KB View Download
146 KB
146 KB View Download