Defer allocation in QuicStreamSequencerBuffer |
|||||||||||
Issue descriptionAs part of identifying redundant allocations in Chrome, I found that QuicStreamSequencerBuffer in some cases allocates 120 KiB of zero memory (15 * 8 KiB), see https://docs.google.com/spreadsheets/d/1HfHKXy4q-ytZ-nwPay8YYyVSn1o-X9I7cWzYlygvtwQ/edit#gid=71377754 This is how it happens: 1. QuicStreamSequencer::QuicStreamSequencer() constructs QuicStreamSequencerBuffer (buffered_frames_) with kStreamReceiveWindowLimit, which is 16 MiB 2. QuicStreamSequencerBuffer::QuicStreamSequencerBuffer() constructs std::vector<BufferBlock*> (blocks_) with 2K elements (16 MiB / kBlockSizeBytes, which is 8 KiB) 3. vector<> allocates 8 KiB bytes (4 * 2K) 4. blocks_ is never written to, remaining just a blob of zero memory I'm proposing resizing blocks_ later, when we're writing its elements in QuicStreamSequencerBuffer::OnStreamData(). Something along the lines of the attached diff. (More info about the effort to kill redundant allocations is here: https://groups.google.com/a/chromium.org/forum/#!msg/project-trim/EiKXwnbTvbo/0z-Y5Kg1CgAJ)
,
Aug 5 2016
,
Aug 5 2016
,
Aug 5 2016
Hi, this is a reasonable improvement. But here I'm still not clear what kind of issue we are trying to solve. What's the definition of redundant/duplicated allocations? The suggestion seems only defers the allocation of std::vector<BufferBlock*> (blocks_) for a little while. > 4. blocks_ is never written to, remaining just a blob of zero memory Are you saying the memory allocated is never used? I'm surprised about it because, as far as I know, a quic stream should always have something to receive sooner or later. (I might miss some corner case regarding this point). If this suggestion is only targeting to allocate memory as late as possible, I totally agree with it. If it's targeting for preventing unused memory allocation/deallocation, I suspect there are other issues about unnecessary quic stream creation.
,
Aug 5 2016
,
Aug 5 2016
Those allocations surfaced under hash value 0, which is a special hash value meaning that the whole allocated region was zeroes at the moment when we looked at it (30s after starting Chrome and opening theverge.com). So you are right, and that might be because those QuicStreamSequencerBuffer objects were Clear()ed, and not because they were unused. I'll verify that. If that's the case, the fix is to kill blocks_ in Clear(), and re-initialize them on demand in OnStreamData(). But since constructor also calls Clear(), there is no point in initializing blocks_ in the constructor at all. I.e. those two changes (deferred allocation + deallocation in Clear) should be done at once.
,
Aug 5 2016
>30s after starting Chrome and opening theverge.com A stream should have been closed and thus QuicStreamSequencerBuffer destructed by then. Because std::vector<BufferBlock*> (blocks_) live through out the life time of a stream. And a quic stream is per request/response. blocks_ can be zero memory since creating a stream till receiving a response. And between all received response read out and more to come in, which should be momentary.
,
Aug 5 2016
That is strange, some classes are definitely alive after 30s. The number of them is not constant, though. During my testing I've seen as many as 26 and as few as 8. Maybe TheVerge downloads stuff in the background? (Let me revert my change below and see how many QuicStreamSequencerBuffer instances remain if I set timeout to 300s.) Anyway, I managed to get rid of blocks_, but I had to introduce allocated_blocks_count_ to keep number of valid blocks, and once that number drops to zero in RetireBlock(), I kill blocks_. The diff is attached. Note that I changed std::vector<> to just std::unique_ptr<>, because (1) the code doesn't actually use vector API (beyond operator[], which unique_ptr also provides) and (2) std::vector<> is hard to kill - shrink_to_fit() doesn't guarantee deallocation.
,
Aug 5 2016
,
Aug 5 2016
,
Aug 5 2016
,
Aug 5 2016
Sorry, for some reason it keeps taking someone off when I add people. Apologies for spam.
,
Aug 6 2016
Yup, don't see QuicStreamSequencerBuffer instances after 300s.
,
Aug 9 2016
> some classes are definitely alive after 30s. The number of them is not constant, though. During my testing I've seen as many as 26 and as few as 8. Maybe TheVerge downloads stuff in the background? I kind of know why. In the spreadsheet you shared most sequencer buffer objects alive after 30s belong to crypto stream or headers stream. These two streams are supposed to be there for the whole life time of a quic connection because there might be more crypto messages and request/response headers to be transferred. But I don't understand why there is also one QuicChromiumClientStream alive after 30s. If Chrome is still downloading, buffer shouldn't be empty for long time. If, it has finished, stream should be closed and destructed together with QuicStreamSequencerBuffer. That's saying as long as a quic stream starts receiving response, buffers_ shouldn't be empty for long time. Once all response arrives QuicChromiumClientStream should be closed and buffer be destructed. Could it be possible that streams are not closed after receiving a full response? The long living crypto/headers stream might also because of this kinda long living data stream. How often do we see data stream is alive after 30s? > but I had to introduce allocated_blocks_count_ to keep number of valid blocks, and once that number drops to zero in RetireBlock(), I kill blocks_. Doing this sounds like a good improvement on cryto/headers stream but might be too costly for data stream. If data just arrive in sequence (the buffer will be filled from beginning), every read might empty the buffer(if not read blocked), it would cause a lot of 8k memory alloc/dealloc. One of the goals of introducing QuicStreamSequencerBuffer is to reducing the frequency of memory alloc/dealloc. If the buffer is empty while we know more data will come soon, why do we release the memory? Deferring blocks_ allocation during construction till OnStreamData() is called can save the memory for 1 rtt which is definitely a win.
,
Aug 9 2016
,
Aug 16 2016
I talked with alyssar@ about this issue offline. Below is the summary: 1. the mysterious data stream still alive after 30s could be hanging GET. Some applications keeps a stream open throughout the lifetime intentionally. So it is not a surprise. 2. lazy allocation of blocks_ can benefit both client side and server side. Actually, it helps server side more in case of GET request. We can do this as first step of improvement. 3. early release of blocks_ when a stream is read closed should also help. but neither lazy allocation nor early release of blocks_ can solve the problem introduced by long standing crypto/headers stream. Questions for rch@: Can we tell when a crypto stream is not going to be used for a while? How complex is it to make change to release the buffer of crypto stream? dskiba@, can you give more explanation about the purpose of getting rid of 0 memory after certain time period? Is it for preventing unused alloc or recycling memory in time?
,
Aug 16 2016
Thanks for the investigation, Dan! BTW, you are correct that freeing blocks_ in RetireBlock() and then allocating them again on demand increases amount of recycled memory. However note that block itself is 8 KiB (i.e. same size as blocks_), so we're already recycling 8 KiB buffers. But of course adding 8 KiB to that doesn't help. The purpose of this experiment of finding redundant memory was to avoid allocating memory that's not used. We do have some buffers that are allocated, but not touched at all, staying zero (see for example issue 635015 ). However, it seems that QuicStreamSequencerBuffer is more complicated, since it does use blocks_ from time to time. Another interesting question is how common are buffers that use just one block? We could optimize for that case by always keeping a block in a class, but allocating more blocks and block_ vector on demand. I.e. we would go from 16 KiB (blocks_ + 1 block) to 8 KiB ("zero" block, empty blocks_) in that special case.
,
Aug 16 2016
"Questions for rch@: Can we tell when a crypto stream is not going to be used for a while? How complex is it to make change to release the buffer of crypto stream?" The server may send a "Server Config Update" message at any time, alas. (And in the future, the server might also be able to send a key update). That being said, these are not *super* frequent, so when a SHLO message is received (which completes the handshake) we could easily call FreeUpBuffers() at that point.
,
Sep 9 2016
Dan, do you have any progress on this issue? I'm happy to help and can try a fix if you have one. #18 looks especially promising.
,
Sep 9 2016
I'm working on lazy allocation and early release of blocks_ mentioned in #16 as first step. Internal CL has been submitted, will merge it to chromium soon. After that I will try #18.
,
Sep 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f commit 2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f Author: danzh <danzh@chromium.org> Date: Mon Sep 12 15:34:15 2016 lazy allocation and early release memory in QuicStreamSequencerBuffer. Protected by --quic_reduce_sequencer_buffer_memory_life_time. Postpone the instantiation of the underlying buffer in QuicStreamSequencerBuffer till receiving first data frame, this will make GET request cheaper. And release the memory as soon as the stream is read closed. Merge internal change: 131985218 and 132338729 R=rch@chromium.org BUG= 635041 Review-Url: https://codereview.chromium.org/2328633004 Cr-Commit-Position: refs/heads/master@{#417947} [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_flags_list.h [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_stream_sequencer.cc [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_stream_sequencer.h [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_stream_sequencer_buffer.cc [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_stream_sequencer_buffer.h [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/quic_stream_sequencer_buffer_test.cc [modify] https://crrev.com/2f64ff9a9ff6d76a2e4b755d3e01b79936759b6f/net/quic/core/reliable_quic_stream.cc
,
Sep 19 2016
Dan, is this bug fixed now? Feel free to mark this bug fixed if it is.
,
Sep 19 2016
Not yet. see #20
,
Sep 19 2016
BTW, I've measured duplicate memory again, and I still see 100-160 KiB of zero memory coming from net::QuicStreamSequencer::OnStreamFrame(), where we now allocate blocks_ on demand.
So I've added a histogram on how many blocks were used during QuicStreamSequencerBuffer lifetime (see attached diff). Here is my anecdata from casual browsing:
Histogram: A-QuicStreamSequencerBuffer.MaxUsedBlocks recorded 552 samples, average = 1.3 (flags = 0x40)
0 -------------------O (102 = 18.5%)
1 ------------------------------------------------------------------------O (392 = 71.0%) {18.5%}
2 ---O (18 = 3.3%) {89.5%}
3 --O (9 = 1.6%) {92.8%}
4 -O (6 = 1.1%) {94.4%}
5 -O (4 = 0.7%) {95.5%}
6 --O (9 = 1.6%) {96.2%}
7 O (2 = 0.4%) {97.8%}
8 -O (3 = 0.5%) {98.2%}
9 O (1 = 0.2%) {98.7%}
10 -O (3 = 0.5%) {98.9%}
11 O (1 = 0.2%) {99.5%}
12 ...
36 O (1 = 0.2%) {99.6%}
38 ...
45 O (1 = 0.2%) {99.8%}
48 ...
So maybe we can resize blocks_ on demand, as more blocks are needed?
,
Sep 20 2016
Thanks for the histogram! Seems that most resources are less than 2*8kB. But resizing blocks_ is a bummer to me, because resizing a ring buffer is more complicated and error-prone. And unless the buffer can shrink, which is impossible for ring buffer, there will still be 0-allocation when the stream is hanging there. The fix I checked in won't improve much for chrome because it only release blocks_ when stream is read closed. As we discovered, most of the 0-allocated memory are from long-staying crypto and headers stream. I expect that the few hanging GET will no longer cause 0-allocated memory. But headers/crypto stream are still going to hold the memory there. Can we have histograms for header and crypto streams along?
,
Sep 20 2016
Hmm, I'm not sure I understand where ring buffer is involved. I'm proposing that instead of lazily allocating max_blocks_count_ blocks at once we allocate min(max_blocks_count_, write_block_num * kGrowFactor) blocks if write_block_num >= blocks_count_. I drafted an implementation in an attached diff. I used grow factor of 16, which means that for 2K blocks we will reallocate at most 3 times (0->16, 16->256, 256->2000).
,
Sep 29 2016
Dan, did you have a chance to look at the diff from #26? What do you think?
,
Sep 30 2016
Hi dskiba, thanks for your prototype. My concern about that approach is that there will still be some 0-allocated memory at the end because blocks_ doesn't shrink after growth. Based on your histogram, 90% of the streams only use first 3 memory blocks which means it will leave a 16 Byte instead of 2KB empty array. But still the memory will be there for the whole life time of connection. Would this 16 Bytes pass your radar or you need more optimization? Meanwhile I have another approach to make cyrpto/headers to release the memory when they are less frequently used. I will soon merge a change from internal code base to release crypto stream's memory after handshake confirmed(see #20). This can ensure no staying crypto stream will hold the memory. For headers stream, my proposal is releasing the memory when no data streams remains for that connection, which I can start as next step if it sounds good to you. After these two changes, all streams won't hold any empty memory if no data is being transferred. What do you think?
,
Sep 30 2016
Actually any zero memory below 1 KiB won't show up in my tool. I can tweak it to 512 bytes, but that's the reasonable minimum. So 16 byte zero arrays definitely won't show up. Thanks for your efforts on this! It looks like your approach is better, because it optimizes on a higher level. Once your changes are merged I'll run my test and report here.
,
Oct 3 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/43eb0ff7957852816782585938122a25b1684601 commit 43eb0ff7957852816782585938122a25b1684601 Author: danzh <danzh@chromium.org> Date: Mon Oct 03 15:30:34 2016 Release read buffer of QuicCryptoStream when incoming message become less frequent. Protected by --quic_release_crypto_stream_buffer. Release the buffer when handshake has been confirmed and complete handshake message has been read out and no further bytes are in buffer. Merge internal change: 134493398 and 134809175. BUG= 635041 Review-Url: https://codereview.chromium.org/2387753002 Cr-Commit-Position: refs/heads/master@{#422426} [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/net.gypi [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_crypto_client_stream_test.cc [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_crypto_stream.cc [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_flags_list.h [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_stream_sequencer.cc [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_stream_sequencer.h [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/core/quic_stream_sequencer_buffer_test.cc [add] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/quic_stream_sequencer_buffer_peer.cc [add] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/quic_stream_sequencer_buffer_peer.h [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/quic_stream_sequencer_peer.cc [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/quic_stream_sequencer_peer.h [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/reliable_quic_stream_peer.cc [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/quic/test_tools/reliable_quic_stream_peer.h [modify] https://crrev.com/43eb0ff7957852816782585938122a25b1684601/net/tools/quic/end_to_end_test.cc
,
Oct 3 2016
dskiba@, please patch PS#4 for verification that no crypto stream should hold 0-allocated memory.
,
Oct 5 2016
Hurray! With PS#4 (i.e. changing the flag to true) I see ~90 KiB less zero memory coming from QuicStreamSequencer::OnStreamFrame().
,
Oct 25 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/34c70a32d562e8205bad739b94c28f1e4d387a15 commit 34c70a32d562e8205bad739b94c28f1e4d387a15 Author: danzh <danzh@chromium.org> Date: Tue Oct 25 22:38:22 2016 Make Quic client more memory efficient. Protected by --quic_headers_stream_release_sequencer_buffer. Release Quic client headers stream's sequencer buffer memory if no active request is going on. Merge internal change: 136071184 R=rch@chromium.org BUG= 635041 Review-Url: https://codereview.chromium.org/2446893003 Cr-Commit-Position: refs/heads/master@{#427514} [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_client_session_base.cc [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_client_session_base.h [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_flags_list.h [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_headers_stream.cc [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_headers_stream.h [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_spdy_session.cc [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/quic/core/quic_spdy_session.h [modify] https://crrev.com/34c70a32d562e8205bad739b94c28f1e4d387a15/net/tools/quic/end_to_end_test.cc
,
Oct 26 2016
dskiba@, please patch https://codereview.chromium.org/2446893003/#ps40001 to verify no 0-allocated memory.
,
Oct 27 2016
Hi Dan! I did that in early October (see #32) - most zero memory wasn't there.
,
Oct 27 2016
Wow, the improvement surprised me! Before #32, I only improved crypto stream. I expected the 0-memory allocation issue could be mitigated to more or less than half. I misunderstood "~90 kB less" for there were still 90 kB such kind of memory. Anyway, thanks for helping verification! The following fix should solve this issue for headers stream too. You can verify there shouldn't be any 0-alloc memory with these two flags on: quic_headers_stream_release_sequencer_buffer, quic_reduce_sequencer_buffer_memory_life_time.
,
Nov 9 2016
It seems that fix #30 already reduce the 0-allocated memory enough to pass the bar. I'll mark this bug as fixed. Feel free to reopen it if this problem shows up again.
,
Nov 21 2016
Latest memory trace shows this among top allocations in QUIC. Talked to danzh@ offline. This flag is currently turned off. But it will be turned on by default in Chromium once it passes through the internal flag process. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by dskiba@chromium.org
, Aug 5 2016