cc::Rtree::nodes_ can waste ~550 KB per renderer due to the std::deque implementation |
|||||
Issue descriptionFull context: "[project-trim] STL memory usage" [1] TL;DR - dskiba@ found that in some implementations std::deque allocates blocks with a granularity of 4k. - cc::Rtree::nodes_ is a deque - Experimental result (linked in [1]) show that in a real-world page (BBC in this case) we can have ~330 Rtrees (see http://pasteboard.co/9FIapuFZ2.png). If they are not always full (w.r.t. aforementioned 4k block size), they end up wasting up to 4k each. The experimental results above measured an overall 4% efficiency in the renderer for std::deque. Observations: - danakj suggests it std::vector is not suitable because the growth implies a memcpy of the old buffer. - jbroman also pointed out that pointers to nodes_ are handed out and hence the container elements need to not move. This is not guaranteed by std::vector (growing can remap the full buffer and invalidate all pointers) but is guaranteed by std::deque (% using shrink_to_fit or similar) [1] https://groups.google.com/a/chromium.org/d/msg/project-trim/rX9mshct51w/mHWOY-AiCgAJ [2] https://docs.google.com/spreadsheets/d/1UMnyXevN_BQf21bGpbz4yNxCp7nucuksarkU-BICfig/edit#gid=512878209
,
Dec 14 2016
I think that declaring std::vector unsuitable because of the copying of elements during growth is an overreaction. If we are inserting items one at a time without first calling reserve() then the number of item copies will always be a small multiple of the number of items - typically two to four at most. And the number of reallocations will be a small multiple of log(n). These are very modest and reasonable costs. Yes, std::vector results will be strictly better if we can reserve the correct size, both for time, fragmentation, and memory wastage, but when the correct size is not known std::vector will still *usually* be faster than the alternatives due to the memory access patterns. Of course std::vector is not appropriate if we need the elements to not move. And we should consider resizing vectors to fit after we have loaded them up in order to avoid wasting memory.
,
Dec 14 2016
Well for this specific problem I am not sure is suitable to ask CC: "either you estimate beforehand the number of nodes you need in a RTree, or you will be doomed". If they under-estimate they will have no way back (because they handed out the pointers to the elements and now cannot amend them) So probably their reaction here would be: "alright, then we will always reserve(9999) , because science has proven no rtree has more than 9999 nodes *" And with this we'd be back to the problem of over-reserving memory. I think for this specific problem (an append-only container where entires are guaranteed to be stable) std::deque would have been a good answer... if they made the block size a template argument. But instead they made the 4k block choice for the world, and this choice plays badly in this situation (and no, I don't think it's something you can fix by passing our own allocator argument to the template) * This is inspired from a true story: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/core/frame/SmartClip.cpp?rcl=0&l=237 I can give you all the context about this if you are curious :P
,
Dec 14 2016
I think in that case simple std::list is the best. RTree::Node is 224 bytes long on 32-bit system, so list's overhead of having two pointers will be negligible. Actually, std::forward_list fits even better.
,
Dec 14 2016
Re #3: isn't forward_list going to cause a malloc() per node? I think this is the original problem Rtree is about to solve, that seems very hot code.
,
Dec 14 2016
Using PartitionAlloc (which is in base now) might make using malloc per node much cheaper. We should consider using that in cc in general, it would probably help here. Could we use data structures from WTF and PA here?
,
Dec 14 2016
Yes, forward_list will allocate each node individually. Allocating many times will be slower than allocating once, but that might be noticeable much given the complexity of RTree::BuildRecursive. BTW, SkRTree (which was the inspiration) uses an array (SkTDArray). I guess I can change std::deque to std::vector and check how that changes things.
,
Dec 14 2016
I'll add a perf test as a start to ensure that we keep performance on par with the current thing. I think a list would work fine here, although it might lose some cache coherence. It's not clear to me how much that benefits the current implementation though, hence the perf test.
,
Dec 14 2016
Err, I missed a bunch of comments, so ignore my statements about list, but perftest still incoming.
,
Dec 14 2016
If there is nothign similar already we could probably make #1 some //base facility, like GrowableNonMoveableNoReallocBuffer<T,BLOCK_SIZE>. It's really 30-40 LOC.
,
Dec 14 2016
That also sounds similar to {blink,cc}::ContinguousContainer and cc::ListContainer, except that those are more general (and still do allocation with exponentially increasing size).
,
Dec 14 2016
> I think that declaring std::vector unsuitable because of the copying of > elements during growth is an overreaction. If we are inserting items one at a > time without first calling reserve() then the number of item copies will always > be a small multiple of the number of items - typically two to four at most. And > the number of reallocations will be a small multiple of log(n). These are very > modest and reasonable costs. Yeh, but at the scale of doing this every frame for hundreds of things it can become a measurable amount of the frame time. For instance, reserving memory instead of resizing vector's buffer log(n) times reduced serialization time for CompositorFrames by about 1/3, and removing mallocs in the UI painting code for each view reduced painting time measurably each time I removed one. So I'm very careful about introducing many allocations on the compositor critical path. Here it's log(n) * the number of layers for example. It sounds from vlad'd performance tests that a list will work tho, so that's good, and I'm glad we can save some memory too.
,
Dec 14 2016
,
Dec 14 2016
Added parent issue 674287 about std::deque inefficiency.
,
Dec 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0d69452813c95ee10e53b8e83eaceea23a2b1780 commit 0d69452813c95ee10e53b8e83eaceea23a2b1780 Author: vmpstr <vmpstr@chromium.org> Date: Wed Dec 14 23:37:19 2016 cc: Add rtree perftests. This patch adds some perftests to measure RTree performance. Sample run (on z620): [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from RTreePerfTest [ RUN ] RTreePerfTest.Construct *RESULT rtree_construct: 100= 346091.53125 runs/s *RESULT rtree_construct: 1000= 36854.61328125 runs/s *RESULT rtree_construct: 10000= 3392.330322265625 runs/s *RESULT rtree_construct: 100000= 369.79052734375 runs/s [ OK ] RTreePerfTest.Construct (8040 ms) [ RUN ] RTreePerfTest.Search *RESULT rtree_search: 100= 2992395.5 runs/s *RESULT rtree_search: 1000= 507624.9375 runs/s *RESULT rtree_search: 10000= 32956.60546875 runs/s *RESULT rtree_search: 100000= 3477.254638671875 runs/s [ OK ] RTreePerfTest.Search (8020 ms) [----------] 2 tests from RTreePerfTest (16060 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (16060 ms total) [ PASSED ] 2 tests. R=danakj@chromium.org, dskiba@chromium.org BUG= 674169 CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel Review-Url: https://codereview.chromium.org/2576793002 Cr-Commit-Position: refs/heads/master@{#438675} [modify] https://crrev.com/0d69452813c95ee10e53b8e83eaceea23a2b1780/cc/BUILD.gn [modify] https://crrev.com/0d69452813c95ee10e53b8e83eaceea23a2b1780/cc/base/DEPS [add] https://crrev.com/0d69452813c95ee10e53b8e83eaceea23a2b1780/cc/base/rtree_perftest.cc
,
Dec 15 2016
Some new data. I added dummy list and vector in RTree, and inserted same amount of nodes into them:
RTree::Node* RTree::AllocateNodeAtLevel(int level) {
+ node_vector_.push_back(Node());
+ node_list_.push_back(Node());
nodes_.push_back(Node());
...
Memory figures for AllocateNodeAtLevel() came to be:
size count
--------------------------------
std::deque 713.4 KiB 362
std::list 42.8 KiB 189
std::vector 42.7 KiB 180
Since std::vector does a single allocation per instance, this means that there were 180 RTree instances, and most of them had a single node.
,
Dec 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ba09803911f3a0dc8d8e3b87f3fb8529baf4888d commit ba09803911f3a0dc8d8e3b87f3fb8529baf4888d Author: vmpstr <vmpstr@chromium.org> Date: Sat Dec 17 00:42:57 2016 cc: Change Nodes collection to use base::ContiguousContainer instead of deque. This patch changes the the underlying structure to be base::ContiguousContainer. This provides better memory behavior and impacts performance as demonstrated by the results below. This is an acceptable change in performance with the benefit of memory savings. (N7v2) Before: [ RUN ] RTreePerfTest.Construct *RESULT rtree_construct: 100= 91842.1953125 runs/s *RESULT rtree_construct: 1000= 9855.638671875 runs/s *RESULT rtree_construct: 10000= 747.183837890625 runs/s *RESULT rtree_construct: 100000= 63.701416015625 runs/s [ OK ] RTreePerfTest.Construct (8245 ms) [ RUN ] RTreePerfTest.Search *RESULT rtree_search: 100= 363610 runs/s *RESULT rtree_search: 1000= 97453.4921875 runs/s *RESULT rtree_search: 10000= 13022.0244140625 runs/s *RESULT rtree_search: 100000= 848.8214111328125 runs/s [ OK ] RTreePerfTest.Search (8093 ms) After: [ RUN ] RTreePerfTest.Construct *RESULT rtree_construct: 100= 74468.8828125 runs/s (-19%) *RESULT rtree_construct: 1000= 12053.529296875 runs/s (+22%) *RESULT rtree_construct: 10000= 716.7188720703125 runs/s (-4%) *RESULT rtree_construct: 100000= 54.01991653442383 runs/s (-16%) [ OK ] RTreePerfTest.Construct (8182 ms) [ RUN ] RTreePerfTest.Search *RESULT rtree_search: 100= 351755 runs/s *RESULT rtree_search: 1000= 96155 runs/s *RESULT rtree_search: 10000= 10232.65625 runs/s *RESULT rtree_search: 100000= 863.5242309570312 runs/s [ OK ] RTreePerfTest.Search (8176 ms) Additionally this changes the union operation to be faster than gfx::Rect::Union, since there are some assumptions that we can make. Also, Union shows up as the hottest function on Linux profiles. BUG= 674169 R=danakj@chromium.org,dskiba@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel Review-Url: https://codereview.chromium.org/2570393002 Cr-Commit-Position: refs/heads/master@{#439258} [modify] https://crrev.com/ba09803911f3a0dc8d8e3b87f3fb8529baf4888d/cc/base/rtree.cc [modify] https://crrev.com/ba09803911f3a0dc8d8e3b87f3fb8529baf4888d/cc/base/rtree.h
,
Dec 17 2016
Oops, sorry I missed the change from std::list to ContiguousContainer. As I understand ContiguousContainer is even worse than std::deque: we construct it with sizeof(Node), which is 4 + 11 * 5 * 4 = 224 bytes on 32-bit system (this is max_object_size_). Then ContiguousContainerBase::Allocate() allocates kDefaultInitialBufferSize * max_object_size_ bytes, which is 32 * 224 = 7168. So we actually regressed almost 2x.
,
Dec 17 2016
As an alternative I can think of using std::vector, and converting Branch::subtree into Branch::subtree_index that would index into the vector. Hopefully one additional indirection in Search() won't matter much.
,
Dec 17 2016
Why aren't we exploring using WTF::deque? Also, if ContinguousContainer and std:deque are always so bad for memory, should we ban adding new uses of them? The names certainly don't imply to me that they'd be bad for memory, so anyone not following along closely on these threads won't know to only use them with great caution.
,
Dec 17 2016
I was actually thinking about adding presubmit error to ban std::deque (and also std::queue / std::stack which use std::deque by default). Was surprised to find ContiguousContainer is also memory hungry. Actually, it's kinda OK with small items, say for 16-byte items it will allocate 512-byte buffer, while std::deque always allocates at least 4 KiB.
,
Dec 17 2016
,
Dec 17 2016
ContiguousContainer has 2x growing behavior, but that could be made configurable if it's important. Fast growing helps with memory locality and number of allocations, and 2x growing is common for things like std::vector (unless you shrink to fit, which is seldom done). The 4 KiB lower bound in some implementations of std::deque might be an issue for small uses, though.
,
Dec 17 2016
> Why aren't we exploring using WTF::deque? The front/back push/pop isn't the property this code was using from std::deque, it was using the fact that deque does not move its contents in memory so pointers remain valid. I guess this was menetioned on the email thread or another bug, but WTF::Deque does not have this behaviour, it moves its contents memory locations. > Then ContiguousContainerBase::Allocate() allocates kDefaultInitialBufferSize * max_object_size_ bytes, which is 32 * 224 = 7168. It would be possible to not use kDefaultInitialBufferSize by passing a smaller number to the alternate constructor.
,
Dec 19 2016
Which site is this example from? I have a patch that wastes at most 6 Node elements per RTree. From my understanding, you have a page that has ~180 rtrees and most of them have one node... If this is such a common case, maybe we can add a codepath that bypasses the rtree altogether, since we can just do the intersection check on 11 elements instead of sticking them into an rtree which ends up as a flat list of 11 elements anyway.
,
Dec 19 2016
From #16, "Since std::vector does a single allocation per instance" what do you mean by this statement? A vector would reallocate if size == capacity and a push_back is requested, no?
,
Dec 20 2016
From the screenshot it #0 it looks like it was the bbc homepage. > From #16, "Since std::vector does a single allocation per instance" what do you mean by this statement? A vector would reallocate if size == capacity and a push_back is requested, no? I think what he meant is: "Ironically, for the special case of a RTree having exactly one node (this is the special case he was analysing in #16), a vector is even better because a single heap allocation to initialize the vector itself and have enough initial storage for one element". But % his, as we extensively discussed, a vector wouldn't be appropriate here because: of the greedy 2x expansion factor (IIRC) and more importantly the lack of the non-moveability guarantee, which is the only thing that RTree really needs here.
,
Dec 20 2016
@danakj, ah, I hadn't see that in an email thread. Sorry for the noise. Still the ban against std::deque seems reasonable, but this bug is probably not the right place to discuss it. :)
,
Dec 20 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b80d44193d48fb38e2fffa7c3f16d89979351371 commit b80d44193d48fb38e2fffa7c3f16d89979351371 Author: vmpstr <vmpstr@chromium.org> Date: Tue Dec 20 03:05:02 2016 cc: Change RTree nodes container to be a vector (with reserve). This patch changes the nodes container in the rtree to use a vector. In order to preserve the consistent access to elements, it reserves the vector to the needed capacity. The wasted space is <= 6 elements. Added a unittest to verify the reserve path is correct for up to 1000 rects. However, locally I've tested this for up to 520k rects (and still testing for up to 1m rects). The perf results from N7v2 are below. Before the patch: [ RUN ] RTreePerfTest.Construct *RESULT rtree_construct: 100= 75723.8671875 runs/s *RESULT rtree_construct: 1000= 11985 runs/s *RESULT rtree_construct: 10000= 715.2415161132812 runs/s *RESULT rtree_construct: 100000= 54.19050216674805 runs/s [ OK ] RTreePerfTest.Construct (8243 ms) [ RUN ] RTreePerfTest.Search *RESULT rtree_search: 100= 352105 runs/s *RESULT rtree_search: 1000= 97542.0234375 runs/s *RESULT rtree_search: 10000= 10274.0595703125 runs/s *RESULT rtree_search: 100000= 866.167236328125 runs/s [ OK ] RTreePerfTest.Search (8122 ms) After the patch: [ RUN ] RTreePerfTest.Construct *RESULT rtree_construct: 100= 112420 runs/s (+48%) *RESULT rtree_construct: 1000= 13482.32421875 runs/s (+12%) *RESULT rtree_construct: 10000= 666.6328125 runs/s (-7%) *RESULT rtree_construct: 100000= 55.98017501831055 runs/s (+3%) [ OK ] RTreePerfTest.Construct (8340 ms) [ RUN ] RTreePerfTest.Search *RESULT rtree_search: 100= 365225 runs/s *RESULT rtree_search: 1000= 98002.0078125 runs/s *RESULT rtree_search: 10000= 10344.048828125 runs/s *RESULT rtree_search: 100000= 866.0883178710938 runs/s [ OK ] RTreePerfTest.Search (8092 ms) BUG= 674169 R=danakj@chromium.org, dskiba@chromium.org CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel Review-Url: https://codereview.chromium.org/2591533002 Cr-Commit-Position: refs/heads/master@{#439686} [modify] https://crrev.com/b80d44193d48fb38e2fffa7c3f16d89979351371/cc/base/rtree.cc [modify] https://crrev.com/b80d44193d48fb38e2fffa7c3f16d89979351371/cc/base/rtree.h [modify] https://crrev.com/b80d44193d48fb38e2fffa7c3f16d89979351371/cc/base/rtree_unittest.cc
,
Dec 21 2016
dskiba@ can you verify that ToT performs better in terms of memory now? If so, I think this can be closed as fixed.
,
Dec 23 2016
Thank you vmpstr@ for working on this! Sorry for being the weakest link, I'm traveling. But I'll check the patch soon (today / tomorrow).
,
Dec 24 2016
So, the change to std::vector helped. The website is BBC.com, and the test is just to load it and wait for 60 seconds for things to settle before taking a heap dump. With std::deque RTree was allocating 863.2 KiB: AllocateNodeAtLevel 863.2 KiB 438 Build<std::__ndk1::vector<gfx::Rect, std::__ndk1::counting_allocator<gfx::Rect, std::__ndk1::allocation_group::vector, gfx::Rect> >, (lambda at ../../cc/base/rtree.h:76:18)> 169.5 KiB 86 BuildRecursive 500.6 KiB 254 Build<std::__ndk1::vector<gfx::Rect, std::__ndk1::counting_allocator<gfx::Rect, std::__ndk1::allocation_group::vector, gfx::Rect> >, (lambda at ../../cc/base/rtree.h:76:18)> 496.6 KiB 252 EndGeneratingMetadata 3.9 KiB 2 EndGeneratingMetadata 193.1 KiB 98 With std::vector RTree now allocates 189.2 KiB: Build<std::__ndk1::vector<gfx::Rect, std::__ndk1::counting_allocator<gfx::Rect, std::__ndk1::allocation_group::vector, gfx::Rect> >, (lambda at ../../cc/base/rtree.h:94:18)> 177.2 KiB 169 EndGeneratingMetadata 12.0 KiB 50 Numbers match exactly: std::deque does two allocations per instance (one for a block, another for a block pointer array), so in total there were 438 / 2 = 219 RTrees (since each holds a single std::deque instance). std::vector does a single allocation per instance, so there were the same number of them (169 + 50 = 219). So the latest CL recovers 674.0 KiB of memory. BTW, here is the (reversed) stack trace to Build<...>: Build<...> Finalize PaintContentsToDisplayList non-virtual thunk to cc_blink::WebContentLayerImpl::PaintContentsToDisplayList(cc::ContentLayerClient::PaintingControlSetting) Update UpdateLayers DoUpdateLayers UpdateLayers BeginMainFrame
,
Jan 4 2017
y'all are awesome, marking this as fixed. Reopen if I missed someting |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by primiano@chromium.org
, Dec 14 2016It seems that the only use of nodes_ is a persistent append-only storage, where the lifetime of the nodes is bounded to the lifetime of the parent rtree. Maybe the right solution here might be having a minimal home-brewed version of std::deque, where the block size is determined experimentally. I am really talking by something like: struct NodesBlock { Node nodes[NUM_NODES_PER_BLOCK]; Node* next_free_node; std::unique_ptr<NodesBlock> next_block; } RTree::Node* RTree::AllocateNodeAtLevel(int level) { Node* node = block_->next_free_node; node->num_children = 0; node->level = level; block_->next_free_node++; if (block_->next_free_node >= &block_->nodes[NUM_NODES_PER_BLOCK]) { auto new_block = MakeUnique<NodeBlock>(); new_block->next_block = std::move(block_); block_ = std::move(new_block); } return node; } With some data we might find a good tradeoff for NUM_NODES_PER_BLOCK to balance heap traffic with waste due to fragmentation. Or maybe this could even be a runtime tunable parameter for the mem coordinator one day.