std::deque is inefficient |
|||||||||||||||
Issue descriptionAs was found during libc++ (Android / macOS) instrumentation [1], its std::deque is quite memory inefficient, because it allocates memory in 4096-byte blocks: https://chromium.googlesource.com/android_ndk/+/master/sources/cxx-stl/llvm-libc++/libcxx/include/deque#912 When number of elements is low that 4096-byte block wastes quite a bit of memory. (Note that for payloads >= 256 block length is 16 elements.) It turned out that libstdc++ (Linux / ChromeOS) also allocates in blocks, only the block size there is 512 bytes: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_deque.h#L91 (Note that for payloads > 256 block length is 1 element.) MSVC (Windows) is different: it seems that the block length is 1 for anything > 8 bytes, see http://stackoverflow.com/questions/13989835/dealing-with-deque-block-size-causing-performance-issues that exposes _DEQUESIZ configuration variable. Test seems confirm that: http://rextester.com/IIDS7391 In a table form: _____________________________________________________ |std::deque<T> block size (bytes) when sizeof(T) is | | | <= 8 | >8 | > 256 | |---------------------------------------------------- |libc++ | 4096 | 4096 | 16 * sizeof(T) | |libstdc++ | 512 | 512 | sizeof(T) | |MSVC | 16 | sizeof(T) | sizeof(T) | ----------------------------------------------------- This means that it's very hard to use std::deque in crossplatform code efficiently: * On MSVC it allocates each item individually for most payloads, i.e. it behaves like a list, just slightly more memory-efficient (one pointer per element in a separate pointer array instead of two pointers in each node). * libc++ wastes lots of memory if number of elements is low. Extreme case is 1-element deque, which wastes something close to 4 KB. You're especially unlucky if your payload is >256 bytes - in this case it wastes 15 * payload bytes. Note that libc++ is what we use on Android, where memory is scarce, especially on low-end devices. * libstdc++ is the most balanced - it doesn't waste much for small elements, and turns to a list for large payloads. Given all that it's very hard to come up with a compelling use case for std::deque. It should either be std::vector or std::list depending on planned utilization / required reference stability. All of this applies to std::queue too, because by default its backed by std::deque. (In [1] queues were observed consuming quite a bit of memory, for example 340 KiB was attributed to TaskQueueImpl queues.) [1] https://docs.google.com/document/d/1YL1FORFMWo0FK0lMg7WsImnjNQ3ZpY0nK0NHGjkeHT4/edit?usp=sharing
,
Dec 14 2016
Could we use the deque in WTF instead? https://cs.chromium.org/chromium/src/third_party/WebKit/Source/wtf/Deque.h?sq=package:chromium&rcl=1481733099&l=51 It uses a Vector for the backing for this reason.
,
Dec 14 2016
That wouldn't be different than using a vector then, so we could just use std::vector when it makes sense. std::deque has different properties such as not invalidating pointers/not reallocing old items which WTF's does not since it's a vector that appears like a std::deque.
,
Dec 15 2016
Turns out std::stack is also using std::deque by default. Just found that v8::internal::SlotSet::PreFreeEmptyBucket() allocates 200 KiB of std::deque memory via its stack variable.
,
Dec 15 2016
danakj@ that would be different than using a std::vector, WTF::Deque adjusts the pointers internally so removing from the start doesn't involve shifting things down. Having to implement that yourself with a vector is error prone. WTF also does a growth factor of 1.5 which saves memory waste and also allows inline capacity which for many cases can avoid malloc entirely. We should just have a Deque data structure we write ourselves that is well behaved. We have one in WTF, we could move it to base and use it for lots of these cases. It might not work for all of the usage inside chromium (maybe rtree needs something special) but it should work generally and avoid waste in the commit case and gives us something that works the same on all platforms which is not true of std::deque/stack or queue.
,
Dec 15 2016
esprehn@, re #5: WTF::Deque does not seem to give the same guarantees of std::deque in terms of non-movability of elements. Which is a key requirement for cases like Issue 674169 . expandCapacity() seems to do re-espansion and memmove, which invalidate pointers previously handed out. I created a repro test here that shows the problem in action https://codereview.chromium.org/2581673002 IMHO there are at least two different things that we need to figure out: 1) Is there a need for fast append-only containers? In other words, are cases like Issue 674169 (cc's RTree) rare? (In which case the solution would be just doing something ad-hoc for RTree) Or is there a more common pattern that we could solve exposing a dedicated container in base for this? 2) How many users of std::deque rely on pointer stability? If WTF::Deque is more efficient and consistent cross-platform, it could be used to replace some of these inefficient users of std::deque. However, this won't work if clients rely on non-movability of elements like in the aforementioned case. Not sure how feasible it is to audit this requirement in the existing code. In other words, as things stand today it seems quite risky to just drop-in replace std::deque with WTF::deque.
,
Dec 15 2016
,
Dec 15 2016
+thakis FYI
,
Dec 15 2016
,
Dec 15 2016
,
Dec 15 2016
primiano@ Yes, see my previous comments. For more information about the inefficient of std::deque and friends: http://info.prelert.com/blog/speed-is-not-the-only-consideration-with-stddeque http://info.prelert.com/blog/stl-container-memory-usage at least C++11 fixed some of the really egregious things like std::list::size() being linear with some stl implementations!
,
Dec 15 2016
,
Dec 15 2016
,
Jan 9 2017
If we're going to import a deque replacement into base:: that is more efficient but can move objects in memory, I'd suggest naming it RingBuffer (or similar) to distinguish it from std::deque and make it more obvious when it should be used.
,
Jan 10 2017
std::queue has a much smaller API with no iterators, so can we have a drop-in replacement while the argument about deque is ongoing? std::list is not a panacea.
,
Jan 10 2017
What if std::deque allowed to specify 'block length' as a template argument? I'm thinking of just copying libc++ implementation and exposing block length (not size, which is in bytes, but count of elements in a block). We could then define base::queue as template <class T, size_t BlockLength> using queue = std::queue<T, deque<T, std::allocator<T>, BlockLength>>; and use appropriate BlockLength in specific cases.
,
Jan 10 2017
We can not just copy the libc++ implementation, it uses libc++ internals which we can't use. And we can't define it in std:: either. We would have to reimplement it.
,
Jan 10 2017
Yeah, I meant "copy and tweak to base:: namespace", but you are right, I overlooked the fact that libc++ implementation relies on __split_buffer.
,
Jan 10 2017
Actually, I think ricea@ is right, and if we're just talking about std::queue replacement, then simple deque implementing just front/back, push_front/push_back is way easier. And implementing pop_back will make it suitable for std::stack too. I'll give it a try just to see how complex gets.
,
Jan 10 2017
We also have a deque in WTF already, I don't think we want to just randomly write another one? I can see it's a fun project but a better one would probably be figuring out how WTF deque can live in base.
,
Jan 10 2017
While I personally support using WTF::Deque in base, it may be a bit tricky because of the Oilpan integration. At the least, I think we'd need to expose some Oilpan traits.
,
Jan 10 2017
before making any changes would be great to do some auditing about who is using std::deque/queue in the codebase, what is their use case and what guarantees they need. The tricky part is figuring out the subset of those who rely on pointer unmoveability (like cc did before the recent refactoring) and hence cannot be switched to WTF::Deque. There are ~250 occurences in the codebase https://cs.chromium.org/search/?q=std::deque+-f:test+-f:src/v8+-f:src/third_party+f:%5C.(c%7Ch)&p=21&sq=package:chromium&type=cs I wonder if we could do some analysis via clang plugins, e.g., how many instances of these for sure don't take pointers from the dequeue elements and could be safely switched to WTF::Deque? How many of them never insert_front/remove_front and could be switched to just a vector?
,
Jan 10 2017
Is there an example of moving a WTF container to base? Allocator template seems very Blink-specific (with methods like enterGCForbiddenScope). As for WTF::Deque, I see it's using VectorBuffer as a backing. VectorBuffer seems to be smart and doesn't copy elements on front removals, i.e. works like a ring buffer. However, since WTF::Deque uses single VectorBuffer, it means that it simply grows as more elements are added (instead of adding new blocks) I.e. in this regard WTF::Deque is closer to std::vector than to std::deque. Additionally inlineCapacity of WTF::Deque also works quite differently than std::deque's block size: it makes WTF::Deque (via VectorBuffer) to have inline buffer of the specified capacity, but once deque grows past it, larger buffer is allocated on the heap and inline buffer turns into a dead weight.
,
Jan 10 2017
Unfortunately, I don't think there's any examples of moving WTF containers to base yet, so doing so would have trailblazing. As for the inline capacity... I think that's a configurable parameter? It defaults to zero, which means it's not an issue except for code that wants to opt into having some inline capacity.
,
Feb 2 2017
Two new datapoints: * libstdc++: Default constructor of std::deque always allocates a block (512 bytes). When deque shrinks, it keeps a single block. * libc++: Default constructor doesn't allocate, but deque keeps up to two blocks (~4096 bytes each) when shrinking. See http://rextester.com/FTBKT82856
,
Mar 21 2017
I thought of ways we could fix std::deque problem here: https://docs.google.com/document/d/1PEuPnSW54LaoWpUIEAHobqGt9nbD2DgQ_W5DdlJOkJU/edit?usp=sharing Comments are welcome. So far my favorite is to just patch libc++ on Android. Everything else seem to involve base::deque of some form, which adds its own set of problems.
,
Apr 5 2017
To add more data: I did some profiling on desktop Linux and std::deque pops up as a big thrasher of memory (that is, it goes through a lot of allocations even though the total size at any given time might not be that bad). The two worst cases are std::queue<content::FrameTreeNode*> (goes through 23MB in 2 minutes) and std::deque<safe_browsing::SBAddPrefix> (goes through 16MB in 2 minutes). We should solve this problem for all systems, not just Android.
,
Apr 5 2017
Wow, that's interesting given that the plan above is roughly "make libc++ use libstdc++'s deque allcoation strategy". Sounds like that's not good enough then? Or should high-frequency deque's explicitly use a slab allocator with the deque?
,
Apr 5 2017
Well, libstdc++ allocation strategy is more efficient, so we want it on Android. But of course going from 4096 to 512 can increase number of allocations / deallocations 8x in some cases. 23MiB is 47104 512-byte blocks in 2 minutes, so it's 392 allocations / frees per second. I wonder what is the average thrashing rate? And how much do we spend in malloc / free?
,
Apr 5 2017
> goes through 23MB in 2 minutes Brett, when you say "goes through" you mean that: A) SUM(mallocs) - SUM(frees) ~= 0, but SUM(mallocs) ~= 23 MB after 2 mins? B) SUM(mallocs) - SUM(frees) ~= 23 MB after 2 mins Also, do they come from the same deque instances, or from several instances from the same class? > Sounds like that's not good enough then? The previous problem was: on Android using many deques with few elements each ends up wasting lot of memory because almost each deque wastes 4k. #27 really depends on what is the specific pathological pattern we are hitting. If B and they come from the same instance, this could just be us misusing deques in those specific classes.
,
Apr 17 2017
Hi all, I'm planning to do port WTF::Deque in base (or somewhere in Chromium), and try to experiment with adopting it to some of the critical std::deque usage (I don't know which std::deque use is critical yet, though). According to the data above, this seems worth trying to me. If nobody is strongly against the idea, I'll start writing a design doc and ask for opinions on the details. WDYT?
,
Apr 17 2017
yutak@, have you seen https://docs.google.com/document/d/1PEuPnSW54LaoWpUIEAHobqGt9nbD2DgQ_W5DdlJOkJU/edit?usp=sharing ? Porting WTF::Deque was one of the solutions there. However, the consensus was to try to convince LLVM people to change std::deque, and then cherry-pick that change to Chrome's copy of Android NDK. See https://bugs.llvm.org/show_bug.cgi?id=32435 There are issues with both introducing base::deque and with porting WTF::Deque as is - they are outlined in the document. The way WTF::Deque is implemented puts it closer to circular buffer than to std::deque - and I think having circular buffer in base:: can be useful.
,
Apr 17 2017
,
Apr 18 2017
My assumption is that 512-byte block (that libstdc++ uses) is still too large for very small deques. I also think MSVC's implementation wasteful, as it's just like a linked list. Do you have any insights on libstdc++ or on MSVC? Are they good enough?
,
Apr 19 2017
I think 512-byte block is fine - with it Chrome wastes just ~150 KiB (instead of >1 MiB). We can think of decreasing it further, but we need to make sure we're not degrading into a list (as MSVC does). Also, the smaller is the buffer, the more allocations deque ends up doing. See https://bugs.llvm.org/show_bug.cgi?id=32435#c5 for benchmark results where 512/2 configuration (512-byte buffer / min 2 items per block) is 2x - 8x slower than 4096/16. Also see comment #7 there which has information about types commonly used with std::deque. When choosing a block size we need to make sure it can hold several PendingTasks.
,
May 15 2017
,
May 15 2017
+haraken
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/android_ndk/+/eecd8c2d681b019efca486f92fdda9a93f52328f commit eecd8c2d681b019efca486f92fdda9a93f52328f Author: Dmitry Skiba <dskiba@chromium.org> Date: Thu May 25 18:30:33 2017 Lower std::deque block size. It was discovered that libc++ implementation of std::deque allocates in 4KiB blocks, so it uses 4KiB even to store a single element. That wastes massive amounts of memory in several cases, see the bug and docs it links to. There are several ways to fix this (see comment #26 in the bug), but the easiest one is to just patch the NDK. We decided to fix the issue upstream first, so bugs.llvm.org/show_bug.cgi?id=32435 was filed. However, the LLVM bug has stalled, and it's unclear how long it will take us to make the change. At the same time pressure to reduce Chrome memory usage on low-end devices has increased following the Android Go announcement. So here is the new plan: 1. Temporarily patch NDK to lower deque's block size 2. Monitor memory and performance dashboards for regressions 3. Decide whether to keep the lower block size 4. Submit findings to the LLVM bug Bug: 674287 Change-Id: Idf7ad7eab5ab1ea6a66902bc5069745e56079cf4 Reviewed-on: https://chromium-review.googlesource.com/514207 Reviewed-by: Nico Weber <thakis@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Tested-by: Dmitry Skiba <dskiba@chromium.org> [modify] https://crrev.com/eecd8c2d681b019efca486f92fdda9a93f52328f/sources/cxx-stl/llvm-libc++/libcxx/include/deque
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/android_tools/+/023e2f65409a2b7886b8d644d6a88542ead6cd0a commit 023e2f65409a2b7886b8d644d6a88542ead6cd0a Author: Dmitry Skiba <dskiba@chromium.org> Date: Thu May 25 23:52:34 2017 Roll NDK to pick std::deque patch. Bug: 674287 Change-Id: Ia38c12af231b74f65b952c59703f470e3e6da076 Reviewed-on: https://chromium-review.googlesource.com/516604 Reviewed-by: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/023e2f65409a2b7886b8d644d6a88542ead6cd0a/DEPS
,
Jun 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/293876a7b9dcaa1a8d74d63b0ab272c54f7a8dcc commit 293876a7b9dcaa1a8d74d63b0ab272c54f7a8dcc Author: dskiba <dskiba@chromium.org> Date: Mon Jun 12 19:02:19 2017 Roll src/third_party/android_tools/ cb6bc2110..023e2f654 (1 commit). https://chromium.googlesource.com/android_tools.git/+log/cb6bc2110..023e2f654 023e2f6 Roll NDK to pick std::deque patch. BUG=674287 Review-Url: https://codereview.chromium.org/2904813005 Cr-Commit-Position: refs/heads/master@{#478703} [modify] https://crrev.com/293876a7b9dcaa1a8d74d63b0ab272c54f7a8dcc/DEPS [modify] https://crrev.com/293876a7b9dcaa1a8d74d63b0ab272c54f7a8dcc/build/config/android/BUILD.gn
,
Jun 20 2017
What work is outstanding? Assigning to you as you've landed a few patches already.
,
Jun 23 2017
With the change system health benchmark saw ~1 MiB decrease in total malloc number: https://chromeperf.appspot.com/report?sid=65741376d622592e0e647fab105154d24cb183bc6eccba9ef5baf386cfe4932d&start_rev=477022&end_rev=481164
,
Jun 24 2017
This is a great improvement!!
,
Jun 24 2017
You could post an update to the old memory-dev thread about this achievement :)
,
May 18 2018
Next steps: 1. Think of lowering block size on desktop platforms too. 2. Fix memory_usage_estimator to use lower block size on Android. Currently it's 4096, i.e. we're overestimating.
,
May 21 2018
Don't we have base::circular_deque for this now?
,
May 22 2018
It has fairly different characteristics than std::deque: https://cs.chromium.org/chromium/src/base/containers/circular_deque.h?rcl=6139e2e6a42b79b4b17f5b0e18588b19dcb1b347&l=27
,
May 22 2018
Sorry I was unclear, I meant I didn't think there was much std::deque usage left after Brett's work that would benefit from global std::deque optimizations. ~50, https://cs.chromium.org/search/?q=std::deque+-file:third_party/&sq=package:chromium&type=cs ~313, https://cs.chromium.org/search/?q=base::circular_deque&type=cs&sq=package:chromium But if we're counting third party then it's more like 50%.
,
May 23 2018
BTW, std::deque is also used in std::queue and std::stack. Re different characteristics of base::circular_deque - just saw crrev.com/c/1069550, where UAF was introduced by calling pop_front(). std::deque is safe in this case.
,
Oct 26
,
Oct 26
I'll take point on this. The problem I forsee is that it's actually a huge performance loss to decrease the block size, and landing a change that makes deque slower in a fair few cases will be tricky. It might be easier to provide a switch for users which allows them to manually specificity the block size or number of elements in the block.
,
Jan 11
Available, but no owner or component? Please find a component, as no one will ever find this without one. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dskiba@chromium.org
, Dec 14 2016