Use of VirtualAlloc for Windows Discardable can be very slow for large allocations |
||||||||||
Issue descriptionOn Windows, base::DiscardableSharedMemory makes use of two functions to lock/unlock memory: VirtualAlloc(..., MEM_RESET, ...); // Unlock - memory not actively in use VirtualAlloc(..., MEM_RESET_UNDO, ...); // Lock - memory about to be used While it's expected that there is a cost to these calls, both are more costly than I'd expect. The current cost makes it hard to use on a per-frame basis for large allocations. For instance, for a ~300MB allocation, locking and unlocking take ~3-6MS each It'd be good to investigate if there are ways to use VirtualAlloc such that the cost to Lock/Unlock is reduced. We're currently hitting this issue when dealing with large images in the ImageDecodeController logic in CC. Attached is a trace which shows CC raster for a large image (2,000 x 40,000 - 320MB). For a sample frame during a scroll, we spend 24MS Locking the image (7 occurrences) and 35MS Unlocking (7 occurrences). While we could factor things to lock and unlock less, you'd still want to at least lock/unlock between frames, which would still make it tricky to hit 60fps.
,
Mar 18 2016
Oh wow that's really bad! pennymac@, can you investigate if there's a way to make this more efficient?
,
Mar 18 2016
This is causing some bad behavior with image decodes, since it's written with the expectation that the Unlock of discardable is always cheap.
,
Mar 18 2016
,
Mar 18 2016
Err, I didn't mean to p1 this
,
Mar 18 2016
Issue 584786 has been merged into this issue.
,
Mar 21 2016
This is very interesting. Eric, could you confirm for me if this is a specific version of Windows, and if it's x86/x64? Unlock/Unpin at the very least should be extremely fast. My understanding is that it just flags the memory. +brucedawson Hey Bruce, is there a fast way to instrument only calls to VirtualAlloc with MEM_RESET/MEM_RESET_UNDO and get some speed numbers? Even just a baseline for when a system is NOT under stress (pages don't get taken) would be great. Possibly two sets of numbers: one for large image and one for smaller size. (We can chat further offline Bruce.) I'd be curious to see what the overhead of our code is as well (mem maps/SM/DSM). There is no other way for us to call VirtualAlloc except with these two flags (to pin/unpin). There are APIs in newer versions of Windows OfferVirtualMemory/ReclaimVirtualMemory, but they carry out the same actions underneath. They were also buggy on Win10. (Ref https://bugs.chromium.org/p/chromium/issues/detail?id=180334#c234)
,
Mar 21 2016
,
Mar 21 2016
Thanks for looking into this - I did these tests on Windows 10 pro, x64.
,
Mar 21 2016
In general one complication with large memory allocations is the hidden costs. The actual VirtualAlloc calls are usually extremely cheap, but when pages are touched they have to be faulted in. VirtualFree is usually somewhat expensive (with the cost being proportional to how many pages were faulted in, IIRC) and then the newly freed pages have to be zeroed at some point, either by the system process (a very well hidden cost, but very real) or when being faulted in to a subsequent allocation. I wrote about this a while ago after seeing some of these issues in Chrome: https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ I don't know how this applies to discardable memory. I assume that this memory is not actually being discarded, merely being made available for discard? IIRC MEM_RESET is implemented by removing pages from the working set and Windows does seem to be very inefficient about doing this. Removing 300 MB from the working set is something that I would expect to be expensive. The actual cost seems to vary depending on OS and/or CPU type. I don't know how quickly Linux can do the equivalent operation. I suspect that large pages would help but the last time I tried using them I found that they were problematic, and they don't support paging so that would be a non-starter. I'm not sure how it makes sense to mark memory as discardable when we're going to need it again in ~16.6 ms. That seems contradictory. TL;DR Mapping 300+ MB in and out of a process every frame is expected to be expensive.
,
Mar 21 2016
Thanks for the quick response. I agree that we shouldn't be pinning/unpinning memory that we know will be used again in rapid succession. That could be a design change. WRT MEM_RESET, I'm fairly sure that it does not remove pages from the working set. It's the newer APIs that do that extra - and another reason for us not to use the new APIs. MEM_RESET only marks the memory range as "no longer of interest"/available-if-needed-under-stress, and the pages should not be read from or written to the paging file. The range will be used again later though, so the pages are not decommitted. - https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx - http://stackoverflow.com/questions/32023771/new-windows-8-1-apis-for-virtual-memory-management-discardvirtualmemory-vs - From msdn: "Note that offering and reclaiming virtual memory is similar to using the MEM_RESET and MEM_RESET_UNDO memory allocation flags, except that OfferVirtualMemory removes the memory from the process working set and restricts access to the offered pages until they are reclaimed."
,
Mar 21 2016
I hacked up a quick test program that uses MEM_RESET. This confirms pennymac@'s belief that MEM_RESET does not remove pages from the working set. I grabbed an ETW trace and the sampling profiler shows the MEM_RESET call spending time in MiActOnPte. So I guess it walks the ~1,200 or so page table entries and modifies them in some way, and this is not very fast. The numbers along the left side are sample counts with each one representing approximately 1 ms:
19 | ntoskrnl.exe!NtAllocateVirtualMemory
19 | ntoskrnl.exe!MiAllocateVirtualMemory
19 | ntoskrnl.exe!MiResetVirtualMemory
19 | ntoskrnl.exe!MiWalkVaRange
19 | |- ntoskrnl.exe!MiActOnPte
17 | | |- ntoskrnl.exe!MI_IS_PFN
14 | | |- ntoskrnl.exe!MiActOnPte<itself>
2 | | |- ntoskrnl.exe!MiLocateWsle
1 | |- ntoskrnl.exe!MiInsertTbFlushEntry
1 | |- ntoskrnl.exe!MiWalkVaRange<itself>
My test program is just an MFC dialog app with three buttons handled by these three functions:
void* pFoo;
const size_t kBigBuffer = 320 * 1024 * 1024;
void CMemoryTestsDlg::OnBnClickedAllocate()
{
if (!pFoo)
{
pFoo = VirtualAlloc(0, kBigBuffer, MEM_COMMIT, PAGE_READWRITE);
}
}
void CMemoryTestsDlg::OnBnClickedTouch()
{
if (pFoo)
{
memset(pFoo, 1, kBigBuffer);
}
}
void CMemoryTestsDlg::OnBnClickedReset()
{
if (pFoo)
{
VirtualAlloc(pFoo, kBigBuffer, MEM_RESET, PAGE_READWRITE);
}
}
I'm not sure that this additional information tells us anything particularly useful. I am also testing on Windows 10.
,
Mar 23 2016
Ok, so thank you very much Bruce. I'm surprised at how inefficient this flagging is on Windows. reveman@, ericrk@, and jschuh@, I think there are two main options for moving forward - and would like your input. 1) I think that most people would agree that we should not be unpinning/unlocking memory and data that we know will be used in rapid succession. How do we feel about working on changing that use of *Discardable* shared memory? We might be fine with the small overhead on Windows then. 2) Another option is to work a fast time check into the Windows DSM objects. Every time Unlock is called on the DSM, we check how long it has been since the last Unlock was done (store one timestamp). If it's happening too quickly, we can choose to not actually unpin on Windows (to prevent overuse). Could also include a small counter to allow up to N rapid unlocks before no longer unpinning. Etc. I'm interested in input on this backup idea as well.
,
Mar 24 2016
I'm not thrilled with the idea of adding heuristics to DSM to potentially ignore unlocks. It is tempting because then it is a choke point to fix bad behavior from multiple clients, but getting such heuristics tuned perfectly is tricky - well, actually, probably impossible. It also leads to magic-performance-cusps where if you cross over some invisible line then your performance suddenly changes dramatically. The Windows heap has some of those and they make profiling very tricky. Can this be fixed at the ImageDecodeController level, which hopefully has enough information to make cleaner decisions?
,
Mar 24 2016
I think if MEM_RESET usage is this expensive then we should just revert those changes until we have an API on Windows that is efficient enough. Discardable memory works without that, just doesn't give the kernel the ability to discard memory when needed but that's the same as on MacOS and Linux.
,
Mar 24 2016
At the ImageDecodeController level, we typically make decisions based on what we need to schedule. This information can change frame to frame depending on what Blink is providing us. We also do try to minimize lock/unlocking based on what we do know at the current time. I'm afraid the fix at the image decode controller level would be similar to the fix at discardable memory level: don't unlock immediately even if the memory is no longer required, then the relock is free; if it's still not required after some time then really unlock it
,
Mar 24 2016
So just to be clear, if I revert the use of pinning/unpinning on Windows, the operating system can not take any resources during times of stress. The only support for "discardable" will be if we cross our memory usage threshold (set internally to Chrome) and we choose to free "unlocked" chunks of shared memory from the Browser process. If we do agree to remove the Windows "pinning" due to overhead, I can still leave the change to Purge() on Windows so that it does a DECOMMIT during Purge. Thoughts Justin?
,
Mar 24 2016
Ok, Justin and I have had a good discussion about this. I'm going to go ahead and make a CL removing the MEM_RESET and MEM_RESET_UNDO from Unlock() and Lock() functions for Windows. Justin had an interesting idea though to do MEM_RESETs from within the Browser process (on Windows only), similar to how we currently handle purging when we hit a memory usage threshold. This "unpinning" extravaganza could happen at a threshold lower than the existing emergency "purging" event. A good discussion ensued. - Based either on a Chrome-internal threshold (lower than the Purge threshold), a timer event (every n seconds) that checks system resource stress, or an event fired from a Chrome centralized task manager or resource monitor thread (when it detects system resource stress). - Reference existing HostDiscardableSharedMemoryManager::ReduceMemoryUsageUntilWithinLimit() function in src/content/common/host_discardable_shared_memory_manager.cc. - We could roll through the least-recently-used-ordered unlocked segment vector (the way we currently do), and instead of calling Purge, call an Unpin function that does MEM_RESET on the DSM pieces. - MEM_RESET_UNDO would still remain in the Lock() function, called by the child processes who already have to handle if their data was lost and needs to be reconstructed. No behavioural changes for how child processes interact with DSM. - We might need a new SharedState flag, to differentiate between a segment that is UNLOCKED vs UNPINNED. Because they're not the same thing. Lock() will need to know if the segment has actually been MEM_RESET or not. I'll be attaching the CL to the original ticket (180334). Stay tuned.
,
Mar 28 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6a2afca1819c6f14aff51dd94aead834e6ad6fd8 commit 6a2afca1819c6f14aff51dd94aead834e6ad6fd8 Author: pennymac <pennymac@chromium.org> Date: Mon Mar 28 17:51:33 2016 [Discardable Shared Memory] Remove Windows (un)pinning support. Removing MEM_RESET and MEM_RESET_UNDO from Unlock and Lock functions. The overhead is a bit too high for some scenarios. See ticket 595930 for details and plans for potential future implementations. BUG= 180334 , 595930 Review URL: https://codereview.chromium.org/1832183002 Cr-Commit-Position: refs/heads/master@{#383514} [modify] https://crrev.com/6a2afca1819c6f14aff51dd94aead834e6ad6fd8/base/memory/discardable_shared_memory.cc
,
Mar 31 2016
Verified. The merge request will happen in 180334.
,
Mar 31 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2b1af3bda9c60987393ece94c49f86c98df4a666 commit 2b1af3bda9c60987393ece94c49f86c98df4a666 Author: Penny MacNeil <pennymac@chromium.org> Date: Thu Mar 31 18:18:19 2016 [Discardable Shared Memory] Remove Windows (un)pinning support. Removing MEM_RESET and MEM_RESET_UNDO from Unlock and Lock functions. The overhead is a bit too high for some scenarios. See ticket 595930 for details and plans for potential future implementations. BUG= 180334 , 595930 Review URL: https://codereview.chromium.org/1832183002 Cr-Commit-Position: refs/heads/master@{#383514} (cherry picked from commit 6a2afca1819c6f14aff51dd94aead834e6ad6fd8) Review URL: https://codereview.chromium.org/1850703004 . Cr-Commit-Position: refs/branch-heads/2661@{#446} Cr-Branched-From: ef6f6ae5e4c96622286b563658d5cd62a6cf1197-refs/heads/master@{#378081} [modify] https://crrev.com/2b1af3bda9c60987393ece94c49f86c98df4a666/base/memory/discardable_shared_memory.cc |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by ericrk@chromium.org
, Mar 18 2016Owner: ----