New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 870054 link

Starred by 16 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

v8 triggered cfg fragmention causing gmail hangs and significant memory waste

Project Member Reported by brucedaw...@chromium.org, Aug 1

Issue description

While using gmail in M68 on Windows 10 I have started noticing two+ second hangs in gmail. As I am typing into an email the typed characters will be appearing fine, then gmail will stop updating, and then after two to three seconds gmail will resume updating. No characters are lost but the two to three second pauses are quite noticeable.

I grabbed an ETW trace and was able to locate the hang in the trace and characterize it somewhat. The CrRendererMain thread in the gmail browser process is repeatedly blocked in calls to VirtualAlloc because it is fighting over a lock with WmiPrvSE.exe. Chrome ends up waiting on the stack below over ~450 times (~450 context switches) and is blocked for a total of 2.87 seconds over a 2.94 second time period:

  |    chrome_child.dll!v8::internal::Runtime_StackGuard
  |    chrome_child.dll!v8::internal::StackGuard::HandleInterrupts
  |    chrome_child.dll!v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions
  |    chrome_child.dll!v8::internal::Compiler::FinalizeCompilationJob
  |    chrome_child.dll!v8::internal::OptimizedCompilationJob::FinalizeJob
  |    chrome_child.dll!v8::internal::compiler::PipelineCompilationJob::FinalizeJobImpl
  |    chrome_child.dll!v8::internal::compiler::PipelineImpl::FinalizeCode
  |    chrome_child.dll!v8::internal::compiler::PipelineImpl::Run<v8::internal::compiler::FinalizeCodePhase>
  |    chrome_child.dll!v8::internal::compiler::CodeGenerator::FinalizeCode
  |    chrome_child.dll!v8::internal::Factory::TryNewCode
  |    chrome_child.dll!v8::internal::Heap::ProtectUnprotectedMemoryChunks
  |    chrome_child.dll!v8::internal::MemoryChunk::SetReadAndExecutable
  |    chrome_child.dll!v8::internal::SetPermissions
  |    chrome_child.dll!base::SetSystemPagesAccess
  |    KernelBase.dll!VirtualAlloc
  |    ntdll.dll!NtAllocateVirtualMemory
  |    ntoskrnl.exe!KiSystemServiceCopyEnd
  |    ntoskrnl.exe!NtAllocateVirtualMemory
  |    ntoskrnl.exe!MiAllocateVirtualMemory
  |    ntoskrnl.exe!MiCommitVadCfgBits
  |    ntoskrnl.exe!MiMarkProcessCfgBits
  |    ntoskrnl.exe!MiMarkPrivateOpenCfgBits
  |    ntoskrnl.exe!MiPopulateCfgBitMap
  |    ntoskrnl.exe!ExAcquirePushLockExclusiveEx
  |    ntoskrnl.exe!ExfAcquirePushLockExclusiveEx
  |    ntoskrnl.exe!KeWaitForSingleObject
  |    ntoskrnl.exe!KiCommitThreadWait
  |    ntoskrnl.exe!KiSwapThread

In every case Chrome is readied by WmiPrvSE.exe on this stack:

  |- ntoskrnl.exe!KiSystemServiceCopyEnd
  |    ntoskrnl.exe!NtQueryVirtualMemory
  |    ntoskrnl.exe!MmQueryVirtualMemory
  |    ntoskrnl.exe!MiUnlockAndDereferenceVad
  |    ntoskrnl.exe!ExfTryToWakePushLock
  |    ntoskrnl.exe!ExpWakePushLock

Chrome and WmiPrvSE.exe's threads are both at priority 8 but WmiPrvSE.exe must either be hammering the lock hard enough to exclude chrome.exe or holding it for long periods of time (an average of 6.3 ms each time).

The attached screenshot shows CPU usage of gmail's CrRendererMain thread around the time of the hang. The vertical lines represent keystrokes. Notice that at the beginning (left-side of the screenshot) each key stroke triggers a burst of CPU activity. Then there is a ~2.8 second period (from ~33.1 to ~36.0 s) where CPU usage is much lower - this is the hang when CrRendererMain cannot keep up due to being blocked by lock contention. Then at 36.0 s into the trace CPU usage resumes, the backlog of work is processed, and then gmail returns to normal.

This behavior appears to be new with M68. There are three possible explanations:
1) WmiPrvSE.exe has changed its behavior (in which case this is unrelated to M68, which seems unlikely)
2) v8 has changed its allocation patterns so that it calls VirtualAlloc more frequently leading to greater contention with WmiPrvSE.exe
3) Chrome has changed so that it is triggering WmiPrvSE.exe to do more work
4) gmail has changed so that it triggers more calls to VirtualAlloc

The CPU time in WmiPrvSE.exe around the time of the hang is mostly on this stack:

ntoskrnl.exe!MiGetNextPageTable
ntoskrnl.exe!MiQueryAddressState
ntoskrnl.exe!MiQueryAddressSpan
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!KiSystemServiceCopyEnd
ntdll.dll!NtQueryVirtualMemory
perfproc.dll!GetProcessVaData
perfproc.dll!GetSystemVaData
perfproc.dll!CollectSysProcessObjectData
advapi32.dll!QueryV1Provider
advapi32.dll!QueryExtensibleData
advapi32.dll!PerfRegQueryValue
KernelBase.dll!LocalBaseRegQueryValue
KernelBase.dll!RegQueryValueExW
pdh.dll!GetSystemPerfData
pdh.dll!GetMachineEx
pdh.dll!PdhiTranslateCounter
pdh.dll!PdhTranslateLocaleCounterW
WmiPerfClass.dll!ConvertCounterPath
WmiPerfClass.dll!EnumSelectCounterObjects
WmiPerfClass.dll!CClassCache::RefreshThreadUpdateSelectedProviders
WmiPerfClass.dll!CClassCache::RefreshThreadProviderObjectUpdate
WmiPerfClass.dll!CClassCache::RefreshThreadProc
kernel32.dll!BaseThreadInitThunk
ntdll.dll!RtlUserThreadStart

 
KeystrokesToCPUUsage.PNG
45.2 KB View Download
For thoughts on how to investigate see my WMI related blog post from last year:
https://randomascii.wordpress.com/2017/09/05/hey-synaptics-can-you-please-stop-polling/

It includes a link to this suggestion for extra data to record - I'll have to see if that reveals anything:
https://twitter.com/kobyk/status/901545728400601092

Odd mystery: during the ~38 s or so that the trace fully captured the *only* process that is interfered with is chrome(27368) - the gmail renderer process. No other process in the system is blocked by WmiPrvSE.exe on a VirtualAlloc call even once. Meanwhile a suitable WPA query (readying process of WmiPrvSE.exe, VirtualAlloc on the New Thread Stack) found another 345 hits earlier on in the trace, from 8.218 to 10.339 s, blocking CrRendererMain for 2.037 s of the 2.120 s interval. WmiPrvSE.exe was burning an entire CPU core virtually the entire ~38 s, but was affecting nobody else.

More weirdness. Looking at the VirtualAlloc data (UIforETW records every call to VirtualAlloc from all processes while it is tracing) shows 189 calls to VirtualAlloc, fairly evenly spread out during the trace, except that there are *zero* VirtualAlloc calls during the two hang periods. What this means is that v8 called VirtualAlloc and kept trying but failing to acquire the lock. Once it acquired the lock it was able to change its code page to read/execute and the hang resolved itself.

So v8 isn't doing anything particularly weird or expensive - it's just making a single call to VirtualAlloc which takes two to three seconds to complete. I don't understand how Windows implements its kernel mode locks but this is not the first time I have seen this situation where a thread is readied because the lock was released but the thread then fails to acquire the lock because another thread has reacquired it by then.

So, the questions are:
1) Who is doing the WMI query and why? (not that there is anything wrong with doing WMI queries)
2) Why does the WMI query take so long?
3) Why are the kernel locks implemented in such a way that Chrome is unable to acquire the lock for seconds at a time.

I'm not seeing any sign that Chrome is doing anything wrong here.

FYI about V8:

AFAIK, allocation behavior wrt. to JIT-ed code did not really change significantly throughout the last months. Page sizes also stayed the same.

I see that you figured already out that the VirtualAlloc calls are probably due to V8 being more restrictive in setting permissions for pages that are used to hold JIT-ed code. You probably observe a hang in [1]. Restricting page permissions was a long-requested security feature.

I don't know the exact timeline when things got fully enabled but it was turned on and off throughout the last couple of months and probably sticked at some point.

[1] https://cs.chromium.org/chromium/src/v8/src/base/platform/platform-win32.cc?q=platform-win32.cc&sq=package:chromium&dr&l=867
Are there any v8 metrics that might show us how frequently this problem is being hit? Maybe some UMA stats around the performance of FinalizeCompilationJob? It appears that this will only happen when VirtualAlloc is used to change the PAGE_EXECUTE_READ flag since that is what will force the OS to update the CFG (Control Flow Guard) bitmap.

It may be that this is a rare event that only happens on over-managed corporate machines, or this WMI behavior could be something that happens far more frequently.

The allocation size from the initial hang was 36 KiB - as you say, nothing unusual or excessive.

It would be helpful to understand exactly what changed about how v8 manages page permissions and when - this change seems to be the mostly likely trigger for this Windows bug. Can you confirm what CL finally enabled this?

It would be good to know if there is any sign of this happening in the wild in Chrome M68. I am seeing these hangs happen many times per day but so far nobody else has reported them. In the attached screenshot you can see that if you View Callers of VirtualAlloc then you can easily see the two spots in this trace where the hang happened. Over the 40 s in this (new trace, from today) the hang happens twice, once for 2.4 s (from 80685 to 80687.2) and once for 7.35 s (from 80696.70 to 80704.05).

It would also be helpful to understand who is triggering the WMI scans - fixing the issue from the other end. I have full Microsoft-Windows-WMI-Activity data in the most recent trace but it's tricky to interpret. There is one entry for the problematic WmiPrvSE.exe process (6884) but it doesn't obviously indicate to me what query was being done or by whom. Here is the raw data (headers and payload):

Line #, Provider Name, Task Name, Opcode Name, Process, Id, Event Name, Cpu, ThreadId, ProviderName (Field 1), Code (Field 2), HostProcess (Field 3), ProcessID (Field 4), ProviderPath (Field 5), Field 6, Field 7, Field 8, Field 9, Field 10, Field 11, Field 12, Field 13, Field 14, Field 15, Field 16, Field 17, Field 18, Field 19, Field 20, Field 21, Count, Time (s)
11, , , , WmiPrvSE.exe (6884), 5857, Microsoft-Windows-WMI-Activity//, 8, 6932, WMIProv, 0x00000000, wmiprvse.exe, 6884, %systemroot%\system32\wbem\wmiprov.dll, , , , , , , , , , , , , , , , , 1, 80,704.952239817


HangDetection.PNG
123 KB View Download
In the sampling profiler data we seem to be seeing this scenario in the wild, but at relatively low frequency. I looked at 30s render thread profiles from canary and dev that contained at least 1s of execution in v8::internal::SetPermissions(). These profiles correspond to ~0.1% of all profiles containing a non-trivial amount of non-idle execution.

Of these profiles, the time spent in v8::internal::SetPermissions() by decile in seconds is roughly: 1,1,1.1,1.2,1.3,1.4,1.6,1.9,2.5,3.7,28.7
Cc: mstarzinger@chromium.org
The change was that V8 switches code pages between RX and RW on code allocation. So every time the compiler allocates V8 needs to re-map a specific area as RW to allocate and write code into the memory. The memory is switched back to RX before returning the code.

AFAIK, the feature was initially turned on in March [1] on ToT.

mstarzinger@ should know more. I think there also was an issue where V8's handling was not ideal and we did far to many transitions that also resulted in renderer hangs on Windows (which could never be fully explained).

[1] https://chromium.googlesource.com/v8/v8.git/+/f7aa8ea00bbf200e9050a22ec84fab4f323849a7
> resulted in renderer hangs on Windows (which could never be fully explained)

Well I guess we have an explanation now. And I can see why they would not have been easily explained - these hangs were tricky to figure out. Reducing the number of transitions is certainly important. I'm not sure why gmail in particular never seems to stabilize - I would have thought that after a few days of up-time that it would stop generating code so frequently.

The change you link to was landed in v8 on March 12th and we branched for M67 on March 13th. Based on that and the link in the M67 branch email (which said that we were branching with v8 6.7, https://chromium.googlesource.com/v8/v8/+log/refs/tags/lkgr/6.7) it looks like that change did not go to stable until M68, so I'm not just imagining that I started seeing it after I upgraded my stable browser.

So, we have a genuine regression due to hitting a bug/glitch/whatever in Windows, on some unknown number of systems. Presumably the next steps are:
1) Monitor so we know how serious the problem is
2) Think about what it would take to revert the change if it hits too many people (or the wrong people, such as enterprise)
3) Follow-up with Microsoft to see if this is a bug that they can fix.

I can do number 3 but that path is likely to be *very* slow.

> I'm not sure why gmail in particular never seems to stabilize - I would have 
> thought that after a few days of up-time that it would stop generating code so
> frequently.

It is not just code generation that needs to flip the permission. Code contains pointers to the heap, so we also need to change permissions when the GC moves objects around (i.e., during compaction) and references to those objects need to be updated in code.

Apparently this odd behavior of GetProcessVaData being extremely expensive is a long-standing issue. It was reported on or before 2014 here:
https://support.microsoft.com/en-us/help/2996013/high-cpu-usage-in-wmiprvse-exe-when-you-have-sap-installed

Unfortunately, despite consulting with a WMI expert, I still haven't found what WMI query will cause the GetProcessVaData calls. This means I have no repro which means that investigating is slow.

Some random notes from investigating this bug last Friday:

Another suggestion was that I should use something like process explorer or hacker to look for anything with wbemsvc.dll loaded. That's likely to at least be things which are calling into WMI. I see:

CcmExec.exe
nvwmi64.exe (NVIDIA WMI host)
PerfWatson2.exe
svchost.exe (several instances)
WmiPrvSE.exe (several instances)
Various Google specific applications.

It would also be possible to scan these binaries or processes to look for WMI query strings. The class names have to be in a text form.

A list of perf classes can be found with this powershell command:

Get-WmiObject -Query 'SELECT * FROM meta_class WHERE __this ISA "Win32_Perf"'

Full names with this syntax:

Get-WmiObject -Query 'SELECT * FROM meta_class WHERE __this ISA "Win32_Perf"' | select "Name"

This powershell command triggers the hang but is a bit heavy-handed:

measure-command {Get-WmiObject -Query "SELECT * FROM Win32_Perf"}

Apparently perfproc.dll corresponds to these counters:
230 Process
232 Thread
786 Process Address Space
740 Image
816 Thread Details
1408 Full Image
1500 Job Object
1548 Job Object Details
1760 Heap
(IDs and counter names)

But they don't obviously map to the Win32_Perf* counters.

That eventually led to me Win32_PerfFormattedData_PerfProc_ProcessAddressSpace_Costly - querying this reproduces the problem. There are ten "Costly" counters (wot???) and that was the only one that mentions address space, and it gives identical results.

Win32_PerfFormattedData_PerfProc_FullImage_Costly
Win32_PerfRawData_PerfProc_FullImage_Costly
Win32_PerfFormattedData_PerfProc_Heap_Costly
Win32_PerfRawData_PerfProc_Heap_Costly
Win32_PerfFormattedData_PerfProc_Image_Costly
Win32_PerfRawData_PerfProc_Image_Costly
Win32_PerfFormattedData_PerfProc_ProcessAddressSpace_Costly
Win32_PerfRawData_PerfProc_ProcessAddressSpace_Costly
Win32_PerfFormattedData_PerfProc_ThreadDetails_Costly
Win32_PerfRawData_PerfProc_ThreadDetails_Costly

These costly classes actually appear and disappear - they can be enabled and disabled? Apparently some WMI code references a registry key called "Enable Costly Providers", in HKLM\SOFTWARE\Microsoft\WBEM\PROVIDERS\WmiPerf. And there's a function in the WbemPerfClass DLL called EnableCostlyProviders. Because of course.

Labels: -Pri-3 Pri-2
Summary: v8 triggered cfg fragmention causing gmail hangs and significant memory waste (was: WmiPrvSE.exe causing Chrome hangs due to hogging of VM lock)
TL;DR v8's ASLR is getting along badly with cfg. gmail's cfg reservation (http://www.alex-ionescu.com/?p=246) is heavily fragmented in the gmail process. This makes scanning it much more expensive which is ultimately (with the associated lock conflicts) what makes gmail hang. This also seems to cause gmail to use a *lot* more page table memory - vmmap says that gmail uses 15-50x as much page table memory (147 MB versus 3 or 10 MB) as other processes. This excessive use of page tables represents a memory increase of about 30%, hidden to our metrics.

I've written a program that scans the virtual address space of a program using ZwQueryVirtualMemory. This program triggers hangs in gmail, just like the WMI scans. This has also let me experiment to try to understand what is going on. After testing on various processes I can confirm several things:

1) Scanning gmail's process takes ~100x longer than scanning most other processes
2) Part of this is because gmail has more memory regions than most processes - ~25x more than most
3) I suspect that this is because gmail has more cfg fragmentation than most (~15,000 fragments, much small numbers for other processes)

> VirtualScan.exe 16224 (wpa.exe, 64-bit, managed code)
Scanning took 0.125s at 16:10:04.
Scanned 16181 regions, 172 code regions, cfg alloc (0000000000000000) with 0 fragments.

> VirtualScan.exe 9172 (explorer.exe, 64-bit, cfg supported)
Scanning took 0.093s at 16:09:27.
Scanned 4778 regions, 298 code regions, cfg alloc (00007DF5FF150000) with 460 fragments.

> VirtualScan.exe 37592 (chrome.exe, browser process)
Scanning took 0.078s at 16:11:55.
Scanned 5462 regions, 166 code regions, cfg alloc (00007DF5FF3A0000) with 234 fragments.

> VirtualScan.exe 13520 (chrome.exe, docs, 79 MB JavaScript memory)
Scanning took 0.609s at 16:12:32.
Scanned 3073 regions, 70 code regions, cfg alloc (00007DF5FF040000) with 440 fragments.

> VirtualScan.exe 12424 (chrome.exe, gmail, 247 MB JavaScript memory)
Scanning took 13.766s at 16:10:23.
Scanned 25594 regions, 95 code regions, cfg alloc (00007DF5FF6B0000) with 15060 fragments.

Fragmenting the 2 TB cfg allocation

I don't know why WmiPrvSE.exe is triggering expensive scans, but I'm going to ignore that for now.
I don't know why the scanning in ZwQueryVirtualMemory is so slow, but we may be able to reduce cfg fragmentation by scaling back our ASLR aggressiveness, thus saving memory and avoiding these hangs, and probably making other virtual memory operations faster.

This seems related to the Mac bug  crbug.com/700928 

So far gmail is the only process I have found which is hit badly, but there may be others.

VirtualScan.zip
5.1 KB Download
This issue seems to be caused by long-running gmail tabs. Either v8 is fragmenting its code pages or cfg is failing to clean up cfg memory and associated page tables, thus making this seem ever more like  crbug.com/700928 .

Here are before/after scans from closing gmail and reopening it (refreshing was not sufficient) - the pid changed from 12424 to 40936:

> VirtualScan.exe 13520 37592 12424
Scanning 13520 took 0.625 s, found 3013 regions, 70 code regions, cfg alloc (00007DF5FF040000) with 444 fragments.
Scanning 37592 took 0.110 s, found 5560 regions, 166 code regions, cfg alloc (00007DF5FF3A0000) with 234 fragments.
Scanning 12424 took 13.812 s, found 26471 regions, 82 code regions, cfg alloc (00007DF5FF6B0000) with 15152 fragments.

> VirtualScan.exe 13520 37592 40936
Scanning 13520 took 0.610 s, found 3013 regions, 70 code regions, cfg alloc (00007DF5FF040000) with 444 fragments.
Scanning 37592 took 0.093 s, found 5563 regions, 166 code regions, cfg alloc (00007DF5FF3A0000) with 234 fragments.
Scanning 40936 took 0.063 s, found 2912 regions, 74 code regions, cfg alloc (00007DF5FFA40000) with 82 fragments.

The results remained unchanged (in scanning speed and number of cfg fragments) after gmail had been running for several minutes, so the fragmentation must take a while to accumulate.

The cfg block accounts for 15,000 of the additional memory regions but there are still others. Those are mostly in the Private Data section of vmmap, which represents memory allocated by VirtualAlloc. This was about 8,100 before restarting gmail and just 1,600 after. This suggests that maybe there is a v8 leak which then forces cfg to map more regions. I have attached to vmmap files (which can be loaded on Windows with sysinternals' vmmap) which were recorded from a long-running gmail instance and the restarted tab (closed and reopened).

gmail_longrunning.mmp
16.9 MB Download
gmail_restarted.mmp
1.7 MB Download
Thanks for the interesting investigation! It is a bit surprising that the system allocates CFG pages even though CFG is disabled for chrome.exe (at least that's what http://www.alex-ionescu.com/?p=246 shows).

I am wondering whether it would help if we changed permission with VirtualProtect rather than with VirtualAlloc (possibly with PAGE_TARGETS_NO_UPDATE, as described by https://msdn.microsoft.com/en-us/library/windows/desktop/aa366786(v=vs.85).aspx).
cfg is enabled for chrome - for 64-bit Chrome anyway. dumpbin /headers on chrome.exe says:
            C160 DLL characteristics
                   ...
                   Control Flow Guard

I think the first step is to understand why this happens on some machines/pages only. I've tested on a few coworkers machines using VirtualScan.exe and this does not happen for them, even after gmail has been running for days. On my machine, in contrast, gmail accumulates gmail fragments when sitting idle. I closed and reopened the gmail tab about 14 hours ago and it has accumulated ~3,000 fragments and now takes 3.4 s to scan, most of which is scanning the cfg region. Here are my results for a few minutes after restart, a few hours of idle time after restart, and ~14 hours of idle time after restart (with updated output):

Scanning 40936 took 0.062 s, found 3139 regions, 75 code regions, cfg alloc (00007DF5FFA40000) with 82 fragments.
Scanning 40936 took 1.250 s, found 4888 regions, 79 code regions, cfg alloc (00007DF5FFA40000) with 1000 fragments.
Scanning 40936 took 3.476 s, 3.433 s cfg scanning, found 7210 regions,  84 code regions, 2904 cfg fragments.

My coworkers' machines' gmail tabs stay under 100 fragments after several days of uptime, also on M68. Mine, with 2904 cfg fragments, has accumulated 39 MB of page tables (~32 MB more than 'normal').

The good news is that this bug is genuinely rare. The bad news is that it is still a real issue and we don't yet know what triggers it. Any ideas on what sort of gmail or v8 experiment could cause this or how I could figure out what it is?

I've attached the source for an updated VirtualScan tool. It lets you go "VirtualScan.exe chrome.exe" and it improves the formatting. Scanning all of chrome.exe just proves that only gmail is affected at all by this.

So, only gmail, only on my machine, and only with M68.

VirtualScan.zip
6.0 KB Download
I now have a full end-to-end repro. I've created one program (VAllocStress.exe) which allocations 2,000 PAGE_EXECUTE_READ blocks of memory and then frees them. I then run "VirtualScan.exe VAllocStress.exe" and it scans the memory of the target process. Typical results from the scan are shown below:

    Printing all processes that take more than 0.060 s to scan.
    Scanning all processes with names that match "VAllocStress.exe"
    Scanning 52048 took 9.086 s, 9.080 s cfg scanning, found 4104 regions,   7 code regions, 4020 cfg fragments.

Meanwhile, after initialization VAllocStress.exe sits in a loop alternating between Sleep(500) and allocating/freeing a block of executable memory. It monitors how long the allocation takes (usually very little time) and prints a message whenever it takes longer than 500 ms. Whenever it is scanned VirtualAlloc takes a long time and this warning is triggered. Its output looks like this:

    pid is 52048
    Finished initialization.

    VirtualAlloc took 8.250s at 13:24:18.
    VirtualAlloc took 8.641s at 13:24:37.
    VirtualAlloc took 7.375s at 13:25:08.

Note that when VAllocStress.exe has finished initialization it's private working set consists of 10 MB of CFG memory, 28 MB of page tables, and just ~2 MB of actual data - so it is more than 90% waste.

These symptoms perfectly match what is happening with gmail so it is reasonable to assume that the root cause has been identified:
1) v8 is allocating and then freeing blocks of executable memory at random locations, perhaps due to gmail dynamically loading and unloading scripts or ???
2) Windows (Windows 10 16299) is allocating CFG regions but never freeing them

That's it.

Fixing this will avoid hangs for those users who endure WmiPrvSE.exe scans. For other users it will save memory - sometimes a considerable amount of memory. In one real case - gmail in Chrome M68 - the process private working set included 35 MB of CFG and ~140 MB of CFG-related page table data, out of a total of 718 MB. This means that CFG was accounting for ~24% of private working set.

On some pages this issue seems to make no difference, but on gmail it matters a lot.

Source code to both test programs is attached (VirtualScan has been updated).

It's still not clear why this seemed to appear suddenly in M68. I tried running Chrome with --js-flags="--no-write-protect-code-memory" but it doesn't seem to be helping. CFG fragments are gradually accumulating (although not yet to catastrophic levels, only 904 so far, taking 1.17 s to scan).

Fixing this would mean somehow avoiding using new regions for generated code - regions would have to be reused as much as possible. Note that when I scan the gmail process I generally see fewer than 100 code blocks, while the number of CFG fragments imply that ~7,000 code blocks have been freed over the lifetime of the process.

Testing on Windows 10 17134 (April 2018 update, latest version) suggests that the performance problem on scanning has been fixed, while the memory wastage has not. That is, scanning the process is now very fast but Private WS remains high. So, going forward this is primarily a memory optimization.

I think my work here is done so I'm reassigning this. Of course I'm happy to help with further investigations.

VAllocStress.zip
4.9 KB Download
VirtualScan.zip
6.5 KB Download
Owner: hpayer@chromium.org
Actually reassigning...
It looks like this issue may still be occurring in newer versions of Win8 and Win8.1 according to sampling profiler data. Here are the raw counts of the number of cases we saw in the last week having >= 1 second equivalent worth of samples in v8::internal::SetPermissions(), split by version of KernelBase.dll, where we saw at least 5 such cases. Note the large number of cases in 6.2.17134.165.

10.0.18204.1001 9
10.0.17730.1000 5
10.0.17723.1000 8
10.0.17713.1002 6
10.0.15063.726 6
10.0.10586.1478 34
10.0.10240.17889 12
6.3.9600.18938 254
6.3.9600.18666 6
6.3.9600.17415 9
6.3.9600.17031 6
6.3.9600.16656 5
6.2.17134.1 6
6.2.17134.165 570
6.2.17134.112 18
6.2.16299.492 76
6.2.16299.402 25
6.2.16299.371 6
6.2.16299.309 13
6.2.16299.15 11
6.2.15063.1155 118
6.2.15063.1029 9
6.2.14393.2363 10
6.2.14393.2189 21
6.2.10240.17394 73
6.1.7601.24168 83
6.1.7601.24150 48
6.1.7601.24117 8
6.1.7601.23915 14
Labels: Performance-Memory
Status: Assigned (was: Started)
Adding memory tag due to this bug using 23-24% of private working set on long-running gmail tabs (my second workstation is currently using 50 MB of CFG private working set, 80 MB of page tables (70 MB above normal) out of 519 MB of total private set, so 23%).

vmmap is also noticeably slower to scan these gmail processes since it is scanning the address space.

Issue 871820 has been merged into this issue.
Bruce,
can you confirm that --js-flags="--no-write-protect-code-memory" is not fixing the issue?
I can confirm that it is not fixing the issue. This makes sense because the issue is not caused by the particular protection settings that we are using, it is caused by ASLR - by the diversity of code addresses that we are using.

The bug is currently tagged as FoundIn-68 but I don't know that for sure. It showed up around the same time but me noticing it may have been triggered by a WMI scanning change by Google's winops team. Or it may have been triggered by some other change to Chrome, or a gmail change.

I am positive that v8 is using blocks of executable and then freeing them and using different blocks of executable memory, and Microsoft's CFG system supports this *really* badly.
Are you running 64 bit or 32 bit?

On 64 bit we have a code range, i.e. code range virtual memory will always be executable and will never be used for regular objects.

On 32 bit we don't have that and every block of memory can be used for everything, code and regular objects. 
This is a 64-bit only issue. The 2 TB in-proc CFG reservation only appears in 64-bit builds of Chrome, for obvious reasons.

The vmap files (attached earlier) show a level of CFG reservation fragmentation that can only happen if the code range is enormous. How big is the code range? I would guess (and I can test) that a few GB range would probably be okay, but a TB or more would not be.

That said, I'm not exactly sure how code addresses map to CFG addresses. I could investigate that.

Cc: titzer@chromium.org
The V8 code range should be at most 256M (https://cs.chromium.org/chromium/src/v8/src/globals.h?q=kMaximalCodeRangeSize&dr=CSs&l=199).

Could this come from WASM? Adding titzer and mstarzinger.
I just tested with a 36-bit code range and the scanning speed is still fine. With a 40-bit code range I started seeing a slowdown.

The actual transition point will depend on how many code blocks are allocated (I used 3,000) but a 256 M range (28 bits) should be *well* within the safe zone.

So WASM is looking like a likely culprit.

I tested by adding the "offset & " line to VAllocStress.cpp and then running VirtualScan to see how long the scanning took. In the code below I am masking the offset down to 24 bits and it is then multiplied by 4096 to give a 36-bit range. With one more 'F' in the masking constant the scanning started to get slower.

        size_t offset = 0;
        for (int i = 0; i < 3; ++i)
        {
          // Generate a random 36-bit number.
          offset = offset * 4096 + (rand() & 4095);
        }
        offset &= 0xFFFFFF;
        // Generate a page-aligned 48-bit address.
        char* addr = (char*)nullptr + offset * 4096;

I just got a twitter-DM report from a chrome user who is seeing the issue on a Google Sheets tab on M67.

We already had reports that Google Docs was affected (I haven't been doing an Google Docs editing which is probably why I hadn't noticed) but this is the first report I've seen of it happening on M67.

This issue may have been around for quite a while - it just wasn't noticeable until the WmiPrvSE.exe scanning turned it from a memory problem into a hang problem.

Is gmail deploying WASM modules? If their use case is to create many small modules and throw them away then there are going to be performance issues for them.

If we can get a JS/wasm repro then we can working on tuning on our end. E.g. we could do pooling of WASM code memory regions to avoid creating/freeing a lot of them, or find a way to reduce the size of ranges that we allocate.
Kind of surprising that Gmail would use WASM (a lot), no?
Maybe an extension, or something that uses asm.js?
Bruce, in your repro do you only load Gmail? No other tabs? Any extensions?

Since this sounds more like a WASM issue I will assign it to titzer.
Owner: titzer@chromium.org
I have lots of tabs open but only the gmail tab shows the problem (my docs tabs are sitting idle which I assume is why they don't trigger the issue). I don't think I have any relevant extensions or experiments - I mostly just have the Google corporate extensions.

Is their a WASM test page I can use to test this? Perhaps one that allocates and frees modules? For mitigations it should work find to restrict the WASM code memory regions to a range of addresses - 256 M like v8 does perhaps. Even 1 G should work fine. It's when the code blocks end up scattered over hundreds of GB or TB that things go south.

Re #33: Have you checked with Chrome's Task Manager whether your GMail renderer is being shared with other frames?  I often see google.com auth frames sharing that renderer, and those do tend to be short-lived.
Good point - my gmail tab is being shared with a couple of google.com subframes. Having those appearing and disappearing in the gmail process could explain the creation and destruction of code blocks.

That would then explain why this issue showed up recently - site isolation. However site isolation would then be serving as a catalyst for the bug. The root cause is still presumed to be WASM spreading out its code memory allocations over too broad a range.

> The V8 code range should be at most 256M

TL;DR - that's great, but if you have enough CodeRange objects then it's still a problem, and you do (have enough CodeRange objects to be a problem).

I just did a custom build of Chrome that prints the PID whenever the CodeRange constructor runs. I then loaded gmail and a few other tabs. So far the gmail tab's process has called the CodeRange constructor nine times in about ten minutes, with no sign of it stopping. There is one subframe in that process but I don't know if that 

These may well be temporary objects, but that doesn't matter, since the root cause of this bug is that the cfg memory associated with temporary code blocks is never freed.

My guess is that we need a mechanism to ensure that the address ranges used by deleted CodeRange objects are recycled, instead of each CodeRange object selecting a random address range.

The steady creation of CodeRange objects seems to happen with and without subframes sharing the gmail process, although it seems to be faster with subframes. However I'm seeing a new CodeRange object created every few minutes in an idle gmail process with no shared subframes.

So this is back to being a v8 problem, but WASM should also make sure that they aren't vulnerable.

This is the code I added:
  char buffer[1000];
  sprintf_s(buffer, "In CodeRange constructor for process %d\n",
            GetCurrentProcessId());
  OutputDebugStringA(buffer);

I then ran under windbg and periodically piped the output through grep.

Owner: hpayer@chromium.org
The CodeRange constructor is being called ~30 times an hour (measured over a two hour period). But where is the CodeRange constructor being called from?

00 chrome_child!v8::internal::CodeRange::CodeRange
01 chrome_child!v8::internal::MemoryAllocator::MemoryAllocator
02 chrome_child!v8::internal::Heap::SetUp
03 chrome_child!v8::internal::Isolate::Init
04 chrome_child!v8::internal::Snapshot::Initialize
05 chrome_child!v8::Isolate::Initialize
06 chrome_child!gin::IsolateHolder::IsolateHolder
07 chrome_child!blink::V8PerIsolateData::V8PerIsolateData
08 chrome_child!blink::V8PerIsolateData::Initialize
09 chrome_child!blink::WorkerBackingThread::InitializeOnBackingThread
0a chrome_child!blink::WorkerThread::InitializeOnWorkerThread
0b chrome_child!base::internal::Invoker<>::Run
0c chrome_child!`anonymous namespace'::DiscardDeviceInfosAndCallContinuation
0d chrome_child!base::internal::Invoker<>::RunOnce
0e chrome_child!base::debug::TaskAnnotator::RunTask
0f chrome_child!base::sequence_manager::internal::ThreadControllerImpl::DoWork
10 chrome_child!base::debug::TaskAnnotator::RunTask
11 chrome_child!base::MessageLoop::RunTask
12 chrome_child!base::MessageLoop::DoWork
13 chrome_child!base::MessagePumpDefault::Run
14 chrome_child!base::RunLoop::Run
15 chrome_child!base::Thread::ThreadMain
16 chrome_child!base::`anonymous namespace'::ThreadFunc
17 KERNEL32!BaseThreadInitThunk
18 ntdll!RtlUserThreadStart

The debug::TaskAnnotator says that it was posted from Start in "../../third_party/blink/renderer/core/workers/worker_thread.cc", line 127.

I don't know if that is always the call stack, but I've hit it several times. If so then what is happening is that a worker thread is created (line 127 is inside of WorkerThread::Start) which then creates an IsolateHolder which creates a CodeRange which gets used for some generated code and then discarded. All would be fine except that the CFG reservation is permanently disrupted.

A possible fix would be to create a process-global map of addresses that have been used by CodeRange objects. On destruction they would lock this map and add their reservation address to it. On construction they would look for an entry in the list and remove and use it if found. This will solve this problem. There are other possible fixes but they all circle back to making sure that new CodeRange objects reuse the code address ranges used by old CodeRange objects, when possible.

But who is calling WorkerThread::Start? Here's one call stack:

00 chrome_child!blink::WorkerThread::Start
01 chrome_child!blink::WebEmbeddedWorkerImpl::StartWorkerThread
02 chrome_child!blink::WebEmbeddedWorkerImpl::OnShadowPageInitialized
03 chrome_child!blink::FrameLoader::FinishedParsing
04 chrome_child!blink::Document::FinishedParsing
05 chrome_child!blink::HTMLDocumentParser::end
06 chrome_child!blink::HTMLDocumentParser::Finish
07 chrome_child!blink::DocumentLoader::FinishedLoading
08 chrome_child!blink::Resource::DidAddClient
09 chrome_child!blink::RawResource::DidAddClient
0a chrome_child!blink::Resource::FinishPendingClients
0b chrome_child!blink::TaskHandle::Runner::Run
0c chrome_child!base::debug::TaskAnnotator::RunTask
0d chrome_child!base::sequence_manager::internal::ThreadControllerImpl::DoWork
0e chrome_child!base::debug::TaskAnnotator::RunTask
0f chrome_child!base::MessageLoop::RunTask
10 chrome_child!base::MessageLoop::DoWork
11 chrome_child!base::MessagePumpDefault::Run
12 chrome_child!base::RunLoop::Run
13 chrome_child!content::RendererMain
14 chrome_child!content::ContentMainRunnerImpl::Run
15 chrome_child!service_manager::Main
16 chrome_child!content::ContentMain
17 chrome_child!ChromeMain

This was posted from DoWork in "../../base/task/sequence_manager/thread_controller_impl.cc", line 202. I assume that this is enough information now?

Assigning back to hpayer@ for v8.

Nice! It seems like workers are created and destroyed over and over again. Indeed, pooling the code ranges would solve the problem.
Owner: u...@chromium.org
Thanks, Bruce. Awesome investigation!

I agree that some form of code range pooling is necessary.
Owner: mstarzinger@chromium.org
Thanks, Bruce.

So as far as I understand, skimming back through the comments, is that a large number of code space reservations are created and destroyed, and that to avoid this we need to pool allocations (in the WasmCodeManager). This will be easier now that we have a shared code manager for the entire V8 process. As discussed with mstarzinger@, we will recycle the last code space reservation in the code manager and decommit the memory using madvise() or VirtuaAlloc() to free the underlying physical pages.


If it makes any difference, you could actually fully release the code space reservations, as long as you reuse previously used addresses. Not releasing the reservation is one way of doing that, and perhaps the simplest, but just remembering your previously used reservation addresses and reusing them is also sufficient.

> we will recycle the last code space reservation

This sounds like just the last code space reservation will be reused, which will probably solve this particular problem, but I can imagine a case where a page creates eight worker threads, then destroys them, and then creates eight again, repeating. If we just recycle the last code space reservation then we will allocate seven new address spaces each time, and the issue will reappear. We need to record and (attempt to) reuse all of the CodeRange reservation addresses, not just the last one.

Just clarify what is actually needed to resolve this so that we get a correct solution the first time.

It's good to hear that we've got the structures in place to help us solve this.
I'm seeing the accumulation of cfg regions in g+ as well.
Thanks for the clarification, Bruce.

I guess if we just keep a list of the last-freed code range addresses (easy with a process-wide code space manager), we can try them as hints to the virtual memory allocator. So, as you point out, we don't need to actually keep the reservations.

Do we need mitigations for 68 and 69? Before 70, WASM still does not have a process-wide code space manager. If we need mitigations in 68 or 69 we can introduce a separate per-process list of hints, but that would be a one-off solution for these versions.

> Do we need mitigations for 68 and 69?

We can probably manage without, although I'm not certain what made this issue suddenly show up. Is it just that the scanning is new? Or is the use of worker threads new? Or is it a Chrome change? I'm not sure.

Most people will not hit the hangs, if only because lots of customers are either on Windows 7 or the latest Windows 10 which are both immune to the hangs (no cfg in Windows 7, fast scanning in the latest Windows 10). The memory waste is still an issue for the latest Windows 10 users but I would guess that it is not severe enough to justify a back port, unless porting it back would be fairly straightforward.

This is *probably* irrelevant, but I wanted to double check and record the results. When I run "virtualscan chrome.exe" I get (trimmed) results like these:

Scanning  9556 took 0.386 s, 0.337 s cfg scanning, found  5512 regions, 168 code regions,   220 cfg fragments.
Scanning  9644 took 0.281 s, 0.254 s cfg scanning, found   399 regions,  38 code regions,    42 cfg fragments.
Scanning  9384 took 0.209 s, 0.193 s cfg scanning, found   401 regions,  40 code regions,    40 cfg fragments.
Scanning 10824 took 0.143 s, 0.090 s cfg scanning, found  6679 regions, 123 code regions,   170 cfg fragments.
Scanning 11328 took 12.418 s, 12.369 s cfg scanning, found 23025 regions,  83 code regions, 12716 cfg fragments.
Scanning 12904 took 0.114 s, 0.103 s cfg scanning, found  1465 regions,  67 code regions,    78 cfg fragments.
Scanning  3528 took 0.352 s, 0.339 s cfg scanning, found  2790 regions,  69 code regions,   286 cfg fragments.
Scanning 15140 took 0.427 s, 0.411 s cfg scanning, found  2286 regions,  73 code regions,   308 cfg fragments.
Scanning 15260 took 0.161 s, 0.147 s cfg scanning, found  1241 regions,  69 code regions,    84 cfg fragments.
Scanning 13936 took 0.063 s, 0.053 s cfg scanning, found  1709 regions,  69 code regions,    82 cfg fragments.
Scanning 15568 took 0.169 s, 0.156 s cfg scanning, found  1451 regions,  69 code regions,    82 cfg fragments.

The number of cfg fragments for the gmail process and the amount of time spent scanning the cfg section is obvious. That's this bug. But, the number of regions/fragments in the process as a whole is also significant. Even after we subtract out the 12716 cfg fragments the gmail process still has 10309 fragments, which seems like a lot.

Those fragments are *not* contributing to the slow scanning. However they *may* be contributing to the page-table cost. The blocks are mostly 512 KB which suggests they are coming from v8, with a fair number of 1280 KB allocations and a long-tail of others.

I decided to crudely check whether there were enough of these private data allocations to affect the size of the page tables. I grabbed the base addresses for the ~840 address reservations and committed and used 512 KB at each address. This is an imperfect approximation (many of the regions were subdivided) but should be fairly accurate from a page-table point of view since each low-level page-table can map 2 MiB as easily as 4 KiB.

vmmap says that the resulting page table was about 9.5 MB for 428 MiB of data so the ~10,300 fragments/regions are not a sign of trouble and can be ignored.

Good news: the crazy page-table costs that I have been seeing in vmmap appear to just be a vmmap bug. It is somehow miscalculating how much page-table memory is used so the actual number is orders of magnitude smaller.

Bad news: the CFG memory cost seems to keep growing indefinitely. After six days of gmail sitting open I am seeing 250 MB of CFG memory, from 37,000 blocks (and ~34 s of scanning on RS3 when WMI does its thing, but most customers won't hit that).

I asked the memory team what they think about the need for a back-port of the fix, but I guess we can wait and see what the fix looks like before deciding.

Owner: u...@chromium.org
Thinking about this more, if we go with the "caching the code range hint" strategy, aren't we effectively disabling ASLR for code ranges? If that's true then we need to make a security judgment here and/or think carefully about whether/how much we need to preserve the entropy introduced for ASLR.

#c46: if this is a vmmap bug, does that mean that there will be a Windows kernel fix?

Re #c45 it seems like most of the mappings are coming from V8's codespace pages and not due to WASM. As such, reassigning back to ulan@, but mstarzinger@ and I will keep monitoring this bug and offer help as necessary.
If the issue happens only on 64-bit, could it be that the leak happens due to fragmentation in high-level page tables? (We had similar issue before on MacOS: https://chromium-review.googlesource.com/c/v8/v8/+/558876)

If that is the case, then we could limit the "code range hints" to a 4GB (or 8GB) virtual address region that is randomly selected at process startup. That would be a relatively simple fix without disadvantages of caching.

Bruce, Ben, wdyt?
Here is a POC of the proposed workaround: https://chromium-review.googlesource.com/c/v8/v8/+/1174716

It generates random hints within a fixed 8GB region. If the hint points to an address that was already taken, then the code range is created with the address chosen by the OS.


Another idea to have one large global code range that serves code allocations from all isolates. That would be a major architectural redesign, but it would allow us to support a lot more workers than we currently do. 
> aren't we effectively disabling ASLR for code ranges?

I don't think so. There will still be multiple code ranges in one process, and these code ranges will be unpredictable between processes. I don't think ASLR gets much additional benefit from not reusing address ranges. Looked at from a native code point of view, Windows tries to load particular DLLs in the same location across processes and across time, only randomizing the location on reboots or when all instances of the DLL have been unloaded across the entire system. So, we would still be more random than the Windows ASLR.

> if this is a vmmap bug, does that mean that there will be a Windows kernel fix?

The only vmmap bug is the miscalculation of the page-table cost. I now have numbers for the actual page table cost. From my 7-day running gmail tab I now have 317 MB of CFG commit and it takes 54 MB of page tables to map that. vmmap reports much higher page-table numbers but I now know to ignore those numbers.

It is *possible* that Microsoft will decide to change the kernel to free CFG memory when the executable blocks are freed, but that would be a year in the future if it ever happens.

> we could limit the "code range hints" to a 4GB (or 8GB) virtual address region

That would help. The amount of CFG memory used is 1/64th of the code range (subject to page quantization) so a 4 GB virtual address region would use at most 64 MB of CFG memory (and very few page tables). Still a lot, but an improvement. However this would suggest that only 16 CodeRange objects could be created simultaneously. Reusing addresses seems simpler to me since it doesn't add any arbitrary limitations and will be as efficient as possible.

> could it be that the leak happens due to fragmentation in high-level page tables?

This is happening, as a secondary effect. Our CodeRange objects are spread out over the entire address space which then causes CFG allocations to be spread out over the entire 2 TiB CFG address space which then requires a lot of page-table entries because the addresses used are so spread out. However the first-level problem is the CFG memory usage. The page tables to back this are (in my most recent test) about 20% the size of the CFG memory.

I'll look at the proposed workaround now.
I'm not qualified to evaluate this next suggestion, but should we consider turning off CFG? The code that v8 generates is not making use of CFG (all addresses are tagged as legal indirect branch targets, I believe) and chrome.exe, chrome.dll, and chrome_child.dll are not making use of CFG (for various reasons we aren't exporting a list of valid indirect branch targets) so the only benefit is whatever Microsoft DLLs are built with CFG.

That is, we are paying the price for this CFG bitmap but only getting a modest security benefit from it. Thoughts?
Re-using code range address turned out to be simpler than I expected: https://chromium-review.googlesource.com/c/v8/v8/+/1174837


Cc: palmer@chromium.org
+Chris for security-perspective of turning off CFG (comment #52).
Note: the CodeRange size for 64-bit Windows is 128 MiB, not 256 MiB - https://cs.chromium.org/chromium/src/v8/src/globals.h?q=kMaximalCodeRangeSize&dr=CSs&l=182 - crrev.com/c/1044195.

It looks like the proposed CL might fail to allocate memory if the random address selected is not available. That is, the hint may be a bit too strong. And, it looks like the hint will not give 256 MiB aligned memory. But, I'm not sure how the underlying functions work so maybe I'm just misunderstanding it.

As mentioned in the previous comment, restricting all code to an 8 GiB range will cap the CFG memory at 8 GiB/64 which is 128 MiB, which is still quite a lot. Whether we hit that limit depends whether we use all of the 128 MiB CodeRange ranges available inside the 8 GiB range, and how we generate code in each of them. In the gmail context we will quickly cycle through all 64 available 128 MiB address ranges, so the amount of CFG memory allocated will depend on how we use each CodeRange.

Here's how the math works: CFG memory is, not surprisingly, allocated in 4 KiB blocks. Each of these controls access to an aligned 256 KiB block of executable memory (4 KiB times 64). So a 128 MiB address range may commit somewhere between 1 and 512 CFG pages. This is a pretty big range. So, it is probably the case that the amount of CFG memory consumed depends as much on how we allocate memory within each CodeRange as how we allocate CodeRange objects. It looks like we randomize addresses within CodeRange objects as well as randomizing the CodeRange addresses and that is why our CFG memory usage keeps increasing.

I just looked at the vmmap data for my 7-day gmail process. It shows three 128 MiB executable regions (view Private Data, sort by reservation size). This suggests that if gmail reused CodeRange addresses then the CFG impact would be greatly minimized because there would only be three or four different CFG address ranges allocated over the lifetime of the process, for a maximum CFG impact of 4 * 128 MiB / 64 = 8 MiB - tiny.

BTW, the upper bound for this bug is terrifying - it would be 2 TiB of CFG memory and 4+ GiB of page tables to back it. But, I really think that if we can reuse CodeRange addresses then we can completely squash this bug, and the security implications should be minor to zero.


Chris - can you comment on the security implications of reusing CodeRange addresses? That is, do we reduce the benefits of ASLR significantly if new CodeRange objects try to allocate their code at the same location used by previous CodeRange objects?

If the random address selected is not available, then the code range should be allocated at another address selected by the OS (without randomization).

Considering this, looks like the CL from comment 49 will allow the attacker to disable code range ASLR by creating sufficient number of workers (e.g. more than 8).

I am in favor of the approach in comment 53 - reusing addresses.
Cc: penny...@chromium.org wfh@chromium.org
+wfh and pennymac, Windows platform security team experts.

If we reuse addresses, how much more performance benefit does turning off CFG get us? We save lots of memory in page tables? From what I understand #52 is correct, but let's believe Windows experts instead of me.

ASLR caused us a similar problem on macOS. (https://bugs.chromium.org/p/chromium/issues/detail?id=738925)
If we reuse addresses then the CFG cost from this bug will, in the cases that I am aware of, drop precipitously, so turning off CFG will not be needed. In the gmail case it looks like the maximum CFG memory cost will be ~8 MB (down from the ~370 MB I'm currently at after one week, still growing).

Leaving CFG enabled then gives us the option of enabling CFG fully for our EXEs and DLLs, and we could even start specifying which addresses are valid indirect branch targets in our generated code if that is worthwhile.

Just finished back reading.

I don't think we should disable CFG in our processes - currently leveraging forward cfi built into the system DLLS (which are otherwise very handy for exploit chains).  Also, even in the clang world, we'll be moving towards adding in more cfg/cfi, rather than less.  This isn't going away.

Given Bruce's comments, especially in 55/58, it doesn't sound like there's a strong argument for it anyhow.

<Bruce just responded similarly now. :) >

Let's make the changes for address reuse, and we can make sure that msft is aware of the possibility of freeing kernel CFG memory longer term.
jarin@ mentioned PAGE_TARGETS_NO_UPDATE, has anyone checked if that mitigates the problem?

So far as I know, we are only enabling CFG in the linker so that Windows system DLLs have their indirect calls checked. I strongly suspect that a Windows system DLL should never indirectly call code generated by V8. We may need to re-evaluate this when we deploy CFG checks in Chrome.
I just tried PAGE_TARGETS_NO_UPDATE and PAGE_TARGETS_INVALID and both together and it made no difference in my test application. I tried a few variants - here is one version:

const int num_contig_allocs = 32;
size_t stride = 1024LL * 1024 * 1024; // 2 * 1024 * 1024;
size_t pages = 0;
for (size_t offset = stride; offset < 0x1000000000000; offset += stride)
{
	void* p = VirtualAlloc(null_char + offset, stride, MEM_COMMIT | MEM_RESERVE,
		PAGE_EXECUTE_READWRITE | PAGE_TARGETS_NO_UPDATE);
	if (p)
	{
		memset(p, 1, 4096);
		VirtualFree(p, 0, MEM_RELEASE);
		++pages;
		if (pages >= num_contig_allocs)
			break;
	}
}
printf("Allocated %zd sparse pages.\n", pages);
Sleep(1000000);

This code does 32 temporary allocations, each of 1 GiB of executable memory. This causes 512 MiB of CFG memory to be allocated, with and without the extra flags. So, good suggestion but it doesn't help. A flag to disable generating of CFG bits does seem like an obvious feature.

Great, thanks, pennymac. The plan in #59 SGTM.
Thanks all, I will go ahead and land the address reuse change:
https://chromium-review.googlesource.com/c/v8/v8/+/1174837
Project Member

Comment 64 by bugdroid1@chromium.org, Aug 15

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/4d474c51d828b050b27c7152e567edc2fb5f9701

commit 4d474c51d828b050b27c7152e567edc2fb5f9701
Author: Ulan Degenbaev <ulan@chromium.org>
Date: Wed Aug 15 18:53:11 2018

[heap] Reuse freed CodeRange addresses.

This patch adds a singleton that tracks recently freed code range
regions and provides hints for newly created code ranges such that
the freed addresses are reused.

This is a workaround for the CFG leak described in the linked bug.

Bug:  chromium:870054 

Change-Id: Ice237a056268379f0fef40abdb1accad125a56b3
Reviewed-on: https://chromium-review.googlesource.com/1174837
Commit-Queue: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Michael Lippautz <mlippautz@chromium.org>
Cr-Commit-Position: refs/heads/master@{#55139}
[modify] https://crrev.com/4d474c51d828b050b27c7152e567edc2fb5f9701/src/heap/spaces.cc
[modify] https://crrev.com/4d474c51d828b050b27c7152e567edc2fb5f9701/src/heap/spaces.h
[modify] https://crrev.com/4d474c51d828b050b27c7152e567edc2fb5f9701/test/unittests/heap/spaces-unittest.cc

Thanks for the fix!

We will have to decide whether it is worth trying to back port this to M69. In addition to the risk/complexity we need to understand the benefits. After eight days this is what the stats look on my gmail tab:

     Scan time,   Commit, page tables, committed blocks
Total: 40.657s, 1433 MiB,    67.1 MiB,  31640, 93 code blocks, in process 11328
  CFG: 40.654s,  351 MiB,    58.9 MiB,  24736

So, 410 MiB of RAM devoted to CFG memory (and still growing) and its associated page tables, with the normal value being about 10-27 MiB. That's pretty bad, but still seems to be anomalous - I haven't heard of a lot of other people hitting the problem. On the other hand, there was a report of this bug showing up in google docs, but I just checked and I have offline mode enabled in docs and I'm not seeing any CFG growth.

Unfortunately our memory metrics seem to be blind to this, probably because the CFG RAM shows up in the Shareable category even though it is ultimately private (theoretically shareable, but not in practice). For whatever reason our Memory footprint number seems to be omitting the CFG RAM, and without custom scanning tools it is easy to be hit by this bug without realizing it. So, ???

Source to the latest version of the scanning tool is attached.

VirtualScan.zip
6.8 KB Download
"probably because the CFG RAM shows up in the Shareable category even though it is ultimately private " - is it shareable with a share count of 1?
The fix is relatively simple. It should be safe to back merge once we get Canary coverage.
I got a pointer to the source that populates our Memory footprint column and I used that to confirm that we *do* count the CFG memory, and the page-table memory. So, this fix should show up on our memory metrics. It will probably not be significant on canary because it takes a while to leak a lot of memory, but the effect will be there and should show up more strongly on other channels.

Details:

The Memory footprint number comes from this code:

https://cs.chromium.org/chromium/src/services/resource_coordinator/public/cpp/memory_instrumentation/os_metrics_win.cc?type=cs&q=private_footprint+file:win+-file:src/out&sq=package:chromium&g=0&l=46

It's simply a matter of calling GetProcessMemoryInfo() and then printing the PrivateUsage field which is documented as "The Commit Charge value in bytes for this process", so everything that is backed by RAM or the page file. I added this to my scanning program and confirmed that this number perfectly matched what our task manager shows for gmail. I then ran it on my vallocstress.exe program which allocates 2053 MiB of CFG memory and basically nothing else. The PrivateUsage number showed 2052 MiB, so that's good.

I then ran it on a different version of VAllocStress that allocates lots of sparse CFG fragments and therefore lots of page table entries. With 5,000 random code allocations of 4 KiB each (all freed) I end up with 24 MiB of CFG memory and 26.8 MiB of page tables to map the CFG RAM (yay sparsity!). PrivateUsage is listed as 54 MiB so it *looks* like the page tables are accounted for as well.

As for why I thought that we weren't accounting for CFG memory, well, there are a lot of memory categories and it's hard to tell how they are added up, and vmmap overstates the Page Table category just to add extra confusion. I think this is the calculation, going through the vmmap rows:

Image:        Count Private bytes
Mapped File:  Not counted
Shareable:    CFG memory only (Private and Private WS are zero in the summary, Private is zero in the details pane, but Private WS is 332,312 KB so use that
Heap:         Count Committed
Stack:        Count Committed
Private Data: Count Committed
Page Table:   Get data from VirtualScan tool and count that

It's also tricky because it takes about a minute for vmmap to scan the process so I have to monitor task manager during that time to get a sense of what it thinks is happening. 

Using this combination of vmmap data and VirtualScan for page tables I get this:
3432 + 0 + 332688 + 198500 + 1376 + 452464 + 68608 = 1057068
Meanwhile task manager was reporting 1066000 KiB, give or take a few hundred KiB.

I can't account for the missing 9 MiB, but oh well. Close enough. I would guess that some of the Image pages are marked as commmitted because they are copy-on-write pages in the data segment, but ???

Read through your blog and was wondering if this (or similar) doesn't happen in other OSs.

Ever since I've used Chrome that GMail is one of those tabs that keeps bloating day after day, without much activity. It is not uncommon to have 500MB+ GMail tabs after a week or so.
Only way for it not to happen is to load the basic HTML version.

My experience has been that the Linux version also seemed to consume more memory with time than the Linux version. This is true regardless if it's Ubuntu, openSUSE or ChromeOS. This is memory that I've repeatedly verified *doesn't get freed when you close all but one new tab.*
Hey ulan -- Can you confirm that on macOS, this CL https://chromium-review.googlesource.com/c/v8/v8/+/558876 will also restrict code page allocations to the same 4GB range? 
There are quite a few possibly relevant changes:

https://chromium-review.googlesource.com/c/v8/v8/+/558876 - "On MacOS the hints are confined to 4GB contiguous region."
https://chromium-review.googlesource.com/c/chromium/src/+/641979 - "Avoid leaking wired pages on macOS due to ASLR." but it masks down to 39 bits?
https://chromium-review.googlesource.com/c/v8/v8/+/557958 - "Confine mmap hints to a 32-bit region on macOS"

I can't tell how they all interact and whether these changes were already sufficient.

On Windows we decided to reuse CodeRange addresses rather than restrict ourselves to 4 GiB because address reuse is more efficient (it uses even less CFG RAM and page tables in most cases) but adds no additional constraints (we can have as many CodeRange objects as needed if that is what a page requests).

OSX issue and this issue are different.

OSX is about allocating pages within the same V8 instance in a loop.
This issue is about allocating CodeRanges in a loop (which is the same as creating V8 instances in a loop).

Re comment 70: within the same V8 instance code page allocations are restricted to 128MB (that's the purpose of CodeRange).

If multiple V8 instances (workers) are created, then their 128MB CodeRange regions can reserved anywhere in the virtual address space.

The fix in comment 59 ensures that V8 instances allocated one after another reuse CodeRange addresses.


The OSX fix confines data page allocations to 4GB region with base address Heap::mmap_region_base_. (Code page allocations are confined to 128MB by CodeRange)

Similar to reuse of CodeRange across V8 instances, we might want to implement reuse of Heap::mmap_region_base_ for OSX.




Project Member

Comment 75 by sheriffbot@chromium.org, Aug 21

Labels: -Merge-Request-69 Merge-Review-69 Hotlist-Merge-Review
This bug requires manual review: We are only 13 days from stable.
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Seems like this bug is found in 68. Why it is critical to merge to M69 this late in release cycle?
govind@, it fixes large memory leak that makes Chrome unusable on some pages.

More discussion on why we need to merge this back:
https://groups.google.com/a/google.com/forum/#!topic/chrome-memory/ryBujwq3u6o
The bug was first noticed in M68 but it is not clear why, so it may have existed before then and just been exposed by external changes. That is, changes to gmail or to Windows scanning at Google may have made the bug visible. Neither explanation is compelling so we really don't know.

The hangs that this bug causes have been reported by multiple Google employees in gmail, docs, and other pages that use service workers to implement offline mode.

The memory waste (400+ MB on one gmail tab that had been running for eight days with the expectation that the leak would continue at that rate indefinitely) is not confined to Google. The memory leak will happen to anybody who uses a site with service workers, at an unknown rate.

The hang was the obvious symptom that then lead to the investigation which found the memory leak.

So, reasons to merge the fix are:
- Avoid repeated gmail/docs hangs for some number of people, probably mostly Google employees
- Avoid significant memory leak for some larger number of people
- Fix is simple/safe - it doesn't actually cache resources (it just caches a memory address hint) so the risk of memory errors is particularly low

Cc: hablich@chromium.org


+hablich@ (V8 TPM) for M69 merge review.  I'm ok to take this merge in for M69 per comments #77 and #78 if hablich@ is ok with it.
Labels: -Merge-Review-69 Merge-Approved-69
+1 to merging the fix in  #64 to M69.
Project Member

Comment 81 by bugdroid1@chromium.org, Aug 21

Labels: merge-merged-6.9
The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/7036f19f57b23f8f5d5fcb7d79b527af77ddacb8

commit 7036f19f57b23f8f5d5fcb7d79b527af77ddacb8
Author: Ulan Degenbaev <ulan@chromium.org>
Date: Tue Aug 21 18:11:03 2018

Merged: [heap] Reuse freed CodeRange addresses.

Revision: 4d474c51d828b050b27c7152e567edc2fb5f9701

NOTRY=true
NOPRESUBMIT=true
NOTREECHECKS=true
R=mlippautz@chromium.org

Bug:  chromium:870054 
Change-Id: I651b191367895d5ace045f024de34605f464786b
Reviewed-on: https://chromium-review.googlesource.com/1183904
Reviewed-by: Michael Lippautz <mlippautz@chromium.org>
Cr-Commit-Position: refs/branch-heads/6.9@{#33}
Cr-Branched-From: d7b61abe7b48928aed739f02bf7695732d359e7e-refs/heads/6.9.427@{#1}
Cr-Branched-From: b7e108d6016bf6b7de3a34e6d61cb522f5193460-refs/heads/master@{#54504}
[modify] https://crrev.com/7036f19f57b23f8f5d5fcb7d79b527af77ddacb8/src/heap/spaces.cc
[modify] https://crrev.com/7036f19f57b23f8f5d5fcb7d79b527af77ddacb8/src/heap/spaces.h
[modify] https://crrev.com/7036f19f57b23f8f5d5fcb7d79b527af77ddacb8/test/unittests/heap/spaces-unittest.cc

Labels: -Merge-Approved-69
Status: Fixed (was: Assigned)
Closing this issue, I will fork comment 73 into a separate issue.
Project Member

Comment 84 by bugdroid1@chromium.org, Aug 23

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/6930df0f1cf6671f356886b068d676a45c3543d6

commit 6930df0f1cf6671f356886b068d676a45c3543d6
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Thu Aug 23 18:27:42 2018

Use PAGE_TARGETS_INVALID when allocating code pages

PAGE_TARGETS_INVALID tells CFG (Control Flow Guard) to mark all
addresses as invalid indirect branch targets. This makes exploits more
difficult. The benefit is minor because most of the code in the Chrome
process doesn't use the CFG checks, but this will close off a few
weaknesses and is the direction we will want to go in eventually
anyway (with specific targets or call sites opted-in to allowing
calls, using SetProcessValidCallTargets).

PAGE_TARGETS_INVALID may ultimately cause CFG to not allocate memory -
that is implied by Windows Internals 7th Edition - and if that is
implemented then this change will save some modest amount of memory.

PAGE_TARGETS_INVALID was introduced in Windows 10 - according to
Windows Internals Part 1 7th Edition - prior to that it will cause
VirtualAlloc to fail.

Bug:  chromium:870054 
Change-Id: Ib1784fba37cc0ecb5fe5df595f1519531b3b3a20
Reviewed-on: https://chromium-review.googlesource.com/1186025
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Hannes Payer <hpayer@chromium.org>
Cr-Commit-Position: refs/heads/master@{#55365}
[modify] https://crrev.com/6930df0f1cf6671f356886b068d676a45c3543d6/src/base/platform/platform-win32.cc

I can confirm that the fix is in M69 (69.0.3497.57) and working. I reproed the bug in the previous version of Chrome beta and then left the version with the fix running for two days. The output below shows stable, beta, and canary with the CFG rows being the critical ones. The difference is quite dramatic.

     Scan time, Committed, page tables, committed blocks
Total: 16.787s, 1020.8 MiB,    29.9 MiB,  13649, 90 code blocks, in process 37424
  CFG: 16.783s, 156.3 MiB,    22.8 MiB,   8913
     Scan time, Committed, page tables, committed blocks
Total:  0.069s, 621.2 MiB,     5.1 MiB,   2520, 79 code blocks, in process 24696
  CFG:  0.065s,  28.5 MiB,     0.1 MiB,     60
     Scan time, Committed, page tables, committed blocks
Total:  0.086s, 724.7 MiB,     5.3 MiB,   3126, 79 code blocks, in process 50452
  CFG:  0.081s,  31.2 MiB,     0.1 MiB,     60

The small difference in CFG committed memory between beta and canary is not relevant. The page table and committed block numbers are.

Thank you for verifying, Bruce.

Sign in to add a comment