New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 744552 link

Starred by 0 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug
Hotlist-MemoryInfra



Sign in to add a comment

malloc shim doesn't seem to respect OOM scenarios

Project Member Reported by mlippautz@chromium.org, Jul 17 2017

Issue description

We were just going through a few Heap::SetUp failures and most of them seem to be regular OOMs that do not trigger any magic hook.

Erik: Since you are on the authors/blame of the allocator shim, we were wondering if you can shed some light into that mystery. We were expecting these malloc OOMs to trigger some magic stack signature which we often see in other cases. 

(I do realize that malloc failures can be handled gracefully; however, the use within operator new should trigger an OOM.)

https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20OMIT%20RECORD%20IF%20SUM%28CrashedStackTrace.StackFrame.FunctionName%3D%27v8%3A%3Ainternal%3A%3AHeap%3A%3ASetUp%28%29%27%29%20%3D%200&ignore_case=false&enable_rewrite=true&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=v8%3A%3Ainternal%3A%3AHeap%3A%3ASetUp%28%29&omit_field_opt=%3D&stbtiq=&reportid=b220d2b488000000&index=1#0
 
Cc: primiano@chromium.org etienneb@chromium.org
Hm. EnableTerminationonOOMMemory is pretty much the first thing that service_manager/embedeer/main.cc does:
https://cs.chromium.org/chromium/src/services/service_manager/embedder/main.cc?type=cs&q=EnableTerminationOnOutOfMemory&l=333

This sets the new OOM handler.
https://cs.chromium.org/chromium/src/base/process/memory_win.cc?type=cs&q=EnableTerminationOnOutOfMemory&l=61

which is also be used for malloc OOM.

All of this being said, it looks like v8 directly uses VirtualAlloc & friends, which we don't hook:
https://cs.chromium.org/chromium/src/v8/src/base/platform/platform-win32.cc?l=740

> (I do realize that malloc failures can be handled gracefully; however, the use within operator new should trigger an OOM.)
Are you referring to placement new? I would expect that to be caught by the shim as well.

Can you be more specific about the case you expect us to shim/fail on, that we don't?
V8 only used the platform allocators for its managed heap. Everything allocated using malloc/new is using whatever is provided, see [1].

[1] https://cs.chromium.org/chromium/src/v8/src/allocation.h?q=NewArray&l=46

Cc: w...@chromium.org ajwong@chromium.org
Labels: OS-Windows
As per offline discussion, the process is crashing on first access to memory, rather than on new. This is *really* odd, since on Windows, theoretically committed memory is guaranteed to have backing space in the page file.

Maybe it's possible for all physical memory to be wired/non-reusable, and that will cause a failure.
Cc: erikc...@chromium.org
Components: Internals>Instrumentation>Memory
Owner: ----
Status: Untriaged (was: Assigned)
v8 historically opted for its own allocation handling (with good reasons), which means that the base:: EnableTerminationonOOMMemory doesn't affect that, in the same way it doesn't affect partition alloc or oilpan.
The expectation is that such subsystems enforce their own suicide-on-mmap-failure. I was pretty sure to have seen code in v8 that does that after printing some log statement. (If that isn't the case please file a v8 bug, as that is a security bug)

> As per offline discussion, the process is crashing on first access to memory, rather than on new. This is *really* odd, since on Windows, theoretically committed memory is guaranteed to have backing space in the page file.

FWIW this behavior is unfortunately specific only to Windows AFAIK. I don't know why this is crashing on windows, but I do definitely expect this to be able to crash on another platform even if using placement new.

On Linux/Android/CrOS (and maybe mac, don't know about that?) the real problem about OOMs random kills is memory-overcommit*. Essentially, depending on the kernel configuration and the scenario, in many cases when you malloc/mmap the kernel will just lie to you and say "go ahead, here's the pointer to your virtual address space... but I make no guarantees". Eventually when you access the memory, during the page fault handler, the kernel says "aha now I gotta find some physical memory to back this page that I promised to this process" so it start running all sort of vmscan shrinking (purge clean memory, writeback file-backed memory, etc). In the very worst case however the kernel will realize that there is no way to maintain that promise. So, again, depending on the system config, the policy will be either suspend the current process or murder any process (not necessarily the one who was page-faulting). By default the latter.

The base::allocator shim does not and can not prevent this. The allocator shim is mostly there for security reasons to avoid propagating null pointers around. But that fact that one gets a non-null pointers doesn't mean that there is memory. 
Also the allocator shim doesn't / cannot easily interfere with mmaps, as some subsystems want to gracefully handle those failures (and also because on linux mmap is a syscall).

* https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

> v8 historically opted for its own allocation handling (with good reasons), which means that the base:: EnableTerminationonOOMMemory doesn't affect that, in the same way it doesn't affect partition alloc or oilpan.
> The expectation is that such subsystems enforce their own suicide-on-mmap-failure. I was pretty sure to have seen code in v8 that does that after printing some log statement. (If that isn't the case please file a v8 bug, as that is a security bug)

V8 managed heap is working as intended, i.e., we will print log messages and defer to proper OOM handlers.

The reason I posted this is that I thought that Windows doesn't actually do on-demand paging and reserves physical pages for the reserved memory; s there should be no overcommitting. I guess I am wrong here and Windows actually supports on-demand paging by now? Otherwise, I have a hard time explaining this behavior.
Cc: brucedaw...@chromium.org
I know who could know: +brucedawson :D

Comment 8 by ajwong@google.com, Jul 18 2017

How did you determine that this is breaking on touching of memory instead of on allocation?  Did someone download the minidump and check that?

Agreed that this seems to violate the windows contract for VirtualAlloc...
Windows does not over-commit - it won't give you a pointer that can't be backed by RAM of space in the pagefile. (this doesn't mean it gives you the memory immediately, that would be wasteful, but it does promise that it can). So, derferencing a pointer that was legally obtained should never trigger OOM on Windows.

So...

Loading the crash into windbg or the latest VC++ 2017 shows that the linked crash is going through the shim:

>	ntdll.dll!RtlpAllocateHeap()	Unknown	Non-user code. Symbols loaded.
 	ntdll.dll!RtlAllocateHeap()	Unknown	Non-user code. Symbols loaded.
 	[Inline Frame] chrome_child.dll!base::allocator::WinHeapMalloc(unsigned __int64) Line 34	C++	Symbols loaded.
 	[Inline Frame] chrome_child.dll!?A0x499c34b3::DefaultWinHeapMallocImpl(const base::allocator::AllocatorDispatch *) Line 17	C++	Symbols loaded.
 	[Inline Frame] chrome_child.dll!ShimMalloc(unsigned __int64) Line 191	C++	Symbols loaded.
 	chrome_child.dll!malloc(unsigned __int64 size) Line 51	C++	Symbols loaded.

However it then crashes inside the heap. This isn't really supposed to happen. The crashing instruction is "xor dword ptr [rdi+8],eax" where rdi is 0x41f0, which is an odd value.

It could be that there is a bug in the heap such that it internally gets a null-pointer (due to an allocation failing) and then fails to handle it properly. I think it is more likely that the heap was corrupted. If this is happening enough then it might mean that we are the ones corrupting it.

Actually we checked a few crashers on crash/ and if that stack traces are correct then we are crashing on the first access. See [1] for example. This might very well be a regular corruption though.

I will check a few minidumps once I have some spare cycles.

Thanks for clarifying that Windows does not support over committing.

[1] https://crash.corp.google.com/browse?q=product.name%3D%27Chrome%27%20OMIT%20RECORD%20IF%20SUM(CrashedStackTrace.StackFrame.FunctionName%3D%27v8%3A%3Ainternal%3A%3AHeap%3A%3ASetUp()%27)%20%3D%200&ignore_case=false&enable_rewrite=true&omit_field_name=CrashedStackTrace.StackFrame.FunctionName&omit_field_value=v8%3A%3Ainternal%3A%3AHeap%3A%3ASetUp()&omit_field_opt=%3D&stbtiq&reportid=f1fb25aa78000000&index=15#4
There is something else going on here. The lasted linked crash happens in this code:

  // Allocate and set up the histogram arrays if necessary.
  allocated_histogram_ = NewArray<HistogramInfo>(LAST_TYPE + 1);
  promoted_histogram_ = NewArray<HistogramInfo>(LAST_TYPE + 1);
#define SET_NAME(name)                        \
  allocated_histogram_[name].set_name(#name); \
  promoted_histogram_[name].set_name(#name);
  INSTANCE_TYPE_LIST(SET_NAME)
#undef SET_NAME

So, the code is going through one name at a time calling .set_name() on allocated_histogram_[name] and promoted_histogram_[name]. A snippet of the code looks like this:

0ff99f1e b9f48dc911      mov     ecx,offset chrome_child!`string' (11c98df4)
0ff99f23 8b8688020000    mov     eax,dword ptr [esi+288h]
0ff99f29 898814030000    mov     dword ptr [eax+314h],ecx
0ff99f2f 8b868c020000    mov     eax,dword ptr [esi+28Ch]
0ff99f35 898814030000    mov     dword ptr [eax+314h],ecx
0ff99f3b b9dc8dc911      mov     ecx,offset chrome_child!`string' (11c98ddc)
0ff99f40 8b8688020000    mov     eax,dword ptr [esi+288h]
0ff99f46 898820030000    mov     dword ptr [eax+320h],ecx - crash!
0ff99f4c 8b868c020000    mov     eax,dword ptr [esi+28Ch]
0ff99f52 898820030000    mov     dword ptr [eax+320h],ecx

When the crash happens eax is zero. The theory in this bug is that it is zero because malloc returned zero, but that doesn't match the evidence because eax was loaded from esi+288h (allocated_histogram_) before it is dereferenced, and just five instructions earlier the same pattern was repeated and worked just fine. So, [esi+288h] used to be a valid pointer and is now zero. I don't know how that could happen, but it has. In fact, dozens of successful .set_name() calls were made to both arrays before the crash, so a valid pointer clearly was returned.

My guess is that the allocation is too small, but I don't know the area well enough to speculate about what the problem could be.

As an aside, the generated code is quite inefficient. That's because the allocated_histogram_ and promoted_histogram_ pointers are being accessed through a pointer and VC++ (conservatively) assumes that there could be aliasing so it reloads them every time. If you cached those pointers in local variables then 40% of the instructions - at least - would be eliminated. That would also hide this bug, probably, so that simple optimization should be done after the bug is fixed.

Sign in to add a comment