New issue
Advanced search Search tips

Issue 730918 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug



Sign in to add a comment

mach_override_ptr makes too many allocation attempts

Project Member Reported by borisv@chromium.org, Jun 8 2017

Issue description

Chrome Version: 60.0.3112.0
OS: OSX 10.12.5

What steps will reproduce the problem?
(1) Make sure that you have Xcode installed
(2) Launch Instruments
(3) Create a new Memory profiling session and point to Chrome executable
(4) Launch the session and stop it after the UI is displayed
(4) Observe the stacks, displayed by the Instruments tool

What is the expected result?

What happens instead?
See the attached picture (the whole trace is too big to attach). Note that this is an optimized build, but ultimately the child methods of mach_override_ptr call vm_allocate 131,074 times and end up allocating 512MB of virtual memory. The actual allocation happens in allocateBranchIsland, which has a logic that tries allocating a slowly shrinking amount of virtual memory "until it works".

If I understand the code correctly, the issue is that when it changes the implementation of one function with another, the method does not know how much memory the method occupies, hence it overshoots as much as possible. 


 
Screen Shot 2017-06-06 at 5.27.33 PM.png
216 KB View Download
It sounds like the code is attempting to find a code page to allocate, I am wondering if there is a better API instead of such sequential tries one page at a time. Also, it sounds like the 'last' page cleans 4 more bits (up to 32), but the comment suggests only bits 13-28 needs to be varied.

		vm_address_t last = (uint64_t)originalFunctionAddress & ~((0x1UL << 32) - 1);


https://cs.chromium.org/chromium/src/third_party/mach_override/mach_override.c?rcl=cac79ef152484f51825a407006982f53529ccc05&l=406

Some more information. General info about mach_override_ptr can be found here:
http://www.buckleyisms.com/home/2013/9/11/a-high-level-explanation-of-mach_override-and-how-misusing-i.html

This thread is also relevant. We may be able to cut the number of allocations in half in some cases. E.g. during my debugging, two branch islands are allocated and they both seem to attempt the same address space, as such both failing the same pages. 
This is another good thread about this method, although my personal experience is of way faster allocation times (under 1 second):
https://github.com/google/sanitizers/issues/24 

Comment 3 by mark@chromium.org, Jun 9 2017

I think we only use mach_override in one spot and I'd be very happy to stop using it altogether.
Is there other ways to avoid the runtime loading plugins into our bundle?

Comment 5 by mark@chromium.org, Jun 9 2017

I don’t know if there’s a way to be as heavy-handed as we are now, but I will note that since we’ve gone 64-bit, the OS will load plugins from fewer locations.

And I’ll also note that the CFBundleBlocker doesn’t limit loads from all third-party locations, either.

There’s the code signing “library” bit (library validation), but we can’t use that in the browser because we do need to be able to load printer driver PDEs (print dialog extensions).

Comment 6 by borisv@chromium.org, Jun 10 2017

I am not sure why this method allocates on x86_64 from the last address to the smallest in the ASLR range. Is this intentional for security reasons? If I flip it around, the vm_alloc succeeds immediately. Observe that I flipped first and last, so that first is the smallest address and also make the attempts start from that small address moving up.
https://cs.chromium.org/chromium/src/third_party/mach_override/mach_override.c?rcl=5ae7434f6f92a98b9a56d0818342e06ee59b7f5a&l=405:
...
#elif defined(__x86_64__)
		// 64-bit ASLR is in bits 13-28
		vm_address_t last = ((uint64_t)originalFunctionAddress & ~( (0xFUL << 28) | (PAGE_SIZE - 1) ) ) | (0x1UL << 31);
		vm_address_t first = (uint64_t)originalFunctionAddress & ~((0x1UL << 32) - 1);
#endif

		page = first;
		int allocated = 0;
		vm_map_t task_self = mach_task_self();

		while( !err && !allocated && page != last ) {

			err = vm_allocate( task_self, &page, PAGE_SIZE, 0 );
			if( err == err_none )
				allocated = 1;
			else if( err == KERN_NO_SPACE ) {
// #if defined(__x86_64__)
// 				page -= PAGE_SIZE;
// #else
				page += PAGE_SIZE;
//#endif
				err = err_none;
			}
		}

Comment 7 by borisv@chromium.org, Jun 14 2017

Labels: -Pri-3 Pri-2
The total impact of these allocations on the start up time on my machine (optimized Chrome build) is not trivial: ~90ms (34ms + 55ms, as it allocates two branch islands. It really asks for better algorithm. When I try allocating from the low addresses, though, the code causes severe memory corruption, although it finds a suitable address to allocate really fast. This is extremely suspicious, as in theory it should not matter which page we allocate. I am wondering if this code conflicts with other places where we attempt to re-arrange the system calls. 

Comment 8 by borisv@chromium.org, Jun 14 2017

Cc: a...@chromium.org
rsesek@, avi@ do you folks know who is familiar with this code? There is a good opportunity to optimize it.

Comment 9 by mark@chromium.org, Jun 15 2017

What do the addresses it gives you look like (both low and high strategies)? It's possible that low winds up camping on a reserved address that isn't actually spoken for in the map.

Comment 10 Deleted

All callsites today pass kAllocateHigh, so there's no attempt at the low allocation strategy. I don't understand why it attempts to find a page at a specific address rather than just using VM_FLAGS_ANYWHERE, and I wasn't able to track that down from the git history.

If it does need to be at a specific address range, using vm_region() could probably improve the algorithm significantly. It looks like this was attempted here https://github.com/rentzsch/mach_override/commit/cb8012cb30d2ec990212af0cd3f8ed04354b5d90 but it was subsequently backed out here https://github.com/rentzsch/mach_override/commit/1b3aa46fc36ea65ad6294379df54fad13a7b27a6.

Also see https://github.com/rentzsch/mach_star/pull/38.
I think I am getting closer to understanding the problem, as I look at the disassembly of the final _CFBundleLoadExecutableAndReturnError in debug builds. Here are some details, though. I will deep a bit more into what instructions the method writes.

@rsesek: The previous attempts to update this logic where back in 2012 when we still had 32-bit of Chrome. Things have gotten significantly worse since, as the address space to walk is now bigger.

@Mark: here are the address spaces: 
The current implementation attempts to reserve a page from the top of the stack and the first one it is able to reserve is at 7FFF6FFFF000, the second one at 7FFF6FFFE000. 

If I allocate from the start of the ASLR range, I got the branch islands at 0x7FFF00000000 and 0x7FFF00001000, allocated almost immediately.

The page size is 0x1000.

In debug mode, allocating from bottom of the ASLR range results in crash that starts in-[KesytoneGlue loadKeystoneRegistration]. The crash is actually in the overridden function _CFBundleLoadExecutableAndReturnError.

Here is how this code looks like (observe the super suspicious jump at the beginning):

CoreFoundation`_CFBundleLoadExecutableAndReturnError:
->  0x7fffa5273e10 <+0>:   jmp    0x800000000000
    0x7fffa5273e15 <+5>:   pushq  %rdi
    0x7fffa5273e16 <+6>:   pushq  %r14
    0x7fffa5273e18 <+8>:   pushq  %r13
    0x7fffa5273e1a <+10>:  pushq  %r12
    0x7fffa5273e1c <+12>:  pushq  %rbx
    0x7fffa5273e1d <+13>:  subq   $0x18, %rsp
    0x7fffa5273e21 <+17>:  movq   %rdx, -0x38(%rbp)
    0x7fffa5273e25 <+21>:  movl   %esi, -0x3c(%rbp)
    0x7fffa5273e28 <+24>:  movq   %rdi, %r13

Some more information. The performance impact is actually bigger than I thought - this method is executed for the browser and every renderer on the main thread. The following statistical data was collected on my laptop (2014 Macbook Pro) while restoring 32 tabs (the method is called 45 times, as that many processes are created).

Median	96.18957 ms
Average	123.4742964 ms
Min	84.18488 ms
Max	384.57725 ms

Here are all of the problems with that method:
1. The method picks the address of the method to override, sets bits 13-28 to 1 and tries allocating every page in that address space until it finds one for the escape island. Note that it starts from the top space, which the kernel seems to reserve for other purposes, hence the long walk.
2. Once one one is found, it passes that page address again to allocate the reentry island. The problem is that due to logic in #1, the search starts from the top again.
3. The method then creates a jump instruction with a 32-bit signed offset. However, the code does not check anywhere the address is within the signed 32-bit limit. In fact, this was the reason for many crashes that I experienced during investigation. This also the most probable reason for the crashes in the implementation that was added and removed in 2012.
4. The method makes an assumption about the ASLR using bits 13 to 28. Yet, the logic that sets those bits has some bugs and thus it does not clear

Just changing the order to start allocating from the bottom of the address (originalFunctionPtr - 0x7fffffff) makes the allocations happen immediately and results in sub 0ms for the the whole method.

I am preparing a fix, but what is the story for updating the ? Should I submit a pull request for the third_party library, or can I check it in chromium directly?

Comment 14 by mark@chromium.org, Jun 17 2017

We can start with a local change but should definitely upstream it. He's always been good about accepting our patches.
Nice find!

We generally try to upstream our changes so that we can avoid local modifications. In this case, given the severity, I would recommend
1) Fixing this in Chrome
  * make sure to update third_party/mach_override/README.chromium with the modifications.
2) Submitting a pull request.
...
3) At some time in the future, if the pull request is accepted, rolling mach_override.
Status: Fixed (was: Assigned)
We see the expected improvements of 90ms in startup time:
https://chromeperf.appspot.com/report?sid=351c7e148f627067c5ea0559ae54c09137cada0b4c557a4c3d1cdf368ba66ea6&start_rev=477937&end_rev=481504

Sign in to add a comment