We should deal more gracefully with browser process memory leaks |
||||||||||||||
Issue descriptionAs we found in digging into bug #690517 (and the related partner bug http://crosbug.com/p/62731, which has more detailed digging), memory leaks in the browser process (especially fast ones) are really hard to debug. It would be awfully nice to do _something_ that trips when the total browser process heap space gets above a certain size and then logs memory stats, maybe even logging stack traces for the next 10 allocations or something crazy like that. This should be a super quick check, so in theory it could stay in even for release builds. Presumably size would be something like this: - On 32-bit userspace systems, check should happen at 2GB - On 64-bit systems, check should happen when the browser process is taking up 75% of the total physical memory of the system. === No idea how practical that would be, but maybe it would help avoid the next complicated debug session?
,
Mar 3 2017
I'd dupe this against Issue 698266, as ssid reported. > This should be a super quick check, so in theory it could stay in even for release builds. There are more subtleties in release builds, that we are discussing for Issue 698266. For instance even stack unwinding is a non trivial problem. I like the idea but honestly this might require more work than what it seems. At which point I think we should just shoot for: - Having a reliable heap profiler, so we can just ask people to restart chrome with --enable-heap-profiling and push a button to file a bug - Having that sending reports from the field when we encounter situations like the one you describe.
,
Apr 14 2017
Bumping priority here, as this was an issue in a recent postmortem: https://docs.google.com/document/d/1PphqDNEMLueeTtK5OZDN4HtnQ_hrRHEEQXmyr-C8ST0/edit We still cannot measure impact on ChromeOS, which is where we had the biggest complaints, and also paying customers.
,
Apr 17 2017
If this should be duped against issue 698266, presumably that bug needs an owner and a priority higher than 3. However, if there's something quicker and dirtier that we can do in the short term to at least get us out of the terrible state that we're in now, it seems like we could address that in this bug. While not beautiful and elegant, it seems like something like my proposal at the start of this bug wouldn't be terribly hard to implement and would at least get us off the ground? Even the simplest case adding a chrome to "crash" early might be worthwhile if everything else is too hard / risky.
,
Apr 18 2017
> However, if there's something quicker and dirtier that we can do in the short term to at least get us out of the terrible state that we're in now, it seems like we could address that in this bug. What is your primary concern here? 1) Being able to debug memory leaks/bloats when they happen? 2) Ensure that the browser process crashes early when it hits a given limit? About 1, today you should be already in the position of being able to restart chrome with --enable-heap-profiling=native and get a full heap dump on demand using memory-infra's heap profiler. (docs here: https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/heap_profiler.md). If for any reason that doesn't work on CrOS, I'd appreciate the help from somebody from CrOS to make it work also on that platform, since this is the tool that we are using and investing on for all other platforms for both immediate manual investigations and upcoming plans of live version of that. About 2, not sure what would be the benefit of crashing chrome early when hitting a fixed 2GB (or whatever) limit, considering that it would crash naturally later. (I am fully onboard with the need of collecting diagnostics once we hit a number similar to that) > While not beautiful and elegant, it seems like something like my proposal at the start of this bug wouldn't be terribly hard to implement and would at least get us off the ground? It is not fully clear to me which problem your #0 is proposing to solve. If 1) you might already have something there. Happy to discuss further
,
Apr 18 2017
A) Fast Leaks The current solution doesn't help us with "fast" memory leaks. This is the type of leak we just spent months getting to the bottom of and that better tools could have helped us solve much, much faster. Basically all of our tools, even if they're working well, are useless with a "fast" leak at the moment. 1. On a machine with 32-bit userspace there will be no chance to "dump" any of this profiling information. The machine will go from "no leak" to "crashed" in a matter of seconds. During the leaking time it's likely that the system will be unusable as it tries to cope with the massive influx of allocations. Unless somehow your profiling tool preallocates memory (so it doesn't need to do any allocations _after_ we've run out of memory) and registers for the crash notification, it wouldn't have been helpful. 2. On Chrome OS machines with 64-bit userspace and a fast leak, today we won't get any sort of Chrome crash. Memory will get exhausted very quickly and the kernel will start "OOM Killing" processes. When this happens they get killed and don't have a chance to get a notification / dump any sort of debugging information. 3. On Mac / Windows / Linux machines with a 64-bit userspace and a fast leak, presumably we won't get any sort of crash at all. The user's machine will slow to a crawl as more and more swap is allocated by the OS. Note: if you have sufficient physical memory and the leak is bounded, you _might_ be able to use your tools here. The above could be solved, as I mentioned, by forcing the machine to "crash" earlier. --- B) Command line flags I will also note that the crash we just debugged was originally reproduced on Chromebooks that people were using normally (not in dev mode). We can't add command-line flags in this mode. Often it took several days to get into this state, and it may have been important that people were logged in with their corp profile so they could sync their Google Docs. Telling people "please get back into this state in dev mode after enabling these flags" would have been a no-go. It is not completely unreasonable to force someone to set a Chrome flag and restart Chrome, though.
,
Apr 18 2017
> A) Fast Leaks very good point. I have never seen before a leak like that, so never considered that use case. But I totally believe you that it was the case here. Some thoughts: A1) One solution I see here is to have a heap profiling mode where once reached the XGB limit, we stop the world, flush the heap profiler snapshot and suicide. This is involves some work and given the other priorities we have (way more slow leaks) it will be unrealistic to see this this Q *. A2) One thing I'm wondering here is: if a leak is so fast, it should show up immediately in a generic callstack sampling profiler (i.e. perf on linux) as I expect the process to spend statistically most of its time on malloc() (and in the callstack that is causing the leak). have you tried that? If that didn't work any idea why it didn't? If you didn't honestly I feel this should be the way to debug these kinds of issues. A3) Another option here is more aligned with what you suggested above. Have a --enable-heap-profiling mode where, after the 1 GB limit, we start logging 1 malloc call out of 1000 on stdout (or a log file). I don't love that option, but wouldn't be too opposed to that, provided that: - I understand why A2 didn't work. - Prioritization: given that it's the first time I came across this sort of issue, unfortunately I still feel all the other OKRs we have are trying to solve slow leaks that happend to be more frequent*. Which translates into: if you find some resources for this I'll happily VC/IM with them to head them in the right direction and handle codereviews, but I don't have bandwidth in my team this Q to do that. > B) Command line flags You make very good points again. This has been discussed in Issue 684839 (TL;DR make it a chrome://flags flag). Again this boils down to prioritization *. Unfortunately the truth is that so far chrome://flags is a nice-to-have for all other (i.e. non-CrOS) platforms, and a must-have only for CrOS. Honestly speaking, CrOS is the platform where we (people working on tooling) have received zero support from. people are contributing to these tools for Win/OSX/Linux and Android, nobody from CrOS. Yet again*, if anybody wants to pick up Issue 684839 , I'll be more than happy to give guidance and review the code. Unfortunately, it's currently a nice-to-have on all but CrOS platforms and we have other battles to fight in this Q. * I know my answers can be annoying and be perceived as a "yes I could fix this but I don't want to". I really want to fix these, but I have some other dozens problems like this in my queue and not enough people to deal with them. I'm just trying to be honest and realistic. Unfortunately we have a huge storm of problems, a lot of work to do on the tooling and a very limited amount of people working on the area. :/ If we had more people or less problems I'd immediately re-route this bug to somebody. But as thing stands today, I feel I'd be asking somebody to drop on the floor a battle to just fight another battle.
,
Apr 18 2017
> A2) One thing I'm wondering here is: if a leak is so fast, it should show up > immediately in a generic callstack sampling profiler (i.e. perf on linux) as > I expect the process to spend statistically most of its time on malloc() (and > in the callstack that is causing the leak). have you tried that? If that > didn't work any idea why it didn't? If you didn't honestly I feel this should > be the way to debug these kinds of issues. Yup, and that's how we made progress on the bug in question. BUT: i) That only helps on 32-bit userspace. As per above, for 64-bit userspace we get an OOM kill w/ no info at all. ii) Something about the process being fully out of memory actually made the crash reporting fail to post symbolized stacks. I'm not sure if this is because the crash reporter itself was only 32-bit and has a bug when the virtual address space is too big or if Chrome was supposed to do something special at crash time and it couldn't do it because there is no memory. ...there are bugs open about trying to make sure we're symbolizing all crashes better, but that is (yet another) low priority bug. See bug #701162 . Apparently people hope this will be better with "Crashpad" "by the end of this year". -- It sounds like most of your arguments here come down to resourcing. I'm afraid I can't really help you there. I'm just a schmoe Software Engineer who works on Chrome OS kernels and I can't easily find you a stack of talented Engineers. +zelidrag or +puneetster might be able to, though. -- Since it sounds as if you have no time / resources / plans to work on this, I guess assigning to zelidrag to help with prioritizing and resourcing.
,
Apr 18 2017
Ah: > i) That only helps on 32-bit userspace. As per above, for 64-bit > userspace we get an OOM kill w/ no info at all. I realized you weren't saying that we'd statistically fail on the memory allocation that was the important one but that we'd actually be able to use "perf" to find the problem. Hmmm, that might have been possible assuming people could produce this in dev mode (people were mostly reproducing in release mode, as per above).
,
May 3 2017
,
May 3 2017
BTW enable-heap-profiling flag is now settable via chrome://flags - on non-userdebug Android phone we also don't have any means of setting command line. I wonder if late-enabling heap profiling could help here. I.e. you notice that Chrome is sluggish, you take a trace and find out that browser malloc is huge. You go and late-enable heap profiling (via devtools for example), wait for some time to accumulate allocations, and take another dump, which now includes stack traces of recent allocations. Would that have helped diagnosing this issue faster?
,
May 3 2017
,
May 3 2017
I think the issue is not being able to take a trace when chrome goes to that state. We already have IndexedDB on the dump providers. If we were able to take a trace then we could have found the issue with just a normal memory dump.
,
May 3 2017
I think the issue was related to leveldb::Iterators - are they accounted for by IndexedDB MDP? If not, we need to account for them.
,
May 3 2017
We weren't able to get a trace easily (we actually got lucky here, +dianders and +cmumford turned no tracing for along time and were able to eventually capture it 'in the wild'). If we try to add accounting for leveldb (third party library) memory allocations in the IndexedDB MDP, would that have made more information for the minidump? Or something uploaded to crash/?
,
May 3 2017
Also, leveldb uses it's own arena allocator for all of its large data allocation - does that change our strategy?
,
May 3 2017
Re #15: I think we account for the block cache in approximate_memory_usage calculation and the allocations for the iterators given by this stack trace seems to be adding the blocks that are read to the cache: https://bugs.chromium.org/p/chromium/issues/detail?id=696055#c17 Also I initially added IndexedDB to accounting because of this issue 533648 , which is similar to the current problem. So, the dump provider must have accounted for the memory usage. Re #16: We do not add a memory dump to a minidump or crash report. It could be hard because memory dumps are series of async tasks which cannot be performed when a crash happens.
,
May 3 2017
We're actually talking to LevelDB guys about accounting for all LevelDB databases, not only IndexedDB ones (see issue 711518 ). I'm experimenting with approach (crrev.com/2855953002) where all LevelDB databases are opened through DBRegistry object, which keeps track and reports all databases to memory-infra. I'll add accounting for iterators in the next patch.
,
May 3 2017
ok, so it sounds like this is all good, but it wouldn't have helped with our specific problem where the crash is happening. But it could have helped prevent it with earlier detection if we have proper monitoring and fuzzing?
,
May 3 2017
AFter talking with +Oystein, it sounds like there was a Tracing V2 proposal that would trigger lightweight tracing randomly, and possibly in circumstances like 'we have crashed twice in the past hour'. This would be stored in a way that crashpad or breakpad can read them on a crash. This would have _definitely_ caught the issue from the postmortem, as people would see crashing every 20 minutes or so.
,
May 3 2017
Yeah primiano@ had some great ideas about using Slow Reports to trigger background tracing (in this case it would also need to be able to trigger memory dumps) under certain scenarios, and changing the tracing ring-buffer format to be protobufs that breakpad/crashpad could grab from the crashed process and attach to the crash report. There's a non-trivial amount of work there though.
,
May 4 2017
What you and Oystein said (relevant docs: bit.ly/TracingV2 + go/crash-and-trace) but also prioritization. Given the amount of problems we have to solve (this is only one of the four memory-related postmortem I have seen this quarter) and the extremely little number of people I have to work on all these things, that is unlikely to happen within the next 2Q.
,
May 5 2017
That makes sense. Let's get Primiano more headcount! :)
,
May 30 2017
hashimoto@ gave me two links [1][2]. The former says: """Unfortunately, we don't have a good way to report these OOM killings. When the kernel kills a process with the oom-killer, it effectively does so with the SIGKILL signal. Because of this, the Breakpad signal handler does not get invoked (which is how Chrome reports its crashes), nor does crash_reporter get called by the kernel. """ The latter says: """We’re going to build a “double-wall” for low-memory states: A “soft wall” where the machine is very low on memory, which triggers a user space attempt to free memory, and a “hard wall” where the machine is out of memory, which triggers the normal kernel OOM killer. In both cases we’ll attempt to minimize impact on the user.""" jamescook@, is this double-wall thing working? [1] https://www.chromium.org/chromium-os/packages/crash-reporting/faq#TOC-Do-we-report-out-of-memory-OOM-crashes- [2] https://www.chromium.org/chromium-os/chromiumos-design-docs/out-of-memory-handling
,
May 30 2017
That doc [2] seems from 2011. Ironically a similar discussion is now re-happening in the context of Memory Coordinator / Global Resource Coordinator.
,
May 30 2017
semenzato might know the current state of [2]. It has broken in the past on ARM machines (I don't know why) but it should still be working. When it was built it relied on a special Chrome OS-only kernel-to-userspace low memory signal. I think that has been refactored sometime in the last 6 years, but I'm not sure.
,
May 31 2017
issue 203180 seems to be the kernel work for [2]. I'm trying to find the chrome code that receives the event.
,
May 31 2017
found the chrome code: https://cs.chromium.org/chromium/src/base/memory/memory_pressure_monitor_chromeos.cc
,
May 31 2017
+skuhne who worked on the chrome code ( issue 439493 ). Re "we don't have a good way to report these OOM killings", I wonder if it's possible to let the browser process call abort() when it's under high memory pressure but there are no tabs to discard, so that the crash report is uploaded.
,
May 31 2017
> I wonder if it's possible to let the browser process call abort() when it's under high memory pressure but there are no tabs to discard, so that the crash report is uploaded. 1. Why? that is an invasive UX change. what is the benefit? 2. What do you expect to see from the crash report?
,
May 31 2017
We're now talking about bits of code that Sonny and Luigi are familiar with. Add them. Also add Ben as FYI.
,
May 31 2017
Re #31 primiano@, it was a random idea but here's what I was thinking: > 1. Why? that is an invasive UX change. what is the benefit? If memory leaks are happening fast in the browser process and there are no open tabs to discard, the browser process will soon be OOM killed. Since there is no crash report from OOM killing, this isn't recorded in the crash dashboard. With a crash report, it can be recorded. > 2. What do you expect to see from the crash report? abort() is called from the low memory handling path. That'll tell us that the browser decided to terminate itself since there was nothing else it could do (very closed to get OOM killed). The stack trace itself isn't so helpful so would be good if we could attach some other useful bits.
,
Jun 1 2017
> If memory leaks are happening fast in the browser process and there are no open tabs to discard, the browser process will soon be OOM killed. The problem here is that "soon". How do you know if the browser bill be killed really? How do you know if the memory pressure was transitory? Do you have data showing that we can identify the cases where we will surely not recover as opposite to temporary memory pressure situations? > Since there is no crash report from OOM killing, this isn't recorded in the crash dashboard. With a crash report, it can be recorded. This is statement is true and not true at the same time. :) The devil is in details. OOM is a ill-defined concept and can have different shapes. A) OOM can happen when we try to malloc/mmap and we get a nullptr back. this is quite infrequent because of overcommit, and realistically happens only in cases of virtual address space exhaustion on 32 bit platforms. In these cases AFAIK we do generate crash reports, as that code ends up hitting things like: - https://cs.chromium.org/chromium/src/base/compiler_specific.h?l=95&ct=xref_jump_to_def&gsn=NOINLINE - v8's Utils::ReportOOMFailure - https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/heap/PageMemory.cpp?rcl=a031ee0c27808d2dcb4d4c54ceb147b29874317d&l=58 - https://cs.chromium.org/chromium/src/base/allocator/partition_allocator/partition_alloc.cc?rcl=a031ee0c27808d2dcb4d4c54ceb147b29874317d&l=250 So in these cases, we already have a crash report. B) We succeed at malloc()/mmap() but then when we try to use that memory the OS shoots in our face. In these cases, depending on the platform, there is very little we can do. AFAIK on Linux/Android we get killed via a SIGKILL which is un-trappable so we can't run breakpad there to get a crash report even if we want. > abort() is called from the low memory handling path. That'll tell us that the browser decided to terminate itself since there was nothing else it could do (very closed to get OOM killed). The stack trace itself isn't so helpful so would be good if we could attach some other useful bits. How can you be confident that you won't just shutdown the browser all the times even for cases where we would have been temporary slower?
,
Jun 1 2017
Hi, sorry, I didn't realize there was a question for me here (#27). We probably should NOT try to detect the upcoming OOM-termination of the browser process, because we're not having great success in general at detecting and preventing OOMs. The tab/app discard mechanism "works" only statistically. If allocation is very fast, the discarder is called too late. The same thing will happen trying to get a clean abort(), which, in addition, almost surely will try to allocate extra memory (if nothing else, it will page-fault new code in) and thus make its own life difficult. The original request seems reasonable. Is it possible to safely estimate an upper bound for the size of the browser process, and make the browser abort() when that size is grossly exceeded? (No matter how much RAM we have.)
,
May 21 2018
Bulk edit** This bug has the label Postmortem-Followup but has not been updated in 3+ weeks. We are working on a new workflow to improve postmortem followthrough. Postmortems and postmortem bugs are very important in making sure we don't repeat prior mistakes and for making Chrome better for all. We will be taking a closer look at these bugs in the coming weeks. Please take some time to work on this, reassign, or close if the issue has been fixed. Thank you.
,
Aug 3
This bug has an owner, thus, it's been triaged. Changing status to "assigned". |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by ssid@chromium.org
, Feb 28 2017