New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 881197 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 2
Type: Bug



Sign in to add a comment

Rename Stability.Android.ProcessedCrashCounts OOM metrics

Project Member Reported by yuzus@chromium.org, Sep 6

Issue description

OOM in the code base refers to mmap() or VirtualAlloc() returning NULL, which is either (1) virtual address space exhaustion or (2) over size commit.

However, in crash_metrics_reporter_andoid, OOM refers to (a) killed by Android low memory killer or (b) physical out of memory, and virtual_oom refers to (1)/(2) (what's called OOM in the other parts of the code base).

We should rename the existing metrics for consistency.
 
Cc: haraken@chromium.org boliu@chromium.org ssid@chromium.org
Components: Internals>Metrics>UMA
Labels: -Pri-3 Pri-2
Owner: yuzus@chromium.org
Status: Assigned (was: Assis)
so what are they being renamed to?
Labels: OS-Android
How about 
- virtual_oom -> oom
- oom -> os_kill_or_physical_oom
?
I think "OOM" applies equally to all of these cases, so they all should have some sort of qualification what "OOM" it's referring to.
That's true, then
- virtual_oom -> vas_exhausted_or_overcommit_oom
- oom -> os_kill_or_physical_oom
? 
Then can debate verbose-ness I guess. Is virtual_oom and os_or_physical_oom good enough? We don't need the variable name to explain exhaustively what all the cases are, as long as there is a comment explaining it somewhere.
What do you mean by other parts of the code referring to oom? Can you provide some example?
Some code I know is referring to physical oom as oom, like oom metics reporter(which should probably include all kinds of oom) and The oom intervention code refers to oom for both cases of oom

I'm trying to get the motivation behind renaming. It will also mess up the timeline view of the oom metrics we have right now. 
Cc: ajwong@chromium.org
I want to step back... As far as I know, different people / components use "OOM" for different meanings. The different meanings are mixed in many parts of the code base.

(a) The renderer is killed by lmkd.
(b) The renderer hits a physical memory OOM.
(c) The renderer hits a virtual address space OOM.
(d) The renderer hits a virtual memory oversize commit.
(e) The renderer hits a hard-coded heap limit of some memory component (e.g., V8, Oilpan, PartitionAlloc).

I agree that we should make the meaning of "OOM" consistent across the code base, but it's going to be a non-trivial amount of work. At the very least, just renaming Stability.Android.ProcessedCrashCounts won't solve the problem.

yuzu-san's goal is to count (c), (d) and (e) in UMA (because (a) and (b) are counted by Stability.Android.ProcessedCrashCounts).

Honestly speaking, I'd prefer putting the renaming work out of the scope of yuzu-san's project for now given the complexity.


Agreed. I do not want to scope creep yuzu-san's project. I also don't think this should be a blocker to submitting the metrics CL, but it does affect naming the API in //base.

I actually don't quite agree that every component uses OOM differently. Auditing the code so far, every single usage being touched seems to consistently say "OOM" means "mmap or VirtualAlloc returned NULL."  The one oddity does actually seem to be Stability.Android.ProcessedCrashCounts.  Do you have examples of another divergence?
I'm not aware of any code. However, when we say "OOM" in the Android context, it normally means that the process was killed by lmkd (or hit physical memory OOM). When the web-platform team says "reduce the OOM rate on Android", it means the rate of crashes caused by lmkd.

That code may not appear in Chromium because it's mostly handled by the Android side.

Here's a proposal
  (a) don't rename the existing metric (cause it messes up timeline view)
  (b) in //base/process/memory.h, partition_alloc, and v8, "OOM" seems to consistently mean "mmap()/VirtualAlloc() returned NULL" so when adding handlers there, just keep using the term "OOM"
  (c) For your new counter, call it "AllocationFailed" instead of OOM.

(c) in particular is more accurate because mmap() can return NULL on android/Linux because you are allocating too fast, not just because you have exhausted VM. This is the weird overcommit behavior.  In fact, I worry that using the term "virtual address space" in general is confusing cause I don't believe we're actually directly measuring that anywhere?

I think that is pretty easy and keeps things mostly consistent?
Sounds reasonable to me!

Cool!  And for completeness, here's the overcommit code in the kernel that shows the exact algorithm that's applied where the system mmap() will return NULL before address space exhaustion:
  https://github.com/torvalds/linux/blob/0e9b10395018ab78bf6bffcb9561a703c7f82cee/mm/util.c#L663

As a simple rule of thumb, if you are low on physical ram/swap, and you allocate a larger chunk (eg, wasm) mmap may deny you the allocation even if it'd fit in your VM space fine.

In your current design docs for better OOMs, the virtual address exhaustion concept seems to have actually been mmap() failure instead. I would gently nudge in replacing that terminology with something like allocation failure before it gets ingrained too much further.
Cc: keishi@chromium.org
It just occurred to me that mmap() returning NULL on a low-end android device before address space exhaustion may actually not be that crazy of a scenario. I assume on a 1GB device, you probably only have a few hundred megs free for any given renderer. If that renderer allocated multi-meg shared buffers in a loop (cause of buggy code, a leak, or something), you could probably hit the kernel overcommit check cited earlier waay before you exhaust the 32-bit address space.  I'm not 100% sure of my reasoning here, but wanted to throw it out there as I'm gone until next week.  Will play with it monday and see if I can some up with a working example.
Finally got a chance to inspect this on my Android One device. Seems like on Android:

shell@A3000:/ # cat /proc/sys/vm/overcommit_memory                             
1
shell@A3000:/ # cat /proc/sys/vm/overcommit_ratio                              
50

which I think means that mmap() will never return NULL except in virtual address exhaustion circumstances.

Thus in android land (though not linux land) your original assumption does seem safe.
Project Member

Comment 17 by bugdroid1@chromium.org, Sep 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6b41ce3ad6fcee93466568699d1b8dd7e9588773

commit 6b41ce3ad6fcee93466568699d1b8dd7e9588773
Author: Yuzu Saijo <yuzus@chromium.org>
Date: Thu Sep 13 03:57:09 2018

Rename virtual_oom to allocation_failed

This CL renames virtual_oom for Stability.Counts UMA to allocation_failed to be accurate. Allocation failure includes both virtual address space exhaustion and overcommit crash.

Bug:  881197 
Change-Id: I2d21d7e43bd02a47282bc545d271326747d60bbd
Reviewed-on: https://chromium-review.googlesource.com/1212434
Commit-Queue: Yuzu Saijo <yuzus@chromium.org>
Reviewed-by: Kentaro Hara <haraken@chromium.org>
Reviewed-by: Bo <boliu@chromium.org>
Reviewed-by: Siddhartha S <ssid@chromium.org>
Reviewed-by: Robert Sesek <rsesek@chromium.org>
Cr-Commit-Position: refs/heads/master@{#590918}
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android.cc
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android.h
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android_unittest.cc
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/third_party/blink/public/common/oom_intervention/oom_intervention_types.h
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/third_party/blink/renderer/controller/crash_memory_metrics_reporter_impl.cc
[modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/tools/metrics/histograms/enums.xml

Status: Fixed (was: Assigned)

Sign in to add a comment