Rename Stability.Android.ProcessedCrashCounts OOM metrics |
|||||
Issue descriptionOOM in the code base refers to mmap() or VirtualAlloc() returning NULL, which is either (1) virtual address space exhaustion or (2) over size commit. However, in crash_metrics_reporter_andoid, OOM refers to (a) killed by Android low memory killer or (b) physical out of memory, and virtual_oom refers to (1)/(2) (what's called OOM in the other parts of the code base). We should rename the existing metrics for consistency.
,
Sep 6
so what are they being renamed to?
,
Sep 6
How about - virtual_oom -> oom - oom -> os_kill_or_physical_oom ?
,
Sep 6
I think "OOM" applies equally to all of these cases, so they all should have some sort of qualification what "OOM" it's referring to.
,
Sep 6
That's true, then - virtual_oom -> vas_exhausted_or_overcommit_oom - oom -> os_kill_or_physical_oom ?
,
Sep 6
Then can debate verbose-ness I guess. Is virtual_oom and os_or_physical_oom good enough? We don't need the variable name to explain exhaustively what all the cases are, as long as there is a comment explaining it somewhere.
,
Sep 6
What do you mean by other parts of the code referring to oom? Can you provide some example? Some code I know is referring to physical oom as oom, like oom metics reporter(which should probably include all kinds of oom) and The oom intervention code refers to oom for both cases of oom I'm trying to get the motivation behind renaming. It will also mess up the timeline view of the oom metrics we have right now.
,
Sep 6
I want to step back... As far as I know, different people / components use "OOM" for different meanings. The different meanings are mixed in many parts of the code base. (a) The renderer is killed by lmkd. (b) The renderer hits a physical memory OOM. (c) The renderer hits a virtual address space OOM. (d) The renderer hits a virtual memory oversize commit. (e) The renderer hits a hard-coded heap limit of some memory component (e.g., V8, Oilpan, PartitionAlloc). I agree that we should make the meaning of "OOM" consistent across the code base, but it's going to be a non-trivial amount of work. At the very least, just renaming Stability.Android.ProcessedCrashCounts won't solve the problem. yuzu-san's goal is to count (c), (d) and (e) in UMA (because (a) and (b) are counted by Stability.Android.ProcessedCrashCounts). Honestly speaking, I'd prefer putting the renaming work out of the scope of yuzu-san's project for now given the complexity.
,
Sep 6
Agreed. I do not want to scope creep yuzu-san's project. I also don't think this should be a blocker to submitting the metrics CL, but it does affect naming the API in //base. I actually don't quite agree that every component uses OOM differently. Auditing the code so far, every single usage being touched seems to consistently say "OOM" means "mmap or VirtualAlloc returned NULL." The one oddity does actually seem to be Stability.Android.ProcessedCrashCounts. Do you have examples of another divergence?
,
Sep 6
I'm not aware of any code. However, when we say "OOM" in the Android context, it normally means that the process was killed by lmkd (or hit physical memory OOM). When the web-platform team says "reduce the OOM rate on Android", it means the rate of crashes caused by lmkd. That code may not appear in Chromium because it's mostly handled by the Android side.
,
Sep 6
Here's a proposal (a) don't rename the existing metric (cause it messes up timeline view) (b) in //base/process/memory.h, partition_alloc, and v8, "OOM" seems to consistently mean "mmap()/VirtualAlloc() returned NULL" so when adding handlers there, just keep using the term "OOM" (c) For your new counter, call it "AllocationFailed" instead of OOM. (c) in particular is more accurate because mmap() can return NULL on android/Linux because you are allocating too fast, not just because you have exhausted VM. This is the weird overcommit behavior. In fact, I worry that using the term "virtual address space" in general is confusing cause I don't believe we're actually directly measuring that anywhere? I think that is pretty easy and keeps things mostly consistent?
,
Sep 6
Sounds reasonable to me!
,
Sep 7
Cool! And for completeness, here's the overcommit code in the kernel that shows the exact algorithm that's applied where the system mmap() will return NULL before address space exhaustion: https://github.com/torvalds/linux/blob/0e9b10395018ab78bf6bffcb9561a703c7f82cee/mm/util.c#L663 As a simple rule of thumb, if you are low on physical ram/swap, and you allocate a larger chunk (eg, wasm) mmap may deny you the allocation even if it'd fit in your VM space fine. In your current design docs for better OOMs, the virtual address exhaustion concept seems to have actually been mmap() failure instead. I would gently nudge in replacing that terminology with something like allocation failure before it gets ingrained too much further.
,
Sep 7
,
Sep 7
It just occurred to me that mmap() returning NULL on a low-end android device before address space exhaustion may actually not be that crazy of a scenario. I assume on a 1GB device, you probably only have a few hundred megs free for any given renderer. If that renderer allocated multi-meg shared buffers in a loop (cause of buggy code, a leak, or something), you could probably hit the kernel overcommit check cited earlier waay before you exhaust the 32-bit address space. I'm not 100% sure of my reasoning here, but wanted to throw it out there as I'm gone until next week. Will play with it monday and see if I can some up with a working example.
,
Sep 10
Finally got a chance to inspect this on my Android One device. Seems like on Android: shell@A3000:/ # cat /proc/sys/vm/overcommit_memory 1 shell@A3000:/ # cat /proc/sys/vm/overcommit_ratio 50 which I think means that mmap() will never return NULL except in virtual address exhaustion circumstances. Thus in android land (though not linux land) your original assumption does seem safe.
,
Sep 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/6b41ce3ad6fcee93466568699d1b8dd7e9588773 commit 6b41ce3ad6fcee93466568699d1b8dd7e9588773 Author: Yuzu Saijo <yuzus@chromium.org> Date: Thu Sep 13 03:57:09 2018 Rename virtual_oom to allocation_failed This CL renames virtual_oom for Stability.Counts UMA to allocation_failed to be accurate. Allocation failure includes both virtual address space exhaustion and overcommit crash. Bug: 881197 Change-Id: I2d21d7e43bd02a47282bc545d271326747d60bbd Reviewed-on: https://chromium-review.googlesource.com/1212434 Commit-Queue: Yuzu Saijo <yuzus@chromium.org> Reviewed-by: Kentaro Hara <haraken@chromium.org> Reviewed-by: Bo <boliu@chromium.org> Reviewed-by: Siddhartha S <ssid@chromium.org> Reviewed-by: Robert Sesek <rsesek@chromium.org> Cr-Commit-Position: refs/heads/master@{#590918} [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android.cc [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android.h [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/components/crash/content/browser/crash_metrics_reporter_android_unittest.cc [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/third_party/blink/public/common/oom_intervention/oom_intervention_types.h [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/third_party/blink/renderer/controller/crash_memory_metrics_reporter_impl.cc [modify] https://crrev.com/6b41ce3ad6fcee93466568699d1b8dd7e9588773/tools/metrics/histograms/enums.xml
,
Oct 16
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by yuzus@chromium.org
, Sep 6Components: Internals>Metrics>UMA
Labels: -Pri-3 Pri-2
Owner: yuzus@chromium.org
Status: Assigned (was: Assis)