New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 852906 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Add UMA metrics for tab discard processing time

Project Member Reported by cylee@chromium.org, Jun 14 2018

Issue description

Per https://crbug/850545 , we should know better about how time is spent from a low memory signal is fired to the first tab is killed. I'm adding more UMA metrics for it.

 
Project Member

Comment 2 by bugdroid1@chromium.org, Jul 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1375063e3bbfd6386d7f5af717db9f091c2edc63

commit 1375063e3bbfd6386d7f5af717db9f091c2edc63
Author: Kuo-Hsin Yang <vovoy@chromium.org>
Date: Tue Jul 03 06:10:45 2018

Add unit to Arc.LowMemoryKiller.FirstKillLatency

BUG=852906

Change-Id: I6a5c7624bca5ad803047247f668014c84d866b91
Reviewed-on: https://chromium-review.googlesource.com/1119732
Commit-Queue: Vovo Yang <vovoy@chromium.org>
Reviewed-by: Robert Kaplow <rkaplow@chromium.org>
Cr-Commit-Position: refs/heads/master@{#572122}
[modify] https://crrev.com/1375063e3bbfd6386d7f5af717db9f091c2edc63/tools/metrics/histograms/histograms.xml

Cc: chinglinyu@chromium.org vovoy@chromium.org
On dev/canary, we can see there're 5-10% FirstKillLatency > 1 second. The most suspicious bottleneck of FirstKillLatency is the RPC call to ARC to get the list of android processes. When system is busy, it may take longer than we expected to switch between chrome threads. It may cause latency for chrome UI thread to fire and receive RPC calls. Evidence is that thread watcher sometimes complains about long response time in feedback reports like:
[942:1255:0705/014901.361436:WARNING:thread_watcher.cc(171)] Long response time on the UI thread: 2.58608 s

That said, I haven't correlate long FirstKillLatency to long UI thread response time yet. I need some more data from feedback to analyze the correlation. 

Besides it, I think Luigi's memd also contains timestamped information of low mem notification and tab discarding. I'll check it out in recent feedback to see if we can get more from it.
Cc: semenzato@chromium.org sonnyrao@chromium.org
The timeline of a low memory handling process is like:

1) Kernel enters low memory condition
2) Chrome detects low memory condition by periodically polling
3) Chrome asks to kill tabs/apps
4) Chrome actually kills tabs/apps
5) Kernel leaves low memory condition

The previously added UMA metrics "FirstKillLatency" only measures (roughly) the time between 3) and 4). As we many want to know more about the total time spent on a full low memory handling cycle, more measurements are needed. In particular, we may want to know the time between 1) and 4) or 1) and 5) to know if we addresses low memory situation efficiently. 

To know when 1) happens, we should either poll /dev/chromeos-low-mem or blocking read on it. I've read the code in Luigi's memd and it seems to do the job nicely. However it's in rust so I can't add UMA report directly in the code. Here're some solutions to it

1. Find a way to call UMA C++ library in memd. I don't have much clue how to bind the interface yet.

2. Have another threads in Chrome which reads memd's output "clip files" and upload statistics to UMA. This is a more promising way.

3. Do it from scratch in Chrome
  3.1 Have another thread blocking reads on /dev/chromeos-low-mem and monitor all the tab kills events and detects when the system leaves low-mem status. The downside of it is that it kinds of re-implement memd. But it could be a bit more accurate since the tab kill event could be notified by a simply function call instead of a dbus communication. 
  3.2 cywang@ proposed to expose a timestamp from kernel along with /dev/chromeos-low-mem to show the last time system enters low memory condition. It may be helpful if we don't want a blocking thread like I described in 3.1 . But it adds some more complexity. Let's discuss it offline.

Solution 2 or 3 looks good to me. Solution 1 saves some more work - if there's a solution to the solution :)


There're some other issues like how do we judge that the system has really resolved memory pressure. The system can go between high mem and low mem very frequently when the memory level is close to the margin. In the period it continues going up and down the margin, memory pressure is actually not resolved yet. Maybe we should combine adjacent memory pressure event and count them only once. 
As a side note, We may not be able to detect "leave memory pressure" events perfectly because we can only poll for it now. If memory goes up and down quickly we'll miss some of them. It may be fine for failing to detect such high memory state because we can deem the system as still in memory pressure.

I think solution 1 is the best.  We can call C from Rust pretty easily, I think, and there is a C API for metrics.  Not totally sure about the linking but it should be doable.  It would also be useful for the crostini VM.

Regarding the last part of your comment: we should probably add some hysteresis to tab discards so we would not have to worry about rapid crossings of the low-mem threshold.  Maybe we could even add it in the kernel.  (Maybe not.)  In any case it makes sense, I think, once the threshold is crossed, to not watch it too closely at first.
Cc: dgreid@chromium.org
I just spoke with the crostini folks and by coincidence they have been working on this issue in the past couple of days.  From my understanding, first they had to disentangle the metrics library from libchrome, then they added (or are adding) libmetrics to their build, so they can call the C functions via Rust FFI.

Dylan would you be so kind to link any relevant CL here?  Thanks!
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1163644

Is the issue where we were calling calling the C api from cras. Calling from rust should be similar.
Components: OS>Performance>Memory
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1203330
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1203350
I've implemented CLs to report the duration of memory pressure event (the definition is the elapsed time from entering low-mem state to leaving low-mem state). Consecutive events spaced by less than 1s (or whatever number) are deemed as a single memory pressure event. It leveraged Rust FFI to communicate to libmetrics.

More to add:
- Measure the time from entering low-mem state to Chrome receiving low-mem signal (which currently uses polling). Chrome needs to send extra dbus message to report the receiving of low-mem signal. Also it should append timestamp to be accurate (dbus overhead).
- Could also report the count of OOM/tab discarded in one memory pressure event.

Cc: cylee@chromium.org cros-perf-detectives@google.com
 Issue 869744  has been merged into this issue.
Status: Assigned (was: Untriaged)
This issue has an owner, a component and a priority, but is still listed as untriaged or unconfirmed. By definition, this bug is triaged. Changing status to "assigned". Please reach out to me if you disagree with how I've done this.

Sign in to add a comment