New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 773110 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

log chrome and system memory activity near events of interest

Project Member Reported by semenzato@chromium.org, Oct 9 2017

Issue description

We collect separately a number of statistics about tab discards (in the Chrome user log), OOM kills (in the kernel log, but also in the Chrome user log), vm activity (vmlogs, added recently), and UMA samples.  Some of these logs are uploaded with feedback reports, and many are time-stamped so in principle we can correlate events.

However, there are other measurements of interest, and the continuous collection (for vmlogs, for instance) limits its granularity.

I am specifically interested in looking at memory manager activity under memory pressure "events".  These can be:

- low-memory notifications
- tab discards (which may or may not happen after a low-memory notification)
- oom kills
- swap activity beyond certain thresholds
- maxima of allocation and swap activity

I would like to collect high-granularity memory manager data for a few seconds before and after such events, and maintain a FIFO of the most recent N such collections.  This data would be uploaded when sending feedback reports and can be used for statistical analysis and possibly drive improvements in the memory management strategy.

 
See issue 773111 for more motivation for this.
Other events:

-start of load (or reload) a page
-both beginning and end (if available) of a tab discard

also the earlier sentence "tab discards (which may or may not happen after a low-memory notification)" is ambiguous.  It should be "tab discards (always triggered by low-mem notifications, although not all notifications trigger them).



I am thinking that a daemon (possibly the metrics daemon) should collect all this information.  Chrome would send the relevant events via dbus.  The daemon can collect what it needs from /proc/stat and /proc/vmstat etc.  Maybe add another procfs or sysfs node for OOM kills.

The information is already all available by monitoring log files, but that looks fragile.  However, there is already a daemon, the anomaly collector, which monitors the syslog using inotify to detect when new lines are logged, and when the file is renamed (for log rotation).  The daemon uses the scanner generator lex.  We could also extend that daemon.

The granularity of vmstat collection should be fairly high.  I am thinking maybe 10 Hz or so.  If this is too much overhead, we can make the rate dynamic so that it becomes quasi-quiescent when there is a lot of free RAM, and the allocation rate is low.

Comment 4 by derat@chromium.org, Oct 10 2017

There's probably enough going on here to justify a design doc. :-)

I think it's probably okay to make Chrome export a new D-Bus service (maybe something like org.chromium.Browser, even?) and emit signals on it for rare memory events. I'm less sure about whether it's reasonable to announce all loads (navigations?), though. It may be better to make Chrome collect these stats internally and expose a GetMemoryEvents or similar method that metricsd can call; then you're not spamming the system bus with rarely-used information.
Ha, and to think that this all started with a seemingly innocent question. ;)

Actually I agree.  Here it goes.  Thanks.

https://docs.google.com/a/google.com/document/d/1cTze7WIbY_YEmM8dX9JRBdo-ZoZjIJtlWIOiUDueKcY/edit?usp=sharing


To address the second part of #4: is there really much overhead associated with a DBus SendSignal?  I wonder if a tab load, or even a tab switch, aren't such huge operations (and relatively infrequent, since they happen on a human scale) that a SendSignal is tolerable.  We can make it even more lightweight though, since we can assume that there is only one recipient of those notifications.

Comment 7 by derat@chromium.org, Oct 12 2017

Why pass debugging information around between processes when we don't actually need it? I also think it's likely that more information will be added as time goes on, and it seems better for the collection/reporting mechanism to start out as scalable (and less likely to spam the "dbus-monitor --system" output).

(Re overhead with sending signals, it at the very least causes context switches to dbus-daemon and the receiver(s). Those are probably insignificant compared to the work required to load a webpage, but see my earlier suspicion that more information will be added in the future.)
OK I see your point, once the interface is there it can be abused because the interface cannot easily enforce or even define a "use only sparingly" contract.

I am thinking of other ways of looking at this.

Cc: dtor@chromium.org
The problem with the proposal in #4 is that the external process now needs to poll chrome continuously, for rare events.  Plus, the size of the ring buffer holding stats need to be larger to account for polling delays.  (We want the data around the event, not at poll time.)

We could consider some other form of IPC, but I am not sure it makes sense to add one.

We could put everything into Chrome, using PostDelayedTask, but other than those events notification from Chrome everything else could be done more easily outside Chrome (deal with smaller binaries, and don't need knowledge of Chrome internals).

(Adding Dmitry since I discussed this with him.)

Comment 10 by derat@chromium.org, Oct 15 2017

Cc: steve...@chromium.org
If Chrome has access to all of the data that you want to report, as well as knowing when it should be reported, then I'd vote for having fewer moving parts and just putting this all inside of Chrome. I don't find the arguably-simpler developer experience of putting this in metricsd or a new daemon to be strong justification for a more-complicated, harder-to-debug design that uses IPC.
I see your point, but by this reasoning then pretty much every daemon (shill, powerd, etc.) should be integrated in Chrome.  Are you saying that it's just an accident that they are not?

I'd like to see more explanation of why the design is more complicated and harder to debug.  The IPC in this case is simple: Chrome sends DBus signals for some events of interest.  That's it.  I don't see how that part would be hard to debug.

OTOH Chrome on Chrome OS is harder to debug than on other systems.  Incremental compile is fast, but linking takes time, also copying the binary to the DUT.

I'll agree that the IPC stubs are a royal pain though.  I haven't counted the number of stubs, but 37,000 lines of code (content of src/chromeos/dbus) seem like a lot just for stubs.  (This doesn't include the 15,000 lines in src/dbus.)  The class structure is organized for testability, but it's not clear what is being tested here---just the stubs?

A number of people with experience with the kernel memory manager don't have a strong background in Chrome coding, and it's not something that can be picked up quickly.  It seems valuable to make it easier for those people to work on this part of the system.

 

Comment 12 by derat@chromium.org, Oct 16 2017

Labels: -Type-Bug Type-Feature
Re being harder to debug, you'll have two separate processes talking over D-Bus instead of having everything contained in a single process. If you really want to put this in metricsd, I'm not going to stop you.
#12 Thanks.  I am also beginning to wonder how much we'd have to rely on debugd to collect some of the data into Chrome.  Then we'll still have to do IPC and probably worse.
Status: Assigned (was: Untriaged)

Sign in to add a comment