New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 749727 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Vmlog reports large page swap-in without many major faults.

Project Member Reported by teravest@chromium.org, Jul 27 2017

Issue description

While debugging user reports of jank, I started to look at vmlog data to try to correlate swap activity with slowness logged by Chrome.

I was able to match up high swap activity to "Slow method" log lines in the M61 feedback, and roughly match up the time to tab discarding in the M60 feedback.

Two recent feedback reports show significant swapping activity, and a surprising ratio of pages being swapped in to major faults:
  https://listnr.corp.google.com/report/69149412199  (M61)
  https://listnr.corp.google.com/report/69266491909  (M60)

Here's one surprising excerpt from the M60 feedback:
$ cat vmlog_data | awk 'NR==6 {old_mf=$2; old_in=$5; old_out=$6; next}; {print $1, $2-old_mf, $5-old_in, $6-old_out; old_mf=$2; old_in=$5; old_out=$6}'

time majfault_diff pswpin_diff pswpout_diff
[47612.585877] 5765 7182 22247
[47614.588085] 3261 3068 57060
[47616.590277] 5228 37849 47169
[47618.592463] 5967 63530 88807
[47620.594358] 5305 81497 116868
[47622.596198] 4779 156290 180023
[47624.598345] 5107 66778 138156
[47626.600310] 7222 134962 167160
[47628.601382] 3549 3351 0
[47630.603378] 5961 6681 35
[47632.606104] 5140 60536 85271
[47634.608168] 1543 78701 71555
[47636.610198] 1765 223093 216027
[47638.612186] 3603 193182 201938
[47640.614698] 470 192090 183774
[47642.616766] 3879 200923 208216
[47644.618208] 184 213786 214028
[47646.620372] 856 251920 252174
[47648.622493] 500 276144 277755
[47650.624222] 1044 213318 253818

Those swapout rates seem extremely high:
>>> 277755.0 * 4096 / 2 / 1024 / 1024
542.490234375 MB/sec

At the same time, there are not many major faults. Looking at the ratio of swap-ins to major faults:
>>> 276144.0 / 500
552.288

This seems very suspicious, especially because most of the log has an expected approximate 1:1 ratio for major faults to swap-ins.
 
Cc: igo@chromium.org
Doug pointed out that this may be explained by swapping in huge pages.  The unit of swap is still the small page, so a single page fault might swap in a large number of them.
It's not huge pages.  Chrome OS uses hugepages, but only for the Chrome text segment, which we don't swap.
Could it be from general compaction triggered by higher order allocations?
But why would compaction need to swap pages in?  Only the pages present in memory need to be copied.
re  #5 -- oh I just saw high swap out rate, yeah i'm not sure why there would be a lot of swap-ins unless compaction could cause swap out + swap in when moving pages around.
I'm not sure that compaction could cause this behavior -- it was just a guess, need to look at the code more.

The feedback report of a sudden burst of activity did sort of remind me of an older bug where Chrome switched to OilPan which didn't have generational GC and we started triggering excessive mark-and-sweep type GC

https://bugs.chromium.org/p/chromium/issues/detail?id=588172#c84

So, it's just another random thought, with no data to back it up, but maybe this type of linear scan GC activity could cause a lot of prefetching without a lot of page faults if it were happening on mmaped?
#7 very nice theory.  I wonder if we shouldn't see such GCs more often.  Maybe we just haven't looked for them.

I looked more closely at the logs.  The "swap storm" starts  here:

[47632.606104] 5140 60536 85271

on the syslog, this corresponds to roughly here:

2017-07-27T08:11:46.570297+00:00 NOTICE kernel: [47626.631116] audit: type=1400 audit(1501143106.569:188): avc:  denied  { read } for  pid=8410 comm="Binder:2860_1" name="/" dev="tmpfs" ino=26817 scontext=u:r:priv_app:s0:c512,c768 tcontext=u:object_r:device:s0 tclass=dir permissive=0

(there's nothing interesting from the kernel at that time)

I think the syslog time is off by 7 hours (GMT difference) because the syslog ends here:

2017-07-27T08:14:10.189258+00:00 DEBUG kernel: [47770.249358] SELinux: initialized (dev proc, type proc), uses genfs_contexts

and the chrome user log ends about 7 hours earlier:

[1635:1704:0727/011328.539112:VERBOSE1:ppapi_plugin_process_host.cc(506)] PpapiPluginProcessHost[broker]<IPv6: 5>OnChannelError()

based on this hypothesis, the chrome user log shows that the tab discarder had just started running (about 5 seconds before the storm start):

[1635:1635:0727/011146.346449:WARNING:oom_memory_details.cc(46)] Tab Discards Memory details (274 ms):

but had done nothing:

[1635:1635:0727/011146.355439:ERROR:device_event_log_impl.cc(156)] [01:11:46.355] Memory: tab_manager_delegate_chromeos.cc:577 Target memory to free: -329728 KB

The next tab discard is 90 seconds later.

Shouldn't a large GC event be logged?

Could something be going on on the Android side?

The included bugreport suggests not:

Total PSS by process:
     51,160K: com.android.vending (pid 27620 / activities)
     41,191K: com.google.android.gms (pid 2888)
     36,516K: system (pid 289)
     35,198K: com.google.android.gms.unstable (pid 2860)
     32,769K: com.google.android.gms.persistent (pid 2731)
     27,488K: com.google.android.gm (pid 8300)
     15,713K: android.process.media (pid 7348)
      8,306K: android.process.acore (pid 6936)
      8,276K: com.google.android.gms.ui (pid 26340)
      5,426K: com.android.pacprocessor (pid 1906)
      5,044K: com.google.android.keep (pid 27171)
      4,084K: com.google.process.gapps (pid 783)
      3,829K: com.google.process.gapps (pid 28450)

These are all small processes.

I am not super-proficient at reading bugreports (i.e.: first one I read) but I don't see anything super-exciting at that time.

Cc: w...@chromium.org
> based on this hypothesis, the chrome user log shows that the tab discarder 
> had just started running (about 5 seconds before the storm start):

When we were talking about the tab discarder didn't someone suggest that the process of discarding a tab (especially w/ a shared renderer?) could trigger a GC?  Yeah, Wez said:

> killing a tab and GCing to reclaim memory is a non-starter in that situation.  
> Tab-killing from shared renderers really makes most sense if we're going to do 
> it before the renderer starts getting swapped-out, though, which I suspect makes 
> it not-very-useful in practice.


Comment 10 by w...@chromium.org, Jul 28 2017

Cc: haraken@chromium.org
Re #9: Discarding a tab in a shared renderer will not free up any memory until both V8 and Oilpan have GC'd; I thought we had a fast-path tear-down for single-tab renderers, but I think that depends on details of what the renderer is running - Kentaro should be able to confirm.
If a single-tab renderer does not have unload handlers, shared workers etc, it is killed immediately (and the memory is freed up).

If the renderer is shared by multiple tabs, we need V8 and Oilpan GCs to collect the memory of discarded tabs.

There are 4 paths to add count to PSWPIN. (in kernel 3.18)

1. PGMAJFAULT_A (anonymous)
  do_swap_page() -> swapin_readahead() -> read_swap_cache_async()
2. PGMAJFAULT_S (shmem)
  shmem_fault() -> shmem_getpage_gfp() -> shmem_swapin() -> swapin_readahead() -> read_swap_cache_async()
3. madvise(..., MADV_WILLNEED)
  madvise_willneed() -> force_swapin_readahead() / force_shm_swapin_readahead() -> read_swap_cache_async()
4. swapoff
  swapoff() -> try_to_unuse() -> read_swap_cache_async()

swapoff is unlikely. It should be madvise(..., MADV_WILLNEED) caused the big mismatch between PSWPIN and PGMAJFAULT.
re #12 -- What about the shmem path?  Could it be that graphics buffers see only one fault but the entire object gets paged in?  That could certainly result in a very large ratio of swap-in vs fault

Comment 14 Deleted

re #13: It could be. The following call path increase PSWPIN without increasing PGMAJFAULT.
i915_gem_object_get_pages_gtt() -> shmem_read_mapping_page_gfp() -> shmem_getpage_gfp() -> shmem_swapin() -> swapin_readahead() -> read_swap_cache_async() -> swap_readpage()

#12 misses some cases, shmem_getpage_gfp() is called by several functions.

Sign in to add a comment