New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 840315 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 850457

Blocking:
issue 850452



Sign in to add a comment

Reboot caused by chrome browser process memory leak

Project Member Reported by vovoy@chromium.org, May 7 2018

Issue description

Crash report:
https://crash.corp.google.com/browse?q=ReportID%3Daaf50b736d58e1c0#7
https://crash.corp.google.com/browse?q=ReportID%3D7b1290fbbe06a807#7
https://crash.corp.google.com/browse?q=ReportID%3D6268730156958b08#7
https://crash.corp.google.com/browse?q=ReportID%3Df6ee28cb89b7055a#7

snippet from the first crash report:
<6>[484044.113951] [25179] 1000 25179 331666 26000 395 111325 -1000 TaskSchedulerSi
<6>[484044.113961] [25199] 224 25199 541 253 6 61 -1000 dhcpcd
<6>[484044.113971] [25201] 1000 25201 344058 29925 401 107377 -1000 TaskSchedulerSi
<6>[484044.113981] [25220] 1000 25220 92463 7489 132 6425 -1000 chrome
<6>[484044.113991] [25221] 1000 25221 352338 29140 405 108148 -1000 TaskSchedulerSi
<6>[484044.114001] [25236] 1000 25236 84597 7923 109 5883 -1000 chrome
<6>[484044.114012] [25243] 1000 25243 355142 30277 405 107017 -1000 TaskSchedulerSi
<6>[484044.114021] [25255] 1000 25255 85877 7918 116 6244 -1000 chrome
<6>[484044.114032] [25256] 1000 25256 356922 25326 406 111980 -1000 TaskSchedulerSi
<6>[484044.114043] [25271] 1000 25271 357426 25837 406 111470 -1000 TaskSchedulerSi
<6>[484044.114055] [25300] 1000 25300 355473 16822 407 120513 -1000 TaskSchedulerSi

There may be some memory leak in Chrome Browser process that it consumes a lot of memory and there is no killable process, system reboot.

The current oom_score_adj design: making all processes have an oom_adj value of -1000 so that they are all marked as "not OOM killable", and then adjusting the oom_adj value for all the processes that we deem killable (renderers, plugins, etc.)

Reference bug and CL:
 https://crbug.com/199548 
 https://crbug.com/200193 
https://crrev.com/c/5701

A solution is making chrome browser process killable by setting it's oom_score_adj to 0. Restart chrome browser process shall be better than system reboot.
 
Cc: conradlo@chromium.org

Comment 2 by vovoy@chromium.org, May 17 2018

Cc: keybuk@chromium.org
Labels: -Pri-3 Pri-2
I think the right solution is to revert "upstart - disable OOM killer for all jobs" https://crrev.com/c/5701 .
The default oom_score_adj shall be 0 instead of -1000. Renderer's oom_score_adj is set to 300 and will be killed earlier. If oom-killer is invoked and all processes' oom_score_adj are -1000, kernel panic is triggered.
Any suggestion?

Comment 3 by vovoy@chromium.org, May 17 2018

Status: Started (was: Available)

Comment 4 by cywang@chromium.org, May 17 2018

We should try to adjust chrome oom default score to zero first as system services invoked by upstart should be protected w/ -1000 to make the system stable?

Comment 5 by vovoy@chromium.org, May 17 2018

In the current Chrome OS, not only system services, every process sets oom_score_adj to -1000 by default. Not only chrome could have memory leak or abnormal memory usage issue, e.g. the recent arc_camera3_service memory leak issue [1]. Any process with oom_score_adj=-1000 and has memory leak could potentially exhaust the system memory and trigger kernel panic.

The upstream upstart set default oom_score_adj to 0 [2]. And as the original bug [3] suggest, the main purpose of "upstart - disable OOM killer for all jobs" [4] is to investigate OS memory compression. I think setting default oom_score_adj to -1000 is for experiment, only a stable system, only a small set of processes should set oom_score_adj to -1000.

I propose setting default oom_score_adj to 0 and some important services set oom_score_adj to -1000 by themselves.

E.g. in gLinux, most system services set oom_score_adj to 0.
Only 4 processes set oom_score_adj to -1000: auditd, sshd, dmeventd, systemd-udevd.
These 4 processes write -1000 to /proc/self/oom_score_adj [5].

[1]:https://listnr.corp.google.com/report/85419030956
[2]:https://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/init/job_class.h#L116
[3]: https://crbug.com/199548 
[4]:https://crrev.com/c/5701
[5]:https://cs.corp.google.com/piper///depot/google3/third_party/systemd/src/udev/udevd.c?rcl=46931103&l=1171

Comment 6 by vovoy@chromium.org, May 18 2018

The process to change the oom_score_adj:
1. send an RFC email to collect important system services that should set negtive oom_score_adj
2. Changing the oom_score_adj of these system services
3. Set the default oom_score_adj to 0 by reverting https://crrev.com/c/5701

Comment 7 by vovoy@chromium.org, May 18 2018

A script to list oom_score_adj of all processes:
https://user.git.corp.google.com/vovoy/utils/+/master/dut_utils/oom_score.py

Comment 8 by vovoy@chromium.org, May 21 2018

I analyzed 10000 kernel panic reports caused by "Out of memory and no killable processes". The attachment list the process consumes most memory and is not killable when kernel panic is triggered.
largest_process.txt
983 KB View Download

Comment 9 by vovoy@chromium.org, May 21 2018

The crash reports is available on https://crash.corp.google.com . E.g. Report ID 7f0687f8e4a5f36e is available on https://crash.corp.google.com/browse?q=ReportID%3D%277f0687f8e4a5f36e%27

Comment 10 Deleted

I don't understand, it says that taskscheduler sometimes consumes 34gb ?
or chrome at 17gb?

Comment 12 by vovoy@chromium.org, May 24 2018

Updates the crash reports summary.
memory_hogs.txt: list the process consumes most memory.
memory_hogs_details.txt: List a line of summary for each crash reports.
memory_hogs.txt
1.6 KB View Download
memory_hogs_details.txt
980 KB View Download

Comment 13 by vovoy@chromium.org, May 24 2018

There is bug on the parsing script. It's fixed.

Comment 14 by vovoy@chromium.org, May 24 2018

Also note that the program size is ram + swap used, so the program size can by greater than total ram.

e.g.:
Report ID       , anon(KB), swapfree(KB), total(GB),     version,      board,    largest_proc, size(KB)
d5f356f589bfa245,  6857388,            0,         8,  10323.67.0,        eve,             vlc, 18023808

vlc used 6.8 GB ram and 11 GB swap.

Comment 15 by vapier@google.com, May 24 2018

there's a lot of dev-mode related processes in there (vlc, kodi, mono, matlab, games, etc...).  can you rerun your summary but filter out devices w/dev-mode enabled ?

however, related to that, it doesn't seem like we run crosh with oom adjusted, so i suspect everything people run from there are getting -1000.  we should fix that ... probably want to do it in the process_proxy code ?

Comment 16 by vovoy@chromium.org, May 24 2018

Cc: vapier@chromium.org
Here is an example crash report:
https://crash.corp.google.com/browse?q=ReportID%3D%2769ef425c282bf51b%27

I could not tell from the crash report if it's in dev mode, but the majority of the process list shall be from dev mode.

IMO, if we can fix the panic issue in dev mode without affecting the normal mode user, it's worth doing.

I am writing a doc to explain my plan to adjust oom_score_adj.
https://docs.google.com/document/d/1NIul6tcKDfiC5J37q8_7hw1MrTz6mCqdvlUOzkwjKuc/edit
CrOS's crash uploader will set image_type=dev when in dev mode, but i don't know how crash/ exposes that

thanks for the link to the doc.

Comment 18 by vovoy@chromium.org, May 24 2018

The dev mode can be tell from Product data -> boot_mode = dev.

e.g.
https://crash.corp.google.com/browse?q=ReportID%3D%27d5f356f589bfa245%27#2

I will modify the script to analyze crash reports w/o dev mode.

Comment 19 by vovoy@chromium.org, May 25 2018

The result without dev mode.

count of reports that the largest process used > 500 MB: 5555
count of distinguishable largest processes: 19

Largest process name, count in crash reports, dev mode excluded:

chrome:
chrome          5289
TaskSchedulerSi   49 chrome thread, showing child thread because main thread (chrome) doesn't exist.
renderer_crash_    1 chrome thread
TaskSchedulerFo    1 chrome thread

upstart services:
shill             77
udevd              4
bluetoothd         3
permission_brok    2
dbus-daemon        2
cras               1
update_engine      1

crosh:
memtester         23 from crosh command memory_test

android:
com.rovio.tnt      1 angry birds evolution, android app with wrong oom_score_adj
Chrome_ProcessL    1 should be a thread of an android app, android app with wrong oom_score_adj
.katana:browser    1 part of facebook android app, android app with wrong oom_score_adj

other:
x.client:glview   88 unknown
futility           9 firmware utility
quipper            1 part of debugd
gs                 1 related to gstoraster and cupsd

memory_hogs_details_nodev.txt
1.0 MB View Download
Components: OS>Kernel
Blockedon: 850457
Blocking: 850452

Comment 23 by vovoy@chromium.org, Jun 12 2018

Status: Fixed (was: Started)
Chrome on Chrome OS is killable with fixes in  https://crbug.com/850457 

Sign in to add a comment