Reboot caused by chrome browser process memory leak |
|||||||
Issue descriptionCrash report: https://crash.corp.google.com/browse?q=ReportID%3Daaf50b736d58e1c0#7 https://crash.corp.google.com/browse?q=ReportID%3D7b1290fbbe06a807#7 https://crash.corp.google.com/browse?q=ReportID%3D6268730156958b08#7 https://crash.corp.google.com/browse?q=ReportID%3Df6ee28cb89b7055a#7 snippet from the first crash report: <6>[484044.113951] [25179] 1000 25179 331666 26000 395 111325 -1000 TaskSchedulerSi <6>[484044.113961] [25199] 224 25199 541 253 6 61 -1000 dhcpcd <6>[484044.113971] [25201] 1000 25201 344058 29925 401 107377 -1000 TaskSchedulerSi <6>[484044.113981] [25220] 1000 25220 92463 7489 132 6425 -1000 chrome <6>[484044.113991] [25221] 1000 25221 352338 29140 405 108148 -1000 TaskSchedulerSi <6>[484044.114001] [25236] 1000 25236 84597 7923 109 5883 -1000 chrome <6>[484044.114012] [25243] 1000 25243 355142 30277 405 107017 -1000 TaskSchedulerSi <6>[484044.114021] [25255] 1000 25255 85877 7918 116 6244 -1000 chrome <6>[484044.114032] [25256] 1000 25256 356922 25326 406 111980 -1000 TaskSchedulerSi <6>[484044.114043] [25271] 1000 25271 357426 25837 406 111470 -1000 TaskSchedulerSi <6>[484044.114055] [25300] 1000 25300 355473 16822 407 120513 -1000 TaskSchedulerSi There may be some memory leak in Chrome Browser process that it consumes a lot of memory and there is no killable process, system reboot. The current oom_score_adj design: making all processes have an oom_adj value of -1000 so that they are all marked as "not OOM killable", and then adjusting the oom_adj value for all the processes that we deem killable (renderers, plugins, etc.) Reference bug and CL: https://crbug.com/199548 https://crbug.com/200193 https://crrev.com/c/5701 A solution is making chrome browser process killable by setting it's oom_score_adj to 0. Restart chrome browser process shall be better than system reboot.
,
May 17 2018
I think the right solution is to revert "upstart - disable OOM killer for all jobs" https://crrev.com/c/5701 . The default oom_score_adj shall be 0 instead of -1000. Renderer's oom_score_adj is set to 300 and will be killed earlier. If oom-killer is invoked and all processes' oom_score_adj are -1000, kernel panic is triggered. Any suggestion?
,
May 17 2018
,
May 17 2018
We should try to adjust chrome oom default score to zero first as system services invoked by upstart should be protected w/ -1000 to make the system stable?
,
May 17 2018
In the current Chrome OS, not only system services, every process sets oom_score_adj to -1000 by default. Not only chrome could have memory leak or abnormal memory usage issue, e.g. the recent arc_camera3_service memory leak issue [1]. Any process with oom_score_adj=-1000 and has memory leak could potentially exhaust the system memory and trigger kernel panic. The upstream upstart set default oom_score_adj to 0 [2]. And as the original bug [3] suggest, the main purpose of "upstart - disable OOM killer for all jobs" [4] is to investigate OS memory compression. I think setting default oom_score_adj to -1000 is for experiment, only a stable system, only a small set of processes should set oom_score_adj to -1000. I propose setting default oom_score_adj to 0 and some important services set oom_score_adj to -1000 by themselves. E.g. in gLinux, most system services set oom_score_adj to 0. Only 4 processes set oom_score_adj to -1000: auditd, sshd, dmeventd, systemd-udevd. These 4 processes write -1000 to /proc/self/oom_score_adj [5]. [1]:https://listnr.corp.google.com/report/85419030956 [2]:https://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/init/job_class.h#L116 [3]: https://crbug.com/199548 [4]:https://crrev.com/c/5701 [5]:https://cs.corp.google.com/piper///depot/google3/third_party/systemd/src/udev/udevd.c?rcl=46931103&l=1171
,
May 18 2018
The process to change the oom_score_adj: 1. send an RFC email to collect important system services that should set negtive oom_score_adj 2. Changing the oom_score_adj of these system services 3. Set the default oom_score_adj to 0 by reverting https://crrev.com/c/5701
,
May 18 2018
A script to list oom_score_adj of all processes: https://user.git.corp.google.com/vovoy/utils/+/master/dut_utils/oom_score.py
,
May 21 2018
I analyzed 10000 kernel panic reports caused by "Out of memory and no killable processes". The attachment list the process consumes most memory and is not killable when kernel panic is triggered.
,
May 21 2018
The crash reports is available on https://crash.corp.google.com . E.g. Report ID 7f0687f8e4a5f36e is available on https://crash.corp.google.com/browse?q=ReportID%3D%277f0687f8e4a5f36e%27
,
May 24 2018
I don't understand, it says that taskscheduler sometimes consumes 34gb ? or chrome at 17gb?
,
May 24 2018
Updates the crash reports summary. memory_hogs.txt: list the process consumes most memory. memory_hogs_details.txt: List a line of summary for each crash reports.
,
May 24 2018
There is bug on the parsing script. It's fixed.
,
May 24 2018
Also note that the program size is ram + swap used, so the program size can by greater than total ram. e.g.: Report ID , anon(KB), swapfree(KB), total(GB), version, board, largest_proc, size(KB) d5f356f589bfa245, 6857388, 0, 8, 10323.67.0, eve, vlc, 18023808 vlc used 6.8 GB ram and 11 GB swap.
,
May 24 2018
there's a lot of dev-mode related processes in there (vlc, kodi, mono, matlab, games, etc...). can you rerun your summary but filter out devices w/dev-mode enabled ? however, related to that, it doesn't seem like we run crosh with oom adjusted, so i suspect everything people run from there are getting -1000. we should fix that ... probably want to do it in the process_proxy code ?
,
May 24 2018
Here is an example crash report: https://crash.corp.google.com/browse?q=ReportID%3D%2769ef425c282bf51b%27 I could not tell from the crash report if it's in dev mode, but the majority of the process list shall be from dev mode. IMO, if we can fix the panic issue in dev mode without affecting the normal mode user, it's worth doing. I am writing a doc to explain my plan to adjust oom_score_adj. https://docs.google.com/document/d/1NIul6tcKDfiC5J37q8_7hw1MrTz6mCqdvlUOzkwjKuc/edit
,
May 24 2018
CrOS's crash uploader will set image_type=dev when in dev mode, but i don't know how crash/ exposes that thanks for the link to the doc.
,
May 24 2018
The dev mode can be tell from Product data -> boot_mode = dev. e.g. https://crash.corp.google.com/browse?q=ReportID%3D%27d5f356f589bfa245%27#2 I will modify the script to analyze crash reports w/o dev mode.
,
May 25 2018
The result without dev mode. count of reports that the largest process used > 500 MB: 5555 count of distinguishable largest processes: 19 Largest process name, count in crash reports, dev mode excluded: chrome: chrome 5289 TaskSchedulerSi 49 chrome thread, showing child thread because main thread (chrome) doesn't exist. renderer_crash_ 1 chrome thread TaskSchedulerFo 1 chrome thread upstart services: shill 77 udevd 4 bluetoothd 3 permission_brok 2 dbus-daemon 2 cras 1 update_engine 1 crosh: memtester 23 from crosh command memory_test android: com.rovio.tnt 1 angry birds evolution, android app with wrong oom_score_adj Chrome_ProcessL 1 should be a thread of an android app, android app with wrong oom_score_adj .katana:browser 1 part of facebook android app, android app with wrong oom_score_adj other: x.client:glview 88 unknown futility 9 firmware utility quipper 1 part of debugd gs 1 related to gstoraster and cupsd
,
May 25 2018
,
Jun 7 2018
,
Jun 7 2018
,
Jun 12 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by conradlo@chromium.org
, May 15 2018