Kernel OOM-killer janks system and kills almost all tabs after leaving device over the weekend. |
||||
Issue descriptionChrome Version: 62.0.3202.43 OS: ChromeOS Panther What steps will reproduce the problem? (1) Use device and open several windows and lots of tabs (I have two profiles signed-in, one with ~20 tabs open, across four windows, the other with one tab in one window). (2) Use it daily and leave it logged-in each evening, but locked. (3) Leave it logged-in over the entire weekend. What is the expected result? Expect that on Monday, the system is responsive as soon as it is unlocked. Expect that there is little change in the running tabs over the weekend. What happens instead? System is extremely janky - the cursor actually stutters around the screen. Once the jank stops, virtually every tab on the system shows a sad-tab icon; visiting about:discards there are no Chrome-initiated discards, so this is a result of the kernel OOM-killer killing the tabs. This is presumably a mixture of timer-based activities (e.g. JS stuff, V8+Oilpan GCs, etc) firing, and tab-discard signals not being quite right to prevent things b0rking. Unfortunately it seems to have become worse recently. :( Mainly filing this to associate a feedback report with.
,
Oct 16 2017
report #76413531611
,
Oct 16 2017
Curiously, I don't see discards or kernel OOM-kills in the logs for this morning. There is a spew of these, probably unrelated, but let me check with the wifi folks if it may be of interest. 2017-10-16T10:30:44.953514-07:00 DEBUG kernel: [163429.594400] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:30:53.872509-07:00 DEBUG kernel: [163438.511485] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:30:53.964523-07:00 DEBUG kernel: [163438.604135] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:30:56.872515-07:00 DEBUG kernel: [163441.510989] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:30:56.935520-07:00 DEBUG kernel: [163441.574036] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:30:58.872509-07:00 DEBUG kernel: [163443.510658] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:30:58.879521-07:00 DEBUG kernel: [163443.518489] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:00.868508-07:00 DEBUG kernel: [163445.506388] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:00.928631-07:00 DEBUG kernel: [163445.566451] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:09.868515-07:00 DEBUG kernel: [163454.504836] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:09.939569-07:00 DEBUG kernel: [163454.575864] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:11.872511-07:00 DEBUG kernel: [163456.508565] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:11.884509-07:00 DEBUG kernel: [163456.521143] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:32.868568-07:00 DEBUG kernel: [163477.500996] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:32.876513-07:00 DEBUG kernel: [163477.509637] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:39.872554-07:00 DEBUG kernel: [163484.503826] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:39.942535-07:00 DEBUG kernel: [163484.574134] wlan0: cancelling probereq poll due to a received beacon 2017-10-16T10:31:41.868537-07:00 DEBUG kernel: [163486.499526] wlan0: detected beacon loss from AP - sending probe request 2017-10-16T10:31:41.887519-07:00 DEBUG kernel: [163486.519364] wlan0: cancelling probereq poll due to a received beacon
,
Oct 16 2017
+ grundler as FYI. I don't know the history with ath9k, but hopefully we dont end up there again with ath10k. That looks pretty bad but panther has an ath9k chip that we are not supporting/debugging anymore :(
,
Oct 16 2017
I don't see evidence yet of a wifi problem here despite the probes getting canceled. And I don't want to pollute this bug with secondary symptoms. If there are issues with ath9k, please open a new bug and include whatever logs that suggest wifi is misbehaving. The main issue sounds like a failure of Chrome to monitor it's memory usage and maybe be leaking memory which the kernel eventually detects. But that's pure speculation since I expect Kernel OOMs to be visible in either dmesg or /var/log/messages (which is where I know Luigi will have looked). I haven't looked at Chrome OS OOM issues in past ~5-6 months. At that time _all_ of the OOM reports from the kernel were to kill chrome process(es). However, OOM picks a victim that isn't necessarily the culprit. To find the culprit, we need to see the OOM report from the kernel. @wez: if you have "Automatically send usage statistics and crash reports to Google" enabled under "Advanced Settings", I thought we get a "kernel crash report" for every OOM event (as a kernel warning IIRC).
,
Oct 16 2017
Are we actually able to capture & store crash reports when the system is busy OOMing, though?
,
Oct 16 2017
#4 Thanks! #5 By all means we won't use this bug for the wifi problem, but it wasn't clear that it's even a bug---hence the question. #6 Most of the times OOM-kills are not disruptive to anything except the renderer(s) being killed. If crashes happen around them, they won't be affected. The OOM kill themselves don't generate crashes.
,
Oct 16 2017
I thought Kernel OOMs are reported in crash.corp against kernel...but maybe this tag is just a "side-effect" of other crashes getting reported. In any case, it's one of the "buckets" I used for crash analysis. https://crash.corp.google.com/browse?q=product.name%3D%27ChromeOS%27%20AND%20EXISTS%20(SELECT%201%20FROM%20UNNEST(productdata)%20WHERE%20Key%3D%27hwclass%27%20AND%20STRPOS(Value%2C%20%27PANTHER%27)%20%3E%200)%20AND%20EXISTS%20(SELECT%201%20FROM%20UNNEST(productdata)%20WHERE%20Key%3D%27exec_name%27%20AND%20Value%3D%27kernel%27)%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27oom%27&sql_dialect=googlesql&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
,
Oct 17 2017
Ah I see. We may be confusing "kernel OOM panics" and "kernel OOM kills". The former causes a crash report (like all other kernel panics) with that signature. The latter does not. Also, re #5: "the kernel chooses a victim that isn't necessarily the culprit". I am not clear on what is meant by "the culprit".
,
Oct 17 2017
Right - I'm confusing OOM panics with kills. "the culprit" is the process that is consuming (or losing) memory. It possible due to system configuration the kernel isn't allowed to kill that particular process. In an ideal world, the "victim" (killed by Chrome or kernel OOM) == "culprit" (consuming lots of mem). |
||||
►
Sign in to add a comment |
||||
Comment 1 Deleted