New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 775118 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Kernel OOM-killer janks system and kills almost all tabs after leaving device over the weekend.

Project Member Reported by w...@chromium.org, Oct 16 2017

Issue description

Chrome Version: 62.0.3202.43
OS: ChromeOS Panther

What steps will reproduce the problem?
(1) Use device and open several windows and lots of tabs (I have two profiles signed-in, one with ~20 tabs open, across four windows, the other with one tab in one window).
(2) Use it daily and leave it logged-in each evening, but locked.
(3) Leave it logged-in over the entire weekend.

What is the expected result?

Expect that on Monday, the system is responsive as soon as it is unlocked.
Expect that there is little change in the running tabs over the weekend.

What happens instead?

System is extremely janky - the cursor actually stutters around the screen.
Once the jank stops, virtually every tab on the system shows a sad-tab icon; visiting about:discards there are no Chrome-initiated discards, so this is a result of the kernel OOM-killer killing the tabs.

This is presumably a mixture of timer-based activities (e.g. JS stuff, V8+Oilpan GCs, etc) firing, and tab-discard signals not being quite right to prevent things b0rking.  Unfortunately it seems to have become worse recently. :(

Mainly filing this to associate a feedback report with.
 

Comment 1 Deleted

report #76413531611



Cc: kirtika@chromium.org
Curiously, I don't see discards or kernel OOM-kills in the logs for this morning.

There is a spew of these, probably unrelated, but let me check with the wifi folks if it may be of interest.

2017-10-16T10:30:44.953514-07:00 DEBUG kernel: [163429.594400] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:30:53.872509-07:00 DEBUG kernel: [163438.511485] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:30:53.964523-07:00 DEBUG kernel: [163438.604135] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:30:56.872515-07:00 DEBUG kernel: [163441.510989] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:30:56.935520-07:00 DEBUG kernel: [163441.574036] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:30:58.872509-07:00 DEBUG kernel: [163443.510658] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:30:58.879521-07:00 DEBUG kernel: [163443.518489] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:00.868508-07:00 DEBUG kernel: [163445.506388] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:00.928631-07:00 DEBUG kernel: [163445.566451] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:09.868515-07:00 DEBUG kernel: [163454.504836] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:09.939569-07:00 DEBUG kernel: [163454.575864] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:11.872511-07:00 DEBUG kernel: [163456.508565] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:11.884509-07:00 DEBUG kernel: [163456.521143] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:32.868568-07:00 DEBUG kernel: [163477.500996] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:32.876513-07:00 DEBUG kernel: [163477.509637] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:39.872554-07:00 DEBUG kernel: [163484.503826] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:39.942535-07:00 DEBUG kernel: [163484.574134] wlan0: cancelling probereq poll due to a received beacon
2017-10-16T10:31:41.868537-07:00 DEBUG kernel: [163486.499526] wlan0: detected beacon loss from AP - sending probe request
2017-10-16T10:31:41.887519-07:00 DEBUG kernel: [163486.519364] wlan0: cancelling probereq poll due to a received beacon
Cc: grundler@chromium.org
+ grundler as FYI. I don't know the history with ath9k, but hopefully we dont end up there again with ath10k.

That looks pretty bad but panther has an ath9k chip that we are not supporting/debugging anymore :( 

Cc: abodenha@chromium.org
I don't see evidence yet of a wifi problem here despite the probes getting canceled. And I don't want to pollute this bug with secondary symptoms.  If there are issues with ath9k, please open a new bug and include whatever logs that suggest wifi is misbehaving.

The main issue sounds like a failure of Chrome to monitor it's memory usage and maybe be leaking memory which the kernel eventually detects. But that's pure speculation since I expect Kernel OOMs to be visible in either dmesg or /var/log/messages (which is where I know Luigi will have looked).

I haven't looked at Chrome OS OOM issues in past ~5-6 months. At that time _all_ of the OOM reports from the kernel were to kill chrome process(es).  However, OOM picks a victim that isn't necessarily the culprit. To find the culprit, we need to see the OOM report from the kernel.

@wez: if you have "Automatically send usage statistics and crash reports to Google" enabled under "Advanced Settings", I thought we get a "kernel crash report" for every OOM event (as a kernel warning IIRC).

Comment 6 by w...@chromium.org, Oct 16 2017

Are we actually able to capture & store crash reports when the system is busy OOMing, though?
#4 Thanks!

#5 By all means we won't use this bug for the wifi problem, but it wasn't clear that it's even a bug---hence the question.

#6 Most of the times OOM-kills are not disruptive to anything except the renderer(s) being killed.  If crashes happen around them, they won't be affected.  The OOM kill themselves don't generate crashes.
Ah I see.  We may be confusing "kernel OOM panics" and "kernel OOM kills".  The former causes a crash report (like all other kernel panics) with that signature.  The latter does not.

Also, re #5: "the kernel chooses a victim that isn't necessarily the culprit".  I am not clear on what is meant by "the culprit".
Right - I'm confusing OOM panics with kills.

"the culprit" is the process that is consuming (or losing) memory. It possible due to system configuration the kernel isn't allowed to kill that particular process. In an ideal world, the "victim" (killed by Chrome or kernel OOM) == "culprit" (consuming lots of mem).

Sign in to add a comment