New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

When memory is low and things get OOM killed, sometimes system locks up (maybe ARC++ related)

Project Member Reported by semenzato@chromium.org, Mar 17 2017

Issue description

(credits to Doug for noticing this)

Android processes (for instance, the binder) run with the default oom_score_adj value of 0.  This means that they are all killable by the kernel OOM killer (the value -1000 makes them unkillable).  However, the critical android processes are likely to be killed only after all Chrome renderers and Android apps are killed.

I have to say "likely" because the oom_score_adj is only one factor in choosing the kill candidate (see issue 701848).  Another is the process size (RSS + swap + page table pages).  However, chrome renderers and android apps have oom_score_adj values between 300 and 1000, which is a lot: the units of oom_score_adj are 0.1% of total RAM, thus on a 4GB system each unit is equivalent to 4MB.  The lowest adj is 300, corresponding to 1.2GB.  Thus an android system process would have to be at least that size (or 600MB on a 2GB system) to be kernel-OOM-killed before a tab or an app.

The question here is whether we want to make them OOM-unkillable, i.e. we'd rather panic, or whether it's OK to have the kernel OOM-kill android when we're out of memory.  Note that no android apps are running, and no chrome tabs are open, but extensions are still loaded and running.  (We should really start killing those too...)

I am in favor of killing Android before panicking.

 
Cc: elijahtaylor@chromium.org mitsuji@chromium.org dgreid@chromium.org
I asked a similar question in an email thread after seeing a report where various arc processes got killed by the oom killer but didn't get any answer -- adding the folks from the thread to see if we can get an answer

Comment 2 by dgreid@chromium.org, Mar 17 2017

Hmm, very dificult question.  I think the answer is yes, we should prevent the oom killer from taking out android system-critical processes.  Android will either restart the process anyways or behave eratically(eratic behavior seems to be the common failure mode here).

However, if we are out of android apps to kill, I'd be in favor of killing the frameworks as a whole.  Turn off the container because at that point it isn't doing anything useful and it's eating 100s of megs of ram.  We'd have to be smart about restarting it when memory pressure relaxed so notification apps would work again.
Better unkillable than hanging the system like what seems to happen today when binder gets OOM killed.  :-P  ...so for the short term making critical system processes seems to make sense...
Labels: -Pri-2 M-58 Pri-1
Since I just hit this personally, I'd be very curious to know if we can at least get some of this handled in short order?  It seems that OOM kill of Android stuff is pretty broken right now because of the bad crash when binder is killed.  Even if you're hardly using Android at all simply having Play Store enabled on your system means you're in an unhappy state.
#2 agree.  I hadn't thought about notifications, that complicates things, because it makes for really bad UX if notifications aren't reliable.
Owner: cylee@chromium.org
Status: Assigned (was: Untriaged)

Comment 7 Deleted

I'd be worried about UX when oom-killer starts killing Android processes. Shouldn't Android be handling what to kill or GC for anything that it spawns to free up mem? Would be good to have a way to notify Android about low memory situation so it could handle it in a more deterministic manner.  

Short of that in the near term, I also think we should make at least Android system-critical processes unkillable by oom-killer.  And if these critical processes became too bloated which prevents users from doing what they want e.g. load a new tab, then turning off the container without relying on oom-killer seems reasonable.  Hopefully that'll never be necessary with reasonable DRAM spec
Summary: Android framework should be discarded (or OOM killed) before panicking (was: Should Android processes in Chrome OS be unkillable by default?)
#8 Partly my fault, but we have to be careful in distinguishing normal and emergency behavior---although the two should be as close as possible.

Emergency behavior is when there is a memory allocation request and no memory left.  The kernel has two choices: it can kill some process, or it can panic.  Panicking kills all processes, so it's seldom better than killing *some* processes.

Right now we panic rather than killing, therefore this bug.  I am changing the title to make this clearer.

The immediate consequence of this reasoning is that for consistency we should shut down the entire Android framework when short on memory.  Unfortunately this doesn't only introduce Android restart latency, but also destroys Android notifications.

Maybe the strategy should be then:

1. "Discard candidate" can be Android app, Chrome tab, or Android framework.
2. Use LRU to discard Android apps and Chrome tabs, but treat the framework separately.
3. When the framework is LRU, and there is only one Chrome tab left (only two?  Only "visible" tabs?), discard the framework, but pop up this dialog: "Android is being shut down for RAM shortage and will stop sending notifications"

This doesn't necessarily work well, but I am not sure what else we can do.



Hmmm, so maybe there are several bugs to address here.  The problem I was seeing (that I thought this bug was about) is:

1. Memory starts getting low

2. For whatever reason, the Chrome Tab Discarder isn't able to react fast enough.  Maybe we have it tuned wrong.  Maybe there's a bug.  Maybe memory was just allocated too quickly.  We can discuss ways to make this better, but there will always be some times the tab discarder doesn't run.  We can discuss that elsewhere (not here!).  In my case, assume the tab discarder didn't free memory.

3. We get to the kernel OOM killer.

4. The OOM killer chooses a VIAP (very important android process) as the one to kill.  Like "binder", for instance.

5. When the VIAP dies, the system is doomed.  It might not crash right away, but within a finite amount of time (like 120 seconds) the kernel will detect a hung task and reboot.  Possibly some parts of the system might freeze/hang during this waiting period, or possibly not.

---

So, my assertion is that rather than kill the VIAP, the following would have been preferable:

A) Kill a Chrome process.

B) Kill an unimportant android process.

C) Kill the whole android container.

D) Just panic.


-> In cases I've looked at I'm pretty sure that A) was possible, but we chose not to do it because we thought that the VIAP was a better candidate to kill.

-> If it's too hard to figure out B) or C) right now, I'd advocate that A) + D) is still better than what we have today.

---

As per dgreid, possibly we should also try to fix the system so that it doesn't fall on its face if a VIAP crashes, but even so it still probably makes sense not to purposefully kill one.

===

I'm going to try some experiments to actually show how the system behaves when a VIAP dies.
My current efforts on this are related to trying to actually kill a "Binder_X" process and watch the system die.  I think that's been relatively common with feedback reports I've seen.

As per dgreid these are short-lived processes and are spawned for binder to communicate back to something, so maybe that explains why I don't see them normally in "ps aux".

---

Looking in "messages" for a machine where I was having problems (attached), I notice:

2017-03-20T15:55:32.534852-07:00 INFO kernel: [18962.589904] [ 6041] 665367  6041   238332        0     106       4     4165           965 d.process.acore
...
2017-03-20T15:55:32.534912-07:00 ERR kernel: [18962.590018] Out of memory: Kill process 6041 (d.process.acore) score 966 or sacrifice child
...
2017-03-20T15:55:32.544008-07:00 INFO kernel: [18964.056499] [18856] 665367  6041   238332       15     106       4     4151           965 Binder_7
...
2017-03-20T15:55:32.544070-07:00 ERR kernel: [18964.057098] Out of memory: Kill process 18856 (Binder_7) score 966 or sacrifice child


So this "Binder_7" was spawned _after_ we killed "d.process.acore", but has the same "uid" and same OOM Score, so it seems like maybe it was created for the "d.process.acore" and maybe inherited its OOM score?

---
messages.2.txt
2.4 MB View Download
Killing "Binder" processes alone doesn't seem to make this happen all the time...

Let's see if I can find a pattern...

---

b/35647865, "OoM-killer_chrome_crash.tar.gz", "messages":


$ grep sacrifice messages 
2017-02-05T22:29:02.500816-08:00 ERR kernel: [24286.111920] Out of memory: Kill process 7916 (e.process.gapps) score 980 or sacrifice child
2017-02-05T22:29:02.509046-08:00 ERR kernel: [24286.373930] Out of memory: Kill process 11860 (Binder_10) score 980 or sacrifice child
2017-02-05T22:29:08.290024-08:00 ERR kernel: [24292.813157] Out of memory: Kill process 26777 (rk.clouddpc.arc) score 960 or sacrifice child
2017-02-05T22:29:10.327657-08:00 ERR kernel: [24294.851191] Out of memory: Kill process 26791 (Binder_2) score 960 or sacrifice child
2017-02-05T22:29:20.294308-08:00 ERR kernel: [24304.032281] Out of memory: Kill process 12014 (gle.android.gms) score 941 or sacrifice child
2017-02-05T22:29:50.909243-08:00 ERR kernel: [24335.432407] Out of memory: Kill process 8110 (c.intent_helper) score 919 or sacrifice child
2017-02-05T22:29:51.154379-08:00 ERR kernel: [24335.441583] Out of memory: Kill process 8121 (Binder_2) score 919 or sacrifice child
2017-02-05T22:29:59.202457-08:00 ERR kernel: [24343.726336] Out of memory: Kill process 7835 (arc.applauncher) score 898 or sacrifice child
2017-02-05T22:30:02.553861-08:00 ERR kernel: [24347.046614] Out of memory: Kill process 8272 (rprisereporting) score 877 or sacrifice child
...
...
2017-02-05T22:30:28.409704-08:00 WARNING session_manager[1488]: [WARNING:liveness_checker_impl.cc(68)] Browser hang detected!
2017-02-05T22:30:28.410279-08:00 WARNING session_manager[1488]: [WARNING:liveness_checker_impl.cc(72)] Aborting browser process.
2017-02-05T22:30:28.410309-08:00 INFO session_manager[1488]: [INFO:browser_job.cc(124)] Terminating process: Browser did not respond to DBus liveness check.
2017-02-05T22:30:28.410319-08:00 INFO session_manager[1488]: [INFO:system_utils_impl.cc(110)] Sending 6 to 1515 as 1000
2017-02-05T22:30:30.341316-08:00 WARNING session_manager[1488]: [WARNING:browser_job.cc(132)] Aborting child process 1515's process group 3 seconds after sending signal
2017-02-05T22:30:30.341376-08:00 INFO session_manager[1488]: [INFO:browser_job.cc(116)] Terminating process group: Browser took more than 3 seconds to exit after signal.
2017-02-05T22:30:30.341386-08:00 INFO session_manager[1488]: [INFO:system_utils_impl.cc(110)] Sending 6 to -1515 as 1000
2017-02-05T22:30:31.046142-08:00 WARNING crash_reporter[27313]: [user] Received crash notification for chrome[1615] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-02-05T22:30:31.046646-08:00 WARNING crash_reporter[27313]: [ARC] Received crash notification for chrome[1615] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-02-05T22:30:31.049006-08:00 WARNING crash_reporter[27312]: [user] Received crash notification for chrome[1733] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-02-05T22:30:31.049086-08:00 WARNING crash_reporter[27314]: [user] Received crash notification for chrome[1710] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-02-05T22:30:31.049107-08:00 WARNING crash_reporter[27312]: [ARC] Received crash notification for chrome[1733] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-02-05T22:30:31.050571-08:00 WARNING crash_reporter[27314]: [ARC] Received crash notification for chrome[1710] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-02-05T22:30:31.164928-08:00 INFO kernel: [24375.646032] binder: 7404:7480 transaction failed 29189, size 328-0
2017-02-05T22:30:32.131740-08:00 WARNING crash_reporter[27328]: Received crash notification for chrome[1515] user 1000 (called directly)
2017-02-05T22:30:32.481673-08:00 WARNING crash_reporter[27354]: [user] Received crash notification for chrome[1515] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-02-05T22:30:32.482366-08:00 WARNING crash_reporter[27354]: [ARC] Received crash notification for chrome[1515] sig 6, user 1000 (ignoring - crash origin is not ARC)


IIUC kHangDetectionIntervalDefaultSeconds is 60, so I think that means that the browser has been hung for between 60 seconds and 120 seconds.  So any of those could have been it.


Looking more closely at my crash (messages.2 from comment 11):

$ grep sacrifice ~/Downloads/messages.2.txt 
2017-03-20T15:55:32.534125-07:00 ERR kernel: [18957.450399] Out of memory: Kill process 9556 (ndroid.gms:snet) score 985 or sacrifice child
2017-03-20T15:55:32.534912-07:00 ERR kernel: [18962.590018] Out of memory: Kill process 6041 (d.process.acore) score 966 or sacrifice child
2017-03-20T15:55:32.544070-07:00 ERR kernel: [18964.057098] Out of memory: Kill process 18856 (Binder_7) score 966 or sacrifice child
2017-03-20T15:55:32.544931-07:00 ERR kernel: [18967.234454] Out of memory: Kill process 5976 (e.process.gapps) score 949 or sacrifice child
2017-03-20T15:55:32.545729-07:00 ERR kernel: [18972.399221] Out of memory: Kill process 9974 (.android.gms.ui) score 931 or sacrifice child
2017-03-20T15:55:32.546457-07:00 ERR kernel: [18977.249886] Out of memory: Kill process 6190 (id.gms.unstable) score 919 or sacrifice child

2017-03-20T15:55:41.821899-07:00 ERR kernel: [18988.743190] Out of memory: Kill process 6569 (ndroid.settings) score 896 or sacrifice child
2017-03-20T15:55:46.146568-07:00 ERR kernel: [18992.756865] Out of memory: Kill process 5702 (d.process.media) score 879 or sacrifice child
2017-03-20T15:55:52.548938-07:00 ERR kernel: [18999.470414] Out of memory: Kill process 5585 (e.process.gapps) score 861 or sacrifice child
2017-03-20T15:55:54.920177-07:00 ERR kernel: [18999.473870] Out of memory: Kill process 15358 (Binder_F) score 861 or sacrifice child
2017-03-20T15:56:06.439886-07:00 ERR kernel: [19013.361888] Out of memory: Kill process 5707 (gle.android.gms) score 847 or sacrifice child
2017-03-20T15:56:20.526220-07:00 ERR kernel: [19027.447412] Out of memory: Kill process 5730 (Binder_2) score 847 or sacrifice child
...
...
2017-03-20T15:57:37.543721-07:00 WARNING session_manager[1302]: [WARNING:liveness_checker_impl.cc(68)] Browser hang detected!
2017-03-20T15:57:37.554394-07:00 WARNING session_manager[1302]: [WARNING:liveness_checker_impl.cc(72)] Aborting browser process.
2017-03-20T15:57:37.555689-07:00 INFO session_manager[1302]: [INFO:browser_job.cc(157)] Terminating process: Browser did not respond to DBus liveness check.
2017-03-20T15:57:37.555815-07:00 INFO session_manager[1302]: [INFO:system_utils_impl.cc(110)] Sending 6 to 1370 as 1000
2017-03-20T15:57:40.436358-07:00 WARNING kernel: [19107.357368] udevd[189]: seq 2719 '/devices/virtual/xt_idletimer/timers' is taking a long time
2017-03-20T15:57:41.036557-07:00 WARNING session_manager[1302]: [WARNING:browser_job.cc(165)] Aborting child process 1370's process group 3 seconds after sending signal
2017-03-20T15:57:41.036714-07:00 INFO session_manager[1302]: [INFO:browser_job.cc(149)] Terminating process group: Browser took more than 3 seconds to exit after signal.
2017-03-20T15:57:41.036778-07:00 INFO session_manager[1302]: [INFO:system_utils_impl.cc(110)] Sending 6 to -1370 as 1000
2017-03-20T15:57:41.410436-07:00 WARNING crash_reporter[29916]: [user] Received crash notification for chrome[1633] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-03-20T15:57:41.410977-07:00 WARNING crash_reporter[29916]: [ARC] Received crash notification for chrome[1633] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-03-20T15:57:42.106383-07:00 WARNING crash_reporter[29927]: Received crash notification for chrome[1370] user 1000 (called directly)
2017-03-20T15:57:42.149559-07:00 INFO kernel: [19109.070859] binder: 4168:5160 transaction failed 29189, size 328-0
2017-03-20T15:57:42.551664-07:00 WARNING crash_reporter[29948]: [user] Received crash notification for chrome[1370] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-03-20T15:57:42.552048-07:00 WARNING crash_reporter[29948]: [ARC] Received crash notification for chrome[1370] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-03-20T15:57:42.561921-07:00 WARNING crash_reporter[29949]: Received crash notification for chrome-crash-unknown-process[29890] user 1000 (called directly)
2017-03-20T15:57:42.571654-07:00 INFO kernel: [19109.492945] Watchdog[1707]: unhandled level 1 translation fault (11) at 0x00000000, esr 0x92000045
2017-03-20T15:57:42.571710-07:00 ALERT kernel: [19109.492955] pgd = ffffffc0e7998000
2017-03-20T15:57:42.571718-07:00 ALERT kernel: [19109.492959] [00000000] *pgd=0000000000000000, *pud=0000000000000000
2017-03-20T15:57:42.571724-07:00 WARNING kernel: [19109.492966] 
2017-03-20T15:57:42.571730-07:00 WARNING kernel: [19109.492972] CPU: 4 PID: 1707 Comm: Watchdog Not tainted 4.4.35-06794-g75ce794 #1
2017-03-20T15:57:42.571736-07:00 WARNING kernel: [19109.492976] Hardware name: Google Kevin (DT)
2017-03-20T15:57:42.571741-07:00 WARNING kernel: [19109.492980] task: ffffffc0ea7b2700 ti: ffffffc0e514c000 task.ti: ffffffc0e514c000
2017-03-20T15:57:42.571748-07:00 WARNING kernel: [19109.492985] PC is at 0xaebf8cf0
2017-03-20T15:57:42.571753-07:00 WARNING kernel: [19109.492989] LR is at 0xee2886d3
2017-03-20T15:57:42.571760-07:00 WARNING kernel: [19109.492992] pc : [<00000000aebf8cf0>] lr : [<00000000ee2886d3>] pstate: 600f0030
2017-03-20T15:57:42.571766-07:00 WARNING kernel: [19109.492996] sp : 00000000ea4e5a48
2017-03-20T15:57:42.571771-07:00 WARNING kernel: [19109.492999] x12: 00000000ee2d8714 
2017-03-20T15:57:42.571813-07:00 WARNING kernel: [19109.493004] x11: 0000000000000000 x10: 00000000ea4e5bb0 
2017-03-20T15:57:42.571820-07:00 WARNING kernel: [19109.493011] x9 : 00000000ea4e5bbc x8 : 00000000ea4e5ba8 
2017-03-20T15:57:42.571826-07:00 WARNING kernel: [19109.493017] x7 : 00000000ea4e5a48 x6 : 00000000ea4e5a64 
2017-03-20T15:57:42.571832-07:00 WARNING kernel: [19109.493023] x5 : 00000000b3320dd0 x4 : 00000000ea4e5a60 
2017-03-20T15:57:42.571838-07:00 WARNING kernel: [19109.493029] x3 : 00000000ea4e5ab4 x2 : 0000000000000000 
2017-03-20T15:57:42.571844-07:00 WARNING kernel: [19109.493036] x1 : 0000000000000000 x0 : 00000000ea4e5a60 
2017-03-20T15:57:42.571850-07:00 WARNING kernel: [19109.493042] 
2017-03-20T15:57:42.624555-07:00 WARNING crash_reporter[29957]: [user] Received crash notification for chrome[1699] sig 6, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-03-20T15:57:42.625951-07:00 WARNING crash_reporter[29950]: Received crash notification for chrome-crash-unknown-process[18799] user 1000 (called directly)
2017-03-20T15:57:42.626255-07:00 WARNING crash_reporter[29957]: [ARC] Received crash notification for chrome[1699] sig 6, user 1000 (ignoring - crash origin is not ARC)
2017-03-20T15:57:42.628515-07:00 WARNING crash_reporter[29954]: [user] Received crash notification for chrome[1676] sig 11, user 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2017-03-20T15:57:42.628965-07:00 WARNING crash_reporter[29954]: [ARC] Received crash notification for chrome[1676] sig 11, user 1000 (ignoring - crash origin is not ARC)
2017-03-20T15:57:42.708806-07:00 INFO session_manager[1302]: [INFO:child_exit_handler.cc(77)] Handling 1370 exit.
2017-03-20T15:57:42.708860-07:00 ERR session_manager[1302]: [ERROR:child_exit_handler.cc(85)]   Exited with signal 6

---

Based on my analysis above, the hang must have happened _after_ 15:55:37.  


Assuming the crash from above shares the same root cause, then maybe it's not binder but instead "e.process.gapps" or "gle.android.gms"?

...or maybe something else is important?  I also tracked down the crash report for the browser when it was killed due to the missing liveness check.  It is 3c2edd6640000000.  Nothing jumps out at me, though...
Owner: diand...@chromium.org
Summary: When memory is low and things get OOM killed, sometimes system locks up (maybe ARC++ related) (was: Android framework should be discarded (or OOM killed) before panicking)
Moving ownership to myself since:

1. I seem to be able to reproduce this or something similar using a combination of starting up ARC++ and running "balloon" to chew through memory

2. When I reproduce the problem, the system is in a clearly bad state where pretty much nothing is happening (other than interrupts firing) for ~2 minutes until the hung task detector comes along.

3. I'm not really sure that binder is involved here, or at least not sure that it's always involved here.

--

At the moment I'm hesitant to speculate too much on details of what I've "figured out" since much of it doesn't make lots of sense.  Certainly there are lots of tasks that are sitting there waiting to run and the system isn't doing anything to help make them run.  I haven't yet been able to figure out why, but many things are blocked waiting for memory to become available or waiting on various disk / squashfs operations...

Happy for any help folks want to give, or else I'll just plod along trying to understand this...

Comment 15 by bccheng@google.com, Mar 23 2017

Hi Doug,

Since you are on it, could you play with increasing the low memory margin by writing a larger number to /sys/kernel/mm/chromeos-low_mem/margin and see if it makes any difference?
@15: I'd rather not.  IMHO the fact that the OOM killer can get us into this awful hung state is a bug and that bug needs to be addressed.  We already have open bugs about adjusting the low memory margin and I think we can focus on tweaks and tests there.
I made a bit of progress.  At the moment I'm convinced that our OOM killer is often churning.  It runs and then fails to pick a process to kill, but elsewhere it thinks it did so it keeps trying.

I've got a set of steps to reproduce this problem and I've been slowly churning through the code to debug.  More Monday.


OK, here's where I'm at today.  I got a bit blocked at the end because I was trying to hack up debugd and got into version mismatch hell w/ it, libchromeos, and the system that was on my device.  :(  I'm trying to rebuild everything from scratch now...

---

Top priority:

From go/arc-memory, some Android processes are supposed to have have negative OOM scores.  I'm fairly certain that we're ignoring these and killing these "important" android tasks.  I believe these processes aren't super great for killing and that's why we're triggering such problems.

Specifically when I see OOM kills, all sorts of important-sounding Android processes have OOM scores like ~800 or ~900, not negative.  ...and I've never seen an important Android process negative.

cylee@: can you help figure out how to fix this?

--

That being said, it looks like we're ticking a bunch of corner cases of the kernel here where the OOM killer really doesn't work super well.  The main problem is that the OOM killer generally doesn't work super well when the task to be killed is blocked on a mutex (or similar).  

I'm still poking, prodding, and gathering logs, but I'm pretty sure we're in this problem:


A) Memory gets low and pretty much everyone is starved.

B) Process 1000 has a mutex named "mutex" and is waiting to allocate memory.

C) Process 1001 wants "mutex".

D) We choose to kill process 1001, for whatever reason.

...when the above happens we mark process 1001 as "MEMDIE" and we assume it will die soon.  In fact, we avoid OOM killing any other process _until_ it does because this process get special access to reserved memory to help it die and we don't want to give that special access to lots of processes at the same time.  See the comment in oom_scan_process_thread().

...while we're waiting for the process to die, we spin like crazy across many CPUs, just for the fun of it.  The "out_of_memory" function will silently return "true" indicating that it made process as long as any process is still marked as MEMDIE, which will cause us to loop and loop and loop in the allocation functions.  Even better, we'll be joined by other processes looping because they'll notice we have the oom_lock (they assume we're making progress for them).  See __alloc_pages_may_oom()


So basically we've got a classic deadlock.  

-> Process 1000 can't make progress (which it needs to do before it releases the mutex) because it can't get any memory and nobody's allowed to OOM kill.

-> Process 1001 can't finish dying (giving up its memory and then letting other things get killed) because it's waiting on the mutex held by process 1000.


...presumably we have to relax the restriction of only allowing one process to have MEMDIE at once and we just need to make sure that there's enough reserved space for everyone.


NOTE: with Android processes we seem to end up in this state much more often than we did for Chrome.  One place where we seem to get stuck on mutexes a lot:

[  119.172054] DOUG: select_bad_process 333: OOM_SCAN_ABORT: Binder_1-4149 (0, 0x00404040, 0x00000004)
[  119.172055] Call trace:
[  119.172064] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8
[  119.172069] [<ffffffc00093bd10>] __schedule+0x504/0x740
[  119.172073] [<ffffffc00093bfd4>] schedule+0x88/0xa8
[  119.172077] [<ffffffc00093c358>] schedule_preempt_disabled+0x28/0x44
[  119.172081] [<ffffffc00093ddd4>] __mutex_lock_slowpath+0xf8/0x1a4
[  119.172085] [<ffffffc00093decc>] mutex_lock+0x4c/0x68
[  119.172091] [<ffffffc0007c0d30>] binder_thread_read+0x694/0x11bc
[  119.172095] [<ffffffc0007c4210>] binder_ioctl_write_read.constprop.46+0x1e8/0x318
[  119.172099] [<ffffffc0007c46a0>] binder_ioctl+0x360/0x768
[  119.172104] [<ffffffc0003b89c4>] compat_SyS_ioctl+0x134/0x10ac
[  119.173642] [<ffffffc000203e34>] el0_svc_naked+0x24/0x28
[  119.173644] DOUG: end of stack }

--

It's possible there are some other related bugs here, all related to tasks not dying fast enough:

1. Possibly there are cases where we're not properly giving priority to a dying process.  In oom_scan_process_thread() we stop scanning with OOM_SCAN_ABORT when we see task_will_free_mem().  ...but I don't see where we tag that process with MEMDIE.  In many other cases we _do_ mark the task with mark_oom_victim().  I'm pretty sure I found one deadlock where we were looping waiting for a process to finish.  Note that alternatively we may be able to patch page_alloc.c to pay more attention to PF_EXITING.  It seems to pay attention to it in some places, but not others.

2. In oom_kill_process() I think there's a bug when we choose a child to die in our place (such good parents we are!).  If that child fails to die (see the mutex discussion above) then we'll keep choosing it over and over again.  I think we need a call to oom_scan_process_thread() before the call to oom_badness() (actually, I think it's not super valid to ever call oom_badness() without first calling oom_scan_process_thread()).

3. In the OOM killing code you see "schedule_timeout_killable(1)".  That will wait a different amount of time depending on the value of HZ.  We have HZ=1000, so that will only give 1 ms.  Possibly this should be a little longer?

4. We only do the "schedule_timeout_killable(1)" in the case where we actually tried to kill the process.  ...but it seems like we'd want to do that in other cases too, like if we notice another process is dying.  Basically any time we claim that we "made progress" from out_of_memory() it seems like we should have some sort of delay since we can't exactly kill things instantly (we mark them to die and then wait for them to get to userspace so they can be cleanly killed).

5. I'm less certain about this one, but in __alloc_pages_may_oom() it sure seems like the oom_lock should be moved to be only around the block of code commented as "Exhausted what can be done so it's blamo time".

--

A few other interesting findings:

* I believe I also found OOM killed tasks sitting waiting in squashfs_cache_get() a lot of the time.  That pointed out CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU, which we probably want (need to benchmark).  That's on my back burner to dig into.

* I also found OOM killed tasks sitting waiting in kbase_mem_evictable_reclaim_count_objects() on a mutex.  As per a previous investigation I did, grabbing a mutex in count_objects isn't ideal.  This can easily be optimized and we can see if it helps our behavior when memory is low.



Forked ARC++ issue to bug #706048
Status: Started (was: Assigned)
Patches posted:

4.4:

remote:   https://chromium-review.googlesource.com/465186 UPSTREAM: mm: oom_kill: don't ignore oom score on exiting tasks
remote:   https://chromium-review.googlesource.com/465187 CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still ...
remote:   https://chromium-review.googlesource.com/465188 CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place
remote:   https://chromium-review.googlesource.com/465189 CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

3.14:

remote:   https://chromium-review.googlesource.com/465468 BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks        
remote:   https://chromium-review.googlesource.com/465469 CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still ...        
remote:   https://chromium-review.googlesource.com/465470 CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place        
remote:   https://chromium-review.googlesource.com/465471 CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims        

---

I'll try to port to some of our other kernels, but I will likely need some help testing on them...
These are untested:

---

3.18:

remote:   https://chromium-review.googlesource.com/465472 BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks        
remote:   https://chromium-review.googlesource.com/465473 CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still ...        
remote:   https://chromium-review.googlesource.com/465474 CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place        
remote:   https://chromium-review.googlesource.com/465475 CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims    


We probably want to test and apply that since ARC++ seems to make this worse...    

---

3.10 (only 3 patches needed):

remote:   https://chromium-review.googlesource.com/465411 BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks        
remote:   https://chromium-review.googlesource.com/465412 CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place        
remote:   https://chromium-review.googlesource.com/465413 CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims        

---

3.8:

remote:   https://chromium-review.googlesource.com/465586 BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks        
remote:   https://chromium-review.googlesource.com/465587 CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place        
remote:   https://chromium-review.googlesource.com/465588 CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims        

We could confirm basic OOMing here and if we're good then we can maybe just mark this and 3.10 to go in?


====


Project Member

Comment 22 by bugdroid1@chromium.org, Mar 31 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/88bdca832d35cb1a28436ee41626da92c2f3251e

commit 88bdca832d35cb1a28436ee41626da92c2f3251e
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri Mar 31 22:16:21 2017

UPSTREAM: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465186
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/88bdca832d35cb1a28436ee41626da92c2f3251e/mm/oom_kill.c

Project Member

Comment 23 by bugdroid1@chromium.org, Mar 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/222d652a0495a51be1d799772ec669e169ca9496

commit 222d652a0495a51be1d799772ec669e169ca9496
Author: Douglas Anderson <dianders@chromium.org>
Date: Fri Mar 31 22:16:22 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465187

[modify] https://crrev.com/222d652a0495a51be1d799772ec669e169ca9496/mm/oom_kill.c

Project Member

Comment 24 by bugdroid1@chromium.org, Mar 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/d7cd7d859ad92dac0f9f4511a2bc1cf2208e8cbe

commit d7cd7d859ad92dac0f9f4511a2bc1cf2208e8cbe
Author: Douglas Anderson <dianders@chromium.org>
Date: Fri Mar 31 22:16:23 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465188

[modify] https://crrev.com/d7cd7d859ad92dac0f9f4511a2bc1cf2208e8cbe/mm/oom_kill.c

Project Member

Comment 25 by bugdroid1@chromium.org, Mar 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/ed758deb4a6ef27d56e95da9d1a73be0cbd09ac5

commit ed758deb4a6ef27d56e95da9d1a73be0cbd09ac5
Author: Douglas Anderson <dianders@chromium.org>
Date: Fri Mar 31 22:16:24 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465189
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/ed758deb4a6ef27d56e95da9d1a73be0cbd09ac5/mm/oom_kill.c

Project Member

Comment 26 by bugdroid1@chromium.org, Apr 1 2017

Labels: merge-merged-chromeos-3.8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/fcdc42d2699d037a399f655571637e093e9fe09e

commit fcdc42d2699d037a399f655571637e093e9fe09e
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat Apr 01 01:56:57 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465586

[modify] https://crrev.com/fcdc42d2699d037a399f655571637e093e9fe09e/mm/oom_kill.c

Project Member

Comment 27 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6d287f9bc9e24b57f938155368784a755204e64c

commit 6d287f9bc9e24b57f938155368784a755204e64c
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 01:56:58 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465587

[modify] https://crrev.com/6d287f9bc9e24b57f938155368784a755204e64c/mm/oom_kill.c

Project Member

Comment 28 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/ea803c877bd53ab8a15e7ff6ed8a20b766b247e5

commit ea803c877bd53ab8a15e7ff6ed8a20b766b247e5
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 01:56:59 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465588
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/ea803c877bd53ab8a15e7ff6ed8a20b766b247e5/mm/oom_kill.c

Project Member

Comment 29 by bugdroid1@chromium.org, Apr 1 2017

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/60728a964498e1200265b94d414cc169b1cdb840

commit 60728a964498e1200265b94d414cc169b1cdb840
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat Apr 01 05:33:30 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465472
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/60728a964498e1200265b94d414cc169b1cdb840/mm/oom_kill.c

Project Member

Comment 30 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b74ed4c3bfcdf15339adf67e28e2b2525d8270c7

commit b74ed4c3bfcdf15339adf67e28e2b2525d8270c7
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:31 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465473

[modify] https://crrev.com/b74ed4c3bfcdf15339adf67e28e2b2525d8270c7/mm/oom_kill.c

Project Member

Comment 31 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e8278e9f12d0f82e39c46452420cf2e74cb5a228

commit e8278e9f12d0f82e39c46452420cf2e74cb5a228
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:32 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465474

[modify] https://crrev.com/e8278e9f12d0f82e39c46452420cf2e74cb5a228/mm/oom_kill.c

Project Member

Comment 32 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/25d68b7fa97d3b331e4ac5e1ff0c8399574569bf

commit 25d68b7fa97d3b331e4ac5e1ff0c8399574569bf
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:33 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465475
Tested-by: Dylan Reid <dgreid@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/25d68b7fa97d3b331e4ac5e1ff0c8399574569bf/mm/oom_kill.c

Project Member

Comment 33 by bugdroid1@chromium.org, Apr 1 2017

Labels: merge-merged-chromeos-3.14
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/8b4a16f82833c8b9f8c3d9cf67898c38354c65c0

commit 8b4a16f82833c8b9f8c3d9cf67898c38354c65c0
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Sat Apr 01 05:33:25 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465468
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/8b4a16f82833c8b9f8c3d9cf67898c38354c65c0/mm/oom_kill.c

Project Member

Comment 34 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e792e36e5d689f3daf93a70741ca1d407deb4513

commit e792e36e5d689f3daf93a70741ca1d407deb4513
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:27 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465469

[modify] https://crrev.com/e792e36e5d689f3daf93a70741ca1d407deb4513/mm/oom_kill.c

Project Member

Comment 35 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/195db82069300f8440a507faa20a4649ad25c022

commit 195db82069300f8440a507faa20a4649ad25c022
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:28 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465470

[modify] https://crrev.com/195db82069300f8440a507faa20a4649ad25c022/mm/oom_kill.c

Project Member

Comment 36 by bugdroid1@chromium.org, Apr 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/fc300ccd3b96963b018fbcf1bd90570b7eb512b9

commit fc300ccd3b96963b018fbcf1bd90570b7eb512b9
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Apr 01 05:33:29 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465471
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/fc300ccd3b96963b018fbcf1bd90570b7eb512b9/mm/oom_kill.c

Project Member

Comment 37 by bugdroid1@chromium.org, Apr 2 2017

Labels: merge-merged-chromeos-3.10
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b4b86c11b400271aaec13c9b56f50e0509d7aaec

commit b4b86c11b400271aaec13c9b56f50e0509d7aaec
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Sun Apr 02 04:52:26 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465411
Reviewed-by: Guenter Roeck <groeck@chromium.org>

[modify] https://crrev.com/b4b86c11b400271aaec13c9b56f50e0509d7aaec/mm/oom_kill.c

Project Member

Comment 38 by bugdroid1@chromium.org, Apr 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6e3f395bb36e20d04fb83ac945bab47410e12e86

commit 6e3f395bb36e20d04fb83ac945bab47410e12e86
Author: Douglas Anderson <dianders@chromium.org>
Date: Sun Apr 02 04:52:27 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465412
Reviewed-by: Guenter Roeck <groeck@chromium.org>

[modify] https://crrev.com/6e3f395bb36e20d04fb83ac945bab47410e12e86/mm/oom_kill.c

Project Member

Comment 39 by bugdroid1@chromium.org, Apr 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/aa071ca4b372e1e30701c56fe50823f34b3bc9f6

commit aa071ca4b372e1e30701c56fe50823f34b3bc9f6
Author: Douglas Anderson <dianders@chromium.org>
Date: Sun Apr 02 04:52:28 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465413
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>

[modify] https://crrev.com/aa071ca4b372e1e30701c56fe50823f34b3bc9f6/mm/oom_kill.c

Labels: Merge-Request-58
Owner: bhthompson@chromium.org
Landed in all kernels in M59.

Bernie: I'd love to hear your thoughts on this for M58.

---

1. This is not a new issue and not a regression.  However we're hitting it a lot more because:
1a) ARC++ triggers this much more easily and ARC++ usage is increasing on kernels 3.14, 3.18, and 4.4.
1b) Possibly our adjustments of OOM scores is making us trigger this more.
1c) Possibly a change to the browser and how it works with subprocesses is making us trigger this more (bug #649116)

2. Change landed across all our kernels between Friday and Saturday, so not much bake time yet.

3. I have done the most testing with kernel 4.4 and second most with 3.14.  I backported this to other kernels (and needed to make some small changes) and did light testing there.  Dylan did a bit of testing on 3.18.  I did a bit of testing on 3.8 (no ARC++ though).  I didn't test 3.10, but code is nearly the same as 3.8.

4. Change only affects the OOM killer.  Should be zero impact when memory isn't tight.  When memory is tight the new code will be run.

5. The change has been mostly tested synthetically.  It's a bit hard to quickly get your device into a low memory situation where the OOM killer runs during normal usage.

---

To me it seems like this would be good to get into R58, but I value other opinions.

---

It would be plausible to get QA to try to stress memory usage on at least one device per kernel version, ideally with ARC++ involved where possible.  Searching through "file:///var/log/messages" for "refused to die" is a good way to confirm that my code ran.

A few notes to QA:

A) When stressing memory, if you allocate slowly enough you're supposed to get memory freed nicely by Chrome.  I'm not sure how much this happens, but that's the idea.

B) To really stress things out, you need to do something on the web which allocates very quickly.  The best I've been able to figure out to do this is to use Facebook infinite scrolling with auto play videos.  Find some feed with lots of videos and make sure high def autoplay is on.  Start scrolling and see if you can get as many videos playing at once.

C) Possibly you could find some Android apps that use lots of memory, but I'm not personally aware of them.

D) If my code is _working_, you _expect_ to see "tab killed".  AKA: if you see tab killed, that's a really, really good thing as far as this fix is concerned.  Said another way:

D-1. Ideally other parts of Chrome OS will notice that memory is low and shut things down cleanly.  The kernel OOM killer (what we're fixing here) is the code that run when the other part of Chrome OS fails to free memory fast enough.  It is known and expected that the kernel OOM killer will sometimes need to run since Chrome can't react as fast as the kernel can.

D-2. When the kernel OOM killer runs the side effect is "tab killed".  That's much better than the alternative of freezing or crashing the whole system.


E) What you really _don't_ want to see is a full system hang (1-2 minutes of no responsiveness followed by a reboot).  Actually, as long as you're not synthetically stressing the system w/ a command line tool (like balloon.arm), you don't want to see _any_ full reboots of the system.


F) Even if you can't manage to reproduce the "refused to die", hopefully you can at least see the OOM killer running.  Look for "OOM" in file:///var/log/messages
Labels: -Merge-Request-58 Merge-Approved-58
If you are confident in it, and nothing bad has happened so far with these on ToT lets go for it for 58.

We should be doing another beta by Wednesday so these should land before then if possible, I'll defer to you on if you want to see more action from ToT before merging.
Project Member

Comment 42 by bugdroid1@chromium.org, Apr 3 2017

Labels: merge-merged-release-R58-9334.B-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/eed40556449ebee13af39c8308ac02bfb252cf40

commit eed40556449ebee13af39c8308ac02bfb252cf40
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon Apr 03 23:28:10 2017

UPSTREAM: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465186
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit 88bdca832d35cb1a28436ee41626da92c2f3251e)
Reviewed-on: https://chromium-review.googlesource.com/466559

[modify] https://crrev.com/eed40556449ebee13af39c8308ac02bfb252cf40/mm/oom_kill.c

Project Member

Comment 43 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/3d3748ee6c0a5cac059e1fa65adf10c85870c890

commit 3d3748ee6c0a5cac059e1fa65adf10c85870c890
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:28:29 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465187
(cherry picked from commit 222d652a0495a51be1d799772ec669e169ca9496)
Reviewed-on: https://chromium-review.googlesource.com/466560

[modify] https://crrev.com/3d3748ee6c0a5cac059e1fa65adf10c85870c890/mm/oom_kill.c

Project Member

Comment 44 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b4253e4b60fb6903693aa137bf2f2334970e03df

commit b4253e4b60fb6903693aa137bf2f2334970e03df
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:28:57 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465188
(cherry picked from commit d7cd7d859ad92dac0f9f4511a2bc1cf2208e8cbe)
Reviewed-on: https://chromium-review.googlesource.com/466561

[modify] https://crrev.com/b4253e4b60fb6903693aa137bf2f2334970e03df/mm/oom_kill.c

Project Member

Comment 45 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7271dacb3f1f17aa698a9831c941bcb7e4b4a8a2

commit 7271dacb3f1f17aa698a9831c941bcb7e4b4a8a2
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:29:25 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465189
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit ed758deb4a6ef27d56e95da9d1a73be0cbd09ac5)
Reviewed-on: https://chromium-review.googlesource.com/466562

[modify] https://crrev.com/7271dacb3f1f17aa698a9831c941bcb7e4b4a8a2/mm/oom_kill.c

Project Member

Comment 46 by bugdroid1@chromium.org, Apr 3 2017

Labels: merge-merged-release-R58-9334.B-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/d40d12a0baa2776d91908c74f37a0d1c46eecb37

commit d40d12a0baa2776d91908c74f37a0d1c46eecb37
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon Apr 03 23:30:15 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465472
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit 60728a964498e1200265b94d414cc169b1cdb840)
Reviewed-on: https://chromium-review.googlesource.com/466563

[modify] https://crrev.com/d40d12a0baa2776d91908c74f37a0d1c46eecb37/mm/oom_kill.c

Project Member

Comment 47 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/bb3ab0662a46da9b135df2917f80da5b1e47c9ee

commit bb3ab0662a46da9b135df2917f80da5b1e47c9ee
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:30:38 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465473
(cherry picked from commit b74ed4c3bfcdf15339adf67e28e2b2525d8270c7)
Reviewed-on: https://chromium-review.googlesource.com/466564

[modify] https://crrev.com/bb3ab0662a46da9b135df2917f80da5b1e47c9ee/mm/oom_kill.c

Project Member

Comment 48 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f70e18011ebfe8f16d22b86ba34dea84aa1c3ef9

commit f70e18011ebfe8f16d22b86ba34dea84aa1c3ef9
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:31:20 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465474
(cherry picked from commit e8278e9f12d0f82e39c46452420cf2e74cb5a228)
Reviewed-on: https://chromium-review.googlesource.com/466565

[modify] https://crrev.com/f70e18011ebfe8f16d22b86ba34dea84aa1c3ef9/mm/oom_kill.c

Project Member

Comment 49 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/2014267785d257ebc84d3f9da30ed0c5df41e2b0

commit 2014267785d257ebc84d3f9da30ed0c5df41e2b0
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:31:46 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465475
Tested-by: Dylan Reid <dgreid@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit 25d68b7fa97d3b331e4ac5e1ff0c8399574569bf)
Reviewed-on: https://chromium-review.googlesource.com/466846

[modify] https://crrev.com/2014267785d257ebc84d3f9da30ed0c5df41e2b0/mm/oom_kill.c

Project Member

Comment 50 by bugdroid1@chromium.org, Apr 3 2017

Labels: merge-merged-release-R58-9334.B-chromeos-3.14
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/facb2dedf039748d0c736e4b00893e2923d83179

commit facb2dedf039748d0c736e4b00893e2923d83179
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon Apr 03 23:33:04 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465468
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit 8b4a16f82833c8b9f8c3d9cf67898c38354c65c0)
Reviewed-on: https://chromium-review.googlesource.com/466847

[modify] https://crrev.com/facb2dedf039748d0c736e4b00893e2923d83179/mm/oom_kill.c

Project Member

Comment 51 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a0a5e011f72a347ab78e1ac2381830d91b50f6cd

commit a0a5e011f72a347ab78e1ac2381830d91b50f6cd
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:33:29 2017

CHROMIUM: DROP: mm/oom_kill: Don't kill a subthread in our place if we still have mm

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

In the function find_lock_task_mm() the stated purpose is to find a
subthread in the case that the original task no longer has an "->mm".

Implied in the description is that we should pick the original process
"p" if it still has "->mm" and we should only pick a subthread if
needed.  However, that's not what happens with the current
implementation.  Let's change the code to explicitly make sure that we
return the original process "p" if it has a valid "->mm".

Why do we care?  A future patch will improve the OOM killer to
sometimes allow more than one OOM victim to exist at once.  If we do
that without making find_lock_task_mm() prefer the original process
then we can get into this loop:
1. Pick a process as a victim.
2. Don't kill the chosen process, but instead kill some other task
   that (perhaps) has already been previously marked as a victim.
3. Go back to step #1 and pick the same process as a victim (since it
   wasn't actually killed before).

Even without the future patch, though, it still makes sense to pick
the original process if possible.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Change-Id: I3b59bbff74bd5fed1fa7790dd9696e29030fba02
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465469
(cherry picked from commit e792e36e5d689f3daf93a70741ca1d407deb4513)
Reviewed-on: https://chromium-review.googlesource.com/466848

[modify] https://crrev.com/a0a5e011f72a347ab78e1ac2381830d91b50f6cd/mm/oom_kill.c

Project Member

Comment 52 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b3ab459eb659ad748cdfd9b8f5074eed42c8b535

commit b3ab459eb659ad748cdfd9b8f5074eed42c8b535
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:33:52 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465470
(cherry picked from commit 195db82069300f8440a507faa20a4649ad25c022)
Reviewed-on: https://chromium-review.googlesource.com/466849

[modify] https://crrev.com/b3ab459eb659ad748cdfd9b8f5074eed42c8b535/mm/oom_kill.c

Project Member

Comment 53 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b3ab459eb659ad748cdfd9b8f5074eed42c8b535

commit b3ab459eb659ad748cdfd9b8f5074eed42c8b535
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:33:52 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465470
(cherry picked from commit 195db82069300f8440a507faa20a4649ad25c022)
Reviewed-on: https://chromium-review.googlesource.com/466849

[modify] https://crrev.com/b3ab459eb659ad748cdfd9b8f5074eed42c8b535/mm/oom_kill.c

Project Member

Comment 54 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/54435bd3425984b4ea5e6d8eaa8cbc59055b96e7

commit 54435bd3425984b4ea5e6d8eaa8cbc59055b96e7
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:34:18 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465471
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit fc300ccd3b96963b018fbcf1bd90570b7eb512b9)
Reviewed-on: https://chromium-review.googlesource.com/466850

[modify] https://crrev.com/54435bd3425984b4ea5e6d8eaa8cbc59055b96e7/mm/oom_kill.c

Project Member

Comment 55 by bugdroid1@chromium.org, Apr 3 2017

Labels: merge-merged-release-R58-9334.B-chromeos-3.10
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/392aa7079e1b6ac13458ee46a01b9676e221a146

commit 392aa7079e1b6ac13458ee46a01b9676e221a146
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon Apr 03 23:35:05 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465411
Reviewed-by: Guenter Roeck <groeck@chromium.org>
(cherry picked from commit b4b86c11b400271aaec13c9b56f50e0509d7aaec)
Reviewed-on: https://chromium-review.googlesource.com/466851

[modify] https://crrev.com/392aa7079e1b6ac13458ee46a01b9676e221a146/mm/oom_kill.c

Project Member

Comment 56 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/9d5640372c747c33e57b8e6a9172854334921c61

commit 9d5640372c747c33e57b8e6a9172854334921c61
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:35:32 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465412
Reviewed-by: Guenter Roeck <groeck@chromium.org>
(cherry picked from commit 6e3f395bb36e20d04fb83ac945bab47410e12e86)
Reviewed-on: https://chromium-review.googlesource.com/466852

[modify] https://crrev.com/9d5640372c747c33e57b8e6a9172854334921c61/mm/oom_kill.c

Project Member

Comment 57 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/58aa31ac416443efae4ff24085a55e143891c64d

commit 58aa31ac416443efae4ff24085a55e143891c64d
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:35:52 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465413
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit aa071ca4b372e1e30701c56fe50823f34b3bc9f6)
Reviewed-on: https://chromium-review.googlesource.com/466853

[modify] https://crrev.com/58aa31ac416443efae4ff24085a55e143891c64d/mm/oom_kill.c

Project Member

Comment 58 by bugdroid1@chromium.org, Apr 3 2017

Labels: merge-merged-release-R58-9334.B-chromeos-3.8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/ae0debcbe000ccc368b801d24d8d80ddff4d708f

commit ae0debcbe000ccc368b801d24d8d80ddff4d708f
Author: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon Apr 03 23:36:37 2017

BACKPORT: mm: oom_kill: don't ignore oom score on exiting tasks

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score.  The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer.  We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.  Remove it.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were extra checked in 3.14, but the concept is still the same.

Change-Id: Ied8391fd2d8cda4d1728a28dbe6cfea31883b24e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(cherry picked from commit 6a618957ad17d8f4f4c7eeede752685374b1b176)
Reviewed-on: https://chromium-review.googlesource.com/465586
(cherry picked from commit fcdc42d2699d037a399f655571637e093e9fe09e)
Reviewed-on: https://chromium-review.googlesource.com/466854

[modify] https://crrev.com/ae0debcbe000ccc368b801d24d8d80ddff4d708f/mm/oom_kill.c

Project Member

Comment 59 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/21f9394b508ab427db3ad9bf2504a1c045541796

commit 21f9394b508ab427db3ad9bf2504a1c045541796
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:37:01 2017

CHROMIUM: DROP: mm/oom_kill: Double-check before killing a child in our place

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

Of the child of an OOM victim has a different "->mm" element than the
victim, we may choose to kill the child in place of the victim.  This
is sane because if we killed the victim the child would die anyway, so
why not see if we can save the parent?

...but, before we kill the child let's double-check that it's OK by
calling oom_scan_process_thread().

I'll admit that as the code stands right now this is a totally useless
thing to do.  The only time oom_scan_process_thread() will tell us not
to kill the child is if:

1. The child is already marked with MEMDIE.  However, that's not
   possible because the code (as it stands today) won't let more than
   one process have MEMDIE.
2. The child has no "mm".  ...but that shouldn't happen because the
   parent and child shouldn't share an "mm" if the child's "mm" is
   NULL.

Now that I've explained why this patch is stupid and useless, here's
why we want it anyway: in a future patch we'll make it (sometimes)
possible for more than one process to be marked with MEMDIE.  With
that in mind, point #1 above is moot and suddenly this patch has merit.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

NOTE: this file is slightly different than its 4.4 counterpart (where
the patch was originally written) due to different parameters in
oom_scan_process_thread().

Change-Id: I0ea9b5f4a03c1392d52d9d927bd395ed44d7b98c
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465587
(cherry picked from commit 6d287f9bc9e24b57f938155368784a755204e64c)
Reviewed-on: https://chromium-review.googlesource.com/466855

[modify] https://crrev.com/21f9394b508ab427db3ad9bf2504a1c045541796/mm/oom_kill.c

Project Member

Comment 60 by bugdroid1@chromium.org, Apr 3 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/b879df4c109d88fbc3b3f3b255c62dafe97acab0

commit b879df4c109d88fbc3b3f3b255c62dafe97acab0
Author: Douglas Anderson <dianders@chromium.org>
Date: Mon Apr 03 23:37:23 2017

CHROMIUM: DROP: mm/oom_kill: Avoid deadlock; allow multiple victims

DROP THIS PATCH ON REBASE.

It is part of a patch series that attempts to solve similar problems
to the OOM reaper upstream.  NOTE that the OOM reaper patches weren't
backported because mm patches in general are fairly intertwined and
picking the OOM reaper without an mm expert and lots of careful
testing could cause new mysterious problems.

Compared to the OOM reaper, this patch series:
+ Is simpler to reason about.
+ Is not very intertwined to the rest of the system.
+ Can be fairly easily reasoned about to see that it should do no
  harm, even if it doesn't fix every problem.
- Can more easily get into the state where all memory reserves are
  gone, since it allows more than one task to be in "MEMDIE".
- Doesn't free memory as quickly.  The reaper kills anonymous /
  swapped memory even before the doomed task exits.
- Contains the magic "100 ms" heuristic, which is non-ideal.

---

It turns out that there's a pretty big flaw with the way the OOM
killer works.  If the chosen OOM kill victim doesn't die then the OOM
killer totally stops working.  Specifically once we've picked an OOM
victim we won't ever pick another until the first victim dies.  If
that victim refuses to die then we'll silently loop until eventually
some sort of watchdog reboots the system.

This is easy to see in toy code.  Just write a function that userspace
can invoke that grabs a mutex twice.  Now set the OOM score so that
this is the preferred process to kill.  The first invocation of the
OOM killer will try to kill this process but won't be able to because
the process will never get the mutex.  All subsequent OOM killer
invocations will be complete no-ops.

This is easy to argue about in non-toy code too.  Just imagine:
A) Memory gets low and pretty much everyone is starved.
B) Process 1000 has a mutex named "mutex" and is waiting to allocate
   memory.
C) Process 1001 wants "mutex".
D) We choose to OOM kill process 1001, maybe because it has an
   attractive looking OOM score.

OOM kill is the last resort of the memory manager.  That means it has
no other ways to get memory to give to process 1000 in the example
above and process 1000 will loop waiting forever, never releasing its
mutex.  Process 1001 will never get the mutex and thus can never die
to either release memory or allow an additional OOM kill.  Classic
deadlock.

This is seen in real life.  Specifically while stressing the OOM
killer on a Chromebook with Android processes running, the OOM killer
likes to kill Android processes.  Often these processes are sitting in
binder_thread_read().  When we try to kill them they wake up and try
to grab the binder mutex.  Often the binder mutex is held by another
process who is stuck trying to allocate memory.  Ugh.  We probably
want to fix Binder to make this specific case behave better, but we
can't fix every case.

This is seen in real life in non-binder code as well.  Specifically
I've seen the victim process sitting waiting like this:
 schedule+0x88/0xa8
 schedule_timeout+0x44/0x27c
 io_schedule_timeout+0x7c/0xbc
 bt_get+0x174/0x278
 blk_mq_get_tag+0x4c/0xd4
 __blk_mq_alloc_request+0x28/0x158
 blk_mq_map_request+0x340/0x39c
 blk_sq_make_request+0xb0/0x3b8
 generic_make_request+0xcc/0x174
 submit_bio+0xe4/0x1d4
 submit_bh_wbc.isra.35+0x124/0x138
 submit_bh+0x2c/0x38
 ll_rw_block+0xac/0xd8
 squashfs_read_data+0x354/0x510
 squashfs_readpage_block+0x2d8/0x490
 squashfs_readpage+0x48c/0x600
 __do_page_cache_readahead+0x1e4/0x248
 filemap_fault+0x14c/0x3b8
 __do_fault+0x78/0xf0
 handle_mm_fault+0x744/0x1050
 do_page_fault+0x140/0x2dc
 do_mem_abort+0x64/0xd8

While I didn't dig into why the process couldn't get unblocked in the
squashfs case, it seems fairly sane to assume some other process is
waiting on memory.

Now that you (hopefully) believe we have a problem, we need a
solution.  One sane solution is to eventually let the OOM killer kill
more than one process.  We don't want to kill more than one process
right away as per the dire warnings in the OOM killer code about
exhausting our memory reserves and ending up in a livelock.  While
imperfect, it seems sane to perhaps use a heuristic here where we wait
a bit of time (100 ms) for a process to die.  If it fails, we give up
and try another process.

BUG=chromium:706048, chromium:702707
TEST=With whole series, eating memory doesn't hang system

Conflicts:
	mm/oom_kill.c
There were conflicts compared to the 4.4 version because of lots of
churn in the oom kill code.  Conceptually this is the same thing,
though.

Change-Id: I4099314abed36b8af0680ceb0eb6d85427ba4962
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/465588
Reviewed-by: Dylan Reid <dgreid@chromium.org>
(cherry picked from commit ea803c877bd53ab8a15e7ff6ed8a20b766b247e5)
Reviewed-on: https://chromium-review.googlesource.com/466856

[modify] https://crrev.com/b879df4c109d88fbc3b3f3b255c62dafe97acab0/mm/oom_kill.c

Cc: bhthompson@chromium.org
Labels: -Merge-Approved-58 Merge-Merged
Owner: diand...@chromium.org
Status: Fixed (was: Started)
QA: please see comment 39 for testing notes
Cc: abodenha@chromium.org durga.behera@chromium.org jcliang@chromium.org semenzato@chromium.org alberto@chromium.org ajha@chromium.org derat@chromium.org snanda@chromium.org marc...@chromium.org pucchakayala@chromium.org brajkumar@chromium.org songsuk@chromium.org srcv@chromium.org kavvaru@chromium.org dhadd...@chromium.org katierh@chromium.org ericrk@chromium.org
Issue 649116 has been merged into this issue.
Project Member

Comment 64 by bugdroid1@chromium.org, May 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/94f9ee2d35f666bced955ea4b4ee25beb1a18aea

commit 94f9ee2d35f666bced955ea4b4ee25beb1a18aea
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Wed May 10 06:20:10 2017

CHROMIUM: mm: lower oom dump burst limit to 1

Dumping tasks during oom seems to cause noticeable UI jank.  The kernel
has a mechanism for disabling the dump altogether, but the dump can
still be useful to see in feedback reports. So, this lowers the
default ratelimit burst from 10 to 1 so that we still get one report
of what was consuming memory when we get OOMs.

BUG=chromium:702707
TEST=induce an OOM and check for UI jank

Change-Id: Ia0e73640ad706d8d4f0d6593a4d3ec36e5cd8728
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/501248
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/94f9ee2d35f666bced955ea4b4ee25beb1a18aea/mm/oom_kill.c

Project Member

Comment 65 by bugdroid1@chromium.org, May 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/df33c0ef46b833a96930500a54a0e729bb8ed642

commit df33c0ef46b833a96930500a54a0e729bb8ed642
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Wed May 10 06:20:09 2017

CHROMIUM: mm: lower oom dump burst limit to 1

Dumping tasks during oom seems to cause noticeable UI jank.  The kernel
has a mechanism for disabling the dump altogether, but the dump can
still be useful to see in feedback reports. So, this lowers the
default ratelimit burst from 10 to 1 so that we still get one report
of what was consuming memory when we get OOMs.

BUG=chromium:702707
TEST=induce an OOM and check for UI jank

Change-Id: Ia0e73640ad706d8d4f0d6593a4d3ec36e5cd8728
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/501228
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/df33c0ef46b833a96930500a54a0e729bb8ed642/mm/oom_kill.c

Project Member

Comment 66 by bugdroid1@chromium.org, May 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/41088d1fcb294668f992627cbeee4c9d06f2e61f

commit 41088d1fcb294668f992627cbeee4c9d06f2e61f
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Wed May 10 18:58:24 2017

CHROMIUM: mm: lower oom dump burst limit to 1

Dumping tasks during oom seems to cause noticeable UI jank.  The kernel
has a mechanism for disabling the dump altogether, but the dump can
still be useful to see in feedback reports. So, this lowers the
default ratelimit burst from 10 to 1 so that we still get one report
of what was consuming memory when we get OOMs.

BUG=chromium:702707
TEST=induce an OOM and check for UI jank

Change-Id: Ia0e73640ad706d8d4f0d6593a4d3ec36e5cd8728
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/501408
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/41088d1fcb294668f992627cbeee4c9d06f2e61f/mm/oom_kill.c

Project Member

Comment 67 by bugdroid1@chromium.org, May 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7fbd9fcc0499ef438d22b5af12e53cb846ff7722

commit 7fbd9fcc0499ef438d22b5af12e53cb846ff7722
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Wed May 10 18:58:26 2017

CHROMIUM: mm: lower oom dump burst limit to 1

Dumping tasks during oom seems to cause noticeable UI jank.  The kernel
has a mechanism for disabling the dump altogether, but the dump can
still be useful to see in feedback reports. So, this lowers the
default ratelimit burst from 10 to 1 so that we still get one report
of what was consuming memory when we get OOMs.

BUG=chromium:702707
TEST=induce an OOM and check for UI jank

Change-Id: Ia0e73640ad706d8d4f0d6593a4d3ec36e5cd8728
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/501310
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/7fbd9fcc0499ef438d22b5af12e53cb846ff7722/mm/oom_kill.c

Project Member

Comment 68 by bugdroid1@chromium.org, May 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/735c96dc4cb5f73cc9ee62228336a72446f89b5d

commit 735c96dc4cb5f73cc9ee62228336a72446f89b5d
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Wed May 10 18:58:29 2017

CHROMIUM: mm: lower oom dump burst limit to 1

Dumping tasks during oom seems to cause noticeable UI jank.  The kernel
has a mechanism for disabling the dump altogether, but the dump can
still be useful to see in feedback reports. So, this lowers the
default ratelimit burst from 10 to 1 so that we still get one report
of what was consuming memory when we get OOMs.

BUG=chromium:702707
TEST=induce an OOM and check for UI jank

Change-Id: Ia0e73640ad706d8d4f0d6593a4d3ec36e5cd8728
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/501287
Commit-Ready: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>

[modify] https://crrev.com/735c96dc4cb5f73cc9ee62228336a72446f89b5d/mm/oom_kill.c

Issue 728194 has been merged into this issue.

Comment 70 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment