too many "refused to die" messages suggest something is wrong |
|||||
Issue description
On a ToT 2GB cyan (kernel 3.18) under a lot of memory pressure, I got a long hang followed by a watchdog crash. From console-ramoops (full file attached):
[504264.082335] [18787] 0 18787 3082 205 11 50 -1000 sleep
[504264.082349] Out of memory: Kill process 18442 (chrome) score 982 or sacrifice child
[504264.082368] Killed process 18442 (chrome) total-vm:1081824kB, anon-rss:0kB, file-rss:28536kB
[504264.202537] Task chrome:18442 refused to die
[504264.204571] Task chrome:18442 refused to die
[504264.206478] Task chrome:18442 refused to die
[504264.215332] Task chrome:18442 refused to die
... <repeats about 2240 times> ...
[504312.378338] Task chrome:18442 refused to die
[504312.379960] Task chrome:18442 refused to die
[504313.688327] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=18626824, c=18626823, q=543253)
[504313.688359] INFO: Stall ended before state dump start
[504463.764489] INFO: task TaskSchedulerFo:17813 blocked for more than 120 seconds.
The number of refusals seems unreasonable.
,
Dec 15 2017
You mean that chrome:18442 was starved of CPU cycles, rather blocked on something else? That seems unlikely. I had 20 open random CNN tabs, and that process refused to die for about 48 seconds.
,
Dec 15 2017
> You mean that chrome:18442 was starved of CPU cycles, rather > blocked on something else Yes. That's how broken I think some of the MM code is. I think tasks spend many tens of seconds in MM code and preemption is disabled for those processes. Thus they can fully hog the CPU. Specifically: - The pattern of messages you show indicates that the process is marked to die and is in state "TASK_RUNNING". Specifically it's not blocked on any kernel object like a mutex, etc. Hrmmm, I guess it could be blocked on a spinlock, though, but presumably if it's taking this long to come back then it's blocked on a spinlock held by someone requesting memory (and that means they're requesting atomic memory). You think we're in that state? - The process marked to die can only be running in kernel code. The moment it is ready to return to userspace code it will die.
,
Dec 15 2017
re #3 -- oh really? OOMing process Starved for 48 seconds seems hard to believe -- I thought it was usually that it was waiting on a lock an the process holding the lock was getting starved, but I might be mis-remembering
,
Dec 15 2017
@4: That is my assertion, though (as always) I'm happy to be proven wrong. I make that assertion because of this bit of code:
pr_warn("Task %s:%d refused to die\n",
task->comm, task->pid);
if (task->state != TASK_RUNNING) {
sched_show_task(task);
return OOM_SCAN_CONTINUE;
}
So if the state of the task was not TASK_RUNNING then we'd be showing a stack trace of the task each time. ...and we'd be continuing on to pick a new victim.
So I guess I'd say I'm very certain that "chrome:18442" is in a TASK_RUNNING state for the 48 seconds, at least... ;)
,
Dec 15 2017
,
Dec 15 2017
,
Dec 15 2017
From hallway conversation: we think we can't be in the spinlock case because anyone holding a spinlock can't be preempted... --- OK, I used mmm_donut (CL:828144, patch set #2) to at least reproduce a few "refused to die". Not 84 seconds worth, but at least enough to confirm my patch worked. CL:829921 - CHROMIUM: oom: Chill out and don't print refusals to die so often CL:831054 - CHROMIUM: oom: Report info about the killer in OOM kills With those on a kevin (with kasan off, slub_debug off, stop ui), I ran: mmm_donut.py -n200 -s10 -z50 AKA: fork 200 processes each trying to allocate 50 MB at a time. Obviously lots of OOM kills, but I got a few "refused to die" messages. Here's where I got 4 in a row: 2017-12-15T15:06:47.517946-08:00 ERR kernel: [ 2033.872688] Killed process 6402 (python2) total-vm:173864kB, anon-rss:3536kB, file-rss:50592kB 2017-12-15T15:06:47.517951-08:00 WARNING kernel: [ 2033.985583] Task python2:6402 refused to die (killer python2:6557, nvcsw=35389, nivcsw=806) 2017-12-15T15:06:47.603453-08:00 WARNING kernel: [ 2034.108710] Task python2:6402 refused to die (killer python2:6422, nvcsw=33292, nivcsw=909) 2017-12-15T15:06:47.724459-08:00 WARNING kernel: [ 2034.229038] Task python2:6402 refused to die (killer python2:6419, nvcsw=33548, nivcsw=924) 2017-12-15T15:06:47.918451-08:00 WARNING kernel: [ 2034.423714] Task python2:6402 refused to die (killer python2:6533, nvcsw=33911, nivcsw=814) As you can see, there is now nicely over 100 ms between printouts and we can see the killer. In this case we can see that the killer is different each time, but what do you expect with 200 processes all vying for memory? --- Note: I wonder if perhaps the problem is that some of the people asking for memory actually have a higher priority. ...AKA if one of our audio threads asks for memory, maybe that's why it can keep blocking things? In any case, if we land the patches above and can reproduce then maybe we can find out?
,
Dec 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/3f2d33d57bba55fe6d4cc0534c1e94f3f0181568 commit 3f2d33d57bba55fe6d4cc0534c1e94f3f0181568 Author: Douglas Anderson <dianders@chromium.org> Date: Sat Dec 16 21:09:29 2017 CHROMIUM: oom: Chill out and don't print refusals to die so often As of the commit ("CHROMIUM: oom: do not select new victim if current one is runnable") we may print out about a task refusing to die but then not select a new victim. In that case then each subsequent call into the OOM call would print another new refusal. Let's avoid that by pretending that we selected a victim if we noticed an existing victim that is runnable. BUG=chromium:795344 TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug Change-Id: Id628c601483951adc54b7e4358512912bf96fbb9 Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/829921 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> [modify] https://crrev.com/3f2d33d57bba55fe6d4cc0534c1e94f3f0181568/mm/oom_kill.c
,
Dec 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/5f420c83308f9f97b81494a6df192f11a1082b73 commit 5f420c83308f9f97b81494a6df192f11a1082b73 Author: Douglas Anderson <dianders@chromium.org> Date: Sat Dec 16 21:09:30 2017 CHROMIUM: oom: Report info about the killer in OOM kills Report which task (name/PID) is doing the killing when an OOM kill happens. Also report nvcsw (voluntary switches) and nivcsw (involuntary switches) for the killer. The idea here is that if a task is runnable but refusing to die that perhaps the killer is just looping too much. BUG=chromium:795344 TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug Change-Id: I702a28e8ff5c5feb177f81be2a39f29c251e8890 Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/831054 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> [modify] https://crrev.com/5f420c83308f9f97b81494a6df192f11a1082b73/mm/oom_kill.c
,
Dec 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/cdccaa9eb8fe60dad5b24f328a8a0047098da928 commit cdccaa9eb8fe60dad5b24f328a8a0047098da928 Author: Douglas Anderson <dianders@chromium.org> Date: Wed Dec 20 03:49:15 2017 CHROMIUM: oom: Chill out and don't print refusals to die so often As of the commit ("CHROMIUM: oom: do not select new victim if current one is runnable") we may print out about a task refusing to die but then not select a new victim. In that case then each subsequent call into the OOM call would print another new refusal. Let's avoid that by pretending that we selected a victim if we noticed an existing victim that is runnable. BUG=chromium:795344 TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug Change-Id: Id628c601483951adc54b7e4358512912bf96fbb9 Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/829921 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> (cherry picked from commit 3f2d33d57bba55fe6d4cc0534c1e94f3f0181568) [modify] https://crrev.com/cdccaa9eb8fe60dad5b24f328a8a0047098da928/mm/oom_kill.c
,
Dec 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0c4a37e44ebc44267db57ffa7cde52347044721f commit 0c4a37e44ebc44267db57ffa7cde52347044721f Author: Douglas Anderson <dianders@chromium.org> Date: Wed Dec 20 03:49:18 2017 CHROMIUM: oom: Report info about the killer in OOM kills Report which task (name/PID) is doing the killing when an OOM kill happens. Also report nvcsw (voluntary switches) and nivcsw (involuntary switches) for the killer. The idea here is that if a task is runnable but refusing to die that perhaps the killer is just looping too much. BUG=chromium:795344 TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug Change-Id: I702a28e8ff5c5feb177f81be2a39f29c251e8890 Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/831054 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> (cherry picked from commit 5f420c83308f9f97b81494a6df192f11a1082b73) [modify] https://crrev.com/0c4a37e44ebc44267db57ffa7cde52347044721f/mm/oom_kill.c |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by diand...@chromium.org
, Dec 15 2017