New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 795344 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

too many "refused to die" messages suggest something is wrong

Project Member Reported by semenzato@chromium.org, Dec 15 2017

Issue description

On a ToT 2GB cyan (kernel 3.18) under a lot of memory pressure, I got a long hang followed by a watchdog crash.  From console-ramoops (full file attached):

[504264.082335] [18787]     0 18787     3082      205      11       50         -1000 sleep
[504264.082349] Out of memory: Kill process 18442 (chrome) score 982 or sacrifice child
[504264.082368] Killed process 18442 (chrome) total-vm:1081824kB, anon-rss:0kB, file-rss:28536kB
[504264.202537] Task chrome:18442 refused to die
[504264.204571] Task chrome:18442 refused to die
[504264.206478] Task chrome:18442 refused to die
[504264.215332] Task chrome:18442 refused to die
... <repeats about 2240 times> ...
[504312.378338] Task chrome:18442 refused to die
[504312.379960] Task chrome:18442 refused to die
[504313.688327] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=60002 jiffies, g=18626824, c=18626823, q=543253)
[504313.688359] INFO: Stall ended before state dump start
[504463.764489] INFO: task TaskSchedulerFo:17813 blocked for more than 120 seconds.

The number of refusals seems unreasonable.

 
refused-to-die
127 KB View Download
This basically is the whole issue where the task chrome:18442 is runnable but isn't actually getting run.  Presumably there are so many competing tasks thrashing through the memory code that they all prevent the doomed task from running.

This patch ought to decrease the number of refusals printed.  I'll see if i can find any way to test:

https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/829921
You mean that chrome:18442 was starved of CPU cycles, rather blocked on something else?  That seems unlikely.  I had 20 open random CNN tabs, and that process refused to die for about 48 seconds.
> You mean that chrome:18442 was starved of CPU cycles, rather 
> blocked on something else

Yes.  That's how broken I think some of the MM code is.  I think tasks spend many tens of seconds in MM code and preemption is disabled for those processes.  Thus they can fully hog the CPU.  Specifically:

- The pattern of messages you show indicates that the process is marked to die and is in state "TASK_RUNNING".  Specifically it's not blocked on any kernel object like a mutex, etc.  Hrmmm, I guess it could be blocked on a spinlock, though, but presumably if it's taking this long to come back then it's blocked on a spinlock held by someone requesting memory (and that means they're requesting atomic memory).  You think we're in that state?

- The process marked to die can only be running in kernel code.  The moment it is ready to return to userspace code it will die.

re #3 -- oh really?  OOMing process Starved for 48 seconds seems hard to believe -- I thought it was usually that it was waiting on a lock an the process holding the lock was getting starved, but I might be mis-remembering
@4: That is my assertion, though (as always) I'm happy to be proven wrong.  I make that assertion because of this bit of code:

  pr_warn("Task %s:%d refused to die\n",
	  task->comm, task->pid);
  if (task->state != TASK_RUNNING) {
	  sched_show_task(task);
	  return OOM_SCAN_CONTINUE;
 }

So if the state of the task was not TASK_RUNNING then we'd be showing a stack trace of the task each time.  ...and we'd be continuing on to pick a new victim.


So I guess I'd say I'm very certain that "chrome:18442" is in a TASK_RUNNING state for the 48 seconds, at least...  ;)
Components: OS>Kernel
Labels: OS-Chrome
From hallway conversation: we think we can't be in the spinlock case because anyone holding a spinlock can't be preempted...

---

OK, I used mmm_donut (CL:828144, patch set #2) to at least reproduce a few "refused to die".  Not 84 seconds worth, but at least enough to confirm my patch worked.

  CL:829921 - CHROMIUM: oom: Chill out and don't print refusals to die so often
  CL:831054 - CHROMIUM: oom: Report info about the killer in OOM kills

With those on a kevin (with kasan off, slub_debug off, stop ui), I ran:
  mmm_donut.py -n200 -s10 -z50

AKA: fork 200 processes each trying to allocate 50 MB at a time.  Obviously lots of OOM kills, but I got a few "refused to die" messages.  Here's where I got 4 in a row:

2017-12-15T15:06:47.517946-08:00 ERR kernel: [ 2033.872688] Killed process 6402 (python2) total-vm:173864kB, anon-rss:3536kB, file-rss:50592kB
2017-12-15T15:06:47.517951-08:00 WARNING kernel: [ 2033.985583] Task python2:6402 refused to die (killer python2:6557, nvcsw=35389, nivcsw=806)
2017-12-15T15:06:47.603453-08:00 WARNING kernel: [ 2034.108710] Task python2:6402 refused to die (killer python2:6422, nvcsw=33292, nivcsw=909)
2017-12-15T15:06:47.724459-08:00 WARNING kernel: [ 2034.229038] Task python2:6402 refused to die (killer python2:6419, nvcsw=33548, nivcsw=924)
2017-12-15T15:06:47.918451-08:00 WARNING kernel: [ 2034.423714] Task python2:6402 refused to die (killer python2:6533, nvcsw=33911, nivcsw=814)

As you can see, there is now nicely over 100 ms between printouts and we can see the killer.  In this case we can see that the killer is different each time, but what do you expect with 200 processes all vying for memory?

---

Note: I wonder if perhaps the problem is that some of the people asking for memory actually have a higher priority.  ...AKA if one of our audio threads asks for memory, maybe that's why it can keep blocking things?  In any case, if we land the patches above and can reproduce then maybe we can find out?
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 16 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/3f2d33d57bba55fe6d4cc0534c1e94f3f0181568

commit 3f2d33d57bba55fe6d4cc0534c1e94f3f0181568
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Dec 16 21:09:29 2017

CHROMIUM: oom: Chill out and don't print refusals to die so often

As of the commit ("CHROMIUM: oom: do not select new victim if current
one is runnable") we may print out about a task refusing to die but
then not select a new victim.  In that case then each subsequent call
into the OOM call would print another new refusal.

Let's avoid that by pretending that we selected a victim if we noticed
an existing victim that is runnable.

BUG=chromium:795344
TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug

Change-Id: Id628c601483951adc54b7e4358512912bf96fbb9
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/829921
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>

[modify] https://crrev.com/3f2d33d57bba55fe6d4cc0534c1e94f3f0181568/mm/oom_kill.c

Project Member

Comment 10 by bugdroid1@chromium.org, Dec 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/5f420c83308f9f97b81494a6df192f11a1082b73

commit 5f420c83308f9f97b81494a6df192f11a1082b73
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Dec 16 21:09:30 2017

CHROMIUM: oom: Report info about the killer in OOM kills

Report which task (name/PID) is doing the killing when an OOM kill
happens.  Also report nvcsw (voluntary switches) and nivcsw
(involuntary switches) for the killer.  The idea here is that if a
task is runnable but refusing to die that perhaps the killer is just
looping too much.

BUG=chromium:795344
TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug

Change-Id: I702a28e8ff5c5feb177f81be2a39f29c251e8890
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/831054
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>

[modify] https://crrev.com/5f420c83308f9f97b81494a6df192f11a1082b73/mm/oom_kill.c

Project Member

Comment 11 by bugdroid1@chromium.org, Dec 20 2017

Labels: merge-merged-chromeos-3.18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/cdccaa9eb8fe60dad5b24f328a8a0047098da928

commit cdccaa9eb8fe60dad5b24f328a8a0047098da928
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Dec 20 03:49:15 2017

CHROMIUM: oom: Chill out and don't print refusals to die so often

As of the commit ("CHROMIUM: oom: do not select new victim if current
one is runnable") we may print out about a task refusing to die but
then not select a new victim.  In that case then each subsequent call
into the OOM call would print another new refusal.

Let's avoid that by pretending that we selected a victim if we noticed
an existing victim that is runnable.

BUG=chromium:795344
TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug

Change-Id: Id628c601483951adc54b7e4358512912bf96fbb9
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/829921
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>
(cherry picked from commit 3f2d33d57bba55fe6d4cc0534c1e94f3f0181568)

[modify] https://crrev.com/cdccaa9eb8fe60dad5b24f328a8a0047098da928/mm/oom_kill.c

Project Member

Comment 12 by bugdroid1@chromium.org, Dec 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/0c4a37e44ebc44267db57ffa7cde52347044721f

commit 0c4a37e44ebc44267db57ffa7cde52347044721f
Author: Douglas Anderson <dianders@chromium.org>
Date: Wed Dec 20 03:49:18 2017

CHROMIUM: oom: Report info about the killer in OOM kills

Report which task (name/PID) is doing the killing when an OOM kill
happens.  Also report nvcsw (voluntary switches) and nivcsw
(involuntary switches) for the killer.  The idea here is that if a
task is runnable but refusing to die that perhaps the killer is just
looping too much.

BUG=chromium:795344
TEST='mmm_donut.py -n200 -s10 -z50' on kevin w/ no ui, no slub debug

Change-Id: I702a28e8ff5c5feb177f81be2a39f29c251e8890
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/831054
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>
(cherry picked from commit 5f420c83308f9f97b81494a6df192f11a1082b73)

[modify] https://crrev.com/0c4a37e44ebc44267db57ffa7cde52347044721f/mm/oom_kill.c

Sign in to add a comment