task crash_sender blocked for more than 120 seconds on cheza |
|
Issue description
I've been leaving my cheza device sitting idle and I see that it keeps crashing / rebooting recently.
I've been building my kernel with a whole bunch of USE flags. Specifically:
USE="debugobjects kasan kmemleak lockdebug kgdb vtconsole"
...and I'm using my most recent kernel sandbox. I'm on an official root filesystem "R73-11386.0.0".
---
I'm not sure when these problem started. It could be when I changed root filesystems. It could be a kernel change. ...or it could be that I've started building with "debugobjects" and "kasan" recently. I can try rolling back each of these to see if the problem goes away if I can't figure out what's up.
---
The problem takes a long time to reproduce. From looking at my scrollback buffer on the serial console I've seen it take up to 20 hours to reproduce. I've also seen it as soon as ~70 minutes after boot. Grepping through my scrollback:
[13272.721718] __wait_rcu_gp+0x1dc/0x204
[134432.399408] __wait_rcu_gp+0x1dc/0x204
[37725.846919] __wait_rcu_gp+0x1dc/0x204
[ 8850.144080] __wait_rcu_gp+0x1dc/0x204
[ 7620.251198] __wait_rcu_gp+0x1e4/0x214
[45224.519949] __wait_rcu_gp+0x1e4/0x214
[ 7250.603385] __wait_rcu_gp+0x1e4/0x214
[16347.714383] __wait_rcu_gp+0x1e4/0x214
[ 7992.851015] __wait_rcu_gp+0x1e4/0x214
[ 5534.019869] __wait_rcu_gp+0x1e4/0x214
[26545.758671] __wait_rcu_gp+0x1e4/0x214
[ 4304.611596] __wait_rcu_gp+0x1e4/0x214
[ 7990.872401] __wait_rcu_gp+0x1e4/0x214
[41415.237878] __wait_rcu_gp+0x1e4/0x214
[56299.736007] __wait_rcu_gp+0x1e4/0x214
[11066.391837] __wait_rcu_gp+0x1e4/0x214
[ 7992.849207] __wait_rcu_gp+0x1e4/0x214
[ 4304.607969] __wait_rcu_gp+0x1e4/0x214
[74959.487822] __wait_rcu_gp+0x1e4/0x214
[11060.881908] __wait_rcu_gp+0x1e4/0x214
[ 7378.155356] __wait_rcu_gp+0x1e4/0x214
[40552.069280] __wait_rcu_gp+0x1e4/0x214
[30245.258316] __wait_rcu_gp+0x1e4/0x214
---
The actual crash always looks something like this (always "crash_sender" is the blocked task):
[ 7992.789370] INFO: task crash_sender:11279 blocked for more than 120 seconds.
[ 7992.797346] Tainted: G W 4.19.8 #14
[ 7992.803042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7992.811469] crash_sender D 0 11279 11278 0x00400000
[ 7992.817559] Call trace:
[ 7992.820376] __switch_to+0x174/0x1c4
[ 7992.824271] __schedule+0x124c/0x1bbc
[ 7992.828274] schedule+0x160/0x1a0
[ 7992.831987] schedule_timeout+0xc0/0xcb4
[ 7992.836238] do_wait_for_common+0x2c8/0x4d4
[ 7992.840763] wait_for_common+0x58/0x78
[ 7992.844787] wait_for_completion+0x24/0x30
[ 7992.849207] __wait_rcu_gp+0x1e4/0x214
[ 7992.853268] synchronize_rcu+0xd8/0x100
[ 7992.857427] namespace_unlock+0x10c/0x11c
[ 7992.861727] ksys_umount+0x7b0/0x7e0
[ 7992.865595] __arm64_sys_umount+0x48/0x5c
[ 7992.869905] el0_svc_common+0x124/0x1c4
[ 7992.877978] el0_svc_compat_handler+0x64/0x8c
[ 7992.882591] el0_svc_compat+0x8/0x18
...and in all the crashes we see "imklog" fingered too.
[ 7992.905367] 1 lock held by in:imklog/617:
[ 7992.909573] #0: 000000005a99472b (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x7c/0xb0
---
Earliest instance I see of this was on '4.19.7' built by me Fri Dec 7 18:02:25 PST 2018.
---
Looking at how crash_sender is called, it looks as if it's called once per hour. See '/etc/init/crash-sender.conf':
exec periodic_scheduler 3600 14400 crash_sender /sbin/crash_sender
This probably explains some of the patterns in the times of the crash above. I haven't yet been able to manually run crash-sender and reproduce.
,
Jan 7
It appears that me turning on extra debug flags is what started making this show up. I left my machine up all vacation and it never hit this error. I'm not sure exactly what flags I was running during that time, but I believe it was USE="lockdebug kgdb vtconsole" Thus some combination of "debugobjects", "kasan" and "kmemleak" seem to be what triggered this for me.
,
Today
(13 hours ago)
In case it helps, reproduced this with: "kasan lockdebug kgdb vtconsole"
,
Today
(10 hours ago)
I gave this another 30 minutes of debugging today.
...and noticed something a bit suspicious...
Just to have something concrete to talk about, in the case I debugged today, the crash looked like this:
---
[11675.286800] INFO: task crash_sender:15228 blocked for more than 120 seconds.
[11675.295414] Tainted: G W 4.19.15 #75
[11675.317365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11675.335628] crash_sender D 0 15228 15227 0x00400000
[11675.348488] Call trace:
[11675.351606] __switch_to+0x174/0x1c4
[11675.356233] __schedule+0x124c/0x1bbc
[11675.362541] schedule+0x160/0x1a0
[11675.366828] schedule_timeout+0xc0/0xc9c
[11675.372103] do_wait_for_common+0x2c8/0x4d4
[11675.378271] wait_for_common+0x58/0x78
[11675.385044] wait_for_completion+0x24/0x30
[11675.390982] __wait_rcu_gp+0x1dc/0x204
[11675.395495] synchronize_rcu+0xd8/0x100
[11675.401937] namespace_unlock+0x10c/0x11c
[11675.406220] ksys_umount+0x7b0/0x7e0
[11675.409971] __arm64_sys_umount+0x48/0x5c
[11675.414328] el0_svc_common+0x124/0x1c4
[11675.418380] el0_svc_compat_handler+0x64/0x8c
[11675.422968] el0_svc_compat+0x8/0x18
[11675.426740]
[11675.426740] Showing all locks held in the system:
[11675.433221] 1 lock held by khungtaskd/58:
[11675.437411] #0: 0000000015c459a1 (rcu_read_lock){....}, at: rcu_lock_acquire+0x8/0x38
[11675.445715] 1 lock held by in:imklog/611:
[11675.449906] #0: 00000000f7ee42e9 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x7c/0xb0
[11675.457970] 2 locks held by bash/6607:
[11675.461911] #0: 00000000be9ee2a5 (&tty->ldisc_sem){++++}, at: ldsem_down_read+0x3c/0x48
[11675.470346] #1: 0000000020d55df2 (&ldata->atomic_read_lock){+.+.}, at: n_tty_read+0x1f4/0xf74
---
...the same as all the previous ones.
So something is fishy in the above. Specifically it claims that at the time of the printout that khungtaskd is holding the rcu lock. Huh? That's weird because in check_hung_uninterruptible_tasks() you can see that the code specifically calls rcu_read_unlock() before calling debug_show_all_locks(). So why would it still show as holding the lock?
===
Another interesting thing I noticed is this patch upstream:
--
commit 304ae42739b108305f8d7b3eb3c1aec7c2b643a9
Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
AuthorDate: Thu Jan 3 15:26:31 2019 -0800
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Fri Jan 4 13:13:45 2019 -0800
kernel/hung_task.c: break RCU locks based on jiffies
check_hung_uninterruptible_tasks() is currently calling rcu_lock_break()
for every 1024 threads. But check_hung_task() is very slow if printk()
was called, and is very fast otherwise.
If many threads within some 1024 threads called printk(), the RCU grace
period might be extended enough to trigger RCU stall warnings.
Therefore, calling rcu_lock_break() for every some fixed jiffies will be
safer.
Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
I'm not convinced that's what we're seeing here, but I at least did add some debug prints to check...
|
|
►
Sign in to add a comment |
|
Comment 1 by diand...@chromium.org
, Dec 19