New issue
Advanced search Search tips

Issue 687711 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Kernel hung in ext4 code during cros deploy

Project Member Reported by briannorris@chromium.org, Feb 1 2017

Issue description

I was testing a self-built ToT (R58) on Kevin, emerging some new packages with 'cros deploy' when I hit a hung task warning -> panic. console-ramoops attached.

[ 1800.116621] INFO: task ebuild.sh:5569 blocked for more than 120 seconds.
[ 1800.123420]       Not tainted 4.4.35 #893
[ 1800.127553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.135437] ebuild.sh       D ffffffc000204fd8     0  5569   5475 0x00400000
[ 1800.142693] Call trace:
[ 1800.145262] [<ffffffc000204fd8>] __switch_to+0x9c/0xa8
[ 1800.150507] [<ffffffc0009215bc>] __schedule+0x478/0x748
[ 1800.155801] [<ffffffc000921920>] schedule+0x94/0xb4
[ 1800.160720] [<ffffffc000924a98>] schedule_timeout+0x44/0x27c
[ 1800.166446] [<ffffffc000921104>] io_schedule_timeout+0x7c/0xbc
[ 1800.172353] [<ffffffc000922278>] bit_wait_io+0x1c/0x6c
[ 1800.177537] [<ffffffc000921e10>] __wait_on_bit+0x74/0xc0
[ 1800.182887] [<ffffffc000302720>] wait_on_page_bit+0xa4/0xc8
[ 1800.188577] [<ffffffc00031294c>] truncate_inode_pages_range+0x504/0x570
[ 1800.195396] [<ffffffc0003129e4>] truncate_inode_pages+0x2c/0x38
[ 1800.201347] [<ffffffc000312aa4>] truncate_pagecache+0x50/0x74
[ 1800.207108] [<ffffffc0003d7630>] ext4_setattr+0x448/0x694
[ 1800.212532] [<ffffffc00036fddc>] notify_change+0x20c/0x328
[ 1800.218029] [<ffffffc000352050>] do_truncate+0xac/0xe8
[ 1800.223167] [<ffffffc000362a08>] path_openat+0xbb4/0xd9c
[ 1800.228492] [<ffffffc000363c74>] do_filp_open+0x58/0xbc
[ 1800.233725] [<ffffffc0003532d8>] do_sys_open+0x184/0x2d8
[ 1800.239052] [<ffffffc0003a493c>] compat_SyS_open+0x38/0x48
[ 1800.244536] [<ffffffc000203e34>] el0_svc_naked+0x24/0x28


I was booted from an SD card image, FWIW. But from the looks of this, it's either generic VFS or EXT4 code that caused the lockup.
 
ramoops.txt
151 KB View Download
It's actually possible that it could be SD Card related.  When I was booting from an SD card to test some PSR issues I sometimes found the device hung like this.

I happened to have <https://chromium-review.googlesource.com/#/c/392327/>, which was quite useful here.  It showed that I was getting a dwmmc interrupt every 10 us (!)

Here's the log I saw:

2017-01-26T14:51:00.733181-08:00 WARNING kernel: [   44.755922] zram: 2268 (metrics_daemon) Attribute compr_data_size (and others) will be removed. See zram documentation.
2017-01-26T14:51:43.798333-08:00 WARNING kernel: [   87.827162] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:17.711188-08:00 WARNING kernel: [  121.740110] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:34.286305-08:00 WARNING kernel: [  138.315117] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:52:37.788554-08:00 WARNING kernel: [  141.817120] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:47.375285-08:00 WARNING kernel: [  151.404153] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:48.797760-08:00 WARNING kernel: [  152.826142] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:52:55.308285-08:00 WARNING kernel: [  159.337125] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:58.305372-08:00 WARNING kernel: [  162.334113] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:52:58.853615-08:00 WARNING kernel: [  162.882173] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:03.257396-08:00 WARNING kernel: [  167.286128] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:04.303497-08:00 WARNING kernel: [  168.332137] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:06.268251-08:00 WARNING kernel: [  170.297131] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:06.336511-08:00 WARNING kernel: [  170.365175] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:08.847514-08:00 WARNING kernel: [  172.876145] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:11.833532-08:00 WARNING kernel: [  175.862195] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:11.928313-08:00 WARNING kernel: [  175.957125] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:14.803338-08:00 WARNING kernel: [  178.832135] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:14.845370-08:00 WARNING kernel: [  178.874164] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:24.308395-08:00 WARNING kernel: [  188.337136] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:24.341435-08:00 WARNING kernel: [  188.370213] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:34.951259-08:00 WARNING kernel: [  198.980124] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:34.963450-08:00 WARNING kernel: [  198.992122] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:35.042512-08:00 WARNING kernel: [  199.071132] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:35.859307-08:00 WARNING kernel: [  199.888112] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:38.032503-08:00 WARNING kernel: [  202.061153] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:40.409437-08:00 WARNING kernel: [  204.438131] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:40.421328-08:00 WARNING kernel: [  204.450134] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:44.374424-08:00 WARNING kernel: [  208.403119] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:47.314317-08:00 WARNING kernel: [  211.343123] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:47.850358-08:00 WARNING kernel: [  211.879153] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:51.126422-08:00 WARNING kernel: [  215.155159] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:51.234355-08:00 WARNING kernel: [  215.263122] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:54.761348-08:00 WARNING kernel: [  218.790113] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:56.722266-08:00 WARNING kernel: [  220.751114] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:53:57.756268-08:00 WARNING kernel: [  221.785135] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:53:58.384343-08:00 WARNING kernel: [  222.413112] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:54:01.380420-08:00 WARNING kernel: [  225.409143] dw-mci (IRQ 28): 102 IRQs in the last 1 ms
2017-01-26T14:54:01.393437-08:00 WARNING kernel: [  225.422109] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:54:04.686236-08:00 WARNING kernel: [  228.715122] dw-mci (IRQ 28): 101 IRQs in the last 1 ms
2017-01-26T14:54:16.673206-08:00 ERR kernel: [  240.700028] INFO: task loop0:327 blocked for more than 120 seconds.
2017-01-26T14:54:16.686423-08:00 ERR kernel: [  240.708301]       Not tainted 4.4.21 #498
2017-01-26T14:54:16.686598-08:00 ERR kernel: [  240.712431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2017-01-26T14:54:16.697851-08:00 INFO kernel: [  240.720359] loop0           D ffffffc000204fd8     0   327      2 0x00000000
2017-01-26T14:54:16.697989-08:00 EMERG kernel: [  240.728208] Call trace:
2017-01-26T14:54:16.698004-08:00 WARNING kernel: [  240.730754] [<ffffffc000204fd8>] __switch_to+0x9c/0xa8
2017-01-26T14:54:16.717840-08:00 WARNING kernel: [  240.735401] dw-mci (IRQ 28): 105 IRQs in the last 1 ms
2017-01-26T14:54:16.717954-08:00 WARNING kernel: [  240.738546] dw-mci (IRQ 28): 118 IRQs in the last 1 ms
2017-01-26T14:54:16.717966-08:00 WARNING kernel: [  240.746581] [<ffffffc00092799c>] __schedule+0x478/0x748
2017-01-26T14:54:16.717976-08:00 WARNING kernel: [  240.750346] dw-mci (IRQ 28): 122 IRQs in the last 1 ms
2017-01-26T14:54:16.728409-08:00 WARNING kernel: [  240.755855] dw-mci (IRQ 28): 218 IRQs in the last 1 ms
2017-01-26T14:54:16.740220-08:00 WARNING kernel: [  240.763572] [<ffffffc000927d00>] schedule+0x94/0xb4
2017-01-26T14:54:16.740346-08:00 WARNING kernel: [  240.768502] [<ffffffc00092ae78>] schedule_timeout+0x44/0x27c
2017-01-26T14:54:16.740358-08:00 WARNING kernel: [  240.774199] [<ffffffc0009274e4>] io_schedule_timeout+0x7c/0xbc
2017-01-26T14:54:16.756457-08:00 WARNING kernel: [  240.776931] dw-mci (IRQ 28): 171 IRQs in the last 1 ms
2017-01-26T14:54:16.756579-08:00 WARNING kernel: [  240.785221] [<ffffffc000928658>] bit_wait_io+0x1c/0x6c
2017-01-26T14:54:16.756595-08:00 WARNING kernel: [  240.790079] dw-mci (IRQ 28): 153 IRQs in the last 1 ms
2017-01-26T14:54:16.766934-08:00 WARNING kernel: [  240.795567] [<ffffffc0009281f0>] __wait_on_bit+0x74/0xc0
2017-01-26T14:54:16.767049-08:00 WARNING kernel: [  240.800962] [<ffffffc00030db68>] wait_on_page_bit+0xa4/0xc8
2017-01-26T14:54:16.778974-08:00 WARNING kernel: [  240.806557] [<ffffffc00030dcc8>] __filemap_fdatawait_range+0xdc/0x158
2017-01-26T14:54:16.779047-08:00 WARNING kernel: [  240.813024] [<ffffffc00030dd74>] filemap_fdatawait_range+0x30/0x50
2017-01-26T14:54:16.791787-08:00 WARNING kernel: [  240.819225] [<ffffffc00030fbf8>] filemap_write_and_wait_range+0x5c/0x88
2017-01-26T14:54:16.791904-08:00 WARNING kernel: [  240.825878] [<ffffffc0003d7630>] ext4_sync_file+0x1b8/0x38c
2017-01-26T14:54:16.802900-08:00 WARNING kernel: [  240.831497] [<ffffffc000391530>] vfs_fsync_range+0xa4/0xb4
2017-01-26T14:54:16.803032-08:00 WARNING kernel: [  240.837021] [<ffffffc000391574>] vfs_fsync+0x34/0x40
2017-01-26T14:54:16.813472-08:00 WARNING kernel: [  240.842004] [<ffffffc000618adc>] loop_queue_work+0x84/0x828
2017-01-26T14:54:16.813579-08:00 WARNING kernel: [  240.847621] [<ffffffc00023f6b4>] kthread_worker_fn+0x138/0x1a8
2017-01-26T14:54:16.824290-08:00 WARNING kernel: [  240.853472] [<ffffffc00023f524>] kthread+0x10c/0x114
2017-01-26T14:54:16.824386-08:00 WARNING kernel: [  240.858469] [<ffffffc000203dd0>] ret_from_fork+0x10/0x40

---

Unless the constant interrupts were related to constant IO requests from ext4, I'd believe this to be a dw_mmc bug.
screen-exchange
210 KB View Download

Comment 2 by dchan@google.com, Sep 8 2017

Cc: dhadd...@chromium.org
Cc: sdantul...@chromium.org mkarkada@chromium.org abod...@chromium.org
Booted Kevin from sdcard using build 9901.7.0, 62.0.3202.9 dev.

Did not notice any device hang. Attached device logs for reference.
debug-logs_20170908-110056.tgz
102 KB Download
So this is an old bug and I'll point out that I use SD Card to boot up my kevin 100% of the time when I install a new kernel.  I very, very rarely ever see any errors like the above.

...but somehow we do have some SD Card problems.  Specifically:

--

1. There's a feedback report <https://listnr.corp.google.com/report/72651271882> that looks like someone is seeing a hang in accessing their SD card.  The traceback from there looks like:

[  360.121372] INFO: task mount.exfat-fus:3707 blocked for more than 120 seconds.
[  360.121416]       Not tainted 4.4.70-11295-gf64189f58926 #1
[  360.121438] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.121461] mount.exfat-fus D ffffffc000204ff0     0  3707   2024 0x00400000
[  360.121504] Call trace:
[  360.121545] [<ffffffc000204ff0>] __switch_to+0x9c/0xa8
[  360.121578] [<ffffffc00093ab88>] __schedule+0x504/0x740
[  360.121612] [<ffffffc00093ae4c>] schedule+0x88/0xa8
[  360.121651] [<ffffffc00093eaa0>] schedule_timeout+0x44/0x27c
[  360.121683] [<ffffffc00093a644>] io_schedule_timeout+0x7c/0xbc
[  360.121718] [<ffffffc00093b7a4>] bit_wait_io+0x1c/0x6c
[  360.121750] [<ffffffc00093b33c>] __wait_on_bit+0x74/0xc0
[  360.121779] [<ffffffc000302d18>] wait_on_page_bit+0xa4/0xc8
[  360.121810] [<ffffffc000302e78>] __filemap_fdatawait_range+0xdc/0x158
[  360.121841] [<ffffffc000302f24>] filemap_fdatawait_range+0x30/0x50
[  360.121970] [<ffffffc000304d04>] filemap_write_and_wait_range+0x5c/0x88
[  360.121999] [<ffffffc00039465c>] blkdev_fsync+0x34/0x60
[  360.122057] [<ffffffc00038d104>] vfs_fsync_range+0xa4/0xb4
[  360.122115] [<ffffffc00038d148>] vfs_fsync+0x34/0x40
[  360.122165] [<ffffffc00038d418>] SyS_fsync+0x40/0x64
[  360.122223] [<ffffffc000203e34>] el0_svc_naked+0x24/0x28

--

2. We've seen lots of feedback reports and crash reports with SD card problems.  Most are during tuning.  That's captured by <https://buganizer.corp.google.com/issues/35580203>.

...in fact, if you look at crash reports for kevin a fairly significant percentage of them appear to be crashing during SD card tuning.

It's unclear that the problems that we see at tuning time and the problems here are the same, but it's plausible.

Comment 6 by dchan@google.com, Sep 9 2017

Cc: sontis@chromium.org
+sontis found this

I have noticed "mount_exfat_fuse" process crash on 9765.53.0 

Steps to repo this crash:
1. Start copying  big file (~2GB)  from device to Micro SD card.
2. Unplug Micro SD card in the middle of the copy operation.

Result:
 "mount_exfat_fuse" process crash
crash link: https://crash.corp.google.com/browse?q=reportid=%27d7db2b65bbb9bf1a%27

0xf193ab86	(libc-2.23.so + 0x00016b86 )	__libc_do_syscall
0xf1949de3	(libc-2.23.so -raise.c:54 )	raise
0xf194ad55	(libc-2.23.so -abort.c:89 )	abort
0xab76c605	(mount.exfat-fuse -log.c:50 )	exfat_bug
0xab76dcad	(mount.exfat-fuse -node.c:601 )	reset_cache
0xab76dbe5	(mount.exfat-fuse -node.c:587 )	reset_cache
0xab76d56b	(mount.exfat-fuse -mount.c:365 )	exfat_unmount
0xf1a34ed5	(libfuse.so.2.8.6 -fuse.c:1962 )	fuse_fs_destroy
0xf1a36321	(libfuse.so.2.8.6 -fuse.c:1975 )	fuse_lib_destroy
0xf1a40b6b	(libfuse.so.2.8.6 -fuse_lowlevel.c:1260 )	do_destroy
0xf1a3eb5b	(libfuse.so.2.8.6 -fuse_lowlevel.c:1514 )	fuse_ll_process
0xf1a3ce45	(libfuse.so.2.8.6 -fuse_loop.c:33 )	fuse_session_loop
0xab76a5eb	(mount.exfat-fuse -main.c:627 )	main
0xf193a8b7	(libc-2.23.so -libc-start.c:289 )	__libc_start_main
0xab76a1e3	(mount.exfat-fuse + 0x000011e3 )	_start
0xab77126b	(mount.exfat-fuse -elf-init.c:87 )	__libc_csu_init
0xf1a7af3f	(ld-2.23.so + 0x0000bf3f )	_dl_sort_fini

Owner: sontis@chromium.org
Can you please try to reproduce on 9901.46.0?  Thanks!  :)  It's plausible this was fixed by the same thing that fixed b/35580203

Comment 8 by sontis@chromium.org, Jun 15 2018

Status: Verified (was: Untriaged)
Did not notice this issue on ToT.

Sign in to add a comment