unclean unmounts lead to EXT4 lockups |
||
Issue descriptionOS: scarlet-factory/R65-10211.19.0 Forked from https://issuetracker.google.com/76121632 Reboot tests done on the factory toolkit previously didn't cleanly unmount the stateful partition, which led to various sorts of filesystem corruption (expected). What's unexpected is that we also saw the system occasionally lock up during shutdown, usually with plenty of EXT4 / block device code in the blame list. We could never reproduce this on a proper test image (without the factory packages); I suspect this is because we do a better job of cleanly unmounting there. Full console-ramoops attached, but here's a snippet: [ 240.115707] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.115720] umount D ffffffc000213030 0 2618 2605 0x00400008 [ 240.115745] Call trace: [ 240.115771] [<ffffffc000213030>] __switch_to+0x9c/0xa8 [ 240.115792] [<ffffffc00092dc9c>] __schedule+0x3cc/0x840 [ 240.115810] [<ffffffc00092d86c>] schedule+0x4c/0xb0 [ 240.115826] [<ffffffc000930924>] schedule_timeout+0x44/0x5a0 [ 240.115844] [<ffffffc00092eff4>] do_wait_for_common+0xcc/0x16c [ 240.115862] [<ffffffc00092ed2c>] wait_for_common+0x58/0x78 [ 240.115879] [<ffffffc00092ecc8>] wait_for_completion+0x24/0x30 [ 240.115896] [<ffffffc000233464>] flush_work+0x15c/0x1a0 [ 240.115914] [<ffffffc00030dd2c>] lru_add_drain_all+0x138/0x184 [ 240.115933] [<ffffffc000392024>] invalidate_bdev+0x2c/0x48 [ 240.115952] [<ffffffc0003da59c>] ext4_put_super+0x1e8/0x278 [ 240.115970] [<ffffffc0003699c8>] generic_shutdown_super+0x6c/0xd8 [ 240.115986] [<ffffffc00036a9c4>] kill_block_super+0x2c/0x70 [ 240.116002] [<ffffffc0003697fc>] deactivate_locked_super+0x58/0x84 [ 240.116018] [<ffffffc0003698a0>] deactivate_super+0x38/0x44 [ 240.116035] [<ffffffc00037fccc>] cleanup_mnt+0x40/0x78 [ 240.116052] [<ffffffc00037fc30>] __cleanup_mnt+0x1c/0x28 [ 240.116071] [<ffffffc0002daec4>] task_work_run+0x88/0xd0 [ 240.116090] [<ffffffc000208dd8>] do_notify_resume+0x530/0x56c [ 240.116106] [<ffffffc000202d28>] work_pending+0x1c/0x20 Testers saw another repro case on a similar image, but this time it also put some RK3399 graphics code in the list of blocked tasks, as well as plenty of the same filesystem tasks: [ 240.115541] INFO: task rockchip_drm_at:163 blocked for more than 120 seconds. [ 240.115577] Not tainted 4.4.111-12566-g462126335459c-dirty #1 [ 240.115590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.115602] rockchip_drm_at D ffffffc000213030 0 163 2 0x00000000 [ 240.115629] Call trace: [ 240.115660] [<ffffffc000213030>] __switch_to+0x9c/0xa8 [ 240.115681] [<ffffffc00092dc9c>] __schedule+0x3cc/0x840 [ 240.115699] [<ffffffc00092d86c>] schedule+0x4c/0xb0 [ 240.115716] [<ffffffc000930924>] schedule_timeout+0x44/0x5a0 [ 240.115737] [<ffffffc000615098>] dma_fence_default_wait+0x128/0x214 [ 240.115755] [<ffffffc000614cb4>] dma_fence_wait_timeout+0xb8/0x158 [ 240.115777] [<ffffffc0004f3678>] rockchip_atomic_commit_complete+0x78/0x4a0 [ 240.115798] [<ffffffc0005a5f60>] rockchip_drm_atomic_work+0x1c/0x28 [ 240.115818] [<ffffffc0002db064>] kthread_worker_fn+0xe8/0x1b8 [ 240.115837] [<ffffffc00023a5d8>] kthread+0xe0/0xf0 [ 240.115855] [<ffffffc000202dd0>] ret_from_fork+0x10/0x40 ... [ 240.116717] umount D ffffffc000213030 0 2634 2621 0x00400008 [ 240.116790] Call trace: [ 240.116830] [<ffffffc000213030>] __switch_to+0x9c/0xa8 [ 240.116867] [<ffffffc00092dc9c>] __schedule+0x3cc/0x840 [ 240.116889] [<ffffffc00092d86c>] schedule+0x4c/0xb0 [ 240.116904] [<ffffffc000930924>] schedule_timeout+0x44/0x5a0 [ 240.116923] [<ffffffc00092eff4>] do_wait_for_common+0xcc/0x16c [ 240.116940] [<ffffffc00092ed2c>] wait_for_common+0x58/0x78 [ 240.116958] [<ffffffc00092ecc8>] wait_for_completion+0x24/0x30 [ 240.116974] [<ffffffc000233464>] flush_work+0x15c/0x1a0 [ 240.116993] [<ffffffc00030dd2c>] lru_add_drain_all+0x138/0x184 [ 240.117012] [<ffffffc000392024>] invalidate_bdev+0x2c/0x48 [ 240.117032] [<ffffffc0003da59c>] ext4_put_super+0x1e8/0x278 [ 240.117050] [<ffffffc0003699c8>] generic_shutdown_super+0x6c/0xd8 [ 240.117066] [<ffffffc00036a9c4>] kill_block_super+0x2c/0x70 [ 240.117082] [<ffffffc0003697fc>] deactivate_locked_super+0x58/0x84 [ 240.117098] [<ffffffc0003698a0>] deactivate_super+0x38/0x44 [ 240.117116] [<ffffffc00037fccc>] cleanup_mnt+0x40/0x78 [ 240.117132] [<ffffffc00037fc30>] __cleanup_mnt+0x1c/0x28 [ 240.117150] [<ffffffc0002daec4>] task_work_run+0x88/0xd0 [ 240.117170] [<ffffffc000208dd8>] do_notify_resume+0x530/0x56c [ 240.117186] [<ffffffc000202d28>] work_pending+0x1c/0x20 ...
,
Oct 2
,
Oct 2
It's been a while since this issue was fresh in my mind...but wasn't this determined not to be completely an EXT4 issue? It was another bug (RTC driver I think?), and the locked up task happened to sit on the same queues that EXT4 did. I think mka@ looked into this, but IIUC he might have found that this stuff got split into different queues in later kernels, so this wouldn't be as much of a problem? I'm mostly working off memory. You might find more at https://issuetracker.google.com/76121632. Anyway, this might not be something worth spending a lot of time on if the above is true, unless we're still seeing a lot of new similar failure modes. |
||
►
Sign in to add a comment |
||
Comment 1 by briannorris@chromium.org
, Apr 24 2018