likely OOM kill deadlock in file system (jbd2_journal_commit, kjournald2) |
||
Issue descriptionBen produced this on a caroline under high memory pressure. From a superficial reading there are more than one possibility for deadlocks, but an unusually large number of processes are in the middle of file system operations. Some of them are blocked on a bit_wait, others in a kjournald transaction. There is also a very strange 5-second jump in the timestamp while printing the process list in the last OOM, which immediately precedes the hung task panic. Ben how old is your tree? [ 594.735599] [ 2040] 218 2040 5162 218 14 104 -1000 bluetoothd [ 594.735609] [ 2053] 0 2053 6767 411 16 149 -1000 metrics_daemon [ 594.735619] [ 2079] 0 2079 2263 86 9 51 -1000 upstart-socket- [ 594.735630] [ 2091] 0 2091 207058 229 37 222 -1000 esif_ufd [ 594.735640] [ 2101] 0 2101 26267 380 20 292 -1000 disks [ 599.622498] [ 2131] 238 2131 3405 207 11 75 -1000 avahi-daemon [ 599.622512] [ 2135] 238 2135 3405 1 10 58 -1000 avahi-daemon [ 599.622523] [ 2140] 0 2140 3947 0 11 129 -1000 sshd
,
Apr 14 2017
Luigi: I don't think this is the same 5 second jump we were debugging before. We should have seen a log message for that one.
...but it is a good point about some of these delays. I hadn't noticed that. I think the system is basically behaving poorly and we spend a whole chunk of time printout out the debug info to the console. Possibly if we turn down debugging verbosity this problem will go away?
AKA, maybe if we adjust in oom_kill_process():
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
,
Apr 14 2017
Right, we would have seen the log messages. Also it's a little short of 5s---I'd expect such delays to be slightly more than the target. 5s is a long time, if you see how long it takes to dump the OOM logs (milliseconds). I don't know how to explain that. Maybe tlsdated kicked in? But maybe those logs are atomic. If so, you're right that blocking that long during OOM situations is not ideal.
,
Apr 14 2017
Doug found this: https://access.redhat.com/solutions/96783 Per vmcore dumps captured after system hung, appears to be an apparent jbd2 deadlock that occurs when high loads are applied to ext4 filesystems. One or more of the jbd2 threads block on jbd2_journal_commit_transaction, and this causes other threads to block on jbd2_log_wait_commit. This happened with the 2.6.32-220.7.1.el6.x86_64 kernel. so this hang may not be directly related to memory management. |
||
►
Sign in to add a comment |
||
Comment 1 by bccheng@chromium.org
, Apr 14 2017