New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 711673 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug


Participants' hotlists:
Hotlist-1


Sign in to add a comment

likely OOM kill deadlock in file system (jbd2_journal_commit, kjournald2)

Project Member Reported by semenzato@chromium.org, Apr 14 2017

Issue description

Ben produced this on a caroline under high memory pressure.

From a superficial reading there are more than one possibility for deadlocks, but an unusually large number of processes are in the middle of file system operations.  Some of them are blocked on a bit_wait, others in a kjournald transaction.

There is also a very strange 5-second jump in the timestamp while printing the process list in the last OOM, which immediately precedes the hung task panic.  Ben how old is your tree?


[  594.735599] [ 2040]   218  2040     5162      218      14      104         -1000 bluetoothd
[  594.735609] [ 2053]     0  2053     6767      411      16      149         -1000 metrics_daemon
[  594.735619] [ 2079]     0  2079     2263       86       9       51         -1000 upstart-socket-
[  594.735630] [ 2091]     0  2091   207058      229      37      222         -1000 esif_ufd
[  594.735640] [ 2101]     0  2101    26267      380      20      292         -1000 disks
[  599.622498] [ 2131]   238  2131     3405      207      11       75         -1000 avahi-daemon
[  599.622512] [ 2135]   238  2135     3405        1      10       58         -1000 avahi-daemon
[  599.622523] [ 2140]     0  2140     3947        0      11      129         -1000 sshd



 
ramoops.txt
128 KB View Download
It is ToT earlier this week, but probably missed the GPU memory purge fix.
Luigi: I don't think this is the same 5 second jump we were debugging before.  We should have seen a log message for that one.

...but it is a good point about some of these delays.  I hadn't noticed that.   I think the system is basically behaving poorly and we spend a whole chunk of time printout out the debug info to the console.  Possibly if we turn down debugging verbosity this problem will go away?  

AKA, maybe if we adjust in oom_kill_process():

        static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
                                              DEFAULT_RATELIMIT_BURST);

Right, we would have seen the log messages.  Also it's a little short of 5s---I'd expect such delays to be slightly more than the target.

5s is a long time, if you see how long it takes to dump the OOM logs (milliseconds).  I don't know how to explain that.  Maybe tlsdated kicked in?  But maybe those logs are atomic.

If so, you're right that blocking that long during OOM situations is not ideal.

Labels: -Pri-1 Pri-2
Doug found this:

https://access.redhat.com/solutions/96783

Per vmcore dumps captured after system hung, appears to be an apparent jbd2 deadlock that occurs when high loads are applied to ext4 filesystems. One or more of the jbd2 threads block on jbd2_journal_commit_transaction, and this causes other threads to block on jbd2_log_wait_commit. This happened with the 2.6.32-220.7.1.el6.x86_64 kernel.

so this hang may not be directly related to memory management.

Sign in to add a comment