From: Andrew Morton <akpm@linux-foundation.org>
To: Andrea Arcangeli <andrea@suse.de>
Cc: linux-kernel@vger.kernel.org, jack@suse.cz,
Ingo Molnar <mingo@elte.hu>,
"Eric W. Biederman" <ebiederm@xmission.com>,
Alexey Dobriyan <adobriyan@gmail.com>
Subject: Re: /proc dcache deadlock in do_exit
Date: Tue, 27 Nov 2007 14:38:52 -0800 [thread overview]
Message-ID: <20071127143852.601509ac.akpm@linux-foundation.org> (raw)
In-Reply-To: <20071127132022.GW6840@v2.random>
On Tue, 27 Nov 2007 14:20:22 +0100
Andrea Arcangeli <andrea@suse.de> wrote:
> Hi,
>
> this patch fixes a sles9 system hang in start_this_handle from a
> customer with some heavy workload where all tasks are waiting on
> kjournald to commit the transaction, but kjournald waits on t_updates
> to go down to zero (it never does). This was reported as a lowmem
> shortage deadlock but when checking the debug data I noticed the VM
> wasn't under pressure at all (well it was really under vm pressure,
> because lots of tasks hanged in the VM prune_dcache methods trying to
> flush dirty inodes, but no task was hanging in GFP_NOFS mode, the
> holder of the journal handle should have if this was a vm issue in the
> first place). No task was apparently holding the leftover handle in
> the committing transaction, so I deduced t_updates was stuck to 1
> because a journal_stop was never run by some path (this turned out to
> be correct). With a debug patch adding proper reverse links and stack
> trace logging in ext3 deployed in production, I found journal_stop is
> never run because mark_inode_dirty_sync is called inside release_task
> called by do_exit. (that was quite fun because I would have never
> thought about this subtleness, I thought a regular path in ext3 had a
> bug and it forgot to call journal_stop)
>
> do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
> come back to run journal_stop)
I don't see why the schedule() will not return? Because the task has
PF_EXITING set? Doesn't TASK_DEAD do that?
> The reason is that shrink_dcache_parent is racy by design (feature not
> a bug) and it can do blocking I/O in some case, but the point is that
> calling shrink_dcache_parent at the last stage of do_exit isn't safe
> for self-reaping tasks.
>
> I guess the memory pressure of the unbalanced highmem system allowed
> to trigger this more easily.
>
> Now mainline doesn't have this line in iput (like sles9 has):
>
> if (inode->i_state & I_DIRTY_DELAYED)
> mark_inode_dirty_sync(inode);
>
> so it will probably not crash with ext3, but for example ext2
> implements an I/O-blocking ext2_put_inode that will lead to similar
> screwups with ext2_free_blocks never coming back and it's definitely
> wrong to call blocking-IO paths inside do_exit. So this should fix a
> subtle bug in mainline too (not verified in practice though). The
> equivalent fix for ext3 is also not verified yet to fix the problem in
> sles9 but I don't have doubt it will (it usually takes days to crash,
> so it'll take weeks to be sure).
>
> An alternate fix would be to offload that work to a kernel thread, but
> I don't think a reschedule for this is worth it, the vm should be able
> to collect those entries for the synchronous release_task.
>
> Signed-off-by: Andrea Arcangeli <andrea@suse.de>
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2265,7 +2265,8 @@ static void proc_flush_task_mnt(struct v
> name.len = snprintf(buf, sizeof(buf), "%d", pid);
> dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> if (dentry) {
> - shrink_dcache_parent(dentry);
> + if (!(current->flags & PF_EXITING))
> + shrink_dcache_parent(dentry);
> d_drop(dentry);
> dput(dentry);
> }
What are the implications of not running shrink_dcache_parent() on the exit
path sometimes? We'll leave procfs stuff behind? Will they be reaped by
memory pressure later on?
next prev parent reply other threads:[~2007-11-27 22:39 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-11-27 13:20 /proc dcache deadlock in do_exit Andrea Arcangeli
2007-11-27 22:38 ` Andrew Morton [this message]
2007-11-28 1:21 ` Andrea Arcangeli
2007-11-28 1:45 ` Andrea Arcangeli
2007-11-28 2:07 ` Eric W. Biederman
2007-11-28 1:53 ` Eric W. Biederman
2007-11-28 1:43 ` Eric W. Biederman
2007-11-28 16:30 ` Oleg Nesterov
2007-11-28 2:09 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071127143852.601509ac.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=adobriyan@gmail.com \
--cc=andrea@suse.de \
--cc=ebiederm@xmission.com \
--cc=jack@suse.cz \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.