From mboxrd@z Thu Jan  1 00:00:00 1970
From: Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: [2.6.38] Deadlock between rename_lock and vfsmount_lock.
Date: Fri, 18 Mar 2011 12:07:48 +0000
Message-ID: <20110318120748.GH22723@ZenIV.linux.org.uk>
References: <201103160854.p2G8sR6c077737@www262.sakura.ne.jp>
 <201103170501.p2H51PjU052428@www262.sakura.ne.jp>
 <201103181959.CHC73937.SQOLJtVHOMFFOF@I-love.SAKURA.ne.jp>
 <20110318110603.GG22723@ZenIV.linux.org.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: npiggin@kernel.dk, linux-fsdevel@vger.kernel.org
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:58911 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754064Ab1CRMHu (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 18 Mar 2011 08:07:50 -0400
Content-Disposition: inline
In-Reply-To: <20110318110603.GG22723@ZenIV.linux.org.uk>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, Mar 18, 2011 at 11:06:03AM +0000, Al Viro wrote:

> That's incredibly ugly.  I agree that the deadlock exists and needs to be
> dealt with, but not that way.  _Strongly_ NAKed.  I'll see what I can come
> up with, but that variant is not an option.

Actually, why do we hold vfsmount_lock over that loop at all?  We already
hold namespace_sem, so ->mnt_parent is protected...

FWIW, there's a thing I really don't like about that area - I would really
prefer to have namespace_sem nest _inside_ i_mutex and have no fs operations
whatsoever done under it.  Note that we already take care (mostly) to
keep actual fs shutdowns outside of it.

The real obstacle is follow_down() we do under namespace_sem on several
paths; otherwise we'd be able to grab i_mutex first and be done with that.
Moreover, we have extra ugliness in follow_down(); I really wonder
what is the only instance trying to do here.

Look: we pass mounting_here == true to autofs4_d_manage().  It sees the
flag, then checks if we have DCACHE_MOUNTED set and returns either -EISDIR
or 0.  -EISDIR leads to follow_down() returning 0 immediately.  0 leads
to trying to cross the mountpoint.  First of all, we could as well return
0 regardless - if DCACHE_MOUNTED is not set, follow_down() will see that
and return 0 since there's nothing left to do.

What's more, we have a funny situation here - we might as well replace
                if (managed & DCACHE_MANAGE_TRANSIT) {
with
                if (!mounting_here && managed & DCACHE_MANAGE_TRANSIT) {
in follow_down() and lose that argument of ->d_manage().

So let's do this:

lock_mount(path, follow)
retry:
	lock path->dentry->d_inode
	if unlikely(can't mount)
		unlock
		fail
	grab namespace_sem
	if !follow || likely(lookup_mnt() returns NULL)
		we are done
	drop namespace_sem
	unlock
	drop path
	replace it with result of lookup_mnt() (and its ->mnt_root)
	goto retry;

and use that (local to fs/namespace.c) in do_add_mount()/do_move_mount()/
do_loopback() (with follow = 1) and pivot_root() (follow = 0).  BTW,
the lack of following in do_loopback() looks like a bug...
Also in do_loopback(): take release_mounts() after unlocking everything.
graft_tree() doesn't grab i_mutex - callers take it.
follow_down() loses mounting_here argument and so does ->d_manage().
cant_mount() calls are all merged into one in lock_mount().

Comments?