From mboxrd@z Thu Jan  1 00:00:00 1970
From: Al Viro <viro@ZenIV.linux.org.uk>
Subject: Re: ->d_lock FUBAR (was Re: Linux 3.0-rc6)
Date: Wed, 13 Jul 2011 01:56:34 +0100
Message-ID: <20110713005634.GK11013@ZenIV.linux.org.uk>
References: <CA+55aFxQw4T07hxz8JSm12x3FOH_Dcf=G5mvLrxiTuLxjbw+Mg@mail.gmail.com>
 <20110711060315.GI11013@ZenIV.linux.org.uk>
 <20110712234806.GJ11013@ZenIV.linux.org.uk>
 <CA+55aFxmJDOpx6grff73tAaGtVWbXNLwwzZLUmL5V8ESn9ij+w@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org, npiggin@kernel.dk
To: Linus Torvalds <torvalds@linux-foundation.org>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from zeniv.linux.org.uk ([195.92.253.2]:42965 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753709Ab1GMA4f (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 12 Jul 2011 20:56:35 -0400
Content-Disposition: inline
In-Reply-To: <CA+55aFxmJDOpx6grff73tAaGtVWbXNLwwzZLUmL5V8ESn9ij+w@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, Jul 12, 2011 at 05:04:55PM -0700, Linus Torvalds wrote:

> > All flakiness of the locking "order" aside, here we simply lock two dentries
> > that might be nowhere near each other by now. ?Hell, by that point the parent
> > might've been moved under (what used to be) child. ?Or it might have address
> > greater than that of child and be not an ancestor anymore. ?Note that no
> > i_mutex, etc. is held at that point, so there's no external serialization
> > to save our arses...
> 
> So why wouldn't it be sufficient to just take the s_vfs_rename_mutex
> in the caller? We're only talking d_materialise_unique(), no?

Alas, no.  d_materialize_unique() aside (we have a couple of bugs in
there), ->d_lock handling is fscked up.  Basically, it's unlazy_walk()
taking ->d_lock instances without anything to guarantee that it's doing
that in order consistent with d_move() (and any other users, for that
matter).  *And* the "locking order" in question is not transitive, but
that's a separate story.  And no, we obviously are not going to
serialize the switch from RCU to normal pathwalk on a per-fs mutex...

I'm crawling through the d_lock users right now (>400 places ;-/), trying
to put together reasonable locking rules.

FWIW, we used to have this for d_move() et.al.:
	* all places changing ->d_parent hold i_mutex on parent(s)
	* if parents differ, we have ->s_vfs_rename_mutex as well
	* if old_dentry was not attached to the tree, it was positive
and a subdirectory of new_dentry->d_parent.  Said new_dentry->d_parent
was been locked by caller.

Unfortunately, current tree has these rules violated in several places.
And these rules have nothing to say about ->d_lock ;-/

Nick, could you please describe the locking rules you had in mind for
->d_lock?  unlazy_walk() (aka nameidata_dentry_drop_rcu()) can probably
be dealt with by checking d_seq twice, once before locking the child.
Then we could be sure that it's still a child of parent and will stay
so as long as parent's ->d_lock is held, and thus the ordering would
stay stable...