races in ll_splice

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* races in ll_splice_alias()
@ 2016-03-08 16:05 Al Viro
  2016-03-08 20:44 ` Drokin, Oleg
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-08 16:05 UTC (permalink / raw)
  To: Drokin, Oleg; +Cc: Linus Torvalds, linux-fsdevel

	Suppose the last dentry refering to an inode is being evicted.
It went through __dentry_kill(), into dentry_iput() and called ->d_iput().
At that stage it is unhashed, out of the list of children, out of the list
of aliases and does not refer to inode anymore.  No locks are held and
it's about to call iput().

	In the meanwhile, two lookups have found links to that inode.  The
same in-core inode has already been picked by ll_iget() by both.  Their
dentries are unhashed, so ll_lookup_it_finish() hits
		alias = ll_splice_alias(inode, *de);
in both.  inode is non-NULL, so
                new = ll_find_alias(inode, de);
is done.
static struct dentry *ll_find_alias(struct inode *inode, struct dentry *dentry)
{
        struct dentry *alias, *discon_alias, *invalid_alias;

        if (hlist_empty(&inode->i_dentry))
                return NULL;

        discon_alias = invalid_alias = NULL;

        ll_lock_dcache(inode);

ends up returning NULL for both (ll_lock_dcache == grab ->i_lock).  So we
get NULL from ll_find_alias(), proceed to allocate ->d_fsdata for both
dentries, then do d_add(de, inode).  See the problem?

We do not hold ->i_lock between ll_find_alias() and d_add(), so there's
nothing to guarantee that negative result will stay valid.  Even had we kept
it (and with block allocation it would require some work), decision that
there's no aliases at all is made _outside_ of ->i_lock, so we have a race
there as well.

Moreover, ll_find_alias() itself is rather fishy - it accepts any alias that
happens to have DCACHE_DISCONNECTED and anything unhashed with exact match
to parent/name/hash.  The former part is trouble - DCACHE_DISCONNECTED is
cleared *after* a detached subtree had been glued in.  It's _not_ equivalent
to "it's a root of detached subtree".

IOW, that thing can return dentries that do have parent other than the one
we'd been looking for.  Which means that d_move() done by ll_splice_alias()
after having found one is short on locking.

Another oddity: you have ll_d_init() called on the result of ll_find_alias(),
but that alias must have either been through ll_splice_alias() before or
through d_instantiate() in ll_lookup_it_finish(), or it wouldn't have been
positive.  And the second variant required that it had been hashed at the
time, which means that it had been through ll_splice_alias() at some earlier
point.  Once ->d_fsdata had been allocated and set, it's never cleared until
we free that dentry, so how the hell is it possible for that call of
ll_d_init() on result of ll_find_alias() to actually do anything at all?

Am I right assuming that it's basically trying to do d_splice_alias() with
a couple of twists - we are likely to find an exact unhashed alias (due to
promiscuous ->d_revalidate()?) we want to try and reuse even for
non-directories and it's likely enough to make unconditional allocation of
->d_fsdata a bad idea since dentry is likely to be replaced by alias?

If so, how about adding d_hash_exact_alias(dentry) that would try to find
and hash an exact unhashed alias, so that this thing would
	* call d_hash_exact_alias(dentry), if found - just return it
	* ll_d_init(dentry)
	* return d_splice_alias() ?: dentry
Do you see any problems with that?  Parent directory is locked, so no new
unhashed exact aliases should have a chance to appear and d_splice_alias()
would take care of everything else (correctness and detached ones).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-08 16:05 races in ll_splice_alias() Al Viro
@ 2016-03-08 20:44 ` Drokin, Oleg
  2016-03-08 21:11   ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-08 20:44 UTC (permalink / raw)
  To: Al Viro, Dilger, Andreas
  Cc: Linus Torvalds, <linux-fsdevel@vger.kernel.org>


On Mar 8, 2016, at 11:05 AM, Al Viro wrote:

> 	Suppose the last dentry refering to an inode is being evicted.
> It went through __dentry_kill(), into dentry_iput() and called ->d_iput().
> At that stage it is unhashed, out of the list of children, out of the list
> of aliases and does not refer to inode anymore.  No locks are held and
> it's about to call iput().

ok.

> 	In the meanwhile, two lookups have found links to that inode.  The

The links are hardlinks, right? (because otherwise they would not be
parallel due to parent dir i_mutex held).

> same in-core inode has already been picked by ll_iget() by both.  Their
> dentries are unhashed, so ll_lookup_it_finish() hits
> 		alias = ll_splice_alias(inode, *de);
> in both.  inode is non-NULL, so
>                new = ll_find_alias(inode, de);
> is done.
> static struct dentry *ll_find_alias(struct inode *inode, struct dentry *dentry)
> {
>        struct dentry *alias, *discon_alias, *invalid_alias;
> 
>        if (hlist_empty(&inode->i_dentry))
>                return NULL;
> 
>        discon_alias = invalid_alias = NULL;
> 
>        ll_lock_dcache(inode);
> 
> ends up returning NULL for both (ll_lock_dcache == grab ->i_lock).  So we
> get NULL from ll_find_alias(), proceed to allocate ->d_fsdata for both
> dentries, then do d_add(de, inode).  See the problem?
> 
> We do not hold ->i_lock between ll_find_alias() and d_add(), so there's
> nothing to guarantee that negative result will stay valid.  Even had we kept
> it (and with block allocation it would require some work), decision that
> there's no aliases at all is made _outside_ of ->i_lock, so we have a race
> there as well.

Hm, The race is a "safe" one, since if the alias have appeared, it does not break
anything other than using up some RAM, I think?
In fact what's to stop one appearing after we released the locking leading to the
same situation?

> Moreover, ll_find_alias() itself is rather fishy - it accepts any alias that
> happens to have DCACHE_DISCONNECTED and anything unhashed with exact match
> to parent/name/hash.  The former part is trouble - DCACHE_DISCONNECTED is
> cleared *after* a detached subtree had been glued in.  It's _not_ equivalent
> to "it's a root of detached subtree".

yes, this indeed seems troublesome.

> IOW, that thing can return dentries that do have parent other than the one
> we'd been looking for.  Which means that d_move() done by ll_splice_alias()
> after having found one is short on locking.
> 
> Another oddity: you have ll_d_init() called on the result of ll_find_alias(),
> but that alias must have either been through ll_splice_alias() before or
> through d_instantiate() in ll_lookup_it_finish(), or it wouldn't have been
> positive.  And the second variant required that it had been hashed at the
> time, which means that it had been through ll_splice_alias() at some earlier
> point.  Once ->d_fsdata had been allocated and set, it's never cleared until
> we free that dentry, so how the hell is it possible for that call of
> ll_d_init() on result of ll_find_alias() to actually do anything at all?

I traced a bit through the history and apparently it used to be a call
that set dentry->d_op before, which is no longer necessary after 2.6.38

> Am I right assuming that it's basically trying to do d_splice_alias() with
> a couple of twists - we are likely to find an exact unhashed alias (due to
> promiscuous ->d_revalidate()?) we want to try and reuse even for
> non-directories and it's likely enough to make unconditional allocation of
> ->d_fsdata a bad idea since dentry is likely to be replaced by alias?

Yes, pretty much.
Lustre has "locks" that guard metadata and when locks are revoked (due to
conflicting accesses) the dentries are "invalidated", but since users
will likely come for them again we do not want to use new dentry every time
and prefer to find old ones if possible.

> If so, how about adding d_hash_exact_alias(dentry) that would try to find
> and hash an exact unhashed alias, so that this thing would
> 	* call d_hash_exact_alias(dentry), if found - just return it
> 	* ll_d_init(dentry)
> 	* return d_splice_alias() ?: dentry
> Do you see any problems with that?  Parent directory is locked, so no new
> unhashed exact aliases should have a chance to appear and d_splice_alias()
> would take care of everything else (correctness and detached ones).

This sounds fine on the surface. I think I remember there were some other
complications with d_splice_alias.
Andreas, do ou remember?

I'll try to research it a bit more and follow up soon.

Thanks for looking into this.

Bye,
    Oleg


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-08 20:44 ` Drokin, Oleg
@ 2016-03-08 21:11   ` Al Viro
  2016-03-08 23:18     ` Drokin, Oleg
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-08 21:11 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>

On Tue, Mar 08, 2016 at 08:44:24PM +0000, Drokin, Oleg wrote:

> The links are hardlinks, right? (because otherwise they would not be
> parallel due to parent dir i_mutex held).

Yes.

> Hm, The race is a "safe" one, since if the alias have appeared, it does not break
> anything other than using up some RAM, I think?
> In fact what's to stop one appearing after we released the locking leading to the
> same situation?

Kinda-sorta.  It's safe unless a rename on server gets you see the same
directory in two places.  d_splice_alias() would've coped with that, this
code won't.

> > If so, how about adding d_hash_exact_alias(dentry) that would try to find
> > and hash an exact unhashed alias, so that this thing would
> > 	* call d_hash_exact_alias(dentry), if found - just return it
> > 	* ll_d_init(dentry)
> > 	* return d_splice_alias() ?: dentry
> > Do you see any problems with that?  Parent directory is locked, so no new
> > unhashed exact aliases should have a chance to appear and d_splice_alias()
> > would take care of everything else (correctness and detached ones).
> 
> This sounds fine on the surface. I think I remember there were some other
> complications with d_splice_alias.
> Andreas, do ou remember?

FWIW, a patch in my queue kills d_add_unique(), replacing it with
d_exact_alias() and d_splice_alias(); it could be used to implement
ll_splice_alias() as well (with code duplication gone *and* capable
of dealing with directories moving around), except for the odd
rules re inode refcount on error; it would've been easier if ll_splice_alias()
would always either inserted inode reference into a new dentry or dropped it.
I'm still trying to trace what does iput() in case of error in your current
code; I understand the one in do_statahead_enter(), but what does it in
ll_lookup_it_finish()?

Anyway, d_add_unique() removal (and switching its only caller to replacement)
follows:

replace d_add_unique() with saner primitive

new primitive: d_exact_alias(dentry, inode).  If there is an unhashed
dentry with the same name/parent and given inode, rehash, grab and
return it.  Otherwise, return NULL.  The only caller of d_add_unique()
switched to d_exact_alias() + d_splice_alias().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/dcache.c b/fs/dcache.c
index 2398f9f9..4d20bf5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1782,81 +1782,6 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 EXPORT_SYMBOL(d_instantiate);
 
 /**
- * d_instantiate_unique - instantiate a non-aliased dentry
- * @entry: dentry to instantiate
- * @inode: inode to attach to this dentry
- *
- * Fill in inode information in the entry. On success, it returns NULL.
- * If an unhashed alias of "entry" already exists, then we return the
- * aliased dentry instead and drop one reference to inode.
- *
- * Note that in order to avoid conflicts with rename() etc, the caller
- * had better be holding the parent directory semaphore.
- *
- * This also assumes that the inode count has been incremented
- * (or otherwise set) by the caller to indicate that it is now
- * in use by the dcache.
- */
-static struct dentry *__d_instantiate_unique(struct dentry *entry,
-					     struct inode *inode)
-{
-	struct dentry *alias;
-	int len = entry->d_name.len;
-	const char *name = entry->d_name.name;
-	unsigned int hash = entry->d_name.hash;
-
-	if (!inode) {
-		__d_instantiate(entry, NULL);
-		return NULL;
-	}
-
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
-		/*
-		 * Don't need alias->d_lock here, because aliases with
-		 * d_parent == entry->d_parent are not subject to name or
-		 * parent changes, because the parent inode i_mutex is held.
-		 */
-		if (alias->d_name.hash != hash)
-			continue;
-		if (alias->d_parent != entry->d_parent)
-			continue;
-		if (alias->d_name.len != len)
-			continue;
-		if (dentry_cmp(alias, name, len))
-			continue;
-		__dget(alias);
-		return alias;
-	}
-
-	__d_instantiate(entry, inode);
-	return NULL;
-}
-
-struct dentry *d_instantiate_unique(struct dentry *entry, struct inode *inode)
-{
-	struct dentry *result;
-
-	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
-
-	if (inode)
-		spin_lock(&inode->i_lock);
-	result = __d_instantiate_unique(entry, inode);
-	if (inode)
-		spin_unlock(&inode->i_lock);
-
-	if (!result) {
-		security_d_instantiate(entry, inode);
-		return NULL;
-	}
-
-	BUG_ON(!d_unhashed(result));
-	iput(inode);
-	return result;
-}
-
-EXPORT_SYMBOL(d_instantiate_unique);
-
-/**
  * d_instantiate_no_diralias - instantiate a non-aliased dentry
  * @entry: dentry to complete
  * @inode: inode to attach to this dentry
@@ -2437,6 +2362,56 @@ void d_rehash(struct dentry * entry)
 EXPORT_SYMBOL(d_rehash);
 
 /**
+ * d_exact_alias - find and hash an exact unhashed alias
+ * @entry: dentry to add
+ * @inode: The inode to go with this dentry
+ *
+ * If an unhashed dentry with the same name/parent and desired
+ * inode already exists, hash and return it.  Otherwise, return
+ * NULL.
+ *
+ * Parent directory should be locked.
+ */
+struct dentry *d_exact_alias(struct dentry *entry, struct inode *inode)
+{
+	struct dentry *alias;
+	int len = entry->d_name.len;
+	const char *name = entry->d_name.name;
+	unsigned int hash = entry->d_name.hash;
+
+	spin_lock(&inode->i_lock);
+	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+		/*
+		 * Don't need alias->d_lock here, because aliases with
+		 * d_parent == entry->d_parent are not subject to name or
+		 * parent changes, because the parent inode i_mutex is held.
+		 */
+		if (alias->d_name.hash != hash)
+			continue;
+		if (alias->d_parent != entry->d_parent)
+			continue;
+		if (alias->d_name.len != len)
+			continue;
+		if (dentry_cmp(alias, name, len))
+			continue;
+		spin_lock(&alias->d_lock);
+		if (!d_unhashed(alias)) {
+			spin_unlock(&alias->d_lock);
+			alias = NULL;
+		} else {
+			__dget_dlock(alias);
+			_d_rehash(alias);
+			spin_unlock(&alias->d_lock);
+		}
+		spin_unlock(&inode->i_lock);
+		return alias;
+	}
+	spin_unlock(&inode->i_lock);
+	return NULL;
+}
+EXPORT_SYMBOL(d_exact_alias);
+
+/**
  * dentry_update_name_case - update case insensitive dentry with a new name
  * @dentry: dentry to be updated
  * @name: new name
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 4bfc33a..9a5d67f 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2461,14 +2461,16 @@ static int _nfs4_open_and_get_state(struct nfs4_opendata *opendata,
 
 	dentry = opendata->dentry;
 	if (d_really_is_negative(dentry)) {
-		/* FIXME: Is this d_drop() ever needed? */
+		struct dentry *alias;
 		d_drop(dentry);
-		dentry = d_add_unique(dentry, igrab(state->inode));
-		if (dentry == NULL) {
-			dentry = opendata->dentry;
-		} else if (dentry != ctx->dentry) {
+		alias = d_exact_alias(dentry, state->inode);
+		if (!alias)
+			alias = d_splice_alias(igrab(state->inode), dentry);
+		/* d_splice_alias() can't fail here - it's a non-directory */
+		if (alias) {
+			dentry = dget(alias);
 			dput(ctx->dentry);
-			ctx->dentry = dget(dentry);
+			ctx->dentry = dentry;
 		}
 		nfs_set_verifier(dentry,
 				nfs_save_change_attribute(d_inode(opendata->dir)));
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c4b5f4b..bda4ec5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -246,6 +246,7 @@ extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
+extern struct dentry * d_exact_alias(struct dentry *, struct inode *);
 extern struct dentry *d_find_any_alias(struct inode *inode);
 extern struct dentry * d_obtain_alias(struct inode *);
 extern struct dentry * d_obtain_root(struct inode *);
@@ -288,23 +289,6 @@ static inline void d_add(struct dentry *entry, struct inode *inode)
 	d_rehash(entry);
 }
 
-/**
- * d_add_unique - add dentry to hash queues without aliasing
- * @entry: dentry to add
- * @inode: The inode to attach to this dentry
- *
- * This adds the entry to the hash queues and initializes @inode.
- * The entry was actually filled in earlier during d_alloc().
- */
-static inline struct dentry *d_add_unique(struct dentry *entry, struct inode *inode)
-{
-	struct dentry *res;
-
-	res = d_instantiate_unique(entry, inode);
-	d_rehash(res != NULL ? res : entry);
-	return res;
-}
-
 extern void dentry_update_name_case(struct dentry *, struct qstr *);
 
 /* used for rename() and baskets */

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-08 21:11   ` Al Viro
@ 2016-03-08 23:18     ` Drokin, Oleg
  2016-03-09  0:34       ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-08 23:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>


On Mar 8, 2016, at 4:11 PM, Al Viro wrote:

> On Tue, Mar 08, 2016 at 08:44:24PM +0000, Drokin, Oleg wrote:
> 
>> The links are hardlinks, right? (because otherwise they would not be
>> parallel due to parent dir i_mutex held).
> 
> Yes.
> 
>> Hm, The race is a "safe" one, since if the alias have appeared, it does not break
>> anything other than using up some RAM, I think?
>> In fact what's to stop one appearing after we released the locking leading to the
>> same situation?
> 
> Kinda-sorta.  It's safe unless a rename on server gets you see the same
> directory in two places.  d_splice_alias() would've coped with that, this
> code won't.

Rename on server cannot get us to see the same directory in two places, or what's
the scenario?
In Lustre:
thread1: lookup a directory on the server.
         get a lock back
         instantiate a dentry (guarded by the lock)
         make a lock revocable (ll_lookup_finish_locks() in ll_lookup_it())
thread2: rename the directory moving it somewhere else
         server attempts to revoke the lock from thread1
         node that runs thread1 drops the lock and marks all dentries for that
              inode invalid
         server completes rename and releases the lock it holds
thread3: lookup the directory under new name on the server
         this can only return from server when the rename has completed and the
         dentry from thread1 marked invalid.

>>> If so, how about adding d_hash_exact_alias(dentry) that would try to find
>>> and hash an exact unhashed alias, so that this thing would
>>> 	* call d_hash_exact_alias(dentry), if found - just return it
>>> 	* ll_d_init(dentry)
>>> 	* return d_splice_alias() ?: dentry
>>> Do you see any problems with that?  Parent directory is locked, so no new
>>> unhashed exact aliases should have a chance to appear and d_splice_alias()
>>> would take care of everything else (correctness and detached ones).
>> 
>> This sounds fine on the surface. I think I remember there were some other
>> complications with d_splice_alias.
>> Andreas, do ou remember?
> 
> FWIW, a patch in my queue kills d_add_unique(), replacing it with
> d_exact_alias() and d_splice_alias(); it could be used to implement
> ll_splice_alias() as well (with code duplication gone *and* capable
> of dealing with directories moving around), except for the odd
> rules re inode refcount on error; it would've been easier if ll_splice_alias()

Ok, let me try my hand at that. Hopefully whatever complications are there would
show themselves right away too.

> would always either inserted inode reference into a new dentry or dropped it.
> I'm still trying to trace what does iput() in case of error in your current
> code; I understand the one in do_statahead_enter(), but what does it in
> ll_lookup_it_finish()?

You mean when ll_d_init() fails in ll_splice_alias()?
Hmï¿½ It appears that we are indeed leaking the inode in that case, thanks.
I'll try to address that too.
Hm, in fact this was almost noticed, but I guess nobody retested it after
fixing the initial crash we had with 7486bc06ab2c46d6957f0211d09bc549aaf9cc87

Thanks.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-08 23:18     ` Drokin, Oleg
@ 2016-03-09  0:34       ` Al Viro
  2016-03-09  0:53         ` Drokin, Oleg
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-09  0:34 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>

On Tue, Mar 08, 2016 at 11:18:09PM +0000, Drokin, Oleg wrote:
> Rename on server cannot get us to see the same directory in two places, or what's
> the scenario?
> In Lustre:
> thread1: lookup a directory on the server.
>          get a lock back
>          instantiate a dentry (guarded by the lock)
>          make a lock revocable (ll_lookup_finish_locks() in ll_lookup_it())
> thread2: rename the directory moving it somewhere else
>          server attempts to revoke the lock from thread1
>          node that runs thread1 drops the lock and marks all dentries for that
>               inode invalid
>          server completes rename and releases the lock it holds
> thread3: lookup the directory under new name on the server
>          this can only return from server when the rename has completed and the
>          dentry from thread1 marked invalid.

thread2 might run on another client; might or might not make any difference,
but in any case server going nuts shouldn't corrupt data structures on client...

> Ok, let me try my hand at that. Hopefully whatever complications are there would
> show themselves right away too.
> 
> > would always either inserted inode reference into a new dentry or dropped it.
> > I'm still trying to trace what does iput() in case of error in your current
> > code; I understand the one in do_statahead_enter(), but what does it in
> > ll_lookup_it_finish()?
> 
> You mean when ll_d_init() fails in ll_splice_alias()?
> Hm… It appears that we are indeed leaking the inode in that case, thanks.
> I'll try to address that too.
> Hm, in fact this was almost noticed, but I guess nobody retested it after
> fixing the initial crash we had with 7486bc06ab2c46d6957f0211d09bc549aaf9cc87

If that's the case, I'd try this (on top of the patch from upthread):

diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
index da5f443..bcc9841 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -320,81 +320,37 @@ void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2)
 }
 
 /*
- * try to reuse three types of dentry:
- * 1. unhashed alias, this one is unhashed by d_invalidate (but it may be valid
- *    by concurrent .revalidate).
- * 2. INVALID alias (common case for no valid ldlm lock held, but this flag may
- *    be cleared by others calling d_lustre_revalidate).
- * 3. DISCONNECTED alias.
- */
-static struct dentry *ll_find_alias(struct inode *inode, struct dentry *dentry)
-{
-	struct dentry *alias, *discon_alias, *invalid_alias;
-
-	if (hlist_empty(&inode->i_dentry))
-		return NULL;
-
-	discon_alias = invalid_alias = NULL;
-
-	ll_lock_dcache(inode);
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
-		LASSERT(alias != dentry);
-
-		spin_lock(&alias->d_lock);
-		if (alias->d_flags & DCACHE_DISCONNECTED)
-			/* LASSERT(last_discon == NULL); LU-405, bz 20055 */
-			discon_alias = alias;
-		else if (alias->d_parent == dentry->d_parent	     &&
-			 alias->d_name.hash == dentry->d_name.hash       &&
-			 alias->d_name.len == dentry->d_name.len	 &&
-			 memcmp(alias->d_name.name, dentry->d_name.name,
-				dentry->d_name.len) == 0)
-			invalid_alias = alias;
-		spin_unlock(&alias->d_lock);
-
-		if (invalid_alias)
-			break;
-	}
-	alias = invalid_alias ?: discon_alias ?: NULL;
-	if (alias) {
-		spin_lock(&alias->d_lock);
-		dget_dlock(alias);
-		spin_unlock(&alias->d_lock);
-	}
-	ll_unlock_dcache(inode);
-
-	return alias;
-}
-
-/*
  * Similar to d_splice_alias(), but lustre treats invalid alias
  * similar to DCACHE_DISCONNECTED, and tries to use it anyway.
  */
 struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 {
-	struct dentry *new;
+	struct dentry *alias = NULL;
 	int rc;
 
 	if (inode) {
-		new = ll_find_alias(inode, de);
-		if (new) {
-			rc = ll_d_init(new);
-			if (rc < 0) {
-				dput(new);
-				return ERR_PTR(rc);
-			}
-			d_move(new, de);
+		alias = d_exact_alias(de, inode);
+		if (alias)
+			iput(inode);
+	}
+
+	if (!alias) {
+		rc = ll_d_init(de);
+		if (rc < 0) {
 			iput(inode);
-			CDEBUG(D_DENTRY,
-			       "Reuse dentry %p inode %p refc %d flags %#x\n",
-			      new, d_inode(new), d_count(new), new->d_flags);
-			return new;
+			return ERR_PTR(rc);
 		}
+		alias = d_splice_alias(inode, de);
+		if (IS_ERR(alias))
+			return alias;
+	}
+
+	if (alias) {
+		CDEBUG(D_DENTRY,
+		       "Reuse dentry %p inode %p refc %d flags %#x\n",
+		      alias, d_inode(alias), d_count(alias), alias->d_flags);
+		return alias;
 	}
-	rc = ll_d_init(de);
-	if (rc < 0)
-		return ERR_PTR(rc);
-	d_add(de, inode);
 	CDEBUG(D_DENTRY, "Add dentry %p inode %p refc %d flags %#x\n",
 	       de, d_inode(de), d_count(de), de->d_flags);
 	return de;
diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
index 88ffd8e..6c64de0 100644
--- a/drivers/staging/lustre/lustre/llite/statahead.c
+++ b/drivers/staging/lustre/lustre/llite/statahead.c
@@ -1576,6 +1576,7 @@ int do_statahead_enter(struct inode *dir, struct dentry **dentryp,
 					alias = ll_splice_alias(inode,
 								   *dentryp);
 					if (IS_ERR(alias)) {
+						entry->se_inode = NULL;
 						ll_sai_unplug(sai, entry);
 						return PTR_ERR(alias);
 					}

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-09  0:34       ` Al Viro
@ 2016-03-09  0:53         ` Drokin, Oleg
  2016-03-09  1:26           ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-09  0:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>


On Mar 8, 2016, at 7:34 PM, Al Viro wrote:

> On Tue, Mar 08, 2016 at 11:18:09PM +0000, Drokin, Oleg wrote:
>> Rename on server cannot get us to see the same directory in two places, or what's
>> the scenario?
>> In Lustre:
>> thread1: lookup a directory on the server.
>>         get a lock back
>>         instantiate a dentry (guarded by the lock)
>>         make a lock revocable (ll_lookup_finish_locks() in ll_lookup_it())
>> thread2: rename the directory moving it somewhere else
>>         server attempts to revoke the lock from thread1
>>         node that runs thread1 drops the lock and marks all dentries for that
>>              inode invalid
>>         server completes rename and releases the lock it holds
>> thread3: lookup the directory under new name on the server
>>         this can only return from server when the rename has completed and the
>>         dentry from thread1 marked invalid.
> 
> thread2 might run on another client; might or might not make any difference,
> but in any case server going nuts shouldn't corrupt data structures on clientï¿½

It does not matter indeed if thread2 is run on any particular client due
to the lustre locking.

>> Ok, let me try my hand at that. Hopefully whatever complications are there would
>> show themselves right away too.
>> 
>>> would always either inserted inode reference into a new dentry or dropped it.
>>> I'm still trying to trace what does iput() in case of error in your current
>>> code; I understand the one in do_statahead_enter(), but what does it in
>>> ll_lookup_it_finish()?
>> 
>> You mean when ll_d_init() fails in ll_splice_alias()?
>> Hmï¿½ It appears that we are indeed leaking the inode in that case, thanks.
>> I'll try to address that too.
>> Hm, in fact this was almost noticed, but I guess nobody retested it after
>> fixing the initial crash we had with 7486bc06ab2c46d6957f0211d09bc549aaf9cc87
> 
> If that's the case, I'd try this (on top of the patch from upthread):

I have not tried this exact one yet (will try in a moment).
But it appears we do need ll_d_init() call if we got a hit
in d_exact_alias().
A particular test case fails due to lack of the d_fsdata being null.

So far I hit it in case of unix sockets with a trace like this:

[  291.019261] Lustre: DEBUG MARKER: == sanity test 54a: unix domain socket test ========================================================== 19:31:19 (1457483479)
[  291.127120] LustreError: 4984:0:(llite_internal.h:1462:d_lustre_revalidate()) ASSERTION( ((struct ll_dentry_data *)((dentry)->d_fsdata)) ) failed: 
[  291.127900] LustreError: 4984:0:(llite_internal.h:1462:d_lustre_revalidate()) LBUG
[  291.128626] CPU: 5 PID: 4984 Comm: socketclient Tainted: G         C      4.5.0-rc6-vm-nfs+ #74
[  291.129340] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  291.129343]  0000000000000000 ffff8800d7a379a0 ffffffff81413801 ffffffffa0464200
[  291.129343]  ffff8800d3b05f50 ffff8800d7a379c0 ffffffffa016ca36 0000000000000000
[  291.129344]  ffff8800d3b05ed0 ffff8800d7a37a60 ffffffffa0438680 ffff8800c22d3000
[  291.129345] Call Trace:
[  291.129349]  [<ffffffff81413801>] dump_stack+0x63/0x82
[  291.129356]  [<ffffffffa016ca36>] lbug_with_loc+0x46/0xb0 [libcfs]
[  291.129368]  [<ffffffffa0438680>] ll_lookup_it_finish+0x610/0x970 [lustre]
[  291.129374]  [<ffffffffa042dace>] ? ll_finish_md_op_data+0xe/0x10 [lustre]
[  291.129380]  [<ffffffffa0438c6b>] ll_lookup_it+0x28b/0x680 [lustre]
[  291.129386]  [<ffffffffa04375a0>] ? ll_rmdir+0x390/0x390 [lustre]
[  291.129392]  [<ffffffffa0439a23>] ll_lookup_nd+0x73/0x120 [lustre]
[  291.129394]  [<ffffffff8122796d>] lookup_real+0x1d/0x60
[  291.129395]  [<ffffffff81227e23>] __lookup_hash+0x33/0x40
[  291.129396]  [<ffffffff8122a71c>] walk_component+0x1bc/0x510
[  291.129397]  [<ffffffff8122ad50>] ? link_path_walk+0x2e0/0x500
[  291.129398]  [<ffffffff8122bd7a>] ? path_init+0x3ca/0x5a0
[  291.129400]  [<ffffffff8122c02d>] path_lookupat+0x5d/0x110
[  291.129401]  [<ffffffff8122e07e>] filename_lookup+0x9e/0x150
[  291.129403]  [<ffffffff811f9591>] ? __kmalloc_node_track_caller+0x31/0x40
[  291.129404]  [<ffffffff811f9591>] ? __kmalloc_node_track_caller+0x31/0x40
[  291.129405]  [<ffffffff8122db24>] ? getname_kernel+0x34/0x120
[  291.129406]  [<ffffffff8122db7d>] ? getname_kernel+0x8d/0x120
[  291.129407]  [<ffffffff8122e15b>] kern_path+0x2b/0x30
[  291.129408]  [<ffffffff81741928>] unix_find_other+0x38/0x1f0
[  291.129409]  [<ffffffff81741bd8>] unix_stream_connect+0xf8/0x480
[  291.129411]  [<ffffffff8166b817>] SYSC_connect+0xb7/0xf0
[  291.129412]  [<ffffffff8166c3ee>] SyS_connect+0xe/0x10
[  291.129413]  [<ffffffff817d1b36>] entry_SYSCALL_64_fastpath+0x16/0x7a

Sadly it seems crash tool does not know how to work with modern kernels
yet again so it is time to update and see what was going on in more details.

The test itself is simple:
test_54a() {
        $SOCKETSERVER $DIR/socket ||
                error "$SOCKETSERVER $DIR/socket failed: $?"
        $SOCKETCLIENT $DIR/socket ||
                error "$SOCKETCLIENT $DIR/socket failed: $?"
        $MUNLINK $DIR/socket || error "$MUNLINK $DIR/socket failed: $?"
}
run_test 54a "unix domain socket test =========================="

with the socket server and socket client doing what you would think.

The patch I had at hand was:
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -358,14 +358,8 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 	int rc;
 
 	if (inode) {
-		new = ll_find_alias(inode, de);
+		new = d_exact_alias(de, inode);
 		if (new) {
-			rc = ll_d_init(new);
-			if (rc < 0) {
-				dput(new);
-				return ERR_PTR(rc);
-			}
-			d_move(new, de);
 			iput(inode);
 			CDEBUG(D_DENTRY,
 			       "Reuse dentry %p inode %p refc %d flags %#x\n",
@@ -374,9 +368,9 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
 		}
 	}
 	rc = ll_d_init(de);
+	d_add(de, inode);
 	if (rc < 0)
 		return ERR_PTR(rc);
-	d_add(de, inode);
 	CDEBUG(D_DENTRY, "Add dentry %p inode %p refc %d flags %#x\n",
 	       de, d_inode(de), d_count(de), de->d_flags);

Overall the rate of hits in d_exact_alias was very low, need to check if pre-patch it was any better.

Also any particular reason to prefer d_splice_alias over d_add? Did not we already determine there
were no other aliases.
> 
> diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
> index da5f443..bcc9841 100644
> --- a/drivers/staging/lustre/lustre/llite/namei.c
> +++ b/drivers/staging/lustre/lustre/llite/namei.c
> @@ -320,81 +320,37 @@ void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2)
> }
> 
> /*
> - * try to reuse three types of dentry:
> - * 1. unhashed alias, this one is unhashed by d_invalidate (but it may be valid
> - *    by concurrent .revalidate).
> - * 2. INVALID alias (common case for no valid ldlm lock held, but this flag may
> - *    be cleared by others calling d_lustre_revalidate).
> - * 3. DISCONNECTED alias.
> - */
> -static struct dentry *ll_find_alias(struct inode *inode, struct dentry *dentry)
> -{
> -	struct dentry *alias, *discon_alias, *invalid_alias;
> -
> -	if (hlist_empty(&inode->i_dentry))
> -		return NULL;
> -
> -	discon_alias = invalid_alias = NULL;
> -
> -	ll_lock_dcache(inode);
> -	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
> -		LASSERT(alias != dentry);
> -
> -		spin_lock(&alias->d_lock);
> -		if (alias->d_flags & DCACHE_DISCONNECTED)
> -			/* LASSERT(last_discon == NULL); LU-405, bz 20055 */
> -			discon_alias = alias;
> -		else if (alias->d_parent == dentry->d_parent	     &&
> -			 alias->d_name.hash == dentry->d_name.hash       &&
> -			 alias->d_name.len == dentry->d_name.len	 &&
> -			 memcmp(alias->d_name.name, dentry->d_name.name,
> -				dentry->d_name.len) == 0)
> -			invalid_alias = alias;
> -		spin_unlock(&alias->d_lock);
> -
> -		if (invalid_alias)
> -			break;
> -	}
> -	alias = invalid_alias ?: discon_alias ?: NULL;
> -	if (alias) {
> -		spin_lock(&alias->d_lock);
> -		dget_dlock(alias);
> -		spin_unlock(&alias->d_lock);
> -	}
> -	ll_unlock_dcache(inode);
> -
> -	return alias;
> -}
> -
> -/*
>  * Similar to d_splice_alias(), but lustre treats invalid alias
>  * similar to DCACHE_DISCONNECTED, and tries to use it anyway.
>  */
> struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
> {
> -	struct dentry *new;
> +	struct dentry *alias = NULL;
> 	int rc;
> 
> 	if (inode) {
> -		new = ll_find_alias(inode, de);
> -		if (new) {
> -			rc = ll_d_init(new);
> -			if (rc < 0) {
> -				dput(new);
> -				return ERR_PTR(rc);
> -			}
> -			d_move(new, de);
> +		alias = d_exact_alias(de, inode);
> +		if (alias)
> +			iput(inode);
> +	}
> +
> +	if (!alias) {
> +		rc = ll_d_init(de);
> +		if (rc < 0) {
> 			iput(inode);
> -			CDEBUG(D_DENTRY,
> -			       "Reuse dentry %p inode %p refc %d flags %#x\n",
> -			      new, d_inode(new), d_count(new), new->d_flags);
> -			return new;
> +			return ERR_PTR(rc);
> 		}
> +		alias = d_splice_alias(inode, de);
> +		if (IS_ERR(alias))
> +			return alias;
> +	}
> +
> +	if (alias) {
> +		CDEBUG(D_DENTRY,
> +		       "Reuse dentry %p inode %p refc %d flags %#x\n",
> +		      alias, d_inode(alias), d_count(alias), alias->d_flags);
> +		return alias;
> 	}
> -	rc = ll_d_init(de);
> -	if (rc < 0)
> -		return ERR_PTR(rc);
> -	d_add(de, inode);
> 	CDEBUG(D_DENTRY, "Add dentry %p inode %p refc %d flags %#x\n",
> 	       de, d_inode(de), d_count(de), de->d_flags);
> 	return de;
> diff --git a/drivers/staging/lustre/lustre/llite/statahead.c b/drivers/staging/lustre/lustre/llite/statahead.c
> index 88ffd8e..6c64de0 100644
> --- a/drivers/staging/lustre/lustre/llite/statahead.c
> +++ b/drivers/staging/lustre/lustre/llite/statahead.c
> @@ -1576,6 +1576,7 @@ int do_statahead_enter(struct inode *dir, struct dentry **dentryp,
> 					alias = ll_splice_alias(inode,
> 								   *dentryp);
> 					if (IS_ERR(alias)) {
> +						entry->se_inode = NULL;
> 						ll_sai_unplug(sai, entry);
> 						return PTR_ERR(alias);
> 					}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-09  0:53         ` Drokin, Oleg
@ 2016-03-09  1:26           ` Al Viro
  2016-03-09  5:20             ` Drokin, Oleg
  2016-03-09 23:47             ` Drokin, Oleg
  0 siblings, 2 replies; 30+ messages in thread
From: Al Viro @ 2016-03-09  1:26 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>

On Wed, Mar 09, 2016 at 12:53:20AM +0000, Drokin, Oleg wrote:

> I have not tried this exact one yet (will try in a moment).
> But it appears we do need ll_d_init() call if we got a hit
> in d_exact_alias().
> A particular test case fails due to lack of the d_fsdata being null.

Very interesting.  I'd suggest checking if we hit that d_instantiate()
in your namei.c with NULL ->d_fsdata.  That really shouldn't happen, AFAICS...

> @@ -374,9 +368,9 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
>  		}
>  	}
>  	rc = ll_d_init(de);
> +	d_add(de, inode);
>  	if (rc < 0)
>  		return ERR_PTR(rc);
> -	d_add(de, inode);

Uhh...  Why would you reorder them like that?

> Also any particular reason to prefer d_splice_alias over d_add? Did not we already determine there
> were no other aliases.

No unhashed aliases with the same name and parent.  Seriously, d_splice_alias()
will do exactly the same thing as d_add() except for the situations where
d_add() would've created multiple dentries over the same directory inode.
And it's not more costly outside of that case...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-09  1:26           ` Al Viro
@ 2016-03-09  5:20             ` Drokin, Oleg
  2016-03-09 23:47             ` Drokin, Oleg
  1 sibling, 0 replies; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-09  5:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>

On Mar 8, 2016, at 8:26 PM, Al Viro wrote:

> On Wed, Mar 09, 2016 at 12:53:20AM +0000, Drokin, Oleg wrote:
> 
>> I have not tried this exact one yet (will try in a moment).
>> But it appears we do need ll_d_init() call if we got a hit
>> in d_exact_alias().
>> A particular test case fails due to lack of the d_fsdata being null.
> 
> Very interesting.  I'd suggest checking if we hit that d_instantiate()
> in your namei.c with NULL ->d_fsdata.  That really shouldn't happen, AFAICSï¿½

No, the d_instantiate for the hashed case comes through with filled in d_fsdata.

Ah! Found it. the lookup (ll_lookup_nd) checks if the operation is CREATE without
corresponding open (i.e. mknod) and returns NULL right away to sort it out in
ll_mknod() later. This adds the new dentry into the cache that is not otherwise
initialized. I guess we should call ll_d_init() right there.

This in particular works so far:
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -558,8 +518,13 @@ static struct dentry *ll_lookup_nd(struct inode *parent, st
ruct dentry *dentry,
               parent->i_generation, parent, flags);

        /* Optimize away (CREATE && !OPEN). Let .create handle the race. */
-       if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN))
+       if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN)) {
+               int rc;
+               rc = ll_d_init(dentry);
+               if (rc < 0)
+                       return ERR_PTR(rc);
                return NULL;
+       }

        if (flags & (LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE))
                itp = NULL;

>> @@ -374,9 +368,9 @@ struct dentry *ll_splice_alias(struct inode *inode, struct dentry *de)
>> 		}
>> 	}
>> 	rc = ll_d_init(de);
>> +	d_add(de, inode);
>> 	if (rc < 0)
>> 		return ERR_PTR(rc);
>> -	d_add(de, inode);
> 
> Uhh...  Why would you reorder them like that?

Hmï¿½ The thinking at the time was to always attach inode to dentry, so even in case of error
it's not really leaked. Though I guess that also makes dentry visible which is not ideal.

>> Also any particular reason to prefer d_splice_alias over d_add? Did not we already determine there
>> were no other aliases.
> 
> No unhashed aliases with the same name and parent.  Seriously, d_splice_alias()
> will do exactly the same thing as d_add() except for the situations where
> d_add() would've created multiple dentries over the same directory inode.
> And it's not more costly outside of that case...

Actually I guess it's kind of important with d_exact_alias, since it skips any hashed dentries,
but in fact invalid Lustre dentries are only marked with a flag inside that we then
validate during d_compare.
ll_find_alias was not ignoring hashed entries so it all worked out, but with
d_exact_alias we also need the d_splice_alias to certainly find those hashed entries too, we get
a lot of hits in it too according to my tests.

After looking into it a bit moreï¿½ d_splice_alias would pick any alias there might be,
unhashed or whatever. So what's the importance of obtaining an exact (but unhashed) alias first,
since if we don't find it we are still going to pick first one from the alias list (both in nfsd
and in Lustre with your patches)? Just because the exact unhashed alias might be a better fit
and we are preferring it? Any other reason I don't realize yet?

I guess this looks pretty good at the moment. I'll do some additional multinode (and nfs reexport)
testing to ensure it all holds up nicely just in case.

Thanks!

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias()
  2016-03-09  1:26           ` Al Viro
  2016-03-09  5:20             ` Drokin, Oleg
@ 2016-03-09 23:47             ` Drokin, Oleg
  2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
  1 sibling, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-09 23:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>

On Mar 8, 2016, at 8:26 PM, Al Viro wrote:

> On Wed, Mar 09, 2016 at 12:53:20AM +0000, Drokin, Oleg wrote:
> 
>> I have not tried this exact one yet (will try in a moment).
>> But it appears we do need ll_d_init() call if we got a hit
>> in d_exact_alias().
>> A particular test case fails due to lack of the d_fsdata being null.
> 
> Very interesting.  I'd suggest checking if we hit that d_instantiate()
> in your namei.c with NULL ->d_fsdata.  That really shouldn't happen, AFAICSï¿½

Well, it appears there's another way of getting this in addition to
the already identified "return NULL" at the start of the ll_lookup_nd()
that was uncovered when reexporting via NFS.
ll_iget_for_nfs does d_obtain_alias() that might produce a dentry also
in need of initialization if it's a disconnected one.
This is a path that you did not list in your possible ways to find a
dentry in ll_find_alias() before.

Now, do you think I should just call ll_d_init() on it at all times,
since that one might be uninitialized.

BTW, I also did a quick look through the code and ext4 "encryption"
code potentially got it wrong too? ext4_lookup checks if the parent
dir have encryption and sets d_fsdata to 1 and also rewrites
dentry ops with ext4_encrypted_d_ops, but it then does
d_splice_alias that might find such a disconnected dentry and it would
lose all of that work in the end?

ocfs2 reimplements d_splice_alias as ocfs2_find_local_alias where they
sidestep the initialization of d_fsdata issue (even though they
refer to it as d_splice_alias elsewhere in the comments).
This would cause them to always reject disconnected dentries from
being found at the very least which is probably somewhat suboptimal
and leads to some wasted RAM for those extra dentries.

With this (and the other one) fixes in place I see no other problems
with your patch.

So if you think the patch below is ok, I can submit it to Greg
(with addition of removing the extra init after ll_find_alias call),
or to you if you think it's better for you to carry it in your tree
(the one below probably has whitespace all wrong, so it won't really
apply as is and needs a proper resend).

diff --git a/drivers/staging/lustre/lustre/llite/llite_nfs.c b/drivers/staging/lustre/lustre/llite/llite_nfs.c
index 193aab8..8a0bf73 100644
--- a/drivers/staging/lustre/lustre/llite/llite_nfs.c
+++ b/drivers/staging/lustre/lustre/llite/llite_nfs.c
@@ -173,6 +173,16 @@ ll_iget_for_nfs(struct super_block *sb, struct lu_fid *fid, struct lu_fid *paren
        /* N.B. d_obtain_alias() drops inode ref on error */
        result = d_obtain_alias(inode);

+       if (!IS_ERR(result)) {
+               int rc;
+
+               rc = ll_d_init(result);
+               if (rc < 0) {
+                       dput(result);
+                       result = ERR_PTR(rc);
+               }
+       }
+
        return result;
 }
diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
index f8f98e4..e18f3a5 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -302,81 +302,44 @@ void ll_i2gids(__u32 *suppgids, struct inode *i1, struct inode *i2)
@@ -558,8 +522,13 @@ static struct dentry *ll_lookup_nd(struct inode *parent, struct dentry *dentry,
               parent->i_generation, parent, flags);

        /* Optimize away (CREATE && !OPEN). Let .create handle the race. */
-       if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN))
+       if ((flags & LOOKUP_CREATE) && !(flags & LOOKUP_OPEN)) {
+               int rc;
+               rc = ll_d_init(dentry);
+               if (rc < 0)
+                       return ERR_PTR(rc);
                return NULL;
+       }

        if (flags & (LOOKUP_PARENT|LOOKUP_OPEN|LOOKUP_CREATE))
                itp = NULL;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-09 23:47             ` Drokin, Oleg
@ 2016-03-10  2:20               ` Al Viro
  2016-03-10  2:59                 ` Al Viro
                                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Al Viro @ 2016-03-10  2:20 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

[ext4 and ocfs2 folks Cc'd]
On Wed, Mar 09, 2016 at 11:47:50PM +0000, Drokin, Oleg wrote:

> Well, it appears there's another way of getting this in addition to
> the already identified "return NULL" at the start of the ll_lookup_nd()
> that was uncovered when reexporting via NFS.
> ll_iget_for_nfs does d_obtain_alias() that might produce a dentry also
> in need of initialization if it's a disconnected one.
> This is a path that you did not list in your possible ways to find a
> dentry in ll_find_alias() before.
> 
> Now, do you think I should just call ll_d_init() on it at all times,
> since that one might be uninitialized.

> BTW, I also did a quick look through the code and ext4 "encryption"
> code potentially got it wrong too? ext4_lookup checks if the parent
> dir have encryption and sets d_fsdata to 1 and also rewrites
> dentry ops with ext4_encrypted_d_ops, but it then does
> d_splice_alias that might find such a disconnected dentry and it would
> lose all of that work in the end?

Umm...  AFAICS, ext4_d_revalidate() is racy, starting with the very
first line.  What's to prevent it being moved while we are calling that?
Lose timeslice on preemption, have mv(1) move it elsewhere, followed by
rmdir taking the now-empty parent out.  Come back and dir points to
freed memory, with ci being complete junk.  Looks like oopsen galore...
Ted, am I missing something subtle here?

d_splice_alias() aside (nfsd interaction will be interesting, though),
this looks like an oopsable race. If not worse - dereferencing some
addresses can really screw hardware...

> ocfs2 reimplements d_splice_alias as ocfs2_find_local_alias where they
> sidestep the initialization of d_fsdata issue (even though they
> refer to it as d_splice_alias elsewhere in the comments).

Not really.  One of the callers (ocfs2_lookup() -> ocfs2_dentry_attach_lock() 
-> ocfs2_find_local_alias()) does that after d_splice_alias(); it doesn't
reimplement d_splice_alias().  There inode is already equal to dentry->d_inode.
In other callers of ocfs2_dentry_attach_lock() we are yet to set ->d_inode,
but already know its value (link/mknod/symlink/reflink).  We do *not* set
it in that function - it's done afterwards (look for d_instantiate()
downstream).  Yet another (cross-directory rename) does that to dentry with
dentry->d_inode already set and equal to inode.  IOW, it's not even close
to d_splice_alias().

> This would cause them to always reject disconnected dentries from
> being found at the very least which is probably somewhat suboptimal
> and leads to some wasted RAM for those extra dentries.

No.  d_splice_alias() in ocfs2_lookup() will pick those just fine.  _Then_
it will go looking for other links in the same directory, and try to share
their (refcounted) ->d_fsdata.  If not found, it'll just alloc a new one -
ocfs2_dentry_lock, not dentry.  Rightfully so, since they are basically
indexed by (parent inumber, child inumber) pairs, so links elsewhere are
out of consideration anyway.

I'm not sure if delaying ->d_fsdata setting until after d_splice_alias() is
such a good idea - it can be picked by dcache lookups as soon as it has
been hashed, so if somebody finds in dcache just as ocfs2_lookup() has
finished d_splice_alias(), we'll end up with ->d_revalidate() called and
possibly finding still NULL ->d_fsdata.  With caller of ->d_revalidate()
unhashing it there and then.  

The same ->d_revalidate() can also cause an unpleasant problem for e.g.
mknod - it'll see ->d_inode still negative, while ->d_fsdata is already
switched from "generation number of parent" to "pointer to ocfs2_dentry_lock",
notice that it does not match the generation number of parent and, again,
kick it out of hash.

Hell knows; I'd probably try to use the LSB of ->d_fsdata to tell one from
another.  ocfs2_dentry_lock is going to be at least 32bit-aligned, so we
could use odd values for ->d_fsdata of negatives.  Always flip to pointer
before making dentry positive and have ->d_revalidate() treat "NULL ->d_inode
and even ->d_fsdata" as "valid negative" - after all, it must have been
just prior to that and dentry *is* negative.

Note, BTW, that d_splice_alias() will not look for aliases in case of
non-directories - for those it's the same d_add(), since there we can
legitimately have many dentry aliases over the same inode.  For directories
we *can't*.

After looking at that code...  ocfs2_match_dentry() is mostly junk.
	a) dentry *never* has NULL ->d_parent; disconnected ones have
->d_parent point to dentry itself.
	b) ->d_parent->d_inode is never NULL.  If it ever happens, we
are really screwed in so many places...
	c) parent_blkno thing is very dubious - if it hadn't just come
from the actual in-core parent inode (and in that case, why not compare
that?), it's pretty much begging for iput() races, _especially_ if they
have async code using it.  For the call chains going through
ocfs2_dentry_attach_lock() the former is true; not sure about the one in
ocfs2_dentry_convert_worker()...

(c) might actually be correct; I hadn't looked in enough details to tell.
(a) and (b) are absolute crap.  BTW, the loop in ocfs2_dentry_convert_worker()
smells like O(busy aliases^2) ;-/

> With this (and the other one) fixes in place I see no other problems
> with your patch.
> 
> So if you think the patch below is ok, I can submit it to Greg
> (with addition of removing the extra init after ll_find_alias call),
> or to you if you think it's better for you to carry it in your tree
> (the one below probably has whitespace all wrong, so it won't really
> apply as is and needs a proper resend).

I'll pick it.  Mind if I fold it into the one I'd posted (with credits,
obviously)?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
@ 2016-03-10  2:59                 ` Al Viro
  2016-03-10 23:55                   ` Theodore Ts'o
  2016-03-10  3:08                 ` Drokin, Oleg
  2016-03-10 19:59                 ` Al Viro
  2 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-10  2:59 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

On Thu, Mar 10, 2016 at 02:20:42AM +0000, Al Viro wrote:

> Umm...  AFAICS, ext4_d_revalidate() is racy, starting with the very
> first line.  What's to prevent it being moved while we are calling that?
> Lose timeslice on preemption, have mv(1) move it elsewhere, followed by
> rmdir taking the now-empty parent out.  Come back and dir points to
> freed memory, with ci being complete junk.  Looks like oopsen galore...
> Ted, am I missing something subtle here?

BTW, the fact that original parent dentry is pinned by caller doesn't help
at all - by the time we get to ext4_d_revalidate() its ->d_parent might have
been pointing to something we are *not* pinning, with another rename() +
rmdir() completing the problem.  It's going to be hard to hit, but not
impossible.  Have d_move() happen right after we'd found the match in
__d_lookup(), then get preempted just as we'd fetched (already changed)
->d_parent->d_inode in ext4_d_revalidate().  The second rename() + rmdir()
have to complete by the time we regain CPU and we are screwed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
  2016-03-10  2:59                 ` Al Viro
@ 2016-03-10  3:08                 ` Drokin, Oleg
  2016-03-10  3:34                   ` Al Viro
  2016-03-10 19:59                 ` Al Viro
  2 siblings, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-10  3:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

[removing Ted from CC since I am not going over ext4 part and he's quite
busy already I am sure.]
On Mar 9, 2016, at 9:20 PM, Al Viro wrote:

> [ext4 and ocfs2 folks Cc'd]
> On Wed, Mar 09, 2016 at 11:47:50PM +0000, Drokin, Oleg wrote:
> 
>> ocfs2 reimplements d_splice_alias as ocfs2_find_local_alias where they
>> sidestep the initialization of d_fsdata issue (even though they
>> refer to it as d_splice_alias elsewhere in the comments).
> 
> Not really.  One of the callers (ocfs2_lookup() -> ocfs2_dentry_attach_lock() 
> -> ocfs2_find_local_alias()) does that after d_splice_alias(); it doesn't

Oh, I missed that call somehow.

> Note, BTW, that d_splice_alias() will not look for aliases in case of
> non-directories - for those it's the same d_add(), since there we can
> legitimately have many dentry aliases over the same inode.  For directories
> we *can't*.

Ah! Btw this highlights the missed case for d_exact_alias + d_splice_alias
in Lustre with current collection of patches you carry.

Suppose we have a dentry pointing to an inode all nicely covered by a lustre
lock (to ensure it is valid).
Now some other client does something that invalidates the name (rename,
or just open(O_CREAT) even), this causes our local lock to disappear
and causes the dentry to be declared "lustre invalid" via
ll_md_blocking_ast()->ll_invalidate_aliases()->d_lustre_invalidate()

This sets the "invalid" flag in __d_lustre_invalidate() and
also would try to unhash the dentry in some cases:
        if (d_count(dentry) == 0 && !(dentry->d_flags & DCACHE_DISCONNECTED))
                __d_drop(dentry);

Now if none of those conditions hit, the dentry stays hashed.
But the problem is if it's not a directory, then d_splice_alias would
also ignore it, while d_exact_alias would ignore it due to it being still hashed
and this dentry would just hang around uselessly taking RAM.

Is there an easy way to rectify this, I wonder?

>> With this (and the other one) fixes in place I see no other problems
>> with your patch.
>> 
>> So if you think the patch below is ok, I can submit it to Greg
>> (with addition of removing the extra init after ll_find_alias call),
>> or to you if you think it's better for you to carry it in your tree
>> (the one below probably has whitespace all wrong, so it won't really
>> apply as is and needs a proper resend).
> 
> I'll pick it.  Mind if I fold it into the one I'd posted (with credits,
> obviously)?

Sure. Thanks.

Bye,
    Oleg


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  3:08                 ` Drokin, Oleg
@ 2016-03-10  3:34                   ` Al Viro
  2016-03-10  3:46                     ` Drokin, Oleg
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-10  3:34 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Thu, Mar 10, 2016 at 03:08:57AM +0000, Drokin, Oleg wrote:
> > Note, BTW, that d_splice_alias() will not look for aliases in case of
> > non-directories - for those it's the same d_add(), since there we can
> > legitimately have many dentry aliases over the same inode.  For directories
> > we *can't*.
> 
> Ah! Btw this highlights the missed case for d_exact_alias + d_splice_alias
> in Lustre with current collection of patches you carry.
> 
> Suppose we have a dentry pointing to an inode all nicely covered by a lustre
> lock (to ensure it is valid).
> Now some other client does something that invalidates the name (rename,
> or just open(O_CREAT) even), this causes our local lock to disappear
> and causes the dentry to be declared "lustre invalid" via
> ll_md_blocking_ast()->ll_invalidate_aliases()->d_lustre_invalidate()
> 
> This sets the "invalid" flag in __d_lustre_invalidate() and
> also would try to unhash the dentry in some cases:
>         if (d_count(dentry) == 0 && !(dentry->d_flags & DCACHE_DISCONNECTED))
>                 __d_drop(dentry);
> 
> Now if none of those conditions hit, the dentry stays hashed.
> But the problem is if it's not a directory, then d_splice_alias would
> also ignore it, while d_exact_alias would ignore it due to it being still hashed
> and this dentry would just hang around uselessly taking RAM.
> 
> Is there an easy way to rectify this, I wonder?

Wait a minute.  If it's hashed, has the right name and the right parent,
why the hell are we calling ->lookup() on a new dentry in the first place?
Why hadn't we simply picked it from dcache?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  3:34                   ` Al Viro
@ 2016-03-10  3:46                     ` Drokin, Oleg
  2016-03-10  4:22                       ` Drokin, Oleg
  2016-03-10  4:43                       ` Al Viro
  0 siblings, 2 replies; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-10  3:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh


On Mar 9, 2016, at 10:34 PM, Al Viro wrote:

> On Thu, Mar 10, 2016 at 03:08:57AM +0000, Drokin, Oleg wrote:
>>> Note, BTW, that d_splice_alias() will not look for aliases in case of
>>> non-directories - for those it's the same d_add(), since there we can
>>> legitimately have many dentry aliases over the same inode.  For directories
>>> we *can't*.
>> 
>> Ah! Btw this highlights the missed case for d_exact_alias + d_splice_alias
>> in Lustre with current collection of patches you carry.
>> 
>> Suppose we have a dentry pointing to an inode all nicely covered by a lustre
>> lock (to ensure it is valid).
>> Now some other client does something that invalidates the name (rename,
>> or just open(O_CREAT) even), this causes our local lock to disappear
>> and causes the dentry to be declared "lustre invalid" via
>> ll_md_blocking_ast()->ll_invalidate_aliases()->d_lustre_invalidate()
>> 
>> This sets the "invalid" flag in __d_lustre_invalidate() and
>> also would try to unhash the dentry in some cases:
>>        if (d_count(dentry) == 0 && !(dentry->d_flags & DCACHE_DISCONNECTED))
>>                __d_drop(dentry);
>> 
>> Now if none of those conditions hit, the dentry stays hashed.
>> But the problem is if it's not a directory, then d_splice_alias would
>> also ignore it, while d_exact_alias would ignore it due to it being still hashed
>> and this dentry would just hang around uselessly taking RAM.
>> 
>> Is there an easy way to rectify this, I wonder?
> 
> Wait a minute.  If it's hashed, has the right name and the right parent,
> why the hell are we calling ->lookup() on a new dentry in the first place?
> Why hadn't we simply picked it from dcache?

This is because of the trickery we do in the d_compare.
our d_compare looks at the "invalid" flag and if it's set, returns "not matching",
triggering the lookup instead of revalidate.
This makes revalidate simple and fast.
(We used to have a complicated revalidate with a lot of code duplication with
lookup in order to be able to query the server and pass all sorts of data there
and it was nothing but trouble).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  3:46                     ` Drokin, Oleg
@ 2016-03-10  4:22                       ` Drokin, Oleg
  2016-03-10  4:43                       ` Al Viro
  1 sibling, 0 replies; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-10  4:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>


On Mar 9, 2016, at 10:46 PM, Oleg Drokin wrote:

> 
> On Mar 9, 2016, at 10:34 PM, Al Viro wrote:
> 
>> On Thu, Mar 10, 2016 at 03:08:57AM +0000, Drokin, Oleg wrote:
>>>> Note, BTW, that d_splice_alias() will not look for aliases in case of
>>>> non-directories - for those it's the same d_add(), since there we can
>>>> legitimately have many dentry aliases over the same inode.  For directories
>>>> we *can't*.
>>> 
>>> Ah! Btw this highlights the missed case for d_exact_alias + d_splice_alias
>>> in Lustre with current collection of patches you carry.
>>> 
>>> Suppose we have a dentry pointing to an inode all nicely covered by a lustre
>>> lock (to ensure it is valid).
>>> Now some other client does something that invalidates the name (rename,
>>> or just open(O_CREAT) even), this causes our local lock to disappear
>>> and causes the dentry to be declared "lustre invalid" via
>>> ll_md_blocking_ast()->ll_invalidate_aliases()->d_lustre_invalidate()
>>> 
>>> This sets the "invalid" flag in __d_lustre_invalidate() and
>>> also would try to unhash the dentry in some cases:
>>>       if (d_count(dentry) == 0 && !(dentry->d_flags & DCACHE_DISCONNECTED))
>>>               __d_drop(dentry);
>>> 
>>> Now if none of those conditions hit, the dentry stays hashed.
>>> But the problem is if it's not a directory, then d_splice_alias would
>>> also ignore it, while d_exact_alias would ignore it due to it being still hashed
>>> and this dentry would just hang around uselessly taking RAM.
>>> 
>>> Is there an easy way to rectify this, I wonder?
>> 
>> Wait a minute.  If it's hashed, has the right name and the right parent,
>> why the hell are we calling ->lookup() on a new dentry in the first place?
>> Why hadn't we simply picked it from dcache?
> 
> This is because of the trickery we do in the d_compare.
> our d_compare looks at the "invalid" flag and if it's set, returns "not matching",
> triggering the lookup instead of revalidate.
> This makes revalidate simple and fast.
> (We used to have a complicated revalidate with a lot of code duplication with
> lookup in order to be able to query the server and pass all sorts of data there
> and it was nothing but trouble)

Let me add that all that complicated logic was due to d_revalidate returning 0 in the
past was bound to quickly turn into ESTALE returned to userspace which was not really
acceptable.

It seems this has changed dramatically and we might not need this now if ESTALE
is never returned just because dentry happened to be invalid.
Time to run some tests, I guess.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  3:46                     ` Drokin, Oleg
  2016-03-10  4:22                       ` Drokin, Oleg
@ 2016-03-10  4:43                       ` Al Viro
  2016-03-10  5:15                         ` Al Viro
  2016-03-10  5:47                         ` Drokin, Oleg
  1 sibling, 2 replies; 30+ messages in thread
From: Al Viro @ 2016-03-10  4:43 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Thu, Mar 10, 2016 at 03:46:43AM +0000, Drokin, Oleg wrote:

> > Wait a minute.  If it's hashed, has the right name and the right parent,
> > why the hell are we calling ->lookup() on a new dentry in the first place?
> > Why hadn't we simply picked it from dcache?
> 
> This is because of the trickery we do in the d_compare.
> our d_compare looks at the "invalid" flag and if it's set, returns "not matching",
> triggering the lookup instead of revalidate.
> This makes revalidate simple and fast.
> (We used to have a complicated revalidate with a lot of code duplication with
> lookup in order to be able to query the server and pass all sorts of data there
> and it was nothing but trouble).

*Ugh*...  That's really nasty.  We certainly could make d_exact_match()
accept unhashed ones and make rehashing conditional (NFS doesn't pull
anything similar, so it won't care), but your ->d_revalidate()
has exact same problem as ext4_d_revalidate() one mentioned upthread -
there's no warranty that dentry->d_parent will stay stable.

We are *NOT* guaranteed locked parent when ->d_revalidate() is called, or
we would have to lock every damn directory on the way through the pathname
resolution.  Moreover, ->d_revalidate() really can overlap with rename(2).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  4:43                       ` Al Viro
@ 2016-03-10  5:15                         ` Al Viro
  2016-03-11  3:47                           ` Drokin, Oleg
  2016-03-10  5:47                         ` Drokin, Oleg
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-10  5:15 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Thu, Mar 10, 2016 at 04:43:16AM +0000, Al Viro wrote:
> On Thu, Mar 10, 2016 at 03:46:43AM +0000, Drokin, Oleg wrote:
> 
> > > Wait a minute.  If it's hashed, has the right name and the right parent,
> > > why the hell are we calling ->lookup() on a new dentry in the first place?
> > > Why hadn't we simply picked it from dcache?
> > 
> > This is because of the trickery we do in the d_compare.
> > our d_compare looks at the "invalid" flag and if it's set, returns "not matching",
> > triggering the lookup instead of revalidate.
> > This makes revalidate simple and fast.
> > (We used to have a complicated revalidate with a lot of code duplication with
> > lookup in order to be able to query the server and pass all sorts of data there
> > and it was nothing but trouble).
> 
> *Ugh*...  That's really nasty.  We certainly could make d_exact_match()
> accept unhashed ones and make rehashing conditional (NFS doesn't pull
> anything similar, so it won't care), but your ->d_revalidate()
> has exact same problem as ext4_d_revalidate() one mentioned upthread -
> there's no warranty that dentry->d_parent will stay stable.
> 
> We are *NOT* guaranteed locked parent when ->d_revalidate() is called, or
> we would have to lock every damn directory on the way through the pathname
> resolution.  Moreover, ->d_revalidate() really can overlap with rename(2).

PS: there's a reason why e.g. NFS ->d_revalidate() is doing
        if (flags & LOOKUP_RCU) {
                parent = ACCESS_ONCE(dentry->d_parent);
                dir = d_inode_rcu(parent);
                if (!dir)
                        return -ECHILD;
        } else {
                parent = dget_parent(dentry);
                dir = d_inode(parent);
        }  
and so do other instances.  It does *not* guarantee that parent will remain
the parent through the whole thing (or will still be one by the time
dget_parent() caller gets the return value), but it does guarantee that it
won't get freed under you.  Note that the original parent won't disappear
(it's pinned by the caller), but there's no promise that what you'll
fetch from dentry->d_parent inside the method will have anything to do with
that.

BTW, we might be better off if we passed the parent and child as separate
arguments...

By the quick look through the instances, we have
	* a bunch that don't look at the parent at all
	* some that use dget_parent()/dput() (and often enough use only
->d_inode of the parent).
	* some that look at it under dentry->d_lock - that's enough for
stability, but can't block.  ceph, BTW, does igrab() of parent's inode under
->d_lock, uses it outside of ->d_lock and iput() in the end.
	* kernfs, which serializes just about everything on a single
system-wide mutex.
	* lustre (and ext4 crypto in -next) - broken

Only the third class (and actually only one instance in there - vfat) wouldn't
be just as fine if we passed it the parent as argument.  VFAT one does
        spin_lock(&dentry->d_lock);
        if (dentry->d_time != d_inode(dentry->d_parent)->i_version)
                ret = 0;
        spin_unlock(&dentry->d_lock);
That one does care about ->d_time and ->d_parent being from the same moment.
And it can bloody well keep doing what it does.

Reducing the amount of dget_parent() callers would also be nice...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  4:43                       ` Al Viro
  2016-03-10  5:15                         ` Al Viro
@ 2016-03-10  5:47                         ` Drokin, Oleg
  1 sibling, 0 replies; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-10  5:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Mar 9, 2016, at 11:43 PM, Al Viro wrote:

> On Thu, Mar 10, 2016 at 03:46:43AM +0000, Drokin, Oleg wrote:
> 
>>> Wait a minute.  If it's hashed, has the right name and the right parent,
>>> why the hell are we calling ->lookup() on a new dentry in the first place?
>>> Why hadn't we simply picked it from dcache?
>> 
>> This is because of the trickery we do in the d_compare.
>> our d_compare looks at the "invalid" flag and if it's set, returns "not matching",
>> triggering the lookup instead of revalidate.
>> This makes revalidate simple and fast.
>> (We used to have a complicated revalidate with a lot of code duplication with
>> lookup in order to be able to query the server and pass all sorts of data there
>> and it was nothing but trouble).
> 
> *Ugh*...  That's really nasty.  We certainly could make d_exact_match()
> accept unhashed ones and make rehashing conditional (NFS doesn't pull
> anything similar, so it won't care), but your ->d_revalidate()
> has exact same problem as ext4_d_revalidate() one mentioned upthread -
> there's no warranty that dentry->d_parent will stay stable.
> 
> We are *NOT* guaranteed locked parent when ->d_revalidate() is called, or
> we would have to lock every damn directory on the way through the pathname
> resolution.  Moreover, ->d_revalidate() really can overlap with rename(2).

Hm, you are right.

I wonder why did we never see this race in practice, though.

I understand the race is rare, but we have a special test scenario called "racer"
(you can see it here: http://git.whamcloud.com/fs/lustre-release.git/tree/refs/heads/master:/lustre/tests/racer )

that does a bunch of operations in a limited namespace uncovering all sorts of
race conditions.

In order to make the effect better, I typically run it on a node with "unstable
cpus" (really a vm with 8 virtual cpus) where the cpus randomly get paused
for random amount of time up to a few seconds to extend the race windows.

All sorts of super unlikely stuff was hit in our code, but not this one.
(don't remember exact details and don't want to make stuff up on the spot ;)
but it certainly made some eye-opening races )

Is there something that suppresses it like our additional locking?
Hm, I guess that must be it. Rename always reaches to the server and that
invalidates the lustre dentry lock.
So if the rename progressed far enough that the lock got cancelled and dentry
got invalidated - and then lookup happened - it would never call revalidate
in the first place, so d_move does not matter.
Now if lookup and rename started at about the same time and we entered revalidate
before the lock got cancelled, the lock cancellation comes in and
blocks in ll_invalidate_aliases()->ll_lock_dcache() on inode_mutex, and
until it completes, the server cannot continue.

So probably just by this sheer luck we are not really affected even when VFS rules
say we are.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
  2016-03-10  2:59                 ` Al Viro
  2016-03-10  3:08                 ` Drokin, Oleg
@ 2016-03-10 19:59                 ` Al Viro
  2016-03-10 20:34                   ` do we need that smp_wmb() in __d_alloc()? Al Viro
  2016-03-10 21:22                   ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Drokin, Oleg
  2 siblings, 2 replies; 30+ messages in thread
From: Al Viro @ 2016-03-10 19:59 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

On Thu, Mar 10, 2016 at 02:20:42AM +0000, Al Viro wrote:

> I'll pick it.  Mind if I fold it into the one I'd posted (with credits,
> obviously)?

Actually, it's still racy - there's a window between d_obtain_alias() and
ll_d_init() where it can be picked by d_splice_alias().

Sigh...  I'm really tempted to just add ->d_init() and let d_alloc() call it
if non-NULL.  

The question is what should it get and what should it be the method of...

How about
	int (*d_init)(struct dentry *);
in dentry_operations, with __d_alloc() ending with
        d_set_d_op(dentry, dentry->d_sb->s_d_op);
	if (unlikely(dentry->d_op && dentry->d_op->d_init)) {
		int err = dentry->d_op->d_init(dentry);
		if (unlikely(err)) {
			dentry_free(dentry);
			return ERR_PTR(err);
		}
	}
        this_cpu_inc(nr_dentry);

        return dentry;
}

Then lustre would simply have

int ll_d_init(struct dentry *de)
{
	struct ll_dentry_data *lld = kzalloc(sizeof(*lld), GFP_NOFS);
	if (unlikely(!lld))
		return -ENOMEM;
	lld->lld_invalid = 1;
	smp_wmb();	/* read barrier in whatever will find us */
	de->d_fsdata = lld;
        return 0;
}

as its ->d_init() and forget about all that mess.

Objections, better ideas?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* do we need that smp_wmb() in __d_alloc()?
  2016-03-10 19:59                 ` Al Viro
@ 2016-03-10 20:34                   ` Al Viro
  2016-03-10 21:17                     ` Al Viro
  2016-03-10 21:22                   ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Drokin, Oleg
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-10 20:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel

On Thu, Mar 10, 2016 at 07:59:51PM +0000, Al Viro wrote:

> int ll_d_init(struct dentry *de)
> {
> 	struct ll_dentry_data *lld = kzalloc(sizeof(*lld), GFP_NOFS);
> 	if (unlikely(!lld))
> 		return -ENOMEM;
> 	lld->lld_invalid = 1;
> 	smp_wmb();	/* read barrier in whatever will find us */
> 	de->d_fsdata = lld;
>         return 0;
> }
> 
> as its ->d_init() and forget about all that mess.
> 
> Objections, better ideas?

Speaking of barriers - why do we need one there at all?  In __d_alloc(), that
is.  Look: callers of __d_alloc() are:
	* d_alloc() - cycles parent's ->d_lock, inserts into the list of
parent's children.  Anyone observing it on that list of children will be
holding the same parent's ->d_lock.  And anyone finding it in any other way
will have to observe the effect of store done after the write barrier in
spin_unlock.
	* __d_obtain_alias() - same story, only it's ->i_lock of the inode
and ->d_lock of dentry itself.

There's also d_alloc_pseudo() and d_make_root(); I suspect that for
d_make_root() an implicit barrier in unlocking ->s_umount would serve,
but in any case, wouldn't it make more sense to lift that smp_wmb() from
__d_alloc() to d_alloc_pseudo() and d_make_root()?  Linus, do you see
any problems with that?  I don't really trust my taste wrt barriers, so...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: do we need that smp_wmb() in __d_alloc()?
  2016-03-10 20:34                   ` do we need that smp_wmb() in __d_alloc()? Al Viro
@ 2016-03-10 21:17                     ` Al Viro
  0 siblings, 0 replies; 30+ messages in thread
From: Al Viro @ 2016-03-10 21:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel

On Thu, Mar 10, 2016 at 08:34:28PM +0000, Al Viro wrote:

> Speaking of barriers - why do we need one there at all?  In __d_alloc(), that
> is.  Look: callers of __d_alloc() are:
> 	* d_alloc() - cycles parent's ->d_lock, inserts into the list of
> parent's children.  Anyone observing it on that list of children will be
> holding the same parent's ->d_lock.  And anyone finding it in any other way
> will have to observe the effect of store done after the write barrier in
> spin_unlock.

... which still leaves the possibility of both stores migrating inside the
lock/unlock pair and past each other.  IOW,
	lock parent's d_lock
	lock child's d_lock
	put into hash
	store terminating NUL in ->d_name
	unlock parent's d_lock
	unlock child's d_lock
is a possible sequence of events, with RCU access on another CPU observing
the damn without seeing the NUL-terminated name ;-/

Oh, well...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10 19:59                 ` Al Viro
  2016-03-10 20:34                   ` do we need that smp_wmb() in __d_alloc()? Al Viro
@ 2016-03-10 21:22                   ` Drokin, Oleg
  2016-03-10 23:23                     ` Al Viro
  1 sibling, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-10 21:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh


On Mar 10, 2016, at 2:59 PM, Al Viro wrote:

> On Thu, Mar 10, 2016 at 02:20:42AM +0000, Al Viro wrote:
> 
>> I'll pick it.  Mind if I fold it into the one I'd posted (with credits,
>> obviously)?
> 
> Actually, it's still racy - there's a window between d_obtain_alias() and
> ll_d_init() where it can be picked by d_splice_alias().
> 
> Sigh...  I'm really tempted to just add ->d_init() and let d_alloc() call it
> if non-NULL.  

In fact I was surprised this was not the case already and so people
needed to call it by hand in every possible case where a new dentry is possibly
originated leading to a bit of a mess (and some bugs).

The downside is that we do the allocation at all times even if the dentry is
going to be promptly destroyed because we found a better alias.

In this way calling it just like we do today, after d_exact_alias + d_splice_alias
is another option, I guess and saves us some alloc + free trouble and a minuscle
overhead reduction for filesystems (majority of the in-kerel ones) that don't
really need this init at all.

> The question is what should it get and what should it be the method of...
> 
> How about
> 	int (*d_init)(struct dentry *);
> in dentry_operations, with __d_alloc() ending with
>        d_set_d_op(dentry, dentry->d_sb->s_d_op);
> 	if (unlikely(dentry->d_op && dentry->d_op->d_init)) {
> 		int err = dentry->d_op->d_init(dentry);
> 		if (unlikely(err)) {
> 			dentry_free(dentry);
> 			return ERR_PTR(err);
> 		}
> 	}
>        this_cpu_inc(nr_dentry);
> 
>        return dentry;
> }
> 
> Then lustre would simply have
> 
> int ll_d_init(struct dentry *de)
> {
> 	struct ll_dentry_data *lld = kzalloc(sizeof(*lld), GFP_NOFS);
> 	if (unlikely(!lld))
> 		return -ENOMEM;
> 	lld->lld_invalid = 1;
> 	smp_wmb();	/* read barrier in whatever will find us */
> 	de->d_fsdata = lld;
>        return 0;
> }
> 
> as its ->d_init() and forget about all that mess.
> 
> Objections, better ideas?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10 21:22                   ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Drokin, Oleg
@ 2016-03-10 23:23                     ` Al Viro
  2016-03-11  3:25                       ` Drokin, Oleg
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-10 23:23 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

On Thu, Mar 10, 2016 at 09:22:21PM +0000, Drokin, Oleg wrote:
> In fact I was surprised this was not the case already and so people
> needed to call it by hand in every possible case where a new dentry is possibly
> originated leading to a bit of a mess (and some bugs).

Most of the filesystems do not need that at all - just look at the
d_splice_alias() callers.  In mainline we have
	* 22 callers of form return d_splice_alias(inode, dentry); in
some ->lookup() instance.
	* 1 caller in d_add_ci(), all callers of which are in ->lookup()
instances, with return value directly returned to caller of ->lookup().
	* 1 caller in kernfs ->lookup(), with return value directly
returned after releasing a global mutex.
	* ceph - returned in a very convoluted way, but return values is
not dereferenced until it gets passed to caller of ->lookup()
	* gfs2 - passed to caller (or fed to finish_open() when it's in
atomic open) with some unlocks along the way.
	* ocfs2 - broken, and in a way that wouldn't be fixable at dentry
allocation time.
	* cifs - seeding dcache on readdir.  No postprocessing of dentry.
fs/9p/vfs_inode.c:834:  res = d_splice_alias(inode, dentry);
	* fuse - essentially touching a timestamp of dentry, be it new or
old one.  Both on ->lookup() and when seeding dcache on readdir.  Same
as done on successful revalidate, so the worst we are risking is extra work
in ->d_revalidate() for somebody finding it in dcache before we'd touched
the timestamp.
	* nfs - ditto.
	* 9p - the fid we'd obtained is hung on whatever dentry we get.
If revalidate doesn't find a fid, it'll try to get one from server and
hang it on the dentry.

d_obtain_alias() callers make it even more clear - 27 call sites where it's
directly returned, 5 more where it's returned without being dereferenced
and 5 callers in 3 filesystems (ceph, fuse, lustre) where something is
(or ought to be) done.  ceph might benefit from that thing as well - looks
like its users of d_obtain_alias() might be racy wrt d_splice_alias() from
->lookup() picking the alias before ceph_init_dentry() finishes.  Or not -
the call chain in there is convoluted as hell and I might miss the call
of ceph_init_denty() along the way.  ocfs2 might or might not be correct -
it doesn't do any postprocessing after d_obtain_alias(); no idea if it
really needs one there.

So we have two filesystems that could benefit from having ->d_init() -
ceph and lustre.  Note that e.g. ocfs2 wouldn't benefit from that - it needs
a lot more than just allocation *and* it uses ->d_fsdata differently for
positives and negatives.

FWIW, there's a troubling thing in ceph_init_dentry() -
        if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_NOSNAP)
                d_set_d_op(dentry, &ceph_dentry_ops);
        else if (ceph_snap(d_inode(dentry->d_parent)) == CEPH_SNAPDIR)
                d_set_d_op(dentry, &ceph_snapdir_dentry_ops);
        else   
                d_set_d_op(dentry, &ceph_snap_dentry_ops);

_That_ can't be done in __d_alloc() and, what's worse, it doesn't look good
in their __fh_to_dentry() call of ceph_init_dentry().  ->d_parent has
a good chance to be pointing to the dentry itself.  Not sure how much trouble
does it lead to - I'm not familiar with ceph snapshots or ceph reexports over
NFS, let alone the combination...

> The downside is that we do the allocation at all times even if the dentry is
> going to be promptly destroyed because we found a better alias.

The cost of kmalloc() (or kmem_cache_alloc() - it might make sense
to introduce a specialized slab for those beasts) is trivial compared to the
cost of talking to server to find which inode you are going to deal with.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  2:59                 ` Al Viro
@ 2016-03-10 23:55                   ` Theodore Ts'o
  2016-03-11  3:18                     ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Theodore Ts'o @ 2016-03-10 23:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Drokin, Oleg, Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Thu, Mar 10, 2016 at 02:59:49AM +0000, Al Viro wrote:
> On Thu, Mar 10, 2016 at 02:20:42AM +0000, Al Viro wrote:
> 
> > Umm...  AFAICS, ext4_d_revalidate() is racy, starting with the very
> > first line.  What's to prevent it being moved while we are calling that?
> > Lose timeslice on preemption, have mv(1) move it elsewhere, followed by
> > rmdir taking the now-empty parent out.  Come back and dir points to
> > freed memory, with ci being complete junk.  Looks like oopsen galore...
> > Ted, am I missing something subtle here?
> 
> BTW, the fact that original parent dentry is pinned by caller doesn't help
> at all - by the time we get to ext4_d_revalidate() its ->d_parent might have
> been pointing to something we are *not* pinning, with another rename() +
> rmdir() completing the problem.  It's going to be hard to hit, but not
> impossible.  Have d_move() happen right after we'd found the match in
> __d_lookup(), then get preempted just as we'd fetched (already changed)
> ->d_parent->d_inode in ext4_d_revalidate().  The second rename() + rmdir()
> have to complete by the time we regain CPU and we are screwed.

The fundamental problem is that ChromeOS wanted root to be able to
delete files for which it didn't have the key.  (The use case is you
have multiple users sharing a single ChromeOS laptop, and you want to
reclaim space from another user's cache directory by deleting files,
even though you don't have the key to decrypt the files' contents or
filename.)  So if you don't have the key, the directory entries are
returned in as a (modified) base-64 encoding of the encrypted
directory entry.

The problem is we've been using the keyring infrastructure, which
means that it's possible for a non-privileged uid (for example,
"chronos") to have access to the key, but for another user (perhaps
the root user) to not have access to the key.  And this violates one
of the fundamental assumptions of the dentry cache, which is that
there is only one view of the namespace for all users.  Some users
might not have permission to access part of the namespace --- but if
they do, it looks the same regardless of whether you are root or a
non-root user.

The ext4_d_revalidate() function was an attempt to square the circle,
but yes, I've been coming to the conclusion that it doesn't work all
that well.  One thing I've been considering is to make access to the
keys a global property.  So the first time a user with access to the
key tries to access an encrypted file, we import the key into mounted
file system's ext4_sb_info structure, and we bump a generation
counter, and then ext4_d_revalidate() simply returns "invalid" for all
denrty entries which (a) refer to an encrypted file or directory, and
(b) whose generation number is less than the current generation
number.

Similarly, if we invalidate a key, we remove the key from tthe keyring
hanging off of the mounted file system's sb_info structure, and then
bump the generation number, which will invalidate the dentries for all
encrypted files/directories on that file system.

This is a bit non-optimal, but I don't see any other way of solving
the problem.  Al, do you have any suggestions?

					- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10 23:55                   ` Theodore Ts'o
@ 2016-03-11  3:18                     ` Al Viro
  2016-03-11 15:42                       ` Theodore Ts'o
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-11  3:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Drokin, Oleg, Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Thu, Mar 10, 2016 at 06:55:43PM -0500, Theodore Ts'o wrote:

> The ext4_d_revalidate() function was an attempt to square the circle,
> but yes, I've been coming to the conclusion that it doesn't work all
> that well.  One thing I've been considering is to make access to the
> keys a global property.  So the first time a user with access to the
> key tries to access an encrypted file, we import the key into mounted
> file system's ext4_sb_info structure, and we bump a generation
> counter, and then ext4_d_revalidate() simply returns "invalid" for all
> denrty entries which (a) refer to an encrypted file or directory, and
> (b) whose generation number is less than the current generation
> number.

That sounds rather DoSable, if I'm not misparsing you...

> Similarly, if we invalidate a key, we remove the key from tthe keyring
> hanging off of the mounted file system's sb_info structure, and then
> bump the generation number, which will invalidate the dentries for all
> encrypted files/directories on that file system.
> 
> This is a bit non-optimal, but I don't see any other way of solving
> the problem.  Al, do you have any suggestions?

In any case, the current variant needs at least dget_parent()/dput() - blind
access of ->d_parent is simply broken.  As for using ->d_revalidate() for
that stuff...  I'm not sure.  How should those directories look like for
somebody who doesn't have a key?  Should e.g. getdents(2) results depend on
who's calling it, or who'd opened the directory, or...?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10 23:23                     ` Al Viro
@ 2016-03-11  3:25                       ` Drokin, Oleg
  2016-03-12 17:22                         ` Al Viro
  0 siblings, 1 reply; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-11  3:25 UTC (permalink / raw)
  To: Al Viro
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

On Mar 10, 2016, at 6:23 PM, Al Viro wrote:

> On Thu, Mar 10, 2016 at 09:22:21PM +0000, Drokin, Oleg wrote:
>> The downside is that we do the allocation at all times even if the dentry is
>> going to be promptly destroyed because we found a better alias.
> 
> The cost of kmalloc() (or kmem_cache_alloc() - it might make sense
> to introduce a specialized slab for those beasts) is trivial compared to the
> cost of talking to server to find which inode you are going to deal with.

It is, though it's not like we always need to talk to the server in those
cases.
Overhead is overhead no matter how small, and an extra check for the d_init
d_op would happen for every filesystem, not just ceph and Lustre.
Overhead also has this bad property of adding up.
Yes, it's minuscule and CPUs are still getting somewhat faster.
Is it worth it in the long run? May be, I do not have a strong opinion
here since always doing ll_d_init after d_splice_alias solves the issue
for us and does it in a way that does not affect anybody else. And as you
correctly observed, most of users don't really care.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-10  5:15                         ` Al Viro
@ 2016-03-11  3:47                           ` Drokin, Oleg
  0 siblings, 0 replies; 30+ messages in thread
From: Drokin, Oleg @ 2016-03-11  3:47 UTC (permalink / raw)
  To: Al Viro; +Cc: Dilger, Andreas, <linux-fsdevel@vger.kernel.org>


On Mar 10, 2016, at 12:15 AM, Al Viro wrote:

> PS: there's a reason why e.g. NFS ->d_revalidate() is doing
>        if (flags & LOOKUP_RCU) {
>                parent = ACCESS_ONCE(dentry->d_parent);
>                dir = d_inode_rcu(parent);
>                if (!dir)
>                        return -ECHILD;
>        } else {
>                parent = dget_parent(dentry);
>                dir = d_inode(parent);
>        }  
> and so do other instances.  It does *not* guarantee that parent will remain
> the parent through the whole thing (or will still be one by the time
> dget_parent() caller gets the return value), but it does guarantee that it
> won't get freed under you.  Note that the original parent won't disappear
> (it's pinned by the caller), but there's no promise that what you'll
> fetch from dentry->d_parent inside the method will have anything to do with
> that.
> 
> BTW, we might be better off if we passed the parent and child as separate
> argumentsï¿½

I wonder if this is something you are seriously contemplating doing or should
I still future-proof Lustre a bit and get a patch going to use dget_parent/dput
in ll_revalidate_dentry in Lustre?
While it's not a problem now due to the external locking, that might change
in the future

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-11  3:18                     ` Al Viro
@ 2016-03-11 15:42                       ` Theodore Ts'o
  0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2016-03-11 15:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Drokin, Oleg, Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Mark Fasheh

On Fri, Mar 11, 2016 at 03:18:51AM +0000, Al Viro wrote:
> On Thu, Mar 10, 2016 at 06:55:43PM -0500, Theodore Ts'o wrote:
> 
> > The ext4_d_revalidate() function was an attempt to square the circle,
> > but yes, I've been coming to the conclusion that it doesn't work all
> > that well.  One thing I've been considering is to make access to the
> > keys a global property.  So the first time a user with access to the
> > key tries to access an encrypted file, we import the key into mounted
> > file system's ext4_sb_info structure, and we bump a generation
> > counter, and then ext4_d_revalidate() simply returns "invalid" for all
> > denrty entries which (a) refer to an encrypted file or directory, and
> > (b) whose generation number is less than the current generation
> > number.
> 
> That sounds rather DoSable, if I'm not misparsing you...

Well, it means that someone who can add and remove a key can force all
file system accesses for encrypted files to go through ext4_lookup(),
yes.  As I said, "non-optimal".  I suppose I could make the generation
number be one on a per-key-id basis to mitigate this problem.

> 
> > Similarly, if we invalidate a key, we remove the key from tthe keyring
> > hanging off of the mounted file system's sb_info structure, and then
> > bump the generation number, which will invalidate the dentries for all
> > encrypted files/directories on that file system.
> > 
> > This is a bit non-optimal, but I don't see any other way of solving
> > the problem.  Al, do you have any suggestions?
> 
> In any case, the current variant needs at least dget_parent()/dput() - blind
> access of ->d_parent is simply broken.  As for using ->d_revalidate() for
> that stuff...  I'm not sure.  How should those directories look like for
> somebody who doesn't have a key?  Should e.g. getdents(2) results depend on
> who's calling it, or who'd opened the directory, or...?

>From a security perspective, we are assuming that the kernel is
trusted, and that if the key is present, the kernel can be counted
upon to use Unix permissions to enforce access controls.  The use of
encryption is to prevent off-line attacks, and to limit the exposure
such that when a user is logged out (and their key is not present),
their data is not at risk.  Only the data for the logged-in user(s)
would be at risk.

Right now we do make the getdents(2) results depend on who opened the
directory --- although we probably should make it be based on who is
calling getdents(2) --- mainly because we don't want a inconsistencies
where getdents(2) returns the decrypted file names to a user who
doesn't have the key, and so attempts do a lookup fail (for example).

Or inconsistencies where the user tried to do a lookup for a file,
failed because the key hadn't been established yet at that point in
the login sequence, but then after the key has been loaded, since we
have a negative dentry caching the previous failure, future attempts
to lookup that file fail even though they now have the key.  This was
what the ext4_d_revalidate() was trying to address....

I realize that at some level we're doing something that the dentry
cache was never designed to handle; it fundamentally assumes that
there is a single global view of a file system hierarchy.  This is why
I've about making the availability (or not) of keys a global property,
and when that changes, we will need to revoke the cache since the
cached results might no longer be correct.

							- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-11  3:25                       ` Drokin, Oleg
@ 2016-03-12 17:22                         ` Al Viro
  2016-03-13 14:35                           ` Sage Weil
  0 siblings, 1 reply; 30+ messages in thread
From: Al Viro @ 2016-03-12 17:22 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh, Sage Weil

On Fri, Mar 11, 2016 at 03:25:22AM +0000, Drokin, Oleg wrote:

> It is, though it's not like we always need to talk to the server in those
> cases.
> Overhead is overhead no matter how small, and an extra check for the d_init
> d_op would happen for every filesystem, not just ceph and Lustre.
> Overhead also has this bad property of adding up.
> Yes, it's minuscule and CPUs are still getting somewhat faster.
> Is it worth it in the long run? May be, I do not have a strong opinion
> here since always doing ll_d_init after d_splice_alias solves the issue
> for us and does it in a way that does not affect anybody else. And as you
> correctly observed, most of users don't really care.--

Maybe...  Keep in mind that d_set_d_op() called just prior to that point
has already done a bunch of checks for methods being non-NULL, so we are
looking at something pretty much guaranteed to be in cache.  So the cost
of test is pretty much nil; it turns into "fetch (with cache hit), test,
branch (expectedly) not taken" in normal case.

We probably have already spent more cycles discussing that than all processors
will spend executing those instructions ;-)

I'm really curious about the ceph situation; fh_to_dentry turns out to
be irrelevant (ceph refuses to export snapshots and inode it'll be looking
for won't be in snapshot, just as its parent), but I wonder why we have
those 3 variants of ->d_op in the first place.

->d_release() is the same in all 3.
->d_prune() is the same in plain and snapshot variants and absent for
children of .snap
->d_revalidate() is nominally different for all 3, but stuff inside the
snapshots and .snap children have equivalent ones *AND* normal ->d_revalidate
appears to cover all three cases -
        dir = ceph_get_dentry_parent_inode(dentry);  

        /* always trust cached snapped dentries, snapdir dentry */
        if (ceph_snap(dir) != CEPH_NOSNAP) {
                dout("d_revalidate %p '%pd' inode %p is SNAPPED\n", dentry,
                     dentry, d_inode(dentry));
                valid = 1;
        } else if (d_really_is_positive(dentry) &&
                   ceph_snap(d_inode(dentry)) == CEPH_SNAPDIR) {
                valid = 1;
        } else if (dentry_lease_is_valid(dentry) ||
in there is explicitly checking the other two.

Sage, is there any reason not to fold all three dentry_operations variants
in one and be done with d_set_d_op()?  I see that in one place you check
->d_op being the plain one, but AFAICS checking ceph_snap() of the parent
would serve just as well there.  BTW, looks like ceph is too pessimistic
about bailing out of RCU mode on ->d_revalidate(), might be worth looking
into at some point...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: races in ll_splice_alias() and elsewhere (ext4, ocfs2)
  2016-03-12 17:22                         ` Al Viro
@ 2016-03-13 14:35                           ` Sage Weil
  0 siblings, 0 replies; 30+ messages in thread
From: Sage Weil @ 2016-03-13 14:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Drokin, Oleg, Dilger, Andreas, Linus Torvalds,
	<linux-fsdevel@vger.kernel.org>, Theodore Ts'o,
	Mark Fasheh

On Sat, 12 Mar 2016, Al Viro wrote:
> On Fri, Mar 11, 2016 at 03:25:22AM +0000, Drokin, Oleg wrote:
> 
> > It is, though it's not like we always need to talk to the server in those
> > cases.
> > Overhead is overhead no matter how small, and an extra check for the d_init
> > d_op would happen for every filesystem, not just ceph and Lustre.
> > Overhead also has this bad property of adding up.
> > Yes, it's minuscule and CPUs are still getting somewhat faster.
> > Is it worth it in the long run? May be, I do not have a strong opinion
> > here since always doing ll_d_init after d_splice_alias solves the issue
> > for us and does it in a way that does not affect anybody else. And as you
> > correctly observed, most of users don't really care.--
> 
> Maybe...  Keep in mind that d_set_d_op() called just prior to that point
> has already done a bunch of checks for methods being non-NULL, so we are
> looking at something pretty much guaranteed to be in cache.  So the cost
> of test is pretty much nil; it turns into "fetch (with cache hit), test,
> branch (expectedly) not taken" in normal case.
> 
> We probably have already spent more cycles discussing that than all processors
> will spend executing those instructions ;-)
> 
> I'm really curious about the ceph situation; fh_to_dentry turns out to
> be irrelevant (ceph refuses to export snapshots and inode it'll be looking
> for won't be in snapshot, just as its parent), but I wonder why we have
> those 3 variants of ->d_op in the first place.
> 
> ->d_release() is the same in all 3.
> ->d_prune() is the same in plain and snapshot variants and absent for
> children of .snap
> ->d_revalidate() is nominally different for all 3, but stuff inside the
> snapshots and .snap children have equivalent ones *AND* normal ->d_revalidate
> appears to cover all three cases -
>         dir = ceph_get_dentry_parent_inode(dentry);  
> 
>         /* always trust cached snapped dentries, snapdir dentry */
>         if (ceph_snap(dir) != CEPH_NOSNAP) {
>                 dout("d_revalidate %p '%pd' inode %p is SNAPPED\n", dentry,
>                      dentry, d_inode(dentry));
>                 valid = 1;
>         } else if (d_really_is_positive(dentry) &&
>                    ceph_snap(d_inode(dentry)) == CEPH_SNAPDIR) {
>                 valid = 1;
>         } else if (dentry_lease_is_valid(dentry) ||
> in there is explicitly checking the other two.
> 
> Sage, is there any reason not to fold all three dentry_operations variants
> in one and be done with d_set_d_op()?  I see that in one place you check
> ->d_op being the plain one, but AFAICS checking ceph_snap() of the parent
> would serve just as well there. 

Yeah, I don't see any reason why we can't fold them into a single 
d_ops.

> BTW, looks like ceph is too pessimistic
> about bailing out of RCU mode on ->d_revalidate(), might be worth looking
> into at some point...

Ah, that dates back to Nick Piggin's original pushdown when adding RCU 
walk back in 2.6.38.  I'll make a note, thanks!

sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-03-13 14:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-08 16:05 races in ll_splice_alias() Al Viro
2016-03-08 20:44 ` Drokin, Oleg
2016-03-08 21:11   ` Al Viro
2016-03-08 23:18     ` Drokin, Oleg
2016-03-09  0:34       ` Al Viro
2016-03-09  0:53         ` Drokin, Oleg
2016-03-09  1:26           ` Al Viro
2016-03-09  5:20             ` Drokin, Oleg
2016-03-09 23:47             ` Drokin, Oleg
2016-03-10  2:20               ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Al Viro
2016-03-10  2:59                 ` Al Viro
2016-03-10 23:55                   ` Theodore Ts'o
2016-03-11  3:18                     ` Al Viro
2016-03-11 15:42                       ` Theodore Ts'o
2016-03-10  3:08                 ` Drokin, Oleg
2016-03-10  3:34                   ` Al Viro
2016-03-10  3:46                     ` Drokin, Oleg
2016-03-10  4:22                       ` Drokin, Oleg
2016-03-10  4:43                       ` Al Viro
2016-03-10  5:15                         ` Al Viro
2016-03-11  3:47                           ` Drokin, Oleg
2016-03-10  5:47                         ` Drokin, Oleg
2016-03-10 19:59                 ` Al Viro
2016-03-10 20:34                   ` do we need that smp_wmb() in __d_alloc()? Al Viro
2016-03-10 21:17                     ` Al Viro
2016-03-10 21:22                   ` races in ll_splice_alias() and elsewhere (ext4, ocfs2) Drokin, Oleg
2016-03-10 23:23                     ` Al Viro
2016-03-11  3:25                       ` Drokin, Oleg
2016-03-12 17:22                         ` Al Viro
2016-03-13 14:35                           ` Sage Weil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).