d_revalidate and real

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* d_revalidate and real_lookup woes
@ 2008-08-21 23:48 Sage Weil
  2008-08-22 23:18 ` Andreas Dilger
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2008-08-21 23:48 UTC (permalink / raw)
  To: viro, linux-fsdevel; +Cc: mfasheh, zach.brown

I came across two somewhat related issues with d_revalidate, and wanted to 
find out if these were either known issues (possibly even addressed by 
Al's planned dcache changes?) or just the product of my own confusion..

The d_revalidate() behavior in real_lookup() currently bails out with 
-ENOENT when it encounters a race with i_mutex that confuses it.  
Specifically:

- in do_lookup(), if no dentry is found (or do_revalidate() fails), we 
call real_lookup().
- real_lookup() takes the dir i_mutex.
- if the dentry now exists (is found by d_lookup()), real_lookup() drops 
i_mutex, and calls do_revalidate() (again).
- if do_revalidate() fails now, we return -ENOENT.

This "two chances or -ENOENT" strategy strikes me as fundamentally racy.  
If, while we wait for i_mutex, something else hashes the dentry, but for 
one reason or another leaves it in a state in which d_revalidate will 
fail, we get -ENOENT instead of a fresh call to lookup.  For (reasonably 
long) timeout-based revalidation schemes, that doesn't really happen, but 
if your timeout is sufficiently short and i_mutex is contended, or you are 
explicitly invalidating a dentry (e.g. due to a callback from the server), 
it can happen more frequently. 

It also looks like we can get into trouble if a ->d_revalidate() doesn't 
d_drop() when it fails: do_revalidate() will try to d_invalidate(), but 
that can -EBUSY if the dentry has non-prunable children, and 
do_revalidate() silently drops that error condition.  If I'm reading this 
correctly, that means if we ever get a hashed dentry with children for 
which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT 
forever without ever calling ->lookup().

(I started worrying about d_revalidate()s that don't d_drop() after 
noticing that at least cifs and ocfs2 don't always do so on failure.  It 
seems reasonable to expect/require d_revalidate() to d_drop() any dentry 
it fails on, but that doesn't appear to be the case.)

Possible solutions?

- I looked a bit at whether simply looping in real_lookup() would address 
the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed).  
Starvation issues aside, that can loop forever if it gets a 
bad-but-hashed dentry (as described above).

- Instead of returning -ENOENT (which seems kind of unlikely to ever be 
the "right" answer), real_lookup() could call ->lookup(), even though the 
dentry is hashed.  Assuming that's even allowed (?), that might lead to 
more ->lookup() calls than we need...

 - A new ->d_revalidate_locked() method that can be called with the dir 
i_mutex held would allow real_lookup() to decide whether to call 
->lookup() while avoiding any additional race/starvaction issues.

None of these possibilities seem especially attractive. :( Are these known 
issues, or am I just misreading the code here?

sage

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: d_revalidate and real_lookup woes
  2008-08-21 23:48 d_revalidate and real_lookup woes Sage Weil
@ 2008-08-22 23:18 ` Andreas Dilger
  2008-11-06 20:06   ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2008-08-22 23:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: viro, linux-fsdevel, mfasheh, zach.brown

On Aug 21, 2008  16:48 -0700, Sage Weil wrote:
> I came across two somewhat related issues with d_revalidate, and wanted to 
> find out if these were either known issues (possibly even addressed by 
> Al's planned dcache changes?) or just the product of my own confusion..
> 
> The d_revalidate() behavior in real_lookup() currently bails out with 
> -ENOENT when it encounters a race with i_mutex that confuses it.  
> Specifically:
> 
> - in do_lookup(), if no dentry is found (or do_revalidate() fails), we 
> call real_lookup().
> - real_lookup() takes the dir i_mutex.
> - if the dentry now exists (is found by d_lookup()), real_lookup() drops 
> i_mutex, and calls do_revalidate() (again).
> - if do_revalidate() fails now, we return -ENOENT.
> 
> This "two chances or -ENOENT" strategy strikes me as fundamentally racy.  
> If, while we wait for i_mutex, something else hashes the dentry, but for 
> one reason or another leaves it in a state in which d_revalidate will 
> fail, we get -ENOENT instead of a fresh call to lookup.  For (reasonably 
> long) timeout-based revalidation schemes, that doesn't really happen, but 
> if your timeout is sufficiently short and i_mutex is contended, or you are 
> explicitly invalidating a dentry (e.g. due to a callback from the server), 
> it can happen more frequently. 

Yes, we had all sorts of problems with this code with Lustre, because it
falls into the category of "server revokes lock rapidly on contended
directory".

> It also looks like we can get into trouble if a ->d_revalidate() doesn't 
> d_drop() when it fails: do_revalidate() will try to d_invalidate(), but 
> that can -EBUSY if the dentry has non-prunable children, and 
> do_revalidate() silently drops that error condition.  If I'm reading this 
> correctly, that means if we ever get a hashed dentry with children for 
> which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT 
> forever without ever calling ->lookup().

Hmm, this might even be a bug we see occasionally that hasn't been
investigated deeply enough...

> - I looked a bit at whether simply looping in real_lookup() would address 
> the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed).  
> Starvation issues aside, that can loop forever if it gets a 
> bad-but-hashed dentry (as described above).

We put in an arbitrary limit of 10 retries in real_lookup() before
returning -ESTALE.

> - Instead of returning -ENOENT (which seems kind of unlikely to ever be 
> the "right" answer), real_lookup() could call ->lookup(), even though the 
> dentry is hashed.  Assuming that's even allowed (?), that might lead to 
> more ->lookup() calls than we need...
> 
>  - A new ->d_revalidate_locked() method that can be called with the dir 
> i_mutex held would allow real_lookup() to decide whether to call 
> ->lookup() while avoiding any additional race/starvaction issues.

We did that also, moving looping inside the mutex_lock(), and moving
mutex_unlock() until after the revalidate.  After some code inspection,
the only filesystem that touched i_mutex inside revalidate was devfs
(at the time), but that isn't in the tree anymore.  I'm not sure if
other filesystems have changed since that time to use i_mutex there.

A patch against the SLES10 kernel is below.

diff -urNp linux-2.6.16.21-0.8.orig/fs/namei.c linux-2.6.16.21-0.8/fs/namei.c
--- linux-2.6.16.21-0.8.orig/fs/namei.c	2006-10-04 02:18:11.000000000 +0300
+++ linux-2.6.16.21-0.8/fs/namei.c	2007-01-29 18:21:10.000000000 +0200
@@ -440,8 +451,12 @@ static struct dentry * real_lookup(struc
 {
 	struct dentry * result;
 	struct inode *dir = parent->d_inode;
+	int counter = 0;
 
 	mutex_lock(&dir->i_mutex);
+again:
+	counter++;
+
 	/*
 	 * First re-do the cached lookup just in case it was created
 	 * while we waited for the directory semaphore..
@@ -475,13 +490,16 @@ static struct dentry * real_lookup(struc
 	 * Uhhuh! Nasty case: the cache was re-populated while
 	 * we waited on the semaphore. Need to revalidate.
 	 */
-	mutex_unlock(&dir->i_mutex);
 	if (result->d_op && result->d_op->d_revalidate) {
 		if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) {
 			dput(result);
-			result = ERR_PTR(-ENOENT);
+			if (counter > 10)
+				result = ERR_PTR(-ESTALE);
+			if (!IS_ERR(result))
+				goto again;
 		}
 	}
+	mutex_unlock(&dir->i_mutex);
 	return result;
 }
 
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: d_revalidate and real_lookup woes
  2008-08-22 23:18 ` Andreas Dilger
@ 2008-11-06 20:06   ` Sage Weil
  0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2008-11-06 20:06 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: viro, linux-fsdevel, mfasheh, zach.brown, yehuda, akpm

Hi,

I dropped this real_lookup() problem for a while, but it just bit us again 
and I think we have a cleaner solution.  (BTW thanks for looking at this 
originally, Andreas!)

On Fri, 22 Aug 2008, Andreas Dilger wrote:
> On Aug 21, 2008  16:48 -0700, Sage Weil wrote:
> > The d_revalidate() behavior in real_lookup() currently bails out with 
> > -ENOENT when it encounters a race with i_mutex that confuses it.  
> > Specifically:
> > 
> > - in do_lookup(), if no dentry is found (or do_revalidate() fails), we 
> > call real_lookup().
> > - real_lookup() takes the dir i_mutex.
> > - if the dentry now exists (is found by d_lookup()), real_lookup() drops 
> > i_mutex, and calls do_revalidate() (again).
> > - if do_revalidate() fails now, we return -ENOENT.
> > 
> > This "two chances or -ENOENT" strategy strikes me as fundamentally racy.  
> > If, while we wait for i_mutex, something else hashes the dentry, but for 
> > one reason or another leaves it in a state in which d_revalidate will 
> > fail, we get -ENOENT instead of a fresh call to lookup.  For (reasonably 
> > long) timeout-based revalidation schemes, that doesn't really happen, but 
> > if your timeout is sufficiently short and i_mutex is contended, or you are 
> > explicitly invalidating a dentry (e.g. due to a callback from the server), 
> > it can happen more frequently. 
> 
> Yes, we had all sorts of problems with this code with Lustre, because it
> falls into the category of "server revokes lock rapidly on contended
> directory".

[...]

> >  - A new ->d_revalidate_locked() method that can be called with the dir 
> > i_mutex held would allow real_lookup() to decide whether to call 
> > ->lookup() while avoiding any additional race/starvaction issues.
> 
> We did that also, moving looping inside the mutex_lock(), and moving
> mutex_unlock() until after the revalidate.  After some code inspection,
> the only filesystem that touched i_mutex inside revalidate was devfs
> (at the time), but that isn't in the tree anymore.  I'm not sure if
> other filesystems have changed since that time to use i_mutex there.

We did a bit more inspection, and noticed that d_revalidate is _already_ 
called with i_mutex held, in lots of places.  do_filp_open(), 
lookup_create(), do_unlinkat(), do_rmdir(), and possibly others all take 
i_mutex, and then

-> lookup_hash
	-> __lookup_hash
		-> cached_lookup
			-> do_revalidate

Assuming that this is all okay, I think the cleaner solution is to just 
do_revalidate in real_lookup with i_mutex still held, and avoid any 
subsequent races altogether.  This simplifies the function a bit and gets 
rid of the nasty -ENOENT behavior.  We have a reliable test case (on ceph) 
and this fixes it.

Does this look reasonable?  If so, it would be great to get a fix for this 
upstream one way or another.  It is highly unlikely to bite existing in 
tree file systems like NFS with the spurious -ENOENT, but it's certainly 
possible, and should be fixed regardless.

Thanks-
sage


Signed-off-by: Yehuda Sadeh <yehuda@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
---

diff --git a/fs/namei.c b/fs/namei.c
index 7a7949c..698b36c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -475,6 +475,7 @@ static struct dentry * real_lookup(struct dentry * 
parent, struct qstr * name, s
 {
        struct dentry * result;
        struct inode *dir = parent->d_inode;
+       struct dentry *dentry;
 
        mutex_lock(&dir->i_mutex);
        /*
@@ -492,38 +493,41 @@ static struct dentry * real_lookup(struct dentry * 
parent, struct qstr * name, s
         * so doing d_lookup() (with seqlock), instead of lockfree 
__d_lookup
         */
        result = d_lookup(parent, name);
-       if (!result) {
-               struct dentry *dentry;
-
-               /* Don't create child dentry for a dead directory. */
-               result = ERR_PTR(-ENOENT);
-               if (IS_DEADDIR(dir))
-                       goto out_unlock;
-
-               dentry = d_alloc(parent, name);
-               result = ERR_PTR(-ENOMEM);
-               if (dentry) {
-                       result = dir->i_op->lookup(dir, dentry, nd);
+       if (result) {
+               /*
+                * Uhhuh! Nasty case: the cache was re-populated while
+                * we waited on the mutex. Need to revalidate, this
+                * time with i_mutex held to avoid any subsequent
+                * race.
+                */
+               if (result->d_op && result->d_op->d_revalidate) {
+                       result = do_revalidate(result, nd);
                        if (result)
-                               dput(dentry);
-                       else
-                               result = dentry;
+                               goto out_unlock;
+                       /*
+                        * Not only did we race on the dentry, but it was
+                        * left behind invalid. We'll just do the lookup.
+                        */
                }
-out_unlock:
-               mutex_unlock(&dir->i_mutex);
-               return result;
        }
 
-       /*
-        * Uhhuh! Nasty case: the cache was re-populated while
-        * we waited on the semaphore. Need to revalidate.
-        */
-       mutex_unlock(&dir->i_mutex);
-       if (result->d_op && result->d_op->d_revalidate) {
-               result = do_revalidate(result, nd);
-               if (!result)
-                       result = ERR_PTR(-ENOENT);
+       /* Don't create child dentry for a dead directory. */
+       result = ERR_PTR(-ENOENT);
+       if (IS_DEADDIR(dir))
+               goto out_unlock;
+
+       dentry = d_alloc(parent, name);
+       result = ERR_PTR(-ENOMEM);
+       if (dentry) {
+               result = dir->i_op->lookup(dir, dentry, nd);
+               if (result)
+                       dput(dentry);
+               else
+                       result = dentry;
        }
+out_unlock:
+       mutex_unlock(&dir->i_mutex);
+
        return result;
 }




^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-11-06 20:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-21 23:48 d_revalidate and real_lookup woes Sage Weil
2008-08-22 23:18 ` Andreas Dilger
2008-11-06 20:06   ` Sage Weil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).