* d_revalidate and real_lookup woes
@ 2008-08-21 23:48 Sage Weil
2008-08-22 23:18 ` Andreas Dilger
0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2008-08-21 23:48 UTC (permalink / raw)
To: viro, linux-fsdevel; +Cc: mfasheh, zach.brown
I came across two somewhat related issues with d_revalidate, and wanted to
find out if these were either known issues (possibly even addressed by
Al's planned dcache changes?) or just the product of my own confusion..
The d_revalidate() behavior in real_lookup() currently bails out with
-ENOENT when it encounters a race with i_mutex that confuses it.
Specifically:
- in do_lookup(), if no dentry is found (or do_revalidate() fails), we
call real_lookup().
- real_lookup() takes the dir i_mutex.
- if the dentry now exists (is found by d_lookup()), real_lookup() drops
i_mutex, and calls do_revalidate() (again).
- if do_revalidate() fails now, we return -ENOENT.
This "two chances or -ENOENT" strategy strikes me as fundamentally racy.
If, while we wait for i_mutex, something else hashes the dentry, but for
one reason or another leaves it in a state in which d_revalidate will
fail, we get -ENOENT instead of a fresh call to lookup. For (reasonably
long) timeout-based revalidation schemes, that doesn't really happen, but
if your timeout is sufficiently short and i_mutex is contended, or you are
explicitly invalidating a dentry (e.g. due to a callback from the server),
it can happen more frequently.
It also looks like we can get into trouble if a ->d_revalidate() doesn't
d_drop() when it fails: do_revalidate() will try to d_invalidate(), but
that can -EBUSY if the dentry has non-prunable children, and
do_revalidate() silently drops that error condition. If I'm reading this
correctly, that means if we ever get a hashed dentry with children for
which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT
forever without ever calling ->lookup().
(I started worrying about d_revalidate()s that don't d_drop() after
noticing that at least cifs and ocfs2 don't always do so on failure. It
seems reasonable to expect/require d_revalidate() to d_drop() any dentry
it fails on, but that doesn't appear to be the case.)
Possible solutions?
- I looked a bit at whether simply looping in real_lookup() would address
the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed).
Starvation issues aside, that can loop forever if it gets a
bad-but-hashed dentry (as described above).
- Instead of returning -ENOENT (which seems kind of unlikely to ever be
the "right" answer), real_lookup() could call ->lookup(), even though the
dentry is hashed. Assuming that's even allowed (?), that might lead to
more ->lookup() calls than we need...
- A new ->d_revalidate_locked() method that can be called with the dir
i_mutex held would allow real_lookup() to decide whether to call
->lookup() while avoiding any additional race/starvaction issues.
None of these possibilities seem especially attractive. :( Are these known
issues, or am I just misreading the code here?
sage
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: d_revalidate and real_lookup woes
2008-08-21 23:48 d_revalidate and real_lookup woes Sage Weil
@ 2008-08-22 23:18 ` Andreas Dilger
2008-11-06 20:06 ` Sage Weil
0 siblings, 1 reply; 3+ messages in thread
From: Andreas Dilger @ 2008-08-22 23:18 UTC (permalink / raw)
To: Sage Weil; +Cc: viro, linux-fsdevel, mfasheh, zach.brown
On Aug 21, 2008 16:48 -0700, Sage Weil wrote:
> I came across two somewhat related issues with d_revalidate, and wanted to
> find out if these were either known issues (possibly even addressed by
> Al's planned dcache changes?) or just the product of my own confusion..
>
> The d_revalidate() behavior in real_lookup() currently bails out with
> -ENOENT when it encounters a race with i_mutex that confuses it.
> Specifically:
>
> - in do_lookup(), if no dentry is found (or do_revalidate() fails), we
> call real_lookup().
> - real_lookup() takes the dir i_mutex.
> - if the dentry now exists (is found by d_lookup()), real_lookup() drops
> i_mutex, and calls do_revalidate() (again).
> - if do_revalidate() fails now, we return -ENOENT.
>
> This "two chances or -ENOENT" strategy strikes me as fundamentally racy.
> If, while we wait for i_mutex, something else hashes the dentry, but for
> one reason or another leaves it in a state in which d_revalidate will
> fail, we get -ENOENT instead of a fresh call to lookup. For (reasonably
> long) timeout-based revalidation schemes, that doesn't really happen, but
> if your timeout is sufficiently short and i_mutex is contended, or you are
> explicitly invalidating a dentry (e.g. due to a callback from the server),
> it can happen more frequently.
Yes, we had all sorts of problems with this code with Lustre, because it
falls into the category of "server revokes lock rapidly on contended
directory".
> It also looks like we can get into trouble if a ->d_revalidate() doesn't
> d_drop() when it fails: do_revalidate() will try to d_invalidate(), but
> that can -EBUSY if the dentry has non-prunable children, and
> do_revalidate() silently drops that error condition. If I'm reading this
> correctly, that means if we ever get a hashed dentry with children for
> which ->d_revalidate() fails (and doesn't d_drop()), we will get -ENOENT
> forever without ever calling ->lookup().
Hmm, this might even be a bug we see occasionally that hasn't been
investigated deeply enough...
> - I looked a bit at whether simply looping in real_lookup() would address
> the issue (i.e. retake i_mutex, d_lookup(), call ->lookup() as needed).
> Starvation issues aside, that can loop forever if it gets a
> bad-but-hashed dentry (as described above).
We put in an arbitrary limit of 10 retries in real_lookup() before
returning -ESTALE.
> - Instead of returning -ENOENT (which seems kind of unlikely to ever be
> the "right" answer), real_lookup() could call ->lookup(), even though the
> dentry is hashed. Assuming that's even allowed (?), that might lead to
> more ->lookup() calls than we need...
>
> - A new ->d_revalidate_locked() method that can be called with the dir
> i_mutex held would allow real_lookup() to decide whether to call
> ->lookup() while avoiding any additional race/starvaction issues.
We did that also, moving looping inside the mutex_lock(), and moving
mutex_unlock() until after the revalidate. After some code inspection,
the only filesystem that touched i_mutex inside revalidate was devfs
(at the time), but that isn't in the tree anymore. I'm not sure if
other filesystems have changed since that time to use i_mutex there.
A patch against the SLES10 kernel is below.
diff -urNp linux-2.6.16.21-0.8.orig/fs/namei.c linux-2.6.16.21-0.8/fs/namei.c
--- linux-2.6.16.21-0.8.orig/fs/namei.c 2006-10-04 02:18:11.000000000 +0300
+++ linux-2.6.16.21-0.8/fs/namei.c 2007-01-29 18:21:10.000000000 +0200
@@ -440,8 +451,12 @@ static struct dentry * real_lookup(struc
{
struct dentry * result;
struct inode *dir = parent->d_inode;
+ int counter = 0;
mutex_lock(&dir->i_mutex);
+again:
+ counter++;
+
/*
* First re-do the cached lookup just in case it was created
* while we waited for the directory semaphore..
@@ -475,13 +490,16 @@ static struct dentry * real_lookup(struc
* Uhhuh! Nasty case: the cache was re-populated while
* we waited on the semaphore. Need to revalidate.
*/
- mutex_unlock(&dir->i_mutex);
if (result->d_op && result->d_op->d_revalidate) {
if (!result->d_op->d_revalidate(result, nd) && !d_invalidate(result)) {
dput(result);
- result = ERR_PTR(-ENOENT);
+ if (counter > 10)
+ result = ERR_PTR(-ESTALE);
+ if (!IS_ERR(result))
+ goto again;
}
}
+ mutex_unlock(&dir->i_mutex);
return result;
}
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: d_revalidate and real_lookup woes
2008-08-22 23:18 ` Andreas Dilger
@ 2008-11-06 20:06 ` Sage Weil
0 siblings, 0 replies; 3+ messages in thread
From: Sage Weil @ 2008-11-06 20:06 UTC (permalink / raw)
To: Andreas Dilger; +Cc: viro, linux-fsdevel, mfasheh, zach.brown, yehuda, akpm
Hi,
I dropped this real_lookup() problem for a while, but it just bit us again
and I think we have a cleaner solution. (BTW thanks for looking at this
originally, Andreas!)
On Fri, 22 Aug 2008, Andreas Dilger wrote:
> On Aug 21, 2008 16:48 -0700, Sage Weil wrote:
> > The d_revalidate() behavior in real_lookup() currently bails out with
> > -ENOENT when it encounters a race with i_mutex that confuses it.
> > Specifically:
> >
> > - in do_lookup(), if no dentry is found (or do_revalidate() fails), we
> > call real_lookup().
> > - real_lookup() takes the dir i_mutex.
> > - if the dentry now exists (is found by d_lookup()), real_lookup() drops
> > i_mutex, and calls do_revalidate() (again).
> > - if do_revalidate() fails now, we return -ENOENT.
> >
> > This "two chances or -ENOENT" strategy strikes me as fundamentally racy.
> > If, while we wait for i_mutex, something else hashes the dentry, but for
> > one reason or another leaves it in a state in which d_revalidate will
> > fail, we get -ENOENT instead of a fresh call to lookup. For (reasonably
> > long) timeout-based revalidation schemes, that doesn't really happen, but
> > if your timeout is sufficiently short and i_mutex is contended, or you are
> > explicitly invalidating a dentry (e.g. due to a callback from the server),
> > it can happen more frequently.
>
> Yes, we had all sorts of problems with this code with Lustre, because it
> falls into the category of "server revokes lock rapidly on contended
> directory".
[...]
> > - A new ->d_revalidate_locked() method that can be called with the dir
> > i_mutex held would allow real_lookup() to decide whether to call
> > ->lookup() while avoiding any additional race/starvaction issues.
>
> We did that also, moving looping inside the mutex_lock(), and moving
> mutex_unlock() until after the revalidate. After some code inspection,
> the only filesystem that touched i_mutex inside revalidate was devfs
> (at the time), but that isn't in the tree anymore. I'm not sure if
> other filesystems have changed since that time to use i_mutex there.
We did a bit more inspection, and noticed that d_revalidate is _already_
called with i_mutex held, in lots of places. do_filp_open(),
lookup_create(), do_unlinkat(), do_rmdir(), and possibly others all take
i_mutex, and then
-> lookup_hash
-> __lookup_hash
-> cached_lookup
-> do_revalidate
Assuming that this is all okay, I think the cleaner solution is to just
do_revalidate in real_lookup with i_mutex still held, and avoid any
subsequent races altogether. This simplifies the function a bit and gets
rid of the nasty -ENOENT behavior. We have a reliable test case (on ceph)
and this fixes it.
Does this look reasonable? If so, it would be great to get a fix for this
upstream one way or another. It is highly unlikely to bite existing in
tree file systems like NFS with the spurious -ENOENT, but it's certainly
possible, and should be fixed regardless.
Thanks-
sage
Signed-off-by: Yehuda Sadeh <yehuda@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
---
diff --git a/fs/namei.c b/fs/namei.c
index 7a7949c..698b36c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -475,6 +475,7 @@ static struct dentry * real_lookup(struct dentry *
parent, struct qstr * name, s
{
struct dentry * result;
struct inode *dir = parent->d_inode;
+ struct dentry *dentry;
mutex_lock(&dir->i_mutex);
/*
@@ -492,38 +493,41 @@ static struct dentry * real_lookup(struct dentry *
parent, struct qstr * name, s
* so doing d_lookup() (with seqlock), instead of lockfree
__d_lookup
*/
result = d_lookup(parent, name);
- if (!result) {
- struct dentry *dentry;
-
- /* Don't create child dentry for a dead directory. */
- result = ERR_PTR(-ENOENT);
- if (IS_DEADDIR(dir))
- goto out_unlock;
-
- dentry = d_alloc(parent, name);
- result = ERR_PTR(-ENOMEM);
- if (dentry) {
- result = dir->i_op->lookup(dir, dentry, nd);
+ if (result) {
+ /*
+ * Uhhuh! Nasty case: the cache was re-populated while
+ * we waited on the mutex. Need to revalidate, this
+ * time with i_mutex held to avoid any subsequent
+ * race.
+ */
+ if (result->d_op && result->d_op->d_revalidate) {
+ result = do_revalidate(result, nd);
if (result)
- dput(dentry);
- else
- result = dentry;
+ goto out_unlock;
+ /*
+ * Not only did we race on the dentry, but it was
+ * left behind invalid. We'll just do the lookup.
+ */
}
-out_unlock:
- mutex_unlock(&dir->i_mutex);
- return result;
}
- /*
- * Uhhuh! Nasty case: the cache was re-populated while
- * we waited on the semaphore. Need to revalidate.
- */
- mutex_unlock(&dir->i_mutex);
- if (result->d_op && result->d_op->d_revalidate) {
- result = do_revalidate(result, nd);
- if (!result)
- result = ERR_PTR(-ENOENT);
+ /* Don't create child dentry for a dead directory. */
+ result = ERR_PTR(-ENOENT);
+ if (IS_DEADDIR(dir))
+ goto out_unlock;
+
+ dentry = d_alloc(parent, name);
+ result = ERR_PTR(-ENOMEM);
+ if (dentry) {
+ result = dir->i_op->lookup(dir, dentry, nd);
+ if (result)
+ dput(dentry);
+ else
+ result = dentry;
}
+out_unlock:
+ mutex_unlock(&dir->i_mutex);
+
return result;
}
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-11-06 20:06 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-21 23:48 d_revalidate and real_lookup woes Sage Weil
2008-08-22 23:18 ` Andreas Dilger
2008-11-06 20:06 ` Sage Weil
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).