[RFC] Reinstate NFS exportability for JFFS2.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Reinstate NFS exportability for JFFS2.
@ 2008-05-01 19:42 David Woodhouse
  2008-05-01 20:48 ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-05-01 19:42 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: hch, linux-mtd

On top of commit 27c72b040c0be8f3704ed0b6b84c12cbba24a7e8 (in the
mtd-2.6 tree if Linus hasn't pulled it yet), this provides
export_operations for JFFS2. 

It doesn't yet solve the readdirplus deadlock described in
http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html 

You also still need to set fsid manually when exporting it -- I'm
thinking of adding a 'get_fsid' to the export_operations to allow the
filesystem to provide its own...

diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 7da69ea..25f6472 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -22,6 +22,7 @@
 #include <linux/mtd/super.h>
 #include <linux/ctype.h>
 #include <linux/namei.h>
+#include <linux/exportfs.h>
 #include "compr.h"
 #include "nodelist.h"
 
@@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
 	return 0;
 }
 
+static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino,
+					 uint32_t generation)
+{
+	return jffs2_iget(sb, ino);
+}
+
+static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_get_parent(struct dentry *child)
+{
+	struct jffs2_inode_info *f;
+	struct dentry *parent;
+	struct inode *inode;
+	uint32_t pino;
+
+	if (!S_ISDIR(child->d_inode->i_mode))
+		return ERR_PTR(-EACCES);
+
+	f = JFFS2_INODE_INFO(child->d_inode);
+
+	pino = f->inocache->pino_nlink;
+
+	JFFS2_DEBUG("Parent of directory ino #%u is #%u\n",
+		    f->inocache->ino, pino);
+
+	inode = jffs2_iget(child->d_inode->i_sb, pino);
+	if (IS_ERR(inode)) {
+		JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino,
+			    PTR_ERR(inode));
+		return ERR_CAST(inode);
+	}
+
+	parent = d_alloc_anon(inode);
+	if (!parent) {
+		iput(inode);
+		parent = ERR_PTR(-ENOMEM);
+	}
+	return parent;
+}
+
+static struct export_operations jffs2_export_ops = {
+	.get_parent = jffs2_get_parent,
+	.fh_to_dentry = jffs2_fh_to_dentry,
+	.fh_to_parent = jffs2_fh_to_parent,
+};
+
 static const struct super_operations jffs2_super_operations =
 {
 	.alloc_inode =	jffs2_alloc_inode,
@@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
 	spin_lock_init(&c->inocache_lock);
 
 	sb->s_op = &jffs2_super_operations;
+ 	sb->s_export_op = &jffs2_export_ops;
 	sb->s_flags = sb->s_flags | MS_NOATIME;
 	sb->s_xattr = jffs2_xattr_handlers;
 #ifdef CONFIG_JFFS2_FS_POSIX_ACL

-- 
dwmw2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-05-01 19:42 [RFC] Reinstate NFS exportability for JFFS2 David Woodhouse
@ 2008-05-01 20:48 ` Christoph Hellwig
  2008-05-01 22:44   ` David Woodhouse
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2008-05-01 20:48 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-fsdevel, hch, linux-mtd

On Thu, May 01, 2008 at 08:42:59PM +0100, David Woodhouse wrote:
> On top of commit 27c72b040c0be8f3704ed0b6b84c12cbba24a7e8 (in the
> mtd-2.6 tree if Linus hasn't pulled it yet), this provides
> export_operations for JFFS2. 
> 
> It doesn't yet solve the readdirplus deadlock described in
> http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html 
> 
> You also still need to set fsid manually when exporting it -- I'm
> thinking of adding a 'get_fsid' to the export_operations to allow the
> filesystem to provide its own...

Yes, and get_fsid would be extremly useful, especially for those
filesystems that already have an uuid in the superblock
(*cough*, XFS, *cough*), but it'll need some co-operations with
nfs-utils on when to use it.

Patch looks good for me except for the few introduced overlong lines.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-05-01 20:48 ` Christoph Hellwig
@ 2008-05-01 22:44   ` David Woodhouse
  2008-05-02  1:38     ` Neil Brown
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-05-01 22:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mtd

On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote:
> Yes, and get_fsid would be extremly useful, especially for those
> filesystems that already have an uuid in the superblock
> (*cough*, XFS, *cough*), but it'll need some co-operations with
> nfs-utils on when to use it.

Why do you need to co-operate with userspace? Userspace shouldn't need
to do anything -- we'll just generate a suitable fsid/uuid for
ourselves, unless userspace deliberately overrides it for the export in
question.

> Patch looks good for me except for the few introduced overlong lines.

Bah, don't you start. A less onanistic problem with it is the deadlock
with NFS readdirplus (->readdir->encode_entry->compose_entry_fh->lookup)

I wonder if we should postpone the calls to compose_entry_fh() until
_after_ readdir() has completed. Leave space in the response for the
filehandles, but only walk through it again calling compose_entry_fh()
once we're done in readdir. That would allow us to get rid of the
various icky hacks that file systems have to avoid that deadlock.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-05-01 22:44   ` David Woodhouse
@ 2008-05-02  1:38     ` Neil Brown
  2008-05-02 11:37       ` David Woodhouse
                         ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Neil Brown @ 2008-05-02  1:38 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Christoph Hellwig, linux-fsdevel, linux-mtd, linux-nfs

On Thursday May 1, dwmw2@infradead.org wrote:
> On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote:
> > Yes, and get_fsid would be extremly useful, especially for those
> > filesystems that already have an uuid in the superblock
> > (*cough*, XFS, *cough*), but it'll need some co-operations with
> > nfs-utils on when to use it.
> 
> Why do you need to co-operate with userspace? Userspace shouldn't need
> to do anything -- we'll just generate a suitable fsid/uuid for
> ourselves, unless userspace deliberately overrides it for the export in
> question.

Actually it is the kernel that doesn't need to do anything....
Mapping between the filesystem and the filesystem part of the
filehandle is done entirely in user space.
The kernel says "Here is a filehandle fragement, what filesystem
should I be accessing".

So what you really want is to teach nfs-utils to recognise JFFS2 and
extract an appropriate uuid.
It already uses libblkid to get uuids for ext3 and XFS and others.
Extending that to handle JFFS2 should be much of a drama.

> 
> > Patch looks good for me except for the few introduced overlong lines.
> 
> Bah, don't you start. A less onanistic problem with it is the deadlock
> with NFS readdirplus (->readdir->encode_entry->compose_entry_fh->lookup)
> 
> I wonder if we should postpone the calls to compose_entry_fh() until
> _after_ readdir() has completed. Leave space in the response for the
> filehandles, but only walk through it again calling compose_entry_fh()
> once we're done in readdir. That would allow us to get rid of the
> various icky hacks that file systems have to avoid that deadlock.

Why is there a deadlock here?
Both readdir and lookup are called with i_mutex held on the directory
so there should need to do any extra locking (he said, naively).  In
the readdirplus cases, i_mutex is held across both the readdir and the
lookup....

One problem with your proposed solution is that filehandles aren't all
the same length, so you cannot reliably leave space for them.

Awkward.

NeilBrown

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-05-02  1:38     ` Neil Brown
@ 2008-05-02 11:37       ` David Woodhouse
       [not found]         ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  2008-07-31 21:54       ` David Woodhouse
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-05-02 11:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-mtd, linux-nfs

On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> On Thursday May 1, dwmw2@infradead.org wrote:
> > On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote:
> > > Yes, and get_fsid would be extremly useful, especially for those
> > > filesystems that already have an uuid in the superblock
> > > (*cough*, XFS, *cough*), but it'll need some co-operations with
> > > nfs-utils on when to use it.
> > 
> > Why do you need to co-operate with userspace? Userspace shouldn't need
> > to do anything -- we'll just generate a suitable fsid/uuid for
> > ourselves, unless userspace deliberately overrides it for the export in
> > question.
> 
> Actually it is the kernel that doesn't need to do anything....
> Mapping between the filesystem and the filesystem part of the
> filehandle is done entirely in user space.
> The kernel says "Here is a filehandle fragement, what filesystem
> should I be accessing".
> 
> So what you really want is to teach nfs-utils to recognise JFFS2 and
> extract an appropriate uuid.
> It already uses libblkid to get uuids for ext3 and XFS and others.
> Extending that to handle JFFS2 should be much of a drama.

For JFFS2, there is no UUID; only i_sb->s_dev. Actually. if we just set
FS_REQUIRES_DEV then it would work out OK -- at least for the NFS
export. We don't do that though, because it gives behaviour we don't
want in other situations,

> Why is there a deadlock here?

Many file systems have their own locking, and lookup() can end up trying
to re-take a lock which readdir() is already holding. In the JFFS2 case,
it's the fs-internal inode mutex, which is required because the garbage
collector can't use i_mutex for lock ordering reasons.

See also the readdir implementation and surrounding comments in
fs/xfs/linux-2.6/xfs_file.c -- and the way GFS2 uses
gfs2_glock_is_locked_by_me() to avoid the deadlock.

The annoying thing is that JFFS2 doesn't even _implement_ i_generation,
so you get no more useful information out of the lookup() call anyway :)

> Both readdir and lookup are called with i_mutex held on the directory
> so there should need to do any extra locking (he said, naively).  In
> the readdirplus cases, i_mutex is held across both the readdir and the
> lookup....
> 
> One problem with your proposed solution is that filehandles aren't all
> the same length, so you cannot reliably leave space for them.

Not without moving stuff around during the postprocessing, I suppose.
Which isn't very pretty -- but it's prettier than some of the hacks we
have at the moment to avoid the deadlock.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]         ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-05-02 14:08           ` J. Bruce Fields
  0 siblings, 0 replies; 55+ messages in thread
From: J. Bruce Fields @ 2008-05-02 14:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Neil Brown, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA

On Fri, May 02, 2008 at 12:37:18PM +0100, David Woodhouse wrote:
> On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> > Why is there a deadlock here?
> 
> Many file systems have their own locking, and lookup() can end up trying
> to re-take a lock which readdir() is already holding. In the JFFS2 case,
> it's the fs-internal inode mutex, which is required because the garbage
> collector can't use i_mutex for lock ordering reasons.
> 
> See also the readdir implementation and surrounding comments in
> fs/xfs/linux-2.6/xfs_file.c -- and the way GFS2 uses
> gfs2_glock_is_locked_by_me() to avoid the deadlock.
> 
> The annoying thing is that JFFS2 doesn't even _implement_ i_generation,
> so you get no more useful information out of the lookup() call anyway :)
> 
> > Both readdir and lookup are called with i_mutex held on the directory
> > so there should need to do any extra locking (he said, naively).  In
> > the readdirplus cases, i_mutex is held across both the readdir and the
> > lookup....
> > 
> > One problem with your proposed solution is that filehandles aren't all
> > the same length, so you cannot reliably leave space for them.
> 
> Not without moving stuff around during the postprocessing, I suppose.
> Which isn't very pretty -- but it's prettier than some of the hacks we
> have at the moment to avoid the deadlock.

This comes up periodically.  It would be great to finally get it fixed.
The last conversation I remember started at about:

	http://marc.info/?l=linux-kernel&m=119506226209056&w=2

Summary: one approach would be to define a ->readdirplus() that passes a
dentry to its equivalent of the ->filldir callback.  (Christoph points
out that returning a stat struct would be simpler.  But unfortunately we
need to check for mountpoints here, so that's not sufficient.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-05-02  1:38     ` Neil Brown
  2008-05-02 11:37       ` David Woodhouse
@ 2008-07-31 21:54       ` David Woodhouse
  2008-08-01  0:16         ` Neil Brown
  2008-07-31 21:54       ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> Why is there a deadlock here?
> Both readdir and lookup are called with i_mutex held on the directory
> so there should need to do any extra locking (he said, naively).  In
> the readdirplus cases, i_mutex is held across both the readdir and the
> lookup....
> 
> One problem with your proposed solution is that filehandles aren't all
> the same length, so you cannot reliably leave space for them.
> 
> Awkward.

Yeah. I think the sanest plan for the short term is, as hch suggests,
just to transplant the existing XFS hack into the nfsd code. That way,
at least we can avoid using the hack for local users. And it makes NFS
export from other file systems (jffs2, btrfs, etc.) easier without
having to put the same hacks in each one.

Git tree at git.infradead.org/users/dwmw2/nfsexport-2.6.git; patch
sequence follows...

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-07-31 21:54       ` David Woodhouse
@ 2008-08-01  0:16         ` Neil Brown
       [not found]           ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: Neil Brown @ 2008-08-01  0:16 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

On Thursday July 31, dwmw2@infradead.org wrote:
> On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> > Why is there a deadlock here?

I was really hoping you would answer this question.
I can see the sense in your approach, but it does still seem a bit
hackish.  I would like to understand the details of the problem enough
to be confident that there is no less-hackish solution.

Thanks,
NeilBrown


> > Both readdir and lookup are called with i_mutex held on the directory
> > so there should need to do any extra locking (he said, naively).  In
> > the readdirplus cases, i_mutex is held across both the readdir and the
> > lookup....
> > 
> > One problem with your proposed solution is that filehandles aren't all
> > the same length, so you cannot reliably leave space for them.
> > 
> > Awkward.
> 
> Yeah. I think the sanest plan for the short term is, as hch suggests,
> just to transplant the existing XFS hack into the nfsd code. That way,
> at least we can avoid using the hack for local users. And it makes NFS
> export from other file systems (jffs2, btrfs, etc.) easier without
> having to put the same hacks in each one.
> 
> Git tree at git.infradead.org/users/dwmw2/nfsexport-2.6.git; patch
> sequence follows...
> 
> -- 
> David Woodhouse                            Open Source Technology Centre
> David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]           ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-01  0:40             ` David Woodhouse
       [not found]               ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
                                 ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: David Woodhouse @ 2008-08-01  0:40 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote:
> On Thursday July 31, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> > > Why is there a deadlock here?
> 
> I was really hoping you would answer this question.

It's because the nfsd readdirplus code recurses into the file system.
>From the file system's ->readdir() function we call back into nfsd's
encode_entry(), which then calls back into the file system's ->lookup()
function so that it can generate a filehandle.

For any file system which has its own internal locking -- which includes
jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that
recursive call back into the file system will deadlock.

In the specific case of JFFS2, we need that internal locking because of
lock ordering constraints with the garbage collection -- we need to take
the allocator lock _before_ the per-inode lock, which means we can't use
the generic inode->i_mutex for that purpose. That's documented in
fs/jffs2/README.Locking. I know fewer details about the other affected
file systems.

> I can see the sense in your approach, but it does still seem a bit
> hackish.  I would like to understand the details of the problem enough
> to be confident that there is no less-hackish solution.

I was thinking that we could perhaps get away with an extension to
readdir() that passes the filehandle to its filldir callback too. I'm
not sure if that's sufficient for NFS4 though.

Perhaps we could add a ->lookup_fh() method which is guaranteed to be
called _only_ from within ->readdir(), so it can have (or _lack_) the
appropriate locking.

For now though, this approach moves the problem into the nfsd code where
it belongs -- the VFS never calls into the file system recursively like
this, in the general case.

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]               ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
@ 2008-08-01  0:52                 ` David Woodhouse
  0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-08-01  0:52 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, 2008-08-01 at 01:40 +0100, David Woodhouse wrote:
> For any file system which has its own internal locking -- which includes
> jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that
> recursive call back into the file system will deadlock.

It _would_ deadlock, I mean. They each have their own strategy to avoid
it, for now. jffs2 and btrfs just don't provide export_ops so they avoid
the problem, but we'd _like_ to be able to export them. xfs has the hack
I've just transplanted. gfs2 has a 'lock is owned by me' check. Not sure
about the others.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01  0:40             ` David Woodhouse
       [not found]               ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
@ 2008-08-01  0:53               ` Chuck Lever
       [not found]                 ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-08-01  2:14               ` Neil Brown
  2 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2008-08-01  0:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs,
	linux-mtd

On Thu, Jul 31, 2008 at 8:40 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote:
>> On Thursday July 31, dwmw2@infradead.org wrote:
>> > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
>> > > Why is there a deadlock here?
>>
>> I was really hoping you would answer this question.
>
> It's because the nfsd readdirplus code recurses into the file system.
> >From the file system's ->readdir() function we call back into nfsd's
> encode_entry(), which then calls back into the file system's ->lookup()
> function so that it can generate a filehandle.
>
> For any file system which has its own internal locking -- which includes
> jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that
> recursive call back into the file system will deadlock.
>
> In the specific case of JFFS2, we need that internal locking because of
> lock ordering constraints with the garbage collection -- we need to take
> the allocator lock _before_ the per-inode lock, which means we can't use
> the generic inode->i_mutex for that purpose. That's documented in
> fs/jffs2/README.Locking. I know fewer details about the other affected
> file systems.
>
>> I can see the sense in your approach, but it does still seem a bit
>> hackish.  I would like to understand the details of the problem enough
>> to be confident that there is no less-hackish solution.
>
> I was thinking that we could perhaps get away with an extension to
> readdir() that passes the filehandle to its filldir callback too. I'm
> not sure if that's sufficient for NFS4 though.

For now it is sufficient, IMO.  NFSv4 doesn't implement a readdirplus
operation, and the performance benefits of NFSv3 readdirplus are
equivocal -- there isn't a strong desire to replicate the complexity
of NFSv3 readdirplus in NFSv4.  I'm not even sure you can do it even
with a single compound RPC, so even in the long run NFSv4 may not ever
have the locking issues that NFSv3 does here.

I agree with Neil, though -- as a reviewer, I think the architecture
of your solution is valid, but would need to audit these pretty
closely to get a sense of the individual correctness/appropriateness
of each change.

-- 
 "Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                 ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-08-01  1:00                   ` David Woodhouse
       [not found]                     ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-01  1:00 UTC (permalink / raw)
  To: chucklever-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Neil Brown, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote:
> For now it is sufficient, IMO.  NFSv4 doesn't implement a readdirplus
> operation, and the performance benefits of NFSv3 readdirplus are
> equivocal -- there isn't a strong desire to replicate the complexity
> of NFSv3 readdirplus in NFSv4.  I'm not even sure you can do it even
> with a single compound RPC, so even in the long run NFSv4 may not ever
> have the locking issues that NFSv3 does here.

AFAICT NFSv4 does have the same recursion issues already. The call trace
goes fs->readdir() ... nfsd4_encode_dirent() ...
nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup().

Or am I mistaken?

> I agree with Neil, though -- as a reviewer, I think the architecture
> of your solution is valid, but would need to audit these pretty
> closely to get a sense of the individual correctness/appropriateness
> of each change.

The important part to review is this one, which is fairly close to what
XFS has been doing (for local access too) for a long time already:

http://git.infradead.org/users/dwmw2/nfsexport-2.6.git?a=commitdiff;h=37f3aae5380eb4ef23a77f3f52b8849a18cee188

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                     ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
@ 2008-08-01  1:31                       ` Chuck Lever
  2008-08-01  8:13                         ` David Woodhouse
  2008-08-01 13:35                         ` David Woodhouse
  0 siblings, 2 replies; 55+ messages in thread
From: Chuck Lever @ 2008-08-01  1:31 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Neil Brown, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, Jul 31, 2008 at 9:00 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote:
>> For now it is sufficient, IMO.  NFSv4 doesn't implement a readdirplus
>> operation, and the performance benefits of NFSv3 readdirplus are
>> equivocal -- there isn't a strong desire to replicate the complexity
>> of NFSv3 readdirplus in NFSv4.  I'm not even sure you can do it even
>> with a single compound RPC, so even in the long run NFSv4 may not ever
>> have the locking issues that NFSv3 does here.
>
> AFAICT NFSv4 does have the same recursion issues already. The call trace
> goes fs->readdir() ... nfsd4_encode_dirent() ...
> nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup().
>
> Or am I mistaken?

It looks like it needs a directory entry's dentry for a couple of reasons:

1.  To determine whether a directory entry is a mount point

2.  If the client has asked for file handles (via a bitmask) for the
directory entries

So, yep, you're right.  NFSv4 will hit this too.

As I recall, the Linux NFSv4 client doesn't use
FATTR4_WORD0_FILEHANDLE during NFSv4 READDIR for the reasons I stated
earlier; and it only uses FATTR4_WORD0_FSID during GETATTR.  Other
clients might use these during a READDIR however.

-- 
 "Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01  1:31                       ` Chuck Lever
@ 2008-08-01  8:13                         ` David Woodhouse
  2008-08-01 13:35                         ` David Woodhouse
  1 sibling, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-08-01  8:13 UTC (permalink / raw)
  To: chucklever
  Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs,
	linux-mtd

On Thu, 2008-07-31 at 21:31 -0400, Chuck Lever wrote:
> It looks like it needs a directory entry's dentry for a couple of
> reasons:
> 
> 1.  To determine whether a directory entry is a mount point
> 
> 2.  If the client has asked for file handles (via a bitmask) for the
> directory entries

Theoretically, neither of those actually need the _inode_. If it's a
mountpoint, won't it be in the dcache already? So a purely dcache-based
lookup would find it, without having to call through to the file
system's ->lookup() method on a dcache miss?

And to generate the file handle, all you need in the common case is
i_generation? You don't need to pull every inode into core.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01  1:31                       ` Chuck Lever
  2008-08-01  8:13                         ` David Woodhouse
@ 2008-08-01 13:35                         ` David Woodhouse
       [not found]                           ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-01 13:35 UTC (permalink / raw)
  To: chucklever, viro
  Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs,
	linux-mtd

On Thu, 2008-07-31 at 21:31 -0400, Chuck Lever wrote:
> On Thu, Jul 31, 2008 at 9:00 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote:
> >> For now it is sufficient, IMO.  NFSv4 doesn't implement a readdirplus
> >> operation, and the performance benefits of NFSv3 readdirplus are
> >> equivocal -- there isn't a strong desire to replicate the complexity
> >> of NFSv3 readdirplus in NFSv4.  I'm not even sure you can do it even
> >> with a single compound RPC, so even in the long run NFSv4 may not ever
> >> have the locking issues that NFSv3 does here.
> >
> > AFAICT NFSv4 does have the same recursion issues already. The call trace
> > goes fs->readdir() ... nfsd4_encode_dirent() ...
> > nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup().
> >
> > Or am I mistaken?
> 
> It looks like it needs a directory entry's dentry for a couple of reasons:
> 
> 1.  To determine whether a directory entry is a mount point
> 
> 2.  If the client has asked for file handles (via a bitmask) for the
> directory entries

Those are needed by NFSv3 too -- and can be handled with a lookup_fh()
method in the file system which is guaranteed to be called from within
the filldir callback, and some support in the VFS for checking if it's a
mountpoint.

NFSv4 introduces another problem though, which is that it seems to be
able to return the _full_ getattr() results for each object, and there's
no real way round the fact that we need to do the ->lookup() for that.

If sane clients aren't expected to _ask_ for that, though, then perhaps
it would be OK to fall back to something like the existing
readdir-to-buffer hack for that case, while most normal clients won't
trigger it.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                           ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-01 13:56                             ` David Woodhouse
  2008-08-01 16:05                               ` Chuck Lever
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-01 13:56 UTC (permalink / raw)
  To: chucklever-Re5JQEeQqe8AvxtiuMwx3w
  Cc: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote:
> Those are needed by NFSv3 too -- and can be handled with a lookup_fh()
> method in the file system which is guaranteed to be called from within
> the filldir callback, and some support in the VFS for checking if it's a
> mountpoint.
> 
> NFSv4 introduces another problem though, which is that it seems to be
> able to return the _full_ getattr() results for each object, and there's
> no real way round the fact that we need to do the ->lookup() for that.
> 
> If sane clients aren't expected to _ask_ for that, though, then perhaps
> it would be OK to fall back to something like the existing
> readdir-to-buffer hack for that case, while most normal clients won't
> trigger it.

Or maybe we could just mask the offending attrs out of ->rd_bmval for
readdir calls, and say we don't support them? Would anyone scream if we
did that?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01 13:56                             ` David Woodhouse
@ 2008-08-01 16:05                               ` Chuck Lever
  2008-08-01 16:19                                 ` David Woodhouse
  0 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2008-08-01 16:05 UTC (permalink / raw)
  To: David Woodhouse, J. Bruce Fields
  Cc: viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs,
	linux-mtd

Hi David-

On Fri, Aug 1, 2008 at 9:56 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote:
>> Those are needed by NFSv3 too -- and can be handled with a lookup_fh()
>> method in the file system which is guaranteed to be called from within
>> the filldir callback, and some support in the VFS for checking if it's a
>> mountpoint.
>>
>> NFSv4 introduces another problem though, which is that it seems to be
>> able to return the _full_ getattr() results for each object, and there's
>> no real way round the fact that we need to do the ->lookup() for that.
>>
>> If sane clients aren't expected to _ask_ for that, though, then perhaps
>> it would be OK to fall back to something like the existing
>> readdir-to-buffer hack for that case, while most normal clients won't
>> trigger it.

Sanity / insanity is probably not the right description for these
different types of clients... NFSv4 is designed to work with clients
that have a typical VFS (like an in-kernel Unix client), or user-space
clients, or clients on systems that don't have typical VFS APIs (like
Windows).  So, servers have to be prepared to expect a wide gamut of
different combinations of individual operations via compound RPCs.

In practice, nearly all client implementations so far are VFS-based
in-kernel clients, and have roughly the same kinds of readdir (ie
driven by getdents(3) and allowing seek offsets like a byte stream).
But they are at generally different stages of implementation, and have
different ways to go about their work.

However, speaking generally, the advanced features of NFSv4, like
FS_LOCATIONS and pseudofs often do require some special sauce that is
sometimes not terribly friendly to the server-side VFS.

I only mentioned that the Linux client doesn't use WORD0_FILEHANDLE to
caution against testing any server change with a Linux NFSv4 client --
it wouldn't necessarily exercise the server code in question.

Some NFSv3 clients don't support READDIRPLUS at all, while some can
disable it (like Linux, Mac OS, and FreeBSD), and others use it only
in certain cases (Linux).  I wouldn't describe any of these as saner
or more commonly encountered than another.

> Or maybe we could just mask the offending attrs out of ->rd_bmval for
> readdir calls, and say we don't support them? Would anyone scream if we
> did that?

I'm not an NFSv4 expert (hence my initial incorrect assertion about
NFSv4 not supporting readdirplus at all).  I defer to those who are
actually working on the standard and Linux implementation (Bruce?)
But typically masking out these features could potentially cause
severe interoperability problems for certain client implementations.
We can only know for sure after a lot of testing at multivendor events
like Connectathon; it's not something I would disable cavalierly.

I believe the server can also indicate to clients that NFSv3
READDIRPLUS is not supported, and that wouldn't cause as much of a
disaster for clients.  It would even be feasible to disable
READDIRPLUS only for certain physical file systems that would have a
problem with lookup-during-filldir.

I rather prefer making NFSD do the right thing here -- it seems to
localize and document the issue and provide a solution that all file
systems can use with a minimum of real fuss.

I know that OCFS2 has this locking inversion issue as well, and we
know OCFS2 users do share data from it with NFS.  So Oracle is
interested in a good solution here too.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01 16:05                               ` Chuck Lever
@ 2008-08-01 16:19                                 ` David Woodhouse
  2008-08-01 17:47                                   ` Chuck Lever
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-01 16:19 UTC (permalink / raw)
  To: chucklever
  Cc: J. Bruce Fields, viro, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote:
> Some NFSv3 clients don't support READDIRPLUS at all, while some can
> disable it (like Linux, Mac OS, and FreeBSD), and others use it only
> in certain cases (Linux).  I wouldn't describe any of these as saner
> or more commonly encountered than another.

Readdirplus is easy enough to fix with a lookup_fh() method and some way
to check if a given entry is a mountpoint. I'm not worried about that.

> > Or maybe we could just mask the offending attrs out of ->rd_bmval for
> > readdir calls, and say we don't support them? Would anyone scream if we
> > did that?
> 
> I'm not an NFSv4 expert (hence my initial incorrect assertion about
> NFSv4 not supporting readdirplus at all).  I defer to those who are
> actually working on the standard and Linux implementation (Bruce?)
> But typically masking out these features could potentially cause
> severe interoperability problems for certain client implementations.
> We can only know for sure after a lot of testing at multivendor events
> like Connectathon; it's not something I would disable cavalierly.

It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about
masking out -- as I said, we can cope with that. It's things that would
_really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE,
etc.

But still, if masking them out would be a problem, we could use the
existing trick of doing readdir into a buffer and then going back and
doing the actual lookup later. But _only_ for that relatively rare case,
rather than for _all_ users of readdirplus, as we do at the moment.

> I rather prefer making NFSD do the right thing here -- it seems to
> localize and document the issue and provide a solution that all file
> systems can use with a minimum of real fuss.

Yes, that's definitely my preference too. What I've posted is a good
first step to that, and we can talk about improving it later.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01 16:19                                 ` David Woodhouse
@ 2008-08-01 17:47                                   ` Chuck Lever
  2008-08-02 18:26                                     ` J. Bruce Fields
  0 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2008-08-01 17:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: J. Bruce Fields, viro, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Fri, Aug 1, 2008 at 12:19 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote:
>> Some NFSv3 clients don't support READDIRPLUS at all, while some can
>> disable it (like Linux, Mac OS, and FreeBSD), and others use it only
>> in certain cases (Linux).  I wouldn't describe any of these as saner
>> or more commonly encountered than another.
>
> Readdirplus is easy enough to fix with a lookup_fh() method and some way
> to check if a given entry is a mountpoint. I'm not worried about that.
>
>> > Or maybe we could just mask the offending attrs out of ->rd_bmval for
>> > readdir calls, and say we don't support them? Would anyone scream if we
>> > did that?
>>
>> I'm not an NFSv4 expert (hence my initial incorrect assertion about
>> NFSv4 not supporting readdirplus at all).  I defer to those who are
>> actually working on the standard and Linux implementation (Bruce?)
>> But typically masking out these features could potentially cause
>> severe interoperability problems for certain client implementations.
>> We can only know for sure after a lot of testing at multivendor events
>> like Connectathon; it's not something I would disable cavalierly.
>
> It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about
> masking out -- as I said, we can cope with that. It's things that would
> _really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE,
> etc.

AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect
data changes on the server.  That's one I think needs to remain
enabled in the server capabilities bitmask.  In fact most local file
systems on Linux don't handle that one correctly yet anyway, but
"saner" clients do depend on that one to manage their page caches.

I don't think there's a way to indicate to a client that a server
supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a
NFSv4 READDIR.

> But still, if masking them out would be a problem, we could use the
> existing trick of doing readdir into a buffer and then going back and
> doing the actual lookup later. But _only_ for that relatively rare case,
> rather than for _all_ users of readdirplus, as we do at the moment.

Personally I don't have an immediate problem with the double-buffering
here.  The extra data copy wouldn't add significant overhead to an
already expensive and not terribly frequent operation.  There may be a
caching opportunity to saving the buffered result, but I would bet
either the contents of a directory or the attributes of objects
residing in it would change too often for there to be much benefit.

I think we have an opportunity for a little better concurrency here,
though.  Doing all the lookups and the directory read asynchronously
and concurrently, if that's possible, might be a performance
enhancement in most normal cases.  This would allow the requests to be
passed all at once to the local file system, which could then organize
any necessary disk accesses in a more optimal way.

That would be a reason to continue to do a double-buffered readdir in
every case.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01 17:47                                   ` Chuck Lever
@ 2008-08-02 18:26                                     ` J. Bruce Fields
       [not found]                                       ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  2008-08-03 11:56                                       ` Neil Brown
  0 siblings, 2 replies; 55+ messages in thread
From: J. Bruce Fields @ 2008-08-02 18:26 UTC (permalink / raw)
  To: chucklever
  Cc: David Woodhouse, viro, Neil Brown, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Fri, Aug 01, 2008 at 01:47:58PM -0400, Chuck Lever wrote:
> AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect
> data changes on the server.  That's one I think needs to remain
> enabled in the server capabilities bitmask.  In fact most local file
> systems on Linux don't handle that one correctly yet anyway, but
> "saner" clients do depend on that one to manage their page caches.
> 
> I don't think there's a way to indicate to a client that a server
> supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a
> NFSv4 READDIR.
> 
> > But still, if masking them out would be a problem, we could use the
> > existing trick of doing readdir into a buffer and then going back and
> > doing the actual lookup later. But _only_ for that relatively rare case,
> > rather than for _all_ users of readdirplus, as we do at the moment.
> 
> Personally I don't have an immediate problem with the double-buffering
> here.  The extra data copy wouldn't add significant overhead to an
> already expensive and not terribly frequent operation.  There may be a
> caching opportunity to saving the buffered result, but I would bet
> either the contents of a directory or the attributes of objects
> residing in it would change too often for there to be much benefit.
> 
> I think we have an opportunity for a little better concurrency here,
> though.  Doing all the lookups and the directory read asynchronously
> and concurrently, if that's possible, might be a performance
> enhancement in most normal cases.  This would allow the requests to be
> passed all at once to the local file system, which could then organize
> any necessary disk accesses in a more optimal way.
> 
> That would be a reason to continue to do a double-buffered readdir in
> every case.

I don't see how xfs's double-buffered readdir allows you to do the
"lookups and directory reads asynchronously and concurrently"?  The
filesystem doesn't know the lookups are going to be coming at the time
that it gets the readdir, so it doesn't know to start them yet.

Could you add a readdirplus vfs operation which took a flag indicating
how much extra information you're going to need?

The three cases where readdir can be called are:

	- ordinary readdir, for which the existing filldir_t provides
	  all the information needed.
	- nfsv3 readdirplus, where the only additional information
	  needed is whether there's a mountpoint.
	- nfsv4 readdir, where we may need all the stat info, acls,
	  etc., etc.

So you could define a readdirplus file operation like

	readdirplus(file, void, plusfiller, flag)

with plusfiller defined just as filldir_t but with an extra struct
dentry * argument.  The only nonzero flag value would be DENTRY_NEEDED,
and without DENTRY_NEEDED set, the filesystem would have the option of
passing NULL in the plusfiller callback's dentry argument.  So nfsv3
would use flag == 0, and get a dentry passed back to it only when it was
already cached (all it needs to detect mountpoints).  But nfsv4 would
pass in DENTRY_NEEDED, and get a dentry on every callback.

Um.  That assumes it's only the lookup that has the locking problems,
and that the calls back into the filesystem for acls and so on will be
safe.  Etc.

Or you could define a more complicated set of flags indicating exactly
what would be needed, and have the callback pass back a big structure,
most of whose fields could be left unset depending on what was
requested.

Though really I can't see any great objection to just moving xfs's hack
up into nfsd.  It may not do everything, but it seems like an
incremental improvement.

--b.

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                                       ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2008-08-02 20:42                                         ` David Woodhouse
  2008-08-02 21:33                                           ` J. Bruce Fields
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-02 20:42 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> Could you add a readdirplus vfs operation which took a flag indicating
> how much extra information you're going to need?

Actually, if we're screwing with readdir then xfs would like to know how
much it's going to be asked to read, rather than just have the filldir
callback return zero when it's done.

> The three cases where readdir can be called are:
>         - ordinary readdir, for which the existing filldir_t provides
>           all the information needed.
>         - nfsv3 readdirplus, where the only additional information
>           needed is whether there's a mountpoint.

It also wants a file handle. For which I think it just needs
i_generation in addition to the information it already has.

>         - nfsv4 readdir, where we may need all the stat info, acls,
>           etc., etc.

We _might_, but most of the time we won't. It might be OK to fall back
to the existing double-buffer hack for the cases where we _do_ need that
extra information.

I think a ->lookup_fh() method (which _expects_ to be called from within
filldir, and hence does its locking automatically) is the way to go.

One alternative might just be a full lookup method with the same locking
rules: ->lookup_locked(). That might be easier to implement, and would
solve the problem even for the corner cases of NFSv4. Although I still
think we'd do best to avoid populating the inode cache with _all_
children when we do a readdirplus. 

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-02 20:42                                         ` David Woodhouse
@ 2008-08-02 21:33                                           ` J. Bruce Fields
       [not found]                                             ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: J. Bruce Fields @ 2008-08-02 21:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: chucklever, viro, Neil Brown, Christoph Hellwig, linux-fsdevel,
	linux-nfs, linux-mtd

On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote:
> On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> > Could you add a readdirplus vfs operation which took a flag indicating
> > how much extra information you're going to need?
> 
> Actually, if we're screwing with readdir then xfs would like to know how
> much it's going to be asked to read, rather than just have the filldir
> callback return zero when it's done.

OK.

> > The three cases where readdir can be called are:
> >         - ordinary readdir, for which the existing filldir_t provides
> >           all the information needed.
> >         - nfsv3 readdirplus, where the only additional information
> >           needed is whether there's a mountpoint.
> 
> It also wants a file handle.

Oops, right.

> For which I think it just needs
> i_generation in addition to the information it already has.

Typically, right, though the filesystem's allowed some choice about what
exactly it wants to use in the filehandle.  I don't know how the various
filesystems are actually using that in practice.

> >         - nfsv4 readdir, where we may need all the stat info, acls,
> >           etc., etc.
> 
> We _might_, but most of the time we won't. It might be OK to fall back
> to the existing double-buffer hack for the cases where we _do_ need that
> extra information.

How bad is the "double-buffer hack" anyway?  Rather than have this as a
fallback case that's rarely used (hence rarely tested), it might be
simpler just to use it for everything if we're going to use it at all.

--b.

> I think a ->lookup_fh() method (which _expects_ to be called from within
> filldir, and hence does its locking automatically) is the way to go.
> 
> One alternative might just be a full lookup method with the same locking
> rules: ->lookup_locked(). That might be easier to implement, and would
> solve the problem even for the corner cases of NFSv4. Although I still
> think we'd do best to avoid populating the inode cache with _all_
> children when we do a readdirplus. 
> 
> -- 
> David Woodhouse                            Open Source Technology Centre
> David.Woodhouse@intel.com                              Intel Corporation
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                                             ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2008-08-03  8:39                                               ` David Woodhouse
  0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-08-03  8:39 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sat, 2008-08-02 at 17:33 -0400, J. Bruce Fields wrote:
> On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote:
> > On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> > > Could you add a readdirplus vfs operation which took a flag indicating
> > > how much extra information you're going to need?
> > 
> > Actually, if we're screwing with readdir then xfs would like to know how
> > much it's going to be asked to read, rather than just have the filldir
> > callback return zero when it's done.
> 
> OK.
> 
> > > The three cases where readdir can be called are:
> > >         - ordinary readdir, for which the existing filldir_t provides
> > >           all the information needed.
> > >         - nfsv3 readdirplus, where the only additional information
> > >           needed is whether there's a mountpoint.
> > 
> > It also wants a file handle.
> 
> Oops, right.
> 
> > For which I think it just needs
> > i_generation in addition to the information it already has.
> 
> Typically, right, though the filesystem's allowed some choice about what
> exactly it wants to use in the filehandle.  I don't know how the various
> filesystems are actually using that in practice.

That's OK. If we do ->lookup_fh() and they're making their own, they can
put what they like in it.

> > >         - nfsv4 readdir, where we may need all the stat info, acls,
> > >           etc., etc.
> > 
> > We _might_, but most of the time we won't. It might be OK to fall back
> > to the existing double-buffer hack for the cases where we _do_ need that
> > extra information.
> 
> How bad is the "double-buffer hack" anyway?  Rather than have this as a
> fallback case that's rarely used (hence rarely tested), it might be
> simpler just to use it for everything if we're going to use it at all.

It's certainly a good enough answer for now, which is why I've posted
the patches to do exactly that.

And yes, I have wondered the same myself, since realising that we'll
need a full lookup for some NFSv4 clients anyway. Or maybe the full
->lookup_locked(), perhaps...

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-02 18:26                                     ` J. Bruce Fields
       [not found]                                       ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
@ 2008-08-03 11:56                                       ` Neil Brown
  2008-08-03 17:15                                         ` Chuck Lever
  1 sibling, 1 reply; 55+ messages in thread
From: Neil Brown @ 2008-08-03 11:56 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, David Woodhouse,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Saturday August 2, bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org wrote:
> 
> Though really I can't see any great objection to just moving xfs's hack
> up into nfsd.  It may not do everything, but it seems like an
> incremental improvement.

Because it is a hack, and hacks have a tendency to hide deeper
problems, and not be ever get cleaned up and generally to become a
burden to future generations.

However if you do go down that path, can I suggest:

 1/ get rid of the word "hack" throughout the code.  If you think it
    is sensible, make it appear sensible.
 2/ drop the "retry malloc of a smaller size" thing. 
    In fact, you can probably use one of the set of pages that has
    been reserved for the request.  It is very rare that a readdir
    request will be as big as the largest read.
 3/ Make the new way unconditional.  That gives it broader test
    coverage which can only be a good thing.  And what is good for the
    goose is good for the gander... (not that I'm calling anyone a
    goose).


But I still prefer the O_READDIRPLUS approach.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-03 11:56                                       ` Neil Brown
@ 2008-08-03 17:15                                         ` Chuck Lever
  2008-08-04  1:03                                           ` Neil Brown
  0 siblings, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2008-08-03 17:15 UTC (permalink / raw)
  To: Neil Brown
  Cc: J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb@suse.de> wrote:
> On Saturday August 2, bfields@fieldses.org wrote:
>>
>> Though really I can't see any great objection to just moving xfs's hack
>> up into nfsd.  It may not do everything, but it seems like an
>> incremental improvement.
>
> Because it is a hack, and hacks have a tendency to hide deeper
> problems, and not be ever get cleaned up and generally to become a
> burden to future generations.

Agreed that maintainability is an important concern.

However, I don't see that what David suggests in general is hiding a
deeper problem, but is rather exposing it.  Can you explain what you
think is the issue?

> However if you do go down that path, can I suggest:
>
>  1/ get rid of the word "hack" throughout the code.  If you think it
>    is sensible, make it appear sensible.

Yes, if we're going to codify this method of handling readdir, let's
document it properly and treat it as a first class API.

>  2/ drop the "retry malloc of a smaller size" thing.
>    In fact, you can probably use one of the set of pages that has
>    been reserved for the request.  It is very rare that a readdir
>    request will be as big as the largest read.

Well, many Linux clients support reading directories only a page at a
time (this limitation may have been lifted recently).  But other
clients often ask to read much more.

Again it appears that a Linux NFS client is not going to be the real
acid test here.  Solaris and FreeBSD would probably be the best to
try, I think.

>  3/ Make the new way unconditional.  That gives it broader test
>    coverage which can only be a good thing.  And what is good for the
>    goose is good for the gander... (not that I'm calling anyone a
>    goose).

As Bruce also suggested, and I agree with this (not the goose part,
the unconditionality and test coverage parts).

-- 
 "Alright guard, begin the unnecessarily slow-moving dipping mechanism."
--Dr. Evil

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-03 17:15                                         ` Chuck Lever
@ 2008-08-04  1:03                                           ` Neil Brown
       [not found]                                             ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2008-08-04 18:41                                             ` J. Bruce Fields
  0 siblings, 2 replies; 55+ messages in thread
From: Neil Brown @ 2008-08-04  1:03 UTC (permalink / raw)
  To: chucklever
  Cc: J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Sunday August 3, chuck.lever@oracle.com wrote:
> On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb@suse.de> wrote:
> > On Saturday August 2, bfields@fieldses.org wrote:
> >>
> >> Though really I can't see any great objection to just moving xfs's hack
> >> up into nfsd.  It may not do everything, but it seems like an
> >> incremental improvement.
> >
> > Because it is a hack, and hacks have a tendency to hide deeper
> > problems, and not be ever get cleaned up and generally to become a
> > burden to future generations.
> 
> Agreed that maintainability is an important concern.
> 
> However, I don't see that what David suggests in general is hiding a
> deeper problem, but is rather exposing it.  Can you explain what you
> think is the issue?

The locking bothers me.
The VFS seems to have a general policy of doing the locking itself to
make life easier for filesystem implementors (or to make it harder for
them to mess up, depending on your perspective).

The current issue seems to suggest that the locking provided by the
VFS is no longer adequate, so each filesystem is needing to create
something itself.  That suggests to me a weakness in the model.
Possibly the VFS should give up trying to be in control and let
filesystems do their own locking.  Possibly there are still things
that the VFS can do which are universally good.  I think these are
questions that should be addressed.  
Maybe they have already been addressed and I missed the conversation
(that wouldn't surprise me much).  But seeing words like "hack"
suggests to me that it hasn't.  So I want to make sure I understand
the problem properly and deeply before giving my blessing to a hack.

So: what are the issues?

 Obviously readdir can race with create and you don't want them
 tripping each other up.  The current VFS approach to this is i_mutex.
 Any place which modifies a directory or does a lookup in a directory
 takes i_mutex to ensure that the directory doesn't change.

 This is probably fairly limiting.  With a tree-structured directory
 you only really need to lock the 'current node' of the tree.
 So any access would lock the top node, find which child to follow,
 then lock the child and unlock the parent.  Rebalancing might need to
 be creative as you cannot lock a parent while holding a lock on the
 child, but that isn't insurmountable.
 So I could argue that holding i_mutex across a lookup or create or
 readdir maybe isn't ideal.  It would be interesting to explore the
 possibility of dropping i_mutex across calls into the filesystem.
 By the time the filesystem is called, we really only need to be
 locking the names (dentries) rather than the whole directory.
 More on this later...

 Some filesystems want to restructure directories and times that are
 independent of any particular access.  This might be defragmentation
 or rebalancing or whatever.  Clearly there needs to be mutual
 exclusion between this and other accesses such as readdir and lookup.
 The VFS clearly cannot help with this.  It doesn't know when
 rebalancing might happen or are what sort of granularity.  So the
 filesystem must do it's own locking.
 It strikes me that this sort of locking would best be done by
 spinlocks at the block level rather than a per-inode mutex.  However
 I haven't actually implemented this (yet) so I cannot be certain.

 This is what is causing the current problem (for JFFS2 at least).
 JFF2 has a per-inode lock which protects against internally visible
 changes to the inode.  Further (and this is key) this lock is held
 across the filldir callback.
 i_mutex is also held across the filldir callback, but there is an
 important difference.  It is taken by the VFS, not the filesystem,
 and it is guaranteed always to be held across the filldir callback.
 So the filldir callback can call e.g. lookup without further locking.

 Backing up a little:  given that the filesystem implementor chose to
 use per-inode locking to protect internal restructuring (which is
 certainly an easier option), why not just use i_mutex?  The reason
 is that a create operation might trigger system-wide garbage
 collection which could trigger restructuring of the current inode,
 which would lead to an A-A deadlock (as the create is waiting for the
 garbage collection, and so still holding i_mutex).

So, given that background, it is possible to see some more possible
solutions (other than the ones already mentioned).

 - drop the internal lock across filldir.
   It could be seen a impolite to hold any locks across a callback
   that are not documented as being held.
   This would put an extra burden on the filesystem, but it shouldn't
   be a particularly big burden.
   A filesystem needs to be able to find the 'next' entry from a
   sequential 'seek' pointer so that is the most state that needs to
   be preserved.  It might be convenient to be able to keep more state
   (pointers into data structures etc).  All this can be achieved with
   fairly standard practice:
     a/ have a 'version' number per inode which is updated when any
        internal restructure happens.
     b/ when calling filldir, record the version number, drop the lock
        call filldir, reclaim the lock, check the version number
     c/ if the version number matches (as it mostly will) just keep
        going.  If it doesn't jump back to the top of readdir where
        we decode the current seek address.

   Some filesystems have problems implementing seekdir/telldir so they
   might not be totally happy here.  I have little sympathy for such
   filesystems and feel the burden should be on them to make it work.

 - use i_mutex to protect internal changes too, and drop i_mutex while
   doing internal restructuring.   This would need some VFS changes so
   that dropping i_mutex would be safe.  It would require some way to
   lock an individual dentry.  Probably you would lock it under
   i_mutex by setting a flag bit, wait for the flag on some inode-wide
   waitqueue, and drop the lock by clearing the flag and waking the
   waitqueue. And you are never allowed to look at ->d_inode if the
   lock flag is set.

Of these I really like the second.  Refining the i_mutex down to a
per-name lock before calling in to the filesystem seems like a really
good idea and should be good for scalability and large directories.
It isn't something to be done lightly though.  The filesystem would
still be given i_mutex held, but would be allowed to drop it if it
wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
FS_RENAME_DOES_D_MOVE.

For the first, I really like the idea that a lock should not be held
across calls the filldir.  I feel that a filesystem doing that is
"wrong" in much the same way that some think that recursing into the
filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
just need to work together and negotiate and work out the best way to
meet all of our needs.

So what should we do now?  I think that for JFFS2 to drop and reclaim
f->sem in readdir would be all of 20 lines of code, including updating
a ->version counter elsewhere in the code.  Replicating that in all
the filesystems that need it would probably be more code than the nfsd
change though.

On the other hand, if we implement the nfsd change, it will almost
certainly never go away, even if all filesystems eventually find that
they don't need it any more because someone improves the locking
rules in the VFS.  Where as the code in the filesystems could quite
possibly go away when they are revised to make better use of the
locking regime.  So I don't think that is an ideal situation either.

If I had time, I would work on modifying the VFS to allow filesystems
to drop i_mutex.  However I don't have time at the moment, so I'll
leave the decision to be made by someone else  (Hi Bruce!  I'll
support whatever you decide).

But I think it is important to understand what is really going on and
not just implement a hack that works around the problem.  I think I do
understand now, so I am a lot happier.  Hopefully you do too.

NeilBrown

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                                             ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-04  6:19                                               ` Chuck Lever
  2008-08-05  8:51                                                 ` Dave Chinner
  2008-08-17 18:22                                               ` [RFC] Reinstate NFS exportability for JFFS2 Andreas Dilger
  1 sibling, 1 reply; 55+ messages in thread
From: Chuck Lever @ 2008-08-04  6:19 UTC (permalink / raw)
  To: Neil Brown
  Cc: J. Bruce Fields, David Woodhouse,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sun, Aug 3, 2008 at 9:03 PM, Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org> wrote:
> On Sunday August 3, chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org wrote:
>> On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org> wrote:
>> > On Saturday August 2, bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org wrote:
>> >>
>> >> Though really I can't see any great objection to just moving xfs's hack
>> >> up into nfsd.  It may not do everything, but it seems like an
>> >> incremental improvement.
>> >
>> > Because it is a hack, and hacks have a tendency to hide deeper
>> > problems, and not be ever get cleaned up and generally to become a
>> > burden to future generations.
>>
>> Agreed that maintainability is an important concern.
>>
>> However, I don't see that what David suggests in general is hiding a
>> deeper problem, but is rather exposing it.  Can you explain what you
>> think is the issue?
>
> The locking bothers me.
> The VFS seems to have a general policy of doing the locking itself to
> make life easier for filesystem implementors (or to make it harder for
> them to mess up, depending on your perspective).
>
> The current issue seems to suggest that the locking provided by the
> VFS is no longer adequate, so each filesystem is needing to create
> something itself.  That suggests to me a weakness in the model.

I often find the VFS locking schemes to be inflexible.

However, in this case, I think that the VFS locking is adequate; NFSD
is doing something that the VFS (and hence the individual file systems
themselves) wasn't designed for.  Adding extra export APIs (like
lookup_locked or lookup_fh) seems like an appropriate way to address
the problem.

> Possibly the VFS should give up trying to be in control and let
> filesystems do their own locking.  Possibly there are still things
> that the VFS can do which are universally good.  I think these are
> questions that should be addressed.

I agree it's worth thinking carefully about this.  Thinking out loud
(as you have done) is helpful.

> Maybe they have already been addressed and I missed the conversation
> (that wouldn't surprise me much).  But seeing words like "hack"
> suggests to me that it hasn't.  So I want to make sure I understand
> the problem properly and deeply before giving my blessing to a hack.
>
> So: what are the issues?
>
>  Obviously readdir can race with create and you don't want them
>  tripping each other up.  The current VFS approach to this is i_mutex.
>  Any place which modifies a directory or does a lookup in a directory
>  takes i_mutex to ensure that the directory doesn't change.
>
>  This is probably fairly limiting.  With a tree-structured directory
>  you only really need to lock the 'current node' of the tree.
>  So any access would lock the top node, find which child to follow,
>  then lock the child and unlock the parent.  Rebalancing might need to
>  be creative as you cannot lock a parent while holding a lock on the
>  child, but that isn't insurmountable.
>  So I could argue that holding i_mutex across a lookup or create or
>  readdir maybe isn't ideal.  It would be interesting to explore the
>  possibility of dropping i_mutex across calls into the filesystem.
>  By the time the filesystem is called, we really only need to be
>  locking the names (dentries) rather than the whole directory.
>  More on this later...
>
>  Some filesystems want to restructure directories and times that are
>  independent of any particular access.  This might be defragmentation
>  or rebalancing or whatever.  Clearly there needs to be mutual
>  exclusion between this and other accesses such as readdir and lookup.
>  The VFS clearly cannot help with this.  It doesn't know when
>  rebalancing might happen or are what sort of granularity.  So the
>  filesystem must do it's own locking.
>  It strikes me that this sort of locking would best be done by
>  spinlocks at the block level rather than a per-inode mutex.  However
>  I haven't actually implemented this (yet) so I cannot be certain.
>
>  This is what is causing the current problem (for JFFS2 at least).
>  JFF2 has a per-inode lock which protects against internally visible
>  changes to the inode.  Further (and this is key) this lock is held
>  across the filldir callback.
>  i_mutex is also held across the filldir callback, but there is an
>  important difference.  It is taken by the VFS, not the filesystem,
>  and it is guaranteed always to be held across the filldir callback.
>  So the filldir callback can call e.g. lookup without further locking.
>
>  Backing up a little:  given that the filesystem implementor chose to
>  use per-inode locking to protect internal restructuring (which is
>  certainly an easier option), why not just use i_mutex?  The reason
>  is that a create operation might trigger system-wide garbage
>  collection which could trigger restructuring of the current inode,
>  which would lead to an A-A deadlock (as the create is waiting for the
>  garbage collection, and so still holding i_mutex).
>
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
>
>  - drop the internal lock across filldir.
>   It could be seen a impolite to hold any locks across a callback
>   that are not documented as being held.
>   This would put an extra burden on the filesystem, but it shouldn't
>   be a particularly big burden.
>   A filesystem needs to be able to find the 'next' entry from a
>   sequential 'seek' pointer so that is the most state that needs to
>   be preserved.  It might be convenient to be able to keep more state
>   (pointers into data structures etc).  All this can be achieved with
>   fairly standard practice:
>     a/ have a 'version' number per inode which is updated when any
>        internal restructure happens.
>     b/ when calling filldir, record the version number, drop the lock
>        call filldir, reclaim the lock, check the version number
>     c/ if the version number matches (as it mostly will) just keep
>        going.  If it doesn't jump back to the top of readdir where
>        we decode the current seek address.
>
>   Some filesystems have problems implementing seekdir/telldir so they
>   might not be totally happy here.  I have little sympathy for such
>   filesystems and feel the burden should be on them to make it work.

If I understand your suggestion correctly, I can see cases where it
would be nearly impossible for NFSD to make forward progress
assembling a full readdir result to post to a client.  This seems like
the same problem as retrying a pathname resolution until we no longer
get an ESTALE.  If an application makes continuous changes to a
directory (such as, say, a very busy MTA) it would be impossible for
NFSD to finish a READDIR.

As a sidebar, I do agree that locking out other accessors when
handling a very large directory can be a real problem.  I've seen this
on the client side with certain workloads.  The usual answer in this
case is to ask the application developer to use a hierarchical
directory structure.  :-/

But we are still stuck with seekdir/telldir.

>  - use i_mutex to protect internal changes too, and drop i_mutex while
>   doing internal restructuring.   This would need some VFS changes so
>   that dropping i_mutex would be safe.  It would require some way to
>   lock an individual dentry.  Probably you would lock it under
>   i_mutex by setting a flag bit, wait for the flag on some inode-wide
>   waitqueue, and drop the lock by clearing the flag and waking the
>   waitqueue. And you are never allowed to look at ->d_inode if the
>   lock flag is set.
>
> Of these I really like the second.  Refining the i_mutex down to a
> per-name lock before calling in to the filesystem seems like a really
> good idea and should be good for scalability and large directories.
> It isn't something to be done lightly though.  The filesystem would
> still be given i_mutex held, but would be allowed to drop it if it
> wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
> FS_RENAME_DOES_D_MOVE.
>
> For the first, I really like the idea that a lock should not be held
> across calls the filldir.  I feel that a filesystem doing that is
> "wrong" in much the same way that some think that recursing into the
> filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
> just need to work together and negotiate and work out the best way to
> meet all of our needs.
>
> So what should we do now?  I think that for JFFS2 to drop and reclaim
> f->sem in readdir would be all of 20 lines of code, including updating
> a ->version counter elsewhere in the code.  Replicating that in all
> the filesystems that need it would probably be more code than the nfsd
> change though.

That doesn't fix XFS or the clustered file systems, and I'm still
concerned about starving other accessors (see above).

> On the other hand, if we implement the nfsd change, it will almost
> certainly never go away, even if all filesystems eventually find that
> they don't need it any more because someone improves the locking
> rules in the VFS.

I understand this sentiment completely, but I'm not as pessimistic about this.

> Where as the code in the filesystems could quite
> possibly go away when they are revised to make better use of the
> locking regime.  So I don't think that is an ideal situation either.

Making such changes to the VFS could take a long time.  I know that
distributions are already getting phone calls (twitters?) about this
problem.

> If I had time, I would work on modifying the VFS to allow filesystems
> to drop i_mutex.  However I don't have time at the moment, so I'll
> leave the decision to be made by someone else  (Hi Bruce!  I'll
> support whatever you decide).
>
> But I think it is important to understand what is really going on and
> not just implement a hack that works around the problem.  I think I do
> understand now, so I am a lot happier.  Hopefully you do too.

Yes, thanks.  I will need to re-read your explanation a few more times
to digest it further.

So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
sure this is the case with other file systems like XFS and OCFS2.  My
impression was that XFS had a transaction logging deadlock, and that
OCFS2 (and possibly GFS2) would have issues with cross-node locking in
these cases (yes, it's 2am here and I should be in bed instead of
answering e-mail).  Similar that these are all internal restructuring
issues, but potentially different enough that each will need some
careful analysis -- cross-node locking has failure modes that may be a
challenge to deal with.  Not to mention the other NFSv4 issues David
dug up that have nothing to do with readdir.

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-04  6:19                                               ` Chuck Lever
@ 2008-08-05  8:51                                                 ` Dave Chinner
  2008-08-05  8:59                                                   ` David Woodhouse
  2008-08-05 23:06                                                   ` Neil Brown
  0 siblings, 2 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-05  8:51 UTC (permalink / raw)
  To: chucklever
  Cc: Neil Brown, J. Bruce Fields, David Woodhouse, viro,
	Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote:
> So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
> sure this is the case with other file systems like XFS and OCFS2.  My
> impression was that XFS had a transaction logging deadlock,

Just to clarify - XFS has a directory buffer lock deadlock. That is,
while reading the contents of the directory buffer it is locked to
prevent modifications from occurring while extracting the contents.
Looking up an entry in the directory also requires the directory
buffer lock (for the same reason), so calling the lookup while
already holding the directory buffer lock (i.e from the filldir
callback) will deadlock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-05  8:51                                                 ` Dave Chinner
@ 2008-08-05  8:59                                                   ` David Woodhouse
  2008-08-05  9:47                                                     ` Dave Chinner
  2008-08-05 23:06                                                   ` Neil Brown
  1 sibling, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-05  8:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: chucklever, Neil Brown, J. Bruce Fields, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Tue, 2008-08-05 at 18:51 +1000, Dave Chinner wrote:
> On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote:
> > So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
> > sure this is the case with other file systems like XFS and OCFS2.  My
> > impression was that XFS had a transaction logging deadlock,
> 
> Just to clarify - XFS has a directory buffer lock deadlock. That is,
> while reading the contents of the directory buffer it is locked to
> prevent modifications from occurring while extracting the contents.
> Looking up an entry in the directory also requires the directory
> buffer lock (for the same reason), so calling the lookup while
> already holding the directory buffer lock (i.e from the filldir
> callback) will deadlock.

But if we had a ->lookup_locked() or ->lookup_fh() method which is
_guaranteed_ to be called from within your ->readdir(), you could manage
to bypass that locking and avoid the deadlock?

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-05  8:59                                                   ` David Woodhouse
@ 2008-08-05  9:47                                                     ` Dave Chinner
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2008-08-05  9:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: chucklever, Neil Brown, J. Bruce Fields, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Tue, Aug 05, 2008 at 09:59:48AM +0100, David Woodhouse wrote:
> On Tue, 2008-08-05 at 18:51 +1000, Dave Chinner wrote:
> > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote:
> > > So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
> > > sure this is the case with other file systems like XFS and OCFS2.  My
> > > impression was that XFS had a transaction logging deadlock,
> > 
> > Just to clarify - XFS has a directory buffer lock deadlock. That is,
> > while reading the contents of the directory buffer it is locked to
> > prevent modifications from occurring while extracting the contents.
> > Looking up an entry in the directory also requires the directory
> > buffer lock (for the same reason), so calling the lookup while
> > already holding the directory buffer lock (i.e from the filldir
> > callback) will deadlock.
> 
> But if we had a ->lookup_locked() or ->lookup_fh() method which is
> _guaranteed_ to be called from within your ->readdir(), you could manage
> to bypass that locking and avoid the deadlock?

I *think* it's possible. It's definitely in the realm of difficult
because the buffer locks are a couple of layers removed from the
directory code and both readdir and lookup assume exclusive access to
the directory block.

We'd probably have to introduce a directory buffer (dabuf) cache
layer with it's own locking to allow multiple accessors to the
buffer at the same time. Christoph might have some other ideas for
this, but I think it's going to take significant surgery
to implement a 'recursive lockless lookup' like this.

The main problem you are going to have is finding someone who has
the time to do the XFS work. If you only implement this lookup
interface, then I'd say it's going to be some time before XFS would
actually use it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-05  8:51                                                 ` Dave Chinner
  2008-08-05  8:59                                                   ` David Woodhouse
@ 2008-08-05 23:06                                                   ` Neil Brown
  2008-08-06  0:08                                                     ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Neil Brown @ 2008-08-05 23:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, J. Bruce Fields,
	David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tuesday August 5, david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org wrote:
> On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote:
> > So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
> > sure this is the case with other file systems like XFS and OCFS2.  My
> > impression was that XFS had a transaction logging deadlock,
> 
> Just to clarify - XFS has a directory buffer lock deadlock. That is,
> while reading the contents of the directory buffer it is locked to
> prevent modifications from occurring while extracting the contents.
> Looking up an entry in the directory also requires the directory
> buffer lock (for the same reason), so calling the lookup while
> already holding the directory buffer lock (i.e from the filldir
> callback) will deadlock.

How much cost would it be, do you think, to drop the lock across the
call to filldir?  Then reclaim the lock, validate pointers etc against
a 'version' counter, and restart based on the current telldir cookie
if needed?

To me, that is the generic solution to allowing filldir to call
->lookup.  I'm just not sure what it costs to be constantly dropping
and reclaiming the lock in the normal case where ->lookup isn't being
called.


NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-05 23:06                                                   ` Neil Brown
@ 2008-08-06  0:08                                                     ` Dave Chinner
  2008-08-06 19:56                                                       ` J. Bruce Fields
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Chinner @ 2008-08-06  0:08 UTC (permalink / raw)
  To: Neil Brown
  Cc: chucklever, J. Bruce Fields, David Woodhouse, viro,
	Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

On Wed, Aug 06, 2008 at 09:06:48AM +1000, Neil Brown wrote:
> On Tuesday August 5, david@fromorbit.com wrote:
> > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote:
> > > So, the JFFS2 locking problem is a garbage-collection issue.  I'm not
> > > sure this is the case with other file systems like XFS and OCFS2.  My
> > > impression was that XFS had a transaction logging deadlock,
> > 
> > Just to clarify - XFS has a directory buffer lock deadlock. That is,
> > while reading the contents of the directory buffer it is locked to
> > prevent modifications from occurring while extracting the contents.
> > Looking up an entry in the directory also requires the directory
> > buffer lock (for the same reason), so calling the lookup while
> > already holding the directory buffer lock (i.e from the filldir
> > callback) will deadlock.
> 
> How much cost would it be, do you think, to drop the lock across the
> call to filldir?  Then reclaim the lock, validate pointers etc against
> a 'version' counter, and restart based on the current telldir cookie
> if needed?

The problem is that the dabuf is a temporary structure only valid
for the length of a block read or transaction - it is built from
buffers that are cached and provide persistence. Remember, XFS
supports large, non-contiguous, directory blocks and so the
directory code extremely complex in places. To do the above we need
to pretty much tear down the dabuf to unlock everything before the
filldir call, then build it in the lookup during the filldir call,
then tear it down for the readdir to build it again, validate,
etc....

Basically, what you suggest above still needs the same
infrastructure as a proper shared locking scheme on the dabuf to
work efficiently. Using a shared locking scheme gives much more
benefit, because it will alow parallel directory traversals and
lookups in *all cases*, not just NFS.

Basically, I don't want to replace an _easily validated_ hack with
some other nasty, non-trivial, disaster-waiting-to-happen hack that
doesn't provide any benefit over the current hack....

> To me, that is the generic solution to allowing filldir to call
> ->lookup.  I'm just not sure what it costs to be constantly dropping
> and reclaiming the lock in the normal case where ->lookup isn't being
> called.

Allowing filldir to call lookup requireѕ shared read lock semantics
between readdir and lookup. I don't think any filesystem has that
implemented, it can't be implemented with i_mutex involved, and it
will be non-trivial to implement in the filesystems that need it.

Normally the generic solution is the lowest common denominator
solution - move the double buffering into the NFSD so everything
works with the current exclusive locking semantics, and then provide
another filldir+lookup for filesystem that are able to do something
special to avoid the slower generic path.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-06  0:08                                                     ` Dave Chinner
@ 2008-08-06 19:56                                                       ` J. Bruce Fields
  2008-08-06 20:10                                                         ` David Woodhouse
  0 siblings, 1 reply; 55+ messages in thread
From: J. Bruce Fields @ 2008-08-06 19:56 UTC (permalink / raw)
  To: Neil Brown, chucklever, David Woodhouse, viro, Christoph Hellwig,
	linux-fsdevel

So the only solution that seems to really be holding up (at least for
the near term) is the double-buffering "hack"; could we get a version of
that with Neil's review comments addressed?

--b.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-06 19:56                                                       ` J. Bruce Fields
@ 2008-08-06 20:10                                                         ` David Woodhouse
       [not found]                                                           ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-06 20:10 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever, viro, Christoph Hellwig, linux-fsdevel,
	linux-nfs, linux-mtd

On Wed, 2008-08-06 at 15:56 -0400, J. Bruce Fields wrote:
> So the only solution that seems to really be holding up (at least for
> the near term) is the double-buffering "hack"; could we get a version
> of that with Neil's review comments addressed?

git.infradead.org/users/dwmw2/nfsexport-2.6.git

I inverted the logic so that the file system has to set a
FS_LOOKUP_IN_READDIR flag to avoid the hack, rather than setting a
FS_NO_LOOKUP_IN_READDIR flag to _use_ it. Renamed it so that it doesn't
say 'hack', took out the loop to allocate memory.

Tasks left for someone more familiar with the NFS code are: 
 - making it use one of the pages already allocated for the response.
 - giving it a hint about how many dirents to read into the buffer.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation




^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                                                           ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-09 16:47                                                             ` David Woodhouse
  2008-08-09 19:55                                                               ` David Woodhouse
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 16:47 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 2008-08-06 at 21:10 +0100, David Woodhouse wrote:
> On Wed, 2008-08-06 at 15:56 -0400, J. Bruce Fields wrote:
> > So the only solution that seems to really be holding up (at least for
> > the near term) is the double-buffering "hack"; could we get a version
> > of that with Neil's review comments addressed?
> 
> git.infradead.org/users/dwmw2/nfsexport-2.6.git
> 
> I inverted the logic so that the file system has to set a
> FS_LOOKUP_IN_READDIR flag to avoid the hack, rather than setting a
> FS_NO_LOOKUP_IN_READDIR flag to _use_ it. Renamed it so that it doesn't
> say 'hack', took out the loop to allocate memory.
> 
> Tasks left for someone more familiar with the NFS code are: 
>  - making it use one of the pages already allocated for the response.
>  - giving it a hint about how many dirents to read into the buffer.

Here it is in a single patch...

commit b4bf782886f5f7711c8ce2c0f3778bf52bda1d56
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Thu Jul 31 20:39:25 2008 +0100

    [JFFS2] Reinstate NFS exportability
    
    Now that the readdir/lookup deadlock issues have been dealt with, we can
    export JFFS2 file systems again.
    
    (For now, you have to specify fsid manually; we should add a method to
    the export_ops to handle that too.)
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit 681b29d988a0dbf8d086958cbeae26fea710baf1
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Wed Aug 6 15:21:59 2008 +0100

    Mark isofs as being able to handle lookup() within readdir()
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit 14b5cf410c17afd7b2589b68ca3ca23352789d2f
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Wed Aug 6 15:21:39 2008 +0100

    Mark msdos/vfat as being able to handle lookup() within readdir()
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit 20a7a67be0ac1f788e77c3e333f241e3395e5dc7
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Wed Aug 6 15:18:37 2008 +0100

    Mark ext[234] as able to handle lookup() within readdir()
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit bade5f353e2cfd31fcec038b41e7cb68bb3321f1
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Thu Jul 31 20:38:04 2008 +0100

    Remove XFS buffered readdir hack
    
    Now that we've moved the readdir hack to the nfsd code, we can
    remove the local version from the XFS code.
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit b8179eee3e72039c8acc9b8efe807031bba3b931
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Thu Jul 31 20:29:12 2008 +0100

    Copy XFS readdir hack into nfsd code, introduce FS_LOOKUP_IN_READDIR flag
    
    Some file systems with their own internal locking have problems with the
    way that nfsd calls the ->lookup() method from within a filldir function
    called from their ->readdir() method. The recursion back into the file
    system code can cause deadlock.
    
    XFS has a fairly hackish solution to this which involves doing the
    readdir() into a locally-allocated buffer, then going back through it
    calling the filldir function afterwards. It's not ideal, but it works.
    
    It's particularly suboptimal because XFS does this for local file
    systems too, where it's completely unnecessary.
    
    Copy this hack into the NFS code where it can be used only for NFS
    export. Allow file systems which are _not_ prone to this deadlock to
    avoid the buffered version by setting FS_LOOKUP_IN_READDIR in their
    fs_type flags to indicate that calling lookup() from readdir() is OK.
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

commit f5d45f637d4907cd6a55bda9465eec997351e9fc
Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Thu Jul 31 17:16:51 2008 +0100

    Factor out nfsd_do_readdir() into its own function
    
    Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

 fs/ext2/super.c             |    2 +-
 fs/ext3/super.c             |    2 +-
 fs/ext4/super.c             |    2 +-
 fs/isofs/inode.c            |    2 +-
 fs/jffs2/super.c            |   60 +++++++++++++++++++
 fs/msdos/namei.c            |    2 +-
 fs/nfsd/vfs.c               |  136 ++++++++++++++++++++++++++++++++++++++-----
 fs/vfat/namei.c             |    2 +-
 fs/xfs/linux-2.6/xfs_file.c |  120 --------------------------------------
 include/linux/fs.h          |    2 +
 10 files changed, 189 insertions(+), 141 deletions(-)


diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fd88c7b..d8ef9ea 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1401,7 +1401,7 @@ static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.get_sb		= ext2_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_ext2_fs(void)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index f38a5af..790a40e 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2938,7 +2938,7 @@ static struct file_system_type ext3_fs_type = {
 	.name		= "ext3",
 	.get_sb		= ext3_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_ext3_fs(void)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d5d7795..2911f43 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3497,7 +3497,7 @@ static struct file_system_type ext4dev_fs_type = {
 	.name		= "ext4dev",
 	.get_sb		= ext4_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_ext4_fs(void)
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 26948a6..f81b920 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -1460,7 +1460,7 @@ static struct file_system_type iso9660_fs_type = {
 	.name		= "iso9660",
 	.get_sb		= isofs_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_iso9660_fs(void)
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index efd4012..41ed563 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -22,6 +22,7 @@
 #include <linux/mtd/super.h>
 #include <linux/ctype.h>
 #include <linux/namei.h>
+#include <linux/exportfs.h>
 #include "compr.h"
 #include "nodelist.h"
 
@@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
 	return 0;
 }
 
+static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino,
+					 uint32_t generation)
+{
+	return jffs2_iget(sb, ino);
+}
+
+static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_get_parent(struct dentry *child)
+{
+	struct jffs2_inode_info *f;
+	struct dentry *parent;
+	struct inode *inode;
+	uint32_t pino;
+
+	if (!S_ISDIR(child->d_inode->i_mode))
+		return ERR_PTR(-EACCES);
+
+	f = JFFS2_INODE_INFO(child->d_inode);
+
+	pino = f->inocache->pino_nlink;
+
+	JFFS2_DEBUG("Parent of directory ino #%u is #%u\n",
+		    f->inocache->ino, pino);
+
+	inode = jffs2_iget(child->d_inode->i_sb, pino);
+	if (IS_ERR(inode)) {
+		JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino,
+			    PTR_ERR(inode));
+		return ERR_CAST(inode);
+	}
+
+	parent = d_alloc_anon(inode);
+	if (!parent) {
+		iput(inode);
+		parent = ERR_PTR(-ENOMEM);
+	}
+	return parent;
+}
+
+static struct export_operations jffs2_export_ops = {
+	.get_parent = jffs2_get_parent,
+	.fh_to_dentry = jffs2_fh_to_dentry,
+	.fh_to_parent = jffs2_fh_to_parent,
+};
+
 static const struct super_operations jffs2_super_operations =
 {
 	.alloc_inode =	jffs2_alloc_inode,
@@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
 	spin_lock_init(&c->inocache_lock);
 
 	sb->s_op = &jffs2_super_operations;
+ 	sb->s_export_op = &jffs2_export_ops;
 	sb->s_flags = sb->s_flags | MS_NOATIME;
 	sb->s_xattr = jffs2_xattr_handlers;
 #ifdef CONFIG_JFFS2_FS_POSIX_ACL
diff --git a/fs/msdos/namei.c b/fs/msdos/namei.c
index e844b98..feebc28 100644
--- a/fs/msdos/namei.c
+++ b/fs/msdos/namei.c
@@ -681,7 +681,7 @@ static struct file_system_type msdos_fs_type = {
 	.name		= "msdos",
 	.get_sb		= msdos_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_msdos_fs(void)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 18060be..ccff326 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,123 @@ out:
 	return err;
 }
 
+struct buffered_dirent {
+	u64		ino;
+	loff_t		offset;
+	int		namlen;
+	unsigned int	d_type;
+	char		name[];
+};
+
+struct readdir_data {
+	char		*dirent;
+	size_t		used;
+};
+
+static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
+				 loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct readdir_data *buf = __buf;
+	struct buffered_dirent *de = (void *)(buf->dirent + buf->used);
+	unsigned int reclen;
+
+	reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64));
+	if (buf->used + reclen > PAGE_SIZE)
+		return -EINVAL;
+
+	de->namlen = namlen;
+	de->offset = offset;
+	de->ino = ino;
+	de->d_type = d_type;
+	memcpy(de->name, name, namlen);
+	buf->used += reclen;
+
+	return 0;
+}
+
+static int nfsd_buffered_readdir(struct file *file, filldir_t func, 
+				 struct readdir_cd *cdp, loff_t *offsetp)
+{
+	struct readdir_data buf;
+	struct buffered_dirent *de;
+	int host_err;
+	int size;
+	loff_t offset;
+
+	buf.dirent = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf.dirent)
+		return -ENOMEM;
+
+	offset = *offsetp;
+	cdp->err = nfserr_eof; /* will be cleared on successful read */
+
+	while (1) {
+		unsigned int reclen;
+
+		buf.used = 0;
+
+		host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf);
+		if (host_err)
+			break;
+
+		size = buf.used;
+
+		if (!size)
+			break;
+
+
+		de = (struct buffered_dirent *)buf.dirent;
+		while (size > 0) {
+			offset = de->offset;
+
+			if (func(cdp, de->name, de->namlen, de->offset,
+				 de->ino, de->d_type))
+				goto done;
+
+			if (cdp->err != nfs_ok)
+				goto done;
+
+			reclen = ALIGN(sizeof(*de) + de->namlen,
+				       sizeof(u64));
+			size -= reclen;
+			de = (struct buffered_dirent *)((char *)de + reclen);
+		}
+		offset = vfs_llseek(file, 0, 1);
+	}
+
+ done:
+	free_page((unsigned long)(buf.dirent));
+
+	if (host_err)
+		return nfserrno(host_err);
+
+	*offsetp = offset;
+	return cdp->err;
+}
+
+static int nfsd_do_readdir(struct file *file, filldir_t func,
+			   struct readdir_cd *cdp, loff_t *offsetp)
+{
+	int host_err;
+
+	/*
+	 * Read the directory entries. This silly loop is necessary because
+	 * readdir() is not guaranteed to fill up the entire buffer, but
+	 * may choose to do less.
+	 */
+	do {
+		cdp->err = nfserr_eof; /* will be cleared on successful read */
+		host_err = vfs_readdir(file, func, cdp);
+	} while (host_err >=0 && cdp->err == nfs_ok);
+
+	*offsetp = vfs_llseek(file, 0, 1);
+
+	if (host_err)
+		return nfserrno(host_err);
+	else
+		return cdp->err;
+}
+
 /*
  * Read entries from a directory.
  * The  NFSv3/4 verifier we ignore for now.
@@ -1823,7 +1940,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 	     struct readdir_cd *cdp, filldir_t func)
 {
 	__be32		err;
-	int 		host_err;
 	struct file	*file;
 	loff_t		offset = *offsetp;
 
@@ -1837,21 +1953,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 		goto out_close;
 	}
 
-	/*
-	 * Read the directory entries. This silly loop is necessary because
-	 * readdir() is not guaranteed to fill up the entire buffer, but
-	 * may choose to do less.
-	 */
-
-	do {
-		cdp->err = nfserr_eof; /* will be cleared on successful read */
-		host_err = vfs_readdir(file, func, cdp);
-	} while (host_err >=0 && cdp->err == nfs_ok);
-	if (host_err)
-		err = nfserrno(host_err);
+	if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags &
+	     FS_LOOKUP_IN_READDIR))
+		err = nfsd_do_readdir(file, func, cdp, offsetp);
 	else
-		err = cdp->err;
-	*offsetp = vfs_llseek(file, 0, 1);
+		err = nfsd_buffered_readdir(file, func, cdp, offsetp);
 
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
diff --git a/fs/vfat/namei.c b/fs/vfat/namei.c
index 155c10b..95d0b6f 100644
--- a/fs/vfat/namei.c
+++ b/fs/vfat/namei.c
@@ -1034,7 +1034,7 @@ static struct file_system_type vfat_fs_type = {
 	.name		= "vfat",
 	.get_sb		= vfat_get_sb,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR,
 };
 
 static int __init init_vfat_fs(void)
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 5f60363..d65d377 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -212,7 +212,6 @@ xfs_file_fsync(
  * Hopefully we'll find a better workaround that allows to use the optimal
  * version at least for local readdirs for 2.6.25.
  */
-#if 0
 STATIC int
 xfs_file_readdir(
 	struct file	*filp,
@@ -244,125 +243,6 @@ xfs_file_readdir(
 		return -error;
 	return 0;
 }
-#else
-
-struct hack_dirent {
-	u64		ino;
-	loff_t		offset;
-	int		namlen;
-	unsigned int	d_type;
-	char		name[];
-};
-
-struct hack_callback {
-	char		*dirent;
-	size_t		len;
-	size_t		used;
-};
-
-STATIC int
-xfs_hack_filldir(
-	void		*__buf,
-	const char	*name,
-	int		namlen,
-	loff_t		offset,
-	u64		ino,
-	unsigned int	d_type)
-{
-	struct hack_callback *buf = __buf;
-	struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used);
-	unsigned int reclen;
-
-	reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
-	if (buf->used + reclen > buf->len)
-		return -EINVAL;
-
-	de->namlen = namlen;
-	de->offset = offset;
-	de->ino = ino;
-	de->d_type = d_type;
-	memcpy(de->name, name, namlen);
-	buf->used += reclen;
-	return 0;
-}
-
-STATIC int
-xfs_file_readdir(
-	struct file	*filp,
-	void		*dirent,
-	filldir_t	filldir)
-{
-	struct inode	*inode = filp->f_path.dentry->d_inode;
-	xfs_inode_t	*ip = XFS_I(inode);
-	struct hack_callback buf;
-	struct hack_dirent *de;
-	int		error;
-	loff_t		size;
-	int		eof = 0;
-	xfs_off_t       start_offset, curr_offset, offset;
-
-	/*
-	 * Try fairly hard to get memory
-	 */
-	buf.len = PAGE_CACHE_SIZE;
-	do {
-		buf.dirent = kmalloc(buf.len, GFP_KERNEL);
-		if (buf.dirent)
-			break;
-		buf.len >>= 1;
-	} while (buf.len >= 1024);
-
-	if (!buf.dirent)
-		return -ENOMEM;
-
-	curr_offset = filp->f_pos;
-	if (curr_offset == 0x7fffffff)
-		offset = 0xffffffff;
-	else
-		offset = filp->f_pos;
-
-	while (!eof) {
-		unsigned int reclen;
-
-		start_offset = offset;
-
-		buf.used = 0;
-		error = -xfs_readdir(ip, &buf, buf.len, &offset,
-				     xfs_hack_filldir);
-		if (error || offset == start_offset) {
-			size = 0;
-			break;
-		}
-
-		size = buf.used;
-		de = (struct hack_dirent *)buf.dirent;
-		while (size > 0) {
-			curr_offset = de->offset /* & 0x7fffffff */;
-			if (filldir(dirent, de->name, de->namlen,
-					curr_offset & 0x7fffffff,
-					de->ino, de->d_type)) {
-				goto done;
-			}
-
-			reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
-				       sizeof(u64));
-			size -= reclen;
-			de = (struct hack_dirent *)((char *)de + reclen);
-		}
-	}
-
- done:
-	if (!error) {
-		if (size == 0)
-			filp->f_pos = offset & 0x7fffffff;
-		else if (de)
-			filp->f_pos = curr_offset;
-	}
-
-	kfree(buf.dirent);
-	return error;
-}
-#endif
 
 STATIC int
 xfs_file_mmap(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..001ba17 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -100,6 +100,8 @@ extern int dir_notify_enable;
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
 					 */
+#define FS_LOOKUP_IN_READDIR	65536	/* FS won't deadlock if you call its
+					   lookup() method from filldir */
 
 /*
  * These are the fs-independent mount-flags: up to 32 flags are supported


-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-09 16:47                                                             ` David Woodhouse
@ 2008-08-09 19:55                                                               ` David Woodhouse
       [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 19:55 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever, viro, Christoph Hellwig, linux-fsdevel,
	linux-nfs, linux-mtd

On Sat, 2008-08-09 at 17:47 +0100, David Woodhouse wrote:
> Here it is in a single patch...

Updated again after more feedback from hch, to really make it
unconditional instead of just inverting the logic, turn a check for
S_ISDIR() in JFFS2 into a BUG_ON() because it should _really_ never
happen, and fix up some comments (remove the one about why we do the
buffering in xfs, and add something similar back in nfsd).

{http://,git://} git.infradead.org/users/dwmw2/nfsexport-2.6.git

Patch sequence follows...

-- 
dwmw2


^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* [PATCH 1/4] Factor out nfsd_do_readdir() into its own function
       [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-09 20:01                                                                   ` David Woodhouse
       [not found]                                                                     ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  2008-08-09 20:02                                                                   ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
                                                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 20:01 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/nfsd/vfs.c |   40 ++++++++++++++++++++++++----------------
 1 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 18060be..3e22634 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,29 @@ out:
 	return err;
 }
 
+static int nfsd_do_readdir(struct file *file, filldir_t func,
+			   struct readdir_cd *cdp, loff_t *offsetp)
+{
+	int host_err;
+
+	/*
+	 * Read the directory entries. This silly loop is necessary because
+	 * readdir() is not guaranteed to fill up the entire buffer, but
+	 * may choose to do less.
+	 */
+	do {
+		cdp->err = nfserr_eof; /* will be cleared on successful read */
+		host_err = vfs_readdir(file, func, cdp);
+	} while (host_err >=0 && cdp->err == nfs_ok);
+
+	*offsetp = vfs_llseek(file, 0, 1);
+
+	if (host_err)
+		return nfserrno(host_err);
+	else
+		return cdp->err;
+}
+
 /*
  * Read entries from a directory.
  * The  NFSv3/4 verifier we ignore for now.
@@ -1823,7 +1846,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 	     struct readdir_cd *cdp, filldir_t func)
 {
 	__be32		err;
-	int 		host_err;
 	struct file	*file;
 	loff_t		offset = *offsetp;
 
@@ -1837,21 +1859,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 		goto out_close;
 	}
 
-	/*
-	 * Read the directory entries. This silly loop is necessary because
-	 * readdir() is not guaranteed to fill up the entire buffer, but
-	 * may choose to do less.
-	 */
-
-	do {
-		cdp->err = nfserr_eof; /* will be cleared on successful read */
-		host_err = vfs_readdir(file, func, cdp);
-	} while (host_err >=0 && cdp->err == nfs_ok);
-	if (host_err)
-		err = nfserrno(host_err);
-	else
-		err = cdp->err;
-	*offsetp = vfs_llseek(file, 0, 1);
+	err = nfsd_do_readdir(file, func, cdp, offsetp);
 
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
-- 
1.5.5.1



-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

[parent not found: <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [PATCH 1/4] Factor out nfsd_do_readdir() into its own function
       [not found]                                                                     ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-09 20:07                                                                       ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2008-08-09 20:07 UTC (permalink / raw)
  To: David Woodhouse
  Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Looks good.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 2/4] Copy XFS readdir hack into nfsd code.
       [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  2008-08-09 20:01                                                                   ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
@ 2008-08-09 20:02                                                                   ` David Woodhouse
  2008-08-09 20:08                                                                     ` Christoph Hellwig
  2008-08-09 20:03                                                                   ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse
  2008-08-09 20:03                                                                   ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse
  3 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 20:02 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Some file systems with their own internal locking have problems with the
way that nfsd calls the ->lookup() method from within a filldir function
called from their ->readdir() method. The recursion back into the file
system code can cause deadlock.

XFS has a fairly hackish solution to this which involves doing the
readdir() into a locally-allocated buffer, then going back through it
calling the filldir function afterwards. It's not ideal, but it works.

It's particularly suboptimal because XFS does this for local file
systems too, where it's completely unnecessary.

Copy this hack into the NFS code where it can be used only for NFS
export. In response to feedback, use it unconditionally rather than only
for the affected file systems.

Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/nfsd/vfs.c |  108 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3e22634..d789fb8 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,27 +1814,105 @@ out:
 	return err;
 }
 
-static int nfsd_do_readdir(struct file *file, filldir_t func,
-			   struct readdir_cd *cdp, loff_t *offsetp)
+/*
+ * We do this buffering because we must not call back into the file
+ * system's ->lookup() method from the filldir callback. That may well
+ * deadlock a number of file systems.
+ *
+ * This is based heavily on the implementation of same in XFS.
+ */
+struct buffered_dirent {
+	u64		ino;
+	loff_t		offset;
+	int		namlen;
+	unsigned int	d_type;
+	char		name[];
+};
+
+struct readdir_data {
+	char		*dirent;
+	size_t		used;
+};
+
+static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
+				 loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct readdir_data *buf = __buf;
+	struct buffered_dirent *de = (void *)(buf->dirent + buf->used);
+	unsigned int reclen;
+
+	reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64));
+	if (buf->used + reclen > PAGE_SIZE)
+		return -EINVAL;
+
+	de->namlen = namlen;
+	de->offset = offset;
+	de->ino = ino;
+	de->d_type = d_type;
+	memcpy(de->name, name, namlen);
+	buf->used += reclen;
+
+	return 0;
+}
+
+static int nfsd_buffered_readdir(struct file *file, filldir_t func, 
+				 struct readdir_cd *cdp, loff_t *offsetp)
 {
+	struct readdir_data buf;
+	struct buffered_dirent *de;
 	int host_err;
+	int size;
+	loff_t offset;
 
-	/*
-	 * Read the directory entries. This silly loop is necessary because
-	 * readdir() is not guaranteed to fill up the entire buffer, but
-	 * may choose to do less.
-	 */
-	do {
-		cdp->err = nfserr_eof; /* will be cleared on successful read */
-		host_err = vfs_readdir(file, func, cdp);
-	} while (host_err >=0 && cdp->err == nfs_ok);
+	buf.dirent = (void *)__get_free_page(GFP_KERNEL);
+	if (!buf.dirent)
+		return -ENOMEM;
+
+	offset = *offsetp;
+	cdp->err = nfserr_eof; /* will be cleared on successful read */
 
-	*offsetp = vfs_llseek(file, 0, 1);
+	while (1) {
+		unsigned int reclen;
+
+		buf.used = 0;
+
+		host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf);
+		if (host_err)
+			break;
+
+		size = buf.used;
+
+		if (!size)
+			break;
+
+
+		de = (struct buffered_dirent *)buf.dirent;
+		while (size > 0) {
+			offset = de->offset;
+
+			if (func(cdp, de->name, de->namlen, de->offset,
+				 de->ino, de->d_type))
+				goto done;
+
+			if (cdp->err != nfs_ok)
+				goto done;
+
+			reclen = ALIGN(sizeof(*de) + de->namlen,
+				       sizeof(u64));
+			size -= reclen;
+			de = (struct buffered_dirent *)((char *)de + reclen);
+		}
+		offset = vfs_llseek(file, 0, 1);
+	}
+
+ done:
+	free_page((unsigned long)(buf.dirent));
 
 	if (host_err)
 		return nfserrno(host_err);
-	else
-		return cdp->err;
+
+	*offsetp = offset;
+	return cdp->err;
 }
 
 /*
@@ -1859,7 +1937,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 		goto out_close;
 	}
 
-	err = nfsd_do_readdir(file, func, cdp, offsetp);
+	err = nfsd_buffered_readdir(file, func, cdp, offsetp);
 
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/4] Copy XFS readdir hack into nfsd code.
  2008-08-09 20:02                                                                   ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
@ 2008-08-09 20:08                                                                     ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2008-08-09 20:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: J. Bruce Fields, Neil Brown, chucklever, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

Looks good.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 3/4] Remove XFS buffered readdir hack
       [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  2008-08-09 20:01                                                                   ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
  2008-08-09 20:02                                                                   ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
@ 2008-08-09 20:03                                                                   ` David Woodhouse
       [not found]                                                                     ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  2008-08-09 20:03                                                                   ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse
  3 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 20:03 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Now that we've moved the readdir hack to the nfsd code, we can
remove the local version from the XFS code.

Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/xfs/linux-2.6/xfs_file.c |  128 -------------------------------------------
 1 files changed, 0 insertions(+), 128 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 5f60363..1946b44 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -204,15 +204,6 @@ xfs_file_fsync(
 	return -xfs_fsync(XFS_I(dentry->d_inode));
 }
 
-/*
- * Unfortunately we can't just use the clean and simple readdir implementation
- * below, because nfs might call back into ->lookup from the filldir callback
- * and that will deadlock the low-level btree code.
- *
- * Hopefully we'll find a better workaround that allows to use the optimal
- * version at least for local readdirs for 2.6.25.
- */
-#if 0
 STATIC int
 xfs_file_readdir(
 	struct file	*filp,
@@ -244,125 +235,6 @@ xfs_file_readdir(
 		return -error;
 	return 0;
 }
-#else
-
-struct hack_dirent {
-	u64		ino;
-	loff_t		offset;
-	int		namlen;
-	unsigned int	d_type;
-	char		name[];
-};
-
-struct hack_callback {
-	char		*dirent;
-	size_t		len;
-	size_t		used;
-};
-
-STATIC int
-xfs_hack_filldir(
-	void		*__buf,
-	const char	*name,
-	int		namlen,
-	loff_t		offset,
-	u64		ino,
-	unsigned int	d_type)
-{
-	struct hack_callback *buf = __buf;
-	struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used);
-	unsigned int reclen;
-
-	reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
-	if (buf->used + reclen > buf->len)
-		return -EINVAL;
-
-	de->namlen = namlen;
-	de->offset = offset;
-	de->ino = ino;
-	de->d_type = d_type;
-	memcpy(de->name, name, namlen);
-	buf->used += reclen;
-	return 0;
-}
-
-STATIC int
-xfs_file_readdir(
-	struct file	*filp,
-	void		*dirent,
-	filldir_t	filldir)
-{
-	struct inode	*inode = filp->f_path.dentry->d_inode;
-	xfs_inode_t	*ip = XFS_I(inode);
-	struct hack_callback buf;
-	struct hack_dirent *de;
-	int		error;
-	loff_t		size;
-	int		eof = 0;
-	xfs_off_t       start_offset, curr_offset, offset;
-
-	/*
-	 * Try fairly hard to get memory
-	 */
-	buf.len = PAGE_CACHE_SIZE;
-	do {
-		buf.dirent = kmalloc(buf.len, GFP_KERNEL);
-		if (buf.dirent)
-			break;
-		buf.len >>= 1;
-	} while (buf.len >= 1024);
-
-	if (!buf.dirent)
-		return -ENOMEM;
-
-	curr_offset = filp->f_pos;
-	if (curr_offset == 0x7fffffff)
-		offset = 0xffffffff;
-	else
-		offset = filp->f_pos;
-
-	while (!eof) {
-		unsigned int reclen;
-
-		start_offset = offset;
-
-		buf.used = 0;
-		error = -xfs_readdir(ip, &buf, buf.len, &offset,
-				     xfs_hack_filldir);
-		if (error || offset == start_offset) {
-			size = 0;
-			break;
-		}
-
-		size = buf.used;
-		de = (struct hack_dirent *)buf.dirent;
-		while (size > 0) {
-			curr_offset = de->offset /* & 0x7fffffff */;
-			if (filldir(dirent, de->name, de->namlen,
-					curr_offset & 0x7fffffff,
-					de->ino, de->d_type)) {
-				goto done;
-			}
-
-			reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
-				       sizeof(u64));
-			size -= reclen;
-			de = (struct hack_dirent *)((char *)de + reclen);
-		}
-	}
-
- done:
-	if (!error) {
-		if (size == 0)
-			filp->f_pos = offset & 0x7fffffff;
-		else if (de)
-			filp->f_pos = curr_offset;
-	}
-
-	kfree(buf.dirent);
-	return error;
-}
-#endif
 
 STATIC int
 xfs_file_mmap(
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

[parent not found: <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [PATCH 3/4] Remove XFS buffered readdir hack
       [not found]                                                                     ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-09 20:09                                                                       ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2008-08-09 20:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sat, Aug 09, 2008 at 09:03:11PM +0100, David Woodhouse wrote:
> Now that we've moved the readdir hack to the nfsd code, we can
> remove the local version from the XFS code.

Looks good to me, I'm also throwig this into my local xfsqa queue.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 4/4] Reinstate NFS exportability
       [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
                                                                                     ` (2 preceding siblings ...)
  2008-08-09 20:03                                                                   ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse
@ 2008-08-09 20:03                                                                   ` David Woodhouse
       [not found]                                                                     ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
  3 siblings, 1 reply; 55+ messages in thread
From: David Woodhouse @ 2008-08-09 20:03 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Now that the readdir/lookup deadlock issues have been dealt with, we can
export JFFS2 file systems again.

(For now, you have to specify fsid manually; we should add a method to
the export_ops to handle that too.)

Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/jffs2/super.c |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index efd4012..7f03a28 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -22,6 +22,7 @@
 #include <linux/mtd/super.h>
 #include <linux/ctype.h>
 #include <linux/namei.h>
+#include <linux/exportfs.h>
 #include "compr.h"
 #include "nodelist.h"
 
@@ -62,6 +63,63 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
 	return 0;
 }
 
+static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino,
+					 uint32_t generation)
+{
+	return jffs2_iget(sb, ino);
+}
+
+static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_get_parent(struct dentry *child)
+{
+	struct jffs2_inode_info *f;
+	struct dentry *parent;
+	struct inode *inode;
+	uint32_t pino;
+
+	BUG_ON(!S_ISDIR(child->d_inode->i_mode));
+
+	f = JFFS2_INODE_INFO(child->d_inode);
+
+	pino = f->inocache->pino_nlink;
+
+	JFFS2_DEBUG("Parent of directory ino #%u is #%u\n",
+		    f->inocache->ino, pino);
+
+	inode = jffs2_iget(child->d_inode->i_sb, pino);
+	if (IS_ERR(inode)) {
+		JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino,
+			    PTR_ERR(inode));
+		return ERR_CAST(inode);
+	}
+
+	parent = d_alloc_anon(inode);
+	if (!parent) {
+		iput(inode);
+		parent = ERR_PTR(-ENOMEM);
+	}
+	return parent;
+}
+
+static struct export_operations jffs2_export_ops = {
+	.get_parent = jffs2_get_parent,
+	.fh_to_dentry = jffs2_fh_to_dentry,
+	.fh_to_parent = jffs2_fh_to_parent,
+};
+
 static const struct super_operations jffs2_super_operations =
 {
 	.alloc_inode =	jffs2_alloc_inode,
@@ -104,6 +162,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
 	spin_lock_init(&c->inocache_lock);
 
 	sb->s_op = &jffs2_super_operations;
+ 	sb->s_export_op = &jffs2_export_ops;
 	sb->s_flags = sb->s_flags | MS_NOATIME;
 	sb->s_xattr = jffs2_xattr_handlers;
 #ifdef CONFIG_JFFS2_FS_POSIX_ACL
-- 
1.5.5.1



-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

[parent not found: <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]

* Re: [PATCH 4/4] Reinstate NFS exportability
       [not found]                                                                     ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
@ 2008-08-09 20:10                                                                       ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2008-08-09 20:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Sat, Aug 09, 2008 at 09:03:33PM +0100, David Woodhouse wrote:
> Now that the readdir/lookup deadlock issues have been dealt with, we can
> export JFFS2 file systems again.

Looks good, but you might want to add a comment why jffs2 doesn't
care about an inode generation number.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                                             ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2008-08-04  6:19                                               ` Chuck Lever
@ 2008-08-17 18:22                                               ` Andreas Dilger
  1 sibling, 0 replies; 55+ messages in thread
From: Andreas Dilger @ 2008-08-17 18:22 UTC (permalink / raw)
  To: Neil Brown
  Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, J. Bruce Fields,
	David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Aug 04, 2008  11:03 +1000, Neil Brown wrote:
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
> 
>  - drop the internal lock across filldir.
>    It could be seen a impolite to hold any locks across a callback
>    that are not documented as being held.
>    This would put an extra burden on the filesystem, but it shouldn't
>    be a particularly big burden.
>    A filesystem needs to be able to find the 'next' entry from a
>    sequential 'seek' pointer so that is the most state that needs to
>    be preserved.  It might be convenient to be able to keep more state
>    (pointers into data structures etc).  All this can be achieved with
>    fairly standard practice:
>      a/ have a 'version' number per inode which is updated when any
>         internal restructure happens.
>      b/ when calling filldir, record the version number, drop the lock
>         call filldir, reclaim the lock, check the version number
>      c/ if the version number matches (as it mostly will) just keep
>         going.  If it doesn't jump back to the top of readdir where
>         we decode the current seek address.
> 
>    Some filesystems have problems implementing seekdir/telldir so they
>    might not be totally happy here.  I have little sympathy for such
>    filesystems and feel the burden should be on them to make it work.
> 
>  - use i_mutex to protect internal changes too, and drop i_mutex while
>    doing internal restructuring.   This would need some VFS changes so
>    that dropping i_mutex would be safe.  It would require some way to
>    lock an individual dentry.  Probably you would lock it under
>    i_mutex by setting a flag bit, wait for the flag on some inode-wide
>    waitqueue, and drop the lock by clearing the flag and waking the
>    waitqueue. And you are never allowed to look at ->d_inode if the
>    lock flag is set.

When we were working on scaling the performance of concurrent operations
in a single directory we added hashed dentry locks instead of using
i_mutex (well, i_sem in those days) to lock the whole directory.  To make
the change manageable we replaced direct i_sem locking on the directory
inode with ->lock_dir() and ->unlock_dir() methods, defaulting to just
down() and up() on i_sem, but replacing this with a per-entry lock on
the child dentry hash.

This allowed Lustre servers to create/lookup/rename/remove many entries
in a single directory concurrently, and I think this same approach could
be useful in this case also.  This allows filesystems that need it to
bypass i_mutex if they need their own brand of locking, while leaving
the majority of filesystems untouched.

It also has the benefit that filesystems that need improved multi-threaded
performance in a single directory (e.g. JFFS2, XFS, or HPC or MTA
workloads) have the ability to do it.  There is definitely some work
needed internal to the filesystem to take advantage of this increased
parallelism, and we did implement such changes for ext3+htree directories,
adding internal locking on each leaf block that scaled the concurrency
with the size of the directory.

Alas, we don't have any up-to-date kernel patches for this, though the VFS
patch was posted to LKML back in Feb 2005 as "RFC: pdirops: vfs patch"
http://www.mail-archive.com/linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg01617.html
We have better dynamic locking code today (one of Jan's objections about
that patch), but the VFS part of the patch is no longer maintained.  The
ext3+htree patch was also posted "[RFC] parallel directory operations"
http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/att-0041/03-ext3-pdirops.patch

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-04  1:03                                           ` Neil Brown
       [not found]                                             ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-04 18:41                                             ` J. Bruce Fields
  2008-08-04 22:37                                               ` Neil Brown
  1 sibling, 1 reply; 55+ messages in thread
From: J. Bruce Fields @ 2008-08-04 18:41 UTC (permalink / raw)
  To: Neil Brown
  Cc: chucklever, David Woodhouse, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote:
> The locking bothers me.
> The VFS seems to have a general policy of doing the locking itself to
> make life easier for filesystem implementors (or to make it harder for
> them to mess up, depending on your perspective).
> 
> The current issue seems to suggest that the locking provided by the
> VFS is no longer adequate, so each filesystem is needing to create
> something itself.  That suggests to me a weakness in the model.
> Possibly the VFS should give up trying to be in control and let
> filesystems do their own locking.  Possibly there are still things
> that the VFS can do which are universally good.  I think these are
> questions that should be addressed.  
> Maybe they have already been addressed and I missed the conversation
> (that wouldn't surprise me much).  But seeing words like "hack"
> suggests to me that it hasn't.  So I want to make sure I understand
> the problem properly and deeply before giving my blessing to a hack.
> 
> So: what are the issues?
> 
>  Obviously readdir can race with create and you don't want them
>  tripping each other up.  The current VFS approach to this is i_mutex.
>  Any place which modifies a directory or does a lookup in a directory
>  takes i_mutex to ensure that the directory doesn't change.
> 
>  This is probably fairly limiting.  With a tree-structured directory
>  you only really need to lock the 'current node' of the tree.
>  So any access would lock the top node, find which child to follow,
>  then lock the child and unlock the parent.  Rebalancing might need to
>  be creative as you cannot lock a parent while holding a lock on the
>  child, but that isn't insurmountable.
>  So I could argue that holding i_mutex across a lookup or create or
>  readdir maybe isn't ideal.  It would be interesting to explore the
>  possibility of dropping i_mutex across calls into the filesystem.
>  By the time the filesystem is called, we really only need to be
>  locking the names (dentries) rather than the whole directory.
>  More on this later...
> 
>  Some filesystems want to restructure directories and times that are
>  independent of any particular access.  This might be defragmentation
>  or rebalancing or whatever.  Clearly there needs to be mutual
>  exclusion between this and other accesses such as readdir and lookup.
>  The VFS clearly cannot help with this.  It doesn't know when
>  rebalancing might happen or are what sort of granularity.  So the
>  filesystem must do it's own locking.
>  It strikes me that this sort of locking would best be done by
>  spinlocks at the block level rather than a per-inode mutex.  However
>  I haven't actually implemented this (yet) so I cannot be certain.
> 
>  This is what is causing the current problem (for JFFS2 at least).
>  JFF2 has a per-inode lock which protects against internally visible
>  changes to the inode.  Further (and this is key) this lock is held
>  across the filldir callback.
>  i_mutex is also held across the filldir callback, but there is an
>  important difference.  It is taken by the VFS, not the filesystem,
>  and it is guaranteed always to be held across the filldir callback.
>  So the filldir callback can call e.g. lookup without further locking.
> 
>  Backing up a little:  given that the filesystem implementor chose to
>  use per-inode locking to protect internal restructuring (which is
>  certainly an easier option), why not just use i_mutex?  The reason
>  is that a create operation might trigger system-wide garbage
>  collection which could trigger restructuring of the current inode,
>  which would lead to an A-A deadlock (as the create is waiting for the
>  garbage collection, and so still holding i_mutex).
> 
> So, given that background, it is possible to see some more possible
> solutions (other than the ones already mentioned).
> 
>  - drop the internal lock across filldir.
>    It could be seen a impolite to hold any locks across a callback
>    that are not documented as being held.
>    This would put an extra burden on the filesystem, but it shouldn't
>    be a particularly big burden.
>    A filesystem needs to be able to find the 'next' entry from a
>    sequential 'seek' pointer so that is the most state that needs to
>    be preserved.  It might be convenient to be able to keep more state
>    (pointers into data structures etc).  All this can be achieved with
>    fairly standard practice:
>      a/ have a 'version' number per inode which is updated when any
>         internal restructure happens.
>      b/ when calling filldir, record the version number, drop the lock
>         call filldir, reclaim the lock, check the version number
>      c/ if the version number matches (as it mostly will) just keep
>         going.  If it doesn't jump back to the top of readdir where
>         we decode the current seek address.
> 
>    Some filesystems have problems implementing seekdir/telldir so they
>    might not be totally happy here.  I have little sympathy for such
>    filesystems and feel the burden should be on them to make it work.
> 
>  - use i_mutex to protect internal changes too, and drop i_mutex while
>    doing internal restructuring.   This would need some VFS changes so
>    that dropping i_mutex would be safe.  It would require some way to
>    lock an individual dentry.  Probably you would lock it under
>    i_mutex by setting a flag bit, wait for the flag on some inode-wide
>    waitqueue, and drop the lock by clearing the flag and waking the
>    waitqueue. And you are never allowed to look at ->d_inode if the
>    lock flag is set.

Isn't there a lot of kernel code that assumes that following ->d_inode
is safe?  Or are you only worried about looking at certain
filesystem-internal fields of d_inode?

The locking required to keep the filesystem namespace consistent is
difficult and important, so I think changing it would require an
extremely careful description of the new design and a commitment from
some filesystem developers to actually take advantage of it....

--b.

> Of these I really like the second.  Refining the i_mutex down to a
> per-name lock before calling in to the filesystem seems like a really
> good idea and should be good for scalability and large directories.
> It isn't something to be done lightly though.  The filesystem would
> still be given i_mutex held, but would be allowed to drop it if it
> wanted to.  We could have an FS_DROPS_I_MUTEX similar to the current
> FS_RENAME_DOES_D_MOVE.
> 
> For the first, I really like the idea that a lock should not be held
> across calls the filldir.  I feel that a filesystem doing that is
> "wrong" in much the same way that some think that recursing into the
> filesystem as nfsd does is "wrong".  In reality neither is "wrong", we
> just need to work together and negotiate and work out the best way to
> meet all of our needs.
> 
> So what should we do now?  I think that for JFFS2 to drop and reclaim
> f->sem in readdir would be all of 20 lines of code, including updating
> a ->version counter elsewhere in the code.  Replicating that in all
> the filesystems that need it would probably be more code than the nfsd
> change though.
> 
> On the other hand, if we implement the nfsd change, it will almost
> certainly never go away, even if all filesystems eventually find that
> they don't need it any more because someone improves the locking
> rules in the VFS.  Where as the code in the filesystems could quite
> possibly go away when they are revised to make better use of the
> locking regime.  So I don't think that is an ideal situation either.
> 
> If I had time, I would work on modifying the VFS to allow filesystems
> to drop i_mutex.  However I don't have time at the moment, so I'll
> leave the decision to be made by someone else  (Hi Bruce!  I'll
> support whatever you decide).
> 
> But I think it is important to understand what is really going on and
> not just implement a hack that works around the problem.  I think I do
> understand now, so I am a lot happier.  Hopefully you do too.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-04 18:41                                             ` J. Bruce Fields
@ 2008-08-04 22:37                                               ` Neil Brown
  0 siblings, 0 replies; 55+ messages in thread
From: Neil Brown @ 2008-08-04 22:37 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: chucklever, David Woodhouse, viro, Christoph Hellwig,
	linux-fsdevel, linux-nfs, linux-mtd

On Monday August 4, bfields@fieldses.org wrote:
> On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote:
> >  - use i_mutex to protect internal changes too, and drop i_mutex while
> >    doing internal restructuring.   This would need some VFS changes so
> >    that dropping i_mutex would be safe.  It would require some way to
> >    lock an individual dentry.  Probably you would lock it under
> >    i_mutex by setting a flag bit, wait for the flag on some inode-wide
> >    waitqueue, and drop the lock by clearing the flag and waking the
> >    waitqueue. And you are never allowed to look at ->d_inode if the
> >    lock flag is set.
> 
> Isn't there a lot of kernel code that assumes that following ->d_inode
> is safe?  Or are you only worried about looking at certain
> filesystem-internal fields of d_inode?

I overstated that restriction a bit.
The way ->d_inode works is that it is set once and then never changed.
A rename doesn't change ->d_inode, it changes the name in the dentry,
and possibly the ->d_parent pointer.  An unlink also leaves ->d_inode
alone and just unhashes the dentry.  ->d_inode is never cleared until
the dentry is freed.
So ->d_inode starts as NULL, is (possibly) set to a value, and stays
that way.
Currently if the name exists, then ->d_inode will be set non-NULL
under i_mutex and so a NULL value will never be visible outside of
i_mutex.  If the name doesn't exist, a NULL value *will* be visible
outside of i_mutex and it could subsequently get (under i_mutex) set
when name is created.

If i_mutex is allowed to be dropped early, then a premature NULL could
be visible in ->d_inode for a name that does exist.  This is the case
the needs to be guarded against.
So when I said "you are never allowed to look at ->d_inode ...." I 
could have said "you are never allowed to see a NULL in ->d_inode ..."
While lots of places dereference ->d_inode, relative few do it in a
context where ->d_inode could be NULL, and these are probably all in
fs/namei.c or related code.  Most of them would be fairly easy to get
to work with dentry locking.

The problem case is rename (as it always is).  With rename, d_move
needs to be called after the filesystem has committed to the rename,
but before i_mutex is dropped.  That would be awkward for filesystems
that needed to collect garbage or whatever before completing the
rename.
I would probably address this by allowing ->rename to return -EAGAIN
with the meaning that i_mutex was dropped but nothing was committed
to, and that vfs_rename_* should try again from the top.  Presumably
->rename will have done some garbage collection or whatever is needed
and the next call to ->rename has a much better chance of going
through without needing to drop the lock.  I don't know if livelocks
could be a problem with this approach.

> 
> The locking required to keep the filesystem namespace consistent is
> difficult and important, so I think changing it would require an
> extremely careful description of the new design and a commitment from
> some filesystem developers to actually take advantage of it....

An "extremely careful description of the new design" is something we
could happily see more of in the kernel world, isn't it :-)
Agreed.

NeilBrown

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01  0:40             ` David Woodhouse
       [not found]               ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
  2008-08-01  0:53               ` Chuck Lever
@ 2008-08-01  2:14               ` Neil Brown
       [not found]                 ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2 siblings, 1 reply; 55+ messages in thread
From: Neil Brown @ 2008-08-01  2:14 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Friday August 1, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote:
> > On Thursday July 31, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote:
> > > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> > > > Why is there a deadlock here?
> > 
> > I was really hoping you would answer this question.
> 
> It's because the nfsd readdirplus code recurses into the file system.
> >From the file system's ->readdir() function we call back into nfsd's
> encode_entry(), which then calls back into the file system's ->lookup()
> function so that it can generate a filehandle.
> 
> For any file system which has its own internal locking -- which includes
> jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that
> recursive call back into the file system will deadlock.
> 
> In the specific case of JFFS2, we need that internal locking because of
> lock ordering constraints with the garbage collection -- we need to take
> the allocator lock _before_ the per-inode lock, which means we can't use
> the generic inode->i_mutex for that purpose. That's documented in
> fs/jffs2/README.Locking. I know fewer details about the other affected
> file systems.

It sounds to me like the core problem here is that the locking regime
imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of
other filesystems.

This seems to suggest that the VFS should be changed.
Superficially, it seems that changing the locking rules so that the
callee takes i_mutex rather than the caller taking it would help and,
in the case of JFFS2, make f->sem redundant.  Does that match your
understanding?

That is obviously a big change and one that should not be made
lightly, but if it was to benefit a number of the newer filesystems, then
it would seem like the appropriate way to go.

Clearly we need a short term solution too as we don't want to wait for
VFS locking rules to be renegotiated.  The idea of a "lock is owned by
me" check is appealing to me as it is a small, localised change that
would easily be removed if/when the locking we "fixed" properly.

In the JFFS2 case I imagine this taking the following form:

 - new field in jffs2_inode_info "struct task_struct *sem_owner",
   initialised to NULL
 - in jffs2_readdir after locking ->sem, set ->sem_owner to current.
 - before unlocking ->sem, set ->sem_owner to NULL
 - in jffs2_lookup, check if "dir_f->sem_owner == current"
   If it does, set a flag reminding us not to drop the lock,
   else mutex_lock(&dir_f->sem);

This should fix the problem with very little cost either in
performance or elegance.  And if/when the locking rules were changed,
the accompanying code review would immediately notice this and remove
it.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

[parent not found: <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                 ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-08-01  8:50                   ` David Woodhouse
  2008-08-01 10:03                   ` Al Viro
  1 sibling, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-08-01  8:50 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, 2008-08-01 at 12:14 +1000, Neil Brown wrote:
> It sounds to me like the core problem here is that the locking regime
> imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of
> other filesystems.

Well, the VFS isn't the part that's recursing into the file system code
and causing deadlock. It's only nfsd which is doing that.

> This seems to suggest that the VFS should be changed.
> Superficially, it seems that changing the locking rules so that the
> callee takes i_mutex rather than the caller taking it would help and,
> in the case of JFFS2, make f->sem redundant.  Does that match your
> understanding?

Not entirely.

I'm not sure I see how we could make such a change -- the VFS relies on
i_mutex for a number of operations it performs for itself, so it's hard
to see how we could do without using it there.

I'm not entirely sure the approach would be sufficient for JFFS2,
either. Both jffs2_readdir() and jffs2_lookup() are _still_ going to
want to take the lock, whichever lock it is. And recursing back into the
file system while the file system is already holding the lock is _still_
going to be considered harmful.

I have even less faith that it would end up working for the various
other file systems which are affected.

Don't let me discourage you from trying, though :)

> Clearly we need a short term solution too as we don't want to wait for
> VFS locking rules to be renegotiated.  The idea of a "lock is owned by
> me" check is appealing to me as it is a small, localised change that
> would easily be removed if/when the locking we "fixed" properly.
> 
> In the JFFS2 case I imagine this taking the following form:
> 
>  - new field in jffs2_inode_info "struct task_struct *sem_owner",
>    initialised to NULL
>  - in jffs2_readdir after locking ->sem, set ->sem_owner to current.
>  - before unlocking ->sem, set ->sem_owner to NULL
>  - in jffs2_lookup, check if "dir_f->sem_owner == current"
>    If it does, set a flag reminding us not to drop the lock,
>    else mutex_lock(&dir_f->sem);

You're describing the hack I posted at
http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html
but couldn't bring myself to commit. And it (or some other workaround)
would be needed in _every_ affected file system.

I'm much happier with putting the workaround in nfsd code, and keeping
it out of the individual file systems.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
       [not found]                 ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2008-08-01  8:50                   ` David Woodhouse
@ 2008-08-01 10:03                   ` Al Viro
  2008-08-01 23:11                     ` Neil Brown
  1 sibling, 1 reply; 55+ messages in thread
From: Al Viro @ 2008-08-01 10:03 UTC (permalink / raw)
  To: Neil Brown
  Cc: David Woodhouse, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, Aug 01, 2008 at 12:14:17PM +1000, Neil Brown wrote:

> It sounds to me like the core problem here is that the locking regime
> imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of
> other filesystems.
> 
> This seems to suggest that the VFS should be changed.
> Superficially, it seems that changing the locking rules so that the
> callee takes i_mutex rather than the caller taking it would help and,
> in the case of JFFS2, make f->sem redundant.  Does that match your
> understanding?

Huh?  How could that possibly help?  You are not changing the sequence
of operations on locks, you only slightly modify the timing; how could
that possibly eliminate a deadlock?  Moreover, how the _hell_ could
making exclusion area smaller make f->sem redundant?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC] Reinstate NFS exportability for JFFS2.
  2008-08-01 10:03                   ` Al Viro
@ 2008-08-01 23:11                     ` Neil Brown
  0 siblings, 0 replies; 55+ messages in thread
From: Neil Brown @ 2008-08-01 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: David Woodhouse, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Friday August 1, viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org wrote:
> On Fri, Aug 01, 2008 at 12:14:17PM +1000, Neil Brown wrote:
> 
> > It sounds to me like the core problem here is that the locking regime
> > imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of
> > other filesystems.
> > 
> > This seems to suggest that the VFS should be changed.
> > Superficially, it seems that changing the locking rules so that the
> > callee takes i_mutex rather than the caller taking it would help and,
> > in the case of JFFS2, make f->sem redundant.  Does that match your
> > understanding?
> 
> Huh?  How could that possibly help?  You are not changing the sequence
> of operations on locks, you only slightly modify the timing; how could
> that possibly eliminate a deadlock?  Moreover, how the _hell_ could
> making exclusion area smaller make f->sem redundant?

By moving the locking of i_mutex inside the filesystem methods, I
reasoned that the filesystem would have more control and be able to do
things in the order that it wanted.

I gathered from the JFFS locking document - admitted a very brief
reading - that it the VFS took i_mutex at awkward times for JFFS2,
and the for this reason, JFFS2 introduced f->sem, described as:

  "This is the JFFS2-internal equivalent of the inode mutex i->i_sem."

If i_mutex became internal, maybe JFFS2 wouldn't need an internal
equivalent.

But I did miss an important point.
The reason that the current code works with many filesystems is that 
lookup_one_len *doesn't* take i_mutex before calling ->lookup.  It 
assumes that it is already held.  So pushing the locking down into
->lookup would actually break other filesystems without fixing this
one.  My Bad.

So it seems like recursion into the filesystem is a Bad Thing and we
need a different approach.
I really don't like NFSD having to cache all the names that it gets
from readdir and then hand them back to lookup one at a time, though
it might end up being the answer..

What about this (I know you're going to hate it)??

A new open flags: O_READDIRPLUS.
If ->readdir finds that it is set on filp->f_flags, then it does a
lookup on each name it finds and makes sure they are all in the
dcache.
Then when NFSD calls lookup_one_len inside the filldir callback, it
will find the answer in the cache and not need to recurse into the
filesystem.

We would need some way to be sure that the new dentry didn't get
flushed from the dcache before NFSD called lookup_one_len.  I guess
if ->readdir held a reference on it while filldir was called that
would be an adequate guarantee.

This would then make O_READDIRPLUS available to userspace too. 
"ls -l" could use it and thus give the filesystem an opportunity to
optimise things.
->readdir could:
    load a block of the directory
    schedule reads for all the inodes in parallel
    attach them to the dcache in series, using whatever locking
        was needed and no more.

I'm actually starting to like the idea.  Which means it is probably
doomed.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/4] Factor out nfsd_do_readdir() into its own function
  2008-05-02  1:38     ` Neil Brown
  2008-05-02 11:37       ` David Woodhouse
  2008-07-31 21:54       ` David Woodhouse
@ 2008-07-31 21:54       ` David Woodhouse
  2008-07-31 21:54       ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd


Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
 fs/nfsd/vfs.c |   40 ++++++++++++++++++++++++----------------
 1 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 18060be..3e22634 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,29 @@ out:
 	return err;
 }
 
+static int nfsd_do_readdir(struct file *file, filldir_t func,
+			   struct readdir_cd *cdp, loff_t *offsetp)
+{
+	int host_err;
+
+	/*
+	 * Read the directory entries. This silly loop is necessary because
+	 * readdir() is not guaranteed to fill up the entire buffer, but
+	 * may choose to do less.
+	 */
+	do {
+		cdp->err = nfserr_eof; /* will be cleared on successful read */
+		host_err = vfs_readdir(file, func, cdp);
+	} while (host_err >=0 && cdp->err == nfs_ok);
+
+	*offsetp = vfs_llseek(file, 0, 1);
+
+	if (host_err)
+		return nfserrno(host_err);
+	else
+		return cdp->err;
+}
+
 /*
  * Read entries from a directory.
  * The  NFSv3/4 verifier we ignore for now.
@@ -1823,7 +1846,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 	     struct readdir_cd *cdp, filldir_t func)
 {
 	__be32		err;
-	int 		host_err;
 	struct file	*file;
 	loff_t		offset = *offsetp;
 
@@ -1837,21 +1859,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 		goto out_close;
 	}
 
-	/*
-	 * Read the directory entries. This silly loop is necessary because
-	 * readdir() is not guaranteed to fill up the entire buffer, but
-	 * may choose to do less.
-	 */
-
-	do {
-		cdp->err = nfserr_eof; /* will be cleared on successful read */
-		host_err = vfs_readdir(file, func, cdp);
-	} while (host_err >=0 && cdp->err == nfs_ok);
-	if (host_err)
-		err = nfserrno(host_err);
-	else
-		err = cdp->err;
-	*offsetp = vfs_llseek(file, 0, 1);
+	err = nfsd_do_readdir(file, func, cdp, offsetp);
 
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag
  2008-05-02  1:38     ` Neil Brown
                         ` (2 preceding siblings ...)
  2008-07-31 21:54       ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
@ 2008-07-31 21:54       ` David Woodhouse
  2008-07-31 21:55       ` [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack David Woodhouse
       [not found]       ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  5 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

Some file systems with their own internal locking have problems with the
way that nfsd calls the ->lookup() method from within a filldir function
called from their ->readdir() method. The recursion back into the file
system code can cause deadlock.

XFS has a fairly hackish solution to this which involves doing the
readdir() into a locally-allocated buffer, then going back through it
calling the filldir function afterwards. It's not ideal, but it works.

It's particularly suboptimal because XFS does this for local file
systems too, where it's completely unnecessary.

Copy this hack into the NFS code where it can be used only for NFS, and
only for file systems which indicate that they need it by setting
FS_NO_LOOKUP_IN_READDIR in their fs_type flags.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
 fs/nfsd/vfs.c      |  111 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |    2 +
 2 files changed, 112 insertions(+), 1 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3e22634..81c9411 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,111 @@ out:
 	return err;
 }
 
+struct hack_dirent {
+	u64		ino;
+	loff_t		offset;
+	int		namlen;
+	unsigned int	d_type;
+	char		name[];
+};
+
+struct hack_callback {
+	char		*dirent;
+	size_t		len;
+	size_t		used;
+};
+
+static int nfsd_hack_filldir(void *__buf, const char *name, int namlen,
+			     loff_t offset, u64 ino, unsigned int d_type)
+{
+	struct hack_callback *buf = __buf;
+	struct hack_dirent *de = (void *)(buf->dirent + buf->used);
+	unsigned int reclen;
+
+	reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
+	if (buf->used + reclen > buf->len)
+		return -EINVAL;
+
+	de->namlen = namlen;
+	de->offset = offset;
+	de->ino = ino;
+	de->d_type = d_type;
+	memcpy(de->name, name, namlen);
+	buf->used += reclen;
+
+	return 0;
+}
+
+static int nfsd_hack_readdir(struct file *file, filldir_t func, 
+			     struct readdir_cd *cdp, loff_t *offsetp)
+{
+	struct hack_callback buf;
+	struct hack_dirent *de;
+	int host_err;
+	int size;
+	loff_t offset;
+
+	/*
+	 * Try fairly hard to get memory
+	 */
+	buf.len = PAGE_CACHE_SIZE;
+	do {
+		buf.dirent = kmalloc(buf.len, GFP_KERNEL);
+		if (buf.dirent)
+			break;
+		buf.len >>= 1;
+	} while (buf.len >= 1024);
+
+	if (!buf.dirent)
+		return -ENOMEM;
+
+	offset = *offsetp;
+	cdp->err = nfserr_eof; /* will be cleared on successful read */
+
+	while (1) {
+		unsigned int reclen;
+
+		buf.used = 0;
+
+		host_err = vfs_readdir(file, nfsd_hack_filldir, &buf);
+		if (host_err)
+			break;
+
+		size = buf.used;
+
+		if (!size)
+			break;
+
+
+		de = (struct hack_dirent *)buf.dirent;
+		while (size > 0) {
+			offset = de->offset;
+
+			if (func(cdp, de->name, de->namlen, de->offset,
+				 de->ino, de->d_type))
+				goto done;
+
+			if (cdp->err != nfs_ok)
+				goto done;
+
+			reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
+				       sizeof(u64));
+			size -= reclen;
+			de = (struct hack_dirent *)((char *)de + reclen);
+		}
+		offset = vfs_llseek(file, 0, 1);
+	}
+
+ done:
+	kfree(buf.dirent);
+
+	if (host_err)
+		return nfserrno(host_err);
+
+	*offsetp = offset;
+	return cdp->err;
+}
+
 static int nfsd_do_readdir(struct file *file, filldir_t func,
 			   struct readdir_cd *cdp, loff_t *offsetp)
 {
@@ -1859,7 +1964,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
 		goto out_close;
 	}
 
-	err = nfsd_do_readdir(file, func, cdp, offsetp);
+	if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags &
+	     FS_NO_LOOKUP_IN_READDIR))
+		err = nfsd_hack_readdir(file, func, cdp, offsetp);
+	else
+		err = nfsd_do_readdir(file, func, cdp, offsetp);
 
 	if (err == nfserr_eof || err == nfserr_toosmall)
 		err = nfs_ok; /* can still be found in ->err */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..80ca410 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -100,6 +100,8 @@ extern int dir_notify_enable;
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
 					 */
+#define FS_NO_LOOKUP_IN_READDIR	65536	/* FS will deadlock if you call its
+					   lookup() method from filldir */
 
 /*
  * These are the fs-independent mount-flags: up to 32 flags are supported
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack
  2008-05-02  1:38     ` Neil Brown
                         ` (3 preceding siblings ...)
  2008-07-31 21:54       ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse
@ 2008-07-31 21:55       ` David Woodhouse
       [not found]       ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  5 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-07-31 21:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd

Now that we've moved the readdir hack to the nfsd code, we can just set
the FS_NO_LOOKUP_IN_READDIR flag and then remove the local hack from the
xfs code.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
---
 fs/xfs/linux-2.6/xfs_file.c  |  120 ------------------------------------------
 fs/xfs/linux-2.6/xfs_super.c |    2 +-
 2 files changed, 1 insertions(+), 121 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 5f60363..d65d377 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -212,7 +212,6 @@ xfs_file_fsync(
  * Hopefully we'll find a better workaround that allows to use the optimal
  * version at least for local readdirs for 2.6.25.
  */
-#if 0
 STATIC int
 xfs_file_readdir(
 	struct file	*filp,
@@ -244,125 +243,6 @@ xfs_file_readdir(
 		return -error;
 	return 0;
 }
-#else
-
-struct hack_dirent {
-	u64		ino;
-	loff_t		offset;
-	int		namlen;
-	unsigned int	d_type;
-	char		name[];
-};
-
-struct hack_callback {
-	char		*dirent;
-	size_t		len;
-	size_t		used;
-};
-
-STATIC int
-xfs_hack_filldir(
-	void		*__buf,
-	const char	*name,
-	int		namlen,
-	loff_t		offset,
-	u64		ino,
-	unsigned int	d_type)
-{
-	struct hack_callback *buf = __buf;
-	struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used);
-	unsigned int reclen;
-
-	reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
-	if (buf->used + reclen > buf->len)
-		return -EINVAL;
-
-	de->namlen = namlen;
-	de->offset = offset;
-	de->ino = ino;
-	de->d_type = d_type;
-	memcpy(de->name, name, namlen);
-	buf->used += reclen;
-	return 0;
-}
-
-STATIC int
-xfs_file_readdir(
-	struct file	*filp,
-	void		*dirent,
-	filldir_t	filldir)
-{
-	struct inode	*inode = filp->f_path.dentry->d_inode;
-	xfs_inode_t	*ip = XFS_I(inode);
-	struct hack_callback buf;
-	struct hack_dirent *de;
-	int		error;
-	loff_t		size;
-	int		eof = 0;
-	xfs_off_t       start_offset, curr_offset, offset;
-
-	/*
-	 * Try fairly hard to get memory
-	 */
-	buf.len = PAGE_CACHE_SIZE;
-	do {
-		buf.dirent = kmalloc(buf.len, GFP_KERNEL);
-		if (buf.dirent)
-			break;
-		buf.len >>= 1;
-	} while (buf.len >= 1024);
-
-	if (!buf.dirent)
-		return -ENOMEM;
-
-	curr_offset = filp->f_pos;
-	if (curr_offset == 0x7fffffff)
-		offset = 0xffffffff;
-	else
-		offset = filp->f_pos;
-
-	while (!eof) {
-		unsigned int reclen;
-
-		start_offset = offset;
-
-		buf.used = 0;
-		error = -xfs_readdir(ip, &buf, buf.len, &offset,
-				     xfs_hack_filldir);
-		if (error || offset == start_offset) {
-			size = 0;
-			break;
-		}
-
-		size = buf.used;
-		de = (struct hack_dirent *)buf.dirent;
-		while (size > 0) {
-			curr_offset = de->offset /* & 0x7fffffff */;
-			if (filldir(dirent, de->name, de->namlen,
-					curr_offset & 0x7fffffff,
-					de->ino, de->d_type)) {
-				goto done;
-			}
-
-			reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
-				       sizeof(u64));
-			size -= reclen;
-			de = (struct hack_dirent *)((char *)de + reclen);
-		}
-	}
-
- done:
-	if (!error) {
-		if (size == 0)
-			filp->f_pos = offset & 0x7fffffff;
-		else if (de)
-			filp->f_pos = curr_offset;
-	}
-
-	kfree(buf.dirent);
-	return error;
-}
-#endif
 
 STATIC int
 xfs_file_mmap(
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 9433812..e44a21d 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1426,7 +1426,7 @@ static struct file_system_type xfs_fs_type = {
 	.name			= "xfs",
 	.get_sb			= xfs_fs_get_sb,
 	.kill_sb		= kill_block_super,
-	.fs_flags		= FS_REQUIRES_DEV,
+	.fs_flags		= FS_REQUIRES_DEV | FS_NO_LOOKUP_IN_READDIR,
 };
 
 
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation


^ permalink raw reply related	[flat|nested] 55+ messages in thread

[parent not found: <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]

* [PATCH 4/4] [JFFS2] Reinstate NFS exportability
       [not found]       ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2008-07-31 21:55         ` David Woodhouse
  0 siblings, 0 replies; 55+ messages in thread
From: David Woodhouse @ 2008-07-31 21:55 UTC (permalink / raw)
  To: Neil Brown
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Now that the readdir/lookup deadlock issues have been dealt with, we can
export JFFS2 file systems again.

(For now, you have to specify fsid manually; we should add a method to
the export_ops to handle that too.)

Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 fs/jffs2/super.c |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index efd4012..2138b26 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -22,6 +22,7 @@
 #include <linux/mtd/super.h>
 #include <linux/ctype.h>
 #include <linux/namei.h>
+#include <linux/exportfs.h>
 #include "compr.h"
 #include "nodelist.h"
 
@@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait)
 	return 0;
 }
 
+static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino,
+					 uint32_t generation)
+{
+	return jffs2_iget(sb, ino);
+}
+
+static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid,
+					 int fh_len, int fh_type)
+{
+        return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+                                    jffs2_nfs_get_inode);
+}
+
+static struct dentry *jffs2_get_parent(struct dentry *child)
+{
+	struct jffs2_inode_info *f;
+	struct dentry *parent;
+	struct inode *inode;
+	uint32_t pino;
+
+	if (!S_ISDIR(child->d_inode->i_mode))
+		return ERR_PTR(-EACCES);
+
+	f = JFFS2_INODE_INFO(child->d_inode);
+
+	pino = f->inocache->pino_nlink;
+
+	JFFS2_DEBUG("Parent of directory ino #%u is #%u\n",
+		    f->inocache->ino, pino);
+
+	inode = jffs2_iget(child->d_inode->i_sb, pino);
+	if (IS_ERR(inode)) {
+		JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino,
+			    PTR_ERR(inode));
+		return ERR_CAST(inode);
+	}
+
+	parent = d_alloc_anon(inode);
+	if (!parent) {
+		iput(inode);
+		parent = ERR_PTR(-ENOMEM);
+	}
+	return parent;
+}
+
+static struct export_operations jffs2_export_ops = {
+	.get_parent = jffs2_get_parent,
+	.fh_to_dentry = jffs2_fh_to_dentry,
+	.fh_to_parent = jffs2_fh_to_parent,
+};
+
 static const struct super_operations jffs2_super_operations =
 {
 	.alloc_inode =	jffs2_alloc_inode,
@@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
 	spin_lock_init(&c->inocache_lock);
 
 	sb->s_op = &jffs2_super_operations;
+ 	sb->s_export_op = &jffs2_export_ops;
 	sb->s_flags = sb->s_flags | MS_NOATIME;
 	sb->s_xattr = jffs2_xattr_handlers;
 #ifdef CONFIG_JFFS2_FS_POSIX_ACL
@@ -161,6 +221,7 @@ static struct file_system_type jffs2_fs_type = {
 	.name =		"jffs2",
 	.get_sb =	jffs2_get_sb,
 	.kill_sb =	jffs2_kill_sb,
+	.fs_flags =	FS_NO_LOOKUP_IN_READDIR,
 };
 
 static int __init init_jffs2_fs(void)
-- 
1.5.5.1


-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2008-08-17 18:22 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-01 19:42 [RFC] Reinstate NFS exportability for JFFS2 David Woodhouse
2008-05-01 20:48 ` Christoph Hellwig
2008-05-01 22:44   ` David Woodhouse
2008-05-02  1:38     ` Neil Brown
2008-05-02 11:37       ` David Woodhouse
     [not found]         ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-05-02 14:08           ` J. Bruce Fields
2008-07-31 21:54       ` David Woodhouse
2008-08-01  0:16         ` Neil Brown
     [not found]           ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-01  0:40             ` David Woodhouse
     [not found]               ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2008-08-01  0:52                 ` David Woodhouse
2008-08-01  0:53               ` Chuck Lever
     [not found]                 ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-08-01  1:00                   ` David Woodhouse
     [not found]                     ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2008-08-01  1:31                       ` Chuck Lever
2008-08-01  8:13                         ` David Woodhouse
2008-08-01 13:35                         ` David Woodhouse
     [not found]                           ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-01 13:56                             ` David Woodhouse
2008-08-01 16:05                               ` Chuck Lever
2008-08-01 16:19                                 ` David Woodhouse
2008-08-01 17:47                                   ` Chuck Lever
2008-08-02 18:26                                     ` J. Bruce Fields
     [not found]                                       ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2008-08-02 20:42                                         ` David Woodhouse
2008-08-02 21:33                                           ` J. Bruce Fields
     [not found]                                             ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2008-08-03  8:39                                               ` David Woodhouse
2008-08-03 11:56                                       ` Neil Brown
2008-08-03 17:15                                         ` Chuck Lever
2008-08-04  1:03                                           ` Neil Brown
     [not found]                                             ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-04  6:19                                               ` Chuck Lever
2008-08-05  8:51                                                 ` Dave Chinner
2008-08-05  8:59                                                   ` David Woodhouse
2008-08-05  9:47                                                     ` Dave Chinner
2008-08-05 23:06                                                   ` Neil Brown
2008-08-06  0:08                                                     ` Dave Chinner
2008-08-06 19:56                                                       ` J. Bruce Fields
2008-08-06 20:10                                                         ` David Woodhouse
     [not found]                                                           ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 16:47                                                             ` David Woodhouse
2008-08-09 19:55                                                               ` David Woodhouse
     [not found]                                                                 ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:01                                                                   ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
     [not found]                                                                     ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:07                                                                       ` Christoph Hellwig
2008-08-09 20:02                                                                   ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
2008-08-09 20:08                                                                     ` Christoph Hellwig
2008-08-09 20:03                                                                   ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse
     [not found]                                                                     ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:09                                                                       ` Christoph Hellwig
2008-08-09 20:03                                                                   ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse
     [not found]                                                                     ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:10                                                                       ` Christoph Hellwig
2008-08-17 18:22                                               ` [RFC] Reinstate NFS exportability for JFFS2 Andreas Dilger
2008-08-04 18:41                                             ` J. Bruce Fields
2008-08-04 22:37                                               ` Neil Brown
2008-08-01  2:14               ` Neil Brown
     [not found]                 ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-01  8:50                   ` David Woodhouse
2008-08-01 10:03                   ` Al Viro
2008-08-01 23:11                     ` Neil Brown
2008-07-31 21:54       ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
2008-07-31 21:54       ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse
2008-07-31 21:55       ` [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack David Woodhouse
     [not found]       ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-07-31 21:55         ` [PATCH 4/4] [JFFS2] Reinstate NFS exportability David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).