* [RFC] Reinstate NFS exportability for JFFS2. @ 2008-05-01 19:42 David Woodhouse 2008-05-01 20:48 ` Christoph Hellwig 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-05-01 19:42 UTC (permalink / raw) To: linux-fsdevel; +Cc: hch, linux-mtd On top of commit 27c72b040c0be8f3704ed0b6b84c12cbba24a7e8 (in the mtd-2.6 tree if Linus hasn't pulled it yet), this provides export_operations for JFFS2. It doesn't yet solve the readdirplus deadlock described in http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html You also still need to set fsid manually when exporting it -- I'm thinking of adding a 'get_fsid' to the export_operations to allow the filesystem to provide its own... diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c index 7da69ea..25f6472 100644 --- a/fs/jffs2/super.c +++ b/fs/jffs2/super.c @@ -22,6 +22,7 @@ #include <linux/mtd/super.h> #include <linux/ctype.h> #include <linux/namei.h> +#include <linux/exportfs.h> #include "compr.h" #include "nodelist.h" @@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait) return 0; } +static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino, + uint32_t generation) +{ + return jffs2_iget(sb, ino); +} + +static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_get_parent(struct dentry *child) +{ + struct jffs2_inode_info *f; + struct dentry *parent; + struct inode *inode; + uint32_t pino; + + if (!S_ISDIR(child->d_inode->i_mode)) + return ERR_PTR(-EACCES); + + f = JFFS2_INODE_INFO(child->d_inode); + + pino = f->inocache->pino_nlink; + + JFFS2_DEBUG("Parent of directory ino #%u is #%u\n", + f->inocache->ino, pino); + + inode = jffs2_iget(child->d_inode->i_sb, pino); + if (IS_ERR(inode)) { + JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino, + PTR_ERR(inode)); + return ERR_CAST(inode); + } + + parent = d_alloc_anon(inode); + if (!parent) { + iput(inode); + parent = ERR_PTR(-ENOMEM); + } + return parent; +} + +static struct export_operations jffs2_export_ops = { + .get_parent = jffs2_get_parent, + .fh_to_dentry = jffs2_fh_to_dentry, + .fh_to_parent = jffs2_fh_to_parent, +}; + static const struct super_operations jffs2_super_operations = { .alloc_inode = jffs2_alloc_inode, @@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent) spin_lock_init(&c->inocache_lock); sb->s_op = &jffs2_super_operations; + sb->s_export_op = &jffs2_export_ops; sb->s_flags = sb->s_flags | MS_NOATIME; sb->s_xattr = jffs2_xattr_handlers; #ifdef CONFIG_JFFS2_FS_POSIX_ACL -- dwmw2 ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-05-01 19:42 [RFC] Reinstate NFS exportability for JFFS2 David Woodhouse @ 2008-05-01 20:48 ` Christoph Hellwig 2008-05-01 22:44 ` David Woodhouse 0 siblings, 1 reply; 55+ messages in thread From: Christoph Hellwig @ 2008-05-01 20:48 UTC (permalink / raw) To: David Woodhouse; +Cc: linux-fsdevel, hch, linux-mtd On Thu, May 01, 2008 at 08:42:59PM +0100, David Woodhouse wrote: > On top of commit 27c72b040c0be8f3704ed0b6b84c12cbba24a7e8 (in the > mtd-2.6 tree if Linus hasn't pulled it yet), this provides > export_operations for JFFS2. > > It doesn't yet solve the readdirplus deadlock described in > http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html > > You also still need to set fsid manually when exporting it -- I'm > thinking of adding a 'get_fsid' to the export_operations to allow the > filesystem to provide its own... Yes, and get_fsid would be extremly useful, especially for those filesystems that already have an uuid in the superblock (*cough*, XFS, *cough*), but it'll need some co-operations with nfs-utils on when to use it. Patch looks good for me except for the few introduced overlong lines. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-05-01 20:48 ` Christoph Hellwig @ 2008-05-01 22:44 ` David Woodhouse 2008-05-02 1:38 ` Neil Brown 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-05-01 22:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, linux-mtd On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote: > Yes, and get_fsid would be extremly useful, especially for those > filesystems that already have an uuid in the superblock > (*cough*, XFS, *cough*), but it'll need some co-operations with > nfs-utils on when to use it. Why do you need to co-operate with userspace? Userspace shouldn't need to do anything -- we'll just generate a suitable fsid/uuid for ourselves, unless userspace deliberately overrides it for the export in question. > Patch looks good for me except for the few introduced overlong lines. Bah, don't you start. A less onanistic problem with it is the deadlock with NFS readdirplus (->readdir->encode_entry->compose_entry_fh->lookup) I wonder if we should postpone the calls to compose_entry_fh() until _after_ readdir() has completed. Leave space in the response for the filehandles, but only walk through it again calling compose_entry_fh() once we're done in readdir. That would allow us to get rid of the various icky hacks that file systems have to avoid that deadlock. -- dwmw2 ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-05-01 22:44 ` David Woodhouse @ 2008-05-02 1:38 ` Neil Brown 2008-05-02 11:37 ` David Woodhouse ` (5 more replies) 0 siblings, 6 replies; 55+ messages in thread From: Neil Brown @ 2008-05-02 1:38 UTC (permalink / raw) To: David Woodhouse; +Cc: Christoph Hellwig, linux-fsdevel, linux-mtd, linux-nfs On Thursday May 1, dwmw2@infradead.org wrote: > On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote: > > Yes, and get_fsid would be extremly useful, especially for those > > filesystems that already have an uuid in the superblock > > (*cough*, XFS, *cough*), but it'll need some co-operations with > > nfs-utils on when to use it. > > Why do you need to co-operate with userspace? Userspace shouldn't need > to do anything -- we'll just generate a suitable fsid/uuid for > ourselves, unless userspace deliberately overrides it for the export in > question. Actually it is the kernel that doesn't need to do anything.... Mapping between the filesystem and the filesystem part of the filehandle is done entirely in user space. The kernel says "Here is a filehandle fragement, what filesystem should I be accessing". So what you really want is to teach nfs-utils to recognise JFFS2 and extract an appropriate uuid. It already uses libblkid to get uuids for ext3 and XFS and others. Extending that to handle JFFS2 should be much of a drama. > > > Patch looks good for me except for the few introduced overlong lines. > > Bah, don't you start. A less onanistic problem with it is the deadlock > with NFS readdirplus (->readdir->encode_entry->compose_entry_fh->lookup) > > I wonder if we should postpone the calls to compose_entry_fh() until > _after_ readdir() has completed. Leave space in the response for the > filehandles, but only walk through it again calling compose_entry_fh() > once we're done in readdir. That would allow us to get rid of the > various icky hacks that file systems have to avoid that deadlock. Why is there a deadlock here? Both readdir and lookup are called with i_mutex held on the directory so there should need to do any extra locking (he said, naively). In the readdirplus cases, i_mutex is held across both the readdir and the lookup.... One problem with your proposed solution is that filehandles aren't all the same length, so you cannot reliably leave space for them. Awkward. NeilBrown ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-05-02 1:38 ` Neil Brown @ 2008-05-02 11:37 ` David Woodhouse [not found] ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 2008-07-31 21:54 ` David Woodhouse ` (4 subsequent siblings) 5 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-05-02 11:37 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-mtd, linux-nfs On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > On Thursday May 1, dwmw2@infradead.org wrote: > > On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote: > > > Yes, and get_fsid would be extremly useful, especially for those > > > filesystems that already have an uuid in the superblock > > > (*cough*, XFS, *cough*), but it'll need some co-operations with > > > nfs-utils on when to use it. > > > > Why do you need to co-operate with userspace? Userspace shouldn't need > > to do anything -- we'll just generate a suitable fsid/uuid for > > ourselves, unless userspace deliberately overrides it for the export in > > question. > > Actually it is the kernel that doesn't need to do anything.... > Mapping between the filesystem and the filesystem part of the > filehandle is done entirely in user space. > The kernel says "Here is a filehandle fragement, what filesystem > should I be accessing". > > So what you really want is to teach nfs-utils to recognise JFFS2 and > extract an appropriate uuid. > It already uses libblkid to get uuids for ext3 and XFS and others. > Extending that to handle JFFS2 should be much of a drama. For JFFS2, there is no UUID; only i_sb->s_dev. Actually. if we just set FS_REQUIRES_DEV then it would work out OK -- at least for the NFS export. We don't do that though, because it gives behaviour we don't want in other situations, > Why is there a deadlock here? Many file systems have their own locking, and lookup() can end up trying to re-take a lock which readdir() is already holding. In the JFFS2 case, it's the fs-internal inode mutex, which is required because the garbage collector can't use i_mutex for lock ordering reasons. See also the readdir implementation and surrounding comments in fs/xfs/linux-2.6/xfs_file.c -- and the way GFS2 uses gfs2_glock_is_locked_by_me() to avoid the deadlock. The annoying thing is that JFFS2 doesn't even _implement_ i_generation, so you get no more useful information out of the lookup() call anyway :) > Both readdir and lookup are called with i_mutex held on the directory > so there should need to do any extra locking (he said, naively). In > the readdirplus cases, i_mutex is held across both the readdir and the > lookup.... > > One problem with your proposed solution is that filehandles aren't all > the same length, so you cannot reliably leave space for them. Not without moving stuff around during the postprocessing, I suppose. Which isn't very pretty -- but it's prettier than some of the hacks we have at the moment to avoid the deadlock. -- dwmw2 ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-05-02 14:08 ` J. Bruce Fields 0 siblings, 0 replies; 55+ messages in thread From: J. Bruce Fields @ 2008-05-02 14:08 UTC (permalink / raw) To: David Woodhouse Cc: Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Fri, May 02, 2008 at 12:37:18PM +0100, David Woodhouse wrote: > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > > Why is there a deadlock here? > > Many file systems have their own locking, and lookup() can end up trying > to re-take a lock which readdir() is already holding. In the JFFS2 case, > it's the fs-internal inode mutex, which is required because the garbage > collector can't use i_mutex for lock ordering reasons. > > See also the readdir implementation and surrounding comments in > fs/xfs/linux-2.6/xfs_file.c -- and the way GFS2 uses > gfs2_glock_is_locked_by_me() to avoid the deadlock. > > The annoying thing is that JFFS2 doesn't even _implement_ i_generation, > so you get no more useful information out of the lookup() call anyway :) > > > Both readdir and lookup are called with i_mutex held on the directory > > so there should need to do any extra locking (he said, naively). In > > the readdirplus cases, i_mutex is held across both the readdir and the > > lookup.... > > > > One problem with your proposed solution is that filehandles aren't all > > the same length, so you cannot reliably leave space for them. > > Not without moving stuff around during the postprocessing, I suppose. > Which isn't very pretty -- but it's prettier than some of the hacks we > have at the moment to avoid the deadlock. This comes up periodically. It would be great to finally get it fixed. The last conversation I remember started at about: http://marc.info/?l=linux-kernel&m=119506226209056&w=2 Summary: one approach would be to define a ->readdirplus() that passes a dentry to its equivalent of the ->filldir callback. (Christoph points out that returning a stat struct would be simpler. But unfortunately we need to check for mountpoints here, so that's not sufficient.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-05-02 1:38 ` Neil Brown 2008-05-02 11:37 ` David Woodhouse @ 2008-07-31 21:54 ` David Woodhouse 2008-08-01 0:16 ` Neil Brown 2008-07-31 21:54 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse ` (3 subsequent siblings) 5 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > Why is there a deadlock here? > Both readdir and lookup are called with i_mutex held on the directory > so there should need to do any extra locking (he said, naively). In > the readdirplus cases, i_mutex is held across both the readdir and the > lookup.... > > One problem with your proposed solution is that filehandles aren't all > the same length, so you cannot reliably leave space for them. > > Awkward. Yeah. I think the sanest plan for the short term is, as hch suggests, just to transplant the existing XFS hack into the nfsd code. That way, at least we can avoid using the hack for local users. And it makes NFS export from other file systems (jffs2, btrfs, etc.) easier without having to put the same hacks in each one. Git tree at git.infradead.org/users/dwmw2/nfsexport-2.6.git; patch sequence follows... -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-07-31 21:54 ` David Woodhouse @ 2008-08-01 0:16 ` Neil Brown [not found] ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 0 siblings, 1 reply; 55+ messages in thread From: Neil Brown @ 2008-08-01 0:16 UTC (permalink / raw) To: David Woodhouse; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Thursday July 31, dwmw2@infradead.org wrote: > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > > Why is there a deadlock here? I was really hoping you would answer this question. I can see the sense in your approach, but it does still seem a bit hackish. I would like to understand the details of the problem enough to be confident that there is no less-hackish solution. Thanks, NeilBrown > > Both readdir and lookup are called with i_mutex held on the directory > > so there should need to do any extra locking (he said, naively). In > > the readdirplus cases, i_mutex is held across both the readdir and the > > lookup.... > > > > One problem with your proposed solution is that filehandles aren't all > > the same length, so you cannot reliably leave space for them. > > > > Awkward. > > Yeah. I think the sanest plan for the short term is, as hch suggests, > just to transplant the existing XFS hack into the nfsd code. That way, > at least we can avoid using the hack for local users. And it makes NFS > export from other file systems (jffs2, btrfs, etc.) easier without > having to put the same hacks in each one. > > Git tree at git.infradead.org/users/dwmw2/nfsexport-2.6.git; patch > sequence follows... > > -- > David Woodhouse Open Source Technology Centre > David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2008-08-01 0:40 ` David Woodhouse [not found] ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: David Woodhouse @ 2008-08-01 0:40 UTC (permalink / raw) To: Neil Brown Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote: > On Thursday July 31, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote: > > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > > > Why is there a deadlock here? > > I was really hoping you would answer this question. It's because the nfsd readdirplus code recurses into the file system. >From the file system's ->readdir() function we call back into nfsd's encode_entry(), which then calls back into the file system's ->lookup() function so that it can generate a filehandle. For any file system which has its own internal locking -- which includes jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that recursive call back into the file system will deadlock. In the specific case of JFFS2, we need that internal locking because of lock ordering constraints with the garbage collection -- we need to take the allocator lock _before_ the per-inode lock, which means we can't use the generic inode->i_mutex for that purpose. That's documented in fs/jffs2/README.Locking. I know fewer details about the other affected file systems. > I can see the sense in your approach, but it does still seem a bit > hackish. I would like to understand the details of the problem enough > to be confident that there is no less-hackish solution. I was thinking that we could perhaps get away with an extension to readdir() that passes the filehandle to its filldir callback too. I'm not sure if that's sufficient for NFS4 though. Perhaps we could add a ->lookup_fh() method which is guaranteed to be called _only_ from within ->readdir(), so it can have (or _lack_) the appropriate locking. For now though, this approach moves the problem into the nfsd code where it belongs -- the VFS never calls into the file system recursively like this, in the general case. -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> @ 2008-08-01 0:52 ` David Woodhouse 0 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-08-01 0:52 UTC (permalink / raw) To: Neil Brown Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Fri, 2008-08-01 at 01:40 +0100, David Woodhouse wrote: > For any file system which has its own internal locking -- which includes > jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that > recursive call back into the file system will deadlock. It _would_ deadlock, I mean. They each have their own strategy to avoid it, for now. jffs2 and btrfs just don't provide export_ops so they avoid the problem, but we'd _like_ to be able to export them. xfs has the hack I've just transplanted. gfs2 has a 'lock is owned by me' check. Not sure about the others. -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 0:40 ` David Woodhouse [not found] ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> @ 2008-08-01 0:53 ` Chuck Lever [not found] ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2008-08-01 2:14 ` Neil Brown 2 siblings, 1 reply; 55+ messages in thread From: Chuck Lever @ 2008-08-01 0:53 UTC (permalink / raw) To: David Woodhouse Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Thu, Jul 31, 2008 at 8:40 PM, David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote: >> On Thursday July 31, dwmw2@infradead.org wrote: >> > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: >> > > Why is there a deadlock here? >> >> I was really hoping you would answer this question. > > It's because the nfsd readdirplus code recurses into the file system. > >From the file system's ->readdir() function we call back into nfsd's > encode_entry(), which then calls back into the file system's ->lookup() > function so that it can generate a filehandle. > > For any file system which has its own internal locking -- which includes > jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that > recursive call back into the file system will deadlock. > > In the specific case of JFFS2, we need that internal locking because of > lock ordering constraints with the garbage collection -- we need to take > the allocator lock _before_ the per-inode lock, which means we can't use > the generic inode->i_mutex for that purpose. That's documented in > fs/jffs2/README.Locking. I know fewer details about the other affected > file systems. > >> I can see the sense in your approach, but it does still seem a bit >> hackish. I would like to understand the details of the problem enough >> to be confident that there is no less-hackish solution. > > I was thinking that we could perhaps get away with an extension to > readdir() that passes the filehandle to its filldir callback too. I'm > not sure if that's sufficient for NFS4 though. For now it is sufficient, IMO. NFSv4 doesn't implement a readdirplus operation, and the performance benefits of NFSv3 readdirplus are equivocal -- there isn't a strong desire to replicate the complexity of NFSv3 readdirplus in NFSv4. I'm not even sure you can do it even with a single compound RPC, so even in the long run NFSv4 may not ever have the locking issues that NFSv3 does here. I agree with Neil, though -- as a reviewer, I think the architecture of your solution is valid, but would need to audit these pretty closely to get a sense of the individual correctness/appropriateness of each change. -- "Alright guard, begin the unnecessarily slow-moving dipping mechanism." --Dr. Evil ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2008-08-01 1:00 ` David Woodhouse [not found] ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-01 1:00 UTC (permalink / raw) To: chucklever-Re5JQEeQqe8AvxtiuMwx3w Cc: Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote: > For now it is sufficient, IMO. NFSv4 doesn't implement a readdirplus > operation, and the performance benefits of NFSv3 readdirplus are > equivocal -- there isn't a strong desire to replicate the complexity > of NFSv3 readdirplus in NFSv4. I'm not even sure you can do it even > with a single compound RPC, so even in the long run NFSv4 may not ever > have the locking issues that NFSv3 does here. AFAICT NFSv4 does have the same recursion issues already. The call trace goes fs->readdir() ... nfsd4_encode_dirent() ... nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup(). Or am I mistaken? > I agree with Neil, though -- as a reviewer, I think the architecture > of your solution is valid, but would need to audit these pretty > closely to get a sense of the individual correctness/appropriateness > of each change. The important part to review is this one, which is fairly close to what XFS has been doing (for local access too) for a long time already: http://git.infradead.org/users/dwmw2/nfsexport-2.6.git?a=commitdiff;h=37f3aae5380eb4ef23a77f3f52b8849a18cee188 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> @ 2008-08-01 1:31 ` Chuck Lever 2008-08-01 8:13 ` David Woodhouse 2008-08-01 13:35 ` David Woodhouse 0 siblings, 2 replies; 55+ messages in thread From: Chuck Lever @ 2008-08-01 1:31 UTC (permalink / raw) To: David Woodhouse Cc: Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Thu, Jul 31, 2008 at 9:00 PM, David Woodhouse <dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote: > On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote: >> For now it is sufficient, IMO. NFSv4 doesn't implement a readdirplus >> operation, and the performance benefits of NFSv3 readdirplus are >> equivocal -- there isn't a strong desire to replicate the complexity >> of NFSv3 readdirplus in NFSv4. I'm not even sure you can do it even >> with a single compound RPC, so even in the long run NFSv4 may not ever >> have the locking issues that NFSv3 does here. > > AFAICT NFSv4 does have the same recursion issues already. The call trace > goes fs->readdir() ... nfsd4_encode_dirent() ... > nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup(). > > Or am I mistaken? It looks like it needs a directory entry's dentry for a couple of reasons: 1. To determine whether a directory entry is a mount point 2. If the client has asked for file handles (via a bitmask) for the directory entries So, yep, you're right. NFSv4 will hit this too. As I recall, the Linux NFSv4 client doesn't use FATTR4_WORD0_FILEHANDLE during NFSv4 READDIR for the reasons I stated earlier; and it only uses FATTR4_WORD0_FSID during GETATTR. Other clients might use these during a READDIR however. -- "Alright guard, begin the unnecessarily slow-moving dipping mechanism." --Dr. Evil -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 1:31 ` Chuck Lever @ 2008-08-01 8:13 ` David Woodhouse 2008-08-01 13:35 ` David Woodhouse 1 sibling, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-08-01 8:13 UTC (permalink / raw) To: chucklever Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Thu, 2008-07-31 at 21:31 -0400, Chuck Lever wrote: > It looks like it needs a directory entry's dentry for a couple of > reasons: > > 1. To determine whether a directory entry is a mount point > > 2. If the client has asked for file handles (via a bitmask) for the > directory entries Theoretically, neither of those actually need the _inode_. If it's a mountpoint, won't it be in the dcache already? So a purely dcache-based lookup would find it, without having to call through to the file system's ->lookup() method on a dcache miss? And to generate the file handle, all you need in the common case is i_generation? You don't need to pull every inode into core. -- dwmw2 ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 1:31 ` Chuck Lever 2008-08-01 8:13 ` David Woodhouse @ 2008-08-01 13:35 ` David Woodhouse [not found] ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 1 sibling, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-01 13:35 UTC (permalink / raw) To: chucklever, viro Cc: Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Thu, 2008-07-31 at 21:31 -0400, Chuck Lever wrote: > On Thu, Jul 31, 2008 at 9:00 PM, David Woodhouse <dwmw2@infradead.org> wrote: > > On Thu, 2008-07-31 at 20:53 -0400, Chuck Lever wrote: > >> For now it is sufficient, IMO. NFSv4 doesn't implement a readdirplus > >> operation, and the performance benefits of NFSv3 readdirplus are > >> equivocal -- there isn't a strong desire to replicate the complexity > >> of NFSv3 readdirplus in NFSv4. I'm not even sure you can do it even > >> with a single compound RPC, so even in the long run NFSv4 may not ever > >> have the locking issues that NFSv3 does here. > > > > AFAICT NFSv4 does have the same recursion issues already. The call trace > > goes fs->readdir() ... nfsd4_encode_dirent() ... > > nfsd4_encode_dirent_fattr() ... lookup_one_len() ... fs->lookup(). > > > > Or am I mistaken? > > It looks like it needs a directory entry's dentry for a couple of reasons: > > 1. To determine whether a directory entry is a mount point > > 2. If the client has asked for file handles (via a bitmask) for the > directory entries Those are needed by NFSv3 too -- and can be handled with a lookup_fh() method in the file system which is guaranteed to be called from within the filldir callback, and some support in the VFS for checking if it's a mountpoint. NFSv4 introduces another problem though, which is that it seems to be able to return the _full_ getattr() results for each object, and there's no real way round the fact that we need to do the ->lookup() for that. If sane clients aren't expected to _ask_ for that, though, then perhaps it would be OK to fall back to something like the existing readdir-to-buffer hack for that case, while most normal clients won't trigger it. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-01 13:56 ` David Woodhouse 2008-08-01 16:05 ` Chuck Lever 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-01 13:56 UTC (permalink / raw) To: chucklever-Re5JQEeQqe8AvxtiuMwx3w Cc: viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote: > Those are needed by NFSv3 too -- and can be handled with a lookup_fh() > method in the file system which is guaranteed to be called from within > the filldir callback, and some support in the VFS for checking if it's a > mountpoint. > > NFSv4 introduces another problem though, which is that it seems to be > able to return the _full_ getattr() results for each object, and there's > no real way round the fact that we need to do the ->lookup() for that. > > If sane clients aren't expected to _ask_ for that, though, then perhaps > it would be OK to fall back to something like the existing > readdir-to-buffer hack for that case, while most normal clients won't > trigger it. Or maybe we could just mask the offending attrs out of ->rd_bmval for readdir calls, and say we don't support them? Would anyone scream if we did that? -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 13:56 ` David Woodhouse @ 2008-08-01 16:05 ` Chuck Lever 2008-08-01 16:19 ` David Woodhouse 0 siblings, 1 reply; 55+ messages in thread From: Chuck Lever @ 2008-08-01 16:05 UTC (permalink / raw) To: David Woodhouse, J. Bruce Fields Cc: viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd Hi David- On Fri, Aug 1, 2008 at 9:56 AM, David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote: >> Those are needed by NFSv3 too -- and can be handled with a lookup_fh() >> method in the file system which is guaranteed to be called from within >> the filldir callback, and some support in the VFS for checking if it's a >> mountpoint. >> >> NFSv4 introduces another problem though, which is that it seems to be >> able to return the _full_ getattr() results for each object, and there's >> no real way round the fact that we need to do the ->lookup() for that. >> >> If sane clients aren't expected to _ask_ for that, though, then perhaps >> it would be OK to fall back to something like the existing >> readdir-to-buffer hack for that case, while most normal clients won't >> trigger it. Sanity / insanity is probably not the right description for these different types of clients... NFSv4 is designed to work with clients that have a typical VFS (like an in-kernel Unix client), or user-space clients, or clients on systems that don't have typical VFS APIs (like Windows). So, servers have to be prepared to expect a wide gamut of different combinations of individual operations via compound RPCs. In practice, nearly all client implementations so far are VFS-based in-kernel clients, and have roughly the same kinds of readdir (ie driven by getdents(3) and allowing seek offsets like a byte stream). But they are at generally different stages of implementation, and have different ways to go about their work. However, speaking generally, the advanced features of NFSv4, like FS_LOCATIONS and pseudofs often do require some special sauce that is sometimes not terribly friendly to the server-side VFS. I only mentioned that the Linux client doesn't use WORD0_FILEHANDLE to caution against testing any server change with a Linux NFSv4 client -- it wouldn't necessarily exercise the server code in question. Some NFSv3 clients don't support READDIRPLUS at all, while some can disable it (like Linux, Mac OS, and FreeBSD), and others use it only in certain cases (Linux). I wouldn't describe any of these as saner or more commonly encountered than another. > Or maybe we could just mask the offending attrs out of ->rd_bmval for > readdir calls, and say we don't support them? Would anyone scream if we > did that? I'm not an NFSv4 expert (hence my initial incorrect assertion about NFSv4 not supporting readdirplus at all). I defer to those who are actually working on the standard and Linux implementation (Bruce?) But typically masking out these features could potentially cause severe interoperability problems for certain client implementations. We can only know for sure after a lot of testing at multivendor events like Connectathon; it's not something I would disable cavalierly. I believe the server can also indicate to clients that NFSv3 READDIRPLUS is not supported, and that wouldn't cause as much of a disaster for clients. It would even be feasible to disable READDIRPLUS only for certain physical file systems that would have a problem with lookup-during-filldir. I rather prefer making NFSD do the right thing here -- it seems to localize and document the issue and provide a solution that all file systems can use with a minimum of real fuss. I know that OCFS2 has this locking inversion issue as well, and we know OCFS2 users do share data from it with NFS. So Oracle is interested in a good solution here too. -- Chuck Lever ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 16:05 ` Chuck Lever @ 2008-08-01 16:19 ` David Woodhouse 2008-08-01 17:47 ` Chuck Lever 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-01 16:19 UTC (permalink / raw) To: chucklever Cc: J. Bruce Fields, viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote: > Some NFSv3 clients don't support READDIRPLUS at all, while some can > disable it (like Linux, Mac OS, and FreeBSD), and others use it only > in certain cases (Linux). I wouldn't describe any of these as saner > or more commonly encountered than another. Readdirplus is easy enough to fix with a lookup_fh() method and some way to check if a given entry is a mountpoint. I'm not worried about that. > > Or maybe we could just mask the offending attrs out of ->rd_bmval for > > readdir calls, and say we don't support them? Would anyone scream if we > > did that? > > I'm not an NFSv4 expert (hence my initial incorrect assertion about > NFSv4 not supporting readdirplus at all). I defer to those who are > actually working on the standard and Linux implementation (Bruce?) > But typically masking out these features could potentially cause > severe interoperability problems for certain client implementations. > We can only know for sure after a lot of testing at multivendor events > like Connectathon; it's not something I would disable cavalierly. It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about masking out -- as I said, we can cope with that. It's things that would _really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE, etc. But still, if masking them out would be a problem, we could use the existing trick of doing readdir into a buffer and then going back and doing the actual lookup later. But _only_ for that relatively rare case, rather than for _all_ users of readdirplus, as we do at the moment. > I rather prefer making NFSD do the right thing here -- it seems to > localize and document the issue and provide a solution that all file > systems can use with a minimum of real fuss. Yes, that's definitely my preference too. What I've posted is a good first step to that, and we can talk about improving it later. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 16:19 ` David Woodhouse @ 2008-08-01 17:47 ` Chuck Lever 2008-08-02 18:26 ` J. Bruce Fields 0 siblings, 1 reply; 55+ messages in thread From: Chuck Lever @ 2008-08-01 17:47 UTC (permalink / raw) To: David Woodhouse Cc: J. Bruce Fields, viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Fri, Aug 1, 2008 at 12:19 PM, David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote: >> Some NFSv3 clients don't support READDIRPLUS at all, while some can >> disable it (like Linux, Mac OS, and FreeBSD), and others use it only >> in certain cases (Linux). I wouldn't describe any of these as saner >> or more commonly encountered than another. > > Readdirplus is easy enough to fix with a lookup_fh() method and some way > to check if a given entry is a mountpoint. I'm not worried about that. > >> > Or maybe we could just mask the offending attrs out of ->rd_bmval for >> > readdir calls, and say we don't support them? Would anyone scream if we >> > did that? >> >> I'm not an NFSv4 expert (hence my initial incorrect assertion about >> NFSv4 not supporting readdirplus at all). I defer to those who are >> actually working on the standard and Linux implementation (Bruce?) >> But typically masking out these features could potentially cause >> severe interoperability problems for certain client implementations. >> We can only know for sure after a lot of testing at multivendor events >> like Connectathon; it's not something I would disable cavalierly. > > It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about > masking out -- as I said, we can cope with that. It's things that would > _really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE, > etc. AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect data changes on the server. That's one I think needs to remain enabled in the server capabilities bitmask. In fact most local file systems on Linux don't handle that one correctly yet anyway, but "saner" clients do depend on that one to manage their page caches. I don't think there's a way to indicate to a client that a server supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a NFSv4 READDIR. > But still, if masking them out would be a problem, we could use the > existing trick of doing readdir into a buffer and then going back and > doing the actual lookup later. But _only_ for that relatively rare case, > rather than for _all_ users of readdirplus, as we do at the moment. Personally I don't have an immediate problem with the double-buffering here. The extra data copy wouldn't add significant overhead to an already expensive and not terribly frequent operation. There may be a caching opportunity to saving the buffered result, but I would bet either the contents of a directory or the attributes of objects residing in it would change too often for there to be much benefit. I think we have an opportunity for a little better concurrency here, though. Doing all the lookups and the directory read asynchronously and concurrently, if that's possible, might be a performance enhancement in most normal cases. This would allow the requests to be passed all at once to the local file system, which could then organize any necessary disk accesses in a more optimal way. That would be a reason to continue to do a double-buffered readdir in every case. -- Chuck Lever ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 17:47 ` Chuck Lever @ 2008-08-02 18:26 ` J. Bruce Fields [not found] ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2008-08-03 11:56 ` Neil Brown 0 siblings, 2 replies; 55+ messages in thread From: J. Bruce Fields @ 2008-08-02 18:26 UTC (permalink / raw) To: chucklever Cc: David Woodhouse, viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Fri, Aug 01, 2008 at 01:47:58PM -0400, Chuck Lever wrote: > AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect > data changes on the server. That's one I think needs to remain > enabled in the server capabilities bitmask. In fact most local file > systems on Linux don't handle that one correctly yet anyway, but > "saner" clients do depend on that one to manage their page caches. > > I don't think there's a way to indicate to a client that a server > supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a > NFSv4 READDIR. > > > But still, if masking them out would be a problem, we could use the > > existing trick of doing readdir into a buffer and then going back and > > doing the actual lookup later. But _only_ for that relatively rare case, > > rather than for _all_ users of readdirplus, as we do at the moment. > > Personally I don't have an immediate problem with the double-buffering > here. The extra data copy wouldn't add significant overhead to an > already expensive and not terribly frequent operation. There may be a > caching opportunity to saving the buffered result, but I would bet > either the contents of a directory or the attributes of objects > residing in it would change too often for there to be much benefit. > > I think we have an opportunity for a little better concurrency here, > though. Doing all the lookups and the directory read asynchronously > and concurrently, if that's possible, might be a performance > enhancement in most normal cases. This would allow the requests to be > passed all at once to the local file system, which could then organize > any necessary disk accesses in a more optimal way. > > That would be a reason to continue to do a double-buffered readdir in > every case. I don't see how xfs's double-buffered readdir allows you to do the "lookups and directory reads asynchronously and concurrently"? The filesystem doesn't know the lookups are going to be coming at the time that it gets the readdir, so it doesn't know to start them yet. Could you add a readdirplus vfs operation which took a flag indicating how much extra information you're going to need? The three cases where readdir can be called are: - ordinary readdir, for which the existing filldir_t provides all the information needed. - nfsv3 readdirplus, where the only additional information needed is whether there's a mountpoint. - nfsv4 readdir, where we may need all the stat info, acls, etc., etc. So you could define a readdirplus file operation like readdirplus(file, void, plusfiller, flag) with plusfiller defined just as filldir_t but with an extra struct dentry * argument. The only nonzero flag value would be DENTRY_NEEDED, and without DENTRY_NEEDED set, the filesystem would have the option of passing NULL in the plusfiller callback's dentry argument. So nfsv3 would use flag == 0, and get a dentry passed back to it only when it was already cached (all it needs to detect mountpoints). But nfsv4 would pass in DENTRY_NEEDED, and get a dentry on every callback. Um. That assumes it's only the lookup that has the locking problems, and that the calls back into the filesystem for acls and so on will be safe. Etc. Or you could define a more complicated set of flags indicating exactly what would be needed, and have the callback pass back a big structure, most of whose fields could be left unset depending on what was requested. Though really I can't see any great objection to just moving xfs's hack up into nfsd. It may not do everything, but it seems like an incremental improvement. --b. ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2008-08-02 20:42 ` David Woodhouse 2008-08-02 21:33 ` J. Bruce Fields 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-02 20:42 UTC (permalink / raw) To: J. Bruce Fields Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote: > Could you add a readdirplus vfs operation which took a flag indicating > how much extra information you're going to need? Actually, if we're screwing with readdir then xfs would like to know how much it's going to be asked to read, rather than just have the filldir callback return zero when it's done. > The three cases where readdir can be called are: > - ordinary readdir, for which the existing filldir_t provides > all the information needed. > - nfsv3 readdirplus, where the only additional information > needed is whether there's a mountpoint. It also wants a file handle. For which I think it just needs i_generation in addition to the information it already has. > - nfsv4 readdir, where we may need all the stat info, acls, > etc., etc. We _might_, but most of the time we won't. It might be OK to fall back to the existing double-buffer hack for the cases where we _do_ need that extra information. I think a ->lookup_fh() method (which _expects_ to be called from within filldir, and hence does its locking automatically) is the way to go. One alternative might just be a full lookup method with the same locking rules: ->lookup_locked(). That might be easier to implement, and would solve the problem even for the corner cases of NFSv4. Although I still think we'd do best to avoid populating the inode cache with _all_ children when we do a readdirplus. -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-02 20:42 ` David Woodhouse @ 2008-08-02 21:33 ` J. Bruce Fields [not found] ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 55+ messages in thread From: J. Bruce Fields @ 2008-08-02 21:33 UTC (permalink / raw) To: David Woodhouse Cc: chucklever, viro, Neil Brown, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote: > On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote: > > Could you add a readdirplus vfs operation which took a flag indicating > > how much extra information you're going to need? > > Actually, if we're screwing with readdir then xfs would like to know how > much it's going to be asked to read, rather than just have the filldir > callback return zero when it's done. OK. > > The three cases where readdir can be called are: > > - ordinary readdir, for which the existing filldir_t provides > > all the information needed. > > - nfsv3 readdirplus, where the only additional information > > needed is whether there's a mountpoint. > > It also wants a file handle. Oops, right. > For which I think it just needs > i_generation in addition to the information it already has. Typically, right, though the filesystem's allowed some choice about what exactly it wants to use in the filehandle. I don't know how the various filesystems are actually using that in practice. > > - nfsv4 readdir, where we may need all the stat info, acls, > > etc., etc. > > We _might_, but most of the time we won't. It might be OK to fall back > to the existing double-buffer hack for the cases where we _do_ need that > extra information. How bad is the "double-buffer hack" anyway? Rather than have this as a fallback case that's rarely used (hence rarely tested), it might be simpler just to use it for everything if we're going to use it at all. --b. > I think a ->lookup_fh() method (which _expects_ to be called from within > filldir, and hence does its locking automatically) is the way to go. > > One alternative might just be a full lookup method with the same locking > rules: ->lookup_locked(). That might be easier to implement, and would > solve the problem even for the corner cases of NFSv4. Although I still > think we'd do best to avoid populating the inode cache with _all_ > children when we do a readdirplus. > > -- > David Woodhouse Open Source Technology Centre > David.Woodhouse@intel.com Intel Corporation > ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2008-08-03 8:39 ` David Woodhouse 0 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-08-03 8:39 UTC (permalink / raw) To: J. Bruce Fields Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Neil Brown, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Sat, 2008-08-02 at 17:33 -0400, J. Bruce Fields wrote: > On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote: > > On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote: > > > Could you add a readdirplus vfs operation which took a flag indicating > > > how much extra information you're going to need? > > > > Actually, if we're screwing with readdir then xfs would like to know how > > much it's going to be asked to read, rather than just have the filldir > > callback return zero when it's done. > > OK. > > > > The three cases where readdir can be called are: > > > - ordinary readdir, for which the existing filldir_t provides > > > all the information needed. > > > - nfsv3 readdirplus, where the only additional information > > > needed is whether there's a mountpoint. > > > > It also wants a file handle. > > Oops, right. > > > For which I think it just needs > > i_generation in addition to the information it already has. > > Typically, right, though the filesystem's allowed some choice about what > exactly it wants to use in the filehandle. I don't know how the various > filesystems are actually using that in practice. That's OK. If we do ->lookup_fh() and they're making their own, they can put what they like in it. > > > - nfsv4 readdir, where we may need all the stat info, acls, > > > etc., etc. > > > > We _might_, but most of the time we won't. It might be OK to fall back > > to the existing double-buffer hack for the cases where we _do_ need that > > extra information. > > How bad is the "double-buffer hack" anyway? Rather than have this as a > fallback case that's rarely used (hence rarely tested), it might be > simpler just to use it for everything if we're going to use it at all. It's certainly a good enough answer for now, which is why I've posted the patches to do exactly that. And yes, I have wondered the same myself, since realising that we'll need a full lookup for some NFSv4 clients anyway. Or maybe the full ->lookup_locked(), perhaps... -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-02 18:26 ` J. Bruce Fields [not found] ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2008-08-03 11:56 ` Neil Brown 2008-08-03 17:15 ` Chuck Lever 1 sibling, 1 reply; 55+ messages in thread From: Neil Brown @ 2008-08-03 11:56 UTC (permalink / raw) To: J. Bruce Fields Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Saturday August 2, bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org wrote: > > Though really I can't see any great objection to just moving xfs's hack > up into nfsd. It may not do everything, but it seems like an > incremental improvement. Because it is a hack, and hacks have a tendency to hide deeper problems, and not be ever get cleaned up and generally to become a burden to future generations. However if you do go down that path, can I suggest: 1/ get rid of the word "hack" throughout the code. If you think it is sensible, make it appear sensible. 2/ drop the "retry malloc of a smaller size" thing. In fact, you can probably use one of the set of pages that has been reserved for the request. It is very rare that a readdir request will be as big as the largest read. 3/ Make the new way unconditional. That gives it broader test coverage which can only be a good thing. And what is good for the goose is good for the gander... (not that I'm calling anyone a goose). But I still prefer the O_READDIRPLUS approach. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-03 11:56 ` Neil Brown @ 2008-08-03 17:15 ` Chuck Lever 2008-08-04 1:03 ` Neil Brown 0 siblings, 1 reply; 55+ messages in thread From: Chuck Lever @ 2008-08-03 17:15 UTC (permalink / raw) To: Neil Brown Cc: J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb@suse.de> wrote: > On Saturday August 2, bfields@fieldses.org wrote: >> >> Though really I can't see any great objection to just moving xfs's hack >> up into nfsd. It may not do everything, but it seems like an >> incremental improvement. > > Because it is a hack, and hacks have a tendency to hide deeper > problems, and not be ever get cleaned up and generally to become a > burden to future generations. Agreed that maintainability is an important concern. However, I don't see that what David suggests in general is hiding a deeper problem, but is rather exposing it. Can you explain what you think is the issue? > However if you do go down that path, can I suggest: > > 1/ get rid of the word "hack" throughout the code. If you think it > is sensible, make it appear sensible. Yes, if we're going to codify this method of handling readdir, let's document it properly and treat it as a first class API. > 2/ drop the "retry malloc of a smaller size" thing. > In fact, you can probably use one of the set of pages that has > been reserved for the request. It is very rare that a readdir > request will be as big as the largest read. Well, many Linux clients support reading directories only a page at a time (this limitation may have been lifted recently). But other clients often ask to read much more. Again it appears that a Linux NFS client is not going to be the real acid test here. Solaris and FreeBSD would probably be the best to try, I think. > 3/ Make the new way unconditional. That gives it broader test > coverage which can only be a good thing. And what is good for the > goose is good for the gander... (not that I'm calling anyone a > goose). As Bruce also suggested, and I agree with this (not the goose part, the unconditionality and test coverage parts). -- "Alright guard, begin the unnecessarily slow-moving dipping mechanism." --Dr. Evil ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-03 17:15 ` Chuck Lever @ 2008-08-04 1:03 ` Neil Brown [not found] ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2008-08-04 18:41 ` J. Bruce Fields 0 siblings, 2 replies; 55+ messages in thread From: Neil Brown @ 2008-08-04 1:03 UTC (permalink / raw) To: chucklever Cc: J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Sunday August 3, chuck.lever@oracle.com wrote: > On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb@suse.de> wrote: > > On Saturday August 2, bfields@fieldses.org wrote: > >> > >> Though really I can't see any great objection to just moving xfs's hack > >> up into nfsd. It may not do everything, but it seems like an > >> incremental improvement. > > > > Because it is a hack, and hacks have a tendency to hide deeper > > problems, and not be ever get cleaned up and generally to become a > > burden to future generations. > > Agreed that maintainability is an important concern. > > However, I don't see that what David suggests in general is hiding a > deeper problem, but is rather exposing it. Can you explain what you > think is the issue? The locking bothers me. The VFS seems to have a general policy of doing the locking itself to make life easier for filesystem implementors (or to make it harder for them to mess up, depending on your perspective). The current issue seems to suggest that the locking provided by the VFS is no longer adequate, so each filesystem is needing to create something itself. That suggests to me a weakness in the model. Possibly the VFS should give up trying to be in control and let filesystems do their own locking. Possibly there are still things that the VFS can do which are universally good. I think these are questions that should be addressed. Maybe they have already been addressed and I missed the conversation (that wouldn't surprise me much). But seeing words like "hack" suggests to me that it hasn't. So I want to make sure I understand the problem properly and deeply before giving my blessing to a hack. So: what are the issues? Obviously readdir can race with create and you don't want them tripping each other up. The current VFS approach to this is i_mutex. Any place which modifies a directory or does a lookup in a directory takes i_mutex to ensure that the directory doesn't change. This is probably fairly limiting. With a tree-structured directory you only really need to lock the 'current node' of the tree. So any access would lock the top node, find which child to follow, then lock the child and unlock the parent. Rebalancing might need to be creative as you cannot lock a parent while holding a lock on the child, but that isn't insurmountable. So I could argue that holding i_mutex across a lookup or create or readdir maybe isn't ideal. It would be interesting to explore the possibility of dropping i_mutex across calls into the filesystem. By the time the filesystem is called, we really only need to be locking the names (dentries) rather than the whole directory. More on this later... Some filesystems want to restructure directories and times that are independent of any particular access. This might be defragmentation or rebalancing or whatever. Clearly there needs to be mutual exclusion between this and other accesses such as readdir and lookup. The VFS clearly cannot help with this. It doesn't know when rebalancing might happen or are what sort of granularity. So the filesystem must do it's own locking. It strikes me that this sort of locking would best be done by spinlocks at the block level rather than a per-inode mutex. However I haven't actually implemented this (yet) so I cannot be certain. This is what is causing the current problem (for JFFS2 at least). JFF2 has a per-inode lock which protects against internally visible changes to the inode. Further (and this is key) this lock is held across the filldir callback. i_mutex is also held across the filldir callback, but there is an important difference. It is taken by the VFS, not the filesystem, and it is guaranteed always to be held across the filldir callback. So the filldir callback can call e.g. lookup without further locking. Backing up a little: given that the filesystem implementor chose to use per-inode locking to protect internal restructuring (which is certainly an easier option), why not just use i_mutex? The reason is that a create operation might trigger system-wide garbage collection which could trigger restructuring of the current inode, which would lead to an A-A deadlock (as the create is waiting for the garbage collection, and so still holding i_mutex). So, given that background, it is possible to see some more possible solutions (other than the ones already mentioned). - drop the internal lock across filldir. It could be seen a impolite to hold any locks across a callback that are not documented as being held. This would put an extra burden on the filesystem, but it shouldn't be a particularly big burden. A filesystem needs to be able to find the 'next' entry from a sequential 'seek' pointer so that is the most state that needs to be preserved. It might be convenient to be able to keep more state (pointers into data structures etc). All this can be achieved with fairly standard practice: a/ have a 'version' number per inode which is updated when any internal restructure happens. b/ when calling filldir, record the version number, drop the lock call filldir, reclaim the lock, check the version number c/ if the version number matches (as it mostly will) just keep going. If it doesn't jump back to the top of readdir where we decode the current seek address. Some filesystems have problems implementing seekdir/telldir so they might not be totally happy here. I have little sympathy for such filesystems and feel the burden should be on them to make it work. - use i_mutex to protect internal changes too, and drop i_mutex while doing internal restructuring. This would need some VFS changes so that dropping i_mutex would be safe. It would require some way to lock an individual dentry. Probably you would lock it under i_mutex by setting a flag bit, wait for the flag on some inode-wide waitqueue, and drop the lock by clearing the flag and waking the waitqueue. And you are never allowed to look at ->d_inode if the lock flag is set. Of these I really like the second. Refining the i_mutex down to a per-name lock before calling in to the filesystem seems like a really good idea and should be good for scalability and large directories. It isn't something to be done lightly though. The filesystem would still be given i_mutex held, but would be allowed to drop it if it wanted to. We could have an FS_DROPS_I_MUTEX similar to the current FS_RENAME_DOES_D_MOVE. For the first, I really like the idea that a lock should not be held across calls the filldir. I feel that a filesystem doing that is "wrong" in much the same way that some think that recursing into the filesystem as nfsd does is "wrong". In reality neither is "wrong", we just need to work together and negotiate and work out the best way to meet all of our needs. So what should we do now? I think that for JFFS2 to drop and reclaim f->sem in readdir would be all of 20 lines of code, including updating a ->version counter elsewhere in the code. Replicating that in all the filesystems that need it would probably be more code than the nfsd change though. On the other hand, if we implement the nfsd change, it will almost certainly never go away, even if all filesystems eventually find that they don't need it any more because someone improves the locking rules in the VFS. Where as the code in the filesystems could quite possibly go away when they are revised to make better use of the locking regime. So I don't think that is an ideal situation either. If I had time, I would work on modifying the VFS to allow filesystems to drop i_mutex. However I don't have time at the moment, so I'll leave the decision to be made by someone else (Hi Bruce! I'll support whatever you decide). But I think it is important to understand what is really going on and not just implement a hack that works around the problem. I think I do understand now, so I am a lot happier. Hopefully you do too. NeilBrown ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2008-08-04 6:19 ` Chuck Lever 2008-08-05 8:51 ` Dave Chinner 2008-08-17 18:22 ` [RFC] Reinstate NFS exportability for JFFS2 Andreas Dilger 1 sibling, 1 reply; 55+ messages in thread From: Chuck Lever @ 2008-08-04 6:19 UTC (permalink / raw) To: Neil Brown Cc: J. Bruce Fields, David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Sun, Aug 3, 2008 at 9:03 PM, Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org> wrote: > On Sunday August 3, chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org wrote: >> On Sun, Aug 3, 2008 at 7:56 AM, Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org> wrote: >> > On Saturday August 2, bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org wrote: >> >> >> >> Though really I can't see any great objection to just moving xfs's hack >> >> up into nfsd. It may not do everything, but it seems like an >> >> incremental improvement. >> > >> > Because it is a hack, and hacks have a tendency to hide deeper >> > problems, and not be ever get cleaned up and generally to become a >> > burden to future generations. >> >> Agreed that maintainability is an important concern. >> >> However, I don't see that what David suggests in general is hiding a >> deeper problem, but is rather exposing it. Can you explain what you >> think is the issue? > > The locking bothers me. > The VFS seems to have a general policy of doing the locking itself to > make life easier for filesystem implementors (or to make it harder for > them to mess up, depending on your perspective). > > The current issue seems to suggest that the locking provided by the > VFS is no longer adequate, so each filesystem is needing to create > something itself. That suggests to me a weakness in the model. I often find the VFS locking schemes to be inflexible. However, in this case, I think that the VFS locking is adequate; NFSD is doing something that the VFS (and hence the individual file systems themselves) wasn't designed for. Adding extra export APIs (like lookup_locked or lookup_fh) seems like an appropriate way to address the problem. > Possibly the VFS should give up trying to be in control and let > filesystems do their own locking. Possibly there are still things > that the VFS can do which are universally good. I think these are > questions that should be addressed. I agree it's worth thinking carefully about this. Thinking out loud (as you have done) is helpful. > Maybe they have already been addressed and I missed the conversation > (that wouldn't surprise me much). But seeing words like "hack" > suggests to me that it hasn't. So I want to make sure I understand > the problem properly and deeply before giving my blessing to a hack. > > So: what are the issues? > > Obviously readdir can race with create and you don't want them > tripping each other up. The current VFS approach to this is i_mutex. > Any place which modifies a directory or does a lookup in a directory > takes i_mutex to ensure that the directory doesn't change. > > This is probably fairly limiting. With a tree-structured directory > you only really need to lock the 'current node' of the tree. > So any access would lock the top node, find which child to follow, > then lock the child and unlock the parent. Rebalancing might need to > be creative as you cannot lock a parent while holding a lock on the > child, but that isn't insurmountable. > So I could argue that holding i_mutex across a lookup or create or > readdir maybe isn't ideal. It would be interesting to explore the > possibility of dropping i_mutex across calls into the filesystem. > By the time the filesystem is called, we really only need to be > locking the names (dentries) rather than the whole directory. > More on this later... > > Some filesystems want to restructure directories and times that are > independent of any particular access. This might be defragmentation > or rebalancing or whatever. Clearly there needs to be mutual > exclusion between this and other accesses such as readdir and lookup. > The VFS clearly cannot help with this. It doesn't know when > rebalancing might happen or are what sort of granularity. So the > filesystem must do it's own locking. > It strikes me that this sort of locking would best be done by > spinlocks at the block level rather than a per-inode mutex. However > I haven't actually implemented this (yet) so I cannot be certain. > > This is what is causing the current problem (for JFFS2 at least). > JFF2 has a per-inode lock which protects against internally visible > changes to the inode. Further (and this is key) this lock is held > across the filldir callback. > i_mutex is also held across the filldir callback, but there is an > important difference. It is taken by the VFS, not the filesystem, > and it is guaranteed always to be held across the filldir callback. > So the filldir callback can call e.g. lookup without further locking. > > Backing up a little: given that the filesystem implementor chose to > use per-inode locking to protect internal restructuring (which is > certainly an easier option), why not just use i_mutex? The reason > is that a create operation might trigger system-wide garbage > collection which could trigger restructuring of the current inode, > which would lead to an A-A deadlock (as the create is waiting for the > garbage collection, and so still holding i_mutex). > > So, given that background, it is possible to see some more possible > solutions (other than the ones already mentioned). > > - drop the internal lock across filldir. > It could be seen a impolite to hold any locks across a callback > that are not documented as being held. > This would put an extra burden on the filesystem, but it shouldn't > be a particularly big burden. > A filesystem needs to be able to find the 'next' entry from a > sequential 'seek' pointer so that is the most state that needs to > be preserved. It might be convenient to be able to keep more state > (pointers into data structures etc). All this can be achieved with > fairly standard practice: > a/ have a 'version' number per inode which is updated when any > internal restructure happens. > b/ when calling filldir, record the version number, drop the lock > call filldir, reclaim the lock, check the version number > c/ if the version number matches (as it mostly will) just keep > going. If it doesn't jump back to the top of readdir where > we decode the current seek address. > > Some filesystems have problems implementing seekdir/telldir so they > might not be totally happy here. I have little sympathy for such > filesystems and feel the burden should be on them to make it work. If I understand your suggestion correctly, I can see cases where it would be nearly impossible for NFSD to make forward progress assembling a full readdir result to post to a client. This seems like the same problem as retrying a pathname resolution until we no longer get an ESTALE. If an application makes continuous changes to a directory (such as, say, a very busy MTA) it would be impossible for NFSD to finish a READDIR. As a sidebar, I do agree that locking out other accessors when handling a very large directory can be a real problem. I've seen this on the client side with certain workloads. The usual answer in this case is to ask the application developer to use a hierarchical directory structure. :-/ But we are still stuck with seekdir/telldir. > - use i_mutex to protect internal changes too, and drop i_mutex while > doing internal restructuring. This would need some VFS changes so > that dropping i_mutex would be safe. It would require some way to > lock an individual dentry. Probably you would lock it under > i_mutex by setting a flag bit, wait for the flag on some inode-wide > waitqueue, and drop the lock by clearing the flag and waking the > waitqueue. And you are never allowed to look at ->d_inode if the > lock flag is set. > > Of these I really like the second. Refining the i_mutex down to a > per-name lock before calling in to the filesystem seems like a really > good idea and should be good for scalability and large directories. > It isn't something to be done lightly though. The filesystem would > still be given i_mutex held, but would be allowed to drop it if it > wanted to. We could have an FS_DROPS_I_MUTEX similar to the current > FS_RENAME_DOES_D_MOVE. > > For the first, I really like the idea that a lock should not be held > across calls the filldir. I feel that a filesystem doing that is > "wrong" in much the same way that some think that recursing into the > filesystem as nfsd does is "wrong". In reality neither is "wrong", we > just need to work together and negotiate and work out the best way to > meet all of our needs. > > So what should we do now? I think that for JFFS2 to drop and reclaim > f->sem in readdir would be all of 20 lines of code, including updating > a ->version counter elsewhere in the code. Replicating that in all > the filesystems that need it would probably be more code than the nfsd > change though. That doesn't fix XFS or the clustered file systems, and I'm still concerned about starving other accessors (see above). > On the other hand, if we implement the nfsd change, it will almost > certainly never go away, even if all filesystems eventually find that > they don't need it any more because someone improves the locking > rules in the VFS. I understand this sentiment completely, but I'm not as pessimistic about this. > Where as the code in the filesystems could quite > possibly go away when they are revised to make better use of the > locking regime. So I don't think that is an ideal situation either. Making such changes to the VFS could take a long time. I know that distributions are already getting phone calls (twitters?) about this problem. > If I had time, I would work on modifying the VFS to allow filesystems > to drop i_mutex. However I don't have time at the moment, so I'll > leave the decision to be made by someone else (Hi Bruce! I'll > support whatever you decide). > > But I think it is important to understand what is really going on and > not just implement a hack that works around the problem. I think I do > understand now, so I am a lot happier. Hopefully you do too. Yes, thanks. I will need to re-read your explanation a few more times to digest it further. So, the JFFS2 locking problem is a garbage-collection issue. I'm not sure this is the case with other file systems like XFS and OCFS2. My impression was that XFS had a transaction logging deadlock, and that OCFS2 (and possibly GFS2) would have issues with cross-node locking in these cases (yes, it's 2am here and I should be in bed instead of answering e-mail). Similar that these are all internal restructuring issues, but potentially different enough that each will need some careful analysis -- cross-node locking has failure modes that may be a challenge to deal with. Not to mention the other NFSv4 issues David dug up that have nothing to do with readdir. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-04 6:19 ` Chuck Lever @ 2008-08-05 8:51 ` Dave Chinner 2008-08-05 8:59 ` David Woodhouse 2008-08-05 23:06 ` Neil Brown 0 siblings, 2 replies; 55+ messages in thread From: Dave Chinner @ 2008-08-05 8:51 UTC (permalink / raw) To: chucklever Cc: Neil Brown, J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote: > So, the JFFS2 locking problem is a garbage-collection issue. I'm not > sure this is the case with other file systems like XFS and OCFS2. My > impression was that XFS had a transaction logging deadlock, Just to clarify - XFS has a directory buffer lock deadlock. That is, while reading the contents of the directory buffer it is locked to prevent modifications from occurring while extracting the contents. Looking up an entry in the directory also requires the directory buffer lock (for the same reason), so calling the lookup while already holding the directory buffer lock (i.e from the filldir callback) will deadlock. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-05 8:51 ` Dave Chinner @ 2008-08-05 8:59 ` David Woodhouse 2008-08-05 9:47 ` Dave Chinner 2008-08-05 23:06 ` Neil Brown 1 sibling, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-05 8:59 UTC (permalink / raw) To: Dave Chinner Cc: chucklever, Neil Brown, J. Bruce Fields, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Tue, 2008-08-05 at 18:51 +1000, Dave Chinner wrote: > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote: > > So, the JFFS2 locking problem is a garbage-collection issue. I'm not > > sure this is the case with other file systems like XFS and OCFS2. My > > impression was that XFS had a transaction logging deadlock, > > Just to clarify - XFS has a directory buffer lock deadlock. That is, > while reading the contents of the directory buffer it is locked to > prevent modifications from occurring while extracting the contents. > Looking up an entry in the directory also requires the directory > buffer lock (for the same reason), so calling the lookup while > already holding the directory buffer lock (i.e from the filldir > callback) will deadlock. But if we had a ->lookup_locked() or ->lookup_fh() method which is _guaranteed_ to be called from within your ->readdir(), you could manage to bypass that locking and avoid the deadlock? -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-05 8:59 ` David Woodhouse @ 2008-08-05 9:47 ` Dave Chinner 0 siblings, 0 replies; 55+ messages in thread From: Dave Chinner @ 2008-08-05 9:47 UTC (permalink / raw) To: David Woodhouse Cc: chucklever, Neil Brown, J. Bruce Fields, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Tue, Aug 05, 2008 at 09:59:48AM +0100, David Woodhouse wrote: > On Tue, 2008-08-05 at 18:51 +1000, Dave Chinner wrote: > > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote: > > > So, the JFFS2 locking problem is a garbage-collection issue. I'm not > > > sure this is the case with other file systems like XFS and OCFS2. My > > > impression was that XFS had a transaction logging deadlock, > > > > Just to clarify - XFS has a directory buffer lock deadlock. That is, > > while reading the contents of the directory buffer it is locked to > > prevent modifications from occurring while extracting the contents. > > Looking up an entry in the directory also requires the directory > > buffer lock (for the same reason), so calling the lookup while > > already holding the directory buffer lock (i.e from the filldir > > callback) will deadlock. > > But if we had a ->lookup_locked() or ->lookup_fh() method which is > _guaranteed_ to be called from within your ->readdir(), you could manage > to bypass that locking and avoid the deadlock? I *think* it's possible. It's definitely in the realm of difficult because the buffer locks are a couple of layers removed from the directory code and both readdir and lookup assume exclusive access to the directory block. We'd probably have to introduce a directory buffer (dabuf) cache layer with it's own locking to allow multiple accessors to the buffer at the same time. Christoph might have some other ideas for this, but I think it's going to take significant surgery to implement a 'recursive lockless lookup' like this. The main problem you are going to have is finding someone who has the time to do the XFS work. If you only implement this lookup interface, then I'd say it's going to be some time before XFS would actually use it.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-05 8:51 ` Dave Chinner 2008-08-05 8:59 ` David Woodhouse @ 2008-08-05 23:06 ` Neil Brown 2008-08-06 0:08 ` Dave Chinner 1 sibling, 1 reply; 55+ messages in thread From: Neil Brown @ 2008-08-05 23:06 UTC (permalink / raw) To: Dave Chinner Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, J. Bruce Fields, David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Tuesday August 5, david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org wrote: > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote: > > So, the JFFS2 locking problem is a garbage-collection issue. I'm not > > sure this is the case with other file systems like XFS and OCFS2. My > > impression was that XFS had a transaction logging deadlock, > > Just to clarify - XFS has a directory buffer lock deadlock. That is, > while reading the contents of the directory buffer it is locked to > prevent modifications from occurring while extracting the contents. > Looking up an entry in the directory also requires the directory > buffer lock (for the same reason), so calling the lookup while > already holding the directory buffer lock (i.e from the filldir > callback) will deadlock. How much cost would it be, do you think, to drop the lock across the call to filldir? Then reclaim the lock, validate pointers etc against a 'version' counter, and restart based on the current telldir cookie if needed? To me, that is the generic solution to allowing filldir to call ->lookup. I'm just not sure what it costs to be constantly dropping and reclaiming the lock in the normal case where ->lookup isn't being called. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-05 23:06 ` Neil Brown @ 2008-08-06 0:08 ` Dave Chinner 2008-08-06 19:56 ` J. Bruce Fields 0 siblings, 1 reply; 55+ messages in thread From: Dave Chinner @ 2008-08-06 0:08 UTC (permalink / raw) To: Neil Brown Cc: chucklever, J. Bruce Fields, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Wed, Aug 06, 2008 at 09:06:48AM +1000, Neil Brown wrote: > On Tuesday August 5, david@fromorbit.com wrote: > > On Mon, Aug 04, 2008 at 02:19:12AM -0400, Chuck Lever wrote: > > > So, the JFFS2 locking problem is a garbage-collection issue. I'm not > > > sure this is the case with other file systems like XFS and OCFS2. My > > > impression was that XFS had a transaction logging deadlock, > > > > Just to clarify - XFS has a directory buffer lock deadlock. That is, > > while reading the contents of the directory buffer it is locked to > > prevent modifications from occurring while extracting the contents. > > Looking up an entry in the directory also requires the directory > > buffer lock (for the same reason), so calling the lookup while > > already holding the directory buffer lock (i.e from the filldir > > callback) will deadlock. > > How much cost would it be, do you think, to drop the lock across the > call to filldir? Then reclaim the lock, validate pointers etc against > a 'version' counter, and restart based on the current telldir cookie > if needed? The problem is that the dabuf is a temporary structure only valid for the length of a block read or transaction - it is built from buffers that are cached and provide persistence. Remember, XFS supports large, non-contiguous, directory blocks and so the directory code extremely complex in places. To do the above we need to pretty much tear down the dabuf to unlock everything before the filldir call, then build it in the lookup during the filldir call, then tear it down for the readdir to build it again, validate, etc.... Basically, what you suggest above still needs the same infrastructure as a proper shared locking scheme on the dabuf to work efficiently. Using a shared locking scheme gives much more benefit, because it will alow parallel directory traversals and lookups in *all cases*, not just NFS. Basically, I don't want to replace an _easily validated_ hack with some other nasty, non-trivial, disaster-waiting-to-happen hack that doesn't provide any benefit over the current hack.... > To me, that is the generic solution to allowing filldir to call > ->lookup. I'm just not sure what it costs to be constantly dropping > and reclaiming the lock in the normal case where ->lookup isn't being > called. Allowing filldir to call lookup requireѕ shared read lock semantics between readdir and lookup. I don't think any filesystem has that implemented, it can't be implemented with i_mutex involved, and it will be non-trivial to implement in the filesystems that need it. Normally the generic solution is the lowest common denominator solution - move the double buffering into the NFSD so everything works with the current exclusive locking semantics, and then provide another filldir+lookup for filesystem that are able to do something special to avoid the slower generic path. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-06 0:08 ` Dave Chinner @ 2008-08-06 19:56 ` J. Bruce Fields 2008-08-06 20:10 ` David Woodhouse 0 siblings, 1 reply; 55+ messages in thread From: J. Bruce Fields @ 2008-08-06 19:56 UTC (permalink / raw) To: Neil Brown, chucklever, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel So the only solution that seems to really be holding up (at least for the near term) is the double-buffering "hack"; could we get a version of that with Neil's review comments addressed? --b. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-06 19:56 ` J. Bruce Fields @ 2008-08-06 20:10 ` David Woodhouse [not found] ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-06 20:10 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Wed, 2008-08-06 at 15:56 -0400, J. Bruce Fields wrote: > So the only solution that seems to really be holding up (at least for > the near term) is the double-buffering "hack"; could we get a version > of that with Neil's review comments addressed? git.infradead.org/users/dwmw2/nfsexport-2.6.git I inverted the logic so that the file system has to set a FS_LOOKUP_IN_READDIR flag to avoid the hack, rather than setting a FS_NO_LOOKUP_IN_READDIR flag to _use_ it. Renamed it so that it doesn't say 'hack', took out the loop to allocate memory. Tasks left for someone more familiar with the NFS code are: - making it use one of the pages already allocated for the response. - giving it a hint about how many dirents to read into the buffer. -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-09 16:47 ` David Woodhouse 2008-08-09 19:55 ` David Woodhouse 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 16:47 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Wed, 2008-08-06 at 21:10 +0100, David Woodhouse wrote: > On Wed, 2008-08-06 at 15:56 -0400, J. Bruce Fields wrote: > > So the only solution that seems to really be holding up (at least for > > the near term) is the double-buffering "hack"; could we get a version > > of that with Neil's review comments addressed? > > git.infradead.org/users/dwmw2/nfsexport-2.6.git > > I inverted the logic so that the file system has to set a > FS_LOOKUP_IN_READDIR flag to avoid the hack, rather than setting a > FS_NO_LOOKUP_IN_READDIR flag to _use_ it. Renamed it so that it doesn't > say 'hack', took out the loop to allocate memory. > > Tasks left for someone more familiar with the NFS code are: > - making it use one of the pages already allocated for the response. > - giving it a hint about how many dirents to read into the buffer. Here it is in a single patch... commit b4bf782886f5f7711c8ce2c0f3778bf52bda1d56 Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Thu Jul 31 20:39:25 2008 +0100 [JFFS2] Reinstate NFS exportability Now that the readdir/lookup deadlock issues have been dealt with, we can export JFFS2 file systems again. (For now, you have to specify fsid manually; we should add a method to the export_ops to handle that too.) Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit 681b29d988a0dbf8d086958cbeae26fea710baf1 Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Wed Aug 6 15:21:59 2008 +0100 Mark isofs as being able to handle lookup() within readdir() Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit 14b5cf410c17afd7b2589b68ca3ca23352789d2f Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Wed Aug 6 15:21:39 2008 +0100 Mark msdos/vfat as being able to handle lookup() within readdir() Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit 20a7a67be0ac1f788e77c3e333f241e3395e5dc7 Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Wed Aug 6 15:18:37 2008 +0100 Mark ext[234] as able to handle lookup() within readdir() Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit bade5f353e2cfd31fcec038b41e7cb68bb3321f1 Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Thu Jul 31 20:38:04 2008 +0100 Remove XFS buffered readdir hack Now that we've moved the readdir hack to the nfsd code, we can remove the local version from the XFS code. Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit b8179eee3e72039c8acc9b8efe807031bba3b931 Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Thu Jul 31 20:29:12 2008 +0100 Copy XFS readdir hack into nfsd code, introduce FS_LOOKUP_IN_READDIR flag Some file systems with their own internal locking have problems with the way that nfsd calls the ->lookup() method from within a filldir function called from their ->readdir() method. The recursion back into the file system code can cause deadlock. XFS has a fairly hackish solution to this which involves doing the readdir() into a locally-allocated buffer, then going back through it calling the filldir function afterwards. It's not ideal, but it works. It's particularly suboptimal because XFS does this for local file systems too, where it's completely unnecessary. Copy this hack into the NFS code where it can be used only for NFS export. Allow file systems which are _not_ prone to this deadlock to avoid the buffered version by setting FS_LOOKUP_IN_READDIR in their fs_type flags to indicate that calling lookup() from readdir() is OK. Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> commit f5d45f637d4907cd6a55bda9465eec997351e9fc Author: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Date: Thu Jul 31 17:16:51 2008 +0100 Factor out nfsd_do_readdir() into its own function Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> fs/ext2/super.c | 2 +- fs/ext3/super.c | 2 +- fs/ext4/super.c | 2 +- fs/isofs/inode.c | 2 +- fs/jffs2/super.c | 60 +++++++++++++++++++ fs/msdos/namei.c | 2 +- fs/nfsd/vfs.c | 136 ++++++++++++++++++++++++++++++++++++++----- fs/vfat/namei.c | 2 +- fs/xfs/linux-2.6/xfs_file.c | 120 -------------------------------------- include/linux/fs.h | 2 + 10 files changed, 189 insertions(+), 141 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index fd88c7b..d8ef9ea 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -1401,7 +1401,7 @@ static struct file_system_type ext2_fs_type = { .name = "ext2", .get_sb = ext2_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_ext2_fs(void) diff --git a/fs/ext3/super.c b/fs/ext3/super.c index f38a5af..790a40e 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -2938,7 +2938,7 @@ static struct file_system_type ext3_fs_type = { .name = "ext3", .get_sb = ext3_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_ext3_fs(void) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index d5d7795..2911f43 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -3497,7 +3497,7 @@ static struct file_system_type ext4dev_fs_type = { .name = "ext4dev", .get_sb = ext4_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_ext4_fs(void) diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index 26948a6..f81b920 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -1460,7 +1460,7 @@ static struct file_system_type iso9660_fs_type = { .name = "iso9660", .get_sb = isofs_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_iso9660_fs(void) diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c index efd4012..41ed563 100644 --- a/fs/jffs2/super.c +++ b/fs/jffs2/super.c @@ -22,6 +22,7 @@ #include <linux/mtd/super.h> #include <linux/ctype.h> #include <linux/namei.h> +#include <linux/exportfs.h> #include "compr.h" #include "nodelist.h" @@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait) return 0; } +static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino, + uint32_t generation) +{ + return jffs2_iget(sb, ino); +} + +static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_get_parent(struct dentry *child) +{ + struct jffs2_inode_info *f; + struct dentry *parent; + struct inode *inode; + uint32_t pino; + + if (!S_ISDIR(child->d_inode->i_mode)) + return ERR_PTR(-EACCES); + + f = JFFS2_INODE_INFO(child->d_inode); + + pino = f->inocache->pino_nlink; + + JFFS2_DEBUG("Parent of directory ino #%u is #%u\n", + f->inocache->ino, pino); + + inode = jffs2_iget(child->d_inode->i_sb, pino); + if (IS_ERR(inode)) { + JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino, + PTR_ERR(inode)); + return ERR_CAST(inode); + } + + parent = d_alloc_anon(inode); + if (!parent) { + iput(inode); + parent = ERR_PTR(-ENOMEM); + } + return parent; +} + +static struct export_operations jffs2_export_ops = { + .get_parent = jffs2_get_parent, + .fh_to_dentry = jffs2_fh_to_dentry, + .fh_to_parent = jffs2_fh_to_parent, +}; + static const struct super_operations jffs2_super_operations = { .alloc_inode = jffs2_alloc_inode, @@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent) spin_lock_init(&c->inocache_lock); sb->s_op = &jffs2_super_operations; + sb->s_export_op = &jffs2_export_ops; sb->s_flags = sb->s_flags | MS_NOATIME; sb->s_xattr = jffs2_xattr_handlers; #ifdef CONFIG_JFFS2_FS_POSIX_ACL diff --git a/fs/msdos/namei.c b/fs/msdos/namei.c index e844b98..feebc28 100644 --- a/fs/msdos/namei.c +++ b/fs/msdos/namei.c @@ -681,7 +681,7 @@ static struct file_system_type msdos_fs_type = { .name = "msdos", .get_sb = msdos_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_msdos_fs(void) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 18060be..ccff326 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1814,6 +1814,123 @@ out: return err; } +struct buffered_dirent { + u64 ino; + loff_t offset; + int namlen; + unsigned int d_type; + char name[]; +}; + +struct readdir_data { + char *dirent; + size_t used; +}; + +static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen, + loff_t offset, u64 ino, unsigned int d_type) +{ + struct readdir_data *buf = __buf; + struct buffered_dirent *de = (void *)(buf->dirent + buf->used); + unsigned int reclen; + + reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64)); + if (buf->used + reclen > PAGE_SIZE) + return -EINVAL; + + de->namlen = namlen; + de->offset = offset; + de->ino = ino; + de->d_type = d_type; + memcpy(de->name, name, namlen); + buf->used += reclen; + + return 0; +} + +static int nfsd_buffered_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) +{ + struct readdir_data buf; + struct buffered_dirent *de; + int host_err; + int size; + loff_t offset; + + buf.dirent = (void *)__get_free_page(GFP_KERNEL); + if (!buf.dirent) + return -ENOMEM; + + offset = *offsetp; + cdp->err = nfserr_eof; /* will be cleared on successful read */ + + while (1) { + unsigned int reclen; + + buf.used = 0; + + host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf); + if (host_err) + break; + + size = buf.used; + + if (!size) + break; + + + de = (struct buffered_dirent *)buf.dirent; + while (size > 0) { + offset = de->offset; + + if (func(cdp, de->name, de->namlen, de->offset, + de->ino, de->d_type)) + goto done; + + if (cdp->err != nfs_ok) + goto done; + + reclen = ALIGN(sizeof(*de) + de->namlen, + sizeof(u64)); + size -= reclen; + de = (struct buffered_dirent *)((char *)de + reclen); + } + offset = vfs_llseek(file, 0, 1); + } + + done: + free_page((unsigned long)(buf.dirent)); + + if (host_err) + return nfserrno(host_err); + + *offsetp = offset; + return cdp->err; +} + +static int nfsd_do_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) +{ + int host_err; + + /* + * Read the directory entries. This silly loop is necessary because + * readdir() is not guaranteed to fill up the entire buffer, but + * may choose to do less. + */ + do { + cdp->err = nfserr_eof; /* will be cleared on successful read */ + host_err = vfs_readdir(file, func, cdp); + } while (host_err >=0 && cdp->err == nfs_ok); + + *offsetp = vfs_llseek(file, 0, 1); + + if (host_err) + return nfserrno(host_err); + else + return cdp->err; +} + /* * Read entries from a directory. * The NFSv3/4 verifier we ignore for now. @@ -1823,7 +1940,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, struct readdir_cd *cdp, filldir_t func) { __be32 err; - int host_err; struct file *file; loff_t offset = *offsetp; @@ -1837,21 +1953,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, goto out_close; } - /* - * Read the directory entries. This silly loop is necessary because - * readdir() is not guaranteed to fill up the entire buffer, but - * may choose to do less. - */ - - do { - cdp->err = nfserr_eof; /* will be cleared on successful read */ - host_err = vfs_readdir(file, func, cdp); - } while (host_err >=0 && cdp->err == nfs_ok); - if (host_err) - err = nfserrno(host_err); + if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags & + FS_LOOKUP_IN_READDIR)) + err = nfsd_do_readdir(file, func, cdp, offsetp); else - err = cdp->err; - *offsetp = vfs_llseek(file, 0, 1); + err = nfsd_buffered_readdir(file, func, cdp, offsetp); if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ diff --git a/fs/vfat/namei.c b/fs/vfat/namei.c index 155c10b..95d0b6f 100644 --- a/fs/vfat/namei.c +++ b/fs/vfat/namei.c @@ -1034,7 +1034,7 @@ static struct file_system_type vfat_fs_type = { .name = "vfat", .get_sb = vfat_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_LOOKUP_IN_READDIR, }; static int __init init_vfat_fs(void) diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index 5f60363..d65d377 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -212,7 +212,6 @@ xfs_file_fsync( * Hopefully we'll find a better workaround that allows to use the optimal * version at least for local readdirs for 2.6.25. */ -#if 0 STATIC int xfs_file_readdir( struct file *filp, @@ -244,125 +243,6 @@ xfs_file_readdir( return -error; return 0; } -#else - -struct hack_dirent { - u64 ino; - loff_t offset; - int namlen; - unsigned int d_type; - char name[]; -}; - -struct hack_callback { - char *dirent; - size_t len; - size_t used; -}; - -STATIC int -xfs_hack_filldir( - void *__buf, - const char *name, - int namlen, - loff_t offset, - u64 ino, - unsigned int d_type) -{ - struct hack_callback *buf = __buf; - struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used); - unsigned int reclen; - - reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64)); - if (buf->used + reclen > buf->len) - return -EINVAL; - - de->namlen = namlen; - de->offset = offset; - de->ino = ino; - de->d_type = d_type; - memcpy(de->name, name, namlen); - buf->used += reclen; - return 0; -} - -STATIC int -xfs_file_readdir( - struct file *filp, - void *dirent, - filldir_t filldir) -{ - struct inode *inode = filp->f_path.dentry->d_inode; - xfs_inode_t *ip = XFS_I(inode); - struct hack_callback buf; - struct hack_dirent *de; - int error; - loff_t size; - int eof = 0; - xfs_off_t start_offset, curr_offset, offset; - - /* - * Try fairly hard to get memory - */ - buf.len = PAGE_CACHE_SIZE; - do { - buf.dirent = kmalloc(buf.len, GFP_KERNEL); - if (buf.dirent) - break; - buf.len >>= 1; - } while (buf.len >= 1024); - - if (!buf.dirent) - return -ENOMEM; - - curr_offset = filp->f_pos; - if (curr_offset == 0x7fffffff) - offset = 0xffffffff; - else - offset = filp->f_pos; - - while (!eof) { - unsigned int reclen; - - start_offset = offset; - - buf.used = 0; - error = -xfs_readdir(ip, &buf, buf.len, &offset, - xfs_hack_filldir); - if (error || offset == start_offset) { - size = 0; - break; - } - - size = buf.used; - de = (struct hack_dirent *)buf.dirent; - while (size > 0) { - curr_offset = de->offset /* & 0x7fffffff */; - if (filldir(dirent, de->name, de->namlen, - curr_offset & 0x7fffffff, - de->ino, de->d_type)) { - goto done; - } - - reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen, - sizeof(u64)); - size -= reclen; - de = (struct hack_dirent *)((char *)de + reclen); - } - } - - done: - if (!error) { - if (size == 0) - filp->f_pos = offset & 0x7fffffff; - else if (de) - filp->f_pos = curr_offset; - } - - kfree(buf.dirent); - return error; -} -#endif STATIC int xfs_file_mmap( diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..001ba17 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -100,6 +100,8 @@ extern int dir_notify_enable; #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. */ +#define FS_LOOKUP_IN_READDIR 65536 /* FS won't deadlock if you call its + lookup() method from filldir */ /* * These are the fs-independent mount-flags: up to 32 flags are supported -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-09 16:47 ` David Woodhouse @ 2008-08-09 19:55 ` David Woodhouse [not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 0 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 19:55 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Sat, 2008-08-09 at 17:47 +0100, David Woodhouse wrote: > Here it is in a single patch... Updated again after more feedback from hch, to really make it unconditional instead of just inverting the logic, turn a check for S_ISDIR() in JFFS2 into a BUG_ON() because it should _really_ never happen, and fix up some comments (remove the one about why we do the buffering in xfs, and add something similar back in nfsd). {http://,git://} git.infradead.org/users/dwmw2/nfsexport-2.6.git Patch sequence follows... -- dwmw2 ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* [PATCH 1/4] Factor out nfsd_do_readdir() into its own function [not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-09 20:01 ` David Woodhouse [not found] ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 2008-08-09 20:02 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse ` (2 subsequent siblings) 3 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 20:01 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++---------------- 1 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 18060be..3e22634 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1814,6 +1814,29 @@ out: return err; } +static int nfsd_do_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) +{ + int host_err; + + /* + * Read the directory entries. This silly loop is necessary because + * readdir() is not guaranteed to fill up the entire buffer, but + * may choose to do less. + */ + do { + cdp->err = nfserr_eof; /* will be cleared on successful read */ + host_err = vfs_readdir(file, func, cdp); + } while (host_err >=0 && cdp->err == nfs_ok); + + *offsetp = vfs_llseek(file, 0, 1); + + if (host_err) + return nfserrno(host_err); + else + return cdp->err; +} + /* * Read entries from a directory. * The NFSv3/4 verifier we ignore for now. @@ -1823,7 +1846,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, struct readdir_cd *cdp, filldir_t func) { __be32 err; - int host_err; struct file *file; loff_t offset = *offsetp; @@ -1837,21 +1859,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, goto out_close; } - /* - * Read the directory entries. This silly loop is necessary because - * readdir() is not guaranteed to fill up the entire buffer, but - * may choose to do less. - */ - - do { - cdp->err = nfserr_eof; /* will be cleared on successful read */ - host_err = vfs_readdir(file, func, cdp); - } while (host_err >=0 && cdp->err == nfs_ok); - if (host_err) - err = nfserrno(host_err); - else - err = cdp->err; - *offsetp = vfs_llseek(file, 0, 1); + err = nfsd_do_readdir(file, func, cdp, offsetp); if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
[parent not found: <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [PATCH 1/4] Factor out nfsd_do_readdir() into its own function [not found] ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-09 20:07 ` Christoph Hellwig 0 siblings, 0 replies; 55+ messages in thread From: Christoph Hellwig @ 2008-08-09 20:07 UTC (permalink / raw) To: David Woodhouse Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Looks good. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 2/4] Copy XFS readdir hack into nfsd code. [not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 2008-08-09 20:01 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse @ 2008-08-09 20:02 ` David Woodhouse 2008-08-09 20:08 ` Christoph Hellwig 2008-08-09 20:03 ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse 2008-08-09 20:03 ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse 3 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 20:02 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Some file systems with their own internal locking have problems with the way that nfsd calls the ->lookup() method from within a filldir function called from their ->readdir() method. The recursion back into the file system code can cause deadlock. XFS has a fairly hackish solution to this which involves doing the readdir() into a locally-allocated buffer, then going back through it calling the filldir function afterwards. It's not ideal, but it works. It's particularly suboptimal because XFS does this for local file systems too, where it's completely unnecessary. Copy this hack into the NFS code where it can be used only for NFS export. In response to feedback, use it unconditionally rather than only for the affected file systems. Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/nfsd/vfs.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 files changed, 93 insertions(+), 15 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 3e22634..d789fb8 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1814,27 +1814,105 @@ out: return err; } -static int nfsd_do_readdir(struct file *file, filldir_t func, - struct readdir_cd *cdp, loff_t *offsetp) +/* + * We do this buffering because we must not call back into the file + * system's ->lookup() method from the filldir callback. That may well + * deadlock a number of file systems. + * + * This is based heavily on the implementation of same in XFS. + */ +struct buffered_dirent { + u64 ino; + loff_t offset; + int namlen; + unsigned int d_type; + char name[]; +}; + +struct readdir_data { + char *dirent; + size_t used; +}; + +static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen, + loff_t offset, u64 ino, unsigned int d_type) +{ + struct readdir_data *buf = __buf; + struct buffered_dirent *de = (void *)(buf->dirent + buf->used); + unsigned int reclen; + + reclen = ALIGN(sizeof(struct buffered_dirent) + namlen, sizeof(u64)); + if (buf->used + reclen > PAGE_SIZE) + return -EINVAL; + + de->namlen = namlen; + de->offset = offset; + de->ino = ino; + de->d_type = d_type; + memcpy(de->name, name, namlen); + buf->used += reclen; + + return 0; +} + +static int nfsd_buffered_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) { + struct readdir_data buf; + struct buffered_dirent *de; int host_err; + int size; + loff_t offset; - /* - * Read the directory entries. This silly loop is necessary because - * readdir() is not guaranteed to fill up the entire buffer, but - * may choose to do less. - */ - do { - cdp->err = nfserr_eof; /* will be cleared on successful read */ - host_err = vfs_readdir(file, func, cdp); - } while (host_err >=0 && cdp->err == nfs_ok); + buf.dirent = (void *)__get_free_page(GFP_KERNEL); + if (!buf.dirent) + return -ENOMEM; + + offset = *offsetp; + cdp->err = nfserr_eof; /* will be cleared on successful read */ - *offsetp = vfs_llseek(file, 0, 1); + while (1) { + unsigned int reclen; + + buf.used = 0; + + host_err = vfs_readdir(file, nfsd_buffered_filldir, &buf); + if (host_err) + break; + + size = buf.used; + + if (!size) + break; + + + de = (struct buffered_dirent *)buf.dirent; + while (size > 0) { + offset = de->offset; + + if (func(cdp, de->name, de->namlen, de->offset, + de->ino, de->d_type)) + goto done; + + if (cdp->err != nfs_ok) + goto done; + + reclen = ALIGN(sizeof(*de) + de->namlen, + sizeof(u64)); + size -= reclen; + de = (struct buffered_dirent *)((char *)de + reclen); + } + offset = vfs_llseek(file, 0, 1); + } + + done: + free_page((unsigned long)(buf.dirent)); if (host_err) return nfserrno(host_err); - else - return cdp->err; + + *offsetp = offset; + return cdp->err; } /* @@ -1859,7 +1937,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, goto out_close; } - err = nfsd_do_readdir(file, func, cdp, offsetp); + err = nfsd_buffered_readdir(file, func, cdp, offsetp); if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 2/4] Copy XFS readdir hack into nfsd code. 2008-08-09 20:02 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse @ 2008-08-09 20:08 ` Christoph Hellwig 0 siblings, 0 replies; 55+ messages in thread From: Christoph Hellwig @ 2008-08-09 20:08 UTC (permalink / raw) To: David Woodhouse Cc: J. Bruce Fields, Neil Brown, chucklever, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd Looks good. ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 3/4] Remove XFS buffered readdir hack [not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 2008-08-09 20:01 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse 2008-08-09 20:02 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse @ 2008-08-09 20:03 ` David Woodhouse [not found] ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 2008-08-09 20:03 ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse 3 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 20:03 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Now that we've moved the readdir hack to the nfsd code, we can remove the local version from the XFS code. Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/xfs/linux-2.6/xfs_file.c | 128 ------------------------------------------- 1 files changed, 0 insertions(+), 128 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index 5f60363..1946b44 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -204,15 +204,6 @@ xfs_file_fsync( return -xfs_fsync(XFS_I(dentry->d_inode)); } -/* - * Unfortunately we can't just use the clean and simple readdir implementation - * below, because nfs might call back into ->lookup from the filldir callback - * and that will deadlock the low-level btree code. - * - * Hopefully we'll find a better workaround that allows to use the optimal - * version at least for local readdirs for 2.6.25. - */ -#if 0 STATIC int xfs_file_readdir( struct file *filp, @@ -244,125 +235,6 @@ xfs_file_readdir( return -error; return 0; } -#else - -struct hack_dirent { - u64 ino; - loff_t offset; - int namlen; - unsigned int d_type; - char name[]; -}; - -struct hack_callback { - char *dirent; - size_t len; - size_t used; -}; - -STATIC int -xfs_hack_filldir( - void *__buf, - const char *name, - int namlen, - loff_t offset, - u64 ino, - unsigned int d_type) -{ - struct hack_callback *buf = __buf; - struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used); - unsigned int reclen; - - reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64)); - if (buf->used + reclen > buf->len) - return -EINVAL; - - de->namlen = namlen; - de->offset = offset; - de->ino = ino; - de->d_type = d_type; - memcpy(de->name, name, namlen); - buf->used += reclen; - return 0; -} - -STATIC int -xfs_file_readdir( - struct file *filp, - void *dirent, - filldir_t filldir) -{ - struct inode *inode = filp->f_path.dentry->d_inode; - xfs_inode_t *ip = XFS_I(inode); - struct hack_callback buf; - struct hack_dirent *de; - int error; - loff_t size; - int eof = 0; - xfs_off_t start_offset, curr_offset, offset; - - /* - * Try fairly hard to get memory - */ - buf.len = PAGE_CACHE_SIZE; - do { - buf.dirent = kmalloc(buf.len, GFP_KERNEL); - if (buf.dirent) - break; - buf.len >>= 1; - } while (buf.len >= 1024); - - if (!buf.dirent) - return -ENOMEM; - - curr_offset = filp->f_pos; - if (curr_offset == 0x7fffffff) - offset = 0xffffffff; - else - offset = filp->f_pos; - - while (!eof) { - unsigned int reclen; - - start_offset = offset; - - buf.used = 0; - error = -xfs_readdir(ip, &buf, buf.len, &offset, - xfs_hack_filldir); - if (error || offset == start_offset) { - size = 0; - break; - } - - size = buf.used; - de = (struct hack_dirent *)buf.dirent; - while (size > 0) { - curr_offset = de->offset /* & 0x7fffffff */; - if (filldir(dirent, de->name, de->namlen, - curr_offset & 0x7fffffff, - de->ino, de->d_type)) { - goto done; - } - - reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen, - sizeof(u64)); - size -= reclen; - de = (struct hack_dirent *)((char *)de + reclen); - } - } - - done: - if (!error) { - if (size == 0) - filp->f_pos = offset & 0x7fffffff; - else if (de) - filp->f_pos = curr_offset; - } - - kfree(buf.dirent); - return error; -} -#endif STATIC int xfs_file_mmap( -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
[parent not found: <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [PATCH 3/4] Remove XFS buffered readdir hack [not found] ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-09 20:09 ` Christoph Hellwig 0 siblings, 0 replies; 55+ messages in thread From: Christoph Hellwig @ 2008-08-09 20:09 UTC (permalink / raw) To: David Woodhouse Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Sat, Aug 09, 2008 at 09:03:11PM +0100, David Woodhouse wrote: > Now that we've moved the readdir hack to the nfsd code, we can > remove the local version from the XFS code. Looks good to me, I'm also throwig this into my local xfsqa queue. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 4/4] Reinstate NFS exportability [not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> ` (2 preceding siblings ...) 2008-08-09 20:03 ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse @ 2008-08-09 20:03 ` David Woodhouse [not found] ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> 3 siblings, 1 reply; 55+ messages in thread From: David Woodhouse @ 2008-08-09 20:03 UTC (permalink / raw) To: J. Bruce Fields Cc: Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Now that the readdir/lookup deadlock issues have been dealt with, we can export JFFS2 file systems again. (For now, you have to specify fsid manually; we should add a method to the export_ops to handle that too.) Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/jffs2/super.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c index efd4012..7f03a28 100644 --- a/fs/jffs2/super.c +++ b/fs/jffs2/super.c @@ -22,6 +22,7 @@ #include <linux/mtd/super.h> #include <linux/ctype.h> #include <linux/namei.h> +#include <linux/exportfs.h> #include "compr.h" #include "nodelist.h" @@ -62,6 +63,63 @@ static int jffs2_sync_fs(struct super_block *sb, int wait) return 0; } +static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino, + uint32_t generation) +{ + return jffs2_iget(sb, ino); +} + +static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_get_parent(struct dentry *child) +{ + struct jffs2_inode_info *f; + struct dentry *parent; + struct inode *inode; + uint32_t pino; + + BUG_ON(!S_ISDIR(child->d_inode->i_mode)); + + f = JFFS2_INODE_INFO(child->d_inode); + + pino = f->inocache->pino_nlink; + + JFFS2_DEBUG("Parent of directory ino #%u is #%u\n", + f->inocache->ino, pino); + + inode = jffs2_iget(child->d_inode->i_sb, pino); + if (IS_ERR(inode)) { + JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino, + PTR_ERR(inode)); + return ERR_CAST(inode); + } + + parent = d_alloc_anon(inode); + if (!parent) { + iput(inode); + parent = ERR_PTR(-ENOMEM); + } + return parent; +} + +static struct export_operations jffs2_export_ops = { + .get_parent = jffs2_get_parent, + .fh_to_dentry = jffs2_fh_to_dentry, + .fh_to_parent = jffs2_fh_to_parent, +}; + static const struct super_operations jffs2_super_operations = { .alloc_inode = jffs2_alloc_inode, @@ -104,6 +162,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent) spin_lock_init(&c->inocache_lock); sb->s_op = &jffs2_super_operations; + sb->s_export_op = &jffs2_export_ops; sb->s_flags = sb->s_flags | MS_NOATIME; sb->s_xattr = jffs2_xattr_handlers; #ifdef CONFIG_JFFS2_FS_POSIX_ACL -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
[parent not found: <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>]
* Re: [PATCH 4/4] Reinstate NFS exportability [not found] ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org> @ 2008-08-09 20:10 ` Christoph Hellwig 0 siblings, 0 replies; 55+ messages in thread From: Christoph Hellwig @ 2008-08-09 20:10 UTC (permalink / raw) To: David Woodhouse Cc: J. Bruce Fields, Neil Brown, chucklever-Re5JQEeQqe8AvxtiuMwx3w, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Sat, Aug 09, 2008 at 09:03:33PM +0100, David Woodhouse wrote: > Now that the readdir/lookup deadlock issues have been dealt with, we can > export JFFS2 file systems again. Looks good, but you might want to add a comment why jffs2 doesn't care about an inode generation number. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2008-08-04 6:19 ` Chuck Lever @ 2008-08-17 18:22 ` Andreas Dilger 1 sibling, 0 replies; 55+ messages in thread From: Andreas Dilger @ 2008-08-17 18:22 UTC (permalink / raw) To: Neil Brown Cc: chucklever-Re5JQEeQqe8AvxtiuMwx3w, J. Bruce Fields, David Woodhouse, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Aug 04, 2008 11:03 +1000, Neil Brown wrote: > So, given that background, it is possible to see some more possible > solutions (other than the ones already mentioned). > > - drop the internal lock across filldir. > It could be seen a impolite to hold any locks across a callback > that are not documented as being held. > This would put an extra burden on the filesystem, but it shouldn't > be a particularly big burden. > A filesystem needs to be able to find the 'next' entry from a > sequential 'seek' pointer so that is the most state that needs to > be preserved. It might be convenient to be able to keep more state > (pointers into data structures etc). All this can be achieved with > fairly standard practice: > a/ have a 'version' number per inode which is updated when any > internal restructure happens. > b/ when calling filldir, record the version number, drop the lock > call filldir, reclaim the lock, check the version number > c/ if the version number matches (as it mostly will) just keep > going. If it doesn't jump back to the top of readdir where > we decode the current seek address. > > Some filesystems have problems implementing seekdir/telldir so they > might not be totally happy here. I have little sympathy for such > filesystems and feel the burden should be on them to make it work. > > - use i_mutex to protect internal changes too, and drop i_mutex while > doing internal restructuring. This would need some VFS changes so > that dropping i_mutex would be safe. It would require some way to > lock an individual dentry. Probably you would lock it under > i_mutex by setting a flag bit, wait for the flag on some inode-wide > waitqueue, and drop the lock by clearing the flag and waking the > waitqueue. And you are never allowed to look at ->d_inode if the > lock flag is set. When we were working on scaling the performance of concurrent operations in a single directory we added hashed dentry locks instead of using i_mutex (well, i_sem in those days) to lock the whole directory. To make the change manageable we replaced direct i_sem locking on the directory inode with ->lock_dir() and ->unlock_dir() methods, defaulting to just down() and up() on i_sem, but replacing this with a per-entry lock on the child dentry hash. This allowed Lustre servers to create/lookup/rename/remove many entries in a single directory concurrently, and I think this same approach could be useful in this case also. This allows filesystems that need it to bypass i_mutex if they need their own brand of locking, while leaving the majority of filesystems untouched. It also has the benefit that filesystems that need improved multi-threaded performance in a single directory (e.g. JFFS2, XFS, or HPC or MTA workloads) have the ability to do it. There is definitely some work needed internal to the filesystem to take advantage of this increased parallelism, and we did implement such changes for ext3+htree directories, adding internal locking on each leaf block that scaled the concurrency with the size of the directory. Alas, we don't have any up-to-date kernel patches for this, though the VFS patch was posted to LKML back in Feb 2005 as "RFC: pdirops: vfs patch" http://www.mail-archive.com/linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg01617.html We have better dynamic locking code today (one of Jan's objections about that patch), but the VFS part of the patch is no longer maintained. The ext3+htree patch was also posted "[RFC] parallel directory operations" http://www.ussg.iu.edu/hypermail/linux/kernel/0307.1/att-0041/03-ext3-pdirops.patch Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-04 1:03 ` Neil Brown [not found] ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2008-08-04 18:41 ` J. Bruce Fields 2008-08-04 22:37 ` Neil Brown 1 sibling, 1 reply; 55+ messages in thread From: J. Bruce Fields @ 2008-08-04 18:41 UTC (permalink / raw) To: Neil Brown Cc: chucklever, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote: > The locking bothers me. > The VFS seems to have a general policy of doing the locking itself to > make life easier for filesystem implementors (or to make it harder for > them to mess up, depending on your perspective). > > The current issue seems to suggest that the locking provided by the > VFS is no longer adequate, so each filesystem is needing to create > something itself. That suggests to me a weakness in the model. > Possibly the VFS should give up trying to be in control and let > filesystems do their own locking. Possibly there are still things > that the VFS can do which are universally good. I think these are > questions that should be addressed. > Maybe they have already been addressed and I missed the conversation > (that wouldn't surprise me much). But seeing words like "hack" > suggests to me that it hasn't. So I want to make sure I understand > the problem properly and deeply before giving my blessing to a hack. > > So: what are the issues? > > Obviously readdir can race with create and you don't want them > tripping each other up. The current VFS approach to this is i_mutex. > Any place which modifies a directory or does a lookup in a directory > takes i_mutex to ensure that the directory doesn't change. > > This is probably fairly limiting. With a tree-structured directory > you only really need to lock the 'current node' of the tree. > So any access would lock the top node, find which child to follow, > then lock the child and unlock the parent. Rebalancing might need to > be creative as you cannot lock a parent while holding a lock on the > child, but that isn't insurmountable. > So I could argue that holding i_mutex across a lookup or create or > readdir maybe isn't ideal. It would be interesting to explore the > possibility of dropping i_mutex across calls into the filesystem. > By the time the filesystem is called, we really only need to be > locking the names (dentries) rather than the whole directory. > More on this later... > > Some filesystems want to restructure directories and times that are > independent of any particular access. This might be defragmentation > or rebalancing or whatever. Clearly there needs to be mutual > exclusion between this and other accesses such as readdir and lookup. > The VFS clearly cannot help with this. It doesn't know when > rebalancing might happen or are what sort of granularity. So the > filesystem must do it's own locking. > It strikes me that this sort of locking would best be done by > spinlocks at the block level rather than a per-inode mutex. However > I haven't actually implemented this (yet) so I cannot be certain. > > This is what is causing the current problem (for JFFS2 at least). > JFF2 has a per-inode lock which protects against internally visible > changes to the inode. Further (and this is key) this lock is held > across the filldir callback. > i_mutex is also held across the filldir callback, but there is an > important difference. It is taken by the VFS, not the filesystem, > and it is guaranteed always to be held across the filldir callback. > So the filldir callback can call e.g. lookup without further locking. > > Backing up a little: given that the filesystem implementor chose to > use per-inode locking to protect internal restructuring (which is > certainly an easier option), why not just use i_mutex? The reason > is that a create operation might trigger system-wide garbage > collection which could trigger restructuring of the current inode, > which would lead to an A-A deadlock (as the create is waiting for the > garbage collection, and so still holding i_mutex). > > So, given that background, it is possible to see some more possible > solutions (other than the ones already mentioned). > > - drop the internal lock across filldir. > It could be seen a impolite to hold any locks across a callback > that are not documented as being held. > This would put an extra burden on the filesystem, but it shouldn't > be a particularly big burden. > A filesystem needs to be able to find the 'next' entry from a > sequential 'seek' pointer so that is the most state that needs to > be preserved. It might be convenient to be able to keep more state > (pointers into data structures etc). All this can be achieved with > fairly standard practice: > a/ have a 'version' number per inode which is updated when any > internal restructure happens. > b/ when calling filldir, record the version number, drop the lock > call filldir, reclaim the lock, check the version number > c/ if the version number matches (as it mostly will) just keep > going. If it doesn't jump back to the top of readdir where > we decode the current seek address. > > Some filesystems have problems implementing seekdir/telldir so they > might not be totally happy here. I have little sympathy for such > filesystems and feel the burden should be on them to make it work. > > - use i_mutex to protect internal changes too, and drop i_mutex while > doing internal restructuring. This would need some VFS changes so > that dropping i_mutex would be safe. It would require some way to > lock an individual dentry. Probably you would lock it under > i_mutex by setting a flag bit, wait for the flag on some inode-wide > waitqueue, and drop the lock by clearing the flag and waking the > waitqueue. And you are never allowed to look at ->d_inode if the > lock flag is set. Isn't there a lot of kernel code that assumes that following ->d_inode is safe? Or are you only worried about looking at certain filesystem-internal fields of d_inode? The locking required to keep the filesystem namespace consistent is difficult and important, so I think changing it would require an extremely careful description of the new design and a commitment from some filesystem developers to actually take advantage of it.... --b. > Of these I really like the second. Refining the i_mutex down to a > per-name lock before calling in to the filesystem seems like a really > good idea and should be good for scalability and large directories. > It isn't something to be done lightly though. The filesystem would > still be given i_mutex held, but would be allowed to drop it if it > wanted to. We could have an FS_DROPS_I_MUTEX similar to the current > FS_RENAME_DOES_D_MOVE. > > For the first, I really like the idea that a lock should not be held > across calls the filldir. I feel that a filesystem doing that is > "wrong" in much the same way that some think that recursing into the > filesystem as nfsd does is "wrong". In reality neither is "wrong", we > just need to work together and negotiate and work out the best way to > meet all of our needs. > > So what should we do now? I think that for JFFS2 to drop and reclaim > f->sem in readdir would be all of 20 lines of code, including updating > a ->version counter elsewhere in the code. Replicating that in all > the filesystems that need it would probably be more code than the nfsd > change though. > > On the other hand, if we implement the nfsd change, it will almost > certainly never go away, even if all filesystems eventually find that > they don't need it any more because someone improves the locking > rules in the VFS. Where as the code in the filesystems could quite > possibly go away when they are revised to make better use of the > locking regime. So I don't think that is an ideal situation either. > > If I had time, I would work on modifying the VFS to allow filesystems > to drop i_mutex. However I don't have time at the moment, so I'll > leave the decision to be made by someone else (Hi Bruce! I'll > support whatever you decide). > > But I think it is important to understand what is really going on and > not just implement a hack that works around the problem. I think I do > understand now, so I am a lot happier. Hopefully you do too. ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-04 18:41 ` J. Bruce Fields @ 2008-08-04 22:37 ` Neil Brown 0 siblings, 0 replies; 55+ messages in thread From: Neil Brown @ 2008-08-04 22:37 UTC (permalink / raw) To: J. Bruce Fields Cc: chucklever, David Woodhouse, viro, Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd On Monday August 4, bfields@fieldses.org wrote: > On Mon, Aug 04, 2008 at 11:03:27AM +1000, Neil Brown wrote: > > - use i_mutex to protect internal changes too, and drop i_mutex while > > doing internal restructuring. This would need some VFS changes so > > that dropping i_mutex would be safe. It would require some way to > > lock an individual dentry. Probably you would lock it under > > i_mutex by setting a flag bit, wait for the flag on some inode-wide > > waitqueue, and drop the lock by clearing the flag and waking the > > waitqueue. And you are never allowed to look at ->d_inode if the > > lock flag is set. > > Isn't there a lot of kernel code that assumes that following ->d_inode > is safe? Or are you only worried about looking at certain > filesystem-internal fields of d_inode? I overstated that restriction a bit. The way ->d_inode works is that it is set once and then never changed. A rename doesn't change ->d_inode, it changes the name in the dentry, and possibly the ->d_parent pointer. An unlink also leaves ->d_inode alone and just unhashes the dentry. ->d_inode is never cleared until the dentry is freed. So ->d_inode starts as NULL, is (possibly) set to a value, and stays that way. Currently if the name exists, then ->d_inode will be set non-NULL under i_mutex and so a NULL value will never be visible outside of i_mutex. If the name doesn't exist, a NULL value *will* be visible outside of i_mutex and it could subsequently get (under i_mutex) set when name is created. If i_mutex is allowed to be dropped early, then a premature NULL could be visible in ->d_inode for a name that does exist. This is the case the needs to be guarded against. So when I said "you are never allowed to look at ->d_inode ...." I could have said "you are never allowed to see a NULL in ->d_inode ..." While lots of places dereference ->d_inode, relative few do it in a context where ->d_inode could be NULL, and these are probably all in fs/namei.c or related code. Most of them would be fairly easy to get to work with dentry locking. The problem case is rename (as it always is). With rename, d_move needs to be called after the filesystem has committed to the rename, but before i_mutex is dropped. That would be awkward for filesystems that needed to collect garbage or whatever before completing the rename. I would probably address this by allowing ->rename to return -EAGAIN with the meaning that i_mutex was dropped but nothing was committed to, and that vfs_rename_* should try again from the top. Presumably ->rename will have done some garbage collection or whatever is needed and the next call to ->rename has a much better chance of going through without needing to drop the lock. I don't know if livelocks could be a problem with this approach. > > The locking required to keep the filesystem namespace consistent is > difficult and important, so I think changing it would require an > extremely careful description of the new design and a commitment from > some filesystem developers to actually take advantage of it.... An "extremely careful description of the new design" is something we could happily see more of in the kernel world, isn't it :-) Agreed. NeilBrown ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 0:40 ` David Woodhouse [not found] ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org> 2008-08-01 0:53 ` Chuck Lever @ 2008-08-01 2:14 ` Neil Brown [not found] ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2 siblings, 1 reply; 55+ messages in thread From: Neil Brown @ 2008-08-01 2:14 UTC (permalink / raw) To: David Woodhouse Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Friday August 1, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote: > On Fri, 2008-08-01 at 10:16 +1000, Neil Brown wrote: > > On Thursday July 31, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org wrote: > > > On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote: > > > > Why is there a deadlock here? > > > > I was really hoping you would answer this question. > > It's because the nfsd readdirplus code recurses into the file system. > >From the file system's ->readdir() function we call back into nfsd's > encode_entry(), which then calls back into the file system's ->lookup() > function so that it can generate a filehandle. > > For any file system which has its own internal locking -- which includes > jffs2, btrfs, xfs, jfs, gfs*, ocfs* and probably others -- that > recursive call back into the file system will deadlock. > > In the specific case of JFFS2, we need that internal locking because of > lock ordering constraints with the garbage collection -- we need to take > the allocator lock _before_ the per-inode lock, which means we can't use > the generic inode->i_mutex for that purpose. That's documented in > fs/jffs2/README.Locking. I know fewer details about the other affected > file systems. It sounds to me like the core problem here is that the locking regime imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of other filesystems. This seems to suggest that the VFS should be changed. Superficially, it seems that changing the locking rules so that the callee takes i_mutex rather than the caller taking it would help and, in the case of JFFS2, make f->sem redundant. Does that match your understanding? That is obviously a big change and one that should not be made lightly, but if it was to benefit a number of the newer filesystems, then it would seem like the appropriate way to go. Clearly we need a short term solution too as we don't want to wait for VFS locking rules to be renegotiated. The idea of a "lock is owned by me" check is appealing to me as it is a small, localised change that would easily be removed if/when the locking we "fixed" properly. In the JFFS2 case I imagine this taking the following form: - new field in jffs2_inode_info "struct task_struct *sem_owner", initialised to NULL - in jffs2_readdir after locking ->sem, set ->sem_owner to current. - before unlocking ->sem, set ->sem_owner to NULL - in jffs2_lookup, check if "dir_f->sem_owner == current" If it does, set a flag reminding us not to drop the lock, else mutex_lock(&dir_f->sem); This should fix the problem with very little cost either in performance or elegance. And if/when the locking rules were changed, the accompanying code review would immediately notice this and remove it. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
[parent not found: <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2008-08-01 8:50 ` David Woodhouse 2008-08-01 10:03 ` Al Viro 1 sibling, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-08-01 8:50 UTC (permalink / raw) To: Neil Brown Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Fri, 2008-08-01 at 12:14 +1000, Neil Brown wrote: > It sounds to me like the core problem here is that the locking regime > imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of > other filesystems. Well, the VFS isn't the part that's recursing into the file system code and causing deadlock. It's only nfsd which is doing that. > This seems to suggest that the VFS should be changed. > Superficially, it seems that changing the locking rules so that the > callee takes i_mutex rather than the caller taking it would help and, > in the case of JFFS2, make f->sem redundant. Does that match your > understanding? Not entirely. I'm not sure I see how we could make such a change -- the VFS relies on i_mutex for a number of operations it performs for itself, so it's hard to see how we could do without using it there. I'm not entirely sure the approach would be sufficient for JFFS2, either. Both jffs2_readdir() and jffs2_lookup() are _still_ going to want to take the lock, whichever lock it is. And recursing back into the file system while the file system is already holding the lock is _still_ going to be considered harmful. I have even less faith that it would end up working for the various other file systems which are affected. Don't let me discourage you from trying, though :) > Clearly we need a short term solution too as we don't want to wait for > VFS locking rules to be renegotiated. The idea of a "lock is owned by > me" check is appealing to me as it is a small, localised change that > would easily be removed if/when the locking we "fixed" properly. > > In the JFFS2 case I imagine this taking the following form: > > - new field in jffs2_inode_info "struct task_struct *sem_owner", > initialised to NULL > - in jffs2_readdir after locking ->sem, set ->sem_owner to current. > - before unlocking ->sem, set ->sem_owner to NULL > - in jffs2_lookup, check if "dir_f->sem_owner == current" > If it does, set a flag reminding us not to drop the lock, > else mutex_lock(&dir_f->sem); You're describing the hack I posted at http://lists.infradead.org/pipermail/linux-mtd/2007-March/017663.html but couldn't bring myself to commit. And it (or some other workaround) would be needed in _every_ affected file system. I'm much happier with putting the workaround in nfsd code, and keeping it out of the individual file systems. -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. [not found] ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2008-08-01 8:50 ` David Woodhouse @ 2008-08-01 10:03 ` Al Viro 2008-08-01 23:11 ` Neil Brown 1 sibling, 1 reply; 55+ messages in thread From: Al Viro @ 2008-08-01 10:03 UTC (permalink / raw) To: Neil Brown Cc: David Woodhouse, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Fri, Aug 01, 2008 at 12:14:17PM +1000, Neil Brown wrote: > It sounds to me like the core problem here is that the locking regime > imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of > other filesystems. > > This seems to suggest that the VFS should be changed. > Superficially, it seems that changing the locking rules so that the > callee takes i_mutex rather than the caller taking it would help and, > in the case of JFFS2, make f->sem redundant. Does that match your > understanding? Huh? How could that possibly help? You are not changing the sequence of operations on locks, you only slightly modify the timing; how could that possibly eliminate a deadlock? Moreover, how the _hell_ could making exclusion area smaller make f->sem redundant? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [RFC] Reinstate NFS exportability for JFFS2. 2008-08-01 10:03 ` Al Viro @ 2008-08-01 23:11 ` Neil Brown 0 siblings, 0 replies; 55+ messages in thread From: Neil Brown @ 2008-08-01 23:11 UTC (permalink / raw) To: Al Viro Cc: David Woodhouse, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r On Friday August 1, viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org wrote: > On Fri, Aug 01, 2008 at 12:14:17PM +1000, Neil Brown wrote: > > > It sounds to me like the core problem here is that the locking regime > > imposed by the VFS simply isn't suitable for JFFS2 .. and a bunch of > > other filesystems. > > > > This seems to suggest that the VFS should be changed. > > Superficially, it seems that changing the locking rules so that the > > callee takes i_mutex rather than the caller taking it would help and, > > in the case of JFFS2, make f->sem redundant. Does that match your > > understanding? > > Huh? How could that possibly help? You are not changing the sequence > of operations on locks, you only slightly modify the timing; how could > that possibly eliminate a deadlock? Moreover, how the _hell_ could > making exclusion area smaller make f->sem redundant? By moving the locking of i_mutex inside the filesystem methods, I reasoned that the filesystem would have more control and be able to do things in the order that it wanted. I gathered from the JFFS locking document - admitted a very brief reading - that it the VFS took i_mutex at awkward times for JFFS2, and the for this reason, JFFS2 introduced f->sem, described as: "This is the JFFS2-internal equivalent of the inode mutex i->i_sem." If i_mutex became internal, maybe JFFS2 wouldn't need an internal equivalent. But I did miss an important point. The reason that the current code works with many filesystems is that lookup_one_len *doesn't* take i_mutex before calling ->lookup. It assumes that it is already held. So pushing the locking down into ->lookup would actually break other filesystems without fixing this one. My Bad. So it seems like recursion into the filesystem is a Bad Thing and we need a different approach. I really don't like NFSD having to cache all the names that it gets from readdir and then hand them back to lookup one at a time, though it might end up being the answer.. What about this (I know you're going to hate it)?? A new open flags: O_READDIRPLUS. If ->readdir finds that it is set on filp->f_flags, then it does a lookup on each name it finds and makes sure they are all in the dcache. Then when NFSD calls lookup_one_len inside the filldir callback, it will find the answer in the cache and not need to recurse into the filesystem. We would need some way to be sure that the new dentry didn't get flushed from the dcache before NFSD called lookup_one_len. I guess if ->readdir held a reference on it while filldir was called that would be an adequate guarantee. This would then make O_READDIRPLUS available to userspace too. "ls -l" could use it and thus give the filesystem an opportunity to optimise things. ->readdir could: load a block of the directory schedule reads for all the inodes in parallel attach them to the dcache in series, using whatever locking was needed and no more. I'm actually starting to like the idea. Which means it is probably doomed. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 1/4] Factor out nfsd_do_readdir() into its own function 2008-05-02 1:38 ` Neil Brown 2008-05-02 11:37 ` David Woodhouse 2008-07-31 21:54 ` David Woodhouse @ 2008-07-31 21:54 ` David Woodhouse 2008-07-31 21:54 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse ` (2 subsequent siblings) 5 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> --- fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++---------------- 1 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 18060be..3e22634 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1814,6 +1814,29 @@ out: return err; } +static int nfsd_do_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) +{ + int host_err; + + /* + * Read the directory entries. This silly loop is necessary because + * readdir() is not guaranteed to fill up the entire buffer, but + * may choose to do less. + */ + do { + cdp->err = nfserr_eof; /* will be cleared on successful read */ + host_err = vfs_readdir(file, func, cdp); + } while (host_err >=0 && cdp->err == nfs_ok); + + *offsetp = vfs_llseek(file, 0, 1); + + if (host_err) + return nfserrno(host_err); + else + return cdp->err; +} + /* * Read entries from a directory. * The NFSv3/4 verifier we ignore for now. @@ -1823,7 +1846,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, struct readdir_cd *cdp, filldir_t func) { __be32 err; - int host_err; struct file *file; loff_t offset = *offsetp; @@ -1837,21 +1859,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, goto out_close; } - /* - * Read the directory entries. This silly loop is necessary because - * readdir() is not guaranteed to fill up the entire buffer, but - * may choose to do less. - */ - - do { - cdp->err = nfserr_eof; /* will be cleared on successful read */ - host_err = vfs_readdir(file, func, cdp); - } while (host_err >=0 && cdp->err == nfs_ok); - if (host_err) - err = nfserrno(host_err); - else - err = cdp->err; - *offsetp = vfs_llseek(file, 0, 1); + err = nfsd_do_readdir(file, func, cdp, offsetp); if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag 2008-05-02 1:38 ` Neil Brown ` (2 preceding siblings ...) 2008-07-31 21:54 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse @ 2008-07-31 21:54 ` David Woodhouse 2008-07-31 21:55 ` [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack David Woodhouse [not found] ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 5 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-07-31 21:54 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd Some file systems with their own internal locking have problems with the way that nfsd calls the ->lookup() method from within a filldir function called from their ->readdir() method. The recursion back into the file system code can cause deadlock. XFS has a fairly hackish solution to this which involves doing the readdir() into a locally-allocated buffer, then going back through it calling the filldir function afterwards. It's not ideal, but it works. It's particularly suboptimal because XFS does this for local file systems too, where it's completely unnecessary. Copy this hack into the NFS code where it can be used only for NFS, and only for file systems which indicate that they need it by setting FS_NO_LOOKUP_IN_READDIR in their fs_type flags. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> --- fs/nfsd/vfs.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/fs.h | 2 + 2 files changed, 112 insertions(+), 1 deletions(-) diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 3e22634..81c9411 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1814,6 +1814,111 @@ out: return err; } +struct hack_dirent { + u64 ino; + loff_t offset; + int namlen; + unsigned int d_type; + char name[]; +}; + +struct hack_callback { + char *dirent; + size_t len; + size_t used; +}; + +static int nfsd_hack_filldir(void *__buf, const char *name, int namlen, + loff_t offset, u64 ino, unsigned int d_type) +{ + struct hack_callback *buf = __buf; + struct hack_dirent *de = (void *)(buf->dirent + buf->used); + unsigned int reclen; + + reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64)); + if (buf->used + reclen > buf->len) + return -EINVAL; + + de->namlen = namlen; + de->offset = offset; + de->ino = ino; + de->d_type = d_type; + memcpy(de->name, name, namlen); + buf->used += reclen; + + return 0; +} + +static int nfsd_hack_readdir(struct file *file, filldir_t func, + struct readdir_cd *cdp, loff_t *offsetp) +{ + struct hack_callback buf; + struct hack_dirent *de; + int host_err; + int size; + loff_t offset; + + /* + * Try fairly hard to get memory + */ + buf.len = PAGE_CACHE_SIZE; + do { + buf.dirent = kmalloc(buf.len, GFP_KERNEL); + if (buf.dirent) + break; + buf.len >>= 1; + } while (buf.len >= 1024); + + if (!buf.dirent) + return -ENOMEM; + + offset = *offsetp; + cdp->err = nfserr_eof; /* will be cleared on successful read */ + + while (1) { + unsigned int reclen; + + buf.used = 0; + + host_err = vfs_readdir(file, nfsd_hack_filldir, &buf); + if (host_err) + break; + + size = buf.used; + + if (!size) + break; + + + de = (struct hack_dirent *)buf.dirent; + while (size > 0) { + offset = de->offset; + + if (func(cdp, de->name, de->namlen, de->offset, + de->ino, de->d_type)) + goto done; + + if (cdp->err != nfs_ok) + goto done; + + reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen, + sizeof(u64)); + size -= reclen; + de = (struct hack_dirent *)((char *)de + reclen); + } + offset = vfs_llseek(file, 0, 1); + } + + done: + kfree(buf.dirent); + + if (host_err) + return nfserrno(host_err); + + *offsetp = offset; + return cdp->err; +} + static int nfsd_do_readdir(struct file *file, filldir_t func, struct readdir_cd *cdp, loff_t *offsetp) { @@ -1859,7 +1964,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp, goto out_close; } - err = nfsd_do_readdir(file, func, cdp, offsetp); + if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags & + FS_NO_LOOKUP_IN_READDIR)) + err = nfsd_hack_readdir(file, func, cdp, offsetp); + else + err = nfsd_do_readdir(file, func, cdp, offsetp); if (err == nfserr_eof || err == nfserr_toosmall) err = nfs_ok; /* can still be found in ->err */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..80ca410 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -100,6 +100,8 @@ extern int dir_notify_enable; #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() * during rename() internally. */ +#define FS_NO_LOOKUP_IN_READDIR 65536 /* FS will deadlock if you call its + lookup() method from filldir */ /* * These are the fs-independent mount-flags: up to 32 flags are supported -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack 2008-05-02 1:38 ` Neil Brown ` (3 preceding siblings ...) 2008-07-31 21:54 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse @ 2008-07-31 21:55 ` David Woodhouse [not found] ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 5 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-07-31 21:55 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, linux-mtd Now that we've moved the readdir hack to the nfsd code, we can just set the FS_NO_LOOKUP_IN_READDIR flag and then remove the local hack from the xfs code. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> --- fs/xfs/linux-2.6/xfs_file.c | 120 ------------------------------------------ fs/xfs/linux-2.6/xfs_super.c | 2 +- 2 files changed, 1 insertions(+), 121 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index 5f60363..d65d377 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -212,7 +212,6 @@ xfs_file_fsync( * Hopefully we'll find a better workaround that allows to use the optimal * version at least for local readdirs for 2.6.25. */ -#if 0 STATIC int xfs_file_readdir( struct file *filp, @@ -244,125 +243,6 @@ xfs_file_readdir( return -error; return 0; } -#else - -struct hack_dirent { - u64 ino; - loff_t offset; - int namlen; - unsigned int d_type; - char name[]; -}; - -struct hack_callback { - char *dirent; - size_t len; - size_t used; -}; - -STATIC int -xfs_hack_filldir( - void *__buf, - const char *name, - int namlen, - loff_t offset, - u64 ino, - unsigned int d_type) -{ - struct hack_callback *buf = __buf; - struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used); - unsigned int reclen; - - reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64)); - if (buf->used + reclen > buf->len) - return -EINVAL; - - de->namlen = namlen; - de->offset = offset; - de->ino = ino; - de->d_type = d_type; - memcpy(de->name, name, namlen); - buf->used += reclen; - return 0; -} - -STATIC int -xfs_file_readdir( - struct file *filp, - void *dirent, - filldir_t filldir) -{ - struct inode *inode = filp->f_path.dentry->d_inode; - xfs_inode_t *ip = XFS_I(inode); - struct hack_callback buf; - struct hack_dirent *de; - int error; - loff_t size; - int eof = 0; - xfs_off_t start_offset, curr_offset, offset; - - /* - * Try fairly hard to get memory - */ - buf.len = PAGE_CACHE_SIZE; - do { - buf.dirent = kmalloc(buf.len, GFP_KERNEL); - if (buf.dirent) - break; - buf.len >>= 1; - } while (buf.len >= 1024); - - if (!buf.dirent) - return -ENOMEM; - - curr_offset = filp->f_pos; - if (curr_offset == 0x7fffffff) - offset = 0xffffffff; - else - offset = filp->f_pos; - - while (!eof) { - unsigned int reclen; - - start_offset = offset; - - buf.used = 0; - error = -xfs_readdir(ip, &buf, buf.len, &offset, - xfs_hack_filldir); - if (error || offset == start_offset) { - size = 0; - break; - } - - size = buf.used; - de = (struct hack_dirent *)buf.dirent; - while (size > 0) { - curr_offset = de->offset /* & 0x7fffffff */; - if (filldir(dirent, de->name, de->namlen, - curr_offset & 0x7fffffff, - de->ino, de->d_type)) { - goto done; - } - - reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen, - sizeof(u64)); - size -= reclen; - de = (struct hack_dirent *)((char *)de + reclen); - } - } - - done: - if (!error) { - if (size == 0) - filp->f_pos = offset & 0x7fffffff; - else if (de) - filp->f_pos = curr_offset; - } - - kfree(buf.dirent); - return error; -} -#endif STATIC int xfs_file_mmap( diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index 9433812..e44a21d 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1426,7 +1426,7 @@ static struct file_system_type xfs_fs_type = { .name = "xfs", .get_sb = xfs_fs_get_sb, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV, + .fs_flags = FS_REQUIRES_DEV | FS_NO_LOOKUP_IN_READDIR, }; -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation ^ permalink raw reply related [flat|nested] 55+ messages in thread
[parent not found: <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* [PATCH 4/4] [JFFS2] Reinstate NFS exportability [not found] ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2008-07-31 21:55 ` David Woodhouse 0 siblings, 0 replies; 55+ messages in thread From: David Woodhouse @ 2008-07-31 21:55 UTC (permalink / raw) To: Neil Brown Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Now that the readdir/lookup deadlock issues have been dealt with, we can export JFFS2 file systems again. (For now, you have to specify fsid manually; we should add a method to the export_ops to handle that too.) Signed-off-by: David Woodhouse <David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> --- fs/jffs2/super.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 61 insertions(+), 0 deletions(-) diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c index efd4012..2138b26 100644 --- a/fs/jffs2/super.c +++ b/fs/jffs2/super.c @@ -22,6 +22,7 @@ #include <linux/mtd/super.h> #include <linux/ctype.h> #include <linux/namei.h> +#include <linux/exportfs.h> #include "compr.h" #include "nodelist.h" @@ -62,6 +63,64 @@ static int jffs2_sync_fs(struct super_block *sb, int wait) return 0; } +static struct inode *jffs2_nfs_get_inode(struct super_block *sb, uint64_t ino, + uint32_t generation) +{ + return jffs2_iget(sb, ino); +} + +static struct dentry *jffs2_fh_to_dentry(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_dentry(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_fh_to_parent(struct super_block *sb, struct fid *fid, + int fh_len, int fh_type) +{ + return generic_fh_to_parent(sb, fid, fh_len, fh_type, + jffs2_nfs_get_inode); +} + +static struct dentry *jffs2_get_parent(struct dentry *child) +{ + struct jffs2_inode_info *f; + struct dentry *parent; + struct inode *inode; + uint32_t pino; + + if (!S_ISDIR(child->d_inode->i_mode)) + return ERR_PTR(-EACCES); + + f = JFFS2_INODE_INFO(child->d_inode); + + pino = f->inocache->pino_nlink; + + JFFS2_DEBUG("Parent of directory ino #%u is #%u\n", + f->inocache->ino, pino); + + inode = jffs2_iget(child->d_inode->i_sb, pino); + if (IS_ERR(inode)) { + JFFS2_ERROR("Failed to get inode #%u: %ld\n", pino, + PTR_ERR(inode)); + return ERR_CAST(inode); + } + + parent = d_alloc_anon(inode); + if (!parent) { + iput(inode); + parent = ERR_PTR(-ENOMEM); + } + return parent; +} + +static struct export_operations jffs2_export_ops = { + .get_parent = jffs2_get_parent, + .fh_to_dentry = jffs2_fh_to_dentry, + .fh_to_parent = jffs2_fh_to_parent, +}; + static const struct super_operations jffs2_super_operations = { .alloc_inode = jffs2_alloc_inode, @@ -104,6 +163,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent) spin_lock_init(&c->inocache_lock); sb->s_op = &jffs2_super_operations; + sb->s_export_op = &jffs2_export_ops; sb->s_flags = sb->s_flags | MS_NOATIME; sb->s_xattr = jffs2_xattr_handlers; #ifdef CONFIG_JFFS2_FS_POSIX_ACL @@ -161,6 +221,7 @@ static struct file_system_type jffs2_fs_type = { .name = "jffs2", .get_sb = jffs2_get_sb, .kill_sb = jffs2_kill_sb, + .fs_flags = FS_NO_LOOKUP_IN_READDIR, }; static int __init init_jffs2_fs(void) -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 55+ messages in thread
end of thread, other threads:[~2008-08-17 18:22 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-01 19:42 [RFC] Reinstate NFS exportability for JFFS2 David Woodhouse
2008-05-01 20:48 ` Christoph Hellwig
2008-05-01 22:44 ` David Woodhouse
2008-05-02 1:38 ` Neil Brown
2008-05-02 11:37 ` David Woodhouse
[not found] ` <1209728238.25560.686.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-05-02 14:08 ` J. Bruce Fields
2008-07-31 21:54 ` David Woodhouse
2008-08-01 0:16 ` Neil Brown
[not found] ` <18578.21997.529551.676627-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-01 0:40 ` David Woodhouse
[not found] ` <1217551230.3719.15.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2008-08-01 0:52 ` David Woodhouse
2008-08-01 0:53 ` Chuck Lever
[not found] ` <76bd70e30807311753m2785c6d3kd82edd1fe8b5f8b7-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-08-01 1:00 ` David Woodhouse
[not found] ` <1217552437.3719.30.camel-Fexsq3y4057IgHVZqg5X0TlWvGAXklZc@public.gmane.org>
2008-08-01 1:31 ` Chuck Lever
2008-08-01 8:13 ` David Woodhouse
2008-08-01 13:35 ` David Woodhouse
[not found] ` <1217597759.3454.356.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-01 13:56 ` David Woodhouse
2008-08-01 16:05 ` Chuck Lever
2008-08-01 16:19 ` David Woodhouse
2008-08-01 17:47 ` Chuck Lever
2008-08-02 18:26 ` J. Bruce Fields
[not found] ` <20080802182644.GE30454-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2008-08-02 20:42 ` David Woodhouse
2008-08-02 21:33 ` J. Bruce Fields
[not found] ` <20080802213337.GA2833-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2008-08-03 8:39 ` David Woodhouse
2008-08-03 11:56 ` Neil Brown
2008-08-03 17:15 ` Chuck Lever
2008-08-04 1:03 ` Neil Brown
[not found] ` <18582.21855.2092.903688-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-04 6:19 ` Chuck Lever
2008-08-05 8:51 ` Dave Chinner
2008-08-05 8:59 ` David Woodhouse
2008-08-05 9:47 ` Dave Chinner
2008-08-05 23:06 ` Neil Brown
2008-08-06 0:08 ` Dave Chinner
2008-08-06 19:56 ` J. Bruce Fields
2008-08-06 20:10 ` David Woodhouse
[not found] ` <1218053443.5111.148.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 16:47 ` David Woodhouse
2008-08-09 19:55 ` David Woodhouse
[not found] ` <1218311710.26926.125.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:01 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
[not found] ` <1218312114.5063.5.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:07 ` Christoph Hellwig
2008-08-09 20:02 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code David Woodhouse
2008-08-09 20:08 ` Christoph Hellwig
2008-08-09 20:03 ` [PATCH 3/4] Remove XFS buffered readdir hack David Woodhouse
[not found] ` <1218312191.5063.8.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:09 ` Christoph Hellwig
2008-08-09 20:03 ` [PATCH 4/4] Reinstate NFS exportability David Woodhouse
[not found] ` <1218312213.5063.9.camel-ZP4jZrcIevRpWr+L1FloEB2eb7JE58TQ@public.gmane.org>
2008-08-09 20:10 ` Christoph Hellwig
2008-08-17 18:22 ` [RFC] Reinstate NFS exportability for JFFS2 Andreas Dilger
2008-08-04 18:41 ` J. Bruce Fields
2008-08-04 22:37 ` Neil Brown
2008-08-01 2:14 ` Neil Brown
[not found] ` <18578.29049.38904.746701-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-08-01 8:50 ` David Woodhouse
2008-08-01 10:03 ` Al Viro
2008-08-01 23:11 ` Neil Brown
2008-07-31 21:54 ` [PATCH 1/4] Factor out nfsd_do_readdir() into its own function David Woodhouse
2008-07-31 21:54 ` [PATCH 2/4] Copy XFS readdir hack into nfsd code, introduce FS_NO_LOOKUP_IN_READDIR flag David Woodhouse
2008-07-31 21:55 ` [PATCH 3/4] Switch XFS to using FS_NO_LOOKUP_IN_READDIR, remove local readdir hack David Woodhouse
[not found] ` <18458.28833.539314.455215-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2008-07-31 21:55 ` [PATCH 4/4] [JFFS2] Reinstate NFS exportability David Woodhouse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).