* why is i_ino unsigned long, anyway?
@ 2013-09-12 16:03 J. Bruce Fields
2013-09-12 19:33 ` Al Viro
0 siblings, 1 reply; 24+ messages in thread
From: J. Bruce Fields @ 2013-09-12 16:03 UTC (permalink / raw)
To: linux-fsdevel; +Cc: Al Viro, linux-nfs, sandeen
Somebody noticed an "ls -l" over nfs failing on entries with inode
numbers greater than 2^32 on a 32-bit NFS server. The cause is some
code that tries to compare i_ino to the full 64-bit inode number.
I think the following will fix it, but I'm curious: why is i_ino
"unsigned long", anyway? Is there something know that depends on that,
or is it just that the sheer number of users makes it too scary to
change?
--b.
commit 0cc784eb430285535ae7a79dd5133ab66e9ce839
Author: J. Bruce Fields <bfields@redhat.com>
Date: Tue Sep 10 11:41:12 2013 -0400
exportfs: fix 32-bit nfsd handling of 64-bit inode numbers
Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a
32-bit NFS server exporting a very large XFS filesystem, when the
server's cache is cold (so the inodes in question are not in cache).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 293bc2e..6a79bb8 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -215,7 +215,7 @@ struct getdents_callback {
struct dir_context ctx;
char *name; /* name that was found. It already points to a
buffer NAME_MAX+1 is size */
- unsigned long ino; /* the inum we are looking for */
+ u64 ino; /* the inum we are looking for */
int found; /* inode matched? */
int sequence; /* sequence counter */
};
@@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
struct inode *dir = path->dentry->d_inode;
int error;
struct file *file;
+ struct kstat stat;
struct getdents_callback buffer = {
.ctx.actor = filldir_one,
.name = name,
- .ino = child->d_inode->i_ino
};
error = -ENOTDIR;
@@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child)
if (!dir->i_fop)
goto out;
/*
+ * inode->i_ino is unsigned long, kstat->ino is u64, so the
+ * former would be insufficient on 32-bit hosts when the
+ * filesystem supports 64-bit inode numbers. So we need to
+ * actually call ->getattr, not just read i_ino:
+ */
+ error = vfs_getattr(path, &stat);
+ if (error)
+ return error;
+ buffer.ino = stat.ino;
+ /*
* Open the directory ...
*/
file = dentry_open(path, O_RDONLY, cred);
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: why is i_ino unsigned long, anyway? 2013-09-12 16:03 why is i_ino unsigned long, anyway? J. Bruce Fields @ 2013-09-12 19:33 ` Al Viro [not found] ` <20130912193328.GP13318-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Al Viro @ 2013-09-12 19:33 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-fsdevel, linux-nfs, sandeen On Thu, Sep 12, 2013 at 12:03:24PM -0400, J. Bruce Fields wrote: > Somebody noticed an "ls -l" over nfs failing on entries with inode > numbers greater than 2^32 on a 32-bit NFS server. The cause is some > code that tries to compare i_ino to the full 64-bit inode number. > > I think the following will fix it, but I'm curious: why is i_ino > "unsigned long", anyway? Is there something know that depends on that, > or is it just that the sheer number of users makes it too scary to > change? i_ino use is entirely up to filesystem; it may be used by library helpers, provided that the choice of using or not using those is, again, up to filesystem in question. NFSD has no damn business looking at it; library helpers in fs/exportfs might, but that makes them not suitable for use by filesystems without inode numbers or with 64bit ones. The reason why it's there at all is that it serves as convenient icache search key for many filesystems. IOW, it's used by iget_locked() and avoiding the overhead of 64bit comparisons on 32bit hosts is the main reason to avoid making it u64. Again, no fs-independent code has any business looking at it, 64bit or not. From the VFS point of view there is no such thing as inode number. And get_name() is just a library helper. For many fs types it works as suitable ->s_export_op->get_name() instance, but decision to use it or not belongs to filesystem in question and frankly, it's probably better to provide an instance of your own anyway. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130912193328.GP13318-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20130912193328.GP13318-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2013-09-29 11:54 ` Christoph Hellwig [not found] ` <20130929115454.GA3953-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2013-09-29 11:54 UTC (permalink / raw) To: Al Viro Cc: J. Bruce Fields, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote: > i_ino use is entirely up to filesystem; it may be used by library helpers, > provided that the choice of using or not using those is, again, up to > filesystem in question. With more and more filesystems using large inode numbers I'm starting to wonder if this still makes sense. But given that it's been this way for a long time we should have more documentation of it for sure. > NFSD has no damn business looking at it; library > helpers in fs/exportfs might, but that makes them not suitable for use > by filesystems without inode numbers or with 64bit ones. > > The reason why it's there at all is that it serves as convenient icache > search key for many filesystems. IOW, it's used by iget_locked() and > avoiding the overhead of 64bit comparisons on 32bit hosts is the main > reason to avoid making it u64. > > Again, no fs-independent code has any business looking at it, 64bit or > not. From the VFS point of view there is no such thing as inode number. > And get_name() is just a library helper. For many fs types it works > as suitable ->s_export_op->get_name() instance, but decision to use it > or not belongs to filesystem in question and frankly, it's probably better > to provide an instance of your own anyway. Given that these days most exportable filesystems use 64-bit inode numbers I think we should put the patch from Bruce in. Nevermind that it's in a slow path, so the overhead of vfs_getattr really doesn't hurt. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20130929115454.GA3953-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20130929115454.GA3953-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2013-10-02 14:25 ` J. Bruce Fields [not found] ` <20131002142527.GD14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 14:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Sun, Sep 29, 2013 at 04:54:54AM -0700, Christoph Hellwig wrote: > On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote: > > i_ino use is entirely up to filesystem; it may be used by library helpers, > > provided that the choice of using or not using those is, again, up to > > filesystem in question. > > With more and more filesystems using large inode numbers I'm starting to > wonder if this still makes sense. But given that it's been this way for > a long time we should have more documentation of it for sure. > > > NFSD has no damn business looking at it; library > > helpers in fs/exportfs might, but that makes them not suitable for use > > by filesystems without inode numbers or with 64bit ones. > > > > The reason why it's there at all is that it serves as convenient icache > > search key for many filesystems. IOW, it's used by iget_locked() and > > avoiding the overhead of 64bit comparisons on 32bit hosts is the main > > reason to avoid making it u64. > > > > Again, no fs-independent code has any business looking at it, 64bit or > > not. From the VFS point of view there is no such thing as inode number. > > And get_name() is just a library helper. For many fs types it works > > as suitable ->s_export_op->get_name() instance, but decision to use it > > or not belongs to filesystem in question and frankly, it's probably better > > to provide an instance of your own anyway. > > Given that these days most exportable filesystems use 64-bit inode > numbers I think we should put the patch from Bruce in. Nevermind that > it's in a slow path, so the overhead of vfs_getattr really doesn't hurt. Calling vfs_getattr adds a security_inode_getattr() call that wasn't there before. Any chance of that being a problem? If so then it's no huge code duplication to it by hand: if (inode->i_op->getattr) inode->i_op->getattr(path->mnt, path->dentry, &stat); else generic_fillattr(inode, &stat); --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20131002142527.GD14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002142527.GD14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 15:43 ` J. Bruce Fields [not found] ` <20131002154320.GE14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-10-02 16:05 ` Christoph Hellwig 1 sibling, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 15:43 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > On Sun, Sep 29, 2013 at 04:54:54AM -0700, Christoph Hellwig wrote: > > On Thu, Sep 12, 2013 at 08:33:28PM +0100, Al Viro wrote: > > > i_ino use is entirely up to filesystem; it may be used by library helpers, > > > provided that the choice of using or not using those is, again, up to > > > filesystem in question. > > > > With more and more filesystems using large inode numbers I'm starting to > > wonder if this still makes sense. But given that it's been this way for > > a long time we should have more documentation of it for sure. > > > > > NFSD has no damn business looking at it; library > > > helpers in fs/exportfs might, but that makes them not suitable for use > > > by filesystems without inode numbers or with 64bit ones. > > > > > > The reason why it's there at all is that it serves as convenient icache > > > search key for many filesystems. IOW, it's used by iget_locked() and > > > avoiding the overhead of 64bit comparisons on 32bit hosts is the main > > > reason to avoid making it u64. > > > > > > Again, no fs-independent code has any business looking at it, 64bit or > > > not. From the VFS point of view there is no such thing as inode number. > > > And get_name() is just a library helper. For many fs types it works > > > as suitable ->s_export_op->get_name() instance, but decision to use it > > > or not belongs to filesystem in question and frankly, it's probably better > > > to provide an instance of your own anyway. > > > > Given that these days most exportable filesystems use 64-bit inode > > numbers I think we should put the patch from Bruce in. Nevermind that > > it's in a slow path, so the overhead of vfs_getattr really doesn't hurt. > > Calling vfs_getattr adds a security_inode_getattr() call that wasn't > there before. Any chance of that being a problem? > > If so then it's no huge code duplication to it by hand: > > if (inode->i_op->getattr) > inode->i_op->getattr(path->mnt, path->dentry, &stat); > else > generic_fillattr(inode, &stat); Just for fun, I took a stab at an xfs-specific method. Pretty much untested and probably wrong as I know nothing about xfs, but it does seem like it allows some minor simplifications compared to the common helper. But as Christoph says other filesystems have the same problem so we probably want to fix the common helper anyway. --b. commit 1d2c2fa899d94aef5283aa66f4bd453305a62030 Author: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Date: Fri Sep 20 16:14:53 2013 -0400 exportfs: fix 32-bit nfsd handling of 64-bit inode numbers Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a 32-bit NFS server exporting a very large XFS filesystem, when the server's cache is cold (so the inodes in question are not in cache). Reported-by: Trevor Cordes <trevor-CGgvEiIIHbIFyWsGDH9TEg@public.gmane.org> Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c index 066df42..e76a288 100644 --- a/fs/xfs/xfs_export.c +++ b/fs/xfs/xfs_export.c @@ -31,6 +31,7 @@ #include "xfs_inode_item.h" #include "xfs_trace.h" #include "xfs_icache.h" +#include "xfs_dir2_priv.h" /* * Note that we only accept fileids which are long enough rather than allow @@ -240,10 +241,90 @@ xfs_fs_nfs_commit_metadata( return _xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, NULL); } +struct getdents_callback { + struct dir_context ctx; + char *name; /* name that was found. It already points to a + buffer NAME_MAX+1 is size */ + u64 ino; /* the inum we are looking for */ + int found; /* inode matched? */ + int sequence; /* sequence counter */ +}; + +/* + * A rather strange filldir function to capture + * the name matching the specified inode number. + */ +static int filldir_one(void * __buf, const char * name, int len, + loff_t pos, u64 ino, unsigned int d_type) +{ + struct getdents_callback *buf = __buf; + int result = 0; + + buf->sequence++; + if (buf->ino == ino && len <= NAME_MAX) { + memcpy(buf->name, name, len); + buf->name[len] = '\0'; + buf->found = 1; + result = -1; + } + return result; +} + +STATIC int +xfs_fs_get_name( + struct dentry *parent, + char *name, + struct dentry *child) +{ + struct inode *dir = parent->d_inode; + int error; + struct xfs_inode *ip = XFS_I(child->d_inode); + struct getdents_callback buffer = { + .ctx.actor = filldir_one, + .name = name, + .ino = ip->i_ino + }; + size_t bufsize; + + error = -ENOTDIR; + if (!dir || !S_ISDIR(dir->i_mode)) + goto out; + + /* See comment in fs/xfs/xfs_file.c:xfs_file_readdir */ + bufsize = (size_t)min_t(loff_t, 32768, ip->i_d.di_size); + + buffer.sequence = 0; + while (1) { + int old_seq = buffer.sequence; + + error = mutex_lock_killable(&dir->i_mutex); + if (error) + goto out; + if (!IS_DEADDIR(dir)) + error = -xfs_readdir(ip, &buffer.ctx, bufsize); + mutex_unlock(&dir->i_mutex); + if (buffer.found) { + error = 0; + break; + } + + if (error) + break; + + error = -ENOENT; + if (old_seq == buffer.sequence) + break; + } + +out: + return error; +} + const struct export_operations xfs_export_operations = { .encode_fh = xfs_fs_encode_fh, .fh_to_dentry = xfs_fs_fh_to_dentry, .fh_to_parent = xfs_fs_fh_to_parent, .get_parent = xfs_fs_get_parent, .commit_metadata = xfs_fs_nfs_commit_metadata, + .get_name = xfs_fs_get_name }; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <20131002154320.GE14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002154320.GE14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 16:04 ` Christoph Hellwig 2013-10-02 18:14 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2013-10-02 16:04 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 11:43:20AM -0400, J. Bruce Fields wrote: > > Just for fun, I took a stab at an xfs-specific method. > > Pretty much untested and probably wrong as I know nothing about xfs, but > it does seem like it allows some minor simplifications compared to the > common helper. > > But as Christoph says other filesystems have the same problem so we > probably want to fix the common helper anyway. You'll need this in at least ext4, btrfs, gfs2, ocfs2 and probably more. So fixing this for real sounds like the best deal. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why is i_ino unsigned long, anyway? 2013-10-02 16:04 ` Christoph Hellwig @ 2013-10-02 18:14 ` J. Bruce Fields 0 siblings, 0 replies; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 18:14 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-nfs, sandeen On Wed, Oct 02, 2013 at 09:04:48AM -0700, Christoph Hellwig wrote: > On Wed, Oct 02, 2013 at 11:43:20AM -0400, J. Bruce Fields wrote: > > > > Just for fun, I took a stab at an xfs-specific method. > > > > Pretty much untested and probably wrong as I know nothing about xfs, but > > it does seem like it allows some minor simplifications compared to the > > common helper. > > > > But as Christoph says other filesystems have the same problem so we > > probably want to fix the common helper anyway. > > You'll need this in at least ext4, btrfs, gfs2, ocfs2 and probably more. > So fixing this for real sounds like the best deal. Based on "git grep '\bget_name\b' fs/".... Actually gfs2 and btrfs already have their own get_name methods. They look like they probably handle 64-bit inodes fine. That leaves xfs, ext4, and ocfs2. Anyway I'm still in favor of fixing the helper. --b. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002142527.GD14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-10-02 15:43 ` J. Bruce Fields @ 2013-10-02 16:05 ` Christoph Hellwig [not found] ` <20131002160527.GB23875-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 1 sibling, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2013-10-02 16:05 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > If so then it's no huge code duplication to it by hand: > > if (inode->i_op->getattr) > inode->i_op->getattr(path->mnt, path->dentry, &stat); > else > generic_fillattr(inode, &stat); Maybe make that a vfs_getattr_nosec and let vfs_getattr call it? Including a proper kerneldoc comment explaining when to use it, please. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20131002160527.GB23875-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002160527.GB23875-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2013-10-02 17:53 ` J. Bruce Fields [not found] ` <20131002175328.GF14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 2013-10-02 18:47 ` why is i_ino unsigned long, anyway? Sage Weil 0 siblings, 2 replies; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 17:53 UTC (permalink / raw) To: Christoph Hellwig Cc: Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote: > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > > If so then it's no huge code duplication to it by hand: > > > > if (inode->i_op->getattr) > > inode->i_op->getattr(path->mnt, path->dentry, &stat); > > else > > generic_fillattr(inode, &stat); > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it? > > Including a proper kerneldoc comment explaining when to use it, please. Something like this? --b. commit 8418a41b7192cf2f372ae091207adb29a088f9a0 Author: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Date: Tue Sep 10 11:41:12 2013 -0400 exportfs: fix 32-bit nfsd handling of 64-bit inode numbers Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a 32-bit NFS server exporting a very large XFS filesystem, when the server's cache is cold (so the inodes in question are not in cache). Reported-by: Trevor Cordes <trevor-CGgvEiIIHbIFyWsGDH9TEg@public.gmane.org> Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index 293bc2e..811831a 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -215,7 +215,7 @@ struct getdents_callback { struct dir_context ctx; char *name; /* name that was found. It already points to a buffer NAME_MAX+1 is size */ - unsigned long ino; /* the inum we are looking for */ + u64 ino; /* the inum we are looking for */ int found; /* inode matched? */ int sequence; /* sequence counter */ }; @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child) struct inode *dir = path->dentry->d_inode; int error; struct file *file; + struct kstat stat; struct getdents_callback buffer = { .ctx.actor = filldir_one, .name = name, - .ino = child->d_inode->i_ino }; error = -ENOTDIR; @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) if (!dir->i_fop) goto out; /* + * inode->i_ino is unsigned long, kstat->ino is u64, so the + * former would be insufficient on 32-bit hosts when the + * filesystem supports 64-bit inode numbers. So we need to + * actually call ->getattr, not just read i_ino: + */ + error = vfs_getattr_nosec(path, &stat); + if (error) + return error; + buffer.ino = stat.ino; + /* * Open the directory ... */ file = dentry_open(path, O_RDONLY, cred); diff --git a/fs/stat.c b/fs/stat.c index 04ce1ac..71a39e8 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) EXPORT_SYMBOL(generic_fillattr); -int vfs_getattr(struct path *path, struct kstat *stat) +/** + * vfs_getattr_nosec - getattr without security checks + * @path: file to get attributes from + * @stat: structure to return attributes in + * + * Get attributes without calling security_inode_getattr. + * + * Currently the only caller other than vfs_getattr is internal to the + * filehandle lookup code, which uses only the inode number and returns + * no attributes to any user. Any other code probably wants + * vfs_getattr. + */ +int vfs_getattr_nosec(struct path *path, struct kstat *stat) { struct inode *inode = path->dentry->d_inode; - int retval; - - retval = security_inode_getattr(path->mnt, path->dentry); - if (retval) - return retval; if (inode->i_op->getattr) return inode->i_op->getattr(path->mnt, path->dentry, stat); @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat) return 0; } +EXPORT_SYMBOL_GPL(vfs_getattr_nosec); + +int vfs_getattr(struct path *path, struct kstat *stat) +{ + int retval; + + retval = security_inode_getattr(path->mnt, path->dentry); + if (retval) + return retval; + return vfs_getattr_nosec(path, stat); +} + EXPORT_SYMBOL(vfs_getattr); int vfs_fstat(unsigned int fd, struct kstat *stat) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9818747..5a51faa 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern int generic_readlink(struct dentry *, char __user *, int); extern void generic_fillattr(struct inode *, struct kstat *); +int vfs_getattr_nosec(struct path *path, struct kstat *stat); extern int vfs_getattr(struct path *, struct kstat *); void __inode_add_bytes(struct inode *inode, loff_t bytes); void inode_add_bytes(struct inode *inode, loff_t bytes); -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <20131002175328.GF14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002175328.GF14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 17:57 ` Christoph Hellwig 2013-10-02 21:07 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2013-10-02 17:57 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA > Something like this? Looks good, although I'd split it into two separate patches for the VFS and exportfs bits. Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why is i_ino unsigned long, anyway? 2013-10-02 17:57 ` Christoph Hellwig @ 2013-10-02 21:07 ` J. Bruce Fields [not found] ` <20131002210736.GA20598-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 21:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Al Viro, linux-fsdevel, linux-nfs, sandeen On Wed, Oct 02, 2013 at 10:57:27AM -0700, Christoph Hellwig wrote: > > Something like this? > > Looks good, although I'd split it into two separate patches for the VFS > and exportfs bits. Hm, this might be to the point of leaving the first patch a little unmotivated, but OK, maybe so. Thanks for the review! --b. ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20131002210736.GA20598-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* [PATCH 1/2] vfs: split out vfs_getattr_nosec [not found] ` <20131002210736.GA20598-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 21:28 ` J. Bruce Fields 2013-10-02 21:28 ` [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 21:28 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA, J. Bruce Fields From: "J. Bruce Fields" <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> The filehandle lookup code wants this version of getattr. Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- fs/stat.c | 31 +++++++++++++++++++++++++------ include/linux/fs.h | 1 + 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/fs/stat.c b/fs/stat.c index 04ce1ac..71a39e8 100644 --- a/fs/stat.c +++ b/fs/stat.c @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) EXPORT_SYMBOL(generic_fillattr); -int vfs_getattr(struct path *path, struct kstat *stat) +/** + * vfs_getattr_nosec - getattr without security checks + * @path: file to get attributes from + * @stat: structure to return attributes in + * + * Get attributes without calling security_inode_getattr. + * + * Currently the only caller other than vfs_getattr is internal to the + * filehandle lookup code, which uses only the inode number and returns + * no attributes to any user. Any other code probably wants + * vfs_getattr. + */ +int vfs_getattr_nosec(struct path *path, struct kstat *stat) { struct inode *inode = path->dentry->d_inode; - int retval; - - retval = security_inode_getattr(path->mnt, path->dentry); - if (retval) - return retval; if (inode->i_op->getattr) return inode->i_op->getattr(path->mnt, path->dentry, stat); @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat) return 0; } +EXPORT_SYMBOL_GPL(vfs_getattr_nosec); + +int vfs_getattr(struct path *path, struct kstat *stat) +{ + int retval; + + retval = security_inode_getattr(path->mnt, path->dentry); + if (retval) + return retval; + return vfs_getattr_nosec(path, stat); +} + EXPORT_SYMBOL(vfs_getattr); int vfs_fstat(unsigned int fd, struct kstat *stat) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9818747..5a51faa 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len); extern const struct inode_operations page_symlink_inode_operations; extern int generic_readlink(struct dentry *, char __user *, int); extern void generic_fillattr(struct inode *, struct kstat *); +int vfs_getattr_nosec(struct path *path, struct kstat *stat); extern int vfs_getattr(struct path *, struct kstat *); void __inode_add_bytes(struct inode *inode, loff_t bytes); void inode_add_bytes(struct inode *inode, loff_t bytes); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-02 21:28 ` [PATCH 1/2] vfs: split out vfs_getattr_nosec J. Bruce Fields @ 2013-10-02 21:28 ` J. Bruce Fields [not found] ` <1380749295-20854-2-git-send-email-bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 21:28 UTC (permalink / raw) To: Al Viro Cc: Christoph Hellwig, linux-fsdevel, linux-nfs, sandeen, J. Bruce Fields From: "J. Bruce Fields" <bfields@redhat.com> Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a 32-bit NFS server exporting a very large XFS filesystem, when the server's cache is cold (so the inodes in question are not in cache). Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Trevor Cordes <trevor@tecnopolis.ca> Signed-off-by: J. Bruce Fields <bfields@redhat.com> --- fs/exportfs/expfs.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index 293bc2e..811831a 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -215,7 +215,7 @@ struct getdents_callback { struct dir_context ctx; char *name; /* name that was found. It already points to a buffer NAME_MAX+1 is size */ - unsigned long ino; /* the inum we are looking for */ + u64 ino; /* the inum we are looking for */ int found; /* inode matched? */ int sequence; /* sequence counter */ }; @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child) struct inode *dir = path->dentry->d_inode; int error; struct file *file; + struct kstat stat; struct getdents_callback buffer = { .ctx.actor = filldir_one, .name = name, - .ino = child->d_inode->i_ino }; error = -ENOTDIR; @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) if (!dir->i_fop) goto out; /* + * inode->i_ino is unsigned long, kstat->ino is u64, so the + * former would be insufficient on 32-bit hosts when the + * filesystem supports 64-bit inode numbers. So we need to + * actually call ->getattr, not just read i_ino: + */ + error = vfs_getattr_nosec(path, &stat); + if (error) + return error; + buffer.ino = stat.ino; + /* * Open the directory ... */ file = dentry_open(path, O_RDONLY, cred); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <1380749295-20854-2-git-send-email-bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers [not found] ` <1380749295-20854-2-git-send-email-bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-10-04 22:12 ` J. Bruce Fields [not found] ` <20131004221216.GC18051-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-04 22:12 UTC (permalink / raw) To: J. Bruce Fields Cc: Al Viro, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > if (!dir->i_fop) > goto out; > /* > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > + * former would be insufficient on 32-bit hosts when the > + * filesystem supports 64-bit inode numbers. So we need to > + * actually call ->getattr, not just read i_ino: > + */ > + error = vfs_getattr_nosec(path, &stat); Doh, "path" here is for the parent.... The following works better! --b. commit 231d38b6f775c4677a61b4d7dc15e0a0d6bbb709 Author: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Date: Tue Sep 10 11:41:12 2013 -0400 exportfs: fix 32-bit nfsd handling of 64-bit inode numbers Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a 32-bit NFS server exporting a very large XFS filesystem, when the server's cache is cold (so the inodes in question are not in cache). Reviewed-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> Reported-by: Trevor Cordes <trevor-CGgvEiIIHbIFyWsGDH9TEg@public.gmane.org> Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index a235f00..c43fe9b 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -215,7 +215,7 @@ struct getdents_callback { struct dir_context ctx; char *name; /* name that was found. It already points to a buffer NAME_MAX+1 is size */ - unsigned long ino; /* the inum we are looking for */ + u64 ino; /* the inum we are looking for */ int found; /* inode matched? */ int sequence; /* sequence counter */ }; @@ -255,10 +255,14 @@ static int get_name(const struct path *path, char *name, struct dentry *child) struct inode *dir = path->dentry->d_inode; int error; struct file *file; + struct kstat stat; + struct path child_path = { + .mnt = path->mnt, + .dentry = child, + }; struct getdents_callback buffer = { .ctx.actor = filldir_one, .name = name, - .ino = child->d_inode->i_ino }; error = -ENOTDIR; @@ -268,6 +272,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) if (!dir->i_fop) goto out; /* + * inode->i_ino is unsigned long, kstat->ino is u64, so the + * former would be insufficient on 32-bit hosts when the + * filesystem supports 64-bit inode numbers. So we need to + * actually call ->getattr, not just read i_ino: + */ + error = vfs_getattr_nosec(&child_path, &stat); + if (error) + return error; + buffer.ino = stat.ino; + /* * Open the directory ... */ file = dentry_open(path, O_RDONLY, cred); -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 24+ messages in thread
[parent not found: <20131004221216.GC18051-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers [not found] ` <20131004221216.GC18051-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-04 22:15 ` J. Bruce Fields 2013-10-08 21:56 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-04 22:15 UTC (permalink / raw) To: J. Bruce Fields Cc: Al Viro, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > if (!dir->i_fop) > > goto out; > > /* > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > + * former would be insufficient on 32-bit hosts when the > > + * filesystem supports 64-bit inode numbers. So we need to > > + * actually call ->getattr, not just read i_ino: > > + */ > > + error = vfs_getattr_nosec(path, &stat); > > Doh, "path" here is for the parent.... The following works better! By the way, I'm testing this with: - create a bunch of nested subdirectories, use name_to_fhandle_at to get a handle for the bottom directory. - echo 2 >/proc/sys/vm/drop_caches - open_by_fhandle_at on the filehandle But this only actually exercises the reconnect path on the first run after boot. Is there something obvious I'm missing here? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-04 22:15 ` J. Bruce Fields @ 2013-10-08 21:56 ` J. Bruce Fields 2013-10-09 0:16 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-08 21:56 UTC (permalink / raw) To: J. Bruce Fields Cc: Al Viro, Christoph Hellwig, linux-fsdevel, linux-nfs, sandeen On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > if (!dir->i_fop) > > > goto out; > > > /* > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > + * former would be insufficient on 32-bit hosts when the > > > + * filesystem supports 64-bit inode numbers. So we need to > > > + * actually call ->getattr, not just read i_ino: > > > + */ > > > + error = vfs_getattr_nosec(path, &stat); > > > > Doh, "path" here is for the parent.... The following works better! > > By the way, I'm testing this with: > > - create a bunch of nested subdirectories, use > name_to_fhandle_at to get a handle for the bottom directory. > - echo 2 >/proc/sys/vm/drop_caches > - open_by_fhandle_at on the filehandle > > But this only actually exercises the reconnect path on the first run > after boot. Is there something obvious I'm missing here? Looking at the code.... OK, most of the work of drop_caches is done by shrink_slab_node, which doesn't actually try to free every single thing that it could free--in particular, it won't try to free anything if it thinks there are less than shrinker->batch_size (1024 in the super_block->s_shrink case) objects to free. So for now I'm just nesting deeper (2048 subdirectories) to get a reliable way to exercise reconnect_path. --b. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-08 21:56 ` J. Bruce Fields @ 2013-10-09 0:16 ` Dave Chinner 2013-10-09 14:53 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2013-10-09 0:16 UTC (permalink / raw) To: J. Bruce Fields Cc: J. Bruce Fields, Al Viro, Christoph Hellwig, linux-fsdevel, linux-nfs, sandeen On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote: > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > > if (!dir->i_fop) > > > > goto out; > > > > /* > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > > + * former would be insufficient on 32-bit hosts when the > > > > + * filesystem supports 64-bit inode numbers. So we need to > > > > + * actually call ->getattr, not just read i_ino: > > > > + */ > > > > + error = vfs_getattr_nosec(path, &stat); > > > > > > Doh, "path" here is for the parent.... The following works better! > > > > By the way, I'm testing this with: > > > > - create a bunch of nested subdirectories, use > > name_to_fhandle_at to get a handle for the bottom directory. > > - echo 2 >/proc/sys/vm/drop_caches > > - open_by_fhandle_at on the filehandle > > > > But this only actually exercises the reconnect path on the first run > > after boot. Is there something obvious I'm missing here? > > Looking at the code.... OK, most of the work of drop_caches is done by > shrink_slab_node, which doesn't actually try to free every single thing > that it could free--in particular, it won't try to free anything if it > thinks there are less than shrinker->batch_size (1024 in the > super_block->s_shrink case) objects to free. That's not quite right. Yes, the shrinker won't be called if the calculated scan count is less than the batch size, but the left over is added back the shrinker scan count to carry over to the next call to the shrinker. Hence if you repeated call the shrinker on a small cache with a large batch size, it will eventually aggregate the scan counts to over the batch size and trim the cache.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-09 0:16 ` Dave Chinner @ 2013-10-09 14:53 ` J. Bruce Fields 2013-10-10 22:28 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-09 14:53 UTC (permalink / raw) To: Dave Chinner Cc: J. Bruce Fields, Al Viro, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote: > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote: > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > > > if (!dir->i_fop) > > > > > goto out; > > > > > /* > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > > > + * former would be insufficient on 32-bit hosts when the > > > > > + * filesystem supports 64-bit inode numbers. So we need to > > > > > + * actually call ->getattr, not just read i_ino: > > > > > + */ > > > > > + error = vfs_getattr_nosec(path, &stat); > > > > > > > > Doh, "path" here is for the parent.... The following works better! > > > > > > By the way, I'm testing this with: > > > > > > - create a bunch of nested subdirectories, use > > > name_to_fhandle_at to get a handle for the bottom directory. > > > - echo 2 >/proc/sys/vm/drop_caches > > > - open_by_fhandle_at on the filehandle > > > > > > But this only actually exercises the reconnect path on the first run > > > after boot. Is there something obvious I'm missing here? > > > > Looking at the code.... OK, most of the work of drop_caches is done by > > shrink_slab_node, which doesn't actually try to free every single thing > > that it could free--in particular, it won't try to free anything if it > > thinks there are less than shrinker->batch_size (1024 in the > > super_block->s_shrink case) objects to free. (Oops, sorry, that should have been "less than half of shrinker->batch_size", see below.) > That's not quite right. Yes, the shrinker won't be called if the > calculated scan count is less than the batch size, but the left over > is added back the shrinker scan count to carry over to the next call > to the shrinker. Hence if you repeated call the shrinker on a small > cache with a large batch size, it will eventually aggregate the scan > counts to over the batch size and trim the cache.... No, in shrink_slab_count, we do this: if (total_scan > max_pass * 2) total_scan = max_pass * 2; while (total_scan >= batch_size) { ... } where max_pass is the value returned from count_objects. So as long as count_objects returns less than half batch_size, nothing ever happens. (I wonder if that check's correct? The "forever" in the comment above it seems wrong at least.) --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-09 14:53 ` J. Bruce Fields @ 2013-10-10 22:28 ` Dave Chinner 2013-10-11 21:53 ` J. Bruce Fields 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2013-10-10 22:28 UTC (permalink / raw) To: J. Bruce Fields Cc: J. Bruce Fields, Al Viro, Christoph Hellwig, linux-fsdevel, linux-nfs, sandeen On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote: > On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote: > > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote: > > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > > > > if (!dir->i_fop) > > > > > > goto out; > > > > > > /* > > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > > > > + * former would be insufficient on 32-bit hosts when the > > > > > > + * filesystem supports 64-bit inode numbers. So we need to > > > > > > + * actually call ->getattr, not just read i_ino: > > > > > > + */ > > > > > > + error = vfs_getattr_nosec(path, &stat); > > > > > > > > > > Doh, "path" here is for the parent.... The following works better! > > > > > > > > By the way, I'm testing this with: > > > > > > > > - create a bunch of nested subdirectories, use > > > > name_to_fhandle_at to get a handle for the bottom directory. > > > > - echo 2 >/proc/sys/vm/drop_caches > > > > - open_by_fhandle_at on the filehandle > > > > > > > > But this only actually exercises the reconnect path on the first run > > > > after boot. Is there something obvious I'm missing here? > > > > > > Looking at the code.... OK, most of the work of drop_caches is done by > > > shrink_slab_node, which doesn't actually try to free every single thing > > > that it could free--in particular, it won't try to free anything if it > > > thinks there are less than shrinker->batch_size (1024 in the > > > super_block->s_shrink case) objects to free. > > (Oops, sorry, that should have been "less than half of > shrinker->batch_size", see below.) > > > That's not quite right. Yes, the shrinker won't be called if the > > calculated scan count is less than the batch size, but the left over > > is added back the shrinker scan count to carry over to the next call > > to the shrinker. Hence if you repeated call the shrinker on a small > > cache with a large batch size, it will eventually aggregate the scan > > counts to over the batch size and trim the cache.... > > No, in shrink_slab_count, we do this: > > if (total_scan > max_pass * 2) > total_scan = max_pass * 2; > > while (total_scan >= batch_size) { > ... > } > > where max_pass is the value returned from count_objects. So as long as > count_objects returns less than half batch_size, nothing ever happens. Ah, right - I hadn't considered what that does to small caches - the intended purpose of that is to limit the scan size when caches are extremely large and lots of deferral has occurred. Perhaps we need to consider the batch size in this? e.g.: total_scan = min(total_scan, max(max_pass * 2, batch_size)); Hence for small caches (max_pass <<< batch_size), it evaluates as: total_scan = min(total_scan, batch_size); and hence once aggregation of repeated calls pushes us over the batch size we run the shrinker. For large caches (max_pass >>> batch_size), it evaluates as: total_scan = min(total_scan, max_pass * 2); which gives us the same behaviour as the current code. I'll write up a patch to do this... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-10 22:28 ` Dave Chinner @ 2013-10-11 21:53 ` J. Bruce Fields 2013-10-13 22:52 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-11 21:53 UTC (permalink / raw) To: Dave Chinner Cc: J. Bruce Fields, Al Viro, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Fri, Oct 11, 2013 at 09:28:07AM +1100, Dave Chinner wrote: > On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote: > > On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote: > > > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote: > > > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > > > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > > > > > if (!dir->i_fop) > > > > > > > goto out; > > > > > > > /* > > > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > > > > > + * former would be insufficient on 32-bit hosts when the > > > > > > > + * filesystem supports 64-bit inode numbers. So we need to > > > > > > > + * actually call ->getattr, not just read i_ino: > > > > > > > + */ > > > > > > > + error = vfs_getattr_nosec(path, &stat); > > > > > > > > > > > > Doh, "path" here is for the parent.... The following works better! > > > > > > > > > > By the way, I'm testing this with: > > > > > > > > > > - create a bunch of nested subdirectories, use > > > > > name_to_fhandle_at to get a handle for the bottom directory. > > > > > - echo 2 >/proc/sys/vm/drop_caches > > > > > - open_by_fhandle_at on the filehandle > > > > > > > > > > But this only actually exercises the reconnect path on the first run > > > > > after boot. Is there something obvious I'm missing here? > > > > > > > > Looking at the code.... OK, most of the work of drop_caches is done by > > > > shrink_slab_node, which doesn't actually try to free every single thing > > > > that it could free--in particular, it won't try to free anything if it > > > > thinks there are less than shrinker->batch_size (1024 in the > > > > super_block->s_shrink case) objects to free. > > > > (Oops, sorry, that should have been "less than half of > > shrinker->batch_size", see below.) > > > > > That's not quite right. Yes, the shrinker won't be called if the > > > calculated scan count is less than the batch size, but the left over > > > is added back the shrinker scan count to carry over to the next call > > > to the shrinker. Hence if you repeated call the shrinker on a small > > > cache with a large batch size, it will eventually aggregate the scan > > > counts to over the batch size and trim the cache.... > > > > No, in shrink_slab_count, we do this: > > > > if (total_scan > max_pass * 2) > > total_scan = max_pass * 2; > > > > while (total_scan >= batch_size) { > > ... > > } > > > > where max_pass is the value returned from count_objects. So as long as > > count_objects returns less than half batch_size, nothing ever happens. > > Ah, right - I hadn't considered what that does to small caches - the > intended purpose of that is to limit the scan size when caches are > extremely large and lots of deferral has occurred. Perhaps we need > to consider the batch size in this? e.g.: > > total_scan = min(total_scan, max(max_pass * 2, batch_size)); > > Hence for small caches (max_pass <<< batch_size), it evaluates as: > > total_scan = min(total_scan, batch_size); > > and hence once aggregation of repeated calls pushes us over the > batch size we run the shrinker. > > For large caches (max_pass >>> batch_size), it evaluates as: > > total_scan = min(total_scan, max_pass * 2); > > which gives us the same behaviour as the current code. > > I'll write up a patch to do this... It all feels a bit ad-hoc, but OK. drop_caches could still end up leaving some small caches alone, right? I hadn't expected that, but then again maybe I don't really understand what drop_caches is for. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers 2013-10-11 21:53 ` J. Bruce Fields @ 2013-10-13 22:52 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2013-10-13 22:52 UTC (permalink / raw) To: J. Bruce Fields Cc: J. Bruce Fields, Al Viro, Christoph Hellwig, linux-fsdevel, linux-nfs, sandeen On Fri, Oct 11, 2013 at 05:53:51PM -0400, J. Bruce Fields wrote: > On Fri, Oct 11, 2013 at 09:28:07AM +1100, Dave Chinner wrote: > > On Wed, Oct 09, 2013 at 10:53:20AM -0400, J. Bruce Fields wrote: > > > On Wed, Oct 09, 2013 at 11:16:31AM +1100, Dave Chinner wrote: > > > > On Tue, Oct 08, 2013 at 05:56:56PM -0400, J. Bruce Fields wrote: > > > > > On Fri, Oct 04, 2013 at 06:15:22PM -0400, J. Bruce Fields wrote: > > > > > > On Fri, Oct 04, 2013 at 06:12:16PM -0400, bfields wrote: > > > > > > > On Wed, Oct 02, 2013 at 05:28:14PM -0400, J. Bruce Fields wrote: > > > > > > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > > > > > > if (!dir->i_fop) > > > > > > > > goto out; > > > > > > > > /* > > > > > > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > > > > > > + * former would be insufficient on 32-bit hosts when the > > > > > > > > + * filesystem supports 64-bit inode numbers. So we need to > > > > > > > > + * actually call ->getattr, not just read i_ino: > > > > > > > > + */ > > > > > > > > + error = vfs_getattr_nosec(path, &stat); > > > > > > > > > > > > > > Doh, "path" here is for the parent.... The following works better! > > > > > > > > > > > > By the way, I'm testing this with: > > > > > > > > > > > > - create a bunch of nested subdirectories, use > > > > > > name_to_fhandle_at to get a handle for the bottom directory. > > > > > > - echo 2 >/proc/sys/vm/drop_caches > > > > > > - open_by_fhandle_at on the filehandle > > > > > > > > > > > > But this only actually exercises the reconnect path on the first run > > > > > > after boot. Is there something obvious I'm missing here? > > > > > > > > > > Looking at the code.... OK, most of the work of drop_caches is done by > > > > > shrink_slab_node, which doesn't actually try to free every single thing > > > > > that it could free--in particular, it won't try to free anything if it > > > > > thinks there are less than shrinker->batch_size (1024 in the > > > > > super_block->s_shrink case) objects to free. > > > > > > (Oops, sorry, that should have been "less than half of > > > shrinker->batch_size", see below.) > > > > > > > That's not quite right. Yes, the shrinker won't be called if the > > > > calculated scan count is less than the batch size, but the left over > > > > is added back the shrinker scan count to carry over to the next call > > > > to the shrinker. Hence if you repeated call the shrinker on a small > > > > cache with a large batch size, it will eventually aggregate the scan > > > > counts to over the batch size and trim the cache.... > > > > > > No, in shrink_slab_count, we do this: > > > > > > if (total_scan > max_pass * 2) > > > total_scan = max_pass * 2; > > > > > > while (total_scan >= batch_size) { > > > ... > > > } > > > > > > where max_pass is the value returned from count_objects. So as long as > > > count_objects returns less than half batch_size, nothing ever happens. > > > > Ah, right - I hadn't considered what that does to small caches - the > > intended purpose of that is to limit the scan size when caches are > > extremely large and lots of deferral has occurred. Perhaps we need > > to consider the batch size in this? e.g.: > > > > total_scan = min(total_scan, max(max_pass * 2, batch_size)); > > > > Hence for small caches (max_pass <<< batch_size), it evaluates as: > > > > total_scan = min(total_scan, batch_size); > > > > and hence once aggregation of repeated calls pushes us over the > > batch size we run the shrinker. > > > > For large caches (max_pass >>> batch_size), it evaluates as: > > > > total_scan = min(total_scan, max_pass * 2); > > > > which gives us the same behaviour as the current code. > > > > I'll write up a patch to do this... > > It all feels a bit ad-hoc, but OK. > > drop_caches could still end up leaving some small caches alone, right? Yes, but it's iterative nature means that as long as it is making progress it will continue to call the shrinkers and hence in most cases caches will get more than just one call to be shrunk. > I hadn't expected that, but then again maybe I don't really understand > what drop_caches is for. drop_caches is a "best attempt" to free memory, not a guaranteed method of freeing pages or slab objects. It's a big hammer that can free a lot of memory and it will continue to free memory as long as it makes progress. But if it can't make progress, then it simply stops, and that can happen at any time during slab cache shrinking... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: why is i_ino unsigned long, anyway? 2013-10-02 17:53 ` J. Bruce Fields [not found] ` <20131002175328.GF14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 18:47 ` Sage Weil [not found] ` <alpine.DEB.2.00.1310021130280.7765-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org> 1 sibling, 1 reply; 24+ messages in thread From: Sage Weil @ 2013-10-02 18:47 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Al Viro, linux-fsdevel, linux-nfs, sandeen On Wed, 2 Oct 2013, J. Bruce Fields wrote: > On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote: > > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > > > If so then it's no huge code duplication to it by hand: > > > > > > if (inode->i_op->getattr) > > > inode->i_op->getattr(path->mnt, path->dentry, &stat); > > > else > > > generic_fillattr(inode, &stat); > > > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it? > > > > Including a proper kerneldoc comment explaining when to use it, please. > > Something like this? I'm late to this thread, but: getattr() is a comparatively expensive operation for just getting an (immutable) ino value on a distributed or clustered fs. It would be nice if there were a separate call that didn't try to populate all the other kstat fields with valid data. Something like this was part of the xstat series from forever ago (a bit mask passed to getattr indicating which fields were needed), so the approach below might be okay if we think we'll get there sometime soon, but my preference would be for another fix... (Ceph also uses 64-bit inos.) Thanks! sage > > --b. > > commit 8418a41b7192cf2f372ae091207adb29a088f9a0 > Author: J. Bruce Fields <bfields@redhat.com> > Date: Tue Sep 10 11:41:12 2013 -0400 > > exportfs: fix 32-bit nfsd handling of 64-bit inode numbers > > Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a > 32-bit NFS server exporting a very large XFS filesystem, when the > server's cache is cold (so the inodes in question are not in cache). > > Reported-by: Trevor Cordes <trevor@tecnopolis.ca> > Signed-off-by: J. Bruce Fields <bfields@redhat.com> > > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c > index 293bc2e..811831a 100644 > --- a/fs/exportfs/expfs.c > +++ b/fs/exportfs/expfs.c > @@ -215,7 +215,7 @@ struct getdents_callback { > struct dir_context ctx; > char *name; /* name that was found. It already points to a > buffer NAME_MAX+1 is size */ > - unsigned long ino; /* the inum we are looking for */ > + u64 ino; /* the inum we are looking for */ > int found; /* inode matched? */ > int sequence; /* sequence counter */ > }; > @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > struct inode *dir = path->dentry->d_inode; > int error; > struct file *file; > + struct kstat stat; > struct getdents_callback buffer = { > .ctx.actor = filldir_one, > .name = name, > - .ino = child->d_inode->i_ino > }; > > error = -ENOTDIR; > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > if (!dir->i_fop) > goto out; > /* > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > + * former would be insufficient on 32-bit hosts when the > + * filesystem supports 64-bit inode numbers. So we need to > + * actually call ->getattr, not just read i_ino: > + */ > + error = vfs_getattr_nosec(path, &stat); > + if (error) > + return error; > + buffer.ino = stat.ino; > + /* > * Open the directory ... > */ > file = dentry_open(path, O_RDONLY, cred); > diff --git a/fs/stat.c b/fs/stat.c > index 04ce1ac..71a39e8 100644 > --- a/fs/stat.c > +++ b/fs/stat.c > @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) > > EXPORT_SYMBOL(generic_fillattr); > > -int vfs_getattr(struct path *path, struct kstat *stat) > +/** > + * vfs_getattr_nosec - getattr without security checks > + * @path: file to get attributes from > + * @stat: structure to return attributes in > + * > + * Get attributes without calling security_inode_getattr. > + * > + * Currently the only caller other than vfs_getattr is internal to the > + * filehandle lookup code, which uses only the inode number and returns > + * no attributes to any user. Any other code probably wants > + * vfs_getattr. > + */ > +int vfs_getattr_nosec(struct path *path, struct kstat *stat) > { > struct inode *inode = path->dentry->d_inode; > - int retval; > - > - retval = security_inode_getattr(path->mnt, path->dentry); > - if (retval) > - return retval; > > if (inode->i_op->getattr) > return inode->i_op->getattr(path->mnt, path->dentry, stat); > @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat) > return 0; > } > > +EXPORT_SYMBOL_GPL(vfs_getattr_nosec); > + > +int vfs_getattr(struct path *path, struct kstat *stat) > +{ > + int retval; > + > + retval = security_inode_getattr(path->mnt, path->dentry); > + if (retval) > + return retval; > + return vfs_getattr_nosec(path, stat); > +} > + > EXPORT_SYMBOL(vfs_getattr); > > int vfs_fstat(unsigned int fd, struct kstat *stat) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9818747..5a51faa 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len); > extern const struct inode_operations page_symlink_inode_operations; > extern int generic_readlink(struct dentry *, char __user *, int); > extern void generic_fillattr(struct inode *, struct kstat *); > +int vfs_getattr_nosec(struct path *path, struct kstat *stat); > extern int vfs_getattr(struct path *, struct kstat *); > void __inode_add_bytes(struct inode *inode, loff_t bytes); > void inode_add_bytes(struct inode *inode, loff_t bytes); > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <alpine.DEB.2.00.1310021130280.7765-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <alpine.DEB.2.00.1310021130280.7765-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org> @ 2013-10-02 19:00 ` J. Bruce Fields [not found] ` <20131002190034.GH14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 0 siblings, 1 reply; 24+ messages in thread From: J. Bruce Fields @ 2013-10-02 19:00 UTC (permalink / raw) To: Sage Weil Cc: Christoph Hellwig, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, Oct 02, 2013 at 11:47:22AM -0700, Sage Weil wrote: > On Wed, 2 Oct 2013, J. Bruce Fields wrote: > > On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote: > > > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > > > > If so then it's no huge code duplication to it by hand: > > > > > > > > if (inode->i_op->getattr) > > > > inode->i_op->getattr(path->mnt, path->dentry, &stat); > > > > else > > > > generic_fillattr(inode, &stat); > > > > > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it? > > > > > > Including a proper kerneldoc comment explaining when to use it, please. > > > > Something like this? > > I'm late to this thread, but: getattr() is a comparatively expensive > operation for just getting an (immutable) ino value on a distributed or > clustered fs. It would be nice if there were a separate call that didn't > try to populate all the other kstat fields with valid data. Something > like this was part of the xstat series from forever ago (a bit mask passed > to getattr indicating which fields were needed), so the approach below > might be okay if we think we'll get there sometime soon, but my preference > would be for another fix... Understood, and perhaps this code should eventually take advantage of xstat, but: - this code isn't handling the common case, so we're not too worried about the performance, and - you also have the option of defining your own export_operations->get_name. Especially consider that if you have some better way to answer the question "what is the name of inode X in directory Y" that's better than reading Y looking for a matching inode number. --b. > (Ceph also uses 64-bit inos.) > > Thanks! > sage > > > > > > --b. > > > > commit 8418a41b7192cf2f372ae091207adb29a088f9a0 > > Author: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Date: Tue Sep 10 11:41:12 2013 -0400 > > > > exportfs: fix 32-bit nfsd handling of 64-bit inode numbers > > > > Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a > > 32-bit NFS server exporting a very large XFS filesystem, when the > > server's cache is cold (so the inodes in question are not in cache). > > > > Reported-by: Trevor Cordes <trevor-CGgvEiIIHbIFyWsGDH9TEg@public.gmane.org> > > Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c > > index 293bc2e..811831a 100644 > > --- a/fs/exportfs/expfs.c > > +++ b/fs/exportfs/expfs.c > > @@ -215,7 +215,7 @@ struct getdents_callback { > > struct dir_context ctx; > > char *name; /* name that was found. It already points to a > > buffer NAME_MAX+1 is size */ > > - unsigned long ino; /* the inum we are looking for */ > > + u64 ino; /* the inum we are looking for */ > > int found; /* inode matched? */ > > int sequence; /* sequence counter */ > > }; > > @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > struct inode *dir = path->dentry->d_inode; > > int error; > > struct file *file; > > + struct kstat stat; > > struct getdents_callback buffer = { > > .ctx.actor = filldir_one, > > .name = name, > > - .ino = child->d_inode->i_ino > > }; > > > > error = -ENOTDIR; > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > if (!dir->i_fop) > > goto out; > > /* > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > + * former would be insufficient on 32-bit hosts when the > > + * filesystem supports 64-bit inode numbers. So we need to > > + * actually call ->getattr, not just read i_ino: > > + */ > > + error = vfs_getattr_nosec(path, &stat); > > + if (error) > > + return error; > > + buffer.ino = stat.ino; > > + /* > > * Open the directory ... > > */ > > file = dentry_open(path, O_RDONLY, cred); > > diff --git a/fs/stat.c b/fs/stat.c > > index 04ce1ac..71a39e8 100644 > > --- a/fs/stat.c > > +++ b/fs/stat.c > > @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) > > > > EXPORT_SYMBOL(generic_fillattr); > > > > -int vfs_getattr(struct path *path, struct kstat *stat) > > +/** > > + * vfs_getattr_nosec - getattr without security checks > > + * @path: file to get attributes from > > + * @stat: structure to return attributes in > > + * > > + * Get attributes without calling security_inode_getattr. > > + * > > + * Currently the only caller other than vfs_getattr is internal to the > > + * filehandle lookup code, which uses only the inode number and returns > > + * no attributes to any user. Any other code probably wants > > + * vfs_getattr. > > + */ > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat) > > { > > struct inode *inode = path->dentry->d_inode; > > - int retval; > > - > > - retval = security_inode_getattr(path->mnt, path->dentry); > > - if (retval) > > - return retval; > > > > if (inode->i_op->getattr) > > return inode->i_op->getattr(path->mnt, path->dentry, stat); > > @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat) > > return 0; > > } > > > > +EXPORT_SYMBOL_GPL(vfs_getattr_nosec); > > + > > +int vfs_getattr(struct path *path, struct kstat *stat) > > +{ > > + int retval; > > + > > + retval = security_inode_getattr(path->mnt, path->dentry); > > + if (retval) > > + return retval; > > + return vfs_getattr_nosec(path, stat); > > +} > > + > > EXPORT_SYMBOL(vfs_getattr); > > > > int vfs_fstat(unsigned int fd, struct kstat *stat) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 9818747..5a51faa 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len); > > extern const struct inode_operations page_symlink_inode_operations; > > extern int generic_readlink(struct dentry *, char __user *, int); > > extern void generic_fillattr(struct inode *, struct kstat *); > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat); > > extern int vfs_getattr(struct path *, struct kstat *); > > void __inode_add_bytes(struct inode *inode, loff_t bytes); > > void inode_add_bytes(struct inode *inode, loff_t bytes); > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
[parent not found: <20131002190034.GH14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: why is i_ino unsigned long, anyway? [not found] ` <20131002190034.GH14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2013-10-02 19:04 ` Sage Weil 0 siblings, 0 replies; 24+ messages in thread From: Sage Weil @ 2013-10-02 19:04 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-nfs-u79uwXL29TY76Z2rM5mHXA, sandeen-H+wXaHxf7aLQT0dZR+AlfA On Wed, 2 Oct 2013, J. Bruce Fields wrote: > On Wed, Oct 02, 2013 at 11:47:22AM -0700, Sage Weil wrote: > > On Wed, 2 Oct 2013, J. Bruce Fields wrote: > > > On Wed, Oct 02, 2013 at 09:05:27AM -0700, Christoph Hellwig wrote: > > > > On Wed, Oct 02, 2013 at 10:25:27AM -0400, J. Bruce Fields wrote: > > > > > If so then it's no huge code duplication to it by hand: > > > > > > > > > > if (inode->i_op->getattr) > > > > > inode->i_op->getattr(path->mnt, path->dentry, &stat); > > > > > else > > > > > generic_fillattr(inode, &stat); > > > > > > > > Maybe make that a vfs_getattr_nosec and let vfs_getattr call it? > > > > > > > > Including a proper kerneldoc comment explaining when to use it, please. > > > > > > Something like this? > > > > I'm late to this thread, but: getattr() is a comparatively expensive > > operation for just getting an (immutable) ino value on a distributed or > > clustered fs. It would be nice if there were a separate call that didn't > > try to populate all the other kstat fields with valid data. Something > > like this was part of the xstat series from forever ago (a bit mask passed > > to getattr indicating which fields were needed), so the approach below > > might be okay if we think we'll get there sometime soon, but my preference > > would be for another fix... > > Understood, and perhaps this code should eventually take advantage of > xstat, but: > > - this code isn't handling the common case, so we're not too > worried about the performance, and > - you also have the option of defining your own > export_operations->get_name. Especially consider that if you > have some better way to answer the question "what is the name > of inode X in directory Y" that's better than reading Y > looking for a matching inode number. Ah, I see. Sounds good then! Thanks- sage > > --b. > > > (Ceph also uses 64-bit inos.) > > > > Thanks! > > sage > > > > > > > > > > --b. > > > > > > commit 8418a41b7192cf2f372ae091207adb29a088f9a0 > > > Author: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > Date: Tue Sep 10 11:41:12 2013 -0400 > > > > > > exportfs: fix 32-bit nfsd handling of 64-bit inode numbers > > > > > > Symptoms were spurious -ENOENTs on stat of an NFS filesystem from a > > > 32-bit NFS server exporting a very large XFS filesystem, when the > > > server's cache is cold (so the inodes in question are not in cache). > > > > > > Reported-by: Trevor Cordes <trevor-CGgvEiIIHbIFyWsGDH9TEg@public.gmane.org> > > > Signed-off-by: J. Bruce Fields <bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c > > > index 293bc2e..811831a 100644 > > > --- a/fs/exportfs/expfs.c > > > +++ b/fs/exportfs/expfs.c > > > @@ -215,7 +215,7 @@ struct getdents_callback { > > > struct dir_context ctx; > > > char *name; /* name that was found. It already points to a > > > buffer NAME_MAX+1 is size */ > > > - unsigned long ino; /* the inum we are looking for */ > > > + u64 ino; /* the inum we are looking for */ > > > int found; /* inode matched? */ > > > int sequence; /* sequence counter */ > > > }; > > > @@ -255,10 +255,10 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > struct inode *dir = path->dentry->d_inode; > > > int error; > > > struct file *file; > > > + struct kstat stat; > > > struct getdents_callback buffer = { > > > .ctx.actor = filldir_one, > > > .name = name, > > > - .ino = child->d_inode->i_ino > > > }; > > > > > > error = -ENOTDIR; > > > @@ -268,6 +268,16 @@ static int get_name(const struct path *path, char *name, struct dentry *child) > > > if (!dir->i_fop) > > > goto out; > > > /* > > > + * inode->i_ino is unsigned long, kstat->ino is u64, so the > > > + * former would be insufficient on 32-bit hosts when the > > > + * filesystem supports 64-bit inode numbers. So we need to > > > + * actually call ->getattr, not just read i_ino: > > > + */ > > > + error = vfs_getattr_nosec(path, &stat); > > > + if (error) > > > + return error; > > > + buffer.ino = stat.ino; > > > + /* > > > * Open the directory ... > > > */ > > > file = dentry_open(path, O_RDONLY, cred); > > > diff --git a/fs/stat.c b/fs/stat.c > > > index 04ce1ac..71a39e8 100644 > > > --- a/fs/stat.c > > > +++ b/fs/stat.c > > > @@ -37,14 +37,21 @@ void generic_fillattr(struct inode *inode, struct kstat *stat) > > > > > > EXPORT_SYMBOL(generic_fillattr); > > > > > > -int vfs_getattr(struct path *path, struct kstat *stat) > > > +/** > > > + * vfs_getattr_nosec - getattr without security checks > > > + * @path: file to get attributes from > > > + * @stat: structure to return attributes in > > > + * > > > + * Get attributes without calling security_inode_getattr. > > > + * > > > + * Currently the only caller other than vfs_getattr is internal to the > > > + * filehandle lookup code, which uses only the inode number and returns > > > + * no attributes to any user. Any other code probably wants > > > + * vfs_getattr. > > > + */ > > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat) > > > { > > > struct inode *inode = path->dentry->d_inode; > > > - int retval; > > > - > > > - retval = security_inode_getattr(path->mnt, path->dentry); > > > - if (retval) > > > - return retval; > > > > > > if (inode->i_op->getattr) > > > return inode->i_op->getattr(path->mnt, path->dentry, stat); > > > @@ -53,6 +60,18 @@ int vfs_getattr(struct path *path, struct kstat *stat) > > > return 0; > > > } > > > > > > +EXPORT_SYMBOL_GPL(vfs_getattr_nosec); > > > + > > > +int vfs_getattr(struct path *path, struct kstat *stat) > > > +{ > > > + int retval; > > > + > > > + retval = security_inode_getattr(path->mnt, path->dentry); > > > + if (retval) > > > + return retval; > > > + return vfs_getattr_nosec(path, stat); > > > +} > > > + > > > EXPORT_SYMBOL(vfs_getattr); > > > > > > int vfs_fstat(unsigned int fd, struct kstat *stat) > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > > index 9818747..5a51faa 100644 > > > --- a/include/linux/fs.h > > > +++ b/include/linux/fs.h > > > @@ -2500,6 +2500,7 @@ extern int page_symlink(struct inode *inode, const char *symname, int len); > > > extern const struct inode_operations page_symlink_inode_operations; > > > extern int generic_readlink(struct dentry *, char __user *, int); > > > extern void generic_fillattr(struct inode *, struct kstat *); > > > +int vfs_getattr_nosec(struct path *path, struct kstat *stat); > > > extern int vfs_getattr(struct path *, struct kstat *); > > > void __inode_add_bytes(struct inode *inode, loff_t bytes); > > > void inode_add_bytes(struct inode *inode, loff_t bytes); > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2013-10-13 22:53 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-12 16:03 why is i_ino unsigned long, anyway? J. Bruce Fields
2013-09-12 19:33 ` Al Viro
[not found] ` <20130912193328.GP13318-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2013-09-29 11:54 ` Christoph Hellwig
[not found] ` <20130929115454.GA3953-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2013-10-02 14:25 ` J. Bruce Fields
[not found] ` <20131002142527.GD14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 15:43 ` J. Bruce Fields
[not found] ` <20131002154320.GE14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 16:04 ` Christoph Hellwig
2013-10-02 18:14 ` J. Bruce Fields
2013-10-02 16:05 ` Christoph Hellwig
[not found] ` <20131002160527.GB23875-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2013-10-02 17:53 ` J. Bruce Fields
[not found] ` <20131002175328.GF14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 17:57 ` Christoph Hellwig
2013-10-02 21:07 ` J. Bruce Fields
[not found] ` <20131002210736.GA20598-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 21:28 ` [PATCH 1/2] vfs: split out vfs_getattr_nosec J. Bruce Fields
2013-10-02 21:28 ` [PATCH 2/2] exportfs: fix 32-bit nfsd handling of 64-bit inode numbers J. Bruce Fields
[not found] ` <1380749295-20854-2-git-send-email-bfields-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-04 22:12 ` J. Bruce Fields
[not found] ` <20131004221216.GC18051-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-04 22:15 ` J. Bruce Fields
2013-10-08 21:56 ` J. Bruce Fields
2013-10-09 0:16 ` Dave Chinner
2013-10-09 14:53 ` J. Bruce Fields
2013-10-10 22:28 ` Dave Chinner
2013-10-11 21:53 ` J. Bruce Fields
2013-10-13 22:52 ` Dave Chinner
2013-10-02 18:47 ` why is i_ino unsigned long, anyway? Sage Weil
[not found] ` <alpine.DEB.2.00.1310021130280.7765-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-10-02 19:00 ` J. Bruce Fields
[not found] ` <20131002190034.GH14808-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2013-10-02 19:04 ` Sage Weil
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).