* [PATCH v3 1/5] add metadata_incore ioctl in vfs
@ 2011-01-19 1:15 Shaohua Li
2011-01-19 20:41 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2011-01-19 1:15 UTC (permalink / raw)
To: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Cc: Chris Mason, Christoph Hellwig, Andrew Morton, Arjan van de Ven,
Yan, Zheng, Wu, Fengguang, linux-api, manpages
Subject: add metadata_incore ioctl in vfs
Add an ioctl to dump filesystem's metadata in memory in vfs. Userspace collects
such info and uses it to do metadata readahead.
Filesystem can hook to super_operations.metadata_incore to get metadata in
specific approach. Next patch will give an example how to implement
.metadata_incore in btrfs.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
---
fs/compat_ioctl.c | 2 ++
fs/ioctl.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 10 ++++++++++
3 files changed, 54 insertions(+)
Index: linux/fs/ioctl.c
===================================================================
--- linux.orig/fs/ioctl.c 2011-01-18 10:15:17.000000000 +0800
+++ linux/fs/ioctl.c 2011-01-18 10:39:40.000000000 +0800
@@ -530,6 +530,45 @@ static int ioctl_fsthaw(struct file *fil
}
/*
+ * Copy info about metadata in memory to userspace
+ * Returns:
+ * = 1, one metadata range copied to userspace
+ * = 0, no more metadata
+ * < 0, error
+ */
+static int ioctl_metadata_incore(struct file *filp, void __user *argp)
+{
+ struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+ struct metadata_incore_args args;
+ loff_t offset;
+ ssize_t size;
+
+ if (!sb->s_op->metadata_incore)
+ return -EINVAL;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+
+ /* we check metadata info in page unit */
+ if (args.offset & ~PAGE_CACHE_MASK)
+ return -EINVAL;
+
+ offset = args.offset;
+
+ if (sb->s_op->metadata_incore(sb, &offset, &size) < 0)
+ return 0;
+
+ args.address = offset;
+ args.size = size;
+ args.unused = 0;
+
+ if (copy_to_user(argp, &args, sizeof(args)))
+ return -EFAULT;
+
+ return 1;
+}
+
+/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
*
@@ -589,6 +628,9 @@ int do_vfs_ioctl(struct file *filp, unsi
return put_user(inode->i_sb->s_blocksize, p);
}
+ case FIMETADATA_INCORE:
+ return ioctl_metadata_incore(filp, argp);
+
default:
if (S_ISREG(filp->f_path.dentry->d_inode->i_mode))
error = file_ioctl(filp, cmd, arg);
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2011-01-18 10:15:17.000000000 +0800
+++ linux/include/linux/fs.h 2011-01-18 10:39:40.000000000 +0800
@@ -53,6 +53,13 @@ struct inodes_stat_t {
};
+struct metadata_incore_args {
+ __u64 offset; /* offset in metadata address */
+ __u64 address; /* returned address of metadata in memory */
+ __u32 size; /* size of the metadata */
+ __u32 unused;
+};
+
#define NR_FILE 8192 /* this can well be larger on a larger system */
#define MAY_EXEC 1
@@ -327,6 +334,7 @@ struct inodes_stat_t {
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
#define FITRIM _IOWR('X', 121, struct fstrim_range) /* Trim */
+#define FIMETADATA_INCORE _IOWR('X', 122, struct metadata_incore_args)
#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
@@ -1626,6 +1634,8 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ int (*metadata_incore)(struct super_block*, loff_t *offset,
+ ssize_t *size);
};
/*
Index: linux/fs/compat_ioctl.c
===================================================================
--- linux.orig/fs/compat_ioctl.c 2011-01-18 09:38:03.000000000 +0800
+++ linux/fs/compat_ioctl.c 2011-01-18 10:39:40.000000000 +0800
@@ -883,6 +883,7 @@ COMPATIBLE_IOCTL(FIGETBSZ)
/* 'X' - originally XFS but some now in the VFS */
COMPATIBLE_IOCTL(FIFREEZE)
COMPATIBLE_IOCTL(FITHAW)
+COMPATIBLE_IOCTL(FIMETADATA_INCORE)
COMPATIBLE_IOCTL(KDGETKEYCODE)
COMPATIBLE_IOCTL(KDSETKEYCODE)
COMPATIBLE_IOCTL(KDGKBTYPE)
@@ -1578,6 +1579,7 @@ asmlinkage long compat_sys_ioctl(unsigne
case FIONBIO:
case FIOASYNC:
case FIOQSIZE:
+ case FIMETADATA_INCORE:
break;
#if defined(CONFIG_IA64) || defined(CONFIG_X86_64)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
2011-01-19 1:15 [PATCH v3 1/5] add metadata_incore ioctl in vfs Shaohua Li
@ 2011-01-19 20:41 ` Andrew Morton
[not found] ` <1295490647.1949.890.camel@sli10-conroe>
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2011-01-19 20:41 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Arjan van de Ven, Yan, Zheng,
Wu, Fengguang, linux-api, manpages
On Wed, 19 Jan 2011 09:15:18 +0800
Shaohua Li <shaohua.li@intel.com> wrote:
> Subject: add metadata_incore ioctl in vfs
>
> Add an ioctl to dump filesystem's metadata in memory in vfs. Userspace collects
> such info and uses it to do metadata readahead.
> Filesystem can hook to super_operations.metadata_incore to get metadata in
> specific approach. Next patch will give an example how to implement
> .metadata_incore in btrfs.
>
> ...
>
> /*
> + * Copy info about metadata in memory to userspace
> + * Returns:
> + * = 1, one metadata range copied to userspace
> + * = 0, no more metadata
> + * < 0, error
> + */
> +static int ioctl_metadata_incore(struct file *filp, void __user *argp)
> +{
> + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
> + struct metadata_incore_args args;
> + loff_t offset;
> + ssize_t size;
> +
> + if (!sb->s_op->metadata_incore)
> + return -EINVAL;
> +
> + if (copy_from_user(&args, argp, sizeof(args)))
> + return -EFAULT;
> +
> + /* we check metadata info in page unit */
> + if (args.offset & ~PAGE_CACHE_MASK)
> + return -EINVAL;
> +
> + offset = args.offset;
> +
> + if (sb->s_op->metadata_incore(sb, &offset, &size) < 0)
> + return 0;
> +
> + args.address = offset;
> + args.size = size;
> + args.unused = 0;
> +
> + if (copy_to_user(argp, &args, sizeof(args)))
> + return -EFAULT;
> +
> + return 1;
> +}
So userspace opens any file on the fs and runs this ioctl against it?
That's a pretty awkward interface - we're doing an fs-wide operation
but the fs is identified by a single file which happens to live on that
fs. For example, this precludes a future extension whereby userspace
can query the incore metadata for a particular file. The statfs
syscall sucks in the same manner.
I don't know if this is worth addressing. Perhaps require that the
filp refers to the root of the fs?
Also, is this a privileged operation? If not, then that might be a
problem - could it be used by unprivileged users to work out which
files have been opened recently or something like that?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
[not found] ` <20110119184240.b0a6a016.akpm@linux-foundation.org>
@ 2011-01-20 2:48 ` Shaohua Li
2011-01-20 3:05 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2011-01-20 2:48 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Arjan van de Ven, Yan, Zheng,
Wu, Fengguang, linux-api, manpages
On Thu, 2011-01-20 at 10:42 +0800, Andrew Morton wrote:
> On Thu, 20 Jan 2011 10:30:47 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
>
> > > I don't know if this is worth addressing. Perhaps require that the
> > > filp refers to the root of the fs?
> > I didn't see why this is needed, but I can limit the fip to the root of
> > the fs.
>
> I don't think it matters much either. The only problem I can see is if
> we were to later try to extend the ioctl into a per-file thing.
since we return page range, a metadata page might be shared by several
files, which makes the per-file thing doesn't work. For a fs using
trees, it's even more hard to distinguish a file's metadata
> > > Also, is this a privileged operation? If not, then that might be a
> > > problem - could it be used by unprivileged users to work out which
> > > files have been opened recently or something like that?
> > it's harmless even a unprivileged user uses it. I don't think
> > unprivileged user can decode the data returned from the ioctl.
>
> um.
>
> Well, by doing a before-and-after thing I can use this ioctl to work
> out what metadata blocks are used when someone reads
> /my/super/secret-directory/foo. Then I can write a program which sits
> there waiting until someone else reads /my/super/secret-directory/foo.
> Then I can use that information to start WWIII or something.
>
> I dunno, strange things happen. Unless there's a good *need* to make
> this available to unprivileged users then we should not do so.
ok, looks interesting, I'll update the patch to limit unprivileged
users.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
2011-01-20 2:48 ` Shaohua Li
@ 2011-01-20 3:05 ` Andrew Morton
[not found] ` <1295493709.1949.910.camel@sli10-conroe>
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2011-01-20 3:05 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Arjan van de Ven, Yan, Zheng,
Wu, Fengguang, linux-api, manpages
On Thu, 20 Jan 2011 10:48:33 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> On Thu, 2011-01-20 at 10:42 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 10:30:47 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> >
> > > > I don't know if this is worth addressing. Perhaps require that the
> > > > filp refers to the root of the fs?
> > > I didn't see why this is needed, but I can limit the fip to the root of
> > > the fs.
> >
> > I don't think it matters much either. The only problem I can see is if
> > we were to later try to extend the ioctl into a per-file thing.
> since we return page range, a metadata page might be shared by several
> files, which makes the per-file thing doesn't work. For a fs using
> trees, it's even more hard to distinguish a file's metadata
hm, why. A query for "which blocks need to be read to access this
file" may return blocks which are shared with other files, but it's
still useful info. Because it will represent vastly less data (and
hence IO) than the current fs-wide thing.
Now I actually look at it, I cannot find any documentation for the ioctl!
It seems to return a single offset/length tuple which refers to the
btrfs metadata "file", with the intent that this tuple later be fed
into a btrfs-specific readahead ioctl.
I can see how this might be used with say fatfs or ext3 where all
metadata resides within the blockdev address_space. But how is a
filesytem which keeps its metadata in multiple address_spaces supposed
to use this interface?
So. Please fully document the proposed userspace APIs! This should be
the first thing we look at. Then we can take a look at how applicable
that is to other-than-btrfs filesystems.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
[not found] ` <20110119201014.adf02a78.akpm@linux-foundation.org>
@ 2011-01-20 4:41 ` Dave Chinner
[not found] ` <1295501898.1949.917.camel@sli10-conroe>
1 sibling, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2011-01-20 4:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Shaohua Li, linux-btrfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Chris Mason, Christoph Hellwig,
Arjan van de Ven, Yan, Zheng, Wu, Fengguang, linux-api, manpages
On Wed, Jan 19, 2011 at 08:10:14PM -0800, Andrew Morton wrote:
> On Thu, 20 Jan 2011 11:21:49 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
>
> > > It seems to return a single offset/length tuple which refers to the
> > > btrfs metadata "file", with the intent that this tuple later be fed
> > > into a btrfs-specific readahead ioctl.
> > >
> > > I can see how this might be used with say fatfs or ext3 where all
> > > metadata resides within the blockdev address_space. But how is a
> > > filesytem which keeps its metadata in multiple address_spaces supposed
> > > to use this interface?
> > Oh, this looks like a big problem, thanks for letting me know such
> > filesystems. is it possible specific filesystem mapping multiple
> > address_space ranges to a virtual big ranges? the new ioctls handle the
> > mapping.
>
> I'm not sure what you mean by that.
>
> ext2, minix and probably others create an address_space for each
> directory. Heaven knows what xfs does (for example).
In 2.6.39 it won't even use address spaces for metadata caching.
Besides, XFS already has pretty sophisticated metadata readahead
built in - it's one of the reasons why the XFS directory code scales
so well on cold cache lookups of arge directories - so I don't see
much need for such an interface for XFS.
Perhaps btrfs would be better served by implementing speculative
metadata readahead in the places where it makes sense (e.g. readdir)
bcause it will improve cold-cache performance on a much wider range
of workloads than at just boot-time....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
[not found] ` <1295503953.1949.928.camel@sli10-conroe>
@ 2011-01-20 6:19 ` Wu Fengguang
2011-01-20 6:37 ` Shaohua Li
2011-01-20 6:27 ` Andrew Morton
1 sibling, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2011-01-20 6:19 UTC (permalink / raw)
To: Li, Shaohua
Cc: Andrew Morton, linux-btrfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Chris Mason, Christoph Hellwig,
Arjan van de Ven, Yan, Zheng, linux-api, manpages
On Thu, Jan 20, 2011 at 02:12:33PM +0800, Li, Shaohua wrote:
> On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> >
> > > > ext2, minix and probably others create an address_space for each
> > > > directory. Heaven knows what xfs does (for example).
> > > yes, this is for one directiory, but the all files's metadata are in
> > > block_dev address_space.
> > > I thought you mean there are several block_dev address_space like
> > > address_space in some filesystems, which doesn't fit well in my
> > > implementation. for ext like filesystem, there is only one
> > > address_space. for filesystems with several address_space, my proposal
> > > is map them to a virtual big address_space in the new ioctls.
> >
> > ext2 and minixfs (and I think sysv and ufs) have a separate
> > address_space for each directory. I don't see how those can be
> > represented with a single "virtual big address_space" - we also need
> > identifiers in there so each directory's address_space can be created
> > and appropriately populated.
> Oh, I misunderstand your comments. you are right, the ioctl methods
> don't work for ext2. the dir's address_space can't be readahead either.
> Looks we could only do the metadata readahead in filesystem specific
> way.
There should be little interest on ext2 boot time optimization.
However if necessary, we might somehow treat ext2 dirs as files and
do normal fadvise on them? The other ext2 metadata may still be
accessible via the block_dev interface.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
[not found] ` <1295503953.1949.928.camel@sli10-conroe>
2011-01-20 6:19 ` Wu Fengguang
@ 2011-01-20 6:27 ` Andrew Morton
1 sibling, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2011-01-20 6:27 UTC (permalink / raw)
To: Shaohua Li
Cc: linux-btrfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
Chris Mason, Christoph Hellwig, Arjan van de Ven, Yan, Zheng,
Wu, Fengguang, linux-api, manpages
On Thu, 20 Jan 2011 14:12:33 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> >
> > > > ext2, minix and probably others create an address_space for each
> > > > directory. Heaven knows what xfs does (for example).
> > > yes, this is for one directiory, but the all files's metadata are in
> > > block_dev address_space.
> > > I thought you mean there are several block_dev address_space like
> > > address_space in some filesystems, which doesn't fit well in my
> > > implementation. for ext like filesystem, there is only one
> > > address_space. for filesystems with several address_space, my proposal
> > > is map them to a virtual big address_space in the new ioctls.
> >
> > ext2 and minixfs (and I think sysv and ufs) have a separate
> > address_space for each directory. I don't see how those can be
> > represented with a single "virtual big address_space" - we also need
> > identifiers in there so each directory's address_space can be created
> > and appropriately populated.
> Oh, I misunderstand your comments. you are right, the ioctl methods
> don't work for ext2. the dir's address_space can't be readahead either.
> Looks we could only do the metadata readahead in filesystem specific
> way.
Another way of doing all this would be to implement some sort of
lookaside cache at the vfs->block boundary. At boot time, load that
cache up with all the disk blocks which we know the boot will need (a
single ascending pass across the disk) and then when the vfs/fs goes to
read a disk block take a peek in that cache first and if it's a hit,
either steal the page or memcpy it.
It has the obvious coherence problems which would be pretty simple to
solve by hooking into the block write path as well. The list of needed
blocks can be very simply generated with existing blktrace
infrastructure. It does add permanent runtime overhead - once the
cache is invalidated and disabled, every IO operation would incur a
test-n-not-taken-branch. Maybe not too bad.
Need to handle small-memory systems somehow, where the cache simply
ooms the machine or becomes ineffective because it's causing eviction
elsewhere.
It could perhaps all be implemented as an md or dm driver.
Or even as an IO scheduler. I say this because IO schedulers can be
replaced on-the-fly, so the caching layer can be unloaded from the
stack once it is finished with.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 1/5] add metadata_incore ioctl in vfs
2011-01-20 6:19 ` Wu Fengguang
@ 2011-01-20 6:37 ` Shaohua Li
0 siblings, 0 replies; 8+ messages in thread
From: Shaohua Li @ 2011-01-20 6:37 UTC (permalink / raw)
To: Wu, Fengguang
Cc: Andrew Morton, linux-btrfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Chris Mason, Christoph Hellwig,
Arjan van de Ven, Yan, Zheng, linux-api, manpages
On Thu, 2011-01-20 at 14:19 +0800, Wu, Fengguang wrote:
> On Thu, Jan 20, 2011 at 02:12:33PM +0800, Li, Shaohua wrote:
> > On Thu, 2011-01-20 at 13:55 +0800, Andrew Morton wrote:
> > > On Thu, 20 Jan 2011 13:38:18 +0800 Shaohua Li <shaohua.li@intel.com> wrote:
> > >
> > > > > ext2, minix and probably others create an address_space for each
> > > > > directory. Heaven knows what xfs does (for example).
> > > > yes, this is for one directiory, but the all files's metadata are in
> > > > block_dev address_space.
> > > > I thought you mean there are several block_dev address_space like
> > > > address_space in some filesystems, which doesn't fit well in my
> > > > implementation. for ext like filesystem, there is only one
> > > > address_space. for filesystems with several address_space, my proposal
> > > > is map them to a virtual big address_space in the new ioctls.
> > >
> > > ext2 and minixfs (and I think sysv and ufs) have a separate
> > > address_space for each directory. I don't see how those can be
> > > represented with a single "virtual big address_space" - we also need
> > > identifiers in there so each directory's address_space can be created
> > > and appropriately populated.
> > Oh, I misunderstand your comments. you are right, the ioctl methods
> > don't work for ext2. the dir's address_space can't be readahead either.
> > Looks we could only do the metadata readahead in filesystem specific
> > way.
>
> There should be little interest on ext2 boot time optimization.
>
> However if necessary, we might somehow treat ext2 dirs as files and
> do normal fadvise on them? The other ext2 metadata may still be
> accessible via the block_dev interface.
current readahead syscall might already work here. however, I would
expect there is stall easily when reading dirs.
thinking 3 dirs:
/
/aa
/aa/bb
before / is in memory, reading aa will have stall. And we can't reduce
disk seeks here. metadata readahead will still have some benefits but
might not be much in such filesystem.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-01-20 6:37 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-19 1:15 [PATCH v3 1/5] add metadata_incore ioctl in vfs Shaohua Li
2011-01-19 20:41 ` Andrew Morton
[not found] ` <1295490647.1949.890.camel@sli10-conroe>
[not found] ` <20110119184240.b0a6a016.akpm@linux-foundation.org>
2011-01-20 2:48 ` Shaohua Li
2011-01-20 3:05 ` Andrew Morton
[not found] ` <1295493709.1949.910.camel@sli10-conroe>
[not found] ` <20110119201014.adf02a78.akpm@linux-foundation.org>
2011-01-20 4:41 ` Dave Chinner
[not found] ` <1295501898.1949.917.camel@sli10-conroe>
[not found] ` <20110119215510.0882db92.akpm@linux-foundation.org>
[not found] ` <1295503953.1949.928.camel@sli10-conroe>
2011-01-20 6:19 ` Wu Fengguang
2011-01-20 6:37 ` Shaohua Li
2011-01-20 6:27 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).