[PATCH 1/5] vfs: vfs-level fiemap interface

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/5] vfs: vfs-level fiemap interface
@ 2008-05-25  0:01 Mark Fasheh
  2008-05-25  7:28 ` Andreas Dilger
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Mark Fasheh @ 2008-05-25  0:01 UTC (permalink / raw)
  To: linux-fsdevel

Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
inode operation.

Userspace can get extent information on a file via fiemap ioctl. As input,
the fiemap ioctl takes a struct fiemap which includes an array of struct
fiemap_extent (fm_extents). Size of the extent array is passed as
fm_extent_count and number of extents returned will be written into
fm_mapped_extents. Offset and length fields on the fiemap structure
(fm_start, fm_length) describe a logical range which will be searched for
extents. All extents returned will at least partially contain this range.
The actual extent offsets and ranges returned will be unmodified from their
offset and range on-disk.

The fiemap ioctl returns '0' on success. On error, -1 is returned and errno
is set. If errno is equal to EBADR, then fm_flags will contain those flags
which were passed in which the kernel did not understand. On all other
errors, the contents of fm_extents is undefined.

As fiemap evolved, there have been many authors of the vfs patch. As far as
I can tell, the list includes:
Kalpak Shah <kalpak.shah@sun.com>
Andreas Dilger <adilger@sun.com>
Eric Sandeen <sandeen@redhat.com>
Mark Fasheh <mfasheh@suse.com>

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 Documentation/filesystems/fiemap.txt |  224 ++++++++++++++++++++++++++++++++++
 fs/ioctl.c                           |  140 +++++++++++++++++++++
 include/linux/fiemap.h               |   72 +++++++++++
 include/linux/fs.h                   |   16 +++
 4 files changed, 452 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/fiemap.txt
 create mode 100644 include/linux/fiemap.h

diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt
new file mode 100644
index 0000000..84acf6c
--- /dev/null
+++ b/Documentation/filesystems/fiemap.txt
@@ -0,0 +1,224 @@
+Fiemap Ioctl
+============
+
+The fiemap ioctl is an efficient method for userspace to get file
+extent mappings. Instead of block-by-block mapping (such as bmap), fiemap
+returns a list of extents.
+
+
+Request Basics
+--------------
+
+A fiemap request is encoded within struct fiemap:
+
+struct fiemap {
+	__u64	fm_start;	 /* logical offset (inclusive) at
+				  * which to start mapping (in) */
+	__u64	fm_length;	 /* logical length of mapping which
+				  * userspace cares about (in) */
+	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */
+	__u32	fm_extent_count; /* size of fm_extents array (in) */
+	__u32	fm_mapped_extents; /* number of extents that were
+				    * mapped (out) */
+	__u32	fm_reserved;
+	struct fiemap_extent	fm_extents[0];
+};
+
+
+fm_start, and fm_length specify the logical range within the file
+which the process would like mappings for. Extents returned mirror
+those on disk - that is, the logical offset of the 1st returned extent
+may start before fm_start, and the range covered by the last returned
+extent may end after fm_length. All offsets and lengths are in bytes.
+
+Certain flags to modify the way in which mappings are looked up can be
+set in fm_flags. If the kernel doesn't understand some particular
+flags, it will return EBADR and the contents of fm_flags will contain
+the set of flags which caused the error. If the kernel is compatible
+with all flags passed, the contents of fm_flags will be unmodified.
+It is up to userspace to determine whether rejection of a particular
+flag is fatal to it's operation. This scheme is intended to allow the
+fiemap interface to grow in the future but without losing
+compatibility with old software.
+
+Currently, there are four flags which can be set in fm_flags:
+
+* FIEMAP_FLAG_NUM_EXTENTS
+If this flag is set, extent information will not be returned via the
+fm_extents array and the value of fm_extent_count will be
+ignored. Instead, the total number of extents covering the range will
+be returned via fm_mapped_extents. This is useful for programs which
+only want to count the number of extents in a file, but don't care
+about the actual extent layout.
+
+* FIEMAP_FLAG_SYNC
+If this flag is set, the kernel will sync the file before mapping extents.
+
+* FIEMAP_FLAG_HSM_READ
+If the extent is offline, retrieve it before mapping and do not flag
+it as FIEMAP_EXTENT_SECONDARY. This flag has no effect if the file
+system does not support HSM.
+
+* FIEMAP_FLAG_XATTR
+If this flag is set, the extents returned will describe the inodes
+extended attribute lookup tree, instead of it's data tree.
+
+* FIEMAP_FLAG_LUN_ORDER
+If the file system stripes file data, this will return contiguous
+regions of physical allocation, sorted by LUN. Logical offsets may not
+make sense if this flag is passed. If the file system does not support
+multiple LUNs, this flag will be ignored.
+
+
+Extent Mapping
+--------------
+
+Note that all of this is ignored if FIEMAP_FLAG_NUM_EXTENTS is set.
+
+Extent information is returned within the embedded fm_extents array
+which userspace must allocate along with the fiemap structure. The
+total number of fiemap_extents available should be passed via
+fm_extent_count. The of extents mapped by kernel will be returned via
+fm_mapped_extents. If the number of fiemap_extents allocated is less
+than would be required to map the requested range, the maximum number
+of extents that can be mapped in available memory will be returned and
+fm_mapped_extents will be equal to fm_extent_count. In that case, the
+last extent in the array will not complete the requested range and
+will not have the FIEMAP_EXTENT_LAST flag set (see the next section on
+extent flags).
+
+Each extent is described by a single fiemap_extent structure as
+returned in fm_extents.
+
+struct fiemap_extent {
+	__u64	fe_logical;/* logical offset in bytes for the start of
+			    * the extent */
+	__u64	fe_physical; /* physical offset in bytes for the start
+			      * of the extent */
+	__u64	fe_length; /* length in bytes for the extent */
+	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
+	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
+};
+
+All offsets and lengths are in bytes and mirror those on disk - it is
+valid for an extents logical offset to start before the request or
+it's logical length to extend past the request. Unless
+FIEMAP_EXTENT_NOT_ALIGNED is returned, fe_logical, fe_physical and
+fe_length will be aligned to the block size of the file system.
+
+The fe_flags field contains flags which describe the extent
+returned. A special flag, FIEMAP_EXTENT_LAST is always set on the last
+extent in the file so that the process making fiemap calls can
+determine when no more extents are available.
+
+Some flags are intentionally vague and will always be set in the
+presence of other more specific flags. This way a program looking for
+a general property does not have to know all existing and future flags
+which imply that property.
+
+For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL
+are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking
+for inline or tail-packed data can key on the specific flag. Software
+which simply cares not to try operating on non-aligned extents
+however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to
+worry about all present and future flags which might imply unaligned
+data. Note that the opposite is not true - it would be valid for
+FIEMAP_EXTENT_NOT_ALIGNED to appear alone.
+
+* FIEMAP_EXTENT_LAST
+This is the last extent in the file. A mapping attempt past this
+extent will return nothing.
+
+* FIEMAP_EXTENT_UNKNOWN
+The location of this extent is currently unknown. This may indicate
+the data is stored on an inaccessible volume or that no storage has
+been allocated for the file yet.
+
+* FIEMAP_EXTENT_SECONDARY
+  - This will also set FIEMAP_EXTENT_UNKNOWN.
+The data for this extent is in secondary storage.
+
+* FIEMAP_EXTENT_DELALLOC
+  - This will also set FIEMAP_EXTENT_UNKNOWN.
+Delayed allocation - while there is data for this extent, it's
+physical location has not been allocated yet.
+
+* FIEMAP_EXTENT_NO_DIRECT
+Direct access to the data in this extent is illegal or will have
+undefined results.
+
+* FIEMAP_EXTENT_NET
+  - This will also set FIEMAP_EXTENT_NO_DIRECT
+The data for this extent is not stored in a locally-accessible device.
+
+* FIEMAP_EXTENT_DATA_COMPRESSED
+  - This will also set FIEMAP_EXTENT_NO_DIRECT
+The data in this extent has been compressed by the file system.
+
+* FIEMAP_EXTENT_DATA_ENCRYPTED
+  - This will also set FIEMAP_EXTENT_NO_DIRECT
+The data in this extent has been encrypted by the file system.
+
+* FIEMAP_EXTENT_NOT_ALIGNED
+Extent offsets and length are not guaranteed to be block aligned.
+
+* FIEMAP_EXTENT_DATA_INLINE
+  This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is located within a meta data block.
+
+* FIEMAP_EXTENT_DATA_TAIL
+  This will also set FIEMAP_EXTENT_NOT_ALIGNED
+Data is packed into a block with data from other files.
+
+* FIEMAP_EXTENT_UNWRITTEN
+Unwritten extent - the extent is allocated but it's data has not been
+initialized.
+
+
+VFS -> File System Implementation
+---------------------------------
+
+File systems wishing to support fiemap must implement a ->fiemap
+callback (on struct inode_operations):
+
+struct inode_operations {
+       ...
+
+       int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start,
+       	   	      u64 len);
+
+->fiemap is passed struct fiemap_extent_info which describes the
+fiemap request:
+
+struct fiemap_extent_info {
+	unsigned int	fi_flags;		/* Flags as passed from user */
+	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
+	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
+	char		*fi_extents_start;	/* Start of fiemap_extent array */
+};
+
+It is intended that the file system should only need to access
+fi_flags directly. Aside from checking fi_flags to modify callback
+behavior, flags which the file system can not handle, can be written
+into fieinfo->fi_flags. In this case, the file system *must* return
+-EBADR so that ioctl_fiemap() can write them into the userspace
+buffer.
+
+For each extent in the request range, the file system should call
+the helper function, fiemap_fill_next_extent():
+
+int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+			    u64 phys, u64 len, u32 flags, u32 lun);
+
+fiemap_fill_next_extent() will use the passed values to populate the
+next free extent in the fm_extents array. 'General' extent flags will
+automatically be set from specific flags on behalf of the calling file
+system so that the userspace API is not broken.
+
+fiemap_fill_next_extent() returns 0 on success, and 1 when the
+user-supplied fm_extents array is full. If an error is encountered
+while copying the extent to user memory, -EFAULT will be returned.
+
+If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
+this helper is not necessary and fi_extents_mapped can be set
+directly.
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7db32b3..405bbcb 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -16,6 +16,9 @@
 
 #include <asm/ioctls.h>
 
+/* So that the fiemap access checks can't overflow on 32 bit machines. */
+#define FIEMAP_MAX_EXTENTS	(UINT_MAX / sizeof(struct fiemap_extent))
+
 /**
  * vfs_ioctl - call filesystem specific ioctl methods
  * @filp:	open file to invoke ioctl method on
@@ -71,6 +74,141 @@ static int ioctl_fibmap(struct file *filp, int __user *p)
 	return put_user(res, p);
 }
 
+/**
+ * fiemap_fill_next_extent - Fiemap helper function
+ * @fieinfo:	Fiemap context passed into ->fiemap
+ * @logical:	Extent logical start offset, in bytes
+ * @phys:	Extent physical start offset, in bytes
+ * @len:	Extent length, in bytes
+ * @flags:	FIEMAP_EXTENT flags that describe this extent
+ * @lun:	LUN on which this extent resides
+ *
+ * Called from file system ->fiemap callback. Will populate extent
+ * info as passed in via arguments and copy to user memory. On
+ * success, extent count on fieinfo is incremented.
+ *
+ * Returns 0 on success, -errno on error, 1 if this was the last
+ * extent that will fit in user array.
+ */
+#define SET_UNKNOWN_FLAGS	(FIEMAP_EXTENT_SECONDARY|FIEMAP_EXTENT_DELALLOC)
+#define SET_NO_DIRECT_FLAGS	(FIEMAP_EXTENT_DATA_COMPRESSED		\
+				 |FIEMAP_EXTENT_DATA_ENCRYPTED		\
+				 |FIEMAP_EXTENT_NET)
+#define SET_NOT_ALIGNED_FLAGS	(FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE)
+int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo,
+			    u64 logical, u64 phys, u64 len, u32 flags, u32 lun)
+{
+	struct fiemap_extent extent;
+	char *dest = fieinfo->fi_extents_start;
+
+	if (fieinfo->fi_flags & FIEMAP_FLAG_NUM_EXTENTS) {
+		fieinfo->fi_extents_mapped++;
+		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
+	}
+
+	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
+		return 1;
+
+	if (flags & SET_UNKNOWN_FLAGS)
+		flags |= FIEMAP_EXTENT_UNKNOWN;
+	if (flags & SET_NO_DIRECT_FLAGS)
+		flags |= FIEMAP_EXTENT_NO_DIRECT;
+	if (flags & SET_NOT_ALIGNED_FLAGS)
+		flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+
+	extent.fe_logical = logical;
+	extent.fe_physical = phys;
+	extent.fe_length = len;
+	extent.fe_flags = flags;
+	extent.fe_lun = lun;
+
+	dest += (fieinfo->fi_extents_mapped * sizeof(extent));
+	if (copy_to_user(dest, &extent, sizeof(extent)))
+		return -EFAULT;
+
+	fieinfo->fi_extents_mapped++;
+	if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max)
+		return 1;
+	return 0;
+}
+EXPORT_SYMBOL(fiemap_fill_next_extent);
+
+static int ioctl_fiemap(struct file *filp, unsigned long arg)
+{
+	struct fiemap fiemap;
+	u64 len;
+	u32 incompat_flags;
+	struct fiemap_extent_info fieinfo = {0, };
+	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
+	int error = 0;
+
+	if (!inode->i_op->fiemap)
+		return -EOPNOTSUPP;
+
+	if (copy_from_user(&fiemap, (struct fiemap __user *) arg,
+			   sizeof(struct fiemap)))
+		return -EFAULT;
+
+	/*
+	 * The VFS does basic sanity checks on the flag
+	 * value. Individual file systems can also reject otherwise
+	 * 'valid' flags by returning -EBADR from ->fiemap.
+	 */
+	incompat_flags = fiemap.fm_flags & ~FIEMAP_FLAGS_COMPAT;
+	if (incompat_flags)
+		goto out_bad_flags;
+
+	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS) &&
+	    (fiemap.fm_extent_count == 0 ||
+	     fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS))
+		return -EINVAL;
+
+	if (fiemap.fm_start > sb->s_maxbytes)
+		return -EFBIG;
+
+	len = fiemap.fm_length;
+	if (len == 0)
+		return -EINVAL;
+	/*
+	 * Shrink request scope to what the fs can actually handle.
+	 */
+	if ((len > sb->s_maxbytes) ||
+	    (sb->s_maxbytes - len) < fiemap.fm_start)
+		len = sb->s_maxbytes - fiemap.fm_start;
+
+	fieinfo.fi_flags = fiemap.fm_flags;
+	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS)) {
+		fieinfo.fi_extents_max = fiemap.fm_extent_count;
+		fieinfo.fi_extents_start = (char *)arg + sizeof(fiemap);
+
+		if (!access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
+			       fieinfo.fi_extents_max * sizeof(struct fiemap_extent)))
+			return -EFAULT;
+	}
+
+	if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC)
+		filemap_write_and_wait(inode->i_mapping);
+
+	error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len);
+	if (error == -EBADR) {
+		incompat_flags = fieinfo.fi_flags;
+		goto out_bad_flags;
+	}
+	if (error == 0) {
+		fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped;
+		error = copy_to_user((char *)arg, &fiemap, sizeof(fiemap));
+	}
+
+	return error;
+
+out_bad_flags:
+	fiemap.fm_flags = incompat_flags;
+	if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap)))
+		return -EFAULT;
+	return -EBADR;
+}
+
 static int file_ioctl(struct file *filp, unsigned int cmd,
 		unsigned long arg)
 {
@@ -80,6 +218,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd,
 	switch (cmd) {
 	case FIBMAP:
 		return ioctl_fibmap(filp, p);
+	case FS_IOC_FIEMAP:
+		return ioctl_fiemap(filp, arg);
 	case FIGETBSZ:
 		return put_user(inode->i_sb->s_blocksize, p);
 	case FIONREAD:
diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h
new file mode 100644
index 0000000..a6a629a
--- /dev/null
+++ b/include/linux/fiemap.h
@@ -0,0 +1,72 @@
+/*
+ * FIEMAP ioctl infrastructure.
+ *
+ * Copyright (C) 2007 Cluster File Systems, Inc
+ *
+ * Author: Kalpak Shah <kalpak.shah@sun.com>
+ *	   Andreas Dilger <adilger@sun.com>
+ */
+
+#ifndef _LINUX_FIEMAP_H
+#define _LINUX_FIEMAP_H
+
+struct fiemap_extent {
+	__u64	fe_logical;/* logical offset in bytes for the start of
+			    * the extent from the beginning of the file */
+	__u64	fe_physical; /* physical offset in bytes for the start
+			      * of the extent from the beginning of the disk */
+	__u64	fe_length; /* length in bytes for the extent */
+	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
+	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
+};
+
+struct fiemap {
+	__u64	fm_start;	 /* logical offset (inclusive) at
+				  * which to start mapping (in) */
+	__u64	fm_length;	 /* logical length of mapping which
+				  * userspace cares about (in) */
+	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in/out) */
+	__u32	fm_extent_count; /* size of fm_extents array (in) */
+	__u32	fm_mapped_extents; /* number of extents that were
+				    * mapped (out) */
+	__u32	fm_reserved;
+	struct fiemap_extent	fm_extents[0];
+};
+
+#define FIEMAP_MAX_OFFSET	(~0ULL)
+
+#define FIEMAP_FLAG_NUM_EXTENTS 0x00000001 /* return only number of extents */
+#define FIEMAP_FLAG_SYNC	0x00000002 /* sync file data before map */
+#define FIEMAP_FLAG_HSM_READ	0x00000004 /* get data from HSM before map */
+#define FIEMAP_FLAG_XATTR	0x00000008 /* map extended attribute tree */
+#define	FIEMAP_FLAG_LUN_ORDER	0x00000010 /* return lun-sorted mappings */
+
+#define FIEMAP_FLAGS_COMPAT	(FIEMAP_FLAG_NUM_EXTENTS|FIEMAP_FLAG_SYNC|   \
+				 FIEMAP_FLAG_HSM_READ|FIEMAP_FLAG_XATTR|     \
+				 FIEMAP_FLAG_LUN_ORDER)
+
+#define	FIEMAP_EXTENT_LAST		0x00000001 /* Last extent in file. */
+#define	FIEMAP_EXTENT_UNKNOWN		0x00000002 /* In use, location unknown. */
+#define	FIEMAP_EXTENT_SECONDARY		0x00000004 /* Data in secondary storage.
+						    * Sets EXTENT_UNKNOWN. */
+#define	FIEMAP_EXTENT_DELALLOC		0x00000008 /* Has data, but not yet
+						    * written.
+						    * Sets EXTENT_UNKNOWN. */
+#define	FIEMAP_EXTENT_NO_DIRECT		0x00000010 /* Cannot access data directly. */
+#define	FIEMAP_EXTENT_NET		0x00000020 /* Data stored remotely.
+						    * Sets EXTENT_NO_DIRECT. */
+#define	FIEMAP_EXTENT_DATA_COMPRESSED	0x00000040 /* Data is compressed by fs.
+						    * Sets EXTENT_NO_DIRECT. */
+#define	FIEMAP_EXTENT_DATA_ENCRYPTED	0x00000080 /* Data is encrypted by fs.
+						    * Sets EXTENT_NO_DIRECT. */
+#define	FIEMAP_EXTENT_NOT_ALIGNED	0x00000100 /* Extent offsets may not be
+						    * block aligned. */
+#define	FIEMAP_EXTENT_DATA_INLINE	0x00000200 /* Data stored within fs
+						    * metadata.
+						    * Sets EXTENT_NOT_ALIGNED.*/
+#define	FIEMAP_EXTENT_DATA_TAIL		0x00000400 /* Data is tail-packed.
+						    * Sets EXTENT_NOT_ALIGNED.*/
+#define	FIEMAP_EXTENT_UNWRITTEN		0x00000800 /* Space allocated, but
+						    * no data. */
+
+#endif /* _LINUX_FIEMAP_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f413085..ed90393 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -228,6 +228,7 @@ extern int dir_notify_enable;
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
 #define	FS_IOC_GETVERSION		_IOR('v', 1, long)
 #define	FS_IOC_SETVERSION		_IOW('v', 2, long)
+#define FS_IOC_FIEMAP			_IOWR('f', 11, struct fiemap)
 #define FS_IOC32_GETFLAGS		_IOR('f', 1, int)
 #define FS_IOC32_SETFLAGS		_IOW('f', 2, int)
 #define FS_IOC32_GETVERSION		_IOR('v', 1, int)
@@ -288,6 +289,7 @@ extern int dir_notify_enable;
 #include <linux/mutex.h>
 #include <linux/capability.h>
 #include <linux/semaphore.h>
+#include <linux/fiemap.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1144,6 +1146,18 @@ extern void dentry_unhash(struct dentry *dentry);
 extern int file_permission(struct file *, int);
 
 /*
+ * VFS FIEMAP helper definitions.
+ */
+struct fiemap_extent_info {
+	unsigned int	fi_flags;		/* Flags as passed from user */
+	unsigned int	fi_extents_mapped;	/* Number of mapped extents */
+	unsigned int	fi_extents_max;		/* Size of fiemap_extent array */
+	char		*fi_extents_start;	/* Start of fiemap_extent array */
+};
+int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
+			    u64 phys, u64 len, u32 flags, u32 lun);
+
+/*
  * File types
  *
  * NOTE! These match bits 12..15 of stat.st_mode
@@ -1273,6 +1287,8 @@ struct inode_operations {
 	void (*truncate_range)(struct inode *, loff_t, loff_t);
 	long (*fallocate)(struct inode *inode, int mode, loff_t offset,
 			  loff_t len);
+	int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start,
+		       u64 len);
 };
 
 struct seq_file;
-- 
1.5.4.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-25  0:01 [PATCH 1/5] vfs: vfs-level fiemap interface Mark Fasheh
@ 2008-05-25  7:28 ` Andreas Dilger
  2008-05-27 18:31   ` Mark Fasheh
  2008-05-28 19:42 ` Andreas Dilger
  2008-06-05  5:18 ` Andreas Dilger
  2 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-05-25  7:28 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel

On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> inode operation.

Mark, this appears to be missing most of my previous comments.  Are
you getting my emails, or is there still a problem?

> +++ b/Documentation/filesystems/fiemap.txt
> +A fiemap request is encoded within struct fiemap:
> +
> +struct fiemap {
> +	__u64	fm_start;	 /* logical offset (inclusive) at
> +				  * which to start mapping (in) */
> +	__u64	fm_length;	 /* logical length of mapping which
> +				  * userspace cares about (in) */
> +	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */

This should be marked (in/out) because of error return, documented below.

> +* FIEMAP_EXTENT_UNKNOWN
> +The location of this extent is currently unknown. This may indicate
> +the data is stored on an inaccessible volume or that no storage has
> +been allocated for the file yet.
> +
> +* FIEMAP_EXTENT_SECONDARY
> +  - This will also set FIEMAP_EXTENT_UNKNOWN.
> +The data for this extent is in secondary storage.

This is not correct that it will always set FIEMAP_EXTENT_UNKNOWN here.
It is possible (common) for data to be stored on both primary AND
secondary storage at the same time.  It is still desirable to mark the
data as FIEMAP_EXTENT_SECONDARY even if it is still resident in the
filesystem, because this informs the user that the HSM/backup system
has made a copy of this file already.

> +For each extent in the request range, the file system should call
> +the helper function, fiemap_fill_next_extent():
> +
> +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> +			    u64 phys, u64 len, u32 flags, u32 lun);
> +
> +fiemap_fill_next_extent() will use the passed values to populate the
> +next free extent in the fm_extents array. 'General' extent flags will
> +automatically be set from specific flags on behalf of the calling file
> +system so that the userspace API is not broken.
> +
> +fiemap_fill_next_extent() returns 0 on success, and 1 when the
> +user-supplied fm_extents array is full. If an error is encountered
> +while copying the extent to user memory, -EFAULT will be returned.
> +
> +If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> +this helper is not necessary and fi_extents_mapped can be set
> +directly.

It is worthwhile to mention here that if FIEMAP_FLAG_NUM_EXTENTS IS set,
it is also OK for the filesystem to call fiemap_fill_next_extent(), and
only the number of extents will be counted, avoiding the need to special-
case this flag in the filesystem implementation.



What about the idea to have fiemap_fill_next_extent() do "extent" merging
for filesystems that use the generic helper but do not return multiple
blocks via get_blocks()?  I don't think that is too hard to implement,
and makes the output much more useful, otherwise we get an extent per block.
The above is what I _think_ will work, haven't actually tried it out.

struct fiemap_extnent_info {
	struct fiemap_extent	fi_last;	/* Copy of previous extent */
	unsigned int	fi_flags;		/* Flags as passed from user */
	unsigned int	fi_extents_mapped;      /* Number of mapped extents */
	unsigned int	fi_extents_max;		/* fiemap_extent array size */
	char            *fi_extents_start;	/* fiemap_extent array start */
}

int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo,
			    u64 logical, u64 phys, u64 len, u32 flags, u32 lun)
{
	struct fiemap_extent *extent = &fieinfo->fi_extent_last;
	int joined = 1;

	if (flags & SET_UNKNOWN_FLAGS)
		flags |= FIEMAP_EXTENT_UNKNOWN;
	if (flags & SET_NO_DIRECT_FLAGS)
		flags |= FIEMAP_EXTENT_NO_DIRECT;
	if (flags & SET_NOT_ALIGNED_FLAGS)
		flags |= FIEMAP_EXTENT_NOT_ALIGNED;

	/* If this is an extension of the previous extent, add it in */
	if (fieinfo->fi_extents_mapped > 0 &&
	    extent->fe_logical + extent->fe_length == logical &&
	    extent->fe_physical + extent->fe_length == phys &&
	    extent->fe_flags == (flags & ~FIEMAP_FLAG_LAST) &&
	    extent->fe_lun == lun) {
	    	extent->fe_length += len;
		extent->fe_flags = flags;	/* maybe set FIEMAP_FLAG_LAST */

		if (!(fieinfo->fi_flags & FIEMAP_FLAG_NUM_EXTENTS))
			fieinfo->fe_extents_start -= sizeof(*extent);
		fieinfo->fi_extents_mapped--;

		joined = 1;
	}

	if (fieinfo->fi_flags & FIEMAP_FLAG_NUM_EXTENTS) {
		fieinfo->fi_extents_mapped++;
		return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
	}

	if (joined)
		goto copy;

	if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max)
		return 1;

	/* Store new extent info in fi_extent_last for next call */
	extent->fe_logical = logical;
	extent->fe_physical = phys;
	extent->fe_length = len;
	extent->fe_flags = flags;
	extent->fe_lun = lun;

copy:
	if (copy_to_user(fieinfo->fi_extents_start, extent, sizeof(*extent)))
		return -EFAULT;

	fieinfo->fi_extents_start += sizeof(*extent);
	fieinfo->fi_extents_mapped++;

	/* Don't return 1 if we are using the last extent slot as we may
	 * be able to merge with the next extent, and we check above if
	 * the next extent won't fit into the array.
	 */

	return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0;
}
EXPORT_SYMBOL(fiemap_fill_next_extent);

> +static int ioctl_fiemap(struct file *filp, unsigned long arg)
> +{
> +	struct fiemap fiemap;
> +	u64 len;
> +	u32 incompat_flags;
> +	struct fiemap_extent_info fieinfo = {0, };
> +	struct inode *inode = filp->f_path.dentry->d_inode;
> +	struct super_block *sb = inode->i_sb;
> +	int error = 0;
> +
> +	if (!inode->i_op->fiemap)
> +		return -EOPNOTSUPP;
> +
> +	if (copy_from_user(&fiemap, (struct fiemap __user *) arg,
> +			   sizeof(struct fiemap)))
> +		return -EFAULT;
> +
> +	/*
> +	 * The VFS does basic sanity checks on the flag
> +	 * value. Individual file systems can also reject otherwise
> +	 * 'valid' flags by returning -EBADR from ->fiemap.
> +	 */
> +	incompat_flags = fiemap.fm_flags & ~FIEMAP_FLAGS_COMPAT;
> +	if (incompat_flags)
> +		goto out_bad_flags;

Please remove this, and leave the checking for the individual filesystems.
They all need to do it anyways, because they can't depend on the check here
never changing or always being updated when this check does.  At best it is
redundant with the check in the filesystem, at worst it will lead to bugs
in the filesystem-specific code due to inconsistent checks.

> +#define FIEMAP_FLAGS_COMPAT	(FIEMAP_FLAG_NUM_EXTENTS|FIEMAP_FLAG_SYNC|   \
> +				 FIEMAP_FLAG_HSM_READ|FIEMAP_FLAG_XATTR|     \
> +				 FIEMAP_FLAG_LUN_ORDER)

This isn't needed anymore.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-25  7:28 ` Andreas Dilger
@ 2008-05-27 18:31   ` Mark Fasheh
  2008-05-28 16:09     ` Andreas Dilger
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Fasheh @ 2008-05-27 18:31 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel

On Sun, May 25, 2008 at 01:28:16AM -0600, Andreas Dilger wrote:
> On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> > Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> > inode operation.
> 
> Mark, this appears to be missing most of my previous comments.  Are
> you getting my emails, or is there still a problem?

No, I got your mail.. "missing most of" is an exaggeration though, isn't it?
I just missed one or two things  ;)


> > +++ b/Documentation/filesystems/fiemap.txt
> > +A fiemap request is encoded within struct fiemap:
> > +
> > +struct fiemap {
> > +	__u64	fm_start;	 /* logical offset (inclusive) at
> > +				  * which to start mapping (in) */
> > +	__u64	fm_length;	 /* logical length of mapping which
> > +				  * userspace cares about (in) */
> > +	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in) */
> 
> This should be marked (in/out) because of error return, documented below.

Ahh, yeah I updated it in the header, but forgot about the document.


> > +* FIEMAP_EXTENT_UNKNOWN
> > +The location of this extent is currently unknown. This may indicate
> > +the data is stored on an inaccessible volume or that no storage has
> > +been allocated for the file yet.
> > +
> > +* FIEMAP_EXTENT_SECONDARY
> > +  - This will also set FIEMAP_EXTENT_UNKNOWN.
> > +The data for this extent is in secondary storage.
> 
> This is not correct that it will always set FIEMAP_EXTENT_UNKNOWN here.
> It is possible (common) for data to be stored on both primary AND
> secondary storage at the same time.  It is still desirable to mark the
> data as FIEMAP_EXTENT_SECONDARY even if it is still resident in the
> filesystem, because this informs the user that the HSM/backup system
> has made a copy of this file already.

Ok. I think I need to modify fiemap_fill_next_extent() slightly for that
too.


> > +For each extent in the request range, the file system should call
> > +the helper function, fiemap_fill_next_extent():
> > +
> > +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical,
> > +			    u64 phys, u64 len, u32 flags, u32 lun);
> > +
> > +fiemap_fill_next_extent() will use the passed values to populate the
> > +next free extent in the fm_extents array. 'General' extent flags will
> > +automatically be set from specific flags on behalf of the calling file
> > +system so that the userspace API is not broken.
> > +
> > +fiemap_fill_next_extent() returns 0 on success, and 1 when the
> > +user-supplied fm_extents array is full. If an error is encountered
> > +while copying the extent to user memory, -EFAULT will be returned.
> > +
> > +If the request has the FIEMAP_FLAG_NUM_EXTENTS flag set, then calling
> > +this helper is not necessary and fi_extents_mapped can be set
> > +directly.
> 
> It is worthwhile to mention here that if FIEMAP_FLAG_NUM_EXTENTS IS set,
> it is also OK for the filesystem to call fiemap_fill_next_extent(), and
> only the number of extents will be counted, avoiding the need to special-
> case this flag in the filesystem implementation.

Right - good idea.


> What about the idea to have fiemap_fill_next_extent() do "extent" merging
> for filesystems that use the generic helper but do not return multiple
> blocks via get_blocks()?  I don't think that is too hard to implement,
> and makes the output much more useful, otherwise we get an extent per block.
> The above is what I _think_ will work, haven't actually tried it out.

I don't think we want to automatically merge extents within this helper
function. Otherwise we would diverge from the actual disk layout for extent
based file systems where an extent might be broken up between two records
for some other reason, such as maximum extent length being exceeded.

It probably makes more sense to encode this merging in the block-based
function instead since it's pretty specific to those.

Btw, how many block-based file systems that don't return multiple blocks via
get_blocks() are there that we actually care about enough to write this
code?


> > +static int ioctl_fiemap(struct file *filp, unsigned long arg)
> > +{
> > +	struct fiemap fiemap;
> > +	u64 len;
> > +	u32 incompat_flags;
> > +	struct fiemap_extent_info fieinfo = {0, };
> > +	struct inode *inode = filp->f_path.dentry->d_inode;
> > +	struct super_block *sb = inode->i_sb;
> > +	int error = 0;
> > +
> > +	if (!inode->i_op->fiemap)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (copy_from_user(&fiemap, (struct fiemap __user *) arg,
> > +			   sizeof(struct fiemap)))
> > +		return -EFAULT;
> > +
> > +	/*
> > +	 * The VFS does basic sanity checks on the flag
> > +	 * value. Individual file systems can also reject otherwise
> > +	 * 'valid' flags by returning -EBADR from ->fiemap.
> > +	 */
> > +	incompat_flags = fiemap.fm_flags & ~FIEMAP_FLAGS_COMPAT;
> > +	if (incompat_flags)
> > +		goto out_bad_flags;
> 
> Please remove this, and leave the checking for the individual filesystems.
> They all need to do it anyways, because they can't depend on the check here
> never changing or always being updated when this check does.

I hope that when we add new feature flags to fiemap, we take the time to
do it right and add it to FIEMAP_FLAGS_COMPAT, which is defined in the same
header for that reason. Likewise, we can't just add flags in any case
without considering how they affect existing users of ->fiemap. If we add a
flag that everyone can implement for example, we'll have to touch all file
systems anyway.


> At best it is redundant with the check in the filesystem, at worst it
> will lead to bugs in the filesystem-specific code due to inconsistent
> checks.

I makes sense to me that the VFS should verify that flag values passed
are valid in that they've been defined by the kernel <-> user interface. The
file systems can and should still reject flags which they can not support.
So the checks are *designed* to not be consistent, but rather allow the
client file system to whittle down the supported flags further.

Otherwise, it seems to me that client file systems could start to define
their own flags which might eventually stomp on those we'd like to add
later.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-27 18:31   ` Mark Fasheh
@ 2008-05-28 16:09     ` Andreas Dilger
  2008-05-28 17:24       ` Joel Becker
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-05-28 16:09 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel

On May 27, 2008  11:31 -0700, Mark Fasheh wrote:
> > > +	/*
> > > +	 * The VFS does basic sanity checks on the flag
> > > +	 * value. Individual file systems can also reject otherwise
> > > +	 * 'valid' flags by returning -EBADR from ->fiemap.
> > > +	 */
> > > +	incompat_flags = fiemap.fm_flags & ~FIEMAP_FLAGS_COMPAT;
> > > +	if (incompat_flags)
> > > +		goto out_bad_flags;
> > 
> > Please remove this, and leave the checking for the individual filesystems.
> > They all need to do it anyways, because they can't depend on the check here
> > never changing or always being updated when this check does.
> 
> I hope that when we add new feature flags to fiemap, we take the time to
> do it right and add it to FIEMAP_FLAGS_COMPAT, which is defined in the same
> header for that reason. Likewise, we can't just add flags in any case
> without considering how they affect existing users of ->fiemap. If we add a
> flag that everyone can implement for example, we'll have to touch all file
> systems anyway.

Well, the problem is that older kernel will not have the most current
upstream FIEMAP_FLAGS_COMPAT, and short of patching the kernel to change
this ONE value (which we've worked for a long time to not do on the
client) there is no way that support for additional upstream flags can
be added.

Because the generic ioctl handling is done before the filesystem-specific
one (sys_ioctl->vfs_ioctl->file_ioctl->do_ioctl->(f_op->ioctl()), we can't
override the FIEMAP ioctl handling in the filesystem, regardless of what
we do with the ->fiemap() method.

> > At best it is redundant with the check in the filesystem, at worst it
> > will lead to bugs in the filesystem-specific code due to inconsistent
> > checks.
> 
> I makes sense to me that the VFS should verify that flag values passed
> are valid in that they've been defined by the kernel <-> user interface. The
> file systems can and should still reject flags which they can not support.
>
> So the checks are *designed* to not be consistent, but rather allow the
> client file system to whittle down the supported flags further.

But the problem is that people are error prone in their updating of code,
and if the filesystems assume "the VFS has checked all of the flags except
this one I don't understand" will likely become incorrect over time as
someone will forget, will misunderstand whether the different per-fs codes
need to be updated, or some patch will be delayed in a FS maintainer queue
while the VFS "acceptance" of the new feature will be included upstream.

I don't think it is safe for filesystems to depend on the VFS check, and
if the filesystems need to do their own complete check then the VFS doesn't
need to do it at all.  There isn't really much that the VFS cares about
the fm_flags field, with NUM_EXTENTS being the only notable exception.
The proposed FIEMAP_FLAG_METADATA is again something that the VFS doesn't
care about.  Even then, since the filesystems would do their own check
and would return -EBADR there isn't anything bad that could happen.

> Otherwise, it seems to me that client file systems could start to define
> their own flags which might eventually stomp on those we'd like to add
> later.

It would be foolish to do so, as long as it isn't impossible to get new
flags defined in the upstream kernel.

The issue is that most users don't have the latest upstream kernel
because they are using a vendor kernel that is a few years old, as you
likely know, but an updated Lustre or OCFS2 or btrfs should work with
the existing vendor kernels.

If we wanted to add something like FIEMAP_FLAG_METADATA, if the check
was done in the VFS, it would be impossible without patching the client
even if it exactly matched the upstream kernel implementation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 16:09     ` Andreas Dilger
@ 2008-05-28 17:24       ` Joel Becker
  2008-05-29 23:46         ` Andreas Dilger
  0 siblings, 1 reply; 20+ messages in thread
From: Joel Becker @ 2008-05-28 17:24 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mark Fasheh, linux-fsdevel

On Wed, May 28, 2008 at 10:09:52AM -0600, Andreas Dilger wrote:
> On May 27, 2008  11:31 -0700, Mark Fasheh wrote:
> > I hope that when we add new feature flags to fiemap, we take the time to
> > do it right and add it to FIEMAP_FLAGS_COMPAT, which is defined in the same
> > header for that reason. Likewise, we can't just add flags in any case
> > without considering how they affect existing users of ->fiemap. If we add a
> > flag that everyone can implement for example, we'll have to touch all file
> > systems anyway.
> 
> Well, the problem is that older kernel will not have the most current
> upstream FIEMAP_FLAGS_COMPAT, and short of patching the kernel to change
> this ONE value (which we've worked for a long time to not do on the
> client) there is no way that support for additional upstream flags can
> be added.

...

> But the problem is that people are error prone in their updating of code,
> and if the filesystems assume "the VFS has checked all of the flags except
> this one I don't understand" will likely become incorrect over time as
> someone will forget, will misunderstand whether the different per-fs codes
> need to be updated, or some patch will be delayed in a FS maintainer queue
> while the VFS "acceptance" of the new feature will be included upstream.

	This is a specious argument - if it doesn't go upstream, we
then have the overloaded-flag problem.  If you're looking for vendor
flags, let's just design a space for them.
 
> The issue is that most users don't have the latest upstream kernel
> because they are using a vendor kernel that is a few years old, as you
> likely know, but an updated Lustre or OCFS2 or btrfs should work with
> the existing vendor kernels.
> 
> If we wanted to add something like FIEMAP_FLAG_METADATA, if the check
> was done in the VFS, it would be impossible without patching the client
> even if it exactly matched the upstream kernel implementation.

	First, getting vendor kernels to update a supported flag set
that is already in mainline is pretty easy.  They are rightly interested
in following a well-defined interface, which is what Mark's trying to do
- no filesystems supporting flags that aren't part of the well-defined
interface.
	But if you are really worried about no kernel updates when you
install a new fs module, you can still solve it with a generic check.
Just add /proc/sys/fs/fiemap-flag-mask.  This covers any new flags for
the generic VFS check.  Alternately, allow filesystems to register their
flags and then do the VFS check based on that.

Joel

-- 

"What do you take me for, an idiot?"  
        - General Charles de Gaulle, when a journalist asked him
          if he was happy.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 17:24       ` Joel Becker
@ 2008-05-29 23:46         ` Andreas Dilger
  2008-05-30  0:15           ` Mark Fasheh
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-05-29 23:46 UTC (permalink / raw)
  To: Joel Becker; +Cc: Mark Fasheh, linux-fsdevel

On May 28, 2008  10:24 -0700, Joel Becker wrote:
> On Wed, May 28, 2008 at 10:09:52AM -0600, Andreas Dilger wrote:
> > But the problem is that people are error prone in their updating of code,
> > and if the filesystems assume "the VFS has checked all of the flags except
> > this one I don't understand" will likely become incorrect over time as
> > someone will forget, will misunderstand whether the different per-fs codes
> > need to be updated, or some patch will be delayed in a FS maintainer queue
> > while the VFS "acceptance" of the new feature will be included upstream.
> 
> 	This is a specious argument - if it doesn't go upstream, we
> then have the overloaded-flag problem.

I was actually thinking of the opposite case - the VFS part of the new
flag is included upstream (i.e. ioctl_fiemap() allows the new flag),
but the filesystem-specific part is delayed by some maintainer (or lack
thereof).

We've had an ongoing issue with ext4 because we need EXPORT_SYMBOL(zero_page),
but this is not making it through the m68k maintainer yet the ext4 part of
the patch is already upstream and Andrew complains about it regularly.

> If you're looking for vendor flags, let's just design a space for them.

By no means am I looking for "private" flags or adding support for flags
that don't exist upstream (assuming it is reasonable to get new flags
upstream).  What I'm specifically concerned about is being able to support
new features that are properly accepted upstream in Lustre built against
older vendor kernels.  We are trying to get out of the kernel-patching
days because customers aren't willing to void their kernel or 3rd-party
application support by running a patched kernel on the client.

Since this is a relatively new API, I think several features like
FIEMAP_FLAG_XATTR, FIEMAP_FLAG_METADATA, and maybe a few others will be
added in the next several months, and some vendor will grab one of the
"has FIEMAP, but not all of the flags" kernels and we won't be able to
add newer features on that kernel for possibly several years.

> > The issue is that most users don't have the latest upstream kernel
> > because they are using a vendor kernel that is a few years old, as you
> > likely know, but an updated Lustre or OCFS2 or btrfs should work with
> > the existing vendor kernels.
> > 
> > If we wanted to add something like FIEMAP_FLAG_METADATA, if the check
> > was done in the VFS, it would be impossible without patching the client
> > even if it exactly matched the upstream kernel implementation.
> 
> 	First, getting vendor kernels to update a supported flag set
> that is already in mainline is pretty easy.  They are rightly interested
> in following a well-defined interface, which is what Mark's trying to do
> - no filesystems supporting flags that aren't part of the well-defined
> interface.

Reasonably so, yes.  The issue is that everyone is busy, and what may
be a priority for us isn't necessarily for the vendor, and there is
another hurdle trying to get the customer to upgrade the kernels on
their 10000-node cluster to add some bits to the compatibility flags.
Being able to add in e.g. FIEMAP_FLAG_XATTR ourselves is easier.

> 	But if you are really worried about no kernel updates when you
> install a new fs module, you can still solve it with a generic check.
> Just add /proc/sys/fs/fiemap-flag-mask.  This covers any new flags for
> the generic VFS check.  Alternately, allow filesystems to register their
> flags and then do the VFS check based on that.

If you are suggesting that the filesystems all export their "supported
flags" mask somewhere, and the VFS uses that for a check, then yes I agree
it would be possible to do.  I don't see a huge benefit of that over just
letting the filesystems do it directly themselves at that point.  Adding a
/proc or /sys or /debugfs tunable for this seems heavyweight, and needs
a sysctl or other setting on each boot - a pain for diskless clients.

It seems backward to me to add arbitrary limits to the API when it was
designed in the first place to be flexible and allow features to be
added easily.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-29 23:46         ` Andreas Dilger
@ 2008-05-30  0:15           ` Mark Fasheh
  2008-05-30 17:24             ` Andreas Dilger
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Fasheh @ 2008-05-30  0:15 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Joel Becker, linux-fsdevel

On Thu, May 29, 2008 at 05:46:48PM -0600, Andreas Dilger wrote:
> > If you're looking for vendor flags, let's just design a space for them.
> 
> By no means am I looking for "private" flags or adding support for flags
> that don't exist upstream (assuming it is reasonable to get new flags
> upstream).  What I'm specifically concerned about is being able to support
> new features that are properly accepted upstream in Lustre built against
> older vendor kernels.  We are trying to get out of the kernel-patching
> days because customers aren't willing to void their kernel or 3rd-party
> application support by running a patched kernel on the client.

So, how about we just put the flag checks in a helper function which the fs
calls. It would take as arguments, the fiemap_extent_info struct and set of
flags which the fs supports. That would allow backported modules to plug in
their own check, but we retain the niceness of having the checks in a
central place for in-tree file systems.

int fiemap_check_flags(struct fiemap_extent_info fieinfo, u32 fs_flags)
{
	u32 incompat_flags;

	incompat_flags = fieinfo.fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags);
	if (incompat_flags) {
		fieinfo->fi_flags = incompat_flags;
		return -EBADR;
	}
	return 0;
}


> > 	But if you are really worried about no kernel updates when you
> > install a new fs module, you can still solve it with a generic check.
> > Just add /proc/sys/fs/fiemap-flag-mask.  This covers any new flags for
> > the generic VFS check.  Alternately, allow filesystems to register their
> > flags and then do the VFS check based on that.
> 
> If you are suggesting that the filesystems all export their "supported
> flags" mask somewhere, and the VFS uses that for a check, then yes I agree
> it would be possible to do.  I don't see a huge benefit of that over just
> letting the filesystems do it directly themselves at that point.  Adding a
> /proc or /sys or /debugfs tunable for this seems heavyweight, and needs
> a sysctl or other setting on each boot - a pain for diskless clients.

Yeah, I agree that it's total overkill.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-30  0:15           ` Mark Fasheh
@ 2008-05-30 17:24             ` Andreas Dilger
  0 siblings, 0 replies; 20+ messages in thread
From: Andreas Dilger @ 2008-05-30 17:24 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Joel Becker, linux-fsdevel

On May 29, 2008  17:15 -0700, Mark Fasheh wrote:
> calls. It would take as arguments, the fiemap_extent_info struct and set of
> flags which the fs supports. That would allow backported modules to plug in
> their own check, but we retain the niceness of having the checks in a
> central place for in-tree file systems.
> 
> int fiemap_check_flags(struct fiemap_extent_info fieinfo, u32 fs_flags)
> {
> 	u32 incompat_flags;
> 
> 	incompat_flags = fieinfo.fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags);
> 	if (incompat_flags) {
> 		fieinfo->fi_flags = incompat_flags;
> 		return -EBADR;
> 	}
> 	return 0;
> }

I'm OK with this approach.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-25  0:01 [PATCH 1/5] vfs: vfs-level fiemap interface Mark Fasheh
  2008-05-25  7:28 ` Andreas Dilger
@ 2008-05-28 19:42 ` Andreas Dilger
  2008-05-28 19:54   ` Josef Bacik
                     ` (3 more replies)
  2008-06-05  5:18 ` Andreas Dilger
  2 siblings, 4 replies; 20+ messages in thread
From: Andreas Dilger @ 2008-05-28 19:42 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel, Josef Bacik

On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> inode operation.

Mark, I was looking at a way to remove the special-casing of NUM_EXTENTS
from ioctl_fiemap() in an effort to remove Christoph's objection to
keeping these in the same ioctl.

I think it is possible and reasonable to move the special-case handling
into fiemap_fill_next_extent().

> +static int ioctl_fiemap(struct file *filp, unsigned long arg)
> +{
> +
> +	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS) &&
> +	    (fiemap.fm_extent_count == 0 ||
> +	     fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS))
> +		return -EINVAL;

This can be changed to only check:

	if (fm_extent_count > FIEMAP_MAX_EXTENTS)
		return -EINVAL;

> +	fieinfo.fi_flags = fiemap.fm_flags;
> +	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS)) {
> +		fieinfo.fi_extents_max = fiemap.fm_extent_count;
> +		fieinfo.fi_extents_start = (char *)arg + sizeof(fiemap);
> +
> +		if (!access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
> +			       fieinfo.fi_extents_max * sizeof(struct fiemap_extent)))
> +			return -EFAULT;
> +	}

It is harmless to set fi_extents_max and fi_extents_start, as this is
ignored by NUM_EXTENTS.  The fiemap_fill_next_extent() will already
check in copy_to_user() whether the fi_extents_start pointer is valid,
and fiemap_fill_next_extent() doesn't even get far enough to look at
fi_extents_max or fi_extents_start.  We just do:

	fieinfo.fi_extents = fiemap.fm_extent_count;
	fieinfo.fi_extents_start = (struct fiemap_extent *)((char *)arg +
							    sizeof(fiemap));

This leaves us with no checks for FIEMAP_FLAG_NUM_EXTENTS in ioctl_fiemap()
at all, and no changes needed in fiemap_fill_next_extent().

> > What about the idea to have fiemap_fill_next_extent() do "extent" merging
> > for filesystems that use the generic helper but do not return multiple
> > blocks via get_blocks()?  I don't think that is too hard to implement,
> > and makes the output more useful, otherwise we get an extent per block.  
> > The above is what I _think_ will work, haven't actually tried it out.
> 
> I don't think we want to automatically merge extents within this helper
> function. Otherwise we would diverge from the actual disk layout for extent
> based file systems where an extent might be broken up between two records
> for some other reason, such as maximum extent length being exceeded.

Do we really want to expose the filesystem-specific extent-length limits
to userspace?  In some sense, a block-based filesystem has a maximum
extent length of the blocksize, but it seems totally reasonable to merge
contiguous blocks into a single "extent" for return to userspace.  I
don't see this significantly different for ext4, even though it can have
extents up to 128MB, or unwritten extents up to 64MB.

> Btw, how many block-based file systems that don't return multiple blocks via
> get_blocks() are there that we actually care about enough to write this
> code?

That I have no clue about.  Joseph?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 19:42 ` Andreas Dilger
@ 2008-05-28 19:54   ` Josef Bacik
  2008-05-28 20:12     ` Mark Fasheh
  2008-05-28 21:23   ` Mark Fasheh
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Josef Bacik @ 2008-05-28 19:54 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mark Fasheh, linux-fsdevel, Josef Bacik

On Wed, May 28, 2008 at 01:42:15PM -0600, Andreas Dilger wrote:
> On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> > Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> > inode operation.
> 
> Mark, I was looking at a way to remove the special-casing of NUM_EXTENTS
> from ioctl_fiemap() in an effort to remove Christoph's objection to
> keeping these in the same ioctl.
> 
> I think it is possible and reasonable to move the special-case handling
> into fiemap_fill_next_extent().
> 
> > +static int ioctl_fiemap(struct file *filp, unsigned long arg)
> > +{
> > +
> > +	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS) &&
> > +	    (fiemap.fm_extent_count == 0 ||
> > +	     fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS))
> > +		return -EINVAL;
> 
> This can be changed to only check:
> 
> 	if (fm_extent_count > FIEMAP_MAX_EXTENTS)
> 		return -EINVAL;
> 
> > +	fieinfo.fi_flags = fiemap.fm_flags;
> > +	if (!(fiemap.fm_flags & FIEMAP_FLAG_NUM_EXTENTS)) {
> > +		fieinfo.fi_extents_max = fiemap.fm_extent_count;
> > +		fieinfo.fi_extents_start = (char *)arg + sizeof(fiemap);
> > +
> > +		if (!access_ok(VERIFY_WRITE, fieinfo.fi_extents_start,
> > +			       fieinfo.fi_extents_max * sizeof(struct fiemap_extent)))
> > +			return -EFAULT;
> > +	}
> 
> It is harmless to set fi_extents_max and fi_extents_start, as this is
> ignored by NUM_EXTENTS.  The fiemap_fill_next_extent() will already
> check in copy_to_user() whether the fi_extents_start pointer is valid,
> and fiemap_fill_next_extent() doesn't even get far enough to look at
> fi_extents_max or fi_extents_start.  We just do:
> 
> 	fieinfo.fi_extents = fiemap.fm_extent_count;
> 	fieinfo.fi_extents_start = (struct fiemap_extent *)((char *)arg +
> 							    sizeof(fiemap));
> 	
> This leaves us with no checks for FIEMAP_FLAG_NUM_EXTENTS in ioctl_fiemap()
> at all, and no changes needed in fiemap_fill_next_extent().
> 
> > > What about the idea to have fiemap_fill_next_extent() do "extent" merging
> > > for filesystems that use the generic helper but do not return multiple
> > > blocks via get_blocks()?  I don't think that is too hard to implement,
> > > and makes the output more useful, otherwise we get an extent per block.  
> > > The above is what I _think_ will work, haven't actually tried it out.
> > 
> > I don't think we want to automatically merge extents within this helper
> > function. Otherwise we would diverge from the actual disk layout for extent
> > based file systems where an extent might be broken up between two records
> > for some other reason, such as maximum extent length being exceeded.
> 
> Do we really want to expose the filesystem-specific extent-length limits
> to userspace?  In some sense, a block-based filesystem has a maximum
> extent length of the blocksize, but it seems totally reasonable to merge
> contiguous blocks into a single "extent" for return to userspace.  I
> don't see this significantly different for ext4, even though it can have
> extents up to 128MB, or unwritten extents up to 64MB.
> 
> > Btw, how many block-based file systems that don't return multiple blocks via
> > get_blocks() are there that we actually care about enough to write this
> > code?
> 
> That I have no clue about.  Joseph?
> 
>

Well I can't comment on the "care enough about" part, but looking at it I don't
think reiserfs does, it just maps the one block we ask for.  Just cursory
looking through stuff it looks like most other stuff does map as much as it can,
even FAT does.  Thanks much,

Josef 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 19:54   ` Josef Bacik
@ 2008-05-28 20:12     ` Mark Fasheh
  2008-05-28 20:19       ` Josef Bacik
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Fasheh @ 2008-05-28 20:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Andreas Dilger, linux-fsdevel

On Wed, May 28, 2008 at 03:54:29PM -0400, Josef Bacik wrote:
> > > Btw, how many block-based file systems that don't return multiple blocks via
> > > get_blocks() are there that we actually care about enough to write this
> > > code?
> > 
> > That I have no clue about.  Joseph?
> > 
> >
> 
> Well I can't comment on the "care enough about" part, but looking at it I don't
> think reiserfs does, it just maps the one block we ask for.  Just cursory
> looking through stuff it looks like most other stuff does map as much as it can,
> even FAT does.  Thanks much,

Josef, thanks for looking.

Isn't reiserfs extent based anyway? So really, it should (eventually) get a
real ->fiemap callback.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 20:12     ` Mark Fasheh
@ 2008-05-28 20:19       ` Josef Bacik
  0 siblings, 0 replies; 20+ messages in thread
From: Josef Bacik @ 2008-05-28 20:19 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Josef Bacik, Andreas Dilger, linux-fsdevel

On Wed, May 28, 2008 at 01:12:40PM -0700, Mark Fasheh wrote:
> On Wed, May 28, 2008 at 03:54:29PM -0400, Josef Bacik wrote:
> > > > Btw, how many block-based file systems that don't return multiple blocks via
> > > > get_blocks() are there that we actually care about enough to write this
> > > > code?
> > > 
> > > That I have no clue about.  Joseph?
> > > 
> > >
> > 
> > Well I can't comment on the "care enough about" part, but looking at it I don't
> > think reiserfs does, it just maps the one block we ask for.  Just cursory
> > looking through stuff it looks like most other stuff does map as much as it can,
> > even FAT does.  Thanks much,
> 
> Josef, thanks for looking.
> 
> Isn't reiserfs extent based anyway? So really, it should (eventually) get a
> real ->fiemap callback.
> 	--Mark
>

I looked again and then double checked with Chris and reiserfs (v3 at least) is
block based.  I can look through all the other block based fs's if you like and
see if any others do this, but like I said I hit all the ones that I could think
of that are commonly used and reiser was the only one that doesn't map multiple
blocks at once.  Thanks much,

Josef 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 19:42 ` Andreas Dilger
  2008-05-28 19:54   ` Josef Bacik
@ 2008-05-28 21:23   ` Mark Fasheh
  2008-05-29  1:24   ` Dave Chinner
  2008-05-29 13:03   ` Christoph Hellwig
  3 siblings, 0 replies; 20+ messages in thread
From: Mark Fasheh @ 2008-05-28 21:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel, Josef Bacik

On Wed, May 28, 2008 at 01:42:15PM -0600, Andreas Dilger wrote:
> > > What about the idea to have fiemap_fill_next_extent() do "extent" merging
> > > for filesystems that use the generic helper but do not return multiple
> > > blocks via get_blocks()?  I don't think that is too hard to implement,
> > > and makes the output more useful, otherwise we get an extent per block.  
> > > The above is what I _think_ will work, haven't actually tried it out.
> > 
> > I don't think we want to automatically merge extents within this helper
> > function. Otherwise we would diverge from the actual disk layout for extent
> > based file systems where an extent might be broken up between two records
> > for some other reason, such as maximum extent length being exceeded.
> 
> Do we really want to expose the filesystem-specific extent-length limits
> to userspace?

Think of it more as being as literal in our representation of extents as
possible. My max extent length was just a (poor) example, meant to
illustrate that goal.

A better one is extent merging. The Ocfs2 btree code tries pretty hard to
merge adjacent extents when they're contiguous. I want to use fiemap in a
tool to measure how good a job the extent merging code is doing, especially
after we've made bug fixes or enhancements to that area. If the vfs starts
to hide these important details from fiemap, then that use case becomes
impossible.


> In some sense, a block-based filesystem has a maximum extent length of the
> blocksize, but it seems totally reasonable to merge contiguous blocks into
> a single "extent" for return to userspace. I don't see this significantly
> different for ext4, even though it can have extents up to 128MB, or
> unwritten extents up to 64MB.

I don't know what the answer is for a block based file system. I do know
however that we have a compatibility callback there for them and that's the
place to implement any block-fs specific features that we want.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 19:42 ` Andreas Dilger
  2008-05-28 19:54   ` Josef Bacik
  2008-05-28 21:23   ` Mark Fasheh
@ 2008-05-29  1:24   ` Dave Chinner
  2008-05-29 13:04     ` Christoph Hellwig
  2008-05-29 13:03   ` Christoph Hellwig
  3 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2008-05-29  1:24 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mark Fasheh, linux-fsdevel, Josef Bacik

On Wed, May 28, 2008 at 01:42:15PM -0600, Andreas Dilger wrote:
> On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> > > blocks via get_blocks()?  I don't think that is too hard to implement,
> > > and makes the output more useful, otherwise we get an extent per block.  
> > > The above is what I _think_ will work, haven't actually tried it out.
> > 
> > I don't think we want to automatically merge extents within this helper
> > function. Otherwise we would diverge from the actual disk layout for extent
> > based file systems where an extent might be broken up between two records
> > for some other reason, such as maximum extent length being exceeded.
> 
> Do we really want to expose the filesystem-specific extent-length limits
> to userspace?

Is that really a problem we need to care about? FIEMAP is just a
list of offset/len pairs that makes no assumptions about the maximum
size of the underlying filesystem's extent size.

e.g. The maximum extent size on XFS varies with block size (2^21 *
block size) so having the filesystem merge extents will hide the
real number of extents in a given range (which is what we actually
want exposed).  e.g. we know that we have N extents on the inode,
but if a FIEMAP only returns N-2 because of merging, then we've got
a consistency problem.

To fix this, FIECOUNT becomes much more complex however other
existing interfaces will still expose conflicting information (e.g.
XFS_IOC_FSGETXATTR). Hence I don't think extent merging is a
good idea in general for this API. For specific implementations
it could be considered (e.g. the block-based filesystem wrappers)
but it should not forced on all implementations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-29  1:24   ` Dave Chinner
@ 2008-05-29 13:04     ` Christoph Hellwig
  2008-05-29 17:02       ` Andreas Dilger
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2008-05-29 13:04 UTC (permalink / raw)
  To: Andreas Dilger, Mark Fasheh, linux-fsdevel, Josef Bacik

On Thu, May 29, 2008 at 11:24:35AM +1000, Dave Chinner wrote:
> To fix this, FIECOUNT becomes much more complex however other
> existing interfaces will still expose conflicting information (e.g.
> XFS_IOC_FSGETXATTR). Hence I don't think extent merging is a
> good idea in general for this API. For specific implementations
> it could be considered (e.g. the block-based filesystem wrappers)
> but it should not forced on all implementations.

Yes.  Extent merging only really makes sense for block based
filesystems, and there should be an output flag telling the user that
this happened.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-29 13:04     ` Christoph Hellwig
@ 2008-05-29 17:02       ` Andreas Dilger
  2008-05-31  8:16         ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-05-29 17:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Mark Fasheh, linux-fsdevel, Josef Bacik

On May 29, 2008  09:04 -0400, Christoph Hellwig wrote:
> On Thu, May 29, 2008 at 11:24:35AM +1000, Dave Chinner wrote:
> > To fix this, FIECOUNT becomes much more complex however other
> > existing interfaces will still expose conflicting information (e.g.
> > XFS_IOC_FSGETXATTR). Hence I don't think extent merging is a
> > good idea in general for this API. For specific implementations
> > it could be considered (e.g. the block-based filesystem wrappers)
> > but it should not forced on all implementations.
> 
> Yes.  Extent merging only really makes sense for block based
> filesystems, and there should be an output flag telling the user
> that this happened.

I've accepted that extent merging is a bad idea for the generic interface.
I think for block-mapped filesystems this pretty much has to happen.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-29 17:02       ` Andreas Dilger
@ 2008-05-31  8:16         ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2008-05-31  8:16 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Christoph Hellwig, Mark Fasheh, linux-fsdevel, Josef Bacik

On Thu, May 29, 2008 at 11:02:16AM -0600, Andreas Dilger wrote:
> > Yes.  Extent merging only really makes sense for block based
> > filesystems, and there should be an output flag telling the user
> > that this happened.
> 
> I've accepted that extent merging is a bad idea for the generic interface.
> I think for block-mapped filesystems this pretty much has to happen.

Yes, I agree and probably was a little unclear in that previous mail.
It should happen for and only for block-based filesystems.  And there
should be a flag telling that user that it has happened.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-28 19:42 ` Andreas Dilger
                     ` (2 preceding siblings ...)
  2008-05-29  1:24   ` Dave Chinner
@ 2008-05-29 13:03   ` Christoph Hellwig
  3 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2008-05-29 13:03 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mark Fasheh, linux-fsdevel, Josef Bacik

On Wed, May 28, 2008 at 01:42:15PM -0600, Andreas Dilger wrote:
> On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> > Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap
> > inode operation.
> 
> Mark, I was looking at a way to remove the special-casing of NUM_EXTENTS
> from ioctl_fiemap() in an effort to remove Christoph's objection to
> keeping these in the same ioctl.

It doesn't really matter where that case is :)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-05-25  0:01 [PATCH 1/5] vfs: vfs-level fiemap interface Mark Fasheh
  2008-05-25  7:28 ` Andreas Dilger
  2008-05-28 19:42 ` Andreas Dilger
@ 2008-06-05  5:18 ` Andreas Dilger
  2008-06-05 21:35   ` jim owens
  2 siblings, 1 reply; 20+ messages in thread
From: Andreas Dilger @ 2008-06-05  5:18 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-fsdevel

On May 24, 2008  17:01 -0700, Mark Fasheh wrote:
> +Each extent is described by a single fiemap_extent structure as
> +returned in fm_extents.
> +
> +struct fiemap_extent {
> +	__u64	fe_logical;/* logical offset in bytes for the start of
> +			    * the extent */
> +	__u64	fe_physical; /* physical offset in bytes for the start
> +			      * of the extent */
> +	__u64	fe_length; /* length in bytes for the extent */
> +	__u32	fe_flags;  /* returned FIEMAP_EXTENT_* flags for the extent */
> +	__u32	fe_lun;	   /* logical device number for extent (starting at 0)*/
> +};

I was reading through the original FIEMAP thread, and one requirement
which isn't addressed with the current code (it was missed in the original
patch also) is the ability to return a physical extent length different
from the logical extent length.

This was originally brought up by Jörn Engel in the context of logfs
compressing data on disk.  It makes sense that the physical extent
length is shorter than the logical extent length in this case.  The
same is probably true for cramfs, and even ext* if the block compression
code was still maintained.  It wouldn't be a surprise if Chris was to
add compression to btrfs also.

So, I think we need another __u64 in he fiemap_extent which is
fe_loglength, and rename fe_length to fe_physlength.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] vfs: vfs-level fiemap interface
  2008-06-05  5:18 ` Andreas Dilger
@ 2008-06-05 21:35   ` jim owens
  0 siblings, 0 replies; 20+ messages in thread
From: jim owens @ 2008-06-05 21:35 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mark Fasheh, linux-fsdevel

Andreas Dilger wrote:
> 
> So, I think we need another __u64 in he fiemap_extent which is
> fe_loglength, and rename fe_length to fe_physlength.

As some people guessed, my earlier post [PATCH 0/5]:
 > experience with a non-linux filesystem that has a similar API
refers to AdvFS on Tru64.

I was asked to provide more information about Tru64's equivalent
to fiemap.  I believe the person who asked wanted to get closure
on the fiemap definition, but I'm probably going to just throw
more gasoline around and light the match with this :)

My earlier post said how I thought our code worked, and as
usual if I describe something without looking at the code,
when I really go look at it I find it does something else
and I'm saying "damn, I didn't think it was that ugly".

Well it probably is ugly and it is really 4 different interfaces,
but after thinking about it I realized the 4 interface designs are
KISS defensible as being optimal for their intended use.
Here is the 10 year old "most used API" for userspace code:

#define F_GETMAP        21      /* retrieve a file's sparseness map */

struct extentmapentry {
         unsigned long offset;
         unsigned long size;
};

struct extentmap {
         unsigned long arraysize;
         unsigned long numextents;
         unsigned long offset;
         struct extentmapentry *extent;
};

fcntl(fileno, F_GETMAP, &extentmap)

Backup/dump tools call this fcntl() to retrieve the sparseness map
of an AdvFS or UFS file.  NFS and CD filesystems return an error.

Its intent is to return the LOGICAL extent map of a file, without
regard to the physical extent map of the file.  Multiple extents
will only be returned if the file in question is a sparse file.
All logically contiguous extents will be collapsed into a single
extent. FYI, "longs" are 64 bits on Tru64.

The extentmapentry.offset is byte-in-file and extentmapentry.size
is bytes-in-extent and only allocated data extents are returned
so there is no need for "extent type".  The extentmapentry is
designed to be small so that minimum memory is required when the
file is highly sparse-fragmented.

extentmap.arraysize is really max_extents (in) and "how_many_more"
(out) extents are present after the "numextents" (out) in the
*extent output array.  The part (so ugly) I forgot is that the
extentmap.offset is NOT a "starting byte in file", it is a
"skip over this many data extents".  That is not an intuitive api
but then I realized it is precisely the best for a backup program.

The backup always reads the complete file from 0..filesize, it
wants to duplicate sparse as sparse (or at least not read it
from the disk), it needs to use a reasonably sized extent array,
so it needs to walk forward in a loop (as in get 4 extents in
one call (0..3, 4..7, 8..11).  So extentmap.offset as an index
into the file's logical map makes sense and you don't need to
worry about start-at-byte-in-file not being an extent start.

A program that wanted to optimize random reads to a sparse file
could do it using this api though not as easily as if it had
the start_byte input parameter.

I'm not going to bore you with the other 3 interfaces that are
only supported in AdvFS to retrieve RAW extent maps for the
cluster and filesystem administrative tools.  These are the only
interfaces that return extent device allocation because normal
applications including backup need to do their data access
through the filesystem.  The bottom line is that information
that is filesystem-specific is only really valuable to tools that
are filesystem-specific.

=== LIGHT THE MATCH ===

- I don't want linux to implement Tru64 F_GETMAP for fiemap!

- The lesson is that a simple design covers the major use and
   other complicated needs are done somewhere else.

- I have talked to Mark and he has tools waiting to use the
   features he originally designed into fiemap... but every
   day there is a new flag or return field added "just in case".

- I know "memory is cheap", but we still seem to run out of it
   so expanding every return structure for data that may only be
   useful to a specific filesystem seems like a bad idea.

- A simplified filesystem-independent version and separate
   complex-as-you-want filesystem-dependent api might be better,
   for example:

   * We can't even agree what "device" is.
   * What good is "encrypted" or "compressed" without "how"?

=== THROW ON MORE GASOLINE ===

Subject:    [RFC] add FIEMAP ioctl to efficiently map file allocation
From:       Andreas Dilger <adilger () clusterfs ! com>
Date:       2007-04-12 11:05:50
Message-ID: 20070412110550.GM5967 () schatzie ! adilger ! int

we additionally need to get the mapping over the network so it needs to
be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

jim

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-06-05 21:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-25  0:01 [PATCH 1/5] vfs: vfs-level fiemap interface Mark Fasheh
2008-05-25  7:28 ` Andreas Dilger
2008-05-27 18:31   ` Mark Fasheh
2008-05-28 16:09     ` Andreas Dilger
2008-05-28 17:24       ` Joel Becker
2008-05-29 23:46         ` Andreas Dilger
2008-05-30  0:15           ` Mark Fasheh
2008-05-30 17:24             ` Andreas Dilger
2008-05-28 19:42 ` Andreas Dilger
2008-05-28 19:54   ` Josef Bacik
2008-05-28 20:12     ` Mark Fasheh
2008-05-28 20:19       ` Josef Bacik
2008-05-28 21:23   ` Mark Fasheh
2008-05-29  1:24   ` Dave Chinner
2008-05-29 13:04     ` Christoph Hellwig
2008-05-29 17:02       ` Andreas Dilger
2008-05-31  8:16         ` Christoph Hellwig
2008-05-29 13:03   ` Christoph Hellwig
2008-06-05  5:18 ` Andreas Dilger
2008-06-05 21:35   ` jim owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).