* YET ANOTHER resend of the fiemap patches @ 2008-09-13 18:47 Theodore Ts'o 2008-09-13 18:49 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-09-13 18:47 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, linux-ext4 So the last round of the on-going round-and-round of the FIEMAP patches went to the linux-ext4 mailing list, because of a query there about why these patches had been stalled for so long. This resulted in a private set of e-mails attacking an ext4 developer and putting forward a conspiracy theory that these patches were trying to be snuck in, despite the fact that they have been sent on linux-fsdevel at least 4 times already. Mark has at this point I believe answered all of Christoph Hellwig's demands, and put forward a request to Andrew Morton to include them in -mm (minus the ext4 patch, since the version he had no longer compiled). As of this writing, Andrew has not put them into -mm, I suspect because he perceives there is still controversies around these patches, despite the fact that most of the e-mails and time wasted has been over individual bitflags and other really tiny nit-picky details. So I am re-sending these patches out for review on linux-fsdevel. The first three patches are taken from Mark's fiemap branch on the ocfs2 branch. The last patch is an updated ext4 patch from the ext4 patch queue, which is provided so that folks can see that the fiemap patches work just fine on ext4. Given that filesystem designers seem to love nit-picking tiny details, and I am personally starting to lose patience, if the fiemap patches stalls any further, my plan is to take a page from the XFS playbook and simply take the ext4-fiemap patch and implement an ext4-specific ioctl. If and when the linux-fsdevel community manages come to consensus on the fiemap patches, whether it happens in 2.6.28 or Linux 2.6.87, it will be easy enough to wire the ext4 support to the generic fiemap ioctl. "It's *just* an ioctl, folks" --- Andreas Dilger Yours disgustedly, - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 18:47 YET ANOTHER resend of the fiemap patches Theodore Ts'o @ 2008-09-13 18:49 ` Theodore Ts'o 2008-09-13 18:49 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Theodore Ts'o @ 2008-09-13 18:49 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-ext4, akpm, Mark Fasheh From: Mark Fasheh <mfasheh@suse.com> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah@sun.com> Andreas Dilger <adilger@sun.com> Eric Sandeen <sandeen@redhat.com> Mark Fasheh <mfasheh@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> --- Documentation/filesystems/fiemap.txt | 216 ++++++++++++++++++++++++++++++++++ fs/ioctl.c | 154 ++++++++++++++++++++++++ include/linux/fiemap.h | 62 ++++++++++ include/linux/fs.h | 18 +++ 4 files changed, 450 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/fiemap.txt create mode 100644 include/linux/fiemap.h diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 0000000..5e95e8f --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,216 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to it's operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +Currently, there are three flags which can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of it's data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or it's logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, it's +physical location has not been allocated yet. + +* FIEMAP_EXTENT_NO_BYPASS +Direct access to the data in this extent is illegal or will have +undefined results. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_NO_BYPASS +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but it's data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining it's set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/fs/ioctl.c b/fs/ioctl.c index 7db32b3..274625a 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -16,6 +16,9 @@ #include <asm/ioctls.h> +/* So that the fiemap access checks can't overflow on 32 bit machines. */ +#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) + /** * vfs_ioctl - call filesystem specific ioctl methods * @filp: open file to invoke ioctl method on @@ -71,6 +74,155 @@ static int ioctl_fibmap(struct file *filp, int __user *p) return put_user(res, p); } +/** + * fiemap_fill_next_extent - Fiemap helper function + * @fieinfo: Fiemap context passed into ->fiemap + * @logical: Extent logical start offset, in bytes + * @phys: Extent physical start offset, in bytes + * @len: Extent length, in bytes + * @flags: FIEMAP_EXTENT flags that describe this extent + * + * Called from file system ->fiemap callback. Will populate extent + * info as passed in via arguments and copy to user memory. On + * success, extent count on fieinfo is incremented. + * + * Returns 0 on success, -errno on error, 1 if this was the last + * extent that will fit in user array. + */ +#define SET_UNKNOWN_FLAGS (FIEMAP_EXTENT_DELALLOC) +#define SET_NO_BYPASS_FLAGS (FIEMAP_EXTENT_DATA_ENCRYPTED) +#define SET_NOT_ALIGNED_FLAGS (FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE) +int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical, + u64 phys, u64 len, u32 flags) +{ + struct fiemap_extent extent; + struct fiemap_extent *dest = fieinfo->fi_extents_start; + + /* only count the extents */ + if (fieinfo->fi_extents_max == 0) { + fieinfo->fi_extents_mapped++; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; + } + + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) + return 1; + + if (flags & SET_UNKNOWN_FLAGS) + flags |= FIEMAP_EXTENT_UNKNOWN; + if (flags & SET_NO_BYPASS_FLAGS) + flags |= FIEMAP_EXTENT_NO_BYPASS; + if (flags & SET_NOT_ALIGNED_FLAGS) + flags |= FIEMAP_EXTENT_NOT_ALIGNED; + + extent.fe_logical = logical; + extent.fe_physical = phys; + extent.fe_length = len; + extent.fe_flags = flags; + + dest += fieinfo->fi_extents_mapped; + if (copy_to_user(dest, &extent, sizeof(extent))) + return -EFAULT; + + fieinfo->fi_extents_mapped++; + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) + return 1; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; +} +EXPORT_SYMBOL(fiemap_fill_next_extent); + +/** + * fiemap_check_flags - check validity of requested flags for fiemap + * @fieinfo: Fiemap context passed into ->fiemap + * @fs_flags: Set of fiemap flags that the file system understands + * + * Called from file system ->fiemap callback. This will compute the + * intersection of valid fiemap flags and those that the fs supports. That + * value is then compared against the user supplied flags. In case of bad user + * flags, the invalid values will be written into the fieinfo structure, and + * -EBADR is returned, which tells ioctl_fiemap() to return those values to + * userspace. For this reason, a return code of -EBADR should be preserved. + * + * Returns 0 on success, -EBADR on bad flags. + */ +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags) +{ + u32 incompat_flags; + + incompat_flags = fieinfo->fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags); + if (incompat_flags) { + fieinfo->fi_flags = incompat_flags; + return -EBADR; + } + return 0; +} +EXPORT_SYMBOL(fiemap_check_flags); + +static int fiemap_check_ranges(struct super_block *sb, + u64 start, u64 len, u64 *new_len) +{ + *new_len = len; + + if (len == 0) + return -EINVAL; + + if (start > sb->s_maxbytes) + return -EFBIG; + + /* + * Shrink request scope to what the fs can actually handle. + */ + if ((len > sb->s_maxbytes) || + (sb->s_maxbytes - len) < start) + *new_len = sb->s_maxbytes - start; + + return 0; +} + +static int ioctl_fiemap(struct file *filp, unsigned long arg) +{ + struct fiemap fiemap; + struct fiemap_extent_info fieinfo = { 0, }; + struct inode *inode = filp->f_path.dentry->d_inode; + struct super_block *sb = inode->i_sb; + u64 len; + int error; + + if (!inode->i_op->fiemap) + return -EOPNOTSUPP; + + if (copy_from_user(&fiemap, (struct fiemap __user *)arg, + sizeof(struct fiemap))) + return -EFAULT; + + if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) + return -EINVAL; + + error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length, + &len); + if (error) + return error; + + fieinfo.fi_flags = fiemap.fm_flags; + fieinfo.fi_extents_max = fiemap.fm_extent_count; + fieinfo.fi_extents_start = (struct fiemap_extent *)(arg + sizeof(fiemap)); + + if (fiemap.fm_extent_count != 0 && + !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) + return -EFAULT; + + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) + filemap_write_and_wait(inode->i_mapping); + + error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len); + fiemap.fm_flags = fieinfo.fi_flags; + fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; + if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap))) + error = -EFAULT; + + return error; +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -80,6 +232,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd, switch (cmd) { case FIBMAP: return ioctl_fibmap(filp, p); + case FS_IOC_FIEMAP: + return ioctl_fiemap(filp, arg); case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h new file mode 100644 index 0000000..29e5862 --- /dev/null +++ b/include/linux/fiemap.h @@ -0,0 +1,62 @@ +/* + * FS_IOC_FIEMAP ioctl infrastructure. + * + * Some portions copyright (C) 2007 Cluster File Systems, Inc + * + * Authors: Mark Fasheh <mfasheh@suse.com> + * Kalpak Shah <kalpak.shah@sun.com> + * Andreas Dilger <adilger@sun.com> + */ + +#ifndef _LINUX_FIEMAP_H +#define _LINUX_FIEMAP_H + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent from the beginning of the file */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent from the beginning of the disk */ + __u64 fe_length; /* length in bytes for this extent */ + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved; +}; + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace wants (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents;/* number of extents that were mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + +#define FIEMAP_MAX_OFFSET (~0ULL) + +#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */ +#define FIEMAP_FLAG_XATTR 0x00000002 /* map extended attribute tree */ + +#define FIEMAP_FLAGS_COMPAT (FIEMAP_FLAG_SYNC | FIEMAP_FLAG_XATTR) + +#define FIEMAP_EXTENT_LAST 0x00000001 /* Last extent in file. */ +#define FIEMAP_EXTENT_UNKNOWN 0x00000002 /* Data location unknown. */ +#define FIEMAP_EXTENT_DELALLOC 0x00000004 /* Location still pending. + * Sets EXTENT_UNKNOWN. */ +#define FIEMAP_EXTENT_NO_BYPASS 0x00000008 /* Data mapping undefined */ +#define FIEMAP_EXTENT_DATA_ENCRYPTED 0x00000080 /* Data is encrypted by fs. + * Sets EXTENT_NO_BYPASS. */ +#define FIEMAP_EXTENT_NOT_ALIGNED 0x00000100 /* Extent offsets may not be + * block aligned. */ +#define FIEMAP_EXTENT_DATA_INLINE 0x00000200 /* Data mixed with metadata. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_DATA_TAIL 0x00000400 /* Multiple files in block. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_UNWRITTEN 0x00000800 /* Space allocated, but + * no data (i.e. zero). */ +#define FIEMAP_EXTENT_MERGED 0x00001000 /* File does not natively + * support extents. Result + * merged for efficiency. */ + +#endif /* _LINUX_FIEMAP_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..194fb23 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -231,6 +231,7 @@ extern int dir_notify_enable; #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IOC_GETVERSION _IOR('v', 1, long) #define FS_IOC_SETVERSION _IOW('v', 2, long) +#define FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap) #define FS_IOC32_GETFLAGS _IOR('f', 1, int) #define FS_IOC32_SETFLAGS _IOW('f', 2, int) #define FS_IOC32_GETVERSION _IOR('v', 1, int) @@ -291,6 +292,7 @@ extern int dir_notify_enable; #include <linux/mutex.h> #include <linux/capability.h> #include <linux/semaphore.h> +#include <linux/fiemap.h> #include <asm/atomic.h> #include <asm/byteorder.h> @@ -1179,6 +1181,20 @@ extern void dentry_unhash(struct dentry *dentry); extern int file_permission(struct file *, int); /* + * VFS FS_IOC_FIEMAP helper definitions. + */ +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent + * array */ +}; +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags); +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +/* * File types * * NOTE! These match bits 12..15 of stat.st_mode @@ -1287,6 +1303,8 @@ struct inode_operations { void (*truncate_range)(struct inode *, loff_t, loff_t); long (*fallocate)(struct inode *inode, int mode, loff_t offset, loff_t len); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); }; struct seq_file; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/4] ocfs2: fiemap support 2008-09-13 18:49 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o @ 2008-09-13 18:49 ` Theodore Ts'o 2008-09-13 18:49 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o 2008-09-13 21:17 ` [PATCH 1/4] vfs: vfs-level " Anton Altaparmakov 2008-09-14 13:47 ` Christoph Hellwig 2 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-09-13 18:49 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-ext4, akpm, Mark Fasheh From: Mark Fasheh <mfasheh@suse.com> Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> --- fs/ocfs2/alloc.c | 9 -- fs/ocfs2/alloc.h | 9 ++ fs/ocfs2/extent_map.c | 346 +++++++++++++++++++++++++++++++++++++++++-------- fs/ocfs2/extent_map.h | 3 + fs/ocfs2/file.c | 1 + 5 files changed, 306 insertions(+), 62 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 10bfb46..29ff57e 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -990,15 +990,6 @@ out: } /* - * This is only valid for leaf nodes, which are the only ones that can - * have empty extents anyway. - */ -static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) -{ - return !rec->e_leaf_clusters; -} - -/* * This function will discard the rightmost extent record. */ static void ocfs2_shift_records_right(struct ocfs2_extent_list *el) diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h index 42ff94b..60cd3d5 100644 --- a/fs/ocfs2/alloc.h +++ b/fs/ocfs2/alloc.h @@ -146,4 +146,13 @@ static inline unsigned int ocfs2_rec_clusters(struct ocfs2_extent_list *el, return le16_to_cpu(rec->e_leaf_clusters); } +/* + * This is only valid for leaf nodes, which are the only ones that can + * have empty extents anyway. + */ +static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) +{ + return !rec->e_leaf_clusters; +} + #endif /* OCFS2_ALLOC_H */ diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index c58668a..aed268e 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -25,6 +25,7 @@ #include <linux/fs.h> #include <linux/init.h> #include <linux/types.h> +#include <linux/fiemap.h> #define MLOG_MASK_PREFIX ML_EXTENT_MAP #include <cluster/masklog.h> @@ -32,6 +33,7 @@ #include "ocfs2.h" #include "alloc.h" +#include "dlmglue.h" #include "extent_map.h" #include "inode.h" #include "super.h" @@ -282,6 +284,51 @@ out: kfree(new_emi); } +static int ocfs2_last_eb_is_empty(struct inode *inode, + struct ocfs2_dinode *di) +{ + int ret, next_free; + u64 last_eb_blk = le64_to_cpu(di->i_last_eb_blk); + struct buffer_head *eb_bh = NULL; + struct ocfs2_extent_block *eb; + struct ocfs2_extent_list *el; + + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), last_eb_blk, + &eb_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } + + eb = (struct ocfs2_extent_block *) eb_bh->b_data; + el = &eb->h_list; + + if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) { + ret = -EROFS; + OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb); + goto out; + } + + if (el->l_tree_depth) { + ocfs2_error(inode->i_sb, + "Inode %lu has non zero tree depth in " + "leaf block %llu\n", inode->i_ino, + (unsigned long long)eb_bh->b_blocknr); + ret = -EROFS; + goto out; + } + + next_free = le16_to_cpu(el->l_next_free_rec); + + if (next_free == 0 || + (next_free == 1 && ocfs2_is_empty_extent(&el->l_recs[0]))) + ret = 1; + +out: + brelse(eb_bh); + return ret; +} + /* * Return the 1st index within el which contains an extent start * larger than v_cluster. @@ -373,42 +420,28 @@ out: return ret; } -int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, - u32 *p_cluster, u32 *num_clusters, - unsigned int *extent_flags) +static int ocfs2_get_clusters_nocache(struct inode *inode, + struct buffer_head *di_bh, + u32 v_cluster, unsigned int *hole_len, + struct ocfs2_extent_rec *ret_rec, + unsigned int *is_last) { - int ret, i; - unsigned int flags = 0; - struct buffer_head *di_bh = NULL; - struct buffer_head *eb_bh = NULL; + int i, ret, tree_height, len; struct ocfs2_dinode *di; - struct ocfs2_extent_block *eb; + struct ocfs2_extent_block *uninitialized_var(eb); struct ocfs2_extent_list *el; struct ocfs2_extent_rec *rec; - u32 coff; - - if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { - ret = -ERANGE; - mlog_errno(ret); - goto out; - } - - ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, - num_clusters, extent_flags); - if (ret == 0) - goto out; + struct buffer_head *eb_bh = NULL; - ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, - &di_bh, OCFS2_BH_CACHED, inode); - if (ret) { - mlog_errno(ret); - goto out; - } + memset(ret_rec, 0, sizeof(*ret_rec)); + if (is_last) + *is_last = 0; di = (struct ocfs2_dinode *) di_bh->b_data; el = &di->id2.i_list; + tree_height = le16_to_cpu(el->l_tree_depth); - if (el->l_tree_depth) { + if (tree_height > 0) { ret = ocfs2_find_leaf(inode, el, v_cluster, &eb_bh); if (ret) { mlog_errno(ret); @@ -431,46 +464,143 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, i = ocfs2_search_extent_list(el, v_cluster); if (i == -1) { /* - * A hole was found. Return some canned values that - * callers can key on. If asked for, num_clusters will - * be populated with the size of the hole. + * Holes can be larger than the maximum size of an + * extent, so we return their lengths in a seperate + * field. */ - *p_cluster = 0; - if (num_clusters) { + if (hole_len) { ret = ocfs2_figure_hole_clusters(inode, el, eb_bh, - v_cluster, - num_clusters); + v_cluster, &len); if (ret) { mlog_errno(ret); goto out; } + + *hole_len = len; } - } else { - rec = &el->l_recs[i]; + goto out_hole; + } - BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + rec = &el->l_recs[i]; - if (!rec->e_blkno) { - ocfs2_error(inode->i_sb, "Inode %lu has bad extent " - "record (%u, %u, 0)", inode->i_ino, - le32_to_cpu(rec->e_cpos), - ocfs2_rec_clusters(el, rec)); - ret = -EROFS; - goto out; + BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + + if (!rec->e_blkno) { + ocfs2_error(inode->i_sb, "Inode %lu has bad extent " + "record (%u, %u, 0)", inode->i_ino, + le32_to_cpu(rec->e_cpos), + ocfs2_rec_clusters(el, rec)); + ret = -EROFS; + goto out; + } + + *ret_rec = *rec; + + /* + * Checking for last extent is potentially expensive - we + * might have to look at the next leaf over to see if it's + * empty. + * + * The first two checks are to see whether the caller even + * cares for this information, and if the extent is at least + * the last in it's list. + * + * If those hold true, then the extent is last if any of the + * additional conditions hold true: + * - Extent list is in-inode + * - Extent list is right-most + * - Extent list is 2nd to rightmost, with empty right-most + */ + if (is_last) { + if (i == (le16_to_cpu(el->l_next_free_rec) - 1)) { + if (tree_height == 0) + *is_last = 1; + else if (eb->h_blkno == di->i_last_eb_blk) + *is_last = 1; + else if (eb->h_next_leaf_blk == di->i_last_eb_blk) { + ret = ocfs2_last_eb_is_empty(inode, di); + if (ret < 0) { + mlog_errno(ret); + goto out; + } + if (ret == 1) + *is_last = 1; + } } + } + +out_hole: + ret = 0; +out: + brelse(eb_bh); + return ret; +} + +static void ocfs2_relative_extent_offsets(struct super_block *sb, + u32 v_cluster, + struct ocfs2_extent_rec *rec, + u32 *p_cluster, u32 *num_clusters) + +{ + u32 coff = v_cluster - le32_to_cpu(rec->e_cpos); + + *p_cluster = ocfs2_blocks_to_clusters(sb, le64_to_cpu(rec->e_blkno)); + *p_cluster = *p_cluster + coff; + + if (num_clusters) + *num_clusters = le16_to_cpu(rec->e_leaf_clusters) - coff; +} + +int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, + u32 *p_cluster, u32 *num_clusters, + unsigned int *extent_flags) +{ + int ret; + unsigned int uninitialized_var(hole_len), flags = 0; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; - coff = v_cluster - le32_to_cpu(rec->e_cpos); + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = -ERANGE; + mlog_errno(ret); + goto out; + } - *p_cluster = ocfs2_blocks_to_clusters(inode->i_sb, - le64_to_cpu(rec->e_blkno)); - *p_cluster = *p_cluster + coff; + ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, + num_clusters, extent_flags); + if (ret == 0) + goto out; - if (num_clusters) - *num_clusters = ocfs2_rec_clusters(el, rec) - coff; + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, + &di_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } - flags = rec->e_flags; + ret = ocfs2_get_clusters_nocache(inode, di_bh, v_cluster, &hole_len, + &rec, NULL); + if (ret) { + mlog_errno(ret); + goto out; + } - ocfs2_extent_map_insert_rec(inode, rec); + if (rec.e_blkno == 0ULL) { + /* + * A hole was found. Return some canned values that + * callers can key on. If asked for, num_clusters will + * be populated with the size of the hole. + */ + *p_cluster = 0; + if (num_clusters) { + *num_clusters = hole_len; + } + } else { + ocfs2_relative_extent_offsets(inode->i_sb, v_cluster, &rec, + p_cluster, num_clusters); + flags = rec.e_flags; + + ocfs2_extent_map_insert_rec(inode, &rec); } if (extent_flags) @@ -478,7 +608,6 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, out: brelse(di_bh); - brelse(eb_bh); return ret; } @@ -521,3 +650,114 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, out: return ret; } + +static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh, + struct fiemap_extent_info *fieinfo, + u64 map_start) +{ + int ret; + unsigned int id_count; + struct ocfs2_dinode *di; + u64 phys; + u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST; + struct ocfs2_inode_info *oi = OCFS2_I(inode); + + di = (struct ocfs2_dinode *)di_bh->b_data; + id_count = le16_to_cpu(di->id2.i_data.id_count); + + if (map_start < id_count) { + phys = oi->ip_blkno << inode->i_sb->s_blocksize_bits; + phys += offsetof(struct ocfs2_dinode, id2.i_data.id_data); + + ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count, + flags); + if (ret < 0) + return ret; + } + + return 0; +} + +#define OCFS2_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC) + +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len) +{ + int ret, is_last; + u32 mapping_end, cpos; + unsigned int hole_size; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + u64 len_bytes, phys_bytes, virt_bytes; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; + + ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS); + if (ret) + return ret; + + ret = ocfs2_inode_lock(inode, &di_bh, 0); + if (ret) { + mlog_errno(ret); + goto out; + } + + down_read(&OCFS2_I(inode)->ip_alloc_sem); + + /* + * Handle inline-data separately. + */ + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = ocfs2_fiemap_inline(inode, di_bh, fieinfo, map_start); + goto out_unlock; + } + + cpos = map_start >> osb->s_clustersize_bits; + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb, + map_start + map_len); + mapping_end -= cpos; + is_last = 0; + while (cpos < mapping_end && !is_last) { + u32 fe_flags; + + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, + &hole_size, &rec, &is_last); + if (ret) { + mlog_errno(ret); + goto out; + } + + if (rec.e_blkno == 0ULL) { + cpos += hole_size; + continue; + } + + fe_flags = 0; + if (rec.e_flags & OCFS2_EXT_UNWRITTEN) + fe_flags |= FIEMAP_EXTENT_UNWRITTEN; + if (is_last) + fe_flags |= FIEMAP_EXTENT_LAST; + len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits; + phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits; + virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits; + + ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes, + len_bytes, fe_flags); + if (ret) + break; + + cpos = le32_to_cpu(rec.e_cpos)+ le16_to_cpu(rec.e_leaf_clusters); + } + + if (ret > 0) + ret = 0; + +out_unlock: + brelse(di_bh); + + up_read(&OCFS2_I(inode)->ip_alloc_sem); + + ocfs2_inode_unlock(inode, 0); +out: + + return ret; +} diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h index de91e3e..1b97490 100644 --- a/fs/ocfs2/extent_map.h +++ b/fs/ocfs2/extent_map.h @@ -50,4 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster, int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, u64 *ret_count, unsigned int *extent_flags); +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len); + #endif /* _EXTENT_MAP_H */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index ec2ed15..ed38796 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2228,6 +2228,7 @@ const struct inode_operations ocfs2_file_iops = { .getattr = ocfs2_getattr, .permission = ocfs2_permission, .fallocate = ocfs2_fallocate, + .fiemap = ocfs2_fiemap, }; const struct inode_operations ocfs2_special_file_iops = { -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/4] generic block based fiemap implementation 2008-09-13 18:49 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o @ 2008-09-13 18:49 ` Theodore Ts'o 2008-09-13 18:49 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-09-13 18:49 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-ext4, akpm, Josef Bacik, Mark Fasheh From: Josef Bacik <jbacik@redhat.com> Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> --- fs/ext2/ext2.h | 2 + fs/ext2/file.c | 1 + fs/ext2/inode.c | 8 +++ fs/ext3/file.c | 1 + fs/ext3/inode.c | 8 +++ fs/ioctl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ext3_fs.h | 2 + include/linux/fs.h | 3 + 8 files changed, 143 insertions(+), 0 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 47d88da..bae998c 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -133,6 +133,8 @@ extern void ext2_truncate (struct inode *); extern int ext2_setattr (struct dentry *, struct iattr *); extern void ext2_set_inode_flags(struct inode *inode); extern void ext2_get_inode_flags(struct ext2_inode_info *); +extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); int __ext2_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5f2fa9c..45ed071 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -86,4 +86,5 @@ const struct inode_operations ext2_file_inode_operations = { #endif .setattr = ext2_setattr, .permission = ext2_permission, + .fiemap = ext2_fiemap, }; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 991d6df..7658b33 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -31,6 +31,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/mpage.h> +#include <linux/fiemap.h> #include "ext2.h" #include "acl.h" #include "xip.h" @@ -704,6 +705,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ } +int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext2_get_block); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); diff --git a/fs/ext3/file.c b/fs/ext3/file.c index acc4913..3be1e06 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -134,5 +134,6 @@ const struct inode_operations ext3_file_inode_operations = { .removexattr = generic_removexattr, #endif .permission = ext3_permission, + .fiemap = ext3_fiemap, }; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 507d868..ebfec4d 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -36,6 +36,7 @@ #include <linux/mpage.h> #include <linux/uio.h> #include <linux/bio.h> +#include <linux/fiemap.h> #include "xattr.h" #include "acl.h" @@ -981,6 +982,13 @@ out: return ret; } +int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext3_get_block); +} + /* * `handle' can be NULL if create is zero */ diff --git a/fs/ioctl.c b/fs/ioctl.c index 274625a..682c252 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -13,6 +13,8 @@ #include <linux/security.h> #include <linux/module.h> #include <linux/uaccess.h> +#include <linux/writeback.h> +#include <linux/buffer_head.h> #include <asm/ioctls.h> @@ -223,6 +225,122 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg) return error; } +#define blk_to_logical(inode, blk) (blk << (inode)->i_blkbits) +#define logical_to_blk(inode, offset) (offset >> (inode)->i_blkbits); + +/* + * @inode - the inode to map + * @arg - the pointer to userspace where we copy everything to + * @get_block - the fs's get_block function + * + * This does FIEMAP for block based inodes. Basically it will just loop + * through get_block until we hit the number of extents we want to map, or we + * go past the end of the file and hit a hole. + * + * If it is possible to have data blocks beyond a hole past @inode->i_size, then + * please do not use this function, it will stop at the first unmapped block + * beyond i_size + */ +int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block) +{ + struct buffer_head tmp; + unsigned int start_blk; + long long length = 0, map_len = 0; + u64 logical = 0, phys = 0, size = 0; + u32 flags = FIEMAP_EXTENT_MERGED; + int ret = 0; + + if ((ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))) + return ret; + + start_blk = logical_to_blk(inode, start); + + /* guard against change */ + mutex_lock(&inode->i_mutex); + + length = (long long)min_t(u64, len, i_size_read(inode)); + map_len = length; + + do { + /* + * we set b_size to the total size we want so it will map as + * many contiguous blocks as possible at once + */ + memset(&tmp, 0, sizeof(struct buffer_head)); + tmp.b_size = map_len; + + ret = get_block(inode, start_blk, &tmp, 0); + if (ret) + break; + + /* HOLE */ + if (!buffer_mapped(&tmp)) { + /* + * first hole after going past the EOF, this is our + * last extent + */ + if (length <= 0) { + flags = FIEMAP_EXTENT_MERGED|FIEMAP_EXTENT_LAST; + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + break; + } + + length -= blk_to_logical(inode, 1); + + /* if we have holes up to/past EOF then we're done */ + if (length <= 0) + break; + + start_blk++; + } else { + if (length <= 0 && size) { + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + if (ret) + break; + } + + logical = blk_to_logical(inode, start_blk); + phys = blk_to_logical(inode, tmp.b_blocknr); + size = tmp.b_size; + flags = FIEMAP_EXTENT_MERGED; + + length -= tmp.b_size; + start_blk += logical_to_blk(inode, size); + + /* + * if we are past the EOF we need to loop again to see + * if there is a hole so we can mark this extent as the + * last one, and if not keep mapping things until we + * find a hole, or we run out of slots in the extent + * array + */ + if (length <= 0) + continue; + + ret = fiemap_fill_next_extent(fieinfo, logical, phys, + size, flags); + if (ret) + break; + } + cond_resched(); + } while (1); + + mutex_unlock(&inode->i_mutex); + + /* if ret is 1 then we just hit the end of the extent array */ + if (ret == 1) + ret = 0; + + return ret; +} +EXPORT_SYMBOL(generic_block_fiemap); + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 80171ee..8120fa1 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -837,6 +837,8 @@ extern void ext3_truncate (struct inode *); extern void ext3_set_inode_flags(struct inode *); extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); +extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, diff --git a/include/linux/fs.h b/include/linux/fs.h index 194fb23..385c9a1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1998,6 +1998,9 @@ extern int vfs_fstat(unsigned int, struct kstat *); extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg); +extern int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block); extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 4/4] Hook ext4 to the vfs fiemap interface. 2008-09-13 18:49 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o @ 2008-09-13 18:49 ` Theodore Ts'o 0 siblings, 0 replies; 40+ messages in thread From: Theodore Ts'o @ 2008-09-13 18:49 UTC (permalink / raw) To: linux-fsdevel; +Cc: linux-ext4, akpm, Eric Sandeen, Theodore Ts'o From: Eric Sandeen <sandeen@redhat.com> ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/ext4/ext4.h | 2 + fs/ext4/ext4_extents.h | 15 +++ fs/ext4/extents.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/file.c | 4 + 4 files changed, 269 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9fccf13..8ad9980 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1127,6 +1127,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int, int *); +int ext4_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, unsigned long maxblocks, struct buffer_head *bh_result, diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h index 0f2c744..fbe34b4 100644 --- a/fs/ext4/ext4_extents.h +++ b/fs/ext4/ext4_extents.h @@ -124,6 +124,19 @@ struct ext4_ext_path { #define EXT4_EXT_CACHE_GAP 1 #define EXT4_EXT_CACHE_EXTENT 2 +/* + * to be called by ext4_ext_walk_space() + * negative retcode - error + * positive retcode - signal for ext4_ext_walk_space(), see below + * callback must return valid extent (passed or newly created) + */ +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *, + struct ext4_ext_cache *, + struct ext4_extent *, void *); + +#define EXT_CONTINUE 0 +#define EXT_BREAK 1 +#define EXT_REPEAT 2 #define EXT_MAX_BLOCK 0xffffffff @@ -224,6 +237,8 @@ extern int ext4_ext_try_to_merge(struct inode *inode, struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); +extern int ext4_ext_walk_space(struct inode *, ext4_lblk_t, ext4_lblk_t, + ext_prepare_callback, void *); extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t, struct ext4_ext_path *); extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 6ff30ff..32c1aa9 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -40,6 +40,7 @@ #include <linux/slab.h> #include <linux/falloc.h> #include <asm/uaccess.h> +#include <linux/fiemap.h> #include "ext4_jbd2.h" #include "ext4_extents.h" @@ -1657,6 +1658,113 @@ cleanup: return err; } +int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block, + ext4_lblk_t num, ext_prepare_callback func, + void *cbdata) +{ + struct ext4_ext_path *path = NULL; + struct ext4_ext_cache cbex; + struct ext4_extent *ex; + ext4_lblk_t next, start = 0, end = 0; + ext4_lblk_t last = block + num; + int depth, exists, err = 0; + + BUG_ON(func == NULL); + BUG_ON(inode == NULL); + + while (block < last && block != EXT_MAX_BLOCK) { + num = last - block; + /* find extent for this block */ + path = ext4_ext_find_extent(inode, block, path); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + break; + } + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + ex = path[depth].p_ext; + next = ext4_ext_next_allocated_block(path); + + exists = 0; + if (!ex) { + /* there is no extent yet, so try to allocate + * all requested space */ + start = block; + end = block + num; + } else if (le32_to_cpu(ex->ee_block) > block) { + /* need to allocate space before found extent */ + start = block; + end = le32_to_cpu(ex->ee_block); + if (block + num < end) + end = block + num; + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { + /* need to allocate space after found extent */ + start = block; + end = block + num; + if (end >= next) + end = next; + } else if (block >= le32_to_cpu(ex->ee_block)) { + /* + * some part of requested space is covered + * by found extent + */ + start = block; + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); + if (block + num < end) + end = block + num; + exists = 1; + } else { + BUG(); + } + BUG_ON(end <= start); + + if (!exists) { + cbex.ec_block = start; + cbex.ec_len = end - start; + cbex.ec_start = 0; + cbex.ec_type = EXT4_EXT_CACHE_GAP; + } else { + cbex.ec_block = le32_to_cpu(ex->ee_block); + cbex.ec_len = ext4_ext_get_actual_len(ex); + cbex.ec_start = ext_pblock(ex); + cbex.ec_type = EXT4_EXT_CACHE_EXTENT; + } + + BUG_ON(cbex.ec_len == 0); + err = func(inode, path, &cbex, ex, cbdata); + ext4_ext_drop_refs(path); + + if (err < 0) + break; + + if (err == EXT_REPEAT) + continue; + else if (err == EXT_BREAK) { + err = 0; + break; + } + + if (ext_depth(inode) != depth) { + /* depth was changed. we have to realloc path */ + kfree(path); + path = NULL; + } + + block = cbex.ec_block + cbex.ec_len; + } + + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } + + return err; +} + static void ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, __u32 len, ext4_fsblk_t start, int type) @@ -3006,3 +3114,143 @@ retry: mutex_unlock(&inode->i_mutex); return ret > 0 ? ret2 : ret; } + +/* + * Callback function called for each extent to gather FIEMAP information. + */ +int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, + struct ext4_ext_cache *newex, struct ext4_extent *ex, + void *data) +{ + struct fiemap_extent_info *fieinfo = data; + unsigned long blksize_bits = inode->i_sb->s_blocksize_bits; + __u64 logical; + __u64 physical; + __u64 length; + __u32 flags = 0; + int error; + + logical = (__u64)newex->ec_block << blksize_bits; + + if (newex->ec_type == EXT4_EXT_CACHE_GAP) { + pgoff_t offset; + struct page *page; + struct buffer_head *bh = NULL; + + offset = logical >> PAGE_SHIFT; + page = find_get_page(inode->i_mapping, offset); + if (!page || !page_has_buffers(page)) + return EXT_CONTINUE; + + bh = page_buffers(page); + + if (!bh) + return EXT_CONTINUE; + + if (buffer_delay(bh)) { + flags |= FIEMAP_EXTENT_DELALLOC; + page_cache_release(page); + } else { + page_cache_release(page); + return EXT_CONTINUE; + } + } + + physical = (__u64)newex->ec_start << blksize_bits; + length = (__u64)newex->ec_len << blksize_bits; + + if (ex && ext4_ext_is_uninitialized(ex)) + flags |= FIEMAP_EXTENT_UNWRITTEN; + + /* + * If this extent reaches EXT_MAX_BLOCK, it must be last. + * + * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK, + * this also indicates no more allocated blocks. + * + * XXX this might miss a single-block extent at EXT_MAX_BLOCK + */ + if (logical + length - 1 == EXT_MAX_BLOCK || + ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK) + flags |= FIEMAP_EXTENT_LAST; + + error = fiemap_fill_next_extent(fieinfo, logical, physical, + length, flags); + if (error < 0) + return error; + if (error == 1) + return EXT_BREAK; + + return EXT_CONTINUE; +} + +/* fiemap flags we can handle specified here */ +#define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR) + +int ext4_xattr_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo) +{ + __u64 physical = 0; + __u64 length; + __u32 flags = FIEMAP_EXTENT_LAST; + int blockbits = inode->i_sb->s_blocksize_bits; + int error = 0; + + /* in-inode? */ + if (EXT4_I(inode)->i_state & EXT4_STATE_XATTR) { + struct ext4_iloc iloc; + int offset; /* offset of xattr in inode */ + + error = ext4_get_inode_loc(inode, &iloc); + if (error) + return error; + physical = iloc.bh->b_blocknr << blockbits; + offset = EXT4_GOOD_OLD_INODE_SIZE + + EXT4_I(inode)->i_extra_isize; + physical += offset; + length = EXT4_SB(inode->i_sb)->s_inode_size - offset; + flags |= FIEMAP_EXTENT_DATA_INLINE; + } else { /* external block */ + physical = EXT4_I(inode)->i_file_acl << blockbits; + length = inode->i_sb->s_blocksize; + } + + if (physical) + error = fiemap_fill_next_extent(fieinfo, 0, physical, + length, flags); + return (error < 0 ? error : 0); +} + +int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len) +{ + ext4_lblk_t start_blk; + ext4_lblk_t len_blks; + int error = 0; + + /* fallback to generic here if not in extents fmt */ + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return generic_block_fiemap(inode, fieinfo, start, len, + ext4_get_block); + + if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS)) + return -EBADR; + + if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) { + error = ext4_xattr_fiemap(inode, fieinfo); + } else { + start_blk = start >> inode->i_sb->s_blocksize_bits; + len_blks = len >> inode->i_sb->s_blocksize_bits; + + /* + * Walk the extent tree gathering extent information. + * ext4_ext_fiemap_cb will push extents back to user. + */ + down_write(&EXT4_I(inode)->i_data_sem); + error = ext4_ext_walk_space(inode, start_blk, len_blks, + ext4_ext_fiemap_cb, fieinfo); + up_write(&EXT4_I(inode)->i_data_sem); + } + + return error; +} + diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 11b289f..8749d8e 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -140,6 +140,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len); + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -170,5 +173,6 @@ const struct inode_operations ext4_file_inode_operations = { #endif .permission = ext4_permission, .fallocate = ext4_fallocate, + .fiemap = ext4_fiemap, }; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 18:49 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o 2008-09-13 18:49 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o @ 2008-09-13 21:17 ` Anton Altaparmakov 2008-09-13 21:29 ` Theodore Tso 2008-09-14 13:47 ` Christoph Hellwig 2 siblings, 1 reply; 40+ messages in thread From: Anton Altaparmakov @ 2008-09-13 21:17 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh Hi, Just a quick question (not a criticism of the patches in any way): Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED exists? I am only asking because there isn't a FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the same. I suppose one could argue that compression is a form of encryption (as one cannot read the data on disk without decompressing it) and thus for compressed files I can just set FIEMAP_EXTENT_DATA_ENCRYPTED, too? Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:17 ` [PATCH 1/4] vfs: vfs-level " Anton Altaparmakov @ 2008-09-13 21:29 ` Theodore Tso 2008-09-13 21:45 ` Matthew Wilcox ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Theodore Tso @ 2008-09-13 21:29 UTC (permalink / raw) To: Anton Altaparmakov; +Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sat, Sep 13, 2008 at 10:17:07PM +0100, Anton Altaparmakov wrote: > Hi, > > Just a quick question (not a criticism of the patches in any way): > > Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED > exists? I am only asking because there isn't a > FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the > same. I suppose one could argue that compression is a form of > encryption (as one cannot read the data on disk without decompressing > it) and thus for compressed files I can just set > FIEMAP_EXTENT_DATA_ENCRYPTED, too? You could, but actually, the interface isn't really set up well to deal with compressed files. In order to deal with compressed data you really would need two fields in struct fiemap_extent: fe_length_physical, fe_length_logical, like this: struct fiemap_extent { __u64 fe_logical; /* logical offset in bytes for the start of * the extent from the beginning of the file */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ __u64 fe_length_physical; /* length in bytes for this extent as * encoded/compressed on disk */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_reserved; }; Part of the problem/frustration as we go around and round fs-devel is that some people would request new features to support new filesystems, and other people have been rejecting new features because it makes the interface more complicated. In fact, the previous version of the patch had a flag for compression, but Cristoph Hellwig objected because he claimed there were no users for it. I don't care how this interface ends up, at least as far as this point is concerned, but it's really annoying when some people want to add features, and other people are objecting and trying to remove features, and months and months and months go by while the patch languishes while people play tug of war on the d*mned thing. Probably the best thing to do is to add a much larger number of reserved fields, of size __u32 and __u64, appropriately aligned, and then when someone adds a compressed filesystem, we can add the structure definition without changing the size of the structure. Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future- users-of-the-interface nazi's will strike again, and more months will go by with absolutely no progress.... - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:29 ` Theodore Tso @ 2008-09-13 21:45 ` Matthew Wilcox 2008-09-13 22:41 ` Theodore Tso 2008-09-13 21:59 ` Anton Altaparmakov 2008-09-14 13:48 ` Christoph Hellwig 2 siblings, 1 reply; 40+ messages in thread From: Matthew Wilcox @ 2008-09-13 21:45 UTC (permalink / raw) To: Theodore Tso Cc: Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote: > Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future- > users-of-the-interface nazi's will strike again, and more months will > go by with absolutely no progress.... Ted, I appreciate you're feeling frustration, but there's no call for this kind of language. Please moderate your tone. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:45 ` Matthew Wilcox @ 2008-09-13 22:41 ` Theodore Tso 0 siblings, 0 replies; 40+ messages in thread From: Theodore Tso @ 2008-09-13 22:41 UTC (permalink / raw) To: Matthew Wilcox Cc: Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sat, Sep 13, 2008 at 03:45:33PM -0600, Matthew Wilcox wrote: > On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote: > > Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future- > > users-of-the-interface nazi's will strike again, and more months will > > go by with absolutely no progress.... > > Ted, I appreciate you're feeling frustration, but there's no call for > this kind of language. Please moderate your tone. Well, being polite and us trying to be constructive hasn't gotten us anywhere, except getting (completely groundless) accusations that one of the ext4 developers was trying to "sneak this in" for the benefit of his company --- and months and months and months of time wasted. I really am beginning to wish that I had ignored Cristoph's whinging and just pushed an ext4-specific ioctl to Linus, instead of trying to develop a filesystem-generic interface. It's just an ioctl. Let's get this done, and move on. If we need to allocate another ioctl, we can do so. Heck, we have five or six different stat interfaces... I have no problem adding support so we can support filesystems with compressed extents. What I don't want to have happen is one patch version where we add a feature, and then get another complaint from someone else demanding that we remove that self-same feature from the interface, repeating, ad infinitum. We shouldn't tolerate this kind of sh*t. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:29 ` Theodore Tso 2008-09-13 21:45 ` Matthew Wilcox @ 2008-09-13 21:59 ` Anton Altaparmakov 2008-09-14 13:51 ` Christoph Hellwig 2008-09-14 13:48 ` Christoph Hellwig 2 siblings, 1 reply; 40+ messages in thread From: Anton Altaparmakov @ 2008-09-13 21:59 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh Hi, On 13 Sep 2008, at 22:29, Theodore Tso wrote: > On Sat, Sep 13, 2008 at 10:17:07PM +0100, Anton Altaparmakov wrote: >> Just a quick question (not a criticism of the patches in any way): >> >> Is there a specific reason that the flag FIEMAP_EXTENT_DATA_ENCRYPTED >> exists? I am only asking because there isn't a >> FIEMAP_EXTENT_DATA_COMPRESSED flag yet the effect on the data is the >> same. I suppose one could argue that compression is a form of >> encryption (as one cannot read the data on disk without decompressing >> it) and thus for compressed files I can just set >> FIEMAP_EXTENT_DATA_ENCRYPTED, too? > > You could, but actually, the interface isn't really set up well to > deal with compressed files. In order to deal with compressed data you > really would need two fields in struct fiemap_extent: > fe_length_physical, fe_length_logical, like this: > > struct fiemap_extent { > __u64 fe_logical; /* logical offset in bytes for the start of > * the extent from the beginning of the file */ > __u64 fe_physical; /* physical offset in bytes for the start > * of the extent from the beginning of the disk */ > __u64 fe_length; /* length in bytes for this extent */ > __u64 fe_length_physical; /* length in bytes for this extent as > * encoded/compressed on disk */ > __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ > __u32 fe_reserved; > }; Yes, that could make sense for compressed files although for NTFS at least it does not matter because the logical regions which do not exist on disk (due to compression having made the corresponding physical regions smaller by at least one block) are present in the extent list as sparse extents, so the extents for a compressed file could look like this for example (lengths are in blocks): logical ofs physical ofs length 0 X 14 14 SPARSE 2 16 Y 4 20 SPARSE 12 32 Z 9 41 SPARSE 7 ... ... ... As you may have noticed from the above example in NTFS each block of 16 blocks is compressed independently, hence the sparse extents interspersed with the compressed extents to bring the logical offset in line with the physical offset. > Part of the problem/frustration as we go around and round fs-devel is > that some people would request new features to support new > filesystems, and other people have been rejecting new features because > it makes the interface more complicated. > > In fact, the previous version of the patch had a flag for compression, > but Cristoph Hellwig objected because he claimed there were no users > for it. He is plain wrong then! NTFS supports compressed files and always has done! We only support them read-only at present but that does not matter for the purposes of FIEMAP. And in any case one day we will have write support for compressed files, too... > I don't care how this interface ends up, at least as far as this point > is concerned, but it's really annoying when some people want to add > features, and other people are objecting and trying to remove > features, and months and months and months go by while the patch > languishes while people play tug of war on the d*mned thing. > > Probably the best thing to do is to add a much larger number of > reserved fields, of size __u32 and __u64, appropriately aligned, and > then when someone adds a compressed filesystem, we can add the > structure definition without changing the size of the structure. > Otherwise, the thou-shalt-not-anticipate-the-needs-of-any-future- > users-of-the-interface nazi's will strike again, and more months will > go by with absolutely no progress.... We already have two file systems in the mainline kernel tree that support transparent compression/decompression: ntfs and zisofs so there is no "future" anywhere in sight... (-; Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:59 ` Anton Altaparmakov @ 2008-09-14 13:51 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2008-09-14 13:51 UTC (permalink / raw) To: Anton Altaparmakov Cc: Theodore Tso, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sat, Sep 13, 2008 at 10:59:40PM +0100, Anton Altaparmakov wrote: >> In fact, the previous version of the patch had a flag for compression, >> but Cristoph Hellwig objected because he claimed there were no users >> for it. > > He is plain wrong then! NTFS supports compressed files and always has > done! We only support them read-only at present but that does not > matter for the purposes of FIEMAP. And in any case one day we will have > write support for compressed files, too... Note that I didn't object it because there are not user, this is just Ted making up BS because he's frustrated. I said don't add it until we actually have users so we know their requirements. The nice part about this interface is that there is no problem adding more flags once we need them, if it's just flags, and if we need mor information than just the flags we're better of discussing these bits beforehand. > We already have two file systems in the mainline kernel tree that > support transparent compression/decompression: ntfs and zisofs so there > is no "future" anywhere in sight... (-; or cramfs. But no one involved with this patch srries bothered actually looking into the requirements. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 21:29 ` Theodore Tso 2008-09-13 21:45 ` Matthew Wilcox 2008-09-13 21:59 ` Anton Altaparmakov @ 2008-09-14 13:48 ` Christoph Hellwig 2008-09-14 17:58 ` Theodore Tso 2 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2008-09-14 13:48 UTC (permalink / raw) To: Theodore Tso Cc: Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sat, Sep 13, 2008 at 05:29:03PM -0400, Theodore Tso wrote: > Part of the problem/frustration as we go around and round fs-devel is > that some people would request new features to support new > filesystems, and other people have been rejecting new features because > it makes the interface more complicated. > > In fact, the previous version of the patch had a flag for compression, > but Cristoph Hellwig objected because he claimed there were no users > for it. Let's make it clear, I've said to not add it unless we have users. And What Anton brought up is exactly the reason for that - to support encrypted extents we actually need more information in the structure. That's why we need to have this broad and sometimes a little slow discussion on fsdevel instead of just rushing in some flag for future use that won't make any sense in the end. (and btw, it's Christoph) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 13:48 ` Christoph Hellwig @ 2008-09-14 17:58 ` Theodore Tso 2008-09-15 14:49 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-14 17:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sun, Sep 14, 2008 at 09:48:59AM -0400, Christoph Hellwig wrote: > Let's make it clear, I've said to not add it unless we have users. And > What Anton brought up is exactly the reason for that - to support > encrypted extents we actually need more information in the structure. > That's why we need to have this broad and sometimes a little slow > discussion on fsdevel instead of just rushing in some flag for future > use that won't make any sense in the end. You're incorrect here. Encrypted extents does not require any additional information in the structure. Compressed extents are a bit more useful if we allow the the filesystem to return the amount of space used on the storage device, but as Anton has pointed out, it's not strictly required for ntfs (although it would be more useful for cramfs). But that being said, the fundamental question here is whether we should try to plan for future users of the data structure, and reserve space now for the, or not. Your approach of saying Nein! Nein! Nein! for every single feature where we don't have implementation pretty much guarantees that we will need to expand the structure later to make room for these extra fields, and then we'll need to define a new ioctl and have similar complexity to the stat system call to support multiple userspace interfaces. If we try to anticipate new users in advance, then there is at least a *chance* we'll get it right up front. The big problem here is that if we try for a generic interface, we *will* end up stalling for every as some people ask for new features, and other people (like you) push back on it. And in the mean time, we make no progress. I'd like to break through the logjam, *somehow*. I can see three possible paths: * We reserve space for likely features that could likely be used by real life filesystems that exist today. * We don't, and accept the fact that later on we will need to revise the interface, with the resulting hair that this will mean to the kernel. * I push an ext4-specific ioctl to Linus. I don't particularly care *which* one we choose, but I'd like for us to make a choice, and then move forward, instead of spinning endlessly. And no, I haven't made anything up. I'm leaving names out because the accusation was really out of line, and really unfair, so I want to spare the embarassment of the accuser and the accused, but ask Andrew Morton for confirmation if you want. This has gotten beyond ridiculous, and I'd really like us to make a decision, and move *on*. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 17:58 ` Theodore Tso @ 2008-09-15 14:49 ` Christoph Hellwig 2008-09-15 17:53 ` Mark Fasheh 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2008-09-15 14:49 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm, Mark Fasheh On Sun, Sep 14, 2008 at 01:58:10PM -0400, Theodore Tso wrote: > On Sun, Sep 14, 2008 at 09:48:59AM -0400, Christoph Hellwig wrote: > > Let's make it clear, I've said to not add it unless we have users. And > > What Anton brought up is exactly the reason for that - to support > > encrypted extents we actually need more information in the structure. > > That's why we need to have this broad and sometimes a little slow > > discussion on fsdevel instead of just rushing in some flag for future > > use that won't make any sense in the end. > > You're incorrect here. Encrypted extents does not require any > additional information in the structure. Compressed extents are a bit > more useful if we allow the the filesystem to return the amount of > space used on the storage device, Sorry, should have written compressed and not encrypted above. > cramfs). But that being said, the fundamental question here is > whether we should try to plan for future users of the data structure, > and reserve space now for the, or not. Your approach of saying Nein! > Nein! Nein! for every single feature where we don't have > implementation pretty much guarantees that we will need to expand the > structure later to make room for these extra fields, and then we'll > need to define a new ioctl and have similar complexity to the stat > system call to support multiple userspace interfaces. If we try to > anticipate new users in advance, then there is at least a *chance* > we'll get it right up front. I agree to you (or someone elses - don't remember anymore) suggestion to put in more padding so we can add fields later. I strongly disagree putting in features now that we neither have a user, nor a usecase or testcase for. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-15 14:49 ` Christoph Hellwig @ 2008-09-15 17:53 ` Mark Fasheh 2008-09-16 6:51 ` Andreas Dilger 2008-09-16 21:31 ` Theodore Tso 0 siblings, 2 replies; 40+ messages in thread From: Mark Fasheh @ 2008-09-15 17:53 UTC (permalink / raw) To: Christoph Hellwig Cc: Theodore Tso, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote: > I agree to you (or someone elses - don't remember anymore) suggestion > to put in more padding so we can add fields later. I strongly disagree > putting in features now that we neither have a user, nor a usecase or > testcase for. So, how about another 64 bits of padding in struct fiemap_extent? That could help cover future uses like compression, which might require another 64 bit offset field - we only have 32 bits of reserved space there right now. --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-15 17:53 ` Mark Fasheh @ 2008-09-16 6:51 ` Andreas Dilger 2008-09-16 21:31 ` Theodore Tso 1 sibling, 0 replies; 40+ messages in thread From: Andreas Dilger @ 2008-09-16 6:51 UTC (permalink / raw) To: Mark Fasheh Cc: Christoph Hellwig, Theodore Tso, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Sep 15, 2008 10:53 -0700, Mark Fasheh wrote: > On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote: > > I agree to you (or someone elses - don't remember anymore) suggestion > > to put in more padding so we can add fields later. I strongly disagree > > putting in features now that we neither have a user, nor a usecase or > > testcase for. > > So, how about another 64 bits of padding in struct fiemap_extent? That could > help cover future uses like compression, which might require another 64 bit > offset field - we only have 32 bits of reserved space there right now. I'd be happy to see it. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-15 17:53 ` Mark Fasheh 2008-09-16 6:51 ` Andreas Dilger @ 2008-09-16 21:31 ` Theodore Tso 2008-09-20 16:47 ` Josef 'Jeff' Sipek 1 sibling, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-16 21:31 UTC (permalink / raw) To: Mark Fasheh Cc: Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Mon, Sep 15, 2008 at 10:53:05AM -0700, Mark Fasheh wrote: > On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote: > > I agree to you (or someone elses - don't remember anymore) suggestion > > to put in more padding so we can add fields later. I strongly disagree > > putting in features now that we neither have a user, nor a usecase or > > testcase for. > > So, how about another 64 bits of padding in struct fiemap_extent? That could > help cover future uses like compression, which might require another 64 bit > offset field - we only have 32 bits of reserved space there right now. What I'd recommend is a 56 byte structure: struct fiemap_extent { __u64 fe_logical; /* logical offset in bytes for the start of * the extent from the beginning of the file */ __u64 fe_physical; /* physical offset in bytes for the start * of the extent from the beginning of the disk */ __u64 fe_length; /* length in bytes for this extent */ __u64 fe_reserved64[2]; * encoded/compressed on disk */ __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ __u32 fe_reserved[3]; }; Yeah, it's a little extra memory per extent, but filesystems seem to always invent new things, and it seem spretty clear that we have at least two extensions on deck (compression, multiple storage devices) both of which have at least one implementation that are either in the kernel or will likely enter the kernel. So it's likely that there is something that we may have missed, and leaving a little extra space doesn't actually cost us that much. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-16 21:31 ` Theodore Tso @ 2008-09-20 16:47 ` Josef 'Jeff' Sipek 2008-09-29 1:07 ` Theodore Tso 0 siblings, 1 reply; 40+ messages in thread From: Josef 'Jeff' Sipek @ 2008-09-20 16:47 UTC (permalink / raw) To: Theodore Tso Cc: Mark Fasheh, Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Tue, Sep 16, 2008 at 05:31:07PM -0400, Theodore Tso wrote: > On Mon, Sep 15, 2008 at 10:53:05AM -0700, Mark Fasheh wrote: > > On Mon, Sep 15, 2008 at 10:49:48AM -0400, Christoph Hellwig wrote: > > > I agree to you (or someone elses - don't remember anymore) suggestion > > > to put in more padding so we can add fields later. I strongly disagree > > > putting in features now that we neither have a user, nor a usecase or > > > testcase for. > > > > So, how about another 64 bits of padding in struct fiemap_extent? That could > > help cover future uses like compression, which might require another 64 bit > > offset field - we only have 32 bits of reserved space there right now. > > What I'd recommend is a 56 byte structure: Why not just make it 64 bytes? Sure, that's 8 extra bytes, but I find the power-of-2 size (and the extra space) comforting. (AFAIK, slab allocators will give you 64 bytes anyway; and I expect something similar on the user-space side of things.) ... > Yeah, it's a little extra memory per extent, but filesystems seem to > always invent new things, and it seem spretty clear that we have at > least two extensions on deck (compression, multiple storage devices) > both of which have at least one implementation that are either in the > kernel or will likely enter the kernel. So it's likely that there is > something that we may have missed, and leaving a little extra space > doesn't actually cost us that much. Right. Josef 'Jeff' Sipek. -- Once you have their hardware. Never give it back. (The First Rule of Hardware Acquisition) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-20 16:47 ` Josef 'Jeff' Sipek @ 2008-09-29 1:07 ` Theodore Tso 2008-09-29 21:13 ` Mark Fasheh 0 siblings, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-29 1:07 UTC (permalink / raw) To: Mark Fasheh, Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm Mark, Do you have time to update the fiemap patchset so we can hopefully come to consensus and get this pushed to Linus in time for the upcoming merge window? If not, I can take over for you and take over the care and feeding of this patchset so it can get pushed to Linus ASAP; I think we are pretty close to consensus on this interface, and I'd rather that we not lose momentum. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-29 1:07 ` Theodore Tso @ 2008-09-29 21:13 ` Mark Fasheh 2008-09-29 22:10 ` Theodore Tso 0 siblings, 1 reply; 40+ messages in thread From: Mark Fasheh @ 2008-09-29 21:13 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Sun, Sep 28, 2008 at 09:07:20PM -0400, Theodore Tso wrote: > Do you have time to update the fiemap patchset so we can hopefully > come to consensus and get this pushed to Linus in time for the > upcoming merge window? I can probably do this early next week. If that's not soon enough, I won't be offended if you or someone else does the update. I agree, it seems like we're pretty close to consensus. > If not, I can take over for you and take over the care and feeding of > this patchset so it can get pushed to Linus ASAP; I think we are > pretty close to consensus on this interface, and I'd rather that we > not lose momentum. Why not let this sit in -mm for a cycle? I know how that sounds, considering how long it's taken everyone to decide what color to paint the bicycle shed, but at the least it'd give other file systems a chance to write their own support patches. There's also the benefit of a little extra time for ext4/ocfs2 to debug their own support patches. Do we lose any momentum if the patches are 'complete' and in -mm with a target for 2.6.29? --Mark -- Mark Fasheh ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-29 21:13 ` Mark Fasheh @ 2008-09-29 22:10 ` Theodore Tso 0 siblings, 0 replies; 40+ messages in thread From: Theodore Tso @ 2008-09-29 22:10 UTC (permalink / raw) To: Mark Fasheh Cc: Christoph Hellwig, Anton Altaparmakov, linux-fsdevel, linux-ext4, akpm On Mon, Sep 29, 2008 at 02:13:40PM -0700, Mark Fasheh wrote: > Why not let this sit in -mm for a cycle? I know how that sounds, considering > how long it's taken everyone to decide what color to paint the bicycle shed, > but at the least it'd give other file systems a chance to write their own > support patches. There's also the benefit of a little extra time for > ext4/ocfs2 to debug their own support patches. Do we lose any momentum if > the patches are 'complete' and in -mm with a target for 2.6.29? Quite frankly, the reason why I'd rather not is because I'd rather not have someone else showing up asking what color to paint the bicycle shed. We don't have to have all filesystems with support patches before we merge the core patches; in fact I don't think there's any real benefit in holding back until other filesystems are ready. At least the ext4 patches has been ready for quite some time --- and getting it merged would help us with our online defrag patches. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-13 18:49 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o 2008-09-13 18:49 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o 2008-09-13 21:17 ` [PATCH 1/4] vfs: vfs-level " Anton Altaparmakov @ 2008-09-14 13:47 ` Christoph Hellwig 2008-09-14 18:01 ` Theodore Tso 2 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2008-09-14 13:47 UTC (permalink / raw) To: Theodore Ts'o Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote: > +* FIEMAP_EXTENT_NO_BYPASS > +Direct access to the data in this extent is illegal or will have > +undefined results. This one is sitll misnamed and for sure utterly misdocumented. Direct access to the data is always illegal and has undefined results, and programs that do it anyway (e.g. grub) are in really deep trouble. I'm also not sure what happened to my request a few rounds ago to Cc Michael Kerrisk on the thread. He's been writing manpages for all important kernel interfaces, and his review resulting from that has been extremly useful. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 13:47 ` Christoph Hellwig @ 2008-09-14 18:01 ` Theodore Tso 2008-09-14 18:08 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-14 18:01 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sun, Sep 14, 2008 at 09:47:11AM -0400, Christoph Hellwig wrote: > On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote: > > +* FIEMAP_EXTENT_NO_BYPASS > > +Direct access to the data in this extent is illegal or will have > > +undefined results. > > This one is sitll misnamed and for sure utterly misdocumented. Direct > access to the data is always illegal and has undefined results, and > programs that do it anyway (e.g. grub) are in really deep trouble. Programs like Grub *have* to. And in the case of a boot-loader, Lilo and Grub have been able to do so safely for over 15 years. This is basically a generic flag that indicates that they can't, but it reflects the reality that for many filesystems, they *can*. The main legal use is LILO and/or Grub, in fact. One could argue that programs that try accessing data blocks directly while the filesystem is mounted are doomed, but boot loaders do so while the filesystem are unmounted. The name isn't particularly important, but what it indicates is very clearly useful. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 18:01 ` Theodore Tso @ 2008-09-14 18:08 ` Christoph Hellwig 2008-09-14 19:58 ` Theodore Tso 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2008-09-14 18:08 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sun, Sep 14, 2008 at 02:01:32PM -0400, Theodore Tso wrote: > On Sun, Sep 14, 2008 at 09:47:11AM -0400, Christoph Hellwig wrote: > > On Sat, Sep 13, 2008 at 02:49:24PM -0400, Theodore Ts'o wrote: > > > +* FIEMAP_EXTENT_NO_BYPASS > > > +Direct access to the data in this extent is illegal or will have > > > +undefined results. > > > > This one is sitll misnamed and for sure utterly misdocumented. Direct > > access to the data is always illegal and has undefined results, and > > programs that do it anyway (e.g. grub) are in really deep trouble. > > Programs like Grub *have* to. Depends on which grub you mean. Grub that boot loader has a filesystem driver and accesses it, that's fine. /usr/sbin/grub, the installation program doesn't have to. In fact the only reason why it does is to "verify" that blocks are in place. This breaks down depending on the phase of the moon, but especially badly on filesystems using a different address space than the block device one. > And in the case of a boot-loader, Lilo > and Grub have been able to do so safely for over 15 years. I think your arguing 100% profs my point that the flag name and description are utterly confusing. What lilo does is load a kernel from an offset into the partition. This is obviously fine, and a perfect use case for an extent map / block map interface. I think (but not one could clarify to me) that this is the primary use case of the no bypass flag, which is supposed to tell lilo that this extent is not actually a real block but some sort of meshup and lilo should return an error that it gets. But this flag is totally unrelated to beeing able to bypass the filesystem driver when running the kernel which is always bogus, and has caused endless amounts of trouble with the only widespread tool that does it (/usr/sbin/grub). > The name isn't particularly important, but what it indicates is very > clearly useful. Sometimes I really wish people would read email before replying to them. I did not say anywhere that this flag (at least for the use case I think it is which no one even bothered to document) is good, but the name and description are utterly wrong, and misleading. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 18:08 ` Christoph Hellwig @ 2008-09-14 19:58 ` Theodore Tso 2008-09-15 14:47 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-14 19:58 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sun, Sep 14, 2008 at 02:08:44PM -0400, Christoph Hellwig wrote: > Sometimes I really wish people would read email before replying to them. > I did not say anywhere that this flag (at least for the use case I think > it is which no one even bothered to document) is good, but the name > and description are utterly wrong, and misleading. Want to pick a name, then? - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-14 19:58 ` Theodore Tso @ 2008-09-15 14:47 ` Christoph Hellwig 2008-09-16 6:49 ` Andreas Dilger 0 siblings, 1 reply; 40+ messages in thread From: Christoph Hellwig @ 2008-09-15 14:47 UTC (permalink / raw) To: Theodore Tso Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sun, Sep 14, 2008 at 03:58:11PM -0400, Theodore Tso wrote: > On Sun, Sep 14, 2008 at 02:08:44PM -0400, Christoph Hellwig wrote: > > Sometimes I really wish people would read email before replying to them. > > I did not say anywhere that this flag (at least for the use case I think > > it is which no one even bothered to document) is good, but the name > > and description are utterly wrong, and misleading. > > Want to pick a name, then? Again, I'd love it when people would read my mails :( Can we please first agree on what the flag is actually supposed to mean? I made one guess in this mail, but I'm not sure if that's even what Andreas intended. And of course the description for it is even more important than the name. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-15 14:47 ` Christoph Hellwig @ 2008-09-16 6:49 ` Andreas Dilger 2008-09-16 22:03 ` Theodore Tso 0 siblings, 1 reply; 40+ messages in thread From: Andreas Dilger @ 2008-09-16 6:49 UTC (permalink / raw) To: Christoph Hellwig Cc: Theodore Tso, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sep 15, 2008 10:47 -0400, Christoph Hellwig wrote: > Again, I'd love it when people would read my mails :( Can we please > first agree on what the flag is actually supposed to mean? I made one > guess in this mail, but I'm not sure if that's even what Andreas > intended. And of course the description for it is even more important > than the name. The intent of this flag was a "catch-all" to indicate it isn't safe to try and read this block from disk, either because it is encrypted, compressed, on a remote system (HSM or over a network), or maybe not even written to disk yet (delalloc). In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to read directly from a block device underneath the filesystem, but that would completely fail for the above cases. Note that the NO_BYPASS (formerly NO_DIRECT) flag is meant to be used in conjunction with other flags that specify more clearly the reason that this block is not directly accessible. Having a "generic" flag cover these different flag allows simple applications to know whether the block is readable or not, without having to understand each flag itself, and allows the flags to be expanded in the future (e.g. HSM or whatever that doesn't get included right away). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-16 6:49 ` Andreas Dilger @ 2008-09-16 22:03 ` Theodore Tso 2008-09-17 14:18 ` Jörn Engel 0 siblings, 1 reply; 40+ messages in thread From: Theodore Tso @ 2008-09-16 22:03 UTC (permalink / raw) To: Andreas Dilger Cc: Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Mon, Sep 15, 2008 at 11:49:43PM -0700, Andreas Dilger wrote: > The intent of this flag was a "catch-all" to indicate it isn't safe > to try and read this block from disk, either because it is encrypted, > compressed, on a remote system (HSM or over a network), or maybe not > even written to disk yet (delalloc). > > In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to > read directly from a block device underneath the filesystem, but that > would completely fail for the above cases. Indeed, I thought it was pretty clear and obvious, but let me give an quick but formal definition, and a potential name: DATA_ENCODED If this flag is not set, then applications that who wish to access the file data may do so by accessing the block device at the indicated offset when the filesystem is unmounted. If the filesystem is mounted, it is undefined whether accessing via the block device will return valid data. If the flag DATA_ENCODED flag is set, it is almost certain that an application will never be able to access the file data via the block device. Would this make people happy? - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-16 22:03 ` Theodore Tso @ 2008-09-17 14:18 ` Jörn Engel 2008-09-17 15:02 ` Jamie Lokier 0 siblings, 1 reply; 40+ messages in thread From: Jörn Engel @ 2008-09-17 14:18 UTC (permalink / raw) To: Theodore Tso Cc: Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Tue, 16 September 2008 18:03:46 -0400, Theodore Tso wrote: > On Mon, Sep 15, 2008 at 11:49:43PM -0700, Andreas Dilger wrote: > > The intent of this flag was a "catch-all" to indicate it isn't safe > > to try and read this block from disk, either because it is encrypted, > > compressed, on a remote system (HSM or over a network), or maybe not > > even written to disk yet (delalloc). > > > > In some cases (e.g. dump on a snapshot, or boot with LILO) it IS ok to > > read directly from a block device underneath the filesystem, but that > > would completely fail for the above cases. > > Indeed, I thought it was pretty clear and obvious, but let me give an > quick but formal definition, and a potential name: DATA_ENCODED > > If this flag is not set, then applications that who wish to access the ^^^^^^^^ > file data may do so by accessing the block device at the indicated > offset when the filesystem is unmounted. If the filesystem is > mounted, it is undefined whether accessing via the block device will > return valid data. If the flag DATA_ENCODED flag is set, it is almost > certain that an application will never be able to access the file data > via the block device. > > Would this make people happy? Apart from the typo above, here is a more discouraging version: In general, accessing the block device directly is strongly discouraged. Exceptions exist mainly in the form of boot loaders like lilo and grub, at a time when the filesystem is not (cannot be) mounted. If the flag DATA_ENCODED is set, however, even this exception is no longer valid. The content is encoded in some form. Details are unknown, it could be compressed, encrypted or something else. Jörn -- Man darf nicht das, was uns unwahrscheinlich und unnatürlich erscheint, mit dem verwechseln, was absolut unmöglich ist. -- Carl Friedrich Gauß -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-17 14:18 ` Jörn Engel @ 2008-09-17 15:02 ` Jamie Lokier 2008-09-17 15:25 ` Theodore Tso 2008-09-19 14:05 ` Chris Mason 0 siblings, 2 replies; 40+ messages in thread From: Jamie Lokier @ 2008-09-17 15:02 UTC (permalink / raw) To: Jörn Engel Cc: Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages Jörn Engel wrote: > Apart from the typo above, here is a more discouraging version: > > In general, accessing the block device directly is strongly discouraged. > Exceptions exist mainly in the form of boot loaders like lilo and grub, > at a time when the filesystem is not (cannot be) mounted. > > If the flag DATA_ENCODED is set, however, even this exception is no > longer valid. The content is encoded in some form. Details are > unknown, it could be compressed, encrypted or something else. I'm not clear about something from the above description. If I were writing a journalling / tree-like filesystem, and I did store data in blocks without encoding, but fsync() only waits for them to be committed to journal, not their final destination, and also they might be moved around - should I set DATA_ENCODED or not? (And should I return the temporary location in the long-running journal since that's the only place the data is committed at the time of the call?) Assume that even reading after unmounting is not 100% safe, because the data blocks could be relocated after calling FIEMAP (when the filesystem must be mounted), and before the unmount. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-17 15:02 ` Jamie Lokier @ 2008-09-17 15:25 ` Theodore Tso 2008-09-19 14:05 ` Chris Mason 1 sibling, 0 replies; 40+ messages in thread From: Theodore Tso @ 2008-09-17 15:25 UTC (permalink / raw) To: Jamie Lokier Cc: Jörn Engel, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote: > > I'm not clear about something from the above description. > > If I were writing a journalling / tree-like filesystem, and I did > store data in blocks without encoding, but fsync() only waits for them > to be committed to journal, not their final destination, and also they > might be moved around - should I set DATA_ENCODED or not? (And should > I return the temporary location in the long-running journal since > that's the only place the data is committed at the time of the call?) I can imagine two ways of requiring this. One would be that you should only set DATA_ENCODED if you knew that the data had reached its final destination, possibly with some way of inducing the filesystem to wait until this had happened. The other approach would be to require in the specification that the filesystem would have to be unmounted at least once. I prefer the latter as it is much simpler, and the number of users who actually care about DATA_ENCODED is quite small. The other observation I would make is that for filesystems where "final location" has no meaning (i.e., a log structured filesystem where its log cleaner is constantly moving data blocks around for compaction and/or flash wear leveling purposes), it will probably need to set some bit that means LOCATION_UNSTABLE, as well as setting DATA_ENCODED. This is would be a little confusing since what DATA_ENCODED really means is, "application that might want to read this file contents on an unmounted filesystem --- think again". So maybe a better name would be NO_UNMOUNTED_IO. - Ted P.S. If such filesystems want to be used by boot loaders that need fixed bootstrap blocks, they would probably need to have a flag that pinned the data blocks and caused them not be moved by the log cleaner. Reiser3 had to have something like this to work for LILO, but iirc this was just to disable tail packing. Same concept, though; it was a specialized per-filesystem flag that was needed if that filesystem wanted to be used by a particular boot loader with direct (unmounted) I/O requirements to access bootstrap code/data. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-17 15:02 ` Jamie Lokier 2008-09-17 15:25 ` Theodore Tso @ 2008-09-19 14:05 ` Chris Mason 2008-09-19 17:38 ` Jamie Lokier 1 sibling, 1 reply; 40+ messages in thread From: Chris Mason @ 2008-09-19 14:05 UTC (permalink / raw) To: Jamie Lokier Cc: Jörn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote: > Jörn Engel wrote: > > Apart from the typo above, here is a more discouraging version: > > > > In general, accessing the block device directly is strongly discouraged. > > Exceptions exist mainly in the form of boot loaders like lilo and grub, > > at a time when the filesystem is not (cannot be) mounted. > > > > If the flag DATA_ENCODED is set, however, even this exception is no > > longer valid. The content is encoded in some form. Details are > > unknown, it could be compressed, encrypted or something else. > > I'm not clear about something from the above description. > > If I were writing a journalling / tree-like filesystem, and I did > store data in blocks without encoding, but fsync() only waits for them > to be committed to journal, not their final destination, and also they > might be moved around - should I set DATA_ENCODED or not? (And should > I return the temporary location in the long-running journal since > that's the only place the data is committed at the time of the call?) > > Assume that even reading after unmounting is not 100% safe, because > the data blocks could be relocated after calling FIEMAP (when the > filesystem must be mounted), and before the unmount. For the journal case at least, grub can walk through the log of the FS looking for up to date copies of things. It does this already for reiserfs because the btree can't be trusted at all without a log replay. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-19 14:05 ` Chris Mason @ 2008-09-19 17:38 ` Jamie Lokier 2008-09-20 7:43 ` Dave Chinner 0 siblings, 1 reply; 40+ messages in thread From: Jamie Lokier @ 2008-09-19 17:38 UTC (permalink / raw) To: Chris Mason Cc: Jörn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages Chris Mason wrote: > On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote: > > Jörn Engel wrote: > > > Apart from the typo above, here is a more discouraging version: > > > > > > In general, accessing the block device directly is strongly discouraged. > > > Exceptions exist mainly in the form of boot loaders like lilo and grub, > > > at a time when the filesystem is not (cannot be) mounted. > > > > > > If the flag DATA_ENCODED is set, however, even this exception is no > > > longer valid. The content is encoded in some form. Details are > > > unknown, it could be compressed, encrypted or something else. > > > > I'm not clear about something from the above description. > > > > If I were writing a journalling / tree-like filesystem, and I did > > store data in blocks without encoding, but fsync() only waits for them > > to be committed to journal, not their final destination, and also they > > might be moved around - should I set DATA_ENCODED or not? (And should > > I return the temporary location in the long-running journal since > > that's the only place the data is committed at the time of the call?) > > > > Assume that even reading after unmounting is not 100% safe, because > > the data blocks could be relocated after calling FIEMAP (when the > > filesystem must be mounted), and before the unmount. > > For the journal case at least, grub can walk through the log of the FS > looking for up to date copies of things. It does this already for > reiserfs because the btree can't be trusted at all without a log replay. Ok, that's good - grub doesn't need FIEMAP, it reads the filesystem properly. So if I were writing a filesystem, what am I expected to return in FIEMAP for these cases? I'm thinking I should set DATA_ENCODED, even though the examples in Jörn's description don't cover this. I'm thinking there are three main uses for FIEMAP: 1. LILO and similar. LILO itself is fine with FIBMAP though. 2. Fragmentation measurement and possibly defragmentation tools. 3. Something wants to have an idea of which areas of disk will be accessed, so it can optimise I/O at a higher level - i.e. a database. This isn't foolproof, especially for writes on recent filesystems which don't overwrite in place. 1 means DATA_ENCODED should be set whenever there's any likelihood that the result isn't reliable, so that would include when data is stored in a journal or other temporary place and not a permanent place on disk. 2 and 3 don't care about DATA_ENCODED at all. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-19 17:38 ` Jamie Lokier @ 2008-09-20 7:43 ` Dave Chinner 2008-09-20 13:50 ` Chris Mason [not found] ` <20080920135048.GA15698@think.oraclecorp.com> 0 siblings, 2 replies; 40+ messages in thread From: Dave Chinner @ 2008-09-20 7:43 UTC (permalink / raw) To: Jamie Lokier Cc: Chris Mason, Jörn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Fri, Sep 19, 2008 at 06:38:02PM +0100, Jamie Lokier wrote: > Chris Mason wrote: > > On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote: > > > Assume that even reading after unmounting is not 100% safe, because > > > the data blocks could be relocated after calling FIEMAP (when the > > > filesystem must be mounted), and before the unmount. > > > > For the journal case at least, grub can walk through the log of the FS > > looking for up to date copies of things. It does this already for > > reiserfs because the btree can't be trusted at all without a log replay. OMFG. > Ok, that's good - grub doesn't need FIEMAP, it reads the filesystem properly. That's garbage. The whole point of using an API like FIEMAP is to avoid the problems with having to parse the raw disk structures to do stuff. Grub has extremely hacky implementation of just the stuff it needs to read stuff directly off disk. The grub filesystem code is not reviewed by the various filesystem teams, it doesn't even use the same code as the filesystems, and last time I looked at it I came away with bleeding eyes and didn't want to look any further. Why was I even looking at the grub code? Well, grub doesn't implement everyhting that is needed to parse XFS metadata structures. In particular, the problem was out-of-line symlinks in XFS, or traversing symlinks that point to symlinks was simply not implemented in the grub XFS code. Of course, this was considered to be an XFS problem, not a grub problem.... Not to mention that if we change the metadata disk format which is conditional on a superblock feature bit set at mkfs time, grub knows *nothing* about that new on disk format, how to parse it and will break horribly on such a filesystem. All the APIs are there to prevent such problems that grub has. I just wish that they would be used rather than using grub's broken design as justification for not needing or using such APIs. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-20 7:43 ` Dave Chinner @ 2008-09-20 13:50 ` Chris Mason [not found] ` <20080920135048.GA15698@think.oraclecorp.com> 1 sibling, 0 replies; 40+ messages in thread From: Chris Mason @ 2008-09-20 13:50 UTC (permalink / raw) To: Jamie Lokier, Jörn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig On Sat, Sep 20, 2008 at 05:43:12PM +1000, Dave Chinner wrote: > On Fri, Sep 19, 2008 at 06:38:02PM +0100, Jamie Lokier wrote: > > Chris Mason wrote: > > > On Wed, Sep 17, 2008 at 04:02:12PM +0100, Jamie Lokier wrote: > > > > Assume that even reading after unmounting is not 100% safe, because > > > > the data blocks could be relocated after calling FIEMAP (when the > > > > filesystem must be mounted), and before the unmount. > > > > > > For the journal case at least, grub can walk through the log of the FS > > > looking for up to date copies of things. It does this already for > > > reiserfs because the btree can't be trusted at all without a log replay. > > OMFG. Shrug, either we force the user to do something to get the blocks on disk in a stable location or we have a bootloader that can parse the disk format. It's either pain on the developers (patching grub) or pain on every user, and the support people who have to remind them to rerun grub-find-the-disk-blocks-just-like-lilo -chris ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <20080920135048.GA15698@think.oraclecorp.com>]
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface [not found] ` <20080920135048.GA15698@think.oraclecorp.com> @ 2008-09-20 15:36 ` Jamie Lokier 2008-09-20 15:44 ` Christoph Hellwig 0 siblings, 1 reply; 40+ messages in thread From: Jamie Lokier @ 2008-09-20 15:36 UTC (permalink / raw) To: Chris Mason Cc: Jörn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages Chris Mason wrote: > > > > For the journal case at least, grub can walk through the log of the FS > > > > looking for up to date copies of things. It does this already for > > > > reiserfs because the btree can't be trusted at all without a log replay. > > OMFG. > > Shrug, either we force the user to do something to get the blocks on > disk in a stable location or we have a bootloader that can parse the > disk format. It's either pain on the developers (patching grub) or pain > on every user, and the support people who have to remind them to rerun > grub-find-the-disk-blocks-just-like-lilo I think this thread may be confusing /usr/bin/grub with grub the bootloader. One of the useful features of grub the bootloader is that it can boot from a real filesystem without needing "run lilo before it's safe to boot - or you'll regret shutting down". That feature needs to use some kind of bleeding-eyes filesystem code, or it will be as large as the filesystem code in Linux. On the other hand, any dodgy direct access done by /usr/bin/grub has no justification and is fair game for deletion, imho. -- Jamie ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH 1/4] vfs: vfs-level fiemap interface 2008-09-20 15:36 ` Jamie Lokier @ 2008-09-20 15:44 ` Christoph Hellwig 0 siblings, 0 replies; 40+ messages in thread From: Christoph Hellwig @ 2008-09-20 15:44 UTC (permalink / raw) To: Jamie Lokier Cc: Chris Mason, J?rn Engel, Theodore Tso, Andreas Dilger, Christoph Hellwig, linux-fsdevel, linux-ext4, akpm, Mark Fasheh, mtk.manpages On Sat, Sep 20, 2008 at 04:36:34PM +0100, Jamie Lokier wrote: > > Shrug, either we force the user to do something to get the blocks on > > disk in a stable location or we have a bootloader that can parse the > > disk format. It's either pain on the developers (patching grub) or pain > > on every user, and the support people who have to remind them to rerun > > grub-find-the-disk-blocks-just-like-lilo > > I think this thread may be confusing /usr/bin/grub with grub the bootloader. > > One of the useful features of grub the bootloader is that it can boot > from a real filesystem without needing "run lilo before it's safe to > boot - or you'll regret shutting down". > > That feature needs to use some kind of bleeding-eyes filesystem code, > or it will be as large as the filesystem code in Linux. If the lilo or grub approach is better is at least debatable, but that's not really the point here. > On the other hand, any dodgy direct access done by /usr/bin/grub has > no justification and is fair game for deletion, imho. That _is_ the real problem. But we're getting quite offtopic as non of that is related to FIEMAP anymore ;-) ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 0/4] Updated fiemap patches @ 2008-10-03 21:59 Theodore Ts'o 2008-10-03 21:59 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-03 21:59 UTC (permalink / raw) To: linux-fsdevel Since Mark has been busy, with his agreement I've updated the fiemap patches based on the last series of comments. All of the patches have been cc'ed to linux-fsdevel, and the first patch (which contains the interface definition) has been cc'ed to Michael Kerrisk and to the new linux-abi mailing list. Note that this patch has gone through a very long and tortured series of bike shed painting, and my goal is to see if we can get consensus so we can get this merged into the 2.6.28 merge window. If not, I will likely push an ext4-specific ioctl to Linus, as the long and painful process of getting a common API that everyone can agree upon is now starting to block ext4 online defrag development. For that reason, this patch doesn't define support for things like compression, as this was vetoed by Cristoph Hellwig. There is however plenty of room for the ABI to be extended to add this later. It is my devout wish that this get pushed at the next merge window, and folks who want to extended it later can provide patches to do so. I would rather not have an extension of the game where one set of people complain because the interface defines some feature, and another set of people complain because it doesn't and in the meantime, months and months go by with no forward progress. Thanks in advance... - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/4] vfs: vfs-level fiemap interface 2008-10-03 21:59 [PATCH 0/4] Updated fiemap patches Theodore Ts'o @ 2008-10-03 21:59 ` Theodore Ts'o 2008-10-03 21:59 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-03 21:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, Theodore Ts'o, Michael Kerrisk, linux-abi From: Mark Fasheh <mfasheh@suse.com> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah@sun.com> Andreas Dilger <adilger@sun.com> Eric Sandeen <sandeen@redhat.com> Mark Fasheh <mfasheh@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: linux-abi@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- Documentation/filesystems/fiemap.txt | 221 ++++++++++++++++++++++++++++++++++ fs/ioctl.c | 155 ++++++++++++++++++++++++ include/linux/fiemap.h | 64 ++++++++++ include/linux/fs.h | 18 +++ 4 files changed, 458 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/fiemap.txt create mode 100644 include/linux/fiemap.h diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 0000000..0945d96 --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,221 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to it's operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +Currently, there are three flags which can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of it's data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or it's logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, it's +physical location has not been allocated yet. + +* FIEMAP_EXTENT_NO_MOUNTED_IO +Accessing the data in this extent via I/O to the block device while the +filesystem is unmounted is illegal or will have undefined results. This +may because the data is somehow encrypted or compressed by the filesystem. +Note that accessing the data using the information returned by the +FIEMAP interface is *always* undefined should not be attempted by user +applications. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_NO_MOUNTED_IO +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but it's data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining it's set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/fs/ioctl.c b/fs/ioctl.c index 7db32b3..837444a 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -16,6 +16,9 @@ #include <asm/ioctls.h> +/* So that the fiemap access checks can't overflow on 32 bit machines. */ +#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) + /** * vfs_ioctl - call filesystem specific ioctl methods * @filp: open file to invoke ioctl method on @@ -71,6 +74,156 @@ static int ioctl_fibmap(struct file *filp, int __user *p) return put_user(res, p); } +/** + * fiemap_fill_next_extent - Fiemap helper function + * @fieinfo: Fiemap context passed into ->fiemap + * @logical: Extent logical start offset, in bytes + * @phys: Extent physical start offset, in bytes + * @len: Extent length, in bytes + * @flags: FIEMAP_EXTENT flags that describe this extent + * + * Called from file system ->fiemap callback. Will populate extent + * info as passed in via arguments and copy to user memory. On + * success, extent count on fieinfo is incremented. + * + * Returns 0 on success, -errno on error, 1 if this was the last + * extent that will fit in user array. + */ +#define SET_UNKNOWN_FLAGS (FIEMAP_EXTENT_DELALLOC) +#define SET_NO_MOUNTED_IO_FLAGS (FIEMAP_EXTENT_DATA_ENCRYPTED) +#define SET_NOT_ALIGNED_FLAGS (FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE) +int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical, + u64 phys, u64 len, u32 flags) +{ + struct fiemap_extent extent; + struct fiemap_extent *dest = fieinfo->fi_extents_start; + + /* only count the extents */ + if (fieinfo->fi_extents_max == 0) { + fieinfo->fi_extents_mapped++; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; + } + + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) + return 1; + + if (flags & SET_UNKNOWN_FLAGS) + flags |= FIEMAP_EXTENT_UNKNOWN; + if (flags & SET_NO_MOUNTED_IO_FLAGS) + flags |= FIEMAP_EXTENT_NO_MOUNTED_IO; + if (flags & SET_NOT_ALIGNED_FLAGS) + flags |= FIEMAP_EXTENT_NOT_ALIGNED; + + memset(extent, 0, sizeof(extent)); + extent.fe_logical = logical; + extent.fe_physical = phys; + extent.fe_length = len; + extent.fe_flags = flags; + + dest += fieinfo->fi_extents_mapped; + if (copy_to_user(dest, &extent, sizeof(extent))) + return -EFAULT; + + fieinfo->fi_extents_mapped++; + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) + return 1; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; +} +EXPORT_SYMBOL(fiemap_fill_next_extent); + +/** + * fiemap_check_flags - check validity of requested flags for fiemap + * @fieinfo: Fiemap context passed into ->fiemap + * @fs_flags: Set of fiemap flags that the file system understands + * + * Called from file system ->fiemap callback. This will compute the + * intersection of valid fiemap flags and those that the fs supports. That + * value is then compared against the user supplied flags. In case of bad user + * flags, the invalid values will be written into the fieinfo structure, and + * -EBADR is returned, which tells ioctl_fiemap() to return those values to + * userspace. For this reason, a return code of -EBADR should be preserved. + * + * Returns 0 on success, -EBADR on bad flags. + */ +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags) +{ + u32 incompat_flags; + + incompat_flags = fieinfo->fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags); + if (incompat_flags) { + fieinfo->fi_flags = incompat_flags; + return -EBADR; + } + return 0; +} +EXPORT_SYMBOL(fiemap_check_flags); + +static int fiemap_check_ranges(struct super_block *sb, + u64 start, u64 len, u64 *new_len) +{ + *new_len = len; + + if (len == 0) + return -EINVAL; + + if (start > sb->s_maxbytes) + return -EFBIG; + + /* + * Shrink request scope to what the fs can actually handle. + */ + if ((len > sb->s_maxbytes) || + (sb->s_maxbytes - len) < start) + *new_len = sb->s_maxbytes - start; + + return 0; +} + +static int ioctl_fiemap(struct file *filp, unsigned long arg) +{ + struct fiemap fiemap; + struct fiemap_extent_info fieinfo = { 0, }; + struct inode *inode = filp->f_path.dentry->d_inode; + struct super_block *sb = inode->i_sb; + u64 len; + int error; + + if (!inode->i_op->fiemap) + return -EOPNOTSUPP; + + if (copy_from_user(&fiemap, (struct fiemap __user *)arg, + sizeof(struct fiemap))) + return -EFAULT; + + if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) + return -EINVAL; + + error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length, + &len); + if (error) + return error; + + fieinfo.fi_flags = fiemap.fm_flags; + fieinfo.fi_extents_max = fiemap.fm_extent_count; + fieinfo.fi_extents_start = (struct fiemap_extent *)(arg + sizeof(fiemap)); + + if (fiemap.fm_extent_count != 0 && + !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) + return -EFAULT; + + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) + filemap_write_and_wait(inode->i_mapping); + + error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len); + fiemap.fm_flags = fieinfo.fi_flags; + fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; + if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap))) + error = -EFAULT; + + return error; +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -80,6 +233,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd, switch (cmd) { case FIBMAP: return ioctl_fibmap(filp, p); + case FS_IOC_FIEMAP: + return ioctl_fiemap(filp, arg); case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h new file mode 100644 index 0000000..90c2e88 --- /dev/null +++ b/include/linux/fiemap.h @@ -0,0 +1,64 @@ +/* + * FS_IOC_FIEMAP ioctl infrastructure. + * + * Some portions copyright (C) 2007 Cluster File Systems, Inc + * + * Authors: Mark Fasheh <mfasheh@suse.com> + * Kalpak Shah <kalpak.shah@sun.com> + * Andreas Dilger <adilger@sun.com> + */ + +#ifndef _LINUX_FIEMAP_H +#define _LINUX_FIEMAP_H + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent from the beginning of the file */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent from the beginning of the disk */ + __u64 fe_length; /* length in bytes for this extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace wants (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents;/* number of extents that were mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + +#define FIEMAP_MAX_OFFSET (~0ULL) + +#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */ +#define FIEMAP_FLAG_XATTR 0x00000002 /* map extended attribute tree */ + +#define FIEMAP_FLAGS_COMPAT (FIEMAP_FLAG_SYNC | FIEMAP_FLAG_XATTR) + +#define FIEMAP_EXTENT_LAST 0x00000001 /* Last extent in file. */ +#define FIEMAP_EXTENT_UNKNOWN 0x00000002 /* Data location unknown. */ +#define FIEMAP_EXTENT_DELALLOC 0x00000004 /* Location still pending. + * Sets EXTENT_UNKNOWN. */ +#define FIEMAP_EXTENT_NO_MOUNTED_IO 0x00000008 /* Data can not be read + * while fs is unmounted */ +#define FIEMAP_EXTENT_DATA_ENCRYPTED 0x00000080 /* Data is encrypted by fs. + * Sets EXTENT_NO_BYPASS. */ +#define FIEMAP_EXTENT_NOT_ALIGNED 0x00000100 /* Extent offsets may not be + * block aligned. */ +#define FIEMAP_EXTENT_DATA_INLINE 0x00000200 /* Data mixed with metadata. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_DATA_TAIL 0x00000400 /* Multiple files in block. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_UNWRITTEN 0x00000800 /* Space allocated, but + * no data (i.e. zero). */ +#define FIEMAP_EXTENT_MERGED 0x00001000 /* File does not natively + * support extents. Result + * merged for efficiency. */ + +#endif /* _LINUX_FIEMAP_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index ae0ab82..0724abf 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -232,6 +232,7 @@ extern int dir_notify_enable; #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IOC_GETVERSION _IOR('v', 1, long) #define FS_IOC_SETVERSION _IOW('v', 2, long) +#define FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap) #define FS_IOC32_GETFLAGS _IOR('f', 1, int) #define FS_IOC32_SETFLAGS _IOW('f', 2, int) #define FS_IOC32_GETVERSION _IOR('v', 1, int) @@ -292,6 +293,7 @@ extern int dir_notify_enable; #include <linux/mutex.h> #include <linux/capability.h> #include <linux/semaphore.h> +#include <linux/fiemap.h> #include <asm/atomic.h> #include <asm/byteorder.h> @@ -1180,6 +1182,20 @@ extern void dentry_unhash(struct dentry *dentry); extern int file_permission(struct file *, int); /* + * VFS FS_IOC_FIEMAP helper definitions. + */ +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent + * array */ +}; +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags); +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +/* * File types * * NOTE! These match bits 12..15 of stat.st_mode @@ -1288,6 +1304,8 @@ struct inode_operations { void (*truncate_range)(struct inode *, loff_t, loff_t); long (*fallocate)(struct inode *inode, int mode, loff_t offset, loff_t len); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); }; struct seq_file; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/4] ocfs2: fiemap support 2008-10-03 21:59 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o @ 2008-10-03 21:59 ` Theodore Ts'o 2008-10-03 21:59 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-03 21:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, Theodore Ts'o, ocfs2-devel From: Mark Fasheh <mfasheh@suse.com> Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Cc: linux-fsdevel@vger.kernel.org --- fs/ocfs2/alloc.c | 9 -- fs/ocfs2/alloc.h | 9 ++ fs/ocfs2/extent_map.c | 346 +++++++++++++++++++++++++++++++++++++++++-------- fs/ocfs2/extent_map.h | 3 + fs/ocfs2/file.c | 1 + 5 files changed, 306 insertions(+), 62 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 10bfb46..29ff57e 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -990,15 +990,6 @@ out: } /* - * This is only valid for leaf nodes, which are the only ones that can - * have empty extents anyway. - */ -static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) -{ - return !rec->e_leaf_clusters; -} - -/* * This function will discard the rightmost extent record. */ static void ocfs2_shift_records_right(struct ocfs2_extent_list *el) diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h index 42ff94b..60cd3d5 100644 --- a/fs/ocfs2/alloc.h +++ b/fs/ocfs2/alloc.h @@ -146,4 +146,13 @@ static inline unsigned int ocfs2_rec_clusters(struct ocfs2_extent_list *el, return le16_to_cpu(rec->e_leaf_clusters); } +/* + * This is only valid for leaf nodes, which are the only ones that can + * have empty extents anyway. + */ +static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) +{ + return !rec->e_leaf_clusters; +} + #endif /* OCFS2_ALLOC_H */ diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index c58668a..aed268e 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -25,6 +25,7 @@ #include <linux/fs.h> #include <linux/init.h> #include <linux/types.h> +#include <linux/fiemap.h> #define MLOG_MASK_PREFIX ML_EXTENT_MAP #include <cluster/masklog.h> @@ -32,6 +33,7 @@ #include "ocfs2.h" #include "alloc.h" +#include "dlmglue.h" #include "extent_map.h" #include "inode.h" #include "super.h" @@ -282,6 +284,51 @@ out: kfree(new_emi); } +static int ocfs2_last_eb_is_empty(struct inode *inode, + struct ocfs2_dinode *di) +{ + int ret, next_free; + u64 last_eb_blk = le64_to_cpu(di->i_last_eb_blk); + struct buffer_head *eb_bh = NULL; + struct ocfs2_extent_block *eb; + struct ocfs2_extent_list *el; + + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), last_eb_blk, + &eb_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } + + eb = (struct ocfs2_extent_block *) eb_bh->b_data; + el = &eb->h_list; + + if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) { + ret = -EROFS; + OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb); + goto out; + } + + if (el->l_tree_depth) { + ocfs2_error(inode->i_sb, + "Inode %lu has non zero tree depth in " + "leaf block %llu\n", inode->i_ino, + (unsigned long long)eb_bh->b_blocknr); + ret = -EROFS; + goto out; + } + + next_free = le16_to_cpu(el->l_next_free_rec); + + if (next_free == 0 || + (next_free == 1 && ocfs2_is_empty_extent(&el->l_recs[0]))) + ret = 1; + +out: + brelse(eb_bh); + return ret; +} + /* * Return the 1st index within el which contains an extent start * larger than v_cluster. @@ -373,42 +420,28 @@ out: return ret; } -int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, - u32 *p_cluster, u32 *num_clusters, - unsigned int *extent_flags) +static int ocfs2_get_clusters_nocache(struct inode *inode, + struct buffer_head *di_bh, + u32 v_cluster, unsigned int *hole_len, + struct ocfs2_extent_rec *ret_rec, + unsigned int *is_last) { - int ret, i; - unsigned int flags = 0; - struct buffer_head *di_bh = NULL; - struct buffer_head *eb_bh = NULL; + int i, ret, tree_height, len; struct ocfs2_dinode *di; - struct ocfs2_extent_block *eb; + struct ocfs2_extent_block *uninitialized_var(eb); struct ocfs2_extent_list *el; struct ocfs2_extent_rec *rec; - u32 coff; - - if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { - ret = -ERANGE; - mlog_errno(ret); - goto out; - } - - ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, - num_clusters, extent_flags); - if (ret == 0) - goto out; + struct buffer_head *eb_bh = NULL; - ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, - &di_bh, OCFS2_BH_CACHED, inode); - if (ret) { - mlog_errno(ret); - goto out; - } + memset(ret_rec, 0, sizeof(*ret_rec)); + if (is_last) + *is_last = 0; di = (struct ocfs2_dinode *) di_bh->b_data; el = &di->id2.i_list; + tree_height = le16_to_cpu(el->l_tree_depth); - if (el->l_tree_depth) { + if (tree_height > 0) { ret = ocfs2_find_leaf(inode, el, v_cluster, &eb_bh); if (ret) { mlog_errno(ret); @@ -431,46 +464,143 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, i = ocfs2_search_extent_list(el, v_cluster); if (i == -1) { /* - * A hole was found. Return some canned values that - * callers can key on. If asked for, num_clusters will - * be populated with the size of the hole. + * Holes can be larger than the maximum size of an + * extent, so we return their lengths in a seperate + * field. */ - *p_cluster = 0; - if (num_clusters) { + if (hole_len) { ret = ocfs2_figure_hole_clusters(inode, el, eb_bh, - v_cluster, - num_clusters); + v_cluster, &len); if (ret) { mlog_errno(ret); goto out; } + + *hole_len = len; } - } else { - rec = &el->l_recs[i]; + goto out_hole; + } - BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + rec = &el->l_recs[i]; - if (!rec->e_blkno) { - ocfs2_error(inode->i_sb, "Inode %lu has bad extent " - "record (%u, %u, 0)", inode->i_ino, - le32_to_cpu(rec->e_cpos), - ocfs2_rec_clusters(el, rec)); - ret = -EROFS; - goto out; + BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + + if (!rec->e_blkno) { + ocfs2_error(inode->i_sb, "Inode %lu has bad extent " + "record (%u, %u, 0)", inode->i_ino, + le32_to_cpu(rec->e_cpos), + ocfs2_rec_clusters(el, rec)); + ret = -EROFS; + goto out; + } + + *ret_rec = *rec; + + /* + * Checking for last extent is potentially expensive - we + * might have to look at the next leaf over to see if it's + * empty. + * + * The first two checks are to see whether the caller even + * cares for this information, and if the extent is at least + * the last in it's list. + * + * If those hold true, then the extent is last if any of the + * additional conditions hold true: + * - Extent list is in-inode + * - Extent list is right-most + * - Extent list is 2nd to rightmost, with empty right-most + */ + if (is_last) { + if (i == (le16_to_cpu(el->l_next_free_rec) - 1)) { + if (tree_height == 0) + *is_last = 1; + else if (eb->h_blkno == di->i_last_eb_blk) + *is_last = 1; + else if (eb->h_next_leaf_blk == di->i_last_eb_blk) { + ret = ocfs2_last_eb_is_empty(inode, di); + if (ret < 0) { + mlog_errno(ret); + goto out; + } + if (ret == 1) + *is_last = 1; + } } + } + +out_hole: + ret = 0; +out: + brelse(eb_bh); + return ret; +} + +static void ocfs2_relative_extent_offsets(struct super_block *sb, + u32 v_cluster, + struct ocfs2_extent_rec *rec, + u32 *p_cluster, u32 *num_clusters) + +{ + u32 coff = v_cluster - le32_to_cpu(rec->e_cpos); + + *p_cluster = ocfs2_blocks_to_clusters(sb, le64_to_cpu(rec->e_blkno)); + *p_cluster = *p_cluster + coff; + + if (num_clusters) + *num_clusters = le16_to_cpu(rec->e_leaf_clusters) - coff; +} + +int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, + u32 *p_cluster, u32 *num_clusters, + unsigned int *extent_flags) +{ + int ret; + unsigned int uninitialized_var(hole_len), flags = 0; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; - coff = v_cluster - le32_to_cpu(rec->e_cpos); + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = -ERANGE; + mlog_errno(ret); + goto out; + } - *p_cluster = ocfs2_blocks_to_clusters(inode->i_sb, - le64_to_cpu(rec->e_blkno)); - *p_cluster = *p_cluster + coff; + ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, + num_clusters, extent_flags); + if (ret == 0) + goto out; - if (num_clusters) - *num_clusters = ocfs2_rec_clusters(el, rec) - coff; + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, + &di_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } - flags = rec->e_flags; + ret = ocfs2_get_clusters_nocache(inode, di_bh, v_cluster, &hole_len, + &rec, NULL); + if (ret) { + mlog_errno(ret); + goto out; + } - ocfs2_extent_map_insert_rec(inode, rec); + if (rec.e_blkno == 0ULL) { + /* + * A hole was found. Return some canned values that + * callers can key on. If asked for, num_clusters will + * be populated with the size of the hole. + */ + *p_cluster = 0; + if (num_clusters) { + *num_clusters = hole_len; + } + } else { + ocfs2_relative_extent_offsets(inode->i_sb, v_cluster, &rec, + p_cluster, num_clusters); + flags = rec.e_flags; + + ocfs2_extent_map_insert_rec(inode, &rec); } if (extent_flags) @@ -478,7 +608,6 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, out: brelse(di_bh); - brelse(eb_bh); return ret; } @@ -521,3 +650,114 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, out: return ret; } + +static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh, + struct fiemap_extent_info *fieinfo, + u64 map_start) +{ + int ret; + unsigned int id_count; + struct ocfs2_dinode *di; + u64 phys; + u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST; + struct ocfs2_inode_info *oi = OCFS2_I(inode); + + di = (struct ocfs2_dinode *)di_bh->b_data; + id_count = le16_to_cpu(di->id2.i_data.id_count); + + if (map_start < id_count) { + phys = oi->ip_blkno << inode->i_sb->s_blocksize_bits; + phys += offsetof(struct ocfs2_dinode, id2.i_data.id_data); + + ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count, + flags); + if (ret < 0) + return ret; + } + + return 0; +} + +#define OCFS2_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC) + +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len) +{ + int ret, is_last; + u32 mapping_end, cpos; + unsigned int hole_size; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + u64 len_bytes, phys_bytes, virt_bytes; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; + + ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS); + if (ret) + return ret; + + ret = ocfs2_inode_lock(inode, &di_bh, 0); + if (ret) { + mlog_errno(ret); + goto out; + } + + down_read(&OCFS2_I(inode)->ip_alloc_sem); + + /* + * Handle inline-data separately. + */ + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = ocfs2_fiemap_inline(inode, di_bh, fieinfo, map_start); + goto out_unlock; + } + + cpos = map_start >> osb->s_clustersize_bits; + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb, + map_start + map_len); + mapping_end -= cpos; + is_last = 0; + while (cpos < mapping_end && !is_last) { + u32 fe_flags; + + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, + &hole_size, &rec, &is_last); + if (ret) { + mlog_errno(ret); + goto out; + } + + if (rec.e_blkno == 0ULL) { + cpos += hole_size; + continue; + } + + fe_flags = 0; + if (rec.e_flags & OCFS2_EXT_UNWRITTEN) + fe_flags |= FIEMAP_EXTENT_UNWRITTEN; + if (is_last) + fe_flags |= FIEMAP_EXTENT_LAST; + len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits; + phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits; + virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits; + + ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes, + len_bytes, fe_flags); + if (ret) + break; + + cpos = le32_to_cpu(rec.e_cpos)+ le16_to_cpu(rec.e_leaf_clusters); + } + + if (ret > 0) + ret = 0; + +out_unlock: + brelse(di_bh); + + up_read(&OCFS2_I(inode)->ip_alloc_sem); + + ocfs2_inode_unlock(inode, 0); +out: + + return ret; +} diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h index de91e3e..1b97490 100644 --- a/fs/ocfs2/extent_map.h +++ b/fs/ocfs2/extent_map.h @@ -50,4 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster, int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, u64 *ret_count, unsigned int *extent_flags); +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len); + #endif /* _EXTENT_MAP_H */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index ec2ed15..ed38796 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2228,6 +2228,7 @@ const struct inode_operations ocfs2_file_iops = { .getattr = ocfs2_getattr, .permission = ocfs2_permission, .fallocate = ocfs2_fallocate, + .fiemap = ocfs2_fiemap, }; const struct inode_operations ocfs2_special_file_iops = { -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/4] generic block based fiemap implementation 2008-10-03 21:59 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o @ 2008-10-03 21:59 ` Theodore Ts'o 2008-10-03 21:59 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-03 21:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: Josef Bacik, Mark Fasheh, Theodore Ts'o From: Josef Bacik <jbacik@redhat.com> Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext2/ext2.h | 2 + fs/ext2/file.c | 1 + fs/ext2/inode.c | 8 +++ fs/ext3/file.c | 1 + fs/ext3/inode.c | 8 +++ fs/ioctl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ext3_fs.h | 2 + include/linux/fs.h | 3 + 8 files changed, 143 insertions(+), 0 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 47d88da..bae998c 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -133,6 +133,8 @@ extern void ext2_truncate (struct inode *); extern int ext2_setattr (struct dentry *, struct iattr *); extern void ext2_set_inode_flags(struct inode *inode); extern void ext2_get_inode_flags(struct ext2_inode_info *); +extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); int __ext2_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5f2fa9c..45ed071 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -86,4 +86,5 @@ const struct inode_operations ext2_file_inode_operations = { #endif .setattr = ext2_setattr, .permission = ext2_permission, + .fiemap = ext2_fiemap, }; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 991d6df..7658b33 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -31,6 +31,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/mpage.h> +#include <linux/fiemap.h> #include "ext2.h" #include "acl.h" #include "xip.h" @@ -704,6 +705,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ } +int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext2_get_block); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); diff --git a/fs/ext3/file.c b/fs/ext3/file.c index acc4913..3be1e06 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -134,5 +134,6 @@ const struct inode_operations ext3_file_inode_operations = { .removexattr = generic_removexattr, #endif .permission = ext3_permission, + .fiemap = ext3_fiemap, }; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 507d868..ebfec4d 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -36,6 +36,7 @@ #include <linux/mpage.h> #include <linux/uio.h> #include <linux/bio.h> +#include <linux/fiemap.h> #include "xattr.h" #include "acl.h" @@ -981,6 +982,13 @@ out: return ret; } +int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext3_get_block); +} + /* * `handle' can be NULL if create is zero */ diff --git a/fs/ioctl.c b/fs/ioctl.c index 837444a..9019510 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -13,6 +13,8 @@ #include <linux/security.h> #include <linux/module.h> #include <linux/uaccess.h> +#include <linux/writeback.h> +#include <linux/buffer_head.h> #include <asm/ioctls.h> @@ -224,6 +226,122 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg) return error; } +#define blk_to_logical(inode, blk) (blk << (inode)->i_blkbits) +#define logical_to_blk(inode, offset) (offset >> (inode)->i_blkbits); + +/* + * @inode - the inode to map + * @arg - the pointer to userspace where we copy everything to + * @get_block - the fs's get_block function + * + * This does FIEMAP for block based inodes. Basically it will just loop + * through get_block until we hit the number of extents we want to map, or we + * go past the end of the file and hit a hole. + * + * If it is possible to have data blocks beyond a hole past @inode->i_size, then + * please do not use this function, it will stop at the first unmapped block + * beyond i_size + */ +int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block) +{ + struct buffer_head tmp; + unsigned int start_blk; + long long length = 0, map_len = 0; + u64 logical = 0, phys = 0, size = 0; + u32 flags = FIEMAP_EXTENT_MERGED; + int ret = 0; + + if ((ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))) + return ret; + + start_blk = logical_to_blk(inode, start); + + /* guard against change */ + mutex_lock(&inode->i_mutex); + + length = (long long)min_t(u64, len, i_size_read(inode)); + map_len = length; + + do { + /* + * we set b_size to the total size we want so it will map as + * many contiguous blocks as possible at once + */ + memset(&tmp, 0, sizeof(struct buffer_head)); + tmp.b_size = map_len; + + ret = get_block(inode, start_blk, &tmp, 0); + if (ret) + break; + + /* HOLE */ + if (!buffer_mapped(&tmp)) { + /* + * first hole after going past the EOF, this is our + * last extent + */ + if (length <= 0) { + flags = FIEMAP_EXTENT_MERGED|FIEMAP_EXTENT_LAST; + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + break; + } + + length -= blk_to_logical(inode, 1); + + /* if we have holes up to/past EOF then we're done */ + if (length <= 0) + break; + + start_blk++; + } else { + if (length <= 0 && size) { + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + if (ret) + break; + } + + logical = blk_to_logical(inode, start_blk); + phys = blk_to_logical(inode, tmp.b_blocknr); + size = tmp.b_size; + flags = FIEMAP_EXTENT_MERGED; + + length -= tmp.b_size; + start_blk += logical_to_blk(inode, size); + + /* + * if we are past the EOF we need to loop again to see + * if there is a hole so we can mark this extent as the + * last one, and if not keep mapping things until we + * find a hole, or we run out of slots in the extent + * array + */ + if (length <= 0) + continue; + + ret = fiemap_fill_next_extent(fieinfo, logical, phys, + size, flags); + if (ret) + break; + } + cond_resched(); + } while (1); + + mutex_unlock(&inode->i_mutex); + + /* if ret is 1 then we just hit the end of the extent array */ + if (ret == 1) + ret = 0; + + return ret; +} +EXPORT_SYMBOL(generic_block_fiemap); + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 80171ee..8120fa1 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -837,6 +837,8 @@ extern void ext3_truncate (struct inode *); extern void ext3_set_inode_flags(struct inode *); extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); +extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, diff --git a/include/linux/fs.h b/include/linux/fs.h index 0724abf..6242491 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1999,6 +1999,9 @@ extern int vfs_fstat(unsigned int, struct kstat *); extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg); +extern int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block); extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 4/4] Hook ext4 to the vfs fiemap interface. 2008-10-03 21:59 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o @ 2008-10-03 21:59 ` Theodore Ts'o 0 siblings, 0 replies; 40+ messages in thread From: Theodore Ts'o @ 2008-10-03 21:59 UTC (permalink / raw) To: linux-fsdevel; +Cc: Eric Sandeen, Theodore Ts'o, linux-ext4 From: Eric Sandeen <sandeen@redhat.com> ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/ext4.h | 2 + fs/ext4/ext4_extents.h | 15 +++ fs/ext4/extents.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/file.c | 4 + 4 files changed, 269 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6db6857..caaca1b 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1083,6 +1083,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int, int *); +int ext4_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, unsigned long maxblocks, struct buffer_head *bh_result, diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h index 2fc0469..6407222 100644 --- a/fs/ext4/ext4_extents.h +++ b/fs/ext4/ext4_extents.h @@ -124,6 +124,19 @@ struct ext4_ext_path { #define EXT4_EXT_CACHE_GAP 1 #define EXT4_EXT_CACHE_EXTENT 2 +/* + * to be called by ext4_ext_walk_space() + * negative retcode - error + * positive retcode - signal for ext4_ext_walk_space(), see below + * callback must return valid extent (passed or newly created) + */ +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *, + struct ext4_ext_cache *, + struct ext4_extent *, void *); + +#define EXT_CONTINUE 0 +#define EXT_BREAK 1 +#define EXT_REPEAT 2 #define EXT_MAX_BLOCK 0xffffffff @@ -224,6 +237,8 @@ extern int ext4_ext_try_to_merge(struct inode *inode, struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); +extern int ext4_ext_walk_space(struct inode *, ext4_lblk_t, ext4_lblk_t, + ext_prepare_callback, void *); extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t, struct ext4_ext_path *); extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index b4f0c7d..dde234f 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -40,6 +40,7 @@ #include <linux/slab.h> #include <linux/falloc.h> #include <asm/uaccess.h> +#include <linux/fiemap.h> #include "ext4_jbd2.h" #include "ext4_extents.h" @@ -1626,6 +1627,113 @@ cleanup: return err; } +int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block, + ext4_lblk_t num, ext_prepare_callback func, + void *cbdata) +{ + struct ext4_ext_path *path = NULL; + struct ext4_ext_cache cbex; + struct ext4_extent *ex; + ext4_lblk_t next, start = 0, end = 0; + ext4_lblk_t last = block + num; + int depth, exists, err = 0; + + BUG_ON(func == NULL); + BUG_ON(inode == NULL); + + while (block < last && block != EXT_MAX_BLOCK) { + num = last - block; + /* find extent for this block */ + path = ext4_ext_find_extent(inode, block, path); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + break; + } + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + ex = path[depth].p_ext; + next = ext4_ext_next_allocated_block(path); + + exists = 0; + if (!ex) { + /* there is no extent yet, so try to allocate + * all requested space */ + start = block; + end = block + num; + } else if (le32_to_cpu(ex->ee_block) > block) { + /* need to allocate space before found extent */ + start = block; + end = le32_to_cpu(ex->ee_block); + if (block + num < end) + end = block + num; + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { + /* need to allocate space after found extent */ + start = block; + end = block + num; + if (end >= next) + end = next; + } else if (block >= le32_to_cpu(ex->ee_block)) { + /* + * some part of requested space is covered + * by found extent + */ + start = block; + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); + if (block + num < end) + end = block + num; + exists = 1; + } else { + BUG(); + } + BUG_ON(end <= start); + + if (!exists) { + cbex.ec_block = start; + cbex.ec_len = end - start; + cbex.ec_start = 0; + cbex.ec_type = EXT4_EXT_CACHE_GAP; + } else { + cbex.ec_block = le32_to_cpu(ex->ee_block); + cbex.ec_len = ext4_ext_get_actual_len(ex); + cbex.ec_start = ext_pblock(ex); + cbex.ec_type = EXT4_EXT_CACHE_EXTENT; + } + + BUG_ON(cbex.ec_len == 0); + err = func(inode, path, &cbex, ex, cbdata); + ext4_ext_drop_refs(path); + + if (err < 0) + break; + + if (err == EXT_REPEAT) + continue; + else if (err == EXT_BREAK) { + err = 0; + break; + } + + if (ext_depth(inode) != depth) { + /* depth was changed. we have to realloc path */ + kfree(path); + path = NULL; + } + + block = cbex.ec_block + cbex.ec_len; + } + + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } + + return err; +} + static void ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, __u32 len, ext4_fsblk_t start, int type) @@ -2971,3 +3079,143 @@ retry: mutex_unlock(&inode->i_mutex); return ret > 0 ? ret2 : ret; } + +/* + * Callback function called for each extent to gather FIEMAP information. + */ +int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, + struct ext4_ext_cache *newex, struct ext4_extent *ex, + void *data) +{ + struct fiemap_extent_info *fieinfo = data; + unsigned long blksize_bits = inode->i_sb->s_blocksize_bits; + __u64 logical; + __u64 physical; + __u64 length; + __u32 flags = 0; + int error; + + logical = (__u64)newex->ec_block << blksize_bits; + + if (newex->ec_type == EXT4_EXT_CACHE_GAP) { + pgoff_t offset; + struct page *page; + struct buffer_head *bh = NULL; + + offset = logical >> PAGE_SHIFT; + page = find_get_page(inode->i_mapping, offset); + if (!page || !page_has_buffers(page)) + return EXT_CONTINUE; + + bh = page_buffers(page); + + if (!bh) + return EXT_CONTINUE; + + if (buffer_delay(bh)) { + flags |= FIEMAP_EXTENT_DELALLOC; + page_cache_release(page); + } else { + page_cache_release(page); + return EXT_CONTINUE; + } + } + + physical = (__u64)newex->ec_start << blksize_bits; + length = (__u64)newex->ec_len << blksize_bits; + + if (ex && ext4_ext_is_uninitialized(ex)) + flags |= FIEMAP_EXTENT_UNWRITTEN; + + /* + * If this extent reaches EXT_MAX_BLOCK, it must be last. + * + * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK, + * this also indicates no more allocated blocks. + * + * XXX this might miss a single-block extent at EXT_MAX_BLOCK + */ + if (logical + length - 1 == EXT_MAX_BLOCK || + ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK) + flags |= FIEMAP_EXTENT_LAST; + + error = fiemap_fill_next_extent(fieinfo, logical, physical, + length, flags); + if (error < 0) + return error; + if (error == 1) + return EXT_BREAK; + + return EXT_CONTINUE; +} + +/* fiemap flags we can handle specified here */ +#define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR) + +int ext4_xattr_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo) +{ + __u64 physical = 0; + __u64 length; + __u32 flags = FIEMAP_EXTENT_LAST; + int blockbits = inode->i_sb->s_blocksize_bits; + int error = 0; + + /* in-inode? */ + if (EXT4_I(inode)->i_state & EXT4_STATE_XATTR) { + struct ext4_iloc iloc; + int offset; /* offset of xattr in inode */ + + error = ext4_get_inode_loc(inode, &iloc); + if (error) + return error; + physical = iloc.bh->b_blocknr << blockbits; + offset = EXT4_GOOD_OLD_INODE_SIZE + + EXT4_I(inode)->i_extra_isize; + physical += offset; + length = EXT4_SB(inode->i_sb)->s_inode_size - offset; + flags |= FIEMAP_EXTENT_DATA_INLINE; + } else { /* external block */ + physical = EXT4_I(inode)->i_file_acl << blockbits; + length = inode->i_sb->s_blocksize; + } + + if (physical) + error = fiemap_fill_next_extent(fieinfo, 0, physical, + length, flags); + return (error < 0 ? error : 0); +} + +int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len) +{ + ext4_lblk_t start_blk; + ext4_lblk_t len_blks; + int error = 0; + + /* fallback to generic here if not in extents fmt */ + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return generic_block_fiemap(inode, fieinfo, start, len, + ext4_get_block); + + if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS)) + return -EBADR; + + if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) { + error = ext4_xattr_fiemap(inode, fieinfo); + } else { + start_blk = start >> inode->i_sb->s_blocksize_bits; + len_blks = len >> inode->i_sb->s_blocksize_bits; + + /* + * Walk the extent tree gathering extent information. + * ext4_ext_fiemap_cb will push extents back to user. + */ + down_write(&EXT4_I(inode)->i_data_sem); + error = ext4_ext_walk_space(inode, start_blk, len_blks, + ext4_ext_fiemap_cb, fieinfo); + up_write(&EXT4_I(inode)->i_data_sem); + } + + return error; +} + diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 62796b7..6d5be15 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -140,6 +140,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len); + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -170,5 +173,6 @@ const struct inode_operations ext4_file_inode_operations = { #endif .permission = ext4_permission, .fallocate = ext4_fallocate, + .fiemap = ext4_fiemap, }; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* PATCH [0/4] Updated**2 fiemap patches @ 2008-10-07 1:06 Theodore Ts'o 2008-10-07 1:06 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-07 1:06 UTC (permalink / raw) To: linux-fsdevel This the second update of the fiemap patches, which integrates the changes and typographical fixups as suggested by Jim Owens. In particular, FIEMAP_EXTENT_NO_MOUNTED_IO was renamed to what I had intended, but mistyped, FIEMAP_EXTENT_NO_UNMOUNTED_IO, and the explanation of that flag has been clarified. Hopefully folks are happy with this. I plan to push this to Linus when the next merge window opens up. - Ted ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH 1/4] vfs: vfs-level fiemap interface 2008-10-07 1:06 PATCH [0/4] Updated**2 fiemap patches Theodore Ts'o @ 2008-10-07 1:06 ` Theodore Ts'o 2008-10-07 1:06 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-07 1:06 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, Theodore Ts'o, Michael Kerrisk, linux-abi From: Mark Fasheh <mfasheh@suse.com> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah@sun.com> Andreas Dilger <adilger@sun.com> Eric Sandeen <sandeen@redhat.com> Mark Fasheh <mfasheh@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: linux-abi@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- Documentation/filesystems/fiemap.txt | 221 ++++++++++++++++++++++++++++++++++ fs/ioctl.c | 155 ++++++++++++++++++++++++ include/linux/fiemap.h | 64 ++++++++++ include/linux/fs.h | 18 +++ 4 files changed, 458 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/fiemap.txt create mode 100644 include/linux/fiemap.h diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 0000000..d56b30e --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,221 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to it's operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +The following flags can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of it's data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or it's logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, it's +physical location has not been allocated yet. + +* FIEMAP_EXTENT_NO_UNMOUNTED_IO +Accessing the data in this extent via I/O to the block device while the +filesystem is unmounted is illegal or will have undefined results. This +may because the data is somehow encrypted or compressed by the filesystem. +Note that accessing the data using the information returned by the +FIEMAP interface while the filesystem is mounted is *always* undefined +should not be attempted by user applications. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_NO_UNMOUNTED_IO +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but it's data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining it's set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/fs/ioctl.c b/fs/ioctl.c index 7db32b3..b759cc2 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -16,6 +16,9 @@ #include <asm/ioctls.h> +/* So that the fiemap access checks can't overflow on 32 bit machines. */ +#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) + /** * vfs_ioctl - call filesystem specific ioctl methods * @filp: open file to invoke ioctl method on @@ -71,6 +74,156 @@ static int ioctl_fibmap(struct file *filp, int __user *p) return put_user(res, p); } +/** + * fiemap_fill_next_extent - Fiemap helper function + * @fieinfo: Fiemap context passed into ->fiemap + * @logical: Extent logical start offset, in bytes + * @phys: Extent physical start offset, in bytes + * @len: Extent length, in bytes + * @flags: FIEMAP_EXTENT flags that describe this extent + * + * Called from file system ->fiemap callback. Will populate extent + * info as passed in via arguments and copy to user memory. On + * success, extent count on fieinfo is incremented. + * + * Returns 0 on success, -errno on error, 1 if this was the last + * extent that will fit in user array. + */ +#define SET_UNKNOWN_FLAGS (FIEMAP_EXTENT_DELALLOC) +#define SET_NO_UNMOUNTED_IO_FLAGS (FIEMAP_EXTENT_DATA_ENCRYPTED) +#define SET_NOT_ALIGNED_FLAGS (FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE) +int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical, + u64 phys, u64 len, u32 flags) +{ + struct fiemap_extent extent; + struct fiemap_extent *dest = fieinfo->fi_extents_start; + + /* only count the extents */ + if (fieinfo->fi_extents_max == 0) { + fieinfo->fi_extents_mapped++; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; + } + + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) + return 1; + + if (flags & SET_UNKNOWN_FLAGS) + flags |= FIEMAP_EXTENT_UNKNOWN; + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) + flags |= FIEMAP_EXTENT_NO_UNMOUNTED_IO; + if (flags & SET_NOT_ALIGNED_FLAGS) + flags |= FIEMAP_EXTENT_NOT_ALIGNED; + + memset(&extent, 0, sizeof(extent)); + extent.fe_logical = logical; + extent.fe_physical = phys; + extent.fe_length = len; + extent.fe_flags = flags; + + dest += fieinfo->fi_extents_mapped; + if (copy_to_user(dest, &extent, sizeof(extent))) + return -EFAULT; + + fieinfo->fi_extents_mapped++; + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) + return 1; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; +} +EXPORT_SYMBOL(fiemap_fill_next_extent); + +/** + * fiemap_check_flags - check validity of requested flags for fiemap + * @fieinfo: Fiemap context passed into ->fiemap + * @fs_flags: Set of fiemap flags that the file system understands + * + * Called from file system ->fiemap callback. This will compute the + * intersection of valid fiemap flags and those that the fs supports. That + * value is then compared against the user supplied flags. In case of bad user + * flags, the invalid values will be written into the fieinfo structure, and + * -EBADR is returned, which tells ioctl_fiemap() to return those values to + * userspace. For this reason, a return code of -EBADR should be preserved. + * + * Returns 0 on success, -EBADR on bad flags. + */ +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags) +{ + u32 incompat_flags; + + incompat_flags = fieinfo->fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags); + if (incompat_flags) { + fieinfo->fi_flags = incompat_flags; + return -EBADR; + } + return 0; +} +EXPORT_SYMBOL(fiemap_check_flags); + +static int fiemap_check_ranges(struct super_block *sb, + u64 start, u64 len, u64 *new_len) +{ + *new_len = len; + + if (len == 0) + return -EINVAL; + + if (start > sb->s_maxbytes) + return -EFBIG; + + /* + * Shrink request scope to what the fs can actually handle. + */ + if ((len > sb->s_maxbytes) || + (sb->s_maxbytes - len) < start) + *new_len = sb->s_maxbytes - start; + + return 0; +} + +static int ioctl_fiemap(struct file *filp, unsigned long arg) +{ + struct fiemap fiemap; + struct fiemap_extent_info fieinfo = { 0, }; + struct inode *inode = filp->f_path.dentry->d_inode; + struct super_block *sb = inode->i_sb; + u64 len; + int error; + + if (!inode->i_op->fiemap) + return -EOPNOTSUPP; + + if (copy_from_user(&fiemap, (struct fiemap __user *)arg, + sizeof(struct fiemap))) + return -EFAULT; + + if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) + return -EINVAL; + + error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length, + &len); + if (error) + return error; + + fieinfo.fi_flags = fiemap.fm_flags; + fieinfo.fi_extents_max = fiemap.fm_extent_count; + fieinfo.fi_extents_start = (struct fiemap_extent *)(arg + sizeof(fiemap)); + + if (fiemap.fm_extent_count != 0 && + !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) + return -EFAULT; + + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) + filemap_write_and_wait(inode->i_mapping); + + error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len); + fiemap.fm_flags = fieinfo.fi_flags; + fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; + if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap))) + error = -EFAULT; + + return error; +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -80,6 +233,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd, switch (cmd) { case FIBMAP: return ioctl_fibmap(filp, p); + case FS_IOC_FIEMAP: + return ioctl_fiemap(filp, arg); case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h new file mode 100644 index 0000000..cbc51e6 --- /dev/null +++ b/include/linux/fiemap.h @@ -0,0 +1,64 @@ +/* + * FS_IOC_FIEMAP ioctl infrastructure. + * + * Some portions copyright (C) 2007 Cluster File Systems, Inc + * + * Authors: Mark Fasheh <mfasheh@suse.com> + * Kalpak Shah <kalpak.shah@sun.com> + * Andreas Dilger <adilger@sun.com> + */ + +#ifndef _LINUX_FIEMAP_H +#define _LINUX_FIEMAP_H + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent from the beginning of the file */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent from the beginning of the disk */ + __u64 fe_length; /* length in bytes for this extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace wants (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents;/* number of extents that were mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + +#define FIEMAP_MAX_OFFSET (~0ULL) + +#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */ +#define FIEMAP_FLAG_XATTR 0x00000002 /* map extended attribute tree */ + +#define FIEMAP_FLAGS_COMPAT (FIEMAP_FLAG_SYNC | FIEMAP_FLAG_XATTR) + +#define FIEMAP_EXTENT_LAST 0x00000001 /* Last extent in file. */ +#define FIEMAP_EXTENT_UNKNOWN 0x00000002 /* Data location unknown. */ +#define FIEMAP_EXTENT_DELALLOC 0x00000004 /* Location still pending. + * Sets EXTENT_UNKNOWN. */ +#define FIEMAP_EXTENT_NO_UNMOUNTED_IO 0x00000008 /* Data can not be read + * while fs is unmounted */ +#define FIEMAP_EXTENT_DATA_ENCRYPTED 0x00000080 /* Data is encrypted by fs. + * Sets EXTENT_NO_BYPASS. */ +#define FIEMAP_EXTENT_NOT_ALIGNED 0x00000100 /* Extent offsets may not be + * block aligned. */ +#define FIEMAP_EXTENT_DATA_INLINE 0x00000200 /* Data mixed with metadata. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_DATA_TAIL 0x00000400 /* Multiple files in block. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_UNWRITTEN 0x00000800 /* Space allocated, but + * no data (i.e. zero). */ +#define FIEMAP_EXTENT_MERGED 0x00001000 /* File does not natively + * support extents. Result + * merged for efficiency. */ + +#endif /* _LINUX_FIEMAP_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..194fb23 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -231,6 +231,7 @@ extern int dir_notify_enable; #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IOC_GETVERSION _IOR('v', 1, long) #define FS_IOC_SETVERSION _IOW('v', 2, long) +#define FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap) #define FS_IOC32_GETFLAGS _IOR('f', 1, int) #define FS_IOC32_SETFLAGS _IOW('f', 2, int) #define FS_IOC32_GETVERSION _IOR('v', 1, int) @@ -291,6 +292,7 @@ extern int dir_notify_enable; #include <linux/mutex.h> #include <linux/capability.h> #include <linux/semaphore.h> +#include <linux/fiemap.h> #include <asm/atomic.h> #include <asm/byteorder.h> @@ -1179,6 +1181,20 @@ extern void dentry_unhash(struct dentry *dentry); extern int file_permission(struct file *, int); /* + * VFS FS_IOC_FIEMAP helper definitions. + */ +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent + * array */ +}; +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags); +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +/* * File types * * NOTE! These match bits 12..15 of stat.st_mode @@ -1287,6 +1303,8 @@ struct inode_operations { void (*truncate_range)(struct inode *, loff_t, loff_t); long (*fallocate)(struct inode *inode, int mode, loff_t offset, loff_t len); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); }; struct seq_file; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/4] ocfs2: fiemap support 2008-10-07 1:06 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o @ 2008-10-07 1:06 ` Theodore Ts'o 2008-10-07 1:06 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-07 1:06 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, Theodore Ts'o, ocfs2-devel From: Mark Fasheh <mfasheh@suse.com> Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Cc: linux-fsdevel@vger.kernel.org --- fs/ocfs2/alloc.c | 9 -- fs/ocfs2/alloc.h | 9 ++ fs/ocfs2/extent_map.c | 346 +++++++++++++++++++++++++++++++++++++++++-------- fs/ocfs2/extent_map.h | 3 + fs/ocfs2/file.c | 1 + 5 files changed, 306 insertions(+), 62 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 10bfb46..29ff57e 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -990,15 +990,6 @@ out: } /* - * This is only valid for leaf nodes, which are the only ones that can - * have empty extents anyway. - */ -static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) -{ - return !rec->e_leaf_clusters; -} - -/* * This function will discard the rightmost extent record. */ static void ocfs2_shift_records_right(struct ocfs2_extent_list *el) diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h index 42ff94b..60cd3d5 100644 --- a/fs/ocfs2/alloc.h +++ b/fs/ocfs2/alloc.h @@ -146,4 +146,13 @@ static inline unsigned int ocfs2_rec_clusters(struct ocfs2_extent_list *el, return le16_to_cpu(rec->e_leaf_clusters); } +/* + * This is only valid for leaf nodes, which are the only ones that can + * have empty extents anyway. + */ +static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) +{ + return !rec->e_leaf_clusters; +} + #endif /* OCFS2_ALLOC_H */ diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index c58668a..aed268e 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -25,6 +25,7 @@ #include <linux/fs.h> #include <linux/init.h> #include <linux/types.h> +#include <linux/fiemap.h> #define MLOG_MASK_PREFIX ML_EXTENT_MAP #include <cluster/masklog.h> @@ -32,6 +33,7 @@ #include "ocfs2.h" #include "alloc.h" +#include "dlmglue.h" #include "extent_map.h" #include "inode.h" #include "super.h" @@ -282,6 +284,51 @@ out: kfree(new_emi); } +static int ocfs2_last_eb_is_empty(struct inode *inode, + struct ocfs2_dinode *di) +{ + int ret, next_free; + u64 last_eb_blk = le64_to_cpu(di->i_last_eb_blk); + struct buffer_head *eb_bh = NULL; + struct ocfs2_extent_block *eb; + struct ocfs2_extent_list *el; + + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), last_eb_blk, + &eb_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } + + eb = (struct ocfs2_extent_block *) eb_bh->b_data; + el = &eb->h_list; + + if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) { + ret = -EROFS; + OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb); + goto out; + } + + if (el->l_tree_depth) { + ocfs2_error(inode->i_sb, + "Inode %lu has non zero tree depth in " + "leaf block %llu\n", inode->i_ino, + (unsigned long long)eb_bh->b_blocknr); + ret = -EROFS; + goto out; + } + + next_free = le16_to_cpu(el->l_next_free_rec); + + if (next_free == 0 || + (next_free == 1 && ocfs2_is_empty_extent(&el->l_recs[0]))) + ret = 1; + +out: + brelse(eb_bh); + return ret; +} + /* * Return the 1st index within el which contains an extent start * larger than v_cluster. @@ -373,42 +420,28 @@ out: return ret; } -int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, - u32 *p_cluster, u32 *num_clusters, - unsigned int *extent_flags) +static int ocfs2_get_clusters_nocache(struct inode *inode, + struct buffer_head *di_bh, + u32 v_cluster, unsigned int *hole_len, + struct ocfs2_extent_rec *ret_rec, + unsigned int *is_last) { - int ret, i; - unsigned int flags = 0; - struct buffer_head *di_bh = NULL; - struct buffer_head *eb_bh = NULL; + int i, ret, tree_height, len; struct ocfs2_dinode *di; - struct ocfs2_extent_block *eb; + struct ocfs2_extent_block *uninitialized_var(eb); struct ocfs2_extent_list *el; struct ocfs2_extent_rec *rec; - u32 coff; - - if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { - ret = -ERANGE; - mlog_errno(ret); - goto out; - } - - ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, - num_clusters, extent_flags); - if (ret == 0) - goto out; + struct buffer_head *eb_bh = NULL; - ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, - &di_bh, OCFS2_BH_CACHED, inode); - if (ret) { - mlog_errno(ret); - goto out; - } + memset(ret_rec, 0, sizeof(*ret_rec)); + if (is_last) + *is_last = 0; di = (struct ocfs2_dinode *) di_bh->b_data; el = &di->id2.i_list; + tree_height = le16_to_cpu(el->l_tree_depth); - if (el->l_tree_depth) { + if (tree_height > 0) { ret = ocfs2_find_leaf(inode, el, v_cluster, &eb_bh); if (ret) { mlog_errno(ret); @@ -431,46 +464,143 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, i = ocfs2_search_extent_list(el, v_cluster); if (i == -1) { /* - * A hole was found. Return some canned values that - * callers can key on. If asked for, num_clusters will - * be populated with the size of the hole. + * Holes can be larger than the maximum size of an + * extent, so we return their lengths in a seperate + * field. */ - *p_cluster = 0; - if (num_clusters) { + if (hole_len) { ret = ocfs2_figure_hole_clusters(inode, el, eb_bh, - v_cluster, - num_clusters); + v_cluster, &len); if (ret) { mlog_errno(ret); goto out; } + + *hole_len = len; } - } else { - rec = &el->l_recs[i]; + goto out_hole; + } - BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + rec = &el->l_recs[i]; - if (!rec->e_blkno) { - ocfs2_error(inode->i_sb, "Inode %lu has bad extent " - "record (%u, %u, 0)", inode->i_ino, - le32_to_cpu(rec->e_cpos), - ocfs2_rec_clusters(el, rec)); - ret = -EROFS; - goto out; + BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + + if (!rec->e_blkno) { + ocfs2_error(inode->i_sb, "Inode %lu has bad extent " + "record (%u, %u, 0)", inode->i_ino, + le32_to_cpu(rec->e_cpos), + ocfs2_rec_clusters(el, rec)); + ret = -EROFS; + goto out; + } + + *ret_rec = *rec; + + /* + * Checking for last extent is potentially expensive - we + * might have to look at the next leaf over to see if it's + * empty. + * + * The first two checks are to see whether the caller even + * cares for this information, and if the extent is at least + * the last in it's list. + * + * If those hold true, then the extent is last if any of the + * additional conditions hold true: + * - Extent list is in-inode + * - Extent list is right-most + * - Extent list is 2nd to rightmost, with empty right-most + */ + if (is_last) { + if (i == (le16_to_cpu(el->l_next_free_rec) - 1)) { + if (tree_height == 0) + *is_last = 1; + else if (eb->h_blkno == di->i_last_eb_blk) + *is_last = 1; + else if (eb->h_next_leaf_blk == di->i_last_eb_blk) { + ret = ocfs2_last_eb_is_empty(inode, di); + if (ret < 0) { + mlog_errno(ret); + goto out; + } + if (ret == 1) + *is_last = 1; + } } + } + +out_hole: + ret = 0; +out: + brelse(eb_bh); + return ret; +} + +static void ocfs2_relative_extent_offsets(struct super_block *sb, + u32 v_cluster, + struct ocfs2_extent_rec *rec, + u32 *p_cluster, u32 *num_clusters) + +{ + u32 coff = v_cluster - le32_to_cpu(rec->e_cpos); + + *p_cluster = ocfs2_blocks_to_clusters(sb, le64_to_cpu(rec->e_blkno)); + *p_cluster = *p_cluster + coff; + + if (num_clusters) + *num_clusters = le16_to_cpu(rec->e_leaf_clusters) - coff; +} + +int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, + u32 *p_cluster, u32 *num_clusters, + unsigned int *extent_flags) +{ + int ret; + unsigned int uninitialized_var(hole_len), flags = 0; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; - coff = v_cluster - le32_to_cpu(rec->e_cpos); + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = -ERANGE; + mlog_errno(ret); + goto out; + } - *p_cluster = ocfs2_blocks_to_clusters(inode->i_sb, - le64_to_cpu(rec->e_blkno)); - *p_cluster = *p_cluster + coff; + ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, + num_clusters, extent_flags); + if (ret == 0) + goto out; - if (num_clusters) - *num_clusters = ocfs2_rec_clusters(el, rec) - coff; + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, + &di_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } - flags = rec->e_flags; + ret = ocfs2_get_clusters_nocache(inode, di_bh, v_cluster, &hole_len, + &rec, NULL); + if (ret) { + mlog_errno(ret); + goto out; + } - ocfs2_extent_map_insert_rec(inode, rec); + if (rec.e_blkno == 0ULL) { + /* + * A hole was found. Return some canned values that + * callers can key on. If asked for, num_clusters will + * be populated with the size of the hole. + */ + *p_cluster = 0; + if (num_clusters) { + *num_clusters = hole_len; + } + } else { + ocfs2_relative_extent_offsets(inode->i_sb, v_cluster, &rec, + p_cluster, num_clusters); + flags = rec.e_flags; + + ocfs2_extent_map_insert_rec(inode, &rec); } if (extent_flags) @@ -478,7 +608,6 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, out: brelse(di_bh); - brelse(eb_bh); return ret; } @@ -521,3 +650,114 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, out: return ret; } + +static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh, + struct fiemap_extent_info *fieinfo, + u64 map_start) +{ + int ret; + unsigned int id_count; + struct ocfs2_dinode *di; + u64 phys; + u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST; + struct ocfs2_inode_info *oi = OCFS2_I(inode); + + di = (struct ocfs2_dinode *)di_bh->b_data; + id_count = le16_to_cpu(di->id2.i_data.id_count); + + if (map_start < id_count) { + phys = oi->ip_blkno << inode->i_sb->s_blocksize_bits; + phys += offsetof(struct ocfs2_dinode, id2.i_data.id_data); + + ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count, + flags); + if (ret < 0) + return ret; + } + + return 0; +} + +#define OCFS2_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC) + +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len) +{ + int ret, is_last; + u32 mapping_end, cpos; + unsigned int hole_size; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + u64 len_bytes, phys_bytes, virt_bytes; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; + + ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS); + if (ret) + return ret; + + ret = ocfs2_inode_lock(inode, &di_bh, 0); + if (ret) { + mlog_errno(ret); + goto out; + } + + down_read(&OCFS2_I(inode)->ip_alloc_sem); + + /* + * Handle inline-data separately. + */ + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = ocfs2_fiemap_inline(inode, di_bh, fieinfo, map_start); + goto out_unlock; + } + + cpos = map_start >> osb->s_clustersize_bits; + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb, + map_start + map_len); + mapping_end -= cpos; + is_last = 0; + while (cpos < mapping_end && !is_last) { + u32 fe_flags; + + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, + &hole_size, &rec, &is_last); + if (ret) { + mlog_errno(ret); + goto out; + } + + if (rec.e_blkno == 0ULL) { + cpos += hole_size; + continue; + } + + fe_flags = 0; + if (rec.e_flags & OCFS2_EXT_UNWRITTEN) + fe_flags |= FIEMAP_EXTENT_UNWRITTEN; + if (is_last) + fe_flags |= FIEMAP_EXTENT_LAST; + len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits; + phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits; + virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits; + + ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes, + len_bytes, fe_flags); + if (ret) + break; + + cpos = le32_to_cpu(rec.e_cpos)+ le16_to_cpu(rec.e_leaf_clusters); + } + + if (ret > 0) + ret = 0; + +out_unlock: + brelse(di_bh); + + up_read(&OCFS2_I(inode)->ip_alloc_sem); + + ocfs2_inode_unlock(inode, 0); +out: + + return ret; +} diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h index de91e3e..1b97490 100644 --- a/fs/ocfs2/extent_map.h +++ b/fs/ocfs2/extent_map.h @@ -50,4 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster, int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, u64 *ret_count, unsigned int *extent_flags); +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len); + #endif /* _EXTENT_MAP_H */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index ec2ed15..ed38796 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2228,6 +2228,7 @@ const struct inode_operations ocfs2_file_iops = { .getattr = ocfs2_getattr, .permission = ocfs2_permission, .fallocate = ocfs2_fallocate, + .fiemap = ocfs2_fiemap, }; const struct inode_operations ocfs2_special_file_iops = { -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/4] generic block based fiemap implementation 2008-10-07 1:06 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o @ 2008-10-07 1:06 ` Theodore Ts'o 2008-10-07 1:06 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-07 1:06 UTC (permalink / raw) To: linux-fsdevel; +Cc: Josef Bacik, Mark Fasheh, Theodore Ts'o From: Josef Bacik <jbacik@redhat.com> Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext2/ext2.h | 2 + fs/ext2/file.c | 1 + fs/ext2/inode.c | 8 +++ fs/ext3/file.c | 1 + fs/ext3/inode.c | 8 +++ fs/ioctl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ext3_fs.h | 2 + include/linux/fs.h | 3 + 8 files changed, 143 insertions(+), 0 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 47d88da..bae998c 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -133,6 +133,8 @@ extern void ext2_truncate (struct inode *); extern int ext2_setattr (struct dentry *, struct iattr *); extern void ext2_set_inode_flags(struct inode *inode); extern void ext2_get_inode_flags(struct ext2_inode_info *); +extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); int __ext2_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5f2fa9c..45ed071 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -86,4 +86,5 @@ const struct inode_operations ext2_file_inode_operations = { #endif .setattr = ext2_setattr, .permission = ext2_permission, + .fiemap = ext2_fiemap, }; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 991d6df..7658b33 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -31,6 +31,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/mpage.h> +#include <linux/fiemap.h> #include "ext2.h" #include "acl.h" #include "xip.h" @@ -704,6 +705,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ } +int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext2_get_block); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); diff --git a/fs/ext3/file.c b/fs/ext3/file.c index acc4913..3be1e06 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -134,5 +134,6 @@ const struct inode_operations ext3_file_inode_operations = { .removexattr = generic_removexattr, #endif .permission = ext3_permission, + .fiemap = ext3_fiemap, }; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 507d868..ebfec4d 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -36,6 +36,7 @@ #include <linux/mpage.h> #include <linux/uio.h> #include <linux/bio.h> +#include <linux/fiemap.h> #include "xattr.h" #include "acl.h" @@ -981,6 +982,13 @@ out: return ret; } +int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext3_get_block); +} + /* * `handle' can be NULL if create is zero */ diff --git a/fs/ioctl.c b/fs/ioctl.c index b759cc2..f364f33 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -13,6 +13,8 @@ #include <linux/security.h> #include <linux/module.h> #include <linux/uaccess.h> +#include <linux/writeback.h> +#include <linux/buffer_head.h> #include <asm/ioctls.h> @@ -224,6 +226,122 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg) return error; } +#define blk_to_logical(inode, blk) (blk << (inode)->i_blkbits) +#define logical_to_blk(inode, offset) (offset >> (inode)->i_blkbits); + +/* + * @inode - the inode to map + * @arg - the pointer to userspace where we copy everything to + * @get_block - the fs's get_block function + * + * This does FIEMAP for block based inodes. Basically it will just loop + * through get_block until we hit the number of extents we want to map, or we + * go past the end of the file and hit a hole. + * + * If it is possible to have data blocks beyond a hole past @inode->i_size, then + * please do not use this function, it will stop at the first unmapped block + * beyond i_size + */ +int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block) +{ + struct buffer_head tmp; + unsigned int start_blk; + long long length = 0, map_len = 0; + u64 logical = 0, phys = 0, size = 0; + u32 flags = FIEMAP_EXTENT_MERGED; + int ret = 0; + + if ((ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))) + return ret; + + start_blk = logical_to_blk(inode, start); + + /* guard against change */ + mutex_lock(&inode->i_mutex); + + length = (long long)min_t(u64, len, i_size_read(inode)); + map_len = length; + + do { + /* + * we set b_size to the total size we want so it will map as + * many contiguous blocks as possible at once + */ + memset(&tmp, 0, sizeof(struct buffer_head)); + tmp.b_size = map_len; + + ret = get_block(inode, start_blk, &tmp, 0); + if (ret) + break; + + /* HOLE */ + if (!buffer_mapped(&tmp)) { + /* + * first hole after going past the EOF, this is our + * last extent + */ + if (length <= 0) { + flags = FIEMAP_EXTENT_MERGED|FIEMAP_EXTENT_LAST; + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + break; + } + + length -= blk_to_logical(inode, 1); + + /* if we have holes up to/past EOF then we're done */ + if (length <= 0) + break; + + start_blk++; + } else { + if (length <= 0 && size) { + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + if (ret) + break; + } + + logical = blk_to_logical(inode, start_blk); + phys = blk_to_logical(inode, tmp.b_blocknr); + size = tmp.b_size; + flags = FIEMAP_EXTENT_MERGED; + + length -= tmp.b_size; + start_blk += logical_to_blk(inode, size); + + /* + * if we are past the EOF we need to loop again to see + * if there is a hole so we can mark this extent as the + * last one, and if not keep mapping things until we + * find a hole, or we run out of slots in the extent + * array + */ + if (length <= 0) + continue; + + ret = fiemap_fill_next_extent(fieinfo, logical, phys, + size, flags); + if (ret) + break; + } + cond_resched(); + } while (1); + + mutex_unlock(&inode->i_mutex); + + /* if ret is 1 then we just hit the end of the extent array */ + if (ret == 1) + ret = 0; + + return ret; +} +EXPORT_SYMBOL(generic_block_fiemap); + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 80171ee..8120fa1 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -837,6 +837,8 @@ extern void ext3_truncate (struct inode *); extern void ext3_set_inode_flags(struct inode *); extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); +extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, diff --git a/include/linux/fs.h b/include/linux/fs.h index 194fb23..385c9a1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1998,6 +1998,9 @@ extern int vfs_fstat(unsigned int, struct kstat *); extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg); +extern int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block); extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 4/4] Hook ext4 to the vfs fiemap interface. 2008-10-07 1:06 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o @ 2008-10-07 1:06 ` Theodore Ts'o 0 siblings, 0 replies; 40+ messages in thread From: Theodore Ts'o @ 2008-10-07 1:06 UTC (permalink / raw) To: linux-fsdevel; +Cc: Eric Sandeen, Theodore Ts'o, linux-ext4 From: Eric Sandeen <sandeen@redhat.com> ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/ext4.h | 2 + fs/ext4/ext4_extents.h | 15 +++ fs/ext4/extents.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/file.c | 4 + 4 files changed, 269 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c50c04c..f46a513 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1067,6 +1067,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int, int *); +int ext4_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, unsigned long maxblocks, struct buffer_head *bh_result, diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h index d33dc56..bec7ce5 100644 --- a/fs/ext4/ext4_extents.h +++ b/fs/ext4/ext4_extents.h @@ -124,6 +124,19 @@ struct ext4_ext_path { #define EXT4_EXT_CACHE_GAP 1 #define EXT4_EXT_CACHE_EXTENT 2 +/* + * to be called by ext4_ext_walk_space() + * negative retcode - error + * positive retcode - signal for ext4_ext_walk_space(), see below + * callback must return valid extent (passed or newly created) + */ +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *, + struct ext4_ext_cache *, + struct ext4_extent *, void *); + +#define EXT_CONTINUE 0 +#define EXT_BREAK 1 +#define EXT_REPEAT 2 #define EXT_MAX_BLOCK 0xffffffff @@ -224,6 +237,8 @@ extern int ext4_ext_try_to_merge(struct inode *inode, struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); +extern int ext4_ext_walk_space(struct inode *, ext4_lblk_t, ext4_lblk_t, + ext_prepare_callback, void *); extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t, struct ext4_ext_path *); extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c8f81f2..ea2ce3c 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -40,6 +40,7 @@ #include <linux/slab.h> #include <linux/falloc.h> #include <asm/uaccess.h> +#include <linux/fiemap.h> #include "ext4_jbd2.h" #include "ext4_extents.h" @@ -1626,6 +1627,113 @@ cleanup: return err; } +int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block, + ext4_lblk_t num, ext_prepare_callback func, + void *cbdata) +{ + struct ext4_ext_path *path = NULL; + struct ext4_ext_cache cbex; + struct ext4_extent *ex; + ext4_lblk_t next, start = 0, end = 0; + ext4_lblk_t last = block + num; + int depth, exists, err = 0; + + BUG_ON(func == NULL); + BUG_ON(inode == NULL); + + while (block < last && block != EXT_MAX_BLOCK) { + num = last - block; + /* find extent for this block */ + path = ext4_ext_find_extent(inode, block, path); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + break; + } + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + ex = path[depth].p_ext; + next = ext4_ext_next_allocated_block(path); + + exists = 0; + if (!ex) { + /* there is no extent yet, so try to allocate + * all requested space */ + start = block; + end = block + num; + } else if (le32_to_cpu(ex->ee_block) > block) { + /* need to allocate space before found extent */ + start = block; + end = le32_to_cpu(ex->ee_block); + if (block + num < end) + end = block + num; + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { + /* need to allocate space after found extent */ + start = block; + end = block + num; + if (end >= next) + end = next; + } else if (block >= le32_to_cpu(ex->ee_block)) { + /* + * some part of requested space is covered + * by found extent + */ + start = block; + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); + if (block + num < end) + end = block + num; + exists = 1; + } else { + BUG(); + } + BUG_ON(end <= start); + + if (!exists) { + cbex.ec_block = start; + cbex.ec_len = end - start; + cbex.ec_start = 0; + cbex.ec_type = EXT4_EXT_CACHE_GAP; + } else { + cbex.ec_block = le32_to_cpu(ex->ee_block); + cbex.ec_len = ext4_ext_get_actual_len(ex); + cbex.ec_start = ext_pblock(ex); + cbex.ec_type = EXT4_EXT_CACHE_EXTENT; + } + + BUG_ON(cbex.ec_len == 0); + err = func(inode, path, &cbex, ex, cbdata); + ext4_ext_drop_refs(path); + + if (err < 0) + break; + + if (err == EXT_REPEAT) + continue; + else if (err == EXT_BREAK) { + err = 0; + break; + } + + if (ext_depth(inode) != depth) { + /* depth was changed. we have to realloc path */ + kfree(path); + path = NULL; + } + + block = cbex.ec_block + cbex.ec_len; + } + + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } + + return err; +} + static void ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, __u32 len, ext4_fsblk_t start, int type) @@ -2971,3 +3079,143 @@ retry: mutex_unlock(&inode->i_mutex); return ret > 0 ? ret2 : ret; } + +/* + * Callback function called for each extent to gather FIEMAP information. + */ +int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, + struct ext4_ext_cache *newex, struct ext4_extent *ex, + void *data) +{ + struct fiemap_extent_info *fieinfo = data; + unsigned long blksize_bits = inode->i_sb->s_blocksize_bits; + __u64 logical; + __u64 physical; + __u64 length; + __u32 flags = 0; + int error; + + logical = (__u64)newex->ec_block << blksize_bits; + + if (newex->ec_type == EXT4_EXT_CACHE_GAP) { + pgoff_t offset; + struct page *page; + struct buffer_head *bh = NULL; + + offset = logical >> PAGE_SHIFT; + page = find_get_page(inode->i_mapping, offset); + if (!page || !page_has_buffers(page)) + return EXT_CONTINUE; + + bh = page_buffers(page); + + if (!bh) + return EXT_CONTINUE; + + if (buffer_delay(bh)) { + flags |= FIEMAP_EXTENT_DELALLOC; + page_cache_release(page); + } else { + page_cache_release(page); + return EXT_CONTINUE; + } + } + + physical = (__u64)newex->ec_start << blksize_bits; + length = (__u64)newex->ec_len << blksize_bits; + + if (ex && ext4_ext_is_uninitialized(ex)) + flags |= FIEMAP_EXTENT_UNWRITTEN; + + /* + * If this extent reaches EXT_MAX_BLOCK, it must be last. + * + * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK, + * this also indicates no more allocated blocks. + * + * XXX this might miss a single-block extent at EXT_MAX_BLOCK + */ + if (logical + length - 1 == EXT_MAX_BLOCK || + ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK) + flags |= FIEMAP_EXTENT_LAST; + + error = fiemap_fill_next_extent(fieinfo, logical, physical, + length, flags); + if (error < 0) + return error; + if (error == 1) + return EXT_BREAK; + + return EXT_CONTINUE; +} + +/* fiemap flags we can handle specified here */ +#define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR) + +int ext4_xattr_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo) +{ + __u64 physical = 0; + __u64 length; + __u32 flags = FIEMAP_EXTENT_LAST; + int blockbits = inode->i_sb->s_blocksize_bits; + int error = 0; + + /* in-inode? */ + if (EXT4_I(inode)->i_state & EXT4_STATE_XATTR) { + struct ext4_iloc iloc; + int offset; /* offset of xattr in inode */ + + error = ext4_get_inode_loc(inode, &iloc); + if (error) + return error; + physical = iloc.bh->b_blocknr << blockbits; + offset = EXT4_GOOD_OLD_INODE_SIZE + + EXT4_I(inode)->i_extra_isize; + physical += offset; + length = EXT4_SB(inode->i_sb)->s_inode_size - offset; + flags |= FIEMAP_EXTENT_DATA_INLINE; + } else { /* external block */ + physical = EXT4_I(inode)->i_file_acl << blockbits; + length = inode->i_sb->s_blocksize; + } + + if (physical) + error = fiemap_fill_next_extent(fieinfo, 0, physical, + length, flags); + return (error < 0 ? error : 0); +} + +int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len) +{ + ext4_lblk_t start_blk; + ext4_lblk_t len_blks; + int error = 0; + + /* fallback to generic here if not in extents fmt */ + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return generic_block_fiemap(inode, fieinfo, start, len, + ext4_get_block); + + if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS)) + return -EBADR; + + if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) { + error = ext4_xattr_fiemap(inode, fieinfo); + } else { + start_blk = start >> inode->i_sb->s_blocksize_bits; + len_blks = len >> inode->i_sb->s_blocksize_bits; + + /* + * Walk the extent tree gathering extent information. + * ext4_ext_fiemap_cb will push extents back to user. + */ + down_write(&EXT4_I(inode)->i_data_sem); + error = ext4_ext_walk_space(inode, start_blk, len_blks, + ext4_ext_fiemap_cb, fieinfo); + up_write(&EXT4_I(inode)->i_data_sem); + } + + return error; +} + diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 62796b7..6d5be15 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -140,6 +140,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len); + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -170,5 +173,6 @@ const struct inode_operations ext4_file_inode_operations = { #endif .permission = ext4_permission, .fallocate = ext4_fallocate, + .fiemap = ext4_fiemap, }; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* PATCH [0/4] Updated**3 fiemap patches
@ 2008-10-09 1:48 Theodore Ts'o
2008-10-09 1:48 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o
0 siblings, 1 reply; 40+ messages in thread
From: Theodore Ts'o @ 2008-10-09 1:48 UTC (permalink / raw)
To: linux-fsdevel
These patches updates the name and definition of the
FIEMAP_EXTENT_ENCODED flag (formerly FIEMAP_EXTENT_NO_UNMOUNTED_IO).
At this point I believe (hope!) that we have consensus that this patch
series is ready for submission to mainline.
- Ted
^ permalink raw reply [flat|nested] 40+ messages in thread* [PATCH 1/4] vfs: vfs-level fiemap interface @ 2008-10-09 1:48 ` Theodore Ts'o 2008-10-09 1:48 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-09 1:48 UTC (permalink / raw) To: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: Mark Fasheh, Theodore Ts'o, Michael Kerrisk, linux-api-u79uwXL29TY76Z2rM5mHXA From: Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org> Basic vfs-level fiemap infrastructure, which sets up a new ->fiemap inode operation. Userspace can get extent information on a file via fiemap ioctl. As input, the fiemap ioctl takes a struct fiemap which includes an array of struct fiemap_extent (fm_extents). Size of the extent array is passed as fm_extent_count and number of extents returned will be written into fm_mapped_extents. Offset and length fields on the fiemap structure (fm_start, fm_length) describe a logical range which will be searched for extents. All extents returned will at least partially contain this range. The actual extent offsets and ranges returned will be unmodified from their offset and range on-disk. The fiemap ioctl returns '0' on success. On error, -1 is returned and errno is set. If errno is equal to EBADR, then fm_flags will contain those flags which were passed in which the kernel did not understand. On all other errors, the contents of fm_extents is undefined. As fiemap evolved, there have been many authors of the vfs patch. As far as I can tell, the list includes: Kalpak Shah <kalpak.shah-xsfywfwIY+M@public.gmane.org> Andreas Dilger <adilger-xsfywfwIY+M@public.gmane.org> Eric Sandeen <sandeen-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org> Signed-off-by: Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org> Signed-off-by: "Theodore Ts'o" <tytso-3s7WtUTddSA@public.gmane.org> Cc: Michael Kerrisk <mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- Documentation/filesystems/fiemap.txt | 228 ++++++++++++++++++++++++++++++++++ fs/ioctl.c | 155 +++++++++++++++++++++++ include/linux/fiemap.h | 64 ++++++++++ include/linux/fs.h | 18 +++ 4 files changed, 465 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/fiemap.txt create mode 100644 include/linux/fiemap.h diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 0000000..1e3defc --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,228 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to it's operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +The following flags can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of it's data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or it's logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, it's +physical location has not been allocated yet. + +* FIEMAP_EXTENT_ENCODED +This extent does not consist of plain filesystem blocks but is +encoded (e.g. encrypted or compressed). Reading the data in this +extent via I/O to the block device will have undefined results. + +Note that it is *always* undefined to try to update the data +in-place by writing to the indicated location without the +assistance of the filesystem, or to access the data using the +information returned by the FIEMAP interface while the filesystem +is mounted. In other words, user applications may only read the +extent data via I/O to the block device while the filesystem is +unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is +clear; user applications must not try reading or writing to the +filesystem via the block device under any other circumstances. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_ENCODED +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but it's data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining it's set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/fs/ioctl.c b/fs/ioctl.c index 7db32b3..045d960 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -16,6 +16,9 @@ #include <asm/ioctls.h> +/* So that the fiemap access checks can't overflow on 32 bit machines. */ +#define FIEMAP_MAX_EXTENTS (UINT_MAX / sizeof(struct fiemap_extent)) + /** * vfs_ioctl - call filesystem specific ioctl methods * @filp: open file to invoke ioctl method on @@ -71,6 +74,156 @@ static int ioctl_fibmap(struct file *filp, int __user *p) return put_user(res, p); } +/** + * fiemap_fill_next_extent - Fiemap helper function + * @fieinfo: Fiemap context passed into ->fiemap + * @logical: Extent logical start offset, in bytes + * @phys: Extent physical start offset, in bytes + * @len: Extent length, in bytes + * @flags: FIEMAP_EXTENT flags that describe this extent + * + * Called from file system ->fiemap callback. Will populate extent + * info as passed in via arguments and copy to user memory. On + * success, extent count on fieinfo is incremented. + * + * Returns 0 on success, -errno on error, 1 if this was the last + * extent that will fit in user array. + */ +#define SET_UNKNOWN_FLAGS (FIEMAP_EXTENT_DELALLOC) +#define SET_NO_UNMOUNTED_IO_FLAGS (FIEMAP_EXTENT_DATA_ENCRYPTED) +#define SET_NOT_ALIGNED_FLAGS (FIEMAP_EXTENT_DATA_TAIL|FIEMAP_EXTENT_DATA_INLINE) +int fiemap_fill_next_extent(struct fiemap_extent_info *fieinfo, u64 logical, + u64 phys, u64 len, u32 flags) +{ + struct fiemap_extent extent; + struct fiemap_extent *dest = fieinfo->fi_extents_start; + + /* only count the extents */ + if (fieinfo->fi_extents_max == 0) { + fieinfo->fi_extents_mapped++; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; + } + + if (fieinfo->fi_extents_mapped >= fieinfo->fi_extents_max) + return 1; + + if (flags & SET_UNKNOWN_FLAGS) + flags |= FIEMAP_EXTENT_UNKNOWN; + if (flags & SET_NO_UNMOUNTED_IO_FLAGS) + flags |= FIEMAP_EXTENT_ENCODED; + if (flags & SET_NOT_ALIGNED_FLAGS) + flags |= FIEMAP_EXTENT_NOT_ALIGNED; + + memset(&extent, 0, sizeof(extent)); + extent.fe_logical = logical; + extent.fe_physical = phys; + extent.fe_length = len; + extent.fe_flags = flags; + + dest += fieinfo->fi_extents_mapped; + if (copy_to_user(dest, &extent, sizeof(extent))) + return -EFAULT; + + fieinfo->fi_extents_mapped++; + if (fieinfo->fi_extents_mapped == fieinfo->fi_extents_max) + return 1; + return (flags & FIEMAP_EXTENT_LAST) ? 1 : 0; +} +EXPORT_SYMBOL(fiemap_fill_next_extent); + +/** + * fiemap_check_flags - check validity of requested flags for fiemap + * @fieinfo: Fiemap context passed into ->fiemap + * @fs_flags: Set of fiemap flags that the file system understands + * + * Called from file system ->fiemap callback. This will compute the + * intersection of valid fiemap flags and those that the fs supports. That + * value is then compared against the user supplied flags. In case of bad user + * flags, the invalid values will be written into the fieinfo structure, and + * -EBADR is returned, which tells ioctl_fiemap() to return those values to + * userspace. For this reason, a return code of -EBADR should be preserved. + * + * Returns 0 on success, -EBADR on bad flags. + */ +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags) +{ + u32 incompat_flags; + + incompat_flags = fieinfo->fi_flags & ~(FIEMAP_FLAGS_COMPAT & fs_flags); + if (incompat_flags) { + fieinfo->fi_flags = incompat_flags; + return -EBADR; + } + return 0; +} +EXPORT_SYMBOL(fiemap_check_flags); + +static int fiemap_check_ranges(struct super_block *sb, + u64 start, u64 len, u64 *new_len) +{ + *new_len = len; + + if (len == 0) + return -EINVAL; + + if (start > sb->s_maxbytes) + return -EFBIG; + + /* + * Shrink request scope to what the fs can actually handle. + */ + if ((len > sb->s_maxbytes) || + (sb->s_maxbytes - len) < start) + *new_len = sb->s_maxbytes - start; + + return 0; +} + +static int ioctl_fiemap(struct file *filp, unsigned long arg) +{ + struct fiemap fiemap; + struct fiemap_extent_info fieinfo = { 0, }; + struct inode *inode = filp->f_path.dentry->d_inode; + struct super_block *sb = inode->i_sb; + u64 len; + int error; + + if (!inode->i_op->fiemap) + return -EOPNOTSUPP; + + if (copy_from_user(&fiemap, (struct fiemap __user *)arg, + sizeof(struct fiemap))) + return -EFAULT; + + if (fiemap.fm_extent_count > FIEMAP_MAX_EXTENTS) + return -EINVAL; + + error = fiemap_check_ranges(sb, fiemap.fm_start, fiemap.fm_length, + &len); + if (error) + return error; + + fieinfo.fi_flags = fiemap.fm_flags; + fieinfo.fi_extents_max = fiemap.fm_extent_count; + fieinfo.fi_extents_start = (struct fiemap_extent *)(arg + sizeof(fiemap)); + + if (fiemap.fm_extent_count != 0 && + !access_ok(VERIFY_WRITE, fieinfo.fi_extents_start, + fieinfo.fi_extents_max * sizeof(struct fiemap_extent))) + return -EFAULT; + + if (fieinfo.fi_flags & FIEMAP_FLAG_SYNC) + filemap_write_and_wait(inode->i_mapping); + + error = inode->i_op->fiemap(inode, &fieinfo, fiemap.fm_start, len); + fiemap.fm_flags = fieinfo.fi_flags; + fiemap.fm_mapped_extents = fieinfo.fi_extents_mapped; + if (copy_to_user((char *)arg, &fiemap, sizeof(fiemap))) + error = -EFAULT; + + return error; +} + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -80,6 +233,8 @@ static int file_ioctl(struct file *filp, unsigned int cmd, switch (cmd) { case FIBMAP: return ioctl_fibmap(filp, p); + case FS_IOC_FIEMAP: + return ioctl_fiemap(filp, arg); case FIGETBSZ: return put_user(inode->i_sb->s_blocksize, p); case FIONREAD: diff --git a/include/linux/fiemap.h b/include/linux/fiemap.h new file mode 100644 index 0000000..671decb --- /dev/null +++ b/include/linux/fiemap.h @@ -0,0 +1,64 @@ +/* + * FS_IOC_FIEMAP ioctl infrastructure. + * + * Some portions copyright (C) 2007 Cluster File Systems, Inc + * + * Authors: Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org> + * Kalpak Shah <kalpak.shah-xsfywfwIY+M@public.gmane.org> + * Andreas Dilger <adilger-xsfywfwIY+M@public.gmane.org> + */ + +#ifndef _LINUX_FIEMAP_H +#define _LINUX_FIEMAP_H + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent from the beginning of the file */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent from the beginning of the disk */ + __u64 fe_length; /* length in bytes for this extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace wants (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents;/* number of extents that were mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + +#define FIEMAP_MAX_OFFSET (~0ULL) + +#define FIEMAP_FLAG_SYNC 0x00000001 /* sync file data before map */ +#define FIEMAP_FLAG_XATTR 0x00000002 /* map extended attribute tree */ + +#define FIEMAP_FLAGS_COMPAT (FIEMAP_FLAG_SYNC | FIEMAP_FLAG_XATTR) + +#define FIEMAP_EXTENT_LAST 0x00000001 /* Last extent in file. */ +#define FIEMAP_EXTENT_UNKNOWN 0x00000002 /* Data location unknown. */ +#define FIEMAP_EXTENT_DELALLOC 0x00000004 /* Location still pending. + * Sets EXTENT_UNKNOWN. */ +#define FIEMAP_EXTENT_ENCODED 0x00000008 /* Data can not be read + * while fs is unmounted */ +#define FIEMAP_EXTENT_DATA_ENCRYPTED 0x00000080 /* Data is encrypted by fs. + * Sets EXTENT_NO_BYPASS. */ +#define FIEMAP_EXTENT_NOT_ALIGNED 0x00000100 /* Extent offsets may not be + * block aligned. */ +#define FIEMAP_EXTENT_DATA_INLINE 0x00000200 /* Data mixed with metadata. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_DATA_TAIL 0x00000400 /* Multiple files in block. + * Sets EXTENT_NOT_ALIGNED.*/ +#define FIEMAP_EXTENT_UNWRITTEN 0x00000800 /* Space allocated, but + * no data (i.e. zero). */ +#define FIEMAP_EXTENT_MERGED 0x00001000 /* File does not natively + * support extents. Result + * merged for efficiency. */ + +#endif /* _LINUX_FIEMAP_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 580b513..194fb23 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -231,6 +231,7 @@ extern int dir_notify_enable; #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IOC_GETVERSION _IOR('v', 1, long) #define FS_IOC_SETVERSION _IOW('v', 2, long) +#define FS_IOC_FIEMAP _IOWR('f', 11, struct fiemap) #define FS_IOC32_GETFLAGS _IOR('f', 1, int) #define FS_IOC32_SETFLAGS _IOW('f', 2, int) #define FS_IOC32_GETVERSION _IOR('v', 1, int) @@ -291,6 +292,7 @@ extern int dir_notify_enable; #include <linux/mutex.h> #include <linux/capability.h> #include <linux/semaphore.h> +#include <linux/fiemap.h> #include <asm/atomic.h> #include <asm/byteorder.h> @@ -1179,6 +1181,20 @@ extern void dentry_unhash(struct dentry *dentry); extern int file_permission(struct file *, int); /* + * VFS FS_IOC_FIEMAP helper definitions. + */ +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent + * array */ +}; +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags); +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +/* * File types * * NOTE! These match bits 12..15 of stat.st_mode @@ -1287,6 +1303,8 @@ struct inode_operations { void (*truncate_range)(struct inode *, loff_t, loff_t); long (*fallocate)(struct inode *inode, int mode, loff_t offset, loff_t len); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); }; struct seq_file; -- 1.5.6.1.205.ge2c7.dirty -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 2/4] ocfs2: fiemap support 2008-10-09 1:48 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o @ 2008-10-09 1:48 ` Theodore Ts'o 2008-10-09 1:48 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-09 1:48 UTC (permalink / raw) To: linux-fsdevel; +Cc: Mark Fasheh, Theodore Ts'o, ocfs2-devel From: Mark Fasheh <mfasheh@suse.com> Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be refactored so that the extent cache can be skipped in favor of going directly to the on-disk records. This makes it easier for us to determine which extent is the last one in the btree. Also, I'm not sure we want to be caching fiemap lookups anyway as they're not directly related to data read/write. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ocfs2-devel@oss.oracle.com Cc: linux-fsdevel@vger.kernel.org --- fs/ocfs2/alloc.c | 9 -- fs/ocfs2/alloc.h | 9 ++ fs/ocfs2/extent_map.c | 346 +++++++++++++++++++++++++++++++++++++++++-------- fs/ocfs2/extent_map.h | 3 + fs/ocfs2/file.c | 1 + 5 files changed, 306 insertions(+), 62 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 10bfb46..29ff57e 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -990,15 +990,6 @@ out: } /* - * This is only valid for leaf nodes, which are the only ones that can - * have empty extents anyway. - */ -static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) -{ - return !rec->e_leaf_clusters; -} - -/* * This function will discard the rightmost extent record. */ static void ocfs2_shift_records_right(struct ocfs2_extent_list *el) diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h index 42ff94b..60cd3d5 100644 --- a/fs/ocfs2/alloc.h +++ b/fs/ocfs2/alloc.h @@ -146,4 +146,13 @@ static inline unsigned int ocfs2_rec_clusters(struct ocfs2_extent_list *el, return le16_to_cpu(rec->e_leaf_clusters); } +/* + * This is only valid for leaf nodes, which are the only ones that can + * have empty extents anyway. + */ +static inline int ocfs2_is_empty_extent(struct ocfs2_extent_rec *rec) +{ + return !rec->e_leaf_clusters; +} + #endif /* OCFS2_ALLOC_H */ diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index c58668a..aed268e 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -25,6 +25,7 @@ #include <linux/fs.h> #include <linux/init.h> #include <linux/types.h> +#include <linux/fiemap.h> #define MLOG_MASK_PREFIX ML_EXTENT_MAP #include <cluster/masklog.h> @@ -32,6 +33,7 @@ #include "ocfs2.h" #include "alloc.h" +#include "dlmglue.h" #include "extent_map.h" #include "inode.h" #include "super.h" @@ -282,6 +284,51 @@ out: kfree(new_emi); } +static int ocfs2_last_eb_is_empty(struct inode *inode, + struct ocfs2_dinode *di) +{ + int ret, next_free; + u64 last_eb_blk = le64_to_cpu(di->i_last_eb_blk); + struct buffer_head *eb_bh = NULL; + struct ocfs2_extent_block *eb; + struct ocfs2_extent_list *el; + + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), last_eb_blk, + &eb_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } + + eb = (struct ocfs2_extent_block *) eb_bh->b_data; + el = &eb->h_list; + + if (!OCFS2_IS_VALID_EXTENT_BLOCK(eb)) { + ret = -EROFS; + OCFS2_RO_ON_INVALID_EXTENT_BLOCK(inode->i_sb, eb); + goto out; + } + + if (el->l_tree_depth) { + ocfs2_error(inode->i_sb, + "Inode %lu has non zero tree depth in " + "leaf block %llu\n", inode->i_ino, + (unsigned long long)eb_bh->b_blocknr); + ret = -EROFS; + goto out; + } + + next_free = le16_to_cpu(el->l_next_free_rec); + + if (next_free == 0 || + (next_free == 1 && ocfs2_is_empty_extent(&el->l_recs[0]))) + ret = 1; + +out: + brelse(eb_bh); + return ret; +} + /* * Return the 1st index within el which contains an extent start * larger than v_cluster. @@ -373,42 +420,28 @@ out: return ret; } -int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, - u32 *p_cluster, u32 *num_clusters, - unsigned int *extent_flags) +static int ocfs2_get_clusters_nocache(struct inode *inode, + struct buffer_head *di_bh, + u32 v_cluster, unsigned int *hole_len, + struct ocfs2_extent_rec *ret_rec, + unsigned int *is_last) { - int ret, i; - unsigned int flags = 0; - struct buffer_head *di_bh = NULL; - struct buffer_head *eb_bh = NULL; + int i, ret, tree_height, len; struct ocfs2_dinode *di; - struct ocfs2_extent_block *eb; + struct ocfs2_extent_block *uninitialized_var(eb); struct ocfs2_extent_list *el; struct ocfs2_extent_rec *rec; - u32 coff; - - if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { - ret = -ERANGE; - mlog_errno(ret); - goto out; - } - - ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, - num_clusters, extent_flags); - if (ret == 0) - goto out; + struct buffer_head *eb_bh = NULL; - ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, - &di_bh, OCFS2_BH_CACHED, inode); - if (ret) { - mlog_errno(ret); - goto out; - } + memset(ret_rec, 0, sizeof(*ret_rec)); + if (is_last) + *is_last = 0; di = (struct ocfs2_dinode *) di_bh->b_data; el = &di->id2.i_list; + tree_height = le16_to_cpu(el->l_tree_depth); - if (el->l_tree_depth) { + if (tree_height > 0) { ret = ocfs2_find_leaf(inode, el, v_cluster, &eb_bh); if (ret) { mlog_errno(ret); @@ -431,46 +464,143 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, i = ocfs2_search_extent_list(el, v_cluster); if (i == -1) { /* - * A hole was found. Return some canned values that - * callers can key on. If asked for, num_clusters will - * be populated with the size of the hole. + * Holes can be larger than the maximum size of an + * extent, so we return their lengths in a seperate + * field. */ - *p_cluster = 0; - if (num_clusters) { + if (hole_len) { ret = ocfs2_figure_hole_clusters(inode, el, eb_bh, - v_cluster, - num_clusters); + v_cluster, &len); if (ret) { mlog_errno(ret); goto out; } + + *hole_len = len; } - } else { - rec = &el->l_recs[i]; + goto out_hole; + } - BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + rec = &el->l_recs[i]; - if (!rec->e_blkno) { - ocfs2_error(inode->i_sb, "Inode %lu has bad extent " - "record (%u, %u, 0)", inode->i_ino, - le32_to_cpu(rec->e_cpos), - ocfs2_rec_clusters(el, rec)); - ret = -EROFS; - goto out; + BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos)); + + if (!rec->e_blkno) { + ocfs2_error(inode->i_sb, "Inode %lu has bad extent " + "record (%u, %u, 0)", inode->i_ino, + le32_to_cpu(rec->e_cpos), + ocfs2_rec_clusters(el, rec)); + ret = -EROFS; + goto out; + } + + *ret_rec = *rec; + + /* + * Checking for last extent is potentially expensive - we + * might have to look at the next leaf over to see if it's + * empty. + * + * The first two checks are to see whether the caller even + * cares for this information, and if the extent is at least + * the last in it's list. + * + * If those hold true, then the extent is last if any of the + * additional conditions hold true: + * - Extent list is in-inode + * - Extent list is right-most + * - Extent list is 2nd to rightmost, with empty right-most + */ + if (is_last) { + if (i == (le16_to_cpu(el->l_next_free_rec) - 1)) { + if (tree_height == 0) + *is_last = 1; + else if (eb->h_blkno == di->i_last_eb_blk) + *is_last = 1; + else if (eb->h_next_leaf_blk == di->i_last_eb_blk) { + ret = ocfs2_last_eb_is_empty(inode, di); + if (ret < 0) { + mlog_errno(ret); + goto out; + } + if (ret == 1) + *is_last = 1; + } } + } + +out_hole: + ret = 0; +out: + brelse(eb_bh); + return ret; +} + +static void ocfs2_relative_extent_offsets(struct super_block *sb, + u32 v_cluster, + struct ocfs2_extent_rec *rec, + u32 *p_cluster, u32 *num_clusters) + +{ + u32 coff = v_cluster - le32_to_cpu(rec->e_cpos); + + *p_cluster = ocfs2_blocks_to_clusters(sb, le64_to_cpu(rec->e_blkno)); + *p_cluster = *p_cluster + coff; + + if (num_clusters) + *num_clusters = le16_to_cpu(rec->e_leaf_clusters) - coff; +} + +int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, + u32 *p_cluster, u32 *num_clusters, + unsigned int *extent_flags) +{ + int ret; + unsigned int uninitialized_var(hole_len), flags = 0; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; - coff = v_cluster - le32_to_cpu(rec->e_cpos); + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = -ERANGE; + mlog_errno(ret); + goto out; + } - *p_cluster = ocfs2_blocks_to_clusters(inode->i_sb, - le64_to_cpu(rec->e_blkno)); - *p_cluster = *p_cluster + coff; + ret = ocfs2_extent_map_lookup(inode, v_cluster, p_cluster, + num_clusters, extent_flags); + if (ret == 0) + goto out; - if (num_clusters) - *num_clusters = ocfs2_rec_clusters(el, rec) - coff; + ret = ocfs2_read_block(OCFS2_SB(inode->i_sb), OCFS2_I(inode)->ip_blkno, + &di_bh, OCFS2_BH_CACHED, inode); + if (ret) { + mlog_errno(ret); + goto out; + } - flags = rec->e_flags; + ret = ocfs2_get_clusters_nocache(inode, di_bh, v_cluster, &hole_len, + &rec, NULL); + if (ret) { + mlog_errno(ret); + goto out; + } - ocfs2_extent_map_insert_rec(inode, rec); + if (rec.e_blkno == 0ULL) { + /* + * A hole was found. Return some canned values that + * callers can key on. If asked for, num_clusters will + * be populated with the size of the hole. + */ + *p_cluster = 0; + if (num_clusters) { + *num_clusters = hole_len; + } + } else { + ocfs2_relative_extent_offsets(inode->i_sb, v_cluster, &rec, + p_cluster, num_clusters); + flags = rec.e_flags; + + ocfs2_extent_map_insert_rec(inode, &rec); } if (extent_flags) @@ -478,7 +608,6 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, out: brelse(di_bh); - brelse(eb_bh); return ret; } @@ -521,3 +650,114 @@ int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, out: return ret; } + +static int ocfs2_fiemap_inline(struct inode *inode, struct buffer_head *di_bh, + struct fiemap_extent_info *fieinfo, + u64 map_start) +{ + int ret; + unsigned int id_count; + struct ocfs2_dinode *di; + u64 phys; + u32 flags = FIEMAP_EXTENT_DATA_INLINE|FIEMAP_EXTENT_LAST; + struct ocfs2_inode_info *oi = OCFS2_I(inode); + + di = (struct ocfs2_dinode *)di_bh->b_data; + id_count = le16_to_cpu(di->id2.i_data.id_count); + + if (map_start < id_count) { + phys = oi->ip_blkno << inode->i_sb->s_blocksize_bits; + phys += offsetof(struct ocfs2_dinode, id2.i_data.id_data); + + ret = fiemap_fill_next_extent(fieinfo, 0, phys, id_count, + flags); + if (ret < 0) + return ret; + } + + return 0; +} + +#define OCFS2_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC) + +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len) +{ + int ret, is_last; + u32 mapping_end, cpos; + unsigned int hole_size; + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + u64 len_bytes, phys_bytes, virt_bytes; + struct buffer_head *di_bh = NULL; + struct ocfs2_extent_rec rec; + + ret = fiemap_check_flags(fieinfo, OCFS2_FIEMAP_FLAGS); + if (ret) + return ret; + + ret = ocfs2_inode_lock(inode, &di_bh, 0); + if (ret) { + mlog_errno(ret); + goto out; + } + + down_read(&OCFS2_I(inode)->ip_alloc_sem); + + /* + * Handle inline-data separately. + */ + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) { + ret = ocfs2_fiemap_inline(inode, di_bh, fieinfo, map_start); + goto out_unlock; + } + + cpos = map_start >> osb->s_clustersize_bits; + mapping_end = ocfs2_clusters_for_bytes(inode->i_sb, + map_start + map_len); + mapping_end -= cpos; + is_last = 0; + while (cpos < mapping_end && !is_last) { + u32 fe_flags; + + ret = ocfs2_get_clusters_nocache(inode, di_bh, cpos, + &hole_size, &rec, &is_last); + if (ret) { + mlog_errno(ret); + goto out; + } + + if (rec.e_blkno == 0ULL) { + cpos += hole_size; + continue; + } + + fe_flags = 0; + if (rec.e_flags & OCFS2_EXT_UNWRITTEN) + fe_flags |= FIEMAP_EXTENT_UNWRITTEN; + if (is_last) + fe_flags |= FIEMAP_EXTENT_LAST; + len_bytes = (u64)le16_to_cpu(rec.e_leaf_clusters) << osb->s_clustersize_bits; + phys_bytes = le64_to_cpu(rec.e_blkno) << osb->sb->s_blocksize_bits; + virt_bytes = (u64)le32_to_cpu(rec.e_cpos) << osb->s_clustersize_bits; + + ret = fiemap_fill_next_extent(fieinfo, virt_bytes, phys_bytes, + len_bytes, fe_flags); + if (ret) + break; + + cpos = le32_to_cpu(rec.e_cpos)+ le16_to_cpu(rec.e_leaf_clusters); + } + + if (ret > 0) + ret = 0; + +out_unlock: + brelse(di_bh); + + up_read(&OCFS2_I(inode)->ip_alloc_sem); + + ocfs2_inode_unlock(inode, 0); +out: + + return ret; +} diff --git a/fs/ocfs2/extent_map.h b/fs/ocfs2/extent_map.h index de91e3e..1b97490 100644 --- a/fs/ocfs2/extent_map.h +++ b/fs/ocfs2/extent_map.h @@ -50,4 +50,7 @@ int ocfs2_get_clusters(struct inode *inode, u32 v_cluster, u32 *p_cluster, int ocfs2_extent_map_get_blocks(struct inode *inode, u64 v_blkno, u64 *p_blkno, u64 *ret_count, unsigned int *extent_flags); +int ocfs2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 map_start, u64 map_len); + #endif /* _EXTENT_MAP_H */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index ec2ed15..ed38796 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2228,6 +2228,7 @@ const struct inode_operations ocfs2_file_iops = { .getattr = ocfs2_getattr, .permission = ocfs2_permission, .fallocate = ocfs2_fallocate, + .fiemap = ocfs2_fiemap, }; const struct inode_operations ocfs2_special_file_iops = { -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 3/4] generic block based fiemap implementation 2008-10-09 1:48 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o @ 2008-10-09 1:48 ` Theodore Ts'o 2008-10-09 1:48 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o 0 siblings, 1 reply; 40+ messages in thread From: Theodore Ts'o @ 2008-10-09 1:48 UTC (permalink / raw) To: linux-fsdevel; +Cc: Josef Bacik, Mark Fasheh, Theodore Ts'o From: Josef Bacik <jbacik@redhat.com> Any block based fs (this patch includes ext3) just has to declare its own fiemap() function and then call this generic function with its own get_block_t. This works well for block based filesystems that will map multiple contiguous blocks at one time, but will work for filesystems that only map one block at a time, you will just end up with an "extent" for each block. One gotcha is this will not play nicely where there is hole+data after the EOF. This function will assume its hit the end of the data as soon as it hits a hole after the EOF, so if there is any data past that it will not pick that up. AFAIK no block based fs does this anyway, but its in the comments of the function anyway just in case. Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org --- fs/ext2/ext2.h | 2 + fs/ext2/file.c | 1 + fs/ext2/inode.c | 8 +++ fs/ext3/file.c | 1 + fs/ext3/inode.c | 8 +++ fs/ioctl.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/ext3_fs.h | 2 + include/linux/fs.h | 3 + 8 files changed, 143 insertions(+), 0 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 47d88da..bae998c 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -133,6 +133,8 @@ extern void ext2_truncate (struct inode *); extern int ext2_setattr (struct dentry *, struct iattr *); extern void ext2_set_inode_flags(struct inode *inode); extern void ext2_get_inode_flags(struct ext2_inode_info *); +extern int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); int __ext2_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 5f2fa9c..45ed071 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -86,4 +86,5 @@ const struct inode_operations ext2_file_inode_operations = { #endif .setattr = ext2_setattr, .permission = ext2_permission, + .fiemap = ext2_fiemap, }; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 991d6df..7658b33 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -31,6 +31,7 @@ #include <linux/writeback.h> #include <linux/buffer_head.h> #include <linux/mpage.h> +#include <linux/fiemap.h> #include "ext2.h" #include "acl.h" #include "xip.h" @@ -704,6 +705,13 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ } +int ext2_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext2_get_block); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); diff --git a/fs/ext3/file.c b/fs/ext3/file.c index acc4913..3be1e06 100644 --- a/fs/ext3/file.c +++ b/fs/ext3/file.c @@ -134,5 +134,6 @@ const struct inode_operations ext3_file_inode_operations = { .removexattr = generic_removexattr, #endif .permission = ext3_permission, + .fiemap = ext3_fiemap, }; diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 507d868..ebfec4d 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -36,6 +36,7 @@ #include <linux/mpage.h> #include <linux/uio.h> #include <linux/bio.h> +#include <linux/fiemap.h> #include "xattr.h" #include "acl.h" @@ -981,6 +982,13 @@ out: return ret; } +int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len) +{ + return generic_block_fiemap(inode, fieinfo, start, len, + ext3_get_block); +} + /* * `handle' can be NULL if create is zero */ diff --git a/fs/ioctl.c b/fs/ioctl.c index 045d960..33a6b7e 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -13,6 +13,8 @@ #include <linux/security.h> #include <linux/module.h> #include <linux/uaccess.h> +#include <linux/writeback.h> +#include <linux/buffer_head.h> #include <asm/ioctls.h> @@ -224,6 +226,122 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg) return error; } +#define blk_to_logical(inode, blk) (blk << (inode)->i_blkbits) +#define logical_to_blk(inode, offset) (offset >> (inode)->i_blkbits); + +/* + * @inode - the inode to map + * @arg - the pointer to userspace where we copy everything to + * @get_block - the fs's get_block function + * + * This does FIEMAP for block based inodes. Basically it will just loop + * through get_block until we hit the number of extents we want to map, or we + * go past the end of the file and hit a hole. + * + * If it is possible to have data blocks beyond a hole past @inode->i_size, then + * please do not use this function, it will stop at the first unmapped block + * beyond i_size + */ +int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block) +{ + struct buffer_head tmp; + unsigned int start_blk; + long long length = 0, map_len = 0; + u64 logical = 0, phys = 0, size = 0; + u32 flags = FIEMAP_EXTENT_MERGED; + int ret = 0; + + if ((ret = fiemap_check_flags(fieinfo, FIEMAP_FLAG_SYNC))) + return ret; + + start_blk = logical_to_blk(inode, start); + + /* guard against change */ + mutex_lock(&inode->i_mutex); + + length = (long long)min_t(u64, len, i_size_read(inode)); + map_len = length; + + do { + /* + * we set b_size to the total size we want so it will map as + * many contiguous blocks as possible at once + */ + memset(&tmp, 0, sizeof(struct buffer_head)); + tmp.b_size = map_len; + + ret = get_block(inode, start_blk, &tmp, 0); + if (ret) + break; + + /* HOLE */ + if (!buffer_mapped(&tmp)) { + /* + * first hole after going past the EOF, this is our + * last extent + */ + if (length <= 0) { + flags = FIEMAP_EXTENT_MERGED|FIEMAP_EXTENT_LAST; + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + break; + } + + length -= blk_to_logical(inode, 1); + + /* if we have holes up to/past EOF then we're done */ + if (length <= 0) + break; + + start_blk++; + } else { + if (length <= 0 && size) { + ret = fiemap_fill_next_extent(fieinfo, logical, + phys, size, + flags); + if (ret) + break; + } + + logical = blk_to_logical(inode, start_blk); + phys = blk_to_logical(inode, tmp.b_blocknr); + size = tmp.b_size; + flags = FIEMAP_EXTENT_MERGED; + + length -= tmp.b_size; + start_blk += logical_to_blk(inode, size); + + /* + * if we are past the EOF we need to loop again to see + * if there is a hole so we can mark this extent as the + * last one, and if not keep mapping things until we + * find a hole, or we run out of slots in the extent + * array + */ + if (length <= 0) + continue; + + ret = fiemap_fill_next_extent(fieinfo, logical, phys, + size, flags); + if (ret) + break; + } + cond_resched(); + } while (1); + + mutex_unlock(&inode->i_mutex); + + /* if ret is 1 then we just hit the end of the extent array */ + if (ret == 1) + ret = 0; + + return ret; +} +EXPORT_SYMBOL(generic_block_fiemap); + static int file_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 80171ee..8120fa1 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -837,6 +837,8 @@ extern void ext3_truncate (struct inode *); extern void ext3_set_inode_flags(struct inode *); extern void ext3_get_inode_flags(struct ext3_inode_info *); extern void ext3_set_aops(struct inode *inode); +extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + u64 start, u64 len); /* ioctl.c */ extern int ext3_ioctl (struct inode *, struct file *, unsigned int, diff --git a/include/linux/fs.h b/include/linux/fs.h index 194fb23..385c9a1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1998,6 +1998,9 @@ extern int vfs_fstat(unsigned int, struct kstat *); extern int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg); +extern int generic_block_fiemap(struct inode *inode, + struct fiemap_extent_info *fieinfo, u64 start, + u64 len, get_block_t *get_block); extern void get_filesystem(struct file_system_type *fs); extern void put_filesystem(struct file_system_type *fs); -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH 4/4] Hook ext4 to the vfs fiemap interface. 2008-10-09 1:48 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o @ 2008-10-09 1:48 ` Theodore Ts'o 0 siblings, 0 replies; 40+ messages in thread From: Theodore Ts'o @ 2008-10-09 1:48 UTC (permalink / raw) To: linux-fsdevel; +Cc: Eric Sandeen, Theodore Ts'o, linux-ext4 From: Eric Sandeen <sandeen@redhat.com> ext4_ext_walk_space() was reinstated to be used for iterating over file extents with a callback; it is used by the ext4 fiemap implementation. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- fs/ext4/ext4.h | 2 + fs/ext4/ext4_extents.h | 15 +++ fs/ext4/extents.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/file.c | 4 + fs/ext4/inode.c | 4 +- 5 files changed, 271 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c50c04c..f46a513 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1067,6 +1067,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int, int *); +int ext4_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); int ext4_get_blocks_handle(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, unsigned long maxblocks, struct buffer_head *bh_result, diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h index d33dc56..bec7ce5 100644 --- a/fs/ext4/ext4_extents.h +++ b/fs/ext4/ext4_extents.h @@ -124,6 +124,19 @@ struct ext4_ext_path { #define EXT4_EXT_CACHE_GAP 1 #define EXT4_EXT_CACHE_EXTENT 2 +/* + * to be called by ext4_ext_walk_space() + * negative retcode - error + * positive retcode - signal for ext4_ext_walk_space(), see below + * callback must return valid extent (passed or newly created) + */ +typedef int (*ext_prepare_callback)(struct inode *, struct ext4_ext_path *, + struct ext4_ext_cache *, + struct ext4_extent *, void *); + +#define EXT_CONTINUE 0 +#define EXT_BREAK 1 +#define EXT_REPEAT 2 #define EXT_MAX_BLOCK 0xffffffff @@ -224,6 +237,8 @@ extern int ext4_ext_try_to_merge(struct inode *inode, struct ext4_extent *); extern unsigned int ext4_ext_check_overlap(struct inode *, struct ext4_extent *, struct ext4_ext_path *); extern int ext4_ext_insert_extent(handle_t *, struct inode *, struct ext4_ext_path *, struct ext4_extent *); +extern int ext4_ext_walk_space(struct inode *, ext4_lblk_t, ext4_lblk_t, + ext_prepare_callback, void *); extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t, struct ext4_ext_path *); extern int ext4_ext_search_left(struct inode *, struct ext4_ext_path *, diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c8f81f2..ea2ce3c 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -40,6 +40,7 @@ #include <linux/slab.h> #include <linux/falloc.h> #include <asm/uaccess.h> +#include <linux/fiemap.h> #include "ext4_jbd2.h" #include "ext4_extents.h" @@ -1626,6 +1627,113 @@ cleanup: return err; } +int ext4_ext_walk_space(struct inode *inode, ext4_lblk_t block, + ext4_lblk_t num, ext_prepare_callback func, + void *cbdata) +{ + struct ext4_ext_path *path = NULL; + struct ext4_ext_cache cbex; + struct ext4_extent *ex; + ext4_lblk_t next, start = 0, end = 0; + ext4_lblk_t last = block + num; + int depth, exists, err = 0; + + BUG_ON(func == NULL); + BUG_ON(inode == NULL); + + while (block < last && block != EXT_MAX_BLOCK) { + num = last - block; + /* find extent for this block */ + path = ext4_ext_find_extent(inode, block, path); + if (IS_ERR(path)) { + err = PTR_ERR(path); + path = NULL; + break; + } + + depth = ext_depth(inode); + BUG_ON(path[depth].p_hdr == NULL); + ex = path[depth].p_ext; + next = ext4_ext_next_allocated_block(path); + + exists = 0; + if (!ex) { + /* there is no extent yet, so try to allocate + * all requested space */ + start = block; + end = block + num; + } else if (le32_to_cpu(ex->ee_block) > block) { + /* need to allocate space before found extent */ + start = block; + end = le32_to_cpu(ex->ee_block); + if (block + num < end) + end = block + num; + } else if (block >= le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex)) { + /* need to allocate space after found extent */ + start = block; + end = block + num; + if (end >= next) + end = next; + } else if (block >= le32_to_cpu(ex->ee_block)) { + /* + * some part of requested space is covered + * by found extent + */ + start = block; + end = le32_to_cpu(ex->ee_block) + + ext4_ext_get_actual_len(ex); + if (block + num < end) + end = block + num; + exists = 1; + } else { + BUG(); + } + BUG_ON(end <= start); + + if (!exists) { + cbex.ec_block = start; + cbex.ec_len = end - start; + cbex.ec_start = 0; + cbex.ec_type = EXT4_EXT_CACHE_GAP; + } else { + cbex.ec_block = le32_to_cpu(ex->ee_block); + cbex.ec_len = ext4_ext_get_actual_len(ex); + cbex.ec_start = ext_pblock(ex); + cbex.ec_type = EXT4_EXT_CACHE_EXTENT; + } + + BUG_ON(cbex.ec_len == 0); + err = func(inode, path, &cbex, ex, cbdata); + ext4_ext_drop_refs(path); + + if (err < 0) + break; + + if (err == EXT_REPEAT) + continue; + else if (err == EXT_BREAK) { + err = 0; + break; + } + + if (ext_depth(inode) != depth) { + /* depth was changed. we have to realloc path */ + kfree(path); + path = NULL; + } + + block = cbex.ec_block + cbex.ec_len; + } + + if (path) { + ext4_ext_drop_refs(path); + kfree(path); + } + + return err; +} + static void ext4_ext_put_in_cache(struct inode *inode, ext4_lblk_t block, __u32 len, ext4_fsblk_t start, int type) @@ -2971,3 +3079,143 @@ retry: mutex_unlock(&inode->i_mutex); return ret > 0 ? ret2 : ret; } + +/* + * Callback function called for each extent to gather FIEMAP information. + */ +int ext4_ext_fiemap_cb(struct inode *inode, struct ext4_ext_path *path, + struct ext4_ext_cache *newex, struct ext4_extent *ex, + void *data) +{ + struct fiemap_extent_info *fieinfo = data; + unsigned long blksize_bits = inode->i_sb->s_blocksize_bits; + __u64 logical; + __u64 physical; + __u64 length; + __u32 flags = 0; + int error; + + logical = (__u64)newex->ec_block << blksize_bits; + + if (newex->ec_type == EXT4_EXT_CACHE_GAP) { + pgoff_t offset; + struct page *page; + struct buffer_head *bh = NULL; + + offset = logical >> PAGE_SHIFT; + page = find_get_page(inode->i_mapping, offset); + if (!page || !page_has_buffers(page)) + return EXT_CONTINUE; + + bh = page_buffers(page); + + if (!bh) + return EXT_CONTINUE; + + if (buffer_delay(bh)) { + flags |= FIEMAP_EXTENT_DELALLOC; + page_cache_release(page); + } else { + page_cache_release(page); + return EXT_CONTINUE; + } + } + + physical = (__u64)newex->ec_start << blksize_bits; + length = (__u64)newex->ec_len << blksize_bits; + + if (ex && ext4_ext_is_uninitialized(ex)) + flags |= FIEMAP_EXTENT_UNWRITTEN; + + /* + * If this extent reaches EXT_MAX_BLOCK, it must be last. + * + * Or if ext4_ext_next_allocated_block is EXT_MAX_BLOCK, + * this also indicates no more allocated blocks. + * + * XXX this might miss a single-block extent at EXT_MAX_BLOCK + */ + if (logical + length - 1 == EXT_MAX_BLOCK || + ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK) + flags |= FIEMAP_EXTENT_LAST; + + error = fiemap_fill_next_extent(fieinfo, logical, physical, + length, flags); + if (error < 0) + return error; + if (error == 1) + return EXT_BREAK; + + return EXT_CONTINUE; +} + +/* fiemap flags we can handle specified here */ +#define EXT4_FIEMAP_FLAGS (FIEMAP_FLAG_SYNC|FIEMAP_FLAG_XATTR) + +int ext4_xattr_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo) +{ + __u64 physical = 0; + __u64 length; + __u32 flags = FIEMAP_EXTENT_LAST; + int blockbits = inode->i_sb->s_blocksize_bits; + int error = 0; + + /* in-inode? */ + if (EXT4_I(inode)->i_state & EXT4_STATE_XATTR) { + struct ext4_iloc iloc; + int offset; /* offset of xattr in inode */ + + error = ext4_get_inode_loc(inode, &iloc); + if (error) + return error; + physical = iloc.bh->b_blocknr << blockbits; + offset = EXT4_GOOD_OLD_INODE_SIZE + + EXT4_I(inode)->i_extra_isize; + physical += offset; + length = EXT4_SB(inode->i_sb)->s_inode_size - offset; + flags |= FIEMAP_EXTENT_DATA_INLINE; + } else { /* external block */ + physical = EXT4_I(inode)->i_file_acl << blockbits; + length = inode->i_sb->s_blocksize; + } + + if (physical) + error = fiemap_fill_next_extent(fieinfo, 0, physical, + length, flags); + return (error < 0 ? error : 0); +} + +int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len) +{ + ext4_lblk_t start_blk; + ext4_lblk_t len_blks; + int error = 0; + + /* fallback to generic here if not in extents fmt */ + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return generic_block_fiemap(inode, fieinfo, start, len, + ext4_get_block); + + if (fiemap_check_flags(fieinfo, EXT4_FIEMAP_FLAGS)) + return -EBADR; + + if (fieinfo->fi_flags & FIEMAP_FLAG_XATTR) { + error = ext4_xattr_fiemap(inode, fieinfo); + } else { + start_blk = start >> inode->i_sb->s_blocksize_bits; + len_blks = len >> inode->i_sb->s_blocksize_bits; + + /* + * Walk the extent tree gathering extent information. + * ext4_ext_fiemap_cb will push extents back to user. + */ + down_write(&EXT4_I(inode)->i_data_sem); + error = ext4_ext_walk_space(inode, start_blk, len_blks, + ext4_ext_fiemap_cb, fieinfo); + up_write(&EXT4_I(inode)->i_data_sem); + } + + return error; +} + diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 62796b7..6d5be15 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -140,6 +140,9 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo, + __u64 start, __u64 len); + const struct file_operations ext4_file_operations = { .llseek = generic_file_llseek, .read = do_sync_read, @@ -170,5 +173,6 @@ const struct inode_operations ext4_file_inode_operations = { #endif .permission = ext4_permission, .fallocate = ext4_fallocate, + .fiemap = ext4_fiemap, }; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bd770c3..a474786 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1135,8 +1135,8 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block, /* Maximum number of blocks we map for direct IO at once. */ #define DIO_MAX_BLOCKS 4096 -static int ext4_get_block(struct inode *inode, sector_t iblock, - struct buffer_head *bh_result, int create) +int ext4_get_block(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) { handle_t *handle = ext4_journal_current_handle(); int ret = 0, started = 0; -- 1.5.6.1.205.ge2c7.dirty ^ permalink raw reply related [flat|nested] 40+ messages in thread
end of thread, other threads:[~2008-10-09 1:48 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-13 18:47 YET ANOTHER resend of the fiemap patches Theodore Ts'o
2008-09-13 18:49 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o
2008-09-13 18:49 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o
2008-09-13 18:49 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o
2008-09-13 18:49 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o
2008-09-13 21:17 ` [PATCH 1/4] vfs: vfs-level " Anton Altaparmakov
2008-09-13 21:29 ` Theodore Tso
2008-09-13 21:45 ` Matthew Wilcox
2008-09-13 22:41 ` Theodore Tso
2008-09-13 21:59 ` Anton Altaparmakov
2008-09-14 13:51 ` Christoph Hellwig
2008-09-14 13:48 ` Christoph Hellwig
2008-09-14 17:58 ` Theodore Tso
2008-09-15 14:49 ` Christoph Hellwig
2008-09-15 17:53 ` Mark Fasheh
2008-09-16 6:51 ` Andreas Dilger
2008-09-16 21:31 ` Theodore Tso
2008-09-20 16:47 ` Josef 'Jeff' Sipek
2008-09-29 1:07 ` Theodore Tso
2008-09-29 21:13 ` Mark Fasheh
2008-09-29 22:10 ` Theodore Tso
2008-09-14 13:47 ` Christoph Hellwig
2008-09-14 18:01 ` Theodore Tso
2008-09-14 18:08 ` Christoph Hellwig
2008-09-14 19:58 ` Theodore Tso
2008-09-15 14:47 ` Christoph Hellwig
2008-09-16 6:49 ` Andreas Dilger
2008-09-16 22:03 ` Theodore Tso
2008-09-17 14:18 ` Jörn Engel
2008-09-17 15:02 ` Jamie Lokier
2008-09-17 15:25 ` Theodore Tso
2008-09-19 14:05 ` Chris Mason
2008-09-19 17:38 ` Jamie Lokier
2008-09-20 7:43 ` Dave Chinner
2008-09-20 13:50 ` Chris Mason
[not found] ` <20080920135048.GA15698@think.oraclecorp.com>
2008-09-20 15:36 ` Jamie Lokier
2008-09-20 15:44 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2008-10-03 21:59 [PATCH 0/4] Updated fiemap patches Theodore Ts'o
2008-10-03 21:59 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o
2008-10-03 21:59 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o
2008-10-03 21:59 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o
2008-10-03 21:59 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o
2008-10-07 1:06 PATCH [0/4] Updated**2 fiemap patches Theodore Ts'o
2008-10-07 1:06 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o
2008-10-07 1:06 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o
2008-10-07 1:06 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o
2008-10-07 1:06 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o
2008-10-09 1:48 PATCH [0/4] Updated**3 fiemap patches Theodore Ts'o
2008-10-09 1:48 ` [PATCH 1/4] vfs: vfs-level fiemap interface Theodore Ts'o
2008-10-09 1:48 ` [PATCH 2/4] ocfs2: fiemap support Theodore Ts'o
2008-10-09 1:48 ` [PATCH 3/4] generic block based fiemap implementation Theodore Ts'o
2008-10-09 1:48 ` [PATCH 4/4] Hook ext4 to the vfs fiemap interface Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).