* [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support @ 2010-09-24 15:35 Lukas Czerner 2010-09-24 15:35 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner ` (3 more replies) 0 siblings, 4 replies; 21+ messages in thread From: Lukas Czerner @ 2010-09-24 15:35 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer Hi, since it has been a while from my last post, I have rebased patches to cleanly apply to the newest kernel. Also there have been some changes regarding blkdev_issue_discard() and its helpers, I have changed patches to use the new helper (ext4_issue_discard() for ext4) and new format of sb_issue_discard() (in ext3). Also in order to give the user better control over the discard I have changed FITRIM ioctl to return the exact amount of discarded Bytes, so the user will know how much space has been reclaimed to wear-leveling. SHORT DESCRIPTION: ================== Batched discard adds ability to discard free space on mounded filesystem, in order to avoid using current discard implementation which discards recently freed blocks. This approach may on some devices (it depends on how efficient is the device wear-leveling algorithm) result in huge performance loss. Batched discard can be invoked from user-space through FITRIM ioctl on the whole, or just a part, of file system. With this approach we are searching for continuous free blocks bigger than defined through ioctl to discard them. So, since we are searching for big continuous extents it is much more efficient than current approach and it gives user fine grained control over how much disk space will be reclaimed for wear-leveling and what impact will it have on performance. I have attached source code for example application which uses FITRIM to discard just a part or whole filesystem. Since FITRIM is filesystem independent ioctl it can be used by any filesystem which supports it. Usage: fstrim [-s start] [-l length] [-m minimum-extent] [-v] directory -s Starting Byte to discard from -l Number of Bytes to discard from the start -m Minimum extent length to discard -v Verbose - prints out number of really discarded Bytes --- a1edaf3 ext3: Add batched discard support 4ed7168 ext4: Add batched discard support 88f8c20 Add ioctl FITRIM. fs/ext3/balloc.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext3/super.c | 1 + fs/ext4/ext4.h | 2 + fs/ext4/mballoc.c | 191 +++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + fs/ioctl.c | 39 +++++++ include/linux/ext3_fs.h | 1 + include/linux/fs.h | 2 + 8 files changed, 488 insertions(+), 0 deletions(-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] Add ioctl FITRIM. 2010-09-24 15:35 [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support Lukas Czerner @ 2010-09-24 15:35 ` Lukas Czerner 2010-09-24 17:03 ` Andreas Dilger 2010-09-24 15:35 ` [PATCH 2/3] ext4: Add batched discard support Lukas Czerner ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-09-24 15:35 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer Adds an filesystem independent ioctl to allow implementation of file system batched discard support. It takes an array of three uint64_t as an argument, the meaning of each item is as follows: array[0] - (start) first Byte to trim array[1] - (len) number of Bytes to trim from start array[2] - (minlen) minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This number will be rounded up to the block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: args[0] = 0; args[1] = ULLONG_MAX; args[2] = 0; So it will trim the whole file system at one run. After the FITRIM is done, the number of actually discarded Bytes is stored in args[1] (len) to give the user better insight on how much storage space was really released for wear-leveling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ioctl.c | 39 +++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 41 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index f855ea4..3dd96b6 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -530,6 +530,41 @@ static int ioctl_fsthaw(struct file *filp) return thaw_super(sb); } +static int ioctl_fstrim(struct file *filp, void __user *argp) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + uint64_t args[3]; + int ret = 0; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + if (argp == NULL) { + args[0] = 0; + args[1] = ULLONG_MAX; + args[2] = 0; + } else if (copy_from_user(args, argp, sizeof(args))) + return -EFAULT; + + ret = sb->s_op->trim_fs(sb, args); + if (ret < 0) + return ret; + + if ((argp != NULL) && + (copy_to_user(argp, args, sizeof(args)))) + return -EFAULT; + + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -580,6 +615,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, argp); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index 76041b6..e1b8a4a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -316,6 +316,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOWR('X', 121, uint64_t) /* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1577,6 +1578,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (struct super_block *, uint64_t *); }; /* -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-09-24 15:35 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner @ 2010-09-24 17:03 ` Andreas Dilger 2010-09-27 9:20 ` Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Andreas Dilger @ 2010-09-24 17:03 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, snitzer On 2010-09-24, at 09:35, Lukas Czerner wrote: > array[0] - (start) first Byte to trim > array[1] - (len) number of Bytes to trim from start > array[2] - (minlen) minimum extent length to trim, free extents shorter > > than this number of Bytes will be ignored. This number will be rounded > up to the block size. > > It is also possible to specify NULL as an argument. In this case the > arguments will set itself as follows: > > args[0] = 0; > args[1] = ULLONG_MAX; > args[2] = 0; (minor) you use "array[]" in one place and "args[]" in another. > +#define FITRIM _IOWR('X', 121, uint64_t) /* Trim */ This ioctl definition isn't strictly correct. The "size" parameter is actually uint64_t[3], though I'm not sure whether you need to define a structure in order to specify this. Cheers, Andreas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-09-24 17:03 ` Andreas Dilger @ 2010-09-27 9:20 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-09-27 9:20 UTC (permalink / raw) To: Andreas Dilger Cc: Lukas Czerner, linux-ext4, tytso, rwheeler, sandeen, snitzer On Fri, 24 Sep 2010, Andreas Dilger wrote: > On 2010-09-24, at 09:35, Lukas Czerner wrote: > > array[0] - (start) first Byte to trim > > array[1] - (len) number of Bytes to trim from start > > array[2] - (minlen) minimum extent length to trim, free extents shorter > > > > than this number of Bytes will be ignored. This number will be rounded > > up to the block size. > > > > It is also possible to specify NULL as an argument. In this case the > > arguments will set itself as follows: > > > > args[0] = 0; > > args[1] = ULLONG_MAX; > > args[2] = 0; > > (minor) you use "array[]" in one place and "args[]" in another. > > > +#define FITRIM _IOWR('X', 121, uint64_t) /* Trim */ > > This ioctl definition isn't strictly correct. The "size" parameter is actually uint64_t[3], though I'm not sure whether you need to define a structure in order to specify this. You,re probably right. A structure would be better. But for example BLKDISCARD takes takes uint64_t[2] and its definition looks like: #define BLKDISCARD _IO(0x12,119) which confuses me a bit. > > Cheers, Andreas ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/3] ext4: Add batched discard support 2010-09-24 15:35 [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support Lukas Czerner 2010-09-24 15:35 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner @ 2010-09-24 15:35 ` Lukas Czerner 2010-09-24 15:35 ` [PATCH 3/3] ext3: " Lukas Czerner 2010-09-24 15:38 ` [PATCH 0/3 v. 8] Ext3/Ext4 Batched " Lukas Czerner 3 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-09-24 15:35 UTC (permalink / raw) To: linux-ext4 Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer, Dmitry Monakhov Walk through allocation groups and trim all free extents. It can be invoked through FITRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search for free extents in allocation groups specified by Byte range start -> start+len. When the free extent is within this range, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Since fstrim is a long operation it is good to have an ability to interrupt it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/ext4.h | 2 + fs/ext4/mballoc.c | 191 +++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + 3 files changed, 194 insertions(+), 0 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 889ec9d..693ce4e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1630,6 +1630,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb, extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t); extern void ext4_mb_put_buddy_cache_lock(struct super_block *, ext4_group_t, int); +extern int ext4_trim_fs(struct super_block *, uint64_t *); + /* inode.c */ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 4b4ad4b..1328e50 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4689,3 +4689,194 @@ error_return: kmem_cache_free(ext4_ac_cachep, ac); return; } + +/** + * ext4_trim_extent -- function to TRIM one single free extent in the group + * @sb: super block for the file system + * @start: starting block of the free extent in the alloc. group + * @count: number of blocks to TRIM + * @group: alloc. group we are working with + * @e4b: ext4 buddy for the group + * + * Trim "count" blocks starting at "start" in the "group". To assure that no + * one will allocate those blocks, mark it as used in buddy bitmap. This must + * be called with under the group lock. + */ +static int ext4_trim_extent(struct super_block *sb, int start, int count, + ext4_group_t group, struct ext4_buddy *e4b) +{ + struct ext4_free_extent ex; + int ret = 0; + + assert_spin_locked(ext4_group_lock_ptr(sb, group)); + + ex.fe_start = start; + ex.fe_group = group; + ex.fe_len = count; + + /* + * Mark blocks used, so no one can reuse them while + * being trimmed. + */ + mb_mark_used(e4b, &ex); + ext4_unlock_group(sb, group); + + ret = ext4_issue_discard(sb, group, start, count); + if (ret) + ext4_std_error(sb, ret); + + ext4_lock_group(sb, group); + mb_free_blocks(NULL, e4b, start, ex.fe_len); + return ret; +} + +/** + * ext4_trim_all_free -- function to trim all free space in alloc. group + * @sb: super block for file system + * @e4b: ext4 buddy + * @start: first group block to examine + * @max: last group block to examine + * @minblocks: minimum extent block count + * + * ext4_trim_all_free walks through group's buddy bitmap searching for free + * extents. When the free block is found, ext4_trim_extent is called to TRIM + * the extent. + * + * + * ext4_trim_all_free walks through group's block bitmap searching for free + * extents. When the free extent is found, mark it as used in group buddy + * bitmap. Then issue a TRIM command on this extent and free the extent in + * the group buddy bitmap. This is done until whole group is scanned. + */ +ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b, + ext4_grpblk_t start, ext4_grpblk_t max, ext4_grpblk_t minblocks) +{ + void *bitmap; + ext4_grpblk_t next, count = 0; + ext4_group_t group; + int ret = 0; + + BUG_ON(e4b == NULL); + + bitmap = e4b->bd_bitmap; + group = e4b->bd_group; + start = (e4b->bd_info->bb_first_free > start) ? + e4b->bd_info->bb_first_free : start; + ext4_lock_group(sb, group); + + while (start < max) { + + start = mb_find_next_zero_bit(bitmap, max, start); + if (start >= max) + break; + next = mb_find_next_bit(bitmap, max, start); + + if ((next - start) >= minblocks) { + ret = ext4_trim_extent(sb, start, + next - start, group, e4b); + if (ret < 0) + break; + count += next - start; + } + start = next + 1; + + if (fatal_signal_pending(current)) { + count = -ERESTARTSYS; + break; + } + + if (need_resched()) { + ext4_unlock_group(sb, group); + cond_resched(); + ext4_lock_group(sb, group); + } + + if ((e4b->bd_info->bb_free - count) < minblocks) + break; + } + ext4_unlock_group(sb, group); + + ext4_debug("trimmed %d blocks in the group %d\n", + count, group); + + if (ret < 0) + count = ret; + + return count; +} + +/** + * ext4_trim_fs() -- trim ioctl handle function + * @sb: superblock for filesystem + * @args: Array of arguments (start, len, minlen) + * + * start: First Byte to trim + * len: number of Bytes to trim from start + * minlen: minimum extent length in Bytes + * ext4_trim_fs goes through all allocation groups containing Bytes from + * start to start+len. For each such a group ext4_trim_all_free function + * is invoked to trim all free space. + */ +int ext4_trim_fs(struct super_block *sb, uint64_t *args) +{ + struct ext4_buddy e4b; + ext4_fsblk_t first_group, last_group; + ext4_group_t group, ngroups = ext4_get_groups_count(sb); + ext4_grpblk_t cnt = 0, first_block, last_block; + struct ext4_super_block *es = EXT4_SB(sb)->s_es; + uint64_t start, len, minlen, trimmed; + int ret = 0; + + start = args[0] >> sb->s_blocksize_bits; + len = args[1] >> sb->s_blocksize_bits; + minlen = args[2] >> sb->s_blocksize_bits; + trimmed = 0; + + if (unlikely(minlen > EXT4_BLOCKS_PER_GROUP(sb))) + return -EINVAL; + + /* Determine first and last group to examine based on start and len */ + first_group = (start - le32_to_cpu(es->s_first_data_block)) / + EXT4_BLOCKS_PER_GROUP(sb); + last_group = (start + len - le32_to_cpu(es->s_first_data_block)) / + EXT4_BLOCKS_PER_GROUP(sb); + last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group; + + if (first_group > last_group) + return -EINVAL; + + first_block = (start - le32_to_cpu(es->s_first_data_block)) % + EXT4_BLOCKS_PER_GROUP(sb); + last_block = EXT4_BLOCKS_PER_GROUP(sb); + + for (group = first_group; group <= last_group; group++) { + + ret = ext4_mb_load_buddy(sb, group, &e4b); + if (ret) { + ext4_error(sb, "Error in loading buddy " + "information for %u", group); + break; + } + + if (len >= EXT4_BLOCKS_PER_GROUP(sb)) + len -= (EXT4_BLOCKS_PER_GROUP(sb) - first_block); + else + last_block = len; + + if (e4b.bd_info->bb_free >= minlen) { + cnt = ext4_trim_all_free(sb, &e4b, first_block, + last_block, minlen); + if (cnt < 0) { + ret = cnt; + ext4_mb_unload_buddy(&e4b); + break; + } + } + ext4_mb_unload_buddy(&e4b); + trimmed += cnt; + first_block = 0; + } + args[1] = trimmed * sb->s_blocksize; + + return ret; +} diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 2614774..686b849 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1173,6 +1173,7 @@ static const struct super_operations ext4_sops = { .quota_write = ext4_quota_write, #endif .bdev_try_to_free_page = bdev_try_to_free_page, + .trim_fs = ext4_trim_fs }; static const struct super_operations ext4_nojournal_sops = { -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 3/3] ext3: Add batched discard support 2010-09-24 15:35 [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support Lukas Czerner 2010-09-24 15:35 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 2010-09-24 15:35 ` [PATCH 2/3] ext4: Add batched discard support Lukas Czerner @ 2010-09-24 15:35 ` Lukas Czerner 2010-09-24 15:38 ` [PATCH 0/3 v. 8] Ext3/Ext4 Batched " Lukas Czerner 3 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-09-24 15:35 UTC (permalink / raw) To: linux-ext4; +Cc: tytso, rwheeler, sandeen, adilger, lczerner, snitzer Walk through allocation groups and trim all free extents. It can be invoked through FITRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search for free extents in allocation groups specified by Byte range start -> start+len. When the free extent is within this range, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext3/balloc.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext3/super.c | 1 + include/linux/ext3_fs.h | 1 + 3 files changed, 253 insertions(+), 0 deletions(-) diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c index 4a32511..25fbc2f 100644 --- a/fs/ext3/balloc.c +++ b/fs/ext3/balloc.c @@ -20,6 +20,7 @@ #include <linux/ext3_jbd.h> #include <linux/quotaops.h> #include <linux/buffer_head.h> +#include <linux/blkdev.h> /* * balloc.c contains the blocks allocation and deallocation routines @@ -1882,3 +1883,253 @@ unsigned long ext3_bg_num_gdb(struct super_block *sb, int group) return ext3_bg_num_gdb_meta(sb,group); } + +/** + * ext3_trim_all_free -- function to trim all free space in alloc. group + * @sb: super block for file system + * @group: allocation group to trim + * @start: first group block to examine + * @max: last group block to examine + * @gdp: allocation group description structure + * @minblocks: minimum extent block count + * + * ext3_trim_all_free walks through group's block bitmap searching for free + * blocks. When the free block is found, it tries to allocate this block and + * consequent free block to get the biggest free extent possible, until it + * reaches any used block. Then issue a TRIM command on this extent and free + * the extent in the block bitmap. This is done until whole group is scanned. + */ +ext3_grpblk_t ext3_trim_all_free(struct super_block *sb, unsigned int group, + ext3_grpblk_t start, ext3_grpblk_t max, + ext3_grpblk_t minblocks) +{ + handle_t *handle; + ext3_grpblk_t next, count = 0, bit; + struct ext3_sb_info *sbi; + ext3_fsblk_t discard_block; + struct buffer_head *bitmap_bh = NULL; + struct buffer_head *gdp_bh; + ext3_grpblk_t free_blocks; + struct ext3_group_desc *gdp; + int err = 0, ret = 0; + ext3_grpblk_t freed; + + /* + * We will update one block bitmap, and one group descriptor + */ + handle = ext3_journal_start_sb(sb, 2); + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + return err; + } + + bitmap_bh = read_block_bitmap(sb, group); + if (!bitmap_bh) + goto err_out; + + BUFFER_TRACE(bitmap_bh, "getting undo access"); + err = ext3_journal_get_undo_access(handle, bitmap_bh); + if (err) + goto err_out; + + gdp = ext3_get_group_desc(sb, group, &gdp_bh); + if (!gdp) + goto err_out; + + BUFFER_TRACE(gdp_bh, "get_write_access"); + err = ext3_journal_get_write_access(handle, gdp_bh); + if (err) + goto err_out; + + free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); + sbi = EXT3_SB(sb); + + /* Walk through the whole group */ + while (start < max) { + + start = bitmap_search_next_usable_block(start, bitmap_bh, max); + if (start < 0) + break; + next = start; + + /* + * Allocate contiguous free extents by setting bits in the + * block bitmap + */ + while (next < max + && claim_block(sb_bgl_lock(sbi, group), + next, bitmap_bh)) { + next++; + } + + /* We did not claim any blocks */ + if (next == start) + continue; + + discard_block = (ext3_fsblk_t)start + + ext3_group_first_block_no(sb, group); + + /* Update counters */ + spin_lock(sb_bgl_lock(sbi, group)); + le16_add_cpu(&gdp->bg_free_blocks_count, start - next); + spin_unlock(sb_bgl_lock(sbi, group)); + percpu_counter_sub(&sbi->s_freeblocks_counter, next - start); + + /* Do not issue a TRIM on extents smaller than minblocks */ + if ((next - start) < minblocks) + goto free_extent; + + /* Send the TRIM command down to the device */ + ret = sb_issue_discard(sb, discard_block, next - start, + GFP_NOFS, 0); + count += (next - start); + +free_extent: + freed = 0; + + /* + * Clear bits in the bitmap + */ + for (bit = start; bit < next; bit++) { + BUFFER_TRACE(bitmap_bh, "clear bit"); + if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, group), + bit, bitmap_bh->b_data)) { + ext3_error(sb, __func__, + "bit already cleared for block "E3FSBLK, + (unsigned long)bit); + BUFFER_TRACE(bitmap_bh, "bit already cleared"); + } else { + freed++; + } + } + + /* Update couters */ + spin_lock(sb_bgl_lock(sbi, group)); + le16_add_cpu(&gdp->bg_free_blocks_count, freed); + spin_unlock(sb_bgl_lock(sbi, group)); + percpu_counter_add(&sbi->s_freeblocks_counter, next - start); + + start = next; + + if (ret < 0) { + if (ret == -EOPNOTSUPP) { + ext3_warning(sb, __func__, + "discard not supported!"); + count = ret; + break; + } + err = ret; + break; + } + + if (fatal_signal_pending(current)) { + count = -ERESTARTSYS; + break; + } + + cond_resched(); + + /* No more suitable extents */ + if ((free_blocks - count) < minblocks) + break; + } + + /* We dirtied the bitmap block */ + BUFFER_TRACE(bitmap_bh, "dirtied bitmap block"); + err = ext3_journal_dirty_metadata(handle, bitmap_bh); + + /* And the group descriptor block */ + BUFFER_TRACE(gdp_bh, "dirtied group descriptor block"); + ret = ext3_journal_dirty_metadata(handle, gdp_bh); + if (!err) + err = ret; + + ext3_debug("trimmed %d blocks in the group %d\n", + count, group); + +err_out: + if (err) { + ext3_std_error(sb, err); + count = err; + } + + ext3_journal_stop(handle); + brelse(bitmap_bh); + + return count; +} + +/** + * ext3_trim_fs() -- trim ioctl handle function + * @sb: superblock for filesystem + * @start: First Byte to trim + * @len: number of Bytes to trim from start + * @minlen: minimum extent length in Bytes + * + * ext3_trim_fs goes through all allocation groups containing Bytes from + * start to start+len. For each such a group ext3_trim_all_free function + * is invoked to trim all free space. + */ +int ext3_trim_fs(struct super_block *sb, uint64_t start, uint64_t len, + uint64_t minlen) +{ + ext3_grpblk_t last_block, first_block, free_blocks; + unsigned long long first_group, last_group; + unsigned long group, ngroups; + struct ext3_group_desc *gdp; + struct ext3_super_block *es; + int ret = 0; + + start >>= sb->s_blocksize_bits; + len >>= sb->s_blocksize_bits; + minlen >>= sb->s_blocksize_bits; + + if (unlikely(minlen > EXT3_BLOCKS_PER_GROUP(sb))) + return -EINVAL; + + es = EXT3_SB(sb)->s_es; + ngroups = EXT3_SB(sb)->s_groups_count; + smp_rmb(); + + /* Determine first and last group to examine based on start and len */ + first_group = (start - le32_to_cpu(es->s_first_data_block)) / + EXT3_BLOCKS_PER_GROUP(sb); + last_group = (start + len - le32_to_cpu(es->s_first_data_block)) / + EXT3_BLOCKS_PER_GROUP(sb); + last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group; + + if (first_group > last_group) + return -EINVAL; + + first_block = (start - le32_to_cpu(es->s_first_data_block)) % + EXT3_BLOCKS_PER_GROUP(sb); + last_block = EXT3_BLOCKS_PER_GROUP(sb); + + for (group = first_group; group <= last_group; group++) { + + gdp = ext3_get_group_desc(sb, group, NULL); + if (!gdp) + break; + + free_blocks = le16_to_cpu(gdp->bg_free_blocks_count); + if (free_blocks < minlen) + continue; + + if (len >= EXT3_BLOCKS_PER_GROUP(sb)) + len -= (EXT3_BLOCKS_PER_GROUP(sb) - first_block); + else + last_block = len; + + ret = ext3_trim_all_free(sb, group, first_block, + last_block, minlen); + if (ret < 0) + break; + + first_block = 0; + } + + if (ret >= 0) + ret = 0; + + return ret; +} diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 5dbf4db..2e095db 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -782,6 +782,7 @@ static const struct super_operations ext3_sops = { .quota_write = ext3_quota_write, #endif .bdev_try_to_free_page = bdev_try_to_free_page, + .trim_fs = ext3_trim_fs, }; static const struct export_operations ext3_export_ops = { diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h index 6ce1bca..07040dc 100644 --- a/include/linux/ext3_fs.h +++ b/include/linux/ext3_fs.h @@ -856,6 +856,7 @@ extern struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb, extern int ext3_should_retry_alloc(struct super_block *sb, int *retries); extern void ext3_init_block_alloc_info(struct inode *); extern void ext3_rsv_window_add(struct super_block *sb, struct ext3_reserve_window_node *rsv); +extern int ext3_trim_fs(struct super_block *sb, uint64_t *args); /* dir.c */ extern int ext3_check_dir_entry(const char *, struct inode *, -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support 2010-09-24 15:35 [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support Lukas Czerner ` (2 preceding siblings ...) 2010-09-24 15:35 ` [PATCH 3/3] ext3: " Lukas Czerner @ 2010-09-24 15:38 ` Lukas Czerner 2010-09-24 17:19 ` Andreas Dilger 3 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-09-24 15:38 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, adilger, snitzer FSTRIM ======= /* * e2trim.c - discard the part (or whole) mounted filesystem. * * Copyright (C) 2009 Red Hat, Inc., Lukas Czerner <lczerner@redhat.com> * * %Begin-Header% * This file may be redistributed under the terms of the GNU Public * License. * %End-Header% * * Usage: e2trim [options] <mount point> */ #include <string.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> #include <stdio.h> #include <stdint.h> #include <fcntl.h> #include <limits.h> #ifdef HAVE_GETOPT_H #include <getopt.h> #else extern char *optarg; extern int optind; #endif #include <sys/ioctl.h> #include <sys/stat.h> #include <linux/fs.h> #ifndef FITRIM #define FITRIM _IOWR('X', 121, uint64_t) #endif const char *program_name = "fstrim"; struct options { uint64_t *range; char mpoint[PATH_MAX]; char verbose; }; static void usage(void) { fprintf(stderr, "Usage: %s [-s start] [-l length] [-m minimum-extent]" " [-v] directory\n\t-s Starting Byte to discard from\n" "\t-l Number of Bytes to discard from the start\n" "\t-m Minimum extent length to discard\n" "\t-v Verbose - prints out number of really discarded Bytes\n", program_name); } /** * Get the number from argument. It can be number followed by * units: k|K,m|M,g|G */ static unsigned long long get_number(char **optarg) { char *opt; unsigned long long number,max; /* get the max to avoid overflow */ max = ULLONG_MAX / 10; number = 0; opt = *optarg; /* compute the number */ while ((*opt >= '0') && (*opt <= '9') && (number < max)) { number = number * 10 + *opt++ - '0'; } while (1) { /* determine if units are defined */ switch(*opt++) { case 'K': /* kilobytes */ case 'k': number *= 1024; break; case 'M': /* megabytes */ case 'm': number *= 1024 * 1024; break; case 'G': /* gigabytes */ case 'g': number *= 1024 * 1024 * 1024; break; case ':': /* delimiter */ if ((number > max) || (number == 0)) { fprintf(stderr,"Numeric argument out" " of range\n"); return 0; } *optarg = opt; return number; case '\0': /* end of the string */ if ((number > max) || (number == 0)) { fprintf(stderr,"Numeric argument out" " of range\n"); return 0; } return number; default: fprintf(stderr,"Bad syntax of numeric" " argument\n"); return 0; } } return number; } static int parse_opts(int argc, char **argv, struct options *opts) { int c; while ((c = getopt(argc, argv, "s:l:m:v")) != EOF) { switch (c) { case 's': /* starting point */ if ((opts->range[0] = get_number(&optarg)) == 0) { return EXIT_FAILURE; } break; case 'l': /* length */ if ((opts->range[1] = get_number(&optarg)) == 0) { return EXIT_FAILURE; } break; case 'm': /* minlen */ if ((opts->range[2] = get_number(&optarg)) == 0) { return EXIT_FAILURE; } break; case 'v': /* verbose */ opts->verbose = 1; break; default: return EXIT_FAILURE; } } return 0; } static void free_opts(struct options *opts) { if (opts) { if (opts->range) free(opts->range); free(opts); } } static void free_opts_and_exit(struct options *opts) { free_opts(opts); exit(EXIT_FAILURE); } static void print_usage_and_exit(struct options *opts) { usage(); free_opts_and_exit(opts); } int main (int argc, char **argv) { struct options *opts; struct stat sb; int fd, ret = 0; opts = malloc(sizeof(struct options)); if (!opts) { perror("malloc"); return EXIT_FAILURE; } opts->range = NULL; opts->verbose = 0; if (argc > 1) strncpy(opts->mpoint, argv[argc - 1], sizeof(opts->mpoint)); if (argc > 2) { opts->range = calloc(3, sizeof(uint64_t)); if (!opts->range) { perror("calloc"); free_opts_and_exit(opts); } opts->range[1] = ULLONG_MAX; ret = parse_opts(argc, argv, opts); } if (ret) print_usage_and_exit(opts); if (strnlen(opts->mpoint, 1) < 1) { fprintf(stderr,"You have to specify mount point.\n"); print_usage_and_exit(opts); } if (stat(opts->mpoint, &sb) == -1) { fprintf(stderr,"%s is not a valid directory\n", opts->mpoint); print_usage_and_exit(opts); } if (!S_ISDIR(sb.st_mode)) { fprintf(stderr,"%s is not a valid directory\n", opts->mpoint); print_usage_and_exit(opts); } fd = open(opts->mpoint, O_RDONLY); if (fd < 0) { perror("open"); free_opts_and_exit(opts); } if (ioctl(fd, FITRIM, opts->range)) { if (errno == EOPNOTSUPP) fprintf(stderr, "Filesystem at %s does not " "support FITRIM\n", opts->mpoint); else perror("FITRIM"); free_opts_and_exit(opts); } if ((opts->verbose) && (opts->range)) fprintf(stdout,"%lu Bytes was trimmed\n", opts->range[1]); free_opts(opts); return ret; } ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support 2010-09-24 15:38 ` [PATCH 0/3 v. 8] Ext3/Ext4 Batched " Lukas Czerner @ 2010-09-24 17:19 ` Andreas Dilger 2010-09-27 9:30 ` Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Andreas Dilger @ 2010-09-24 17:19 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-ext4, tytso, rwheeler, sandeen, snitzer On 2010-09-24, at 09:38, Lukas Czerner wrote: > /* > * e2trim.c - discard the part (or whole) mounted filesystem. > * > * Usage: e2trim [options] <mount point> > */ This tool isn't extN specific at all, and should probably go into util-linux or similar. That said, it might make sense to add an option to e2fsck to have it trim the free space at the end, since it will just have verified the bitmaps are correct (which is the safest time to do this). > const char *program_name = "fstrim"; Ah, your comment block has an inconsistent name for the program. > static void usage(void) > { > fprintf(stderr, "Usage: %s [-s start] [-l length] [-m minimum-extent]" > " [-v] directory\n\t-s Starting Byte to discard from\n" This formatting, while saving a line, makes it hard to read. Better is: fprintf(stderr, "Usage: %s [-s start] [-l length] [-m minimum-extent]" " [-v] {mountpoint}\n\t" "-s Starting Byte to discard from\n\t" "-l Number of Bytes to discard from the start\n\t" "-m Minimum extent length to discard\n\t" "-v Verbose - number of discarded bytes\n", program_name); > static unsigned long long get_number(char **optarg) > { > /* compute the number */ > while ((*opt >= '0') && (*opt <= '9') && (number < max)) { > number = number * 10 + *opt++ - '0'; > } Umm, strtoul(opt, &end, 0) is a much nicer way to parse this, and it also allows hex-formatted input. > while (1) { > /* determine if units are defined */ > switch(*opt++) { > case 'K': /* kilobytes */ > case 'k': > number *= 1024; > break; > case 'M': /* megabytes */ > case 'm': > number *= 1024 * 1024; > break; > case 'G': /* gigabytes */ > case 'g': > number *= 1024 * 1024 * 1024; > break; I usually implement this by descending TGMK, multiply by 1024 at each step, and then skip the "break" in between. Definitely you want to have "T" as a suffix these days. > case ':': /* delimiter */ It isn't clear from the usage message what this delimiter is used for? > if (argc > 2) { > opts->range = calloc(3, sizeof(uint64_t)); Thinking about this more, it might make more sense to have a structure with named fields here. This simplifies the ioctl definition, but also makes it much more clear from the caller which field is which. Otherwise, there will be callers that get the order of parameters wrong, etc. > if (stat(opts->mpoint, &sb) == -1) { > fprintf(stderr,"%s is not a valid directory\n", opts->mpoint); Using strerror() here to print the actual error is better. It might be e.g. a permission problem, etc. Also, using perror() will localize the message. > fd = open(opts->mpoint, O_RDONLY); > if (fd < 0) { > perror("open"); I prefer strerror() over perror(), since it allows creating a better error: fprintf(stderr, "%s: open(%s): %s\n", program_name, opts->mpoint, strerror(errno)); Cheers, Andreas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support 2010-09-24 17:19 ` Andreas Dilger @ 2010-09-27 9:30 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-09-27 9:30 UTC (permalink / raw) To: Andreas Dilger Cc: Lukas Czerner, linux-ext4, tytso, rwheeler, sandeen, snitzer On Fri, 24 Sep 2010, Andreas Dilger wrote: > On 2010-09-24, at 09:38, Lukas Czerner wrote: > > /* > > * e2trim.c - discard the part (or whole) mounted filesystem. > > * > > * Usage: e2trim [options] <mount point> > > */ > > This tool isn't extN specific at all, and should probably go into util-linux or similar. That said, it might make sense to add an option to e2fsck to have it trim the free space at the end, since it will just have verified the bitmaps are correct (which is the safest time to do this). > > > const char *program_name = "fstrim"; > > Ah, your comment block has an inconsistent name for the program. Right, I have changed it from e2trim to fstrim because of the same reason you've just pointed out. Looks like I had missed some old comments. > > > static void usage(void) > > { > > fprintf(stderr, "Usage: %s [-s start] [-l length] [-m minimum-extent]" > > " [-v] directory\n\t-s Starting Byte to discard from\n" > > This formatting, while saving a line, makes it hard to read. Better is: > > fprintf(stderr, > "Usage: %s [-s start] [-l length] [-m minimum-extent]" > " [-v] {mountpoint}\n\t" > "-s Starting Byte to discard from\n\t" > "-l Number of Bytes to discard from the start\n\t" > "-m Minimum extent length to discard\n\t" > "-v Verbose - number of discarded bytes\n", > program_name); > > > static unsigned long long get_number(char **optarg) > > { > > /* compute the number */ > > while ((*opt >= '0') && (*opt <= '9') && (number < max)) { > > number = number * 10 + *opt++ - '0'; > > } > > Umm, strtoul(opt, &end, 0) is a much nicer way to parse this, and it also allows hex-formatted input. That's right, but my original idea was to let the user write parameters like 10kk etc. But when I think of it now, it is not exactly useful :) > > > while (1) { > > /* determine if units are defined */ > > switch(*opt++) { > > case 'K': /* kilobytes */ > > case 'k': > > number *= 1024; > > break; > > case 'M': /* megabytes */ > > case 'm': > > number *= 1024 * 1024; > > break; > > case 'G': /* gigabytes */ > > case 'g': > > number *= 1024 * 1024 * 1024; > > break; > > I usually implement this by descending TGMK, multiply by 1024 at each step, and then skip the "break" in between. Definitely you want to have "T" as a suffix these days. > > > case ':': /* delimiter */ > > It isn't clear from the usage message what this delimiter is used for? Oh, I have just reused this part from another project of mine, so I have probably missed that out. Thanks. > > > if (argc > 2) { > > opts->range = calloc(3, sizeof(uint64_t)); > > Thinking about this more, it might make more sense to have a structure with named fields here. This simplifies the ioctl definition, but also makes it much more clear from the caller which field is which. Otherwise, there will be callers that get the order of parameters wrong, etc. Definitely, looks like I am going to use a structure for this ioctl. > > > if (stat(opts->mpoint, &sb) == -1) { > > fprintf(stderr,"%s is not a valid directory\n", opts->mpoint); > > Using strerror() here to print the actual error is better. It might be e.g. a permission problem, etc. Also, using perror() will localize the message. > > > fd = open(opts->mpoint, O_RDONLY); > > if (fd < 0) { > > perror("open"); > > I prefer strerror() over perror(), since it allows creating a better error: > > fprintf(stderr, "%s: open(%s): %s\n", > program_name, opts->mpoint, strerror(errno)); > > > Cheers, Andreas > Andreas, thanks a lot for reviewing this. -Lukas ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 0/3 ver. 7] Ext3/Ext4 Batched discard support @ 2010-08-10 14:19 Lukas Czerner 2010-08-10 14:19 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-08-10 14:19 UTC (permalink / raw) To: linux-ext4 Cc: dmonakhov, jmoyer, rwheeler, eshishki, sandeen, jack, tytso, lczerner Hi all, this is "hopefully" the final version of the patch set, since it looks like everyone agree with this. Since the last version some simple fixes was done. - As Jan noticed it should subtract the s_first_data_block before modulo when computing the group starting block. - As Dimitry proposed cond_reshed() is really useless in the ext4_trim_extent and rescheduling should be done in the ext4_trim_all_free. Also from previous version it supports "partial" fs discard see [PATCH 1/3]. The patch set consist of these three patches. Second and third patch are independent on each other and dependent on the first patch. [PATCH 1/3] Add ioctl FITRIM. [PATCH 2/3] Add batched discard support for ext3 [PATCH 3/3] Add batched discard support for ext4 Regards -Lukas --- fs/ext3/balloc.c | 250 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext3/super.c | 1 + fs/ext4/ext4.h | 2 + fs/ext4/mballoc.c | 200 +++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + fs/ioctl.c | 34 +++++++ include/linux/ext3_fs.h | 1 + include/linux/fs.h | 2 + 8 files changed, 491 insertions(+), 0 deletions(-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] Add ioctl FITRIM. 2010-08-10 14:19 [PATCH 0/3 ver. 7] " Lukas Czerner @ 2010-08-10 14:19 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-08-10 14:19 UTC (permalink / raw) To: linux-ext4 Cc: dmonakhov, jmoyer, rwheeler, eshishki, sandeen, jack, tytso, lczerner Adds an filesystem independent ioctl to allow implementation of file system batched discard support. It takes an array of three uint64_t as an argument, the meaning of each item is as follows: array[0] - (start) first Byte to trim array[1] - (len) number of Bytes to trim from start array[2] - (minlen) minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This number will be rounded up to the block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: args[0] = 0; args[1] = ULLONG_MAX; args[2] = 0; So it will trim the whole file system at one run. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ioctl.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 36 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 2d140a7..a0e243c 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -540,6 +540,36 @@ static int ioctl_fsthaw(struct file *filp) return thaw_super(sb); } +static int ioctl_fstrim(struct file *filp, void __user *argp) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + uint64_t args[3]; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + if (argp == NULL) { + args[0] = 0; + args[1] = ULLONG_MAX; + args[2] = 0; + } else if (copy_from_user(args, argp, sizeof(args))) + return -EFAULT; + + err = sb->s_op->trim_fs(sb, args[0], args[1], args[2]); + if (err) + return err; + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -590,6 +620,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, argp); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index e5106e4..0cc196f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -316,6 +316,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOW('X', 121, uint64_t)/* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1582,6 +1583,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (struct super_block *, uint64_t, uint64_t, uint64_t); }; /* -- 1.7.2.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 0/3] Batched discard support @ 2010-08-06 11:31 Lukas Czerner 2010-08-06 11:31 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-08-06 11:31 UTC (permalink / raw) To: linux-ext4 Cc: dmonakhov, jmoyer, rwheeler, eshishki, sandeen, jack, tytso, lczerner Hi, all because people were worried about possibly long stalls appearing when FITRIM ioctl is working, I have changed the FITRIM interface as Dimitry suggested. Now you can choose whether to trim whole file system or just a part of it, resp. you can specify the range of Bytes to trim. To be specific you can create something like this: int main(int argc, char **argv) { int fd; uint64_t range[3]; range[0] = 40960; range[1] = 134217728; range[2] = 4096; fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 1; } if (ioctl(fd, FITRIM, range)) { if (errno == EOPNOTSUPP) fprintf(stderr, "FITRIM not supported\n"); else perror("FITRIM"); return 1; } return 0; } Range items have following meaning: range[0] - (start) first Byte to trim range[1] - (len) number of Bytes to trim from start range[2] - (minlen) minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This number will be rounded up to the block size. So in my example it will trim all free extents from block 10 of first alloc. group to block 10 of second alloc. group, assuming we have block_size = 4096. Also, when you want to trim the whole fs, you can simply pass NULL instead of range into the ioctl, or you can specify the range correctly to cover the whole fs. Regards -Lukas [PATCH 1/3] Add ioctl FITRIM. [PATCH 2/3] Add batched discard support for ext3 [PATCH 3/3] Add batched discard support for ext4 fs/ext3/balloc.c | 249 +++++++++++++++++++++++++++++++++++++++++++++++ fs/ext3/super.c | 1 + fs/ext4/ext4.h | 2 + fs/ext4/mballoc.c | 194 ++++++++++++++++++++++++++++++++++++ fs/ext4/super.c | 1 + fs/ioctl.c | 34 +++++++ include/linux/ext3_fs.h | 1 + include/linux/fs.h | 2 + 8 files changed, 484 insertions(+), 0 deletions(-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] Add ioctl FITRIM. 2010-08-06 11:31 [PATCH 0/3] Batched discard support Lukas Czerner @ 2010-08-06 11:31 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-08-06 11:31 UTC (permalink / raw) To: linux-ext4 Cc: dmonakhov, jmoyer, rwheeler, eshishki, sandeen, jack, tytso, lczerner Adds an filesystem independent ioctl to allow implementation of file system batched discard support. It takes an array of three uint64_t as an argument, the meaning of each item is as follows: array[0] - (start) first Byte to trim array[1] - (len) number of Bytes to trim from start array[2] - (minlen) minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This number will be rounded up to the block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: args[0] = 0; args[1] = LLONG_MAX; args[2] = 0; So it will trim the whole file system at one run. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ioctl.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 36 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 2d140a7..a0e243c 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -540,6 +540,36 @@ static int ioctl_fsthaw(struct file *filp) return thaw_super(sb); } +static int ioctl_fstrim(struct file *filp, void __user *argp) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + uint64_t args[3]; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + if (argp == NULL) { + args[0] = 0; + args[1] = ULLONG_MAX; + args[2] = 0; + } else if (copy_from_user(args, argp, sizeof(args))) + return -EFAULT; + + err = sb->s_op->trim_fs(sb, args[0], args[1], args[2]); + if (err) + return err; + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -590,6 +620,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, argp); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index e5106e4..0cc196f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -316,6 +316,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOW('X', 121, uint64_t)/* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1582,6 +1583,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (struct super_block *, uint64_t, uint64_t, uint64_t); }; /* -- 1.7.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 1/3] Add ioctl FITRIM. @ 2010-08-04 13:44 Lukas Czerner 2010-08-04 14:57 ` Dmitry Monakhov 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-08-04 13:44 UTC (permalink / raw) To: linux-ext4; +Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso, lczerner Adds an filesystem independent ioctl to allow implementation of file system batched discard support. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 2d140a7..6c01c3c 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) return thaw_super(sb); } +static int ioctl_fstrim(struct file *filp, unsigned long arg) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + unsigned int minlen; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + err = get_user(minlen, (unsigned int __user *) arg); + if (err) + return err; + + err = sb->s_op->trim_fs(minlen, sb); + if (err) + return err; + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -590,6 +617,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, arg); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index 68ca1b0..01632e4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -315,6 +315,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOWR('X', 121, int) /* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1580,6 +1581,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (unsigned int, struct super_block *); }; /* -- 1.7.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-04 13:44 Lukas Czerner @ 2010-08-04 14:57 ` Dmitry Monakhov 2010-08-04 15:13 ` Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Dmitry Monakhov @ 2010-08-04 14:57 UTC (permalink / raw) To: Lukas Czerner Cc: linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso Lukas Czerner <lczerner@redhat.com> writes: > Adds an filesystem independent ioctl to allow implementation of file > system batched discard support. > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > --- > fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ > include/linux/fs.h | 2 ++ > 2 files changed, 33 insertions(+), 0 deletions(-) > > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 2d140a7..6c01c3c 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) > return thaw_super(sb); > } > > +static int ioctl_fstrim(struct file *filp, unsigned long arg) BTW why do we have to trim fs in one shot ? IMHO it is much suitable to provide start,len parameters as we do in most functions(truncate, bdevdiscard, getdents). It allow userspace caller to implement a fancy looking progress bars. > +{ > + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; > + unsigned int minlen; > + int err; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + /* If filesystem doesn't support trim feature, return. */ > + if (sb->s_op->trim_fs == NULL) > + return -EOPNOTSUPP; > + > + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ > + if (sb->s_bdev == NULL) > + return -EINVAL; > + > + err = get_user(minlen, (unsigned int __user *) arg); > + if (err) > + return err; > + > + err = sb->s_op->trim_fs(minlen, sb); > + if (err) > + return err; > + return 0; > +} > + > /* > * When you add any new common ioctls to the switches above and below > * please update compat_sys_ioctl() too. > @@ -590,6 +617,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, > error = ioctl_fsthaw(filp); > break; > > + case FITRIM: > + error = ioctl_fstrim(filp, arg); > + break; > + > case FS_IOC_FIEMAP: > return ioctl_fiemap(filp, arg); > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 68ca1b0..01632e4 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -315,6 +315,7 @@ struct inodes_stat_t { > #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ > #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ > #define FITHAW _IOWR('X', 120, int) /* Thaw */ > +#define FITRIM _IOWR('X', 121, int) /* Trim */ > > #define FS_IOC_GETFLAGS _IOR('f', 1, long) > #define FS_IOC_SETFLAGS _IOW('f', 2, long) > @@ -1580,6 +1581,7 @@ struct super_operations { > ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); > #endif > int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); > + int (*trim_fs) (unsigned int, struct super_block *); > }; > > /* ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-04 14:57 ` Dmitry Monakhov @ 2010-08-04 15:13 ` Lukas Czerner 2010-08-04 15:26 ` Greg Freemyer 2010-08-05 7:00 ` Dmitry Monakhov 0 siblings, 2 replies; 21+ messages in thread From: Lukas Czerner @ 2010-08-04 15:13 UTC (permalink / raw) To: Dmitry Monakhov Cc: Lukas Czerner, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso On Wed, 4 Aug 2010, Dmitry Monakhov wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > Adds an filesystem independent ioctl to allow implementation of file > > system batched discard support. > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > --- > > fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ > > include/linux/fs.h | 2 ++ > > 2 files changed, 33 insertions(+), 0 deletions(-) > > > > diff --git a/fs/ioctl.c b/fs/ioctl.c > > index 2d140a7..6c01c3c 100644 > > --- a/fs/ioctl.c > > +++ b/fs/ioctl.c > > @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) > > return thaw_super(sb); > > } > > > > +static int ioctl_fstrim(struct file *filp, unsigned long arg) > BTW why do we have to trim fs in one shot ? > IMHO it is much suitable to provide start,len parameters as we > do in most functions(truncate, bdevdiscard, getdents). > It allow userspace caller to implement a fancy looking progress bars. Hi, do you think it is really needed when even with todays SSD's it takes just a couple of seconds ? And I suppose it will improve in future. But generally I think we can do that..I would like to hear some more opinions before I start looking at this. Thanks. -Lukas. > > +{ > > + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; > > + unsigned int minlen; > > + int err; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > + > > + /* If filesystem doesn't support trim feature, return. */ > > + if (sb->s_op->trim_fs == NULL) > > + return -EOPNOTSUPP; > > + > > + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ > > + if (sb->s_bdev == NULL) > > + return -EINVAL; > > + > > + err = get_user(minlen, (unsigned int __user *) arg); > > + if (err) > > + return err; > > + > > + err = sb->s_op->trim_fs(minlen, sb); > > + if (err) > > + return err; > > + return 0; > > +} > > + > > /* > > * When you add any new common ioctls to the switches above and below > > * please update compat_sys_ioctl() too. > > @@ -590,6 +617,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, > > error = ioctl_fsthaw(filp); > > break; > > > > + case FITRIM: > > + error = ioctl_fstrim(filp, arg); > > + break; > > + > > case FS_IOC_FIEMAP: > > return ioctl_fiemap(filp, arg); > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 68ca1b0..01632e4 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -315,6 +315,7 @@ struct inodes_stat_t { > > #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ > > #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ > > #define FITHAW _IOWR('X', 120, int) /* Thaw */ > > +#define FITRIM _IOWR('X', 121, int) /* Trim */ > > > > #define FS_IOC_GETFLAGS _IOR('f', 1, long) > > #define FS_IOC_SETFLAGS _IOW('f', 2, long) > > @@ -1580,6 +1581,7 @@ struct super_operations { > > ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); > > #endif > > int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); > > + int (*trim_fs) (unsigned int, struct super_block *); > > }; > > > > /* > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-04 15:13 ` Lukas Czerner @ 2010-08-04 15:26 ` Greg Freemyer 2010-08-05 0:28 ` Ted Ts'o 2010-08-05 7:00 ` Dmitry Monakhov 1 sibling, 1 reply; 21+ messages in thread From: Greg Freemyer @ 2010-08-04 15:26 UTC (permalink / raw) To: Lukas Czerner Cc: Dmitry Monakhov, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso, Mark Lord On Wed, Aug 4, 2010 at 11:13 AM, Lukas Czerner <lczerner@redhat.com> wrote: > On Wed, 4 Aug 2010, Dmitry Monakhov wrote: > >> Lukas Czerner <lczerner@redhat.com> writes: >> >> > Adds an filesystem independent ioctl to allow implementation of file >> > system batched discard support. >> > >> > Signed-off-by: Lukas Czerner <lczerner@redhat.com> >> > --- >> > fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ >> > include/linux/fs.h | 2 ++ >> > 2 files changed, 33 insertions(+), 0 deletions(-) >> > >> > diff --git a/fs/ioctl.c b/fs/ioctl.c >> > index 2d140a7..6c01c3c 100644 >> > --- a/fs/ioctl.c >> > +++ b/fs/ioctl.c >> > @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) >> > return thaw_super(sb); >> > } >> > >> > +static int ioctl_fstrim(struct file *filp, unsigned long arg) >> BTW why do we have to trim fs in one shot ? >> IMHO it is much suitable to provide start,len parameters as we >> do in most functions(truncate, bdevdiscard, getdents). >> It allow userspace caller to implement a fancy looking progress bars. > > Hi, > > do you think it is really needed when even with todays SSD's it takes > just a couple of seconds ? And I suppose it will improve in future. But > generally I think we can do that..I would like to hear some more > opinions before I start looking at this. > > Thanks. > > -Lukas. Since the proposed patch is not aggregating discards into multiple ranges per ATA command, I thought some of the non-optimized devices would take minutes / hours? If true, a way to control the progress from userspace is important. If in general it is only going to take a few seconds for a full FITRIM to run, it is much less important, but I suppose the the RT project might find even that problematic. Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-04 15:26 ` Greg Freemyer @ 2010-08-05 0:28 ` Ted Ts'o 2010-08-05 6:51 ` Dmitry Monakhov 2010-08-05 15:47 ` Andreas Dilger 0 siblings, 2 replies; 21+ messages in thread From: Ted Ts'o @ 2010-08-05 0:28 UTC (permalink / raw) To: Greg Freemyer Cc: Lukas Czerner, Dmitry Monakhov, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, Mark Lord On Wed, Aug 04, 2010 at 11:26:56AM -0400, Greg Freemyer wrote: > > Since the proposed patch is not aggregating discards into multiple > ranges per ATA command, I thought some of the non-optimized devices > would take minutes / hours? > > If true, a way to control the progress from userspace is important. > > If in general it is only going to take a few seconds for a full FITRIM > to run, it is much less important, but I suppose the the RT project > might find even that problematic. Even if it without the RT project, if disk activity is slowed or completely stopped for a few seconds, I can think of plenty of workloads where this would be totally unacceptable. Suppose you are running a web site; it doesn't really matter whether it is at Google, Facebook, Twitter, etc. If this means that one or more web pages get stalled by "a few seconds" while the FITRIM is going on, this is generally not considered acceptable. Even if it slows down the server by 30-50%, for some sites this would also be quite unacceptable. This is a hard problem to solve, though, especially if there is an insistence to solve it in a fs-independent fashion. I could imagine doing this at work, by doing things one block group at a time, and then I could measure, for our specific hardware, how badly disk performance would get hit, and for how long, and then the userspace daemon could control how many block groups to do per unit time. But this would be of necessity ext2/3/4 specific.... So I'm not sure what to suggest here. Maybe the answer is we can have a fs-independent ioctl for desktop workloads, and one which gives more fine-grained control for those who need it? That seems ugly, but it might be the best compromise. - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-05 0:28 ` Ted Ts'o @ 2010-08-05 6:51 ` Dmitry Monakhov 2010-08-05 15:47 ` Andreas Dilger 1 sibling, 0 replies; 21+ messages in thread From: Dmitry Monakhov @ 2010-08-05 6:51 UTC (permalink / raw) To: Ted Ts'o Cc: Greg Freemyer, Lukas Czerner, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, Mark Lord Ted Ts'o <tytso@mit.edu> writes: > On Wed, Aug 04, 2010 at 11:26:56AM -0400, Greg Freemyer wrote: >> >> Since the proposed patch is not aggregating discards into multiple >> ranges per ATA command, I thought some of the non-optimized devices >> would take minutes / hours? >> >> If true, a way to control the progress from userspace is important. >> >> If in general it is only going to take a few seconds for a full FITRIM >> to run, it is much less important, but I suppose the the RT project >> might find even that problematic. Few second may not being true, We always have to think about crazy user, and crazy fs-layouts. My SSD is able to process 8*10^3 requests per second So in worst case( where 1k fs block are bysy like this 010101010101) it can process about 10^3/s * 2*1Kb = 16Mb/s Which is no good. > > Even if it without the RT project, if disk activity is slowed or > completely stopped for a few seconds, I can think of plenty of > workloads where this would be totally unacceptable. Suppose you are > running a web site; it doesn't really matter whether it is at Google, > Facebook, Twitter, etc. If this means that one or more web pages get > stalled by "a few seconds" while the FITRIM is going on, this is > generally not considered acceptable. Even if it slows down the server > by 30-50%, for some sites this would also be quite unacceptable. > > This is a hard problem to solve, though, especially if there is an > insistence to solve it in a fs-independent fashion. I could imagine > doing this at work, by doing things one block group at a time, and > then I could measure, for our specific hardware, how badly disk > performance would get hit, and for how long, and then the userspace > daemon could control how many block groups to do per unit time. > But this would be of necessity ext2/3/4 specific.... > > So I'm not sure what to suggest here. Maybe the answer is we can have > a fs-independent ioctl for desktop workloads, and one which gives more > fine-grained control for those who need it? That seems ugly, but it > might be the best compromise. Why do we have to invent a wheel again? We already have BLKDISCARD which has following arguments: uint64_t range[2] So IMHO it is reasonable that FITRIM should have following arguments uint64_t start, uint64_t len, uint64_t minlen. No problems with compat, no problems with interactivity. User who does not care about interactivity may just call ioctl(fd, FITRIM, 0, LLONG_MAX, 0), > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-05 0:28 ` Ted Ts'o 2010-08-05 6:51 ` Dmitry Monakhov @ 2010-08-05 15:47 ` Andreas Dilger 1 sibling, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2010-08-05 15:47 UTC (permalink / raw) To: Ted Ts'o Cc: Greg Freemyer, Lukas Czerner, Dmitry Monakhov, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, Mark Lord On 2010-08-04, at 18:28, Ted Ts'o wrote: > On Wed, Aug 04, 2010 at 11:26:56AM -0400, Greg Freemyer wrote: >> If true, a way to control the progress from userspace is important. >> >> If in general it is only going to take a few seconds for a full FITRIM >> to run, it is much less important, but I suppose the the RT project >> might find even that problematic. > > Even if it without the RT project, if disk activity is slowed or > completely stopped for a few seconds, I can think of plenty of > workloads where this would be totally unacceptable. Suppose you are > running a web site; it doesn't really matter whether it is at Google, > Facebook, Twitter, etc. If this means that one or more web pages get > stalled by "a few seconds" while the FITRIM is going on, this is > generally not considered acceptable. Even if it slows down the server > by 30-50%, for some sites this would also be quite unacceptable. > > This is a hard problem to solve, though, especially if there is an > insistence to solve it in a fs-independent fashion. I could imagine > doing this at work, by doing things one block group at a time, and > then I could measure, for our specific hardware, how badly disk > performance would get hit, and for how long, and then the userspace > daemon could control how many block groups to do per unit time. > But this would be of necessity ext2/3/4 specific.... I think "blockgroup at a time" is simply the extN way of "range of blocks at a time". Having an API that is requesting "trim free blocks from [M,N]" is a generic enough interface to apply to any filesystem. If there is some way to query the "efficient trim increment size" (i.e. block group for extN, allocation group for xfs, ??? for btrfs) then userspace could do it that way, or simply pick some fraction of the filesystem and use a nice power-of-two value and hope it works out. > So I'm not sure what to suggest here. Maybe the answer is we can have > a fs-independent ioctl for desktop workloads, and one which gives more > fine-grained control for those who need it? That seems ugly, but it > might be the best compromise. No, too ugly. Cheers, Andreas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-04 15:13 ` Lukas Czerner 2010-08-04 15:26 ` Greg Freemyer @ 2010-08-05 7:00 ` Dmitry Monakhov 2010-08-05 8:36 ` Lukas Czerner 1 sibling, 1 reply; 21+ messages in thread From: Dmitry Monakhov @ 2010-08-05 7:00 UTC (permalink / raw) To: Lukas Czerner Cc: linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso Lukas Czerner <lczerner@redhat.com> writes: > On Wed, 4 Aug 2010, Dmitry Monakhov wrote: > >> Lukas Czerner <lczerner@redhat.com> writes: >> >> > Adds an filesystem independent ioctl to allow implementation of file >> > system batched discard support. >> > >> > Signed-off-by: Lukas Czerner <lczerner@redhat.com> >> > --- >> > fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ >> > include/linux/fs.h | 2 ++ >> > 2 files changed, 33 insertions(+), 0 deletions(-) >> > >> > diff --git a/fs/ioctl.c b/fs/ioctl.c >> > index 2d140a7..6c01c3c 100644 >> > --- a/fs/ioctl.c >> > +++ b/fs/ioctl.c >> > @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) >> > return thaw_super(sb); >> > } >> > >> > +static int ioctl_fstrim(struct file *filp, unsigned long arg) >> BTW why do we have to trim fs in one shot ? >> IMHO it is much suitable to provide start,len parameters as we >> do in most functions(truncate, bdevdiscard, getdents). >> It allow userspace caller to implement a fancy looking progress bars. > > Hi, > > do you think it is really needed when even with todays SSD's it takes > just a couple of seconds ? And I suppose it will improve in future. But > generally I think we can do that..I would like to hear some more > opinions before I start looking at this. Hi, Lukas we may face a really long delays due to bad layouts and slow devices Please read my response to Ted I'm agree with you what this interface is important, BTW i already enabled FITRIM support on my notebook, my speed difference is about 2-3%. But let's provide right user interface from very beginning. > > Thanks. > > -Lukas. > >> > +{ >> > + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; >> > + unsigned int minlen; >> > + int err; >> > + >> > + if (!capable(CAP_SYS_ADMIN)) >> > + return -EPERM; >> > + >> > + /* If filesystem doesn't support trim feature, return. */ >> > + if (sb->s_op->trim_fs == NULL) >> > + return -EOPNOTSUPP; >> > + >> > + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ >> > + if (sb->s_bdev == NULL) >> > + return -EINVAL; >> > + >> > + err = get_user(minlen, (unsigned int __user *) arg); >> > + if (err) >> > + return err; >> > + >> > + err = sb->s_op->trim_fs(minlen, sb); >> > + if (err) >> > + return err; >> > + return 0; >> > +} >> > + >> > /* >> > * When you add any new common ioctls to the switches above and below >> > * please update compat_sys_ioctl() too. >> > @@ -590,6 +617,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, >> > error = ioctl_fsthaw(filp); >> > break; >> > >> > + case FITRIM: >> > + error = ioctl_fstrim(filp, arg); >> > + break; >> > + >> > case FS_IOC_FIEMAP: >> > return ioctl_fiemap(filp, arg); >> > >> > diff --git a/include/linux/fs.h b/include/linux/fs.h >> > index 68ca1b0..01632e4 100644 >> > --- a/include/linux/fs.h >> > +++ b/include/linux/fs.h >> > @@ -315,6 +315,7 @@ struct inodes_stat_t { >> > #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ >> > #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ >> > #define FITHAW _IOWR('X', 120, int) /* Thaw */ >> > +#define FITRIM _IOWR('X', 121, int) /* Trim */ >> > >> > #define FS_IOC_GETFLAGS _IOR('f', 1, long) >> > #define FS_IOC_SETFLAGS _IOW('f', 2, long) >> > @@ -1580,6 +1581,7 @@ struct super_operations { >> > ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); >> > #endif >> > int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); >> > + int (*trim_fs) (unsigned int, struct super_block *); >> > }; >> > >> > /* >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/3] Add ioctl FITRIM. 2010-08-05 7:00 ` Dmitry Monakhov @ 2010-08-05 8:36 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-08-05 8:36 UTC (permalink / raw) To: Dmitry Monakhov Cc: Lukas Czerner, linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso On Thu, 5 Aug 2010, Dmitry Monakhov wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > On Wed, 4 Aug 2010, Dmitry Monakhov wrote: > > > >> Lukas Czerner <lczerner@redhat.com> writes: > >> > >> > Adds an filesystem independent ioctl to allow implementation of file > >> > system batched discard support. > >> > > >> > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > >> > --- > >> > fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ > >> > include/linux/fs.h | 2 ++ > >> > 2 files changed, 33 insertions(+), 0 deletions(-) > >> > > >> > diff --git a/fs/ioctl.c b/fs/ioctl.c > >> > index 2d140a7..6c01c3c 100644 > >> > --- a/fs/ioctl.c > >> > +++ b/fs/ioctl.c > >> > @@ -540,6 +540,33 @@ static int ioctl_fsthaw(struct file *filp) > >> > return thaw_super(sb); > >> > } > >> > > >> > +static int ioctl_fstrim(struct file *filp, unsigned long arg) > >> BTW why do we have to trim fs in one shot ? > >> IMHO it is much suitable to provide start,len parameters as we > >> do in most functions(truncate, bdevdiscard, getdents). > >> It allow userspace caller to implement a fancy looking progress bars. > > > > Hi, > > > > do you think it is really needed when even with todays SSD's it takes > > just a couple of seconds ? And I suppose it will improve in future. But > > generally I think we can do that..I would like to hear some more > > opinions before I start looking at this. > Hi, Lukas > we may face a really long delays due to bad layouts and slow devices > Please read my response to Ted > I'm agree with you what this interface is important, BTW i already > enabled FITRIM support on my notebook, my speed difference is about 2-3%. > But let's provide right user interface from very beginning. Hi, Dimitry I read the thread and really it makes sense to me. Sometimes it can be useful to have more fine-grained control beside just specifying minlen argument, which works quite well, however it is a little bit fuzzy because when you do not know how much space was actually trimmed. I think that there is no need to have two separate ioctls, even though it would be more effective to specify block group instead of block range. I am thinking about something like int optimize which will tell us to round the range to block group boundaries. But I can not tell if it would really help someone (probably not). So I will try to do something to be able to break the FITRIM to smaller pieces, uint64_t start, uint64_t len, uint64_t minlen seems good to me. Thanks -Lukas ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 @ 2010-07-27 12:41 Lukas Czerner 2010-07-27 12:41 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2010-07-27 12:41 UTC (permalink / raw) To: linux-ext4; +Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso Hi all, since my last post I have changed those patches slightly. Ext3 patch does not introduce DISCARD/NODISCARD mount option, so there is just FITRIM ioctl which can be invoked only by user with CAP_SYS_ADMIN. Ext3 patch also involves journaling because as Jan pointed out not using journal can cause problems when system crashes. Jan please look at this I hope it is ok now. Also I am not sure if it would not be better to close the journal after allocating free space and then start new one after the TRIM (this was proposed by Josef Bacik). Ext4 patch does not remove old "online" discard implementation and does not check for DISCARD mount option. To avoid journaling I have used alloc_sem semaphore to prevent allocations from the group while it is being trimmed. It seems sufficient to me. The general idea of batched discards stays the same. So when FITRIM ioctl is invoked upon the mount point, it walks through all allocation groups searching for free extents bigger than minlen, then trim those extents. In ext3 consistency is assured by allocating these free extents in alloc. group (with journal involved). In ext4 consistency is assured by use of alloc_sem semaphore. -Lukas [PATCH 1/3] Add ioctl FITRIM. [PATCH 2/3] Add batched discard support for ext3 [PATCH 3/3] Add batched discard support for ext4 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] Add ioctl FITRIM. 2010-07-27 12:41 [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 Lukas Czerner @ 2010-07-27 12:41 ` Lukas Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukas Czerner @ 2010-07-27 12:41 UTC (permalink / raw) To: linux-ext4 Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso, Lukas Czerner Adds an filesystem independent ioctl to allow implementation of file system batched discard support. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 7faefb4..09b33ae 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -551,6 +551,33 @@ static int ioctl_fsthaw(struct file *filp) return thaw_bdev(sb->s_bdev, sb); } +static int ioctl_fstrim(struct file *filp, unsigned long arg) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + unsigned int minlen; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + err = get_user(minlen, (unsigned int __user *) arg); + if (err) + return err; + + err = sb->s_op->trim_fs(minlen, sb); + if (err) + return err; + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -601,6 +628,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, arg); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index 44f35ae..7a27fa4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -315,6 +315,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOWR('X', 121, int) /* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1580,6 +1581,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (unsigned int, struct super_block *); }; /* -- 1.7.1.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2010-09-27 9:30 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-24 15:35 [PATCH 0/3 v. 8] Ext3/Ext4 Batched discard support Lukas Czerner 2010-09-24 15:35 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 2010-09-24 17:03 ` Andreas Dilger 2010-09-27 9:20 ` Lukas Czerner 2010-09-24 15:35 ` [PATCH 2/3] ext4: Add batched discard support Lukas Czerner 2010-09-24 15:35 ` [PATCH 3/3] ext3: " Lukas Czerner 2010-09-24 15:38 ` [PATCH 0/3 v. 8] Ext3/Ext4 Batched " Lukas Czerner 2010-09-24 17:19 ` Andreas Dilger 2010-09-27 9:30 ` Lukas Czerner -- strict thread matches above, loose matches on Subject: below -- 2010-08-10 14:19 [PATCH 0/3 ver. 7] " Lukas Czerner 2010-08-10 14:19 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 2010-08-06 11:31 [PATCH 0/3] Batched discard support Lukas Czerner 2010-08-06 11:31 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner 2010-08-04 13:44 Lukas Czerner 2010-08-04 14:57 ` Dmitry Monakhov 2010-08-04 15:13 ` Lukas Czerner 2010-08-04 15:26 ` Greg Freemyer 2010-08-05 0:28 ` Ted Ts'o 2010-08-05 6:51 ` Dmitry Monakhov 2010-08-05 15:47 ` Andreas Dilger 2010-08-05 7:00 ` Dmitry Monakhov 2010-08-05 8:36 ` Lukas Czerner 2010-07-27 12:41 [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 Lukas Czerner 2010-07-27 12:41 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).