* [PATCH 1/3] Add ioctl FITRIM.
2010-07-27 12:41 [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 Lukas Czerner
@ 2010-07-27 12:41 ` Lukas Czerner
2010-07-27 12:41 ` [PATCH 2/3] Add batched discard support for ext3 Lukas Czerner
2010-07-27 12:41 ` [PATCH 3/3] Add batched discard support for ext4 Lukas Czerner
2 siblings, 0 replies; 7+ messages in thread
From: Lukas Czerner @ 2010-07-27 12:41 UTC (permalink / raw)
To: linux-ext4
Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso, Lukas Czerner
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7faefb4..09b33ae 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -551,6 +551,33 @@ static int ioctl_fsthaw(struct file *filp)
return thaw_bdev(sb->s_bdev, sb);
}
+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+ struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+ unsigned int minlen;
+ int err;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* If filesystem doesn't support trim feature, return. */
+ if (sb->s_op->trim_fs == NULL)
+ return -EOPNOTSUPP;
+
+ /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+ if (sb->s_bdev == NULL)
+ return -EINVAL;
+
+ err = get_user(minlen, (unsigned int __user *) arg);
+ if (err)
+ return err;
+
+ err = sb->s_op->trim_fs(minlen, sb);
+ if (err)
+ return err;
+ return 0;
+}
+
/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
@@ -601,6 +628,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
error = ioctl_fsthaw(filp);
break;
+ case FITRIM:
+ error = ioctl_fstrim(filp, arg);
+ break;
+
case FS_IOC_FIEMAP:
return ioctl_fiemap(filp, arg);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 44f35ae..7a27fa4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
+#define FITRIM _IOWR('X', 121, int) /* Trim */
#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
@@ -1580,6 +1581,7 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ int (*trim_fs) (unsigned int, struct super_block *);
};
/*
--
1.7.1.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] Add batched discard support for ext3
2010-07-27 12:41 [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 Lukas Czerner
2010-07-27 12:41 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner
@ 2010-07-27 12:41 ` Lukas Czerner
2010-07-27 15:43 ` Jan Kara
2010-07-27 12:41 ` [PATCH 3/3] Add batched discard support for ext4 Lukas Czerner
2 siblings, 1 reply; 7+ messages in thread
From: Lukas Czerner @ 2010-07-27 12:41 UTC (permalink / raw)
To: linux-ext4
Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso, Lukas Czerner
Walk through each allocation group and trim all free extents. It can be
invoked through TRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).
It search for free extents in each allocation group. When the free
extent is found, blocks are marked as used and then trimmed. Afterwards
these blocks are marked as free in per-group bitmap.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
fs/ext3/balloc.c | 229 +++++++++++++++++++++++++++++++++++++++++++++++
fs/ext3/super.c | 1 +
include/linux/ext3_fs.h | 1 +
3 files changed, 231 insertions(+), 0 deletions(-)
diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index a177122..97c6867 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -20,6 +20,7 @@
#include <linux/ext3_jbd.h>
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
/*
* balloc.c contains the blocks allocation and deallocation routines
@@ -1876,3 +1877,231 @@ unsigned long ext3_bg_num_gdb(struct super_block *sb, int group)
return ext3_bg_num_gdb_meta(sb,group);
}
+
+/**
+ * ext3_trim_all_free -- function to trim all free space in alloc. group
+ * @sb: super block for file system
+ * @group: allocation group to trim
+ * @gdp: allocation group description structure
+ * @minblocks: minimum extent block count
+ *
+ * ext3_trim_all_free walks through group's block bitmap searching for free
+ * blocks. When the free block is found, it tries to allocate this block and
+ * consequent free block to get the biggest free extent possible, until it
+ * reaches any used block. Then issue a TRIM command on this extent and free
+ * the extent in the block bitmap. This is done until whole group is scanned.
+ */
+ext3_grpblk_t ext3_trim_all_free(struct super_block *sb,
+ unsigned int group, ext3_grpblk_t minblocks)
+{
+ handle_t *handle;
+ ext3_grpblk_t max = EXT3_BLOCKS_PER_GROUP(sb);
+ ext3_grpblk_t next, count = 0, start, bit;
+ struct ext3_sb_info *sbi;
+ ext3_fsblk_t discard_block;
+ struct buffer_head *bitmap_bh = NULL;
+ struct buffer_head *gdp_bh;
+ ext3_grpblk_t free_blocks;
+ struct ext3_group_desc *gdp;
+ int err = 0, ret;
+ ext3_grpblk_t freed;
+
+ /**
+ * We will update one block bitmap, and one group descriptor
+ */
+ handle = ext3_journal_start_sb(sb, 2);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ ext3_warning(sb, __func__, "error %d on journal start", err);
+ return err;
+ }
+
+ bitmap_bh = read_block_bitmap(sb, group);
+ if (!bitmap_bh)
+ goto err_out;
+
+ BUFFER_TRACE(bitmap_bh, "getting undo access");
+ err = ext3_journal_get_undo_access(handle, bitmap_bh);
+ if (err)
+ goto err_out;
+
+ gdp = ext3_get_group_desc(sb, group, &gdp_bh);
+ if (!gdp)
+ goto err_out;
+
+ BUFFER_TRACE(gd_bh, "get_write_access");
+ err = ext3_journal_get_write_access(handle, gdp_bh);
+ if (err)
+ goto err_out;
+
+ free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
+ sbi = EXT3_SB(sb);
+
+ /* Walk through the whole group */
+ start = 0;
+ while (start < max) {
+
+ start = bitmap_search_next_usable_block(start, bitmap_bh, max);
+ if (start < 0)
+ break;
+ next = start;
+
+ /**
+ * Allocate contiguous free extents by setting bits in the
+ * block bitmap
+ */
+ while (next < max
+ && claim_block(sb_bgl_lock(sbi, group),
+ next, bitmap_bh)) {
+ next++;
+ }
+
+ /* We did not claimed any blocks */
+ if (next == start)
+ continue;
+
+ discard_block = (ext3_fsblk_t)start +
+ ext3_group_first_block_no(sb, group);
+
+ /* Update counters */
+ spin_lock(sb_bgl_lock(sbi, group));
+ le16_add_cpu(&gdp->bg_free_blocks_count, start - next);
+ spin_unlock(sb_bgl_lock(sbi, group));
+ percpu_counter_sub(&sbi->s_freeblocks_counter, next - start);
+
+ /* Do not issue a TRIM on extents smaller than minblocks */
+ if ((next - start) < minblocks)
+ goto free_extent;
+
+ /* Send the TRIM command down to the device */
+ sb_issue_discard(sb, discard_block, next - start);
+ count += (next - start);
+ cond_resched();
+
+free_extent:
+
+ freed = 0;
+ jbd_lock_bh_state(bitmap_bh);
+
+ for (bit = start; bit < next; bit++) {
+
+ /**
+ * @@@ This prevents newly-allocated data from being
+ * freed and then reallocated within the same
+ * transaction.
+ */
+ BUFFER_TRACE(bitmap_bh, "set in b_committed_data");
+ J_ASSERT_BH(bitmap_bh,
+ bh2jh(bitmap_bh)->b_committed_data != NULL);
+ ext3_set_bit_atomic(sb_bgl_lock(sbi, group), bit,
+ bh2jh(bitmap_bh)->b_committed_data);
+
+ /**
+ * We clear the bit in the bitmap after setting the
+ * committed data bit, because this is the reverse
+ * order to that which the allocator uses.
+ */
+ BUFFER_TRACE(bitmap_bh, "clear bit");
+ if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, group),
+ bit, bitmap_bh->b_data)) {
+ jbd_unlock_bh_state(bitmap_bh);
+ ext3_error(sb, __func__,
+ "bit already cleared for block "E3FSBLK,
+ (unsigned long)bit);
+ jbd_lock_bh_state(bitmap_bh);
+ BUFFER_TRACE(bitmap_bh, "bit already cleared");
+ } else {
+ freed++;
+ }
+ }
+
+ jbd_unlock_bh_state(bitmap_bh);
+
+ /* Update couters */
+ spin_lock(sb_bgl_lock(sbi, group));
+ le16_add_cpu(&gdp->bg_free_blocks_count, freed);
+ spin_unlock(sb_bgl_lock(sbi, group));
+ percpu_counter_add(&sbi->s_freeblocks_counter, next - start);
+
+ start = next;
+
+ if (signal_pending(current)) {
+ count = -ERESTARTSYS;
+ break;
+ }
+
+ /* No more suitable extents */
+ if ((free_blocks - count) < minblocks)
+ break;
+ }
+
+ /* We dirtied the bitmap block */
+ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
+ err = ext3_journal_dirty_metadata(handle, bitmap_bh);
+
+ /* And the group descriptor block */
+ BUFFER_TRACE(gdp_bh, "dirtied group descriptor block");
+ ret = ext3_journal_dirty_metadata(handle, gdp_bh);
+ if (!err)
+ err = ret;
+
+ ext3_debug("trimmed %d blocks in the group %d\n",
+ count, group);
+
+err_out:
+ if (err) {
+ ext3_std_error(sb, err);
+ count = -err;
+ }
+
+ ext3_journal_stop(handle);
+ brelse(bitmap_bh);
+
+ return count;
+}
+
+/**
+ * ext3_trim_fs() -- trim ioctl handle function
+ * @minlen: minimum extent length in Bytes
+ * @sb: superblock for filesystem
+ *
+ * ext3_trim_fs goes through all allocation group searching for groups with more
+ * free space than minlen. For such a group ext3_trim_all_free function is
+ * invoked to trim all free space.
+ */
+int ext3_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+ ext3_grpblk_t minblocks;
+ unsigned long ngroups;
+ unsigned int group;
+ struct ext3_group_desc *gdp;
+ ext3_grpblk_t free_blocks;
+ int ret = 0;
+
+ minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+ if (unlikely(minblocks > EXT3_BLOCKS_PER_GROUP(sb)))
+ return -EINVAL;
+
+ ngroups = EXT3_SB(sb)->s_groups_count;
+ smp_rmb();
+
+ for (group = 0; group < ngroups; group++) {
+
+ gdp = ext3_get_group_desc(sb, group, NULL);
+ if (!gdp)
+ break;
+
+ free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
+ if (free_blocks < minblocks)
+ continue;
+
+ ret = ext3_trim_all_free(sb, group, minblocks);
+ if (ret < 0)
+ break;
+ }
+
+ if (ret >= 0)
+ ret = 0;
+
+ return ret;
+}
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 1bee604..75688f3 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -791,6 +791,7 @@ static const struct super_operations ext3_sops = {
.quota_write = ext3_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .trim_fs = ext3_trim_fs,
};
static const struct export_operations ext3_export_ops = {
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 5f494b4..6e3fefc 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -857,6 +857,7 @@ extern struct ext3_group_desc * ext3_get_group_desc(struct super_block * sb,
extern int ext3_should_retry_alloc(struct super_block *sb, int *retries);
extern void ext3_init_block_alloc_info(struct inode *);
extern void ext3_rsv_window_add(struct super_block *sb, struct ext3_reserve_window_node *rsv);
+extern int ext3_trim_fs(unsigned int minlen, struct super_block *sb);
/* dir.c */
extern int ext3_check_dir_entry(const char *, struct inode *,
--
1.7.1.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/3] Add batched discard support for ext3
2010-07-27 12:41 ` [PATCH 2/3] Add batched discard support for ext3 Lukas Czerner
@ 2010-07-27 15:43 ` Jan Kara
2010-07-28 9:13 ` Lukas Czerner
0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2010-07-27 15:43 UTC (permalink / raw)
To: Lukas Czerner
Cc: linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso
On Tue 27-07-10 14:41:53, Lukas Czerner wrote:
> Walk through each allocation group and trim all free extents. It can be
> invoked through TRIM ioctl on the file system. The main idea is to
> provide a way to trim the whole file system if needed, since some SSD's
> may suffer from performance loss after the whole device was filled (it
> does not mean that fs is full!).
>
> It search for free extents in each allocation group. When the free
> extent is found, blocks are marked as used and then trimmed. Afterwards
> these blocks are marked as free in per-group bitmap.
Thanks for the patch. It looks much better than the previous one. Some
comments below.
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> fs/ext3/balloc.c | 229 +++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext3/super.c | 1 +
> include/linux/ext3_fs.h | 1 +
> 3 files changed, 231 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index a177122..97c6867 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -20,6 +20,7 @@
> #include <linux/ext3_jbd.h>
> #include <linux/quotaops.h>
> #include <linux/buffer_head.h>
> +#include <linux/blkdev.h>
>
> /*
> * balloc.c contains the blocks allocation and deallocation routines
> @@ -1876,3 +1877,231 @@ unsigned long ext3_bg_num_gdb(struct super_block *sb, int group)
> return ext3_bg_num_gdb_meta(sb,group);
>
> }
> +
> +/**
> + * ext3_trim_all_free -- function to trim all free space in alloc. group
> + * @sb: super block for file system
> + * @group: allocation group to trim
> + * @gdp: allocation group description structure
> + * @minblocks: minimum extent block count
> + *
> + * ext3_trim_all_free walks through group's block bitmap searching for free
> + * blocks. When the free block is found, it tries to allocate this block and
> + * consequent free block to get the biggest free extent possible, until it
> + * reaches any used block. Then issue a TRIM command on this extent and free
> + * the extent in the block bitmap. This is done until whole group is scanned.
> + */
> +ext3_grpblk_t ext3_trim_all_free(struct super_block *sb,
> + unsigned int group, ext3_grpblk_t minblocks)
> +{
> + handle_t *handle;
> + ext3_grpblk_t max = EXT3_BLOCKS_PER_GROUP(sb);
> + ext3_grpblk_t next, count = 0, start, bit;
> + struct ext3_sb_info *sbi;
> + ext3_fsblk_t discard_block;
> + struct buffer_head *bitmap_bh = NULL;
> + struct buffer_head *gdp_bh;
> + ext3_grpblk_t free_blocks;
> + struct ext3_group_desc *gdp;
> + int err = 0, ret;
> + ext3_grpblk_t freed;
> +
> + /**
> + * We will update one block bitmap, and one group descriptor
> + */
> + handle = ext3_journal_start_sb(sb, 2);
> + if (IS_ERR(handle)) {
> + err = PTR_ERR(handle);
> + ext3_warning(sb, __func__, "error %d on journal start", err);
I think there's no need to issue another warning here... When
journal_start fails there should be notices about this in the log already.
> + return err;
> + }
> +
> + bitmap_bh = read_block_bitmap(sb, group);
> + if (!bitmap_bh)
> + goto err_out;
> +
> + BUFFER_TRACE(bitmap_bh, "getting undo access");
> + err = ext3_journal_get_undo_access(handle, bitmap_bh);
> + if (err)
> + goto err_out;
> +
> + gdp = ext3_get_group_desc(sb, group, &gdp_bh);
> + if (!gdp)
> + goto err_out;
> +
> + BUFFER_TRACE(gd_bh, "get_write_access");
> + err = ext3_journal_get_write_access(handle, gdp_bh);
> + if (err)
> + goto err_out;
> +
> + free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
> + sbi = EXT3_SB(sb);
> +
> + /* Walk through the whole group */
> + start = 0;
> + while (start < max) {
> +
> + start = bitmap_search_next_usable_block(start, bitmap_bh, max);
> + if (start < 0)
> + break;
> + next = start;
> +
> + /**
^^^ Use just /*. /** is reserved for comments processed by
kernel documentation generator and they have to have special format...
> + * Allocate contiguous free extents by setting bits in the
> + * block bitmap
> + */
> + while (next < max
> + && claim_block(sb_bgl_lock(sbi, group),
> + next, bitmap_bh)) {
> + next++;
> + }
> +
> + /* We did not claimed any blocks */
^^^^ claim
> + if (next == start)
> + continue;
> +
> + discard_block = (ext3_fsblk_t)start +
> + ext3_group_first_block_no(sb, group);
> +
> + /* Update counters */
> + spin_lock(sb_bgl_lock(sbi, group));
> + le16_add_cpu(&gdp->bg_free_blocks_count, start - next);
> + spin_unlock(sb_bgl_lock(sbi, group));
> + percpu_counter_sub(&sbi->s_freeblocks_counter, next - start);
> +
> + /* Do not issue a TRIM on extents smaller than minblocks */
> + if ((next - start) < minblocks)
> + goto free_extent;
> +
> + /* Send the TRIM command down to the device */
> + sb_issue_discard(sb, discard_block, next - start);
> + count += (next - start);
> + cond_resched();
Maybe you could cond_resched() after freeing the blocks instead of here?
So that we don't sleep with blocks unnecessarily allocated?
> +
> +free_extent:
> +
> + freed = 0;
> + jbd_lock_bh_state(bitmap_bh);
> +
> + for (bit = start; bit < next; bit++) {
> +
> + /**
> + * @@@ This prevents newly-allocated data from being
> + * freed and then reallocated within the same
> + * transaction.
> + */
> + BUFFER_TRACE(bitmap_bh, "set in b_committed_data");
> + J_ASSERT_BH(bitmap_bh,
> + bh2jh(bitmap_bh)->b_committed_data != NULL);
> + ext3_set_bit_atomic(sb_bgl_lock(sbi, group), bit,
> + bh2jh(bitmap_bh)->b_committed_data);
You don't have to do this. Since you didn't really allocate the blocks
(you are actually never going to commit bitmap changes to the journal)
blocks can be happily reallocated just after you clear the bits in the
bitmap. There's no problem with that. Also you don't have to hold bh_state
lock at all.
> +
> + /**
> + * We clear the bit in the bitmap after setting the
> + * committed data bit, because this is the reverse
> + * order to that which the allocator uses.
> + */
> + BUFFER_TRACE(bitmap_bh, "clear bit");
> + if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, group),
> + bit, bitmap_bh->b_data)) {
> + jbd_unlock_bh_state(bitmap_bh);
> + ext3_error(sb, __func__,
> + "bit already cleared for block "E3FSBLK,
> + (unsigned long)bit);
> + jbd_lock_bh_state(bitmap_bh);
> + BUFFER_TRACE(bitmap_bh, "bit already cleared");
> + } else {
> + freed++;
> + }
> + }
> +
> + jbd_unlock_bh_state(bitmap_bh);
> +
> + /* Update couters */
> + spin_lock(sb_bgl_lock(sbi, group));
> + le16_add_cpu(&gdp->bg_free_blocks_count, freed);
> + spin_unlock(sb_bgl_lock(sbi, group));
> + percpu_counter_add(&sbi->s_freeblocks_counter, next - start);
> +
> + start = next;
> +
> + if (signal_pending(current)) {
> + count = -ERESTARTSYS;
> + break;
> + }
> +
> + /* No more suitable extents */
> + if ((free_blocks - count) < minblocks)
> + break;
> + }
> +
> + /* We dirtied the bitmap block */
> + BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
> + err = ext3_journal_dirty_metadata(handle, bitmap_bh);
> +
> + /* And the group descriptor block */
> + BUFFER_TRACE(gdp_bh, "dirtied group descriptor block");
> + ret = ext3_journal_dirty_metadata(handle, gdp_bh);
> + if (!err)
> + err = ret;
> +
> + ext3_debug("trimmed %d blocks in the group %d\n",
> + count, group);
> +
> +err_out:
> + if (err) {
> + ext3_std_error(sb, err);
> + count = -err;
> + }
> +
> + ext3_journal_stop(handle);
> + brelse(bitmap_bh);
> +
> + return count;
> +}
Otherwise the patch looks OK.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/3] Add batched discard support for ext3
2010-07-27 15:43 ` Jan Kara
@ 2010-07-28 9:13 ` Lukas Czerner
0 siblings, 0 replies; 7+ messages in thread
From: Lukas Czerner @ 2010-07-28 9:13 UTC (permalink / raw)
To: Jan Kara
Cc: Lukas Czerner, linux-ext4, jmoyer, rwheeler, eshishki, sandeen,
tytso
On Tue, 27 Jul 2010, Jan Kara wrote:
> On Tue 27-07-10 14:41:53, Lukas Czerner wrote:
> > Walk through each allocation group and trim all free extents. It can be
> > invoked through TRIM ioctl on the file system. The main idea is to
> > provide a way to trim the whole file system if needed, since some SSD's
> > may suffer from performance loss after the whole device was filled (it
> > does not mean that fs is full!).
> >
> > It search for free extents in each allocation group. When the free
> > extent is found, blocks are marked as used and then trimmed. Afterwards
> > these blocks are marked as free in per-group bitmap.
> Thanks for the patch. It looks much better than the previous one. Some
> comments below.
>
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> > fs/ext3/balloc.c | 229 +++++++++++++++++++++++++++++++++++++++++++++++
> > fs/ext3/super.c | 1 +
> > include/linux/ext3_fs.h | 1 +
> > 3 files changed, 231 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > index a177122..97c6867 100644
> > --- a/fs/ext3/balloc.c
> > +++ b/fs/ext3/balloc.c
> > @@ -20,6 +20,7 @@
> > #include <linux/ext3_jbd.h>
> > #include <linux/quotaops.h>
> > #include <linux/buffer_head.h>
> > +#include <linux/blkdev.h>
> >
> > /*
> > * balloc.c contains the blocks allocation and deallocation routines
> > @@ -1876,3 +1877,231 @@ unsigned long ext3_bg_num_gdb(struct super_block *sb, int group)
> > return ext3_bg_num_gdb_meta(sb,group);
> >
> > }
> > +
> > +/**
> > + * ext3_trim_all_free -- function to trim all free space in alloc. group
> > + * @sb: super block for file system
> > + * @group: allocation group to trim
> > + * @gdp: allocation group description structure
> > + * @minblocks: minimum extent block count
> > + *
> > + * ext3_trim_all_free walks through group's block bitmap searching for free
> > + * blocks. When the free block is found, it tries to allocate this block and
> > + * consequent free block to get the biggest free extent possible, until it
> > + * reaches any used block. Then issue a TRIM command on this extent and free
> > + * the extent in the block bitmap. This is done until whole group is scanned.
> > + */
> > +ext3_grpblk_t ext3_trim_all_free(struct super_block *sb,
> > + unsigned int group, ext3_grpblk_t minblocks)
> > +{
> > + handle_t *handle;
> > + ext3_grpblk_t max = EXT3_BLOCKS_PER_GROUP(sb);
> > + ext3_grpblk_t next, count = 0, start, bit;
> > + struct ext3_sb_info *sbi;
> > + ext3_fsblk_t discard_block;
> > + struct buffer_head *bitmap_bh = NULL;
> > + struct buffer_head *gdp_bh;
> > + ext3_grpblk_t free_blocks;
> > + struct ext3_group_desc *gdp;
> > + int err = 0, ret;
> > + ext3_grpblk_t freed;
> > +
> > + /**
> > + * We will update one block bitmap, and one group descriptor
> > + */
> > + handle = ext3_journal_start_sb(sb, 2);
> > + if (IS_ERR(handle)) {
> > + err = PTR_ERR(handle);
> > + ext3_warning(sb, __func__, "error %d on journal start", err);
> I think there's no need to issue another warning here... When
> journal_start fails there should be notices about this in the log already.
That's right, thanks.
>
> > + return err;
> > + }
> > +
> > + bitmap_bh = read_block_bitmap(sb, group);
> > + if (!bitmap_bh)
> > + goto err_out;
> > +
> > + BUFFER_TRACE(bitmap_bh, "getting undo access");
> > + err = ext3_journal_get_undo_access(handle, bitmap_bh);
> > + if (err)
> > + goto err_out;
> > +
> > + gdp = ext3_get_group_desc(sb, group, &gdp_bh);
> > + if (!gdp)
> > + goto err_out;
> > +
> > + BUFFER_TRACE(gd_bh, "get_write_access");
> > + err = ext3_journal_get_write_access(handle, gdp_bh);
> > + if (err)
> > + goto err_out;
> > +
> > + free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
> > + sbi = EXT3_SB(sb);
> > +
> > + /* Walk through the whole group */
> > + start = 0;
> > + while (start < max) {
> > +
> > + start = bitmap_search_next_usable_block(start, bitmap_bh, max);
> > + if (start < 0)
> > + break;
> > + next = start;
> > +
> > + /**
> ^^^ Use just /*. /** is reserved for comments processed by
> kernel documentation generator and they have to have special format...
Ok.
>
> > + * Allocate contiguous free extents by setting bits in the
> > + * block bitmap
> > + */
> > + while (next < max
> > + && claim_block(sb_bgl_lock(sbi, group),
> > + next, bitmap_bh)) {
> > + next++;
> > + }
> > +
> > + /* We did not claimed any blocks */
> ^^^^ claim
Oh, that is my astonishing English skills ;). Thanks.
>
> > + if (next == start)
> > + continue;
> > +
> > + discard_block = (ext3_fsblk_t)start +
> > + ext3_group_first_block_no(sb, group);
> > +
> > + /* Update counters */
> > + spin_lock(sb_bgl_lock(sbi, group));
> > + le16_add_cpu(&gdp->bg_free_blocks_count, start - next);
> > + spin_unlock(sb_bgl_lock(sbi, group));
> > + percpu_counter_sub(&sbi->s_freeblocks_counter, next - start);
> > +
> > + /* Do not issue a TRIM on extents smaller than minblocks */
> > + if ((next - start) < minblocks)
> > + goto free_extent;
> > +
> > + /* Send the TRIM command down to the device */
> > + sb_issue_discard(sb, discard_block, next - start);
> > + count += (next - start);
> > + cond_resched();
> Maybe you could cond_resched() after freeing the blocks instead of here?
> So that we don't sleep with blocks unnecessarily allocated?
Of course, that would be better.
>
> > +
> > +free_extent:
> > +
> > + freed = 0;
> > + jbd_lock_bh_state(bitmap_bh);
> > +
> > + for (bit = start; bit < next; bit++) {
> > +
> > + /**
> > + * @@@ This prevents newly-allocated data from being
> > + * freed and then reallocated within the same
> > + * transaction.
> > + */
> > + BUFFER_TRACE(bitmap_bh, "set in b_committed_data");
> > + J_ASSERT_BH(bitmap_bh,
> > + bh2jh(bitmap_bh)->b_committed_data != NULL);
> > + ext3_set_bit_atomic(sb_bgl_lock(sbi, group), bit,
> > + bh2jh(bitmap_bh)->b_committed_data);
> You don't have to do this. Since you didn't really allocate the blocks
> (you are actually never going to commit bitmap changes to the journal)
> blocks can be happily reallocated just after you clear the bits in the
> bitmap. There's no problem with that. Also you don't have to hold bh_state
> lock at all.
Thanks.
>
> > +
> > + /**
> > + * We clear the bit in the bitmap after setting the
> > + * committed data bit, because this is the reverse
> > + * order to that which the allocator uses.
> > + */
> > + BUFFER_TRACE(bitmap_bh, "clear bit");
> > + if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, group),
> > + bit, bitmap_bh->b_data)) {
> > + jbd_unlock_bh_state(bitmap_bh);
> > + ext3_error(sb, __func__,
> > + "bit already cleared for block "E3FSBLK,
> > + (unsigned long)bit);
> > + jbd_lock_bh_state(bitmap_bh);
> > + BUFFER_TRACE(bitmap_bh, "bit already cleared");
> > + } else {
> > + freed++;
> > + }
> > + }
> > +
> > + jbd_unlock_bh_state(bitmap_bh);
> > +
> > + /* Update couters */
> > + spin_lock(sb_bgl_lock(sbi, group));
> > + le16_add_cpu(&gdp->bg_free_blocks_count, freed);
> > + spin_unlock(sb_bgl_lock(sbi, group));
> > + percpu_counter_add(&sbi->s_freeblocks_counter, next - start);
> > +
> > + start = next;
> > +
> > + if (signal_pending(current)) {
> > + count = -ERESTARTSYS;
> > + break;
> > + }
> > +
> > + /* No more suitable extents */
> > + if ((free_blocks - count) < minblocks)
> > + break;
> > + }
> > +
> > + /* We dirtied the bitmap block */
> > + BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
> > + err = ext3_journal_dirty_metadata(handle, bitmap_bh);
> > +
> > + /* And the group descriptor block */
> > + BUFFER_TRACE(gdp_bh, "dirtied group descriptor block");
> > + ret = ext3_journal_dirty_metadata(handle, gdp_bh);
> > + if (!err)
> > + err = ret;
> > +
> > + ext3_debug("trimmed %d blocks in the group %d\n",
> > + count, group);
> > +
> > +err_out:
> > + if (err) {
> > + ext3_std_error(sb, err);
> > + count = -err;
> > + }
> > +
> > + ext3_journal_stop(handle);
> > + brelse(bitmap_bh);
> > +
> > + return count;
> > +}
> Otherwise the patch looks OK.
Great, thanks for looking at this. I will resend the patch shortly.
>
> Honza
>
-Lukas
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 3/3] Add batched discard support for ext4
2010-07-27 12:41 [PATCH 0/3 v3] Batched discard support for Ext3/Ext4 Lukas Czerner
2010-07-27 12:41 ` [PATCH 1/3] Add ioctl FITRIM Lukas Czerner
2010-07-27 12:41 ` [PATCH 2/3] Add batched discard support for ext3 Lukas Czerner
@ 2010-07-27 12:41 ` Lukas Czerner
2010-07-27 16:28 ` Jan Kara
2 siblings, 1 reply; 7+ messages in thread
From: Lukas Czerner @ 2010-07-27 12:41 UTC (permalink / raw)
To: linux-ext4
Cc: jmoyer, rwheeler, eshishki, sandeen, jack, tytso, Lukas Czerner,
Dmitry Monakhov
Walk through each allocation group and trim all free extents. It can be
invoked through TRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).
It search fro free extents in each allocation group. When the free
extent is found, blocks are marked as used and then trimmed. Afterwards
these blocks are marked as free in per-group bitmap.
Since fstrim is a long operation it is good to have an ability to interrupt
it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ext4.h | 2 +
fs/ext4/mballoc.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 1 +
3 files changed, 106 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..ba0fff0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b423a36..f00b7dd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4640,3 +4640,106 @@ error_return:
kmem_cache_free(ext4_ac_cachep, ac);
return;
}
+
+/**
+ * Trim "count" blocks starting at "start" in "group"
+ * This must be called under group lock
+ */
+static void ext4_trim_extent(struct super_block *sb, int start, int count,
+ ext4_group_t group)
+{
+ ext4_fsblk_t discard_block;
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+ discard_block = (ext4_fsblk_t)group *
+ EXT4_BLOCKS_PER_GROUP(sb)
+ + start
+ + le32_to_cpu(es->s_first_data_block);
+ trace_ext4_discard_blocks(sb,
+ (unsigned long long)discard_block,
+ count);
+ sb_issue_discard(sb, discard_block, count);
+ cond_resched();
+}
+
+/**
+ * Trim all free extents in group at least minblocks long
+ */
+ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
+ ext4_grpblk_t minblocks)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+ ext4_grpblk_t start, next, count = 0;
+ struct ext4_group_info *grp;
+ int err = 0;
+
+ err = -EIO;
+ bitmap_bh = ext4_read_block_bitmap(sb, group);
+ if (!bitmap_bh)
+ return 0;
+
+ grp = ext4_get_group_info(sb, group);
+ start = grp->bb_first_free;
+
+ down_write(&grp->alloc_sem);
+ while (start < max) {
+
+ start = mb_find_next_zero_bit(bitmap_bh->b_data, max, start);
+ if (start >= max)
+ break;
+ next = mb_find_next_bit(bitmap_bh->b_data, max, start);
+
+ if ((next - start) >= minblocks) {
+ count += next - start;
+ ext4_trim_extent(sb, start,
+ next - start, group);
+ }
+ start = next + 1;
+ if (signal_pending(current)) {
+ count = -ERESTARTSYS;
+ break;
+ }
+ if ((grp->bb_free - count) < minblocks)
+ break;
+ }
+ up_write(&grp->alloc_sem);
+
+ ext4_debug("trimmed %d blocks in the group %d\n",
+ count, group);
+
+ brelse(bitmap_bh);
+ return count;
+}
+
+/**
+ * ext4_trim_fs goes through all allocation groups searching for group with
+ * more free space than minlen. For such a group ext4_trim_all_free function
+ * is invoked to trim all free space.
+ */
+int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+ ext4_group_t group;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ ext4_grpblk_t minblocks, cnt;
+ struct ext4_group_info *grp;
+ int ret = 0;
+
+ minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+ if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
+ return -EINVAL;
+
+ for (group = 0; group < ngroups; group++) {
+
+ grp = ext4_get_group_info(sb, group);
+
+ if (grp->bb_free >= minblocks) {
+ cnt = ext4_trim_all_free(sb, group, minblocks);
+ if (cnt < 0) {
+ ret = cnt;
+ break;
+ }
+ }
+ }
+ return ret;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..253eb98 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
.quota_write = ext4_quota_write,
#endif
.bdev_try_to_free_page = bdev_try_to_free_page,
+ .trim_fs = ext4_trim_fs
};
static const struct super_operations ext4_nojournal_sops = {
--
1.7.1.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 3/3] Add batched discard support for ext4
2010-07-27 12:41 ` [PATCH 3/3] Add batched discard support for ext4 Lukas Czerner
@ 2010-07-27 16:28 ` Jan Kara
0 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2010-07-27 16:28 UTC (permalink / raw)
To: Lukas Czerner
Cc: linux-ext4, jmoyer, rwheeler, eshishki, sandeen, jack, tytso,
Dmitry Monakhov
On Tue 27-07-10 14:41:54, Lukas Czerner wrote:
> Walk through each allocation group and trim all free extents. It can be
> invoked through TRIM ioctl on the file system. The main idea is to
> provide a way to trim the whole file system if needed, since some SSD's
> may suffer from performance loss after the whole device was filled (it
> does not mean that fs is full!).
>
> It search fro free extents in each allocation group. When the free
> extent is found, blocks are marked as used and then trimmed. Afterwards
> these blocks are marked as free in per-group bitmap.
>
> Since fstrim is a long operation it is good to have an ability to interrupt
> it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry.
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
A couple of comments below...
> ---
> fs/ext4/ext4.h | 2 +
> fs/ext4/mballoc.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/super.c | 1 +
> 3 files changed, 106 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index b423a36..f00b7dd 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4640,3 +4640,106 @@ error_return:
> kmem_cache_free(ext4_ac_cachep, ac);
> return;
> }
> +
> +/**
> + * Trim "count" blocks starting at "start" in "group"
> + * This must be called under group lock
> + */
> +static void ext4_trim_extent(struct super_block *sb, int start, int count,
> + ext4_group_t group)
> +{
> + ext4_fsblk_t discard_block;
> + struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> +
> + discard_block = (ext4_fsblk_t)group *
> + EXT4_BLOCKS_PER_GROUP(sb)
> + + start
> + + le32_to_cpu(es->s_first_data_block);
You can use ext4_group_first_block_no() I believe.
> + trace_ext4_discard_blocks(sb,
> + (unsigned long long)discard_block,
> + count);
> + sb_issue_discard(sb, discard_block, count);
> + cond_resched();
> +}
> +
> +/**
> + * Trim all free extents in group at least minblocks long
> + */
> +ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
> + ext4_grpblk_t minblocks)
> +{
> + struct buffer_head *bitmap_bh = NULL;
> + ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
> + ext4_grpblk_t start, next, count = 0;
> + struct ext4_group_info *grp;
> + int err = 0;
> +
> + err = -EIO;
> + bitmap_bh = ext4_read_block_bitmap(sb, group);
> + if (!bitmap_bh)
> + return 0;
> +
> + grp = ext4_get_group_info(sb, group);
> + start = grp->bb_first_free;
> +
> + down_write(&grp->alloc_sem);
> + while (start < max) {
> +
> + start = mb_find_next_zero_bit(bitmap_bh->b_data, max, start);
> + if (start >= max)
> + break;
> + next = mb_find_next_bit(bitmap_bh->b_data, max, start);
Hmm, I don't think this is right. If you want to avoid doing the same
thing as you do for ext3, you have to use the buddy bitmap and not the
on-disk bitmap (we free blocks from a buddy bitmap only after a transaction
freeing them is committed). That way you avoid trimming blocks that were
freed in the transaction which is just committing (you mustn't do
that). So you have to load the buddy bitmap for the group into memory
(ext4_mb_load_buddy()), lock the group (ext4_group_lock()), and then you
can investigate the buddy bitmap (EXT4_MB_BITMAP()). You could actually use
the buddy information to make scanning bitmap faster (it carries
information about larger chunks of free blocks) but that's a voluntary
bonus I think ;).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread