linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] Add ioctl FITRIM.
  2010-04-19 10:55 Ext4: batched discard support Lukas Czerner
@ 2010-04-19 10:55 ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-04-19 10:55 UTC (permalink / raw)
  To: linux-ext4
  Cc: Jeff Moyer, Edward Shishkin, Eric Sandeen, Ric Wheeler,
	Lukas Czerner

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ioctl.c         |   31 +++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 6c75110..a43eaea 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -537,6 +537,33 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_bdev(sb->s_bdev, sb);
 }
 
+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+	struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+	unsigned int minlen;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* If filesystem doesn't support trim feature, return. */
+	if (sb->s_op->trim_fs == NULL)
+		return -EOPNOTSUPP;
+
+	/* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+	if (sb->s_bdev == NULL)
+		return -EINVAL;
+
+	err = get_user(minlen, (unsigned int __user *) arg);
+	if (err)
+		return err;
+
+	err = sb->s_op->trim_fs(minlen, sb);
+	if (err)
+		return err;
+	return 0;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -587,6 +614,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 		error = ioctl_fsthaw(filp);
 		break;
 
+	case FITRIM:
+		error = ioctl_fstrim(filp, arg);
+		break;
+
 	case FS_IOC_FIEMAP:
 		return ioctl_fiemap(filp, arg);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..f6ecdff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
+#define FITRIM		_IOWR('X', 121, int)	/* Trim */
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
@@ -1580,6 +1581,7 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+	int (*trim_fs) (unsigned int, struct super_block *);
 };
 
 /*
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Ext4: batched discard support - simplified version
@ 2010-07-07  7:53 Lukas Czerner
  2010-07-07  7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-07-07  7:53 UTC (permalink / raw)
  To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen


Hi all,

since my last post I have done some more testing with various SSD's and the
trend is clear. Trim performance is getting better and the performance loss
without trim is getting lower. So I have decided to abandon the initial idea
to track free blocks within some internal data structure - it takes time and
memory.

Today there are some SSD's which performance does not seems to degrade over
the time (writes). I have filled those devices up to 200% and still did not
seen any performance loss. On the other hand, there are still some devices
which shows about 300% performance degradation, so I suppose TRIM will be
still needed for some time.

You can try it out with the simple program attached below. Just create ext4
fs, mount it with -o discard and invoke attached program on ext4 mount point.

>From my experience the time needed to trim whole file system is strongly device
dependent. It may take few seconds on one device up to one minute on another,
under the heavy io load the time to trim whole fs gets longer.

There are two pathes:

[PATCH 1/2] Add ioctl FITRIM.
 fs/ioctl.c         |   31 +++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 33 insertions(+), 0 deletions(-)

[PATCH 2/2] Add batched discard support for ext4
 fs/ext4/ext4.h    |    2 +
 fs/ext4/mballoc.c |  126 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/ext4/super.c   |    1 +
 3 files changed, 118 insertions(+), 11 deletions(-)

Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>


#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>

#define FITRIM		_IOWR('X', 121, int)

int main(int argc, char **argv)
{
	int minsize = 4096;
	int fd;

	if (argc != 2) {
		fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	if (ioctl(fd, FITRIM, &minsize)) {
		if (errno == EOPNOTSUPP)
			fprintf(stderr, "TRIM not supported\n");
		else
			perror("EXT4_IOC_TRIM");
		return 1;
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] Add ioctl FITRIM.
  2010-07-07  7:53 Ext4: batched discard support - simplified version Lukas Czerner
@ 2010-07-07  7:53 ` Lukas Czerner
  2010-07-07  7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner
  2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o
  2 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-07-07  7:53 UTC (permalink / raw)
  To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ioctl.c         |   31 +++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7faefb4..09b33ae 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -551,6 +551,33 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_bdev(sb->s_bdev, sb);
 }
 
+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+	struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+	unsigned int minlen;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* If filesystem doesn't support trim feature, return. */
+	if (sb->s_op->trim_fs == NULL)
+		return -EOPNOTSUPP;
+
+	/* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+	if (sb->s_bdev == NULL)
+		return -EINVAL;
+
+	err = get_user(minlen, (unsigned int __user *) arg);
+	if (err)
+		return err;
+
+	err = sb->s_op->trim_fs(minlen, sb);
+	if (err)
+		return err;
+	return 0;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -601,6 +628,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 		error = ioctl_fsthaw(filp);
 		break;
 
+	case FITRIM:
+		error = ioctl_fstrim(filp, arg);
+		break;
+
 	case FS_IOC_FIEMAP:
 		return ioctl_fiemap(filp, arg);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 44f35ae..7a27fa4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
+#define FITRIM		_IOWR('X', 121, int)	/* Trim */
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 #define	FS_IOC_SETFLAGS			_IOW('f', 2, long)
@@ -1580,6 +1581,7 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+	int (*trim_fs) (unsigned int, struct super_block *);
 };
 
 /*
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] Add batched discard support for ext4
  2010-07-07  7:53 Ext4: batched discard support - simplified version Lukas Czerner
  2010-07-07  7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner
@ 2010-07-07  7:53 ` Lukas Czerner
  2010-07-14  8:33   ` Dmitry Monakhov
  2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o
  2 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-07-07  7:53 UTC (permalink / raw)
  To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen

Walk through each allocation group and trim all free extents. It can be
invoked through TRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search fro free extents in each allocation group. When the free
extent is found, blocks are marked as used and then trimmed. Afterwards
these blocks are marked as free in per-group bitmap.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h    |    2 +
 fs/ext4/mballoc.c |  126 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/ext4/super.c   |    1 +
 3 files changed, 118 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..ba0fff0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
 extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
 extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
 						ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
 /* inode.c */
 struct buffer_head *ext4_getblk(handle_t *, struct inode *,
 						ext4_lblk_t, int, int *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b423a36..c7b541c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2535,17 +2535,6 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn)
 		mb_debug(1, "gonna free %u blocks in group %u (0x%p):",
 			 entry->count, entry->group, entry);
 
-		if (test_opt(sb, DISCARD)) {
-			ext4_fsblk_t discard_block;
-
-			discard_block = entry->start_blk +
-				ext4_group_first_block_no(sb, entry->group);
-			trace_ext4_discard_blocks(sb,
-					(unsigned long long)discard_block,
-					entry->count);
-			sb_issue_discard(sb, discard_block, entry->count);
-		}
-
 		err = ext4_mb_load_buddy(sb, entry->group, &e4b);
 		/* we expect to find existing buddy because it's pinned */
 		BUG_ON(err != 0);
@@ -4640,3 +4629,118 @@ error_return:
 		kmem_cache_free(ext4_ac_cachep, ac);
 	return;
 }
+
+/**
+ * Trim "count" blocks starting at "start" in "group"
+ * This must be called under group lock
+ */
+static void ext4_trim_extent(struct super_block *sb, int start, int count,
+		ext4_group_t group, struct ext4_buddy *e4b)
+{
+	ext4_fsblk_t discard_block;
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+	struct ext4_free_extent ex;
+
+	assert_spin_locked(ext4_group_lock_ptr(sb, group));
+
+	ex.fe_start = start;
+	ex.fe_group = group;
+	ex.fe_len = count;
+
+	/**
+	 * Mark blocks used, so no one can reuse them while
+	 * being trimmed.
+	 */
+	mb_mark_used(e4b, &ex);
+	ext4_unlock_group(sb, group);
+
+	discard_block = (ext4_fsblk_t)group *
+			EXT4_BLOCKS_PER_GROUP(sb)
+			+ start
+			+ le32_to_cpu(es->s_first_data_block);
+	trace_ext4_discard_blocks(sb,
+			(unsigned long long)discard_block,
+			count);
+	sb_issue_discard(sb, discard_block, count);
+
+	ext4_lock_group(sb, group);
+	mb_free_blocks(NULL, e4b, start, ex.fe_len);
+}
+
+/**
+ * Trim all free extents in group at least minblocks long
+ */
+ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
+		ext4_grpblk_t minblocks)
+{
+	void *bitmap;
+	ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb);
+	ext4_grpblk_t start, next, count = 0;
+	ext4_group_t group;
+
+	BUG_ON(e4b == NULL);
+
+	bitmap = e4b->bd_bitmap;
+	group = e4b->bd_group;
+	start = e4b->bd_info->bb_first_free;
+	ext4_lock_group(sb, group);
+
+	while (start < max) {
+
+		start = mb_find_next_zero_bit(bitmap, max, start);
+		if (start >= max)
+			break;
+		next = mb_find_next_bit(bitmap, max, start);
+
+		if ((next - start) >= minblocks) {
+			count += next - start;
+			ext4_trim_extent(sb, start,
+				next - start, group, e4b);
+		}
+		start = next + 1;
+
+		if ((e4b->bd_info->bb_free - count) < minblocks)
+			break;
+	}
+
+	ext4_unlock_group(sb, group);
+
+	ext4_debug("trimmed %d blocks in the group %d\n",
+		count, group);
+
+	return count;
+}
+
+int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
+{
+	struct ext4_buddy e4b;
+	ext4_group_t group;
+	ext4_group_t ngroups = ext4_get_groups_count(sb);
+	ext4_grpblk_t minblocks;
+
+	if (!test_opt(sb, DISCARD))
+		return 0;
+
+	minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize);
+	if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb)))
+		return -EINVAL;
+
+	for (group = 0; group < ngroups; group++) {
+		int err;
+
+		err = ext4_mb_load_buddy(sb, group, &e4b);
+		if (err) {
+			ext4_error(sb, "Error in loading buddy "
+					"information for %u", group);
+			continue;
+		}
+
+		if (e4b.bd_info->bb_free >= minblocks) {
+			ext4_trim_all_free(sb, &e4b, minblocks);
+		}
+
+		ext4_mb_release_desc(&e4b);
+	}
+
+	return 0;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e14d22c..253eb98 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = {
 	.quota_write	= ext4_quota_write,
 #endif
 	.bdev_try_to_free_page = bdev_try_to_free_page,
+	.trim_fs	= ext4_trim_fs
 };
 
 static const struct super_operations ext4_nojournal_sops = {
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] Add batched discard support for ext4
  2010-07-07  7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner
@ 2010-07-14  8:33   ` Dmitry Monakhov
  2010-07-14  9:40     ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-07-14  8:33 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

Lukas Czerner <lczerner@redhat.com> writes:

> Walk through each allocation group and trim all free extents. It can be
> invoked through TRIM ioctl on the file system. The main idea is to
> provide a way to trim the whole file system if needed, since some SSD's
> may suffer from performance loss after the whole device was filled (it
> does not mean that fs is full!).
>
> It search fro free extents in each allocation group. When the free
> extent is found, blocks are marked as used and then trimmed. Afterwards
> these blocks are marked as free in per-group bitmap.
Looks ok, except two small notes:
1) trim_fs is a time consuming operation and we have to add
   condresced, and signal_pending checks to allow user to interrupt
   cmd if necessery. See patch attached.
2) IMHO runtime trim support is useful sometimes, for example when
   user really care about data security i.e. unlinked file should be
   trimmed ASAP. I think we have to provide 'secdel' mount option
   similar to secdeletion flag for inode, but this is another story
   not directly connected with the patch.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ext4-Add-interrupt-points-to-batched-discard.patch --]
[-- Type: text/x-diff, Size: 2230 bytes --]

>From 54ca2add544f88ac9ca8c647ae7b093be9ac2872 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Wed, 14 Jul 2010 12:16:50 +0400
Subject: [PATCH] ext4: Add interrupt points to batched discard

Since fstrim is a long operation it will be good to have an
ability to interrupt it by a signal.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/mballoc.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 48abd3d..0948408 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3959,7 +3959,7 @@ static void ext4_trim_extent(struct super_block *sb, int start, int count,
 			(unsigned long long)discard_block,
 			count);
 	sb_issue_discard(sb, discard_block, count);
-
+	cond_resched();
 	ext4_lock_group(sb, group);
 	mb_free_blocks(NULL, e4b, start, ex.fe_len);
 }
@@ -3995,7 +3995,10 @@ ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b,
 				next - start, group, e4b);
 		}
 		start = next + 1;
-
+		if (signal_pending(current)) {
+			count = -ERESTARTSYS;
+			break;
+		}
 		if ((e4b->bd_info->bb_free - count) < minblocks)
 			break;
 	}
@@ -4013,7 +4016,8 @@ int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
 	struct ext4_buddy e4b;
 	ext4_group_t group;
 	ext4_group_t ngroups = ext4_get_groups_count(sb);
-	ext4_grpblk_t minblocks;
+	ext4_grpblk_t minblocks, cnt;
+	int ret = 0;
 
 	if (!test_opt(sb, DISCARD))
 		return 0;
@@ -4023,22 +4027,25 @@ int ext4_trim_fs(unsigned int minlen, struct super_block *sb)
 		return -EINVAL;
 
 	for (group = 0; group < ngroups; group++) {
-		int err;
-
-		err = ext4_mb_load_buddy(sb, group, &e4b);
-		if (err) {
+		ret = ext4_mb_load_buddy(sb, group, &e4b);
+		if (ret) {
 			ext4_error(sb, "Error in loading buddy "
 					"information for %u", group);
-			continue;
+			break;
 		}
 
 		if (e4b.bd_info->bb_free >= minblocks) {
-			ext4_trim_all_free(sb, &e4b, minblocks);
+			cnt = ext4_trim_all_free(sb, &e4b, minblocks);
+			if (cnt < 0) {
+				ret = cnt;
+				ext4_mb_unload_buddy(&e4b);
+				break;
+			}
 		}
 		ext4_mb_unload_buddy(&e4b);
 	}
 
-	return 0;
+	return ret;
 }
 
 /*
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] Add batched discard support for ext4
  2010-07-14  8:33   ` Dmitry Monakhov
@ 2010-07-14  9:40     ` Lukas Czerner
  2010-07-14 10:03       ` Dmitry Monakhov
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Czerner @ 2010-07-14  9:40 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen

On Wed, 14 Jul 2010, Dmitry Monakhov wrote:

> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > Walk through each allocation group and trim all free extents. It can be
> > invoked through TRIM ioctl on the file system. The main idea is to
> > provide a way to trim the whole file system if needed, since some SSD's
> > may suffer from performance loss after the whole device was filled (it
> > does not mean that fs is full!).
> >
> > It search fro free extents in each allocation group. When the free
> > extent is found, blocks are marked as used and then trimmed. Afterwards
> > these blocks are marked as free in per-group bitmap.
> Looks ok, except two small notes:
> 1) trim_fs is a time consuming operation and we have to add
>    condresced, and signal_pending checks to allow user to interrupt
>    cmd if necessery. See patch attached.

Hi, Dimitry

thanks for your patch! Although I have one question:


 	for (group = 0; group < ngroups; group++) {
-		int err;
-
-		err = ext4_mb_load_buddy(sb, group, &e4b);
-		if (err) {
+		ret = ext4_mb_load_buddy(sb, group, &e4b);
+		if (ret) {
 			ext4_error(sb, "Error in loading buddy "
 					"information for %u", group);
-			continue;
+			break;
 		}

Is there really need to jump out from the loop and exit in the case of
load_buddy failure ? Next group may very well succeed in loading buddy,
or am I missing something ?

> 2) IMHO runtime trim support is useful sometimes, for example when
>    user really care about data security i.e. unlinked file should be
>    trimmed ASAP. I think we have to provide 'secdel' mount option
>    similar to secdeletion flag for inode, but this is another story
>    not directly connected with the patch.

I like the idea, but IMO this should work for any underlying storage,
not just for SSDs.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] Add batched discard support for ext4
  2010-07-14  9:40     ` Lukas Czerner
@ 2010-07-14 10:03       ` Dmitry Monakhov
  2010-07-14 11:43         ` Lukas Czerner
  0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-07-14 10:03 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen

Lukas Czerner <lczerner@redhat.com> writes:

> On Wed, 14 Jul 2010, Dmitry Monakhov wrote:
>
>> Lukas Czerner <lczerner@redhat.com> writes:
>> 
>> > Walk through each allocation group and trim all free extents. It can be
>> > invoked through TRIM ioctl on the file system. The main idea is to
>> > provide a way to trim the whole file system if needed, since some SSD's
>> > may suffer from performance loss after the whole device was filled (it
>> > does not mean that fs is full!).
>> >
>> > It search fro free extents in each allocation group. When the free
>> > extent is found, blocks are marked as used and then trimmed. Afterwards
>> > these blocks are marked as free in per-group bitmap.
>> Looks ok, except two small notes:
>> 1) trim_fs is a time consuming operation and we have to add
>>    condresced, and signal_pending checks to allow user to interrupt
>>    cmd if necessery. See patch attached.
>
> Hi, Dimitry
>
> thanks for your patch! Although I have one question:
>
>
>  	for (group = 0; group < ngroups; group++) {
> -		int err;
> -
> -		err = ext4_mb_load_buddy(sb, group, &e4b);
> -		if (err) {
> +		ret = ext4_mb_load_buddy(sb, group, &e4b);
> +		if (ret) {
>  			ext4_error(sb, "Error in loading buddy "
>  					"information for %u", group);
> -			continue;
> +			break;
>  		}
>
> Is there really need to jump out from the loop and exit in the case of
> load_buddy failure ? Next group may very well succeed in loading buddy,
> or am I missing something ?
Well, it may fail due to -ENOMEM which is not scary but in some places
it may fail due to EIO which is a very bad sign. So i think it is
slightly dangerous to continue if we have found a same group.
>
>> 2) IMHO runtime trim support is useful sometimes, for example when
>>    user really care about data security i.e. unlinked file should be
>>    trimmed ASAP. I think we have to provide 'secdel' mount option
>>    similar to secdeletion flag for inode, but this is another story
>>    not directly connected with the patch.
>
> I like the idea, but IMO this should work for any underlying storage,
> not just for SSDs.
Off course. We may use blkdev_issue_zeroout() if disk does not support
discard with zeroing.
>
> Thanks!
> -Lukas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] Add batched discard support for ext4
  2010-07-14 10:03       ` Dmitry Monakhov
@ 2010-07-14 11:43         ` Lukas Czerner
  0 siblings, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-07-14 11:43 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen

On Wed, 14 Jul 2010, Dmitry Monakhov wrote:

> Lukas Czerner <lczerner@redhat.com> writes:
> 
> > On Wed, 14 Jul 2010, Dmitry Monakhov wrote:
> >
> >> Lukas Czerner <lczerner@redhat.com> writes:
> >> 
> >> > Walk through each allocation group and trim all free extents. It can be
> >> > invoked through TRIM ioctl on the file system. The main idea is to
> >> > provide a way to trim the whole file system if needed, since some SSD's
> >> > may suffer from performance loss after the whole device was filled (it
> >> > does not mean that fs is full!).
> >> >
> >> > It search fro free extents in each allocation group. When the free
> >> > extent is found, blocks are marked as used and then trimmed. Afterwards
> >> > these blocks are marked as free in per-group bitmap.
> >> Looks ok, except two small notes:
> >> 1) trim_fs is a time consuming operation and we have to add
> >>    condresced, and signal_pending checks to allow user to interrupt
> >>    cmd if necessery. See patch attached.
> >
> > Hi, Dimitry
> >
> > thanks for your patch! Although I have one question:
> >
> >
> >  	for (group = 0; group < ngroups; group++) {
> > -		int err;
> > -
> > -		err = ext4_mb_load_buddy(sb, group, &e4b);
> > -		if (err) {
> > +		ret = ext4_mb_load_buddy(sb, group, &e4b);
> > +		if (ret) {
> >  			ext4_error(sb, "Error in loading buddy "
> >  					"information for %u", group);
> > -			continue;
> > +			break;
> >  		}
> >
> > Is there really need to jump out from the loop and exit in the case of
> > load_buddy failure ? Next group may very well succeed in loading buddy,
> > or am I missing something ?
> Well, it may fail due to -ENOMEM which is not scary but in some places
> it may fail due to EIO which is a very bad sign. So i think it is
> slightly dangerous to continue if we have found a same group.

Ok, it seems reasonable.

> >
> >> 2) IMHO runtime trim support is useful sometimes, for example when
> >>    user really care about data security i.e. unlinked file should be
> >>    trimmed ASAP. I think we have to provide 'secdel' mount option
> >>    similar to secdeletion flag for inode, but this is another story
> >>    not directly connected with the patch.
> >
> > I like the idea, but IMO this should work for any underlying storage,
> > not just for SSDs.
> Off course. We may use blkdev_issue_zeroout() if disk does not support
> discard with zeroing.
> >

-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-07  7:53 Ext4: batched discard support - simplified version Lukas Czerner
  2010-07-07  7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner
  2010-07-07  7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner
@ 2010-07-23 14:36 ` Ted Ts'o
  2010-07-23 15:13   ` Jeff Moyer
  2010-07-26 10:30   ` Lukas Czerner
  2 siblings, 2 replies; 16+ messages in thread
From: Ted Ts'o @ 2010-07-23 14:36 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen

On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
> 
> Hi all,
> 
> since my last post I have done some more testing with various SSD's and the
> trend is clear. Trim performance is getting better and the performance loss
> without trim is getting lower. So I have decided to abandon the initial idea
> to track free blocks within some internal data structure - it takes time and
> memory.

Do you have some numbers about how bad trim actually might be on
various devices?  I can imagine some devices where it might be better
(for wear levelling and better write endurance if nothing else) where
it's better to do the trim right away instead of batching things.

So what I'm thinking about doing is keeping the "discard" mount option
to mean non-batched discard.  If you want to use the explicit FITRIM
ioctl, I don't think we need to test to see if the dicard mount option
is set; if the user issues the ioctl, then we should do the batched
discard, and if we don't trust the user to do that, then well, the
ioctl should be restricted to privileged users only --- especially if
it could take up to a minute.

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o
@ 2010-07-23 15:13   ` Jeff Moyer
  2010-07-23 15:19     ` Ted Ts'o
  2010-07-23 15:30     ` Greg Freemyer
  2010-07-26 10:30   ` Lukas Czerner
  1 sibling, 2 replies; 16+ messages in thread
From: Jeff Moyer @ 2010-07-23 15:13 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen

"Ted Ts'o" <tytso@mit.edu> writes:

> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
>> 
>> Hi all,
>> 
>> since my last post I have done some more testing with various SSD's and the
>> trend is clear. Trim performance is getting better and the performance loss
>> without trim is getting lower. So I have decided to abandon the initial idea
>> to track free blocks within some internal data structure - it takes time and
>> memory.
>
> Do you have some numbers about how bad trim actually might be on
> various devices?

I'll let Lukas answer that when he gets back to the office next week.
The performance of the trim command itself varies by vendor, of course.

> I can imagine some devices where it might be better (for wear
> levelling and better write endurance if nothing else) where it's
> better to do the trim right away instead of batching things.

I don't think so.  In all of the configurations tested, I'm pretty sure
we saw a performance hit from doing the TRIMs right away.  The queue
flush really hurts.  Of course, I have no idea what you had in mind for
the amount of time in between batched discards.

> So what I'm thinking about doing is keeping the "discard" mount option
> to mean non-batched discard.  If you want to use the explicit FITRIM
> ioctl, I don't think we need to test to see if the dicard mount option
> is set; if the user issues the ioctl, then we should do the batched
> discard, and if we don't trust the user to do that, then well, the
> ioctl should be restricted to privileged users only --- especially if
> it could take up to a minute.

That sounds reasonable to me.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 15:13   ` Jeff Moyer
@ 2010-07-23 15:19     ` Ted Ts'o
  2010-07-23 15:40       ` Jeff Moyer
  2010-07-24 16:31       ` Ric Wheeler
  2010-07-23 15:30     ` Greg Freemyer
  1 sibling, 2 replies; 16+ messages in thread
From: Ted Ts'o @ 2010-07-23 15:19 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen

On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
> 
> I don't think so.  In all of the configurations tested, I'm pretty sure
> we saw a performance hit from doing the TRIMs right away.  The queue
> flush really hurts.  Of course, I have no idea what you had in mind for
> the amount of time in between batched discards.

Sure, but not all the world is SATA-attached SSD's.  I'm thinking in
particular of PCIe-attached SSD's, where the TRIM command might be
very fast indeed...  I believe Ric Wheeler tells me you have TMS
RamSan SSD's in house that you are testing?  And of course those
aren't the only PCIe-attached flash devices out there....

       	   		      	    	    	- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 15:13   ` Jeff Moyer
  2010-07-23 15:19     ` Ted Ts'o
@ 2010-07-23 15:30     ` Greg Freemyer
  1 sibling, 0 replies; 16+ messages in thread
From: Greg Freemyer @ 2010-07-23 15:30 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ted Ts'o, Lukas Czerner, eshishki, rwheeler, linux-ext4,
	sandeen

On Fri, Jul 23, 2010 at 11:13 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> "Ted Ts'o" <tytso@mit.edu> writes:
>
>> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
>>>
>>> Hi all,
>>>
>>> since my last post I have done some more testing with various SSD's and the
>>> trend is clear. Trim performance is getting better and the performance loss
>>> without trim is getting lower. So I have decided to abandon the initial idea
>>> to track free blocks within some internal data structure - it takes time and
>>> memory.
>>
>> Do you have some numbers about how bad trim actually might be on
>> various devices?
>
> I'll let Lukas answer that when he gets back to the office next week.
> The performance of the trim command itself varies by vendor, of course.
>
>> I can imagine some devices where it might be better (for wear
>> levelling and better write endurance if nothing else) where it's
>> better to do the trim right away instead of batching things.
>
> I don't think so.  In all of the configurations tested, I'm pretty sure
> we saw a performance hit from doing the TRIMs right away.  The queue
> flush really hurts.  Of course, I have no idea what you had in mind for
> the amount of time in between batched discards.

It was my understanding that way back when, Intel was pushing for the
TRIMs to be right away.  That may be why they never fully implemented
the TRIM command to accept more than one sectors worth of vectorized
data.  (That is still multiple ranges per discard, just not
hundreds/thousands.)

Along those lines, does this patch create multi-sector discard trim
commands when there is a large list of discard ranges (ie. thousands
of ranges to discard)?  And if so, does it have a blacklist for SSDs
like the Intel that don't implement the multi-sector payload part of
the spec?

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 15:19     ` Ted Ts'o
@ 2010-07-23 15:40       ` Jeff Moyer
  2010-07-23 17:00         ` Ted Ts'o
  2010-07-24 16:31       ` Ric Wheeler
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff Moyer @ 2010-07-23 15:40 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen

"Ted Ts'o" <tytso@mit.edu> writes:

> On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
>> 
>> I don't think so.  In all of the configurations tested, I'm pretty sure
>> we saw a performance hit from doing the TRIMs right away.  The queue
>> flush really hurts.  Of course, I have no idea what you had in mind for
>> the amount of time in between batched discards.
>
> Sure, but not all the world is SATA-attached SSD's.  I'm thinking in
> particular of PCIe-attached SSD's, where the TRIM command might be
> very fast indeed...  I believe Ric Wheeler tells me you have TMS
> RamSan SSD's in house that you are testing?  And of course those
> aren't the only PCIe-attached flash devices out there....

You are right, and we have to consider thinly provisioned luns, as well.
The only case I can think of where it makes sense to issue those
discards immediately is if you are running tight on allocated space in
your thinly provisioned lun.  Aside from that, I'm not sure why you
would want to send those commands down with every journal commit,
instead of batched daily, for example.  But, I can certainly understand
wanting to allow this flexibility.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 15:40       ` Jeff Moyer
@ 2010-07-23 17:00         ` Ted Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Ted Ts'o @ 2010-07-23 17:00 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen

On Fri, Jul 23, 2010 at 11:40:58AM -0400, Jeff Moyer wrote:
> 
> You are right, and we have to consider thinly provisioned luns, as well.
> The only case I can think of where it makes sense to issue those
> discards immediately is if you are running tight on allocated space in
> your thinly provisioned lun.  Aside from that, I'm not sure why you
> would want to send those commands down with every journal commit,
> instead of batched daily, for example.  But, I can certainly understand
> wanting to allow this flexibility.

The two reasons I could imagine is to give more flexibility to the
wear leveling algorithms (depending on how often you are turning over
files --- i.e., deleting blocks and then reusing them), and to
minimize latency (it might be nicer for the system to send down the
deleted blocks on a continuing basis rather than to send them down all
at once).  

The other issue is that by sending TRIM commands for all free extents,
even those that haven't been recently been released, the flash
translation layer needs to look up a large number of blocks in its
translation table to see if it needs to update it.  This can end up
burning CPU unnecessarily, especially for those flash devices (such as
FusionIO, for example) manage their FTL using the host CPU.

So this is one of the reasons why I want to leave some flexibility
here; BTW, for some systems, it may make sense for the FITRIM ioctl to
throttle the rate at which it locks block groups and sends down TRIM
requests so it doesn't end up causing performance hiccups for live
applications while the FITRIM ioctl is running.

	     	     	       	      	       - Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 15:19     ` Ted Ts'o
  2010-07-23 15:40       ` Jeff Moyer
@ 2010-07-24 16:31       ` Ric Wheeler
  1 sibling, 0 replies; 16+ messages in thread
From: Ric Wheeler @ 2010-07-24 16:31 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Jeff Moyer, Lukas Czerner, eshishki, linux-ext4, sandeen

On 07/23/2010 11:19 AM, Ted Ts'o wrote:
> On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote:
>    
>> I don't think so.  In all of the configurations tested, I'm pretty sure
>> we saw a performance hit from doing the TRIMs right away.  The queue
>> flush really hurts.  Of course, I have no idea what you had in mind for
>> the amount of time in between batched discards.
>>      
> Sure, but not all the world is SATA-attached SSD's.  I'm thinking in
> particular of PCIe-attached SSD's, where the TRIM command might be
> very fast indeed...  I believe Ric Wheeler tells me you have TMS
> RamSan SSD's in house that you are testing?  And of course those
> aren't the only PCIe-attached flash devices out there....
>
>         	   		      	    	    	- Ted
>    

I think that some of the PCI-e cards might want that information right 
away, a lot of high end arrays actually prefer fewer larger chunks.

One other user might be virtual devices or some remote replication 
mechanism. I wonder if the drbd people for example might (do?) use these?

ric


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Ext4: batched discard support - simplified version
  2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o
  2010-07-23 15:13   ` Jeff Moyer
@ 2010-07-26 10:30   ` Lukas Czerner
  1 sibling, 0 replies; 16+ messages in thread
From: Lukas Czerner @ 2010-07-26 10:30 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen

On Fri, 23 Jul 2010, Ted Ts'o wrote:

> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote:
> > 
> > Hi all,
> > 
> > since my last post I have done some more testing with various SSD's and the
> > trend is clear. Trim performance is getting better and the performance loss
> > without trim is getting lower. So I have decided to abandon the initial idea
> > to track free blocks within some internal data structure - it takes time and
> > memory.
> 
> Do you have some numbers about how bad trim actually might be on
> various devices?  I can imagine some devices where it might be better
> (for wear levelling and better write endurance if nothing else) where
> it's better to do the trim right away instead of batching things.

Hi,

Yes, I have those numbers.

http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_discard.html
This page presents my test results on three different devices. I have
tested the current ext4 discard implementation (do the trim right away).
However, one tested device is still not on that page. With this
(Vendor4) device I have got only about 1.83% performance loss, which is
very good.

http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_ioctltrim.html
This page provides test results with my batched discard implementation.
Take those numbers with discretion, because the patch still does not
involve journaling and I have tested the "worst case" scenario, which
involves issuing FITRIM in endless loop without any sleep.

Generally the FITRIM ioctl can take from 2 seconds on fast devices to
several (2-4) minutes on very slow devices, or under heavy load.


> 
> So what I'm thinking about doing is keeping the "discard" mount option
> to mean non-batched discard.  If you want to use the explicit FITRIM
> ioctl, I don't think we need to test to see if the dicard mount option
> is set; if the user issues the ioctl, then we should do the batched
> discard, and if we don't trust the user to do that, then well, the
> ioctl should be restricted to privileged users only --- especially if
> it could take up to a minute.

I agree.

> 
> 						- Ted
> 

Thanks.

-Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-07-26 10:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-07  7:53 Ext4: batched discard support - simplified version Lukas Czerner
2010-07-07  7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner
2010-07-07  7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner
2010-07-14  8:33   ` Dmitry Monakhov
2010-07-14  9:40     ` Lukas Czerner
2010-07-14 10:03       ` Dmitry Monakhov
2010-07-14 11:43         ` Lukas Czerner
2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o
2010-07-23 15:13   ` Jeff Moyer
2010-07-23 15:19     ` Ted Ts'o
2010-07-23 15:40       ` Jeff Moyer
2010-07-23 17:00         ` Ted Ts'o
2010-07-24 16:31       ` Ric Wheeler
2010-07-23 15:30     ` Greg Freemyer
2010-07-26 10:30   ` Lukas Czerner
  -- strict thread matches above, loose matches on Subject: below --
2010-04-19 10:55 Ext4: batched discard support Lukas Czerner
2010-04-19 10:55 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).