* Ext4: batched discard support - simplified version
@ 2010-07-07 7:53 Lukas Czerner
2010-07-07 7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Lukas Czerner @ 2010-07-07 7:53 UTC (permalink / raw)
To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen
Hi all,
since my last post I have done some more testing with various SSD's and the
trend is clear. Trim performance is getting better and the performance loss
without trim is getting lower. So I have decided to abandon the initial idea
to track free blocks within some internal data structure - it takes time and
memory.
Today there are some SSD's which performance does not seems to degrade over
the time (writes). I have filled those devices up to 200% and still did not
seen any performance loss. On the other hand, there are still some devices
which shows about 300% performance degradation, so I suppose TRIM will be
still needed for some time.
You can try it out with the simple program attached below. Just create ext4
fs, mount it with -o discard and invoke attached program on ext4 mount point.
>From my experience the time needed to trim whole file system is strongly device
dependent. It may take few seconds on one device up to one minute on another,
under the heavy io load the time to trim whole fs gets longer.
There are two pathes:
[PATCH 1/2] Add ioctl FITRIM.
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)
[PATCH 2/2] Add batched discard support for ext4
fs/ext4/ext4.h | 2 +
fs/ext4/mballoc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/super.c | 1 +
3 files changed, 118 insertions(+), 11 deletions(-)
Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/ioctl.h>
#define FITRIM _IOWR('X', 121, int)
int main(int argc, char **argv)
{
int minsize = 4096;
int fd;
if (argc != 2) {
fprintf(stderr, "usage: %s mountpoint\n", argv[0]);
return 1;
}
fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 1;
}
if (ioctl(fd, FITRIM, &minsize)) {
if (errno == EOPNOTSUPP)
fprintf(stderr, "TRIM not supported\n");
else
perror("EXT4_IOC_TRIM");
return 1;
}
return 0;
}
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] Add ioctl FITRIM. 2010-07-07 7:53 Ext4: batched discard support - simplified version Lukas Czerner @ 2010-07-07 7:53 ` Lukas Czerner 2010-07-07 7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner 2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o 2 siblings, 0 replies; 15+ messages in thread From: Lukas Czerner @ 2010-07-07 7:53 UTC (permalink / raw) To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen Adds an filesystem independent ioctl to allow implementation of file system batched discard support. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ioctl.c | 31 +++++++++++++++++++++++++++++++ include/linux/fs.h | 2 ++ 2 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/ioctl.c b/fs/ioctl.c index 7faefb4..09b33ae 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -551,6 +551,33 @@ static int ioctl_fsthaw(struct file *filp) return thaw_bdev(sb->s_bdev, sb); } +static int ioctl_fstrim(struct file *filp, unsigned long arg) +{ + struct super_block *sb = filp->f_path.dentry->d_inode->i_sb; + unsigned int minlen; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* If filesystem doesn't support trim feature, return. */ + if (sb->s_op->trim_fs == NULL) + return -EOPNOTSUPP; + + /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */ + if (sb->s_bdev == NULL) + return -EINVAL; + + err = get_user(minlen, (unsigned int __user *) arg); + if (err) + return err; + + err = sb->s_op->trim_fs(minlen, sb); + if (err) + return err; + return 0; +} + /* * When you add any new common ioctls to the switches above and below * please update compat_sys_ioctl() too. @@ -601,6 +628,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, error = ioctl_fsthaw(filp); break; + case FITRIM: + error = ioctl_fstrim(filp, arg); + break; + case FS_IOC_FIEMAP: return ioctl_fiemap(filp, arg); diff --git a/include/linux/fs.h b/include/linux/fs.h index 44f35ae..7a27fa4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -315,6 +315,7 @@ struct inodes_stat_t { #define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */ #define FIFREEZE _IOWR('X', 119, int) /* Freeze */ #define FITHAW _IOWR('X', 120, int) /* Thaw */ +#define FITRIM _IOWR('X', 121, int) /* Trim */ #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) @@ -1580,6 +1581,7 @@ struct super_operations { ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); + int (*trim_fs) (unsigned int, struct super_block *); }; /* -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/2] Add batched discard support for ext4 2010-07-07 7:53 Ext4: batched discard support - simplified version Lukas Czerner 2010-07-07 7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner @ 2010-07-07 7:53 ` Lukas Czerner 2010-07-14 8:33 ` Dmitry Monakhov 2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o 2 siblings, 1 reply; 15+ messages in thread From: Lukas Czerner @ 2010-07-07 7:53 UTC (permalink / raw) To: eshishki; +Cc: lczerner, jmoyer, rwheeler, linux-ext4, sandeen Walk through each allocation group and trim all free extents. It can be invoked through TRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search fro free extents in each allocation group. When the free extent is found, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 2 + fs/ext4/mballoc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++----- fs/ext4/super.c | 1 + 3 files changed, 118 insertions(+), 11 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index bf938cf..ba0fff0 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb, extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t); extern void ext4_mb_put_buddy_cache_lock(struct super_block *, ext4_group_t, int); +extern int ext4_trim_fs(unsigned int, struct super_block *); + /* inode.c */ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b423a36..c7b541c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2535,17 +2535,6 @@ static void release_blocks_on_commit(journal_t *journal, transaction_t *txn) mb_debug(1, "gonna free %u blocks in group %u (0x%p):", entry->count, entry->group, entry); - if (test_opt(sb, DISCARD)) { - ext4_fsblk_t discard_block; - - discard_block = entry->start_blk + - ext4_group_first_block_no(sb, entry->group); - trace_ext4_discard_blocks(sb, - (unsigned long long)discard_block, - entry->count); - sb_issue_discard(sb, discard_block, entry->count); - } - err = ext4_mb_load_buddy(sb, entry->group, &e4b); /* we expect to find existing buddy because it's pinned */ BUG_ON(err != 0); @@ -4640,3 +4629,118 @@ error_return: kmem_cache_free(ext4_ac_cachep, ac); return; } + +/** + * Trim "count" blocks starting at "start" in "group" + * This must be called under group lock + */ +static void ext4_trim_extent(struct super_block *sb, int start, int count, + ext4_group_t group, struct ext4_buddy *e4b) +{ + ext4_fsblk_t discard_block; + struct ext4_super_block *es = EXT4_SB(sb)->s_es; + struct ext4_free_extent ex; + + assert_spin_locked(ext4_group_lock_ptr(sb, group)); + + ex.fe_start = start; + ex.fe_group = group; + ex.fe_len = count; + + /** + * Mark blocks used, so no one can reuse them while + * being trimmed. + */ + mb_mark_used(e4b, &ex); + ext4_unlock_group(sb, group); + + discard_block = (ext4_fsblk_t)group * + EXT4_BLOCKS_PER_GROUP(sb) + + start + + le32_to_cpu(es->s_first_data_block); + trace_ext4_discard_blocks(sb, + (unsigned long long)discard_block, + count); + sb_issue_discard(sb, discard_block, count); + + ext4_lock_group(sb, group); + mb_free_blocks(NULL, e4b, start, ex.fe_len); +} + +/** + * Trim all free extents in group at least minblocks long + */ +ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b, + ext4_grpblk_t minblocks) +{ + void *bitmap; + ext4_grpblk_t max = EXT4_BLOCKS_PER_GROUP(sb); + ext4_grpblk_t start, next, count = 0; + ext4_group_t group; + + BUG_ON(e4b == NULL); + + bitmap = e4b->bd_bitmap; + group = e4b->bd_group; + start = e4b->bd_info->bb_first_free; + ext4_lock_group(sb, group); + + while (start < max) { + + start = mb_find_next_zero_bit(bitmap, max, start); + if (start >= max) + break; + next = mb_find_next_bit(bitmap, max, start); + + if ((next - start) >= minblocks) { + count += next - start; + ext4_trim_extent(sb, start, + next - start, group, e4b); + } + start = next + 1; + + if ((e4b->bd_info->bb_free - count) < minblocks) + break; + } + + ext4_unlock_group(sb, group); + + ext4_debug("trimmed %d blocks in the group %d\n", + count, group); + + return count; +} + +int ext4_trim_fs(unsigned int minlen, struct super_block *sb) +{ + struct ext4_buddy e4b; + ext4_group_t group; + ext4_group_t ngroups = ext4_get_groups_count(sb); + ext4_grpblk_t minblocks; + + if (!test_opt(sb, DISCARD)) + return 0; + + minblocks = DIV_ROUND_UP(minlen, sb->s_blocksize); + if (unlikely(minblocks > EXT4_BLOCKS_PER_GROUP(sb))) + return -EINVAL; + + for (group = 0; group < ngroups; group++) { + int err; + + err = ext4_mb_load_buddy(sb, group, &e4b); + if (err) { + ext4_error(sb, "Error in loading buddy " + "information for %u", group); + continue; + } + + if (e4b.bd_info->bb_free >= minblocks) { + ext4_trim_all_free(sb, &e4b, minblocks); + } + + ext4_mb_release_desc(&e4b); + } + + return 0; +} diff --git a/fs/ext4/super.c b/fs/ext4/super.c index e14d22c..253eb98 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1109,6 +1109,7 @@ static const struct super_operations ext4_sops = { .quota_write = ext4_quota_write, #endif .bdev_try_to_free_page = bdev_try_to_free_page, + .trim_fs = ext4_trim_fs }; static const struct super_operations ext4_nojournal_sops = { -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] Add batched discard support for ext4 2010-07-07 7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner @ 2010-07-14 8:33 ` Dmitry Monakhov 2010-07-14 9:40 ` Lukas Czerner 0 siblings, 1 reply; 15+ messages in thread From: Dmitry Monakhov @ 2010-07-14 8:33 UTC (permalink / raw) To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen [-- Attachment #1: Type: text/plain, Size: 1086 bytes --] Lukas Czerner <lczerner@redhat.com> writes: > Walk through each allocation group and trim all free extents. It can be > invoked through TRIM ioctl on the file system. The main idea is to > provide a way to trim the whole file system if needed, since some SSD's > may suffer from performance loss after the whole device was filled (it > does not mean that fs is full!). > > It search fro free extents in each allocation group. When the free > extent is found, blocks are marked as used and then trimmed. Afterwards > these blocks are marked as free in per-group bitmap. Looks ok, except two small notes: 1) trim_fs is a time consuming operation and we have to add condresced, and signal_pending checks to allow user to interrupt cmd if necessery. See patch attached. 2) IMHO runtime trim support is useful sometimes, for example when user really care about data security i.e. unlinked file should be trimmed ASAP. I think we have to provide 'secdel' mount option similar to secdeletion flag for inode, but this is another story not directly connected with the patch. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: 0001-ext4-Add-interrupt-points-to-batched-discard.patch --] [-- Type: text/x-diff, Size: 2230 bytes --] >From 54ca2add544f88ac9ca8c647ae7b093be9ac2872 Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov <dmonakhov@openvz.org> Date: Wed, 14 Jul 2010 12:16:50 +0400 Subject: [PATCH] ext4: Add interrupt points to batched discard Since fstrim is a long operation it will be good to have an ability to interrupt it by a signal. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/mballoc.c | 27 +++++++++++++++++---------- 1 files changed, 17 insertions(+), 10 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 48abd3d..0948408 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3959,7 +3959,7 @@ static void ext4_trim_extent(struct super_block *sb, int start, int count, (unsigned long long)discard_block, count); sb_issue_discard(sb, discard_block, count); - + cond_resched(); ext4_lock_group(sb, group); mb_free_blocks(NULL, e4b, start, ex.fe_len); } @@ -3995,7 +3995,10 @@ ext4_grpblk_t ext4_trim_all_free(struct super_block *sb, struct ext4_buddy *e4b, next - start, group, e4b); } start = next + 1; - + if (signal_pending(current)) { + count = -ERESTARTSYS; + break; + } if ((e4b->bd_info->bb_free - count) < minblocks) break; } @@ -4013,7 +4016,8 @@ int ext4_trim_fs(unsigned int minlen, struct super_block *sb) struct ext4_buddy e4b; ext4_group_t group; ext4_group_t ngroups = ext4_get_groups_count(sb); - ext4_grpblk_t minblocks; + ext4_grpblk_t minblocks, cnt; + int ret = 0; if (!test_opt(sb, DISCARD)) return 0; @@ -4023,22 +4027,25 @@ int ext4_trim_fs(unsigned int minlen, struct super_block *sb) return -EINVAL; for (group = 0; group < ngroups; group++) { - int err; - - err = ext4_mb_load_buddy(sb, group, &e4b); - if (err) { + ret = ext4_mb_load_buddy(sb, group, &e4b); + if (ret) { ext4_error(sb, "Error in loading buddy " "information for %u", group); - continue; + break; } if (e4b.bd_info->bb_free >= minblocks) { - ext4_trim_all_free(sb, &e4b, minblocks); + cnt = ext4_trim_all_free(sb, &e4b, minblocks); + if (cnt < 0) { + ret = cnt; + ext4_mb_unload_buddy(&e4b); + break; + } } ext4_mb_unload_buddy(&e4b); } - return 0; + return ret; } /* -- 1.6.6.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] Add batched discard support for ext4 2010-07-14 8:33 ` Dmitry Monakhov @ 2010-07-14 9:40 ` Lukas Czerner 2010-07-14 10:03 ` Dmitry Monakhov 0 siblings, 1 reply; 15+ messages in thread From: Lukas Czerner @ 2010-07-14 9:40 UTC (permalink / raw) To: Dmitry Monakhov Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen On Wed, 14 Jul 2010, Dmitry Monakhov wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > Walk through each allocation group and trim all free extents. It can be > > invoked through TRIM ioctl on the file system. The main idea is to > > provide a way to trim the whole file system if needed, since some SSD's > > may suffer from performance loss after the whole device was filled (it > > does not mean that fs is full!). > > > > It search fro free extents in each allocation group. When the free > > extent is found, blocks are marked as used and then trimmed. Afterwards > > these blocks are marked as free in per-group bitmap. > Looks ok, except two small notes: > 1) trim_fs is a time consuming operation and we have to add > condresced, and signal_pending checks to allow user to interrupt > cmd if necessery. See patch attached. Hi, Dimitry thanks for your patch! Although I have one question: for (group = 0; group < ngroups; group++) { - int err; - - err = ext4_mb_load_buddy(sb, group, &e4b); - if (err) { + ret = ext4_mb_load_buddy(sb, group, &e4b); + if (ret) { ext4_error(sb, "Error in loading buddy " "information for %u", group); - continue; + break; } Is there really need to jump out from the loop and exit in the case of load_buddy failure ? Next group may very well succeed in loading buddy, or am I missing something ? > 2) IMHO runtime trim support is useful sometimes, for example when > user really care about data security i.e. unlinked file should be > trimmed ASAP. I think we have to provide 'secdel' mount option > similar to secdeletion flag for inode, but this is another story > not directly connected with the patch. I like the idea, but IMO this should work for any underlying storage, not just for SSDs. Thanks! -Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] Add batched discard support for ext4 2010-07-14 9:40 ` Lukas Czerner @ 2010-07-14 10:03 ` Dmitry Monakhov 2010-07-14 11:43 ` Lukas Czerner 0 siblings, 1 reply; 15+ messages in thread From: Dmitry Monakhov @ 2010-07-14 10:03 UTC (permalink / raw) To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen Lukas Czerner <lczerner@redhat.com> writes: > On Wed, 14 Jul 2010, Dmitry Monakhov wrote: > >> Lukas Czerner <lczerner@redhat.com> writes: >> >> > Walk through each allocation group and trim all free extents. It can be >> > invoked through TRIM ioctl on the file system. The main idea is to >> > provide a way to trim the whole file system if needed, since some SSD's >> > may suffer from performance loss after the whole device was filled (it >> > does not mean that fs is full!). >> > >> > It search fro free extents in each allocation group. When the free >> > extent is found, blocks are marked as used and then trimmed. Afterwards >> > these blocks are marked as free in per-group bitmap. >> Looks ok, except two small notes: >> 1) trim_fs is a time consuming operation and we have to add >> condresced, and signal_pending checks to allow user to interrupt >> cmd if necessery. See patch attached. > > Hi, Dimitry > > thanks for your patch! Although I have one question: > > > for (group = 0; group < ngroups; group++) { > - int err; > - > - err = ext4_mb_load_buddy(sb, group, &e4b); > - if (err) { > + ret = ext4_mb_load_buddy(sb, group, &e4b); > + if (ret) { > ext4_error(sb, "Error in loading buddy " > "information for %u", group); > - continue; > + break; > } > > Is there really need to jump out from the loop and exit in the case of > load_buddy failure ? Next group may very well succeed in loading buddy, > or am I missing something ? Well, it may fail due to -ENOMEM which is not scary but in some places it may fail due to EIO which is a very bad sign. So i think it is slightly dangerous to continue if we have found a same group. > >> 2) IMHO runtime trim support is useful sometimes, for example when >> user really care about data security i.e. unlinked file should be >> trimmed ASAP. I think we have to provide 'secdel' mount option >> similar to secdeletion flag for inode, but this is another story >> not directly connected with the patch. > > I like the idea, but IMO this should work for any underlying storage, > not just for SSDs. Off course. We may use blkdev_issue_zeroout() if disk does not support discard with zeroing. > > Thanks! > -Lukas > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] Add batched discard support for ext4 2010-07-14 10:03 ` Dmitry Monakhov @ 2010-07-14 11:43 ` Lukas Czerner 0 siblings, 0 replies; 15+ messages in thread From: Lukas Czerner @ 2010-07-14 11:43 UTC (permalink / raw) To: Dmitry Monakhov Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen On Wed, 14 Jul 2010, Dmitry Monakhov wrote: > Lukas Czerner <lczerner@redhat.com> writes: > > > On Wed, 14 Jul 2010, Dmitry Monakhov wrote: > > > >> Lukas Czerner <lczerner@redhat.com> writes: > >> > >> > Walk through each allocation group and trim all free extents. It can be > >> > invoked through TRIM ioctl on the file system. The main idea is to > >> > provide a way to trim the whole file system if needed, since some SSD's > >> > may suffer from performance loss after the whole device was filled (it > >> > does not mean that fs is full!). > >> > > >> > It search fro free extents in each allocation group. When the free > >> > extent is found, blocks are marked as used and then trimmed. Afterwards > >> > these blocks are marked as free in per-group bitmap. > >> Looks ok, except two small notes: > >> 1) trim_fs is a time consuming operation and we have to add > >> condresced, and signal_pending checks to allow user to interrupt > >> cmd if necessery. See patch attached. > > > > Hi, Dimitry > > > > thanks for your patch! Although I have one question: > > > > > > for (group = 0; group < ngroups; group++) { > > - int err; > > - > > - err = ext4_mb_load_buddy(sb, group, &e4b); > > - if (err) { > > + ret = ext4_mb_load_buddy(sb, group, &e4b); > > + if (ret) { > > ext4_error(sb, "Error in loading buddy " > > "information for %u", group); > > - continue; > > + break; > > } > > > > Is there really need to jump out from the loop and exit in the case of > > load_buddy failure ? Next group may very well succeed in loading buddy, > > or am I missing something ? > Well, it may fail due to -ENOMEM which is not scary but in some places > it may fail due to EIO which is a very bad sign. So i think it is > slightly dangerous to continue if we have found a same group. Ok, it seems reasonable. > > > >> 2) IMHO runtime trim support is useful sometimes, for example when > >> user really care about data security i.e. unlinked file should be > >> trimmed ASAP. I think we have to provide 'secdel' mount option > >> similar to secdeletion flag for inode, but this is another story > >> not directly connected with the patch. > > > > I like the idea, but IMO this should work for any underlying storage, > > not just for SSDs. > Off course. We may use blkdev_issue_zeroout() if disk does not support > discard with zeroing. > > -Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-07 7:53 Ext4: batched discard support - simplified version Lukas Czerner 2010-07-07 7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner 2010-07-07 7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner @ 2010-07-23 14:36 ` Ted Ts'o 2010-07-23 15:13 ` Jeff Moyer 2010-07-26 10:30 ` Lukas Czerner 2 siblings, 2 replies; 15+ messages in thread From: Ted Ts'o @ 2010-07-23 14:36 UTC (permalink / raw) To: Lukas Czerner; +Cc: eshishki, jmoyer, rwheeler, linux-ext4, sandeen On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote: > > Hi all, > > since my last post I have done some more testing with various SSD's and the > trend is clear. Trim performance is getting better and the performance loss > without trim is getting lower. So I have decided to abandon the initial idea > to track free blocks within some internal data structure - it takes time and > memory. Do you have some numbers about how bad trim actually might be on various devices? I can imagine some devices where it might be better (for wear levelling and better write endurance if nothing else) where it's better to do the trim right away instead of batching things. So what I'm thinking about doing is keeping the "discard" mount option to mean non-batched discard. If you want to use the explicit FITRIM ioctl, I don't think we need to test to see if the dicard mount option is set; if the user issues the ioctl, then we should do the batched discard, and if we don't trust the user to do that, then well, the ioctl should be restricted to privileged users only --- especially if it could take up to a minute. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o @ 2010-07-23 15:13 ` Jeff Moyer 2010-07-23 15:19 ` Ted Ts'o 2010-07-23 15:30 ` Greg Freemyer 2010-07-26 10:30 ` Lukas Czerner 1 sibling, 2 replies; 15+ messages in thread From: Jeff Moyer @ 2010-07-23 15:13 UTC (permalink / raw) To: Ted Ts'o; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen "Ted Ts'o" <tytso@mit.edu> writes: > On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote: >> >> Hi all, >> >> since my last post I have done some more testing with various SSD's and the >> trend is clear. Trim performance is getting better and the performance loss >> without trim is getting lower. So I have decided to abandon the initial idea >> to track free blocks within some internal data structure - it takes time and >> memory. > > Do you have some numbers about how bad trim actually might be on > various devices? I'll let Lukas answer that when he gets back to the office next week. The performance of the trim command itself varies by vendor, of course. > I can imagine some devices where it might be better (for wear > levelling and better write endurance if nothing else) where it's > better to do the trim right away instead of batching things. I don't think so. In all of the configurations tested, I'm pretty sure we saw a performance hit from doing the TRIMs right away. The queue flush really hurts. Of course, I have no idea what you had in mind for the amount of time in between batched discards. > So what I'm thinking about doing is keeping the "discard" mount option > to mean non-batched discard. If you want to use the explicit FITRIM > ioctl, I don't think we need to test to see if the dicard mount option > is set; if the user issues the ioctl, then we should do the batched > discard, and if we don't trust the user to do that, then well, the > ioctl should be restricted to privileged users only --- especially if > it could take up to a minute. That sounds reasonable to me. Cheers, Jeff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 15:13 ` Jeff Moyer @ 2010-07-23 15:19 ` Ted Ts'o 2010-07-23 15:40 ` Jeff Moyer 2010-07-24 16:31 ` Ric Wheeler 2010-07-23 15:30 ` Greg Freemyer 1 sibling, 2 replies; 15+ messages in thread From: Ted Ts'o @ 2010-07-23 15:19 UTC (permalink / raw) To: Jeff Moyer; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote: > > I don't think so. In all of the configurations tested, I'm pretty sure > we saw a performance hit from doing the TRIMs right away. The queue > flush really hurts. Of course, I have no idea what you had in mind for > the amount of time in between batched discards. Sure, but not all the world is SATA-attached SSD's. I'm thinking in particular of PCIe-attached SSD's, where the TRIM command might be very fast indeed... I believe Ric Wheeler tells me you have TMS RamSan SSD's in house that you are testing? And of course those aren't the only PCIe-attached flash devices out there.... - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 15:19 ` Ted Ts'o @ 2010-07-23 15:40 ` Jeff Moyer 2010-07-23 17:00 ` Ted Ts'o 2010-07-24 16:31 ` Ric Wheeler 1 sibling, 1 reply; 15+ messages in thread From: Jeff Moyer @ 2010-07-23 15:40 UTC (permalink / raw) To: Ted Ts'o; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen "Ted Ts'o" <tytso@mit.edu> writes: > On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote: >> >> I don't think so. In all of the configurations tested, I'm pretty sure >> we saw a performance hit from doing the TRIMs right away. The queue >> flush really hurts. Of course, I have no idea what you had in mind for >> the amount of time in between batched discards. > > Sure, but not all the world is SATA-attached SSD's. I'm thinking in > particular of PCIe-attached SSD's, where the TRIM command might be > very fast indeed... I believe Ric Wheeler tells me you have TMS > RamSan SSD's in house that you are testing? And of course those > aren't the only PCIe-attached flash devices out there.... You are right, and we have to consider thinly provisioned luns, as well. The only case I can think of where it makes sense to issue those discards immediately is if you are running tight on allocated space in your thinly provisioned lun. Aside from that, I'm not sure why you would want to send those commands down with every journal commit, instead of batched daily, for example. But, I can certainly understand wanting to allow this flexibility. Cheers, Jeff ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 15:40 ` Jeff Moyer @ 2010-07-23 17:00 ` Ted Ts'o 0 siblings, 0 replies; 15+ messages in thread From: Ted Ts'o @ 2010-07-23 17:00 UTC (permalink / raw) To: Jeff Moyer; +Cc: Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen On Fri, Jul 23, 2010 at 11:40:58AM -0400, Jeff Moyer wrote: > > You are right, and we have to consider thinly provisioned luns, as well. > The only case I can think of where it makes sense to issue those > discards immediately is if you are running tight on allocated space in > your thinly provisioned lun. Aside from that, I'm not sure why you > would want to send those commands down with every journal commit, > instead of batched daily, for example. But, I can certainly understand > wanting to allow this flexibility. The two reasons I could imagine is to give more flexibility to the wear leveling algorithms (depending on how often you are turning over files --- i.e., deleting blocks and then reusing them), and to minimize latency (it might be nicer for the system to send down the deleted blocks on a continuing basis rather than to send them down all at once). The other issue is that by sending TRIM commands for all free extents, even those that haven't been recently been released, the flash translation layer needs to look up a large number of blocks in its translation table to see if it needs to update it. This can end up burning CPU unnecessarily, especially for those flash devices (such as FusionIO, for example) manage their FTL using the host CPU. So this is one of the reasons why I want to leave some flexibility here; BTW, for some systems, it may make sense for the FITRIM ioctl to throttle the rate at which it locks block groups and sends down TRIM requests so it doesn't end up causing performance hiccups for live applications while the FITRIM ioctl is running. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 15:19 ` Ted Ts'o 2010-07-23 15:40 ` Jeff Moyer @ 2010-07-24 16:31 ` Ric Wheeler 1 sibling, 0 replies; 15+ messages in thread From: Ric Wheeler @ 2010-07-24 16:31 UTC (permalink / raw) To: Ted Ts'o; +Cc: Jeff Moyer, Lukas Czerner, eshishki, linux-ext4, sandeen On 07/23/2010 11:19 AM, Ted Ts'o wrote: > On Fri, Jul 23, 2010 at 11:13:52AM -0400, Jeff Moyer wrote: > >> I don't think so. In all of the configurations tested, I'm pretty sure >> we saw a performance hit from doing the TRIMs right away. The queue >> flush really hurts. Of course, I have no idea what you had in mind for >> the amount of time in between batched discards. >> > Sure, but not all the world is SATA-attached SSD's. I'm thinking in > particular of PCIe-attached SSD's, where the TRIM command might be > very fast indeed... I believe Ric Wheeler tells me you have TMS > RamSan SSD's in house that you are testing? And of course those > aren't the only PCIe-attached flash devices out there.... > > - Ted > I think that some of the PCI-e cards might want that information right away, a lot of high end arrays actually prefer fewer larger chunks. One other user might be virtual devices or some remote replication mechanism. I wonder if the drbd people for example might (do?) use these? ric ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 15:13 ` Jeff Moyer 2010-07-23 15:19 ` Ted Ts'o @ 2010-07-23 15:30 ` Greg Freemyer 1 sibling, 0 replies; 15+ messages in thread From: Greg Freemyer @ 2010-07-23 15:30 UTC (permalink / raw) To: Jeff Moyer Cc: Ted Ts'o, Lukas Czerner, eshishki, rwheeler, linux-ext4, sandeen On Fri, Jul 23, 2010 at 11:13 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > "Ted Ts'o" <tytso@mit.edu> writes: > >> On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote: >>> >>> Hi all, >>> >>> since my last post I have done some more testing with various SSD's and the >>> trend is clear. Trim performance is getting better and the performance loss >>> without trim is getting lower. So I have decided to abandon the initial idea >>> to track free blocks within some internal data structure - it takes time and >>> memory. >> >> Do you have some numbers about how bad trim actually might be on >> various devices? > > I'll let Lukas answer that when he gets back to the office next week. > The performance of the trim command itself varies by vendor, of course. > >> I can imagine some devices where it might be better (for wear >> levelling and better write endurance if nothing else) where it's >> better to do the trim right away instead of batching things. > > I don't think so. In all of the configurations tested, I'm pretty sure > we saw a performance hit from doing the TRIMs right away. The queue > flush really hurts. Of course, I have no idea what you had in mind for > the amount of time in between batched discards. It was my understanding that way back when, Intel was pushing for the TRIMs to be right away. That may be why they never fully implemented the TRIM command to accept more than one sectors worth of vectorized data. (That is still multiple ranges per discard, just not hundreds/thousands.) Along those lines, does this patch create multi-sector discard trim commands when there is a large list of discard ranges (ie. thousands of ranges to discard)? And if so, does it have a blacklist for SSDs like the Intel that don't implement the multi-sector payload part of the spec? Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Ext4: batched discard support - simplified version 2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o 2010-07-23 15:13 ` Jeff Moyer @ 2010-07-26 10:30 ` Lukas Czerner 1 sibling, 0 replies; 15+ messages in thread From: Lukas Czerner @ 2010-07-26 10:30 UTC (permalink / raw) To: Ted Ts'o Cc: Lukas Czerner, eshishki, jmoyer, rwheeler, linux-ext4, sandeen On Fri, 23 Jul 2010, Ted Ts'o wrote: > On Wed, Jul 07, 2010 at 09:53:30AM +0200, Lukas Czerner wrote: > > > > Hi all, > > > > since my last post I have done some more testing with various SSD's and the > > trend is clear. Trim performance is getting better and the performance loss > > without trim is getting lower. So I have decided to abandon the initial idea > > to track free blocks within some internal data structure - it takes time and > > memory. > > Do you have some numbers about how bad trim actually might be on > various devices? I can imagine some devices where it might be better > (for wear levelling and better write endurance if nothing else) where > it's better to do the trim right away instead of batching things. Hi, Yes, I have those numbers. http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_discard.html This page presents my test results on three different devices. I have tested the current ext4 discard implementation (do the trim right away). However, one tested device is still not on that page. With this (Vendor4) device I have got only about 1.83% performance loss, which is very good. http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ext4_ioctltrim.html This page provides test results with my batched discard implementation. Take those numbers with discretion, because the patch still does not involve journaling and I have tested the "worst case" scenario, which involves issuing FITRIM in endless loop without any sleep. Generally the FITRIM ioctl can take from 2 seconds on fast devices to several (2-4) minutes on very slow devices, or under heavy load. > > So what I'm thinking about doing is keeping the "discard" mount option > to mean non-batched discard. If you want to use the explicit FITRIM > ioctl, I don't think we need to test to see if the dicard mount option > is set; if the user issues the ioctl, then we should do the batched > discard, and if we don't trust the user to do that, then well, the > ioctl should be restricted to privileged users only --- especially if > it could take up to a minute. I agree. > > - Ted > Thanks. -Lukas ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-07-26 10:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-07 7:53 Ext4: batched discard support - simplified version Lukas Czerner 2010-07-07 7:53 ` [PATCH 1/2] Add ioctl FITRIM Lukas Czerner 2010-07-07 7:53 ` [PATCH 2/2] Add batched discard support for ext4 Lukas Czerner 2010-07-14 8:33 ` Dmitry Monakhov 2010-07-14 9:40 ` Lukas Czerner 2010-07-14 10:03 ` Dmitry Monakhov 2010-07-14 11:43 ` Lukas Czerner 2010-07-23 14:36 ` Ext4: batched discard support - simplified version Ted Ts'o 2010-07-23 15:13 ` Jeff Moyer 2010-07-23 15:19 ` Ted Ts'o 2010-07-23 15:40 ` Jeff Moyer 2010-07-23 17:00 ` Ted Ts'o 2010-07-24 16:31 ` Ric Wheeler 2010-07-23 15:30 ` Greg Freemyer 2010-07-26 10:30 ` Lukas Czerner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).