[PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs()
@ 2012-03-02 12:11 Lukas Czerner
  2012-03-02 12:11 ` [PATCH 2/4 v2] ext4: Fix trimmed block count computing Lukas Czerner
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Lukas Czerner @ 2012-03-02 12:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Lukas Czerner

The overflow can happen when we are calling get_group_no_and_offset()
which stores the group number in the ext4_grpblk_t type which is
actually int. However when the blocknr is big enough the group number
might be bigger than ext4_grpblk_t resulting in overflow. This will
most likely happen with FITRIM default argument len = ULLONG_MAX.

Fix this by using "end" variable instead of "start+len" as it is easier
to get right and specifically check that the end is not beyond the end
of the file system, so we are sure that the result of
get_group_no_and_offset() will not overflow. Otherwise truncate it to
the size of the file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: Squash the 'don't forget to discard last block in a group' into this one
    since this is the commit which reveals the bug.

 fs/ext4/mballoc.c |   57 +++++++++++++++++++++++++++-------------------------
 1 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index cb990b2..16a3aeb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4971,11 +4971,11 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
 	start = (e4b.bd_info->bb_first_free > start) ?
 		e4b.bd_info->bb_first_free : start;
 
-	while (start < max) {
-		start = mb_find_next_zero_bit(bitmap, max, start);
-		if (start >= max)
+	while (start <= max) {
+		start = mb_find_next_zero_bit(bitmap, max + 1, start);
+		if (start > max)
 			break;
-		next = mb_find_next_bit(bitmap, max, start);
+		next = mb_find_next_bit(bitmap, max + 1, start);
 
 		if ((next - start) >= minblocks) {
 			ext4_trim_extent(sb, start,
@@ -5027,37 +5027,36 @@ out:
 int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 {
 	struct ext4_group_info *grp;
-	ext4_group_t first_group, last_group;
-	ext4_group_t group, ngroups = ext4_get_groups_count(sb);
+	ext4_group_t group, first_group, last_group;
 	ext4_grpblk_t cnt = 0, first_cluster, last_cluster;
-	uint64_t start, len, minlen, trimmed = 0;
+	uint64_t start, end, minlen, trimmed = 0;
 	ext4_fsblk_t first_data_blk =
 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
+	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
 	int ret = 0;
 
 	start = range->start >> sb->s_blocksize_bits;
-	len = range->len >> sb->s_blocksize_bits;
+	end = start + (range->len >> sb->s_blocksize_bits) - 1;
 	minlen = range->minlen >> sb->s_blocksize_bits;
 
-	if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)))
+	if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)) ||
+	    unlikely(start >= max_blks))
 		return -EINVAL;
-	if (start + len <= first_data_blk)
+	if (end >= max_blks)
+		end = max_blks - 1;
+	if (end <= first_data_blk)
 		goto out;
-	if (start < first_data_blk) {
-		len -= first_data_blk - start;
+	if (start < first_data_blk)
 		start = first_data_blk;
-	}
 
-	/* Determine first and last group to examine based on start and len */
+	/* Determine first and last group to examine based on start and end */
 	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start,
 				     &first_group, &first_cluster);
-	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) (start + len),
+	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end,
 				     &last_group, &last_cluster);
-	last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group;
-	last_cluster = EXT4_CLUSTERS_PER_GROUP(sb);
 
-	if (first_group > last_group)
-		return -EINVAL;
+	/* end now represents the last cluster to discard in this group */
+	end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
 
 	for (group = first_group; group <= last_group; group++) {
 		grp = ext4_get_group_info(sb, group);
@@ -5069,24 +5068,28 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 		}
 
 		/*
-		 * For all the groups except the last one, last block will
-		 * always be EXT4_BLOCKS_PER_GROUP(sb), so we only need to
-		 * change it for the last group in which case start +
-		 * len < EXT4_BLOCKS_PER_GROUP(sb).
+		 * For all the groups except the last one, last cluster will
+		 * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to
+		 * change it for the last group, note that last_cluster is
+		 * already computed earlier by ext4_get_group_no_and_offset()
 		 */
-		if (first_cluster + len < EXT4_CLUSTERS_PER_GROUP(sb))
-			last_cluster = first_cluster + len;
-		len -= last_cluster - first_cluster;
+		if (group == last_group)
+			end = last_cluster;
 
 		if (grp->bb_free >= minlen) {
 			cnt = ext4_trim_all_free(sb, group, first_cluster,
-						last_cluster, minlen);
+						end, minlen);
 			if (cnt < 0) {
 				ret = cnt;
 				break;
 			}
 		}
 		trimmed += cnt;
+
+		/*
+		 * For every group except the first one, we are sure
+		 * that the first cluster to discard will be cluster #0.
+		 */
 		first_cluster = 0;
 	}
 	range->len = trimmed * sb->s_blocksize;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4 v2] ext4: Fix trimmed block count computing
  2012-03-02 12:11 [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Lukas Czerner
@ 2012-03-02 12:11 ` Lukas Czerner
  2012-03-05 12:38   ` Jan Kara
  2012-03-02 12:11 ` [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len Lukas Czerner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Lukas Czerner @ 2012-03-02 12:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Lukas Czerner

Currently when there is not enough free block in the block group to
discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
the number of discarded blocks from the previous iteration. Fix this
by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
run.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: nothing changed

 fs/ext4/mballoc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 16a3aeb..f20688e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5083,8 +5083,8 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 				ret = cnt;
 				break;
 			}
+			trimmed += cnt;
 		}
-		trimmed += cnt;
 
 		/*
 		 * For every group except the first one, we are sure
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4 v2] ext4: Fix trimmed block count computing
  2012-03-02 12:11 ` [PATCH 2/4 v2] ext4: Fix trimmed block count computing Lukas Czerner
@ 2012-03-05 12:38   ` Jan Kara
  2012-03-22  1:22     ` Ted Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-03-05 12:38 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso

On Fri 02-03-12 13:11:56, Lukas Czerner wrote:
> Currently when there is not enough free block in the block group to
> discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
> the number of discarded blocks from the previous iteration. Fix this
> by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
> run.
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v2: nothing changed
> 
>  fs/ext4/mballoc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 16a3aeb..f20688e 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5083,8 +5083,8 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  				ret = cnt;
>  				break;
>  			}
> +			trimmed += cnt;
>  		}
> -		trimmed += cnt;
>  
>  		/*
>  		 * For every group except the first one, we are sure
> -- 
> 1.7.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4 v2] ext4: Fix trimmed block count computing
  2012-03-05 12:38   ` Jan Kara
@ 2012-03-22  1:22     ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2012-03-22  1:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-ext4

On Mon, Mar 05, 2012 at 01:38:20PM +0100, Jan Kara wrote:
> On Fri 02-03-12 13:11:56, Lukas Czerner wrote:
> > Currently when there is not enough free block in the block group to
> > discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
> > the number of discarded blocks from the previous iteration. Fix this
> > by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
> > run.
>   Looks good. You can add:
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len
  2012-03-02 12:11 [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Lukas Czerner
  2012-03-02 12:11 ` [PATCH 2/4 v2] ext4: Fix trimmed block count computing Lukas Czerner
@ 2012-03-02 12:11 ` Lukas Czerner
  2012-03-05 12:38   ` Jan Kara
  2012-03-02 12:11 ` [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set Lukas Czerner
  2012-03-05 12:37 ` [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Jan Kara
  3 siblings, 1 reply; 15+ messages in thread
From: Lukas Czerner @ 2012-03-02 12:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Lukas Czerner

Currently if the range to trim is too small, for example on 1K fs
the request to trim the first block, then the 'range->len' is not set
reporting wrong number of discarded block to the caller.

Fix this by always setting the 'range->len' before we return. Note that
when there is a failure (-EINVAL) caller can not depend on 'range->len'
being set properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: nothing changed

 fs/ext4/mballoc.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f20688e..8f817f2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5092,11 +5092,11 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 		 */
 		first_cluster = 0;
 	}
-	range->len = trimmed * sb->s_blocksize;
 
 	if (!ret)
 		atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen);
 
 out:
+	range->len = trimmed * sb->s_blocksize;
 	return ret;
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len
  2012-03-02 12:11 ` [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len Lukas Czerner
@ 2012-03-05 12:38   ` Jan Kara
  2012-03-22  1:23     ` Ted Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-03-05 12:38 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso

On Fri 02-03-12 13:11:57, Lukas Czerner wrote:
> Currently if the range to trim is too small, for example on 1K fs
> the request to trim the first block, then the 'range->len' is not set
> reporting wrong number of discarded block to the caller.
> 
> Fix this by always setting the 'range->len' before we return. Note that
> when there is a failure (-EINVAL) caller can not depend on 'range->len'
> being set properly.
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v2: nothing changed
> 
>  fs/ext4/mballoc.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f20688e..8f817f2 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5092,11 +5092,11 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  		 */
>  		first_cluster = 0;
>  	}
> -	range->len = trimmed * sb->s_blocksize;
>  
>  	if (!ret)
>  		atomic_set(&EXT4_SB(sb)->s_last_trim_minblks, minlen);
>  
>  out:
> +	range->len = trimmed * sb->s_blocksize;
>  	return ret;
>  }
> -- 
> 1.7.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len
  2012-03-05 12:38   ` Jan Kara
@ 2012-03-22  1:23     ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2012-03-22  1:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-ext4

On Mon, Mar 05, 2012 at 01:38:39PM +0100, Jan Kara wrote:
> On Fri 02-03-12 13:11:57, Lukas Czerner wrote:
> > Currently if the range to trim is too small, for example on 1K fs
> > the request to trim the first block, then the 'range->len' is not set
> > reporting wrong number of discarded block to the caller.
> > 
> > Fix this by always setting the 'range->len' before we return. Note that
> > when there is a failure (-EINVAL) caller can not depend on 'range->len'
> > being set properly.
>   Looks good. You can add:
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-02 12:11 [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Lukas Czerner
  2012-03-02 12:11 ` [PATCH 2/4 v2] ext4: Fix trimmed block count computing Lukas Czerner
  2012-03-02 12:11 ` [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len Lukas Czerner
@ 2012-03-02 12:11 ` Lukas Czerner
  2012-03-05 12:41   ` Jan Kara
  2012-03-06 22:18   ` Ted Ts'o
  2012-03-05 12:37 ` [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Jan Kara
  3 siblings, 2 replies; 15+ messages in thread
From: Lukas Czerner @ 2012-03-02 12:11 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Lukas Czerner

This commit is an optimization for FITRIM implementation. If the group
has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
discard such group. This flag is set on mke2fs time to speed up
subsequent file system checks, because it says to us that there is
nothing there in the block group.

Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
allocation from that group takes place we know that when set, there was
not anything allocated from that group, hence there should not be anything
to discard from the file system point of view. Of course there might be
situations where even if BLOCK_UNINIT is set the underlying storage is
provisioned. This might happen for example when the user disables discard
on mke2fs, however I think that this niche is not enough to not to take
advantage of this optimization.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: nothing changed

 fs/ext4/mballoc.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8f817f2..9ea1065a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 	ext4_fsblk_t first_data_blk =
 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
 	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
+	struct ext4_group_desc *desc;
 	int ret = 0;
 
 	start = range->start >> sb->s_blocksize_bits;
@@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 		if (group == last_group)
 			end = last_cluster;
 
-		if (grp->bb_free >= minlen) {
+		desc = ext4_get_group_desc(sb, group, NULL);
+		if (!desc) {
+			ret = -EIO;
+			break;
+		}
+
+		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
+		    (grp->bb_free >= minlen)) {
 			cnt = ext4_trim_all_free(sb, group, first_cluster,
 						end, minlen);
 			if (cnt < 0) {
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-02 12:11 ` [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set Lukas Czerner
@ 2012-03-05 12:41   ` Jan Kara
  2012-03-05 13:12     ` Lukas Czerner
  2012-03-06 22:18   ` Ted Ts'o
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-03-05 12:41 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso

On Fri 02-03-12 13:11:58, Lukas Czerner wrote:
> This commit is an optimization for FITRIM implementation. If the group
> has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
> discard such group. This flag is set on mke2fs time to speed up
> subsequent file system checks, because it says to us that there is
> nothing there in the block group.
> 
> Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> allocation from that group takes place we know that when set, there was
> not anything allocated from that group, hence there should not be anything
> to discard from the file system point of view. Of course there might be
> situations where even if BLOCK_UNINIT is set the underlying storage is
> provisioned. This might happen for example when the user disables discard
> on mke2fs, however I think that this niche is not enough to not to take
> advantage of this optimization.
  This patch is correct but I'm undecided whether we really want to do this
optimization or not. It might be unexpected we didn't truncate block group
which was completely free (from user's POV). I don't consider FITRIM too
performance critical and I also don't think this will be such a massive
speedup...

								Honza
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v2: nothing changed
> 
>  fs/ext4/mballoc.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8f817f2..9ea1065a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  	ext4_fsblk_t first_data_blk =
>  			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
>  	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
> +	struct ext4_group_desc *desc;
>  	int ret = 0;
>  
>  	start = range->start >> sb->s_blocksize_bits;
> @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  		if (group == last_group)
>  			end = last_cluster;
>  
> -		if (grp->bb_free >= minlen) {
> +		desc = ext4_get_group_desc(sb, group, NULL);
> +		if (!desc) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
> +		    (grp->bb_free >= minlen)) {
>  			cnt = ext4_trim_all_free(sb, group, first_cluster,
>  						end, minlen);
>  			if (cnt < 0) {
> -- 
> 1.7.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-05 12:41   ` Jan Kara
@ 2012-03-05 13:12     ` Lukas Czerner
  0 siblings, 0 replies; 15+ messages in thread
From: Lukas Czerner @ 2012-03-05 13:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-ext4, tytso

On Mon, 5 Mar 2012, Jan Kara wrote:

> On Fri 02-03-12 13:11:58, Lukas Czerner wrote:
> > This commit is an optimization for FITRIM implementation. If the group
> > has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
> > discard such group. This flag is set on mke2fs time to speed up
> > subsequent file system checks, because it says to us that there is
> > nothing there in the block group.
> > 
> > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> > allocation from that group takes place we know that when set, there was
> > not anything allocated from that group, hence there should not be anything
> > to discard from the file system point of view. Of course there might be
> > situations where even if BLOCK_UNINIT is set the underlying storage is
> > provisioned. This might happen for example when the user disables discard
> > on mke2fs, however I think that this niche is not enough to not to take
> > advantage of this optimization.
>   This patch is correct but I'm undecided whether we really want to do this
> optimization or not. It might be unexpected we didn't truncate block group
> which was completely free (from user's POV). I don't consider FITRIM too
> performance critical and I also don't think this will be such a massive
> speedup...
> 
> 								Honza

Hi Honzo,

on small SSD's it certainly does not bring any significant speedup. But
consider huge thin-provisioned storage and you'll immediately notice the
change, but not only because of the storage size, but also because (from
my experience) those thing are really slow with discard.

Moreover ext4 would not be the only one not discarding the whole file
system on FITRIM. See btrfs which does not map the whole storage to the
file system at creation time, but rather allocate smaller chunks as the
demand for space grows. It also means that they will not discard the
whole file system, but only mapped chunks.

And lastly, FITRIM is supposed to be a way to notify the underlying
storage about the space which is no longer used. In conjunction with
full device discard on mke2fs (which is the default), we can skip UNINIT
groups just because from the fs point of view we are sure enough that
this space is not mapped. Note that the only case where this is not
true is if someone overrides the default mke2fs behaviour, or move their
file system with dd.

But I am certainly open for discussions about that.

Thanks!
-Lukas

> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> > v2: nothing changed
> > 
> >  fs/ext4/mballoc.c |   10 +++++++++-
> >  1 files changed, 9 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 8f817f2..9ea1065a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> >  	ext4_fsblk_t first_data_blk =
> >  			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
> >  	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
> > +	struct ext4_group_desc *desc;
> >  	int ret = 0;
> >  
> >  	start = range->start >> sb->s_blocksize_bits;
> > @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> >  		if (group == last_group)
> >  			end = last_cluster;
> >  
> > -		if (grp->bb_free >= minlen) {
> > +		desc = ext4_get_group_desc(sb, group, NULL);
> > +		if (!desc) {
> > +			ret = -EIO;
> > +			break;
> > +		}
> > +
> > +		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
> > +		    (grp->bb_free >= minlen)) {
> >  			cnt = ext4_trim_all_free(sb, group, first_cluster,
> >  						end, minlen);
> >  			if (cnt < 0) {
> > -- 
> > 1.7.4.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-02 12:11 ` [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set Lukas Czerner
  2012-03-05 12:41   ` Jan Kara
@ 2012-03-06 22:18   ` Ted Ts'o
  2012-03-07  7:10     ` Lukas Czerner
  1 sibling, 1 reply; 15+ messages in thread
From: Ted Ts'o @ 2012-03-06 22:18 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4

On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote:
> Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> allocation from that group takes place we know that when set, there was
> not anything allocated from that group, hence there should not be anything
> to discard from the file system point of view.

There's a really good reason to set BLOCK_UNINIT once we have noticed
that all of the blocks in the block group have been released....

If you have a 3TB HDD, running e2fsck takes 4 times as long if all of
the block groups have BLOCK_UNINIT cleared, compared to a freshly
mkfs'ed file system.  As a result of my getting really annoyed at how
long it took in this case, I'm planning on making e2fsck clear
BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs
and debugfs invocations) can also be fast.

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-06 22:18   ` Ted Ts'o
@ 2012-03-07  7:10     ` Lukas Czerner
  2012-03-07 17:22       ` Ted Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Lukas Czerner @ 2012-03-07  7:10 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukas Czerner, linux-ext4

On Tue, 6 Mar 2012, Ted Ts'o wrote:

> On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote:
> > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> > allocation from that group takes place we know that when set, there was
> > not anything allocated from that group, hence there should not be anything
> > to discard from the file system point of view.
> 
> There's a really good reason to set BLOCK_UNINIT once we have noticed
> that all of the blocks in the block group have been released....
> 
> If you have a 3TB HDD, running e2fsck takes 4 times as long if all of
> the block groups have BLOCK_UNINIT cleared, compared to a freshly
> mkfs'ed file system.  As a result of my getting really annoyed at how
> long it took in this case, I'm planning on making e2fsck clear
> BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs
> and debugfs invocations) can also be fast.
> 
> 					- Ted

Ok, if there is a plan to implement that, I am fine with dropping th
patches. But since this optimization would be helpful for discard, we
can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be
sen only after discard and cleared with the first allocation ?

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set
  2012-03-07  7:10     ` Lukas Czerner
@ 2012-03-07 17:22       ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2012-03-07 17:22 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4

On Wed, Mar 07, 2012 at 08:10:00AM +0100, Lukas Czerner wrote:
> 
> Ok, if there is a plan to implement that, I am fine with dropping th
> patches. But since this optimization would be helpful for discard, we
> can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be
> sen only after discard and cleared with the first allocation ?

Heh, great minds think alike.  I was thinking about a the possibility
of having a BLOCK_DISCARDED flag this morning, with exactly the
semantics that you are suggesting.  There may be devices in the future
that have fast trims which will prefer to always get the duplicate
trim requests, but there is no question there are a lot of crappy
devices out there right now where trims are extremely expensive, so
that seems fair.

Something else to think about for the future, for battery driven
devices (such as handsets) automatically sending tirm commands might
not be a good idea when the device is sleeping/has the screen turned
off.  Given that we don't have an easy way of determining whether or
not the device is in a low powered state (ideally we only send the
trims right after work has been queued to the device, so it's woken up
already, but when there isn't anything else that needs to be sent to
the device).

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs()
  2012-03-02 12:11 [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Lukas Czerner
                   ` (2 preceding siblings ...)
  2012-03-02 12:11 ` [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set Lukas Czerner
@ 2012-03-05 12:37 ` Jan Kara
  2012-03-22  1:22   ` Ted Ts'o
  3 siblings, 1 reply; 15+ messages in thread
From: Jan Kara @ 2012-03-05 12:37 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, tytso

On Fri 02-03-12 13:11:55, Lukas Czerner wrote:
> The overflow can happen when we are calling get_group_no_and_offset()
> which stores the group number in the ext4_grpblk_t type which is
> actually int. However when the blocknr is big enough the group number
> might be bigger than ext4_grpblk_t resulting in overflow. This will
> most likely happen with FITRIM default argument len = ULLONG_MAX.
> 
> Fix this by using "end" variable instead of "start+len" as it is easier
> to get right and specifically check that the end is not beyond the end
> of the file system, so we are sure that the result of
> get_group_no_and_offset() will not overflow. Otherwise truncate it to
> the size of the file system.
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v2: Squash the 'don't forget to discard last block in a group' into this one
>     since this is the commit which reveals the bug.
> 
>  fs/ext4/mballoc.c |   57 +++++++++++++++++++++++++++-------------------------
>  1 files changed, 30 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index cb990b2..16a3aeb 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4971,11 +4971,11 @@ ext4_trim_all_free(struct super_block *sb, ext4_group_t group,
>  	start = (e4b.bd_info->bb_first_free > start) ?
>  		e4b.bd_info->bb_first_free : start;
>  
> -	while (start < max) {
> -		start = mb_find_next_zero_bit(bitmap, max, start);
> -		if (start >= max)
> +	while (start <= max) {
> +		start = mb_find_next_zero_bit(bitmap, max + 1, start);
> +		if (start > max)
>  			break;
> -		next = mb_find_next_bit(bitmap, max, start);
> +		next = mb_find_next_bit(bitmap, max + 1, start);
>  
>  		if ((next - start) >= minblocks) {
>  			ext4_trim_extent(sb, start,
> @@ -5027,37 +5027,36 @@ out:
>  int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  {
>  	struct ext4_group_info *grp;
> -	ext4_group_t first_group, last_group;
> -	ext4_group_t group, ngroups = ext4_get_groups_count(sb);
> +	ext4_group_t group, first_group, last_group;
>  	ext4_grpblk_t cnt = 0, first_cluster, last_cluster;
> -	uint64_t start, len, minlen, trimmed = 0;
> +	uint64_t start, end, minlen, trimmed = 0;
>  	ext4_fsblk_t first_data_blk =
>  			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
> +	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
>  	int ret = 0;
>  
>  	start = range->start >> sb->s_blocksize_bits;
> -	len = range->len >> sb->s_blocksize_bits;
> +	end = start + (range->len >> sb->s_blocksize_bits) - 1;
>  	minlen = range->minlen >> sb->s_blocksize_bits;
>  
> -	if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)))
> +	if (unlikely(minlen > EXT4_CLUSTERS_PER_GROUP(sb)) ||
> +	    unlikely(start >= max_blks))
>  		return -EINVAL;
> -	if (start + len <= first_data_blk)
> +	if (end >= max_blks)
> +		end = max_blks - 1;
> +	if (end <= first_data_blk)
>  		goto out;
> -	if (start < first_data_blk) {
> -		len -= first_data_blk - start;
> +	if (start < first_data_blk)
>  		start = first_data_blk;
> -	}
>  
> -	/* Determine first and last group to examine based on start and len */
> +	/* Determine first and last group to examine based on start and end */
>  	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) start,
>  				     &first_group, &first_cluster);
> -	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) (start + len),
> +	ext4_get_group_no_and_offset(sb, (ext4_fsblk_t) end,
>  				     &last_group, &last_cluster);
> -	last_group = (last_group > ngroups - 1) ? ngroups - 1 : last_group;
> -	last_cluster = EXT4_CLUSTERS_PER_GROUP(sb);
>  
> -	if (first_group > last_group)
> -		return -EINVAL;
> +	/* end now represents the last cluster to discard in this group */
> +	end = EXT4_CLUSTERS_PER_GROUP(sb) - 1;
>  
>  	for (group = first_group; group <= last_group; group++) {
>  		grp = ext4_get_group_info(sb, group);
> @@ -5069,24 +5068,28 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  		}
>  
>  		/*
> -		 * For all the groups except the last one, last block will
> -		 * always be EXT4_BLOCKS_PER_GROUP(sb), so we only need to
> -		 * change it for the last group in which case start +
> -		 * len < EXT4_BLOCKS_PER_GROUP(sb).
> +		 * For all the groups except the last one, last cluster will
> +		 * always be EXT4_CLUSTERS_PER_GROUP(sb)-1, so we only need to
> +		 * change it for the last group, note that last_cluster is
> +		 * already computed earlier by ext4_get_group_no_and_offset()
>  		 */
> -		if (first_cluster + len < EXT4_CLUSTERS_PER_GROUP(sb))
> -			last_cluster = first_cluster + len;
> -		len -= last_cluster - first_cluster;
> +		if (group == last_group)
> +			end = last_cluster;
>  
>  		if (grp->bb_free >= minlen) {
>  			cnt = ext4_trim_all_free(sb, group, first_cluster,
> -						last_cluster, minlen);
> +						end, minlen);
>  			if (cnt < 0) {
>  				ret = cnt;
>  				break;
>  			}
>  		}
>  		trimmed += cnt;
> +
> +		/*
> +		 * For every group except the first one, we are sure
> +		 * that the first cluster to discard will be cluster #0.
> +		 */
>  		first_cluster = 0;
>  	}
>  	range->len = trimmed * sb->s_blocksize;
> -- 
> 1.7.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs()
  2012-03-05 12:37 ` [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Jan Kara
@ 2012-03-22  1:22   ` Ted Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Ted Ts'o @ 2012-03-22  1:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-ext4

On Mon, Mar 05, 2012 at 01:37:57PM +0100, Jan Kara wrote:
> On Fri 02-03-12 13:11:55, Lukas Czerner wrote:
> > The overflow can happen when we are calling get_group_no_and_offset()
> > which stores the group number in the ext4_grpblk_t type which is
> > actually int. However when the blocknr is big enough the group number
> > might be bigger than ext4_grpblk_t resulting in overflow. This will
> > most likely happen with FITRIM default argument len = ULLONG_MAX.
> > 
> > Fix this by using "end" variable instead of "start+len" as it is easier
> > to get right and specifically check that the end is not beyond the end
> > of the file system, so we are sure that the result of
> > get_group_no_and_offset() will not overflow. Otherwise truncate it to
> > the size of the file system.
>   Looks good. You can add:
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-03-22  1:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-02 12:11 [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Lukas Czerner
2012-03-02 12:11 ` [PATCH 2/4 v2] ext4: Fix trimmed block count computing Lukas Czerner
2012-03-05 12:38   ` Jan Kara
2012-03-22  1:22     ` Ted Ts'o
2012-03-02 12:11 ` [PATCH 3/4 v2] ext4: Always set then trimmed blocks count into len Lukas Czerner
2012-03-05 12:38   ` Jan Kara
2012-03-22  1:23     ` Ted Ts'o
2012-03-02 12:11 ` [PATCH 4/4 v2] ext4: Do not discard group with BLOCK_UNINIT set Lukas Czerner
2012-03-05 12:41   ` Jan Kara
2012-03-05 13:12     ` Lukas Czerner
2012-03-06 22:18   ` Ted Ts'o
2012-03-07  7:10     ` Lukas Czerner
2012-03-07 17:22       ` Ted Ts'o
2012-03-05 12:37 ` [PATCH 1/4 v2] ext4: fix start and len arguments handling in ext4_trim_fs() Jan Kara
2012-03-22  1:22   ` Ted Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).