[PATCH] btrfs: don't update the block group item if used bytes are the same

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH] btrfs: don't update the block group item if used bytes are the same
@ 2022-07-11  6:37 Qu Wenruo
  2022-07-11  8:30 ` Nikolay Borisov
  2022-09-07 14:31 ` Josef Bacik
  0 siblings, 2 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-07-11  6:37 UTC (permalink / raw)
  To: linux-btrfs

When committing a transaction, we will update block group items for all
dirty block groups.

But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several CoW operations, but still have the same amount of used bytes.

In that case, we may unnecessarily CoW a tree block doing nothing.

This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.

This would be more common for large fs, which metadata block group can
be as large as 1GiB, containing at most 64K metadata items.

In that case, if CoW added and the deleted one metadata item near the end
of the block group, then it's completely possible we don't need to touch
the block group item at all.

I don't have any benchmark to prove this, but this should not cause any
hurt either.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 6 ++++++
 fs/btrfs/block-group.h | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 0148a6d719a4..5b08ac282ace 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2024,6 +2024,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 
 	cache->length = key->offset;
 	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->commit_used = cache->used;
 	cache->flags = btrfs_stack_block_group_flags(bgi);
 	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
 
@@ -2724,6 +2725,10 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct btrfs_block_group_item bgi;
 	struct btrfs_key key;
 
+	/* No change in used bytes, can safely skip it. */
+	if (cache->commit_used == cache->used)
+		return 0;
+
 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
 	key.offset = cache->length;
@@ -2743,6 +2748,7 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
 	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
 	btrfs_mark_buffer_dirty(leaf);
+	cache->commit_used = cache->used;
 fail:
 	btrfs_release_path(path);
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 35e0e860cc0b..3f92b8eb9a05 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -74,6 +74,12 @@ struct btrfs_block_group {
 	u64 cache_generation;
 	u64 global_root_id;
 
+	/*
+	 * The last committed used bytes of this block group, if above @used
+	 * is still the same as @commit_used, we don't need to update block
+	 * group item of this block group.
+	 */
+	u64 commit_used;
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-07-11  6:37 [PATCH] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
@ 2022-07-11  8:30 ` Nikolay Borisov
  2022-07-11  8:47   ` Qu Wenruo
  2022-09-07 14:31 ` Josef Bacik
  1 sibling, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2022-07-11  8:30 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 11.07.22 г. 9:37 ч., Qu Wenruo wrote:
> When committing a transaction, we will update block group items for all
> dirty block groups.
> 
> But in fact, dirty block groups don't always need to update their block
> group items.
> It's pretty common to have a metadata block group which experienced
> several CoW operations, but still have the same amount of used bytes.

This could happen if for example the allocated/freed extents in a single 
transaction cancel each other out, right? Are there other cases where it 
could matter?

> 
> In that case, we may unnecessarily CoW a tree block doing nothing.
> 
> This patch will introduce btrfs_block_group::commit_used member to
> remember the last used bytes, and use that new member to skip
> unnecessary block group item update.
> 
> This would be more common for large fs, which metadata block group can
> be as large as 1GiB, containing at most 64K metadata items.
> 
> In that case, if CoW added and the deleted one metadata item near the end
> of the block group, then it's completely possible we don't need to touch
> the block group item at all.
> 
> I don't have any benchmark to prove this, but this should not cause any
> hurt either.

It should not but adds more state and is overall a maintenance burden. 
One way to test this would be to rig up the fs to count how many times 
the optimization has been hit over the course of, say, a full xfstest 
run or at least demonstrate a particular workload where this makes 
tangible difference.

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/block-group.c | 6 ++++++
>   fs/btrfs/block-group.h | 6 ++++++
>   2 files changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 0148a6d719a4..5b08ac282ace 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2024,6 +2024,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   
>   	cache->length = key->offset;
>   	cache->used = btrfs_stack_block_group_used(bgi);
> +	cache->commit_used = cache->used;
>   	cache->flags = btrfs_stack_block_group_flags(bgi);
>   	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
>   
> @@ -2724,6 +2725,10 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	struct btrfs_block_group_item bgi;
>   	struct btrfs_key key;
>   
> +	/* No change in used bytes, can safely skip it. */
> +	if (cache->commit_used == cache->used)
> +		return 0;
> +
>   	key.objectid = cache->start;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>   	key.offset = cache->length;
> @@ -2743,6 +2748,7 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>   	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>   	btrfs_mark_buffer_dirty(leaf);
> +	cache->commit_used = cache->used;
>   fail:
>   	btrfs_release_path(path);
>   	return ret;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 35e0e860cc0b..3f92b8eb9a05 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -74,6 +74,12 @@ struct btrfs_block_group {
>   	u64 cache_generation;
>   	u64 global_root_id;
>   
> +	/*
> +	 * The last committed used bytes of this block group, if above @used
> +	 * is still the same as @commit_used, we don't need to update block
> +	 * group item of this block group.
> +	 */
> +	u64 commit_used;
>   	/*
>   	 * If the free space extent count exceeds this number, convert the block
>   	 * group to bitmaps.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-07-11  8:30 ` Nikolay Borisov
@ 2022-07-11  8:47   ` Qu Wenruo
  2022-08-18 12:26     ` David Sterba
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-07-11  8:47 UTC (permalink / raw)
  To: Nikolay Borisov, Qu Wenruo, linux-btrfs



On 2022/7/11 16:30, Nikolay Borisov wrote:
>
>
> On 11.07.22 г. 9:37 ч., Qu Wenruo wrote:
>> When committing a transaction, we will update block group items for all
>> dirty block groups.
>>
>> But in fact, dirty block groups don't always need to update their block
>> group items.
>> It's pretty common to have a metadata block group which experienced
>> several CoW operations, but still have the same amount of used bytes.
>
> This could happen if for example the allocated/freed extents in a single
> transaction cancel each other out, right? Are there other cases where it
> could matter?

No need to completely cancel each other.

In fact, just COWing a path without adding/deleting new cousins would be
enough, and that would be very common for a lot of tree block operations.

>
>>
>> In that case, we may unnecessarily CoW a tree block doing nothing.
>>
>> This patch will introduce btrfs_block_group::commit_used member to
>> remember the last used bytes, and use that new member to skip
>> unnecessary block group item update.
>>
>> This would be more common for large fs, which metadata block group can
>> be as large as 1GiB, containing at most 64K metadata items.
>>
>> In that case, if CoW added and the deleted one metadata item near the end
>> of the block group, then it's completely possible we don't need to touch
>> the block group item at all.
>>
>> I don't have any benchmark to prove this, but this should not cause any
>> hurt either.
>
> It should not but adds more state and is overall a maintenance burden.
> One way to test this would be to rig up the fs to count how many times
> the optimization has been hit over the course of, say, a full xfstest
> run or at least demonstrate a particular workload where this makes
> tangible difference.

But in this particular case, there is really not that much status to bother.

In fact, we don't care about if there is any status, we only care about
the block_group::used is different from committed one.
Even no change to lock schemes.

Thanks,
Qu
>
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/block-group.c | 6 ++++++
>>   fs/btrfs/block-group.h | 6 ++++++
>>   2 files changed, 12 insertions(+)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 0148a6d719a4..5b08ac282ace 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -2024,6 +2024,7 @@ static int read_one_block_group(struct
>> btrfs_fs_info *info,
>>       cache->length = key->offset;
>>       cache->used = btrfs_stack_block_group_used(bgi);
>> +    cache->commit_used = cache->used;
>>       cache->flags = btrfs_stack_block_group_flags(bgi);
>>       cache->global_root_id =
>> btrfs_stack_block_group_chunk_objectid(bgi);
>> @@ -2724,6 +2725,10 @@ static int update_block_group_item(struct
>> btrfs_trans_handle *trans,
>>       struct btrfs_block_group_item bgi;
>>       struct btrfs_key key;
>> +    /* No change in used bytes, can safely skip it. */
>> +    if (cache->commit_used == cache->used)
>> +        return 0;
>> +
>>       key.objectid = cache->start;
>>       key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>>       key.offset = cache->length;
>> @@ -2743,6 +2748,7 @@ static int update_block_group_item(struct
>> btrfs_trans_handle *trans,
>>       btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>>       write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>>       btrfs_mark_buffer_dirty(leaf);
>> +    cache->commit_used = cache->used;
>>   fail:
>>       btrfs_release_path(path);
>>       return ret;
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index 35e0e860cc0b..3f92b8eb9a05 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -74,6 +74,12 @@ struct btrfs_block_group {
>>       u64 cache_generation;
>>       u64 global_root_id;
>> +    /*
>> +     * The last committed used bytes of this block group, if above @used
>> +     * is still the same as @commit_used, we don't need to update block
>> +     * group item of this block group.
>> +     */
>> +    u64 commit_used;
>>       /*
>>        * If the free space extent count exceeds this number, convert
>> the block
>>        * group to bitmaps.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-07-11  8:47   ` Qu Wenruo
@ 2022-08-18 12:26     ` David Sterba
  2022-09-02 12:51       ` David Sterba
  0 siblings, 1 reply; 11+ messages in thread
From: David Sterba @ 2022-08-18 12:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Qu Wenruo, linux-btrfs

On Mon, Jul 11, 2022 at 04:47:25PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/7/11 16:30, Nikolay Borisov wrote:
> >
> >
> > On 11.07.22 г. 9:37 ч., Qu Wenruo wrote:
> >> When committing a transaction, we will update block group items for all
> >> dirty block groups.
> >>
> >> But in fact, dirty block groups don't always need to update their block
> >> group items.
> >> It's pretty common to have a metadata block group which experienced
> >> several CoW operations, but still have the same amount of used bytes.
> >
> > This could happen if for example the allocated/freed extents in a single
> > transaction cancel each other out, right? Are there other cases where it
> > could matter?
> 
> No need to completely cancel each other.
> 
> In fact, just COWing a path without adding/deleting new cousins would be
> enough, and that would be very common for a lot of tree block operations.

I would be interested in numbers too, a percentage of skip/total would
be good for some workloads so we have at least some idea.

> >> In that case, we may unnecessarily CoW a tree block doing nothing.
> >>
> >> This patch will introduce btrfs_block_group::commit_used member to
> >> remember the last used bytes, and use that new member to skip
> >> unnecessary block group item update.
> >>
> >> This would be more common for large fs, which metadata block group can
> >> be as large as 1GiB, containing at most 64K metadata items.
> >>
> >> In that case, if CoW added and the deleted one metadata item near the end
> >> of the block group, then it's completely possible we don't need to touch
> >> the block group item at all.
> >>
> >> I don't have any benchmark to prove this, but this should not cause any
> >> hurt either.
> >
> > It should not but adds more state and is overall a maintenance burden.
> > One way to test this would be to rig up the fs to count how many times
> > the optimization has been hit over the course of, say, a full xfstest
> > run or at least demonstrate a particular workload where this makes
> > tangible difference.
> 
> But in this particular case, there is really not that much status to bother.
> 
> In fact, we don't care about if there is any status, we only care about
> the block_group::used is different from committed one.
> Even no change to lock schemes.

Looking to update_block_group_item, there are other block grup items
updated too:

- used - the one one you check
- flags - can't change on the fly, we'd have to do relocation, ie. a new
  block group
- chunk_objectid - that stays for the whole fileystem lifetime

So I think it's safe to just check the 'used' value but it increases
memory size of block group (that's been increasing in size recently) so
I'd rather have a better idea about the justification.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-08-18 12:26     ` David Sterba
@ 2022-09-02 12:51       ` David Sterba
  0 siblings, 0 replies; 11+ messages in thread
From: David Sterba @ 2022-09-02 12:51 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Nikolay Borisov, Qu Wenruo, linux-btrfs

On Thu, Aug 18, 2022 at 02:26:24PM +0200, David Sterba wrote:
> On Mon, Jul 11, 2022 at 04:47:25PM +0800, Qu Wenruo wrote:
> > No need to completely cancel each other.
> > 
> > In fact, just COWing a path without adding/deleting new cousins would be
> > enough, and that would be very common for a lot of tree block operations.
> 
> I would be interested in numbers too, a percentage of skip/total would
> be good for some workloads so we have at least some idea.

I'll keep the patch in for-next until we see some numbers about how much
the optimization is used.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-07-11  6:37 [PATCH] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
  2022-07-11  8:30 ` Nikolay Borisov
@ 2022-09-07 14:31 ` Josef Bacik
  2022-09-07 17:29   ` Josef Bacik
  1 sibling, 1 reply; 11+ messages in thread
From: Josef Bacik @ 2022-09-07 14:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Jul 11, 2022 at 02:37:52PM +0800, Qu Wenruo wrote:
> When committing a transaction, we will update block group items for all
> dirty block groups.
> 
> But in fact, dirty block groups don't always need to update their block
> group items.
> It's pretty common to have a metadata block group which experienced
> several CoW operations, but still have the same amount of used bytes.
> 
> In that case, we may unnecessarily CoW a tree block doing nothing.
> 
> This patch will introduce btrfs_block_group::commit_used member to
> remember the last used bytes, and use that new member to skip
> unnecessary block group item update.
> 
> This would be more common for large fs, which metadata block group can
> be as large as 1GiB, containing at most 64K metadata items.
> 
> In that case, if CoW added and the deleted one metadata item near the end
> of the block group, then it's completely possible we don't need to touch
> the block group item at all.
> 
> I don't have any benchmark to prove this, but this should not cause any
> hurt either.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I've been seeing random btrfs check failures on our overnight testing since this
patch was merged.  I can't blame it directly yet, I've mostly seen it on
TEST_DEV, and once while running generic/648.  I'm running it in a loop now to
reproduce and then fix it.

We can start updating block groups before we're in the critical section, so we
can update block_group->bytes_used while we're updating the block group item in
a different thread.  So if we set the block_group item to some value of
bytes_used, then update it in another thread, and then set ->commit_used to the
new value we'll fail to update the block group item with the correct value
later.

We need to wrap this bit in the block_group->lock to avoid this particular
problem.  Once I reproduce and validate the fix I'll send that, but I wanted to
reply in case that takes longer than I expect.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-09-07 14:31 ` Josef Bacik
@ 2022-09-07 17:29   ` Josef Bacik
  2022-09-07 22:08     ` David Sterba
  2022-09-07 22:20     ` Qu Wenruo
  0 siblings, 2 replies; 11+ messages in thread
From: Josef Bacik @ 2022-09-07 17:29 UTC (permalink / raw)
  To: Qu Wenruo, dsterba; +Cc: linux-btrfs

On Wed, Sep 07, 2022 at 10:31:58AM -0400, Josef Bacik wrote:
> On Mon, Jul 11, 2022 at 02:37:52PM +0800, Qu Wenruo wrote:
> > When committing a transaction, we will update block group items for all
> > dirty block groups.
> > 
> > But in fact, dirty block groups don't always need to update their block
> > group items.
> > It's pretty common to have a metadata block group which experienced
> > several CoW operations, but still have the same amount of used bytes.
> > 
> > In that case, we may unnecessarily CoW a tree block doing nothing.
> > 
> > This patch will introduce btrfs_block_group::commit_used member to
> > remember the last used bytes, and use that new member to skip
> > unnecessary block group item update.
> > 
> > This would be more common for large fs, which metadata block group can
> > be as large as 1GiB, containing at most 64K metadata items.
> > 
> > In that case, if CoW added and the deleted one metadata item near the end
> > of the block group, then it's completely possible we don't need to touch
> > the block group item at all.
> > 
> > I don't have any benchmark to prove this, but this should not cause any
> > hurt either.
> > 
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> I've been seeing random btrfs check failures on our overnight testing since this
> patch was merged.  I can't blame it directly yet, I've mostly seen it on
> TEST_DEV, and once while running generic/648.  I'm running it in a loop now to
> reproduce and then fix it.
> 
> We can start updating block groups before we're in the critical section, so we
> can update block_group->bytes_used while we're updating the block group item in
> a different thread.  So if we set the block_group item to some value of
> bytes_used, then update it in another thread, and then set ->commit_used to the
> new value we'll fail to update the block group item with the correct value
> later.
> 
> We need to wrap this bit in the block_group->lock to avoid this particular
> problem.  Once I reproduce and validate the fix I'll send that, but I wanted to
> reply in case that takes longer than I expect.  Thanks,

Ok this is in fact the problem, this fixup made the problem go away.  Thanks,

Josef

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 6e7bb1c0352d..1e2773b120d4 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2694,10 +2694,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_block_group_item bgi;
 	struct btrfs_key key;
+	u64 used;
 
 	/* No change in used bytes, can safely skip it. */
-	if (cache->commit_used == cache->used)
+	spin_lock(&cache->lock);
+	used = cache->used;
+	if (cache->commit_used == used) {
+		spin_unlock(&cache->lock);
 		return 0;
+	}
+	spin_unlock(&cache->lock);
 
 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
@@ -2712,13 +2718,14 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 
 	leaf = path->nodes[0];
 	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	btrfs_set_stack_block_group_used(&bgi, cache->used);
+
+	btrfs_set_stack_block_group_used(&bgi, used);
 	btrfs_set_stack_block_group_chunk_objectid(&bgi,
 						   cache->global_root_id);
 	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
 	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
 	btrfs_mark_buffer_dirty(leaf);
-	cache->commit_used = cache->used;
+	cache->commit_used = used;
 fail:
 	btrfs_release_path(path);
 	return ret;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-09-07 17:29   ` Josef Bacik
@ 2022-09-07 22:08     ` David Sterba
  2022-09-07 22:20     ` Qu Wenruo
  1 sibling, 0 replies; 11+ messages in thread
From: David Sterba @ 2022-09-07 22:08 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Qu Wenruo, dsterba, linux-btrfs

On Wed, Sep 07, 2022 at 01:29:49PM -0400, Josef Bacik wrote:
> On Wed, Sep 07, 2022 at 10:31:58AM -0400, Josef Bacik wrote:
> > On Mon, Jul 11, 2022 at 02:37:52PM +0800, Qu Wenruo wrote:
> > > When committing a transaction, we will update block group items for all
> > > dirty block groups.
> > > 
> > > But in fact, dirty block groups don't always need to update their block
> > > group items.
> > > It's pretty common to have a metadata block group which experienced
> > > several CoW operations, but still have the same amount of used bytes.
> > > 
> > > In that case, we may unnecessarily CoW a tree block doing nothing.
> > > 
> > > This patch will introduce btrfs_block_group::commit_used member to
> > > remember the last used bytes, and use that new member to skip
> > > unnecessary block group item update.
> > > 
> > > This would be more common for large fs, which metadata block group can
> > > be as large as 1GiB, containing at most 64K metadata items.
> > > 
> > > In that case, if CoW added and the deleted one metadata item near the end
> > > of the block group, then it's completely possible we don't need to touch
> > > the block group item at all.
> > > 
> > > I don't have any benchmark to prove this, but this should not cause any
> > > hurt either.
> > > 
> > > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > 
> > I've been seeing random btrfs check failures on our overnight testing since this
> > patch was merged.  I can't blame it directly yet, I've mostly seen it on
> > TEST_DEV, and once while running generic/648.  I'm running it in a loop now to
> > reproduce and then fix it.
> > 
> > We can start updating block groups before we're in the critical section, so we
> > can update block_group->bytes_used while we're updating the block group item in
> > a different thread.  So if we set the block_group item to some value of
> > bytes_used, then update it in another thread, and then set ->commit_used to the
> > new value we'll fail to update the block group item with the correct value
> > later.
> > 
> > We need to wrap this bit in the block_group->lock to avoid this particular
> > problem.  Once I reproduce and validate the fix I'll send that, but I wanted to
> > reply in case that takes longer than I expect.  Thanks,
> 
> Ok this is in fact the problem, this fixup made the problem go away.  Thanks,

Thanks for tracking it down, I've removed the patch from for-next, it's
an optimization and we haven't seen any numbers yet how useful it is.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-09-07 17:29   ` Josef Bacik
  2022-09-07 22:08     ` David Sterba
@ 2022-09-07 22:20     ` Qu Wenruo
  2022-09-07 22:35       ` Qu Wenruo
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-09-07 22:20 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, dsterba; +Cc: linux-btrfs



On 2022/9/8 01:29, Josef Bacik wrote:
> On Wed, Sep 07, 2022 at 10:31:58AM -0400, Josef Bacik wrote:
>> On Mon, Jul 11, 2022 at 02:37:52PM +0800, Qu Wenruo wrote:
>>> When committing a transaction, we will update block group items for all
>>> dirty block groups.
>>>
>>> But in fact, dirty block groups don't always need to update their block
>>> group items.
>>> It's pretty common to have a metadata block group which experienced
>>> several CoW operations, but still have the same amount of used bytes.
>>>
>>> In that case, we may unnecessarily CoW a tree block doing nothing.
>>>
>>> This patch will introduce btrfs_block_group::commit_used member to
>>> remember the last used bytes, and use that new member to skip
>>> unnecessary block group item update.
>>>
>>> This would be more common for large fs, which metadata block group can
>>> be as large as 1GiB, containing at most 64K metadata items.
>>>
>>> In that case, if CoW added and the deleted one metadata item near the end
>>> of the block group, then it's completely possible we don't need to touch
>>> the block group item at all.
>>>
>>> I don't have any benchmark to prove this, but this should not cause any
>>> hurt either.
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>
>> I've been seeing random btrfs check failures on our overnight testing since this
>> patch was merged.  I can't blame it directly yet, I've mostly seen it on
>> TEST_DEV, and once while running generic/648.  I'm running it in a loop now to
>> reproduce and then fix it.
>>
>> We can start updating block groups before we're in the critical section, so we
>> can update block_group->bytes_used while we're updating the block group item in
>> a different thread.  So if we set the block_group item to some value of
>> bytes_used, then update it in another thread, and then set ->commit_used to the
>> new value we'll fail to update the block group item with the correct value
>> later.
>>
>> We need to wrap this bit in the block_group->lock to avoid this particular
>> problem.  Once I reproduce and validate the fix I'll send that, but I wanted to
>> reply in case that takes longer than I expect.  Thanks,
>
> Ok this is in fact the problem, this fixup made the problem go away.  Thanks,

This fix means, a bg members can change even we are at
update_block_group_item().

The old code is completely relying on the one time access on cache->used.

Anyway thanks for the fix.

Thanks,
Qu
>
> Josef
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 6e7bb1c0352d..1e2773b120d4 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2694,10 +2694,16 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	struct extent_buffer *leaf;
>   	struct btrfs_block_group_item bgi;
>   	struct btrfs_key key;
> +	u64 used;
>
>   	/* No change in used bytes, can safely skip it. */
> -	if (cache->commit_used == cache->used)
> +	spin_lock(&cache->lock);
> +	used = cache->used;
> +	if (cache->commit_used == used) {
> +		spin_unlock(&cache->lock);
>   		return 0;
> +	}
> +	spin_unlock(&cache->lock);
>
>   	key.objectid = cache->start;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> @@ -2712,13 +2718,14 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>
>   	leaf = path->nodes[0];
>   	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> -	btrfs_set_stack_block_group_used(&bgi, cache->used);
> +
> +	btrfs_set_stack_block_group_used(&bgi, used);
>   	btrfs_set_stack_block_group_chunk_objectid(&bgi,
>   						   cache->global_root_id);
>   	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>   	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>   	btrfs_mark_buffer_dirty(leaf);
> -	cache->commit_used = cache->used;
> +	cache->commit_used = used;
>   fail:
>   	btrfs_release_path(path);
>   	return ret;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] btrfs: don't update the block group item if used bytes are the same
  2022-09-07 22:20     ` Qu Wenruo
@ 2022-09-07 22:35       ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-09-07 22:35 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, dsterba; +Cc: linux-btrfs



On 2022/9/8 06:20, Qu Wenruo wrote:
>
>
> On 2022/9/8 01:29, Josef Bacik wrote:
>> On Wed, Sep 07, 2022 at 10:31:58AM -0400, Josef Bacik wrote:
>>> On Mon, Jul 11, 2022 at 02:37:52PM +0800, Qu Wenruo wrote:
>>>> When committing a transaction, we will update block group items for all
>>>> dirty block groups.
>>>>
>>>> But in fact, dirty block groups don't always need to update their block
>>>> group items.
>>>> It's pretty common to have a metadata block group which experienced
>>>> several CoW operations, but still have the same amount of used bytes.
>>>>
>>>> In that case, we may unnecessarily CoW a tree block doing nothing.
>>>>
>>>> This patch will introduce btrfs_block_group::commit_used member to
>>>> remember the last used bytes, and use that new member to skip
>>>> unnecessary block group item update.
>>>>
>>>> This would be more common for large fs, which metadata block group can
>>>> be as large as 1GiB, containing at most 64K metadata items.
>>>>
>>>> In that case, if CoW added and the deleted one metadata item near
>>>> the end
>>>> of the block group, then it's completely possible we don't need to
>>>> touch
>>>> the block group item at all.
>>>>
>>>> I don't have any benchmark to prove this, but this should not cause any
>>>> hurt either.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>
>>> I've been seeing random btrfs check failures on our overnight testing
>>> since this
>>> patch was merged.  I can't blame it directly yet, I've mostly seen it on
>>> TEST_DEV, and once while running generic/648.  I'm running it in a
>>> loop now to
>>> reproduce and then fix it.
>>>
>>> We can start updating block groups before we're in the critical
>>> section, so we
>>> can update block_group->bytes_used while we're updating the block
>>> group item in
>>> a different thread.  So if we set the block_group item to some value of
>>> bytes_used, then update it in another thread, and then set
>>> ->commit_used to the
>>> new value we'll fail to update the block group item with the correct
>>> value
>>> later.
>>>
>>> We need to wrap this bit in the block_group->lock to avoid this
>>> particular
>>> problem.  Once I reproduce and validate the fix I'll send that, but I
>>> wanted to
>>> reply in case that takes longer than I expect.  Thanks,
>>
>> Ok this is in fact the problem, this fixup made the problem go away.
>> Thanks,
>
> This fix means, a bg members can change even we are at
> update_block_group_item().
>
> The old code is completely relying on the one time access on cache->used.
>
> Anyway thanks for the fix.

So this is only happening if we execute
btrfs_start_dirty_block_groups(), which unlike
btrfs_write_dirty_block_groups(), is not yet protected by transaction
critical path.

Thus we can have bg members changing halfway and caused the race.


To David, do I need to send a updated version with extra comments on this?

Thanks,
Qu

>
> Thanks,
> Qu
>>
>> Josef
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 6e7bb1c0352d..1e2773b120d4 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -2694,10 +2694,16 @@ static int update_block_group_item(struct
>> btrfs_trans_handle *trans,
>>       struct extent_buffer *leaf;
>>       struct btrfs_block_group_item bgi;
>>       struct btrfs_key key;
>> +    u64 used;
>>
>>       /* No change in used bytes, can safely skip it. */
>> -    if (cache->commit_used == cache->used)
>> +    spin_lock(&cache->lock);
>> +    used = cache->used;
>> +    if (cache->commit_used == used) {
>> +        spin_unlock(&cache->lock);
>>           return 0;
>> +    }
>> +    spin_unlock(&cache->lock);
>>
>>       key.objectid = cache->start;
>>       key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>> @@ -2712,13 +2718,14 @@ static int update_block_group_item(struct
>> btrfs_trans_handle *trans,
>>
>>       leaf = path->nodes[0];
>>       bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> -    btrfs_set_stack_block_group_used(&bgi, cache->used);
>> +
>> +    btrfs_set_stack_block_group_used(&bgi, used);
>>       btrfs_set_stack_block_group_chunk_objectid(&bgi,
>>                              cache->global_root_id);
>>       btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>>       write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>>       btrfs_mark_buffer_dirty(leaf);
>> -    cache->commit_used = cache->used;
>> +    cache->commit_used = used;
>>   fail:
>>       btrfs_release_path(path);
>>       return ret;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] btrfs: don't update the block group item if used bytes are the same
@ 2022-09-09  6:45 Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-09-09  6:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[BACKGROUND]

When committing a transaction, we will update block group items for all
dirty block groups.

But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several CoW operations, but still have the same amount of used bytes.

In that case, we may unnecessarily CoW a tree block doing nothing.

[ENHANCEMENT]

This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.

This would be more common for large fs, which metadata block group can
be as large as 1GiB, containing at most 64K metadata items.

In that case, if CoW added and the deleted one metadata item near the end
of the block group, then it's completely possible we don't need to touch
the block group item at all.

[BENCHMARK]

The patchset itself can have quite a high chance (20~80%) to skip block
group item updates in a lot of workload.

As a result, it would result shorter time spent on
btrfs_write_dirty_block_groups(), and overall reduce the execution time
of the critical section of btrfs_commit_transaction().

Here comes a fio command, which will do random writes in 4K block size,
causing a very heavy metadata updates.

fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
    --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
    --name=random_write --fallocate=none --time_based --fsync_on_close=1

The file size (512M) and number of threads (4) means 2GiB file size in
total, but during the full 300s run time, my dedicated SATA SSD is able
to write around 20~25GiB, which is over 10 times the file size.

Thus after we fill the initial 2G, we should not cause much block group items
updates.

Please note, the fio numbers by itself doesn't have much change, but if
we look deeper, there is some reduced execution time, especially
for the critical section of btrfs_commit_transaction().

I added extra trace_printk() to measure the following per-transaction
execution time

- Critical section of btrfs_commit_transaction()
  By re-using the existing update_commit_stats() function, which
  has already calculated the interval correctly.

- The while() loop for btrfs_write_dirty_block_groups()
  Although this includes the execution time of btrfs_run_delayed_refs(),
  it should still be representative overall.

Both result involves transid 7~30, the same amount of transaction
committed.

The result looks like this:

                      |      Before       |     After      |  Diff
----------------------+-------------------+----------------+--------
Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
Block group interval  | 23133.33333       | 18970.83333    | -18.0%

The change in block group item updates is more obvious, as skipped bg
item updates also means less delayed refs, thus the change is more
obvious.

And the overall execution time for that bg update loop is pretty small,
thus we can assume the extent tree is already mostly cached.
If we can skip a uncached tree block, it would cause more obvious
change.

Unfortunately the overall reduce in commit transaction critical section
is much smaller, as the block group item updates loop is not really the
major part, at least for the above fio script.

But still we have a observable reduction in the critical section.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[Josef pinned down the race and provided a fix]
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
Changelog:
v3:
- Further improve the benchmark
  No code change.

 * Make it time based, so we can overwrite the file several times
 * Make it much longer
 * Reduce the total file size to avoid ENOSPC pressure
 * Measure both commit transaction interval and block group item
   updates interval to get a more representative result.

v2:
- Add a new benchmark
  I added btrfs_transaction::total_bg_updates and skipped_bg_updates
  atomics, the total one will get increased every time
  update_block_group_item() get called, and the skipped one will
  get increased when we skip one update.

  Then I go trace_printk() at btrfs_commit_transaction() to
  print the two atomics.

  The full benchmark is way better than I initially though, after
  the initial increase in metadata usage, later transactions all
  got a high percentage of skipped bg updates, between 40~80%.

- Thanks Josef for pinning down and fixing a race
  Previously, update_block_group_item() only access cache->used
  once, thus it doesn't really need extra protection.

  But now we need to access it at least 3 times (one to check if
  we can skip, one to update the block group item, one to update
  commit_used).

  This requires a lock to prevent the cache->used changed halfway,
  and lead to incorrect used bytes in block group item.
---
 fs/btrfs/block-group.c | 20 +++++++++++++++++++-
 fs/btrfs/block-group.h |  6 ++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e7b5a54c8258..0df4d98df278 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,

 	cache->length = key->offset;
 	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->commit_used = cache->used;
 	cache->flags = btrfs_stack_block_group_flags(bgi);
 	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);

@@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_block_group_item bgi;
 	struct btrfs_key key;
+	u64 used;
+
+	/*
+	 * Block group items update can be triggered out of commit transaction
+	 * critical section, thus we need a consistent view of used bytes.
+	 * We can not direct use cache->used out of the spin lock, as it
+	 * may be changed.
+	 */
+	spin_lock(&cache->lock);
+	used = cache->used;
+	/* No change in used bytes, can safely skip it. */
+	if (cache->commit_used == used) {
+		spin_unlock(&cache->lock);
+		return 0;
+	}
+	spin_unlock(&cache->lock);

 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
@@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,

 	leaf = path->nodes[0];
 	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	btrfs_set_stack_block_group_used(&bgi, cache->used);
+	btrfs_set_stack_block_group_used(&bgi, used);
 	btrfs_set_stack_block_group_chunk_objectid(&bgi,
 						   cache->global_root_id);
 	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
 	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
 	btrfs_mark_buffer_dirty(leaf);
+	cache->commit_used = used;
 fail:
 	btrfs_release_path(path);
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index f48db81d1d56..b57718020104 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -84,6 +84,12 @@ struct btrfs_block_group {
 	u64 cache_generation;
 	u64 global_root_id;

+	/*
+	 * The last committed used bytes of this block group, if above @used
+	 * is still the same as @commit_used, we don't need to update block
+	 * group item of this block group.
+	 */
+	u64 commit_used;
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
-- 
2.37.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-09  6:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-11  6:37 [PATCH] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
2022-07-11  8:30 ` Nikolay Borisov
2022-07-11  8:47   ` Qu Wenruo
2022-08-18 12:26     ` David Sterba
2022-09-02 12:51       ` David Sterba
2022-09-07 14:31 ` Josef Bacik
2022-09-07 17:29   ` Josef Bacik
2022-09-07 22:08     ` David Sterba
2022-09-07 22:20     ` Qu Wenruo
2022-09-07 22:35       ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2022-09-09  6:45 Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox