[PATCH v2] btrfs: don't update the block group item if used bytes are the same

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] btrfs: don't update the block group item if used bytes are the same
@ 2022-09-08  6:33 Qu Wenruo
  2022-09-08  6:42 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-09-08  6:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik

[BACKGROUND]

When committing a transaction, we will update block group items for all
dirty block groups.

But in fact, dirty block groups don't always need to update their block
group items.
It's pretty common to have a metadata block group which experienced
several CoW operations, but still have the same amount of used bytes.

In that case, we may unnecessarily CoW a tree block doing nothing.

[ENHANCEMENT]

This patch will introduce btrfs_block_group::commit_used member to
remember the last used bytes, and use that new member to skip
unnecessary block group item update.

This would be more common for large fs, which metadata block group can
be as large as 1GiB, containing at most 64K metadata items.

In that case, if CoW added and the deleted one metadata item near the end
of the block group, then it's completely possible we don't need to touch
the block group item at all.

[BENCHMARK]

To properly benchmark how many block group items we skipped the update,
I added some members into btrfs_tranaction to record how many times
update_block_group_item() is called, and how many of them are skipped.

Then with a single line fio to trigger the workload on a newly created
btrfs:

  fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
      --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
      --fallocate=posix

Then I got 101 transaction committed during that fio command, and a
total of 2062 update_block_group_item() calls, in which 1241 can be
skipped.

This is already a good 60% got skipped.

The full analyse can be found here:
https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml

Furthermore, since I have per-trans accounting, it shows that initially
we have a very low percentage of skipped block group item update.

This is expected since we're inserting a lot of new file extents
initially, thus the metadata usage is going to increase.

But after the initial 3 transactions, all later transactions are have a
very stable 40~80% skip rate, mostly proving the patch is working.

Although such high skip rate is not always a huge win, as for
our later block group tree feature, we will have a much higher chance to
have a block group item already COWed, thus can skip some COW work.

But still, skipping a full COW search on extent tree is always a good
thing, especially when the benefit almost comes from thin-air.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[Josef pinned down the race and provided a fix]
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c | 20 +++++++++++++++++++-
 fs/btrfs/block-group.h |  6 ++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e7b5a54c8258..0df4d98df278 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,

 	cache->length = key->offset;
 	cache->used = btrfs_stack_block_group_used(bgi);
+	cache->commit_used = cache->used;
 	cache->flags = btrfs_stack_block_group_flags(bgi);
 	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);

@@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_block_group_item bgi;
 	struct btrfs_key key;
+	u64 used;
+
+	/*
+	 * Block group items update can be triggered out of commit transaction
+	 * critical section, thus we need a consistent view of used bytes.
+	 * We can not direct use cache->used out of the spin lock, as it
+	 * may be changed.
+	 */
+	spin_lock(&cache->lock);
+	used = cache->used;
+	/* No change in used bytes, can safely skip it. */
+	if (cache->commit_used == used) {
+		spin_unlock(&cache->lock);
+		return 0;
+	}
+	spin_unlock(&cache->lock);

 	key.objectid = cache->start;
 	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
@@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,

 	leaf = path->nodes[0];
 	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
-	btrfs_set_stack_block_group_used(&bgi, cache->used);
+	btrfs_set_stack_block_group_used(&bgi, used);
 	btrfs_set_stack_block_group_chunk_objectid(&bgi,
 						   cache->global_root_id);
 	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
 	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
 	btrfs_mark_buffer_dirty(leaf);
+	cache->commit_used = used;
 fail:
 	btrfs_release_path(path);
 	return ret;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index f48db81d1d56..b57718020104 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -84,6 +84,12 @@ struct btrfs_block_group {
 	u64 cache_generation;
 	u64 global_root_id;

+	/*
+	 * The last committed used bytes of this block group, if above @used
+	 * is still the same as @commit_used, we don't need to update block
+	 * group item of this block group.
+	 */
+	u64 commit_used;
 	/*
 	 * If the free space extent count exceeds this number, convert the block
 	 * group to bitmaps.
-- 
2.37.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08  6:33 [PATCH v2] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
@ 2022-09-08  6:42 ` Qu Wenruo
  2022-09-08 10:31 ` Filipe Manana
  2022-09-08 13:17 ` Josef Bacik
  2 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-09-08  6:42 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik



On 2022/9/8 14:33, Qu Wenruo wrote:
> [BACKGROUND]
> 
> When committing a transaction, we will update block group items for all
> dirty block groups.
> 
> But in fact, dirty block groups don't always need to update their block
> group items.
> It's pretty common to have a metadata block group which experienced
> several CoW operations, but still have the same amount of used bytes.
> 
> In that case, we may unnecessarily CoW a tree block doing nothing.
> 
> [ENHANCEMENT]
> 
> This patch will introduce btrfs_block_group::commit_used member to
> remember the last used bytes, and use that new member to skip
> unnecessary block group item update.
> 
> This would be more common for large fs, which metadata block group can
> be as large as 1GiB, containing at most 64K metadata items.
> 
> In that case, if CoW added and the deleted one metadata item near the end
> of the block group, then it's completely possible we don't need to touch
> the block group item at all.
> 
> [BENCHMARK]
> 
> To properly benchmark how many block group items we skipped the update,
> I added some members into btrfs_tranaction to record how many times
> update_block_group_item() is called, and how many of them are skipped.
> 
> Then with a single line fio to trigger the workload on a newly created
> btrfs:
> 
>    fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
>        --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
>        --fallocate=posix
> 
> Then I got 101 transaction committed during that fio command, and a
> total of 2062 update_block_group_item() calls, in which 1241 can be
> skipped.
> 
> This is already a good 60% got skipped.
> 
> The full analyse can be found here:
> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml
> 
> Furthermore, since I have per-trans accounting, it shows that initially
> we have a very low percentage of skipped block group item update.
> 
> This is expected since we're inserting a lot of new file extents
> initially, thus the metadata usage is going to increase.
> 
> But after the initial 3 transactions, all later transactions are have a
> very stable 40~80% skip rate, mostly proving the patch is working.
> 
> Although such high skip rate is not always a huge win, as for
> our later block group tree feature, we will have a much higher chance to
> have a block group item already COWed, thus can skip some COW work.
> 
> But still, skipping a full COW search on extent tree is always a good
> thing, especially when the benefit almost comes from thin-air.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> [Josef pinned down the race and provided a fix]
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---

Changelog:

v2:

- Add a new benchmark

   I added btrfs_transaction::total_bg_updates and skipped_bg_updates
   atomics, the total one will get increased everytime
   update_block_group_item() get called, and the skipped one will
   get increased when we skip one update.

   Then I go trace_printk() at btrfs_commit_transaction() to
   print the two atomics.

   The full benchmark is way better than I initially though, after
   the initial increase in metadata usage, later transactions all
   got a high percentage of skipped bg updates, between 40~80%.

- Thanks Josef for pinning down and fixing a race

   Previously, update_block_group_item() only access cache->used
   once, thus it doesn't really need extra protection.

   But now we need to access it at least 3 times (one to check if
   we can skip, one to update the block group item, one to update
   commit_used).

   This requires a lock to prevent the cache->used changed halfway,
   and lead to incorrect used bytes in block group item.

Thanks,
Qu

>   fs/btrfs/block-group.c | 20 +++++++++++++++++++-
>   fs/btrfs/block-group.h |  6 ++++++
>   2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index e7b5a54c8258..0df4d98df278 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>   
>   	cache->length = key->offset;
>   	cache->used = btrfs_stack_block_group_used(bgi);
> +	cache->commit_used = cache->used;
>   	cache->flags = btrfs_stack_block_group_flags(bgi);
>   	cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
>   
> @@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   	struct extent_buffer *leaf;
>   	struct btrfs_block_group_item bgi;
>   	struct btrfs_key key;
> +	u64 used;
> +
> +	/*
> +	 * Block group items update can be triggered out of commit transaction
> +	 * critical section, thus we need a consistent view of used bytes.
> +	 * We can not direct use cache->used out of the spin lock, as it
> +	 * may be changed.
> +	 */
> +	spin_lock(&cache->lock);
> +	used = cache->used;
> +	/* No change in used bytes, can safely skip it. */
> +	if (cache->commit_used == used) {
> +		spin_unlock(&cache->lock);
> +		return 0;
> +	}
> +	spin_unlock(&cache->lock);
>   
>   	key.objectid = cache->start;
>   	key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> @@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>   
>   	leaf = path->nodes[0];
>   	bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> -	btrfs_set_stack_block_group_used(&bgi, cache->used);
> +	btrfs_set_stack_block_group_used(&bgi, used);
>   	btrfs_set_stack_block_group_chunk_objectid(&bgi,
>   						   cache->global_root_id);
>   	btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>   	write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>   	btrfs_mark_buffer_dirty(leaf);
> +	cache->commit_used = used;
>   fail:
>   	btrfs_release_path(path);
>   	return ret;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index f48db81d1d56..b57718020104 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -84,6 +84,12 @@ struct btrfs_block_group {
>   	u64 cache_generation;
>   	u64 global_root_id;
>   
> +	/*
> +	 * The last committed used bytes of this block group, if above @used
> +	 * is still the same as @commit_used, we don't need to update block
> +	 * group item of this block group.
> +	 */
> +	u64 commit_used;
>   	/*
>   	 * If the free space extent count exceeds this number, convert the block
>   	 * group to bitmaps.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08  6:33 [PATCH v2] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
  2022-09-08  6:42 ` Qu Wenruo
@ 2022-09-08 10:31 ` Filipe Manana
  2022-09-08 11:41   ` Qu Wenruo
  2022-09-08 13:17 ` Josef Bacik
  2 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2022-09-08 10:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Josef Bacik

On Thu, Sep 8, 2022 at 7:40 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BACKGROUND]
>
> When committing a transaction, we will update block group items for all
> dirty block groups.
>
> But in fact, dirty block groups don't always need to update their block
> group items.
> It's pretty common to have a metadata block group which experienced
> several CoW operations, but still have the same amount of used bytes.
>
> In that case, we may unnecessarily CoW a tree block doing nothing.
>
> [ENHANCEMENT]
>
> This patch will introduce btrfs_block_group::commit_used member to
> remember the last used bytes, and use that new member to skip
> unnecessary block group item update.
>
> This would be more common for large fs, which metadata block group can
> be as large as 1GiB, containing at most 64K metadata items.
>
> In that case, if CoW added and the deleted one metadata item near the end
> of the block group, then it's completely possible we don't need to touch
> the block group item at all.
>
> [BENCHMARK]
>
> To properly benchmark how many block group items we skipped the update,
> I added some members into btrfs_tranaction to record how many times
> update_block_group_item() is called, and how many of them are skipped.
>
> Then with a single line fio to trigger the workload on a newly created
> btrfs:
>
>   fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
>       --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
>       --fallocate=posix

And did this improve fio's numbers? Throughput, latency?
It's odd to paste a test here and not mention its results. I suppose
it didn't make
a difference, but even if not, it should be explicitly stated.

In this case, less extent tree updates can result in better
concurrency for the nocow checks,
which need to check the extent tree.

>
> Then I got 101 transaction committed during that fio command, and a
> total of 2062 update_block_group_item() calls, in which 1241 can be
> skipped.
>
> This is already a good 60% got skipped.
>
> The full analyse can be found here:
> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml

Not sure if keeping an url to an external source that is not
guaranteed to be available "forever" is a good practice.
It also doesn't seem to provide any substantial value, as you have
already mentioned above some numbers.

>
> Furthermore, since I have per-trans accounting, it shows that initially
> we have a very low percentage of skipped block group item update.
>
> This is expected since we're inserting a lot of new file extents
> initially, thus the metadata usage is going to increase.
>
> But after the initial 3 transactions, all later transactions are have a

"are have" -> "have"

> very stable 40~80% skip rate, mostly proving the patch is working.
>
> Although such high skip rate is not always a huge win, as for
> our later block group tree feature, we will have a much higher chance to
> have a block group item already COWed, thus can skip some COW work.
>
> But still, skipping a full COW search on extent tree is always a good
> thing, especially when the benefit almost comes from thin-air.

Agreed, it's a good thing.

Were there other benefits observed, like for example less IO due to less COW?
Or transaction commits taking less time?

Thanks.


>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> [Josef pinned down the race and provided a fix]
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-group.c | 20 +++++++++++++++++++-
>  fs/btrfs/block-group.h |  6 ++++++
>  2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index e7b5a54c8258..0df4d98df278 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>
>         cache->length = key->offset;
>         cache->used = btrfs_stack_block_group_used(bgi);
> +       cache->commit_used = cache->used;
>         cache->flags = btrfs_stack_block_group_flags(bgi);
>         cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
>
> @@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>         struct extent_buffer *leaf;
>         struct btrfs_block_group_item bgi;
>         struct btrfs_key key;
> +       u64 used;
> +
> +       /*
> +        * Block group items update can be triggered out of commit transaction
> +        * critical section, thus we need a consistent view of used bytes.
> +        * We can not direct use cache->used out of the spin lock, as it
> +        * may be changed.
> +        */
> +       spin_lock(&cache->lock);
> +       used = cache->used;
> +       /* No change in used bytes, can safely skip it. */
> +       if (cache->commit_used == used) {
> +               spin_unlock(&cache->lock);
> +               return 0;
> +       }
> +       spin_unlock(&cache->lock);
>
>         key.objectid = cache->start;
>         key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> @@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>
>         leaf = path->nodes[0];
>         bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> -       btrfs_set_stack_block_group_used(&bgi, cache->used);
> +       btrfs_set_stack_block_group_used(&bgi, used);
>         btrfs_set_stack_block_group_chunk_objectid(&bgi,
>                                                    cache->global_root_id);
>         btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>         write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>         btrfs_mark_buffer_dirty(leaf);
> +       cache->commit_used = used;
>  fail:
>         btrfs_release_path(path);
>         return ret;
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index f48db81d1d56..b57718020104 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -84,6 +84,12 @@ struct btrfs_block_group {
>         u64 cache_generation;
>         u64 global_root_id;
>
> +       /*
> +        * The last committed used bytes of this block group, if above @used
> +        * is still the same as @commit_used, we don't need to update block
> +        * group item of this block group.
> +        */
> +       u64 commit_used;
>         /*
>          * If the free space extent count exceeds this number, convert the block
>          * group to bitmaps.
> --
> 2.37.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08 10:31 ` Filipe Manana
@ 2022-09-08 11:41   ` Qu Wenruo
  2022-09-08 12:18     ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2022-09-08 11:41 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs, Josef Bacik



On 2022/9/8 18:31, Filipe Manana wrote:
> On Thu, Sep 8, 2022 at 7:40 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BACKGROUND]
>>
>> When committing a transaction, we will update block group items for all
>> dirty block groups.
>>
>> But in fact, dirty block groups don't always need to update their block
>> group items.
>> It's pretty common to have a metadata block group which experienced
>> several CoW operations, but still have the same amount of used bytes.
>>
>> In that case, we may unnecessarily CoW a tree block doing nothing.
>>
>> [ENHANCEMENT]
>>
>> This patch will introduce btrfs_block_group::commit_used member to
>> remember the last used bytes, and use that new member to skip
>> unnecessary block group item update.
>>
>> This would be more common for large fs, which metadata block group can
>> be as large as 1GiB, containing at most 64K metadata items.
>>
>> In that case, if CoW added and the deleted one metadata item near the end
>> of the block group, then it's completely possible we don't need to touch
>> the block group item at all.
>>
>> [BENCHMARK]
>>
>> To properly benchmark how many block group items we skipped the update,
>> I added some members into btrfs_tranaction to record how many times
>> update_block_group_item() is called, and how many of them are skipped.
>>
>> Then with a single line fio to trigger the workload on a newly created
>> btrfs:
>>
>>    fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
>>        --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
>>        --fallocate=posix
>
> And did this improve fio's numbers? Throughput, latency?
> It's odd to paste a test here and not mention its results. I suppose
> it didn't make
> a difference, but even if not, it should be explicitly stated.

Unfortunately that test is run with my extra debugging patch (to show if
the patch works).
Thus I didn't take the fio numbers too seriously.

And if I'm going to do a real tests, I'd remove the fallocate, decrease
the blocksize, and do more loops, and other VM tunings to get a more
performance orient tests.

Just for reference, here is the script I slightly modified:

fio --filename=$mnt/file --size=2G --rw=randwrite --bs=4k
--ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4
--name=random_write --fallocate=none

And with my VM tuned for perf tests (no heavy debug config, dedicated
SATA SSD, none cache mode, less memory, larger file size).

[BEFORE]
   WRITE: bw=32.3MiB/s (33.9MB/s), 8269KiB/s-8315KiB/s
(8468kB/s-8515kB/s), io=8192MiB (8590MB), run=252210-253603msec

[AFTER]
WRITE: bw=31.7MiB/s (33.3MB/s), 8124KiB/s-8184KiB/s (8319kB/s-8380kB/s),
io=8192MiB (8590MB), run=256257-258135msec

So in fact it's even worse performance, which I can not explain at all...

>
> In this case, less extent tree updates can result in better
> concurrency for the nocow checks,
> which need to check the extent tree.
>
>>
>> Then I got 101 transaction committed during that fio command, and a
>> total of 2062 update_block_group_item() calls, in which 1241 can be
>> skipped.
>>
>> This is already a good 60% got skipped.
>>
>> The full analyse can be found here:
>> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml
>
> Not sure if keeping an url to an external source that is not
> guaranteed to be available "forever" is a good practice.
> It also doesn't seem to provide any substantial value, as you have
> already mentioned above some numbers.
>
>>
>> Furthermore, since I have per-trans accounting, it shows that initially
>> we have a very low percentage of skipped block group item update.
>>
>> This is expected since we're inserting a lot of new file extents
>> initially, thus the metadata usage is going to increase.
>>
>> But after the initial 3 transactions, all later transactions are have a
>
> "are have" -> "have"
>
>> very stable 40~80% skip rate, mostly proving the patch is working.
>>
>> Although such high skip rate is not always a huge win, as for
>> our later block group tree feature, we will have a much higher chance to
>> have a block group item already COWed, thus can skip some COW work.
>>
>> But still, skipping a full COW search on extent tree is always a good
>> thing, especially when the benefit almost comes from thin-air.
>
> Agreed, it's a good thing.
>
> Were there other benefits observed, like for example less IO due to less COW?
> Or transaction commits taking less time?

I can definitely do that, but just from fio numbers, it doesn't seem to
help at all?

Thanks,
Qu
>
> Thanks.
>
>
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> [Josef pinned down the race and provided a fix]
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/block-group.c | 20 +++++++++++++++++++-
>>   fs/btrfs/block-group.h |  6 ++++++
>>   2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index e7b5a54c8258..0df4d98df278 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
>>
>>          cache->length = key->offset;
>>          cache->used = btrfs_stack_block_group_used(bgi);
>> +       cache->commit_used = cache->used;
>>          cache->flags = btrfs_stack_block_group_flags(bgi);
>>          cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
>>
>> @@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>>          struct extent_buffer *leaf;
>>          struct btrfs_block_group_item bgi;
>>          struct btrfs_key key;
>> +       u64 used;
>> +
>> +       /*
>> +        * Block group items update can be triggered out of commit transaction
>> +        * critical section, thus we need a consistent view of used bytes.
>> +        * We can not direct use cache->used out of the spin lock, as it
>> +        * may be changed.
>> +        */
>> +       spin_lock(&cache->lock);
>> +       used = cache->used;
>> +       /* No change in used bytes, can safely skip it. */
>> +       if (cache->commit_used == used) {
>> +               spin_unlock(&cache->lock);
>> +               return 0;
>> +       }
>> +       spin_unlock(&cache->lock);
>>
>>          key.objectid = cache->start;
>>          key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
>> @@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
>>
>>          leaf = path->nodes[0];
>>          bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
>> -       btrfs_set_stack_block_group_used(&bgi, cache->used);
>> +       btrfs_set_stack_block_group_used(&bgi, used);
>>          btrfs_set_stack_block_group_chunk_objectid(&bgi,
>>                                                     cache->global_root_id);
>>          btrfs_set_stack_block_group_flags(&bgi, cache->flags);
>>          write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
>>          btrfs_mark_buffer_dirty(leaf);
>> +       cache->commit_used = used;
>>   fail:
>>          btrfs_release_path(path);
>>          return ret;
>> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
>> index f48db81d1d56..b57718020104 100644
>> --- a/fs/btrfs/block-group.h
>> +++ b/fs/btrfs/block-group.h
>> @@ -84,6 +84,12 @@ struct btrfs_block_group {
>>          u64 cache_generation;
>>          u64 global_root_id;
>>
>> +       /*
>> +        * The last committed used bytes of this block group, if above @used
>> +        * is still the same as @commit_used, we don't need to update block
>> +        * group item of this block group.
>> +        */
>> +       u64 commit_used;
>>          /*
>>           * If the free space extent count exceeds this number, convert the block
>>           * group to bitmaps.
>> --
>> 2.37.3
>>
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08 11:41   ` Qu Wenruo
@ 2022-09-08 12:18     ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2022-09-08 12:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Josef Bacik

On Thu, Sep 8, 2022 at 12:41 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/9/8 18:31, Filipe Manana wrote:
> > On Thu, Sep 8, 2022 at 7:40 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BACKGROUND]
> >>
> >> When committing a transaction, we will update block group items for all
> >> dirty block groups.
> >>
> >> But in fact, dirty block groups don't always need to update their block
> >> group items.
> >> It's pretty common to have a metadata block group which experienced
> >> several CoW operations, but still have the same amount of used bytes.
> >>
> >> In that case, we may unnecessarily CoW a tree block doing nothing.
> >>
> >> [ENHANCEMENT]
> >>
> >> This patch will introduce btrfs_block_group::commit_used member to
> >> remember the last used bytes, and use that new member to skip
> >> unnecessary block group item update.
> >>
> >> This would be more common for large fs, which metadata block group can
> >> be as large as 1GiB, containing at most 64K metadata items.
> >>
> >> In that case, if CoW added and the deleted one metadata item near the end
> >> of the block group, then it's completely possible we don't need to touch
> >> the block group item at all.
> >>
> >> [BENCHMARK]
> >>
> >> To properly benchmark how many block group items we skipped the update,
> >> I added some members into btrfs_tranaction to record how many times
> >> update_block_group_item() is called, and how many of them are skipped.
> >>
> >> Then with a single line fio to trigger the workload on a newly created
> >> btrfs:
> >>
> >>    fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
> >>        --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
> >>        --fallocate=posix
> >
> > And did this improve fio's numbers? Throughput, latency?
> > It's odd to paste a test here and not mention its results. I suppose
> > it didn't make
> > a difference, but even if not, it should be explicitly stated.
>
> Unfortunately that test is run with my extra debugging patch (to show if
> the patch works).
> Thus I didn't take the fio numbers too seriously.
>
> And if I'm going to do a real tests, I'd remove the fallocate, decrease
> the blocksize, and do more loops, and other VM tunings to get a more
> performance orient tests.
>
> Just for reference, here is the script I slightly modified:
>
> fio --filename=$mnt/file --size=2G --rw=randwrite --bs=4k
> --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4
> --name=random_write --fallocate=none
>
> And with my VM tuned for perf tests (no heavy debug config, dedicated
> SATA SSD, none cache mode, less memory, larger file size).
>
> [BEFORE]
>    WRITE: bw=32.3MiB/s (33.9MB/s), 8269KiB/s-8315KiB/s
> (8468kB/s-8515kB/s), io=8192MiB (8590MB), run=252210-253603msec
>
> [AFTER]
> WRITE: bw=31.7MiB/s (33.3MB/s), 8124KiB/s-8184KiB/s (8319kB/s-8380kB/s),
> io=8192MiB (8590MB), run=256257-258135msec
>
> So in fact it's even worse performance, which I can not explain at all...

That's probably just noise.
Small variations like those are not unusual, especially with multiple jobs.

The idea seems fine to me.
Thanks.

>
> >
> > In this case, less extent tree updates can result in better
> > concurrency for the nocow checks,
> > which need to check the extent tree.
> >
> >>
> >> Then I got 101 transaction committed during that fio command, and a
> >> total of 2062 update_block_group_item() calls, in which 1241 can be
> >> skipped.
> >>
> >> This is already a good 60% got skipped.
> >>
> >> The full analyse can be found here:
> >> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml
> >
> > Not sure if keeping an url to an external source that is not
> > guaranteed to be available "forever" is a good practice.
> > It also doesn't seem to provide any substantial value, as you have
> > already mentioned above some numbers.
> >
> >>
> >> Furthermore, since I have per-trans accounting, it shows that initially
> >> we have a very low percentage of skipped block group item update.
> >>
> >> This is expected since we're inserting a lot of new file extents
> >> initially, thus the metadata usage is going to increase.
> >>
> >> But after the initial 3 transactions, all later transactions are have a
> >
> > "are have" -> "have"
> >
> >> very stable 40~80% skip rate, mostly proving the patch is working.
> >>
> >> Although such high skip rate is not always a huge win, as for
> >> our later block group tree feature, we will have a much higher chance to
> >> have a block group item already COWed, thus can skip some COW work.
> >>
> >> But still, skipping a full COW search on extent tree is always a good
> >> thing, especially when the benefit almost comes from thin-air.
> >
> > Agreed, it's a good thing.
> >
> > Were there other benefits observed, like for example less IO due to less COW?
> > Or transaction commits taking less time?
>
> I can definitely do that, but just from fio numbers, it doesn't seem to
> help at all?
>
> Thanks,
> Qu
> >
> > Thanks.
> >
> >
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> [Josef pinned down the race and provided a fix]
> >> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> >> ---
> >>   fs/btrfs/block-group.c | 20 +++++++++++++++++++-
> >>   fs/btrfs/block-group.h |  6 ++++++
> >>   2 files changed, 25 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> >> index e7b5a54c8258..0df4d98df278 100644
> >> --- a/fs/btrfs/block-group.c
> >> +++ b/fs/btrfs/block-group.c
> >> @@ -2002,6 +2002,7 @@ static int read_one_block_group(struct btrfs_fs_info *info,
> >>
> >>          cache->length = key->offset;
> >>          cache->used = btrfs_stack_block_group_used(bgi);
> >> +       cache->commit_used = cache->used;
> >>          cache->flags = btrfs_stack_block_group_flags(bgi);
> >>          cache->global_root_id = btrfs_stack_block_group_chunk_objectid(bgi);
> >>
> >> @@ -2693,6 +2694,22 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
> >>          struct extent_buffer *leaf;
> >>          struct btrfs_block_group_item bgi;
> >>          struct btrfs_key key;
> >> +       u64 used;
> >> +
> >> +       /*
> >> +        * Block group items update can be triggered out of commit transaction
> >> +        * critical section, thus we need a consistent view of used bytes.
> >> +        * We can not direct use cache->used out of the spin lock, as it
> >> +        * may be changed.
> >> +        */
> >> +       spin_lock(&cache->lock);
> >> +       used = cache->used;
> >> +       /* No change in used bytes, can safely skip it. */
> >> +       if (cache->commit_used == used) {
> >> +               spin_unlock(&cache->lock);
> >> +               return 0;
> >> +       }
> >> +       spin_unlock(&cache->lock);
> >>
> >>          key.objectid = cache->start;
> >>          key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
> >> @@ -2707,12 +2724,13 @@ static int update_block_group_item(struct btrfs_trans_handle *trans,
> >>
> >>          leaf = path->nodes[0];
> >>          bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
> >> -       btrfs_set_stack_block_group_used(&bgi, cache->used);
> >> +       btrfs_set_stack_block_group_used(&bgi, used);
> >>          btrfs_set_stack_block_group_chunk_objectid(&bgi,
> >>                                                     cache->global_root_id);
> >>          btrfs_set_stack_block_group_flags(&bgi, cache->flags);
> >>          write_extent_buffer(leaf, &bgi, bi, sizeof(bgi));
> >>          btrfs_mark_buffer_dirty(leaf);
> >> +       cache->commit_used = used;
> >>   fail:
> >>          btrfs_release_path(path);
> >>          return ret;
> >> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> >> index f48db81d1d56..b57718020104 100644
> >> --- a/fs/btrfs/block-group.h
> >> +++ b/fs/btrfs/block-group.h
> >> @@ -84,6 +84,12 @@ struct btrfs_block_group {
> >>          u64 cache_generation;
> >>          u64 global_root_id;
> >>
> >> +       /*
> >> +        * The last committed used bytes of this block group, if above @used
> >> +        * is still the same as @commit_used, we don't need to update block
> >> +        * group item of this block group.
> >> +        */
> >> +       u64 commit_used;
> >>          /*
> >>           * If the free space extent count exceeds this number, convert the block
> >>           * group to bitmaps.
> >> --
> >> 2.37.3
> >>
> >
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08  6:33 [PATCH v2] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
  2022-09-08  6:42 ` Qu Wenruo
  2022-09-08 10:31 ` Filipe Manana
@ 2022-09-08 13:17 ` Josef Bacik
  2022-09-08 22:35   ` Qu Wenruo
  2 siblings, 1 reply; 7+ messages in thread
From: Josef Bacik @ 2022-09-08 13:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Sep 08, 2022 at 02:33:48PM +0800, Qu Wenruo wrote:
> [BACKGROUND]
> 
> When committing a transaction, we will update block group items for all
> dirty block groups.
> 
> But in fact, dirty block groups don't always need to update their block
> group items.
> It's pretty common to have a metadata block group which experienced
> several CoW operations, but still have the same amount of used bytes.
> 
> In that case, we may unnecessarily CoW a tree block doing nothing.
> 
> [ENHANCEMENT]
> 
> This patch will introduce btrfs_block_group::commit_used member to
> remember the last used bytes, and use that new member to skip
> unnecessary block group item update.
> 
> This would be more common for large fs, which metadata block group can
> be as large as 1GiB, containing at most 64K metadata items.
> 
> In that case, if CoW added and the deleted one metadata item near the end
> of the block group, then it's completely possible we don't need to touch
> the block group item at all.
> 
> [BENCHMARK]
> 
> To properly benchmark how many block group items we skipped the update,
> I added some members into btrfs_tranaction to record how many times
> update_block_group_item() is called, and how many of them are skipped.
> 
> Then with a single line fio to trigger the workload on a newly created
> btrfs:
> 
>   fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
>       --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
>       --fallocate=posix
> 
> Then I got 101 transaction committed during that fio command, and a
> total of 2062 update_block_group_item() calls, in which 1241 can be
> skipped.
> 
> This is already a good 60% got skipped.
> 
> The full analyse can be found here:
> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml
> 
> Furthermore, since I have per-trans accounting, it shows that initially
> we have a very low percentage of skipped block group item update.
> 
> This is expected since we're inserting a lot of new file extents
> initially, thus the metadata usage is going to increase.
> 
> But after the initial 3 transactions, all later transactions are have a
> very stable 40~80% skip rate, mostly proving the patch is working.
> 
> Although such high skip rate is not always a huge win, as for
> our later block group tree feature, we will have a much higher chance to
> have a block group item already COWed, thus can skip some COW work.
> 
> But still, skipping a full COW search on extent tree is always a good
> thing, especially when the benefit almost comes from thin-air.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> [Josef pinned down the race and provided a fix]
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Generally I like this change, any time we can avoid a tree search will be good.
I would like to see if this makes any difference timing wise.  Could you record
transaction commit times for this job with and without your change?  That would
likely show a difference.  In any case the code is fine

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] btrfs: don't update the block group item if used bytes are the same
  2022-09-08 13:17 ` Josef Bacik
@ 2022-09-08 22:35   ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2022-09-08 22:35 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo; +Cc: linux-btrfs



On 2022/9/8 21:17, Josef Bacik wrote:
> On Thu, Sep 08, 2022 at 02:33:48PM +0800, Qu Wenruo wrote:
>> [BACKGROUND]
>>
>> When committing a transaction, we will update block group items for all
>> dirty block groups.
>>
>> But in fact, dirty block groups don't always need to update their block
>> group items.
>> It's pretty common to have a metadata block group which experienced
>> several CoW operations, but still have the same amount of used bytes.
>>
>> In that case, we may unnecessarily CoW a tree block doing nothing.
>>
>> [ENHANCEMENT]
>>
>> This patch will introduce btrfs_block_group::commit_used member to
>> remember the last used bytes, and use that new member to skip
>> unnecessary block group item update.
>>
>> This would be more common for large fs, which metadata block group can
>> be as large as 1GiB, containing at most 64K metadata items.
>>
>> In that case, if CoW added and the deleted one metadata item near the end
>> of the block group, then it's completely possible we don't need to touch
>> the block group item at all.
>>
>> [BENCHMARK]
>>
>> To properly benchmark how many block group items we skipped the update,
>> I added some members into btrfs_tranaction to record how many times
>> update_block_group_item() is called, and how many of them are skipped.
>>
>> Then with a single line fio to trigger the workload on a newly created
>> btrfs:
>>
>>    fio --filename=$mnt/file --size=4G --rw=randrw --bs=32k --ioengine=libaio \
>>        --direct=1 --iodepth=64 --runtime=120 --numjobs=4 --name=random_rw \
>>        --fallocate=posix
>>
>> Then I got 101 transaction committed during that fio command, and a
>> total of 2062 update_block_group_item() calls, in which 1241 can be
>> skipped.
>>
>> This is already a good 60% got skipped.
>>
>> The full analyse can be found here:
>> https://docs.google.com/spreadsheets/d/e/2PACX-1vTbjhqqqxoebnQM_ZJzSM1rF7EY7I1IRbAzZjv19imcDHsVDA7qeA_-MzXxltFZ0kHBjxMA15s2CSH8/pubhtml
>>
>> Furthermore, since I have per-trans accounting, it shows that initially
>> we have a very low percentage of skipped block group item update.
>>
>> This is expected since we're inserting a lot of new file extents
>> initially, thus the metadata usage is going to increase.
>>
>> But after the initial 3 transactions, all later transactions are have a
>> very stable 40~80% skip rate, mostly proving the patch is working.
>>
>> Although such high skip rate is not always a huge win, as for
>> our later block group tree feature, we will have a much higher chance to
>> have a block group item already COWed, thus can skip some COW work.
>>
>> But still, skipping a full COW search on extent tree is always a good
>> thing, especially when the benefit almost comes from thin-air.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> [Josef pinned down the race and provided a fix]
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Generally I like this change, any time we can avoid a tree search will be good.
> I would like to see if this makes any difference timing wise.  Could you record
> transaction commit times for this job with and without your change?

Sure, would add extra benchmarking the transaction commit time consumption.

However previous fio performance tests show no observable change in
performance numbers, so just in case I would also record the total and
individual execution time for update_block_group_item() function just in
case.

Thanks,
Qu

> That would
> likely show a difference.  In any case the code is fine
>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>
> Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-09-08 22:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-08  6:33 [PATCH v2] btrfs: don't update the block group item if used bytes are the same Qu Wenruo
2022-09-08  6:42 ` Qu Wenruo
2022-09-08 10:31 ` Filipe Manana
2022-09-08 11:41   ` Qu Wenruo
2022-09-08 12:18     ` Filipe Manana
2022-09-08 13:17 ` Josef Bacik
2022-09-08 22:35   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox