From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
Date: Fri, 24 Oct 2025 03:58:20 -0700 [thread overview]
Message-ID: <aPtbzMCLwhuLuo4d@infradead.org> (raw)
In-Reply-To: <44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com>
This seems to be a lot of overhead at least for simple checksums
like crc32c or xxhash.
Did you also look into Eric's
[PATCH 10/10] btrfs: switch to library APIs for checksums to reduce
the crypto API overhead, which is probably the worst overhead for
checksums right now?
On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
>
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
>
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, we only need to make sure btrfs_bio::end_io() is called
> after the checksum calculation is done.
>
> As usual, such new feature is hidden behind the experimental flag.
>
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
>
> Memory bandwidth: 50GiB/s
> CRC32C bandwidth: 45GiB/s
> SSD bandwidth: 8GiB/s
>
> Then btrfs write bandwidth with data checksum before the patch would be
>
> 1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>
> After the patch, the bandwidth would be:
>
> 1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>
> The difference would be 15.32 % improvement.
>
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
>
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
>
> Vanilla Datasum: 1619.97 GiB/s
> Patched Datasum: 1792.26 GiB/s
> Diff +10.6 %
>
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/bio.c | 17 ++++++++----
> fs/btrfs/bio.h | 5 ++++
> fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
> fs/btrfs/file-item.h | 2 +-
> 4 files changed, 61 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 18272ef4b4d8..a5b83c6c9e7f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> /* Make sure we're already in task context. */
> ASSERT(in_task());
>
> + if (bbio->async_csum)
> + wait_for_completion(&bbio->csum_done);
> +
> bbio->bio.bi_status = status;
> if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
> struct btrfs_bio *orig_bbio = bbio->private;
> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
> {
> if (bbio->bio.bi_opf & REQ_META)
> return btree_csum_one_bio(bbio);
> - return btrfs_csum_one_bio(bbio);
> + return btrfs_csum_one_bio(bbio, true);
> }
>
> /*
> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
> struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
> enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>
> - if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> - return false;
> -
> - auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> + if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> + return true;
> + /*
> + * Write bios will calculate checksum and submit bio at the same time.
> + * Unless explicitly required don't offload serial csum calculate and bio
> + * submit into a workqueue.
> + */
> + return false;
> #endif
>
> /* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 00883aea55d7..277f2ac090d9 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -60,6 +60,8 @@ struct btrfs_bio {
> struct {
> struct btrfs_ordered_extent *ordered;
> struct btrfs_ordered_sum *sums;
> + struct work_struct csum_work;
> + struct completion csum_done;
> u64 orig_physical;
> };
>
> @@ -84,6 +86,9 @@ struct btrfs_bio {
>
> /* Use the commit root to look up csums (data read bio only). */
> bool csum_search_commit_root;
> +
> + bool async_csum;
> +
> /*
> * This member must come last, bio_alloc_bioset will allocate enough
> * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..bedfcf4a088d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
> #include "fs.h"
> #include "accessors.h"
> #include "file-item.h"
> +#include "volumes.h"
>
> #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
> sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
> return ret;
> }
>
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio)
> {
> - struct btrfs_ordered_extent *ordered = bbio->ordered;
> struct btrfs_inode *inode = bbio->inode;
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
> struct bio *bio = &bbio->bio;
> - struct btrfs_ordered_sum *sums;
> + struct btrfs_ordered_sum *sums = bbio->sums;
> struct bvec_iter iter = bio->bi_iter;
> phys_addr_t paddr;
> const u32 blocksize = fs_info->sectorsize;
> - int index;
> + int index = 0;
> +
> + shash->tfm = fs_info->csum_shash;
> +
> + btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> + btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> + index += fs_info->csum_size;
> + }
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> + struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> + ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> + ASSERT(bbio->async_csum == true);
> + csum_one_bio(bbio);
> + complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> + struct btrfs_ordered_extent *ordered = bbio->ordered;
> + struct btrfs_inode *inode = bbio->inode;
> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> + struct bio *bio = &bbio->bio;
> + struct btrfs_ordered_sum *sums;
> unsigned nofs_flag;
>
> nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> if (!sums)
> return -ENOMEM;
>
> + sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> sums->len = bio->bi_iter.bi_size;
> INIT_LIST_HEAD(&sums->list);
> -
> - sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> - index = 0;
> -
> - shash->tfm = fs_info->csum_shash;
> -
> - btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> - btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> - index += fs_info->csum_size;
> - }
> -
> bbio->sums = sums;
> btrfs_add_ordered_sum(ordered, sums);
> +
> + if (!async) {
> + csum_one_bio(bbio);
> + return 0;
> + }
> + init_completion(&bbio->csum_done);
> + bbio->async_csum = true;
> + INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> + schedule_work(&bbio->csum_work);
> return 0;
> }
>
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 63216c43676d..2a250cf8b2a1 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
> int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
> int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
> int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
> struct list_head *list, int search_commit,
> --
> 2.51.0
>
>
---end quoted text---
next prev parent reply other threads:[~2025-10-24 10:58 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-10-24 10:58 ` Christoph Hellwig [this message]
2025-10-24 22:15 ` Qu Wenruo
2025-10-24 22:51 ` Eric Biggers
2025-10-24 23:13 ` Qu Wenruo
2025-10-24 14:51 ` Boris Burkov
2025-10-24 21:40 ` Qu Wenruo
2025-10-28 7:19 ` kernel test robot
2025-10-28 8:06 ` Qu Wenruo
2025-10-28 10:15 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPtbzMCLwhuLuo4d@infradead.org \
--to=hch@infradead.org \
--cc=ebiggers@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox