From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
Date: Fri, 24 Oct 2025 03:58:20 -0700 [thread overview]
Message-ID: <aPtbzMCLwhuLuo4d@infradead.org> (raw)
In-Reply-To: <44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com>
This seems to be a lot of overhead at least for simple checksums
like crc32c or xxhash.
Did you also look into Eric's
[PATCH 10/10] btrfs: switch to library APIs for checksums to reduce
the crypto API overhead, which is probably the worst overhead for
checksums right now?
On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
>
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
>
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, we only need to make sure btrfs_bio::end_io() is called
> after the checksum calculation is done.
>
> As usual, such new feature is hidden behind the experimental flag.
>
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
>
> Memory bandwidth: 50GiB/s
> CRC32C bandwidth: 45GiB/s
> SSD bandwidth: 8GiB/s
>
> Then btrfs write bandwidth with data checksum before the patch would be
>
> 1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>
> After the patch, the bandwidth would be:
>
> 1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>
> The difference would be 15.32 % improvement.
>
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
>
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
>
> Vanilla Datasum: 1619.97 GiB/s
> Patched Datasum: 1792.26 GiB/s
> Diff +10.6 %
>
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/bio.c | 17 ++++++++----
> fs/btrfs/bio.h | 5 ++++
> fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
> fs/btrfs/file-item.h | 2 +-
> 4 files changed, 61 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 18272ef4b4d8..a5b83c6c9e7f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> /* Make sure we're already in task context. */
> ASSERT(in_task());
>
> + if (bbio->async_csum)
> + wait_for_completion(&bbio->csum_done);
> +
> bbio->bio.bi_status = status;
> if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
> struct btrfs_bio *orig_bbio = bbio->private;
> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
> {
> if (bbio->bio.bi_opf & REQ_META)
> return btree_csum_one_bio(bbio);
> - return btrfs_csum_one_bio(bbio);
> + return btrfs_csum_one_bio(bbio, true);
> }
>
> /*
> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
> struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
> enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>
> - if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> - return false;
> -
> - auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> + if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> + return true;
> + /*
> + * Write bios will calculate checksum and submit bio at the same time.
> + * Unless explicitly required don't offload serial csum calculate and bio
> + * submit into a workqueue.
> + */
> + return false;
> #endif
>
> /* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 00883aea55d7..277f2ac090d9 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -60,6 +60,8 @@ struct btrfs_bio {
> struct {
> struct btrfs_ordered_extent *ordered;
> struct btrfs_ordered_sum *sums;
> + struct work_struct csum_work;
> + struct completion csum_done;
> u64 orig_physical;
> };
>
> @@ -84,6 +86,9 @@ struct btrfs_bio {
>
> /* Use the commit root to look up csums (data read bio only). */
> bool csum_search_commit_root;
> +
> + bool async_csum;
> +
> /*
> * This member must come last, bio_alloc_bioset will allocate enough
> * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..bedfcf4a088d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
> #include "fs.h"
> #include "accessors.h"
> #include "file-item.h"
> +#include "volumes.h"
>
> #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
> sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
> return ret;
> }
>
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio)
> {
> - struct btrfs_ordered_extent *ordered = bbio->ordered;
> struct btrfs_inode *inode = bbio->inode;
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
> struct bio *bio = &bbio->bio;
> - struct btrfs_ordered_sum *sums;
> + struct btrfs_ordered_sum *sums = bbio->sums;
> struct bvec_iter iter = bio->bi_iter;
> phys_addr_t paddr;
> const u32 blocksize = fs_info->sectorsize;
> - int index;
> + int index = 0;
> +
> + shash->tfm = fs_info->csum_shash;
> +
> + btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> + btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> + index += fs_info->csum_size;
> + }
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> + struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> + ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> + ASSERT(bbio->async_csum == true);
> + csum_one_bio(bbio);
> + complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> + struct btrfs_ordered_extent *ordered = bbio->ordered;
> + struct btrfs_inode *inode = bbio->inode;
> + struct btrfs_fs_info *fs_info = inode->root->fs_info;
> + struct bio *bio = &bbio->bio;
> + struct btrfs_ordered_sum *sums;
> unsigned nofs_flag;
>
> nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> if (!sums)
> return -ENOMEM;
>
> + sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> sums->len = bio->bi_iter.bi_size;
> INIT_LIST_HEAD(&sums->list);
> -
> - sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> - index = 0;
> -
> - shash->tfm = fs_info->csum_shash;
> -
> - btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> - btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> - index += fs_info->csum_size;
> - }
> -
> bbio->sums = sums;
> btrfs_add_ordered_sum(ordered, sums);
> +
> + if (!async) {
> + csum_one_bio(bbio);
> + return 0;
> + }
> + init_completion(&bbio->csum_done);
> + bbio->async_csum = true;
> + INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> + schedule_work(&bbio->csum_work);
> return 0;
> }
>
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 63216c43676d..2a250cf8b2a1 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
> int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
> int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
> int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
> struct list_head *list, int search_commit,
> --
> 2.51.0
>
>
---end quoted text---
next prev parent reply other threads:[~2025-10-24 10:58 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-10-24 10:58 ` Christoph Hellwig [this message]
2025-10-24 22:15 ` Qu Wenruo
2025-10-24 22:51 ` Eric Biggers
2025-10-24 23:13 ` Qu Wenruo
2025-10-24 14:51 ` Boris Burkov
2025-10-24 21:40 ` Qu Wenruo
2025-10-28 7:19 ` kernel test robot
2025-10-28 8:06 ` Qu Wenruo
2025-10-28 10:15 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPtbzMCLwhuLuo4d@infradead.org \
--to=hch@infradead.org \
--cc=ebiggers@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.