public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Eric Biggers <ebiggers@kernel.org>
Subject: Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
Date: Fri, 24 Oct 2025 03:58:20 -0700	[thread overview]
Message-ID: <aPtbzMCLwhuLuo4d@infradead.org> (raw)
In-Reply-To: <44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com>

This seems to be a lot of overhead at least for simple checksums
like crc32c or xxhash.

Did you also look into Eric's

[PATCH 10/10] btrfs: switch to library APIs for checksums to reduce
the crypto API overhead, which is probably the worst overhead for
checksums right now?

On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
> 
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
> 
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, we only need to make sure btrfs_bio::end_io() is called
> after the checksum calculation is done.
> 
> As usual, such new feature is hidden behind the experimental flag.
> 
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
> 
> 	Memory bandwidth:	50GiB/s
> 	CRC32C bandwidth:	45GiB/s
> 	SSD bandwidth:		8GiB/s
> 
> Then btrfs write bandwidth with data checksum before the patch would be
> 
> 	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> 
> After the patch, the bandwidth would be:
> 
> 	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> 
> The difference would be 15.32 % improvement.
> 
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
> 
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
> 
> Vanilla Datasum:	1619.97 GiB/s
> Patched Datasum:	1792.26 GiB/s
> Diff			+10.6 %
> 
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c       | 17 ++++++++----
>  fs/btrfs/bio.h       |  5 ++++
>  fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
>  fs/btrfs/file-item.h |  2 +-
>  4 files changed, 61 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 18272ef4b4d8..a5b83c6c9e7f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>  	/* Make sure we're already in task context. */
>  	ASSERT(in_task());
>  
> +	if (bbio->async_csum)
> +		wait_for_completion(&bbio->csum_done);
> +
>  	bbio->bio.bi_status = status;
>  	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>  		struct btrfs_bio *orig_bbio = bbio->private;
> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>  {
>  	if (bbio->bio.bi_opf & REQ_META)
>  		return btree_csum_one_bio(bbio);
> -	return btrfs_csum_one_bio(bbio);
> +	return btrfs_csum_one_bio(bbio, true);
>  }
>  
>  /*
> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>  	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
>  	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>  
> -	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> -		return false;
> -
> -	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> +	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> +		return true;
> +	/*
> +	 * Write bios will calculate checksum and submit bio at the same time.
> +	 * Unless explicitly required don't offload serial csum calculate and bio
> +	 * submit into a workqueue.
> +	 */
> +	return false;
>  #endif
>  
>  	/* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 00883aea55d7..277f2ac090d9 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -60,6 +60,8 @@ struct btrfs_bio {
>  		struct {
>  			struct btrfs_ordered_extent *ordered;
>  			struct btrfs_ordered_sum *sums;
> +			struct work_struct csum_work;
> +			struct completion csum_done;
>  			u64 orig_physical;
>  		};
>  
> @@ -84,6 +86,9 @@ struct btrfs_bio {
>  
>  	/* Use the commit root to look up csums (data read bio only). */
>  	bool csum_search_commit_root;
> +
> +	bool async_csum;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..bedfcf4a088d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "file-item.h"
> +#include "volumes.h"
>  
>  #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>  				   sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>  	return ret;
>  }
>  
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio)
>  {
> -	struct btrfs_ordered_extent *ordered = bbio->ordered;
>  	struct btrfs_inode *inode = bbio->inode;
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>  	struct bio *bio = &bbio->bio;
> -	struct btrfs_ordered_sum *sums;
> +	struct btrfs_ordered_sum *sums = bbio->sums;
>  	struct bvec_iter iter = bio->bi_iter;
>  	phys_addr_t paddr;
>  	const u32 blocksize = fs_info->sectorsize;
> -	int index;
> +	int index = 0;
> +
> +	shash->tfm = fs_info->csum_shash;
> +
> +	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> +		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> +		index += fs_info->csum_size;
> +	}
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> +	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> +	ASSERT(bbio->async_csum == true);
> +	csum_one_bio(bbio);
> +	complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> +	struct btrfs_ordered_extent *ordered = bbio->ordered;
> +	struct btrfs_inode *inode = bbio->inode;
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct bio *bio = &bbio->bio;
> +	struct btrfs_ordered_sum *sums;
>  	unsigned nofs_flag;
>  
>  	nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>  	if (!sums)
>  		return -ENOMEM;
>  
> +	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>  	sums->len = bio->bi_iter.bi_size;
>  	INIT_LIST_HEAD(&sums->list);
> -
> -	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> -	index = 0;
> -
> -	shash->tfm = fs_info->csum_shash;
> -
> -	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> -		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> -		index += fs_info->csum_size;
> -	}
> -
>  	bbio->sums = sums;
>  	btrfs_add_ordered_sum(ordered, sums);
> +
> +	if (!async) {
> +		csum_one_bio(bbio);
> +		return 0;
> +	}
> +	init_completion(&bbio->csum_done);
> +	bbio->async_csum = true;
> +	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +	schedule_work(&bbio->csum_work);
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 63216c43676d..2a250cf8b2a1 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>  			   struct btrfs_root *root,
>  			   struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>  int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>  			     struct list_head *list, int search_commit,
> -- 
> 2.51.0
> 
> 
---end quoted text---

  reply	other threads:[~2025-10-24 10:58 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-10-24 10:58   ` Christoph Hellwig [this message]
2025-10-24 22:15     ` Qu Wenruo
2025-10-24 22:51       ` Eric Biggers
2025-10-24 23:13         ` Qu Wenruo
2025-10-24 14:51   ` Boris Burkov
2025-10-24 21:40     ` Qu Wenruo
2025-10-28  7:19   ` kernel test robot
2025-10-28  8:06     ` Qu Wenruo
2025-10-28 10:15       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPtbzMCLwhuLuo4d@infradead.org \
    --to=hch@infradead.org \
    --cc=ebiggers@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox