[PATCH 0/4] btrfs: introduce async

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: introduce async_csum feature
@ 2025-10-24 10:49 Qu Wenruo
  2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 10:49 UTC (permalink / raw)
  To: linux-btrfs

The new async_csum feature will allow btrfs to calculate checksum for
data write bios and submit them in parallel.

This will reduce latency and improve write throughput when data checksum
is utilized.

This will slightly reclaim the performance drop after commit
968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode
requires checksum").

The first patch makes sure btrfs_bio::end_io() is executed in task
context, as we will need to wait for csum calculation to finish before
end_io() function.

Then patch 2~3 are small cleanups related to the new task context of
end_io().

The last one enables async_csum for data writes with checksum, this will
slightly change the behavior of data checksum offload feature (offload
data checksum calculation to another thread, but still doing serial
calculation then submit behavior).

Since the new async_csum should be better than the csum offload anyway,
we may want to deprecate the csum offload feature completely in the
future.
Thankfully csum offload feature is still behind the experimental flag,
thus it should not affect end users.

Qu Wenruo (4):
  btrfs: make sure all btrfs_bio::end_io is called in task context
  btrfs: remove btrfs_fs_info::compressed_write_workers
  btrfs: relax btrfs_inode::ordered_tree_lock
  btrfs: introduce btrfs_bio::async_csum

 fs/btrfs/bio.c          | 79 +++++++++++++++++++++++++++++------------
 fs/btrfs/bio.h          |  5 +++
 fs/btrfs/compression.c  | 27 +++++---------
 fs/btrfs/compression.h  |  7 ++--
 fs/btrfs/disk-io.c      |  9 ++---
 fs/btrfs/extent_io.c    |  5 ++-
 fs/btrfs/file-item.c    | 61 +++++++++++++++++++++----------
 fs/btrfs/file-item.h    |  2 +-
 fs/btrfs/fs.h           |  1 -
 fs/btrfs/inode.c        |  4 +--
 fs/btrfs/ordered-data.c | 57 +++++++++++++----------------
 fs/btrfs/tree-log.c     |  4 +--
 12 files changed, 148 insertions(+), 113 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context
  2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
@ 2025-10-24 10:49 ` Qu Wenruo
  2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 10:49 UTC (permalink / raw)
  To: linux-btrfs

[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different context for
btrfs_bio::end_io() calls:

- Simple read bios
  Run in task context, backed by either endio_meta_workers or
  endio_workers.

- Simple write bios
  Run in irq context.

- RAID56 write or rebuild bios
  Run in task context, backed by rmw_workers.

- Mirrored write bios
  Run in irq context.

This is very inconsistent, and contributes to the huge amount of
workqueues used in btrfs.

[ENHANCEMENT]
Make all above bios to call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.

For simple write bios, merge the handling into simple_end_io_work().

For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().

Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.

And add extra ASSERT()s to make sure btrfs_bio_end_io() is running in
task context.

This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c | 62 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 44 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 21df48e6c4fa..18272ef4b4d8 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -100,6 +100,9 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 
 void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 {
+	/* Make sure we're already in task context. */
+	ASSERT(in_task());
+
 	bbio->bio.bi_status = status;
 	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
 		struct btrfs_bio *orig_bbio = bbio->private;
@@ -317,15 +320,20 @@ static struct workqueue_struct *btrfs_end_io_wq(const struct btrfs_fs_info *fs_i
 	return fs_info->endio_workers;
 }
 
-static void btrfs_end_bio_work(struct work_struct *work)
+static void simple_end_io_work(struct work_struct *work)
 {
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 
-	/* Metadata reads are checked and repaired by the submitter. */
-	if (is_data_bbio(bbio))
-		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
-	else
-		btrfs_bio_end_io(bbio, bbio->bio.bi_status);
+	if (bio_op(bio) == REQ_OP_READ) {
+		/* Metadata reads are checked and repaired by the submitter. */
+		if (is_data_bbio(bbio))
+			return btrfs_check_read_bio(bbio, bbio->bio.bi_private);
+		return btrfs_bio_end_io(bbio, bbio->bio.bi_status);
+	}
+	if (bio_is_zone_append(bio) && !bio->bi_status)
+		btrfs_record_physical_zoned(bbio);
+	btrfs_bio_end_io(bbio, bbio->bio.bi_status);
 }
 
 static void btrfs_simple_end_io(struct bio *bio)
@@ -339,14 +347,8 @@ static void btrfs_simple_end_io(struct bio *bio)
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
-	if (bio_op(bio) == REQ_OP_READ) {
-		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
-		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
-	} else {
-		if (bio_is_zone_append(bio) && !bio->bi_status)
-			btrfs_record_physical_zoned(bbio);
-		btrfs_bio_end_io(bbio, bbio->bio.bi_status);
-	}
+	INIT_WORK(&bbio->end_io_work, simple_end_io_work);
+	queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
 }
 
 static void btrfs_raid56_end_io(struct bio *bio)
@@ -354,6 +356,9 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	struct btrfs_io_context *bioc = bio->bi_private;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 
+	/* RAID56 endio is always handled in workqueue. */
+	ASSERT(in_task());
+
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
 	if (bio_op(bio) == REQ_OP_READ && is_data_bbio(bbio))
@@ -364,11 +369,12 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
-static void btrfs_orig_write_end_io(struct bio *bio)
+static void orig_write_end_io_work(struct work_struct *work)
 {
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 	struct btrfs_io_stripe *stripe = bio->bi_private;
 	struct btrfs_io_context *bioc = stripe->bioc;
-	struct btrfs_bio *bbio = btrfs_bio(bio);
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 
@@ -393,8 +399,18 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
-static void btrfs_clone_write_end_io(struct bio *bio)
+static void btrfs_orig_write_end_io(struct bio *bio)
 {
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+
+	INIT_WORK(&bbio->end_io_work, orig_write_end_io_work);
+	queue_work(btrfs_end_io_wq(bbio->fs_info, bio), &bbio->end_io_work);
+}
+
+static void clone_write_end_io_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 	struct btrfs_io_stripe *stripe = bio->bi_private;
 
 	if (bio->bi_status) {
@@ -409,6 +425,14 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
+static void btrfs_clone_write_end_io(struct bio *bio)
+{
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+
+	INIT_WORK(&bbio->end_io_work, clone_write_end_io_work);
+	queue_work(btrfs_end_io_wq(bbio->fs_info, bio), &bbio->end_io_work);
+}
+
 static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 {
 	if (!dev || !dev->bdev ||
@@ -463,8 +487,10 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 		bio = orig_bio;
 		bio->bi_end_io = btrfs_orig_write_end_io;
 	} else {
-		bio = bio_alloc_clone(NULL, orig_bio, GFP_NOFS, &fs_bio_set);
+		/* We need to use endio_work to run end_io in task context. */
+		bio = bio_alloc_clone(NULL, orig_bio, GFP_NOFS, &btrfs_bioset);
 		bio_inc_remaining(orig_bio);
+		btrfs_bio_init(btrfs_bio(bio), bioc->fs_info, NULL, NULL);
 		bio->bi_end_io = btrfs_clone_write_end_io;
 	}
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers
  2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
  2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
@ 2025-10-24 10:49 ` Qu Wenruo
  2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
  2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  3 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 10:49 UTC (permalink / raw)
  To: linux-btrfs

The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folioes and clear the writeback flags,
which may sleep.

However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.

Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.

There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 27 ++++++++-------------------
 fs/btrfs/compression.h |  7 ++-----
 fs/btrfs/disk-io.c     |  9 ++-------
 fs/btrfs/fs.h          |  1 -
 4 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index bacad18357b3..2d7992e0145a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -321,22 +321,6 @@ static noinline void end_compressed_writeback(const struct compressed_bio *cb)
 	/* the inode may be gone now */
 }
 
-static void btrfs_finish_compressed_write_work(struct work_struct *work)
-{
-	struct compressed_bio *cb =
-		container_of(work, struct compressed_bio, write_end_work);
-
-	btrfs_finish_ordered_extent(cb->bbio.ordered, NULL, cb->start, cb->len,
-				    cb->bbio.bio.bi_status == BLK_STS_OK);
-
-	if (cb->writeback)
-		end_compressed_writeback(cb);
-	/* Note, our inode could be gone now */
-
-	btrfs_free_compressed_folios(cb);
-	bio_put(&cb->bbio.bio);
-}
-
 /*
  * Do the cleanup once all the compressed pages hit the disk.  This will clear
  * writeback on the file pages and free the compressed pages.
@@ -347,9 +331,15 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
 static void end_bbio_compressed_write(struct btrfs_bio *bbio)
 {
 	struct compressed_bio *cb = to_compressed_bio(bbio);
-	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
+	btrfs_finish_ordered_extent(cb->bbio.ordered, NULL, cb->start, cb->len,
+				    cb->bbio.bio.bi_status == BLK_STS_OK);
+
+	if (cb->writeback)
+		end_compressed_writeback(cb);
+	/* Note, our inode could be gone now */
+	btrfs_free_compressed_folios(cb);
+	bio_put(&cb->bbio.bio);
 }
 
 static void btrfs_add_compressed_bio_folios(struct compressed_bio *cb)
@@ -402,7 +392,6 @@ void btrfs_submit_compressed_write(struct btrfs_ordered_extent *ordered,
 	cb->compressed_folios = compressed_folios;
 	cb->compressed_len = ordered->disk_num_bytes;
 	cb->writeback = writeback;
-	INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
 	cb->nr_folios = nr_folios;
 	cb->bbio.bio.bi_iter.bi_sector = ordered->disk_bytenr >> SECTOR_SHIFT;
 	cb->bbio.ordered = ordered;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index eba188a9e3bb..2ebf94794f0d 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -65,11 +65,8 @@ struct compressed_bio {
 	/* Whether this is a write for writeback. */
 	bool writeback;
 
-	union {
-		/* For reads, this is the bio we are copying the data into */
-		struct btrfs_bio *orig_bbio;
-		struct work_struct write_end_work;
-	};
+	/* For reads, this is the bio we are copying the data into */
+	struct btrfs_bio *orig_bbio;
 
 	/* Must be last. */
 	struct btrfs_bio bbio;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0aa7e5d1b05f..c96051343450 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1773,8 +1773,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 		destroy_workqueue(fs_info->endio_workers);
 	if (fs_info->rmw_workers)
 		destroy_workqueue(fs_info->rmw_workers);
-	if (fs_info->compressed_write_workers)
-		destroy_workqueue(fs_info->compressed_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
@@ -1986,8 +1984,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	fs_info->endio_write_workers =
 		btrfs_alloc_workqueue(fs_info, "endio-write", flags,
 				      max_active, 2);
-	fs_info->compressed_write_workers =
-		alloc_workqueue("btrfs-compressed-write", flags, max_active);
 	fs_info->endio_freespace_worker =
 		btrfs_alloc_workqueue(fs_info, "freespace-write", flags,
 				      max_active, 0);
@@ -2003,7 +1999,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	if (!(fs_info->workers &&
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
-	      fs_info->compressed_write_workers &&
 	      fs_info->endio_write_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
@@ -4290,7 +4285,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	/*
 	 * When finishing a compressed write bio we schedule a work queue item
-	 * to finish an ordered extent - btrfs_finish_compressed_write_work()
+	 * to finish an ordered extent - end_bbio_compressed_write()
 	 * calls btrfs_finish_ordered_extent() which in turns does a call to
 	 * btrfs_queue_ordered_fn(), and that queues the ordered extent
 	 * completion either in the endio_write_workers work queue or in the
@@ -4298,7 +4293,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	 * below, so before we flush them we must flush this queue for the
 	 * workers of compressed writes.
 	 */
-	flush_workqueue(fs_info->compressed_write_workers);
+	flush_workqueue(fs_info->endio_workers);
 
 	/*
 	 * After we parked the cleaner kthread, ordered extents may have
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index ad389fb1c01a..2e705025a8a5 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -654,7 +654,6 @@ struct btrfs_fs_info {
 	struct workqueue_struct *endio_workers;
 	struct workqueue_struct *endio_meta_workers;
 	struct workqueue_struct *rmw_workers;
-	struct workqueue_struct *compressed_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
 	struct btrfs_workqueue *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock
  2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
  2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
  2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
@ 2025-10-24 10:49 ` Qu Wenruo
  2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  3 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 10:49 UTC (permalink / raw)
  To: linux-btrfs

We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.

However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  5 ++--
 fs/btrfs/inode.c        |  4 +--
 fs/btrfs/ordered-data.c | 57 ++++++++++++++++++-----------------------
 fs/btrfs/tree-log.c     |  4 +--
 4 files changed, 31 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb680cdeb77d..4bca9d372826 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1728,7 +1728,6 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 
 		if (cur >= i_size) {
 			struct btrfs_ordered_extent *ordered;
-			unsigned long flags;
 
 			ordered = btrfs_lookup_first_ordered_range(inode, cur,
 								   folio_end - cur);
@@ -1737,11 +1736,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * there must be an ordered extent.
 			 */
 			ASSERT(ordered != NULL);
-			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			spin_lock(&inode->ordered_tree_lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 			ordered->truncated_len = min(ordered->truncated_len,
 						     cur - ordered->file_offset);
-			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			spin_unlock(&inode->ordered_tree_lock);
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 79732756b87f..d2027bf063fb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7589,11 +7589,11 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
 					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					       EXTENT_DEFRAG, &cached_state);
 
-		spin_lock_irq(&inode->ordered_tree_lock);
+		spin_lock(&inode->ordered_tree_lock);
 		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 		ordered->truncated_len = min(ordered->truncated_len,
 					     cur - ordered->file_offset);
-		spin_unlock_irq(&inode->ordered_tree_lock);
+		spin_unlock(&inode->ordered_tree_lock);
 
 		/*
 		 * If the ordered extent has finished, we're safe to delete all
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index dfda952dcf7b..a421f7db9eec 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -237,14 +237,14 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
 	/* One ref for the tree. */
 	refcount_inc(&entry->refs);
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = tree_insert(&inode->ordered_tree, entry->file_offset,
 			   &entry->rb_node);
 	if (unlikely(node))
 		btrfs_panic(fs_info, -EEXIST,
 				"inconsistency in ordered tree at offset %llu",
 				entry->file_offset);
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 
 	spin_lock(&root->ordered_extent_lock);
 	list_add_tail(&entry->root_extent_list,
@@ -328,9 +328,9 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 {
 	struct btrfs_inode *inode = entry->inode;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	list_add_tail(&sum->list, &entry->list);
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
@@ -417,15 +417,14 @@ void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 bool uptodate)
 {
 	struct btrfs_inode *inode = ordered->inode;
-	unsigned long flags;
 	bool ret;
 
 	trace_btrfs_finish_ordered_extent(inode, file_offset, len, uptodate);
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	ret = can_finish_ordered_extent(ordered, folio, file_offset, len,
 					uptodate);
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 
 	/*
 	 * If this is a COW write it means we created new extent maps for the
@@ -481,13 +480,12 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	u64 cur = file_offset;
 	const u64 end = file_offset + num_bytes;
 
 	trace_btrfs_writepage_end_io_hook(inode, file_offset, end - 1, uptodate);
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	while (cur < end) {
 		u64 entry_end;
 		u64 this_end;
@@ -539,13 +537,13 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 		ASSERT(len < U32_MAX);
 
 		if (can_finish_ordered_extent(entry, folio, cur, len, uptodate)) {
-			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			spin_unlock(&inode->ordered_tree_lock);
 			btrfs_queue_ordered_fn(entry);
-			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			spin_lock(&inode->ordered_tree_lock);
 		}
 		cur += len;
 	}
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 /*
@@ -571,10 +569,9 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	bool finished = false;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	if (cached && *cached) {
 		entry = *cached;
 		goto have_entry;
@@ -611,7 +608,7 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_dec_test_pending(inode, entry);
 	}
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return finished;
 }
 
@@ -676,7 +673,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
 				 fs_info->delalloc_batch);
 
-	spin_lock_irq(&btrfs_inode->ordered_tree_lock);
+	spin_lock(&btrfs_inode->ordered_tree_lock);
 	node = &entry->rb_node;
 	rb_erase(node, &btrfs_inode->ordered_tree);
 	RB_CLEAR_NODE(node);
@@ -684,7 +681,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 		btrfs_inode->ordered_tree_last = NULL;
 	set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
 	pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
-	spin_unlock_irq(&btrfs_inode->ordered_tree_lock);
+	spin_unlock(&btrfs_inode->ordered_tree_lock);
 
 	/*
 	 * The current running transaction is waiting on us, we need to let it
@@ -969,9 +966,8 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node)
 		goto out;
@@ -984,7 +980,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 		trace_btrfs_ordered_extent_lookup(inode, entry);
 	}
 out:
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -997,7 +993,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node) {
 		node = ordered_tree_search(inode, file_offset + len);
@@ -1024,7 +1020,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_lookup_range(inode, entry);
 	}
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1039,7 +1035,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 
 	btrfs_assert_inode_locked(inode);
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	for (n = rb_first(&inode->ordered_tree); n; n = rb_next(n)) {
 		struct btrfs_ordered_extent *ordered;
 
@@ -1053,7 +1049,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 		refcount_inc(&ordered->refs);
 		trace_btrfs_ordered_extent_lookup_for_logging(inode, ordered);
 	}
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 /*
@@ -1066,7 +1062,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node)
 		goto out;
@@ -1075,7 +1071,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	refcount_inc(&entry->refs);
 	trace_btrfs_ordered_extent_lookup_first(inode, entry);
 out:
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1096,9 +1092,8 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	struct rb_node *prev;
 	struct rb_node *next;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	node = inode->ordered_tree.rb_node;
 	/*
 	 * Here we don't want to use tree_search() which will use tree->last
@@ -1153,7 +1148,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
 	}
 
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1285,9 +1280,7 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 	/*
 	 * Take the root's ordered_extent_lock to avoid a race with
 	 * btrfs_wait_ordered_extents() when updating the disk_bytenr and
-	 * disk_num_bytes fields of the ordered extent below. And we disable
-	 * IRQs because the inode's ordered_tree_lock is used in IRQ context
-	 * elsewhere.
+	 * disk_num_bytes fields of the ordered extent below.
 	 *
 	 * There's no concern about a previous caller of
 	 * btrfs_wait_ordered_extents() getting the trimmed ordered extent
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 65079eb651da..2513b19eb472 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5409,12 +5409,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags);
 
 		if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
-			spin_lock_irq(&inode->ordered_tree_lock);
+			spin_lock(&inode->ordered_tree_lock);
 			if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
 				set_bit(BTRFS_ORDERED_PENDING, &ordered->flags);
 				atomic_inc(&trans->transaction->pending_ordered);
 			}
-			spin_unlock_irq(&inode->ordered_tree_lock);
+			spin_unlock(&inode->ordered_tree_lock);
 		}
 		btrfs_put_ordered_extent(ordered);
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
@ 2025-10-24 10:49 ` Qu Wenruo
  2025-10-24 10:58   ` Christoph Hellwig
                     ` (2 more replies)
  3 siblings, 3 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 10:49 UTC (permalink / raw)
  To: linux-btrfs

[ENHANCEMENT]
Btrfs currently calculate its data checksum then submit the bio.

But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.

This means we're safe to calculate the data checksum and submit the bio
in parallel, we only need to make sure btrfs_bio::end_io() is called
after the checksum calculation is done.

As usual, such new feature is hidden behind the experimental flag.

[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:

	Memory bandwidth:	50GiB/s
	CRC32C bandwidth:	45GiB/s
	SSD bandwidth:		8GiB/s

Then btrfs write bandwidth with data checksum before the patch would be

	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s

After the patch, the bandwidth would be:

	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s

The difference would be 15.32 % improvement.

[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIE gen3 x4 NVME SSD.

The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will fallback
to buffered one:

Vanilla Datasum:	1619.97 GiB/s
Patched Datasum:	1792.26 GiB/s
Diff			+10.6 %

In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c       | 17 ++++++++----
 fs/btrfs/bio.h       |  5 ++++
 fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
 fs/btrfs/file-item.h |  2 +-
 4 files changed, 61 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 18272ef4b4d8..a5b83c6c9e7f 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 	/* Make sure we're already in task context. */
 	ASSERT(in_task());
 
+	if (bbio->async_csum)
+		wait_for_completion(&bbio->csum_done);
+
 	bbio->bio.bi_status = status;
 	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
 		struct btrfs_bio *orig_bbio = bbio->private;
@@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
 {
 	if (bbio->bio.bi_opf & REQ_META)
 		return btree_csum_one_bio(bbio);
-	return btrfs_csum_one_bio(bbio);
+	return btrfs_csum_one_bio(bbio, true);
 }
 
 /*
@@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
 	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
 	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
 
-	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
-		return false;
-
-	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
+	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
+		return true;
+	/*
+	 * Write bios will calculate checksum and submit bio at the same time.
+	 * Unless explicitly required don't offload serial csum calculate and bio
+	 * submit into a workqueue.
+	 */
+	return false;
 #endif
 
 	/* Submit synchronously if the checksum implementation is fast. */
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 00883aea55d7..277f2ac090d9 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -60,6 +60,8 @@ struct btrfs_bio {
 		struct {
 			struct btrfs_ordered_extent *ordered;
 			struct btrfs_ordered_sum *sums;
+			struct work_struct csum_work;
+			struct completion csum_done;
 			u64 orig_physical;
 		};
 
@@ -84,6 +86,9 @@ struct btrfs_bio {
 
 	/* Use the commit root to look up csums (data read bio only). */
 	bool csum_search_commit_root;
+
+	bool async_csum;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a42e6d54e7cd..bedfcf4a088d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -18,6 +18,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "file-item.h"
+#include "volumes.h"
 
 #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
 				   sizeof(struct btrfs_item) * 2) / \
@@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
 	return ret;
 }
 
-/*
- * Calculate checksums of the data contained inside a bio.
- */
-int btrfs_csum_one_bio(struct btrfs_bio *bbio)
+static void csum_one_bio(struct btrfs_bio *bbio)
 {
-	struct btrfs_ordered_extent *ordered = bbio->ordered;
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct bio *bio = &bbio->bio;
-	struct btrfs_ordered_sum *sums;
+	struct btrfs_ordered_sum *sums = bbio->sums;
 	struct bvec_iter iter = bio->bi_iter;
 	phys_addr_t paddr;
 	const u32 blocksize = fs_info->sectorsize;
-	int index;
+	int index = 0;
+
+	shash->tfm = fs_info->csum_shash;
+
+	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
+		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
+		index += fs_info->csum_size;
+	}
+}
+
+static void csum_one_bio_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
+
+	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
+	ASSERT(bbio->async_csum == true);
+	csum_one_bio(bbio);
+	complete(&bbio->csum_done);
+}
+
+/*
+ * Calculate checksums of the data contained inside a bio.
+ */
+int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
+{
+	struct btrfs_ordered_extent *ordered = bbio->ordered;
+	struct btrfs_inode *inode = bbio->inode;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio = &bbio->bio;
+	struct btrfs_ordered_sum *sums;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
 	if (!sums)
 		return -ENOMEM;
 
+	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
-
-	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
-	index = 0;
-
-	shash->tfm = fs_info->csum_shash;
-
-	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
-		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
-		index += fs_info->csum_size;
-	}
-
 	bbio->sums = sums;
 	btrfs_add_ordered_sum(ordered, sums);
+
+	if (!async) {
+		csum_one_bio(bbio);
+		return 0;
+	}
+	init_completion(&bbio->csum_done);
+	bbio->async_csum = true;
+	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
+	schedule_work(&bbio->csum_work);
 	return 0;
 }
 
diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
index 63216c43676d..2a250cf8b2a1 100644
--- a/fs/btrfs/file-item.h
+++ b/fs/btrfs/file-item.h
@@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
-int btrfs_csum_one_bio(struct btrfs_bio *bbio);
+int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
 int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
@ 2025-10-24 10:58   ` Christoph Hellwig
  2025-10-24 22:15     ` Qu Wenruo
  2025-10-24 14:51   ` Boris Burkov
  2025-10-28  7:19   ` kernel test robot
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-24 10:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Eric Biggers

This seems to be a lot of overhead at least for simple checksums
like crc32c or xxhash.

Did you also look into Eric's

[PATCH 10/10] btrfs: switch to library APIs for checksums to reduce
the crypto API overhead, which is probably the worst overhead for
checksums right now?

On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
> 
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
> 
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, we only need to make sure btrfs_bio::end_io() is called
> after the checksum calculation is done.
> 
> As usual, such new feature is hidden behind the experimental flag.
> 
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
> 
> 	Memory bandwidth:	50GiB/s
> 	CRC32C bandwidth:	45GiB/s
> 	SSD bandwidth:		8GiB/s
> 
> Then btrfs write bandwidth with data checksum before the patch would be
> 
> 	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> 
> After the patch, the bandwidth would be:
> 
> 	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> 
> The difference would be 15.32 % improvement.
> 
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
> 
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
> 
> Vanilla Datasum:	1619.97 GiB/s
> Patched Datasum:	1792.26 GiB/s
> Diff			+10.6 %
> 
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c       | 17 ++++++++----
>  fs/btrfs/bio.h       |  5 ++++
>  fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
>  fs/btrfs/file-item.h |  2 +-
>  4 files changed, 61 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 18272ef4b4d8..a5b83c6c9e7f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>  	/* Make sure we're already in task context. */
>  	ASSERT(in_task());
>  
> +	if (bbio->async_csum)
> +		wait_for_completion(&bbio->csum_done);
> +
>  	bbio->bio.bi_status = status;
>  	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>  		struct btrfs_bio *orig_bbio = bbio->private;
> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>  {
>  	if (bbio->bio.bi_opf & REQ_META)
>  		return btree_csum_one_bio(bbio);
> -	return btrfs_csum_one_bio(bbio);
> +	return btrfs_csum_one_bio(bbio, true);
>  }
>  
>  /*
> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>  	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
>  	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>  
> -	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> -		return false;
> -
> -	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> +	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> +		return true;
> +	/*
> +	 * Write bios will calculate checksum and submit bio at the same time.
> +	 * Unless explicitly required don't offload serial csum calculate and bio
> +	 * submit into a workqueue.
> +	 */
> +	return false;
>  #endif
>  
>  	/* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 00883aea55d7..277f2ac090d9 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -60,6 +60,8 @@ struct btrfs_bio {
>  		struct {
>  			struct btrfs_ordered_extent *ordered;
>  			struct btrfs_ordered_sum *sums;
> +			struct work_struct csum_work;
> +			struct completion csum_done;
>  			u64 orig_physical;
>  		};
>  
> @@ -84,6 +86,9 @@ struct btrfs_bio {
>  
>  	/* Use the commit root to look up csums (data read bio only). */
>  	bool csum_search_commit_root;
> +
> +	bool async_csum;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..bedfcf4a088d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "file-item.h"
> +#include "volumes.h"
>  
>  #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>  				   sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>  	return ret;
>  }
>  
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio)
>  {
> -	struct btrfs_ordered_extent *ordered = bbio->ordered;
>  	struct btrfs_inode *inode = bbio->inode;
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>  	struct bio *bio = &bbio->bio;
> -	struct btrfs_ordered_sum *sums;
> +	struct btrfs_ordered_sum *sums = bbio->sums;
>  	struct bvec_iter iter = bio->bi_iter;
>  	phys_addr_t paddr;
>  	const u32 blocksize = fs_info->sectorsize;
> -	int index;
> +	int index = 0;
> +
> +	shash->tfm = fs_info->csum_shash;
> +
> +	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> +		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> +		index += fs_info->csum_size;
> +	}
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> +	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> +	ASSERT(bbio->async_csum == true);
> +	csum_one_bio(bbio);
> +	complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> +	struct btrfs_ordered_extent *ordered = bbio->ordered;
> +	struct btrfs_inode *inode = bbio->inode;
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct bio *bio = &bbio->bio;
> +	struct btrfs_ordered_sum *sums;
>  	unsigned nofs_flag;
>  
>  	nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>  	if (!sums)
>  		return -ENOMEM;
>  
> +	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>  	sums->len = bio->bi_iter.bi_size;
>  	INIT_LIST_HEAD(&sums->list);
> -
> -	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> -	index = 0;
> -
> -	shash->tfm = fs_info->csum_shash;
> -
> -	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> -		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> -		index += fs_info->csum_size;
> -	}
> -
>  	bbio->sums = sums;
>  	btrfs_add_ordered_sum(ordered, sums);
> +
> +	if (!async) {
> +		csum_one_bio(bbio);
> +		return 0;
> +	}
> +	init_completion(&bbio->csum_done);
> +	bbio->async_csum = true;
> +	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +	schedule_work(&bbio->csum_work);
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 63216c43676d..2a250cf8b2a1 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>  			   struct btrfs_root *root,
>  			   struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>  int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>  			     struct list_head *list, int search_commit,
> -- 
> 2.51.0
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 10:58   ` Christoph Hellwig
@ 2025-10-24 22:15     ` Qu Wenruo
  2025-10-24 22:51       ` Eric Biggers
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 22:15 UTC (permalink / raw)
  To: Christoph Hellwig, Qu Wenruo; +Cc: linux-btrfs, Eric Biggers



在 2025/10/24 21:28, Christoph Hellwig 写道:
> This seems to be a lot of overhead at least for simple checksums
> like crc32c or xxhash.
> 
> Did you also look into Eric's
> 
> [PATCH 10/10] btrfs: switch to library APIs for checksums to reduce
> the crypto API overhead, which is probably the worst overhead for
> checksums right now?

That doesn't seem to be the biggest overhead to me, at least for modern 
hardware accelerated CRC32C.
Although it will definitely help for much slower hashes.


The main bottle neck here is the storage, at least mainstream level 
storage is still less than 10GiB/s, meanwhile dual channel DDR5 
bandwidth is at around 50 GiB/s, not to mention 80GiB/s if one go LPDDR5 
or quad channel ones.

This means even if the overhead is reduced, it's not eliminating the 
latency to calculate the checksum.

So let's use the following theorectic example to check how the 
difference will be:

- 50GiB/s DRAM bandwidth
- 45GiB/s CRC32C bandwidth (original)
- 50GiB/s CRC32C bandwidth (improved by the new API)
- 8GiB/s storage bandwidth

Original write bandwith with data checksum:

1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s

Improved crypt API:

1 / ( 1/ 50 + 1 / 50 + 1 / 8) = 6.06 GiB/s

Async csum (no matter which crypt API):

1 / ( 1 / 50 + max (1 / 45, 1/ 8)) = 6.90 GiB/s

Even if the new API can cause black magic to make CRC32C to be faster 
than DRAM bandwidth, it will not remove the latency.


And not to mention there are much slower checksums, some will be much 
slower than the storage.

If I go with SHA256 which is super slow, around 1GiB/s, then the new API 
will shine much more:

Vanilla:
1 / ( 1 / 50 + 1 / 1 + 1 / 8) = 0.87 GiB/s

Improved crypt API (+10% throughput for csum)

1 / ( 1 / 50 + 1 / 1.1 + 1 / 8) = 0.95 GiB/s

Async csum (old api):

1 / ( 1 / 50 + max( 1 / 1, 1/8)) = 0.98 GiB/s

Async csum (new api):

1 / ( 1 / 50 + max( 1 / 1.1, 1/8)) = 1.08 GiB/s

Although in that case, improved crypt API will definitely help us with 
async csum.

So yes, I think that series will help overall, but still not as good as 
async checksum calculation.

Thanks,
Qu

> 
> On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
>> [ENHANCEMENT]
>> Btrfs currently calculate its data checksum then submit the bio.
>>
>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
>> if the inode requires checksum"), any writes with data checksum will
>> fallback to buffered IO, meaning the content will not change during
>> writeback.
>>
>> This means we're safe to calculate the data checksum and submit the bio
>> in parallel, we only need to make sure btrfs_bio::end_io() is called
>> after the checksum calculation is done.
>>
>> As usual, such new feature is hidden behind the experimental flag.
>>
>> [THEORETIC ANALYZE]
>> Consider the following theoretic hardware performance, which should be
>> more or less close to modern mainstream hardware:
>>
>> 	Memory bandwidth:	50GiB/s
>> 	CRC32C bandwidth:	45GiB/s
>> 	SSD bandwidth:		8GiB/s
>>
>> Then btrfs write bandwidth with data checksum before the patch would be
>>
>> 	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>>
>> After the patch, the bandwidth would be:
>>
>> 	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>>
>> The difference would be 15.32 % improvement.
>>
>> [REAL WORLD BENCHMARK]
>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
>> storage is backed by a PCIE gen3 x4 NVME SSD.
>>
>> The test is a direct IO write, with 1MiB block size, write 7GiB data
>> into a btrfs mount with data checksum. Thus the direct write will fallback
>> to buffered one:
>>
>> Vanilla Datasum:	1619.97 GiB/s
>> Patched Datasum:	1792.26 GiB/s
>> Diff			+10.6 %
>>
>> In my case, the bottleneck is the storage, thus the improvement is not
>> reaching the theoretic one, but still some observable improvement.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/bio.c       | 17 ++++++++----
>>   fs/btrfs/bio.h       |  5 ++++
>>   fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
>>   fs/btrfs/file-item.h |  2 +-
>>   4 files changed, 61 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>> index 18272ef4b4d8..a5b83c6c9e7f 100644
>> --- a/fs/btrfs/bio.c
>> +++ b/fs/btrfs/bio.c
>> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>>   	/* Make sure we're already in task context. */
>>   	ASSERT(in_task());
>>   
>> +	if (bbio->async_csum)
>> +		wait_for_completion(&bbio->csum_done);
>> +
>>   	bbio->bio.bi_status = status;
>>   	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>>   		struct btrfs_bio *orig_bbio = bbio->private;
>> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>>   {
>>   	if (bbio->bio.bi_opf & REQ_META)
>>   		return btree_csum_one_bio(bbio);
>> -	return btrfs_csum_one_bio(bbio);
>> +	return btrfs_csum_one_bio(bbio, true);
>>   }
>>   
>>   /*
>> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>>   	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
>>   	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>>   
>> -	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
>> -		return false;
>> -
>> -	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
>> +	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
>> +		return true;
>> +	/*
>> +	 * Write bios will calculate checksum and submit bio at the same time.
>> +	 * Unless explicitly required don't offload serial csum calculate and bio
>> +	 * submit into a workqueue.
>> +	 */
>> +	return false;
>>   #endif
>>   
>>   	/* Submit synchronously if the checksum implementation is fast. */
>> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
>> index 00883aea55d7..277f2ac090d9 100644
>> --- a/fs/btrfs/bio.h
>> +++ b/fs/btrfs/bio.h
>> @@ -60,6 +60,8 @@ struct btrfs_bio {
>>   		struct {
>>   			struct btrfs_ordered_extent *ordered;
>>   			struct btrfs_ordered_sum *sums;
>> +			struct work_struct csum_work;
>> +			struct completion csum_done;
>>   			u64 orig_physical;
>>   		};
>>   
>> @@ -84,6 +86,9 @@ struct btrfs_bio {
>>   
>>   	/* Use the commit root to look up csums (data read bio only). */
>>   	bool csum_search_commit_root;
>> +
>> +	bool async_csum;
>> +
>>   	/*
>>   	 * This member must come last, bio_alloc_bioset will allocate enough
>>   	 * bytes for entire btrfs_bio but relies on bio being last.
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index a42e6d54e7cd..bedfcf4a088d 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -18,6 +18,7 @@
>>   #include "fs.h"
>>   #include "accessors.h"
>>   #include "file-item.h"
>> +#include "volumes.h"
>>   
>>   #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>>   				   sizeof(struct btrfs_item) * 2) / \
>> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>>   	return ret;
>>   }
>>   
>> -/*
>> - * Calculate checksums of the data contained inside a bio.
>> - */
>> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>> +static void csum_one_bio(struct btrfs_bio *bbio)
>>   {
>> -	struct btrfs_ordered_extent *ordered = bbio->ordered;
>>   	struct btrfs_inode *inode = bbio->inode;
>>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>   	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>>   	struct bio *bio = &bbio->bio;
>> -	struct btrfs_ordered_sum *sums;
>> +	struct btrfs_ordered_sum *sums = bbio->sums;
>>   	struct bvec_iter iter = bio->bi_iter;
>>   	phys_addr_t paddr;
>>   	const u32 blocksize = fs_info->sectorsize;
>> -	int index;
>> +	int index = 0;
>> +
>> +	shash->tfm = fs_info->csum_shash;
>> +
>> +	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
>> +		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
>> +		index += fs_info->csum_size;
>> +	}
>> +}
>> +
>> +static void csum_one_bio_work(struct work_struct *work)
>> +{
>> +	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
>> +
>> +	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
>> +	ASSERT(bbio->async_csum == true);
>> +	csum_one_bio(bbio);
>> +	complete(&bbio->csum_done);
>> +}
>> +
>> +/*
>> + * Calculate checksums of the data contained inside a bio.
>> + */
>> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
>> +{
>> +	struct btrfs_ordered_extent *ordered = bbio->ordered;
>> +	struct btrfs_inode *inode = bbio->inode;
>> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	struct bio *bio = &bbio->bio;
>> +	struct btrfs_ordered_sum *sums;
>>   	unsigned nofs_flag;
>>   
>>   	nofs_flag = memalloc_nofs_save();
>> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>>   	if (!sums)
>>   		return -ENOMEM;
>>   
>> +	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>>   	sums->len = bio->bi_iter.bi_size;
>>   	INIT_LIST_HEAD(&sums->list);
>> -
>> -	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> -	index = 0;
>> -
>> -	shash->tfm = fs_info->csum_shash;
>> -
>> -	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
>> -		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
>> -		index += fs_info->csum_size;
>> -	}
>> -
>>   	bbio->sums = sums;
>>   	btrfs_add_ordered_sum(ordered, sums);
>> +
>> +	if (!async) {
>> +		csum_one_bio(bbio);
>> +		return 0;
>> +	}
>> +	init_completion(&bbio->csum_done);
>> +	bbio->async_csum = true;
>> +	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
>> +	schedule_work(&bbio->csum_work);
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
>> index 63216c43676d..2a250cf8b2a1 100644
>> --- a/fs/btrfs/file-item.h
>> +++ b/fs/btrfs/file-item.h
>> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>>   int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>>   			   struct btrfs_root *root,
>>   			   struct btrfs_ordered_sum *sums);
>> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
>> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>>   int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>>   int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>>   			     struct list_head *list, int search_commit,
>> -- 
>> 2.51.0
>>
>>
> ---end quoted text---
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 22:15     ` Qu Wenruo
@ 2025-10-24 22:51       ` Eric Biggers
  2025-10-24 23:13         ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Biggers @ 2025-10-24 22:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs

On Sat, Oct 25, 2025 at 08:45:09AM +1030, Qu Wenruo wrote:
> Even if the new API can cause black magic to make CRC32C to be faster than
> DRAM bandwidth, it will not remove the latency.

For what it's worth, crc32c checksumming is often faster than DRAM
already.  Though, being faster than DRAM is still useful when the data
is already in cache.

For example, on AMD Ryzen 9 9950X (Zen 5), I get 89 GB/s crc32c with the
kernel's current crc32c code for data that's already in L1 cache.

However, Zen 5 tripled the number of ALUs that can execute the crc32
instruction, resulting in new code being optimal.  I've tested that
130 GB/s crc32c is theoretically possible with Zen 5 optimized code.

The indirect calls and other overhead in the traditional crypto API
makes a notable difference in checksumming throughput when the actual
calculation is this fast...

But absolutely, the real bottleneck for I/O is almost always going to be
elsewhere in the stack.

- Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 22:51       ` Eric Biggers
@ 2025-10-24 23:13         ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 23:13 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Christoph Hellwig, Qu Wenruo, linux-btrfs



在 2025/10/25 09:21, Eric Biggers 写道:
> On Sat, Oct 25, 2025 at 08:45:09AM +1030, Qu Wenruo wrote:
>> Even if the new API can cause black magic to make CRC32C to be faster than
>> DRAM bandwidth, it will not remove the latency.
> 
> For what it's worth, crc32c checksumming is often faster than DRAM
> already.  Though, being faster than DRAM is still useful when the data
> is already in cache.
> 
> For example, on AMD Ryzen 9 9950X (Zen 5), I get 89 GB/s crc32c with the
> kernel's current crc32c code for data that's already in L1 cache.
> 
> However, Zen 5 tripled the number of ALUs that can execute the crc32
> instruction, resulting in new code being optimal.  I've tested that
> 130 GB/s crc32c is theoretically possible with Zen 5 optimized code.
> 
> The indirect calls and other overhead in the traditional crypto API
> makes a notable difference in checksumming throughput when the actual
> calculation is this fast...

Wow, that's really awesome. I guess this change itself is already going 
to help btrfs a lot, especially for locations where the csum 
verification can not be done in parallel, e.g. verification after read.

But I guess for much slower checksums like SHA256, the API change is not 
really going to improve that much, as the real slow part is still the 
calculation.

Thanks,
Qu

> 
> But absolutely, the real bottleneck for I/O is almost always going to be
> elsewhere in the stack.
> 
> - Eric
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-10-24 10:58   ` Christoph Hellwig
@ 2025-10-24 14:51   ` Boris Burkov
  2025-10-24 21:40     ` Qu Wenruo
  2025-10-28  7:19   ` kernel test robot
  2 siblings, 1 reply; 14+ messages in thread
From: Boris Burkov @ 2025-10-24 14:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
> 
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
> 
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, we only need to make sure btrfs_bio::end_io() is called
> after the checksum calculation is done.
> 
> As usual, such new feature is hidden behind the experimental flag.
> 
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
> 
> 	Memory bandwidth:	50GiB/s
> 	CRC32C bandwidth:	45GiB/s
> 	SSD bandwidth:		8GiB/s
> 
> Then btrfs write bandwidth with data checksum before the patch would be
> 
> 	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> 
> After the patch, the bandwidth would be:
> 
> 	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> 
> The difference would be 15.32 % improvement.
> 
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
> 
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
> 
> Vanilla Datasum:	1619.97 GiB/s
> Patched Datasum:	1792.26 GiB/s
> Diff			+10.6 %
> 
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.

This is really awesome! Makes me think we could also async the read csum
lookup...

> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c       | 17 ++++++++----
>  fs/btrfs/bio.h       |  5 ++++
>  fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
>  fs/btrfs/file-item.h |  2 +-
>  4 files changed, 61 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 18272ef4b4d8..a5b83c6c9e7f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>  	/* Make sure we're already in task context. */
>  	ASSERT(in_task());
>  
> +	if (bbio->async_csum)
> +		wait_for_completion(&bbio->csum_done);
> +

This is a principled way to do it; ensure the whole endio path has the
csums. However, as far as I know, the only place that the sums are only
needed at finish_ordered_io when we call add_pending_csums() from
btrfs_finish_one_ordered(), which is already able to block.

So I think you could separate the need to change the workers from the
async csums if you delay the async csum wait until add_pending_csum. It
might be a little annoying to have to recover the bbio/wait context from
the ordered_extent with container_of, but I'm sure it can be managed.

One other "benefit" is that it technically delays the wait even further,
till the absolute last second, so it is strictly less waiting, though I
wouldn't expect any real benefit from that in practice.

Thanks,
Boris

>  	bbio->bio.bi_status = status;
>  	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>  		struct btrfs_bio *orig_bbio = bbio->private;
> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>  {
>  	if (bbio->bio.bi_opf & REQ_META)
>  		return btree_csum_one_bio(bbio);
> -	return btrfs_csum_one_bio(bbio);
> +	return btrfs_csum_one_bio(bbio, true);
>  }
>  
>  /*
> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>  	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
>  	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>  
> -	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> -		return false;
> -
> -	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> +	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> +		return true;
> +	/*
> +	 * Write bios will calculate checksum and submit bio at the same time.
> +	 * Unless explicitly required don't offload serial csum calculate and bio
> +	 * submit into a workqueue.
> +	 */
> +	return false;
>  #endif
>  
>  	/* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 00883aea55d7..277f2ac090d9 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -60,6 +60,8 @@ struct btrfs_bio {
>  		struct {
>  			struct btrfs_ordered_extent *ordered;
>  			struct btrfs_ordered_sum *sums;
> +			struct work_struct csum_work;
> +			struct completion csum_done;
>  			u64 orig_physical;
>  		};
>  
> @@ -84,6 +86,9 @@ struct btrfs_bio {
>  
>  	/* Use the commit root to look up csums (data read bio only). */
>  	bool csum_search_commit_root;
> +
> +	bool async_csum;
> +
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
>  	 * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..bedfcf4a088d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "file-item.h"
> +#include "volumes.h"
>  
>  #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>  				   sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>  	return ret;
>  }
>  
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio)
>  {
> -	struct btrfs_ordered_extent *ordered = bbio->ordered;
>  	struct btrfs_inode *inode = bbio->inode;
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>  	struct bio *bio = &bbio->bio;
> -	struct btrfs_ordered_sum *sums;
> +	struct btrfs_ordered_sum *sums = bbio->sums;
>  	struct bvec_iter iter = bio->bi_iter;
>  	phys_addr_t paddr;
>  	const u32 blocksize = fs_info->sectorsize;
> -	int index;
> +	int index = 0;
> +
> +	shash->tfm = fs_info->csum_shash;
> +
> +	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> +		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> +		index += fs_info->csum_size;
> +	}
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> +	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> +	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> +	ASSERT(bbio->async_csum == true);
> +	csum_one_bio(bbio);
> +	complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> +	struct btrfs_ordered_extent *ordered = bbio->ordered;
> +	struct btrfs_inode *inode = bbio->inode;
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct bio *bio = &bbio->bio;
> +	struct btrfs_ordered_sum *sums;
>  	unsigned nofs_flag;
>  
>  	nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>  	if (!sums)
>  		return -ENOMEM;
>  
> +	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>  	sums->len = bio->bi_iter.bi_size;
>  	INIT_LIST_HEAD(&sums->list);
> -
> -	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> -	index = 0;
> -
> -	shash->tfm = fs_info->csum_shash;
> -
> -	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> -		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> -		index += fs_info->csum_size;
> -	}
> -
>  	bbio->sums = sums;
>  	btrfs_add_ordered_sum(ordered, sums);
> +
> +	if (!async) {
> +		csum_one_bio(bbio);
> +		return 0;
> +	}
> +	init_completion(&bbio->csum_done);
> +	bbio->async_csum = true;
> +	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +	schedule_work(&bbio->csum_work);
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 63216c43676d..2a250cf8b2a1 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>  			   struct btrfs_root *root,
>  			   struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>  int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>  			     struct list_head *list, int search_commit,
> -- 
> 2.51.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 14:51   ` Boris Burkov
@ 2025-10-24 21:40     ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-24 21:40 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2025/10/25 01:21, Boris Burkov 写道:
> On Fri, Oct 24, 2025 at 09:19:35PM +1030, Qu Wenruo wrote:
>> [ENHANCEMENT]
>> Btrfs currently calculate its data checksum then submit the bio.
>>
>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
>> if the inode requires checksum"), any writes with data checksum will
>> fallback to buffered IO, meaning the content will not change during
>> writeback.
>>
>> This means we're safe to calculate the data checksum and submit the bio
>> in parallel, we only need to make sure btrfs_bio::end_io() is called
>> after the checksum calculation is done.
>>
>> As usual, such new feature is hidden behind the experimental flag.
>>
>> [THEORETIC ANALYZE]
>> Consider the following theoretic hardware performance, which should be
>> more or less close to modern mainstream hardware:
>>
>> 	Memory bandwidth:	50GiB/s
>> 	CRC32C bandwidth:	45GiB/s
>> 	SSD bandwidth:		8GiB/s
>>
>> Then btrfs write bandwidth with data checksum before the patch would be
>>
>> 	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>>
>> After the patch, the bandwidth would be:
>>
>> 	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>>
>> The difference would be 15.32 % improvement.
>>
>> [REAL WORLD BENCHMARK]
>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
>> storage is backed by a PCIE gen3 x4 NVME SSD.
>>
>> The test is a direct IO write, with 1MiB block size, write 7GiB data
>> into a btrfs mount with data checksum. Thus the direct write will fallback
>> to buffered one:
>>
>> Vanilla Datasum:	1619.97 GiB/s
>> Patched Datasum:	1792.26 GiB/s
>> Diff			+10.6 %
>>
>> In my case, the bottleneck is the storage, thus the improvement is not
>> reaching the theoretic one, but still some observable improvement.
> 
> This is really awesome! Makes me think we could also async the read csum
> lookup...

Yes, lookup can be delayed, but I'm afraid that's the limit, and the 
improvement may not be that huge?

> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/bio.c       | 17 ++++++++----
>>   fs/btrfs/bio.h       |  5 ++++
>>   fs/btrfs/file-item.c | 61 +++++++++++++++++++++++++++++++-------------
>>   fs/btrfs/file-item.h |  2 +-
>>   4 files changed, 61 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>> index 18272ef4b4d8..a5b83c6c9e7f 100644
>> --- a/fs/btrfs/bio.c
>> +++ b/fs/btrfs/bio.c
>> @@ -103,6 +103,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>>   	/* Make sure we're already in task context. */
>>   	ASSERT(in_task());
>>   
>> +	if (bbio->async_csum)
>> +		wait_for_completion(&bbio->csum_done);
>> +
> 
> This is a principled way to do it; ensure the whole endio path has the
> csums. However, as far as I know, the only place that the sums are only
> needed at finish_ordered_io when we call add_pending_csums() from
> btrfs_finish_one_ordered(), which is already able to block.

The problem is crossing layers.

btrfs_csum_bio() is called for each splitted bio range, thus I'd prefer 
to do the wait for csum at the same layer, inside btrfs bio layer, and 
not expose it to higher layers.

> 
> So I think you could separate the need to change the workers from the
> async csums if you delay the async csum wait until add_pending_csum. It
> might be a little annoying to have to recover the bbio/wait context from
> the ordered_extent with container_of, but I'm sure it can be managed.

I think it's possible to delay the wait to the whole bbio (the original 
range without split).
Using an extra atomic to track the amount of pending async csums would 
handle it.

But I'll still need to check if this will make it much harder to read.
I have already spent too much time debugging various bugs related to 
where the wait should be placed, the existing code style and lack of 
comments are definitely not helping.

> 
> One other "benefit" is that it technically delays the wait even further,
> till the absolute last second, so it is strictly less waiting, though I
> wouldn't expect any real benefit from that in practice.

Yep, that's true, and also why I still prefer to keep it contained 
inside bio layer.

Thanks,
Qu

> 
> Thanks,
> Boris
> 
>>   	bbio->bio.bi_status = status;
>>   	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>>   		struct btrfs_bio *orig_bbio = bbio->private;
>> @@ -535,7 +538,7 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>>   {
>>   	if (bbio->bio.bi_opf & REQ_META)
>>   		return btree_csum_one_bio(bbio);
>> -	return btrfs_csum_one_bio(bbio);
>> +	return btrfs_csum_one_bio(bbio, true);
>>   }
>>   
>>   /*
>> @@ -613,10 +616,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>>   	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
>>   	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>>   
>> -	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
>> -		return false;
>> -
>> -	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
>> +	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
>> +		return true;
>> +	/*
>> +	 * Write bios will calculate checksum and submit bio at the same time.
>> +	 * Unless explicitly required don't offload serial csum calculate and bio
>> +	 * submit into a workqueue.
>> +	 */
>> +	return false;
>>   #endif
>>   
>>   	/* Submit synchronously if the checksum implementation is fast. */
>> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
>> index 00883aea55d7..277f2ac090d9 100644
>> --- a/fs/btrfs/bio.h
>> +++ b/fs/btrfs/bio.h
>> @@ -60,6 +60,8 @@ struct btrfs_bio {
>>   		struct {
>>   			struct btrfs_ordered_extent *ordered;
>>   			struct btrfs_ordered_sum *sums;
>> +			struct work_struct csum_work;
>> +			struct completion csum_done;
>>   			u64 orig_physical;
>>   		};
>>   
>> @@ -84,6 +86,9 @@ struct btrfs_bio {
>>   
>>   	/* Use the commit root to look up csums (data read bio only). */
>>   	bool csum_search_commit_root;
>> +
>> +	bool async_csum;
>> +
>>   	/*
>>   	 * This member must come last, bio_alloc_bioset will allocate enough
>>   	 * bytes for entire btrfs_bio but relies on bio being last.
>> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
>> index a42e6d54e7cd..bedfcf4a088d 100644
>> --- a/fs/btrfs/file-item.c
>> +++ b/fs/btrfs/file-item.c
>> @@ -18,6 +18,7 @@
>>   #include "fs.h"
>>   #include "accessors.h"
>>   #include "file-item.h"
>> +#include "volumes.h"
>>   
>>   #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>>   				   sizeof(struct btrfs_item) * 2) / \
>> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>>   	return ret;
>>   }
>>   
>> -/*
>> - * Calculate checksums of the data contained inside a bio.
>> - */
>> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>> +static void csum_one_bio(struct btrfs_bio *bbio)
>>   {
>> -	struct btrfs_ordered_extent *ordered = bbio->ordered;
>>   	struct btrfs_inode *inode = bbio->inode;
>>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>>   	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>>   	struct bio *bio = &bbio->bio;
>> -	struct btrfs_ordered_sum *sums;
>> +	struct btrfs_ordered_sum *sums = bbio->sums;
>>   	struct bvec_iter iter = bio->bi_iter;
>>   	phys_addr_t paddr;
>>   	const u32 blocksize = fs_info->sectorsize;
>> -	int index;
>> +	int index = 0;
>> +
>> +	shash->tfm = fs_info->csum_shash;
>> +
>> +	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
>> +		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
>> +		index += fs_info->csum_size;
>> +	}
>> +}
>> +
>> +static void csum_one_bio_work(struct work_struct *work)
>> +{
>> +	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
>> +
>> +	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
>> +	ASSERT(bbio->async_csum == true);
>> +	csum_one_bio(bbio);
>> +	complete(&bbio->csum_done);
>> +}
>> +
>> +/*
>> + * Calculate checksums of the data contained inside a bio.
>> + */
>> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
>> +{
>> +	struct btrfs_ordered_extent *ordered = bbio->ordered;
>> +	struct btrfs_inode *inode = bbio->inode;
>> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	struct bio *bio = &bbio->bio;
>> +	struct btrfs_ordered_sum *sums;
>>   	unsigned nofs_flag;
>>   
>>   	nofs_flag = memalloc_nofs_save();
>> @@ -789,21 +815,20 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>>   	if (!sums)
>>   		return -ENOMEM;
>>   
>> +	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>>   	sums->len = bio->bi_iter.bi_size;
>>   	INIT_LIST_HEAD(&sums->list);
>> -
>> -	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>> -	index = 0;
>> -
>> -	shash->tfm = fs_info->csum_shash;
>> -
>> -	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
>> -		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
>> -		index += fs_info->csum_size;
>> -	}
>> -
>>   	bbio->sums = sums;
>>   	btrfs_add_ordered_sum(ordered, sums);
>> +
>> +	if (!async) {
>> +		csum_one_bio(bbio);
>> +		return 0;
>> +	}
>> +	init_completion(&bbio->csum_done);
>> +	bbio->async_csum = true;
>> +	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
>> +	schedule_work(&bbio->csum_work);
>>   	return 0;
>>   }
>>   
>> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
>> index 63216c43676d..2a250cf8b2a1 100644
>> --- a/fs/btrfs/file-item.h
>> +++ b/fs/btrfs/file-item.h
>> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>>   int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>>   			   struct btrfs_root *root,
>>   			   struct btrfs_ordered_sum *sums);
>> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
>> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>>   int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>>   int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>>   			     struct list_head *list, int search_commit,
>> -- 
>> 2.51.0
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-10-24 10:58   ` Christoph Hellwig
  2025-10-24 14:51   ` Boris Burkov
@ 2025-10-28  7:19   ` kernel test robot
  2025-10-28  8:06     ` Qu Wenruo
  2 siblings, 1 reply; 14+ messages in thread
From: kernel test robot @ 2025-10-28  7:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: oe-lkp, lkp, linux-btrfs, oliver.sang



Hello,

kernel test robot noticed "xfstests.btrfs.026.fail" on:

commit: d72352d1c3a3a201dcd3684b05987f281b1d66aa ("[PATCH 4/4] btrfs: introduce btrfs_bio::async_csum")
url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-make-sure-all-btrfs_bio-end_io-is-called-in-task-context/20251024-185435
base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
patch link: https://lore.kernel.org/all/44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com/
patch subject: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum

in testcase: xfstests
version: xfstests-x86_64-2cba4b54-1_20251020
with following parameters:

	disk: 6HDD
	fs: btrfs
	test: btrfs-026



config: x86_64-rhel-9.4-func
compiler: gcc-14
test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 8G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510281522.d23994ae-lkp@intel.com

2025-10-27 23:06:41 cd /lkp/benchmarks/xfstests
2025-10-27 23:06:42 export TEST_DIR=/fs/sda1
2025-10-27 23:06:42 export TEST_DEV=/dev/sda1
2025-10-27 23:06:42 export FSTYP=btrfs
2025-10-27 23:06:42 export SCRATCH_MNT=/fs/scratch
2025-10-27 23:06:42 mkdir /fs/scratch -p
2025-10-27 23:06:42 export SCRATCH_DEV_POOL="/dev/sda2 /dev/sda3 /dev/sda4 /dev/sda5 /dev/sda6"
2025-10-27 23:06:42 echo btrfs/026
2025-10-27 23:06:42 ./check btrfs/026
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 lkp-hsw-d01 6.18.0-rc1-00265-gd72352d1c3a3 #1 SMP PREEMPT_DYNAMIC Tue Oct 28 06:52:50 CST 2025
MKFS_OPTIONS  -- /dev/sda2
MOUNT_OPTIONS -- /dev/sda2 /fs/scratch

btrfs/026       - output mismatch (see /lkp/benchmarks/xfstests/results//btrfs/026.out.bad)
    --- tests/btrfs/026.out	2025-10-20 16:48:15.000000000 +0000
    +++ /lkp/benchmarks/xfstests/results//btrfs/026.out.bad	2025-10-27 23:06:53.540513519 +0000
    @@ -12,4 +12,4 @@
     5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
     File digests after remounting the file system:
     647d815906324ccdf288c7681f900ec0  SCRATCH_MNT/foo
    -5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
    +md5sum: /fs/scratch/bar: Input/output error
    ...
    (Run 'diff -u /lkp/benchmarks/xfstests/tests/btrfs/026.out /lkp/benchmarks/xfstests/results//btrfs/026.out.bad'  to see the entire diff)
Ran: btrfs/026
Failures: btrfs/026
Failed 1 of 1 tests




The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251028/202510281522.d23994ae-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-28  7:19   ` kernel test robot
@ 2025-10-28  8:06     ` Qu Wenruo
  2025-10-28 10:15       ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2025-10-28  8:06 UTC (permalink / raw)
  To: kernel test robot; +Cc: oe-lkp, lkp, linux-btrfs



在 2025/10/28 17:49, kernel test robot 写道:
> 
> 
> Hello,
> 
> kernel test robot noticed "xfstests.btrfs.026.fail" on:
> 
> commit: d72352d1c3a3a201dcd3684b05987f281b1d66aa ("[PATCH 4/4] btrfs: introduce btrfs_bio::async_csum")
> url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-make-sure-all-btrfs_bio-end_io-is-called-in-task-context/20251024-185435
> base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git for-next
> patch link: https://lore.kernel.org/all/44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com/
> patch subject: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
> 
> in testcase: xfstests
> version: xfstests-x86_64-2cba4b54-1_20251020
> with following parameters:
> 
> 	disk: 6HDD
> 	fs: btrfs
> 	test: btrfs-026
> 
> 
> 
> config: x86_64-rhel-9.4-func
> compiler: gcc-14
> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell) with 8G memory
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202510281522.d23994ae-lkp@intel.com

Unfortunately I'm unable to reproduce the failure here.
100+ runs no reproduce.

Thus I guess it may be some incompatibility with the series and the base 
(which is for-next branch, not the btrfs for-next branch).

Just to rule out the possibility, mind to re-test using my branch directly?

https://github.com/adam900710/linux/tree/async_csum


 From the assets, the dmesg shows that all data checksum are zeros:

[   62.192305][  T269] BTRFS warning (device sdb2): csum failed root 5 
ino 258 off 275644416 csum 0xa54e4c94 expected csum 0x00000000 mirror 1
[   62.192397][   T12] BTRFS warning (device sdb2): csum failed root 5 
ino 258 off 275775488 csum 0xa54e4c94 expected csum 0x00000000 mirror 1
[   62.192470][ T5037] BTRFS warning (device sdb2): csum failed root 5 
ino 258 off 275906560 csum 0xa54e4c94 expected csum 0x00000000 mirror 1

This means we're running the end_io() functions before the csum is fully 
calculated.

If that's the case, it will eventually hits some use-after-free at other 
tests, which I hit too many times during development due to incorrect 
wait timing.

But I'm only seeing this single report, which is pretty weird.

I hope it's just some bad code base.

Thanks,
Qu
> 
> 2025-10-27 23:06:41 cd /lkp/benchmarks/xfstests
> 2025-10-27 23:06:42 export TEST_DIR=/fs/sda1
> 2025-10-27 23:06:42 export TEST_DEV=/dev/sda1
> 2025-10-27 23:06:42 export FSTYP=btrfs
> 2025-10-27 23:06:42 export SCRATCH_MNT=/fs/scratch
> 2025-10-27 23:06:42 mkdir /fs/scratch -p
> 2025-10-27 23:06:42 export SCRATCH_DEV_POOL="/dev/sda2 /dev/sda3 /dev/sda4 /dev/sda5 /dev/sda6"
> 2025-10-27 23:06:42 echo btrfs/026
> 2025-10-27 23:06:42 ./check btrfs/026
> FSTYP         -- btrfs
> PLATFORM      -- Linux/x86_64 lkp-hsw-d01 6.18.0-rc1-00265-gd72352d1c3a3 #1 SMP PREEMPT_DYNAMIC Tue Oct 28 06:52:50 CST 2025
> MKFS_OPTIONS  -- /dev/sda2
> MOUNT_OPTIONS -- /dev/sda2 /fs/scratch
> 
> btrfs/026       - output mismatch (see /lkp/benchmarks/xfstests/results//btrfs/026.out.bad)
>      --- tests/btrfs/026.out	2025-10-20 16:48:15.000000000 +0000
>      +++ /lkp/benchmarks/xfstests/results//btrfs/026.out.bad	2025-10-27 23:06:53.540513519 +0000
>      @@ -12,4 +12,4 @@
>       5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
>       File digests after remounting the file system:
>       647d815906324ccdf288c7681f900ec0  SCRATCH_MNT/foo
>      -5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
>      +md5sum: /fs/scratch/bar: Input/output error
>      ...
>      (Run 'diff -u /lkp/benchmarks/xfstests/tests/btrfs/026.out /lkp/benchmarks/xfstests/results//btrfs/026.out.bad'  to see the entire diff)
> Ran: btrfs/026
> Failures: btrfs/026
> Failed 1 of 1 tests
> 
> 
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20251028/202510281522.d23994ae-lkp@intel.com
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
  2025-10-28  8:06     ` Qu Wenruo
@ 2025-10-28 10:15       ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2025-10-28 10:15 UTC (permalink / raw)
  To: kernel test robot; +Cc: oe-lkp, lkp, linux-btrfs



在 2025/10/28 18:36, Qu Wenruo 写道:
> 
> 
> 在 2025/10/28 17:49, kernel test robot 写道:
>>
>>
>> Hello,
>>
>> kernel test robot noticed "xfstests.btrfs.026.fail" on:
>>
>> commit: d72352d1c3a3a201dcd3684b05987f281b1d66aa ("[PATCH 4/4] btrfs: 
>> introduce btrfs_bio::async_csum")
>> url: https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs- 
>> make-sure-all-btrfs_bio-end_io-is-called-in-task-context/20251024-185435
>> base: https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git 
>> for-next
>> patch link: https://lore.kernel.org/ 
>> all/44a1532190aee561c2a8ae7af9f84fc1e092ae9e.1761302592.git.wqu@suse.com/
>> patch subject: [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum
>>
>> in testcase: xfstests
>> version: xfstests-x86_64-2cba4b54-1_20251020
>> with following parameters:
>>
>>     disk: 6HDD
>>     fs: btrfs
>>     test: btrfs-026
>>
>>
>>
>> config: x86_64-rhel-9.4-func
>> compiler: gcc-14
>> test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770 CPU @ 
>> 3.40GHz (Haswell) with 8G memory
>>
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>
>>
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new 
>> version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202510281522.d23994ae- 
>> lkp@intel.com
> 
> Unfortunately I'm unable to reproduce the failure here.
> 100+ runs no reproduce.
> 
> Thus I guess it may be some incompatibility with the series and the base 
> (which is for-next branch, not the btrfs for-next branch).
> 
> Just to rule out the possibility, mind to re-test using my branch directly?
> 
> https://github.com/adam900710/linux/tree/async_csum
> 
> 
>  From the assets, the dmesg shows that all data checksum are zeros:
> 
> [   62.192305][  T269] BTRFS warning (device sdb2): csum failed root 5 
> ino 258 off 275644416 csum 0xa54e4c94 expected csum 0x00000000 mirror 1
> [   62.192397][   T12] BTRFS warning (device sdb2): csum failed root 5 
> ino 258 off 275775488 csum 0xa54e4c94 expected csum 0x00000000 mirror 1
> [   62.192470][ T5037] BTRFS warning (device sdb2): csum failed root 5 
> ino 258 off 275906560 csum 0xa54e4c94 expected csum 0x00000000 mirror 1
> 
> This means we're running the end_io() functions before the csum is fully 
> calculated.
> 
> If that's the case, it will eventually hits some use-after-free at other 
> tests, which I hit too many times during development due to incorrect 
> wait timing.

My bad, and after looking into my local runs, there are some tests that 
fails with the same dmesg errors.

It turns out that there is a race on bi_iter, where csum_one_bio_work() 
and lower storage layer can try to grab the same bi_iter.

On much newer hardware like zen5, the checksum calculation is way faster 
than IO, thus under most cases csum_one_bio_work() can get the bi_iter 
before lower layer advancing it.

But if lower layer advanced it before csum_one_bio_work(), then 
csum_one_bio_work() will skip the calculation completely resulting the 
csums to be all zero.

Thanks for detecting this bug, I'll update the fix to add a new 
saved_iter so that csum_one_bio_work() can always grab the correct iter.

Thanks,
Qu

> 
> But I'm only seeing this single report, which is pretty weird.
> 
> I hope it's just some bad code base.
> 
> Thanks,
> Qu
>>
>> 2025-10-27 23:06:41 cd /lkp/benchmarks/xfstests
>> 2025-10-27 23:06:42 export TEST_DIR=/fs/sda1
>> 2025-10-27 23:06:42 export TEST_DEV=/dev/sda1
>> 2025-10-27 23:06:42 export FSTYP=btrfs
>> 2025-10-27 23:06:42 export SCRATCH_MNT=/fs/scratch
>> 2025-10-27 23:06:42 mkdir /fs/scratch -p
>> 2025-10-27 23:06:42 export SCRATCH_DEV_POOL="/dev/sda2 /dev/sda3 /dev/ 
>> sda4 /dev/sda5 /dev/sda6"
>> 2025-10-27 23:06:42 echo btrfs/026
>> 2025-10-27 23:06:42 ./check btrfs/026
>> FSTYP         -- btrfs
>> PLATFORM      -- Linux/x86_64 lkp-hsw-d01 6.18.0-rc1-00265- 
>> gd72352d1c3a3 #1 SMP PREEMPT_DYNAMIC Tue Oct 28 06:52:50 CST 2025
>> MKFS_OPTIONS  -- /dev/sda2
>> MOUNT_OPTIONS -- /dev/sda2 /fs/scratch
>>
>> btrfs/026       - output mismatch (see /lkp/benchmarks/xfstests/ 
>> results//btrfs/026.out.bad)
>>      --- tests/btrfs/026.out    2025-10-20 16:48:15.000000000 +0000
>>      +++ /lkp/benchmarks/xfstests/results//btrfs/026.out.bad    
>> 2025-10-27 23:06:53.540513519 +0000
>>      @@ -12,4 +12,4 @@
>>       5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
>>       File digests after remounting the file system:
>>       647d815906324ccdf288c7681f900ec0  SCRATCH_MNT/foo
>>      -5876dba1217b4c2915cda86f4c67640e  SCRATCH_MNT/bar
>>      +md5sum: /fs/scratch/bar: Input/output error
>>      ...
>>      (Run 'diff -u /lkp/benchmarks/xfstests/tests/btrfs/026.out /lkp/ 
>> benchmarks/xfstests/results//btrfs/026.out.bad'  to see the entire diff)
>> Ran: btrfs/026
>> Failures: btrfs/026
>> Failed 1 of 1 tests
>>
>>
>>
>>
>> The kernel config and materials to reproduce are available at:
>> https://download.01.org/0day-ci/ 
>> archive/20251028/202510281522.d23994ae-lkp@intel.com
>>
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-28 10:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24 10:49 [PATCH 0/4] btrfs: introduce async_csum feature Qu Wenruo
2025-10-24 10:49 ` [PATCH 1/4] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
2025-10-24 10:49 ` [PATCH 2/4] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
2025-10-24 10:49 ` [PATCH 3/4] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
2025-10-24 10:49 ` [PATCH 4/4] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-10-24 10:58   ` Christoph Hellwig
2025-10-24 22:15     ` Qu Wenruo
2025-10-24 22:51       ` Eric Biggers
2025-10-24 23:13         ` Qu Wenruo
2025-10-24 14:51   ` Boris Burkov
2025-10-24 21:40     ` Qu Wenruo
2025-10-28  7:19   ` kernel test robot
2025-10-28  8:06     ` Qu Wenruo
2025-10-28 10:15       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox