public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum
@ 2025-10-29  5:34 Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 1/7] btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Fix a race where csum generation can be completely skipped
  If the lower storage layer advanced the bi_iter before we finished
  calculating the checksum, bio_one_bio() may skip generating csum
  completely.

  Unfortunately to solve this, a new bvec_iter must be introduced and
  cause the size of btrfs_bio to increase.

- Remove btrfs_bio::fs_info by always requireing btrfs_bio::inode
  For scrub, use btree inode instead, and for the only function that
  requires to be called by scrub, btrfs_submit_repair_write(), use
  a new member, btrfs_bio::is_scrub, to check the callers.

- Add a new patch to cleanup the recursive including problems
  The idea is to keep the btrfs headers inside .h files to minimal,
  thus lower risk of recursive including.

- Remove BTRFS_MAX_BIO_SECTORS macro
  It's the same as BIO_MAX_VECS, and only utilized once.

The new async_csum feature will allow btrfs to calculate checksum for
data write bios and submit them in parallel.

This will reduce latency and improve write throughput when data checksum
is utilized.

This will slightly reclaim the performance drop after commit
968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode
requires checksum").

Patch 1 is a minor cleanup to remove a single-used macro.
Patch 2 is a preparation to remove btrfs_bio::fs_info, the core
is to avoid recursive including.
Patch 3 is to remove btrfs_bio::fs_info, as soon we will
increase the size of btrfs_bio by 16 bytes.

Patch 4~6 are preparations to make sure btrfs_bio::end_io() is called in
task context.

Patch 7 enables the async checksum generation for data writes, under
experimental flags.

The series will increase btrfs_bio size by 8 bytes (+16 for the new
structurs, -8 for the removal of btrfs_bio::fs_info).

Since the new async_csum should be better than the csum offload anyway,
we may want to deprecate the csum offload feature completely in the
future.
Thankfully csum offload feature is still behind the experimental flag,
thus it should not affect end users.

Qu Wenruo (7):
  btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS
  btrfs: headers cleanup to remove unnecessary local includes
  btrfs: remove btrfs_bio::fs_info by extracting it from
    btrfs_bio::inode
  btrfs: make sure all btrfs_bio::end_io is called in task context
  btrfs: remove btrfs_fs_info::compressed_write_workers
  btrfs: relax btrfs_inode::ordered_tree_lock
  btrfs: introduce btrfs_bio::async_csum

 fs/btrfs/accessors.h    |   1 +
 fs/btrfs/bio.c          | 136 ++++++++++++++++++++++++++--------------
 fs/btrfs/bio.h          |  32 ++++++----
 fs/btrfs/btrfs_inode.h  |   8 +--
 fs/btrfs/compression.c  |  33 +++-------
 fs/btrfs/compression.h  |  13 ++--
 fs/btrfs/ctree.h        |   2 -
 fs/btrfs/defrag.c       |   1 +
 fs/btrfs/dir-item.c     |   1 +
 fs/btrfs/direct-io.c    |   8 +--
 fs/btrfs/disk-io.c      |  10 +--
 fs/btrfs/disk-io.h      |   3 +-
 fs/btrfs/extent-tree.c  |   1 +
 fs/btrfs/extent_io.c    |  27 ++++----
 fs/btrfs/extent_io.h    |   1 -
 fs/btrfs/extent_map.h   |   3 +-
 fs/btrfs/file-item.c    |  64 +++++++++++++------
 fs/btrfs/file-item.h    |   4 +-
 fs/btrfs/fs.h           |   1 -
 fs/btrfs/inode.c        |  12 ++--
 fs/btrfs/ordered-data.c |  57 ++++++++---------
 fs/btrfs/scrub.c        |  49 ++++++++-------
 fs/btrfs/space-info.c   |   1 +
 fs/btrfs/subpage.h      |   1 -
 fs/btrfs/transaction.c  |   2 +
 fs/btrfs/transaction.h  |   4 --
 fs/btrfs/tree-log.c     |   5 +-
 fs/btrfs/tree-log.h     |   3 +-
 fs/btrfs/zoned.c        |   4 +-
 fs/btrfs/zoned.h        |   1 -
 30 files changed, 264 insertions(+), 224 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/7] btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 2/7] btrfs: headers cleanup to remove unnecessary local includes Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

It's impossible to have a btrfs bio with more than BIO_MAX_VECS vectors
anyway.

And there is only one location utilizing that macro, just replace it
with BIO_MAX_VECS.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.h       | 7 -------
 fs/btrfs/direct-io.c | 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 00883aea55d7..3cc0fe23898f 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -18,13 +18,6 @@ struct btrfs_inode;
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
 
-/*
- * Maximum number of sectors for a single bio to limit the size of the
- * checksum array.  This matches the number of bio_vecs per bio and thus the
- * I/O size for buffered I/O.
- */
-#define BTRFS_MAX_BIO_SECTORS		(256)
-
 typedef void (*btrfs_bio_end_io_t)(struct btrfs_bio *bbio);
 
 /*
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 802d4dbe5b38..db0191567b8d 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -385,7 +385,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	 * to allocate a contiguous array for the checksums.
 	 */
 	if (!write)
-		len = min_t(u64, len, fs_info->sectorsize * BTRFS_MAX_BIO_SECTORS);
+		len = min_t(u64, len, fs_info->sectorsize * BIO_MAX_VECS);
 
 	lockstart = start;
 	lockend = start + len - 1;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/7] btrfs: headers cleanup to remove unnecessary local includes
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 1/7] btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 3/7] btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.

Then btrfs will fail to compile.

[CAUSE]
There is a recursive including chain:

  "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
  "bio.h"

That recursive including is causing problems for btrfs.

[ENHANCEMENT]
To reduce the risk of recursive including:

- Remove unnecessary local includes from btrfs headers
  Either the included header is pulled in by other headers, or is
  completely unnecessary.

- Remove btrfs local includes if the header only requires a pointer
  In that case let the implementing C file to pull the required header.

  This is especially important for headers like "btrfs_inode.h" which
  pulls in a lot of other btrfs headers, thus it's a mine field of
  recursive including.

- Remove unnecessary temporary structure definition
  Either if we have included the header defining the structure, or
  completely unused.

Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/accessors.h   | 1 +
 fs/btrfs/btrfs_inode.h | 8 ++++----
 fs/btrfs/compression.h | 3 ---
 fs/btrfs/ctree.h       | 2 --
 fs/btrfs/defrag.c      | 1 +
 fs/btrfs/dir-item.c    | 1 +
 fs/btrfs/direct-io.c   | 2 ++
 fs/btrfs/disk-io.c     | 1 +
 fs/btrfs/disk-io.h     | 3 ++-
 fs/btrfs/extent-tree.c | 1 +
 fs/btrfs/extent_io.h   | 1 -
 fs/btrfs/extent_map.h  | 3 +--
 fs/btrfs/file-item.h   | 2 +-
 fs/btrfs/inode.c       | 1 +
 fs/btrfs/space-info.c  | 1 +
 fs/btrfs/subpage.h     | 1 -
 fs/btrfs/transaction.c | 2 ++
 fs/btrfs/transaction.h | 4 ----
 fs/btrfs/tree-log.c    | 1 +
 fs/btrfs/tree-log.h    | 3 +--
 fs/btrfs/zoned.h       | 1 -
 21 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 99b3ced12805..78721412951c 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -12,6 +12,7 @@
 #include <linux/string.h>
 #include <linux/mm.h>
 #include <uapi/linux/btrfs_tree.h>
+#include "fs.h"
 #include "extent_io.h"
 
 struct extent_buffer;
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index af373d50a901..a66ca5531b5c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -18,20 +18,20 @@
 #include <linux/lockdep.h>
 #include <uapi/linux/btrfs_tree.h>
 #include <trace/events/btrfs.h>
+#include "ctree.h"
 #include "block-rsv.h"
 #include "extent_map.h"
-#include "extent_io.h"
 #include "extent-io-tree.h"
-#include "ordered-data.h"
-#include "delayed-inode.h"
 
-struct extent_state;
 struct posix_acl;
 struct iov_iter;
 struct writeback_control;
 struct btrfs_root;
 struct btrfs_fs_info;
 struct btrfs_trans_handle;
+struct btrfs_bio;
+struct btrfs_file_extent;
+struct btrfs_delayed_node;
 
 /*
  * Since we search a directory based on f_pos (struct dir_context::pos) we have
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index eba188a9e3bb..c6812d5fcab7 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -14,14 +14,11 @@
 #include <linux/pagemap.h>
 #include "bio.h"
 #include "fs.h"
-#include "messages.h"
 
 struct address_space;
-struct page;
 struct inode;
 struct btrfs_inode;
 struct btrfs_ordered_extent;
-struct btrfs_bio;
 
 /*
  * We want to make sure that amount of RAM required to uncompress an extent is
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fe70b593c7cd..16dd11c48531 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -17,9 +17,7 @@
 #include <linux/refcount.h>
 #include <uapi/linux/btrfs_tree.h>
 #include "locking.h"
-#include "fs.h"
 #include "accessors.h"
-#include "extent-io-tree.h"
 
 struct extent_buffer;
 struct btrfs_block_rsv;
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 7b277934f66f..a4cc1bc63562 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -15,6 +15,7 @@
 #include "defrag.h"
 #include "file-item.h"
 #include "super.h"
+#include "compression.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 69863e398e22..77e1bcb2a74b 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -9,6 +9,7 @@
 #include "transaction.h"
 #include "accessors.h"
 #include "dir-item.h"
+#include "delayed-inode.h"
 
 /*
  * insert a name into a directory, doing overflow properly if there is a hash
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index db0191567b8d..f225cc3fd3a1 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -10,6 +10,8 @@
 #include "fs.h"
 #include "transaction.h"
 #include "volumes.h"
+#include "bio.h"
+#include "ordered-data.h"
 
 struct btrfs_dio_data {
 	ssize_t submitted;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0aa7e5d1b05f..46b715f3447b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -50,6 +50,7 @@
 #include "relocation.h"
 #include "scrub.h"
 #include "super.h"
+#include "delayed-inode.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 57920f2c6fe4..5320da83d0cf 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -9,7 +9,8 @@
 #include <linux/sizes.h>
 #include <linux/compiler_types.h>
 #include "ctree.h"
-#include "fs.h"
+#include "bio.h"
+#include "ordered-data.h"
 
 struct block_device;
 struct super_block;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f981ff72fb98..be8cb3bdc166 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -40,6 +40,7 @@
 #include "orphan.h"
 #include "tree-checker.h"
 #include "raid-stripe-tree.h"
+#include "delayed-inode.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5fcbfe44218c..02ebb2f238af 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -12,7 +12,6 @@
 #include <linux/rwsem.h>
 #include <linux/list.h>
 #include <linux/slab.h>
-#include "compression.h"
 #include "messages.h"
 #include "ulist.h"
 #include "misc.h"
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index d4b81ee4d97b..6f685f3c9327 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -8,8 +8,7 @@
 #include <linux/rbtree.h>
 #include <linux/list.h>
 #include <linux/refcount.h>
-#include "misc.h"
-#include "compression.h"
+#include "fs.h"
 
 struct btrfs_inode;
 struct btrfs_fs_info;
diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
index 63216c43676d..0d59e830018a 100644
--- a/fs/btrfs/file-item.h
+++ b/fs/btrfs/file-item.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <uapi/linux/btrfs_tree.h>
 #include "ctree.h"
-#include "accessors.h"
+#include "ordered-data.h"
 
 struct extent_map;
 struct btrfs_file_extent_item;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03e9c3ac20ed..2ca7a34fc73b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -71,6 +71,7 @@
 #include "backref.h"
 #include "raid-stripe-tree.h"
 #include "fiemap.h"
+#include "delayed-inode.h"
 
 #define COW_FILE_RANGE_KEEP_LOCKED	(1UL << 0)
 #define COW_FILE_RANGE_NO_INLINE	(1UL << 1)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index cc8015c8b1ff..63d14b5dfc6c 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -15,6 +15,7 @@
 #include "accessors.h"
 #include "extent-tree.h"
 #include "zoned.h"
+#include "delayed-inode.h"
 
 /*
  * HOW DOES SPACE RESERVATION WORK
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index ad0552db7c7d..d81a0ade559f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -7,7 +7,6 @@
 #include <linux/atomic.h>
 #include <linux/sizes.h>
 #include "btrfs_inode.h"
-#include "fs.h"
 
 struct address_space;
 struct folio;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 907f2d047b44..03c62fd1a091 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -32,6 +32,8 @@
 #include "ioctl.h"
 #include "relocation.h"
 #include "scrub.h"
+#include "ordered-data.h"
+#include "delayed-inode.h"
 
 static struct kmem_cache *btrfs_trans_handle_cachep;
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 9f7c777af635..18ef069197e5 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -14,10 +14,6 @@
 #include <linux/wait.h>
 #include "btrfs_inode.h"
 #include "delayed-ref.h"
-#include "extent-io-tree.h"
-#include "block-rsv.h"
-#include "messages.h"
-#include "misc.h"
 
 struct dentry;
 struct inode;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8dfd504b37ae..afaf96a76e3f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -29,6 +29,7 @@
 #include "orphan.h"
 #include "print-tree.h"
 #include "tree-checker.h"
+#include "delayed-inode.h"
 
 #define MAX_CONFLICT_INODES 10
 
diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h
index dc313e6bb2fa..4f149d7d4fde 100644
--- a/fs/btrfs/tree-log.h
+++ b/fs/btrfs/tree-log.h
@@ -8,8 +8,7 @@
 
 #include <linux/list.h>
 #include <linux/fs.h>
-#include "messages.h"
-#include "ctree.h"
+#include <linux/fscrypt.h>
 #include "transaction.h"
 
 struct inode;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index d64f7c9255fa..5cefdeb08b7b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -15,7 +15,6 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "btrfs_inode.h"
-#include "fs.h"
 
 struct block_device;
 struct extent_buffer;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/7] btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 1/7] btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 2/7] btrfs: headers cleanup to remove unnecessary local includes Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 4/7] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.

The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.

However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.

The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.

Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c         | 51 ++++++++++++++++++++++--------------------
 fs/btrfs/bio.h         | 18 ++++++++++-----
 fs/btrfs/compression.c |  6 ++---
 fs/btrfs/compression.h |  3 ++-
 fs/btrfs/direct-io.c   |  4 +---
 fs/btrfs/extent_io.c   | 22 ++++++++----------
 fs/btrfs/inode.c       |  7 ++----
 fs/btrfs/scrub.c       | 49 ++++++++++++++++++++++------------------
 fs/btrfs/zoned.c       |  4 ++--
 9 files changed, 84 insertions(+), 80 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 21df48e6c4fa..6028d8f0c8fc 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -41,13 +41,17 @@ static bool bbio_has_ordered_extent(const struct btrfs_bio *bbio)
  * Initialize a btrfs_bio structure.  This skips the embedded bio itself as it
  * is already initialized by the block layer.
  */
-void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info,
+void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_inode *inode, u64 file_offset,
 		    btrfs_bio_end_io_t end_io, void *private)
 {
+	/* @inode parameter is mandatory. */
+	ASSERT(inode);
+
 	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
-	bbio->fs_info = fs_info;
+	bbio->inode = inode;
 	bbio->end_io = end_io;
 	bbio->private = private;
+	bbio->file_offset = file_offset;
 	atomic_set(&bbio->pending_ios, 1);
 	WRITE_ONCE(bbio->status, BLK_STS_OK);
 }
@@ -60,7 +64,7 @@ void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info,
  * a mempool.
  */
 struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
-				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_inode *inode, u64 file_offset,
 				  btrfs_bio_end_io_t end_io, void *private)
 {
 	struct btrfs_bio *bbio;
@@ -68,7 +72,7 @@ struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
 
 	bio = bio_alloc_bioset(NULL, nr_vecs, opf, GFP_NOFS, &btrfs_bioset);
 	bbio = btrfs_bio(bio);
-	btrfs_bio_init(bbio, fs_info, end_io, private);
+	btrfs_bio_init(bbio, inode, file_offset, end_io, private);
 	return bbio;
 }
 
@@ -85,9 +89,7 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 		return ERR_CAST(bio);
 
 	bbio = btrfs_bio(bio);
-	btrfs_bio_init(bbio, fs_info, NULL, orig_bbio);
-	bbio->inode = orig_bbio->inode;
-	bbio->file_offset = orig_bbio->file_offset;
+	btrfs_bio_init(bbio, orig_bbio->inode, orig_bbio->file_offset, NULL, orig_bbio);
 	orig_bbio->file_offset += map_length;
 	if (bbio_has_ordered_extent(bbio)) {
 		refcount_inc(&orig_bbio->ordered->refs);
@@ -244,9 +246,8 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
 	bio_add_folio_nofail(repair_bio, folio, sectorsize, foff);
 
 	repair_bbio = btrfs_bio(repair_bio);
-	btrfs_bio_init(repair_bbio, fs_info, NULL, fbio);
-	repair_bbio->inode = failed_bbio->inode;
-	repair_bbio->file_offset = failed_bbio->file_offset + bio_offset;
+	btrfs_bio_init(repair_bbio, failed_bbio->inode, failed_bbio->file_offset + bio_offset,
+		       NULL, fbio);
 
 	mirror = next_repair_mirror(fbio, failed_bbio->mirror_num);
 	btrfs_debug(fs_info, "submitting repair read to mirror %d", mirror);
@@ -332,7 +333,7 @@ static void btrfs_simple_end_io(struct bio *bio)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct btrfs_device *dev = bio->bi_private;
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
 	btrfs_bio_counter_dec(fs_info);
 
@@ -581,10 +582,11 @@ static void run_one_async_done(struct btrfs_work *work, bool do_free)
 
 static bool should_async_write(struct btrfs_bio *bbio)
 {
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	bool auto_csum_mode = true;
 
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
-	struct btrfs_fs_devices *fs_devices = bbio->fs_info->fs_devices;
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
 
 	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
@@ -594,7 +596,7 @@ static bool should_async_write(struct btrfs_bio *bbio)
 #endif
 
 	/* Submit synchronously if the checksum implementation is fast. */
-	if (auto_csum_mode && test_bit(BTRFS_FS_CSUM_IMPL_FAST, &bbio->fs_info->flags))
+	if (auto_csum_mode && test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))
 		return false;
 
 	/*
@@ -605,7 +607,7 @@ static bool should_async_write(struct btrfs_bio *bbio)
 		return false;
 
 	/* Zoned devices require I/O to be submitted in order. */
-	if ((bbio->bio.bi_opf & REQ_META) && btrfs_is_zoned(bbio->fs_info))
+	if ((bbio->bio.bi_opf & REQ_META) && btrfs_is_zoned(fs_info))
 		return false;
 
 	return true;
@@ -620,7 +622,7 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
 				struct btrfs_io_context *bioc,
 				struct btrfs_io_stripe *smap, int mirror_num)
 {
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	struct async_submit_bio *async;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
@@ -639,11 +641,12 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
 
 static u64 btrfs_append_map_length(struct btrfs_bio *bbio, u64 map_length)
 {
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	unsigned int nr_segs;
 	int sector_offset;
 
-	map_length = min(map_length, bbio->fs_info->max_zone_append_size);
-	sector_offset = bio_split_rw_at(&bbio->bio, &bbio->fs_info->limits,
+	map_length = min(map_length, fs_info->max_zone_append_size);
+	sector_offset = bio_split_rw_at(&bbio->bio, &fs_info->limits,
 					&nr_segs, map_length);
 	if (sector_offset) {
 		/*
@@ -651,7 +654,7 @@ static u64 btrfs_append_map_length(struct btrfs_bio *bbio, u64 map_length)
 		 * sectorsize and thus cause unaligned I/Os.  Fix that by
 		 * always rounding down to the nearest boundary.
 		 */
-		return ALIGN_DOWN(sector_offset << SECTOR_SHIFT, bbio->fs_info->sectorsize);
+		return ALIGN_DOWN(sector_offset << SECTOR_SHIFT, fs_info->sectorsize);
 	}
 	return map_length;
 }
@@ -659,7 +662,7 @@ static u64 btrfs_append_map_length(struct btrfs_bio *bbio, u64 map_length)
 static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 {
 	struct btrfs_inode *inode = bbio->inode;
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct bio *bio = &bbio->bio;
 	u64 logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 length = bio->bi_iter.bi_size;
@@ -670,7 +673,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 	blk_status_t status;
 	int ret;
 
-	if (!bbio->inode || btrfs_is_data_reloc_root(inode->root))
+	if (bbio->is_scrub || btrfs_is_data_reloc_root(inode->root))
 		smap.rst_search_commit_root = true;
 	else
 		smap.rst_search_commit_root = false;
@@ -782,7 +785,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 static void assert_bbio_alignment(struct btrfs_bio *bbio)
 {
 #ifdef CONFIG_BTRFS_ASSERT
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	struct bio_vec bvec;
 	struct bvec_iter iter;
 	const u32 blocksize = fs_info->sectorsize;
@@ -885,16 +888,16 @@ int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
  */
 void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, bool dev_replace)
 {
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	u64 logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
 	u64 length = bbio->bio.bi_iter.bi_size;
 	struct btrfs_io_stripe smap = { 0 };
 	int ret;
 
-	ASSERT(fs_info);
 	ASSERT(mirror_num > 0);
 	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
-	ASSERT(!bbio->inode);
+	ASSERT(!is_data_inode(bbio->inode));
+	ASSERT(bbio->is_scrub);
 
 	btrfs_bio_counter_inc_blocked(fs_info);
 	ret = btrfs_map_repair_block(fs_info, &smap, logical, length, mirror_num);
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 3cc0fe23898f..5d20f959e12d 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -27,7 +27,10 @@ typedef void (*btrfs_bio_end_io_t)(struct btrfs_bio *bbio);
 struct btrfs_bio {
 	/*
 	 * Inode and offset into it that this I/O operates on.
-	 * Only set for data I/O.
+	 *
+	 * If the inode is a data one, csum verification and read-repair
+	 * will be done automatically.
+	 * If the inode is a metadata one, everything is handled by the caller.
 	 */
 	struct btrfs_inode *inode;
 	u64 file_offset;
@@ -69,14 +72,17 @@ struct btrfs_bio {
 	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
-	/* File system that this I/O operates on. */
-	struct btrfs_fs_info *fs_info;
-
 	/* Save the first error status of split bio. */
 	blk_status_t status;
 
 	/* Use the commit root to look up csums (data read bio only). */
 	bool csum_search_commit_root;
+
+	/*
+	 * Since scrub will reuse btree inode, we need this flag to distinguish
+	 * scrub bios.
+	 */
+	bool is_scrub;
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
@@ -92,10 +98,10 @@ static inline struct btrfs_bio *btrfs_bio(struct bio *bio)
 int __init btrfs_bioset_init(void);
 void __cold btrfs_bioset_exit(void);
 
-void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_fs_info *fs_info,
+void btrfs_bio_init(struct btrfs_bio *bbio, struct btrfs_inode *inode, u64 file_offset,
 		    btrfs_bio_end_io_t end_io, void *private);
 struct btrfs_bio *btrfs_bio_alloc(unsigned int nr_vecs, blk_opf_t opf,
-				  struct btrfs_fs_info *fs_info,
+				  struct btrfs_inode *inode, u64 file_offset,
 				  btrfs_bio_end_io_t end_io, void *private);
 void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
 
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index bacad18357b3..8c3899832a1a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -67,9 +67,7 @@ static struct compressed_bio *alloc_compressed_bio(struct btrfs_inode *inode,
 
 	bbio = btrfs_bio(bio_alloc_bioset(NULL, BTRFS_MAX_COMPRESSED_PAGES, op,
 					  GFP_NOFS, &btrfs_compressed_bioset));
-	btrfs_bio_init(bbio, inode->root->fs_info, end_io, NULL);
-	bbio->inode = inode;
-	bbio->file_offset = start;
+	btrfs_bio_init(bbio, inode, start, end_io, NULL);
 	return to_compressed_bio(bbio);
 }
 
@@ -354,7 +352,7 @@ static void end_bbio_compressed_write(struct btrfs_bio *bbio)
 
 static void btrfs_add_compressed_bio_folios(struct compressed_bio *cb)
 {
-	struct btrfs_fs_info *fs_info = cb->bbio.fs_info;
+	struct btrfs_fs_info *fs_info = cb->bbio.inode->root->fs_info;
 	struct bio *bio = &cb->bbio.bio;
 	u32 offset = 0;
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index c6812d5fcab7..062ebd9c2d32 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -14,6 +14,7 @@
 #include <linux/pagemap.h>
 #include "bio.h"
 #include "fs.h"
+#include "btrfs_inode.h"
 
 struct address_space;
 struct inode;
@@ -74,7 +75,7 @@ struct compressed_bio {
 
 static inline struct btrfs_fs_info *cb_to_fs_info(const struct compressed_bio *cb)
 {
-	return cb->bbio.fs_info;
+	return cb->bbio.inode->root->fs_info;
 }
 
 /* @range_end must be exclusive. */
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index f225cc3fd3a1..962fccceffd6 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -715,10 +715,8 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio,
 		container_of(bbio, struct btrfs_dio_private, bbio);
 	struct btrfs_dio_data *dio_data = iter->private;
 
-	btrfs_bio_init(bbio, BTRFS_I(iter->inode)->root->fs_info,
+	btrfs_bio_init(bbio, BTRFS_I(iter->inode), file_offset,
 		       btrfs_dio_end_io, bio->bi_private);
-	bbio->inode = BTRFS_I(iter->inode);
-	bbio->file_offset = file_offset;
 
 	dip->file_offset = file_offset;
 	dip->bytes = bio->bi_iter.bi_size;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb680cdeb77d..b25a2b45047e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -517,7 +517,7 @@ static void end_folio_read(struct folio *folio, bool uptodate, u64 start, u32 le
  */
 static void end_bbio_data_write(struct btrfs_bio *bbio)
 {
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	struct bio *bio = &bbio->bio;
 	int error = blk_status_to_errno(bio->bi_status);
 	struct folio_iter fi;
@@ -573,7 +573,7 @@ static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
  */
 static void end_bbio_data_read(struct btrfs_bio *bbio)
 {
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	struct bio *bio = &bbio->bio;
 	struct folio_iter fi;
 
@@ -738,12 +738,10 @@ static void alloc_new_bio(struct btrfs_inode *inode,
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_bio *bbio;
 
-	bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, fs_info,
-			       bio_ctrl->end_io_func, NULL);
+	bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, inode,
+			       file_offset, bio_ctrl->end_io_func, NULL);
 	bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
 	bbio->bio.bi_write_hint = inode->vfs_inode.i_write_hint;
-	bbio->inode = inode;
-	bbio->file_offset = file_offset;
 	bio_ctrl->bbio = bbio;
 	bio_ctrl->len_to_oe_boundary = U32_MAX;
 	bio_ctrl->next_file_offset = file_offset;
@@ -2224,12 +2222,11 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
 
 	bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
 			       REQ_OP_WRITE | REQ_META | wbc_to_write_flags(wbc),
-			       eb->fs_info, end_bbio_meta_write, eb);
+			       BTRFS_I(fs_info->btree_inode), eb->start,
+			       end_bbio_meta_write, eb);
 	bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
 	bio_set_dev(&bbio->bio, fs_info->fs_devices->latest_dev->bdev);
 	wbc_init_bio(wbc, &bbio->bio);
-	bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
-	bbio->file_offset = eb->start;
 	for (int i = 0; i < num_extent_folios(eb); i++) {
 		struct folio *folio = eb->folios[i];
 		u64 range_start = max_t(u64, eb->start, folio_pos(folio));
@@ -3844,6 +3841,7 @@ static void end_bbio_meta_read(struct btrfs_bio *bbio)
 int read_extent_buffer_pages_nowait(struct extent_buffer *eb, int mirror_num,
 				    const struct btrfs_tree_parent_check *check)
 {
+	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct btrfs_bio *bbio;
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
@@ -3877,11 +3875,9 @@ int read_extent_buffer_pages_nowait(struct extent_buffer *eb, int mirror_num,
 	refcount_inc(&eb->refs);
 
 	bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
-			       REQ_OP_READ | REQ_META, eb->fs_info,
-			       end_bbio_meta_read, eb);
+			       REQ_OP_READ | REQ_META, BTRFS_I(fs_info->btree_inode),
+			       eb->start, end_bbio_meta_read, eb);
 	bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
-	bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
-	bbio->file_offset = eb->start;
 	memcpy(&bbio->parent_check, check, sizeof(*check));
 	for (int i = 0; i < num_extent_folios(eb); i++) {
 		struct folio *folio = eb->folios[i];
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2ca7a34fc73b..bfcac2c68474 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9404,7 +9404,6 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 					  u64 disk_bytenr, u64 disk_io_size,
 					  struct page **pages, void *uring_ctx)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_encoded_read_private *priv, sync_priv;
 	struct completion sync_reads;
 	unsigned long i = 0;
@@ -9429,10 +9428,9 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 	priv->status = 0;
 	priv->uring_ctx = uring_ctx;
 
-	bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
+	bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, inode, 0,
 			       btrfs_encoded_read_endio, priv);
 	bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
-	bbio->inode = inode;
 
 	do {
 		size_t bytes = min_t(u64, disk_io_size, PAGE_SIZE);
@@ -9441,10 +9439,9 @@ int btrfs_encoded_read_regular_fill_pages(struct btrfs_inode *inode,
 			refcount_inc(&priv->pending_refs);
 			btrfs_submit_bbio(bbio, 0);
 
-			bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, fs_info,
+			bbio = btrfs_bio_alloc(BIO_MAX_VECS, REQ_OP_READ, inode, 0,
 					       btrfs_encoded_read_endio, priv);
 			bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
-			bbio->inode = inode;
 			continue;
 		}
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fe266785804e..4951022ab402 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -929,10 +929,11 @@ static int calc_next_mirror(int mirror, int num_copies)
 static void scrub_bio_add_sector(struct btrfs_bio *bbio, struct scrub_stripe *stripe,
 				 int sector_nr)
 {
+	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 	void *kaddr = scrub_stripe_get_kaddr(stripe, sector_nr);
 	int ret;
 
-	ret = bio_add_page(&bbio->bio, virt_to_page(kaddr), bbio->fs_info->sectorsize,
+	ret = bio_add_page(&bbio->bio, virt_to_page(kaddr), fs_info->sectorsize,
 			   offset_in_page(kaddr));
 	/*
 	 * Caller should ensure the bbio has enough size.
@@ -942,7 +943,19 @@ static void scrub_bio_add_sector(struct btrfs_bio *bbio, struct scrub_stripe *st
 	 * to create the minimal amount of bio vectors, for fs block size < page
 	 * size cases.
 	 */
-	ASSERT(ret == bbio->fs_info->sectorsize);
+	ASSERT(ret == fs_info->sectorsize);
+}
+
+static struct btrfs_bio *alloc_scrub_bbio(struct btrfs_fs_info *fs_info,
+					  unsigned int nr_vecs, blk_opf_t opf,
+					  u64 logical,
+					  btrfs_bio_end_io_t end_io, void *private)
+{
+	struct btrfs_bio *bbio = btrfs_bio_alloc(nr_vecs, opf, BTRFS_I(fs_info->btree_inode),
+						 logical, end_io, private);
+	bbio->is_scrub = true;
+	bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+	return bbio;
 }
 
 static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
@@ -968,12 +981,10 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
 			bbio = NULL;
 		}
 
-		if (!bbio) {
-			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
-				fs_info, scrub_repair_read_endio, stripe);
-			bbio->bio.bi_iter.bi_sector = (stripe->logical +
-				(i << fs_info->sectorsize_bits)) >> SECTOR_SHIFT;
-		}
+		if (!bbio)
+			bbio = alloc_scrub_bbio(fs_info, stripe->nr_sectors, REQ_OP_READ,
+						stripe->logical + (i << fs_info->sectorsize_bits),
+						scrub_repair_read_endio, stripe);
 
 		scrub_bio_add_sector(bbio, stripe, i);
 	}
@@ -1352,13 +1363,10 @@ static void scrub_write_sectors(struct scrub_ctx *sctx, struct scrub_stripe *str
 			scrub_submit_write_bio(sctx, stripe, bbio, dev_replace);
 			bbio = NULL;
 		}
-		if (!bbio) {
-			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_WRITE,
-					       fs_info, scrub_write_endio, stripe);
-			bbio->bio.bi_iter.bi_sector = (stripe->logical +
-				(sector_nr << fs_info->sectorsize_bits)) >>
-				SECTOR_SHIFT;
-		}
+		if (!bbio)
+			bbio = alloc_scrub_bbio(fs_info, stripe->nr_sectors, REQ_OP_WRITE,
+					stripe->logical + (sector_nr << fs_info->sectorsize_bits),
+					scrub_write_endio, stripe);
 		scrub_bio_add_sector(bbio, stripe, sector_nr);
 	}
 	if (bbio)
@@ -1849,9 +1857,8 @@ static void scrub_submit_extent_sector_read(struct scrub_stripe *stripe)
 				continue;
 			}
 
-			bbio = btrfs_bio_alloc(stripe->nr_sectors, REQ_OP_READ,
-					       fs_info, scrub_read_endio, stripe);
-			bbio->bio.bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+			bbio = alloc_scrub_bbio(fs_info, stripe->nr_sectors, REQ_OP_READ, logical,
+						scrub_read_endio, stripe);
 		}
 
 		scrub_bio_add_sector(bbio, stripe, i);
@@ -1888,10 +1895,8 @@ static void scrub_submit_initial_read(struct scrub_ctx *sctx,
 		return;
 	}
 
-	bbio = btrfs_bio_alloc(BTRFS_STRIPE_LEN >> min_folio_shift, REQ_OP_READ, fs_info,
-			       scrub_read_endio, stripe);
-
-	bbio->bio.bi_iter.bi_sector = stripe->logical >> SECTOR_SHIFT;
+	bbio = alloc_scrub_bbio(fs_info, BTRFS_STRIPE_LEN >> min_folio_shift, REQ_OP_READ,
+				stripe->logical, scrub_read_endio, stripe);
 	/* Read the whole range inside the chunk boundary. */
 	for (unsigned int cur = 0; cur < nr_sectors; cur++)
 		scrub_bio_add_sector(bbio, stripe, cur);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 2e3145c1a102..8833fa54b07b 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1813,14 +1813,14 @@ bool btrfs_use_zone_append(struct btrfs_bio *bbio)
 {
 	u64 start = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT);
 	struct btrfs_inode *inode = bbio->inode;
-	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_block_group *cache;
 	bool ret = false;
 
 	if (!btrfs_is_zoned(fs_info))
 		return false;
 
-	if (!inode || !is_data_inode(inode))
+	if (!is_data_inode(inode))
 		return false;
 
 	if (btrfs_op(&bbio->bio) != BTRFS_MAP_WRITE)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/7] btrfs: make sure all btrfs_bio::end_io is called in task context
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-10-29  5:34 ` [PATCH v2 3/7] btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 5/7] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different context for
btrfs_bio::end_io() calls:

- Simple read bios
  Run in task context, backed by either endio_meta_workers or
  endio_workers.

- Simple write bios
  Run in irq context.

- RAID56 write or rebuild bios
  Run in task context, backed by rmw_workers.

- Mirrored write bios
  Run in irq context.

This is very inconsistent, and contributes to the huge amount of
workqueues used in btrfs.

[ENHANCEMENT]
Make all above bios to call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.

For simple write bios, merge the handling into simple_end_io_work().

For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().

Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.

And add extra ASSERT()s to make sure btrfs_bio_end_io() is running in
task context.

This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c | 64 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 6028d8f0c8fc..5a5f23332c2e 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -102,6 +102,9 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
 
 void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 {
+	/* Make sure we're already in task context. */
+	ASSERT(in_task());
+
 	bbio->bio.bi_status = status;
 	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
 		struct btrfs_bio *orig_bbio = bbio->private;
@@ -318,15 +321,20 @@ static struct workqueue_struct *btrfs_end_io_wq(const struct btrfs_fs_info *fs_i
 	return fs_info->endio_workers;
 }
 
-static void btrfs_end_bio_work(struct work_struct *work)
+static void simple_end_io_work(struct work_struct *work)
 {
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 
-	/* Metadata reads are checked and repaired by the submitter. */
-	if (is_data_bbio(bbio))
-		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
-	else
-		btrfs_bio_end_io(bbio, bbio->bio.bi_status);
+	if (bio_op(bio) == REQ_OP_READ) {
+		/* Metadata reads are checked and repaired by the submitter. */
+		if (is_data_bbio(bbio))
+			return btrfs_check_read_bio(bbio, bbio->bio.bi_private);
+		return btrfs_bio_end_io(bbio, bbio->bio.bi_status);
+	}
+	if (bio_is_zone_append(bio) && !bio->bi_status)
+		btrfs_record_physical_zoned(bbio);
+	btrfs_bio_end_io(bbio, bbio->bio.bi_status);
 }
 
 static void btrfs_simple_end_io(struct bio *bio)
@@ -340,14 +348,8 @@ static void btrfs_simple_end_io(struct bio *bio)
 	if (bio->bi_status)
 		btrfs_log_dev_io_error(bio, dev);
 
-	if (bio_op(bio) == REQ_OP_READ) {
-		INIT_WORK(&bbio->end_io_work, btrfs_end_bio_work);
-		queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
-	} else {
-		if (bio_is_zone_append(bio) && !bio->bi_status)
-			btrfs_record_physical_zoned(bbio);
-		btrfs_bio_end_io(bbio, bbio->bio.bi_status);
-	}
+	INIT_WORK(&bbio->end_io_work, simple_end_io_work);
+	queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->end_io_work);
 }
 
 static void btrfs_raid56_end_io(struct bio *bio)
@@ -355,6 +357,9 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	struct btrfs_io_context *bioc = bio->bi_private;
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 
+	/* RAID56 endio is always handled in workqueue. */
+	ASSERT(in_task());
+
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
 	if (bio_op(bio) == REQ_OP_READ && is_data_bbio(bbio))
@@ -365,11 +370,12 @@ static void btrfs_raid56_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
-static void btrfs_orig_write_end_io(struct bio *bio)
+static void orig_write_end_io_work(struct work_struct *work)
 {
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 	struct btrfs_io_stripe *stripe = bio->bi_private;
 	struct btrfs_io_context *bioc = stripe->bioc;
-	struct btrfs_bio *bbio = btrfs_bio(bio);
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 
@@ -394,8 +400,18 @@ static void btrfs_orig_write_end_io(struct bio *bio)
 	btrfs_put_bioc(bioc);
 }
 
-static void btrfs_clone_write_end_io(struct bio *bio)
+static void btrfs_orig_write_end_io(struct bio *bio)
 {
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+
+	INIT_WORK(&bbio->end_io_work, orig_write_end_io_work);
+	queue_work(btrfs_end_io_wq(bbio->inode->root->fs_info, bio), &bbio->end_io_work);
+}
+
+static void clone_write_end_io_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
+	struct bio *bio = &bbio->bio;
 	struct btrfs_io_stripe *stripe = bio->bi_private;
 
 	if (bio->bi_status) {
@@ -410,6 +426,14 @@ static void btrfs_clone_write_end_io(struct bio *bio)
 	bio_put(bio);
 }
 
+static void btrfs_clone_write_end_io(struct bio *bio)
+{
+	struct btrfs_bio *bbio = btrfs_bio(bio);
+
+	INIT_WORK(&bbio->end_io_work, clone_write_end_io_work);
+	queue_work(btrfs_end_io_wq(bbio->inode->root->fs_info, bio), &bbio->end_io_work);
+}
+
 static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 {
 	if (!dev || !dev->bdev ||
@@ -456,6 +480,7 @@ static void btrfs_submit_dev_bio(struct btrfs_device *dev, struct bio *bio)
 static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 {
 	struct bio *orig_bio = bioc->orig_bio, *bio;
+	struct btrfs_bio *orig_bbio = btrfs_bio(orig_bio);
 
 	ASSERT(bio_op(orig_bio) != REQ_OP_READ);
 
@@ -464,8 +489,11 @@ static void btrfs_submit_mirrored_bio(struct btrfs_io_context *bioc, int dev_nr)
 		bio = orig_bio;
 		bio->bi_end_io = btrfs_orig_write_end_io;
 	} else {
-		bio = bio_alloc_clone(NULL, orig_bio, GFP_NOFS, &fs_bio_set);
+		/* We need to use endio_work to run end_io in task context. */
+		bio = bio_alloc_clone(NULL, orig_bio, GFP_NOFS, &btrfs_bioset);
 		bio_inc_remaining(orig_bio);
+		btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode,
+			       orig_bbio->file_offset, NULL, NULL);
 		bio->bi_end_io = btrfs_clone_write_end_io;
 	}
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/7] btrfs: remove btrfs_fs_info::compressed_write_workers
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-10-29  5:34 ` [PATCH v2 4/7] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 6/7] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folioes and clear the writeback flags,
which may sleep.

However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.

Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.

There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.c | 27 ++++++++-------------------
 fs/btrfs/compression.h |  7 ++-----
 fs/btrfs/disk-io.c     |  9 ++-------
 fs/btrfs/fs.h          |  1 -
 4 files changed, 12 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 8c3899832a1a..ef6e4d1662cc 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -319,22 +319,6 @@ static noinline void end_compressed_writeback(const struct compressed_bio *cb)
 	/* the inode may be gone now */
 }
 
-static void btrfs_finish_compressed_write_work(struct work_struct *work)
-{
-	struct compressed_bio *cb =
-		container_of(work, struct compressed_bio, write_end_work);
-
-	btrfs_finish_ordered_extent(cb->bbio.ordered, NULL, cb->start, cb->len,
-				    cb->bbio.bio.bi_status == BLK_STS_OK);
-
-	if (cb->writeback)
-		end_compressed_writeback(cb);
-	/* Note, our inode could be gone now */
-
-	btrfs_free_compressed_folios(cb);
-	bio_put(&cb->bbio.bio);
-}
-
 /*
  * Do the cleanup once all the compressed pages hit the disk.  This will clear
  * writeback on the file pages and free the compressed pages.
@@ -345,9 +329,15 @@ static void btrfs_finish_compressed_write_work(struct work_struct *work)
 static void end_bbio_compressed_write(struct btrfs_bio *bbio)
 {
 	struct compressed_bio *cb = to_compressed_bio(bbio);
-	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
 
-	queue_work(fs_info->compressed_write_workers, &cb->write_end_work);
+	btrfs_finish_ordered_extent(cb->bbio.ordered, NULL, cb->start, cb->len,
+				    cb->bbio.bio.bi_status == BLK_STS_OK);
+
+	if (cb->writeback)
+		end_compressed_writeback(cb);
+	/* Note, our inode could be gone now */
+	btrfs_free_compressed_folios(cb);
+	bio_put(&cb->bbio.bio);
 }
 
 static void btrfs_add_compressed_bio_folios(struct compressed_bio *cb)
@@ -400,7 +390,6 @@ void btrfs_submit_compressed_write(struct btrfs_ordered_extent *ordered,
 	cb->compressed_folios = compressed_folios;
 	cb->compressed_len = ordered->disk_num_bytes;
 	cb->writeback = writeback;
-	INIT_WORK(&cb->write_end_work, btrfs_finish_compressed_write_work);
 	cb->nr_folios = nr_folios;
 	cb->bbio.bio.bi_iter.bi_sector = ordered->disk_bytenr >> SECTOR_SHIFT;
 	cb->bbio.ordered = ordered;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 062ebd9c2d32..c1a6c76827da 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -63,11 +63,8 @@ struct compressed_bio {
 	/* Whether this is a write for writeback. */
 	bool writeback;
 
-	union {
-		/* For reads, this is the bio we are copying the data into */
-		struct btrfs_bio *orig_bbio;
-		struct work_struct write_end_work;
-	};
+	/* For reads, this is the bio we are copying the data into */
+	struct btrfs_bio *orig_bbio;
 
 	/* Must be last. */
 	struct btrfs_bio bbio;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 46b715f3447b..6a1fa3b08b3f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1774,8 +1774,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 		destroy_workqueue(fs_info->endio_workers);
 	if (fs_info->rmw_workers)
 		destroy_workqueue(fs_info->rmw_workers);
-	if (fs_info->compressed_write_workers)
-		destroy_workqueue(fs_info->compressed_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
@@ -1987,8 +1985,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	fs_info->endio_write_workers =
 		btrfs_alloc_workqueue(fs_info, "endio-write", flags,
 				      max_active, 2);
-	fs_info->compressed_write_workers =
-		alloc_workqueue("btrfs-compressed-write", flags, max_active);
 	fs_info->endio_freespace_worker =
 		btrfs_alloc_workqueue(fs_info, "freespace-write", flags,
 				      max_active, 0);
@@ -2004,7 +2000,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	if (!(fs_info->workers &&
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
-	      fs_info->compressed_write_workers &&
 	      fs_info->endio_write_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
@@ -4291,7 +4286,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 
 	/*
 	 * When finishing a compressed write bio we schedule a work queue item
-	 * to finish an ordered extent - btrfs_finish_compressed_write_work()
+	 * to finish an ordered extent - end_bbio_compressed_write()
 	 * calls btrfs_finish_ordered_extent() which in turns does a call to
 	 * btrfs_queue_ordered_fn(), and that queues the ordered extent
 	 * completion either in the endio_write_workers work queue or in the
@@ -4299,7 +4294,7 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
 	 * below, so before we flush them we must flush this queue for the
 	 * workers of compressed writes.
 	 */
-	flush_workqueue(fs_info->compressed_write_workers);
+	flush_workqueue(fs_info->endio_workers);
 
 	/*
 	 * After we parked the cleaner kthread, ordered extents may have
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index ad389fb1c01a..2e705025a8a5 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -654,7 +654,6 @@ struct btrfs_fs_info {
 	struct workqueue_struct *endio_workers;
 	struct workqueue_struct *endio_meta_workers;
 	struct workqueue_struct *rmw_workers;
-	struct workqueue_struct *compressed_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
 	struct btrfs_workqueue *endio_freespace_worker;
 	struct btrfs_workqueue *caching_workers;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/7] btrfs: relax btrfs_inode::ordered_tree_lock
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-10-29  5:34 ` [PATCH v2 5/7] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-10-29  5:34 ` [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-11-04  5:37 ` [PATCH v2 0/7] " David Sterba
  7 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.

However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c    |  5 ++--
 fs/btrfs/inode.c        |  4 +--
 fs/btrfs/ordered-data.c | 57 ++++++++++++++++++-----------------------
 fs/btrfs/tree-log.c     |  4 +--
 4 files changed, 31 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b25a2b45047e..2d32dfc34ae3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1726,7 +1726,6 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 
 		if (cur >= i_size) {
 			struct btrfs_ordered_extent *ordered;
-			unsigned long flags;
 
 			ordered = btrfs_lookup_first_ordered_range(inode, cur,
 								   folio_end - cur);
@@ -1735,11 +1734,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * there must be an ordered extent.
 			 */
 			ASSERT(ordered != NULL);
-			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			spin_lock(&inode->ordered_tree_lock);
 			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 			ordered->truncated_len = min(ordered->truncated_len,
 						     cur - ordered->file_offset);
-			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			spin_unlock(&inode->ordered_tree_lock);
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bfcac2c68474..6ac8eceaeddd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7589,11 +7589,11 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset,
 					       EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 					       EXTENT_DEFRAG, &cached_state);
 
-		spin_lock_irq(&inode->ordered_tree_lock);
+		spin_lock(&inode->ordered_tree_lock);
 		set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
 		ordered->truncated_len = min(ordered->truncated_len,
 					     cur - ordered->file_offset);
-		spin_unlock_irq(&inode->ordered_tree_lock);
+		spin_unlock(&inode->ordered_tree_lock);
 
 		/*
 		 * If the ordered extent has finished, we're safe to delete all
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index dfda952dcf7b..a421f7db9eec 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -237,14 +237,14 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
 	/* One ref for the tree. */
 	refcount_inc(&entry->refs);
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = tree_insert(&inode->ordered_tree, entry->file_offset,
 			   &entry->rb_node);
 	if (unlikely(node))
 		btrfs_panic(fs_info, -EEXIST,
 				"inconsistency in ordered tree at offset %llu",
 				entry->file_offset);
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 
 	spin_lock(&root->ordered_extent_lock);
 	list_add_tail(&entry->root_extent_list,
@@ -328,9 +328,9 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 {
 	struct btrfs_inode *inode = entry->inode;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	list_add_tail(&sum->list, &entry->list);
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent *ordered)
@@ -417,15 +417,14 @@ void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 bool uptodate)
 {
 	struct btrfs_inode *inode = ordered->inode;
-	unsigned long flags;
 	bool ret;
 
 	trace_btrfs_finish_ordered_extent(inode, file_offset, len, uptodate);
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	ret = can_finish_ordered_extent(ordered, folio, file_offset, len,
 					uptodate);
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 
 	/*
 	 * If this is a COW write it means we created new extent maps for the
@@ -481,13 +480,12 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	u64 cur = file_offset;
 	const u64 end = file_offset + num_bytes;
 
 	trace_btrfs_writepage_end_io_hook(inode, file_offset, end - 1, uptodate);
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	while (cur < end) {
 		u64 entry_end;
 		u64 this_end;
@@ -539,13 +537,13 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 		ASSERT(len < U32_MAX);
 
 		if (can_finish_ordered_extent(entry, folio, cur, len, uptodate)) {
-			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			spin_unlock(&inode->ordered_tree_lock);
 			btrfs_queue_ordered_fn(entry);
-			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			spin_lock(&inode->ordered_tree_lock);
 		}
 		cur += len;
 	}
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 /*
@@ -571,10 +569,9 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 	bool finished = false;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	if (cached && *cached) {
 		entry = *cached;
 		goto have_entry;
@@ -611,7 +608,7 @@ bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_dec_test_pending(inode, entry);
 	}
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return finished;
 }
 
@@ -676,7 +673,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 	percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes,
 				 fs_info->delalloc_batch);
 
-	spin_lock_irq(&btrfs_inode->ordered_tree_lock);
+	spin_lock(&btrfs_inode->ordered_tree_lock);
 	node = &entry->rb_node;
 	rb_erase(node, &btrfs_inode->ordered_tree);
 	RB_CLEAR_NODE(node);
@@ -684,7 +681,7 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 		btrfs_inode->ordered_tree_last = NULL;
 	set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
 	pending = test_and_clear_bit(BTRFS_ORDERED_PENDING, &entry->flags);
-	spin_unlock_irq(&btrfs_inode->ordered_tree_lock);
+	spin_unlock(&btrfs_inode->ordered_tree_lock);
 
 	/*
 	 * The current running transaction is waiting on us, we need to let it
@@ -969,9 +966,8 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 {
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node)
 		goto out;
@@ -984,7 +980,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *ino
 		trace_btrfs_ordered_extent_lookup(inode, entry);
 	}
 out:
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -997,7 +993,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node) {
 		node = ordered_tree_search(inode, file_offset + len);
@@ -1024,7 +1020,7 @@ struct btrfs_ordered_extent *btrfs_lookup_ordered_range(
 		refcount_inc(&entry->refs);
 		trace_btrfs_ordered_extent_lookup_range(inode, entry);
 	}
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1039,7 +1035,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 
 	btrfs_assert_inode_locked(inode);
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	for (n = rb_first(&inode->ordered_tree); n; n = rb_next(n)) {
 		struct btrfs_ordered_extent *ordered;
 
@@ -1053,7 +1049,7 @@ void btrfs_get_ordered_extents_for_logging(struct btrfs_inode *inode,
 		refcount_inc(&ordered->refs);
 		trace_btrfs_ordered_extent_lookup_for_logging(inode, ordered);
 	}
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 }
 
 /*
@@ -1066,7 +1062,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry = NULL;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock(&inode->ordered_tree_lock);
 	node = ordered_tree_search(inode, file_offset);
 	if (!node)
 		goto out;
@@ -1075,7 +1071,7 @@ btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset)
 	refcount_inc(&entry->refs);
 	trace_btrfs_ordered_extent_lookup_first(inode, entry);
 out:
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1096,9 +1092,8 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	struct rb_node *prev;
 	struct rb_node *next;
 	struct btrfs_ordered_extent *entry = NULL;
-	unsigned long flags;
 
-	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+	spin_lock(&inode->ordered_tree_lock);
 	node = inode->ordered_tree.rb_node;
 	/*
 	 * Here we don't want to use tree_search() which will use tree->last
@@ -1153,7 +1148,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
 	}
 
-	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+	spin_unlock(&inode->ordered_tree_lock);
 	return entry;
 }
 
@@ -1285,9 +1280,7 @@ struct btrfs_ordered_extent *btrfs_split_ordered_extent(
 	/*
 	 * Take the root's ordered_extent_lock to avoid a race with
 	 * btrfs_wait_ordered_extents() when updating the disk_bytenr and
-	 * disk_num_bytes fields of the ordered extent below. And we disable
-	 * IRQs because the inode's ordered_tree_lock is used in IRQ context
-	 * elsewhere.
+	 * disk_num_bytes fields of the ordered extent below.
 	 *
 	 * There's no concern about a previous caller of
 	 * btrfs_wait_ordered_extents() getting the trimmed ordered extent
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index afaf96a76e3f..a9436be5e142 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5410,12 +5410,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 		set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags);
 
 		if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
-			spin_lock_irq(&inode->ordered_tree_lock);
+			spin_lock(&inode->ordered_tree_lock);
 			if (!test_bit(BTRFS_ORDERED_COMPLETE, &ordered->flags)) {
 				set_bit(BTRFS_ORDERED_PENDING, &ordered->flags);
 				atomic_inc(&trans->transaction->pending_ordered);
 			}
-			spin_unlock_irq(&inode->ordered_tree_lock);
+			spin_unlock(&inode->ordered_tree_lock);
 		}
 		btrfs_put_ordered_extent(ordered);
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
                   ` (5 preceding siblings ...)
  2025-10-29  5:34 ` [PATCH v2 6/7] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
@ 2025-10-29  5:34 ` Qu Wenruo
  2025-11-10 18:40   ` Daniel Vacek
  2025-11-11 12:41   ` Daniel Vacek
  2025-11-04  5:37 ` [PATCH v2 0/7] " David Sterba
  7 siblings, 2 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-10-29  5:34 UTC (permalink / raw)
  To: linux-btrfs

[ENHANCEMENT]
Btrfs currently calculate its data checksum then submit the bio.

But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.

This means we're safe to calculate the data checksum and submit the bio
in parallel, and only need the following new behaviors:

- Wait the csum generation to finish before calling btrfs_bio::end_io()
  Or we can lead to use-after-free for the csum generation worker.

- Save the current bi_iter for csum_one_bio()
  As the submission part can advance btrfs_bio::bio.bi_iter, if not
  saved csum_one_bio() may got an empty bi_iter and do not generate any
  checksum.

  Unfortunately this means we have to increase the size of btrfs_bio for
  16 bytes.

As usual, such new feature is hidden behind the experimental flag.

[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:

	Memory bandwidth:	50GiB/s
	CRC32C bandwidth:	45GiB/s
	SSD bandwidth:		8GiB/s

Then btrfs write bandwidth with data checksum before the patch would be

	1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s

After the patch, the bandwidth would be:

	1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s

The difference would be 15.32 % improvement.

[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIE gen3 x4 NVME SSD.

The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will fallback
to buffered one:

Vanilla Datasum:	1619.97 GiB/s
Patched Datasum:	1792.26 GiB/s
Diff			+10.6 %

In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c       | 21 +++++++++++----
 fs/btrfs/bio.h       |  7 +++++
 fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
 fs/btrfs/file-item.h |  2 +-
 4 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 5a5f23332c2e..8af2b68c2d53 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 	/* Make sure we're already in task context. */
 	ASSERT(in_task());
 
+	if (bbio->async_csum)
+		wait_for_completion(&bbio->csum_done);
+
 	bbio->bio.bi_status = status;
 	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
 		struct btrfs_bio *orig_bbio = bbio->private;
@@ -538,7 +541,11 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
 {
 	if (bbio->bio.bi_opf & REQ_META)
 		return btree_csum_one_bio(bbio);
-	return btrfs_csum_one_bio(bbio);
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	return btrfs_csum_one_bio(bbio, true);
+#else
+	return btrfs_csum_one_bio(bbio, false);
+#endif
 }
 
 /*
@@ -617,10 +624,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
 
-	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
-		return false;
-
-	auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
+	if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
+		return true;
+	/*
+	 * Write bios will calculate checksum and submit bio at the same time.
+	 * Unless explicitly required don't offload serial csum calculate and bio
+	 * submit into a workqueue.
+	 */
+	return false;
 #endif
 
 	/* Submit synchronously if the checksum implementation is fast. */
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 5d20f959e12d..deaeea3becf4 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -56,6 +56,9 @@ struct btrfs_bio {
 		struct {
 			struct btrfs_ordered_extent *ordered;
 			struct btrfs_ordered_sum *sums;
+			struct work_struct csum_work;
+			struct completion csum_done;
+			struct bvec_iter csum_saved_iter;
 			u64 orig_physical;
 		};
 
@@ -83,6 +86,10 @@ struct btrfs_bio {
 	 * scrub bios.
 	 */
 	bool is_scrub;
+
+	/* Whether the csum generation for data write is async. */
+	bool async_csum;
+
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
 	 * bytes for entire btrfs_bio but relies on bio being last.
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a42e6d54e7cd..4b7c40f05e8f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -18,6 +18,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "file-item.h"
+#include "volumes.h"
 
 #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
 				   sizeof(struct btrfs_item) * 2) / \
@@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
 	return ret;
 }
 
-/*
- * Calculate checksums of the data contained inside a bio.
- */
-int btrfs_csum_one_bio(struct btrfs_bio *bbio)
+static void csum_one_bio(struct btrfs_bio *bbio, struct bvec_iter *src)
 {
-	struct btrfs_ordered_extent *ordered = bbio->ordered;
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct bio *bio = &bbio->bio;
-	struct btrfs_ordered_sum *sums;
-	struct bvec_iter iter = bio->bi_iter;
+	struct btrfs_ordered_sum *sums = bbio->sums;
+	struct bvec_iter iter = *src;
 	phys_addr_t paddr;
 	const u32 blocksize = fs_info->sectorsize;
-	int index;
+	int index = 0;
+
+	shash->tfm = fs_info->csum_shash;
+
+	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
+		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
+		index += fs_info->csum_size;
+	}
+}
+
+static void csum_one_bio_work(struct work_struct *work)
+{
+	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
+
+	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
+	ASSERT(bbio->async_csum == true);
+	csum_one_bio(bbio, &bbio->csum_saved_iter);
+	complete(&bbio->csum_done);
+}
+
+/*
+ * Calculate checksums of the data contained inside a bio.
+ */
+int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
+{
+	struct btrfs_ordered_extent *ordered = bbio->ordered;
+	struct btrfs_inode *inode = bbio->inode;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct bio *bio = &bbio->bio;
+	struct btrfs_ordered_sum *sums;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -789,21 +815,21 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
 	if (!sums)
 		return -ENOMEM;
 
+	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
 	sums->len = bio->bi_iter.bi_size;
 	INIT_LIST_HEAD(&sums->list);
-
-	sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
-	index = 0;
-
-	shash->tfm = fs_info->csum_shash;
-
-	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
-		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
-		index += fs_info->csum_size;
-	}
-
 	bbio->sums = sums;
 	btrfs_add_ordered_sum(ordered, sums);
+
+	if (!async) {
+		csum_one_bio(bbio, &bbio->bio.bi_iter);
+		return 0;
+	}
+	init_completion(&bbio->csum_done);
+	bbio->async_csum = true;
+	bbio->csum_saved_iter = bbio->bio.bi_iter;
+	INIT_WORK(&bbio->csum_work, csum_one_bio_work);
+	schedule_work(&bbio->csum_work);
 	return 0;
 }
 
diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
index 0d59e830018a..5645c5e3abdb 100644
--- a/fs/btrfs/file-item.h
+++ b/fs/btrfs/file-item.h
@@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
 int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct btrfs_ordered_sum *sums);
-int btrfs_csum_one_bio(struct btrfs_bio *bbio);
+int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
 int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 			     struct list_head *list, int search_commit,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum
  2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
                   ` (6 preceding siblings ...)
  2025-10-29  5:34 ` [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
@ 2025-11-04  5:37 ` David Sterba
  7 siblings, 0 replies; 22+ messages in thread
From: David Sterba @ 2025-11-04  5:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Oct 29, 2025 at 04:04:10PM +1030, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Fix a race where csum generation can be completely skipped
>   If the lower storage layer advanced the bi_iter before we finished
>   calculating the checksum, bio_one_bio() may skip generating csum
>   completely.
> 
>   Unfortunately to solve this, a new bvec_iter must be introduced and
>   cause the size of btrfs_bio to increase.
> 
> - Remove btrfs_bio::fs_info by always requireing btrfs_bio::inode
>   For scrub, use btree inode instead, and for the only function that
>   requires to be called by scrub, btrfs_submit_repair_write(), use
>   a new member, btrfs_bio::is_scrub, to check the callers.
> 
> - Add a new patch to cleanup the recursive including problems
>   The idea is to keep the btrfs headers inside .h files to minimal,
>   thus lower risk of recursive including.
> 
> - Remove BTRFS_MAX_BIO_SECTORS macro
>   It's the same as BIO_MAX_VECS, and only utilized once.
> 
> The new async_csum feature will allow btrfs to calculate checksum for
> data write bios and submit them in parallel.
> 
> This will reduce latency and improve write throughput when data checksum
> is utilized.
> 
> This will slightly reclaim the performance drop after commit
> 968f19c5b1b7 ("btrfs: always fallback to buffered write if the inode
> requires checksum").
> 
> Patch 1 is a minor cleanup to remove a single-used macro.
> Patch 2 is a preparation to remove btrfs_bio::fs_info, the core
> is to avoid recursive including.
> Patch 3 is to remove btrfs_bio::fs_info, as soon we will
> increase the size of btrfs_bio by 16 bytes.
> 
> Patch 4~6 are preparations to make sure btrfs_bio::end_io() is called in
> task context.
> 
> Patch 7 enables the async checksum generation for data writes, under
> experimental flags.

Ok, I think we can leave it under experimental for just one cycle.

> The series will increase btrfs_bio size by 8 bytes (+16 for the new
> structurs, -8 for the removal of btrfs_bio::fs_info).

The end result is still good, also the fs_info was really redundant.

> Since the new async_csum should be better than the csum offload anyway,
> we may want to deprecate the csum offload feature completely in the
> future.
> Thankfully csum offload feature is still behind the experimental flag,
> thus it should not affect end users.

Yeah we can remove it eventually. The idea was to provide a way to
possibly test different ways how to improve the performance but it's a
bandaid, there should be only one. The logic can decide based on various
factors but the sysfs tunable should not be needed.

> Qu Wenruo (7):
>   btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS
>   btrfs: headers cleanup to remove unnecessary local includes
>   btrfs: remove btrfs_bio::fs_info by extracting it from
>     btrfs_bio::inode
>   btrfs: make sure all btrfs_bio::end_io is called in task context
>   btrfs: remove btrfs_fs_info::compressed_write_workers
>   btrfs: relax btrfs_inode::ordered_tree_lock
>   btrfs: introduce btrfs_bio::async_csum

Please add the patches to for-next as you see fit. I'll add them to
linux-next for early testing too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-10-29  5:34 ` [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
@ 2025-11-10 18:40   ` Daniel Vacek
  2025-11-10 21:05     ` Qu Wenruo
  2025-11-11 12:41   ` Daniel Vacek
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Vacek @ 2025-11-10 18:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
>
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
>
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
>
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, and only need the following new behaviors:
>
> - Wait the csum generation to finish before calling btrfs_bio::end_io()
>   Or we can lead to use-after-free for the csum generation worker.
>
> - Save the current bi_iter for csum_one_bio()
>   As the submission part can advance btrfs_bio::bio.bi_iter, if not
>   saved csum_one_bio() may got an empty bi_iter and do not generate any
>   checksum.
>
>   Unfortunately this means we have to increase the size of btrfs_bio for
>   16 bytes.
>
> As usual, such new feature is hidden behind the experimental flag.
>
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
>
>         Memory bandwidth:       50GiB/s
>         CRC32C bandwidth:       45GiB/s
>         SSD bandwidth:          8GiB/s
>
> Then btrfs write bandwidth with data checksum before the patch would be
>
>         1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>
> After the patch, the bandwidth would be:
>
>         1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>
> The difference would be 15.32 % improvement.
>
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
>
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
>
> Vanilla Datasum:        1619.97 GiB/s
> Patched Datasum:        1792.26 GiB/s
> Diff                    +10.6 %
>
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c       | 21 +++++++++++----
>  fs/btrfs/bio.h       |  7 +++++
>  fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
>  fs/btrfs/file-item.h |  2 +-
>  4 files changed, 69 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 5a5f23332c2e..8af2b68c2d53 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>         /* Make sure we're already in task context. */
>         ASSERT(in_task());
>
> +       if (bbio->async_csum)
> +               wait_for_completion(&bbio->csum_done);

Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
the completion? I believe it is not needed at all.

--nX

> +
>         bbio->bio.bi_status = status;
>         if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>                 struct btrfs_bio *orig_bbio = bbio->private;
> @@ -538,7 +541,11 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>  {
>         if (bbio->bio.bi_opf & REQ_META)
>                 return btree_csum_one_bio(bbio);
> -       return btrfs_csum_one_bio(bbio);
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +       return btrfs_csum_one_bio(bbio, true);
> +#else
> +       return btrfs_csum_one_bio(bbio, false);
> +#endif
>  }
>
>  /*
> @@ -617,10 +624,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>         struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>         enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>
> -       if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> -               return false;
> -
> -       auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> +       if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> +               return true;
> +       /*
> +        * Write bios will calculate checksum and submit bio at the same time.
> +        * Unless explicitly required don't offload serial csum calculate and bio
> +        * submit into a workqueue.
> +        */
> +       return false;
>  #endif
>
>         /* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 5d20f959e12d..deaeea3becf4 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -56,6 +56,9 @@ struct btrfs_bio {
>                 struct {
>                         struct btrfs_ordered_extent *ordered;
>                         struct btrfs_ordered_sum *sums;
> +                       struct work_struct csum_work;
> +                       struct completion csum_done;
> +                       struct bvec_iter csum_saved_iter;
>                         u64 orig_physical;
>                 };
>
> @@ -83,6 +86,10 @@ struct btrfs_bio {
>          * scrub bios.
>          */
>         bool is_scrub;
> +
> +       /* Whether the csum generation for data write is async. */
> +       bool async_csum;
> +
>         /*
>          * This member must come last, bio_alloc_bioset will allocate enough
>          * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..4b7c40f05e8f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "file-item.h"
> +#include "volumes.h"
>
>  #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>                                    sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>         return ret;
>  }
>
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio, struct bvec_iter *src)
>  {
> -       struct btrfs_ordered_extent *ordered = bbio->ordered;
>         struct btrfs_inode *inode = bbio->inode;
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>         struct bio *bio = &bbio->bio;
> -       struct btrfs_ordered_sum *sums;
> -       struct bvec_iter iter = bio->bi_iter;
> +       struct btrfs_ordered_sum *sums = bbio->sums;
> +       struct bvec_iter iter = *src;
>         phys_addr_t paddr;
>         const u32 blocksize = fs_info->sectorsize;
> -       int index;
> +       int index = 0;
> +
> +       shash->tfm = fs_info->csum_shash;
> +
> +       btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> +               btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> +               index += fs_info->csum_size;
> +       }
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> +       struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> +       ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> +       ASSERT(bbio->async_csum == true);
> +       csum_one_bio(bbio, &bbio->csum_saved_iter);
> +       complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> +       struct btrfs_ordered_extent *ordered = bbio->ordered;
> +       struct btrfs_inode *inode = bbio->inode;
> +       struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +       struct bio *bio = &bbio->bio;
> +       struct btrfs_ordered_sum *sums;
>         unsigned nofs_flag;
>
>         nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,21 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>         if (!sums)
>                 return -ENOMEM;
>
> +       sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>         sums->len = bio->bi_iter.bi_size;
>         INIT_LIST_HEAD(&sums->list);
> -
> -       sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> -       index = 0;
> -
> -       shash->tfm = fs_info->csum_shash;
> -
> -       btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> -               btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> -               index += fs_info->csum_size;
> -       }
> -
>         bbio->sums = sums;
>         btrfs_add_ordered_sum(ordered, sums);
> +
> +       if (!async) {
> +               csum_one_bio(bbio, &bbio->bio.bi_iter);
> +               return 0;
> +       }
> +       init_completion(&bbio->csum_done);
> +       bbio->async_csum = true;
> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +       schedule_work(&bbio->csum_work);
>         return 0;
>  }
>
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 0d59e830018a..5645c5e3abdb 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>                            struct btrfs_root *root,
>                            struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>  int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>                              struct list_head *list, int search_commit,
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-10 18:40   ` Daniel Vacek
@ 2025-11-10 21:05     ` Qu Wenruo
  2025-11-10 21:45       ` Qu Wenruo
  2025-11-11 12:38       ` Daniel Vacek
  0 siblings, 2 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-11-10 21:05 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-btrfs



在 2025/11/11 05:10, Daniel Vacek 写道:
> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
>>
>> [ENHANCEMENT]
>> Btrfs currently calculate its data checksum then submit the bio.
>>
>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
>> if the inode requires checksum"), any writes with data checksum will
>> fallback to buffered IO, meaning the content will not change during
>> writeback.
>>
>> This means we're safe to calculate the data checksum and submit the bio
>> in parallel, and only need the following new behaviors:
>>
>> - Wait the csum generation to finish before calling btrfs_bio::end_io()
>>    Or we can lead to use-after-free for the csum generation worker.
>>
>> - Save the current bi_iter for csum_one_bio()
>>    As the submission part can advance btrfs_bio::bio.bi_iter, if not
>>    saved csum_one_bio() may got an empty bi_iter and do not generate any
>>    checksum.
>>
>>    Unfortunately this means we have to increase the size of btrfs_bio for
>>    16 bytes.
>>
>> As usual, such new feature is hidden behind the experimental flag.
>>
>> [THEORETIC ANALYZE]
>> Consider the following theoretic hardware performance, which should be
>> more or less close to modern mainstream hardware:
>>
>>          Memory bandwidth:       50GiB/s
>>          CRC32C bandwidth:       45GiB/s
>>          SSD bandwidth:          8GiB/s
>>
>> Then btrfs write bandwidth with data checksum before the patch would be
>>
>>          1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>>
>> After the patch, the bandwidth would be:
>>
>>          1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>>
>> The difference would be 15.32 % improvement.
>>
>> [REAL WORLD BENCHMARK]
>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
>> storage is backed by a PCIE gen3 x4 NVME SSD.
>>
>> The test is a direct IO write, with 1MiB block size, write 7GiB data
>> into a btrfs mount with data checksum. Thus the direct write will fallback
>> to buffered one:
>>
>> Vanilla Datasum:        1619.97 GiB/s
>> Patched Datasum:        1792.26 GiB/s
>> Diff                    +10.6 %
>>
>> In my case, the bottleneck is the storage, thus the improvement is not
>> reaching the theoretic one, but still some observable improvement.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/bio.c       | 21 +++++++++++----
>>   fs/btrfs/bio.h       |  7 +++++
>>   fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
>>   fs/btrfs/file-item.h |  2 +-
>>   4 files changed, 69 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>> index 5a5f23332c2e..8af2b68c2d53 100644
>> --- a/fs/btrfs/bio.c
>> +++ b/fs/btrfs/bio.c
>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>>          /* Make sure we're already in task context. */
>>          ASSERT(in_task());
>>
>> +       if (bbio->async_csum)
>> +               wait_for_completion(&bbio->csum_done);
> 
> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
> the completion? I believe it is not needed at all.

I tried this idea, unfortunately causing kernel warnings.

It will trigger a warning inside __flush_work(), triggering the warning 
from check_flush_dependency().

It looks like the workqueue we're in (btrfs-endio) has a different 
WQ_MEM_RECLAIM flag for csum_one_bio_work().


I'll keep digging to try to use flush_work() to remove the csum_done, as 
that will keep the size of btrfs_bio unchanged.

Thanks,
Qu

> 
> --nX
> 

>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-10 21:05     ` Qu Wenruo
@ 2025-11-10 21:45       ` Qu Wenruo
  2025-11-11 12:38       ` Daniel Vacek
  1 sibling, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-11-10 21:45 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-btrfs



在 2025/11/11 07:35, Qu Wenruo 写道:
> 
> 
> 在 2025/11/11 05:10, Daniel Vacek 写道:
[...]
>>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>>> index 5a5f23332c2e..8af2b68c2d53 100644
>>> --- a/fs/btrfs/bio.c
>>> +++ b/fs/btrfs/bio.c
>>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, 
>>> blk_status_t status)
>>>          /* Make sure we're already in task context. */
>>>          ASSERT(in_task());
>>>
>>> +       if (bbio->async_csum)
>>> +               wait_for_completion(&bbio->csum_done);
>>
>> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
>> the completion? I believe it is not needed at all.
> 
> I tried this idea, unfortunately causing kernel warnings.
> 
> It will trigger a warning inside __flush_work(), triggering the warning 
> from check_flush_dependency().
> 
> It looks like the workqueue we're in (btrfs-endio) has a different 
> WQ_MEM_RECLAIM flag for csum_one_bio_work().
> 
> 
> I'll keep digging to try to use flush_work() to remove the csum_done, as 
> that will keep the size of btrfs_bio unchanged.

Unfortunately it doesn't look like possible to use flush_work() here.

The point is that btrfs_bio_end_io() can be called inside a workqueue 
(endio_workers or rmw_workers).
Those involved workqueues have WQ_MEM_RECLAIM flags set, this means they 
can not flush works without breaking the forward-process guarantee.
(check_flush_dependency()).

It looks like flush_work() has more complex mechanism than just waiting, 
not suitable for our pure waiting usage.

Thanks,
Qu
> 
> Thanks,
> Qu
> 
>>
>> --nX
>>
> 
>>>
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-10 21:05     ` Qu Wenruo
  2025-11-10 21:45       ` Qu Wenruo
@ 2025-11-11 12:38       ` Daniel Vacek
  2025-11-11 20:33         ` Qu Wenruo
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Vacek @ 2025-11-11 12:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, 10 Nov 2025 at 22:05, Qu Wenruo <wqu@suse.com> wrote:
> 在 2025/11/11 05:10, Daniel Vacek 写道:
> > On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [ENHANCEMENT]
> >> Btrfs currently calculate its data checksum then submit the bio.
> >>
> >> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> >> if the inode requires checksum"), any writes with data checksum will
> >> fallback to buffered IO, meaning the content will not change during
> >> writeback.
> >>
> >> This means we're safe to calculate the data checksum and submit the bio
> >> in parallel, and only need the following new behaviors:
> >>
> >> - Wait the csum generation to finish before calling btrfs_bio::end_io()
> >>    Or we can lead to use-after-free for the csum generation worker.
> >>
> >> - Save the current bi_iter for csum_one_bio()
> >>    As the submission part can advance btrfs_bio::bio.bi_iter, if not
> >>    saved csum_one_bio() may got an empty bi_iter and do not generate any
> >>    checksum.
> >>
> >>    Unfortunately this means we have to increase the size of btrfs_bio for
> >>    16 bytes.
> >>
> >> As usual, such new feature is hidden behind the experimental flag.
> >>
> >> [THEORETIC ANALYZE]
> >> Consider the following theoretic hardware performance, which should be
> >> more or less close to modern mainstream hardware:
> >>
> >>          Memory bandwidth:       50GiB/s
> >>          CRC32C bandwidth:       45GiB/s
> >>          SSD bandwidth:          8GiB/s
> >>
> >> Then btrfs write bandwidth with data checksum before the patch would be
> >>
> >>          1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> >>
> >> After the patch, the bandwidth would be:
> >>
> >>          1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> >>
> >> The difference would be 15.32 % improvement.
> >>
> >> [REAL WORLD BENCHMARK]
> >> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> >> storage is backed by a PCIE gen3 x4 NVME SSD.
> >>
> >> The test is a direct IO write, with 1MiB block size, write 7GiB data
> >> into a btrfs mount with data checksum. Thus the direct write will fallback
> >> to buffered one:
> >>
> >> Vanilla Datasum:        1619.97 GiB/s
> >> Patched Datasum:        1792.26 GiB/s
> >> Diff                    +10.6 %
> >>
> >> In my case, the bottleneck is the storage, thus the improvement is not
> >> reaching the theoretic one, but still some observable improvement.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/bio.c       | 21 +++++++++++----
> >>   fs/btrfs/bio.h       |  7 +++++
> >>   fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
> >>   fs/btrfs/file-item.h |  2 +-
> >>   4 files changed, 69 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> >> index 5a5f23332c2e..8af2b68c2d53 100644
> >> --- a/fs/btrfs/bio.c
> >> +++ b/fs/btrfs/bio.c
> >> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> >>          /* Make sure we're already in task context. */
> >>          ASSERT(in_task());
> >>
> >> +       if (bbio->async_csum)
> >> +               wait_for_completion(&bbio->csum_done);
> >
> > Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
> > the completion? I believe it is not needed at all.
>
> I tried this idea, unfortunately causing kernel warnings.
>
> It will trigger a warning inside __flush_work(), triggering the warning
> from check_flush_dependency().
>
> It looks like the workqueue we're in (btrfs-endio) has a different
> WQ_MEM_RECLAIM flag for csum_one_bio_work().

If I read the code correctly that would be solved using the
btrfs_end_io_wq instead of system wq (ie. queue_work() instead of
schedule_work()).

> +       init_completion(&bbio->csum_done);
> +       bbio->async_csum = true;
> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +       schedule_work(&bbio->csum_work);

queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);

> I'll keep digging to try to use flush_work() to remove the csum_done, as
> that will keep the size of btrfs_bio unchanged.
>
> Thanks,
> Qu
>
> >
> > --nX
> >
>
> >>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-10-29  5:34 ` [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
  2025-11-10 18:40   ` Daniel Vacek
@ 2025-11-11 12:41   ` Daniel Vacek
  2025-11-11 20:38     ` Qu Wenruo
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Vacek @ 2025-11-11 12:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
>
> [ENHANCEMENT]
> Btrfs currently calculate its data checksum then submit the bio.
>
> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> if the inode requires checksum"), any writes with data checksum will
> fallback to buffered IO, meaning the content will not change during
> writeback.
>
> This means we're safe to calculate the data checksum and submit the bio
> in parallel, and only need the following new behaviors:
>
> - Wait the csum generation to finish before calling btrfs_bio::end_io()
>   Or we can lead to use-after-free for the csum generation worker.
>
> - Save the current bi_iter for csum_one_bio()
>   As the submission part can advance btrfs_bio::bio.bi_iter, if not
>   saved csum_one_bio() may got an empty bi_iter and do not generate any
>   checksum.
>
>   Unfortunately this means we have to increase the size of btrfs_bio for
>   16 bytes.
>
> As usual, such new feature is hidden behind the experimental flag.
>
> [THEORETIC ANALYZE]
> Consider the following theoretic hardware performance, which should be
> more or less close to modern mainstream hardware:
>
>         Memory bandwidth:       50GiB/s
>         CRC32C bandwidth:       45GiB/s
>         SSD bandwidth:          8GiB/s
>
> Then btrfs write bandwidth with data checksum before the patch would be
>
>         1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>
> After the patch, the bandwidth would be:
>
>         1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>
> The difference would be 15.32 % improvement.
>
> [REAL WORLD BENCHMARK]
> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> storage is backed by a PCIE gen3 x4 NVME SSD.
>
> The test is a direct IO write, with 1MiB block size, write 7GiB data
> into a btrfs mount with data checksum. Thus the direct write will fallback
> to buffered one:
>
> Vanilla Datasum:        1619.97 GiB/s
> Patched Datasum:        1792.26 GiB/s
> Diff                    +10.6 %
>
> In my case, the bottleneck is the storage, thus the improvement is not
> reaching the theoretic one, but still some observable improvement.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c       | 21 +++++++++++----
>  fs/btrfs/bio.h       |  7 +++++
>  fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
>  fs/btrfs/file-item.h |  2 +-
>  4 files changed, 69 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 5a5f23332c2e..8af2b68c2d53 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>         /* Make sure we're already in task context. */
>         ASSERT(in_task());
>
> +       if (bbio->async_csum)
> +               wait_for_completion(&bbio->csum_done);
> +
>         bbio->bio.bi_status = status;
>         if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
>                 struct btrfs_bio *orig_bbio = bbio->private;
> @@ -538,7 +541,11 @@ static int btrfs_bio_csum(struct btrfs_bio *bbio)
>  {
>         if (bbio->bio.bi_opf & REQ_META)
>                 return btree_csum_one_bio(bbio);
> -       return btrfs_csum_one_bio(bbio);
> +#ifdef CONFIG_BTRFS_EXPERIMENTAL
> +       return btrfs_csum_one_bio(bbio, true);
> +#else
> +       return btrfs_csum_one_bio(bbio, false);
> +#endif
>  }
>
>  /*
> @@ -617,10 +624,14 @@ static bool should_async_write(struct btrfs_bio *bbio)
>         struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>         enum btrfs_offload_csum_mode csum_mode = READ_ONCE(fs_devices->offload_csum_mode);
>
> -       if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_OFF)
> -               return false;
> -
> -       auto_csum_mode = (csum_mode == BTRFS_OFFLOAD_CSUM_AUTO);
> +       if (csum_mode == BTRFS_OFFLOAD_CSUM_FORCE_ON)
> +               return true;
> +       /*
> +        * Write bios will calculate checksum and submit bio at the same time.
> +        * Unless explicitly required don't offload serial csum calculate and bio
> +        * submit into a workqueue.
> +        */
> +       return false;
>  #endif
>
>         /* Submit synchronously if the checksum implementation is fast. */
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 5d20f959e12d..deaeea3becf4 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -56,6 +56,9 @@ struct btrfs_bio {
>                 struct {
>                         struct btrfs_ordered_extent *ordered;
>                         struct btrfs_ordered_sum *sums;
> +                       struct work_struct csum_work;
> +                       struct completion csum_done;
> +                       struct bvec_iter csum_saved_iter;
>                         u64 orig_physical;
>                 };
>
> @@ -83,6 +86,10 @@ struct btrfs_bio {
>          * scrub bios.
>          */
>         bool is_scrub;
> +
> +       /* Whether the csum generation for data write is async. */
> +       bool async_csum;
> +
>         /*
>          * This member must come last, bio_alloc_bioset will allocate enough
>          * bytes for entire btrfs_bio but relies on bio being last.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index a42e6d54e7cd..4b7c40f05e8f 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -18,6 +18,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "file-item.h"
> +#include "volumes.h"
>
>  #define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
>                                    sizeof(struct btrfs_item) * 2) / \
> @@ -764,21 +765,46 @@ int btrfs_lookup_csums_bitmap(struct btrfs_root *root, struct btrfs_path *path,
>         return ret;
>  }
>
> -/*
> - * Calculate checksums of the data contained inside a bio.
> - */
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> +static void csum_one_bio(struct btrfs_bio *bbio, struct bvec_iter *src)
>  {
> -       struct btrfs_ordered_extent *ordered = bbio->ordered;
>         struct btrfs_inode *inode = bbio->inode;
>         struct btrfs_fs_info *fs_info = inode->root->fs_info;

If you use bbio->fs_info, the inode can be also removed.

>         SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
>         struct bio *bio = &bbio->bio;
> -       struct btrfs_ordered_sum *sums;
> -       struct bvec_iter iter = bio->bi_iter;
> +       struct btrfs_ordered_sum *sums = bbio->sums;
> +       struct bvec_iter iter = *src;
>         phys_addr_t paddr;
>         const u32 blocksize = fs_info->sectorsize;
> -       int index;
> +       int index = 0;
> +
> +       shash->tfm = fs_info->csum_shash;
> +
> +       btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> +               btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> +               index += fs_info->csum_size;
> +       }
> +}
> +
> +static void csum_one_bio_work(struct work_struct *work)
> +{
> +       struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, csum_work);
> +
> +       ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
> +       ASSERT(bbio->async_csum == true);
> +       csum_one_bio(bbio, &bbio->csum_saved_iter);
> +       complete(&bbio->csum_done);
> +}
> +
> +/*
> + * Calculate checksums of the data contained inside a bio.
> + */
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
> +{
> +       struct btrfs_ordered_extent *ordered = bbio->ordered;
> +       struct btrfs_inode *inode = bbio->inode;
> +       struct btrfs_fs_info *fs_info = inode->root->fs_info;

And the same here.

--nX

> +       struct bio *bio = &bbio->bio;
> +       struct btrfs_ordered_sum *sums;
>         unsigned nofs_flag;
>
>         nofs_flag = memalloc_nofs_save();
> @@ -789,21 +815,21 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>         if (!sums)
>                 return -ENOMEM;
>
> +       sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
>         sums->len = bio->bi_iter.bi_size;
>         INIT_LIST_HEAD(&sums->list);
> -
> -       sums->logical = bio->bi_iter.bi_sector << SECTOR_SHIFT;
> -       index = 0;
> -
> -       shash->tfm = fs_info->csum_shash;
> -
> -       btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
> -               btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
> -               index += fs_info->csum_size;
> -       }
> -
>         bbio->sums = sums;
>         btrfs_add_ordered_sum(ordered, sums);
> +
> +       if (!async) {
> +               csum_one_bio(bbio, &bbio->bio.bi_iter);
> +               return 0;
> +       }
> +       init_completion(&bbio->csum_done);
> +       bbio->async_csum = true;
> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> +       schedule_work(&bbio->csum_work);
>         return 0;
>  }
>
> diff --git a/fs/btrfs/file-item.h b/fs/btrfs/file-item.h
> index 0d59e830018a..5645c5e3abdb 100644
> --- a/fs/btrfs/file-item.h
> +++ b/fs/btrfs/file-item.h
> @@ -64,7 +64,7 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
>  int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>                            struct btrfs_root *root,
>                            struct btrfs_ordered_sum *sums);
> -int btrfs_csum_one_bio(struct btrfs_bio *bbio);
> +int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async);
>  int btrfs_alloc_dummy_sum(struct btrfs_bio *bbio);
>  int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
>                              struct list_head *list, int search_commit,
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-11 12:38       ` Daniel Vacek
@ 2025-11-11 20:33         ` Qu Wenruo
  2025-11-12 10:46           ` Daniel Vacek
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-11-11 20:33 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-btrfs



在 2025/11/11 23:08, Daniel Vacek 写道:
> On Mon, 10 Nov 2025 at 22:05, Qu Wenruo <wqu@suse.com> wrote:
>> 在 2025/11/11 05:10, Daniel Vacek 写道:
>>> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> [ENHANCEMENT]
>>>> Btrfs currently calculate its data checksum then submit the bio.
>>>>
>>>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
>>>> if the inode requires checksum"), any writes with data checksum will
>>>> fallback to buffered IO, meaning the content will not change during
>>>> writeback.
>>>>
>>>> This means we're safe to calculate the data checksum and submit the bio
>>>> in parallel, and only need the following new behaviors:
>>>>
>>>> - Wait the csum generation to finish before calling btrfs_bio::end_io()
>>>>     Or we can lead to use-after-free for the csum generation worker.
>>>>
>>>> - Save the current bi_iter for csum_one_bio()
>>>>     As the submission part can advance btrfs_bio::bio.bi_iter, if not
>>>>     saved csum_one_bio() may got an empty bi_iter and do not generate any
>>>>     checksum.
>>>>
>>>>     Unfortunately this means we have to increase the size of btrfs_bio for
>>>>     16 bytes.
>>>>
>>>> As usual, such new feature is hidden behind the experimental flag.
>>>>
>>>> [THEORETIC ANALYZE]
>>>> Consider the following theoretic hardware performance, which should be
>>>> more or less close to modern mainstream hardware:
>>>>
>>>>           Memory bandwidth:       50GiB/s
>>>>           CRC32C bandwidth:       45GiB/s
>>>>           SSD bandwidth:          8GiB/s
>>>>
>>>> Then btrfs write bandwidth with data checksum before the patch would be
>>>>
>>>>           1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>>>>
>>>> After the patch, the bandwidth would be:
>>>>
>>>>           1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>>>>
>>>> The difference would be 15.32 % improvement.
>>>>
>>>> [REAL WORLD BENCHMARK]
>>>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
>>>> storage is backed by a PCIE gen3 x4 NVME SSD.
>>>>
>>>> The test is a direct IO write, with 1MiB block size, write 7GiB data
>>>> into a btrfs mount with data checksum. Thus the direct write will fallback
>>>> to buffered one:
>>>>
>>>> Vanilla Datasum:        1619.97 GiB/s
>>>> Patched Datasum:        1792.26 GiB/s
>>>> Diff                    +10.6 %
>>>>
>>>> In my case, the bottleneck is the storage, thus the improvement is not
>>>> reaching the theoretic one, but still some observable improvement.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/bio.c       | 21 +++++++++++----
>>>>    fs/btrfs/bio.h       |  7 +++++
>>>>    fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
>>>>    fs/btrfs/file-item.h |  2 +-
>>>>    4 files changed, 69 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>>>> index 5a5f23332c2e..8af2b68c2d53 100644
>>>> --- a/fs/btrfs/bio.c
>>>> +++ b/fs/btrfs/bio.c
>>>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>>>>           /* Make sure we're already in task context. */
>>>>           ASSERT(in_task());
>>>>
>>>> +       if (bbio->async_csum)
>>>> +               wait_for_completion(&bbio->csum_done);
>>>
>>> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
>>> the completion? I believe it is not needed at all.
>>
>> I tried this idea, unfortunately causing kernel warnings.
>>
>> It will trigger a warning inside __flush_work(), triggering the warning
>> from check_flush_dependency().
>>
>> It looks like the workqueue we're in (btrfs-endio) has a different
>> WQ_MEM_RECLAIM flag for csum_one_bio_work().
> 
> If I read the code correctly that would be solved using the
> btrfs_end_io_wq instead of system wq (ie. queue_work() instead of
> schedule_work()).

That will cause dependency problems. The endio work now depends on the 
csum work, which are both executed on the same workqueue.
If the max_active is 1, endio work will deadlock waiting for the csum one.

> 
>> +       init_completion(&bbio->csum_done);
>> +       bbio->async_csum = true;
>> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
>> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
>> +       schedule_work(&bbio->csum_work);
> 
> queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);

Nope, I even created a new workqueue for the csum work, and it doesn't 
workaround the workqueue dependency check.

The check inside check_flush_dependency() is:

         WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
                               (WQ_MEM_RECLAIM | __WQ_LEGACY)) == 
WQ_MEM_RECLAIM),

It means the worker can not have WQ_MEM_RECLAIM flag at all. (unless it 
also has the LEGACY flag)

So nope, the flush_work() idea won't work inside any current btrfs 
workqueue, which all have WQ_MEM_RECLAIM flag set.

What we need is to make endio_workers and rmw_workers to get rid of 
WQ_MEM_RECLAIM, but that is a huge change and may not even work.

Thanks,
Qu

> 
>> I'll keep digging to try to use flush_work() to remove the csum_done, as
>> that will keep the size of btrfs_bio unchanged.
>>
>> Thanks,
>> Qu
>>
>>>
>>> --nX
>>>
>>
>>>>
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-11 12:41   ` Daniel Vacek
@ 2025-11-11 20:38     ` Qu Wenruo
  2025-11-12  7:12       ` Daniel Vacek
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-11-11 20:38 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-btrfs



在 2025/11/11 23:11, Daniel Vacek 写道:
> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
[...]
>> -/*
>> - * Calculate checksums of the data contained inside a bio.
>> - */
>> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
>> +static void csum_one_bio(struct btrfs_bio *bbio, struct bvec_iter *src)
>>   {
>> -       struct btrfs_ordered_extent *ordered = bbio->ordered;
>>          struct btrfs_inode *inode = bbio->inode;
>>          struct btrfs_fs_info *fs_info = inode->root->fs_info;
> 
> If you use bbio->fs_info, the inode can be also removed.

Please check the patch 3/7.

There is no more btrfs_bio::fs_info.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-11 20:38     ` Qu Wenruo
@ 2025-11-12  7:12       ` Daniel Vacek
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Vacek @ 2025-11-12  7:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, 11 Nov 2025 at 21:38, Qu Wenruo <wqu@suse.com> wrote:
> 在 2025/11/11 23:11, Daniel Vacek 写道:
> > On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
> [...]
> >> -/*
> >> - * Calculate checksums of the data contained inside a bio.
> >> - */
> >> -int btrfs_csum_one_bio(struct btrfs_bio *bbio)
> >> +static void csum_one_bio(struct btrfs_bio *bbio, struct bvec_iter *src)
> >>   {
> >> -       struct btrfs_ordered_extent *ordered = bbio->ordered;
> >>          struct btrfs_inode *inode = bbio->inode;
> >>          struct btrfs_fs_info *fs_info = inode->root->fs_info;
> >
> > If you use bbio->fs_info, the inode can be also removed.
>
> Please check the patch 3/7.
>
> There is no more btrfs_bio::fs_info.

Right. I missed that. Not a big deal then. The inode is still not
needed here, though.

Sorry about the noise.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-11 20:33         ` Qu Wenruo
@ 2025-11-12 10:46           ` Daniel Vacek
  2025-11-12 20:11             ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vacek @ 2025-11-12 10:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, 11 Nov 2025 at 21:33, Qu Wenruo <wqu@suse.com> wrote:
> 在 2025/11/11 23:08, Daniel Vacek 写道:
> > On Mon, 10 Nov 2025 at 22:05, Qu Wenruo <wqu@suse.com> wrote:
> >> 在 2025/11/11 05:10, Daniel Vacek 写道:
> >>> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
> >>>>
> >>>> [ENHANCEMENT]
> >>>> Btrfs currently calculate its data checksum then submit the bio.
> >>>>
> >>>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> >>>> if the inode requires checksum"), any writes with data checksum will
> >>>> fallback to buffered IO, meaning the content will not change during
> >>>> writeback.
> >>>>
> >>>> This means we're safe to calculate the data checksum and submit the bio
> >>>> in parallel, and only need the following new behaviors:
> >>>>
> >>>> - Wait the csum generation to finish before calling btrfs_bio::end_io()
> >>>>     Or we can lead to use-after-free for the csum generation worker.
> >>>>
> >>>> - Save the current bi_iter for csum_one_bio()
> >>>>     As the submission part can advance btrfs_bio::bio.bi_iter, if not
> >>>>     saved csum_one_bio() may got an empty bi_iter and do not generate any
> >>>>     checksum.
> >>>>
> >>>>     Unfortunately this means we have to increase the size of btrfs_bio for
> >>>>     16 bytes.
> >>>>
> >>>> As usual, such new feature is hidden behind the experimental flag.
> >>>>
> >>>> [THEORETIC ANALYZE]
> >>>> Consider the following theoretic hardware performance, which should be
> >>>> more or less close to modern mainstream hardware:
> >>>>
> >>>>           Memory bandwidth:       50GiB/s
> >>>>           CRC32C bandwidth:       45GiB/s
> >>>>           SSD bandwidth:          8GiB/s
> >>>>
> >>>> Then btrfs write bandwidth with data checksum before the patch would be
> >>>>
> >>>>           1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> >>>>
> >>>> After the patch, the bandwidth would be:
> >>>>
> >>>>           1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> >>>>
> >>>> The difference would be 15.32 % improvement.
> >>>>
> >>>> [REAL WORLD BENCHMARK]
> >>>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> >>>> storage is backed by a PCIE gen3 x4 NVME SSD.
> >>>>
> >>>> The test is a direct IO write, with 1MiB block size, write 7GiB data
> >>>> into a btrfs mount with data checksum. Thus the direct write will fallback
> >>>> to buffered one:
> >>>>
> >>>> Vanilla Datasum:        1619.97 GiB/s
> >>>> Patched Datasum:        1792.26 GiB/s
> >>>> Diff                    +10.6 %
> >>>>
> >>>> In my case, the bottleneck is the storage, thus the improvement is not
> >>>> reaching the theoretic one, but still some observable improvement.
> >>>>
> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>> ---
> >>>>    fs/btrfs/bio.c       | 21 +++++++++++----
> >>>>    fs/btrfs/bio.h       |  7 +++++
> >>>>    fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
> >>>>    fs/btrfs/file-item.h |  2 +-
> >>>>    4 files changed, 69 insertions(+), 25 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> >>>> index 5a5f23332c2e..8af2b68c2d53 100644
> >>>> --- a/fs/btrfs/bio.c
> >>>> +++ b/fs/btrfs/bio.c
> >>>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> >>>>           /* Make sure we're already in task context. */
> >>>>           ASSERT(in_task());
> >>>>
> >>>> +       if (bbio->async_csum)
> >>>> +               wait_for_completion(&bbio->csum_done);
> >>>
> >>> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
> >>> the completion? I believe it is not needed at all.
> >>
> >> I tried this idea, unfortunately causing kernel warnings.
> >>
> >> It will trigger a warning inside __flush_work(), triggering the warning
> >> from check_flush_dependency().
> >>
> >> It looks like the workqueue we're in (btrfs-endio) has a different
> >> WQ_MEM_RECLAIM flag for csum_one_bio_work().
> >
> > If I read the code correctly that would be solved using the
> > btrfs_end_io_wq instead of system wq (ie. queue_work() instead of
> > schedule_work()).
>
> That will cause dependency problems. The endio work now depends on the
> csum work, which are both executed on the same workqueue.
> If the max_active is 1, endio work will deadlock waiting for the csum one.

When the csum work is being queued the bio was not even submitted yet.
The chances are the csum work will be done even before the bio ends
and the end io work is queued. But even if csum work is not done yet
(or even started due to scheduling delays or previous (unrelated)
worker still being blocked), it's always serialized before the end io
work. So IIUC, there should be no deadlock possible. Unless I'm still
missing something, workqueues could be tricky.

> >> +       init_completion(&bbio->csum_done);
> >> +       bbio->async_csum = true;
> >> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
> >> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> >> +       schedule_work(&bbio->csum_work);
> >
> > queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
>
> Nope, I even created a new workqueue for the csum work, and it doesn't
> workaround the workqueue dependency check.

Did you create it with the WQ_MEM_RECLAIM flag? As like:

alloc_workqueue("btrfs-async-csum", ... | WQ_MEM_RECLAIM, ...);

I don't see how that would trigger the warning. See below for a
detailed explanation.

> The check inside check_flush_dependency() is:
>
>          WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>                                (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
> WQ_MEM_RECLAIM),
>
> It means the worker can not have WQ_MEM_RECLAIM flag at all. (unless it
> also has the LEGACY flag)

I understand that if the work is queued on a wq with WQ_MEM_RECLAIM,
the check_flush_dependency() returns early. Hence no need to worry
about the warning's condition as it's no longer of any concern.

> So nope, the flush_work() idea won't work inside any current btrfs
> workqueue, which all have WQ_MEM_RECLAIM flag set.

With the above being said, I still see two possible solutions.
Either using the btrfs_end_io_wq() as suggested before. It should be safe, IMO.
Or, if you're still worried about possible deadlocks, creating a new
dedicated wq which also has the WQ_MEM_RECLAIM set (which is needed
for compatibility with the context where we want to call the
flush_work()).

Both ways could help getting rid of the completion in btrfs_bio, which
is 32 bytes by itself.

What do I miss?

Out of curiosity, flush_work() internally also uses completion in
pretty much exactly the same way as in this patch, but it's on the
caller's stack (in this case on the stack which would call the
btrfs_bio_end_io() modified with flush_work()). So in the end the
effect would be like moving the completion from btrfs_bio to a stack.

> What we need is to make endio_workers and rmw_workers to get rid of
> WQ_MEM_RECLAIM, but that is a huge change and may not even work.
>
> Thanks,
> Qu
>
> >
> >> I'll keep digging to try to use flush_work() to remove the csum_done, as
> >> that will keep the size of btrfs_bio unchanged.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> --nX
> >>>
> >>
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-12 10:46           ` Daniel Vacek
@ 2025-11-12 20:11             ` Qu Wenruo
  2025-11-12 21:43               ` Daniel Vacek
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-11-12 20:11 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: linux-btrfs



在 2025/11/12 21:16, Daniel Vacek 写道:
> On Tue, 11 Nov 2025 at 21:33, Qu Wenruo <wqu@suse.com> wrote:
>> 在 2025/11/11 23:08, Daniel Vacek 写道:
>>> On Mon, 10 Nov 2025 at 22:05, Qu Wenruo <wqu@suse.com> wrote:
>>>> 在 2025/11/11 05:10, Daniel Vacek 写道:
>>>>> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
>>>>>>
>>>>>> [ENHANCEMENT]
>>>>>> Btrfs currently calculate its data checksum then submit the bio.
>>>>>>
>>>>>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
>>>>>> if the inode requires checksum"), any writes with data checksum will
>>>>>> fallback to buffered IO, meaning the content will not change during
>>>>>> writeback.
>>>>>>
>>>>>> This means we're safe to calculate the data checksum and submit the bio
>>>>>> in parallel, and only need the following new behaviors:
>>>>>>
>>>>>> - Wait the csum generation to finish before calling btrfs_bio::end_io()
>>>>>>      Or we can lead to use-after-free for the csum generation worker.
>>>>>>
>>>>>> - Save the current bi_iter for csum_one_bio()
>>>>>>      As the submission part can advance btrfs_bio::bio.bi_iter, if not
>>>>>>      saved csum_one_bio() may got an empty bi_iter and do not generate any
>>>>>>      checksum.
>>>>>>
>>>>>>      Unfortunately this means we have to increase the size of btrfs_bio for
>>>>>>      16 bytes.
>>>>>>
>>>>>> As usual, such new feature is hidden behind the experimental flag.
>>>>>>
>>>>>> [THEORETIC ANALYZE]
>>>>>> Consider the following theoretic hardware performance, which should be
>>>>>> more or less close to modern mainstream hardware:
>>>>>>
>>>>>>            Memory bandwidth:       50GiB/s
>>>>>>            CRC32C bandwidth:       45GiB/s
>>>>>>            SSD bandwidth:          8GiB/s
>>>>>>
>>>>>> Then btrfs write bandwidth with data checksum before the patch would be
>>>>>>
>>>>>>            1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
>>>>>>
>>>>>> After the patch, the bandwidth would be:
>>>>>>
>>>>>>            1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
>>>>>>
>>>>>> The difference would be 15.32 % improvement.
>>>>>>
>>>>>> [REAL WORLD BENCHMARK]
>>>>>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
>>>>>> storage is backed by a PCIE gen3 x4 NVME SSD.
>>>>>>
>>>>>> The test is a direct IO write, with 1MiB block size, write 7GiB data
>>>>>> into a btrfs mount with data checksum. Thus the direct write will fallback
>>>>>> to buffered one:
>>>>>>
>>>>>> Vanilla Datasum:        1619.97 GiB/s
>>>>>> Patched Datasum:        1792.26 GiB/s
>>>>>> Diff                    +10.6 %
>>>>>>
>>>>>> In my case, the bottleneck is the storage, thus the improvement is not
>>>>>> reaching the theoretic one, but still some observable improvement.
>>>>>>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>>     fs/btrfs/bio.c       | 21 +++++++++++----
>>>>>>     fs/btrfs/bio.h       |  7 +++++
>>>>>>     fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
>>>>>>     fs/btrfs/file-item.h |  2 +-
>>>>>>     4 files changed, 69 insertions(+), 25 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>>>>>> index 5a5f23332c2e..8af2b68c2d53 100644
>>>>>> --- a/fs/btrfs/bio.c
>>>>>> +++ b/fs/btrfs/bio.c
>>>>>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
>>>>>>            /* Make sure we're already in task context. */
>>>>>>            ASSERT(in_task());
>>>>>>
>>>>>> +       if (bbio->async_csum)
>>>>>> +               wait_for_completion(&bbio->csum_done);
>>>>>
>>>>> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
>>>>> the completion? I believe it is not needed at all.
>>>>
>>>> I tried this idea, unfortunately causing kernel warnings.
>>>>
>>>> It will trigger a warning inside __flush_work(), triggering the warning
>>>> from check_flush_dependency().
>>>>
>>>> It looks like the workqueue we're in (btrfs-endio) has a different
>>>> WQ_MEM_RECLAIM flag for csum_one_bio_work().
>>>
>>> If I read the code correctly that would be solved using the
>>> btrfs_end_io_wq instead of system wq (ie. queue_work() instead of
>>> schedule_work()).
>>
>> That will cause dependency problems. The endio work now depends on the
>> csum work, which are both executed on the same workqueue.
>> If the max_active is 1, endio work will deadlock waiting for the csum one.
> 
> When the csum work is being queued the bio was not even submitted yet.
> The chances are the csum work will be done even before the bio ends
> and the end io work is queued. But even if csum work is not done yet
> (or even started due to scheduling delays or previous (unrelated)
> worker still being blocked), it's always serialized before the end io
> work. So IIUC, there should be no deadlock possible. Unless I'm still
> missing something, workqueues could be tricky.
> 
>>>> +       init_completion(&bbio->csum_done);
>>>> +       bbio->async_csum = true;
>>>> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
>>>> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
>>>> +       schedule_work(&bbio->csum_work);
>>>
>>> queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
>>
>> Nope, I even created a new workqueue for the csum work, and it doesn't
>> workaround the workqueue dependency check.
> 
> Did you create it with the WQ_MEM_RECLAIM flag? As like:
> 
> alloc_workqueue("btrfs-async-csum", ... | WQ_MEM_RECLAIM, ...);

Yes.

> 
> I don't see how that would trigger the warning. See below for a
> detailed explanation.

It's not the workqueue running the csum work, but the workqueue running 
the flush_work().

> 
>> The check inside check_flush_dependency() is:
>>
>>           WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>>                                 (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
>> WQ_MEM_RECLAIM),
>>
>> It means the worker can not have WQ_MEM_RECLAIM flag at all. (unless it
>> also has the LEGACY flag)
> 
> I understand that if the work is queued on a wq with WQ_MEM_RECLAIM,
> the check_flush_dependency() returns early. Hence no need to worry
> about the warning's condition as it's no longer of any concern.

You can just try to use flush_work() and test it by yourself.

I have tried my best to explain it, but it looks like it's at my limit.

Just try a patch removing the csum_wait, and use flush_work() instead of 
wait_for_completion().

Then you'll see the problem I'm hitting.

Thanks,
Qu

> 
>> So nope, the flush_work() idea won't work inside any current btrfs
>> workqueue, which all have WQ_MEM_RECLAIM flag set.
> 
> With the above being said, I still see two possible solutions.
> Either using the btrfs_end_io_wq() as suggested before. It should be safe, IMO.
> Or, if you're still worried about possible deadlocks, creating a new
> dedicated wq which also has the WQ_MEM_RECLAIM set (which is needed
> for compatibility with the context where we want to call the
> flush_work()).
> 
> Both ways could help getting rid of the completion in btrfs_bio, which
> is 32 bytes by itself.
> 
> What do I miss?
> 
> Out of curiosity, flush_work() internally also uses completion in
> pretty much exactly the same way as in this patch, but it's on the
> caller's stack (in this case on the stack which would call the
> btrfs_bio_end_io() modified with flush_work()). So in the end the
> effect would be like moving the completion from btrfs_bio to a stack.
> 
>> What we need is to make endio_workers and rmw_workers to get rid of
>> WQ_MEM_RECLAIM, but that is a huge change and may not even work.
>>
>> Thanks,
>> Qu
>>
>>>
>>>> I'll keep digging to try to use flush_work() to remove the csum_done, as
>>>> that will keep the size of btrfs_bio unchanged.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> --nX
>>>>>
>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-12 20:11             ` Qu Wenruo
@ 2025-11-12 21:43               ` Daniel Vacek
  2025-11-13  0:21                 ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Vacek @ 2025-11-12 21:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, 12 Nov 2025 at 21:11, Qu Wenruo <wqu@suse.com> wrote:
> 在 2025/11/12 21:16, Daniel Vacek 写道:
> > On Tue, 11 Nov 2025 at 21:33, Qu Wenruo <wqu@suse.com> wrote:
> >> 在 2025/11/11 23:08, Daniel Vacek 写道:
> >>> On Mon, 10 Nov 2025 at 22:05, Qu Wenruo <wqu@suse.com> wrote:
> >>>> 在 2025/11/11 05:10, Daniel Vacek 写道:
> >>>>> On Wed, 29 Oct 2025 at 06:43, Qu Wenruo <wqu@suse.com> wrote:
> >>>>>>
> >>>>>> [ENHANCEMENT]
> >>>>>> Btrfs currently calculate its data checksum then submit the bio.
> >>>>>>
> >>>>>> But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
> >>>>>> if the inode requires checksum"), any writes with data checksum will
> >>>>>> fallback to buffered IO, meaning the content will not change during
> >>>>>> writeback.
> >>>>>>
> >>>>>> This means we're safe to calculate the data checksum and submit the bio
> >>>>>> in parallel, and only need the following new behaviors:
> >>>>>>
> >>>>>> - Wait the csum generation to finish before calling btrfs_bio::end_io()
> >>>>>>      Or we can lead to use-after-free for the csum generation worker.
> >>>>>>
> >>>>>> - Save the current bi_iter for csum_one_bio()
> >>>>>>      As the submission part can advance btrfs_bio::bio.bi_iter, if not
> >>>>>>      saved csum_one_bio() may got an empty bi_iter and do not generate any
> >>>>>>      checksum.
> >>>>>>
> >>>>>>      Unfortunately this means we have to increase the size of btrfs_bio for
> >>>>>>      16 bytes.
> >>>>>>
> >>>>>> As usual, such new feature is hidden behind the experimental flag.
> >>>>>>
> >>>>>> [THEORETIC ANALYZE]
> >>>>>> Consider the following theoretic hardware performance, which should be
> >>>>>> more or less close to modern mainstream hardware:
> >>>>>>
> >>>>>>            Memory bandwidth:       50GiB/s
> >>>>>>            CRC32C bandwidth:       45GiB/s
> >>>>>>            SSD bandwidth:          8GiB/s
> >>>>>>
> >>>>>> Then btrfs write bandwidth with data checksum before the patch would be
> >>>>>>
> >>>>>>            1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
> >>>>>>
> >>>>>> After the patch, the bandwidth would be:
> >>>>>>
> >>>>>>            1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
> >>>>>>
> >>>>>> The difference would be 15.32 % improvement.
> >>>>>>
> >>>>>> [REAL WORLD BENCHMARK]
> >>>>>> I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
> >>>>>> storage is backed by a PCIE gen3 x4 NVME SSD.
> >>>>>>
> >>>>>> The test is a direct IO write, with 1MiB block size, write 7GiB data
> >>>>>> into a btrfs mount with data checksum. Thus the direct write will fallback
> >>>>>> to buffered one:
> >>>>>>
> >>>>>> Vanilla Datasum:        1619.97 GiB/s
> >>>>>> Patched Datasum:        1792.26 GiB/s
> >>>>>> Diff                    +10.6 %
> >>>>>>
> >>>>>> In my case, the bottleneck is the storage, thus the improvement is not
> >>>>>> reaching the theoretic one, but still some observable improvement.
> >>>>>>
> >>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >>>>>> ---
> >>>>>>     fs/btrfs/bio.c       | 21 +++++++++++----
> >>>>>>     fs/btrfs/bio.h       |  7 +++++
> >>>>>>     fs/btrfs/file-item.c | 64 +++++++++++++++++++++++++++++++-------------
> >>>>>>     fs/btrfs/file-item.h |  2 +-
> >>>>>>     4 files changed, 69 insertions(+), 25 deletions(-)
> >>>>>>
> >>>>>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> >>>>>> index 5a5f23332c2e..8af2b68c2d53 100644
> >>>>>> --- a/fs/btrfs/bio.c
> >>>>>> +++ b/fs/btrfs/bio.c
> >>>>>> @@ -105,6 +105,9 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
> >>>>>>            /* Make sure we're already in task context. */
> >>>>>>            ASSERT(in_task());
> >>>>>>
> >>>>>> +       if (bbio->async_csum)
> >>>>>> +               wait_for_completion(&bbio->csum_done);
> >>>>>
> >>>>> Can we do `flush_work(&bbio->csum_work);` instead here and get rid of
> >>>>> the completion? I believe it is not needed at all.
> >>>>
> >>>> I tried this idea, unfortunately causing kernel warnings.
> >>>>
> >>>> It will trigger a warning inside __flush_work(), triggering the warning
> >>>> from check_flush_dependency().
> >>>>
> >>>> It looks like the workqueue we're in (btrfs-endio) has a different
> >>>> WQ_MEM_RECLAIM flag for csum_one_bio_work().
> >>>
> >>> If I read the code correctly that would be solved using the
> >>> btrfs_end_io_wq instead of system wq (ie. queue_work() instead of
> >>> schedule_work()).
> >>
> >> That will cause dependency problems. The endio work now depends on the
> >> csum work, which are both executed on the same workqueue.
> >> If the max_active is 1, endio work will deadlock waiting for the csum one.
> >
> > When the csum work is being queued the bio was not even submitted yet.
> > The chances are the csum work will be done even before the bio ends
> > and the end io work is queued. But even if csum work is not done yet
> > (or even started due to scheduling delays or previous (unrelated)
> > worker still being blocked), it's always serialized before the end io
> > work. So IIUC, there should be no deadlock possible. Unless I'm still
> > missing something, workqueues could be tricky.
> >
> >>>> +       init_completion(&bbio->csum_done);
> >>>> +       bbio->async_csum = true;
> >>>> +       bbio->csum_saved_iter = bbio->bio.bi_iter;
> >>>> +       INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> >>>> +       schedule_work(&bbio->csum_work);
> >>>
> >>> queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
> >>
> >> Nope, I even created a new workqueue for the csum work, and it doesn't
> >> workaround the workqueue dependency check.
> >
> > Did you create it with the WQ_MEM_RECLAIM flag? As like:
> >
> > alloc_workqueue("btrfs-async-csum", ... | WQ_MEM_RECLAIM, ...);
>
> Yes.
>
> >
> > I don't see how that would trigger the warning. See below for a
> > detailed explanation.
>
> It's not the workqueue running the csum work, but the workqueue running
> the flush_work().

The wq running the flush_work() is in the warning condition. *But* the
early return checks the csum wq.

> >
> >> The check inside check_flush_dependency() is:
> >>
> >>           WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
> >>                                 (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
> >> WQ_MEM_RECLAIM),
> >>
> >> It means the worker can not have WQ_MEM_RECLAIM flag at all. (unless it
> >> also has the LEGACY flag)
> >
> > I understand that if the work is queued on a wq with WQ_MEM_RECLAIM,
> > the check_flush_dependency() returns early. Hence no need to worry
> > about the warning's condition as it's no longer of any concern.
>
> You can just try to use flush_work() and test it by yourself.
>
> I have tried my best to explain it, but it looks like it's at my limit.
>
> Just try a patch removing the csum_wait, and use flush_work() instead of
> wait_for_completion().
>
> Then you'll see the problem I'm hitting.

No luck. With the change below all fstests pass without any warning or
deadlocks whatsoever. Just as I expected.
I don't think there would be any impact on performance, but that needs
to be tested and confirmed.

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index aba452dd9904..02b40d6fa13f 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -108,7 +108,8 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio,
blk_status_t status)
        ASSERT(in_task());

        if (bbio->async_csum)
-               wait_for_completion(&bbio->csum_done);
+               flush_work(&bbio->csum_work);

        bbio->bio.bi_status = status;
        if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 5015e327dbd9..11eb8bcab94d 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -58,7 +58,7 @@ struct btrfs_bio {
                        struct btrfs_ordered_extent *ordered;
                        struct btrfs_ordered_sum *sums;
                        struct work_struct csum_work;
-                       struct completion csum_done;
                        struct bvec_iter csum_saved_iter;
                        u64 orig_physical;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index d2ecd26727ac..f69fa943d3e0 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -790,7 +790,15 @@ static void csum_one_bio_work(struct work_struct *work)
        ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
        ASSERT(bbio->async_csum == true);
        csum_one_bio(bbio, &bbio->csum_saved_iter);
-       complete(&bbio->csum_done);
+}
+
+static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_fs_info *fs_info,
+                                                struct bio *bio)
+{
+       if (bio->bi_opf & REQ_META)
+               return fs_info->endio_meta_workers;
+       return fs_info->endio_workers;
 }

 /*
@@ -823,12 +831,13 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
                csum_one_bio(bbio, &bbio->bio.bi_iter);
                return 0;
        }
-       init_completion(&bbio->csum_done);
        bbio->async_csum = true;
        bbio->csum_saved_iter = bbio->bio.bi_iter;
        INIT_WORK(&bbio->csum_work, csum_one_bio_work);
-       schedule_work(&bbio->csum_work);
+       queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
        return 0;
 }

> Thanks,
> Qu
>
> >
> >> So nope, the flush_work() idea won't work inside any current btrfs
> >> workqueue, which all have WQ_MEM_RECLAIM flag set.
> >
> > With the above being said, I still see two possible solutions.
> > Either using the btrfs_end_io_wq() as suggested before. It should be safe, IMO.
> > Or, if you're still worried about possible deadlocks, creating a new
> > dedicated wq which also has the WQ_MEM_RECLAIM set (which is needed
> > for compatibility with the context where we want to call the
> > flush_work()).
> >
> > Both ways could help getting rid of the completion in btrfs_bio, which
> > is 32 bytes by itself.
> >
> > What do I miss?
> >
> > Out of curiosity, flush_work() internally also uses completion in
> > pretty much exactly the same way as in this patch, but it's on the
> > caller's stack (in this case on the stack which would call the
> > btrfs_bio_end_io() modified with flush_work()). So in the end the
> > effect would be like moving the completion from btrfs_bio to a stack.
> >
> >> What we need is to make endio_workers and rmw_workers to get rid of
> >> WQ_MEM_RECLAIM, but that is a huge change and may not even work.
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>>> I'll keep digging to try to use flush_work() to remove the csum_done, as
> >>>> that will keep the size of btrfs_bio unchanged.
> >>>>
> >>>> Thanks,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> --nX
> >>>>>
> >>>>
> >>>>>>
> >>>>
> >>
>

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-12 21:43               ` Daniel Vacek
@ 2025-11-13  0:21                 ` Qu Wenruo
  2025-11-13  0:36                   ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-11-13  0:21 UTC (permalink / raw)
  To: Daniel Vacek, Qu Wenruo; +Cc: linux-btrfs



在 2025/11/13 08:13, Daniel Vacek 写道:
[...]
>>>>>
>>>>> queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
>>>>
>>>> Nope, I even created a new workqueue for the csum work, and it doesn't
>>>> workaround the workqueue dependency check.
>>>
>>> Did you create it with the WQ_MEM_RECLAIM flag? As like:
>>>
>>> alloc_workqueue("btrfs-async-csum", ... | WQ_MEM_RECLAIM, ...);
>>
>> Yes.
>>
>>>
>>> I don't see how that would trigger the warning. See below for a
>>> detailed explanation.
>>
>> It's not the workqueue running the csum work, but the workqueue running
>> the flush_work().
> 
> The wq running the flush_work() is in the warning condition. *But* the
> early return checks the csum wq.
> 
>>>
>>>> The check inside check_flush_dependency() is:
>>>>
>>>>            WARN_ONCE(worker && ((worker->current_pwq->wq->flags &
>>>>                                  (WQ_MEM_RECLAIM | __WQ_LEGACY)) ==
>>>> WQ_MEM_RECLAIM),
>>>>
>>>> It means the worker can not have WQ_MEM_RECLAIM flag at all. (unless it
>>>> also has the LEGACY flag)
>>>
>>> I understand that if the work is queued on a wq with WQ_MEM_RECLAIM,
>>> the check_flush_dependency() returns early. Hence no need to worry
>>> about the warning's condition as it's no longer of any concern.
>>
>> You can just try to use flush_work() and test it by yourself.
>>
>> I have tried my best to explain it, but it looks like it's at my limit.
>>
>> Just try a patch removing the csum_wait, and use flush_work() instead of
>> wait_for_completion().
>>
>> Then you'll see the problem I'm hitting.
> 
> No luck. With the change below all fstests pass without any warning or
> deadlocks whatsoever. Just as I expected.
> I don't think there would be any impact on performance, but that needs
> to be tested and confirmed.
> 
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index aba452dd9904..02b40d6fa13f 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -108,7 +108,8 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio,
> blk_status_t status)
>          ASSERT(in_task());
> 
>          if (bbio->async_csum)
> -               wait_for_completion(&bbio->csum_done);
> +               flush_work(&bbio->csum_work);
> 
>          bbio->bio.bi_status = status;
>          if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
> diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
> index 5015e327dbd9..11eb8bcab94d 100644
> --- a/fs/btrfs/bio.h
> +++ b/fs/btrfs/bio.h
> @@ -58,7 +58,7 @@ struct btrfs_bio {
>                          struct btrfs_ordered_extent *ordered;
>                          struct btrfs_ordered_sum *sums;
>                          struct work_struct csum_work;
> -                       struct completion csum_done;
>                          struct bvec_iter csum_saved_iter;
>                          u64 orig_physical;
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index d2ecd26727ac..f69fa943d3e0 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -790,7 +790,15 @@ static void csum_one_bio_work(struct work_struct *work)
>          ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
>          ASSERT(bbio->async_csum == true);
>          csum_one_bio(bbio, &bbio->csum_saved_iter);
> -       complete(&bbio->csum_done);
> +}
> +
> +static struct workqueue_struct *btrfs_end_io_wq(struct btrfs_fs_info *fs_info,
> +                                                struct bio *bio)
> +{
> +       if (bio->bi_opf & REQ_META)
> +               return fs_info->endio_meta_workers;
> +       return fs_info->endio_workers;

For btrfs_csum_one_bio() it's definitely data IO, thus you can go with 
endio_workers directly.
>   }
> 
>   /*
> @@ -823,12 +831,13 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio, bool async)
>                  csum_one_bio(bbio, &bbio->bio.bi_iter);
>                  return 0;
>          }
> -       init_completion(&bbio->csum_done);
>          bbio->async_csum = true;
>          bbio->csum_saved_iter = bbio->bio.bi_iter;
>          INIT_WORK(&bbio->csum_work, csum_one_bio_work);
> -       schedule_work(&bbio->csum_work);
> +       queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);

This is the what causing the difference, that you're calling 
flush_work() inside the same workqueue, at least for most cases.

However I'm still not 100% sure if it's fine, as for RAID56 the endio 
function is called inside rmw_workers, not the same as the regular 
endio_workers, causing the same cross-wq flush_work() call, which can 
still lead to the warning.

Personally speaking I'd prefer not to bother the cross-wq situations for 
flush_work() for now, which is a completely new rabbit hole.

Feel free to send out a dedicated patch removing btrfs_bio::csum_done, 
and we can continue the discussion there.

Thanks,
Qu

>          return 0;
>   }
> 
>> Thanks,
>> Qu
>>
>>>
>>>> So nope, the flush_work() idea won't work inside any current btrfs
>>>> workqueue, which all have WQ_MEM_RECLAIM flag set.
>>>
>>> With the above being said, I still see two possible solutions.
>>> Either using the btrfs_end_io_wq() as suggested before. It should be safe, IMO.
>>> Or, if you're still worried about possible deadlocks, creating a new
>>> dedicated wq which also has the WQ_MEM_RECLAIM set (which is needed
>>> for compatibility with the context where we want to call the
>>> flush_work()).
>>>
>>> Both ways could help getting rid of the completion in btrfs_bio, which
>>> is 32 bytes by itself.
>>>
>>> What do I miss?
>>>
>>> Out of curiosity, flush_work() internally also uses completion in
>>> pretty much exactly the same way as in this patch, but it's on the
>>> caller's stack (in this case on the stack which would call the
>>> btrfs_bio_end_io() modified with flush_work()). So in the end the
>>> effect would be like moving the completion from btrfs_bio to a stack.
>>>
>>>> What we need is to make endio_workers and rmw_workers to get rid of
>>>> WQ_MEM_RECLAIM, but that is a huge change and may not even work.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>> I'll keep digging to try to use flush_work() to remove the csum_done, as
>>>>>> that will keep the size of btrfs_bio unchanged.
>>>>>>
>>>>>> Thanks,
>>>>>> Qu
>>>>>>
>>>>>>>
>>>>>>> --nX
>>>>>>>
>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum
  2025-11-13  0:21                 ` Qu Wenruo
@ 2025-11-13  0:36                   ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-11-13  0:36 UTC (permalink / raw)
  To: Qu Wenruo, Daniel Vacek; +Cc: linux-btrfs



在 2025/11/13 10:51, Qu Wenruo 写道:
> 
> 
> 在 2025/11/13 08:13, Daniel Vacek 写道:

>> -       init_completion(&bbio->csum_done);
>>          bbio->async_csum = true;
>>          bbio->csum_saved_iter = bbio->bio.bi_iter;
>>          INIT_WORK(&bbio->csum_work, csum_one_bio_work);
>> -       schedule_work(&bbio->csum_work);
>> +       queue_work(btrfs_end_io_wq(fs_info, bio), &bbio->csum_work);
> 
> This is the what causing the difference, that you're calling 
> flush_work() inside the same workqueue, at least for most cases.
> 
> However I'm still not 100% sure if it's fine, as for RAID56 the endio 
> function is called inside rmw_workers, not the same as the regular 
> endio_workers, causing the same cross-wq flush_work() call, which can 
> still lead to the warning.

Well, for RAID56 the rmw_workers are still having WQ_MEM_RECLAIM flag, 
thus it will still be fine.

It's really the system wide workqueues not having that flags which 
causes the warning.

> 
> Personally speaking I'd prefer not to bother the cross-wq situations for 
> flush_work() for now, which is a completely new rabbit hole.
> 
> Feel free to send out a dedicated patch removing btrfs_bio::csum_done, 
> and we can continue the discussion there.

Still, you find the fix and you deserve the patch.

Thanks,
Qu

> 
> Thanks,
> Qu
> 
>>          return 0;
>>   }
>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>> So nope, the flush_work() idea won't work inside any current btrfs
>>>>> workqueue, which all have WQ_MEM_RECLAIM flag set.
>>>>
>>>> With the above being said, I still see two possible solutions.
>>>> Either using the btrfs_end_io_wq() as suggested before. It should be 
>>>> safe, IMO.
>>>> Or, if you're still worried about possible deadlocks, creating a new
>>>> dedicated wq which also has the WQ_MEM_RECLAIM set (which is needed
>>>> for compatibility with the context where we want to call the
>>>> flush_work()).
>>>>
>>>> Both ways could help getting rid of the completion in btrfs_bio, which
>>>> is 32 bytes by itself.
>>>>
>>>> What do I miss?
>>>>
>>>> Out of curiosity, flush_work() internally also uses completion in
>>>> pretty much exactly the same way as in this patch, but it's on the
>>>> caller's stack (in this case on the stack which would call the
>>>> btrfs_bio_end_io() modified with flush_work()). So in the end the
>>>> effect would be like moving the completion from btrfs_bio to a stack.
>>>>
>>>>> What we need is to make endio_workers and rmw_workers to get rid of
>>>>> WQ_MEM_RECLAIM, but that is a huge change and may not even work.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>
>>>>>>
>>>>>>> I'll keep digging to try to use flush_work() to remove the 
>>>>>>> csum_done, as
>>>>>>> that will keep the size of btrfs_bio unchanged.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Qu
>>>>>>>
>>>>>>>>
>>>>>>>> --nX
>>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-11-13  0:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29  5:34 [PATCH v2 0/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 1/7] btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 2/7] btrfs: headers cleanup to remove unnecessary local includes Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 3/7] btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 4/7] btrfs: make sure all btrfs_bio::end_io is called in task context Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 5/7] btrfs: remove btrfs_fs_info::compressed_write_workers Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 6/7] btrfs: relax btrfs_inode::ordered_tree_lock Qu Wenruo
2025-10-29  5:34 ` [PATCH v2 7/7] btrfs: introduce btrfs_bio::async_csum Qu Wenruo
2025-11-10 18:40   ` Daniel Vacek
2025-11-10 21:05     ` Qu Wenruo
2025-11-10 21:45       ` Qu Wenruo
2025-11-11 12:38       ` Daniel Vacek
2025-11-11 20:33         ` Qu Wenruo
2025-11-12 10:46           ` Daniel Vacek
2025-11-12 20:11             ` Qu Wenruo
2025-11-12 21:43               ` Daniel Vacek
2025-11-13  0:21                 ` Qu Wenruo
2025-11-13  0:36                   ` Qu Wenruo
2025-11-11 12:41   ` Daniel Vacek
2025-11-11 20:38     ` Qu Wenruo
2025-11-12  7:12       ` Daniel Vacek
2025-11-04  5:37 ` [PATCH v2 0/7] " David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox