linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] btrfs: bs > ps support preparation
@ 2025-09-02  9:02 Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 1/5] btrfs: support all block sizes which is no larger than page size Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

One of the blockage for bs > ps support is the conflicts between all the
single-page bvec iterator/helpers (like memzero_bvec(),
bio_for_each_segment() etc) and large folios (with highmem support).

For bs > ps support, all the folios will have a minimal order, so that
each folio will cover at least one block. This saves the hassle of the
fs to handle sub-block contents.

However for all those single-page bvec iterator/helpers, they can only
handle a bvec that is no larger than a page.

To address the conflicting features, go a completely different way to
handle a fs block:

- Use phys_addr_t to represent a block inside a bio
  So we won't need to bother the sp bvec helpers, just pass a single
  paddr around.

- Do proper highmem handling for checksum generation/verification
  Now we will grab the folio from using the paddr, and make sure the
  folio will cover at least one block starting at the paddr.

  If the folio is highmem, do proper per-page kmap_local_folio()/kunmap()
  to handle highmem.
  Otherwise do a full block csum calculation in one go.

  This should brings no extra overhead except the paddr->folio
  conversion (which should be really tiny), as for systems without
  HIGHMEM, folio_test_partial_kmap() will always return false, and the
  HIGHMEM path will be optimized out by the compiler completely.

  Unfortunately I don't have a 32bit VM at hand to test.

- Introduce extra marcos to iterate blocks inside a bio
  Two macros, btrfs_bio_for_each_block() which starts at the specified
  bio_iter.
  The other one, btrfs_bio_for_each_block_all() will go through all
  blocks in the bio.

  Both returns a @paddr representing a block. Callers are either using
  paddr based helper like
  btrfs_calculate_block_csum()/btrfs_check_block_csum(), or RAID56 which
  is already using paddr.

  For now it's only utilized by btrfs, bcachefs has a similar helper and
  that's my inspiration.

  I hope one day it can be escalated to bio.h.

With all those preparation done, btrfs now can support basic file
opeartions with bs > ps support, but still with quite some limits:

- No compression support
  The compressed folios must be allocated using the minimal folio order.
  As btrfs_calculate_block_csum() requires the minimal folio size.

- No RAID56 support
- No scrub support
  The same as compression, currently we're allocating the folios in page
  size.
  Although raid56 codes are now using the btrfs_bio_for_each_block*()
  helpers, the underlying folio sizes still needs update.

[Changelog]
v2:
- Use paddr to represent a block inside a bio
  This makes a lot of data checksum handling much easier to read.
  And make csum verification/generation to properly follow all the
  highmem helpers, by doing kmap inside the helper, not requiring the
  callers to do kmap.

- Fix a compiler warning when caching max and min folio order
  Remove the fs_info local variable, as for non-experimental builds it
  is not utilized.

Qu Wenruo (5):
  btrfs: support all block sizes which is no larger than page size
  btrfs: concentrate highmem handling for data verification
  btrfs: introduce btrfs_bio_for_each_block() helper
  btrfs: introduce btrfs_bio_for_each_block_all() helper
  btrfs: cache max and min order inside btrfs_fs_info

 fs/btrfs/bio.c         | 22 +++++++-------
 fs/btrfs/btrfs_inode.h | 14 +++++----
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/file-item.c   | 26 ++++-------------
 fs/btrfs/fs.c          |  4 +++
 fs/btrfs/fs.h          |  2 ++
 fs/btrfs/inode.c       | 59 ++++++++++++++++++++++++++------------
 fs/btrfs/misc.h        | 49 +++++++++++++++++++++++++++++++
 fs/btrfs/raid56.c      | 65 ++++++++++++++++--------------------------
 fs/btrfs/scrub.c       | 18 ++++++++++--
 10 files changed, 162 insertions(+), 99 deletions(-)

-- 
2.50.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/5] btrfs: support all block sizes which is no larger than page size
  2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
@ 2025-09-02  9:02 ` Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 2/5] btrfs: concentrate highmem handling for data verification Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

Currently if block size < page size, btrfs only supports one single
config, 4K.

This is mostly to reduce the test configurations, as 4K is going to be
the default block size for all architectures.

However all other major filesystems have no artificial limits on the
support block size, and some are already supporting block size > page
sizes.

Since the btrfs subpage block support has been there for a long time,
it's time for us to enable all block size <= page size support.

So here enable all block sizes support as long as it's no larger than
page size for experimental builds.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/fs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/fs.c b/fs/btrfs/fs.c
index 335209fe3734..014fb8b12f96 100644
--- a/fs/btrfs/fs.c
+++ b/fs/btrfs/fs.c
@@ -78,6 +78,10 @@ bool __attribute_const__ btrfs_supported_blocksize(u32 blocksize)
 
 	if (blocksize == PAGE_SIZE || blocksize == SZ_4K || blocksize == BTRFS_MIN_BLOCKSIZE)
 		return true;
+#ifdef CONFIG_BTRFS_EXPERIMENTAL
+	if (blocksize <= PAGE_SIZE)
+		return true;
+#endif
 	return false;
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/5] btrfs: concentrate highmem handling for data verification
  2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 1/5] btrfs: support all block sizes which is no larger than page size Qu Wenruo
@ 2025-09-02  9:02 ` Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 3/5] btrfs: introduce btrfs_bio_for_each_block() helper Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

Currently for btrfs checksum verification, we do it in the following
pattern:

	kaddr = kmap_local_*();
	ret = btrfs_check_csum_csum(kaddr);
	kunmap_local(kaddr);

It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.

In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.

Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:

- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
  To follow the more common term "block" used in all other major
  filesystems.

- Pass a physical address into btrfs_check_block_csum() and
  btrfs_data_csum_ok()
  The physical address is always available even for a highmem page.
  Since it's page frame number << PAGE_SHIFT + offset in page.

  And with that physical address, we can grab the folio covering the
  page, and do extra checks to ensure it covers at least one block.

  This also allows us to do the kmap inside btrfs_check_block_csum().
  This means all the extra HIGHMEM handling will be concentrated into
  btrfs_check_block_csum(), and no callers will need to bother highmem
  by themselves.

- Properly zero out the block if csum mismatch
  Since btrfs_data_csum_ok() only got a paddr, we can not and should not
  use memzero_bvec(), which only accepts single page bvec.
  Instead use paddr to grab the folio and call folio_zero_range()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c         |  4 ++--
 fs/btrfs/btrfs_inode.h |  6 +++---
 fs/btrfs/inode.c       | 47 +++++++++++++++++++++++++++++-------------
 fs/btrfs/raid56.c      | 13 +++---------
 fs/btrfs/scrub.c       | 18 ++++++++++++++--
 5 files changed, 57 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index ea7f7a17a3d5..493135bfa518 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -167,7 +167,7 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
 	int mirror = repair_bbio->mirror_num;
 
 	if (repair_bbio->bio.bi_status ||
-	    !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
+	    !btrfs_data_csum_ok(repair_bbio, dev, 0, bvec_phys(bv))) {
 		bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
 		repair_bbio->bio.bi_iter = repair_bbio->saved_iter;
 
@@ -280,7 +280,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
 		struct bio_vec bv = bio_iter_iovec(&bbio->bio, *iter);
 
 		bv.bv_len = min(bv.bv_len, sectorsize);
-		if (status || !btrfs_data_csum_ok(bbio, dev, offset, &bv))
+		if (status || !btrfs_data_csum_ok(bbio, dev, offset, bvec_phys(&bv)))
 			fbio = repair_one_sector(bbio, offset, &bv, fbio);
 
 		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index df3445448b7d..077b2f178816 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -542,10 +542,10 @@ static inline void btrfs_set_inode_mapping_order(struct btrfs_inode *inode)
 #define CSUM_FMT				"0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)		size, bytes
 
-int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, void *kaddr, u8 *csum,
-			    const u8 * const csum_expected);
+int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
+			   const u8 * const csum_expected);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
-			u32 bio_offset, struct bio_vec *bv);
+			u32 bio_offset, phys_addr_t paddr);
 noinline int can_nocow_extent(struct btrfs_inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
 			      bool nowait);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3cd9b505bd25..33bc1f51a8eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3334,13 +3334,35 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
  *
  * @kaddr must be a properly kmapped address.
  */
-int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, void *kaddr, u8 *csum,
-			    const u8 * const csum_expected)
+int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
+			   const u8 * const csum_expected)
 {
+	struct folio *folio = page_folio(phys_to_page(paddr));
+	const u32 blocksize = fs_info->sectorsize;
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 
 	shash->tfm = fs_info->csum_shash;
-	crypto_shash_digest(shash, kaddr, fs_info->sectorsize, csum);
+	/* The full block must be inside the folio. */
+	ASSERT(offset_in_folio(folio, paddr) + blocksize <= folio_size(folio));
+
+	if (folio_test_partial_kmap(folio)) {
+		size_t cur = paddr;
+
+		crypto_shash_init(shash);
+		while (cur < paddr + blocksize) {
+			void *kaddr;
+			size_t len = min(paddr + blocksize - cur,
+					 PAGE_SIZE - offset_in_page(cur));
+
+			kaddr = kmap_local_folio(folio, offset_in_folio(folio, cur));
+			crypto_shash_update(shash, kaddr, len);
+			kunmap_local(kaddr);
+			cur += len;
+		}
+		crypto_shash_final(shash, csum);
+	} else {
+		crypto_shash_digest(shash, phys_to_virt(paddr), blocksize, csum);
+	}
 
 	if (memcmp(csum, csum_expected, fs_info->csum_size))
 		return -EIO;
@@ -3361,17 +3383,16 @@ int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, void *kaddr, u8 *csum
  * Return %true if the sector is ok or had no checksum to start with, else %false.
  */
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
-			u32 bio_offset, struct bio_vec *bv)
+			u32 bio_offset, phys_addr_t paddr)
 {
 	struct btrfs_inode *inode = bbio->inode;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 blocksize = fs_info->sectorsize;
+	struct folio *folio;
 	u64 file_offset = bbio->file_offset + bio_offset;
-	u64 end = file_offset + bv->bv_len - 1;
+	u64 end = file_offset + blocksize - 1;
 	u8 *csum_expected;
 	u8 csum[BTRFS_CSUM_SIZE];
-	void *kaddr;
-
-	ASSERT(bv->bv_len == fs_info->sectorsize);
 
 	if (!bbio->csum)
 		return true;
@@ -3387,12 +3408,8 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 
 	csum_expected = bbio->csum + (bio_offset >> fs_info->sectorsize_bits) *
 				fs_info->csum_size;
-	kaddr = bvec_kmap_local(bv);
-	if (btrfs_check_sector_csum(fs_info, kaddr, csum, csum_expected)) {
-		kunmap_local(kaddr);
+	if (btrfs_check_block_csum(fs_info, paddr, csum, csum_expected))
 		goto zeroit;
-	}
-	kunmap_local(kaddr);
 	return true;
 
 zeroit:
@@ -3400,7 +3417,9 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 				    bbio->mirror_num);
 	if (dev)
 		btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS);
-	memzero_bvec(bv);
+	folio = page_folio(phys_to_page(paddr));
+	ASSERT(offset_in_folio(folio, paddr) + blocksize <= folio_size(folio));
+	folio_zero_range(folio, offset_in_folio(folio, paddr), blocksize);
 	return false;
 }
 
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3ff2bedfb3a4..e88699460dda 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1585,9 +1585,6 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 		return;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		void *kaddr;
-
-		kaddr = bvec_kmap_local(bvec);
 		for (u32 off = 0; off < bvec->bv_len;
 		     off += fs_info->sectorsize, total_sector_nr++) {
 			u8 csum_buf[BTRFS_CSUM_SIZE];
@@ -1599,12 +1596,11 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 			if (!test_bit(total_sector_nr, rbio->csum_bitmap))
 				continue;
 
-			ret = btrfs_check_sector_csum(fs_info, kaddr + off,
-						      csum_buf, expected_csum);
+			ret = btrfs_check_block_csum(fs_info, bvec_phys(bvec) + off,
+						     csum_buf, expected_csum);
 			if (ret < 0)
 				set_bit(total_sector_nr, rbio->error_bitmap);
 		}
-		kunmap_local(kaddr);
 	}
 }
 
@@ -1802,7 +1798,6 @@ static int verify_one_sector(struct btrfs_raid_bio *rbio,
 	struct sector_ptr *sector;
 	u8 csum_buf[BTRFS_CSUM_SIZE];
 	u8 *csum_expected;
-	void *kaddr;
 	int ret;
 
 	if (!rbio->csum_bitmap || !rbio->csum_buf)
@@ -1824,9 +1819,7 @@ static int verify_one_sector(struct btrfs_raid_bio *rbio,
 	csum_expected = rbio->csum_buf +
 			(stripe_nr * rbio->stripe_nsectors + sector_nr) *
 			fs_info->csum_size;
-	kaddr = kmap_local_sector(sector);
-	ret = btrfs_check_sector_csum(fs_info, kaddr, csum_buf, csum_expected);
-	kunmap_local(kaddr);
+	ret = btrfs_check_block_csum(fs_info, sector->paddr, csum_buf, csum_expected);
 	return ret;
 }
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d86020ace69c..0b993e7273d3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -696,6 +696,20 @@ static void *scrub_stripe_get_kaddr(struct scrub_stripe *stripe, int sector_nr)
 	return page_address(page) + offset_in_page(offset);
 }
 
+static phys_addr_t scrub_stripe_get_paddr(struct scrub_stripe *stripe, int sector_nr)
+{
+	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
+	u32 offset = (sector_nr << fs_info->sectorsize_bits);
+	const struct page *page = stripe->pages[offset >> PAGE_SHIFT];
+
+	/* stripe->pages[] is allocated by us and no highmem is allowed. */
+	ASSERT(page);
+	ASSERT(!PageHighMem(page));
+	/* And the range must be contained inside the page. */
+	ASSERT(offset_in_page(offset) + fs_info->sectorsize <= PAGE_SIZE);
+	return page_to_phys(page) + offset_in_page(offset);
+}
+
 static void scrub_verify_one_metadata(struct scrub_stripe *stripe, int sector_nr)
 {
 	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
@@ -788,7 +802,7 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr)
 	struct btrfs_fs_info *fs_info = stripe->bg->fs_info;
 	struct scrub_sector_verification *sector = &stripe->sectors[sector_nr];
 	const u32 sectors_per_tree = fs_info->nodesize >> fs_info->sectorsize_bits;
-	void *kaddr = scrub_stripe_get_kaddr(stripe, sector_nr);
+	phys_addr_t paddr = scrub_stripe_get_paddr(stripe, sector_nr);
 	u8 csum_buf[BTRFS_CSUM_SIZE];
 	int ret;
 
@@ -833,7 +847,7 @@ static void scrub_verify_one_sector(struct scrub_stripe *stripe, int sector_nr)
 		return;
 	}
 
-	ret = btrfs_check_sector_csum(fs_info, kaddr, csum_buf, sector->csum);
+	ret = btrfs_check_block_csum(fs_info, paddr, csum_buf, sector->csum);
 	if (ret < 0) {
 		scrub_bitmap_set_bit_csum_error(stripe, sector_nr);
 		scrub_bitmap_set_bit_error(stripe, sector_nr);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/5] btrfs: introduce btrfs_bio_for_each_block() helper
  2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 1/5] btrfs: support all block sizes which is no larger than page size Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 2/5] btrfs: concentrate highmem handling for data verification Qu Wenruo
@ 2025-09-02  9:02 ` Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 4/5] btrfs: introduce btrfs_bio_for_each_block_all() helper Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 5/5] btrfs: cache max and min order inside btrfs_fs_info Qu Wenruo
  4 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

Currently if we want to iterate a bio in block unit, we do something
like this:

	while (iter->bi_size) {
		struct bio_vec bv = bio_iter_iovec();

		/* Do something with using the bv */

		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
	}

That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.

This means the code using that bv can only handle a block if bs <= ps.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block()
  Instead of bio_vec, which has single and multiple page version and
  multiple page version has quite some limits, use my favorite way to
  represent a block, phys_addr_t.

  For bs <= ps cases, nothing is changed, except we will do a very
  small overhead to convert phys_addr_t to a folio, then use the proper
  folio helpers to handle the possible highmem cases.

  For bs > ps cases, all blocks will be backed by large folios, meaning
  every folio will cover at least one block. And still use proper folio
  helpers to handle highmem cases.

  With phys_addr_t, we will handle both large folio and highmem
  properly. So there is no better single variable to present a btrfs
  block than phys_addr_t.

- Extract the data block csum calculation into a helper
  The new helper, btrfs_calculate_block_csum() will be utilzed by
  btrfs_csum_one_bio().

- Use btrfs_bio_for_each_block() to replace existing call sites
  Including:

  * index_one_bio() from raid56.c
    Very straight-forward.

  * btrfs_check_read_bio()
    Also update repair_one_sector() to grab the folio using phys_addr_t,
    and do extra checks to make sure the folio covers at least one
    block.
    We do not need to bother bv_len at all now.

  * btrfs_csum_one_bio()
    Now we can move the highmem handling into a dedicated helper,
    calculate_block_csum(), and use btrfs_bio_for_each_block() helper.

There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c         | 20 +++++++++-----------
 fs/btrfs/btrfs_inode.h |  2 ++
 fs/btrfs/file-item.c   | 26 ++++++--------------------
 fs/btrfs/inode.c       | 26 +++++++++++++++-----------
 fs/btrfs/misc.h        | 25 +++++++++++++++++++++++++
 fs/btrfs/raid56.c      |  7 +++----
 6 files changed, 60 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 493135bfa518..909b208f9ef3 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -204,18 +204,21 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
  */
 static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
 						  u32 bio_offset,
-						  struct bio_vec *bv,
+						  phys_addr_t paddr,
 						  struct btrfs_failed_bio *fbio)
 {
 	struct btrfs_inode *inode = failed_bbio->inode;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct folio *folio = page_folio(phys_to_page(paddr));
 	const u32 sectorsize = fs_info->sectorsize;
+	const u32 foff = offset_in_folio(folio, paddr);
 	const u64 logical = (failed_bbio->saved_iter.bi_sector << SECTOR_SHIFT);
 	struct btrfs_bio *repair_bbio;
 	struct bio *repair_bio;
 	int num_copies;
 	int mirror;
 
+	ASSERT(foff + sectorsize <= folio_size(folio));
 	btrfs_debug(fs_info, "repair read error: read error at %llu",
 		    failed_bbio->file_offset + bio_offset);
 
@@ -238,7 +241,7 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
 	repair_bio = bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_NOFS,
 				      &btrfs_repair_bioset);
 	repair_bio->bi_iter.bi_sector = failed_bbio->saved_iter.bi_sector;
-	__bio_add_page(repair_bio, bv->bv_page, bv->bv_len, bv->bv_offset);
+	bio_add_folio_nofail(repair_bio, folio, sectorsize, foff);
 
 	repair_bbio = btrfs_bio(repair_bio);
 	btrfs_bio_init(repair_bbio, fs_info, NULL, fbio);
@@ -259,6 +262,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
 	struct bvec_iter *iter = &bbio->saved_iter;
 	blk_status_t status = bbio->bio.bi_status;
 	struct btrfs_failed_bio *fbio = NULL;
+	phys_addr_t paddr;
 	u32 offset = 0;
 
 	/* Read-repair requires the inode field to be set by the submitter. */
@@ -276,17 +280,11 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
 	/* Clear the I/O error. A failed repair will reset it. */
 	bbio->bio.bi_status = BLK_STS_OK;
 
-	while (iter->bi_size) {
-		struct bio_vec bv = bio_iter_iovec(&bbio->bio, *iter);
-
-		bv.bv_len = min(bv.bv_len, sectorsize);
-		if (status || !btrfs_data_csum_ok(bbio, dev, offset, bvec_phys(&bv)))
-			fbio = repair_one_sector(bbio, offset, &bv, fbio);
-
-		bio_advance_iter_single(&bbio->bio, iter, sectorsize);
+	btrfs_bio_for_each_block(paddr, &bbio->bio, iter, fs_info->sectorsize) {
+		if (status || !btrfs_data_csum_ok(bbio, dev, offset, paddr))
+			fbio = repair_one_sector(bbio, offset, paddr, fbio);
 		offset += sectorsize;
 	}
-
 	if (bbio->csum != bbio->csum_inline)
 		kfree(bbio->csum);
 
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 077b2f178816..c40e99ec13bf 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -542,6 +542,8 @@ static inline void btrfs_set_inode_mapping_order(struct btrfs_inode *inode)
 #define CSUM_FMT				"0x%*phN"
 #define CSUM_FMT_VALUE(size, bytes)		size, bytes
 
+void btrfs_calculate_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr,
+				u8 *dest);
 int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
 			   const u8 * const csum_expected);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 4dd3d8a02519..7906aea75ee4 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -775,12 +775,10 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	struct bio *bio = &bbio->bio;
 	struct btrfs_ordered_sum *sums;
-	char *data;
-	struct bvec_iter iter;
-	struct bio_vec bvec;
+	struct bvec_iter iter = bio->bi_iter;
+	phys_addr_t paddr;
+	const u32 blocksize = fs_info->sectorsize;
 	int index;
-	unsigned int blockcount;
-	int i;
 	unsigned nofs_flag;
 
 	nofs_flag = memalloc_nofs_save();
@@ -799,21 +797,9 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
 
 	shash->tfm = fs_info->csum_shash;
 
-	bio_for_each_segment(bvec, bio, iter) {
-		blockcount = BTRFS_BYTES_TO_BLKS(fs_info,
-						 bvec.bv_len + fs_info->sectorsize
-						 - 1);
-
-		for (i = 0; i < blockcount; i++) {
-			data = bvec_kmap_local(&bvec);
-			crypto_shash_digest(shash,
-					    data + (i * fs_info->sectorsize),
-					    fs_info->sectorsize,
-					    sums->sums + index);
-			kunmap_local(data);
-			index += fs_info->csum_size;
-		}
-
+	btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
+		btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
+		index += fs_info->csum_size;
 	}
 
 	bbio->sums = sums;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 33bc1f51a8eb..ad876779289e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3328,14 +3328,8 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
 	return btrfs_finish_one_ordered(ordered);
 }
 
-/*
- * Verify the checksum for a single sector without any extra action that depend
- * on the type of I/O.
- *
- * @kaddr must be a properly kmapped address.
- */
-int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
-			   const u8 * const csum_expected)
+void btrfs_calculate_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr,
+				u8 *dest)
 {
 	struct folio *folio = page_folio(phys_to_page(paddr));
 	const u32 blocksize = fs_info->sectorsize;
@@ -3359,11 +3353,21 @@ int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8
 			kunmap_local(kaddr);
 			cur += len;
 		}
-		crypto_shash_final(shash, csum);
+		crypto_shash_final(shash, dest);
 	} else {
-		crypto_shash_digest(shash, phys_to_virt(paddr), blocksize, csum);
+		crypto_shash_digest(shash, phys_to_virt(paddr), blocksize, dest);
 	}
-
+}
+/*
+ * Verify the checksum for a single sector without any extra action that depend
+ * on the type of I/O.
+ *
+ * @kaddr must be a properly kmapped address.
+ */
+int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
+			   const u8 * const csum_expected)
+{
+	btrfs_calculate_block_csum(fs_info, paddr, csum);
 	if (memcmp(csum, csum_expected, fs_info->csum_size))
 		return -EIO;
 	return 0;
diff --git a/fs/btrfs/misc.h b/fs/btrfs/misc.h
index ff5eac84d819..f210f808311f 100644
--- a/fs/btrfs/misc.h
+++ b/fs/btrfs/misc.h
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/math64.h>
 #include <linux/rbtree.h>
+#include <linux/bio.h>
 
 /*
  * Enumerate bits using enum autoincrement. Define the @name as the n-th bit.
@@ -20,6 +21,30 @@
 	name = (1U << __ ## name ## _BIT),              \
 	__ ## name ## _SEQ = __ ## name ## _BIT
 
+static inline phys_addr_t bio_iter_phys(struct bio *bio, struct bvec_iter *iter)
+{
+	struct bio_vec bv = bio_iter_iovec(bio, *iter);
+
+	return bvec_phys(&bv);
+}
+
+/*
+ * Helper to iterate bio using btrfs block size.
+ *
+ * This will handle large folio and highmem.
+ *
+ * @paddr:	Physical memory address of each iteration
+ * @bio:	The bio to iterate
+ * @iter:	The bvec_iter (pointer) to use.
+ * @blocksize:	The blocksize to iterate.
+ *
+ * This requires all folios in the bio to cover at least one block.
+ */
+#define btrfs_bio_for_each_block(paddr, bio, iter, blocksize)		\
+	for (; (iter)->bi_size &&					\
+	     (paddr = bio_iter_phys((bio), (iter)), 1);			\
+	     bio_advance_iter_single((bio), (iter), (blocksize)))
+
 static inline void cond_wake_up(struct wait_queue_head *wq)
 {
 	/*
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index e88699460dda..389f1b617fe7 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1208,17 +1208,16 @@ static void index_one_bio(struct btrfs_raid_bio *rbio, struct bio *bio)
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
 	const u32 sectorsize_bits = rbio->bioc->fs_info->sectorsize_bits;
 	struct bvec_iter iter = bio->bi_iter;
+	phys_addr_t paddr;
 	u32 offset = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
 		     rbio->bioc->full_stripe_logical;
 
-	while (iter.bi_size) {
+	btrfs_bio_for_each_block(paddr, bio, &iter, sectorsize) {
 		unsigned int index = (offset >> sectorsize_bits);
 		struct sector_ptr *sector = &rbio->bio_sectors[index];
-		struct bio_vec bv = bio_iter_iovec(bio, iter);
 
 		sector->has_paddr = true;
-		sector->paddr = bvec_phys(&bv);
-		bio_advance_iter_single(bio, &iter, sectorsize);
+		sector->paddr = paddr;
 		offset += sectorsize;
 	}
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 4/5] btrfs: introduce btrfs_bio_for_each_block_all() helper
  2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-09-02  9:02 ` [PATCH v2 3/5] btrfs: introduce btrfs_bio_for_each_block() helper Qu Wenruo
@ 2025-09-02  9:02 ` Qu Wenruo
  2025-09-02  9:02 ` [PATCH v2 5/5] btrfs: cache max and min order inside btrfs_fs_info Qu Wenruo
  4 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

Currently if we want to iterate all blocks inside a bio, we do something
like this:

	bio_for_each_segment_all(bvec, bio, iter_all) {
		for (off = 0; off < bvec->bv_len; off += sectorsize) {
			/* Iterate blocks using bv + off */
		}
	}

That's fine for now, but it will not handle future bs > ps, as
bio_for_each_segment_all() is a single-page iterator, it will always
return a bvec that's no larger than a page.

But for bs > ps cases, we need a full folio (which covers at least one
block) so that we can work on the block.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block_all()
  This helper will create a local bvec_iter, which has the size of the
  target bio. Then grab the current physical address of the current
  location, then advance the iterator by block size.

- Use btrfs_bio_for_each_block_all() to replace existing call sites
  Including:

  * set_bio_pages_uptodate() in raid56
  * verify_bio_data_sectors() in raid56

  Both will result much easier to read code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/misc.h   | 24 +++++++++++++++++++++++
 fs/btrfs/raid56.c | 49 +++++++++++++++++++----------------------------
 2 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/misc.h b/fs/btrfs/misc.h
index f210f808311f..2352ab56dbb3 100644
--- a/fs/btrfs/misc.h
+++ b/fs/btrfs/misc.h
@@ -45,6 +45,30 @@ static inline phys_addr_t bio_iter_phys(struct bio *bio, struct bvec_iter *iter)
 	     (paddr = bio_iter_phys((bio), (iter)), 1);			\
 	     bio_advance_iter_single((bio), (iter), (blocksize)))
 
+/* Helper to initialize a bvec_iter to the size of the specified bio. */
+static inline struct bvec_iter init_bvec_iter_for_bio(struct bio *bio)
+{
+	struct bio_vec *bvec;
+	u32 bio_size = 0;
+	int i;
+
+	bio_for_each_bvec_all(bvec, bio, i)
+		bio_size += bvec->bv_len;
+
+	return (struct bvec_iter) {
+		.bi_sector = 0,
+		.bi_size = bio_size,
+		.bi_idx = 0,
+		.bi_bvec_done = 0,
+	};
+}
+
+#define btrfs_bio_for_each_block_all(paddr, bio, blocksize)		\
+	for (struct bvec_iter iter = init_bvec_iter_for_bio(bio);	\
+	     (iter).bi_size &&						\
+	     (paddr = bio_iter_phys((bio), &(iter)), 1);		\
+	     bio_advance_iter_single((bio), &(iter), (blocksize)))
+
 static inline void cond_wake_up(struct wait_queue_head *wq)
 {
 	/*
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 389f1b617fe7..2b4f577dcf39 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1510,22 +1510,17 @@ static struct sector_ptr *find_stripe_sector(struct btrfs_raid_bio *rbio,
  */
 static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	const u32 blocksize = rbio->bioc->fs_info->sectorsize;
+	phys_addr_t paddr;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct sector_ptr *sector;
-		phys_addr_t paddr = bvec_phys(bvec);
+	btrfs_bio_for_each_block_all(paddr, bio, blocksize) {
+		struct sector_ptr *sector = find_stripe_sector(rbio, paddr);
 
-		for (u32 off = 0; off < bvec->bv_len; off += sectorsize) {
-			sector = find_stripe_sector(rbio, paddr + off);
-			ASSERT(sector);
-			if (sector)
-				sector->uptodate = 1;
-		}
+		ASSERT(sector);
+		if (sector)
+			sector->uptodate = 1;
 	}
 }
 
@@ -1572,8 +1567,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
 	int total_sector_nr = get_bio_sector_nr(rbio, bio);
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	phys_addr_t paddr;
 
 	/* No data csum for the whole stripe, no need to verify. */
 	if (!rbio->csum_bitmap || !rbio->csum_buf)
@@ -1583,23 +1577,20 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 	if (total_sector_nr >= rbio->nr_data * rbio->stripe_nsectors)
 		return;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		for (u32 off = 0; off < bvec->bv_len;
-		     off += fs_info->sectorsize, total_sector_nr++) {
-			u8 csum_buf[BTRFS_CSUM_SIZE];
-			u8 *expected_csum = rbio->csum_buf +
-					    total_sector_nr * fs_info->csum_size;
-			int ret;
+	btrfs_bio_for_each_block_all(paddr, bio, fs_info->sectorsize) {
+		u8 csum_buf[BTRFS_CSUM_SIZE];
+		u8 *expected_csum = rbio->csum_buf + total_sector_nr * fs_info->csum_size;
+		int ret;
 
-			/* No csum for this sector, skip to the next sector. */
-			if (!test_bit(total_sector_nr, rbio->csum_bitmap))
-				continue;
+		/* No csum for this sector, skip to the next sector. */
+		if (!test_bit(total_sector_nr, rbio->csum_bitmap))
+			continue;
 
-			ret = btrfs_check_block_csum(fs_info, bvec_phys(bvec) + off,
-						     csum_buf, expected_csum);
-			if (ret < 0)
-				set_bit(total_sector_nr, rbio->error_bitmap);
-		}
+		ret = btrfs_check_block_csum(fs_info, paddr,
+					     csum_buf, expected_csum);
+		if (ret < 0)
+			set_bit(total_sector_nr, rbio->error_bitmap);
+		total_sector_nr++;
 	}
 }
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 5/5] btrfs: cache max and min order inside btrfs_fs_info
  2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-09-02  9:02 ` [PATCH v2 4/5] btrfs: introduce btrfs_bio_for_each_block_all() helper Qu Wenruo
@ 2025-09-02  9:02 ` Qu Wenruo
  4 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-09-02  9:02 UTC (permalink / raw)
  To: linux-btrfs

Inside btrfs_fs_info we cache several bits shift like sectorsize_bits.

Apply this to max and min folio orders so that every time mapping order
needs to be applied we can skip the calculation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h | 6 +++---
 fs/btrfs/disk-io.c     | 2 ++
 fs/btrfs/fs.h          | 2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c40e99ec13bf..f45dcdde0efc 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -532,9 +532,9 @@ static inline void btrfs_set_inode_mapping_order(struct btrfs_inode *inode)
 
 	/* We only allow BITS_PER_LONGS blocks for each bitmap. */
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
-	mapping_set_folio_order_range(inode->vfs_inode.i_mapping, 0,
-			ilog2(((BITS_PER_LONG << inode->root->fs_info->sectorsize_bits)
-				>> PAGE_SHIFT)));
+	mapping_set_folio_order_range(inode->vfs_inode.i_mapping,
+				      inode->root->fs_info->block_min_order,
+				      inode->root->fs_info->block_max_order);
 #endif
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7b06bbc40898..a2eba8bc4336 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3383,6 +3383,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	fs_info->nodesize_bits = ilog2(nodesize);
 	fs_info->sectorsize = sectorsize;
 	fs_info->sectorsize_bits = ilog2(sectorsize);
+	fs_info->block_min_order = ilog2(round_up(sectorsize, PAGE_SIZE) >> PAGE_SHIFT);
+	fs_info->block_max_order = ilog2((BITS_PER_LONG << fs_info->sectorsize_bits) >> PAGE_SHIFT);
 	fs_info->csums_per_leaf = BTRFS_MAX_ITEM_SIZE(fs_info) / fs_info->csum_size;
 	fs_info->stripesize = stripesize;
 	fs_info->fs_devices->fs_info = fs_info;
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index 5f0b185a7f21..bf4a1b75b0cf 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -825,6 +825,8 @@ struct btrfs_fs_info {
 	u32 sectorsize;
 	/* ilog2 of sectorsize, use to avoid 64bit division */
 	u32 sectorsize_bits;
+	u32 block_min_order;
+	u32 block_max_order;
 	u32 csum_size;
 	u32 csums_per_leaf;
 	u32 stripesize;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-02  9:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-02  9:02 [PATCH v2 0/5] btrfs: bs > ps support preparation Qu Wenruo
2025-09-02  9:02 ` [PATCH v2 1/5] btrfs: support all block sizes which is no larger than page size Qu Wenruo
2025-09-02  9:02 ` [PATCH v2 2/5] btrfs: concentrate highmem handling for data verification Qu Wenruo
2025-09-02  9:02 ` [PATCH v2 3/5] btrfs: introduce btrfs_bio_for_each_block() helper Qu Wenruo
2025-09-02  9:02 ` [PATCH v2 4/5] btrfs: introduce btrfs_bio_for_each_block_all() helper Qu Wenruo
2025-09-02  9:02 ` [PATCH v2 5/5] btrfs: cache max and min order inside btrfs_fs_info Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).