public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] btrfs: add raid56 support for bs > ps cases
@ 2025-11-17  7:30 Qu Wenruo
  2025-11-17  7:30 ` [PATCH 01/12] btrfs: add an overview for the btrfs_raid_bio structure Qu Wenruo
                   ` (12 more replies)
  0 siblings, 13 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

[OVERVIEW]
This series add the missing raid56 support for the experimental bs > ps
support.

The main challenge here is the conflicts between RAID56 RMW/recovery and
data checksum.

For RAID56 RMW/recovery, the vertical stripe can only be mapped one page
one time, as the upper layer can pass bios that are not backed by large
folios (direct IO, encoded read/write/send).

On the other hand, data checksum requires multiple pages at the same
time, e.g. btrfs_calculate_block_csum_pages().

To meet both requirements, introduce a new unit, step, which is
min(PAGE_SIZE, sectorsize), and make the paddrs[] arrays in RAID56 to be
in step sizes.

So for vertical stripe related works, reduce the map size from
one sector to one step. For data checksum verification grab the pointer
from involved paddrs[] array and pass the sub-array into
btrfs_calculate_block_csum_pages().

So before the patchset, the btrfs_raid_bio paddr pointers looks like
this:

  16K page size, 4K fs block size (aka, subpage case)

                       0                   16K  ...
  stripe_pages[]:      |                   |    ...
  stripe_paddrs[]:     0    1    2    3    4    ...
  fs blocks            |<-->|<-->|<-->|<-->|    ...

  There are at least one fs block (sector) inside a page, and each
  paddrs[] entry represents an fs block 1:1.

To the new structure for bs > ps support:

  4K page size, 8K fs block size

                       0    4k   8K   12K   16K  ...
  stripe_pages[]:      |    |    |    |    |     ...
  stripe_paddrs[]:     0    1    2    3    4     ...
  fs blocks            |<------->|<------->|     ...

  Now paddrs[] entry is no longer 1:1 mapped to an fs block, but
  multiple paddrs mapped to one fs block.

The glue unit between paddrs[] and fs blocks is a step.

One fs blocks can one or more steps, and one step maps to a paddr[]
entry 1:1.

For bs <= ps cases, one step is the same as an fs block.
For bs > ps case, one step is just a page.

For RAID56, now we need one extra step iteration loop when handling an
fs block.

[TESTING]
I have tested the following combinations:

- bs=4k ps=4k x86_64
- bs=4k ps=64k arm64
  The base line to ensure no regression caused by this patchset for bs
  == ps and bs < ps cases.

- bs=8k ps=4k x86_64
  The new run for this series.

  The only new failure is related to direct IO read verification, which
  is a known one caused by no direct IO support for bs > ps cases.

I'm afraid in the long run, the combination matrix will be larger than
larger, and I'm not sure if my environment can handle all the extra bs/ps
combinations.

The long term plan is to test bs=4k ps=4k, bs=4k ps=64k, bs=8k ps=4k
cases only.

[PATCHSET LAYOUT]
Patch 1 introduces an overview of how btrfs_raid_bio structure
works.
Patch 2~10 starts converting t he existing infrastructures to use the
new step based paddr pointers.
Patch 11 enables RAID56 for bs > ps cases, which is still an
experimental feature.
The last patch removes the "_step" infix which is used as a temporary
naming during the work.

[ROADMAP FOR BS > PS SUPPORT]
The remaining feature not yet implemented for bs > ps cases is direct
IO. The needed patch in iomap is submitted through VFS/iomap tree, and
the btrfs part is a very tiny patch, will be submitted during v6.19
cycle.


Qu Wenruo (12):
  btrfs: add an overview for the btrfs_raid_bio structure
  btrfs: introduce a new parameter to locate a sector
  btrfs: prepare generate_pq_vertical() for bs > ps cases
  btrfs: prepare recover_vertical() to support bs > ps cases
  btrfs: prepare verify_one_sector() to support bs > ps cases
  btrfs: prepare verify_bio_data_sectors() to support bs > ps cases
  btrfs: prepare set_bio_pages_uptodate() to support bs > ps cases
  btrfs: prepare steal_rbio() to support bs > ps cases
  btrfs: prepare rbio_bio_add_io_paddr() to support bs > ps cases
  btrfs: prepare finish_parity_scrub() to support bs > ps cases
  btrfs: enable bs > ps support for raid56
  btrfs: remove the "_step" infix

 fs/btrfs/disk-io.c |   6 -
 fs/btrfs/raid56.c  | 711 ++++++++++++++++++++++++++++-----------------
 fs/btrfs/raid56.h  |  87 ++++++
 3 files changed, 535 insertions(+), 269 deletions(-)

-- 
2.51.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 01/12] btrfs: add an overview for the btrfs_raid_bio structure
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 02/12] btrfs: introduce a new parameter to locate a sector Qu Wenruo
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The structure needs to track both the pages from higher layer bio and
internal pages, thus it can be a little complex to grash.

Add an overview of the structure, especially how we track different
pages from higher layer bios and internal ones, to save some time for
future developers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.h | 71 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 42a45716fb03..87b0c73ee05b 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -24,6 +24,77 @@ enum btrfs_rbio_ops {
 	BTRFS_RBIO_PARITY_SCRUB,
 };
 
+/*
+ * Overview of btrfs_raid_bio.
+ *
+ * One btrfs_raid_bio represents a full stripe of RAID56, including both data
+ * and P/Q stripes.
+ * For now, each data and P/Q stripe is in fixed length (64K).
+ *
+ * One btrfs_raid_bio can have one or more bios from higher layer, covering
+ * part or all of the data stripes.
+ *
+ * [PAGES FROM HIGHER LAYER BIOS]
+ * Higher layer bios are in the btrfs_raid_bio::bio_list.
+ *
+ * Pages from the bio_list are represented like the following.
+ *
+ *
+ * bio_list:	     |<- Bio 1 ->|             |<- Bio 2 ->|  ...
+ * bio_paddrs:	    [0]   [1]   [2]    [3]    [4]    [5]      ...
+ *
+ * If there is a bio covering a sector (one btrfs fs block), the corresponding
+ * pointer in btrfs_raid_bio::bio_paddrs[] will point to the physical address
+ * (with the offset inside the page) of the corresponding bio.
+ *
+ * If there is no bio covering a sector, then btrfs_raid_bio::bio_paddrs[i] will
+ * be INVALID_PADDR.
+ *
+ * The length of each entry in bio_paddrs[] is sectorsize.
+ *
+ * [PAGES FOR INTERNAL USAGES]
+ * For pages not covered by any bio or belonging to P/Q stripes, they are stored
+ * in btrfs_raid_bio::stripe_pages[] and stripe_paddrs[], like the following:
+ *
+ * stripe_pages:       |<- Page 0 ->|<- Page 1 ->|  ...
+ * stripe_paddrs:     [0]    [1]   [2]    [3]   [4] ...
+ *
+ * stripe_pages[] array stores all the pages covering the full stripe, including
+ * data and P/Q pages.
+ * stripe_pages[0] is the first page of the first data stripe.
+ * stripe_pages[BTRFS_STRIPE_LEN / PAGE_SIZE] is the first page of the second data stripe.
+ *
+ * Some pointers inside stripe_pages[] can be NULL, e.g. for a full stripe write
+ * (the bio covers all data stripes) there is no need to allocate pages for
+ * data stripes (can grab from bio_paddrs[]).
+ *
+ * If the corresponding page of stripe_paddrs[i] is not allocated, the value of
+ * stripe_paddrs[i] will be INVALID_PADDR.
+ *
+ * The length of each entry in stripe_paddrs[] is sectorsize.
+ *
+ * [LOCATING A SECTOR]
+ * To locating a sector for IO, we need the following info:
+ *
+ * - stripe_nr
+ *   Starts from 0 (representing the first data stripe), ends at
+ *   @nr_data (RAID5, P stripe) or @nr_data + 1 (RAID6, Q stripe).
+ *
+ * - sector_nr
+ *   Starts from 0 (representing the first sector of the stripe), ends
+ *   at BTRFS_STRIPE_LEN / sectorsize - 1.
+ *
+ *   All existing bitmaps are based on sector numbers.
+ *
+ * - from which array
+ *   Whether grabbing from stripe_paddrs[] (aka, internal pages) or from the
+ *   bio_paddrs[] (aka, from the higher layer bios).
+ *
+ * For IO, a physical address is returned, so that we can extract the page and
+ * the offset inside the page for IO.
+ * A special value INVALID_PADDR represents the physical address is invalid,
+ * normally meaning there is no page allocated for the specified sector.
+ */
 struct btrfs_raid_bio {
 	struct btrfs_io_context *bioc;
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 02/12] btrfs: introduce a new parameter to locate a sector
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
  2025-11-17  7:30 ` [PATCH 01/12] btrfs: add an overview for the btrfs_raid_bio structure Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 03/12] btrfs: prepare generate_pq_vertical() for bs > ps cases Qu Wenruo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

Since we can not ensure all bios from the higher layer is backed by
large folios (e.g. direct IO, encoded read/write/send), we need to the
ability to locate sub-block (aka, a page) inside a full stripe.

So the existing @stripe_nr + @sector_nr combination is not enough to
locate such page for bs > ps cases.

Introduce a new parameter, @step_nr, to locate the page of a larger fs
block.
The naming is following the conventions used inside btrfs elsewhere,
where one step is min(sectorsize, PAGE_SIZE).

It's still a preparation, only touching the following aspects:

- btrfs_dump_rbio()
  To show the new @sector_nsteps member.

- btrfs_raid_bio::sector_nsteps
  Recording how many steps there are inside a fs block.

- Enlarge btrfs_raid_bio::*_paddrs[] size
  To take @sector_nsteps into consideration.

- index_one_bio()
- index_stripe_sectors()
- memcpy_from_bio_to_stripe()
- cache_rbio_pages()
- need_read_stripe_sectors()
  Those functions are iterating *_paddrs[], which needs to take
  sector_nsteps into consideration.

- Rename rbio_stripe_sector_index() to rbio_sector_index()
  The "stripe" part is not that helpful.

  And an extra ASSERT() before returning the result.

- Add a new rbio_paddr_index() helper
  This will take the extra @step_nr into consideration.

- The comments of btrfs_raid_bio

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 92 +++++++++++++++++++++++++++++++----------------
 fs/btrfs/raid56.h | 22 ++++++++++--
 2 files changed, 80 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 95cc243d9c8b..7f01178be7d8 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -66,10 +66,10 @@ static void btrfs_dump_rbio(const struct btrfs_fs_info *fs_info,
 
 	dump_bioc(fs_info, rbio->bioc);
 	btrfs_crit(fs_info,
-"rbio flags=0x%lx nr_sectors=%u nr_data=%u real_stripes=%u stripe_nsectors=%u scrubp=%u dbitmap=0x%lx",
+"rbio flags=0x%lx nr_sectors=%u nr_data=%u real_stripes=%u stripe_nsectors=%u sector_nsteps=%u scrubp=%u dbitmap=0x%lx",
 		rbio->flags, rbio->nr_sectors, rbio->nr_data,
 		rbio->real_stripes, rbio->stripe_nsectors,
-		rbio->scrubp, rbio->dbitmap);
+		rbio->sector_nsteps, rbio->scrubp, rbio->dbitmap);
 }
 
 #define ASSERT_RBIO(expr, rbio)						\
@@ -229,15 +229,20 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info)
 
 static void memcpy_from_bio_to_stripe(struct btrfs_raid_bio *rbio, unsigned int sector_nr)
 {
-	phys_addr_t dst = rbio->stripe_paddrs[sector_nr];
-	phys_addr_t src = rbio->bio_paddrs[sector_nr];
+	const u32 step = min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE);
 
-	ASSERT(dst != INVALID_PADDR);
-	ASSERT(src != INVALID_PADDR);
+	ASSERT(sector_nr < rbio->nr_sectors);
+	for (int i = 0; i < rbio->sector_nsteps; i++) {
+		unsigned int index = sector_nr * rbio->sector_nsteps + i;
+		phys_addr_t dst = rbio->stripe_paddrs[index];
+		phys_addr_t src = rbio->bio_paddrs[index];
 
-	memcpy_page(phys_to_page(dst), offset_in_page(dst),
-		    phys_to_page(src), offset_in_page(src),
-		    rbio->bioc->fs_info->sectorsize);
+		ASSERT(dst != INVALID_PADDR);
+		ASSERT(src != INVALID_PADDR);
+
+		memcpy_page(phys_to_page(dst), offset_in_page(dst),
+			    phys_to_page(src), offset_in_page(src), step);
+	}
 }
 
 /*
@@ -260,7 +265,7 @@ static void cache_rbio_pages(struct btrfs_raid_bio *rbio)
 
 	for (i = 0; i < rbio->nr_sectors; i++) {
 		/* Some range not covered by bio (partial write), skip it */
-		if (rbio->bio_paddrs[i] == INVALID_PADDR) {
+		if (rbio->bio_paddrs[i * rbio->sector_nsteps] == INVALID_PADDR) {
 			/*
 			 * Even if the sector is not covered by bio, if it is
 			 * a data sector it should still be uptodate as it is
@@ -320,11 +325,12 @@ static __maybe_unused bool full_page_sectors_uptodate(struct btrfs_raid_bio *rbi
  */
 static void index_stripe_sectors(struct btrfs_raid_bio *rbio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	const u32 step = min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE);
 	u32 offset;
 	int i;
 
-	for (i = 0, offset = 0; i < rbio->nr_sectors; i++, offset += sectorsize) {
+	for (i = 0, offset = 0; i < rbio->nr_sectors * rbio->sector_nsteps;
+	     i++, offset += step) {
 		int page_index = offset >> PAGE_SHIFT;
 
 		ASSERT(page_index < rbio->nr_pages);
@@ -668,21 +674,41 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
 	return 1;
 }
 
-static unsigned int rbio_stripe_sector_index(const struct btrfs_raid_bio *rbio,
-					     unsigned int stripe_nr,
-					     unsigned int sector_nr)
+/* Return the sector index for @stripe_nr and @sector_nr. */
+static unsigned int rbio_sector_index(const struct btrfs_raid_bio *rbio,
+				      unsigned int stripe_nr,
+				      unsigned int sector_nr)
 {
+	unsigned int ret;
+
 	ASSERT_RBIO_STRIPE(stripe_nr < rbio->real_stripes, rbio, stripe_nr);
 	ASSERT_RBIO_SECTOR(sector_nr < rbio->stripe_nsectors, rbio, sector_nr);
 
-	return stripe_nr * rbio->stripe_nsectors + sector_nr;
+	ret = stripe_nr * rbio->stripe_nsectors + sector_nr;
+	ASSERT(ret < rbio->nr_sectors);
+	return ret;
+}
+
+/* Return the paddr array index for @stripe_nr, @sector_nr and @step_nr. */
+static unsigned int rbio_paddr_index(const struct btrfs_raid_bio *rbio,
+				     unsigned int stripe_nr,
+				     unsigned int sector_nr,
+				     unsigned int step_nr)
+{
+	unsigned int ret;
+
+	ASSERT_RBIO_SECTOR(step_nr < rbio->sector_nsteps, rbio, step_nr);
+
+	ret = rbio_sector_index(rbio, stripe_nr, sector_nr) * rbio->sector_nsteps + step_nr;
+	ASSERT(ret < rbio->nr_sectors * rbio->sector_nsteps);
+	return ret;
 }
 
 /* Return a paddr from rbio->stripe_sectors, not from the bio list */
 static phys_addr_t rbio_stripe_paddr(const struct btrfs_raid_bio *rbio,
 				     unsigned int stripe_nr, unsigned int sector_nr)
 {
-	return rbio->stripe_paddrs[rbio_stripe_sector_index(rbio, stripe_nr, sector_nr)];
+	return rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, 0)];
 }
 
 /* Grab a paddr inside P stripe */
@@ -985,6 +1011,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	const unsigned int stripe_nsectors =
 		BTRFS_STRIPE_LEN >> fs_info->sectorsize_bits;
 	const unsigned int num_sectors = stripe_nsectors * real_stripes;
+	const unsigned int step = min(fs_info->sectorsize, PAGE_SIZE);
+	const unsigned int sector_nsteps = fs_info->sectorsize / step;
 	struct btrfs_raid_bio *rbio;
 
 	/* PAGE_SIZE must also be aligned to sectorsize for subpage support */
@@ -1007,8 +1035,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 		return ERR_PTR(-ENOMEM);
 	rbio->stripe_pages = kcalloc(num_pages, sizeof(struct page *),
 				     GFP_NOFS);
-	rbio->bio_paddrs = kcalloc(num_sectors, sizeof(phys_addr_t), GFP_NOFS);
-	rbio->stripe_paddrs = kcalloc(num_sectors, sizeof(phys_addr_t), GFP_NOFS);
+	rbio->bio_paddrs = kcalloc(num_sectors * sector_nsteps, sizeof(phys_addr_t), GFP_NOFS);
+	rbio->stripe_paddrs = kcalloc(num_sectors * sector_nsteps, sizeof(phys_addr_t), GFP_NOFS);
 	rbio->finish_pointers = kcalloc(real_stripes, sizeof(void *), GFP_NOFS);
 	rbio->error_bitmap = bitmap_zalloc(num_sectors, GFP_NOFS);
 	rbio->stripe_uptodate_bitmap = bitmap_zalloc(num_sectors, GFP_NOFS);
@@ -1019,7 +1047,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 		kfree(rbio);
 		return ERR_PTR(-ENOMEM);
 	}
-	for (int i = 0; i < num_sectors; i++) {
+	for (int i = 0; i < num_sectors * sector_nsteps; i++) {
 		rbio->stripe_paddrs[i] = INVALID_PADDR;
 		rbio->bio_paddrs[i] = INVALID_PADDR;
 	}
@@ -1037,6 +1065,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	rbio->real_stripes = real_stripes;
 	rbio->stripe_npages = stripe_npages;
 	rbio->stripe_nsectors = stripe_nsectors;
+	rbio->sector_nsteps = sector_nsteps;
 	refcount_set(&rbio->refs, 1);
 	atomic_set(&rbio->stripes_pending, 0);
 
@@ -1192,18 +1221,19 @@ static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_l
 
 static void index_one_bio(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	const u32 sectorsize_bits = rbio->bioc->fs_info->sectorsize_bits;
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
+	const u32 step_bits = min(fs_info->sectorsize_bits, PAGE_SHIFT);
 	struct bvec_iter iter = bio->bi_iter;
 	phys_addr_t paddr;
 	u32 offset = (bio->bi_iter.bi_sector << SECTOR_SHIFT) -
 		     rbio->bioc->full_stripe_logical;
 
-	btrfs_bio_for_each_block(paddr, bio, &iter, sectorsize) {
-		unsigned int index = (offset >> sectorsize_bits);
+	btrfs_bio_for_each_block(paddr, bio, &iter, step) {
+		unsigned int index = (offset >> step_bits);
 
 		rbio->bio_paddrs[index] = paddr;
-		offset += sectorsize;
+		offset += step;
 	}
 }
 
@@ -1303,7 +1333,7 @@ static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
 				sector_paddr_in_rbio(rbio, stripe, sectornr, 0));
 
 	/* Then add the parity stripe */
-	set_bit(rbio_stripe_sector_index(rbio, rbio->nr_data, sectornr),
+	set_bit(rbio_sector_index(rbio, rbio->nr_data, sectornr),
 		rbio->stripe_uptodate_bitmap);
 	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_paddr(rbio, sectornr));
 
@@ -1312,7 +1342,7 @@ static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
 		 * RAID6, add the qstripe and call the library function
 		 * to fill in our p/q
 		 */
-		set_bit(rbio_stripe_sector_index(rbio, rbio->nr_data + 1, sectornr),
+		set_bit(rbio_sector_index(rbio, rbio->nr_data + 1, sectornr),
 			rbio->stripe_uptodate_bitmap);
 		pointers[stripe++] = kmap_local_paddr(rbio_qstripe_paddr(rbio, sectornr));
 
@@ -1932,7 +1962,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		if (ret < 0)
 			goto cleanup;
 
-		set_bit(rbio_stripe_sector_index(rbio, faila, sector_nr),
+		set_bit(rbio_sector_index(rbio, faila, sector_nr),
 			rbio->stripe_uptodate_bitmap);
 	}
 	if (failb >= 0) {
@@ -1940,7 +1970,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		if (ret < 0)
 			goto cleanup;
 
-		set_bit(rbio_stripe_sector_index(rbio, failb, sector_nr),
+		set_bit(rbio_sector_index(rbio, failb, sector_nr),
 			rbio->stripe_uptodate_bitmap);
 	}
 
@@ -2288,7 +2318,7 @@ static bool need_read_stripe_sectors(struct btrfs_raid_bio *rbio)
 	int i;
 
 	for (i = 0; i < rbio->nr_data * rbio->stripe_nsectors; i++) {
-		phys_addr_t paddr = rbio->stripe_paddrs[i];
+		phys_addr_t paddr = rbio->stripe_paddrs[i * rbio->sector_nsteps];
 
 		/*
 		 * We have a sector which doesn't have page nor uptodate,
@@ -2746,7 +2776,7 @@ static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio)
 		 * The bio cache may have handed us an uptodate sector.  If so,
 		 * use it.
 		 */
-		if (test_bit(rbio_stripe_sector_index(rbio, stripe, sectornr),
+		if (test_bit(rbio_sector_index(rbio, stripe, sectornr),
 			     rbio->stripe_uptodate_bitmap))
 			continue;
 
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 87b0c73ee05b..cafad2435ecf 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -50,7 +50,7 @@ enum btrfs_rbio_ops {
  * If there is no bio covering a sector, then btrfs_raid_bio::bio_paddrs[i] will
  * be INVALID_PADDR.
  *
- * The length of each entry in bio_paddrs[] is sectorsize.
+ * The length of each entry in bio_paddrs[] is a step (aka, min(sectorsize, PAGE_SIZE)).
  *
  * [PAGES FOR INTERNAL USAGES]
  * For pages not covered by any bio or belonging to P/Q stripes, they are stored
@@ -71,7 +71,7 @@ enum btrfs_rbio_ops {
  * If the corresponding page of stripe_paddrs[i] is not allocated, the value of
  * stripe_paddrs[i] will be INVALID_PADDR.
  *
- * The length of each entry in stripe_paddrs[] is sectorsize.
+ * The length of each entry in stripe_paddrs[] is a step.
  *
  * [LOCATING A SECTOR]
  * To locating a sector for IO, we need the following info:
@@ -84,7 +84,15 @@ enum btrfs_rbio_ops {
  *   Starts from 0 (representing the first sector of the stripe), ends
  *   at BTRFS_STRIPE_LEN / sectorsize - 1.
  *
- *   All existing bitmaps are based on sector numbers.
+ * - step_nr
+ *   A step is min(sector_size, PAGE_SIZE).
+ *
+ *   Starts from 0 (representing the first step of the sector), ends
+ *   at @sector_nsteps - 1.
+ *
+ *   For most call sites they do not need to bother this parameter.
+ *   It is for bs > ps support and only for vertical stripe related works.
+ *   (e.g. RMW/recover)
  *
  * - from which array
  *   Whether grabbing from stripe_paddrs[] (aka, internal pages) or from the
@@ -152,6 +160,14 @@ struct btrfs_raid_bio {
 	/* How many sectors there are for each stripe */
 	u8 stripe_nsectors;
 
+	/*
+	 * How many steps there are for one sector.
+	 *
+	 * For bs > ps cases, it's sectorsize / PAGE_SIZE.
+	 * For bs <= ps cases, it's always 1.
+	 */
+	u8 sector_nsteps;
+
 	/* Stripe number that we're scrubbing  */
 	u8 scrubp;
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 03/12] btrfs: prepare generate_pq_vertical() for bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
  2025-11-17  7:30 ` [PATCH 01/12] btrfs: add an overview for the btrfs_raid_bio structure Qu Wenruo
  2025-11-17  7:30 ` [PATCH 02/12] btrfs: introduce a new parameter to locate a sector Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 04/12] btrfs: prepare recover_vertical() to support " Qu Wenruo
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

Unlike btrfs_calculate_block_csum_pages(), we can not handle multiple
pages at the same time for pq generation.

So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 92 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 70 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7f01178be7d8..95cf037cbf0f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -711,20 +711,25 @@ static phys_addr_t rbio_stripe_paddr(const struct btrfs_raid_bio *rbio,
 	return rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, 0)];
 }
 
-/* Grab a paddr inside P stripe */
-static phys_addr_t rbio_pstripe_paddr(const struct btrfs_raid_bio *rbio,
-				      unsigned int sector_nr)
+static phys_addr_t rbio_stripe_step_paddr(const struct btrfs_raid_bio *rbio,
+					  unsigned int stripe_nr, unsigned int sector_nr,
+					  unsigned int step_nr)
 {
-	return rbio_stripe_paddr(rbio, rbio->nr_data, sector_nr);
+	return rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, step_nr)];
 }
 
-/* Grab a paddr inside Q stripe, return INVALID_PADDR if not RAID6 */
-static phys_addr_t rbio_qstripe_paddr(const struct btrfs_raid_bio *rbio,
-				      unsigned int sector_nr)
+static phys_addr_t rbio_pstripe_step_paddr(const struct btrfs_raid_bio *rbio,
+					   unsigned int sector_nr, unsigned int step_nr)
+{
+	return rbio_stripe_step_paddr(rbio, rbio->nr_data, sector_nr, step_nr);
+}
+
+static phys_addr_t rbio_qstripe_step_paddr(const struct btrfs_raid_bio *rbio,
+					   unsigned int sector_nr, unsigned int step_nr)
 {
 	if (rbio->nr_data + 1 == rbio->real_stripes)
 		return INVALID_PADDR;
-	return rbio_stripe_paddr(rbio, rbio->nr_data + 1, sector_nr);
+	return rbio_stripe_step_paddr(rbio, rbio->nr_data + 1, sector_nr, step_nr);
 }
 
 /*
@@ -998,6 +1003,38 @@ static phys_addr_t sector_paddr_in_rbio(struct btrfs_raid_bio *rbio,
 	return rbio->stripe_paddrs[index];
 }
 
+/*
+ * Similar to sector_paddr_in_rbio(), but with extra consideration for
+ * bs > ps cases, where we can have multiple steps for a fs block.
+ */
+static phys_addr_t step_paddr_in_rbio(struct btrfs_raid_bio *rbio,
+				      int stripe_nr, int sector_nr, int step_nr,
+				      bool bio_list_only)
+{
+	phys_addr_t ret = INVALID_PADDR;
+	int index;
+
+	ASSERT_RBIO_STRIPE(stripe_nr >= 0 && stripe_nr < rbio->real_stripes,
+			   rbio, stripe_nr);
+	ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors,
+			   rbio, sector_nr);
+	ASSERT_RBIO_SECTOR(step_nr >= 0 && step_nr < rbio->sector_nsteps,
+			   rbio, sector_nr);
+
+	index = (stripe_nr * rbio->stripe_nsectors + sector_nr) * rbio->sector_nsteps + step_nr;
+	ASSERT(index >= 0 && index < rbio->nr_sectors * rbio->sector_nsteps);
+
+	scoped_guard(spinlock, &rbio->bio_list_lock) {
+		if (rbio->bio_paddrs[index] != INVALID_PADDR || bio_list_only) {
+			/* Don't return sector without a valid page pointer */
+			if (rbio->bio_paddrs[index] != INVALID_PADDR)
+				ret = rbio->bio_paddrs[index];
+			return ret;
+		}
+	}
+	return rbio->stripe_paddrs[index];
+}
+
 /*
  * allocation and initial setup for the btrfs_raid_bio.  Not
  * this does not allocate any pages for rbio->pages.
@@ -1319,45 +1356,56 @@ static inline void *kmap_local_paddr(phys_addr_t paddr)
 	return kmap_local_page(phys_to_page(paddr)) + offset_in_page(paddr);
 }
 
-/* Generate PQ for one vertical stripe. */
-static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
+static void generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsigned int sector_nr,
+				      unsigned int step_nr)
 {
 	void **pointers = rbio->finish_pointers;
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	const u32 step = min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE);
 	int stripe;
 	const bool has_qstripe = rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6;
 
 	/* First collect one sector from each data stripe */
 	for (stripe = 0; stripe < rbio->nr_data; stripe++)
 		pointers[stripe] = kmap_local_paddr(
-				sector_paddr_in_rbio(rbio, stripe, sectornr, 0));
+				step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
 
 	/* Then add the parity stripe */
-	set_bit(rbio_sector_index(rbio, rbio->nr_data, sectornr),
-		rbio->stripe_uptodate_bitmap);
-	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_paddr(rbio, sectornr));
+	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_step_paddr(rbio, sector_nr, step_nr));
 
 	if (has_qstripe) {
 		/*
 		 * RAID6, add the qstripe and call the library function
 		 * to fill in our p/q
 		 */
-		set_bit(rbio_sector_index(rbio, rbio->nr_data + 1, sectornr),
-			rbio->stripe_uptodate_bitmap);
-		pointers[stripe++] = kmap_local_paddr(rbio_qstripe_paddr(rbio, sectornr));
+		pointers[stripe++] = kmap_local_paddr(
+				rbio_qstripe_step_paddr(rbio, sector_nr, step_nr));
 
 		assert_rbio(rbio);
-		raid6_call.gen_syndrome(rbio->real_stripes, sectorsize,
-					pointers);
+		raid6_call.gen_syndrome(rbio->real_stripes, step, pointers);
 	} else {
 		/* raid5 */
-		memcpy(pointers[rbio->nr_data], pointers[0], sectorsize);
-		run_xor(pointers + 1, rbio->nr_data - 1, sectorsize);
+		memcpy(pointers[rbio->nr_data], pointers[0], step);
+		run_xor(pointers + 1, rbio->nr_data - 1, step);
 	}
 	for (stripe = stripe - 1; stripe >= 0; stripe--)
 		kunmap_local(pointers[stripe]);
 }
 
+/* Generate PQ for one vertical stripe. */
+static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
+{
+	const bool has_qstripe = rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6;
+
+	for (int i = 0; i < rbio->sector_nsteps; i++)
+		generate_pq_vertical_step(rbio, sectornr, i);
+
+	set_bit(rbio_sector_index(rbio, rbio->nr_data, sectornr),
+		rbio->stripe_uptodate_bitmap);
+	if (has_qstripe)
+		set_bit(rbio_sector_index(rbio, rbio->nr_data + 1, sectornr),
+			rbio->stripe_uptodate_bitmap);
+}
+
 static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 				   struct bio_list *bio_list)
 {
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 04/12] btrfs: prepare recover_vertical() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 03/12] btrfs: prepare generate_pq_vertical() for bs > ps cases Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 05/12] btrfs: prepare verify_one_sector() " Qu Wenruo
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.

Prepare recover_vertical() to support bs > ps cases by:

- Introduce recover_vertical_step() helper
  Which will recover a full step (min(PAGE_SIZE, sectorsize)).

  Now recover_vertical() will do the error check for the specified
  sector, do the recover step by step, then do the sector verification.

- Fix a spelling error of get_rbio_vertical_errors()
  The old name has a typo: "veritical".

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 141 ++++++++++++++++++++++------------------------
 1 file changed, 68 insertions(+), 73 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 95cf037cbf0f..fafd200a2eff 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1007,21 +1007,13 @@ static phys_addr_t sector_paddr_in_rbio(struct btrfs_raid_bio *rbio,
  * Similar to sector_paddr_in_rbio(), but with extra consideration for
  * bs > ps cases, where we can have multiple steps for a fs block.
  */
-static phys_addr_t step_paddr_in_rbio(struct btrfs_raid_bio *rbio,
-				      int stripe_nr, int sector_nr, int step_nr,
-				      bool bio_list_only)
+static phys_addr_t sector_step_paddr_in_rbio(struct btrfs_raid_bio *rbio,
+					     int stripe_nr, int sector_nr, int step_nr,
+					     bool bio_list_only)
 {
 	phys_addr_t ret = INVALID_PADDR;
-	int index;
+	const int index = rbio_paddr_index(rbio, stripe_nr, sector_nr, step_nr);
 
-	ASSERT_RBIO_STRIPE(stripe_nr >= 0 && stripe_nr < rbio->real_stripes,
-			   rbio, stripe_nr);
-	ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors,
-			   rbio, sector_nr);
-	ASSERT_RBIO_SECTOR(step_nr >= 0 && step_nr < rbio->sector_nsteps,
-			   rbio, sector_nr);
-
-	index = (stripe_nr * rbio->stripe_nsectors + sector_nr) * rbio->sector_nsteps + step_nr;
 	ASSERT(index >= 0 && index < rbio->nr_sectors * rbio->sector_nsteps);
 
 	scoped_guard(spinlock, &rbio->bio_list_lock) {
@@ -1147,8 +1139,8 @@ static int alloc_rbio_parity_pages(struct btrfs_raid_bio *rbio)
  * @faila and @failb will also be updated to the first and second stripe
  * number of the errors.
  */
-static int get_rbio_veritical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
-				     int *faila, int *failb)
+static int get_rbio_vertical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
+				    int *faila, int *failb)
 {
 	int stripe_nr;
 	int found_errors = 0;
@@ -1219,8 +1211,8 @@ static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_l
 			rbio->error_bitmap);
 
 		/* Check if we have reached tolerance early. */
-		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
-							 NULL, NULL);
+		found_errors = get_rbio_vertical_errors(rbio, sector_nr,
+							NULL, NULL);
 		if (unlikely(found_errors > rbio->bioc->max_errors))
 			return -EIO;
 		return 0;
@@ -1367,7 +1359,7 @@ static void generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsigned int
 	/* First collect one sector from each data stripe */
 	for (stripe = 0; stripe < rbio->nr_data; stripe++)
 		pointers[stripe] = kmap_local_paddr(
-				step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
+				sector_step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
 
 	/* Then add the parity stripe */
 	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_step_paddr(rbio, sector_nr, step_nr));
@@ -1868,41 +1860,18 @@ static int verify_one_sector(struct btrfs_raid_bio *rbio,
 	return ret;
 }
 
-/*
- * Recover a vertical stripe specified by @sector_nr.
- * @*pointers are the pre-allocated pointers by the caller, so we don't
- * need to allocate/free the pointers again and again.
- */
-static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
-			    void **pointers, void **unmap_array)
+static void recover_vertical_step(struct btrfs_raid_bio *rbio,
+				  unsigned int sector_nr,
+				  unsigned int step_nr,
+				  int faila, int failb,
+				  void **pointers, void **unmap_array)
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
-	const u32 sectorsize = fs_info->sectorsize;
-	int found_errors;
-	int faila;
-	int failb;
+	const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
 	int stripe_nr;
-	int ret = 0;
 
-	/*
-	 * Now we just use bitmap to mark the horizontal stripes in
-	 * which we have data when doing parity scrub.
-	 */
-	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
-	    !test_bit(sector_nr, &rbio->dbitmap))
-		return 0;
-
-	found_errors = get_rbio_veritical_errors(rbio, sector_nr, &faila,
-						 &failb);
-	/*
-	 * No errors in the vertical stripe, skip it.  Can happen for recovery
-	 * which only part of a stripe failed csum check.
-	 */
-	if (!found_errors)
-		return 0;
-
-	if (unlikely(found_errors > rbio->bioc->max_errors))
-		return -EIO;
+	ASSERT(step_nr < rbio->sector_nsteps);
+	ASSERT(sector_nr < rbio->stripe_nsectors);
 
 	/*
 	 * Setup our array of pointers with sectors from each stripe
@@ -1918,9 +1887,9 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		 * bio list if possible.
 		 */
 		if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
-			paddr = sector_paddr_in_rbio(rbio, stripe_nr, sector_nr, 0);
+			paddr = sector_step_paddr_in_rbio(rbio, stripe_nr, sector_nr, step_nr, 0);
 		} else {
-			paddr = rbio_stripe_paddr(rbio, stripe_nr, sector_nr);
+			paddr = rbio_stripe_step_paddr(rbio, stripe_nr, sector_nr, step_nr);
 		}
 		pointers[stripe_nr] = kmap_local_paddr(paddr);
 		unmap_array[stripe_nr] = pointers[stripe_nr];
@@ -1968,10 +1937,10 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		}
 
 		if (failb == rbio->real_stripes - 2) {
-			raid6_datap_recov(rbio->real_stripes, sectorsize,
+			raid6_datap_recov(rbio->real_stripes, step,
 					  faila, pointers);
 		} else {
-			raid6_2data_recov(rbio->real_stripes, sectorsize,
+			raid6_2data_recov(rbio->real_stripes, step,
 					  faila, failb, pointers);
 		}
 	} else {
@@ -1981,7 +1950,7 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		ASSERT(failb == -1);
 pstripe:
 		/* Copy parity block into failed block to start with */
-		memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize);
+		memcpy(pointers[faila], pointers[rbio->nr_data], step);
 
 		/* Rearrange the pointer array */
 		p = pointers[faila];
@@ -1991,24 +1960,54 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		pointers[rbio->nr_data - 1] = p;
 
 		/* Xor in the rest */
-		run_xor(pointers, rbio->nr_data - 1, sectorsize);
-
+		run_xor(pointers, rbio->nr_data - 1, step);
 	}
 
+cleanup:
+	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
+		kunmap_local(unmap_array[stripe_nr]);
+}
+
+/*
+ * Recover a vertical stripe specified by @sector_nr.
+ * @*pointers are the pre-allocated pointers by the caller, so we don't
+ * need to allocate/free the pointers again and again.
+ */
+static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			    void **pointers, void **unmap_array)
+{
+	int found_errors;
+	int faila;
+	int failb;
+	int ret = 0;
+
 	/*
-	 * No matter if this is a RMW or recovery, we should have all
-	 * failed sectors repaired in the vertical stripe, thus they are now
-	 * uptodate.
-	 * Especially if we determine to cache the rbio, we need to
-	 * have at least all data sectors uptodate.
-	 *
-	 * If possible, also check if the repaired sector matches its data
-	 * checksum.
+	 * Now we just use bitmap to mark the horizontal stripes in
+	 * which we have data when doing parity scrub.
 	 */
+	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
+	    !test_bit(sector_nr, &rbio->dbitmap))
+		return 0;
+
+	found_errors = get_rbio_vertical_errors(rbio, sector_nr, &faila,
+						&failb);
+	/*
+	 * No errors in the vertical stripe, skip it.  Can happen for recovery
+	 * which only part of a stripe failed csum check.
+	 */
+	if (!found_errors)
+		return 0;
+
+	if (unlikely(found_errors > rbio->bioc->max_errors))
+		return -EIO;
+
+	for (int i = 0; i < rbio->sector_nsteps; i++)
+		recover_vertical_step(rbio, sector_nr, i, faila, failb,
+					    pointers, unmap_array);
 	if (faila >= 0) {
 		ret = verify_one_sector(rbio, faila, sector_nr);
 		if (ret < 0)
-			goto cleanup;
+			return ret;
 
 		set_bit(rbio_sector_index(rbio, faila, sector_nr),
 			rbio->stripe_uptodate_bitmap);
@@ -2016,15 +2015,11 @@ static int recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 	if (failb >= 0) {
 		ret = verify_one_sector(rbio, failb, sector_nr);
 		if (ret < 0)
-			goto cleanup;
+			return ret;
 
 		set_bit(rbio_sector_index(rbio, failb, sector_nr),
 			rbio->stripe_uptodate_bitmap);
 	}
-
-cleanup:
-	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
-		kunmap_local(unmap_array[stripe_nr]);
 	return ret;
 }
 
@@ -2162,7 +2157,7 @@ static void set_rbio_raid6_extra_error(struct btrfs_raid_bio *rbio, int mirror_n
 		int faila;
 		int failb;
 
-		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+		found_errors = get_rbio_vertical_errors(rbio, sector_nr,
 							 &faila, &failb);
 		/* This vertical stripe doesn't have errors. */
 		if (!found_errors)
@@ -2455,7 +2450,7 @@ static void rmw_rbio(struct btrfs_raid_bio *rbio)
 	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
 		int found_errors;
 
-		found_errors = get_rbio_veritical_errors(rbio, sectornr, NULL, NULL);
+		found_errors = get_rbio_vertical_errors(rbio, sectornr, NULL, NULL);
 		if (unlikely(found_errors > rbio->bioc->max_errors)) {
 			ret = -EIO;
 			break;
@@ -2735,7 +2730,7 @@ static int recover_scrub_rbio(struct btrfs_raid_bio *rbio)
 		int failb;
 		int found_errors;
 
-		found_errors = get_rbio_veritical_errors(rbio, sector_nr,
+		found_errors = get_rbio_vertical_errors(rbio, sector_nr,
 							 &faila, &failb);
 		if (unlikely(found_errors > rbio->bioc->max_errors)) {
 			ret = -EIO;
@@ -2869,7 +2864,7 @@ static void scrub_rbio(struct btrfs_raid_bio *rbio)
 	for (sector_nr = 0; sector_nr < rbio->stripe_nsectors; sector_nr++) {
 		int found_errors;
 
-		found_errors = get_rbio_veritical_errors(rbio, sector_nr, NULL, NULL);
+		found_errors = get_rbio_vertical_errors(rbio, sector_nr, NULL, NULL);
 		if (unlikely(found_errors > rbio->bioc->max_errors)) {
 			ret = -EIO;
 			break;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 05/12] btrfs: prepare verify_one_sector() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 04/12] btrfs: prepare recover_vertical() to support " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 06/12] btrfs: prepare verify_bio_data_sectors() " Qu Wenruo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce helpers to get a paddrs pointer
  Thankfully all the higher layer bio should still be aligned to fs
  block size, thus a fs block should still be fully covered by the bio.

  Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
  return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
  stripe_paddrs[].

  The pointer can be directly passed to
  btrfs_calculate_block_csum_pages() to verify the checksum.

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple incontiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 55 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index fafd200a2eff..07d452439e37 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -732,6 +732,13 @@ static phys_addr_t rbio_qstripe_step_paddr(const struct btrfs_raid_bio *rbio,
 	return rbio_stripe_step_paddr(rbio, rbio->nr_data + 1, sector_nr, step_nr);
 }
 
+/* Return a paddr pointer into the rbio::stripe_paddrs[] for the specified sector. */
+static phys_addr_t *rbio_stripe_paddrs(const struct btrfs_raid_bio *rbio,
+				       unsigned int stripe_nr, unsigned int sector_nr)
+{
+	return &rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, 0)];
+}
+
 /*
  * The first stripe in the table for a logical address
  * has the lock.  rbios are added in one of three ways:
@@ -1003,6 +1010,41 @@ static phys_addr_t sector_paddr_in_rbio(struct btrfs_raid_bio *rbio,
 	return rbio->stripe_paddrs[index];
 }
 
+/*
+ * Get paddr pointer for the sector specified by its @stripe_nr and @sector_nr.
+ *
+ * @rbio:               The raid bio
+ * @stripe_nr:          Stripe number, valid range [0, real_stripe)
+ * @sector_nr:		Sector number inside the stripe,
+ *			valid range [0, stripe_nsectors)
+ * @bio_list_only:      Whether to use sectors inside the bio list only.
+ *
+ * The read/modify/write code wants to reuse the original bio page as much
+ * as possible, and only use stripe_sectors as fallback.
+ *
+ * Return NULL if bio_list_only is set but the specified sector has no
+ * coresponding bio.
+ */
+static phys_addr_t *sector_paddrs_in_rbio(struct btrfs_raid_bio *rbio,
+					  int stripe_nr, int sector_nr,
+					  bool bio_list_only)
+{
+	phys_addr_t *ret = NULL;
+	const int index = rbio_paddr_index(rbio, stripe_nr, sector_nr, 0);
+
+	ASSERT(index >= 0 && index < rbio->nr_sectors * rbio->sector_nsteps);
+
+	scoped_guard(spinlock, &rbio->bio_list_lock) {
+		if (rbio->bio_paddrs[index] != INVALID_PADDR || bio_list_only) {
+			/* Don't return sector without a valid page pointer */
+			if (rbio->bio_paddrs[index] != INVALID_PADDR)
+				ret = &rbio->bio_paddrs[index];
+			return ret;
+		}
+	}
+	return &rbio->stripe_paddrs[index];
+}
+
 /*
  * Similar to sector_paddr_in_rbio(), but with extra consideration for
  * bs > ps cases, where we can have multiple steps for a fs block.
@@ -1832,10 +1874,9 @@ static int verify_one_sector(struct btrfs_raid_bio *rbio,
 			     int stripe_nr, int sector_nr)
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
-	phys_addr_t paddr;
+	phys_addr_t *paddrs;
 	u8 csum_buf[BTRFS_CSUM_SIZE];
 	u8 *csum_expected;
-	int ret;
 
 	if (!rbio->csum_bitmap || !rbio->csum_buf)
 		return 0;
@@ -1848,16 +1889,18 @@ static int verify_one_sector(struct btrfs_raid_bio *rbio,
 	 * bio list if possible.
 	 */
 	if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
-		paddr = sector_paddr_in_rbio(rbio, stripe_nr, sector_nr, 0);
+		paddrs = sector_paddrs_in_rbio(rbio, stripe_nr, sector_nr, 0);
 	} else {
-		paddr = rbio_stripe_paddr(rbio, stripe_nr, sector_nr);
+		paddrs = rbio_stripe_paddrs(rbio, stripe_nr, sector_nr);
 	}
 
 	csum_expected = rbio->csum_buf +
 			(stripe_nr * rbio->stripe_nsectors + sector_nr) *
 			fs_info->csum_size;
-	ret = btrfs_check_block_csum(fs_info, paddr, csum_buf, csum_expected);
-	return ret;
+	btrfs_calculate_block_csum_pages(fs_info, paddrs, csum_buf);
+	if (unlikely(memcmp(csum_buf, csum_expected, fs_info->csum_size) != 0))
+		return -EIO;
+	return 0;
 }
 
 static void recover_vertical_step(struct btrfs_raid_bio *rbio,
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 06/12] btrfs: prepare verify_bio_data_sectors() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 05/12] btrfs: prepare verify_one_sector() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 07/12] btrfs: prepare set_bio_pages_uptodate() " Qu Wenruo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Make get_bio_sector_nr() to consider bs > ps cases
  The function is utilized to calculate the sector number of a device
  bio submitted by btrfs raid56 layer.

- Assemble a local paddrs[] for checksum calculation

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple incontiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 07d452439e37..6d9d9d494721 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1620,9 +1620,9 @@ static int get_bio_sector_nr(struct btrfs_raid_bio *rbio, struct bio *bio)
 	int i;
 
 	for (i = 0; i < rbio->nr_sectors; i++) {
-		if (rbio->stripe_paddrs[i] == bvec_paddr)
+		if (rbio->stripe_paddrs[i * rbio->sector_nsteps] == bvec_paddr)
 			break;
-		if (rbio->bio_paddrs[i] == bvec_paddr)
+		if (rbio->bio_paddrs[i * rbio->sector_nsteps] == bvec_paddr)
 			break;
 	}
 	ASSERT(i < rbio->nr_sectors);
@@ -1655,7 +1655,11 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 				    struct bio *bio)
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
+	const u32 nr_steps = rbio->sector_nsteps;
 	int total_sector_nr = get_bio_sector_nr(rbio, bio);
+	u32 offset = 0;
+	phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
 	phys_addr_t paddr;
 
 	/* No data csum for the whole stripe, no need to verify. */
@@ -1666,18 +1670,24 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 	if (total_sector_nr >= rbio->nr_data * rbio->stripe_nsectors)
 		return;
 
-	btrfs_bio_for_each_block_all(paddr, bio, fs_info->sectorsize) {
+	btrfs_bio_for_each_block_all(paddr, bio, step) {
 		u8 csum_buf[BTRFS_CSUM_SIZE];
-		u8 *expected_csum = rbio->csum_buf + total_sector_nr * fs_info->csum_size;
-		int ret;
+		u8 *expected_csum;
+
+		paddrs[(offset / step) % nr_steps] = paddr;
+		offset += step;
+
+		/* Not yet covering the full fs block, continue to the next step. */
+		if (!IS_ALIGNED(offset, fs_info->sectorsize))
+			continue;
 
 		/* No csum for this sector, skip to the next sector. */
 		if (!test_bit(total_sector_nr, rbio->csum_bitmap))
 			continue;
 
-		ret = btrfs_check_block_csum(fs_info, paddr,
-					     csum_buf, expected_csum);
-		if (ret < 0)
+		expected_csum = rbio->csum_buf + total_sector_nr * fs_info->csum_size;
+		btrfs_calculate_block_csum_pages(fs_info, paddrs, csum_buf);
+		if (unlikely(memcmp(csum_buf, expected_csum, fs_info->csum_size) != 0))
 			set_bit(total_sector_nr, rbio->error_bitmap);
 		total_sector_nr++;
 	}
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 07/12] btrfs: prepare set_bio_pages_uptodate() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (5 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 06/12] btrfs: prepare verify_bio_data_sectors() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 08/12] btrfs: prepare steal_rbio() " Qu Wenruo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Update find_stripe_sector_nr() to check only the first step paddr
  We don't need to check each paddr, as the bios are still aligned to fs
  block size, thus checking the first step is enough.

- Use step size to iterate the bio
  This means we only need to find the sector number for the first step
  of each fs block, and skip the remaining part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6d9d9d494721..820bdc7f6dbe 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1588,7 +1588,7 @@ static void set_rbio_range_error(struct btrfs_raid_bio *rbio, struct bio *bio)
 static int find_stripe_sector_nr(struct btrfs_raid_bio *rbio, phys_addr_t paddr)
 {
 	for (int i = 0; i < rbio->nr_sectors; i++) {
-		if (rbio->stripe_paddrs[i] == paddr)
+		if (rbio->stripe_paddrs[i * rbio->sector_nsteps] == paddr)
 			return i;
 	}
 	return -1;
@@ -1600,17 +1600,23 @@ static int find_stripe_sector_nr(struct btrfs_raid_bio *rbio, phys_addr_t paddr)
  */
 static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
-	const u32 blocksize = rbio->bioc->fs_info->sectorsize;
+	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	const u32 step = min(sectorsize, PAGE_SIZE);
+	u32 offset = 0;
 	phys_addr_t paddr;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 
-	btrfs_bio_for_each_block_all(paddr, bio, blocksize) {
-		int sector_nr = find_stripe_sector_nr(rbio, paddr);
+	btrfs_bio_for_each_block_all(paddr, bio, step) {
+		/* Hitting the first step of a sector.*/
+		if (IS_ALIGNED(offset, sectorsize)) {
+			int sector_nr = find_stripe_sector_nr(rbio, paddr);
 
-		ASSERT(sector_nr >= 0);
-		if (sector_nr >= 0)
-			set_bit(sector_nr, rbio->stripe_uptodate_bitmap);
+			ASSERT(sector_nr >= 0);
+			if (sector_nr >= 0)
+				set_bit(sector_nr, rbio->stripe_uptodate_bitmap);
+		}
+		offset += step;
 	}
 }
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 08/12] btrfs: prepare steal_rbio() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (6 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 07/12] btrfs: prepare set_bio_pages_uptodate() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 09/12] btrfs: prepare rbio_bio_add_io_paddr() " Qu Wenruo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce two helpers to calculate the sector number
  Previously we assume one page will contain at least one fs block, thus
  can use something like "sectors_per_page = PAGE_SIZE / secotrsize;",
  but with bs > ps support that above number will be 0.

  Instead introduce two helpers:

  * page_nr_to_sector_nr()
    Returns the sector number of the first sector covered by the page.

  * page_nr_to_num_sectors()
    Return how many sectors are covered by the page.

  And use the returned values for bitmap operations other than
  open-coded "PAGE_SIZE / sectorsize".
  Those helpers also have extra ASSERT()s to catch weird numbers.

- Use above helpers
  The involved functions are:
  * steal_rbio_page()
  * is_data_stripe_page()
  * full_page_sectors_uptodate()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 57 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 820bdc7f6dbe..7cb3b3eccda6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -300,18 +300,47 @@ static int rbio_bucket(struct btrfs_raid_bio *rbio)
 	return hash_64(num >> 16, BTRFS_STRIPE_HASH_TABLE_BITS);
 }
 
-static __maybe_unused bool full_page_sectors_uptodate(struct btrfs_raid_bio *rbio,
-						      unsigned int page_nr)
+/* Get the sector number of the first sector covered by @page_nr. */
+static u32 page_nr_to_sector_nr(struct btrfs_raid_bio *rbio, unsigned int page_nr)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	const u32 sectors_per_page = PAGE_SIZE / sectorsize;
-	int i;
+	u32 sector_nr;
 
 	ASSERT(page_nr < rbio->nr_pages);
 
-	for (i = sectors_per_page * page_nr;
-	     i < sectors_per_page * page_nr + sectors_per_page;
-	     i++) {
+	sector_nr = page_nr << PAGE_SHIFT >> rbio->bioc->fs_info->sectorsize_bits;
+	ASSERT(sector_nr < rbio->nr_sectors);
+	return sector_nr;
+}
+
+/*
+ * Get the number of sectors covered by @page_nr.
+ *
+ * For bs > ps cases, the result will always be 1.
+ * For bs <= ps cases, the result will be ps / bs.
+ */
+static u32 page_nr_to_num_sectors(struct btrfs_raid_bio *rbio, unsigned int page_nr)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	u32 nr_sectors;
+
+	ASSERT(page_nr < rbio->nr_pages);
+
+	nr_sectors = round_up(PAGE_SIZE, fs_info->sectorsize) >> fs_info->sectorsize_bits;
+	ASSERT(nr_sectors > 0);
+	return nr_sectors;
+}
+
+static __maybe_unused bool full_page_sectors_uptodate(struct btrfs_raid_bio *rbio,
+						      unsigned int page_nr)
+{
+	const u32 sector_nr = page_nr_to_sector_nr(rbio, page_nr);
+	const u32 nr_bits = page_nr_to_num_sectors(rbio, page_nr);
+	int i;
+
+	ASSERT(page_nr < rbio->nr_pages);
+	ASSERT(sector_nr + nr_bits < rbio->nr_sectors);
+
+	for (i = sector_nr; i < sector_nr + nr_bits; i++) {
 		if (!test_bit(i, rbio->stripe_uptodate_bitmap))
 			return false;
 	}
@@ -345,8 +374,11 @@ static void index_stripe_sectors(struct btrfs_raid_bio *rbio)
 static void steal_rbio_page(struct btrfs_raid_bio *src,
 			    struct btrfs_raid_bio *dest, int page_nr)
 {
-	const u32 sectorsize = src->bioc->fs_info->sectorsize;
-	const u32 sectors_per_page = PAGE_SIZE / sectorsize;
+	const u32 sector_nr = page_nr_to_sector_nr(src, page_nr);
+	const u32 nr_bits = page_nr_to_num_sectors(src, page_nr);
+
+	ASSERT(page_nr < src->nr_pages);
+	ASSERT(sector_nr + nr_bits < src->nr_sectors);
 
 	if (dest->stripe_pages[page_nr])
 		__free_page(dest->stripe_pages[page_nr]);
@@ -354,13 +386,12 @@ static void steal_rbio_page(struct btrfs_raid_bio *src,
 	src->stripe_pages[page_nr] = NULL;
 
 	/* Also update the stripe_uptodate_bitmap bits. */
-	bitmap_set(dest->stripe_uptodate_bitmap, sectors_per_page * page_nr, sectors_per_page);
+	bitmap_set(dest->stripe_uptodate_bitmap, sector_nr, nr_bits);
 }
 
 static bool is_data_stripe_page(struct btrfs_raid_bio *rbio, int page_nr)
 {
-	const int sector_nr = (page_nr << PAGE_SHIFT) >>
-			      rbio->bioc->fs_info->sectorsize_bits;
+	const int sector_nr = page_nr_to_sector_nr(rbio, page_nr);
 
 	/*
 	 * We have ensured PAGE_SIZE is aligned with sectorsize, thus
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 09/12] btrfs: prepare rbio_bio_add_io_paddr() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (7 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 08/12] btrfs: prepare steal_rbio() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 10/12] btrfs: prepare finish_parity_scrub() " Qu Wenruo
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper bio_add_paddrs()
  Previously we only need to add a single page to a bio for a fs block,
  but now we need to add multiple pages, this means we can fail halfway.

  In that case we need to properly revert the bio (only for its size
  though) for halfway failed cases.

- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
  And change the @paddr parameter to @paddrs[].

- Change all callers to use the updated rbio_add_io_paddrs()
  For the @paddrs pointer used for the new function, it can be grabbed
  using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 106 ++++++++++++++++++++++++++++------------------
 1 file changed, 65 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7cb3b3eccda6..44eede8d9544 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1245,17 +1245,41 @@ static int get_rbio_vertical_errors(struct btrfs_raid_bio *rbio, int sector_nr,
 	return found_errors;
 }
 
+static int bio_add_paddrs(struct bio *bio, phys_addr_t *paddrs, unsigned int nr_steps,
+			  unsigned int step)
+{
+	int added = 0;
+	int ret;
+
+	for (int i = 0; i < nr_steps; i++) {
+		ret = bio_add_page(bio, phys_to_page(paddrs[i]), step,
+				   offset_in_page(paddrs[i]));
+		if (ret != step)
+			goto revert;
+		added += ret;
+	}
+	return added;
+revert:
+	/*
+	 * We don't need to revert the bvec, as the bio will be submitted immediately,
+	 * as long as the size is reduced the extra bvec will not be accessed.
+	 */
+	bio->bi_iter.bi_size -= added;
+	return 0;
+}
+
 /*
  * Add a single sector @sector into our list of bios for IO.
  *
  * Return 0 if everything went well.
- * Return <0 for error.
+ * Return <0 for error, and no byte will be added to @rbio.
  */
-static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_list,
-			     phys_addr_t paddr, unsigned int stripe_nr,
-			     unsigned int sector_nr, enum req_op op)
+static int rbio_add_io_paddrs(struct btrfs_raid_bio *rbio, struct bio_list *bio_list,
+			      phys_addr_t *paddrs, unsigned int stripe_nr,
+			      unsigned int sector_nr, enum req_op op)
 {
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	const u32 step = min(sectorsize, PAGE_SIZE);
 	struct bio *last = bio_list->tail;
 	int ret;
 	struct bio *bio;
@@ -1271,7 +1295,7 @@ static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_l
 			   rbio, stripe_nr);
 	ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors,
 			   rbio, sector_nr);
-	ASSERT(paddr != INVALID_PADDR);
+	ASSERT(paddrs != NULL);
 
 	stripe = &rbio->bioc->stripes[stripe_nr];
 	disk_start = stripe->physical + sector_nr * sectorsize;
@@ -1302,8 +1326,7 @@ static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_l
 		 */
 		if (last_end == disk_start && !last->bi_status &&
 		    last->bi_bdev == stripe->dev->bdev) {
-			ret = bio_add_page(last, phys_to_page(paddr), sectorsize,
-					   offset_in_page(paddr));
+			ret = bio_add_paddrs(last, paddrs, rbio->sector_nsteps, step);
 			if (ret == sectorsize)
 				return 0;
 		}
@@ -1316,7 +1339,8 @@ static int rbio_add_io_paddr(struct btrfs_raid_bio *rbio, struct bio_list *bio_l
 	bio->bi_iter.bi_sector = disk_start >> SECTOR_SHIFT;
 	bio->bi_private = rbio;
 
-	__bio_add_page(bio, phys_to_page(paddr), sectorsize, offset_in_page(paddr));
+	ret = bio_add_paddrs(bio, paddrs, rbio->sector_nsteps, step);
+	ASSERT(ret == sectorsize);
 	bio_list_add(bio_list, bio);
 	return 0;
 }
@@ -1497,7 +1521,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	 */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
 		stripe = total_sector_nr / rbio->stripe_nsectors;
 		sectornr = total_sector_nr % rbio->stripe_nsectors;
@@ -1507,15 +1531,15 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 			continue;
 
 		if (stripe < rbio->nr_data) {
-			paddr = sector_paddr_in_rbio(rbio, stripe, sectornr, 1);
-			if (paddr == INVALID_PADDR)
+			paddrs = sector_paddrs_in_rbio(rbio, stripe, sectornr, 1);
+			if (paddrs == NULL)
 				continue;
 		} else {
-			paddr = rbio_stripe_paddr(rbio, stripe, sectornr);
+			paddrs = rbio_stripe_paddrs(rbio, stripe, sectornr);
 		}
 
-		ret = rbio_add_io_paddr(rbio, bio_list, paddr, stripe,
-					sectornr, REQ_OP_WRITE);
+		ret = rbio_add_io_paddrs(rbio, bio_list, paddrs, stripe,
+					 sectornr, REQ_OP_WRITE);
 		if (ret)
 			goto error;
 	}
@@ -1532,7 +1556,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
 		stripe = total_sector_nr / rbio->stripe_nsectors;
 		sectornr = total_sector_nr % rbio->stripe_nsectors;
@@ -1557,14 +1581,14 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 			continue;
 
 		if (stripe < rbio->nr_data) {
-			paddr = sector_paddr_in_rbio(rbio, stripe, sectornr, 1);
-			if (paddr == INVALID_PADDR)
+			paddrs = sector_paddrs_in_rbio(rbio, stripe, sectornr, 1);
+			if (paddrs == NULL)
 				continue;
 		} else {
-			paddr = rbio_stripe_paddr(rbio, stripe, sectornr);
+			paddrs = rbio_stripe_paddrs(rbio, stripe, sectornr);
 		}
 
-		ret = rbio_add_io_paddr(rbio, bio_list, paddr,
+		ret = rbio_add_io_paddrs(rbio, bio_list, paddrs,
 					 rbio->real_stripes,
 					 sectornr, REQ_OP_WRITE);
 		if (ret)
@@ -2184,7 +2208,7 @@ static void recover_rbio(struct btrfs_raid_bio *rbio)
 	     total_sector_nr++) {
 		int stripe = total_sector_nr / rbio->stripe_nsectors;
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
 		/*
 		 * Skip the range which has error.  It can be a range which is
@@ -2201,9 +2225,9 @@ static void recover_rbio(struct btrfs_raid_bio *rbio)
 			continue;
 		}
 
-		paddr = rbio_stripe_paddr(rbio, stripe, sectornr);
-		ret = rbio_add_io_paddr(rbio, &bio_list, paddr, stripe,
-					sectornr, REQ_OP_READ);
+		paddrs = rbio_stripe_paddrs(rbio, stripe, sectornr);
+		ret = rbio_add_io_paddrs(rbio, &bio_list, paddrs, stripe,
+					 sectornr, REQ_OP_READ);
 		if (ret < 0) {
 			bio_list_put(&bio_list);
 			goto out;
@@ -2393,11 +2417,11 @@ static int rmw_read_wait_recover(struct btrfs_raid_bio *rbio)
 	     total_sector_nr++) {
 		int stripe = total_sector_nr / rbio->stripe_nsectors;
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
-		paddr = rbio_stripe_paddr(rbio, stripe, sectornr);
-		ret = rbio_add_io_paddr(rbio, &bio_list, paddr, stripe,
-					sectornr, REQ_OP_READ);
+		paddrs = rbio_stripe_paddrs(rbio, stripe, sectornr);
+		ret = rbio_add_io_paddrs(rbio, &bio_list, paddrs, stripe,
+					 sectornr, REQ_OP_READ);
 		if (ret) {
 			bio_list_put(&bio_list);
 			return ret;
@@ -2751,11 +2775,11 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 	 * everything else.
 	 */
 	for_each_set_bit(sectornr, &rbio->dbitmap, rbio->stripe_nsectors) {
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
-		paddr = rbio_stripe_paddr(rbio, rbio->scrubp, sectornr);
-		ret = rbio_add_io_paddr(rbio, &bio_list, paddr, rbio->scrubp,
-					sectornr, REQ_OP_WRITE);
+		paddrs = rbio_stripe_paddrs(rbio, rbio->scrubp, sectornr);
+		ret = rbio_add_io_paddrs(rbio, &bio_list, paddrs, rbio->scrubp,
+					 sectornr, REQ_OP_WRITE);
 		if (ret)
 			goto cleanup;
 	}
@@ -2769,11 +2793,11 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 	 */
 	ASSERT_RBIO(rbio->bioc->replace_stripe_src >= 0, rbio);
 	for_each_set_bit(sectornr, pbitmap, rbio->stripe_nsectors) {
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
-		paddr = rbio_stripe_paddr(rbio, rbio->scrubp, sectornr);
-		ret = rbio_add_io_paddr(rbio, &bio_list, paddr, rbio->real_stripes,
-					sectornr, REQ_OP_WRITE);
+		paddrs = rbio_stripe_paddrs(rbio, rbio->scrubp, sectornr);
+		ret = rbio_add_io_paddrs(rbio, &bio_list, paddrs, rbio->real_stripes,
+					 sectornr, REQ_OP_WRITE);
 		if (ret)
 			goto cleanup;
 	}
@@ -2889,7 +2913,7 @@ static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio)
 	     total_sector_nr++) {
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
 		int stripe = total_sector_nr / rbio->stripe_nsectors;
-		phys_addr_t paddr;
+		phys_addr_t *paddrs;
 
 		/* No data in the vertical stripe, no need to read. */
 		if (!test_bit(sectornr, &rbio->dbitmap))
@@ -2900,11 +2924,11 @@ static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio)
 		 * read them from the disk. If sector_paddr_in_rbio() finds a sector
 		 * in the bio list we don't need to read it off the stripe.
 		 */
-		paddr = sector_paddr_in_rbio(rbio, stripe, sectornr, 1);
-		if (paddr == INVALID_PADDR)
+		paddrs = sector_paddrs_in_rbio(rbio, stripe, sectornr, 1);
+		if (paddrs == NULL)
 			continue;
 
-		paddr = rbio_stripe_paddr(rbio, stripe, sectornr);
+		paddrs = rbio_stripe_paddrs(rbio, stripe, sectornr);
 		/*
 		 * The bio cache may have handed us an uptodate sector.  If so,
 		 * use it.
@@ -2913,8 +2937,8 @@ static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio)
 			     rbio->stripe_uptodate_bitmap))
 			continue;
 
-		ret = rbio_add_io_paddr(rbio, &bio_list, paddr, stripe,
-					sectornr, REQ_OP_READ);
+		ret = rbio_add_io_paddrs(rbio, &bio_list, paddrs, stripe,
+					 sectornr, REQ_OP_READ);
 		if (ret) {
 			bio_list_put(&bio_list);
 			return ret;
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 10/12] btrfs: prepare finish_parity_scrub() to support bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (8 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 09/12] btrfs: prepare rbio_bio_add_io_paddr() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 11/12] btrfs: enable bs > ps support for raid56 Qu Wenruo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper, verify_one_parity_step()
  Since the P/Q generation is always done in a vertical stripe, we have
  to handle the range step by step.

- Only clear the rbio->dbitmap if all steps of an fs block match

- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
  Now we either use the paddrs version for checksum, or the step version
  for P/Q generation/recovery.

- Make alloc_rbio_essential_pages() to handle bs > ps cases
  Since for bs > ps cases, one fs block needs multiple pages, the
  existing simple check against rbio->stripe_pages[] is not enough.

  Extract a dedicated helper, alloc_rbio_sector_pages(), for the
  existing alloc_rbio_essential_pages(), which is still based on sector
  number.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 175 +++++++++++++++++++++++-----------------------
 1 file changed, 86 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 44eede8d9544..d8d3af2c4db5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -735,13 +735,6 @@ static unsigned int rbio_paddr_index(const struct btrfs_raid_bio *rbio,
 	return ret;
 }
 
-/* Return a paddr from rbio->stripe_sectors, not from the bio list */
-static phys_addr_t rbio_stripe_paddr(const struct btrfs_raid_bio *rbio,
-				     unsigned int stripe_nr, unsigned int sector_nr)
-{
-	return rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, 0)];
-}
-
 static phys_addr_t rbio_stripe_step_paddr(const struct btrfs_raid_bio *rbio,
 					  unsigned int stripe_nr, unsigned int sector_nr,
 					  unsigned int step_nr)
@@ -1001,46 +994,6 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t status)
 		rbio_endio_bio_list(extra, status);
 }
 
-/*
- * Get the paddr specified by its @stripe_nr and @sector_nr.
- *
- * @rbio:               The raid bio
- * @stripe_nr:          Stripe number, valid range [0, real_stripe)
- * @sector_nr:		Sector number inside the stripe,
- *			valid range [0, stripe_nsectors)
- * @bio_list_only:      Whether to use sectors inside the bio list only.
- *
- * The read/modify/write code wants to reuse the original bio page as much
- * as possible, and only use stripe_sectors as fallback.
- */
-static phys_addr_t sector_paddr_in_rbio(struct btrfs_raid_bio *rbio,
-					int stripe_nr, int sector_nr,
-					bool bio_list_only)
-{
-	phys_addr_t ret = INVALID_PADDR;
-	int index;
-
-	ASSERT_RBIO_STRIPE(stripe_nr >= 0 && stripe_nr < rbio->real_stripes,
-			   rbio, stripe_nr);
-	ASSERT_RBIO_SECTOR(sector_nr >= 0 && sector_nr < rbio->stripe_nsectors,
-			   rbio, sector_nr);
-
-	index = stripe_nr * rbio->stripe_nsectors + sector_nr;
-	ASSERT(index >= 0 && index < rbio->nr_sectors);
-
-	spin_lock(&rbio->bio_list_lock);
-	if (rbio->bio_paddrs[index] != INVALID_PADDR || bio_list_only) {
-		/* Don't return sector without a valid page pointer */
-		if (rbio->bio_paddrs[index] != INVALID_PADDR)
-			ret = rbio->bio_paddrs[index];
-		spin_unlock(&rbio->bio_list_lock);
-		return ret;
-	}
-	spin_unlock(&rbio->bio_list_lock);
-
-	return rbio->stripe_paddrs[index];
-}
-
 /*
  * Get paddr pointer for the sector specified by its @stripe_nr and @sector_nr.
  *
@@ -2635,42 +2588,115 @@ struct btrfs_raid_bio *raid56_parity_alloc_scrub_rbio(struct bio *bio,
 	return rbio;
 }
 
+static int alloc_rbio_sector_pages(struct btrfs_raid_bio *rbio,
+				  int sector_nr)
+{
+	const u32 step = min(PAGE_SIZE, rbio->bioc->fs_info->sectorsize);
+	const u32 base = sector_nr * rbio->sector_nsteps;
+
+	for (int i = base; i < base + rbio->sector_nsteps; i++) {
+		const unsigned int page_index = (i * step) >> PAGE_SHIFT;
+		struct page *page;
+
+		if (rbio->stripe_pages[page_index])
+			continue;
+		page = alloc_page(GFP_NOFS);
+		if (!page)
+			return -ENOMEM;
+		rbio->stripe_pages[page_index] = page;
+	}
+	return 0;
+}
+
 /*
  * We just scrub the parity that we have correct data on the same horizontal,
  * so we needn't allocate all pages for all the stripes.
  */
 static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
 	int total_sector_nr;
 
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
-		struct page *page;
 		int sectornr = total_sector_nr % rbio->stripe_nsectors;
-		int index = (total_sector_nr * sectorsize) >> PAGE_SHIFT;
+		int ret;
 
 		if (!test_bit(sectornr, &rbio->dbitmap))
 			continue;
-		if (rbio->stripe_pages[index])
-			continue;
-		page = alloc_page(GFP_NOFS);
-		if (!page)
-			return -ENOMEM;
-		rbio->stripe_pages[index] = page;
+		ret = alloc_rbio_sector_pages(rbio, total_sector_nr);
+		if (ret < 0)
+			return ret;
 	}
 	index_stripe_sectors(rbio);
 	return 0;
 }
 
+/* Return true if the content of the step matches the caclulated one. */
+static bool verify_one_parity_step(struct btrfs_raid_bio *rbio,
+				   void *pointers[], unsigned int sector_nr,
+				   unsigned int step_nr)
+{
+	const unsigned int nr_data = rbio->nr_data;
+	const bool has_qstripe = (rbio->real_stripes - rbio->nr_data == 2);
+	const u32 step = min(rbio->bioc->fs_info->sectorsize, PAGE_SIZE);
+	void *parity;
+	bool ret = false;
+
+	ASSERT(step_nr < rbio->sector_nsteps);
+
+	/* first collect one page from each data stripe */
+	for (int stripe = 0; stripe < nr_data; stripe++)
+		pointers[stripe] = kmap_local_paddr(
+				sector_step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
+
+	if (has_qstripe) {
+		assert_rbio(rbio);
+		/* RAID6, call the library function to fill in our P/Q */
+		raid6_call.gen_syndrome(rbio->real_stripes, step, pointers);
+	} else {
+		/* raid5 */
+		memcpy(pointers[nr_data], pointers[0], step);
+		run_xor(pointers + 1, nr_data - 1, step);
+	}
+
+	/* Check scrubbing parity and repair it */
+	parity = kmap_local_paddr(rbio_stripe_step_paddr(rbio, rbio->scrubp, sector_nr, step_nr));
+	if (memcmp(parity, pointers[rbio->scrubp], step) != 0)
+		memcpy(parity, pointers[rbio->scrubp], step);
+	else
+		ret = true;
+	kunmap_local(parity);
+
+	for (int stripe = nr_data - 1; stripe >= 0; stripe--)
+		kunmap_local(pointers[stripe]);
+	return ret;
+}
+
+/*
+ * The @pointers array should have the P/Q parity already mapped.
+ */
+static void verify_one_parity_sector(struct btrfs_raid_bio *rbio,
+				    void *pointers[], unsigned int sector_nr)
+{
+	bool found_error = false;
+
+	for (int step_nr = 0; step_nr < rbio->sector_nsteps; step_nr++) {
+		bool match;
+
+		match = verify_one_parity_step(rbio, pointers, sector_nr, step_nr);
+		if (!match)
+			found_error = true;
+	}
+	if (!found_error)
+		bitmap_clear(&rbio->dbitmap, sector_nr, 1);
+}
+
 static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 {
 	struct btrfs_io_context *bioc = rbio->bioc;
-	const u32 sectorsize = bioc->fs_info->sectorsize;
 	void **pointers = rbio->finish_pointers;
 	unsigned long *pbitmap = &rbio->finish_pbitmap;
 	int nr_data = rbio->nr_data;
-	int stripe;
 	int sectornr;
 	bool has_qstripe;
 	struct page *page;
@@ -2729,37 +2755,8 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio)
 
 	/* Map the parity stripe just once */
 
-	for_each_set_bit(sectornr, &rbio->dbitmap, rbio->stripe_nsectors) {
-		void *parity;
-
-		/* first collect one page from each data stripe */
-		for (stripe = 0; stripe < nr_data; stripe++)
-			pointers[stripe] = kmap_local_paddr(
-					sector_paddr_in_rbio(rbio, stripe, sectornr, 0));
-
-		if (has_qstripe) {
-			assert_rbio(rbio);
-			/* RAID6, call the library function to fill in our P/Q */
-			raid6_call.gen_syndrome(rbio->real_stripes, sectorsize,
-						pointers);
-		} else {
-			/* raid5 */
-			memcpy(pointers[nr_data], pointers[0], sectorsize);
-			run_xor(pointers + 1, nr_data - 1, sectorsize);
-		}
-
-		/* Check scrubbing parity and repair it */
-		parity = kmap_local_paddr(rbio_stripe_paddr(rbio, rbio->scrubp, sectornr));
-		if (memcmp(parity, pointers[rbio->scrubp], sectorsize) != 0)
-			memcpy(parity, pointers[rbio->scrubp], sectorsize);
-		else
-			/* Parity is right, needn't writeback */
-			bitmap_clear(&rbio->dbitmap, sectornr, 1);
-		kunmap_local(parity);
-
-		for (stripe = nr_data - 1; stripe >= 0; stripe--)
-			kunmap_local(pointers[stripe]);
-	}
+	for_each_set_bit(sectornr, &rbio->dbitmap, rbio->stripe_nsectors)
+		verify_one_parity_sector(rbio, pointers, sectornr);
 
 	kunmap_local(pointers[nr_data]);
 	__free_page(phys_to_page(p_paddr));
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 11/12] btrfs: enable bs > ps support for raid56
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (9 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 10/12] btrfs: prepare finish_parity_scrub() " Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-17  7:30 ` [PATCH 12/12] btrfs: remove the "_step" infix Qu Wenruo
  2025-11-18 15:15 ` [PATCH 00/12] btrfs: add raid56 support for bs > ps cases David Sterba
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c |  6 ------
 fs/btrfs/raid56.c  | 11 ++++++-----
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0df81a09a3d1..fe62f5a244f5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3258,12 +3258,6 @@ int btrfs_check_features(struct btrfs_fs_info *fs_info, bool is_rw_mount)
 			   PAGE_SIZE, fs_info->sectorsize);
 		return -EINVAL;
 	}
-	if (fs_info->sectorsize > PAGE_SIZE && btrfs_fs_incompat(fs_info, RAID56)) {
-		btrfs_err(fs_info,
-		"RAID56 is not supported for page size %lu with sectorsize %u",
-			  PAGE_SIZE, fs_info->sectorsize);
-		return -EINVAL;
-	}
 
 	/* This can be called by remount, we need to protect the super block. */
 	spin_lock(&fs_info->super_lock);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d8d3af2c4db5..ec6427565f25 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1070,8 +1070,12 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	const unsigned int sector_nsteps = fs_info->sectorsize / step;
 	struct btrfs_raid_bio *rbio;
 
-	/* PAGE_SIZE must also be aligned to sectorsize for subpage support */
-	ASSERT(IS_ALIGNED(PAGE_SIZE, fs_info->sectorsize));
+	/*
+	 * For bs <= ps cases, ps must be aligned to bs.
+	 * For bs > ps cases, bs must be aligned to ps.
+	 */
+	ASSERT(IS_ALIGNED(PAGE_SIZE, fs_info->sectorsize) ||
+	       IS_ALIGNED(fs_info->sectorsize, PAGE_SIZE));
 	/*
 	 * Our current stripe len should be fixed to 64k thus stripe_nsectors
 	 * (at most 16) should be no larger than BITS_PER_LONG.
@@ -3013,9 +3017,6 @@ void raid56_parity_cache_data_folios(struct btrfs_raid_bio *rbio,
 	unsigned int foffset = 0;
 	int ret;
 
-	/* We shouldn't hit RAID56 for bs > ps cases for now. */
-	ASSERT(fs_info->sectorsize <= PAGE_SIZE);
-
 	/*
 	 * If we hit ENOMEM temporarily, but later at
 	 * raid56_parity_submit_scrub_rbio() time it succeeded, we just do
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 12/12] btrfs: remove the "_step" infix
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (10 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 11/12] btrfs: enable bs > ps support for raid56 Qu Wenruo
@ 2025-11-17  7:30 ` Qu Wenruo
  2025-11-18 15:15 ` [PATCH 00/12] btrfs: add raid56 support for bs > ps cases David Sterba
  12 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-11-17  7:30 UTC (permalink / raw)
  To: linux-btrfs

The following functions are introduced as a middle step for bs > ps
support:

- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()

As there is already an existing function without the infix, and has a
different parameter list.

But now those existing functions are cleaned up, there is no need to
keep the "_step" infix, just remove the infix completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index ec6427565f25..26c1e0e8a1a8 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -735,25 +735,25 @@ static unsigned int rbio_paddr_index(const struct btrfs_raid_bio *rbio,
 	return ret;
 }
 
-static phys_addr_t rbio_stripe_step_paddr(const struct btrfs_raid_bio *rbio,
+static phys_addr_t rbio_stripe_paddr(const struct btrfs_raid_bio *rbio,
 					  unsigned int stripe_nr, unsigned int sector_nr,
 					  unsigned int step_nr)
 {
 	return rbio->stripe_paddrs[rbio_paddr_index(rbio, stripe_nr, sector_nr, step_nr)];
 }
 
-static phys_addr_t rbio_pstripe_step_paddr(const struct btrfs_raid_bio *rbio,
+static phys_addr_t rbio_pstripe_paddr(const struct btrfs_raid_bio *rbio,
 					   unsigned int sector_nr, unsigned int step_nr)
 {
-	return rbio_stripe_step_paddr(rbio, rbio->nr_data, sector_nr, step_nr);
+	return rbio_stripe_paddr(rbio, rbio->nr_data, sector_nr, step_nr);
 }
 
-static phys_addr_t rbio_qstripe_step_paddr(const struct btrfs_raid_bio *rbio,
+static phys_addr_t rbio_qstripe_paddr(const struct btrfs_raid_bio *rbio,
 					   unsigned int sector_nr, unsigned int step_nr)
 {
 	if (rbio->nr_data + 1 == rbio->real_stripes)
 		return INVALID_PADDR;
-	return rbio_stripe_step_paddr(rbio, rbio->nr_data + 1, sector_nr, step_nr);
+	return rbio_stripe_paddr(rbio, rbio->nr_data + 1, sector_nr, step_nr);
 }
 
 /* Return a paddr pointer into the rbio::stripe_paddrs[] for the specified sector. */
@@ -1033,9 +1033,9 @@ static phys_addr_t *sector_paddrs_in_rbio(struct btrfs_raid_bio *rbio,
  * Similar to sector_paddr_in_rbio(), but with extra consideration for
  * bs > ps cases, where we can have multiple steps for a fs block.
  */
-static phys_addr_t sector_step_paddr_in_rbio(struct btrfs_raid_bio *rbio,
-					     int stripe_nr, int sector_nr, int step_nr,
-					     bool bio_list_only)
+static phys_addr_t sector_paddr_in_rbio(struct btrfs_raid_bio *rbio,
+					int stripe_nr, int sector_nr, int step_nr,
+					bool bio_list_only)
 {
 	phys_addr_t ret = INVALID_PADDR;
 	const int index = rbio_paddr_index(rbio, stripe_nr, sector_nr, step_nr);
@@ -1413,10 +1413,10 @@ static void generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsigned int
 	/* First collect one sector from each data stripe */
 	for (stripe = 0; stripe < rbio->nr_data; stripe++)
 		pointers[stripe] = kmap_local_paddr(
-				sector_step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
+				sector_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
 
 	/* Then add the parity stripe */
-	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_step_paddr(rbio, sector_nr, step_nr));
+	pointers[stripe++] = kmap_local_paddr(rbio_pstripe_paddr(rbio, sector_nr, step_nr));
 
 	if (has_qstripe) {
 		/*
@@ -1424,7 +1424,7 @@ static void generate_pq_vertical_step(struct btrfs_raid_bio *rbio, unsigned int
 		 * to fill in our p/q
 		 */
 		pointers[stripe++] = kmap_local_paddr(
-				rbio_qstripe_step_paddr(rbio, sector_nr, step_nr));
+				rbio_qstripe_paddr(rbio, sector_nr, step_nr));
 
 		assert_rbio(rbio);
 		raid6_call.gen_syndrome(rbio->real_stripes, step, pointers);
@@ -1958,9 +1958,9 @@ static void recover_vertical_step(struct btrfs_raid_bio *rbio,
 		 * bio list if possible.
 		 */
 		if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
-			paddr = sector_step_paddr_in_rbio(rbio, stripe_nr, sector_nr, step_nr, 0);
+			paddr = sector_paddr_in_rbio(rbio, stripe_nr, sector_nr, step_nr, 0);
 		} else {
-			paddr = rbio_stripe_step_paddr(rbio, stripe_nr, sector_nr, step_nr);
+			paddr = rbio_stripe_paddr(rbio, stripe_nr, sector_nr, step_nr);
 		}
 		pointers[stripe_nr] = kmap_local_paddr(paddr);
 		unmap_array[stripe_nr] = pointers[stripe_nr];
@@ -2651,7 +2651,7 @@ static bool verify_one_parity_step(struct btrfs_raid_bio *rbio,
 	/* first collect one page from each data stripe */
 	for (int stripe = 0; stripe < nr_data; stripe++)
 		pointers[stripe] = kmap_local_paddr(
-				sector_step_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
+				sector_paddr_in_rbio(rbio, stripe, sector_nr, step_nr, 0));
 
 	if (has_qstripe) {
 		assert_rbio(rbio);
@@ -2664,7 +2664,7 @@ static bool verify_one_parity_step(struct btrfs_raid_bio *rbio,
 	}
 
 	/* Check scrubbing parity and repair it */
-	parity = kmap_local_paddr(rbio_stripe_step_paddr(rbio, rbio->scrubp, sector_nr, step_nr));
+	parity = kmap_local_paddr(rbio_stripe_paddr(rbio, rbio->scrubp, sector_nr, step_nr));
 	if (memcmp(parity, pointers[rbio->scrubp], step) != 0)
 		memcpy(parity, pointers[rbio->scrubp], step);
 	else
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/12] btrfs: add raid56 support for bs > ps cases
  2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
                   ` (11 preceding siblings ...)
  2025-11-17  7:30 ` [PATCH 12/12] btrfs: remove the "_step" infix Qu Wenruo
@ 2025-11-18 15:15 ` David Sterba
  2025-11-18 21:10   ` Qu Wenruo
  12 siblings, 1 reply; 17+ messages in thread
From: David Sterba @ 2025-11-18 15:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Nov 17, 2025 at 06:00:40PM +1030, Qu Wenruo wrote:
> [OVERVIEW]
> This series add the missing raid56 support for the experimental bs > ps
> support.

Please add it to for-next, it's coveredby experimental config is any
eventual bugs can be dealt with safely. Thanks.

> The main challenge here is the conflicts between RAID56 RMW/recovery and
> data checksum.
> 
> For RAID56 RMW/recovery, the vertical stripe can only be mapped one page
> one time, as the upper layer can pass bios that are not backed by large
> folios (direct IO, encoded read/write/send).
> 
> On the other hand, data checksum requires multiple pages at the same
> time, e.g. btrfs_calculate_block_csum_pages().
> 
> To meet both requirements, introduce a new unit, step, which is
> min(PAGE_SIZE, sectorsize), and make the paddrs[] arrays in RAID56 to be
> in step sizes.
> 
> So for vertical stripe related works, reduce the map size from
> one sector to one step. For data checksum verification grab the pointer
> from involved paddrs[] array and pass the sub-array into
> btrfs_calculate_block_csum_pages().
> 
> So before the patchset, the btrfs_raid_bio paddr pointers looks like
> this:
> 
>   16K page size, 4K fs block size (aka, subpage case)
> 
>                        0                   16K  ...
>   stripe_pages[]:      |                   |    ...
>   stripe_paddrs[]:     0    1    2    3    4    ...
>   fs blocks            |<-->|<-->|<-->|<-->|    ...
> 
>   There are at least one fs block (sector) inside a page, and each
>   paddrs[] entry represents an fs block 1:1.
> 
> To the new structure for bs > ps support:
> 
>   4K page size, 8K fs block size
> 
>                        0    4k   8K   12K   16K  ...
>   stripe_pages[]:      |    |    |    |    |     ...
>   stripe_paddrs[]:     0    1    2    3    4     ...
>   fs blocks            |<------->|<------->|     ...
> 
>   Now paddrs[] entry is no longer 1:1 mapped to an fs block, but
>   multiple paddrs mapped to one fs block.
> 
> The glue unit between paddrs[] and fs blocks is a step.
> 
> One fs blocks can one or more steps, and one step maps to a paddr[]
> entry 1:1.
> 
> For bs <= ps cases, one step is the same as an fs block.
> For bs > ps case, one step is just a page.
> 
> For RAID56, now we need one extra step iteration loop when handling an
> fs block.
> 
> [TESTING]
> I have tested the following combinations:
> 
> - bs=4k ps=4k x86_64
> - bs=4k ps=64k arm64
>   The base line to ensure no regression caused by this patchset for bs
>   == ps and bs < ps cases.
> 
> - bs=8k ps=4k x86_64
>   The new run for this series.
> 
>   The only new failure is related to direct IO read verification, which
>   is a known one caused by no direct IO support for bs > ps cases.
> 
> I'm afraid in the long run, the combination matrix will be larger than
> larger, and I'm not sure if my environment can handle all the extra bs/ps
> combinations.
> 
> The long term plan is to test bs=4k ps=4k, bs=4k ps=64k, bs=8k ps=4k
> cases only.

Yes the number of combinations increases, I'd recommend to test those
that make sense. The idea is to match what could on one side exist as a
native combination and could be used on another host where it would have
to be emulated by the bs>ps code. E.g. 16K page and sectorsize on ARM
and then used on x86_64. The other size to consider is 64K, e.g. on
powerpc.

In your list the bs=8K and ps=4K exercises the code but the only harware
taht may still be in use (I know of) and has 8K pages is SPARC. I'd
rather pick numbers that still have some contemporary hardware relevance.

> [PATCHSET LAYOUT]
> Patch 1 introduces an overview of how btrfs_raid_bio structure
> works.
> Patch 2~10 starts converting t he existing infrastructures to use the
> new step based paddr pointers.
> Patch 11 enables RAID56 for bs > ps cases, which is still an
> experimental feature.
> The last patch removes the "_step" infix which is used as a temporary
> naming during the work.
> 
> [ROADMAP FOR BS > PS SUPPORT]
> The remaining feature not yet implemented for bs > ps cases is direct
> IO. The needed patch in iomap is submitted through VFS/iomap tree, and
> the btrfs part is a very tiny patch, will be submitted during v6.19
> cycle.

Sounds good.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/12] btrfs: add raid56 support for bs > ps cases
  2025-11-18 15:15 ` [PATCH 00/12] btrfs: add raid56 support for bs > ps cases David Sterba
@ 2025-11-18 21:10   ` Qu Wenruo
  2025-11-19  8:13     ` David Sterba
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2025-11-18 21:10 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: linux-btrfs



在 2025/11/19 01:45, David Sterba 写道:
[...]
>>
>> The long term plan is to test bs=4k ps=4k, bs=4k ps=64k, bs=8k ps=4k
>> cases only.
> 
> Yes the number of combinations increases, I'd recommend to test those
> that make sense. The idea is to match what could on one side exist as a
> native combination and could be used on another host where it would have
> to be emulated by the bs>ps code. E.g. 16K page and sectorsize on ARM
> and then used on x86_64. The other size to consider is 64K, e.g. on
> powerpc.
> 
> In your list the bs=8K and ps=4K exercises the code but the only harware
> taht may still be in use (I know of) and has 8K pages is SPARC. I'd
> rather pick numbers that still have some contemporary hardware relevance.

The bs > ps support has a hidden problem, a much higher chance of memory 
allocation failure for page cache, thus can lead to false alerts.

E.g. ps = 4k bs = 64k, the order is 4, beyond the costly order 3, thus 
it can fail without retry.

Maybe that can help us exposing more bugs, but for now I'm sticking to 
the safest tests without extra -ENOMEM possibilities.

It can be expanded to 16K (order 2) and be more realistic though.


Although bs > ps support will be utilized for possible RAIDZ like 
profiles to solve RAID56 write-holes problems, in that case bs > ps 
support may see more widely usage, and we may get more adventurous users 
to help testing.

Thanks,
Qu
> 
>> [PATCHSET LAYOUT]
>> Patch 1 introduces an overview of how btrfs_raid_bio structure
>> works.
>> Patch 2~10 starts converting t he existing infrastructures to use the
>> new step based paddr pointers.
>> Patch 11 enables RAID56 for bs > ps cases, which is still an
>> experimental feature.
>> The last patch removes the "_step" infix which is used as a temporary
>> naming during the work.
>>
>> [ROADMAP FOR BS > PS SUPPORT]
>> The remaining feature not yet implemented for bs > ps cases is direct
>> IO. The needed patch in iomap is submitted through VFS/iomap tree, and
>> the btrfs part is a very tiny patch, will be submitted during v6.19
>> cycle.
> 
> Sounds good.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/12] btrfs: add raid56 support for bs > ps cases
  2025-11-18 21:10   ` Qu Wenruo
@ 2025-11-19  8:13     ` David Sterba
  2025-11-20 13:23       ` Neal Gompa
  0 siblings, 1 reply; 17+ messages in thread
From: David Sterba @ 2025-11-19  8:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Wed, Nov 19, 2025 at 07:40:50AM +1030, Qu Wenruo wrote:
> 在 2025/11/19 01:45, David Sterba 写道:
> [...]
> >>
> >> The long term plan is to test bs=4k ps=4k, bs=4k ps=64k, bs=8k ps=4k
> >> cases only.
> > 
> > Yes the number of combinations increases, I'd recommend to test those
> > that make sense. The idea is to match what could on one side exist as a
> > native combination and could be used on another host where it would have
> > to be emulated by the bs>ps code. E.g. 16K page and sectorsize on ARM
> > and then used on x86_64. The other size to consider is 64K, e.g. on
> > powerpc.
> > 
> > In your list the bs=8K and ps=4K exercises the code but the only harware
> > taht may still be in use (I know of) and has 8K pages is SPARC. I'd
> > rather pick numbers that still have some contemporary hardware relevance.
> 
> The bs > ps support has a hidden problem, a much higher chance of memory 
> allocation failure for page cache, thus can lead to false alerts.
> 
> E.g. ps = 4k bs = 64k, the order is 4, beyond the costly order 3, thus 
> it can fail without retry.
> 
> Maybe that can help us exposing more bugs, but for now I'm sticking to 
> the safest tests without extra -ENOMEM possibilities.

I see, this could make the testing pointless.

> It can be expanded to 16K (order 2) and be more realistic though.

Yes 16K sounds as a good compromise.

> Although bs > ps support will be utilized for possible RAIDZ like 
> profiles to solve RAID56 write-holes problems, in that case bs > ps 
> support may see more widely usage, and we may get more adventurous users 
> to help testing.

In such case we'd have to increase the reliability of allocations by
some sort of caching or emergency pool for the requests. The memory
management people maybe have some generic solution as the large folios
usage is on the rise and I don't think the allocation problems are left
to everybody as a problem to solve.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 00/12] btrfs: add raid56 support for bs > ps cases
  2025-11-19  8:13     ` David Sterba
@ 2025-11-20 13:23       ` Neal Gompa
  0 siblings, 0 replies; 17+ messages in thread
From: Neal Gompa @ 2025-11-20 13:23 UTC (permalink / raw)
  To: dsterba; +Cc: Qu Wenruo, Qu Wenruo, linux-btrfs, Justin M. Forbes, Michel Lind

On Wed, Nov 19, 2025 at 3:13 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Nov 19, 2025 at 07:40:50AM +1030, Qu Wenruo wrote:
> > 在 2025/11/19 01:45, David Sterba 写道:
> > [...]
> > >>
> > >> The long term plan is to test bs=4k ps=4k, bs=4k ps=64k, bs=8k ps=4k
> > >> cases only.
> > >
> > > Yes the number of combinations increases, I'd recommend to test those
> > > that make sense. The idea is to match what could on one side exist as a
> > > native combination and could be used on another host where it would have
> > > to be emulated by the bs>ps code. E.g. 16K page and sectorsize on ARM
> > > and then used on x86_64. The other size to consider is 64K, e.g. on
> > > powerpc.
> > >
> > > In your list the bs=8K and ps=4K exercises the code but the only harware
> > > taht may still be in use (I know of) and has 8K pages is SPARC. I'd
> > > rather pick numbers that still have some contemporary hardware relevance.
> >
> > The bs > ps support has a hidden problem, a much higher chance of memory
> > allocation failure for page cache, thus can lead to false alerts.
> >
> > E.g. ps = 4k bs = 64k, the order is 4, beyond the costly order 3, thus
> > it can fail without retry.
> >
> > Maybe that can help us exposing more bugs, but for now I'm sticking to
> > the safest tests without extra -ENOMEM possibilities.
>
> I see, this could make the testing pointless.
>
> > It can be expanded to 16K (order 2) and be more realistic though.
>
> Yes 16K sounds as a good compromise.
>

Yes, and it can be immediately useful with Fedora systems since there
have been historical 16k btrfs volumes created on Fedora Asahi Remix
by users before the 4k bs default was implemented.

> > Although bs > ps support will be utilized for possible RAIDZ like
> > profiles to solve RAID56 write-holes problems, in that case bs > ps
> > support may see more widely usage, and we may get more adventurous users
> > to help testing.
>
> In such case we'd have to increase the reliability of allocations by
> some sort of caching or emergency pool for the requests. The memory
> management people maybe have some generic solution as the large folios
> usage is on the rise and I don't think the allocation problems are left
> to everybody as a problem to solve.
>

If we need some testing for RAID modes, it might be possible to
organize something on the Fedora side through the Fedora Btrfs SIG[1].

[1]: https://fedoraproject.org/wiki/SIGs/Btrfs


-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-20 13:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17  7:30 [PATCH 00/12] btrfs: add raid56 support for bs > ps cases Qu Wenruo
2025-11-17  7:30 ` [PATCH 01/12] btrfs: add an overview for the btrfs_raid_bio structure Qu Wenruo
2025-11-17  7:30 ` [PATCH 02/12] btrfs: introduce a new parameter to locate a sector Qu Wenruo
2025-11-17  7:30 ` [PATCH 03/12] btrfs: prepare generate_pq_vertical() for bs > ps cases Qu Wenruo
2025-11-17  7:30 ` [PATCH 04/12] btrfs: prepare recover_vertical() to support " Qu Wenruo
2025-11-17  7:30 ` [PATCH 05/12] btrfs: prepare verify_one_sector() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 06/12] btrfs: prepare verify_bio_data_sectors() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 07/12] btrfs: prepare set_bio_pages_uptodate() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 08/12] btrfs: prepare steal_rbio() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 09/12] btrfs: prepare rbio_bio_add_io_paddr() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 10/12] btrfs: prepare finish_parity_scrub() " Qu Wenruo
2025-11-17  7:30 ` [PATCH 11/12] btrfs: enable bs > ps support for raid56 Qu Wenruo
2025-11-17  7:30 ` [PATCH 12/12] btrfs: remove the "_step" infix Qu Wenruo
2025-11-18 15:15 ` [PATCH 00/12] btrfs: add raid56 support for bs > ps cases David Sterba
2025-11-18 21:10   ` Qu Wenruo
2025-11-19  8:13     ` David Sterba
2025-11-20 13:23       ` Neal Gompa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox