* [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases
@ 2025-10-05 22:41 Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 1/4] btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios Qu Wenruo
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-05 22:41 UTC (permalink / raw)
To: linux-btrfs
[CHANGELOG]
v2:
- Fix smatch warnings
Mostly remove any ASSERT()s checking returned value of bio_alloc().
- Fix a bug in the btrfs_add_compressed_bio_folios()
Where the function still assumes large folios, but now we can have
regular folios.
Previously encoded read/write/send is disabled for bs > ps cases,
because they are either using regular pages or kvmallocated memories as
buffer.
This means their buffers do not meet the folio size requirement (each
folio much contain at least one fs block, no block can cross large folio
boundries).
This series address the limits by allowing the following functionalities
to support regular pages, without relying on large folios:
- Checksum calculation
Now instead of passing a single @paddr which is ensured to be inside a
large folio, an array, paddrs[], is passed in.
For bs <= ps cases, it's still a single paddr.
For bs > ps cases, we can accept an array of multiple paddrs, that
represents a single fs block.
- Read repair
Allow btrfs_repair_io_failure() to submit a bio with multiple
incontiguous pages.
The same paddrs[] array building scheme.
But this time since we need to submit a bio with multiple bvecs, we
can no longer use the current on-stack bio.
This also brings a small improvement for metadata read-repair, we can
submit the whole metadata block in one go.
- Comprssed bio folios submission
Now the function btrfs_add_compressed_bio_folios() can handle both
large and regular folios, even handling different sized folios in the
array.
- Read verification
Just build the paddrs[] array for bs > ps cases and pass the array
into btrfs_calculate_block_csum_folio().
Unfortunately since there is no reliable on-stack VLA support, we have
to pay the extra on-stack memory (128 bytes for x86_64, or 8 bytes for
64K page sized systems) everywhere, even if 99% of the cases our block
size is no larger than page size.
Another thing is, even with all those support, direct IO is still not
supported.
The problem is iomap can still split the bio between pages, breaking our
bio size requirement (still have to be block aligned).
Qu Wenruo (4):
btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios
btrfs: make btrfs_repair_io_failure() handle bs > ps cases without
large folios
btrfs: make read verification handle bs > ps cases without large
folios
btrfs: enable encoded read/write/send for bs > ps cases
fs/btrfs/bio.c | 138 +++++++++++++++++++++++++++++------------
fs/btrfs/bio.h | 2 +-
fs/btrfs/btrfs_inode.h | 8 ++-
fs/btrfs/compression.c | 10 +--
fs/btrfs/disk-io.c | 29 +++++----
fs/btrfs/file-item.c | 15 ++++-
fs/btrfs/inode.c | 91 ++++++++++++++++++---------
fs/btrfs/ioctl.c | 21 -------
fs/btrfs/send.c | 9 +--
9 files changed, 203 insertions(+), 120 deletions(-)
--
2.50.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/4] btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios
2025-10-05 22:41 [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
@ 2025-10-05 22:41 ` Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 2/4] btrfs: make btrfs_repair_io_failure() handle bs > ps cases " Qu Wenruo
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-05 22:41 UTC (permalink / raw)
To: linux-btrfs
For bs > ps cases, all folios passed into btrfs_csum_one_bio() is
ensured to be backed by large folios.
But that requirement excludes features like direct IO and encoded
writes.
To support bs > ps without large folios, enhance btrfs_csum_one_bio()
by:
- Split btrfs_calculate_block_csum() into two versions
* btrfs_calculate_block_csum_folio()
For call sites that a fs block is always backed by a large folio.
This will do extra checks on the folio size, build a paddrs[] array,
and pass it into the newer btrfs_calculate_block_csum_pages()
helper.
For now btrfs_check_block_csum() is still using this version.
* btrfs_calculate_block_csum_pages()
For call sites that may hit a fs block backed by incontiguous pages.
The pages are represented by paddrs[] array, which includes the
offset inside the page.
This function will do the proper sub-block handling.
- Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages()
This means we will need to build a local paddrs[] array, and after
filling a fs block, do the checksum calculation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/btrfs_inode.h | 6 ++--
fs/btrfs/file-item.c | 15 +++++++--
fs/btrfs/inode.c | 73 ++++++++++++++++++++++++++++++------------
3 files changed, 68 insertions(+), 26 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f45dcdde0efc..03c107ab391b 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -542,8 +542,10 @@ static inline void btrfs_set_inode_mapping_order(struct btrfs_inode *inode)
#define CSUM_FMT "0x%*phN"
#define CSUM_FMT_VALUE(size, bytes) size, bytes
-void btrfs_calculate_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr,
- u8 *dest);
+void btrfs_calculate_block_csum_folio(struct btrfs_fs_info *fs_info,
+ const phys_addr_t paddr, u8 *dest);
+void btrfs_calculate_block_csum_pages(struct btrfs_fs_info *fs_info,
+ const phys_addr_t paddrs[], u8 *dest);
int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
const u8 * const csum_expected);
bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index a42e6d54e7cd..5b3c5489ee48 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -778,6 +778,10 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
struct bvec_iter iter = bio->bi_iter;
phys_addr_t paddr;
const u32 blocksize = fs_info->sectorsize;
+ const u32 step = min(blocksize, PAGE_SIZE);
+ const u32 nr_steps = blocksize / step;
+ phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
+ u32 offset = 0;
int index;
unsigned nofs_flag;
@@ -797,9 +801,14 @@ int btrfs_csum_one_bio(struct btrfs_bio *bbio)
shash->tfm = fs_info->csum_shash;
- btrfs_bio_for_each_block(paddr, bio, &iter, blocksize) {
- btrfs_calculate_block_csum(fs_info, paddr, sums->sums + index);
- index += fs_info->csum_size;
+ btrfs_bio_for_each_block(paddr, bio, &iter, step) {
+ paddrs[(offset / step) % nr_steps] = paddr;
+ offset += step;
+
+ if (IS_ALIGNED(offset, blocksize)) {
+ btrfs_calculate_block_csum_pages(fs_info, paddrs, sums->sums + index);
+ index += fs_info->csum_size;
+ }
}
bbio->sums = sums;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac2fd589697d..287c67296a7d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3330,36 +3330,67 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered)
return btrfs_finish_one_ordered(ordered);
}
-void btrfs_calculate_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr,
- u8 *dest)
+/*
+ * Calculate the checksum of a fs block at physical memory address @paddr,
+ * and save the result to @dest.
+ *
+ * The folio containing @paddr must be large enough to contain a full fs block.
+ */
+void btrfs_calculate_block_csum_folio(struct btrfs_fs_info *fs_info,
+ const phys_addr_t paddr, u8 *dest)
{
struct folio *folio = page_folio(phys_to_page(paddr));
const u32 blocksize = fs_info->sectorsize;
- SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+ const u32 step = min(blocksize, PAGE_SIZE);
+ const u32 nr_steps = blocksize / step;
+ phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
- shash->tfm = fs_info->csum_shash;
/* The full block must be inside the folio. */
ASSERT(offset_in_folio(folio, paddr) + blocksize <= folio_size(folio));
- if (folio_test_partial_kmap(folio)) {
- size_t cur = paddr;
+ for (int i = 0; i < nr_steps; i++) {
+ u32 pindex = offset_in_folio(folio, paddr + i * step) >> PAGE_SHIFT;
- crypto_shash_init(shash);
- while (cur < paddr + blocksize) {
- void *kaddr;
- size_t len = min(paddr + blocksize - cur,
- PAGE_SIZE - offset_in_page(cur));
-
- kaddr = kmap_local_folio(folio, offset_in_folio(folio, cur));
- crypto_shash_update(shash, kaddr, len);
- kunmap_local(kaddr);
- cur += len;
- }
- crypto_shash_final(shash, dest);
- } else {
- crypto_shash_digest(shash, phys_to_virt(paddr), blocksize, dest);
+ /*
+ * For bs <= ps cases, we will only run the loop once, so the offset
+ * inside the page will only added to paddrs[0].
+ *
+ * For bs > ps cases, the block must be page aligned, thus offset
+ * inside the page will always be 0.
+ */
+ paddrs[i] = page_to_phys(folio_page(folio, pindex)) + offset_in_page(paddr);
}
+ return btrfs_calculate_block_csum_pages(fs_info, paddrs, dest);
}
+
+/*
+ * Calculate the checksum of a fs block backed by multiple incontiguous pages at @paddrs[]
+ * and save the result to @dest.
+ *
+ * The folio containing @paddr must be large enough to contain a full fs block.
+ */
+void btrfs_calculate_block_csum_pages(struct btrfs_fs_info *fs_info,
+ const phys_addr_t paddrs[], u8 *dest)
+{
+ const u32 blocksize = fs_info->sectorsize;
+ const u32 step = min(blocksize, PAGE_SIZE);
+ const u32 nr_steps = blocksize / step;
+ SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
+
+ shash->tfm = fs_info->csum_shash;
+ crypto_shash_init(shash);
+ for (int i = 0; i < nr_steps; i++) {
+ const phys_addr_t paddr = paddrs[i];
+ void *kaddr;
+
+ ASSERT(offset_in_page(paddr) + step <= PAGE_SIZE);
+ kaddr = kmap_local_page(phys_to_page(paddr)) + offset_in_page(paddr);
+ crypto_shash_update(shash, kaddr, step);
+ kunmap_local(kaddr);
+ }
+ crypto_shash_final(shash, dest);
+}
+
/*
* Verify the checksum for a single sector without any extra action that depend
* on the type of I/O.
@@ -3369,7 +3400,7 @@ void btrfs_calculate_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr
int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
const u8 * const csum_expected)
{
- btrfs_calculate_block_csum(fs_info, paddr, csum);
+ btrfs_calculate_block_csum_folio(fs_info, paddr, csum);
if (unlikely(memcmp(csum, csum_expected, fs_info->csum_size) != 0))
return -EIO;
return 0;
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 2/4] btrfs: make btrfs_repair_io_failure() handle bs > ps cases without large folios
2025-10-05 22:41 [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 1/4] btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios Qu Wenruo
@ 2025-10-05 22:41 ` Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 3/4] btrfs: make read verification " Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 4/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-05 22:41 UTC (permalink / raw)
To: linux-btrfs
Currently btrfs_repair_io_failure() only accept a single @paddr
parameter, and for bs > ps cases it's required that @paddr is backed by
a large folio.
That assumption has quite some limits, preventing us from utilizing true
zero-copy direct-io and encoded read/writes.
To address the problem, enhance btrfs_repair_io_failure() by:
- Accept an array of paddrs, up to 64K / PAGE_SIZE entries
This kind acts like a bio_vec, but with very limited entries, as the
function is only utilized to repair one fs data block, or a tree block.
Both has a upper size limit (BTRFS_MAX_BLOCK_SIZE, aka 64K), so we
don't need the full bio_vec thing to handle it.
- Allocate a bio with multiple slots
Previously even for bs > ps cases, we only pass in a contiguous
physical address range, thus a single slot will be enough.
But not any more, so we have to allocate a bio structure, other than
using the on-stack one.
- Use on-stack memory to allocate @paddrs array
It's at most 16 pages (4K page size, 64K block size), will take up at
most 128 bytes.
I think the on-stack cost is still acceptable.
- Add one extra check to make sure the repair bio is exactly one block
- Utilize btrfs_repair_io_failure() to submit a single bio for metadata
This should improve the read-repair performance for metadata, as now
we submit a node sized bio then wait, other than submit each block of
the metadata and wait for each submitted block.
- Reduce the width of @length parameter of btrfs_repair_io_failure()
Since we only call btrfs_repair_io_failure() on a single data or
metadata block, u64 is overkilled.
Use u32 instead and add one extra ASSERT()s to make sure the length
never exceed BTRFS_MAX_BLOCK_SIZE.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 66 ++++++++++++++++++++++++++++++++++++----------
fs/btrfs/bio.h | 2 +-
fs/btrfs/disk-io.c | 29 ++++++++++++--------
3 files changed, 71 insertions(+), 26 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 21df48e6c4fa..5034a9eaf2f0 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -164,7 +164,21 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
struct btrfs_inode *inode = repair_bbio->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
+ /*
+ * We can not move forward the saved_iter, as it will be later
+ * utilized by repair_bbio again.
+ */
+ struct bvec_iter saved_iter = repair_bbio->saved_iter;
+ const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
+ const u64 logical = repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT;
+ const u32 nr_steps = repair_bbio->saved_iter.bi_size / step;
int mirror = repair_bbio->mirror_num;
+ phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
+ phys_addr_t paddr;
+ unsigned int slot = 0;
+
+ /* Repair bbio should be eaxctly one block sized. */
+ ASSERT(repair_bbio->saved_iter.bi_size == fs_info->sectorsize);
if (repair_bbio->bio.bi_status ||
!btrfs_data_csum_ok(repair_bbio, dev, 0, bvec_phys(bv))) {
@@ -182,12 +196,17 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
return;
}
+ btrfs_bio_for_each_block(paddr, &repair_bbio->bio, &saved_iter, step) {
+ ASSERT(slot < nr_steps);
+ paddrs[slot] = paddr;
+ slot++;
+ }
+
do {
mirror = prev_repair_mirror(fbio, mirror);
btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
repair_bbio->file_offset, fs_info->sectorsize,
- repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
- bvec_phys(bv), mirror);
+ logical, paddrs, mirror);
} while (mirror != fbio->bbio->mirror_num);
done:
@@ -824,18 +843,34 @@ void btrfs_submit_bbio(struct btrfs_bio *bbio, int mirror_num)
*
* The I/O is issued synchronously to block the repair read completion from
* freeing the bio.
+ *
+ * @ino: Offending inode number
+ * @fileoff: File offset inside the inode
+ * @length: Length of the repair write
+ * @logical: Logical address of the range
+ * @paddrs: Physical address array of the content.
+ * Length for each paddr should be min(sectorsize, PAGE_SIZE).
+ * @mirror_num: Mirror number to write to. Must not be zero.
*/
-int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
- u64 length, u64 logical, phys_addr_t paddr, int mirror_num)
+int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 fileoff,
+ u32 length, u64 logical, const phys_addr_t paddrs[], int mirror_num)
{
+ const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
+ const u32 nr_steps = DIV_ROUND_UP_POW2(length, step);
struct btrfs_io_stripe smap = { 0 };
- struct bio_vec bvec;
- struct bio bio;
+ struct bio *bio = NULL;
int ret = 0;
ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
BUG_ON(!mirror_num);
+ /* Basic alignment checks. */
+ ASSERT(IS_ALIGNED(logical, fs_info->sectorsize));
+ ASSERT(IS_ALIGNED(length, fs_info->sectorsize));
+ ASSERT(IS_ALIGNED(fileoff, fs_info->sectorsize));
+ /* Either it's a single data or metadata block. */
+ ASSERT(length <= BTRFS_MAX_BLOCKSIZE);
+
if (btrfs_repair_one_zone(fs_info, logical))
return 0;
@@ -855,24 +890,27 @@ int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
goto out_counter_dec;
}
- bio_init(&bio, smap.dev->bdev, &bvec, 1, REQ_OP_WRITE | REQ_SYNC);
- bio.bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT;
- __bio_add_page(&bio, phys_to_page(paddr), length, offset_in_page(paddr));
- ret = submit_bio_wait(&bio);
+ bio = bio_alloc(smap.dev->bdev, nr_steps, REQ_OP_WRITE | REQ_SYNC, GFP_NOFS);
+ bio->bi_iter.bi_sector = smap.physical >> SECTOR_SHIFT;
+ for (int i = 0; i < nr_steps; i++) {
+ ret = bio_add_page(bio, phys_to_page(paddrs[i]), step, offset_in_page(paddrs[i]));
+ /* We should have allocated enough slots to contain all the different pages. */
+ ASSERT(ret == step);
+ }
+ ret = submit_bio_wait(bio);
+ bio_put(bio);
if (ret) {
/* try to remap that extent elsewhere? */
btrfs_dev_stat_inc_and_print(smap.dev, BTRFS_DEV_STAT_WRITE_ERRS);
- goto out_bio_uninit;
+ goto out_counter_dec;
}
btrfs_info_rl(fs_info,
"read error corrected: ino %llu off %llu (dev %s sector %llu)",
- ino, start, btrfs_dev_name(smap.dev),
+ ino, fileoff, btrfs_dev_name(smap.dev),
smap.physical >> SECTOR_SHIFT);
ret = 0;
-out_bio_uninit:
- bio_uninit(&bio);
out_counter_dec:
btrfs_bio_counter_dec(fs_info);
return ret;
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 00883aea55d7..e93e7ab6322a 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -112,6 +112,6 @@ void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status);
void btrfs_submit_bbio(struct btrfs_bio *bbio, int mirror_num);
void btrfs_submit_repair_write(struct btrfs_bio *bbio, int mirror_num, bool dev_replace);
int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
- u64 length, u64 logical, phys_addr_t paddr, int mirror_num);
+ u32 length, u64 logical, const phys_addr_t paddrs[], int mirror_num);
#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f475fb2272ac..2623b84dd6ed 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -182,26 +182,33 @@ static int btrfs_repair_eb_io_failure(const struct extent_buffer *eb,
int mirror_num)
{
struct btrfs_fs_info *fs_info = eb->fs_info;
+ const u32 step = min(fs_info->nodesize, PAGE_SIZE);
+ const u32 nr_steps = eb->len / step;
+ phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
int ret = 0;
if (sb_rdonly(fs_info->sb))
return -EROFS;
- for (int i = 0; i < num_extent_folios(eb); i++) {
+ for (int i = 0; i < num_extent_pages(eb); i++) {
struct folio *folio = eb->folios[i];
- u64 start = max_t(u64, eb->start, folio_pos(folio));
- u64 end = min_t(u64, eb->start + eb->len,
- folio_pos(folio) + eb->folio_size);
- u32 len = end - start;
- phys_addr_t paddr = PFN_PHYS(folio_pfn(folio)) +
- offset_in_folio(folio, start);
- ret = btrfs_repair_io_failure(fs_info, 0, start, len, start,
- paddr, mirror_num);
- if (ret)
- break;
+ /* No large folio support yet. */
+ ASSERT(folio_order(folio) == 0);
+ ASSERT(i < nr_steps);
+
+ /*
+ * For nodesize < page size, there is just one paddr, with some
+ * offset inside the page.
+ *
+ * For nodesize >= page size, it's one or more paddrs, and eb->start
+ * must be aligned to page boundary.
+ */
+ paddrs[i] = page_to_phys(&folio->page) + offset_in_page(eb->start);
}
+ ret = btrfs_repair_io_failure(fs_info, 0, eb->start, eb->len, eb->start,
+ paddrs, mirror_num);
return ret;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 3/4] btrfs: make read verification handle bs > ps cases without large folios
2025-10-05 22:41 [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 1/4] btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 2/4] btrfs: make btrfs_repair_io_failure() handle bs > ps cases " Qu Wenruo
@ 2025-10-05 22:41 ` Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 4/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-05 22:41 UTC (permalink / raw)
To: linux-btrfs
The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.
To enhance read-repair to support bs > ps without large folios:
- Make btrfs_data_csum_ok() to accept an array of paddrs
Which can pass the paddrs[] direct into
btrfs_calculate_block_csum_pages().
- Make repair_one_sector() to accept an array of paddrs
So that it can submit a repair bio backed by regular pages, not only
large folios.
This requires us to allocate more slots at bio allocation time though.
Also since the caller may have only partially advanced the saved_iter
for bs > ps cases, we can not directly trust the logical bytenr from
saved_iter (can be unaligned), thus a manual round down is necessary
for the logical bytenr.
- Make btrfs_check_read_bio() to build an array of paddrs
The tricky part is that we can only call btrfs_data_csum_ok() after
all involved pages are assembled.
This means at the call time of btrfs_check_read_bio(), our offset
inside the bio is already at the end of the fs block.
Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
repair_one_sector().
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 62 ++++++++++++++++++++++++++++--------------
fs/btrfs/btrfs_inode.h | 2 +-
fs/btrfs/inode.c | 18 ++++++------
3 files changed, 52 insertions(+), 30 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 5034a9eaf2f0..6642318320eb 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -163,7 +163,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
struct btrfs_failed_bio *fbio = repair_bbio->private;
struct btrfs_inode *inode = repair_bbio->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
/*
* We can not move forward the saved_iter, as it will be later
* utilized by repair_bbio again.
@@ -180,8 +179,14 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
/* Repair bbio should be eaxctly one block sized. */
ASSERT(repair_bbio->saved_iter.bi_size == fs_info->sectorsize);
+ btrfs_bio_for_each_block(paddr, &repair_bbio->bio, &saved_iter, step) {
+ ASSERT(slot < nr_steps);
+ paddrs[slot] = paddr;
+ slot++;
+ }
+
if (repair_bbio->bio.bi_status ||
- !btrfs_data_csum_ok(repair_bbio, dev, 0, bvec_phys(bv))) {
+ !btrfs_data_csum_ok(repair_bbio, dev, 0, paddrs)) {
bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
repair_bbio->bio.bi_iter = repair_bbio->saved_iter;
@@ -196,12 +201,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
return;
}
- btrfs_bio_for_each_block(paddr, &repair_bbio->bio, &saved_iter, step) {
- ASSERT(slot < nr_steps);
- paddrs[slot] = paddr;
- slot++;
- }
-
do {
mirror = prev_repair_mirror(fbio, mirror);
btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
@@ -223,21 +222,25 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
*/
static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
u32 bio_offset,
- phys_addr_t paddr,
+ phys_addr_t paddrs[],
struct btrfs_failed_bio *fbio)
{
struct btrfs_inode *inode = failed_bbio->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- struct folio *folio = page_folio(phys_to_page(paddr));
const u32 sectorsize = fs_info->sectorsize;
- const u32 foff = offset_in_folio(folio, paddr);
- const u64 logical = (failed_bbio->saved_iter.bi_sector << SECTOR_SHIFT);
+ const u32 step = min(fs_info->sectorsize, PAGE_SIZE);
+ const u32 nr_steps = sectorsize / step;
+ /*
+ * For bs > ps cases, the saved_iter can be partially moved forward.
+ * In that case we should round it down to the block boundary.
+ */
+ const u64 logical = round_down(failed_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
+ sectorsize);
struct btrfs_bio *repair_bbio;
struct bio *repair_bio;
int num_copies;
int mirror;
- ASSERT(foff + sectorsize <= folio_size(folio));
btrfs_debug(fs_info, "repair read error: read error at %llu",
failed_bbio->file_offset + bio_offset);
@@ -257,10 +260,18 @@ static struct btrfs_failed_bio *repair_one_sector(struct btrfs_bio *failed_bbio,
atomic_inc(&fbio->repair_count);
- repair_bio = bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_NOFS,
+ repair_bio = bio_alloc_bioset(NULL, nr_steps, REQ_OP_READ, GFP_NOFS,
&btrfs_repair_bioset);
- repair_bio->bi_iter.bi_sector = failed_bbio->saved_iter.bi_sector;
- bio_add_folio_nofail(repair_bio, folio, sectorsize, foff);
+ repair_bio->bi_iter.bi_sector = logical >> SECTOR_SHIFT;
+ for (int i = 0; i < nr_steps; i++) {
+ int ret;
+
+ ASSERT(offset_in_page(paddrs[i]) + step <= PAGE_SIZE);
+
+ ret = bio_add_page(repair_bio, phys_to_page(paddrs[i]), step,
+ offset_in_page(paddrs[i]));
+ ASSERT(ret == step);
+ }
repair_bbio = btrfs_bio(repair_bio);
btrfs_bio_init(repair_bbio, fs_info, NULL, fbio);
@@ -277,10 +288,13 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
{
struct btrfs_inode *inode = bbio->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
- u32 sectorsize = fs_info->sectorsize;
+ const u32 sectorsize = fs_info->sectorsize;
+ const u32 step = min(sectorsize, PAGE_SIZE);
+ const u32 nr_steps = sectorsize / step;
struct bvec_iter *iter = &bbio->saved_iter;
blk_status_t status = bbio->bio.bi_status;
struct btrfs_failed_bio *fbio = NULL;
+ phys_addr_t paddrs[BTRFS_MAX_BLOCKSIZE / PAGE_SIZE];
phys_addr_t paddr;
u32 offset = 0;
@@ -299,10 +313,16 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio, struct btrfs_device *de
/* Clear the I/O error. A failed repair will reset it. */
bbio->bio.bi_status = BLK_STS_OK;
- btrfs_bio_for_each_block(paddr, &bbio->bio, iter, fs_info->sectorsize) {
- if (status || !btrfs_data_csum_ok(bbio, dev, offset, paddr))
- fbio = repair_one_sector(bbio, offset, paddr, fbio);
- offset += sectorsize;
+ btrfs_bio_for_each_block(paddr, &bbio->bio, iter, step) {
+ paddrs[(offset / step) % nr_steps] = paddr;
+ offset += step;
+
+ if (IS_ALIGNED(offset, sectorsize)) {
+ if (status ||
+ !btrfs_data_csum_ok(bbio, dev, offset - sectorsize, paddrs))
+ fbio = repair_one_sector(bbio, offset - sectorsize,
+ paddrs, fbio);
+ }
}
if (bbio->csum != bbio->csum_inline)
kfree(bbio->csum);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 03c107ab391b..ef3f2c1f7a77 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -549,7 +549,7 @@ void btrfs_calculate_block_csum_pages(struct btrfs_fs_info *fs_info,
int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8 *csum,
const u8 * const csum_expected);
bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
- u32 bio_offset, phys_addr_t paddr);
+ u32 bio_offset, const phys_addr_t paddrs[]);
noinline int can_nocow_extent(struct btrfs_inode *inode, u64 offset, u64 *len,
struct btrfs_file_extent *file_extent,
bool nowait);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 287c67296a7d..dbc5fad7b9a1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3407,12 +3407,13 @@ int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8
}
/*
- * Verify the checksum of a single data sector.
+ * Verify the checksum of a single data sector, which can be scattered at
+ * different incontiguous pages.
*
* @bbio: btrfs_io_bio which contains the csum
* @dev: device the sector is on
* @bio_offset: offset to the beginning of the bio (in bytes)
- * @bv: bio_vec to check
+ * @paddrs: physical addresses which back the fs block
*
* Check if the checksum on a data block is valid. When a checksum mismatch is
* detected, report the error and fill the corrupted range with zero.
@@ -3420,12 +3421,13 @@ int btrfs_check_block_csum(struct btrfs_fs_info *fs_info, phys_addr_t paddr, u8
* Return %true if the sector is ok or had no checksum to start with, else %false.
*/
bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
- u32 bio_offset, phys_addr_t paddr)
+ u32 bio_offset, const phys_addr_t paddrs[])
{
struct btrfs_inode *inode = bbio->inode;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
const u32 blocksize = fs_info->sectorsize;
- struct folio *folio;
+ const u32 step = min(blocksize, PAGE_SIZE);
+ const u32 nr_steps = blocksize / step;
u64 file_offset = bbio->file_offset + bio_offset;
u64 end = file_offset + blocksize - 1;
u8 *csum_expected;
@@ -3445,7 +3447,8 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
csum_expected = bbio->csum + (bio_offset >> fs_info->sectorsize_bits) *
fs_info->csum_size;
- if (btrfs_check_block_csum(fs_info, paddr, csum, csum_expected))
+ btrfs_calculate_block_csum_pages(fs_info, paddrs, csum);
+ if (unlikely(memcmp(csum, csum_expected, fs_info->csum_size) != 0))
goto zeroit;
return true;
@@ -3454,9 +3457,8 @@ bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
bbio->mirror_num);
if (dev)
btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_CORRUPTION_ERRS);
- folio = page_folio(phys_to_page(paddr));
- ASSERT(offset_in_folio(folio, paddr) + blocksize <= folio_size(folio));
- folio_zero_range(folio, offset_in_folio(folio, paddr), blocksize);
+ for (int i = 0; i < nr_steps; i++)
+ memzero_page(phys_to_page(paddrs[i]), offset_in_page(paddrs[i]), step);
return false;
}
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 4/4] btrfs: enable encoded read/write/send for bs > ps cases
2025-10-05 22:41 [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
` (2 preceding siblings ...)
2025-10-05 22:41 ` [PATCH v2 3/4] btrfs: make read verification " Qu Wenruo
@ 2025-10-05 22:41 ` Qu Wenruo
3 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-10-05 22:41 UTC (permalink / raw)
To: linux-btrfs
Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.
Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.
There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.
This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 22 ++++++++++++----------
fs/btrfs/compression.c | 10 +++++-----
fs/btrfs/ioctl.c | 21 ---------------------
fs/btrfs/send.c | 9 +--------
4 files changed, 18 insertions(+), 44 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 6642318320eb..b2dbf93842d2 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -825,21 +825,23 @@ static void assert_bbio_alignment(struct btrfs_bio *bbio)
struct bio_vec bvec;
struct bvec_iter iter;
const u32 blocksize = fs_info->sectorsize;
+ const u32 alignment = min(blocksize, PAGE_SIZE);
+ const u64 logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+ const u32 length = bbio->bio.bi_iter.bi_size;
- /* Metadata has no extra bs > ps alignment requirement. */
- if (!is_data_bbio(bbio))
- return;
+ /* The logical and length should still be aligned to blocksize. */
+ ASSERT(IS_ALIGNED(logical, blocksize) && IS_ALIGNED(length, blocksize) &&
+ length != 0, "root=%llu inode=%llu logical=%llu length=%u",
+ btrfs_root_id(bbio->inode->root),
+ btrfs_ino(bbio->inode), logical, length);
bio_for_each_bvec(bvec, &bbio->bio, iter)
- ASSERT(IS_ALIGNED(bvec.bv_offset, blocksize) &&
- IS_ALIGNED(bvec.bv_len, blocksize),
+ ASSERT(IS_ALIGNED(bvec.bv_offset, alignment) &&
+ IS_ALIGNED(bvec.bv_len, alignment),
"root=%llu inode=%llu logical=%llu length=%u index=%u bv_offset=%u bv_len=%u",
btrfs_root_id(bbio->inode->root),
- btrfs_ino(bbio->inode),
- bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT,
- bbio->bio.bi_iter.bi_size, iter.bi_idx,
- bvec.bv_offset,
- bvec.bv_len);
+ btrfs_ino(bbio->inode), logical, length, iter.bi_idx,
+ bvec.bv_offset, bvec.bv_len);
#endif
}
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index bacad18357b3..82c43496011c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -354,21 +354,21 @@ static void end_bbio_compressed_write(struct btrfs_bio *bbio)
static void btrfs_add_compressed_bio_folios(struct compressed_bio *cb)
{
- struct btrfs_fs_info *fs_info = cb->bbio.fs_info;
struct bio *bio = &cb->bbio.bio;
u32 offset = 0;
+ unsigned int findex = 0;
while (offset < cb->compressed_len) {
- struct folio *folio;
- int ret;
+ struct folio *folio = cb->compressed_folios[findex];
u32 len = min_t(u32, cb->compressed_len - offset,
- btrfs_min_folio_size(fs_info));
+ folio_size(folio));
+ int ret;
- folio = cb->compressed_folios[offset >> (PAGE_SHIFT + fs_info->block_min_order)];
/* Maximum compressed extent is smaller than bio size limit. */
ret = bio_add_folio(bio, folio, len, 0);
ASSERT(ret);
offset += len;
+ findex++;
}
}
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 938286bee6a8..552ad99e89fc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4418,10 +4418,6 @@ static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
goto out_acct;
}
- if (fs_info->sectorsize > PAGE_SIZE) {
- ret = -ENOTTY;
- goto out_acct;
- }
if (compat) {
#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT)
struct btrfs_ioctl_encoded_io_args_32 args32;
@@ -4513,7 +4509,6 @@ static int btrfs_ioctl_encoded_read(struct file *file, void __user *argp,
static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool compat)
{
- struct btrfs_fs_info *fs_info = inode_to_fs_info(file->f_inode);
struct btrfs_ioctl_encoded_io_args args;
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
@@ -4527,11 +4522,6 @@ static int btrfs_ioctl_encoded_write(struct file *file, void __user *argp, bool
goto out_acct;
}
- if (fs_info->sectorsize > PAGE_SIZE) {
- ret = -ENOTTY;
- goto out_acct;
- }
-
if (!(file->f_mode & FMODE_WRITE)) {
ret = -EBADF;
goto out_acct;
@@ -4813,11 +4803,6 @@ static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd, unsigned int issue
ret = -EPERM;
goto out_acct;
}
- if (fs_info->sectorsize > PAGE_SIZE) {
- ret = -ENOTTY;
- goto out_acct;
- }
-
sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
if (issue_flags & IO_URING_F_COMPAT) {
@@ -4945,7 +4930,6 @@ static int btrfs_uring_encoded_read(struct io_uring_cmd *cmd, unsigned int issue
static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issue_flags)
{
struct file *file = cmd->file;
- struct btrfs_fs_info *fs_info = inode_to_fs_info(file->f_inode);
loff_t pos;
struct kiocb kiocb;
ssize_t ret;
@@ -4960,11 +4944,6 @@ static int btrfs_uring_encoded_write(struct io_uring_cmd *cmd, unsigned int issu
ret = -EPERM;
goto out_acct;
}
- if (fs_info->sectorsize > PAGE_SIZE) {
- ret = -ENOTTY;
- goto out_acct;
- }
-
sqe_addr = u64_to_user_ptr(READ_ONCE(cmd->sqe->addr));
if (!(file->f_mode & FMODE_WRITE)) {
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 9230e5066fc6..1fcc820f6baf 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5601,14 +5601,7 @@ static int send_extent_data(struct send_ctx *sctx, struct btrfs_path *path,
ei = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_file_extent_item);
- /*
- * Do not go through encoded read for bs > ps cases.
- *
- * Encoded send is using vmallocated pages as buffer, which we can
- * not ensure every folio is large enough to contain a block.
- */
- if (sctx->send_root->fs_info->sectorsize <= PAGE_SIZE &&
- (sctx->flags & BTRFS_SEND_FLAG_COMPRESSED) &&
+ if ((sctx->flags & BTRFS_SEND_FLAG_COMPRESSED) &&
btrfs_file_extent_compression(leaf, ei) != BTRFS_COMPRESS_NONE) {
bool is_inline = (btrfs_file_extent_type(leaf, ei) ==
BTRFS_FILE_EXTENT_INLINE);
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-05 22:42 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-05 22:41 [PATCH v2 0/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 1/4] btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 2/4] btrfs: make btrfs_repair_io_failure() handle bs > ps cases " Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 3/4] btrfs: make read verification " Qu Wenruo
2025-10-05 22:41 ` [PATCH v2 4/4] btrfs: enable encoded read/write/send for bs > ps cases Qu Wenruo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).