* [PATCH 0/2] btrfs: add large folio support to read-repair and defrag
@ 2025-04-07 6:09 Qu Wenruo
2025-04-07 6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
2025-04-07 6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07 6:09 UTC (permalink / raw)
To: linux-btrfs
These two patches are originally in a series of ASSERT()s cleanup, but
later since I'm able to enable large data folios for testing, one bug is
exposed in the first patch, thus these two are not yet merged into
for-next branch.
Now with full fstests able to run on my local large data folios branch,
I can eventually verify the behavior of those two patches.
The first patch is to add extra calculation to convert the bio_vec (no
matter single page or multi page) to a proper folio and offset inside the
folio.
The second patch is to add large folios support to defrag, which only
needs minor changes.
Qu Wenruo (2):
btrfs: prepare btrfs_end_repair_bio() for larger data folios
btrfs: enable larger data folios support for defrag
fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++-----
fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
2 files changed, 96 insertions(+), 37 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
2025-04-07 6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
@ 2025-04-07 6:09 ` Qu Wenruo
2025-04-14 14:25 ` Filipe Manana
2025-04-07 6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07 6:09 UTC (permalink / raw)
To: linux-btrfs
The function btrfs_end_repair_bio() has an ASSERT() making sure the
folio is page sized.
The reason is mostly related to the fact that later we pass a folio and
its offset into btrfs_repair_io_failure().
If we have larger folios passed in, later calculation of the folio and
its offset can go wrong, as we have extra offset to the bv_page.
Change the behavior by:
- Doing a proper folio grab
Instead of just page_folio(bv_page), we should get the real page (as the
bv_offset can be larger than page size), then call page_folio().
- Do extra folio offset calculation
real_page
bv_page | bv_offset (10K)
| | |
v v v
| | | |
|<- F1 ->|<--- Folio 2 -->|
| result off |
'|' is page boundary.
The folio is the one containing that real_page.
We want the real offset inside that folio.
The result offset we want is of two parts:
- the offset of the real page to the folio page
- the offset inside that real page
We can not use offset_in_folio() which will give an incorrect result.
(2K instead of 6K, as folio 1 has a different order)
With these changes, now btrfs_end_repair_bio() is able to handle not
only large folios, but also multi-page bio vectors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 8c2eee1f1878..3140aa19aadc 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
}
}
+/*
+ * Since a single bio_vec can merge multiple physically contiguous pages
+ * into one bio_vec entry, we can have the following case:
+ *
+ * bv_page bv_offset
+ * v v
+ * | | | | | | |
+ *
+ * In that case we need to grab the real page where bv_offset is at.
+ */
+static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
+{
+ return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
+}
+static struct folio *bio_vec_get_folio(const struct bio_vec *bv)
+{
+ return page_folio(bio_vec_get_real_page(bv));
+}
+
+static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
+{
+ const struct page *real_page = bio_vec_get_real_page(bv);
+ const struct folio *folio = page_folio(real_page);
+
+ /*
+ * The following ASCII chart is to show how the calculation is done.
+ *
+ * real_page
+ * bv_page | bv_offset (10K)
+ * | | |
+ * v v v
+ * | | | |
+ * |<- F1 ->|<--- Folio 2 -->|
+ * | result off |
+ *
+ * '|' is page boundary.
+ *
+ * The folio is the one containing that real_page.
+ * We want the real offset inside that folio.
+ *
+ * The result offset we want is of two parts:
+ * - the offset of the real page to the folio page
+ * - the offset inside that real page
+ *
+ * We can not use offset_in_folio() which will give an incorrect result.
+ * (2K instead of 6K, as folio 1 has a different order)
+ */
+ ASSERT(&folio->page <= real_page);
+ return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
+ offset_in_page(bv->bv_offset);
+}
+
static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
struct btrfs_device *dev)
{
@@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
int mirror = repair_bbio->mirror_num;
- /*
- * We can only trigger this for data bio, which doesn't support larger
- * folios yet.
- */
- ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
-
if (repair_bbio->bio.bi_status ||
!btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
@@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
repair_bbio->file_offset, fs_info->sectorsize,
repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
- page_folio(bv->bv_page), bv->bv_offset, mirror);
+ bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
+ mirror);
} while (mirror != fbio->bbio->mirror_num);
done:
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] btrfs: enable larger data folios support for defrag
2025-04-07 6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
2025-04-07 6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
@ 2025-04-07 6:09 ` Qu Wenruo
2025-04-14 14:38 ` Filipe Manana
1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07 6:09 UTC (permalink / raw)
To: linux-btrfs
Currently we rejects larger folios for defrag gracefully, but the
implementation itself is already mostly larger folios compatible.
There are several parts of defrag in btrfs:
- Extent map checking
Aka, defrag_collect_targets(), which prepares a list of target ranges
that should be defragged.
This part is completely folio unrelated, thus it doesn't care about
the folio size.
- Target folio preparation
Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
target folio.
Since folio read and lock are already supporting larger folios, this
part needs only minor changes.
- Redirty the target range of the folio
This is already done in a way supporting larger folios.
So it's pretty straightforward to enable larger folios for defrag:
- Do not reject larger folios for experimental builds
This affects the larger folio check inside defrag_prepare_one_folio().
- Wait for ordered extents of the whole folio in
defrag_prepare_one_folio()
- Lock the whole extent range for all involved folios in
defrag_one_range()
- Allow the folios[] array to be partially empty
Since we can have larger folios, folios[] will not always be full.
This affects:
* How to allocate folios in defrag_one_range()
Now we can not use page index, but use the end position of the folio
as an iterator.
* How to free the folios[] array
If we hit an empty slot, it means we have larger folios and already
hit the end of the array.
* How to mark the range dirty
Instead of use page index directly, we have to go through each
folio.
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
1 file changed, 42 insertions(+), 30 deletions(-)
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index d4310d93f532..f2fa8b5c64b5 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -857,13 +857,14 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
{
struct address_space *mapping = inode->vfs_inode.i_mapping;
gfp_t mask = btrfs_alloc_write_mask(mapping);
- u64 page_start = (u64)index << PAGE_SHIFT;
- u64 page_end = page_start + PAGE_SIZE - 1;
+ u64 folio_start;
+ u64 folio_end;
struct extent_state *cached_state = NULL;
struct folio *folio;
int ret;
again:
+ /* TODO: Add order fgp order flags when larger folios are fully enabled. */
folio = __filemap_get_folio(mapping, index,
FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
if (IS_ERR(folio))
@@ -871,13 +872,16 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
/*
* Since we can defragment files opened read-only, we can encounter
- * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS). We
- * can't do I/O using huge pages yet, so return an error for now.
+ * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
+ *
+ * The IO for such larger folios are not fully tested, thus return
+ * an error to reject such folios unless it's an experimental build.
+ *
* Filesystem transparent huge pages are typically only used for
* executables that explicitly enable them, so this isn't very
* restrictive.
*/
- if (folio_test_large(folio)) {
+ if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL) && folio_test_large(folio)) {
folio_unlock(folio);
folio_put(folio);
return ERR_PTR(-ETXTBSY);
@@ -890,13 +894,15 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
return ERR_PTR(ret);
}
+ folio_start = folio_pos(folio);
+ folio_end = folio_pos(folio) + folio_size(folio) - 1;
/* Wait for any existing ordered extent in the range */
while (1) {
struct btrfs_ordered_extent *ordered;
- lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
- ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
- unlock_extent(&inode->io_tree, page_start, page_end,
+ lock_extent(&inode->io_tree, folio_start, folio_end, &cached_state);
+ ordered = btrfs_lookup_ordered_range(inode, folio_start, folio_size(folio));
+ unlock_extent(&inode->io_tree, folio_start, folio_end,
&cached_state);
if (!ordered)
break;
@@ -1162,13 +1168,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
struct extent_changeset *data_reserved = NULL;
const u64 start = target->start;
const u64 len = target->len;
- unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
- unsigned long start_index = start >> PAGE_SHIFT;
- unsigned long first_index = folios[0]->index;
int ret = 0;
- int i;
-
- ASSERT(last_index - first_index + 1 <= nr_pages);
ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
if (ret < 0)
@@ -1179,10 +1179,17 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
set_extent_bit(&inode->io_tree, start, start + len - 1,
EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
- /* Update the page status */
- for (i = start_index - first_index; i <= last_index - first_index; i++) {
- folio_clear_checked(folios[i]);
- btrfs_folio_clamp_set_dirty(fs_info, folios[i], start, len);
+ /*
+ * Update the page status.
+ * Due to possible larger folios, we have to check all folios one by one.
+ * And the btrfs_folio_clamp_*() helpers can handle ranges out of the
+ * folio cases well.
+ */
+ for (int i = 0; i < nr_pages && folios[i]; i++) {
+ struct folio *folio = folios[i];
+
+ btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
+ btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
}
btrfs_delalloc_release_extents(inode, len);
extent_changeset_free(data_reserved);
@@ -1200,9 +1207,9 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
LIST_HEAD(target_list);
struct folio **folios;
const u32 sectorsize = inode->root->fs_info->sectorsize;
- u64 last_index = (start + len - 1) >> PAGE_SHIFT;
- u64 start_index = start >> PAGE_SHIFT;
- unsigned int nr_pages = last_index - start_index + 1;
+ u64 cur = start;
+ const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
+ (start >> PAGE_SHIFT) + 1;
int ret = 0;
int i;
@@ -1214,21 +1221,25 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
return -ENOMEM;
/* Prepare all pages */
- for (i = 0; i < nr_pages; i++) {
- folios[i] = defrag_prepare_one_folio(inode, start_index + i);
+ for (i = 0; cur < start + len && i < nr_pages; i++) {
+ folios[i] = defrag_prepare_one_folio(inode, cur >> PAGE_SHIFT);
if (IS_ERR(folios[i])) {
ret = PTR_ERR(folios[i]);
- nr_pages = i;
+ folios[i] = NULL;
goto free_folios;
}
+ cur = folio_pos(folios[i]) + folio_size(folios[i]);
}
- for (i = 0; i < nr_pages; i++)
+ for (i = 0; i < nr_pages; i++) {
+ if (!folios[i])
+ break;
folio_wait_writeback(folios[i]);
+ }
- /* Lock the pages range */
- lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
- (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+ /* Lock the folios[] range */
+ lock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
&cached_state);
+
/*
* Now we have a consistent view about the extent map, re-check
* which range really needs to be defragged.
@@ -1254,11 +1265,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
kfree(entry);
}
unlock_extent:
- unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
- (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+ unlock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
&cached_state);
free_folios:
for (i = 0; i < nr_pages; i++) {
+ if (!folios[i])
+ break;
folio_unlock(folios[i]);
folio_put(folios[i]);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
2025-04-07 6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
@ 2025-04-14 14:25 ` Filipe Manana
2025-04-14 22:58 ` Qu Wenruo
0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2025-04-14 14:25 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Apr 7, 2025 at 7:09 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The function btrfs_end_repair_bio() has an ASSERT() making sure the
> folio is page sized.
>
> The reason is mostly related to the fact that later we pass a folio and
> its offset into btrfs_repair_io_failure().
> If we have larger folios passed in, later calculation of the folio and
> its offset can go wrong, as we have extra offset to the bv_page.
>
> Change the behavior by:
>
> - Doing a proper folio grab
> Instead of just page_folio(bv_page), we should get the real page (as the
> bv_offset can be larger than page size), then call page_folio().
>
> - Do extra folio offset calculation
>
> real_page
> bv_page | bv_offset (10K)
> | | |
> v v v
> | | | |
> |<- F1 ->|<--- Folio 2 -->|
> | result off |
>
> '|' is page boundary.
>
> The folio is the one containing that real_page.
> We want the real offset inside that folio.
>
> The result offset we want is of two parts:
> - the offset of the real page to the folio page
"to the folio page", this is confusing for me. Isn't it the offset of
the page inside the folio?
I.e. "the offset of the real page inside the folio"
Also, this terminology "real page", does it come from somewhere?
Aren't all pages "real"?
> - the offset inside that real page
>
> We can not use offset_in_folio() which will give an incorrect result.
> (2K instead of 6K, as folio 1 has a different order)
I don't think anyone can understand where that 2K and 6K come from.
The diagram doesn't mention how many pages per folio (folio order),
page sizes, and file offset of each folio.
So mentioning that 2K and 6K without all the relevant information,
make it useless and confusing IMO.
>
> With these changes, now btrfs_end_repair_bio() is able to handle not
> only large folios, but also multi-page bio vectors.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 54 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 8c2eee1f1878..3140aa19aadc 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
> }
> }
>
> +/*
> + * Since a single bio_vec can merge multiple physically contiguous pages
> + * into one bio_vec entry, we can have the following case:
> + *
> + * bv_page bv_offset
> + * v v
> + * | | | | | | |
> + *
> + * In that case we need to grab the real page where bv_offset is at.
> + */
> +static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
> +{
> + return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
> +}
> +static struct folio *bio_vec_get_folio(const struct bio_vec *bv)
Please add a blank line between function delcarations.
> +{
> + return page_folio(bio_vec_get_real_page(bv));
> +}
> +
> +static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
> +{
> + const struct page *real_page = bio_vec_get_real_page(bv);
> + const struct folio *folio = page_folio(real_page);
> +
> + /*
> + * The following ASCII chart is to show how the calculation is done.
> + *
> + * real_page
> + * bv_page | bv_offset (10K)
> + * | | |
> + * v v v
> + * | | | |
> + * |<- F1 ->|<--- Folio 2 -->|
> + * | result off |
> + *
> + * '|' is page boundary.
> + *
> + * The folio is the one containing that real_page.
> + * We want the real offset inside that folio.
> + *
> + * The result offset we want is of two parts:
> + * - the offset of the real page to the folio page
> + * - the offset inside that real page
> + *
> + * We can not use offset_in_folio() which will give an incorrect result.
> + * (2K instead of 6K, as folio 1 has a different order)
Same comment here, as this is copied from the change log.
Codewise this looks good to me, but those comments and terminology
("real page") make it confusing IMO.
Thanks.
> + */
> + ASSERT(&folio->page <= real_page);
> + return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
> + offset_in_page(bv->bv_offset);
> +}
> +
> static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
> struct btrfs_device *dev)
> {
> @@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
> struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
> int mirror = repair_bbio->mirror_num;
>
> - /*
> - * We can only trigger this for data bio, which doesn't support larger
> - * folios yet.
> - */
> - ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
> -
> if (repair_bbio->bio.bi_status ||
> !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
> bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
> @@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
> btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
> repair_bbio->file_offset, fs_info->sectorsize,
> repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
> - page_folio(bv->bv_page), bv->bv_offset, mirror);
> + bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
> + mirror);
> } while (mirror != fbio->bbio->mirror_num);
>
> done:
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] btrfs: enable larger data folios support for defrag
2025-04-07 6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
@ 2025-04-14 14:38 ` Filipe Manana
0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2025-04-14 14:38 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
On Mon, Apr 7, 2025 at 7:10 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently we rejects larger folios for defrag gracefully, but the
rejects -> reject
> implementation itself is already mostly larger folios compatible.
>
> There are several parts of defrag in btrfs:
>
> - Extent map checking
> Aka, defrag_collect_targets(), which prepares a list of target ranges
> that should be defragged.
>
> This part is completely folio unrelated, thus it doesn't care about
> the folio size.
>
> - Target folio preparation
> Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
> target folio.
>
> Since folio read and lock are already supporting larger folios, this
> part needs only minor changes.
>
> - Redirty the target range of the folio
> This is already done in a way supporting larger folios.
>
> So it's pretty straightforward to enable larger folios for defrag:
>
> - Do not reject larger folios for experimental builds
> This affects the larger folio check inside defrag_prepare_one_folio().
>
> - Wait for ordered extents of the whole folio in
> defrag_prepare_one_folio()
>
> - Lock the whole extent range for all involved folios in
> defrag_one_range()
>
> - Allow the folios[] array to be partially empty
> Since we can have larger folios, folios[] will not always be full.
>
> This affects:
> * How to allocate folios in defrag_one_range()
> Now we can not use page index, but use the end position of the folio
> as an iterator.
>
> * How to free the folios[] array
> If we hit an empty slot, it means we have larger folios and already
> hit the end of the array.
>
> * How to mark the range dirty
> Instead of use page index directly, we have to go through each
> folio.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
> 1 file changed, 42 insertions(+), 30 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index d4310d93f532..f2fa8b5c64b5 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -857,13 +857,14 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
> {
> struct address_space *mapping = inode->vfs_inode.i_mapping;
> gfp_t mask = btrfs_alloc_write_mask(mapping);
> - u64 page_start = (u64)index << PAGE_SHIFT;
> - u64 page_end = page_start + PAGE_SIZE - 1;
> + u64 folio_start;
> + u64 folio_end;
> struct extent_state *cached_state = NULL;
> struct folio *folio;
> int ret;
>
> again:
> + /* TODO: Add order fgp order flags when larger folios are fully enabled. */
> folio = __filemap_get_folio(mapping, index,
> FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
> if (IS_ERR(folio))
> @@ -871,13 +872,16 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>
> /*
> * Since we can defragment files opened read-only, we can encounter
> - * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS). We
> - * can't do I/O using huge pages yet, so return an error for now.
> + * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
> + *
> + * The IO for such larger folios are not fully tested, thus return
"are not" -> "is not"
> + * an error to reject such folios unless it's an experimental build.
> + *
> * Filesystem transparent huge pages are typically only used for
> * executables that explicitly enable them, so this isn't very
> * restrictive.
> */
> - if (folio_test_large(folio)) {
> + if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL) && folio_test_large(folio)) {
> folio_unlock(folio);
> folio_put(folio);
> return ERR_PTR(-ETXTBSY);
> @@ -890,13 +894,15 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
> return ERR_PTR(ret);
> }
>
> + folio_start = folio_pos(folio);
> + folio_end = folio_pos(folio) + folio_size(folio) - 1;
> /* Wait for any existing ordered extent in the range */
> while (1) {
> struct btrfs_ordered_extent *ordered;
>
> - lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
> - ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
> - unlock_extent(&inode->io_tree, page_start, page_end,
> + lock_extent(&inode->io_tree, folio_start, folio_end, &cached_state);
> + ordered = btrfs_lookup_ordered_range(inode, folio_start, folio_size(folio));
> + unlock_extent(&inode->io_tree, folio_start, folio_end,
> &cached_state);
> if (!ordered)
> break;
> @@ -1162,13 +1168,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
> struct extent_changeset *data_reserved = NULL;
> const u64 start = target->start;
> const u64 len = target->len;
> - unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
> - unsigned long start_index = start >> PAGE_SHIFT;
> - unsigned long first_index = folios[0]->index;
> int ret = 0;
> - int i;
> -
> - ASSERT(last_index - first_index + 1 <= nr_pages);
>
> ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
> if (ret < 0)
> @@ -1179,10 +1179,17 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
> set_extent_bit(&inode->io_tree, start, start + len - 1,
> EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
>
> - /* Update the page status */
> - for (i = start_index - first_index; i <= last_index - first_index; i++) {
> - folio_clear_checked(folios[i]);
> - btrfs_folio_clamp_set_dirty(fs_info, folios[i], start, len);
> + /*
> + * Update the page status.
> + * Due to possible larger folios, we have to check all folios one by one.
> + * And the btrfs_folio_clamp_*() helpers can handle ranges out of the
> + * folio cases well.
> + */
> + for (int i = 0; i < nr_pages && folios[i]; i++) {
> + struct folio *folio = folios[i];
> +
> + btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
> + btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
> }
> btrfs_delalloc_release_extents(inode, len);
> extent_changeset_free(data_reserved);
> @@ -1200,9 +1207,9 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
> LIST_HEAD(target_list);
> struct folio **folios;
> const u32 sectorsize = inode->root->fs_info->sectorsize;
> - u64 last_index = (start + len - 1) >> PAGE_SHIFT;
> - u64 start_index = start >> PAGE_SHIFT;
> - unsigned int nr_pages = last_index - start_index + 1;
> + u64 cur = start;
> + const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
> + (start >> PAGE_SHIFT) + 1;
> int ret = 0;
> int i;
>
> @@ -1214,21 +1221,25 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
> return -ENOMEM;
>
> /* Prepare all pages */
> - for (i = 0; i < nr_pages; i++) {
> - folios[i] = defrag_prepare_one_folio(inode, start_index + i);
> + for (i = 0; cur < start + len && i < nr_pages; i++) {
It's a good opportunity to declare 'i' in the loop instead, like in
the previous function.
> + folios[i] = defrag_prepare_one_folio(inode, cur >> PAGE_SHIFT);
> if (IS_ERR(folios[i])) {
> ret = PTR_ERR(folios[i]);
> - nr_pages = i;
> + folios[i] = NULL;
> goto free_folios;
> }
> + cur = folio_pos(folios[i]) + folio_size(folios[i]);
> }
> - for (i = 0; i < nr_pages; i++)
> + for (i = 0; i < nr_pages; i++) {
Same here.
> + if (!folios[i])
> + break;
> folio_wait_writeback(folios[i]);
> + }
>
> - /* Lock the pages range */
> - lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
> - (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
> + /* Lock the folios[] range */
> + lock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
> &cached_state);
> +
This new line seems accidental.
> /*
> * Now we have a consistent view about the extent map, re-check
> * which range really needs to be defragged.
> @@ -1254,11 +1265,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
> kfree(entry);
> }
> unlock_extent:
> - unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
> - (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
> + unlock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
> &cached_state);
> free_folios:
> for (i = 0; i < nr_pages; i++) {
Same here.
With those minor changes:
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
> + if (!folios[i])
> + break;
> folio_unlock(folios[i]);
> folio_put(folios[i]);
> }
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
2025-04-14 14:25 ` Filipe Manana
@ 2025-04-14 22:58 ` Qu Wenruo
0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-04-14 22:58 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
在 2025/4/14 23:55, Filipe Manana 写道:
> On Mon, Apr 7, 2025 at 7:09 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> The function btrfs_end_repair_bio() has an ASSERT() making sure the
>> folio is page sized.
>>
>> The reason is mostly related to the fact that later we pass a folio and
>> its offset into btrfs_repair_io_failure().
>> If we have larger folios passed in, later calculation of the folio and
>> its offset can go wrong, as we have extra offset to the bv_page.
>>
>> Change the behavior by:
>>
>> - Doing a proper folio grab
>> Instead of just page_folio(bv_page), we should get the real page (as the
>> bv_offset can be larger than page size), then call page_folio().
>>
>> - Do extra folio offset calculation
>>
>> real_page
>> bv_page | bv_offset (10K)
>> | | |
>> v v v
>> | | | |
>> |<- F1 ->|<--- Folio 2 -->|
>> | result off |
>>
>> '|' is page boundary.
>>
>> The folio is the one containing that real_page.
>> We want the real offset inside that folio.
>>
>> The result offset we want is of two parts:
>> - the offset of the real page to the folio page
>
> "to the folio page", this is confusing for me. Isn't it the offset of
> the page inside the folio?
> I.e. "the offset of the real page inside the folio"
That's correct.
>
> Also, this terminology "real page", does it come from somewhere?
> Aren't all pages "real"?
The "real" part is compared to bv_page, which is not really the page
we're working on if the bvec is a multi-page one.
>
>> - the offset inside that real page
>>
>> We can not use offset_in_folio() which will give an incorrect result.
>> (2K instead of 6K, as folio 1 has a different order)
>
> I don't think anyone can understand where that 2K and 6K come from.
> The diagram doesn't mention how many pages per folio (folio order),
> page sizes, and file offset of each folio.
> So mentioning that 2K and 6K without all the relevant information,
> make it useless and confusing IMO.
>
>>
>> With these changes, now btrfs_end_repair_bio() is able to handle not
>> only large folios, but also multi-page bio vectors.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 54 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>> index 8c2eee1f1878..3140aa19aadc 100644
>> --- a/fs/btrfs/bio.c
>> +++ b/fs/btrfs/bio.c
>> @@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
>> }
>> }
>>
>> +/*
>> + * Since a single bio_vec can merge multiple physically contiguous pages
>> + * into one bio_vec entry, we can have the following case:
>> + *
>> + * bv_page bv_offset
>> + * v v
>> + * | | | | | | |
>> + *
>> + * In that case we need to grab the real page where bv_offset is at.
>> + */
>> +static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
>> +{
>> + return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
>> +}
>> +static struct folio *bio_vec_get_folio(const struct bio_vec *bv)
>
> Please add a blank line between function delcarations.
>
>> +{
>> + return page_folio(bio_vec_get_real_page(bv));
>> +}
>> +
>> +static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
>> +{
>> + const struct page *real_page = bio_vec_get_real_page(bv);
>> + const struct folio *folio = page_folio(real_page);
>> +
>> + /*
>> + * The following ASCII chart is to show how the calculation is done.
>> + *
>> + * real_page
>> + * bv_page | bv_offset (10K)
>> + * | | |
>> + * v v v
>> + * | | | |
>> + * |<- F1 ->|<--- Folio 2 -->|
>> + * | result off |
>> + *
>> + * '|' is page boundary.
>> + *
>> + * The folio is the one containing that real_page.
>> + * We want the real offset inside that folio.
>> + *
>> + * The result offset we want is of two parts:
>> + * - the offset of the real page to the folio page
>> + * - the offset inside that real page
>> + *
>> + * We can not use offset_in_folio() which will give an incorrect result.
>> + * (2K instead of 6K, as folio 1 has a different order)
>
> Same comment here, as this is copied from the change log.
>
> Codewise this looks good to me, but those comments and terminology
> ("real page") make it confusing IMO.
Although this patch will be dropped, as Christoph's physical address
solution is simpler overall, and he is not happy with us poking with the
bvec internals.
Thanks,
Qu
>
> Thanks.
>
>> + */
>> + ASSERT(&folio->page <= real_page);
>> + return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
>> + offset_in_page(bv->bv_offset);
>> +}
>> +
>> static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>> struct btrfs_device *dev)
>> {
>> @@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>> struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
>> int mirror = repair_bbio->mirror_num;
>>
>> - /*
>> - * We can only trigger this for data bio, which doesn't support larger
>> - * folios yet.
>> - */
>> - ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
>> -
>> if (repair_bbio->bio.bi_status ||
>> !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
>> bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
>> @@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>> btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
>> repair_bbio->file_offset, fs_info->sectorsize,
>> repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
>> - page_folio(bv->bv_page), bv->bv_offset, mirror);
>> + bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
>> + mirror);
>> } while (mirror != fbio->bbio->mirror_num);
>>
>> done:
>> --
>> 2.49.0
>>
>>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-04-14 22:58 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
2025-04-07 6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
2025-04-14 14:25 ` Filipe Manana
2025-04-14 22:58 ` Qu Wenruo
2025-04-07 6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
2025-04-14 14:38 ` Filipe Manana
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox