[PATCH 0/2] btrfs: add large folio support to read-repair and defrag

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: add large folio support to read-repair and defrag
@ 2025-04-07  6:09 Qu Wenruo
  2025-04-07  6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
  2025-04-07  6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
  0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07  6:09 UTC (permalink / raw)
  To: linux-btrfs

These two patches are originally in a series of ASSERT()s cleanup, but
later since I'm able to enable large data folios for testing, one bug is
exposed in the first patch, thus these two are not yet merged into
for-next branch.

Now with full fstests able to run on my local large data folios branch,
I can eventually verify the behavior of those two patches.

The first patch is to add extra calculation to convert the bio_vec (no
matter single page or multi page) to a proper folio and offset inside the
folio.

The second patch is to add large folios support to defrag, which only
needs minor changes.

Qu Wenruo (2):
  btrfs: prepare btrfs_end_repair_bio() for larger data folios
  btrfs: enable larger data folios support for defrag

 fs/btrfs/bio.c    | 61 ++++++++++++++++++++++++++++++++++-----
 fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
 2 files changed, 96 insertions(+), 37 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
  2025-04-07  6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
@ 2025-04-07  6:09 ` Qu Wenruo
  2025-04-14 14:25   ` Filipe Manana
  2025-04-07  6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07  6:09 UTC (permalink / raw)
  To: linux-btrfs

The function btrfs_end_repair_bio() has an ASSERT() making sure the
folio is page sized.

The reason is mostly related to the fact that later we pass a folio and
its offset into btrfs_repair_io_failure().
If we have larger folios passed in, later calculation of the folio and
its offset can go wrong, as we have extra offset to the bv_page.

Change the behavior by:

- Doing a proper folio grab
  Instead of just page_folio(bv_page), we should get the real page (as the
  bv_offset can be larger than page size), then call page_folio().

- Do extra folio offset calculation

                     real_page
   bv_page           |   bv_offset (10K)
   |                 |   |
   v                 v   v
   |        |        |       |
   |<- F1 ->|<--- Folio 2 -->|
            | result off |

   '|' is page boundary.

   The folio is the one containing that real_page.
   We want the real offset inside that folio.

   The result offset we want is of two parts:
   - the offset of the real page to the folio page
   - the offset inside that real page

   We can not use offset_in_folio() which will give an incorrect result.
   (2K instead of 6K, as folio 1 has a different order)

With these changes, now btrfs_end_repair_bio() is able to handle not
only large folios, but also multi-page bio vectors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 54 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index 8c2eee1f1878..3140aa19aadc 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
 	}
 }
 
+/*
+ * Since a single bio_vec can merge multiple physically contiguous pages
+ * into one bio_vec entry, we can have the following case:
+ *
+ * bv_page             bv_offset
+ * v                   v
+ * |    |    |   |   |   |   |
+ *
+ * In that case we need to grab the real page where bv_offset is at.
+ */
+static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
+{
+	return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
+}
+static struct folio *bio_vec_get_folio(const struct bio_vec *bv)
+{
+	return page_folio(bio_vec_get_real_page(bv));
+}
+
+static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
+{
+	const struct page *real_page = bio_vec_get_real_page(bv);
+	const struct folio *folio = page_folio(real_page);
+
+	/*
+	 * The following ASCII chart is to show how the calculation is done.
+	 *
+	 *                   real_page
+	 * bv_page           |   bv_offset (10K)
+	 * |                 |   |
+	 * v                 v   v
+	 * |        |        |       |
+	 * |<- F1 ->|<--- Folio 2 -->|
+	 *          | result off |
+	 *
+	 * '|' is page boundary.
+	 *
+	 * The folio is the one containing that real_page.
+	 * We want the real offset inside that folio.
+	 *
+	 * The result offset we want is of two parts:
+	 * - the offset of the real page to the folio page
+	 * - the offset inside that real page
+	 *
+	 * We can not use offset_in_folio() which will give an incorrect result.
+	 * (2K instead of 6K, as folio 1 has a different order)
+	 */
+	ASSERT(&folio->page <= real_page);
+	return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
+		offset_in_page(bv->bv_offset);
+}
+
 static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
 				 struct btrfs_device *dev)
 {
@@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
 	struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
 	int mirror = repair_bbio->mirror_num;
 
-	/*
-	 * We can only trigger this for data bio, which doesn't support larger
-	 * folios yet.
-	 */
-	ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
-
 	if (repair_bbio->bio.bi_status ||
 	    !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
 		bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
@@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
 		btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
 				  repair_bbio->file_offset, fs_info->sectorsize,
 				  repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
-				  page_folio(bv->bv_page), bv->bv_offset, mirror);
+				  bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
+				  mirror);
 	} while (mirror != fbio->bbio->mirror_num);
 
 done:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] btrfs: enable larger data folios support for defrag
  2025-04-07  6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
  2025-04-07  6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
@ 2025-04-07  6:09 ` Qu Wenruo
  2025-04-14 14:38   ` Filipe Manana
  1 sibling, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-04-07  6:09 UTC (permalink / raw)
  To: linux-btrfs

Currently we rejects larger folios for defrag gracefully, but the
implementation itself is already mostly larger folios compatible.

There are several parts of defrag in btrfs:

- Extent map checking
  Aka, defrag_collect_targets(), which prepares a list of target ranges
  that should be defragged.

  This part is completely folio unrelated, thus it doesn't care about
  the folio size.

- Target folio preparation
  Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
  target folio.

  Since folio read and lock are already supporting larger folios, this
  part needs only minor changes.

- Redirty the target range of the folio
  This is already done in a way supporting larger folios.

So it's pretty straightforward to enable larger folios for defrag:

- Do not reject larger folios for experimental builds
  This affects the larger folio check inside defrag_prepare_one_folio().

- Wait for ordered extents of the whole folio in
  defrag_prepare_one_folio()

- Lock the whole extent range for all involved folios in
  defrag_one_range()

- Allow the folios[] array to be partially empty
  Since we can have larger folios, folios[] will not always be full.

  This affects:
  * How to allocate folios in defrag_one_range()
    Now we can not use page index, but use the end position of the folio
    as an iterator.

  * How to free the folios[] array
    If we hit an empty slot, it means we have larger folios and already
    hit the end of the array.

  * How to mark the range dirty
    Instead of use page index directly, we have to go through each
    folio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
 1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index d4310d93f532..f2fa8b5c64b5 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -857,13 +857,14 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 {
 	struct address_space *mapping = inode->vfs_inode.i_mapping;
 	gfp_t mask = btrfs_alloc_write_mask(mapping);
-	u64 page_start = (u64)index << PAGE_SHIFT;
-	u64 page_end = page_start + PAGE_SIZE - 1;
+	u64 folio_start;
+	u64 folio_end;
 	struct extent_state *cached_state = NULL;
 	struct folio *folio;
 	int ret;
 
 again:
+	/* TODO: Add order fgp order flags when larger folios are fully enabled. */
 	folio = __filemap_get_folio(mapping, index,
 				    FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
 	if (IS_ERR(folio))
@@ -871,13 +872,16 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 
 	/*
 	 * Since we can defragment files opened read-only, we can encounter
-	 * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS). We
-	 * can't do I/O using huge pages yet, so return an error for now.
+	 * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
+	 *
+	 * The IO for such larger folios are not fully tested, thus return
+	 * an error to reject such folios unless it's an experimental build.
+	 *
 	 * Filesystem transparent huge pages are typically only used for
 	 * executables that explicitly enable them, so this isn't very
 	 * restrictive.
 	 */
-	if (folio_test_large(folio)) {
+	if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL) && folio_test_large(folio)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		return ERR_PTR(-ETXTBSY);
@@ -890,13 +894,15 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 		return ERR_PTR(ret);
 	}
 
+	folio_start = folio_pos(folio);
+	folio_end = folio_pos(folio) + folio_size(folio) - 1;
 	/* Wait for any existing ordered extent in the range */
 	while (1) {
 		struct btrfs_ordered_extent *ordered;
 
-		lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
-		ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
-		unlock_extent(&inode->io_tree, page_start, page_end,
+		lock_extent(&inode->io_tree, folio_start, folio_end, &cached_state);
+		ordered = btrfs_lookup_ordered_range(inode, folio_start, folio_size(folio));
+		unlock_extent(&inode->io_tree, folio_start, folio_end,
 			      &cached_state);
 		if (!ordered)
 			break;
@@ -1162,13 +1168,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	struct extent_changeset *data_reserved = NULL;
 	const u64 start = target->start;
 	const u64 len = target->len;
-	unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
-	unsigned long start_index = start >> PAGE_SHIFT;
-	unsigned long first_index = folios[0]->index;
 	int ret = 0;
-	int i;
-
-	ASSERT(last_index - first_index + 1 <= nr_pages);
 
 	ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
 	if (ret < 0)
@@ -1179,10 +1179,17 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
 	set_extent_bit(&inode->io_tree, start, start + len - 1,
 		       EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
 
-	/* Update the page status */
-	for (i = start_index - first_index; i <= last_index - first_index; i++) {
-		folio_clear_checked(folios[i]);
-		btrfs_folio_clamp_set_dirty(fs_info, folios[i], start, len);
+	/*
+	 * Update the page status.
+	 * Due to possible larger folios, we have to check all folios one by one.
+	 * And the btrfs_folio_clamp_*() helpers can handle ranges out of the
+	 * folio cases well.
+	 */
+	for (int i = 0; i < nr_pages && folios[i]; i++) {
+		struct folio *folio = folios[i];
+
+		btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
+		btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
 	}
 	btrfs_delalloc_release_extents(inode, len);
 	extent_changeset_free(data_reserved);
@@ -1200,9 +1207,9 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 	LIST_HEAD(target_list);
 	struct folio **folios;
 	const u32 sectorsize = inode->root->fs_info->sectorsize;
-	u64 last_index = (start + len - 1) >> PAGE_SHIFT;
-	u64 start_index = start >> PAGE_SHIFT;
-	unsigned int nr_pages = last_index - start_index + 1;
+	u64 cur = start;
+	const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
+				      (start >> PAGE_SHIFT) + 1;
 	int ret = 0;
 	int i;
 
@@ -1214,21 +1221,25 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		return -ENOMEM;
 
 	/* Prepare all pages */
-	for (i = 0; i < nr_pages; i++) {
-		folios[i] = defrag_prepare_one_folio(inode, start_index + i);
+	for (i = 0; cur < start + len && i < nr_pages; i++) {
+		folios[i] = defrag_prepare_one_folio(inode, cur >> PAGE_SHIFT);
 		if (IS_ERR(folios[i])) {
 			ret = PTR_ERR(folios[i]);
-			nr_pages = i;
+			folios[i] = NULL;
 			goto free_folios;
 		}
+		cur = folio_pos(folios[i]) + folio_size(folios[i]);
 	}
-	for (i = 0; i < nr_pages; i++)
+	for (i = 0; i < nr_pages; i++) {
+		if (!folios[i])
+			break;
 		folio_wait_writeback(folios[i]);
+	}
 
-	/* Lock the pages range */
-	lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
-		    (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+	/* Lock the folios[] range */
+	lock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
 		    &cached_state);
+
 	/*
 	 * Now we have a consistent view about the extent map, re-check
 	 * which range really needs to be defragged.
@@ -1254,11 +1265,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
 		kfree(entry);
 	}
 unlock_extent:
-	unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
-		      (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
+	unlock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
 		      &cached_state);
 free_folios:
 	for (i = 0; i < nr_pages; i++) {
+		if (!folios[i])
+			break;
 		folio_unlock(folios[i]);
 		folio_put(folios[i]);
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
  2025-04-07  6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
@ 2025-04-14 14:25   ` Filipe Manana
  2025-04-14 22:58     ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2025-04-14 14:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Apr 7, 2025 at 7:09 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The function btrfs_end_repair_bio() has an ASSERT() making sure the
> folio is page sized.
>
> The reason is mostly related to the fact that later we pass a folio and
> its offset into btrfs_repair_io_failure().
> If we have larger folios passed in, later calculation of the folio and
> its offset can go wrong, as we have extra offset to the bv_page.
>
> Change the behavior by:
>
> - Doing a proper folio grab
>   Instead of just page_folio(bv_page), we should get the real page (as the
>   bv_offset can be larger than page size), then call page_folio().
>
> - Do extra folio offset calculation
>
>                      real_page
>    bv_page           |   bv_offset (10K)
>    |                 |   |
>    v                 v   v
>    |        |        |       |
>    |<- F1 ->|<--- Folio 2 -->|
>             | result off |
>
>    '|' is page boundary.
>
>    The folio is the one containing that real_page.
>    We want the real offset inside that folio.
>
>    The result offset we want is of two parts:
>    - the offset of the real page to the folio page

"to the folio page", this is confusing for me. Isn't it the offset of
the page inside the folio?
I.e. "the offset of the real page inside the folio"

Also, this terminology "real page", does it come from somewhere?
Aren't all pages "real"?

>    - the offset inside that real page
>
>    We can not use offset_in_folio() which will give an incorrect result.
>    (2K instead of 6K, as folio 1 has a different order)

I don't think anyone can understand where that 2K and 6K come from.
The diagram doesn't mention how many pages per folio (folio order),
page sizes, and file offset of each folio.
So mentioning that 2K and 6K without all the relevant information,
make it useless and confusing IMO.

>
> With these changes, now btrfs_end_repair_bio() is able to handle not
> only large folios, but also multi-page bio vectors.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 54 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
> index 8c2eee1f1878..3140aa19aadc 100644
> --- a/fs/btrfs/bio.c
> +++ b/fs/btrfs/bio.c
> @@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
>         }
>  }
>
> +/*
> + * Since a single bio_vec can merge multiple physically contiguous pages
> + * into one bio_vec entry, we can have the following case:
> + *
> + * bv_page             bv_offset
> + * v                   v
> + * |    |    |   |   |   |   |
> + *
> + * In that case we need to grab the real page where bv_offset is at.
> + */
> +static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
> +{
> +       return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
> +}
> +static struct folio *bio_vec_get_folio(const struct bio_vec *bv)

Please add a blank line between function delcarations.

> +{
> +       return page_folio(bio_vec_get_real_page(bv));
> +}
> +
> +static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
> +{
> +       const struct page *real_page = bio_vec_get_real_page(bv);
> +       const struct folio *folio = page_folio(real_page);
> +
> +       /*
> +        * The following ASCII chart is to show how the calculation is done.
> +        *
> +        *                   real_page
> +        * bv_page           |   bv_offset (10K)
> +        * |                 |   |
> +        * v                 v   v
> +        * |        |        |       |
> +        * |<- F1 ->|<--- Folio 2 -->|
> +        *          | result off |
> +        *
> +        * '|' is page boundary.
> +        *
> +        * The folio is the one containing that real_page.
> +        * We want the real offset inside that folio.
> +        *
> +        * The result offset we want is of two parts:
> +        * - the offset of the real page to the folio page
> +        * - the offset inside that real page
> +        *
> +        * We can not use offset_in_folio() which will give an incorrect result.
> +        * (2K instead of 6K, as folio 1 has a different order)

Same comment here, as this is copied from the change log.

Codewise this looks good to me, but those comments and terminology
("real page") make it confusing IMO.

Thanks.

> +        */
> +       ASSERT(&folio->page <= real_page);
> +       return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
> +               offset_in_page(bv->bv_offset);
> +}
> +
>  static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>                                  struct btrfs_device *dev)
>  {
> @@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>         struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
>         int mirror = repair_bbio->mirror_num;
>
> -       /*
> -        * We can only trigger this for data bio, which doesn't support larger
> -        * folios yet.
> -        */
> -       ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
> -
>         if (repair_bbio->bio.bi_status ||
>             !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
>                 bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
> @@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>                 btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
>                                   repair_bbio->file_offset, fs_info->sectorsize,
>                                   repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
> -                                 page_folio(bv->bv_page), bv->bv_offset, mirror);
> +                                 bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
> +                                 mirror);
>         } while (mirror != fbio->bbio->mirror_num);
>
>  done:
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: enable larger data folios support for defrag
  2025-04-07  6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
@ 2025-04-14 14:38   ` Filipe Manana
  0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2025-04-14 14:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Apr 7, 2025 at 7:10 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently we rejects larger folios for defrag gracefully, but the

rejects -> reject

> implementation itself is already mostly larger folios compatible.
>
> There are several parts of defrag in btrfs:
>
> - Extent map checking
>   Aka, defrag_collect_targets(), which prepares a list of target ranges
>   that should be defragged.
>
>   This part is completely folio unrelated, thus it doesn't care about
>   the folio size.
>
> - Target folio preparation
>   Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
>   target folio.
>
>   Since folio read and lock are already supporting larger folios, this
>   part needs only minor changes.
>
> - Redirty the target range of the folio
>   This is already done in a way supporting larger folios.
>
> So it's pretty straightforward to enable larger folios for defrag:
>
> - Do not reject larger folios for experimental builds
>   This affects the larger folio check inside defrag_prepare_one_folio().
>
> - Wait for ordered extents of the whole folio in
>   defrag_prepare_one_folio()
>
> - Lock the whole extent range for all involved folios in
>   defrag_one_range()
>
> - Allow the folios[] array to be partially empty
>   Since we can have larger folios, folios[] will not always be full.
>
>   This affects:
>   * How to allocate folios in defrag_one_range()
>     Now we can not use page index, but use the end position of the folio
>     as an iterator.
>
>   * How to free the folios[] array
>     If we hit an empty slot, it means we have larger folios and already
>     hit the end of the array.
>
>   * How to mark the range dirty
>     Instead of use page index directly, we have to go through each
>     folio.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/defrag.c | 72 +++++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 30 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index d4310d93f532..f2fa8b5c64b5 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -857,13 +857,14 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>  {
>         struct address_space *mapping = inode->vfs_inode.i_mapping;
>         gfp_t mask = btrfs_alloc_write_mask(mapping);
> -       u64 page_start = (u64)index << PAGE_SHIFT;
> -       u64 page_end = page_start + PAGE_SIZE - 1;
> +       u64 folio_start;
> +       u64 folio_end;
>         struct extent_state *cached_state = NULL;
>         struct folio *folio;
>         int ret;
>
>  again:
> +       /* TODO: Add order fgp order flags when larger folios are fully enabled. */
>         folio = __filemap_get_folio(mapping, index,
>                                     FGP_LOCK | FGP_ACCESSED | FGP_CREAT, mask);
>         if (IS_ERR(folio))
> @@ -871,13 +872,16 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>
>         /*
>          * Since we can defragment files opened read-only, we can encounter
> -        * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS). We
> -        * can't do I/O using huge pages yet, so return an error for now.
> +        * transparent huge pages here (see CONFIG_READ_ONLY_THP_FOR_FS).
> +        *
> +        * The IO for such larger folios are not fully tested, thus return

"are not" -> "is not"

> +        * an error to reject such folios unless it's an experimental build.
> +        *
>          * Filesystem transparent huge pages are typically only used for
>          * executables that explicitly enable them, so this isn't very
>          * restrictive.
>          */
> -       if (folio_test_large(folio)) {
> +       if (!IS_ENABLED(CONFIG_BTRFS_EXPERIMENTAL) && folio_test_large(folio)) {
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 return ERR_PTR(-ETXTBSY);
> @@ -890,13 +894,15 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>                 return ERR_PTR(ret);
>         }
>
> +       folio_start = folio_pos(folio);
> +       folio_end = folio_pos(folio) + folio_size(folio) - 1;
>         /* Wait for any existing ordered extent in the range */
>         while (1) {
>                 struct btrfs_ordered_extent *ordered;
>
> -               lock_extent(&inode->io_tree, page_start, page_end, &cached_state);
> -               ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
> -               unlock_extent(&inode->io_tree, page_start, page_end,
> +               lock_extent(&inode->io_tree, folio_start, folio_end, &cached_state);
> +               ordered = btrfs_lookup_ordered_range(inode, folio_start, folio_size(folio));
> +               unlock_extent(&inode->io_tree, folio_start, folio_end,
>                               &cached_state);
>                 if (!ordered)
>                         break;
> @@ -1162,13 +1168,7 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
>         struct extent_changeset *data_reserved = NULL;
>         const u64 start = target->start;
>         const u64 len = target->len;
> -       unsigned long last_index = (start + len - 1) >> PAGE_SHIFT;
> -       unsigned long start_index = start >> PAGE_SHIFT;
> -       unsigned long first_index = folios[0]->index;
>         int ret = 0;
> -       int i;
> -
> -       ASSERT(last_index - first_index + 1 <= nr_pages);
>
>         ret = btrfs_delalloc_reserve_space(inode, &data_reserved, start, len);
>         if (ret < 0)
> @@ -1179,10 +1179,17 @@ static int defrag_one_locked_target(struct btrfs_inode *inode,
>         set_extent_bit(&inode->io_tree, start, start + len - 1,
>                        EXTENT_DELALLOC | EXTENT_DEFRAG, cached_state);
>
> -       /* Update the page status */
> -       for (i = start_index - first_index; i <= last_index - first_index; i++) {
> -               folio_clear_checked(folios[i]);
> -               btrfs_folio_clamp_set_dirty(fs_info, folios[i], start, len);
> +       /*
> +        * Update the page status.
> +        * Due to possible larger folios, we have to check all folios one by one.
> +        * And the btrfs_folio_clamp_*() helpers can handle ranges out of the
> +        * folio cases well.
> +        */
> +       for (int i = 0; i < nr_pages && folios[i]; i++) {
> +               struct folio *folio = folios[i];
> +
> +               btrfs_folio_clamp_clear_checked(fs_info, folio, start, len);
> +               btrfs_folio_clamp_set_dirty(fs_info, folio, start, len);
>         }
>         btrfs_delalloc_release_extents(inode, len);
>         extent_changeset_free(data_reserved);
> @@ -1200,9 +1207,9 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
>         LIST_HEAD(target_list);
>         struct folio **folios;
>         const u32 sectorsize = inode->root->fs_info->sectorsize;
> -       u64 last_index = (start + len - 1) >> PAGE_SHIFT;
> -       u64 start_index = start >> PAGE_SHIFT;
> -       unsigned int nr_pages = last_index - start_index + 1;
> +       u64 cur = start;
> +       const unsigned int nr_pages = ((start + len - 1) >> PAGE_SHIFT) -
> +                                     (start >> PAGE_SHIFT) + 1;
>         int ret = 0;
>         int i;
>
> @@ -1214,21 +1221,25 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
>                 return -ENOMEM;
>
>         /* Prepare all pages */
> -       for (i = 0; i < nr_pages; i++) {
> -               folios[i] = defrag_prepare_one_folio(inode, start_index + i);
> +       for (i = 0; cur < start + len && i < nr_pages; i++) {

It's a good opportunity to declare 'i' in the loop instead, like in
the previous function.

> +               folios[i] = defrag_prepare_one_folio(inode, cur >> PAGE_SHIFT);
>                 if (IS_ERR(folios[i])) {
>                         ret = PTR_ERR(folios[i]);
> -                       nr_pages = i;
> +                       folios[i] = NULL;
>                         goto free_folios;
>                 }
> +               cur = folio_pos(folios[i]) + folio_size(folios[i]);
>         }
> -       for (i = 0; i < nr_pages; i++)
> +       for (i = 0; i < nr_pages; i++) {

Same here.

> +               if (!folios[i])
> +                       break;
>                 folio_wait_writeback(folios[i]);
> +       }
>
> -       /* Lock the pages range */
> -       lock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
> -                   (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
> +       /* Lock the folios[] range */
> +       lock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
>                     &cached_state);
> +

This new line seems accidental.

>         /*
>          * Now we have a consistent view about the extent map, re-check
>          * which range really needs to be defragged.
> @@ -1254,11 +1265,12 @@ static int defrag_one_range(struct btrfs_inode *inode, u64 start, u32 len,
>                 kfree(entry);
>         }
>  unlock_extent:
> -       unlock_extent(&inode->io_tree, start_index << PAGE_SHIFT,
> -                     (last_index << PAGE_SHIFT) + PAGE_SIZE - 1,
> +       unlock_extent(&inode->io_tree, folio_pos(folios[0]), cur - 1,
>                       &cached_state);
>  free_folios:
>         for (i = 0; i < nr_pages; i++) {

Same here.

With those minor changes:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +               if (!folios[i])
> +                       break;
>                 folio_unlock(folios[i]);
>                 folio_put(folios[i]);
>         }
> --
> 2.49.0
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios
  2025-04-14 14:25   ` Filipe Manana
@ 2025-04-14 22:58     ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-04-14 22:58 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2025/4/14 23:55, Filipe Manana 写道:
> On Mon, Apr 7, 2025 at 7:09 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> The function btrfs_end_repair_bio() has an ASSERT() making sure the
>> folio is page sized.
>>
>> The reason is mostly related to the fact that later we pass a folio and
>> its offset into btrfs_repair_io_failure().
>> If we have larger folios passed in, later calculation of the folio and
>> its offset can go wrong, as we have extra offset to the bv_page.
>>
>> Change the behavior by:
>>
>> - Doing a proper folio grab
>>    Instead of just page_folio(bv_page), we should get the real page (as the
>>    bv_offset can be larger than page size), then call page_folio().
>>
>> - Do extra folio offset calculation
>>
>>                       real_page
>>     bv_page           |   bv_offset (10K)
>>     |                 |   |
>>     v                 v   v
>>     |        |        |       |
>>     |<- F1 ->|<--- Folio 2 -->|
>>              | result off |
>>
>>     '|' is page boundary.
>>
>>     The folio is the one containing that real_page.
>>     We want the real offset inside that folio.
>>
>>     The result offset we want is of two parts:
>>     - the offset of the real page to the folio page
> 
> "to the folio page", this is confusing for me. Isn't it the offset of
> the page inside the folio?
> I.e. "the offset of the real page inside the folio"

That's correct.

> 
> Also, this terminology "real page", does it come from somewhere?
> Aren't all pages "real"?

The "real" part is compared to bv_page, which is not really the page 
we're working on if the bvec is a multi-page one.

> 
>>     - the offset inside that real page
>>
>>     We can not use offset_in_folio() which will give an incorrect result.
>>     (2K instead of 6K, as folio 1 has a different order)
> 
> I don't think anyone can understand where that 2K and 6K come from.
> The diagram doesn't mention how many pages per folio (folio order),
> page sizes, and file offset of each folio.
> So mentioning that 2K and 6K without all the relevant information,
> make it useless and confusing IMO.
> 
>>
>> With these changes, now btrfs_end_repair_bio() is able to handle not
>> only large folios, but also multi-page bio vectors.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/bio.c | 61 ++++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 54 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
>> index 8c2eee1f1878..3140aa19aadc 100644
>> --- a/fs/btrfs/bio.c
>> +++ b/fs/btrfs/bio.c
>> @@ -156,6 +156,58 @@ static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
>>          }
>>   }
>>
>> +/*
>> + * Since a single bio_vec can merge multiple physically contiguous pages
>> + * into one bio_vec entry, we can have the following case:
>> + *
>> + * bv_page             bv_offset
>> + * v                   v
>> + * |    |    |   |   |   |   |
>> + *
>> + * In that case we need to grab the real page where bv_offset is at.
>> + */
>> +static struct page *bio_vec_get_real_page(const struct bio_vec *bv)
>> +{
>> +       return bv->bv_page + (bv->bv_offset >> PAGE_SHIFT);
>> +}
>> +static struct folio *bio_vec_get_folio(const struct bio_vec *bv)
> 
> Please add a blank line between function delcarations.
> 
>> +{
>> +       return page_folio(bio_vec_get_real_page(bv));
>> +}
>> +
>> +static unsigned long bio_vec_get_folio_offset(const struct bio_vec *bv)
>> +{
>> +       const struct page *real_page = bio_vec_get_real_page(bv);
>> +       const struct folio *folio = page_folio(real_page);
>> +
>> +       /*
>> +        * The following ASCII chart is to show how the calculation is done.
>> +        *
>> +        *                   real_page
>> +        * bv_page           |   bv_offset (10K)
>> +        * |                 |   |
>> +        * v                 v   v
>> +        * |        |        |       |
>> +        * |<- F1 ->|<--- Folio 2 -->|
>> +        *          | result off |
>> +        *
>> +        * '|' is page boundary.
>> +        *
>> +        * The folio is the one containing that real_page.
>> +        * We want the real offset inside that folio.
>> +        *
>> +        * The result offset we want is of two parts:
>> +        * - the offset of the real page to the folio page
>> +        * - the offset inside that real page
>> +        *
>> +        * We can not use offset_in_folio() which will give an incorrect result.
>> +        * (2K instead of 6K, as folio 1 has a different order)
> 
> Same comment here, as this is copied from the change log.
> 
> Codewise this looks good to me, but those comments and terminology
> ("real page") make it confusing IMO.

Although this patch will be dropped, as Christoph's physical address 
solution is simpler overall, and he is not happy with us poking with the 
bvec internals.

Thanks,
Qu

> 
> Thanks.
> 
>> +        */
>> +       ASSERT(&folio->page <= real_page);
>> +       return (folio_page_idx(folio, real_page) << PAGE_SHIFT) +
>> +               offset_in_page(bv->bv_offset);
>> +}
>> +
>>   static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>>                                   struct btrfs_device *dev)
>>   {
>> @@ -165,12 +217,6 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>>          struct bio_vec *bv = bio_first_bvec_all(&repair_bbio->bio);
>>          int mirror = repair_bbio->mirror_num;
>>
>> -       /*
>> -        * We can only trigger this for data bio, which doesn't support larger
>> -        * folios yet.
>> -        */
>> -       ASSERT(folio_order(page_folio(bv->bv_page)) == 0);
>> -
>>          if (repair_bbio->bio.bi_status ||
>>              !btrfs_data_csum_ok(repair_bbio, dev, 0, bv)) {
>>                  bio_reset(&repair_bbio->bio, NULL, REQ_OP_READ);
>> @@ -192,7 +238,8 @@ static void btrfs_end_repair_bio(struct btrfs_bio *repair_bbio,
>>                  btrfs_repair_io_failure(fs_info, btrfs_ino(inode),
>>                                    repair_bbio->file_offset, fs_info->sectorsize,
>>                                    repair_bbio->saved_iter.bi_sector << SECTOR_SHIFT,
>> -                                 page_folio(bv->bv_page), bv->bv_offset, mirror);
>> +                                 bio_vec_get_folio(bv), bio_vec_get_folio_offset(bv),
>> +                                 mirror);
>>          } while (mirror != fbio->bbio->mirror_num);
>>
>>   done:
>> --
>> 2.49.0
>>
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-04-14 22:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07  6:09 [PATCH 0/2] btrfs: add large folio support to read-repair and defrag Qu Wenruo
2025-04-07  6:09 ` [PATCH 1/2] btrfs: prepare btrfs_end_repair_bio() for larger data folios Qu Wenruo
2025-04-14 14:25   ` Filipe Manana
2025-04-14 22:58     ` Qu Wenruo
2025-04-07  6:09 ` [PATCH 2/2] btrfs: enable larger data folios support for defrag Qu Wenruo
2025-04-14 14:38   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox