[PATCH 0/3] btrfs: enable large folios for data reloc inodes

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: enable large folios for data reloc inodes
@ 2025-07-16 21:14 Qu Wenruo
  2025-07-16 21:14 ` [PATCH 1/3] btrfs: reloc: unconditionally invalidate the page cache for each cluster Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-16 21:14 UTC (permalink / raw)
  To: linux-btrfs

Although large data folios are enabled for experimental builds, data
reloc inodes are excluded due to the cluster folio handling are still
done in fixed page size.

But data reloc inodes fit large folios better than regular inodes, as
each relocation cluster is one or more file extents that are contiguous
in their logical addresses.

This series will enable large folios for data reloc inodes by:

- Simplify the handling of cluster boundary folio invalidation
  This patch has been sent to the list already, but it's the dependency
  of the series.

- Enhance the output of btrfs_subpage_assert() 
  As it easily caught a lot of bugs in the relocation code that are
  still using PAGE_SIZE

- Enable large folios for data reloc inodes
  The last patch that remove the PAGE_SIZE usage and fixed PAGE_SIZE
  itearation of data reloc inodes.

Qu Wenruo (3):
  btrfs: reloc: unconditionally invalidate the page cache for each
    cluster
  btrfs: output more info when btrfs_subpage_assert() failed
  btrfs: enable large data folios for data reloc inode

 fs/btrfs/btrfs_inode.h |  4 ---
 fs/btrfs/relocation.c  | 82 +++++++++++-------------------------------
 fs/btrfs/subpage.c     |  5 +--
 3 files changed, 24 insertions(+), 67 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] btrfs: reloc: unconditionally invalidate the page cache for each cluster
  2025-07-16 21:14 [PATCH 0/3] btrfs: enable large folios for data reloc inodes Qu Wenruo
@ 2025-07-16 21:14 ` Qu Wenruo
  2025-07-16 21:14 ` [PATCH 2/3] btrfs: output more info when btrfs_subpage_assert() failed Qu Wenruo
  2025-07-16 21:14 ` [PATCH 3/3] btrfs: enable large data folios for data reloc inode Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-16 21:14 UTC (permalink / raw)
  To: linux-btrfs

Commit 9d9ea1e68a05 ("btrfs: subpage: fix relocation potentially
overwriting last page data") fixed a bug when relocating data block
groups for subpage cases.

However for the incoming large folios for data reloc inode, we can hit
the same situation where block size is the same as page size, but the
folio we got is still larger than a block.

In that case, the old subpage specific check is no longer reliable.

Here we have to enhance the handling by:

- Unconditionally invalidate the page cache for the current cluster
  We set the @flush to true so that any dirty folios are properly
  written back first.

  And this time instead of dropping the whole page cache, just drop the
  range covered by the current cluster.

  This will bring some minor performance drop, as for a large folio, the
  heading half will be read twice (read by previous cluster, then
  invalidated, then read again by the current cluster).

  However that is required to support large folios, and this gets rid of
  the kinda tricky manual uptodate flag clearing for each block.

- Remove the special handling of writing back the whole page cache
  filemap_invalidate_inode() handles the write back already, and since
  we're invalidating all pages in the range, we no longer need to
  manually clear the uptodate flags for involved blocks.

  Thus there is no need to manually write back the whole page cache.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/relocation.c | 58 ++++++-------------------------------------
 1 file changed, 8 insertions(+), 50 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8a71cffb4dfb..fa93ba40278d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2666,66 +2666,24 @@ static noinline_for_stack int prealloc_file_extent_cluster(struct reloc_control
 	u64 num_bytes;
 	int nr;
 	int ret = 0;
-	u64 i_size = i_size_read(&inode->vfs_inode);
 	u64 prealloc_start = cluster->start - offset;
 	u64 prealloc_end = cluster->end - offset;
 	u64 cur_offset = prealloc_start;
 
 	/*
-	 * For subpage case, previous i_size may not be aligned to PAGE_SIZE.
-	 * This means the range [i_size, PAGE_END + 1) is filled with zeros by
-	 * btrfs_do_readpage() call of previously relocated file cluster.
+	 * For blocksize < folio size case (either bs < page size or large folios),
+	 * beyond i_size, all blocks are filled with zero.
 	 *
-	 * If the current cluster starts in the above range, btrfs_do_readpage()
+	 * If the current cluster covers above range, btrfs_do_readpage()
 	 * will skip the read, and relocate_one_folio() will later writeback
 	 * the padding zeros as new data, causing data corruption.
 	 *
-	 * Here we have to manually invalidate the range (i_size, PAGE_END + 1).
+	 * Here we have to invalidate the cache covering our cluster.
 	 */
-	if (!PAGE_ALIGNED(i_size)) {
-		struct address_space *mapping = inode->vfs_inode.i_mapping;
-		struct btrfs_fs_info *fs_info = inode->root->fs_info;
-		const u32 sectorsize = fs_info->sectorsize;
-		struct folio *folio;
-
-		ASSERT(sectorsize < PAGE_SIZE);
-		ASSERT(IS_ALIGNED(i_size, sectorsize));
-
-		/*
-		 * Subpage can't handle page with DIRTY but without UPTODATE
-		 * bit as it can lead to the following deadlock:
-		 *
-		 * btrfs_read_folio()
-		 * | Page already *locked*
-		 * |- btrfs_lock_and_flush_ordered_range()
-		 *    |- btrfs_start_ordered_extent()
-		 *       |- extent_write_cache_pages()
-		 *          |- lock_page()
-		 *             We try to lock the page we already hold.
-		 *
-		 * Here we just writeback the whole data reloc inode, so that
-		 * we will be ensured to have no dirty range in the page, and
-		 * are safe to clear the uptodate bits.
-		 *
-		 * This shouldn't cause too much overhead, as we need to write
-		 * the data back anyway.
-		 */
-		ret = filemap_write_and_wait(mapping);
-		if (ret < 0)
-			return ret;
-
-		folio = filemap_lock_folio(mapping, i_size >> PAGE_SHIFT);
-		/*
-		 * If page is freed we don't need to do anything then, as we
-		 * will re-read the whole page anyway.
-		 */
-		if (!IS_ERR(folio)) {
-			btrfs_subpage_clear_uptodate(fs_info, folio, i_size,
-					round_up(i_size, PAGE_SIZE) - i_size);
-			folio_unlock(folio);
-			folio_put(folio);
-		}
-	}
+	ret = filemap_invalidate_inode(&inode->vfs_inode, true, prealloc_start,
+				       prealloc_end);
+	if (ret < 0)
+		return ret;
 
 	BUG_ON(cluster->start != cluster->boundary[0]);
 	ret = btrfs_alloc_data_chunk_ondemand(inode,
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] btrfs: output more info when btrfs_subpage_assert() failed
  2025-07-16 21:14 [PATCH 0/3] btrfs: enable large folios for data reloc inodes Qu Wenruo
  2025-07-16 21:14 ` [PATCH 1/3] btrfs: reloc: unconditionally invalidate the page cache for each cluster Qu Wenruo
@ 2025-07-16 21:14 ` Qu Wenruo
  2025-07-16 21:14 ` [PATCH 3/3] btrfs: enable large data folios for data reloc inode Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-16 21:14 UTC (permalink / raw)
  To: linux-btrfs

The function btrfs_subpage_assert() is a very common utilized assert to
make sure the range passed in is correct inside the folio.

And when some code is not properly subpage/large folio compatible
btrfs_subpage_assert() will be the first to be triggered.
E.g. when I incorrectly enabled large folios for data reloc inodes, it
immediately triggered btrfs_subpage_assert().

In that case, outputting all the involved members will be very helpful,
this includes:

- start
- len
- folio position inside the mapping
- folio size

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2c5c9262b1a8..c9b3821957f7 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -186,8 +186,9 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
 	 * unmapped page like dummy extent buffer pages.
 	 */
 	if (folio->mapping)
-		ASSERT(folio_pos(folio) <= start &&
-		       start + len <= folio_end(folio));
+		ASSERT(folio_pos(folio) <= start && start + len <= folio_end(folio),
+		       "start=%llu len=%u folio_pos=%llu folio_size=%zu",
+		       start, len, folio_pos(folio), folio_size(folio));
 }
 
 #define subpage_calc_start_bit(fs_info, folio, name, start, len)	\
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] btrfs: enable large data folios for data reloc inode
  2025-07-16 21:14 [PATCH 0/3] btrfs: enable large folios for data reloc inodes Qu Wenruo
  2025-07-16 21:14 ` [PATCH 1/3] btrfs: reloc: unconditionally invalidate the page cache for each cluster Qu Wenruo
  2025-07-16 21:14 ` [PATCH 2/3] btrfs: output more info when btrfs_subpage_assert() failed Qu Wenruo
@ 2025-07-16 21:14 ` Qu Wenruo
  2 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2025-07-16 21:14 UTC (permalink / raw)
  To: linux-btrfs

For data reloc inodes, they are a special type of inodes that is not
exposed to user space, and are only utilized during data block groups
relocation.

They do not go under regular read-write operations, but have its file
extents manually created to have the same layout of a block group, then
its content is read from the original block group, and written back to
the new location which is in a new block group.

Previously all those handling are done in page unit, and commit
c2832898126f ("btrfs: make relocate_one_page() handle subpage case")
makes the handling to handle subpage blocks.

On the other hand, data reloc inodes are a perfect match for large data
folios, as each relocation cluster represents one or more data extents
that are contiguous in their logical addresses.

This patch enables large folios for data reloc inodes by:

- Remove the special handling of data reloc inodes when setting folio
  order

- Change relocate_one_folio() to return the file offset of the next
  folio
  Originally it's designed to handle fixed page sized blocks, but with
  large folios, we can handle a large folio, thus we have to return the
  end of the current folio.

- Remove the warning on folio_order()

- Use folio_size() to replace fixed PAGE_SIZE usage

- Use file_offset as iterator inside relocate_file_extent_cluster

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h |  4 ----
 fs/btrfs/relocation.c  | 24 +++++++++++++-----------
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index c9f73fc3e3c8..024c7bc7cf63 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -530,10 +530,6 @@ static inline void btrfs_set_inode_mapping_order(struct btrfs_inode *inode)
 	/* Metadata inode should not reach here. */
 	ASSERT(is_data_inode(inode));
 
-	/* For data reloc inode, it still requires page sized folio. */
-	if (unlikely(btrfs_is_data_reloc_root(inode->root)))
-		return;
-
 	/* We only allows BITS_PER_LONGS blocks for each bitmap. */
 #ifdef CONFIG_BTRFS_EXPERIMENTAL
 	mapping_set_folio_order_range(inode->vfs_inode.i_mapping, 0,
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index fa93ba40278d..60dd3971c3ae 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2772,13 +2772,15 @@ static u64 get_cluster_boundary_end(const struct file_extent_cluster *cluster,
 
 static int relocate_one_folio(struct reloc_control *rc,
 			      struct file_ra_state *ra,
-			      int *cluster_nr, pgoff_t index)
+			      int *cluster_nr, u64 *file_offset_ret)
 {
 	const struct file_extent_cluster *cluster = &rc->cluster;
 	struct inode *inode = rc->data_inode;
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	const u64 orig_file_offset = *file_offset_ret;
 	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
 	const pgoff_t last_index = (cluster->end - offset) >> PAGE_SHIFT;
+	const pgoff_t index = orig_file_offset >> PAGE_SHIFT;
 	gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
 	struct folio *folio;
 	u64 folio_start;
@@ -2811,8 +2813,6 @@ static int relocate_one_folio(struct reloc_control *rc,
 			return PTR_ERR(folio);
 	}
 
-	WARN_ON(folio_order(folio));
-
 	if (folio_test_readahead(folio) && !use_rst)
 		page_cache_async_readahead(inode->i_mapping, ra, NULL,
 					   folio, last_index + 1 - index);
@@ -2841,7 +2841,7 @@ static int relocate_one_folio(struct reloc_control *rc,
 		goto release_folio;
 
 	folio_start = folio_pos(folio);
-	folio_end = folio_start + PAGE_SIZE - 1;
+	folio_end = folio_start + folio_size(folio) - 1;
 
 	/*
 	 * Start from the cluster, as for subpage case, the cluster can start
@@ -2889,7 +2889,8 @@ static int relocate_one_folio(struct reloc_control *rc,
 		 * EXTENT_BOUNDARY bit prevents current extent from being merged
 		 * with previous extent.
 		 */
-		if (in_range(cluster->boundary[*cluster_nr] - offset, folio_start, PAGE_SIZE)) {
+		if (in_range(cluster->boundary[*cluster_nr] - offset,
+			     folio_start, folio_size(folio))) {
 			u64 boundary_start = cluster->boundary[*cluster_nr] -
 						offset;
 			u64 boundary_end = boundary_start +
@@ -2919,6 +2920,7 @@ static int relocate_one_folio(struct reloc_control *rc,
 	btrfs_throttle(fs_info);
 	if (btrfs_should_cancel_balance(fs_info))
 		ret = -ECANCELED;
+	*file_offset_ret = folio_end + 1;
 	return ret;
 
 release_folio:
@@ -2932,8 +2934,7 @@ static int relocate_file_extent_cluster(struct reloc_control *rc)
 	struct inode *inode = rc->data_inode;
 	const struct file_extent_cluster *cluster = &rc->cluster;
 	u64 offset = BTRFS_I(inode)->reloc_block_group_start;
-	pgoff_t index;
-	pgoff_t last_index;
+	u64 cur_file_offset = cluster->start - offset;
 	struct file_ra_state *ra;
 	int cluster_nr = 0;
 	int ret = 0;
@@ -2955,10 +2956,11 @@ static int relocate_file_extent_cluster(struct reloc_control *rc)
 	if (ret)
 		goto out;
 
-	last_index = (cluster->end - offset) >> PAGE_SHIFT;
-	for (index = (cluster->start - offset) >> PAGE_SHIFT;
-	     index <= last_index && !ret; index++)
-		ret = relocate_one_folio(rc, ra, &cluster_nr, index);
+	while (cur_file_offset < cluster->end - offset) {
+		ret = relocate_one_folio(rc, ra, &cluster_nr, &cur_file_offset);
+		if (ret)
+			break;
+	}
 	if (ret == 0)
 		WARN_ON(cluster_nr != cluster->nr);
 out:
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-16 21:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 21:14 [PATCH 0/3] btrfs: enable large folios for data reloc inodes Qu Wenruo
2025-07-16 21:14 ` [PATCH 1/3] btrfs: reloc: unconditionally invalidate the page cache for each cluster Qu Wenruo
2025-07-16 21:14 ` [PATCH 2/3] btrfs: output more info when btrfs_subpage_assert() failed Qu Wenruo
2025-07-16 21:14 ` [PATCH 3/3] btrfs: enable large data folios for data reloc inode Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox