[PATCH v2 0/3] btrfs: enhancement to pass generic/563

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] btrfs: enhancement to pass generic/563
@ 2025-02-12  2:52 Qu Wenruo
  2025-02-12  2:52 ` [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Qu Wenruo @ 2025-02-12  2:52 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Rebased to the latest for-next branch
  There is a data corruption read fix, which changed the timing of
  btrfs_lock_and_flush_ordered_range().

- Introduce a dedicated and smarted, read path speific extent range lock
  helper, lock_extents_for_read()
  Which has all the extra subpage specific deadlock avoiding mechanism,
  along with a much better comments on which type of ordered extents
  needs to be waited, and which can be completely skipped.

The test case generic/563 on aarch64 with 64K page size and 4K fs block
size will fail with btrfs, but not EXT4 nor XFS.

The detailed reason is explained in the last patch, the TL;DR is that
btrfs is not handling block aligned buffered write in an optimized way
for subpage cases (block size < page size).

The first patch is to address the deadlock-prone
btrfs_lock_and_flush_ordered_range() in read paths, by introduce a
deadlock-avoiding helper for read paths.

The second patch is a refactor in preparation for the new enhancement.

Eventually the last patch will enable the enhancement and pass the
generic/563.

Qu Wenruo (3):
  btrfs: introduce a read path dedicated extent lock helper
  btrfs: make btrfs_do_readpage() to do block-by-block read
  btrfs: allow buffered write to avoid full page read if it's block
    aligned

 fs/btrfs/defrag.c       |   2 +-
 fs/btrfs/direct-io.c    |   2 +-
 fs/btrfs/extent_io.c    | 224 +++++++++++++++++++++++++++++++++++-----
 fs/btrfs/file.c         |   9 +-
 fs/btrfs/inode.c        |   4 +-
 fs/btrfs/ordered-data.c |  29 ++++--
 fs/btrfs/ordered-data.h |   3 +-
 7 files changed, 229 insertions(+), 44 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper
  2025-02-12  2:52 [PATCH v2 0/3] btrfs: enhancement to pass generic/563 Qu Wenruo
@ 2025-02-12  2:52 ` Qu Wenruo
  2025-02-25 13:00   ` David Sterba
  2025-02-12  2:52 ` [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
  2025-02-12  2:52 ` [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2025-02-12  2:52 UTC (permalink / raw)
  To: linux-btrfs

Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage enhancements:

- It will call btrfs_start_ordered_extent() to writeback the involved
  folios

  But remember we're calling btrfs_lock_and_flush_ordered_range() at
  read paths, meaning the folio is already locked by read path.

  If we really trigger writeback, this will lead to a deadlock and
  writeback can not hold the folio lock.

  Such dead lock is prevented by the fact that btrfs always keeps a
  dirty folio also uptodate.

  But it's not the common behavior, as XFS/EXT4 both allow the folio to
  be dirty without keeping the full folio uptodate.

  They allow block aligned buffered writes to keep only the involved
  blocks to be dirty inside the folio, without reading the full folio.

Instead of blindly calling btrfs_start_ordered_extent(), introduce a
newer helper, which is smarter in the following ways:

- Only wait and flush the ordered extent if
  * The folio doesn't even have private set
  * Part of the blocks of the ordered extent is not uptodate

  This can happen by:
  * Folio writeback finished, then get invalidated. But OE not yet
    finished
  * Direct IO

  We have to wait for the ordered extent, as it may contain
  to-be-inserted data checksum.
  Without waiting, our read will fail with missing csum.

  But either way, the OE should not need any extra flush inside the
  locked folio range.

- Skip the ordered extent completely if
  * All the blocks are dirty
    This happens when OE creation is caused by previous folio.
    The writeback will never happen (we're holding the folio lock for
    read), nor will the OE finish.

    Thus we must skip the range.

  * All the blocks are uptodate
    This happens when the writeback finished, but OE not yet finished.

    Since the blocks are already uptodate, we can skip the OE range.

The newer helper, lock_extents_for_read() will do a loop for the target
range by:

1) Lock the full range

2) If there is no ordered extent in the remaining range, exit

3) If there is an ordered extent that we can skip
   Skip to the end of the OE, and continue checking
   We do not trigger writeback nor wait for the OE.

4) If there is an ordered extent that we can not skip
   Unlock the whole extent range and start the ordered extent.

And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.

This will allow us to handle the following case properly in the future:

 16K page size, 4K btrfs block size:

 16K           20K             24K              28K            32K
 |/////////////////////////////|                |              |
 |<----------- OE 2----------->|                |<--- OE 1 --->|

 The folio has been written back before, thus we have an OE at
 [28K, 32K).
 Although the OE 1 finished its io, the OE is not yet removed from IO
 tree.
 Later the folio got invalidated, but OE still exists.

 And [16K, 24K) range is dirty and uptodate, caused by a block aligned
 buffered write (and future enhancements allowing btrfs to skip full
 folio read for such case).

 Furthermore, OE 2 is created covering range [16K, 24K) by the writeback
 of previous folio.

 Since the full folio is not uptodate, if we want to read the folio,
 the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:

 btrfs_read_folio()
 | Folio 16K is already locked
 |- btrfs_lock_and_flush_ordered_range()
    |- btrfs_start_ordered_extent() for range [16K, 24K)
       |- filemap_fdatawrite_range() for range [16K, 24K)
          |- extent_write_cache_pages()
	     folio_lock() on folio 16K, deadlock.

 But now we will have the following sequence:

 btrfs_read_folio()
 | Folio 16K is already locked
 |- lock_extents_for_read()
    |- can_skip_ordered_extent() for range [16K, 24K)
    |  Returned true, the range [16K, 24K) will be skipped.
    |- can_skip_ordered_extent() for range [28K, 32K)
    |  Returned false.
    |- btrfs_start_ordered_extent() for range [28K, 32K) with
       [16K, 32K) as no writeback range
       No writeback for folio 16K will be triggered.

 And there will be no more possible deadlock on the same folio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c       |   2 +-
 fs/btrfs/direct-io.c    |   2 +-
 fs/btrfs/extent_io.c    | 183 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/file.c         |   4 +-
 fs/btrfs/inode.c        |   4 +-
 fs/btrfs/ordered-data.c |  29 +++++--
 fs/btrfs/ordered-data.h |   3 +-
 7 files changed, 210 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 968dae953948..d1330c138054 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -902,7 +902,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 			break;
 
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		folio_lock(folio);
 		/*
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index b5190a010205..c98db5058967 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -103,7 +103,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 */
 			if (writing ||
 			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
-				btrfs_start_ordered_extent(ordered);
+				btrfs_start_ordered_extent(ordered, 0, 0);
 			else
 				ret = nowait ? -EAGAIN : -ENOTBLK;
 			btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 91b20dccef73..819d51c3ed57 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1075,6 +1075,185 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	return 0;
 }
 
+/*
+ * Check if we can skip waiting the @ordered extent covering the block
+ * at file pos @cur.
+ *
+ * Return true if we can skip to @next_ret. The caller needs to check
+ * the @next_ret value to make sure if covers the full range, before
+ * skipping the OE.
+ *
+ * Return false if we must wait for the ordered extent.
+ *
+ * @cur:	The start file offset that we have locked folio for read.
+ * @next_ret:	If we return true, this indiciates the next check start
+ *		range.
+ */
+static bool can_skip_one_ordered_range(struct btrfs_inode *binode,
+				       struct btrfs_ordered_extent *ordered,
+				       u64 cur, u64 *next_ret)
+{
+	const struct btrfs_fs_info *fs_info = binode->root->fs_info;
+	struct folio *folio;
+	const u32 blocksize = fs_info->sectorsize;
+	u64 range_len;
+	bool ret;
+
+	folio = filemap_get_folio(binode->vfs_inode.i_mapping,
+				  cur >> PAGE_SHIFT);
+
+	/*
+	 * We should have locked the folio(s) for range [start, end], thus
+	 * there must be a folio and it must be locked.
+	 */
+	ASSERT(!IS_ERR(folio));
+	ASSERT(folio_test_locked(folio));
+
+	/*
+	 * We several cases for the folio and OE combination:
+	 *
+	 * 0) Folio has no private flag
+	 *    The OE has all its IO done but not yet finished, and folio got
+	 *    invalidated. Or direct IO.
+	 *
+	 * Have to wait for the OE to finish, as it may contain the
+	 * to-be-inserted data checksum.
+	 * Without the data checksum inserted into csum tree, read
+	 * will just fail with missing csum.
+	 */
+	if (!folio_test_private(folio)) {
+		ret = false;
+		goto out;
+	}
+	range_len = min(folio_pos(folio) + folio_size(folio),
+			ordered->file_offset + ordered->num_bytes) - cur;
+
+	/*
+	 * 1) The first block is DIRTY.
+	 *
+	 * This means the OE is created by some folio before us, but writeback
+	 * has not started.
+	 * We can and must skip the whole OE, because it will never start until
+	 * we finished our folio read and unlocked the folio.
+	 */
+	if (btrfs_folio_test_dirty(fs_info, folio, cur, blocksize)) {
+		ret = true;
+		/*
+		 * At least inside the folio, all the remaining blocks should
+		 * also be dirty.
+		 */
+		ASSERT(btrfs_folio_test_dirty(fs_info, folio, cur, range_len));
+		*next_ret = ordered->file_offset + ordered->num_bytes;
+		goto out;
+	}
+
+	/*
+	 * 2) The first block is uptodate.
+	 *
+	 * At least the first block can be skipped, but we are still
+	 * not full sure. E.g. if the OE has some other folios in
+	 * the range that can not be skipped.
+	 * So we return true and update @next_ret to the OE/folio boundary.
+	 */
+	if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+		u64 range_len = min(folio_pos(folio) + folio_size(folio),
+				    ordered->file_offset + ordered->num_bytes) - cur;
+
+		/*
+		 * The whole range to the OE end or folio boundary should also
+		 * be uptodate.
+		 */
+		ASSERT(btrfs_folio_test_uptodate(fs_info, folio, cur, range_len));
+		ret = true;
+		*next_ret = cur + range_len;
+		goto out;
+	}
+
+	/*
+	 * 3) The first block is not uptodate.
+	 *
+	 * This means the folio is invalidated after the OE finished, or direct IO.
+	 * Very much the same as case 1), just with private flag set.
+	 */
+	ret = false;
+out:
+	folio_put(folio);
+	return ret;
+}
+
+static bool can_skip_ordered_extent(struct btrfs_inode *binode,
+				    struct btrfs_ordered_extent *ordered,
+				    u64 start, u64 end)
+{
+	u64 range_end = min(end, ordered->file_offset + ordered->num_bytes - 1);
+	u64 range_start = max(start, ordered->file_offset);
+	u64 cur = range_start;
+
+	while (cur < range_end) {
+		bool can_skip;
+		u64 next_start;
+
+		can_skip = can_skip_one_ordered_range(binode, ordered, cur,
+						      &next_start);
+		if (!can_skip)
+			return false;
+		cur = next_start;
+	}
+	return true;
+}
+
+/*
+ * To make sure we get a stable view of extent maps for the involved range.
+ * This is for folio read paths (read and readahead), thus involved range
+ * should have all the folios locked.
+ */
+static void lock_extents_for_read(struct btrfs_inode *binode, u64 start, u64 end,
+				  struct extent_state **cached_state)
+{
+	struct btrfs_ordered_extent *ordered;
+	u64 cur_pos;
+
+	/* Caller must provide a valid @cached_state. */
+	ASSERT(cached_state);
+
+	/*
+	 * The range must at least be page aligned, as all read paths
+	 * are folio based.
+	 */
+	ASSERT(IS_ALIGNED(start, PAGE_SIZE) && IS_ALIGNED(end + 1, PAGE_SIZE));
+
+again:
+	lock_extent(&binode->io_tree, start, end, cached_state);
+	cur_pos = start;
+	while (cur_pos < end) {
+		ordered = btrfs_lookup_ordered_range(binode, cur_pos,
+						     end - cur_pos + 1);
+		/*
+		 * No ordered extents in the range, and we hold the
+		 * extent lock, no one can modify the extent maps
+		 * in the range, we're safe to return.
+		 */
+		if (!ordered)
+			break;
+
+		/* Check if we can skip waiting for the whole OE. */
+		if (can_skip_ordered_extent(binode, ordered, start, end)) {
+			cur_pos = min(ordered->file_offset + ordered->num_bytes,
+				      end + 1);
+			btrfs_put_ordered_extent(ordered);
+			continue;
+		}
+
+		/* Now wait for the OE to finish. */
+		unlock_extent(&binode->io_tree, start, end,
+			      cached_state);
+		btrfs_start_ordered_extent(ordered, start, end + 1 - start);
+		btrfs_put_ordered_extent(ordered);
+		/* We have unlocked the whole range, restart from the beginning. */
+		goto again;
+	}
+}
+
 int btrfs_read_folio(struct file *file, struct folio *folio)
 {
 	struct btrfs_inode *inode = folio_to_inode(folio);
@@ -1085,7 +1264,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
 	struct extent_map *em_cached = NULL;
 	int ret;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
+	lock_extents_for_read(inode, start, end, &cached_state);
 	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 	unlock_extent(&inode->io_tree, start, end, &cached_state);
 
@@ -2375,7 +2554,7 @@ void btrfs_readahead(struct readahead_control *rac)
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
+	lock_extents_for_read(inode, start, end, &cached_state);
 
 	while ((folio = readahead_folio(rac)) != NULL)
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 579706fab9b4..81e6cb599585 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -942,7 +942,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 				      cached_state);
 			folio_unlock(folio);
 			folio_put(folio);
-			btrfs_start_ordered_extent(ordered);
+			btrfs_start_ordered_extent(ordered, 0, 0);
 			btrfs_put_ordered_extent(ordered);
 			return -EAGAIN;
 		}
@@ -1846,7 +1846,7 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		unlock_extent(io_tree, page_start, page_end, &cached_state);
 		folio_unlock(folio);
 		up_read(&BTRFS_I(inode)->i_mmap_lock);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32895aabf0ff..eaf53408254d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2821,7 +2821,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		unlock_extent(&inode->io_tree, page_start, page_end,
 			      &cached_state);
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4876,7 +4876,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		unlock_extent(io_tree, block_start, block_end, &cached_state);
 		folio_unlock(folio);
 		folio_put(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4aca7475fd82..6075a6fa4817 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -729,7 +729,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
 	struct btrfs_ordered_extent *ordered;
 
 	ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
-	btrfs_start_ordered_extent(ordered);
+	btrfs_start_ordered_extent(ordered, 0, 0);
 	complete(&ordered->completion);
 }
 
@@ -842,10 +842,12 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 /*
  * Start IO and wait for a given ordered extent to finish.
  *
- * Wait on page writeback for all the pages in the extent and the IO completion
- * code to insert metadata into the btree corresponding to the extent.
+ * Wait on page writeback for all the pages in the extent but not in
+ * [@nowriteback_start, @nowriteback_start + @nowriteback_len) and the
+ * IO completion code to insert metadata into the btree corresponding to the extent.
  */
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				u64 nowriteback_start, u32 nowriteback_len)
 {
 	u64 start = entry->file_offset;
 	u64 end = start + entry->num_bytes - 1;
@@ -865,8 +867,19 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
 	 * start IO on any dirty ones so the wait doesn't stall waiting
 	 * for the flusher thread to find them
 	 */
-	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
-		filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
+		if (!nowriteback_len) {
+			filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+		} else {
+			if (start < nowriteback_start)
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start,
+						nowriteback_start - 1);
+			if (nowriteback_start + nowriteback_len < end)
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						nowriteback_start + nowriteback_len,
+						end);
+		}
+	}
 
 	if (!freespace_inode)
 		btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
@@ -921,7 +934,7 @@ int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
 			btrfs_put_ordered_extent(ordered);
 			break;
 		}
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		end = ordered->file_offset;
 		/*
 		 * If the ordered extent had an error save the error but don't
@@ -1174,7 +1187,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 			break;
 		}
 		unlock_extent(&inode->io_tree, start, end, cachedp);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4e152736d06c..d7cf69647434 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -191,7 +191,8 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
 							 u64 file_offset);
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				u64 nowriteback_start, u32 nowriteback_len);
 int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper
  2025-02-12  2:52 ` [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
@ 2025-02-25 13:00   ` David Sterba
  2025-02-26  0:04     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: David Sterba @ 2025-02-25 13:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Feb 12, 2025 at 01:22:45PM +1030, Qu Wenruo wrote:
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I have style comments that you can fix either at commit time or if you
have another iteration.

> ---
>  fs/btrfs/defrag.c       |   2 +-
>  fs/btrfs/direct-io.c    |   2 +-
>  fs/btrfs/extent_io.c    | 183 +++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/file.c         |   4 +-
>  fs/btrfs/inode.c        |   4 +-
>  fs/btrfs/ordered-data.c |  29 +++++--
>  fs/btrfs/ordered-data.h |   3 +-
>  7 files changed, 210 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 968dae953948..d1330c138054 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -902,7 +902,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>  			break;
>  
>  		folio_unlock(folio);
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);

There are many calls with the 0, 0 parameters, and only one instance
with other values. It would be good to add a static inline helper that
wraps these parameters.

>  		btrfs_put_ordered_extent(ordered);
>  		folio_lock(folio);
>  		/*
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index b5190a010205..c98db5058967 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -103,7 +103,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>  			 */
>  			if (writing ||
>  			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
> -				btrfs_start_ordered_extent(ordered);
> +				btrfs_start_ordered_extent(ordered, 0, 0);
>  			else
>  				ret = nowait ? -EAGAIN : -ENOTBLK;
>  			btrfs_put_ordered_extent(ordered);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 91b20dccef73..819d51c3ed57 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1075,6 +1075,185 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>  	return 0;
>  }
>  
> +/*
> + * Check if we can skip waiting the @ordered extent covering the block
> + * at file pos @cur.
> + *
> + * Return true if we can skip to @next_ret. The caller needs to check
> + * the @next_ret value to make sure if covers the full range, before
> + * skipping the OE.
> + *
> + * Return false if we must wait for the ordered extent.
> + *
> + * @cur:	The start file offset that we have locked folio for read.
> + * @next_ret:	If we return true, this indiciates the next check start
> + *		range.

The parameters should be after the first line description, the return
value description is last.

Also please reformat it to 80-ish line width, it seems the comments
below are also formatted to 70.

> + */
> +static bool can_skip_one_ordered_range(struct btrfs_inode *binode,

Please use 'inode' for btrfs_inodes, I have patches cleaning that up
everywhere.

> +				       struct btrfs_ordered_extent *ordered,
> +				       u64 cur, u64 *next_ret)
> +{
> +	const struct btrfs_fs_info *fs_info = binode->root->fs_info;
> +	struct folio *folio;
> +	const u32 blocksize = fs_info->sectorsize;
> +	u64 range_len;
> +	bool ret;
> +
> +	folio = filemap_get_folio(binode->vfs_inode.i_mapping,
> +				  cur >> PAGE_SHIFT);

Should this be folio_shift?

> +
> +	/*
> +	 * We should have locked the folio(s) for range [start, end], thus
> +	 * there must be a folio and it must be locked.
> +	 */
> +	ASSERT(!IS_ERR(folio));
> +	ASSERT(folio_test_locked(folio));
> +
> +	/*
> +	 * We several cases for the folio and OE combination:
> +	 *
> +	 * 0) Folio has no private flag
> +	 *    The OE has all its IO done but not yet finished, and folio got
> +	 *    invalidated. Or direct IO.
> +	 *
> +	 * Have to wait for the OE to finish, as it may contain the
> +	 * to-be-inserted data checksum.
> +	 * Without the data checksum inserted into csum tree, read
> +	 * will just fail with missing csum.
> +	 */
> +	if (!folio_test_private(folio)) {
> +		ret = false;
> +		goto out;
> +	}
> +	range_len = min(folio_pos(folio) + folio_size(folio),
> +			ordered->file_offset + ordered->num_bytes) - cur;
> +
> +	/*
> +	 * 1) The first block is DIRTY.
> +	 *
> +	 * This means the OE is created by some folio before us, but writeback
> +	 * has not started.
> +	 * We can and must skip the whole OE, because it will never start until
> +	 * we finished our folio read and unlocked the folio.
> +	 */
> +	if (btrfs_folio_test_dirty(fs_info, folio, cur, blocksize)) {
> +		ret = true;
> +		/*
> +		 * At least inside the folio, all the remaining blocks should
> +		 * also be dirty.
> +		 */
> +		ASSERT(btrfs_folio_test_dirty(fs_info, folio, cur, range_len));
> +		*next_ret = ordered->file_offset + ordered->num_bytes;
> +		goto out;
> +	}
> +
> +	/*
> +	 * 2) The first block is uptodate.
> +	 *
> +	 * At least the first block can be skipped, but we are still
> +	 * not full sure. E.g. if the OE has some other folios in
> +	 * the range that can not be skipped.
> +	 * So we return true and update @next_ret to the OE/folio boundary.
> +	 */
> +	if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +		u64 range_len = min(folio_pos(folio) + folio_size(folio),
> +				    ordered->file_offset + ordered->num_bytes) - cur;
> +
> +		/*
> +		 * The whole range to the OE end or folio boundary should also
> +		 * be uptodate.
> +		 */
> +		ASSERT(btrfs_folio_test_uptodate(fs_info, folio, cur, range_len));
> +		ret = true;
> +		*next_ret = cur + range_len;
> +		goto out;
> +	}
> +
> +	/*
> +	 * 3) The first block is not uptodate.
> +	 *
> +	 * This means the folio is invalidated after the OE finished, or direct IO.
> +	 * Very much the same as case 1), just with private flag set.
> +	 */
> +	ret = false;
> +out:
> +	folio_put(folio);
> +	return ret;
> +}
> +
> +static bool can_skip_ordered_extent(struct btrfs_inode *binode,
> +				    struct btrfs_ordered_extent *ordered,
> +				    u64 start, u64 end)
> +{
> +	u64 range_end = min(end, ordered->file_offset + ordered->num_bytes - 1);
> +	u64 range_start = max(start, ordered->file_offset);
> +	u64 cur = range_start;
> +
> +	while (cur < range_end) {
> +		bool can_skip;
> +		u64 next_start;
> +
> +		can_skip = can_skip_one_ordered_range(binode, ordered, cur,
> +						      &next_start);
> +		if (!can_skip)
> +			return false;
> +		cur = next_start;
> +	}
> +	return true;
> +}
> +
> +/*
> + * To make sure we get a stable view of extent maps for the involved range.
> + * This is for folio read paths (read and readahead), thus involved range
> + * should have all the folios locked.
> + */
> +static void lock_extents_for_read(struct btrfs_inode *binode, u64 start, u64 end,
> +				  struct extent_state **cached_state)
> +{
> +	struct btrfs_ordered_extent *ordered;
> +	u64 cur_pos;
> +
> +	/* Caller must provide a valid @cached_state. */
> +	ASSERT(cached_state);
> +
> +	/*
> +	 * The range must at least be page aligned, as all read paths
> +	 * are folio based.
> +	 */
> +	ASSERT(IS_ALIGNED(start, PAGE_SIZE) && IS_ALIGNED(end + 1, PAGE_SIZE));
> +
> +again:
> +	lock_extent(&binode->io_tree, start, end, cached_state);
> +	cur_pos = start;
> +	while (cur_pos < end) {
> +		ordered = btrfs_lookup_ordered_range(binode, cur_pos,
> +						     end - cur_pos + 1);
> +		/*
> +		 * No ordered extents in the range, and we hold the
> +		 * extent lock, no one can modify the extent maps
> +		 * in the range, we're safe to return.
> +		 */
> +		if (!ordered)
> +			break;
> +
> +		/* Check if we can skip waiting for the whole OE. */
> +		if (can_skip_ordered_extent(binode, ordered, start, end)) {
> +			cur_pos = min(ordered->file_offset + ordered->num_bytes,
> +				      end + 1);
> +			btrfs_put_ordered_extent(ordered);
> +			continue;
> +		}
> +
> +		/* Now wait for the OE to finish. */
> +		unlock_extent(&binode->io_tree, start, end,
> +			      cached_state);
> +		btrfs_start_ordered_extent(ordered, start, end + 1 - start);
> +		btrfs_put_ordered_extent(ordered);
> +		/* We have unlocked the whole range, restart from the beginning. */
> +		goto again;

This is a bit wild, goto at the end of a while loop but I don't see a
cleaner way without making complicated in another way.

> +	}
> +}
> +
>  int btrfs_read_folio(struct file *file, struct folio *folio)
>  {
>  	struct btrfs_inode *inode = folio_to_inode(folio);
> @@ -1085,7 +1264,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
>  	struct extent_map *em_cached = NULL;
>  	int ret;
>  
> -	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> +	lock_extents_for_read(inode, start, end, &cached_state);
>  	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
>  	unlock_extent(&inode->io_tree, start, end, &cached_state);
>  
> @@ -2375,7 +2554,7 @@ void btrfs_readahead(struct readahead_control *rac)
>  	struct extent_map *em_cached = NULL;
>  	u64 prev_em_start = (u64)-1;
>  
> -	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> +	lock_extents_for_read(inode, start, end, &cached_state);
>  
>  	while ((folio = readahead_folio(rac)) != NULL)
>  		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 579706fab9b4..81e6cb599585 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -942,7 +942,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>  				      cached_state);
>  			folio_unlock(folio);
>  			folio_put(folio);
> -			btrfs_start_ordered_extent(ordered);
> +			btrfs_start_ordered_extent(ordered, 0, 0);
>  			btrfs_put_ordered_extent(ordered);
>  			return -EAGAIN;
>  		}
> @@ -1846,7 +1846,7 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>  		unlock_extent(io_tree, page_start, page_end, &cached_state);
>  		folio_unlock(folio);
>  		up_read(&BTRFS_I(inode)->i_mmap_lock);
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);
>  		btrfs_put_ordered_extent(ordered);
>  		goto again;
>  	}
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 32895aabf0ff..eaf53408254d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2821,7 +2821,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
>  		unlock_extent(&inode->io_tree, page_start, page_end,
>  			      &cached_state);
>  		folio_unlock(folio);
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);
>  		btrfs_put_ordered_extent(ordered);
>  		goto again;
>  	}
> @@ -4876,7 +4876,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>  		unlock_extent(io_tree, block_start, block_end, &cached_state);
>  		folio_unlock(folio);
>  		folio_put(folio);
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);
>  		btrfs_put_ordered_extent(ordered);
>  		goto again;
>  	}
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 4aca7475fd82..6075a6fa4817 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -729,7 +729,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
>  	struct btrfs_ordered_extent *ordered;
>  
>  	ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
> -	btrfs_start_ordered_extent(ordered);
> +	btrfs_start_ordered_extent(ordered, 0, 0);
>  	complete(&ordered->completion);
>  }
>  
> @@ -842,10 +842,12 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
>  /*
>   * Start IO and wait for a given ordered extent to finish.
>   *
> - * Wait on page writeback for all the pages in the extent and the IO completion
> - * code to insert metadata into the btree corresponding to the extent.
> + * Wait on page writeback for all the pages in the extent but not in
> + * [@nowriteback_start, @nowriteback_start + @nowriteback_len) and the
> + * IO completion code to insert metadata into the btree corresponding to the extent.
>   */
> -void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
> +void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
> +				u64 nowriteback_start, u32 nowriteback_len)
>  {
>  	u64 start = entry->file_offset;
>  	u64 end = start + entry->num_bytes - 1;
> @@ -865,8 +867,19 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
>  	 * start IO on any dirty ones so the wait doesn't stall waiting
>  	 * for the flusher thread to find them
>  	 */
> -	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
> -		filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
> +	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
> +		if (!nowriteback_len) {
> +			filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
> +		} else {
> +			if (start < nowriteback_start)
> +				filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start,
> +						nowriteback_start - 1);
> +			if (nowriteback_start + nowriteback_len < end)
> +				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
> +						nowriteback_start + nowriteback_len,
> +						end);
> +		}
> +	}
>  
>  	if (!freespace_inode)
>  		btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
> @@ -921,7 +934,7 @@ int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
>  			btrfs_put_ordered_extent(ordered);
>  			break;
>  		}
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);
>  		end = ordered->file_offset;
>  		/*
>  		 * If the ordered extent had an error save the error but don't
> @@ -1174,7 +1187,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
>  			break;
>  		}
>  		unlock_extent(&inode->io_tree, start, end, cachedp);
> -		btrfs_start_ordered_extent(ordered);
> +		btrfs_start_ordered_extent(ordered, 0, 0);
>  		btrfs_put_ordered_extent(ordered);
>  	}
>  }
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 4e152736d06c..d7cf69647434 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -191,7 +191,8 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>  			   struct btrfs_ordered_sum *sum);
>  struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
>  							 u64 file_offset);
> -void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
> +void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
> +				u64 nowriteback_start, u32 nowriteback_len);
>  int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
>  struct btrfs_ordered_extent *
>  btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper
  2025-02-25 13:00   ` David Sterba
@ 2025-02-26  0:04     ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2025-02-26  0:04 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, Linux Memory Management List; +Cc: linux-btrfs

在 2025/2/25 23:30, David Sterba 写道:
> On Wed, Feb 12, 2025 at 01:22:45PM +1030, Qu Wenruo wrote:
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
[...]
>> +	folio = filemap_get_folio(binode->vfs_inode.i_mapping,
>> +				  cur >> PAGE_SHIFT);
>
> Should this be folio_shift?

This is the biggest trap!

The filemap_* helpers are always using page index, no matter the folio size.

The filemap can be considered as a super large array of folio pointers.
(Implenmented by xarray)

For the current folio size == page size case, it's straight forward, if
there is a pointer then there is a cached folio for that index.

For larger folios, the overall idea is not changed, just we can have a
larger folio covering multiple slots, no longer one folio one slot.

So when doing the search we should always use PAGE_SHIFT.

And that why I hope the MM guys can provide a fileoff based
filemap_get_folio_by_fileoff().

CC MM guys, will a dedicated helper reduce such confusion?
Or it's just making the currently very simple filemap_*() helpers too
complex?

Thanks,
Qu

[...]
>> +again:
>> +	lock_extent(&binode->io_tree, start, end, cached_state);
>> +	cur_pos = start;
>> +	while (cur_pos < end) {
>> +		ordered = btrfs_lookup_ordered_range(binode, cur_pos,
>> +						     end - cur_pos + 1);
>> +		/*
>> +		 * No ordered extents in the range, and we hold the
>> +		 * extent lock, no one can modify the extent maps
>> +		 * in the range, we're safe to return.
>> +		 */
>> +		if (!ordered)
>> +			break;
>> +
>> +		/* Check if we can skip waiting for the whole OE. */
>> +		if (can_skip_ordered_extent(binode, ordered, start, end)) {
>> +			cur_pos = min(ordered->file_offset + ordered->num_bytes,
>> +				      end + 1);
>> +			btrfs_put_ordered_extent(ordered);
>> +			continue;
>> +		}
>> +
>> +		/* Now wait for the OE to finish. */
>> +		unlock_extent(&binode->io_tree, start, end,
>> +			      cached_state);
>> +		btrfs_start_ordered_extent(ordered, start, end + 1 - start);
>> +		btrfs_put_ordered_extent(ordered);
>> +		/* We have unlocked the whole range, restart from the beginning. */
>> +		goto again;
>
> This is a bit wild, goto at the end of a while loop but I don't see a
> cleaner way without making complicated in another way.

I have fixed this in the one submitted the mail list, by introducing
another layer of while loop (in another function).

Thanks,
Qu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read
  2025-02-12  2:52 [PATCH v2 0/3] btrfs: enhancement to pass generic/563 Qu Wenruo
  2025-02-12  2:52 ` [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
@ 2025-02-12  2:52 ` Qu Wenruo
  2025-02-25 13:04   ` David Sterba
  2025-02-12  2:52 ` [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2025-02-12  2:52 UTC (permalink / raw)
  To: linux-btrfs

Currently if a btrfs has block size (the older sector size) < page size,
btrfs_do_readpage() will handle the range extent by extent, this is good
for performance as it doesn't need to re-lookup the same extent map again
and again.
(Although __get_extent_map() already does extra cached em check, thus
the optimization is not that obvious)

This is totally fine and is a valid optimization, but it has an
assumption that, there is no partial uptodate range in the page.

Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.

In that case, we can have a page that is partially uptodate, and the
current per-extent lookup can not handle such case.

So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:

- Remove the need for @iosize variable
  Because we just use sectorsize as our increment.

- Remove @pg_offset, and calculate it inside the loop when needed
  It's just offset_in_folio().

- Use a for() loop instead of a while() loop

This will slightly reduce the read performance for
block size < page size cases, but for the future where we can skip a
full page read for a lot of cases, it should still be worthy.

For block size == page size, this brings no performance change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++-------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 819d51c3ed57..64812045a42d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -941,9 +941,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	u64 last_byte = i_size_read(inode);
 	struct extent_map *em;
 	int ret = 0;
-	size_t pg_offset = 0;
-	size_t iosize;
-	size_t blocksize = fs_info->sectorsize;
+	const size_t blocksize = fs_info->sectorsize;
 
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0) {
@@ -954,24 +952,23 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	if (folio_contains(folio, last_byte >> PAGE_SHIFT)) {
 		size_t zero_offset = offset_in_folio(folio, last_byte);
 
-		if (zero_offset) {
-			iosize = folio_size(folio) - zero_offset;
-			folio_zero_range(folio, zero_offset, iosize);
-		}
+		if (zero_offset)
+			folio_zero_range(folio, zero_offset,
+					 folio_size(folio) - zero_offset);
 	}
 	bio_ctrl->end_io_func = end_bbio_data_read;
 	begin_folio_read(fs_info, folio);
-	while (cur <= end) {
+	for (cur = start; cur <= end; cur += blocksize) {
 		enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
+		unsigned long pg_offset = offset_in_folio(folio, cur);
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
 		u64 block_start;
 
 		ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
 		if (cur >= last_byte) {
-			iosize = folio_size(folio) - pg_offset;
-			folio_zero_range(folio, pg_offset, iosize);
-			end_folio_read(folio, true, cur, iosize);
+			folio_zero_range(folio, pg_offset, end - cur + 1);
+			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
@@ -985,8 +982,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		compress_type = extent_map_compression(em);
 
-		iosize = min(extent_map_end(em) - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
 		if (compress_type != BTRFS_COMPRESS_NONE)
 			disk_bytenr = em->disk_bytenr;
 		else
@@ -1044,18 +1039,13 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		/* we've found a hole, just zero and go on */
 		if (block_start == EXTENT_MAP_HOLE) {
-			folio_zero_range(folio, pg_offset, iosize);
-
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			folio_zero_range(folio, pg_offset, blocksize);
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 		/* the get_extent function already copied into the folio */
 		if (block_start == EXTENT_MAP_INLINE) {
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 
@@ -1066,12 +1056,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		if (force_bio_submit)
 			submit_one_bio(bio_ctrl);
-		submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
+		submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
 				    pg_offset);
-		cur = cur + iosize;
-		pg_offset += iosize;
 	}
-
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read
  2025-02-12  2:52 ` [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
@ 2025-02-25 13:04   ` David Sterba
  0 siblings, 0 replies; 8+ messages in thread
From: David Sterba @ 2025-02-25 13:04 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Feb 12, 2025 at 01:22:46PM +1030, Qu Wenruo wrote:
> Currently if a btrfs has block size (the older sector size) < page size,
> btrfs_do_readpage() will handle the range extent by extent, this is good
> for performance as it doesn't need to re-lookup the same extent map again
> and again.
> (Although __get_extent_map() already does extra cached em check, thus

Minor thing, __get_extent_map has been renamed to get_extent_map.

> the optimization is not that obvious)
> 
> This is totally fine and is a valid optimization, but it has an
> assumption that, there is no partial uptodate range in the page.
> 
> Meanwhile there is an incoming feature, requiring btrfs to skip the full
> page read if a buffered write range covers a full block but not a full
> page.
> 
> In that case, we can have a page that is partially uptodate, and the
> current per-extent lookup can not handle such case.
> 
> So here we change btrfs_do_readpage() to do block-by-block read, this
> simplifies the following things:
> 
> - Remove the need for @iosize variable
>   Because we just use sectorsize as our increment.
> 
> - Remove @pg_offset, and calculate it inside the loop when needed
>   It's just offset_in_folio().
> 
> - Use a for() loop instead of a while() loop
> 
> This will slightly reduce the read performance for
> block size < page size cases, but for the future where we can skip a
> full page read for a lot of cases, it should still be worthy.
> 
> For block size == page size, this brings no performance change.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned
  2025-02-12  2:52 [PATCH v2 0/3] btrfs: enhancement to pass generic/563 Qu Wenruo
  2025-02-12  2:52 ` [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
  2025-02-12  2:52 ` [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
@ 2025-02-12  2:52 ` Qu Wenruo
  2025-02-25 13:05   ` David Sterba
  2 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2025-02-12  2:52 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full block and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned any more, thus btrfs will read the full
page, causing more read than expected and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is block
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate blocks during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 64812045a42d..abf43805ea92 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -971,6 +971,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
 		if (IS_ERR(em)) {
 			end_folio_read(folio, false, cur, end + 1 - cur);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 81e6cb599585..83a7238e8c2e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
 {
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned
  2025-02-12  2:52 ` [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
@ 2025-02-25 13:05   ` David Sterba
  0 siblings, 0 replies; 8+ messages in thread
From: David Sterba @ 2025-02-25 13:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Feb 12, 2025 at 01:22:47PM +1030, Qu Wenruo wrote:
> [BUG]
> Since the support of block size (sector size) < page size for btrfs,
> test case generic/563 fails with 4K block size and 64K page size:
> 
>     --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
>     +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
>     @@ -3,7 +3,8 @@
>      read is in range
>      write is in range
>      write -> read/write
>     -read is in range
>     +read has value of 8388608
>     +read is NOT in range -33792 .. 33792
>      write is in range
>     ...
> 
> [CAUSE]
> The test case creates a 8MiB file, then buffered write into the 8MiB
> using 4K block size, to overwrite the whole file.
> 
> On 4K page sized systems, since the write range covers the full block and
> page, btrfs will no bother reading the page, just like what XFS and EXT4
> do.
> 
> But 64K page sized systems, although the 4K sized write is still block
> aligned, it's not page aligned any more, thus btrfs will read the full
> page, causing more read than expected and fail the test case.
> 
> [FIX]
> To skip the full page read, we need to do the following modification:
> 
> - Do not trigger full page read as long as the buffered write is block
>   aligned
>   This is pretty simple by modifying the check inside
>   prepare_uptodate_page().
> 
> - Skip already uptodate blocks during full page read
>   Or we can lead to the following data corruption:
> 
>   0       32K        64K
>   |///////|          |
> 
>   Where the file range [0, 32K) is dirtied by buffered write, the
>   remaining range [32K, 64K) is not.
> 
>   When reading the full page, since [0,32K) is only dirtied but not
>   written back, there is no data extent map for it, but a hole covering
>   [0, 64k).
> 
>   If we continue reading the full page range [0, 64K), the dirtied range
>   will be filled with 0 (since there is only a hole covering the whole
>   range).
>   This causes the dirtied range to get lost.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  fs/btrfs/file.c      | 5 +++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 64812045a42d..abf43805ea92 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -971,6 +971,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>  			end_folio_read(folio, true, cur, end - cur + 1);
>  			break;
>  		}
> +		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +			end_folio_read(folio, true, cur, blocksize);
> +			continue;
> +		}
>  		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>  		if (IS_ERR(em)) {
>  			end_folio_read(folio, false, cur, end + 1 - cur);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 81e6cb599585..83a7238e8c2e 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>  {
>  	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>  	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
> +	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;

In such cases you can name the local variables blocksize, this is the
least intrusive way to convert it from the sectorsize.

>  	int ret = 0;
>  
>  	if (folio_test_uptodate(folio))
>  		return 0;
>  
>  	if (!force_uptodate &&
> -	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
> -	    IS_ALIGNED(clamp_end, PAGE_SIZE))
> +	    IS_ALIGNED(clamp_start, sectorsize) &&
> +	    IS_ALIGNED(clamp_end, sectorsize))
>  		return 0;
>  
>  	ret = btrfs_read_folio(NULL, folio);
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-02-26  0:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-12  2:52 [PATCH v2 0/3] btrfs: enhancement to pass generic/563 Qu Wenruo
2025-02-12  2:52 ` [PATCH v2 1/3] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
2025-02-25 13:00   ` David Sterba
2025-02-26  0:04     ` Qu Wenruo
2025-02-12  2:52 ` [PATCH v2 2/3] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
2025-02-25 13:04   ` David Sterba
2025-02-12  2:52 ` [PATCH v2 3/3] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
2025-02-25 13:05   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox