[PATCH 0/7] btrfs: make subpage handling be feature full

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] btrfs: make subpage handling be feature full
@ 2025-02-23 23:46 Qu Wenruo
  2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

[CHANGLOG]
v1:
- Merge previous two patches into one
- Re-order the patches so preparation/fixes are always before feature
  enablement
- Update the commit message of the first patch
  So that we do not focus on the out-of-tree part, but explain why it's
  not ideal in the first place (double page zeroing), and just mention
  the behavior change in the future.

Since the introduction of btrfs subapge support in v5.15, there are
quite some limitations:

- No RAID56 support
  Added in v5.19.

- No memory efficient scrub support
  Added in v6.4.

- No block perfect compressed write support
  Previously btrfs requires the dirty range to be fully page aligned, or
  it will skip compression completely.

  Added in v6.13.

- Various subpage related error handling fixes
  Added in v6.14.

- No support to create inline data extent
- No partial uptodate page support
  This is a long existing optimization supported by EXT4/XFS and
  is required to pass generic/563 with subpage block size.

The last two are addressed in this patchset.

And since all features are implemented for subpage cases, the long
existing experimental warning message can be dropped, as it is already
causing a lot of concerns for users who checks the dmesg carefully.

Qu Wenruo (7):
  btrfs: prevent inline data extents read from touching blocks beyond
    its range
  btrfs: fix the qgroup data free range for inline data extents
  btrfs: introduce a read path dedicated extent lock helper
  btrfs: make btrfs_do_readpage() to do block-by-block read
  btrfs: allow buffered write to avoid full page read if it's block
    aligned
  btrfs: allow inline data extents creation if block size < page size
  btrfs: remove the subpage related warning message

 fs/btrfs/defrag.c       |   2 +-
 fs/btrfs/direct-io.c    |   2 +-
 fs/btrfs/disk-io.c      |   5 -
 fs/btrfs/extent_io.c    | 224 +++++++++++++++++++++++++++++++++++-----
 fs/btrfs/file.c         |   9 +-
 fs/btrfs/inode.c        |  34 +++---
 fs/btrfs/ordered-data.c |  29 ++++--
 fs/btrfs/ordered-data.h |   3 +-
 8 files changed, 239 insertions(+), 69 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 15:07   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Currently reading an inline data extent will zero out all the remaining
range in the page.

This is not yet causing problems even for block size < page size
(subpage) cases because:

1) An inline data extent always starts at file offset 0
   Meaning at page read, we always read the inline extent first, before
   any other blocks in the page. Then later blocks are properly read out
   and re-fill the zeroed out ranges.

2) Currently btrfs will read out the whole page if a buffered write is
   not page aligned
   So a page is either fully uptodate at buffered write time (covers the
   whole page), or we will read out the whole page first.
   Meaning there is nothing to lose for such an inline extent read.

But it's still not ideal:

- We're zeroing out the page twice
  One done by read_inline_extent()/uncompress_inline(), one done by
  btrfs_do_readpage() for ranges beyond i_size.

- We're touching blocks that doesn't belong to the inline extent
  In the incoming patches, we can have a partial uptodate folio, that
  some dirty blocks can exist while the page is not fully uptodate:

  The page size is 16K and block size is 4K:

  0         4K        8K        12K        16K
  |         |         |/////////|          |

  And range [8K, 12K) is dirtied by a buffered write, the remaining
  blocks are not uptodate.

  If range [0, 4K) contains an inline data extent, and we try to read
  the whole page, the current behavior will overwrite range [8K, 12K)
  with zero and cause data loss.

So to make the behavior more consistent and in preparation for future
changes, limit the inline data extents read to only zero out the range
inside the first block, not the whole page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c432ccfba56e..fe2c6038064a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6788,6 +6788,7 @@ static noinline int uncompress_inline(struct btrfs_path *path,
 {
 	int ret;
 	struct extent_buffer *leaf = path->nodes[0];
+	const u32 sectorsize = leaf->fs_info->sectorsize;
 	char *tmp;
 	size_t max_size;
 	unsigned long inline_size;
@@ -6816,17 +6817,19 @@ static noinline int uncompress_inline(struct btrfs_path *path,
 	 * cover that region here.
 	 */
 
-	if (max_size < PAGE_SIZE)
-		folio_zero_range(folio, max_size, PAGE_SIZE - max_size);
+	if (max_size < sectorsize)
+		folio_zero_range(folio, max_size, sectorsize - max_size);
 	kfree(tmp);
 	return ret;
 }
 
-static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
+static int read_inline_extent(struct btrfs_fs_info *fs_info,
+			      struct btrfs_path *path, struct folio *folio)
 {
 	struct btrfs_file_extent_item *fi;
 	void *kaddr;
 	size_t copy_size;
+	const u32 sectorsize = fs_info->sectorsize;
 
 	if (!folio || folio_test_uptodate(folio))
 		return 0;
@@ -6844,8 +6847,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
 	read_extent_buffer(path->nodes[0], kaddr,
 			   btrfs_file_extent_inline_start(fi), copy_size);
 	kunmap_local(kaddr);
-	if (copy_size < PAGE_SIZE)
-		folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size);
+	if (copy_size < sectorsize)
+		folio_zero_range(folio, copy_size, sectorsize - copy_size);
 	return 0;
 }
 
@@ -7020,7 +7023,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE);
 		ASSERT(em->len == fs_info->sectorsize);
 
-		ret = read_inline_extent(path, folio);
+		ret = read_inline_extent(fs_info, path, folio);
 		if (ret < 0)
 			goto out;
 		goto insert;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
  2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 15:11   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Inside function __cow_file_range_inline() since the inlined data no
longer takes any data space, we need to free up the reserved space.

However the code is still using the old page size == sector size
assumption, and will not handle subpage case well.

Thankfully it is not going to cause any problems because we have two extra
safe nets:

- Inline data extents creation is disable for sector size < page size
  cases for now
  But it won't stay that for long.

- btrfs_qgroup_free_data() will only clear ranges which are already
  reserved
  So even if we pass a range larger than what we need, it should still
  be fine, especially there is only reserved space for a single block at
  file offset 0 for an inline data extent.

But just for the sake of consistentcy, fix the call site to use
sectorsize instead of page size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fe2c6038064a..0efe9f005149 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -672,7 +672,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
 	 * And at reserve time, it's always aligned to page size, so
 	 * just free one page here.
 	 */
-	btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL);
+	btrfs_qgroup_free_data(inode, NULL, 0, fs_info->sectorsize, NULL);
 	btrfs_free_path(path);
 	btrfs_end_transaction(trans);
 	return ret;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
  2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
  2025-02-23 23:46 ` [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 17:05   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage enhancements:

- It will call btrfs_start_ordered_extent() to writeback the involved
  folios

  But remember we're calling btrfs_lock_and_flush_ordered_range() at
  read paths, meaning the folio is already locked by read path.

  If we really trigger writeback, this will lead to a deadlock and
  writeback can not hold the folio lock.

  Such dead lock is prevented by the fact that btrfs always keeps a
  dirty folio also uptodate, by either dirtying all blocks of the folio,
  or read the folio before dirtying.

  But it's not the common behavior, as XFS/EXT4 both allow the folio to
  be dirty but not uptodate, by allowing buffered write to dirty a full
  block without reading the full folio.

Instead of blindly calling btrfs_start_ordered_extent(), introduce a
newer helper, which is smarter in the following ways:

- Only wait and flush the ordered extent if
  * The folio doesn't even have private set
  * Part of the blocks of the ordered extent is not uptodate

  This can happen by:
  * Folio writeback finished, then get invalidated. But OE not yet
    finished
  * Direct IO

  We have to wait for the ordered extent, as it may contain
  to-be-inserted data checksum.
  Without waiting, our read can fail due to the missing csum.

  But either way, the OE should not need any extra flush inside the
  locked folio range.

- Skip the ordered extent completely if
  * All the blocks are dirty
    This happens when OE creation is caused by previous folio.
    The writeback will never happen (we're holding the folio lock for
    read), nor will the OE finish.

    Thus we must skip the range.

  * All the blocks are uptodate
    This happens when the writeback finished, but OE not yet finished.

    Since the blocks are already uptodate, we can skip the OE range.

The newer helper, lock_extents_for_read() will do a loop for the target
range by:

1) Lock the full range

2) If there is no ordered extent in the remaining range, exit

3) If there is an ordered extent that we can skip
   Skip to the end of the OE, and continue checking
   We do not trigger writeback nor wait for the OE.

4) If there is an ordered extent that we can not skip
   Unlock the whole extent range and start the ordered extent.

And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.

This will allow us to handle the following case properly in the future:

 16K page size, 4K btrfs block size:

 16K           20K             24K              28K            32K
 |/////////////////////////////|                |              |
 |<----------- OE 2----------->|                |<--- OE 1 --->|

 The folio has been written back before, thus we have an OE at
 [28K, 32K).
 Although the OE 1 finished its IO, the OE is not yet removed from IO
 tree.
 Later the folio got invalidated, but OE still exists.

 And [16K, 24K) range is dirty and uptodate, caused by a block aligned
 buffered write (and future enhancements allowing btrfs to skip full
 folio read for such case).

 Furthermore, OE 2 is created covering range [16K, 24K) by the writeback
 of previous folio.

 Since the full folio is not uptodate, if we want to read the folio,
 the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:

 btrfs_read_folio()
 | Folio 16K is already locked
 |- btrfs_lock_and_flush_ordered_range()
    |- btrfs_start_ordered_extent() for range [16K, 24K)
       |- filemap_fdatawrite_range() for range [16K, 24K)
          |- extent_write_cache_pages()
	     folio_lock() on folio 16K, deadlock.

 But now we will have the following sequence:

 btrfs_read_folio()
 | Folio 16K is already locked
 |- lock_extents_for_read()
    |- can_skip_ordered_extent() for range [16K, 24K)
    |  Returned true, the range [16K, 24K) will be skipped.
    |- can_skip_ordered_extent() for range [28K, 32K)
    |  Returned false.
    |- btrfs_start_ordered_extent() for range [28K, 32K) with
       [16K, 32K) as no writeback range
       No writeback for folio 16K will be triggered.

 And there will be no more possible deadlock on the same folio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/defrag.c       |   2 +-
 fs/btrfs/direct-io.c    |   2 +-
 fs/btrfs/extent_io.c    | 183 +++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/file.c         |   4 +-
 fs/btrfs/inode.c        |   4 +-
 fs/btrfs/ordered-data.c |  29 +++++--
 fs/btrfs/ordered-data.h |   3 +-
 7 files changed, 210 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 18f0704263f3..4b89094da3de 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -902,7 +902,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
 			break;
 
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		folio_lock(folio);
 		/*
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index e8548de319e7..fb6df17fb55c 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -103,7 +103,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 */
 			if (writing ||
 			    test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
-				btrfs_start_ordered_extent(ordered);
+				btrfs_start_ordered_extent(ordered, 0, 0);
 			else
 				ret = nowait ? -EAGAIN : -ENOTBLK;
 			btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7b0aa332aedc..a6ffee6f6fd9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1081,6 +1081,185 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	return 0;
 }
 
+/*
+ * Check if we can skip waiting the @ordered extent covering the block
+ * at file pos @cur.
+ *
+ * Return true if we can skip to @next_ret. The caller needs to check
+ * the @next_ret value to make sure if covers the full range, before
+ * skipping the OE.
+ *
+ * Return false if we must wait for the ordered extent.
+ *
+ * @cur:	The start file offset that we have locked folio for read.
+ * @next_ret:	If we return true, this indiciates the next check start
+ *		range.
+ */
+static bool can_skip_one_ordered_range(struct btrfs_inode *binode,
+				       struct btrfs_ordered_extent *ordered,
+				       u64 cur, u64 *next_ret)
+{
+	const struct btrfs_fs_info *fs_info = binode->root->fs_info;
+	struct folio *folio;
+	const u32 blocksize = fs_info->sectorsize;
+	u64 range_len;
+	bool ret;
+
+	folio = filemap_get_folio(binode->vfs_inode.i_mapping,
+				  cur >> PAGE_SHIFT);
+
+	/*
+	 * We should have locked the folio(s) for range [start, end], thus
+	 * there must be a folio and it must be locked.
+	 */
+	ASSERT(!IS_ERR(folio));
+	ASSERT(folio_test_locked(folio));
+
+	/*
+	 * We several cases for the folio and OE combination:
+	 *
+	 * 0) Folio has no private flag
+	 *    The OE has all its IO done but not yet finished, and folio got
+	 *    invalidated. Or direct IO.
+	 *
+	 * Have to wait for the OE to finish, as it may contain the
+	 * to-be-inserted data checksum.
+	 * Without the data checksum inserted into csum tree, read
+	 * will just fail with missing csum.
+	 */
+	if (!folio_test_private(folio)) {
+		ret = false;
+		goto out;
+	}
+	range_len = min(folio_pos(folio) + folio_size(folio),
+			ordered->file_offset + ordered->num_bytes) - cur;
+
+	/*
+	 * 1) The first block is DIRTY.
+	 *
+	 * This means the OE is created by some folio before us, but writeback
+	 * has not started.
+	 * We can and must skip the whole OE, because it will never start until
+	 * we finished our folio read and unlocked the folio.
+	 */
+	if (btrfs_folio_test_dirty(fs_info, folio, cur, blocksize)) {
+		ret = true;
+		/*
+		 * At least inside the folio, all the remaining blocks should
+		 * also be dirty.
+		 */
+		ASSERT(btrfs_folio_test_dirty(fs_info, folio, cur, range_len));
+		*next_ret = ordered->file_offset + ordered->num_bytes;
+		goto out;
+	}
+
+	/*
+	 * 2) The first block is uptodate.
+	 *
+	 * At least the first block can be skipped, but we are still
+	 * not full sure. E.g. if the OE has some other folios in
+	 * the range that can not be skipped.
+	 * So we return true and update @next_ret to the OE/folio boundary.
+	 */
+	if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+		u64 range_len = min(folio_pos(folio) + folio_size(folio),
+				    ordered->file_offset + ordered->num_bytes) - cur;
+
+		/*
+		 * The whole range to the OE end or folio boundary should also
+		 * be uptodate.
+		 */
+		ASSERT(btrfs_folio_test_uptodate(fs_info, folio, cur, range_len));
+		ret = true;
+		*next_ret = cur + range_len;
+		goto out;
+	}
+
+	/*
+	 * 3) The first block is not uptodate.
+	 *
+	 * This means the folio is invalidated after the OE finished, or direct IO.
+	 * Very much the same as case 1), just with private flag set.
+	 */
+	ret = false;
+out:
+	folio_put(folio);
+	return ret;
+}
+
+static bool can_skip_ordered_extent(struct btrfs_inode *binode,
+				    struct btrfs_ordered_extent *ordered,
+				    u64 start, u64 end)
+{
+	u64 range_end = min(end, ordered->file_offset + ordered->num_bytes - 1);
+	u64 range_start = max(start, ordered->file_offset);
+	u64 cur = range_start;
+
+	while (cur < range_end) {
+		bool can_skip;
+		u64 next_start;
+
+		can_skip = can_skip_one_ordered_range(binode, ordered, cur,
+						      &next_start);
+		if (!can_skip)
+			return false;
+		cur = next_start;
+	}
+	return true;
+}
+
+/*
+ * To make sure we get a stable view of extent maps for the involved range.
+ * This is for folio read paths (read and readahead), thus involved range
+ * should have all the folios locked.
+ */
+static void lock_extents_for_read(struct btrfs_inode *binode, u64 start, u64 end,
+				  struct extent_state **cached_state)
+{
+	struct btrfs_ordered_extent *ordered;
+	u64 cur_pos;
+
+	/* Caller must provide a valid @cached_state. */
+	ASSERT(cached_state);
+
+	/*
+	 * The range must at least be page aligned, as all read paths
+	 * are folio based.
+	 */
+	ASSERT(IS_ALIGNED(start, PAGE_SIZE) && IS_ALIGNED(end + 1, PAGE_SIZE));
+
+again:
+	lock_extent(&binode->io_tree, start, end, cached_state);
+	cur_pos = start;
+	while (cur_pos < end) {
+		ordered = btrfs_lookup_ordered_range(binode, cur_pos,
+						     end - cur_pos + 1);
+		/*
+		 * No ordered extents in the range, and we hold the
+		 * extent lock, no one can modify the extent maps
+		 * in the range, we're safe to return.
+		 */
+		if (!ordered)
+			break;
+
+		/* Check if we can skip waiting for the whole OE. */
+		if (can_skip_ordered_extent(binode, ordered, start, end)) {
+			cur_pos = min(ordered->file_offset + ordered->num_bytes,
+				      end + 1);
+			btrfs_put_ordered_extent(ordered);
+			continue;
+		}
+
+		/* Now wait for the OE to finish. */
+		unlock_extent(&binode->io_tree, start, end,
+			      cached_state);
+		btrfs_start_ordered_extent(ordered, start, end + 1 - start);
+		btrfs_put_ordered_extent(ordered);
+		/* We have unlocked the whole range, restart from the beginning. */
+		goto again;
+	}
+}
+
 int btrfs_read_folio(struct file *file, struct folio *folio)
 {
 	struct btrfs_inode *inode = folio_to_inode(folio);
@@ -1091,7 +1270,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
 	struct extent_map *em_cached = NULL;
 	int ret;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
+	lock_extents_for_read(inode, start, end, &cached_state);
 	ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
 	unlock_extent(&inode->io_tree, start, end, &cached_state);
 
@@ -2380,7 +2559,7 @@ void btrfs_readahead(struct readahead_control *rac)
 	struct extent_map *em_cached = NULL;
 	u64 prev_em_start = (u64)-1;
 
-	btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
+	lock_extents_for_read(inode, start, end, &cached_state);
 
 	while ((folio = readahead_folio(rac)) != NULL)
 		btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e87d4a37c929..00c68b7b2206 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -941,7 +941,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
 				      cached_state);
 			folio_unlock(folio);
 			folio_put(folio);
-			btrfs_start_ordered_extent(ordered);
+			btrfs_start_ordered_extent(ordered, 0, 0);
 			btrfs_put_ordered_extent(ordered);
 			return -EAGAIN;
 		}
@@ -1855,7 +1855,7 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 		unlock_extent(io_tree, page_start, page_end, &cached_state);
 		folio_unlock(folio);
 		up_read(&BTRFS_I(inode)->i_mmap_lock);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0efe9f005149..e99cb5109967 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2821,7 +2821,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
 		unlock_extent(&inode->io_tree, page_start, page_end,
 			      &cached_state);
 		folio_unlock(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
@@ -4873,7 +4873,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		unlock_extent(io_tree, block_start, block_end, &cached_state);
 		folio_unlock(folio);
 		folio_put(folio);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 		goto again;
 	}
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4aca7475fd82..6075a6fa4817 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -729,7 +729,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
 	struct btrfs_ordered_extent *ordered;
 
 	ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
-	btrfs_start_ordered_extent(ordered);
+	btrfs_start_ordered_extent(ordered, 0, 0);
 	complete(&ordered->completion);
 }
 
@@ -842,10 +842,12 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 /*
  * Start IO and wait for a given ordered extent to finish.
  *
- * Wait on page writeback for all the pages in the extent and the IO completion
- * code to insert metadata into the btree corresponding to the extent.
+ * Wait on page writeback for all the pages in the extent but not in
+ * [@nowriteback_start, @nowriteback_start + @nowriteback_len) and the
+ * IO completion code to insert metadata into the btree corresponding to the extent.
  */
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				u64 nowriteback_start, u32 nowriteback_len)
 {
 	u64 start = entry->file_offset;
 	u64 end = start + entry->num_bytes - 1;
@@ -865,8 +867,19 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
 	 * start IO on any dirty ones so the wait doesn't stall waiting
 	 * for the flusher thread to find them
 	 */
-	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
-		filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+	if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
+		if (!nowriteback_len) {
+			filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
+		} else {
+			if (start < nowriteback_start)
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start,
+						nowriteback_start - 1);
+			if (nowriteback_start + nowriteback_len < end)
+				filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
+						nowriteback_start + nowriteback_len,
+						end);
+		}
+	}
 
 	if (!freespace_inode)
 		btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
@@ -921,7 +934,7 @@ int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
 			btrfs_put_ordered_extent(ordered);
 			break;
 		}
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		end = ordered->file_offset;
 		/*
 		 * If the ordered extent had an error save the error but don't
@@ -1174,7 +1187,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 			break;
 		}
 		unlock_extent(&inode->io_tree, start, end, cachedp);
-		btrfs_start_ordered_extent(ordered);
+		btrfs_start_ordered_extent(ordered, 0, 0);
 		btrfs_put_ordered_extent(ordered);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index be36083297a7..5e17f03d43c9 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -192,7 +192,8 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
 							 u64 file_offset);
-void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
+void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
+				u64 nowriteback_start, u32 nowriteback_len);
 int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-02-23 23:46 ` [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 17:09   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Currently if a btrfs has its block size (the older sector size) smaller
than the page size, btrfs_do_readpage() will handle the range extent by
extent, this is good for performance as it doesn't need to re-lookup the
same extent map again and again.
(Although __get_extent_map() already does extra cached em check, thus
the optimization is not that obvious)

This is totally fine and is a valid optimization, but it has an
assumption that, there is no partial uptodate range in the page.

Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.

In that case, we can have a page that is partially uptodate, and the
current per-extent lookup can not handle such case.

So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:

- Remove the need for @iosize variable
  Because we just use sectorsize as our increment.

- Remove @pg_offset, and calculate it inside the loop when needed
  It's just offset_in_folio().

- Use a for() loop instead of a while() loop

This will slightly reduce the read performance for subpage cases, but for
the future where we need to skip already uptodate blocks, it should still
be worthy.

For block size == page size, this brings no performance change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++-------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a6ffee6f6fd9..b3a4a94212c9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -947,9 +947,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	u64 last_byte = i_size_read(inode);
 	struct extent_map *em;
 	int ret = 0;
-	size_t pg_offset = 0;
-	size_t iosize;
-	size_t blocksize = fs_info->sectorsize;
+	const size_t blocksize = fs_info->sectorsize;
 
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0) {
@@ -960,24 +958,23 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	if (folio_contains(folio, last_byte >> PAGE_SHIFT)) {
 		size_t zero_offset = offset_in_folio(folio, last_byte);
 
-		if (zero_offset) {
-			iosize = folio_size(folio) - zero_offset;
-			folio_zero_range(folio, zero_offset, iosize);
-		}
+		if (zero_offset)
+			folio_zero_range(folio, zero_offset,
+					 folio_size(folio) - zero_offset);
 	}
 	bio_ctrl->end_io_func = end_bbio_data_read;
 	begin_folio_read(fs_info, folio);
-	while (cur <= end) {
+	for (cur = start; cur <= end; cur += blocksize) {
 		enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
+		unsigned long pg_offset = offset_in_folio(folio, cur);
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
 		u64 block_start;
 
 		ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
 		if (cur >= last_byte) {
-			iosize = folio_size(folio) - pg_offset;
-			folio_zero_range(folio, pg_offset, iosize);
-			end_folio_read(folio, true, cur, iosize);
+			folio_zero_range(folio, pg_offset, end - cur + 1);
+			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
@@ -991,8 +988,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		compress_type = extent_map_compression(em);
 
-		iosize = min(extent_map_end(em) - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
 		if (compress_type != BTRFS_COMPRESS_NONE)
 			disk_bytenr = em->disk_bytenr;
 		else
@@ -1050,18 +1045,13 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		/* we've found a hole, just zero and go on */
 		if (block_start == EXTENT_MAP_HOLE) {
-			folio_zero_range(folio, pg_offset, iosize);
-
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			folio_zero_range(folio, pg_offset, blocksize);
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 		/* the get_extent function already copied into the folio */
 		if (block_start == EXTENT_MAP_INLINE) {
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 
@@ -1072,12 +1062,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		if (force_bio_submit)
 			submit_one_bio(bio_ctrl);
-		submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
+		submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
 				    pg_offset);
-		cur = cur + iosize;
-		pg_offset += iosize;
 	}
-
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-02-23 23:46 ` [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 17:55   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size Qu Wenruo
  2025-02-23 23:46 ` [PATCH 7/7] btrfs: remove the subpage related warning message Qu Wenruo
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full block and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned any more, thus btrfs will read the full
page, causing more read than expected and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is block
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate blocks during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b3a4a94212c9..d7240e295bfc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
 		if (IS_ERR(em)) {
 			end_folio_read(folio, false, cur, end + 1 - cur);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 00c68b7b2206..e3d63192281d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
 {
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-02-23 23:46 ` [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 18:03   ` Filipe Manana
  2025-02-23 23:46 ` [PATCH 7/7] btrfs: remove the subpage related warning message Qu Wenruo
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Previously inline data extents creation is disable if the block size
(previously called sector size) is smaller than the page size, for the
following reasons:

- Possible mixed inline and regular data extents
  However this is also the same if the block size matches the page size,
  thus we do not treat mixed inline and regular extents as an error.

  And the chance to cause mixed inline and regular data extents are not
  even increased, it has the same requirement (compressed inline data
  extent covering the whole first block, followed by regular extents).

- Unable to handle async/inline delalloc range for block size < page
  size cases
  This is already fixed since commit 1d2fbb7f1f9e ("btrfs: allow
  compression even if the range is not page aligned").

  This was the major technical blockage, but it's no longer a blockage
  anymore.

With the major technical blockage already removed, we can enable inline
data extents creation no matter the block size nor the page size,
allowing the btrfs to have the same capacity for all block sizes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e99cb5109967..a1ea93bad80e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -566,19 +566,6 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode,
 	if (offset != 0)
 		return false;
 
-	/*
-	 * Due to the page size limit, for subpage we can only trigger the
-	 * writeback for the dirty sectors of page, that means data writeback
-	 * is doing more writeback than what we want.
-	 *
-	 * This is especially unexpected for some call sites like fallocate,
-	 * where we only increase i_size after everything is done.
-	 * This means we can trigger inline extent even if we didn't want to.
-	 * So here we skip inline extent creation completely.
-	 */
-	if (fs_info->sectorsize != PAGE_SIZE)
-		return false;
-
 	/* Inline extents are limited to sectorsize. */
 	if (size > fs_info->sectorsize)
 		return false;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] btrfs: remove the subpage related warning message
  2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
                   ` (5 preceding siblings ...)
  2025-02-23 23:46 ` [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size Qu Wenruo
@ 2025-02-23 23:46 ` Qu Wenruo
  2025-02-25 18:10   ` Filipe Manana
  6 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-23 23:46 UTC (permalink / raw)
  To: linux-btrfs

Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:

- RAID56 subpage support
  In v5.19

- Refactored scrub support to support subpage better
  In v6.4

- Block perfect (previously requires page aligned ranges) compressed write
  In v6.13

- Various error handling fixes involving subpage
  In v6.14

Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just done in previous patches.

Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a799216aa264..c0b40dedceb5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3414,11 +3414,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	 */
 	fs_info->max_inline = min_t(u64, fs_info->max_inline, fs_info->sectorsize);
 
-	if (sectorsize < PAGE_SIZE)
-		btrfs_warn(fs_info,
-		"read-write for sector size %u with page size %lu is experimental",
-			   sectorsize, PAGE_SIZE);
-
 	ret = btrfs_init_workqueues(fs_info);
 	if (ret)
 		goto fail_sb_buffer;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range
  2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
@ 2025-02-25 15:07   ` Filipe Manana
  2025-02-25 16:10     ` Filipe Manana
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 15:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:46 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently reading an inline data extent will zero out all the remaining
> range in the page.
>
> This is not yet causing problems even for block size < page size
> (subpage) cases because:
>
> 1) An inline data extent always starts at file offset 0
>    Meaning at page read, we always read the inline extent first, before
>    any other blocks in the page. Then later blocks are properly read out
>    and re-fill the zeroed out ranges.
>
> 2) Currently btrfs will read out the whole page if a buffered write is
>    not page aligned
>    So a page is either fully uptodate at buffered write time (covers the
>    whole page), or we will read out the whole page first.
>    Meaning there is nothing to lose for such an inline extent read.
>
> But it's still not ideal:
>
> - We're zeroing out the page twice
>   One done by read_inline_extent()/uncompress_inline(), one done by
>   btrfs_do_readpage() for ranges beyond i_size.
>
> - We're touching blocks that doesn't belong to the inline extent
>   In the incoming patches, we can have a partial uptodate folio, that
>   some dirty blocks can exist while the page is not fully uptodate:
>
>   The page size is 16K and block size is 4K:
>
>   0         4K        8K        12K        16K
>   |         |         |/////////|          |
>
>   And range [8K, 12K) is dirtied by a buffered write, the remaining
>   blocks are not uptodate.
>
>   If range [0, 4K) contains an inline data extent, and we try to read
>   the whole page, the current behavior will overwrite range [8K, 12K)
>   with zero and cause data loss.
>
> So to make the behavior more consistent and in preparation for future
> changes, limit the inline data extents read to only zero out the range
> inside the first block, not the whole page.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c432ccfba56e..fe2c6038064a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6788,6 +6788,7 @@ static noinline int uncompress_inline(struct btrfs_path *path,
>  {
>         int ret;
>         struct extent_buffer *leaf = path->nodes[0];
> +       const u32 sectorsize = leaf->fs_info->sectorsize;
>         char *tmp;
>         size_t max_size;
>         unsigned long inline_size;
> @@ -6816,17 +6817,19 @@ static noinline int uncompress_inline(struct btrfs_path *path,
>          * cover that region here.
>          */
>
> -       if (max_size < PAGE_SIZE)
> -               folio_zero_range(folio, max_size, PAGE_SIZE - max_size);
> +       if (max_size < sectorsize)
> +               folio_zero_range(folio, max_size, sectorsize - max_size);
>         kfree(tmp);
>         return ret;
>  }
>
> -static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
> +static int read_inline_extent(struct btrfs_fs_info *fs_info,
> +                             struct btrfs_path *path, struct folio *folio)
>  {
>         struct btrfs_file_extent_item *fi;
>         void *kaddr;
>         size_t copy_size;
> +       const u32 sectorsize = fs_info->sectorsize;

There's no need to add the fs_info argument just to get the sectorsize.
We can get it like this:

const u32 sectorsize = path->nodes[0]->fs_info->sectorsize;

Otherwise it looks good, so:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.



>
>         if (!folio || folio_test_uptodate(folio))
>                 return 0;
> @@ -6844,8 +6847,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
>         read_extent_buffer(path->nodes[0], kaddr,
>                            btrfs_file_extent_inline_start(fi), copy_size);
>         kunmap_local(kaddr);
> -       if (copy_size < PAGE_SIZE)
> -               folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size);
> +       if (copy_size < sectorsize)
> +               folio_zero_range(folio, copy_size, sectorsize - copy_size);
>         return 0;
>  }
>
> @@ -7020,7 +7023,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>                 ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE);
>                 ASSERT(em->len == fs_info->sectorsize);
>
> -               ret = read_inline_extent(path, folio);
> +               ret = read_inline_extent(fs_info, path, folio);
>                 if (ret < 0)
>                         goto out;
>                 goto insert;
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents
  2025-02-23 23:46 ` [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo
@ 2025-02-25 15:11   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 15:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Inside function __cow_file_range_inline() since the inlined data no
> longer takes any data space, we need to free up the reserved space.
>
> However the code is still using the old page size == sector size
> assumption, and will not handle subpage case well.
>
> Thankfully it is not going to cause any problems because we have two extra
> safe nets:
>
> - Inline data extents creation is disable for sector size < page size
>   cases for now
>   But it won't stay that for long.
>
> - btrfs_qgroup_free_data() will only clear ranges which are already
>   reserved
>   So even if we pass a range larger than what we need, it should still
>   be fine, especially there is only reserved space for a single block at
>   file offset 0 for an inline data extent.
>
> But just for the sake of consistentcy, fix the call site to use
> sectorsize instead of page size.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fe2c6038064a..0efe9f005149 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -672,7 +672,7 @@ static noinline int __cow_file_range_inline(struct btrfs_inode *inode,
>          * And at reserve time, it's always aligned to page size, so
>          * just free one page here.
>          */
> -       btrfs_qgroup_free_data(inode, NULL, 0, PAGE_SIZE, NULL);
> +       btrfs_qgroup_free_data(inode, NULL, 0, fs_info->sectorsize, NULL);
>         btrfs_free_path(path);
>         btrfs_end_transaction(trans);
>         return ret;
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range
  2025-02-25 15:07   ` Filipe Manana
@ 2025-02-25 16:10     ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 16:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2025 at 3:07 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Sun, Feb 23, 2025 at 11:46 PM Qu Wenruo <wqu@suse.com> wrote:
> >
> > Currently reading an inline data extent will zero out all the remaining
> > range in the page.
> >
> > This is not yet causing problems even for block size < page size
> > (subpage) cases because:
> >
> > 1) An inline data extent always starts at file offset 0
> >    Meaning at page read, we always read the inline extent first, before
> >    any other blocks in the page. Then later blocks are properly read out
> >    and re-fill the zeroed out ranges.
> >
> > 2) Currently btrfs will read out the whole page if a buffered write is
> >    not page aligned
> >    So a page is either fully uptodate at buffered write time (covers the
> >    whole page), or we will read out the whole page first.
> >    Meaning there is nothing to lose for such an inline extent read.
> >
> > But it's still not ideal:
> >
> > - We're zeroing out the page twice
> >   One done by read_inline_extent()/uncompress_inline(), one done by
> >   btrfs_do_readpage() for ranges beyond i_size.
> >
> > - We're touching blocks that doesn't belong to the inline extent
> >   In the incoming patches, we can have a partial uptodate folio, that
> >   some dirty blocks can exist while the page is not fully uptodate:
> >
> >   The page size is 16K and block size is 4K:
> >
> >   0         4K        8K        12K        16K
> >   |         |         |/////////|          |
> >
> >   And range [8K, 12K) is dirtied by a buffered write, the remaining
> >   blocks are not uptodate.
> >
> >   If range [0, 4K) contains an inline data extent, and we try to read
> >   the whole page, the current behavior will overwrite range [8K, 12K)
> >   with zero and cause data loss.
> >
> > So to make the behavior more consistent and in preparation for future
> > changes, limit the inline data extents read to only zero out the range
> > inside the first block, not the whole page.
> >
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/inode.c | 15 +++++++++------
> >  1 file changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index c432ccfba56e..fe2c6038064a 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6788,6 +6788,7 @@ static noinline int uncompress_inline(struct btrfs_path *path,
> >  {
> >         int ret;
> >         struct extent_buffer *leaf = path->nodes[0];
> > +       const u32 sectorsize = leaf->fs_info->sectorsize;
> >         char *tmp;
> >         size_t max_size;
> >         unsigned long inline_size;
> > @@ -6816,17 +6817,19 @@ static noinline int uncompress_inline(struct btrfs_path *path,
> >          * cover that region here.
> >          */
> >
> > -       if (max_size < PAGE_SIZE)
> > -               folio_zero_range(folio, max_size, PAGE_SIZE - max_size);
> > +       if (max_size < sectorsize)
> > +               folio_zero_range(folio, max_size, sectorsize - max_size);

Right above this we have:

max_size = min_t(unsigned long, PAGE_SIZE, max_size);
ret = btrfs_decompress(compress_type, tmp, folio, 0, inline_size,
       max_size);

Should't we replace PAGE_SIZE with sectorsize too? Why not?

And there's a similar case at read_inline_extent() too.

Thanks.


> >         kfree(tmp);
> >         return ret;
> >  }
> >
> > -static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
> > +static int read_inline_extent(struct btrfs_fs_info *fs_info,
> > +                             struct btrfs_path *path, struct folio *folio)
> >  {
> >         struct btrfs_file_extent_item *fi;
> >         void *kaddr;
> >         size_t copy_size;
> > +       const u32 sectorsize = fs_info->sectorsize;
>
> There's no need to add the fs_info argument just to get the sectorsize.
> We can get it like this:
>
> const u32 sectorsize = path->nodes[0]->fs_info->sectorsize;
>
> Otherwise it looks good, so:
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks.
>
>
>
> >
> >         if (!folio || folio_test_uptodate(folio))
> >                 return 0;
> > @@ -6844,8 +6847,8 @@ static int read_inline_extent(struct btrfs_path *path, struct folio *folio)
> >         read_extent_buffer(path->nodes[0], kaddr,
> >                            btrfs_file_extent_inline_start(fi), copy_size);
> >         kunmap_local(kaddr);
> > -       if (copy_size < PAGE_SIZE)
> > -               folio_zero_range(folio, copy_size, PAGE_SIZE - copy_size);
> > +       if (copy_size < sectorsize)
> > +               folio_zero_range(folio, copy_size, sectorsize - copy_size);
> >         return 0;
> >  }
> >
> > @@ -7020,7 +7023,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> >                 ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE);
> >                 ASSERT(em->len == fs_info->sectorsize);
> >
> > -               ret = read_inline_extent(path, folio);
> > +               ret = read_inline_extent(fs_info, path, folio);
> >                 if (ret < 0)
> >                         goto out;
> >                 goto insert;
> > --
> > 2.48.1
> >
> >

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-23 23:46 ` [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
@ 2025-02-25 17:05   ` Filipe Manana
  2025-02-25 21:12     ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 17:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently we're using btrfs_lock_and_flush_ordered_range() for both
> btrfs_read_folio() and btrfs_readahead(), but it has one critical
> problem for future subpage enhancements:
>
> - It will call btrfs_start_ordered_extent() to writeback the involved
>   folios
>
>   But remember we're calling btrfs_lock_and_flush_ordered_range() at
>   read paths, meaning the folio is already locked by read path.
>
>   If we really trigger writeback, this will lead to a deadlock and
>   writeback can not hold the folio lock.
>
>   Such dead lock is prevented by the fact that btrfs always keeps a
>   dirty folio also uptodate, by either dirtying all blocks of the folio,
>   or read the folio before dirtying.
>
>   But it's not the common behavior, as XFS/EXT4 both allow the folio to
>   be dirty but not uptodate, by allowing buffered write to dirty a full
>   block without reading the full folio.

Ok, but what happens in other filesystems doesn't affect us.
Or did you forget to mention that some other subsequent patch or planned changes
will introduce that behaviour?

If so, it would be best to say that, otherwise that paragraph isn't relevant.

>
> Instead of blindly calling btrfs_start_ordered_extent(), introduce a
> newer helper, which is smarter in the following ways:
>
> - Only wait and flush the ordered extent if
>   * The folio doesn't even have private set
>   * Part of the blocks of the ordered extent is not uptodate

is -> are

>
>   This can happen by:
>   * Folio writeback finished, then get invalidated. But OE not yet
>     finished

Ok, this one I understand.

>   * Direct IO

This one I don't because direct IO doesn't use the page cache.
Can you elaborate here and explain why it can happen with direct IO?

>
>   We have to wait for the ordered extent, as it may contain
>   to-be-inserted data checksum.
>   Without waiting, our read can fail due to the missing csum.
>
>   But either way, the OE should not need any extra flush inside the
>   locked folio range.
>
> - Skip the ordered extent completely if
>   * All the blocks are dirty
>     This happens when OE creation is caused by previous folio.

"OE creation is caused by previous folio" - what does this mean?
You mean by a write that happened before the read?


>     The writeback will never happen (we're holding the folio lock for
>     read), nor will the OE finish.
>
>     Thus we must skip the range.
>
>   * All the blocks are uptodate
>     This happens when the writeback finished, but OE not yet finished.
>
>     Since the blocks are already uptodate, we can skip the OE range.
>
> The newer helper, lock_extents_for_read() will do a loop for the target
> range by:
>
> 1) Lock the full range
>
> 2) If there is no ordered extent in the remaining range, exit
>
> 3) If there is an ordered extent that we can skip
>    Skip to the end of the OE, and continue checking
>    We do not trigger writeback nor wait for the OE.
>
> 4) If there is an ordered extent that we can not skip
>    Unlock the whole extent range and start the ordered extent.
>
> And also update btrfs_start_ordered_extent() to add two more parameters:
> @nowriteback_start and @nowriteback_len, to prevent triggering flush for
> a certain range.
>
> This will allow us to handle the following case properly in the future:
>
>  16K page size, 4K btrfs block size:
>
>  16K           20K             24K              28K            32K
>  |/////////////////////////////|                |              |
>  |<----------- OE 2----------->|                |<--- OE 1 --->|
>
>  The folio has been written back before, thus we have an OE at
>  [28K, 32K).
>  Although the OE 1 finished its IO, the OE is not yet removed from IO
>  tree.
>  Later the folio got invalidated, but OE still exists.

This last sentence to be less confusing: "The folio got invalited
after writeback completed and before the ordered extent finished."

>
>  And [16K, 24K) range is dirty and uptodate, caused by a block aligned
>  buffered write (and future enhancements allowing btrfs to skip full
>  folio read for such case).
>
>  Furthermore, OE 2 is created covering range [16K, 24K) by the writeback
>  of previous folio.

What previous folio? There's no other one in the example.
Do you mean there's a folio for range [0, 16K) and running delalloc
resulted in the creation of an OE for the range [0, 24K)? So that OE 2
doesn't really start at 16K but at 0 instead?

>
>  Since the full folio is not uptodate, if we want to read the folio,
>  the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
>
>  btrfs_read_folio()
>  | Folio 16K is already locked
>  |- btrfs_lock_and_flush_ordered_range()
>     |- btrfs_start_ordered_extent() for range [16K, 24K)
>        |- filemap_fdatawrite_range() for range [16K, 24K)
>           |- extent_write_cache_pages()
>              folio_lock() on folio 16K, deadlock.
>
>  But now we will have the following sequence:
>
>  btrfs_read_folio()
>  | Folio 16K is already locked
>  |- lock_extents_for_read()
>     |- can_skip_ordered_extent() for range [16K, 24K)
>     |  Returned true, the range [16K, 24K) will be skipped.
>     |- can_skip_ordered_extent() for range [28K, 32K)
>     |  Returned false.
>     |- btrfs_start_ordered_extent() for range [28K, 32K) with
>        [16K, 32K) as no writeback range
>        No writeback for folio 16K will be triggered.
>
>  And there will be no more possible deadlock on the same folio.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/defrag.c       |   2 +-
>  fs/btrfs/direct-io.c    |   2 +-
>  fs/btrfs/extent_io.c    | 183 +++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/file.c         |   4 +-
>  fs/btrfs/inode.c        |   4 +-
>  fs/btrfs/ordered-data.c |  29 +++++--
>  fs/btrfs/ordered-data.h |   3 +-
>  7 files changed, 210 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 18f0704263f3..4b89094da3de 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -902,7 +902,7 @@ static struct folio *defrag_prepare_one_folio(struct btrfs_inode *inode, pgoff_t
>                         break;
>
>                 folio_unlock(folio);
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 btrfs_put_ordered_extent(ordered);
>                 folio_lock(folio);
>                 /*
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index e8548de319e7..fb6df17fb55c 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -103,7 +103,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>                          */
>                         if (writing ||
>                             test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags))
> -                               btrfs_start_ordered_extent(ordered);
> +                               btrfs_start_ordered_extent(ordered, 0, 0);
>                         else
>                                 ret = nowait ? -EAGAIN : -ENOTBLK;
>                         btrfs_put_ordered_extent(ordered);
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7b0aa332aedc..a6ffee6f6fd9 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1081,6 +1081,185 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>         return 0;
>  }
>
> +/*
> + * Check if we can skip waiting the @ordered extent covering the block
> + * at file pos @cur.
> + *
> + * Return true if we can skip to @next_ret. The caller needs to check
> + * the @next_ret value to make sure if covers the full range, before

if covers -> it covers

> + * skipping the OE.
> + *
> + * Return false if we must wait for the ordered extent.
> + *
> + * @cur:       The start file offset that we have locked folio for read.
> + * @next_ret:  If we return true, this indiciates the next check start

indiciates -> indicates

> + *             range.

Normally the parameter list/description comes before describing the
return values.

> + */
> +static bool can_skip_one_ordered_range(struct btrfs_inode *binode,

Why binode and not just inode? There's no vfs inode in the function to
make any confusion.


> +                                      struct btrfs_ordered_extent *ordered,
> +                                      u64 cur, u64 *next_ret)
> +{
> +       const struct btrfs_fs_info *fs_info = binode->root->fs_info;
> +       struct folio *folio;
> +       const u32 blocksize = fs_info->sectorsize;
> +       u64 range_len;
> +       bool ret;
> +
> +       folio = filemap_get_folio(binode->vfs_inode.i_mapping,
> +                                 cur >> PAGE_SHIFT);
> +
> +       /*
> +        * We should have locked the folio(s) for range [start, end], thus
> +        * there must be a folio and it must be locked.
> +        */
> +       ASSERT(!IS_ERR(folio));
> +       ASSERT(folio_test_locked(folio));
> +
> +       /*
> +        * We several cases for the folio and OE combination:
> +        *
> +        * 0) Folio has no private flag

Why does the numbering start from 0? It's not code and it's not
related to array indexes or offsets :)

> +        *    The OE has all its IO done but not yet finished, and folio got
> +        *    invalidated. Or direct IO.

Ok so I still don't understand the direct IO case, because direct IO
doesn't use the page cache.
Just like the change log, some explanation here would be desired.

> +        *
> +        * Have to wait for the OE to finish, as it may contain the
> +        * to-be-inserted data checksum.
> +        * Without the data checksum inserted into csum tree, read
> +        * will just fail with missing csum.

Well we don't need to wait if it's a NOCOW / NODATASUM write.

> +        */
> +       if (!folio_test_private(folio)) {
> +               ret = false;
> +               goto out;
> +       }
> +       range_len = min(folio_pos(folio) + folio_size(folio),
> +                       ordered->file_offset + ordered->num_bytes) - cur;

This isn't used, only in one if branch below which repeats the computation.

> +
> +       /*
> +        * 1) The first block is DIRTY.
> +        *
> +        * This means the OE is created by some folio before us, but writeback
> +        * has not started.
> +        * We can and must skip the whole OE, because it will never start until
> +        * we finished our folio read and unlocked the folio.
> +        */
> +       if (btrfs_folio_test_dirty(fs_info, folio, cur, blocksize)) {
> +               ret = true;
> +               /*
> +                * At least inside the folio, all the remaining blocks should
> +                * also be dirty.
> +                */
> +               ASSERT(btrfs_folio_test_dirty(fs_info, folio, cur, range_len));
> +               *next_ret = ordered->file_offset + ordered->num_bytes;
> +               goto out;
> +       }
> +
> +       /*
> +        * 2) The first block is uptodate.
> +        *
> +        * At least the first block can be skipped, but we are still
> +        * not full sure. E.g. if the OE has some other folios in
> +        * the range that can not be skipped.
> +        * So we return true and update @next_ret to the OE/folio boundary.
> +        */
> +       if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +               u64 range_len = min(folio_pos(folio) + folio_size(folio),
> +                                   ordered->file_offset + ordered->num_bytes) - cur;

Here, it's the same computation.
So we can remove the previous one and the range_len variable defined
at the top level scope.

> +
> +               /*
> +                * The whole range to the OE end or folio boundary should also
> +                * be uptodate.
> +                */
> +               ASSERT(btrfs_folio_test_uptodate(fs_info, folio, cur, range_len));
> +               ret = true;
> +               *next_ret = cur + range_len;
> +               goto out;
> +       }
> +
> +       /*
> +        * 3) The first block is not uptodate.
> +        *
> +        * This means the folio is invalidated after the OE finished, or direct IO.
> +        * Very much the same as case 1), just with private flag set.
> +        */
> +       ret = false;
> +out:
> +       folio_put(folio);
> +       return ret;
> +}
> +
> +static bool can_skip_ordered_extent(struct btrfs_inode *binode,

Same here, call it inode instead of binode.

> +                                   struct btrfs_ordered_extent *ordered,
> +                                   u64 start, u64 end)
> +{
> +       u64 range_end = min(end, ordered->file_offset + ordered->num_bytes - 1);
> +       u64 range_start = max(start, ordered->file_offset);
> +       u64 cur = range_start;

Well the range_start variable is not really needed, could assign its
expression to 'cur'.
range_end can also be made const.

> +
> +       while (cur < range_end) {
> +               bool can_skip;
> +               u64 next_start;
> +
> +               can_skip = can_skip_one_ordered_range(binode, ordered, cur,
> +                                                     &next_start);

Can pass &cur and get rid of the next_start variable.


> +               if (!can_skip)
> +                       return false;
> +               cur = next_start;
> +       }
> +       return true;
> +}
> +
> +/*
> + * To make sure we get a stable view of extent maps for the involved range.
> + * This is for folio read paths (read and readahead), thus involved range
> + * should have all the folios locked.
> + */
> +static void lock_extents_for_read(struct btrfs_inode *binode, u64 start, u64 end,

Same here, call it inode instead of binode.

> +                                 struct extent_state **cached_state)
> +{
> +       struct btrfs_ordered_extent *ordered;

This isn't used in this scope, so it can be moved to the scope of the
while loop below.

> +       u64 cur_pos;
> +
> +       /* Caller must provide a valid @cached_state. */
> +       ASSERT(cached_state);
> +
> +       /*
> +        * The range must at least be page aligned, as all read paths
> +        * are folio based.
> +        */
> +       ASSERT(IS_ALIGNED(start, PAGE_SIZE) && IS_ALIGNED(end + 1, PAGE_SIZE));

It's best to have one ASSERT per condition, as it makes it easy to
figure out which one fails in case the assertion is triggered.

Thanks.


> +
> +again:
> +       lock_extent(&binode->io_tree, start, end, cached_state);
> +       cur_pos = start;
> +       while (cur_pos < end) {
> +               ordered = btrfs_lookup_ordered_range(binode, cur_pos,
> +                                                    end - cur_pos + 1);
> +               /*
> +                * No ordered extents in the range, and we hold the
> +                * extent lock, no one can modify the extent maps
> +                * in the range, we're safe to return.
> +                */
> +               if (!ordered)
> +                       break;
> +
> +               /* Check if we can skip waiting for the whole OE. */
> +               if (can_skip_ordered_extent(binode, ordered, start, end)) {
> +                       cur_pos = min(ordered->file_offset + ordered->num_bytes,
> +                                     end + 1);
> +                       btrfs_put_ordered_extent(ordered);
> +                       continue;
> +               }
> +
> +               /* Now wait for the OE to finish. */
> +               unlock_extent(&binode->io_tree, start, end,
> +                             cached_state);
> +               btrfs_start_ordered_extent(ordered, start, end + 1 - start);
> +               btrfs_put_ordered_extent(ordered);
> +               /* We have unlocked the whole range, restart from the beginning. */
> +               goto again;
> +       }
> +}
> +
>  int btrfs_read_folio(struct file *file, struct folio *folio)
>  {
>         struct btrfs_inode *inode = folio_to_inode(folio);
> @@ -1091,7 +1270,7 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
>         struct extent_map *em_cached = NULL;
>         int ret;
>
> -       btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> +       lock_extents_for_read(inode, start, end, &cached_state);
>         ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
>         unlock_extent(&inode->io_tree, start, end, &cached_state);
>
> @@ -2380,7 +2559,7 @@ void btrfs_readahead(struct readahead_control *rac)
>         struct extent_map *em_cached = NULL;
>         u64 prev_em_start = (u64)-1;
>
> -       btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
> +       lock_extents_for_read(inode, start, end, &cached_state);
>
>         while ((folio = readahead_folio(rac)) != NULL)
>                 btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e87d4a37c929..00c68b7b2206 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -941,7 +941,7 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct folio *folio,
>                                       cached_state);
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       btrfs_start_ordered_extent(ordered);
> +                       btrfs_start_ordered_extent(ordered, 0, 0);
>                         btrfs_put_ordered_extent(ordered);
>                         return -EAGAIN;
>                 }
> @@ -1855,7 +1855,7 @@ static vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>                 unlock_extent(io_tree, page_start, page_end, &cached_state);
>                 folio_unlock(folio);
>                 up_read(&BTRFS_I(inode)->i_mmap_lock);
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 btrfs_put_ordered_extent(ordered);
>                 goto again;
>         }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0efe9f005149..e99cb5109967 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2821,7 +2821,7 @@ static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
>                 unlock_extent(&inode->io_tree, page_start, page_end,
>                               &cached_state);
>                 folio_unlock(folio);
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 btrfs_put_ordered_extent(ordered);
>                 goto again;
>         }
> @@ -4873,7 +4873,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>                 unlock_extent(io_tree, block_start, block_end, &cached_state);
>                 folio_unlock(folio);
>                 folio_put(folio);
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 btrfs_put_ordered_extent(ordered);
>                 goto again;
>         }
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 4aca7475fd82..6075a6fa4817 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -729,7 +729,7 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
>         struct btrfs_ordered_extent *ordered;
>
>         ordered = container_of(work, struct btrfs_ordered_extent, flush_work);
> -       btrfs_start_ordered_extent(ordered);
> +       btrfs_start_ordered_extent(ordered, 0, 0);
>         complete(&ordered->completion);
>  }
>
> @@ -842,10 +842,12 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
>  /*
>   * Start IO and wait for a given ordered extent to finish.
>   *
> - * Wait on page writeback for all the pages in the extent and the IO completion
> - * code to insert metadata into the btree corresponding to the extent.
> + * Wait on page writeback for all the pages in the extent but not in
> + * [@nowriteback_start, @nowriteback_start + @nowriteback_len) and the
> + * IO completion code to insert metadata into the btree corresponding to the extent.
>   */
> -void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
> +void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
> +                               u64 nowriteback_start, u32 nowriteback_len)
>  {
>         u64 start = entry->file_offset;
>         u64 end = start + entry->num_bytes - 1;
> @@ -865,8 +867,19 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
>          * start IO on any dirty ones so the wait doesn't stall waiting
>          * for the flusher thread to find them
>          */
> -       if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags))
> -               filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
> +       if (!test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) {
> +               if (!nowriteback_len) {
> +                       filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start, end);
> +               } else {
> +                       if (start < nowriteback_start)
> +                               filemap_fdatawrite_range(inode->vfs_inode.i_mapping, start,
> +                                               nowriteback_start - 1);
> +                       if (nowriteback_start + nowriteback_len < end)
> +                               filemap_fdatawrite_range(inode->vfs_inode.i_mapping,
> +                                               nowriteback_start + nowriteback_len,
> +                                               end);
> +               }
> +       }
>
>         if (!freespace_inode)
>                 btrfs_might_wait_for_event(inode->root->fs_info, btrfs_ordered_extent);
> @@ -921,7 +934,7 @@ int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
>                         btrfs_put_ordered_extent(ordered);
>                         break;
>                 }
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 end = ordered->file_offset;
>                 /*
>                  * If the ordered extent had an error save the error but don't
> @@ -1174,7 +1187,7 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
>                         break;
>                 }
>                 unlock_extent(&inode->io_tree, start, end, cachedp);
> -               btrfs_start_ordered_extent(ordered);
> +               btrfs_start_ordered_extent(ordered, 0, 0);
>                 btrfs_put_ordered_extent(ordered);
>         }
>  }
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index be36083297a7..5e17f03d43c9 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -192,7 +192,8 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>                            struct btrfs_ordered_sum *sum);
>  struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
>                                                          u64 file_offset);
> -void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
> +void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry,
> +                               u64 nowriteback_start, u32 nowriteback_len);
>  int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
>  struct btrfs_ordered_extent *
>  btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read
  2025-02-23 23:46 ` [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
@ 2025-02-25 17:09   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 17:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently if a btrfs has its block size (the older sector size) smaller
> than the page size, btrfs_do_readpage() will handle the range extent by
> extent, this is good for performance as it doesn't need to re-lookup the
> same extent map again and again.
> (Although __get_extent_map() already does extra cached em check, thus
> the optimization is not that obvious)
>
> This is totally fine and is a valid optimization, but it has an
> assumption that, there is no partial uptodate range in the page.
>
> Meanwhile there is an incoming feature, requiring btrfs to skip the full
> page read if a buffered write range covers a full block but not a full
> page.
>
> In that case, we can have a page that is partially uptodate, and the
> current per-extent lookup can not handle such case.
>
> So here we change btrfs_do_readpage() to do block-by-block read, this
> simplifies the following things:
>
> - Remove the need for @iosize variable
>   Because we just use sectorsize as our increment.
>
> - Remove @pg_offset, and calculate it inside the loop when needed
>   It's just offset_in_folio().
>
> - Use a for() loop instead of a while() loop
>
> This will slightly reduce the read performance for subpage cases, but for
> the future where we need to skip already uptodate blocks, it should still
> be worthy.
>
> For block size == page size, this brings no performance change.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 37 ++++++++++++-------------------------
>  1 file changed, 12 insertions(+), 25 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a6ffee6f6fd9..b3a4a94212c9 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -947,9 +947,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>         u64 last_byte = i_size_read(inode);
>         struct extent_map *em;
>         int ret = 0;
> -       size_t pg_offset = 0;
> -       size_t iosize;
> -       size_t blocksize = fs_info->sectorsize;
> +       const size_t blocksize = fs_info->sectorsize;
>
>         ret = set_folio_extent_mapped(folio);
>         if (ret < 0) {
> @@ -960,24 +958,23 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>         if (folio_contains(folio, last_byte >> PAGE_SHIFT)) {
>                 size_t zero_offset = offset_in_folio(folio, last_byte);
>
> -               if (zero_offset) {
> -                       iosize = folio_size(folio) - zero_offset;
> -                       folio_zero_range(folio, zero_offset, iosize);
> -               }
> +               if (zero_offset)
> +                       folio_zero_range(folio, zero_offset,
> +                                        folio_size(folio) - zero_offset);
>         }
>         bio_ctrl->end_io_func = end_bbio_data_read;
>         begin_folio_read(fs_info, folio);
> -       while (cur <= end) {
> +       for (cur = start; cur <= end; cur += blocksize) {

While here, we can also declare the variable 'cur' in the for loop
instead of being at the top scope, since it's not used outside the
loop.

Either way, it looks good:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>                 enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
> +               unsigned long pg_offset = offset_in_folio(folio, cur);
>                 bool force_bio_submit = false;
>                 u64 disk_bytenr;
>                 u64 block_start;
>
>                 ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
>                 if (cur >= last_byte) {
> -                       iosize = folio_size(folio) - pg_offset;
> -                       folio_zero_range(folio, pg_offset, iosize);
> -                       end_folio_read(folio, true, cur, iosize);
> +                       folio_zero_range(folio, pg_offset, end - cur + 1);
> +                       end_folio_read(folio, true, cur, end - cur + 1);
>                         break;
>                 }
>                 em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
> @@ -991,8 +988,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>
>                 compress_type = extent_map_compression(em);
>
> -               iosize = min(extent_map_end(em) - cur, end - cur + 1);
> -               iosize = ALIGN(iosize, blocksize);
>                 if (compress_type != BTRFS_COMPRESS_NONE)
>                         disk_bytenr = em->disk_bytenr;
>                 else
> @@ -1050,18 +1045,13 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>
>                 /* we've found a hole, just zero and go on */
>                 if (block_start == EXTENT_MAP_HOLE) {
> -                       folio_zero_range(folio, pg_offset, iosize);
> -
> -                       end_folio_read(folio, true, cur, iosize);
> -                       cur = cur + iosize;
> -                       pg_offset += iosize;
> +                       folio_zero_range(folio, pg_offset, blocksize);
> +                       end_folio_read(folio, true, cur, blocksize);
>                         continue;
>                 }
>                 /* the get_extent function already copied into the folio */
>                 if (block_start == EXTENT_MAP_INLINE) {
> -                       end_folio_read(folio, true, cur, iosize);
> -                       cur = cur + iosize;
> -                       pg_offset += iosize;
> +                       end_folio_read(folio, true, cur, blocksize);
>                         continue;
>                 }
>
> @@ -1072,12 +1062,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>
>                 if (force_bio_submit)
>                         submit_one_bio(bio_ctrl);
> -               submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
> +               submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
>                                     pg_offset);
> -               cur = cur + iosize;
> -               pg_offset += iosize;
>         }
> -
>         return 0;
>  }
>
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned
  2025-02-23 23:46 ` [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
@ 2025-02-25 17:55   ` Filipe Manana
  2025-02-25 21:16     ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 17:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> Since the support of block size (sector size) < page size for btrfs,
> test case generic/563 fails with 4K block size and 64K page size:
>
>     --- tests/generic/563.out   2024-04-25 18:13:45.178550333 +0930
>     +++ /home/adam/xfstests-dev/results//generic/563.out.bad    2024-09-30 09:09:16.155312379 +0930
>     @@ -3,7 +3,8 @@
>      read is in range
>      write is in range
>      write -> read/write
>     -read is in range
>     +read has value of 8388608
>     +read is NOT in range -33792 .. 33792
>      write is in range
>     ...
>
> [CAUSE]
> The test case creates a 8MiB file, then buffered write into the 8MiB
> using 4K block size, to overwrite the whole file.
>
> On 4K page sized systems, since the write range covers the full block and
> page, btrfs will no bother reading the page, just like what XFS and EXT4
> do.
>
> But 64K page sized systems, although the 4K sized write is still block
> aligned, it's not page aligned any more, thus btrfs will read the full
> page, causing more read than expected and fail the test case.

This part "causing more read than expected" got me puzzled, but it's explained
in the "Fix" section below. It's more like "causing previously dirty
blocks within the page to be zeroed out".

>
> [FIX]
> To skip the full page read, we need to do the following modification:
>
> - Do not trigger full page read as long as the buffered write is block
>   aligned
>   This is pretty simple by modifying the check inside
>   prepare_uptodate_page().
>
> - Skip already uptodate blocks during full page read
>   Or we can lead to the following data corruption:
>
>   0       32K        64K
>   |///////|          |
>
>   Where the file range [0, 32K) is dirtied by buffered write, the
>   remaining range [32K, 64K) is not.
>
>   When reading the full page, since [0,32K) is only dirtied but not
>   written back, there is no data extent map for it, but a hole covering
>   [0, 64k).
>
>   If we continue reading the full page range [0, 64K), the dirtied range
>   will be filled with 0 (since there is only a hole covering the whole
>   range).
>   This causes the dirtied range to get lost.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/extent_io.c | 4 ++++
>  fs/btrfs/file.c      | 5 +++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b3a4a94212c9..d7240e295bfc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>                         end_folio_read(folio, true, cur, end - cur + 1);
>                         break;
>                 }
> +               if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +                       end_folio_read(folio, true, cur, blocksize);
> +                       continue;
> +               }
>                 em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>                 if (IS_ERR(em)) {
>                         end_folio_read(folio, false, cur, end + 1 - cur);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 00c68b7b2206..e3d63192281d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>  {
>         u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>         u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
> +       const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>         int ret = 0;
>
>         if (folio_test_uptodate(folio))
>                 return 0;
>
>         if (!force_uptodate &&
> -           IS_ALIGNED(clamp_start, PAGE_SIZE) &&
> -           IS_ALIGNED(clamp_end, PAGE_SIZE))
> +           IS_ALIGNED(clamp_start, sectorsize) &&
> +           IS_ALIGNED(clamp_end, sectorsize))
>                 return 0;
>
>         ret = btrfs_read_folio(NULL, folio);
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size
  2025-02-23 23:46 ` [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size Qu Wenruo
@ 2025-02-25 18:03   ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 18:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Previously inline data extents creation is disable if the block size
> (previously called sector size) is smaller than the page size, for the
> following reasons:
>
> - Possible mixed inline and regular data extents
>   However this is also the same if the block size matches the page size,
>   thus we do not treat mixed inline and regular extents as an error.
>
>   And the chance to cause mixed inline and regular data extents are not
>   even increased, it has the same requirement (compressed inline data
>   extent covering the whole first block, followed by regular extents).
>
> - Unable to handle async/inline delalloc range for block size < page
>   size cases
>   This is already fixed since commit 1d2fbb7f1f9e ("btrfs: allow
>   compression even if the range is not page aligned").
>
>   This was the major technical blockage, but it's no longer a blockage
>   anymore.
>
> With the major technical blockage already removed, we can enable inline
> data extents creation no matter the block size nor the page size,
> allowing the btrfs to have the same capacity for all block sizes.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/inode.c | 13 -------------
>  1 file changed, 13 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e99cb5109967..a1ea93bad80e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -566,19 +566,6 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode,
>         if (offset != 0)
>                 return false;
>
> -       /*
> -        * Due to the page size limit, for subpage we can only trigger the
> -        * writeback for the dirty sectors of page, that means data writeback
> -        * is doing more writeback than what we want.
> -        *
> -        * This is especially unexpected for some call sites like fallocate,
> -        * where we only increase i_size after everything is done.
> -        * This means we can trigger inline extent even if we didn't want to.
> -        * So here we skip inline extent creation completely.
> -        */
> -       if (fs_info->sectorsize != PAGE_SIZE)
> -               return false;
> -
>         /* Inline extents are limited to sectorsize. */
>         if (size > fs_info->sectorsize)
>                 return false;
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] btrfs: remove the subpage related warning message
  2025-02-23 23:46 ` [PATCH 7/7] btrfs: remove the subpage related warning message Qu Wenruo
@ 2025-02-25 18:10   ` Filipe Manana
  2025-02-25 21:19     ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2025-02-25 18:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>
> Since the initial enablement of block size < page size support for
> btrfs in v5.15, we have hit several milestones for block size < page
> size (subpage) support:
>
> - RAID56 subpage support
>   In v5.19
>
> - Refactored scrub support to support subpage better
>   In v6.4
>
> - Block perfect (previously requires page aligned ranges) compressed write
>   In v6.13
>
> - Various error handling fixes involving subpage
>   In v6.14
>
> Finally the only missing feature is the pretty simple and harmless
> inlined data extent creation, just done in previous patches.
>
> Now btrfs has all of its features ready for both regular and subpage
> cases, there is no reason to output a warning about the experimental
> subpage support, and we can finally remove it now.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I don't have a machine to test block size < page size, but I trust you
did all the testing besides being the expert on the code base for the
feature, so:

Acked-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> ---
>  fs/btrfs/disk-io.c | 5 -----
>  1 file changed, 5 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a799216aa264..c0b40dedceb5 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3414,11 +3414,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>          */
>         fs_info->max_inline = min_t(u64, fs_info->max_inline, fs_info->sectorsize);
>
> -       if (sectorsize < PAGE_SIZE)
> -               btrfs_warn(fs_info,
> -               "read-write for sector size %u with page size %lu is experimental",
> -                          sectorsize, PAGE_SIZE);
> -
>         ret = btrfs_init_workqueues(fs_info);
>         if (ret)
>                 goto fail_sb_buffer;
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-25 17:05   ` Filipe Manana
@ 2025-02-25 21:12     ` Qu Wenruo
  2025-02-25 23:44       ` Qu Wenruo
  2025-02-26 11:07       ` Filipe Manana
  0 siblings, 2 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-02-25 21:12 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2025/2/26 03:35, Filipe Manana 写道:
> On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Currently we're using btrfs_lock_and_flush_ordered_range() for both
>> btrfs_read_folio() and btrfs_readahead(), but it has one critical
>> problem for future subpage enhancements:
>>
>> - It will call btrfs_start_ordered_extent() to writeback the involved
>>    folios
>>
>>    But remember we're calling btrfs_lock_and_flush_ordered_range() at
>>    read paths, meaning the folio is already locked by read path.
>>
>>    If we really trigger writeback, this will lead to a deadlock and
>>    writeback can not hold the folio lock.
>>
>>    Such dead lock is prevented by the fact that btrfs always keeps a
>>    dirty folio also uptodate, by either dirtying all blocks of the folio,
>>    or read the folio before dirtying.
>>
>>    But it's not the common behavior, as XFS/EXT4 both allow the folio to
>>    be dirty but not uptodate, by allowing buffered write to dirty a full
>>    block without reading the full folio.
> 
> Ok, but what happens in other filesystems doesn't affect us.
> Or did you forget to mention that some other subsequent patch or planned changes
> will introduce that behaviour?
> 
> If so, it would be best to say that, otherwise that paragraph isn't relevant.

Got it, will add the mention of the incoming feature of partial uptodate 
folio support.

But still, all the work done here is mostly to pass generic/563, which I 
have to say the EXT4/XFS behavior is still a pretty good optimizatioin.

I'll change the paraph to focus on the incoming optimization.

> 
>>
>> Instead of blindly calling btrfs_start_ordered_extent(), introduce a
>> newer helper, which is smarter in the following ways:
>>
>> - Only wait and flush the ordered extent if
>>    * The folio doesn't even have private set
>>    * Part of the blocks of the ordered extent is not uptodate
> 
> is -> are
> 
>>
>>    This can happen by:
>>    * Folio writeback finished, then get invalidated. But OE not yet
>>      finished
> 
> Ok, this one I understand.
> 
>>    * Direct IO
> 
> This one I don't because direct IO doesn't use the page cache.
> Can you elaborate here and explain why it can happen with direct IO?

Direct IO always drops the cache first, so the folio in the direct IO 
range should have no private attached just like it was invalidated.

And direct IO doesn't wait for the OE to finish, we will have the same 
case just like the previous case.

> 
>>
>>    We have to wait for the ordered extent, as it may contain
>>    to-be-inserted data checksum.
>>    Without waiting, our read can fail due to the missing csum.
>>
>>    But either way, the OE should not need any extra flush inside the
>>    locked folio range.
>>
>> - Skip the ordered extent completely if
>>    * All the blocks are dirty
>>      This happens when OE creation is caused by previous folio.
> 
> "OE creation is caused by previous folio" - what does this mean?
> You mean by a write that happened before the read?

I mean the following case, 16K page size 4K block size:


    0      8K     16K    24K     32K
    |/////////////|//////|       |

The writeback for folio 0 started, which ran delalloc range for [0, 
24K), thus we have an OE at [0, 24K).

But at the same time, since range [24K, 32K) is not uptodate, a read can 
be triggered on folio 16K.

> 
> 
>>      The writeback will never happen (we're holding the folio lock for
>>      read), nor will the OE finish.
>>
>>      Thus we must skip the range.
>>
>>    * All the blocks are uptodate
>>      This happens when the writeback finished, but OE not yet finished.
>>
>>      Since the blocks are already uptodate, we can skip the OE range.
>>
>> The newer helper, lock_extents_for_read() will do a loop for the target
>> range by:
>>
>> 1) Lock the full range
>>
>> 2) If there is no ordered extent in the remaining range, exit
>>
>> 3) If there is an ordered extent that we can skip
>>     Skip to the end of the OE, and continue checking
>>     We do not trigger writeback nor wait for the OE.
>>
>> 4) If there is an ordered extent that we can not skip
>>     Unlock the whole extent range and start the ordered extent.
>>
>> And also update btrfs_start_ordered_extent() to add two more parameters:
>> @nowriteback_start and @nowriteback_len, to prevent triggering flush for
>> a certain range.
>>
>> This will allow us to handle the following case properly in the future:
>>
>>   16K page size, 4K btrfs block size:
>>
>>   16K           20K             24K              28K            32K
>>   |/////////////////////////////|                |              |
>>   |<----------- OE 2----------->|                |<--- OE 1 --->|
>>
>>   The folio has been written back before, thus we have an OE at
>>   [28K, 32K).
>>   Although the OE 1 finished its IO, the OE is not yet removed from IO
>>   tree.
>>   Later the folio got invalidated, but OE still exists.
> 
> This last sentence to be less confusing: "The folio got invalited
> after writeback completed and before the ordered extent finished."
> 
>>
>>   And [16K, 24K) range is dirty and uptodate, caused by a block aligned
>>   buffered write (and future enhancements allowing btrfs to skip full
>>   folio read for such case).
>>
>>   Furthermore, OE 2 is created covering range [16K, 24K) by the writeback
>>   of previous folio.
> 
> What previous folio? There's no other one in the example.
> Do you mean there's a folio for range [0, 16K) and running delalloc
> resulted in the creation of an OE for the range [0, 24K)? So that OE 2
> doesn't really start at 16K but at 0 instead?

Yep, exactly.

[...]
> Normally the parameter list/description comes before describing the
> return values.
> 
>> + */
>> +static bool can_skip_one_ordered_range(struct btrfs_inode *binode,
> 
> Why binode and not just inode? There's no vfs inode in the function to
> make any confusion.

OK, will go the regular inode naming.

> 
> 
>> +                                      struct btrfs_ordered_extent *ordered,
>> +                                      u64 cur, u64 *next_ret)
>> +{
>> +       const struct btrfs_fs_info *fs_info = binode->root->fs_info;
>> +       struct folio *folio;
>> +       const u32 blocksize = fs_info->sectorsize;
>> +       u64 range_len;
>> +       bool ret;
>> +
>> +       folio = filemap_get_folio(binode->vfs_inode.i_mapping,
>> +                                 cur >> PAGE_SHIFT);
>> +
>> +       /*
>> +        * We should have locked the folio(s) for range [start, end], thus
>> +        * there must be a folio and it must be locked.
>> +        */
>> +       ASSERT(!IS_ERR(folio));
>> +       ASSERT(folio_test_locked(folio));
>> +
>> +       /*
>> +        * We several cases for the folio and OE combination:
>> +        *
>> +        * 0) Folio has no private flag
> 
> Why does the numbering start from 0? It's not code and it's not
> related to array indexes or offsets :)

Because I forgot the folio to-be-read may not even be managed by btrfs, 
and hit ASSERT()s related to that.
Thus I think the private flag should be an important prerequisite, and 
give it 0.

But you're right, in fact it's very common and should not be treated 
specially.

> 
>> +        *    The OE has all its IO done but not yet finished, and folio got
>> +        *    invalidated. Or direct IO.
> 
> Ok so I still don't understand the direct IO case, because direct IO
> doesn't use the page cache.
> Just like the change log, some explanation here would be desired.
> 
>> +        *
>> +        * Have to wait for the OE to finish, as it may contain the
>> +        * to-be-inserted data checksum.
>> +        * Without the data checksum inserted into csum tree, read
>> +        * will just fail with missing csum.
> 
> Well we don't need to wait if it's a NOCOW / NODATASUM write.

That is also a valid optimization, I can enhance the check to skip 
NODATASUM OEs.

Thanks a lot for reviewing the core mechanism of the patchset.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned
  2025-02-25 17:55   ` Filipe Manana
@ 2025-02-25 21:16     ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-02-25 21:16 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2025/2/26 04:25, Filipe Manana 写道:
> On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> Since the support of block size (sector size) < page size for btrfs,
>> test case generic/563 fails with 4K block size and 64K page size:
>>
>>      --- tests/generic/563.out   2024-04-25 18:13:45.178550333 +0930
>>      +++ /home/adam/xfstests-dev/results//generic/563.out.bad    2024-09-30 09:09:16.155312379 +0930
>>      @@ -3,7 +3,8 @@
>>       read is in range
>>       write is in range
>>       write -> read/write
>>      -read is in range
>>      +read has value of 8388608
>>      +read is NOT in range -33792 .. 33792
>>       write is in range
>>      ...
>>
>> [CAUSE]
>> The test case creates a 8MiB file, then buffered write into the 8MiB
>> using 4K block size, to overwrite the whole file.
>>
>> On 4K page sized systems, since the write range covers the full block and
>> page, btrfs will no bother reading the page, just like what XFS and EXT4
>> do.
>>
>> But 64K page sized systems, although the 4K sized write is still block
>> aligned, it's not page aligned any more, thus btrfs will read the full
>> page, causing more read than expected and fail the test case.
> 
> This part "causing more read than expected" got me puzzled, but it's explained
> in the "Fix" section below. It's more like "causing previously dirty
> blocks within the page to be zeroed out".

It's not exactly the same.

Before this patch, we will never allow a folio to be dirtied if it's not 
uptodate (unless we're dirtying the full folio).

So there is no "previously dirty blocks to be zeroed out".

I have no better idea to explain the situation here, it's really about 
if we need to read the folio before buffered write.

I'm super happy if you have a better expression here so that no one 
would be confused.

Thanks,
Qu

> 
>>
>> [FIX]
>> To skip the full page read, we need to do the following modification:
>>
>> - Do not trigger full page read as long as the buffered write is block
>>    aligned
>>    This is pretty simple by modifying the check inside
>>    prepare_uptodate_page().
>>
>> - Skip already uptodate blocks during full page read
>>    Or we can lead to the following data corruption:
>>
>>    0       32K        64K
>>    |///////|          |
>>
>>    Where the file range [0, 32K) is dirtied by buffered write, the
>>    remaining range [32K, 64K) is not.
>>
>>    When reading the full page, since [0,32K) is only dirtied but not
>>    written back, there is no data extent map for it, but a hole covering
>>    [0, 64k).
>>
>>    If we continue reading the full page range [0, 64K), the dirtied range
>>    will be filled with 0 (since there is only a hole covering the whole
>>    range).
>>    This causes the dirtied range to get lost.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Looks good, thanks.
> 
>> ---
>>   fs/btrfs/extent_io.c | 4 ++++
>>   fs/btrfs/file.c      | 5 +++--
>>   2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index b3a4a94212c9..d7240e295bfc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>>                          end_folio_read(folio, true, cur, end - cur + 1);
>>                          break;
>>                  }
>> +               if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
>> +                       end_folio_read(folio, true, cur, blocksize);
>> +                       continue;
>> +               }
>>                  em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>>                  if (IS_ERR(em)) {
>>                          end_folio_read(folio, false, cur, end + 1 - cur);
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 00c68b7b2206..e3d63192281d 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>>   {
>>          u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>>          u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
>> +       const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>>          int ret = 0;
>>
>>          if (folio_test_uptodate(folio))
>>                  return 0;
>>
>>          if (!force_uptodate &&
>> -           IS_ALIGNED(clamp_start, PAGE_SIZE) &&
>> -           IS_ALIGNED(clamp_end, PAGE_SIZE))
>> +           IS_ALIGNED(clamp_start, sectorsize) &&
>> +           IS_ALIGNED(clamp_end, sectorsize))
>>                  return 0;
>>
>>          ret = btrfs_read_folio(NULL, folio);
>> --
>> 2.48.1
>>
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] btrfs: remove the subpage related warning message
  2025-02-25 18:10   ` Filipe Manana
@ 2025-02-25 21:19     ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2025-02-25 21:19 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2025/2/26 04:40, Filipe Manana 写道:
> On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Since the initial enablement of block size < page size support for
>> btrfs in v5.15, we have hit several milestones for block size < page
>> size (subpage) support:
>>
>> - RAID56 subpage support
>>    In v5.19
>>
>> - Refactored scrub support to support subpage better
>>    In v6.4
>>
>> - Block perfect (previously requires page aligned ranges) compressed write
>>    In v6.13
>>
>> - Various error handling fixes involving subpage
>>    In v6.14
>>
>> Finally the only missing feature is the pretty simple and harmless
>> inlined data extent creation, just done in previous patches.
>>
>> Now btrfs has all of its features ready for both regular and subpage
>> cases, there is no reason to output a warning about the experimental
>> subpage support, and we can finally remove it now.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> I don't have a machine to test block size < page size, but I trust you
> did all the testing besides being the expert on the code base for the
> feature, so:

There will a be a very small patch series (3 patches, < 70 lines of 
modification), to enable 2K block size on DEBUG btrfs builds.

Although with a lot of extra limitations, it should be enough to trigger 
most (if not all) bugs involved in this series.

The fstests to avoid the 2K block size limit will be a much big work 
though...

Thanks,
Qu

> 
> Acked-by: Filipe Manana <fdmanana@suse.com>
> 
> Thanks.
> 
>> ---
>>   fs/btrfs/disk-io.c | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index a799216aa264..c0b40dedceb5 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3414,11 +3414,6 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>>           */
>>          fs_info->max_inline = min_t(u64, fs_info->max_inline, fs_info->sectorsize);
>>
>> -       if (sectorsize < PAGE_SIZE)
>> -               btrfs_warn(fs_info,
>> -               "read-write for sector size %u with page size %lu is experimental",
>> -                          sectorsize, PAGE_SIZE);
>> -
>>          ret = btrfs_init_workqueues(fs_info);
>>          if (ret)
>>                  goto fail_sb_buffer;
>> --
>> 2.48.1
>>
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-25 21:12     ` Qu Wenruo
@ 2025-02-25 23:44       ` Qu Wenruo
  2025-02-26 11:07         ` Filipe Manana
  2025-02-26 11:07       ` Filipe Manana
  1 sibling, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2025-02-25 23:44 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2025/2/26 07:42, Qu Wenruo 写道:
> 
[...]
>>> +        *
>>> +        * Have to wait for the OE to finish, as it may contain the
>>> +        * to-be-inserted data checksum.
>>> +        * Without the data checksum inserted into csum tree, read
>>> +        * will just fail with missing csum.
>>
>> Well we don't need to wait if it's a NOCOW / NODATASUM write.
> 
> That is also a valid optimization, I can enhance the check to skip 
> NODATASUM OEs.

BTW, the enhancement will not be in this series.

As it will introduce extra skip cases, which may further makes things 
more complex.

I'll make it a dedicated patch in the future.
Currently there are quite some patches tested based on the behavior 
without NODATASUM skip.

Thanks,
Qu

> 
> Thanks a lot for reviewing the core mechanism of the patchset.
> 
> Thanks,
> Qu
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-25 21:12     ` Qu Wenruo
  2025-02-25 23:44       ` Qu Wenruo
@ 2025-02-26 11:07       ` Filipe Manana
  1 sibling, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-26 11:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2025 at 9:12 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2025/2/26 03:35, Filipe Manana 写道:
> > On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> Currently we're using btrfs_lock_and_flush_ordered_range() for both
> >> btrfs_read_folio() and btrfs_readahead(), but it has one critical
> >> problem for future subpage enhancements:
> >>
> >> - It will call btrfs_start_ordered_extent() to writeback the involved
> >>    folios
> >>
> >>    But remember we're calling btrfs_lock_and_flush_ordered_range() at
> >>    read paths, meaning the folio is already locked by read path.
> >>
> >>    If we really trigger writeback, this will lead to a deadlock and
> >>    writeback can not hold the folio lock.
> >>
> >>    Such dead lock is prevented by the fact that btrfs always keeps a
> >>    dirty folio also uptodate, by either dirtying all blocks of the folio,
> >>    or read the folio before dirtying.
> >>
> >>    But it's not the common behavior, as XFS/EXT4 both allow the folio to
> >>    be dirty but not uptodate, by allowing buffered write to dirty a full
> >>    block without reading the full folio.
> >
> > Ok, but what happens in other filesystems doesn't affect us.
> > Or did you forget to mention that some other subsequent patch or planned changes
> > will introduce that behaviour?
> >
> > If so, it would be best to say that, otherwise that paragraph isn't relevant.
>
> Got it, will add the mention of the incoming feature of partial uptodate
> folio support.
>
> But still, all the work done here is mostly to pass generic/563, which I
> have to say the EXT4/XFS behavior is still a pretty good optimizatioin.
>
> I'll change the paraph to focus on the incoming optimization.
>
> >
> >>
> >> Instead of blindly calling btrfs_start_ordered_extent(), introduce a
> >> newer helper, which is smarter in the following ways:
> >>
> >> - Only wait and flush the ordered extent if
> >>    * The folio doesn't even have private set
> >>    * Part of the blocks of the ordered extent is not uptodate
> >
> > is -> are
> >
> >>
> >>    This can happen by:
> >>    * Folio writeback finished, then get invalidated. But OE not yet
> >>      finished
> >
> > Ok, this one I understand.
> >
> >>    * Direct IO
> >
> > This one I don't because direct IO doesn't use the page cache.
> > Can you elaborate here and explain why it can happen with direct IO?
>
> Direct IO always drops the cache first, so the folio in the direct IO
> range should have no private attached just like it was invalidated.

Ok, I was under the impression you were implying direct IO could cause
deadlock, but what you mean is the scenarios where private is not set.
So all is good.

Thanks.

>
> And direct IO doesn't wait for the OE to finish, we will have the same
> case just like the previous case.
>
> >
> >>
> >>    We have to wait for the ordered extent, as it may contain
> >>    to-be-inserted data checksum.
> >>    Without waiting, our read can fail due to the missing csum.
> >>
> >>    But either way, the OE should not need any extra flush inside the
> >>    locked folio range.
> >>
> >> - Skip the ordered extent completely if
> >>    * All the blocks are dirty
> >>      This happens when OE creation is caused by previous folio.
> >
> > "OE creation is caused by previous folio" - what does this mean?
> > You mean by a write that happened before the read?
>
> I mean the following case, 16K page size 4K block size:
>
>
>     0      8K     16K    24K     32K
>     |/////////////|//////|       |
>
> The writeback for folio 0 started, which ran delalloc range for [0,
> 24K), thus we have an OE at [0, 24K).
>
> But at the same time, since range [24K, 32K) is not uptodate, a read can
> be triggered on folio 16K.
>
> >
> >
> >>      The writeback will never happen (we're holding the folio lock for
> >>      read), nor will the OE finish.
> >>
> >>      Thus we must skip the range.
> >>
> >>    * All the blocks are uptodate
> >>      This happens when the writeback finished, but OE not yet finished.
> >>
> >>      Since the blocks are already uptodate, we can skip the OE range.
> >>
> >> The newer helper, lock_extents_for_read() will do a loop for the target
> >> range by:
> >>
> >> 1) Lock the full range
> >>
> >> 2) If there is no ordered extent in the remaining range, exit
> >>
> >> 3) If there is an ordered extent that we can skip
> >>     Skip to the end of the OE, and continue checking
> >>     We do not trigger writeback nor wait for the OE.
> >>
> >> 4) If there is an ordered extent that we can not skip
> >>     Unlock the whole extent range and start the ordered extent.
> >>
> >> And also update btrfs_start_ordered_extent() to add two more parameters:
> >> @nowriteback_start and @nowriteback_len, to prevent triggering flush for
> >> a certain range.
> >>
> >> This will allow us to handle the following case properly in the future:
> >>
> >>   16K page size, 4K btrfs block size:
> >>
> >>   16K           20K             24K              28K            32K
> >>   |/////////////////////////////|                |              |
> >>   |<----------- OE 2----------->|                |<--- OE 1 --->|
> >>
> >>   The folio has been written back before, thus we have an OE at
> >>   [28K, 32K).
> >>   Although the OE 1 finished its IO, the OE is not yet removed from IO
> >>   tree.
> >>   Later the folio got invalidated, but OE still exists.
> >
> > This last sentence to be less confusing: "The folio got invalited
> > after writeback completed and before the ordered extent finished."
> >
> >>
> >>   And [16K, 24K) range is dirty and uptodate, caused by a block aligned
> >>   buffered write (and future enhancements allowing btrfs to skip full
> >>   folio read for such case).
> >>
> >>   Furthermore, OE 2 is created covering range [16K, 24K) by the writeback
> >>   of previous folio.
> >
> > What previous folio? There's no other one in the example.
> > Do you mean there's a folio for range [0, 16K) and running delalloc
> > resulted in the creation of an OE for the range [0, 24K)? So that OE 2
> > doesn't really start at 16K but at 0 instead?
>
> Yep, exactly.
>
> [...]
> > Normally the parameter list/description comes before describing the
> > return values.
> >
> >> + */
> >> +static bool can_skip_one_ordered_range(struct btrfs_inode *binode,
> >
> > Why binode and not just inode? There's no vfs inode in the function to
> > make any confusion.
>
> OK, will go the regular inode naming.
>
> >
> >
> >> +                                      struct btrfs_ordered_extent *ordered,
> >> +                                      u64 cur, u64 *next_ret)
> >> +{
> >> +       const struct btrfs_fs_info *fs_info = binode->root->fs_info;
> >> +       struct folio *folio;
> >> +       const u32 blocksize = fs_info->sectorsize;
> >> +       u64 range_len;
> >> +       bool ret;
> >> +
> >> +       folio = filemap_get_folio(binode->vfs_inode.i_mapping,
> >> +                                 cur >> PAGE_SHIFT);
> >> +
> >> +       /*
> >> +        * We should have locked the folio(s) for range [start, end], thus
> >> +        * there must be a folio and it must be locked.
> >> +        */
> >> +       ASSERT(!IS_ERR(folio));
> >> +       ASSERT(folio_test_locked(folio));
> >> +
> >> +       /*
> >> +        * We several cases for the folio and OE combination:
> >> +        *
> >> +        * 0) Folio has no private flag
> >
> > Why does the numbering start from 0? It's not code and it's not
> > related to array indexes or offsets :)
>
> Because I forgot the folio to-be-read may not even be managed by btrfs,
> and hit ASSERT()s related to that.
> Thus I think the private flag should be an important prerequisite, and
> give it 0.
>
> But you're right, in fact it's very common and should not be treated
> specially.
>
> >
> >> +        *    The OE has all its IO done but not yet finished, and folio got
> >> +        *    invalidated. Or direct IO.
> >
> > Ok so I still don't understand the direct IO case, because direct IO
> > doesn't use the page cache.
> > Just like the change log, some explanation here would be desired.
> >
> >> +        *
> >> +        * Have to wait for the OE to finish, as it may contain the
> >> +        * to-be-inserted data checksum.
> >> +        * Without the data checksum inserted into csum tree, read
> >> +        * will just fail with missing csum.
> >
> > Well we don't need to wait if it's a NOCOW / NODATASUM write.
>
> That is also a valid optimization, I can enhance the check to skip
> NODATASUM OEs.
>
> Thanks a lot for reviewing the core mechanism of the patchset.
>
> Thanks,
> Qu
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper
  2025-02-25 23:44       ` Qu Wenruo
@ 2025-02-26 11:07         ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2025-02-26 11:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2025 at 11:44 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2025/2/26 07:42, Qu Wenruo 写道:
> >
> [...]
> >>> +        *
> >>> +        * Have to wait for the OE to finish, as it may contain the
> >>> +        * to-be-inserted data checksum.
> >>> +        * Without the data checksum inserted into csum tree, read
> >>> +        * will just fail with missing csum.
> >>
> >> Well we don't need to wait if it's a NOCOW / NODATASUM write.
> >
> > That is also a valid optimization, I can enhance the check to skip
> > NODATASUM OEs.
>
> BTW, the enhancement will not be in this series.
>
> As it will introduce extra skip cases, which may further makes things
> more complex.

Sure, that's fine.
Thanks.

>
> I'll make it a dedicated patch in the future.
> Currently there are quite some patches tested based on the behavior
> without NODATASUM skip.
>
> Thanks,
> Qu
>
> >
> > Thanks a lot for reviewing the core mechanism of the patchset.
> >
> > Thanks,
> > Qu
> >
> >
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-02-26 11:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
2025-02-25 15:07   ` Filipe Manana
2025-02-25 16:10     ` Filipe Manana
2025-02-23 23:46 ` [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo
2025-02-25 15:11   ` Filipe Manana
2025-02-23 23:46 ` [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
2025-02-25 17:05   ` Filipe Manana
2025-02-25 21:12     ` Qu Wenruo
2025-02-25 23:44       ` Qu Wenruo
2025-02-26 11:07         ` Filipe Manana
2025-02-26 11:07       ` Filipe Manana
2025-02-23 23:46 ` [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
2025-02-25 17:09   ` Filipe Manana
2025-02-23 23:46 ` [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
2025-02-25 17:55   ` Filipe Manana
2025-02-25 21:16     ` Qu Wenruo
2025-02-23 23:46 ` [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size Qu Wenruo
2025-02-25 18:03   ` Filipe Manana
2025-02-23 23:46 ` [PATCH 7/7] btrfs: remove the subpage related warning message Qu Wenruo
2025-02-25 18:10   ` Filipe Manana
2025-02-25 21:19     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox