[PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups
@ 2025-10-13 12:05 fdmanana
  2025-10-13 12:05 ` [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size fdmanana
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The first patch fixes a bug where we can up persisting a file extent item
that crosses the i_size boundary, making btrfs check complain because it
won't find checksum items beyond the i_size boundary. Details in the change
log. The rest are several cleanups in related code that popped up while
debugging the issue.

Filipe Manana (7):
  btrfs: truncate ordered extent when skipping writeback past i_size
  btrfs: use variable for end offset in extent_writepage_io()
  btrfs: split assertion into two in extent_writepage_io()
  btrfs: add unlikely to unexpected error case in extent_writepages()
  btrfs: consistently round up or down i_size in btrfs_truncate()
  btrfs: avoid multiple i_size rounding in btrfs_truncate()
  btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()

 fs/btrfs/extent_io.c    | 33 ++++++++++++++++++++++++++-------
 fs/btrfs/inode.c        | 18 ++++++++----------
 fs/btrfs/ordered-data.c | 23 +++++++++++------------
 3 files changed, 45 insertions(+), 29 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:46   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io() fdmanana
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.

Example btrfs check output when this happens:

  $ btrfs check /dev/sdc
  Opening filesystem to check...
  Checking filesystem on /dev/sdc
  UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
  [1/8] checking log skipped (none written)
  [2/8] checking root items
  [3/8] checking extents
  [4/8] checking free space tree
  [5/8] checking fs roots
  root 5 inode 332 errors 1000, some csum missing
  ERROR: errors found in fs roots
  (...)

Looking at a tree dump of the fs tree (root 5) for inode 332 we have:

   $ btrfs inspect-internal dump-tree -t 5 /dev/sdc
   (...)
        item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
                generation 17 transid 19 size 610969 nbytes 86016
                block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
                sequence 11 flags 0x0(none)
                atime 1759851068.391327881 (2025-10-07 16:31:08)
                ctime 1759851068.410098267 (2025-10-07 16:31:08)
                mtime 1759851068.410098267 (2025-10-07 16:31:08)
                otime 1759851068.391327881 (2025-10-07 16:31:08)
        item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
                index 2 namelen 3 name: f1f
        item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
                generation 19 type 1 (regular)
                extent data disk byte 21745664 nr 65536
                extent data offset 0 nr 65536 ram 65536
                extent compression 0 (none)
   (...)

We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.

Looking into the csum tree:

  $ btrfs inspect-internal dump-tree /dev/sdc
  (...)
        item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
                range start 21565440 end 21770240 length 204800
           item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
                range start 1104576512 end 1104584704 length 8192
  (..)

We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:

   21770240 - 21745664 = 24K

We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.

If we round up the i_size to the sector size, we get:

   round_up(610969, 4096) = 614400

If we subtract from that the file offset for the extent item we get:

   614400 - 589824 = 24K

So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:

   655360 - 614400 = 40K

Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.

There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).

The steps (in the kernel) that lead to such scenario are the following:

 1) We have inode I as an empty file, no allocated extents, i_size is 0;

 2) A buffered write is done for file range [589824, 655360[ (length of
    64K) and the i_size is updated to 655360. Note that we got a single
    large folio for the range (64K);

 3) A truncate operation starts that reduces the inode's i_size down to
    610969 bytes. The truncate sets the inode's new i_size at
    btrfs_setsize() by calling truncate_setsize() and before calling
    btrfs_truncate();

 4) At btrfs_truncate() we trigger writeback for the range starting at
    610304 (which is the new i_size rounded down to the sector size) and
    ending at (u64)-1;

 5) During the writeback, at extent_write_cache_pages(), we get from the
    call to filemap_get_folios_tag(), the 64K folio that starts at file
    offset 589824 since it contains the start offset of the writeback
    range (610304);

 6) At writepage_delalloc() we find the whole range of the folio is dirty
    and therefore we run delalloc for that 64K range ([589824, 655360[),
    reserving a 64K extent, creating an ordered extent, etc;

 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
    because the inode's i_size is 610969 bytes (rounded up by sector size
    is 614400). There, in the while loop we intentionally skip IO beyond
    i_size to avoid any unnecessay work and just call
    btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
    has a 40K length);

 8) Once the IO finishes we finish the ordered extent by ending up at
    btrfs_finish_one_ordered(), join transaction N, insert a file extent
    item in the inode's subvolume tree for file offset 589824 with a number
    of bytes of 64K, and update the inode's delayed inode item or directly
    the inode item with a call to btrfs_update_inode_fallback(), which
    results in storing the new i_size of 610969 bytes;

 9) Transaction N is committed either by the transaction kthread or some
    other task committed it (in response to a sync or fsync for example).

    At this point we have inode I persisted with an i_size of 610969 bytes
    and file extent item that starts at file offset 589824 and has a number
    of bytes of 64K, ending at an offset of 655360 which is beyond the
    i_size rounded up to the sector size (614400).

    --> So after a crash or power failure here, the btrfs check program
        reports that error about missing checksum items for this inode, as
	it tries to lookup for checksums covering the whole range of the
	extent;

10) Only after transaction N is committed that at btrfs_truncate() the
    call to btrfs_start_transaction() starts a new transaction, N + 1,
    instead of joining transaction N. And it's with transaction N + 1 that
    it calls btrfs_truncate_inode_items() which updates the file extent
    item at file offset 589824 to reduce its number of bytes from 64K down
    to 24K, so that the file extent item's range ends at the i_size
    rounded up to the sector size (614400 bytes).

Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c    | 21 +++++++++++++++++++--
 fs/btrfs/ordered-data.c |  5 +++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3a8681566fc5..67706c1efa88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1691,13 +1691,13 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	bool submitted_io = false;
 	int found_error = 0;
 	const u64 folio_start = folio_pos(folio);
+	const u64 folio_end = folio_start + folio_size(folio);
 	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
 	u64 cur;
 	int bit;
 	int ret = 0;
 
-	ASSERT(start >= folio_start &&
-	       start + len <= folio_start + folio_size(folio));
+	ASSERT(start >= folio_start && start + len <= folio_end);
 
 	ret = btrfs_writepage_cow_fixup(folio);
 	if (ret == -EAGAIN) {
@@ -1724,6 +1724,23 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
 
 		if (cur >= i_size) {
+			struct btrfs_ordered_extent *ordered;
+			unsigned long flags;
+
+			ordered = btrfs_lookup_first_ordered_range(inode, cur,
+								   folio_end - cur);
+			/*
+			 * We have just run delalloc before getting here, so
+			 * there must be an ordered extent.
+			 */
+			ASSERT(ordered != NULL);
+			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+			ordered->truncated_len = min(ordered->truncated_len,
+						     cur - ordered->file_offset);
+			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			btrfs_put_ordered_extent(ordered);
+
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
 						       start + len - cur, true);
 			/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2829f20d7bb5..8a8aa6ed405b 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1098,8 +1098,9 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	struct rb_node *prev;
 	struct rb_node *next;
 	struct btrfs_ordered_extent *entry = NULL;
+	unsigned long flags;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
 	node = inode->ordered_tree.rb_node;
 	/*
 	 * Here we don't want to use tree_search() which will use tree->last
@@ -1154,7 +1155,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
 	}
 
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 	return entry;
 }
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
  2025-10-13 12:05 ` [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:51   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 3/7] btrfs: split assertion into two " fdmanana
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of repeating the expression "start + len" multiple times, store it
in a variable and use it where needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 67706c1efa88..c641eb50d0ee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1690,6 +1690,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	unsigned long range_bitmap = 0;
 	bool submitted_io = false;
 	int found_error = 0;
+	const u64 end = start + len;
 	const u64 folio_start = folio_pos(folio);
 	const u64 folio_end = folio_start + folio_size(folio);
 	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
@@ -1697,7 +1698,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	int bit;
 	int ret = 0;
 
-	ASSERT(start >= folio_start && start + len <= folio_end);
+	ASSERT(start >= folio_start && end <= folio_end);
 
 	ret = btrfs_writepage_cow_fixup(folio);
 	if (ret == -EAGAIN) {
@@ -1713,7 +1714,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		return ret;
 	}
 
-	for (cur = start; cur < start + len; cur += fs_info->sectorsize)
+	for (cur = start; cur < end; cur += fs_info->sectorsize)
 		set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
 	bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
 		   blocks_per_folio);
@@ -1742,7 +1743,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       start + len - cur, true);
+						       end - cur, true);
 			/*
 			 * This range is beyond i_size, thus we don't need to
 			 * bother writing back.
@@ -1751,8 +1752,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * writeback the sectors with subpage dirty bits,
 			 * causing writeback without ordered extent.
 			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur,
-						start + len - cur);
+			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
 			break;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/7] btrfs: split assertion into two in extent_writepage_io()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
  2025-10-13 12:05 ` [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size fdmanana
  2025-10-13 12:05 ` [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io() fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:52   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages() fdmanana
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If the assertion fails we don't get to know which of the two expressions
failed and neither the values used in each expression.

So split the assertion into two, each for a single expression, so that
if any is triggered we see a line number reported in a stack trace that
points to which expression failed. Also  make the assertions use the
verbose mode to print the values involved in the computations.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c641eb50d0ee..1cb73f55af20 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1698,7 +1698,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	int bit;
 	int ret = 0;
 
-	ASSERT(start >= folio_start && end <= folio_end);
+	ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start);
+	ASSERT(end <= folio_end, "start=%llu len=%u folio_start=%llu folio_size=%zu",
+	       start, len, folio_start, folio_size(folio));
 
 	ret = btrfs_writepage_cow_fixup(folio);
 	if (ret == -EAGAIN) {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
                   ` (2 preceding siblings ...)
  2025-10-13 12:05 ` [PATCH 3/7] btrfs: split assertion into two " fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:52   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate() fdmanana
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We don't expect to hit errors and log the error message, so add the
unlikely annotation to make it clear and to hint the compiler that it may
reorganize code to be more efficient.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1cb73f55af20..870584dde575 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1874,7 +1874,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
 				  folio_size(folio), bio_ctrl, i_size);
 	if (ret == 1)
 		return 0;
-	if (ret < 0)
+	if (unlikely(ret < 0))
 		btrfs_err_rl(fs_info,
 "failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d",
 			     btrfs_root_id(inode->root), btrfs_ino(inode),
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
                   ` (3 preceding siblings ...)
  2025-10-13 12:05 ` [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages() fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:52   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 6/7] btrfs: avoid multiple i_size rounding " fdmanana
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We're using different ways to round down the i_size by sector size, one
with a bitwise and with a negated mask and another with ALIGN_DOWN(), and
using ALIGN() to round up.

Replace these uses with the round_down() and round_up() macros which have
have names that make it clear the direction of the rounding (unlike the
ALIGN() macro) and getting rid of the bitwise and, negated mask and local
variable for the mask.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac2fd589697d..4a4cb91b7586 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7649,12 +7649,12 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 	struct btrfs_block_rsv rsv;
 	int ret;
 	struct btrfs_trans_handle *trans;
-	u64 mask = fs_info->sectorsize - 1;
 	const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
 
 	if (!skip_writeback) {
 		ret = btrfs_wait_ordered_range(inode,
-					       inode->vfs_inode.i_size & (~mask),
+					       round_down(inode->vfs_inode.i_size,
+							  fs_info->sectorsize),
 					       (u64)-1);
 		if (ret)
 			return ret;
@@ -7720,7 +7720,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 	while (1) {
 		struct extent_state *cached_state = NULL;
 		const u64 new_size = inode->vfs_inode.i_size;
-		const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+		const u64 lock_start = round_down(new_size, fs_info->sectorsize);
 
 		control.new_size = new_size;
 		btrfs_lock_extent(&inode->io_tree, lock_start, (u64)-1, &cached_state);
@@ -7730,7 +7730,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 		 * block of the extent just the way it is.
 		 */
 		btrfs_drop_extent_map_range(inode,
-					    ALIGN(new_size, fs_info->sectorsize),
+					    round_up(new_size, fs_info->sectorsize),
 					    (u64)-1, false);
 
 		ret = btrfs_truncate_inode_items(trans, root, &control);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/7] btrfs: avoid multiple i_size rounding in btrfs_truncate()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
                   ` (4 preceding siblings ...)
  2025-10-13 12:05 ` [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate() fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:53   ` Qu Wenruo
  2025-10-13 12:05 ` [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished() fdmanana
  2025-10-13 21:01 ` [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups Qu Wenruo
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We have the inode locked so no one can concurrently change its i_size and
neither do we change it ourselves, so there's no point in keep rounding
it in the while loop and setting it up in the control structure. That only
causes confusion when reading the code.

So move all the i_size setup and rounding out of the loop and assert the
inode is locked.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4a4cb91b7586..096b995fe87b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7643,6 +7643,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 		.ino = btrfs_ino(inode),
 		.min_type = BTRFS_EXTENT_DATA_KEY,
 		.clear_extent_range = true,
+		.new_size = inode->vfs_inode.i_size,
 	};
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -7650,12 +7651,14 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 	int ret;
 	struct btrfs_trans_handle *trans;
 	const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
+	const u64 lock_start = round_down(inode->vfs_inode.i_size, fs_info->sectorsize);
+	const u64 i_size_up = round_up(inode->vfs_inode.i_size, fs_info->sectorsize);
+
+	/* Our inode is locked and the i_size can't be changed concurrently. */
+	btrfs_assert_inode_locked(inode);
 
 	if (!skip_writeback) {
-		ret = btrfs_wait_ordered_range(inode,
-					       round_down(inode->vfs_inode.i_size,
-							  fs_info->sectorsize),
-					       (u64)-1);
+		ret = btrfs_wait_ordered_range(inode, lock_start, (u64)-1);
 		if (ret)
 			return ret;
 	}
@@ -7719,19 +7722,14 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 
 	while (1) {
 		struct extent_state *cached_state = NULL;
-		const u64 new_size = inode->vfs_inode.i_size;
-		const u64 lock_start = round_down(new_size, fs_info->sectorsize);
 
-		control.new_size = new_size;
 		btrfs_lock_extent(&inode->io_tree, lock_start, (u64)-1, &cached_state);
 		/*
 		 * We want to drop from the next block forward in case this new
 		 * size is not block aligned since we will be keeping the last
 		 * block of the extent just the way it is.
 		 */
-		btrfs_drop_extent_map_range(inode,
-					    round_up(new_size, fs_info->sectorsize),
-					    (u64)-1, false);
+		btrfs_drop_extent_map_range(inode, i_size_up, (u64)-1, false);
 
 		ret = btrfs_truncate_inode_items(trans, root, &control);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
                   ` (5 preceding siblings ...)
  2025-10-13 12:05 ` [PATCH 6/7] btrfs: avoid multiple i_size rounding " fdmanana
@ 2025-10-13 12:05 ` fdmanana
  2025-10-13 20:55   ` Qu Wenruo
  2025-10-13 21:01 ` [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups Qu Wenruo
  7 siblings, 1 reply; 17+ messages in thread
From: fdmanana @ 2025-10-13 12:05 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We're computing a few values several times:

1) The current ordered extent's end offset inside the while loop, we have
   computed it and stored it in the 'entry_end' variable but then we
   compute it again later as the first argument to the min() macro;

2) The end file offset, open coded 3 times;

3) The current length (stored in variable 'len') computed 2 times, one
   inside an assertion and the other when assigning to the 'len' variable.

So use existing variables and add new ones to prevent repeating these
expressions and reduce the source code.

We were also subtracting one from the result of min() macro call and
then adding 1 back in the next line, making both operations pointless.
So just remove the decrement and increment by 1.

This also reduces very slightly the object code.

Before:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1916576	 161679	  15592	2093847	 1ff317	fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  1916556	 161679	  15592	2093827	 1ff303	fs/btrfs/btrfs.ko

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 8a8aa6ed405b..dfda952dcf7b 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -483,16 +483,15 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 	struct btrfs_ordered_extent *entry = NULL;
 	unsigned long flags;
 	u64 cur = file_offset;
+	const u64 end = file_offset + num_bytes;
 
-	trace_btrfs_writepage_end_io_hook(inode, file_offset,
-					  file_offset + num_bytes - 1,
-					  uptodate);
+	trace_btrfs_writepage_end_io_hook(inode, file_offset, end - 1, uptodate);
 
 	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
-	while (cur < file_offset + num_bytes) {
+	while (cur < end) {
 		u64 entry_end;
-		u64 end;
-		u32 len;
+		u64 this_end;
+		u64 len;
 
 		node = ordered_tree_search(inode, cur);
 		/* No ordered extents at all */
@@ -535,10 +534,9 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 		 *	|
 		 *	cur
 		 */
-		end = min(entry->file_offset + entry->num_bytes,
-			  file_offset + num_bytes) - 1;
-		ASSERT(end + 1 - cur < U32_MAX);
-		len = end + 1 - cur;
+		this_end = min(entry_end, end);
+		len = this_end - cur;
+		ASSERT(len < U32_MAX);
 
 		if (can_finish_ordered_extent(entry, folio, cur, len, uptodate)) {
 			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size
  2025-10-13 12:05 ` [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size fdmanana
@ 2025-10-13 20:46   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:46 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> While running test case btrfs/192 from fstests with support for large
> folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
> btrfs check failures reporting that csum items were missing. Looking into
> the issue it turned out that btrfs check searches for csum items of a file
> extent item with a range that spans beyond the i_size of a file and we
> don't have any, because the kernel's writeback code skips submitting bios
> for ranges beyond eof. It's not expected however to find a file extent item
> that crosses the rounded up (by the sector size) i_size value, but there is
> a short time window where we can end up with a transaction commit leaving
> this small inconsistency between the i_size and the last file extent item.
> 
> Example btrfs check output when this happens:
> 
>    $ btrfs check /dev/sdc
>    Opening filesystem to check...
>    Checking filesystem on /dev/sdc
>    UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
>    [1/8] checking log skipped (none written)
>    [2/8] checking root items
>    [3/8] checking extents
>    [4/8] checking free space tree
>    [5/8] checking fs roots
>    root 5 inode 332 errors 1000, some csum missing
>    ERROR: errors found in fs roots
>    (...)
> 
> Looking at a tree dump of the fs tree (root 5) for inode 332 we have:
> 
>     $ btrfs inspect-internal dump-tree -t 5 /dev/sdc
>     (...)
>          item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
>                  generation 17 transid 19 size 610969 nbytes 86016
>                  block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
>                  sequence 11 flags 0x0(none)
>                  atime 1759851068.391327881 (2025-10-07 16:31:08)
>                  ctime 1759851068.410098267 (2025-10-07 16:31:08)
>                  mtime 1759851068.410098267 (2025-10-07 16:31:08)
>                  otime 1759851068.391327881 (2025-10-07 16:31:08)
>          item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
>                  index 2 namelen 3 name: f1f
>          item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
>                  generation 19 type 1 (regular)
>                  extent data disk byte 21745664 nr 65536
>                  extent data offset 0 nr 65536 ram 65536
>                  extent compression 0 (none)
>     (...)
> 
> We can see that the file extent item for file offset 589824 has a length of
> 64K and its number of bytes is 64K. Looking at the inode item we see that
> its i_size is 610969 bytes which falls within the range of that file extent
> item [589824, 655360[.
> 
> Looking into the csum tree:
> 
>    $ btrfs inspect-internal dump-tree /dev/sdc
>    (...)
>          item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
>                  range start 21565440 end 21770240 length 204800
>             item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
>                  range start 1104576512 end 1104584704 length 8192
>    (..)
> 
> We see that the csum item number 15 covers the first 24K of the file extent
> item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
> so we have:
> 
>     21770240 - 21745664 = 24K
> 
> We see that the next csum item (number 16) is completely outside the range,
> so the remaining 40K of the extent doesn't have csum items in the tree.
> 
> If we round up the i_size to the sector size, we get:
> 
>     round_up(610969, 4096) = 614400
> 
> If we subtract from that the file offset for the extent item we get:
> 
>     614400 - 589824 = 24K
> 
> So the missing 40K corresponds to the end of the file extent item's range
> minus the rounded up i_size:
> 
>     655360 - 614400 = 40K
> 
> Normally we don't expect a file extent item to span over the rounded up
> i_size of an inode, since when truncating, doing hole punching and other
> operations that trim a file extent item, the number of bytes is adjusted.
> 
> There is however a short time window where the kernel can end up,
> temporarily,persisting an inode with an i_size that falls in the middle of
> the last file extent item and the file extent item was not yet trimmed (its
> number of bytes reduced so that it doesn't cross i_size rounded up by the
> sector size).
> 
> The steps (in the kernel) that lead to such scenario are the following:
> 
>   1) We have inode I as an empty file, no allocated extents, i_size is 0;
> 
>   2) A buffered write is done for file range [589824, 655360[ (length of
>      64K) and the i_size is updated to 655360. Note that we got a single
>      large folio for the range (64K);
> 
>   3) A truncate operation starts that reduces the inode's i_size down to
>      610969 bytes. The truncate sets the inode's new i_size at
>      btrfs_setsize() by calling truncate_setsize() and before calling
>      btrfs_truncate();
> 
>   4) At btrfs_truncate() we trigger writeback for the range starting at
>      610304 (which is the new i_size rounded down to the sector size) and
>      ending at (u64)-1;
> 
>   5) During the writeback, at extent_write_cache_pages(), we get from the
>      call to filemap_get_folios_tag(), the 64K folio that starts at file
>      offset 589824 since it contains the start offset of the writeback
>      range (610304);
> 
>   6) At writepage_delalloc() we find the whole range of the folio is dirty
>      and therefore we run delalloc for that 64K range ([589824, 655360[),
>      reserving a 64K extent, creating an ordered extent, etc;
> 
>   7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
>      because the inode's i_size is 610969 bytes (rounded up by sector size
>      is 614400). There, in the while loop we intentionally skip IO beyond
>      i_size to avoid any unnecessay work and just call
>      btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
>      has a 40K length);
> 
>   8) Once the IO finishes we finish the ordered extent by ending up at
>      btrfs_finish_one_ordered(), join transaction N, insert a file extent
>      item in the inode's subvolume tree for file offset 589824 with a number
>      of bytes of 64K, and update the inode's delayed inode item or directly
>      the inode item with a call to btrfs_update_inode_fallback(), which
>      results in storing the new i_size of 610969 bytes;
> 
>   9) Transaction N is committed either by the transaction kthread or some
>      other task committed it (in response to a sync or fsync for example).
> 
>      At this point we have inode I persisted with an i_size of 610969 bytes
>      and file extent item that starts at file offset 589824 and has a number
>      of bytes of 64K, ending at an offset of 655360 which is beyond the
>      i_size rounded up to the sector size (614400).
> 
>      --> So after a crash or power failure here, the btrfs check program
>          reports that error about missing checksum items for this inode, as
> 	it tries to lookup for checksums covering the whole range of the
> 	extent;
> 
> 10) Only after transaction N is committed that at btrfs_truncate() the
>      call to btrfs_start_transaction() starts a new transaction, N + 1,
>      instead of joining transaction N. And it's with transaction N + 1 that
>      it calls btrfs_truncate_inode_items() which updates the file extent
>      item at file offset 589824 to reduce its number of bytes from 64K down
>      to 24K, so that the file extent item's range ends at the i_size
>      rounded up to the sector size (614400 bytes).
> 
> Fix this by truncating the ordered extent at extent_writepage_io() when we
> skip writeback because the current offset in the folio is beyond i_size.
> This ensures we don't ever persist a file extent item with a number of
> bytes beyond the rounded up (by sector size) value of the i_size.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
[...]
> @@ -1724,6 +1724,23 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
>   
>   		if (cur >= i_size) {
> +			struct btrfs_ordered_extent *ordered;
> +			unsigned long flags;
> +
> +			ordered = btrfs_lookup_first_ordered_range(inode, cur,
> +								   folio_end - cur);
> +			/*
> +			 * We have just run delalloc before getting here, so
> +			 * there must be an ordered extent.
> +			 */
> +			ASSERT(ordered != NULL);
> +			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
> +			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +			ordered->truncated_len = min(ordered->truncated_len,
> +						     cur - ordered->file_offset);
> +			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
> +			btrfs_put_ordered_extent(ordered);
> +
>   			btrfs_mark_ordered_io_finished(inode, folio, cur,
>   						       start + len - cur, true);
>   			/*
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 2829f20d7bb5..8a8aa6ed405b 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -1098,8 +1098,9 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
>   	struct rb_node *prev;
>   	struct rb_node *next;
>   	struct btrfs_ordered_extent *entry = NULL;
> +	unsigned long flags;
>   
> -	spin_lock_irq(&inode->ordered_tree_lock);
> +	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
>   	node = inode->ordered_tree.rb_node;
>   	/*
>   	 * Here we don't want to use tree_search() which will use tree->last
> @@ -1154,7 +1155,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
>   		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
>   	}
>   
> -	spin_unlock_irq(&inode->ordered_tree_lock);
> +	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>   	return entry;
>   }
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io()
  2025-10-13 12:05 ` [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io() fdmanana
@ 2025-10-13 20:51   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:51 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Instead of repeating the expression "start + len" multiple times, store it
> in a variable and use it where needed.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent_io.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 67706c1efa88..c641eb50d0ee 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1690,6 +1690,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   	unsigned long range_bitmap = 0;
>   	bool submitted_io = false;
>   	int found_error = 0;
> +	const u64 end = start + len;
>   	const u64 folio_start = folio_pos(folio);
>   	const u64 folio_end = folio_start + folio_size(folio);
>   	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
> @@ -1697,7 +1698,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   	int bit;
>   	int ret = 0;
>   
> -	ASSERT(start >= folio_start && start + len <= folio_end);
> +	ASSERT(start >= folio_start && end <= folio_end);
>   
>   	ret = btrfs_writepage_cow_fixup(folio);
>   	if (ret == -EAGAIN) {
> @@ -1713,7 +1714,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   		return ret;
>   	}
>   
> -	for (cur = start; cur < start + len; cur += fs_info->sectorsize)
> +	for (cur = start; cur < end; cur += fs_info->sectorsize)
>   		set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
>   	bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
>   		   blocks_per_folio);
> @@ -1742,7 +1743,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   			btrfs_put_ordered_extent(ordered);
>   
>   			btrfs_mark_ordered_io_finished(inode, folio, cur,
> -						       start + len - cur, true);
> +						       end - cur, true);
>   			/*
>   			 * This range is beyond i_size, thus we don't need to
>   			 * bother writing back.
> @@ -1751,8 +1752,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   			 * writeback the sectors with subpage dirty bits,
>   			 * causing writeback without ordered extent.
>   			 */
> -			btrfs_folio_clear_dirty(fs_info, folio, cur,
> -						start + len - cur);
> +			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
>   			break;
>   		}
>   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/7] btrfs: split assertion into two in extent_writepage_io()
  2025-10-13 12:05 ` [PATCH 3/7] btrfs: split assertion into two " fdmanana
@ 2025-10-13 20:52   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:52 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If the assertion fails we don't get to know which of the two expressions
> failed and neither the values used in each expression.
> 
> So split the assertion into two, each for a single expression, so that
> if any is triggered we see a line number reported in a stack trace that
> points to which expression failed. Also  make the assertions use the
> verbose mode to print the values involved in the computations.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/extent_io.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c641eb50d0ee..1cb73f55af20 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1698,7 +1698,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>   	int bit;
>   	int ret = 0;
>   
> -	ASSERT(start >= folio_start && end <= folio_end);
> +	ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start);
> +	ASSERT(end <= folio_end, "start=%llu len=%u folio_start=%llu folio_size=%zu",
> +	       start, len, folio_start, folio_size(folio));
>   
>   	ret = btrfs_writepage_cow_fixup(folio);
>   	if (ret == -EAGAIN) {


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages()
  2025-10-13 12:05 ` [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages() fdmanana
@ 2025-10-13 20:52   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:52 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We don't expect to hit errors and log the error message, so add the
> unlikely annotation to make it clear and to hint the compiler that it may
> reorganize code to be more efficient.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/extent_io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 1cb73f55af20..870584dde575 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1874,7 +1874,7 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
>   				  folio_size(folio), bio_ctrl, i_size);
>   	if (ret == 1)
>   		return 0;
> -	if (ret < 0)
> +	if (unlikely(ret < 0))
>   		btrfs_err_rl(fs_info,
>   "failed to submit blocks, root=%lld inode=%llu folio=%llu submit_bitmap=%*pbl: %d",
>   			     btrfs_root_id(inode->root), btrfs_ino(inode),


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate()
  2025-10-13 12:05 ` [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate() fdmanana
@ 2025-10-13 20:52   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:52 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We're using different ways to round down the i_size by sector size, one
> with a bitwise and with a negated mask and another with ALIGN_DOWN(), and
> using ALIGN() to round up.
> 
> Replace these uses with the round_down() and round_up() macros which have
> have names that make it clear the direction of the rounding (unlike the
> ALIGN() macro) and getting rid of the bitwise and, negated mask and local
> variable for the mask.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/inode.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ac2fd589697d..4a4cb91b7586 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7649,12 +7649,12 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   	struct btrfs_block_rsv rsv;
>   	int ret;
>   	struct btrfs_trans_handle *trans;
> -	u64 mask = fs_info->sectorsize - 1;
>   	const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
>   
>   	if (!skip_writeback) {
>   		ret = btrfs_wait_ordered_range(inode,
> -					       inode->vfs_inode.i_size & (~mask),
> +					       round_down(inode->vfs_inode.i_size,
> +							  fs_info->sectorsize),
>   					       (u64)-1);
>   		if (ret)
>   			return ret;
> @@ -7720,7 +7720,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   	while (1) {
>   		struct extent_state *cached_state = NULL;
>   		const u64 new_size = inode->vfs_inode.i_size;
> -		const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
> +		const u64 lock_start = round_down(new_size, fs_info->sectorsize);
>   
>   		control.new_size = new_size;
>   		btrfs_lock_extent(&inode->io_tree, lock_start, (u64)-1, &cached_state);
> @@ -7730,7 +7730,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   		 * block of the extent just the way it is.
>   		 */
>   		btrfs_drop_extent_map_range(inode,
> -					    ALIGN(new_size, fs_info->sectorsize),
> +					    round_up(new_size, fs_info->sectorsize),
>   					    (u64)-1, false);
>   
>   		ret = btrfs_truncate_inode_items(trans, root, &control);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/7] btrfs: avoid multiple i_size rounding in btrfs_truncate()
  2025-10-13 12:05 ` [PATCH 6/7] btrfs: avoid multiple i_size rounding " fdmanana
@ 2025-10-13 20:53   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:53 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We have the inode locked so no one can concurrently change its i_size and
> neither do we change it ourselves, so there's no point in keep rounding
> it in the while loop and setting it up in the control structure. That only
> causes confusion when reading the code.
> 
> So move all the i_size setup and rounding out of the loop and assert the
> inode is locked.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/inode.c | 18 ++++++++----------
>   1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4a4cb91b7586..096b995fe87b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7643,6 +7643,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   		.ino = btrfs_ino(inode),
>   		.min_type = BTRFS_EXTENT_DATA_KEY,
>   		.clear_extent_range = true,
> +		.new_size = inode->vfs_inode.i_size,
>   	};
>   	struct btrfs_root *root = inode->root;
>   	struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -7650,12 +7651,14 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   	int ret;
>   	struct btrfs_trans_handle *trans;
>   	const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
> +	const u64 lock_start = round_down(inode->vfs_inode.i_size, fs_info->sectorsize);
> +	const u64 i_size_up = round_up(inode->vfs_inode.i_size, fs_info->sectorsize);
> +
> +	/* Our inode is locked and the i_size can't be changed concurrently. */
> +	btrfs_assert_inode_locked(inode);
>   
>   	if (!skip_writeback) {
> -		ret = btrfs_wait_ordered_range(inode,
> -					       round_down(inode->vfs_inode.i_size,
> -							  fs_info->sectorsize),
> -					       (u64)-1);
> +		ret = btrfs_wait_ordered_range(inode, lock_start, (u64)-1);
>   		if (ret)
>   			return ret;
>   	}
> @@ -7719,19 +7722,14 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   
>   	while (1) {
>   		struct extent_state *cached_state = NULL;
> -		const u64 new_size = inode->vfs_inode.i_size;
> -		const u64 lock_start = round_down(new_size, fs_info->sectorsize);
>   
> -		control.new_size = new_size;
>   		btrfs_lock_extent(&inode->io_tree, lock_start, (u64)-1, &cached_state);
>   		/*
>   		 * We want to drop from the next block forward in case this new
>   		 * size is not block aligned since we will be keeping the last
>   		 * block of the extent just the way it is.
>   		 */
> -		btrfs_drop_extent_map_range(inode,
> -					    round_up(new_size, fs_info->sectorsize),
> -					    (u64)-1, false);
> +		btrfs_drop_extent_map_range(inode, i_size_up, (u64)-1, false);
>   
>   		ret = btrfs_truncate_inode_items(trans, root, &control);
>   


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()
  2025-10-13 12:05 ` [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished() fdmanana
@ 2025-10-13 20:55   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 20:55 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We're computing a few values several times:
> 
> 1) The current ordered extent's end offset inside the while loop, we have
>     computed it and stored it in the 'entry_end' variable but then we
>     compute it again later as the first argument to the min() macro;
> 
> 2) The end file offset, open coded 3 times;
> 
> 3) The current length (stored in variable 'len') computed 2 times, one
>     inside an assertion and the other when assigning to the 'len' variable.
> 
> So use existing variables and add new ones to prevent repeating these
> expressions and reduce the source code.
> 
> We were also subtracting one from the result of min() macro call and
> then adding 1 back in the next line, making both operations pointless.
> So just remove the decrement and increment by 1.
> 
> This also reduces very slightly the object code.
> 
> Before:
> 
>    $ size fs/btrfs/btrfs.ko
>       text	   data	    bss	    dec	    hex	filename
>    1916576	 161679	  15592	2093847	 1ff317	fs/btrfs/btrfs.ko
> 
> After:
> 
>    $ size fs/btrfs/btrfs.ko
>       text	   data	    bss	    dec	    hex	filename
>    1916556	 161679	  15592	2093827	 1ff303	fs/btrfs/btrfs.ko
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> ---
>   fs/btrfs/ordered-data.c | 18 ++++++++----------
>   1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 8a8aa6ed405b..dfda952dcf7b 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -483,16 +483,15 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   	struct btrfs_ordered_extent *entry = NULL;
>   	unsigned long flags;
>   	u64 cur = file_offset;
> +	const u64 end = file_offset + num_bytes;
>   
> -	trace_btrfs_writepage_end_io_hook(inode, file_offset,
> -					  file_offset + num_bytes - 1,
> -					  uptodate);
> +	trace_btrfs_writepage_end_io_hook(inode, file_offset, end - 1, uptodate);
>   
>   	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
> -	while (cur < file_offset + num_bytes) {
> +	while (cur < end) {
>   		u64 entry_end;
> -		u64 end;
> -		u32 len;
> +		u64 this_end;
> +		u64 len;
>   
>   		node = ordered_tree_search(inode, cur);
>   		/* No ordered extents at all */
> @@ -535,10 +534,9 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>   		 *	|
>   		 *	cur
>   		 */
> -		end = min(entry->file_offset + entry->num_bytes,
> -			  file_offset + num_bytes) - 1;
> -		ASSERT(end + 1 - cur < U32_MAX);
> -		len = end + 1 - cur;
> +		this_end = min(entry_end, end);
> +		len = this_end - cur;
> +		ASSERT(len < U32_MAX);
>   
>   		if (can_finish_ordered_extent(entry, folio, cur, len, uptodate)) {
>   			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups
  2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
                   ` (6 preceding siblings ...)
  2025-10-13 12:05 ` [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished() fdmanana
@ 2025-10-13 21:01 ` Qu Wenruo
  2025-10-13 22:41   ` Filipe Manana
  7 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2025-10-13 21:01 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The first patch fixes a bug where we can up persisting a file extent item
> that crosses the i_size boundary, making btrfs check complain because it
> won't find checksum items beyond the i_size boundary. Details in the change
> log. The rest are several cleanups in related code that popped up while
> debugging the issue.

The series looks good to me.

Just notice one small inconsistency during review and it's not related 
to the series itself.


Most callsites searching ordered extents are holding ordered_tree_lock 
with irq spinlock.

However in btrfs_split_ordered_extent() there is a non-irq version 
spinlock of ordered_tree_lock.

Looks like an inconsistency at least.

Thanks,
Qu

> 
> Filipe Manana (7):
>    btrfs: truncate ordered extent when skipping writeback past i_size
>    btrfs: use variable for end offset in extent_writepage_io()
>    btrfs: split assertion into two in extent_writepage_io()
>    btrfs: add unlikely to unexpected error case in extent_writepages()
>    btrfs: consistently round up or down i_size in btrfs_truncate()
>    btrfs: avoid multiple i_size rounding in btrfs_truncate()
>    btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()
> 
>   fs/btrfs/extent_io.c    | 33 ++++++++++++++++++++++++++-------
>   fs/btrfs/inode.c        | 18 ++++++++----------
>   fs/btrfs/ordered-data.c | 23 +++++++++++------------
>   3 files changed, 45 insertions(+), 29 deletions(-)
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups
  2025-10-13 21:01 ` [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups Qu Wenruo
@ 2025-10-13 22:41   ` Filipe Manana
  0 siblings, 0 replies; 17+ messages in thread
From: Filipe Manana @ 2025-10-13 22:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, Oct 13, 2025 at 10:01 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2025/10/13 22:35, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The first patch fixes a bug where we can up persisting a file extent item
> > that crosses the i_size boundary, making btrfs check complain because it
> > won't find checksum items beyond the i_size boundary. Details in the change
> > log. The rest are several cleanups in related code that popped up while
> > debugging the issue.
>
> The series looks good to me.
>
> Just notice one small inconsistency during review and it's not related
> to the series itself.
>
>
> Most callsites searching ordered extents are holding ordered_tree_lock
> with irq spinlock.
>
> However in btrfs_split_ordered_extent() there is a non-irq version
> spinlock of ordered_tree_lock.

Yes, and it's correct as it is.

Right before taking ordered_tree_lock, with a regular spin_lock(), we
take root->ordered_extent_lock with spin_lock_irq(), which disables
irqs.
That means the critical section of ordered_extent_lock can not be
interrupted to serve interrupts.

So everything's fine there.



>
> Looks like an inconsistency at least.
>
> Thanks,
> Qu
>
> >
> > Filipe Manana (7):
> >    btrfs: truncate ordered extent when skipping writeback past i_size
> >    btrfs: use variable for end offset in extent_writepage_io()
> >    btrfs: split assertion into two in extent_writepage_io()
> >    btrfs: add unlikely to unexpected error case in extent_writepages()
> >    btrfs: consistently round up or down i_size in btrfs_truncate()
> >    btrfs: avoid multiple i_size rounding in btrfs_truncate()
> >    btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()
> >
> >   fs/btrfs/extent_io.c    | 33 ++++++++++++++++++++++++++-------
> >   fs/btrfs/inode.c        | 18 ++++++++----------
> >   fs/btrfs/ordered-data.c | 23 +++++++++++------------
> >   3 files changed, 45 insertions(+), 29 deletions(-)
> >
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-10-13 22:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 12:05 [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups fdmanana
2025-10-13 12:05 ` [PATCH 1/7] btrfs: truncate ordered extent when skipping writeback past i_size fdmanana
2025-10-13 20:46   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 2/7] btrfs: use variable for end offset in extent_writepage_io() fdmanana
2025-10-13 20:51   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 3/7] btrfs: split assertion into two " fdmanana
2025-10-13 20:52   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 4/7] btrfs: add unlikely to unexpected error case in extent_writepages() fdmanana
2025-10-13 20:52   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 5/7] btrfs: consistently round up or down i_size in btrfs_truncate() fdmanana
2025-10-13 20:52   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 6/7] btrfs: avoid multiple i_size rounding " fdmanana
2025-10-13 20:53   ` Qu Wenruo
2025-10-13 12:05 ` [PATCH 7/7] btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished() fdmanana
2025-10-13 20:55   ` Qu Wenruo
2025-10-13 21:01 ` [PATCH 0/7] btrfs: fix a bug with truncation and writeback and cleanups Qu Wenruo
2025-10-13 22:41   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox