FAILED: patch "[PATCH] btrfs: fix beyond-EOF write handling" failed to apply to 6.18-stable tree

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* FAILED: patch "[PATCH] btrfs: fix beyond-EOF write handling" failed to apply to 6.18-stable tree
@ 2026-01-12 10:34 gregkh
  2026-01-12 13:54 ` [PATCH 6.18.y 1/3] btrfs: truncate ordered extent when skipping writeback past i_size Sasha Levin
  0 siblings, 1 reply; 4+ messages in thread
From: gregkh @ 2026-01-12 10:34 UTC (permalink / raw)
  To: wqu, dsterba, fdmanana; +Cc: stable

The patch below does not apply to the 6.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

To reproduce the conflict and resubmit, you may use the following commands:

git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.18.y
git checkout FETCH_HEAD
git cherry-pick -x e9e3b22ddfa760762b696ac6417c8d6edd182e49
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable@vger.kernel.org>' --in-reply-to '2026011240-unreal-knee-7bf4@gregkh' --subject-prefix 'PATCH 6.18.y' HEAD^..

Possible dependencies:

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

From e9e3b22ddfa760762b696ac6417c8d6edd182e49 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Thu, 11 Dec 2025 12:45:17 +1030
Subject: [PATCH] btrfs: fix beyond-EOF write handling

[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:

  mkfs.btrfs -s 4k -f $dev > /dev/null
  mount $dev $mnt
  xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
            -c "truncate 16k" $mnt/foobar
  umount $mnt

This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):

  btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
  btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384

Note for file offset 60K, we're inserting a file extent without any
data checksum.

Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.

Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.

That will cause "btrfs check" to report error.

[CAUSE]
The sequence happens like this:

- Buffered write dirtied the page cache and updated isize

  Now the inode size is 64K, with the following page cache layout:

  0             16K             32K              48K           64K
  |/////////////|               |//|                        |//|

- Truncate the inode to 16K
  Which will trigger writeback through:

  btrfs_setsize()
  |- truncate_setsize()
  |  Now the inode size is set to 16K
  |
  |- btrfs_truncate()
     |- btrfs_wait_ordered_range() for [16K, u64(-1)]
        |- btrfs_fdatawrite_range() for [16K, u64(-1)}
	   |- extent_writepage() for folio 0
	      |- writepage_delalloc()
	      |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
	      |
	      |- extent_writepage_io()

  Then inside extent_writepage_io(), the dirty fs blocks are handled
  differently:

  - Submit write for range [0, 16K)
    As they are still inside the inode size (16K).

  - Mark OE [32K, 36K) as truncated
    Since we only call btrfs_lookup_first_ordered_range() once, which
    returned the first OE after file offset 16K.

  - Mark all OEs inside range [16K, 64K) as finished
    Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.

    For OE [32K, 36K) since it's already marked as truncated, and its
    truncated length is 0, no file extent will be inserted.

    For OE [60K, 64K) it has never been submitted thus has no data
    checksum, and we insert the file extent as usual.
    This is the root cause of file extent at 60K to be inserted without
    any data checksum.

  - Clear dirty flags for range [16K, 64K)
    It is the function btrfs_folio_clear_dirty() which searches and clears
    any dirty blocks inside that range.

[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.

At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.

But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.

Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.

The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.

By this we always locate and truncate the OE for every dirty block.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2d32dfc34ae3..97748d0d54d9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			struct btrfs_ordered_extent *ordered;

 			ordered = btrfs_lookup_first_ordered_range(inode, cur,
-								   folio_end - cur);
+								   fs_info->sectorsize);
 			/*
 			 * We have just run delalloc before getting here, so
 			 * there must be an ordered extent.
@@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			btrfs_put_ordered_extent(ordered);

 			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       end - cur, true);
+						       fs_info->sectorsize, true);
 			/*
 			 * This range is beyond i_size, thus we don't need to
 			 * bother writing back.
@@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * writeback the sectors with subpage dirty bits,
 			 * causing writeback without ordered extent.
 			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
-			break;
+			btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);
+			continue;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (unlikely(ret < 0)) {

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 6.18.y 1/3] btrfs: truncate ordered extent when skipping writeback past i_size
  2026-01-12 10:34 FAILED: patch "[PATCH] btrfs: fix beyond-EOF write handling" failed to apply to 6.18-stable tree gregkh
@ 2026-01-12 13:54 ` Sasha Levin
  2026-01-12 13:54   ` [PATCH 6.18.y 2/3] btrfs: use variable for end offset in extent_writepage_io() Sasha Levin
  2026-01-12 13:54   ` [PATCH 6.18.y 3/3] btrfs: fix beyond-EOF write handling Sasha Levin
  0 siblings, 2 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-12 13:54 UTC (permalink / raw)
  To: stable; +Cc: Filipe Manana, Qu Wenruo, Anand Jain, David Sterba, Sasha Levin

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit 18de34daa7c62c830be533aace6b7c271e8e95cf ]

While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.

Example btrfs check output when this happens:

  $ btrfs check /dev/sdc
  Opening filesystem to check...
  Checking filesystem on /dev/sdc
  UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
  [1/8] checking log skipped (none written)
  [2/8] checking root items
  [3/8] checking extents
  [4/8] checking free space tree
  [5/8] checking fs roots
  root 5 inode 332 errors 1000, some csum missing
  ERROR: errors found in fs roots
  (...)

Looking at a tree dump of the fs tree (root 5) for inode 332 we have:

   $ btrfs inspect-internal dump-tree -t 5 /dev/sdc
   (...)
        item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
                generation 17 transid 19 size 610969 nbytes 86016
                block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
                sequence 11 flags 0x0(none)
                atime 1759851068.391327881 (2025-10-07 16:31:08)
                ctime 1759851068.410098267 (2025-10-07 16:31:08)
                mtime 1759851068.410098267 (2025-10-07 16:31:08)
                otime 1759851068.391327881 (2025-10-07 16:31:08)
        item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
                index 2 namelen 3 name: f1f
        item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
                generation 19 type 1 (regular)
                extent data disk byte 21745664 nr 65536
                extent data offset 0 nr 65536 ram 65536
                extent compression 0 (none)
   (...)

We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.

Looking into the csum tree:

  $ btrfs inspect-internal dump-tree /dev/sdc
  (...)
        item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
                range start 21565440 end 21770240 length 204800
           item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
                range start 1104576512 end 1104584704 length 8192
  (..)

We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:

   21770240 - 21745664 = 24K

We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.

If we round up the i_size to the sector size, we get:

   round_up(610969, 4096) = 614400

If we subtract from that the file offset for the extent item we get:

   614400 - 589824 = 24K

So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:

   655360 - 614400 = 40K

Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.

There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).

The steps (in the kernel) that lead to such scenario are the following:

 1) We have inode I as an empty file, no allocated extents, i_size is 0;

 2) A buffered write is done for file range [589824, 655360[ (length of
    64K) and the i_size is updated to 655360. Note that we got a single
    large folio for the range (64K);

 3) A truncate operation starts that reduces the inode's i_size down to
    610969 bytes. The truncate sets the inode's new i_size at
    btrfs_setsize() by calling truncate_setsize() and before calling
    btrfs_truncate();

 4) At btrfs_truncate() we trigger writeback for the range starting at
    610304 (which is the new i_size rounded down to the sector size) and
    ending at (u64)-1;

 5) During the writeback, at extent_write_cache_pages(), we get from the
    call to filemap_get_folios_tag(), the 64K folio that starts at file
    offset 589824 since it contains the start offset of the writeback
    range (610304);

 6) At writepage_delalloc() we find the whole range of the folio is dirty
    and therefore we run delalloc for that 64K range ([589824, 655360[),
    reserving a 64K extent, creating an ordered extent, etc;

 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
    because the inode's i_size is 610969 bytes (rounded up by sector size
    is 614400). There, in the while loop we intentionally skip IO beyond
    i_size to avoid any unnecessay work and just call
    btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
    has a 40K length);

 8) Once the IO finishes we finish the ordered extent by ending up at
    btrfs_finish_one_ordered(), join transaction N, insert a file extent
    item in the inode's subvolume tree for file offset 589824 with a number
    of bytes of 64K, and update the inode's delayed inode item or directly
    the inode item with a call to btrfs_update_inode_fallback(), which
    results in storing the new i_size of 610969 bytes;

 9) Transaction N is committed either by the transaction kthread or some
    other task committed it (in response to a sync or fsync for example).

    At this point we have inode I persisted with an i_size of 610969 bytes
    and file extent item that starts at file offset 589824 and has a number
    of bytes of 64K, ending at an offset of 655360 which is beyond the
    i_size rounded up to the sector size (614400).

    --> So after a crash or power failure here, the btrfs check program
        reports that error about missing checksum items for this inode, as
	it tries to lookup for checksums covering the whole range of the
	extent;

10) Only after transaction N is committed that at btrfs_truncate() the
    call to btrfs_start_transaction() starts a new transaction, N + 1,
    instead of joining transaction N. And it's with transaction N + 1 that
    it calls btrfs_truncate_inode_items() which updates the file extent
    item at file offset 589824 to reduce its number of bytes from 64K down
    to 24K, so that the file extent item's range ends at the i_size
    rounded up to the sector size (614400 bytes).

Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent_io.c    | 21 +++++++++++++++++++--
 fs/btrfs/ordered-data.c |  5 +++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 23273d0e6f224..7cbccd4604c50 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1692,13 +1692,13 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	bool submitted_io = false;
 	int found_error = 0;
 	const u64 folio_start = folio_pos(folio);
+	const u64 folio_end = folio_start + folio_size(folio);
 	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
 	u64 cur;
 	int bit;
 	int ret = 0;
 
-	ASSERT(start >= folio_start &&
-	       start + len <= folio_start + folio_size(folio));
+	ASSERT(start >= folio_start && start + len <= folio_end);
 
 	ret = btrfs_writepage_cow_fixup(folio);
 	if (ret == -EAGAIN) {
@@ -1725,6 +1725,23 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
 
 		if (cur >= i_size) {
+			struct btrfs_ordered_extent *ordered;
+			unsigned long flags;
+
+			ordered = btrfs_lookup_first_ordered_range(inode, cur,
+								   folio_end - cur);
+			/*
+			 * We have just run delalloc before getting here, so
+			 * there must be an ordered extent.
+			 */
+			ASSERT(ordered != NULL);
+			spin_lock_irqsave(&inode->ordered_tree_lock, flags);
+			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+			ordered->truncated_len = min(ordered->truncated_len,
+						     cur - ordered->file_offset);
+			spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
+			btrfs_put_ordered_extent(ordered);
+
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
 						       start + len - cur, true);
 			/*
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 2829f20d7bb59..8a8aa6ed405bd 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1098,8 +1098,9 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 	struct rb_node *prev;
 	struct rb_node *next;
 	struct btrfs_ordered_extent *entry = NULL;
+	unsigned long flags;
 
-	spin_lock_irq(&inode->ordered_tree_lock);
+	spin_lock_irqsave(&inode->ordered_tree_lock, flags);
 	node = inode->ordered_tree.rb_node;
 	/*
 	 * Here we don't want to use tree_search() which will use tree->last
@@ -1154,7 +1155,7 @@ struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
 		trace_btrfs_ordered_extent_lookup_first_range(inode, entry);
 	}
 
-	spin_unlock_irq(&inode->ordered_tree_lock);
+	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 	return entry;
 }
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 6.18.y 2/3] btrfs: use variable for end offset in extent_writepage_io()
  2026-01-12 13:54 ` [PATCH 6.18.y 1/3] btrfs: truncate ordered extent when skipping writeback past i_size Sasha Levin
@ 2026-01-12 13:54   ` Sasha Levin
  2026-01-12 13:54   ` [PATCH 6.18.y 3/3] btrfs: fix beyond-EOF write handling Sasha Levin
  1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-12 13:54 UTC (permalink / raw)
  To: stable; +Cc: Filipe Manana, Qu Wenruo, Anand Jain, David Sterba, Sasha Levin

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit 46a23908598f4b8e61483f04ea9f471b2affc58a ]

Instead of repeating the expression "start + len" multiple times, store it
in a variable and use it where needed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent_io.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7cbccd4604c50..ca09f42d26aa1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1691,6 +1691,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	unsigned long range_bitmap = 0;
 	bool submitted_io = false;
 	int found_error = 0;
+	const u64 end = start + len;
 	const u64 folio_start = folio_pos(folio);
 	const u64 folio_end = folio_start + folio_size(folio);
 	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
@@ -1698,7 +1699,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	int bit;
 	int ret = 0;
 
-	ASSERT(start >= folio_start && start + len <= folio_end);
+	ASSERT(start >= folio_start && end <= folio_end);
 
 	ret = btrfs_writepage_cow_fixup(folio);
 	if (ret == -EAGAIN) {
@@ -1714,7 +1715,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		return ret;
 	}
 
-	for (cur = start; cur < start + len; cur += fs_info->sectorsize)
+	for (cur = start; cur < end; cur += fs_info->sectorsize)
 		set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
 	bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
 		   blocks_per_folio);
@@ -1743,7 +1744,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       start + len - cur, true);
+						       end - cur, true);
 			/*
 			 * This range is beyond i_size, thus we don't need to
 			 * bother writing back.
@@ -1752,8 +1753,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * writeback the sectors with subpage dirty bits,
 			 * causing writeback without ordered extent.
 			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur,
-						start + len - cur);
+			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
 			break;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 6.18.y 3/3] btrfs: fix beyond-EOF write handling
  2026-01-12 13:54 ` [PATCH 6.18.y 1/3] btrfs: truncate ordered extent when skipping writeback past i_size Sasha Levin
  2026-01-12 13:54   ` [PATCH 6.18.y 2/3] btrfs: use variable for end offset in extent_writepage_io() Sasha Levin
@ 2026-01-12 13:54   ` Sasha Levin
  1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-12 13:54 UTC (permalink / raw)
  To: stable; +Cc: Qu Wenruo, Filipe Manana, David Sterba, Sasha Levin

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit e9e3b22ddfa760762b696ac6417c8d6edd182e49 ]

[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:

  mkfs.btrfs -s 4k -f $dev > /dev/null
  mount $dev $mnt
  xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
            -c "truncate 16k" $mnt/foobar
  umount $mnt

This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):

  btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
  btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384

Note for file offset 60K, we're inserting a file extent without any
data checksum.

Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.

Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.

That will cause "btrfs check" to report error.

[CAUSE]
The sequence happens like this:

- Buffered write dirtied the page cache and updated isize

  Now the inode size is 64K, with the following page cache layout:

  0             16K             32K              48K           64K
  |/////////////|               |//|                        |//|

- Truncate the inode to 16K
  Which will trigger writeback through:

  btrfs_setsize()
  |- truncate_setsize()
  |  Now the inode size is set to 16K
  |
  |- btrfs_truncate()
     |- btrfs_wait_ordered_range() for [16K, u64(-1)]
        |- btrfs_fdatawrite_range() for [16K, u64(-1)}
	   |- extent_writepage() for folio 0
	      |- writepage_delalloc()
	      |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
	      |
	      |- extent_writepage_io()

  Then inside extent_writepage_io(), the dirty fs blocks are handled
  differently:

  - Submit write for range [0, 16K)
    As they are still inside the inode size (16K).

  - Mark OE [32K, 36K) as truncated
    Since we only call btrfs_lookup_first_ordered_range() once, which
    returned the first OE after file offset 16K.

  - Mark all OEs inside range [16K, 64K) as finished
    Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.

    For OE [32K, 36K) since it's already marked as truncated, and its
    truncated length is 0, no file extent will be inserted.

    For OE [60K, 64K) it has never been submitted thus has no data
    checksum, and we insert the file extent as usual.
    This is the root cause of file extent at 60K to be inserted without
    any data checksum.

  - Clear dirty flags for range [16K, 64K)
    It is the function btrfs_folio_clear_dirty() which searches and clears
    any dirty blocks inside that range.

[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.

At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.

But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.

Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.

The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.

By this we always locate and truncate the OE for every dirty block.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ca09f42d26aa1..1a07edaefaa0f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1730,7 +1730,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			unsigned long flags;
 
 			ordered = btrfs_lookup_first_ordered_range(inode, cur,
-								   folio_end - cur);
+								   fs_info->sectorsize);
 			/*
 			 * We have just run delalloc before getting here, so
 			 * there must be an ordered extent.
@@ -1744,7 +1744,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       end - cur, true);
+						       fs_info->sectorsize, true);
 			/*
 			 * This range is beyond i_size, thus we don't need to
 			 * bother writing back.
@@ -1753,8 +1753,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * writeback the sectors with subpage dirty bits,
 			 * causing writeback without ordered extent.
 			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
-			break;
+			btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);
+			continue;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (unlikely(ret < 0)) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-01-12 13:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 10:34 FAILED: patch "[PATCH] btrfs: fix beyond-EOF write handling" failed to apply to 6.18-stable tree gregkh
2026-01-12 13:54 ` [PATCH 6.18.y 1/3] btrfs: truncate ordered extent when skipping writeback past i_size Sasha Levin
2026-01-12 13:54   ` [PATCH 6.18.y 2/3] btrfs: use variable for end offset in extent_writepage_io() Sasha Levin
2026-01-12 13:54   ` [PATCH 6.18.y 3/3] btrfs: fix beyond-EOF write handling Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox