[PATCH 0/2] btrfs: fix beyond-EOF write handling mostly for subpage and larger folios

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix beyond-EOF write handling mostly for subpage and larger folios
@ 2025-12-11  2:15 Qu Wenruo
  2025-12-11  2:15 ` [PATCH 1/2] btrfs: fix beyond-EOF write handling Qu Wenruo
  2025-12-11  2:15 ` [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum Qu Wenruo
  0 siblings, 2 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11  2:15 UTC (permalink / raw)
  To: linux-btrfs

Filipe fixed a bug in commit 18de34daa7c6 ("btrfs: truncate ordered
extent when skipping writeback past i_size"), which can lead to btrfs
check reporting missing data checksum.

However that fix is not complete, we can still get file extents inserted
without any data checksum.

The root cause is that the original beyond-EOF handling is not compatible
with subpage/larger folios from day 1, thus as long as we're still using
the old [cur, end) range, there will always be something incorrect.

The old handling is always handling the range [cur, end), which at that
time can only be one fs block.

Later subpage support is still re-using that part, but the enhanced
btrfs_folio_clear_dirty() function allows passing a range that covers
non-dirty blocks, this enhanced behavior masked the problem.

The true fix to the beyond-EOF handling should handle each beyond-EOF
block one-by-one, just like how we handle regular writes.

The first patch is the minimal fix that can be backported, but
unfortunately that still relies on the commit 18de34daa7c6 ("btrfs: truncate
ordered extent when skipping writeback past i_size") itself, and can be
tricky for older branches.

The second patch is to add an extra ASSERT() to catch any OE extent that
doesn't have a proper reason to have no data checksum.
With that new ASSERT() we can catch missing cases more reliably.

Qu Wenruo (2):
  btrfs: fix beyond-EOF write handling
  btrfs: add an ASSERT() to catch ordered extents without datasum

 fs/btrfs/extent_io.c |  8 ++++----
 fs/btrfs/inode.c     | 15 +++++++++++++++
 2 files changed, 19 insertions(+), 4 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] btrfs: fix beyond-EOF write handling
  2025-12-11  2:15 [PATCH 0/2] btrfs: fix beyond-EOF write handling mostly for subpage and larger folios Qu Wenruo
@ 2025-12-11  2:15 ` Qu Wenruo
  2025-12-11 11:26   ` Filipe Manana
  2025-12-11  2:15 ` [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11  2:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:

  mkfs.btrfs -s 4k -f $dev > /dev/null
  mount $dev $mnt
  xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
            -c "truncate 16k" $mnt/foobar
  umount $mnt

This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):

   btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
   btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384

Note for file offset 60K, we're inserting an file extent without any
data checksum.

Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.

Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.

That will cause "btrfs check" to report error.

[CAUSE]
The sequence happens like this:

- Buffered write dirtied the page cache and updated isize

  Now the inode size is 64K, with the following page cache layout:

  0             16K             32K              48K           64K
  |/////////////|               |//|                        |//|

- Truncate the inode to 16K
  Which will trigger writeback through:

  btrfs_setsize()
  |- truncate_setsize()
  |  Now the inode size is set to 16K
  |
  |- btrfs_truncacte()
     |- btrfs_wait_ordered_range() for [16K, u64(-1)]
        |- btrfs_fdatawrite_range() for [16K, u64(-1)}
	   |- extent_writepage() for folio 0
	      |- writepage_delalloc()
	      |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
	      |
	      |- extent_writepage_io()

  Then inside extent_writepage_io(), the dirty fs blocks are handled
  differently:

  - Submit write for range [0, 16K)
    As they are still inside the inode size (16K).

  - Mark OE [32K, 36K) as truncated
    Since we only call btrfs_lookup_first_ordered_range() once, which
    returned the first OE after file offset 16K.

  - Mark all OEs inside range [16K, 64K) as finished
    Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.

    For OE [32K, 36K) since it's already marked as truncated, and its
    truncated length is 0, no file extent will be inserted.

    For OE [60K, 64K) it has never been submitted thus has no data
    checksum, and we insert the file extent as usual.
    This is the root cause of file extent at 60K to be inserted without
    any data checksum.

  - Clear dirty flags for range [16K, 64K)
    It is the function btrfs_folio_clear_dirty() which search and clear
    any dirty blocks inside that range.

[FIX]
The bug itself is introduced a long time ago, way before subpage and
larger folio support.

At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.

But later with subpage and larger folios, the same range [cur, end)
can have multiple blocks and ordered extents.

Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") is fixing a bug related to subpage/larger
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.

The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.

By this we always locate and truncate the OE for every dirty block.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 629fd5af4286..a4b74023618d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			struct btrfs_ordered_extent *ordered;
 
 			ordered = btrfs_lookup_first_ordered_range(inode, cur,
-								   folio_end - cur);
+								   fs_info->sectorsize);
 			/*
 			 * We have just run delalloc before getting here, so
 			 * there must be an ordered extent.
@@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			btrfs_put_ordered_extent(ordered);
 
 			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       end - cur, true);
+						       fs_info->sectorsize, true);
 			/*
 			 * This range is beyond i_size, thus we don't need to
 			 * bother writing back.
@@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 			 * writeback the sectors with subpage dirty bits,
 			 * causing writeback without ordered extent.
 			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
-			break;
+			btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);
+			continue;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (unlikely(ret < 0)) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix beyond-EOF write handling
  2025-12-11  2:15 ` [PATCH 1/2] btrfs: fix beyond-EOF write handling Qu Wenruo
@ 2025-12-11 11:26   ` Filipe Manana
  2025-12-11 23:05     ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2025-12-11 11:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Dec 11, 2025 at 2:15 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> For the following write sequence with 64K page size and 4K fs block size,
> it will lead to file extent items to be inserted without any data
> checksum:
>
>   mkfs.btrfs -s 4k -f $dev > /dev/null
>   mount $dev $mnt
>   xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
>             -c "truncate 16k" $mnt/foobar
>   umount $mnt

If you can, add this to fstests.

>
> This will result the following 2 file extent items to be inserted (extra
> trace point added to insert_ordered_extent_file_extent()):
>
>    btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
>    btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
>
> Note for file offset 60K, we're inserting an file extent without any

an file -> a file

> data checksum.
>
> Also note that range [32K, 36K) didn't reach
> insert_ordered_extent_file_extent(), which is the correct behavior as
> that OE is fully truncated, should not result any file extent.
>
> Although file extent at 60K will be later dropped by btrfs_truncate(),
> if the transaction got committed after file extent inserted but before
> the file extent dropping, we will have a small window where we have a
> file extent beyond EOF and without any data checksum.
>
> That will cause "btrfs check" to report error.
>
> [CAUSE]
> The sequence happens like this:
>
> - Buffered write dirtied the page cache and updated isize
>
>   Now the inode size is 64K, with the following page cache layout:
>
>   0             16K             32K              48K           64K
>   |/////////////|               |//|                        |//|
>
> - Truncate the inode to 16K
>   Which will trigger writeback through:
>
>   btrfs_setsize()
>   |- truncate_setsize()
>   |  Now the inode size is set to 16K
>   |
>   |- btrfs_truncacte()

truncacte -> truncate

It looks good otherwise, so:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>      |- btrfs_wait_ordered_range() for [16K, u64(-1)]
>         |- btrfs_fdatawrite_range() for [16K, u64(-1)}
>            |- extent_writepage() for folio 0
>               |- writepage_delalloc()
>               |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
>               |
>               |- extent_writepage_io()
>
>   Then inside extent_writepage_io(), the dirty fs blocks are handled
>   differently:
>
>   - Submit write for range [0, 16K)
>     As they are still inside the inode size (16K).
>
>   - Mark OE [32K, 36K) as truncated
>     Since we only call btrfs_lookup_first_ordered_range() once, which
>     returned the first OE after file offset 16K.
>
>   - Mark all OEs inside range [16K, 64K) as finished
>     Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
>
>     For OE [32K, 36K) since it's already marked as truncated, and its
>     truncated length is 0, no file extent will be inserted.
>
>     For OE [60K, 64K) it has never been submitted thus has no data
>     checksum, and we insert the file extent as usual.
>     This is the root cause of file extent at 60K to be inserted without
>     any data checksum.
>
>   - Clear dirty flags for range [16K, 64K)
>     It is the function btrfs_folio_clear_dirty() which search and clear
>     any dirty blocks inside that range.
>
> [FIX]
> The bug itself is introduced a long time ago, way before subpage and
> larger folio support.
>
> At that time, fs block size must match page size, thus the range
> [cur, end) is just one fs block.
>
> But later with subpage and larger folios, the same range [cur, end)
> can have multiple blocks and ordered extents.
>
> Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
> writeback past i_size") is fixing a bug related to subpage/larger
> folios, but it's still utilizing the old range [cur, end), meaning only
> the first OE will be marked as truncated.
>
> The proper fix here is to make EOF handling block-by-block, not trying
> to handle the whole range to @end.
>
> By this we always locate and truncate the OE for every dirty block.
>
> Cc: stable@vger.kernel.org # 5.15+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 629fd5af4286..a4b74023618d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>                         struct btrfs_ordered_extent *ordered;
>
>                         ordered = btrfs_lookup_first_ordered_range(inode, cur,
> -                                                                  folio_end - cur);
> +                                                                  fs_info->sectorsize);
>                         /*
>                          * We have just run delalloc before getting here, so
>                          * there must be an ordered extent.
> @@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>                         btrfs_put_ordered_extent(ordered);
>
>                         btrfs_mark_ordered_io_finished(inode, folio, cur,
> -                                                      end - cur, true);
> +                                                      fs_info->sectorsize, true);
>                         /*
>                          * This range is beyond i_size, thus we don't need to
>                          * bother writing back.
> @@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>                          * writeback the sectors with subpage dirty bits,
>                          * causing writeback without ordered extent.
>                          */
> -                       btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
> -                       break;
> +                       btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);
> +                       continue;
>                 }
>                 ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>                 if (unlikely(ret < 0)) {
> --
> 2.52.0
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] btrfs: fix beyond-EOF write handling
  2025-12-11 11:26   ` Filipe Manana
@ 2025-12-11 23:05     ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11 23:05 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, stable



在 2025/12/11 21:56, Filipe Manana 写道:
> On Thu, Dec 11, 2025 at 2:15 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> For the following write sequence with 64K page size and 4K fs block size,
>> it will lead to file extent items to be inserted without any data
>> checksum:
>>
>>    mkfs.btrfs -s 4k -f $dev > /dev/null
>>    mount $dev $mnt
>>    xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
>>              -c "truncate 16k" $mnt/foobar
>>    umount $mnt
> 
> If you can, add this to fstests.

The problem is, it's very hard to cause any observable result.

For most if not all cases, the file extent item insertion and truncation 
happen in the same transaction thus no error can be detected.

We need either a way to force all btrfs_end_transaction() to commit the 
trans, or very heavy workload to trigger enough transactions, then along 
with log-replay to detect such problem.

Thanks,
Qu

> 
>>
>> This will result the following 2 file extent items to be inserted (extra
>> trace point added to insert_ordered_extent_file_extent()):
>>
>>     btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
>>     btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
>>
>> Note for file offset 60K, we're inserting an file extent without any
> 
> an file -> a file
> 
>> data checksum.
>>
>> Also note that range [32K, 36K) didn't reach
>> insert_ordered_extent_file_extent(), which is the correct behavior as
>> that OE is fully truncated, should not result any file extent.
>>
>> Although file extent at 60K will be later dropped by btrfs_truncate(),
>> if the transaction got committed after file extent inserted but before
>> the file extent dropping, we will have a small window where we have a
>> file extent beyond EOF and without any data checksum.
>>
>> That will cause "btrfs check" to report error.
>>
>> [CAUSE]
>> The sequence happens like this:
>>
>> - Buffered write dirtied the page cache and updated isize
>>
>>    Now the inode size is 64K, with the following page cache layout:
>>
>>    0             16K             32K              48K           64K
>>    |/////////////|               |//|                        |//|
>>
>> - Truncate the inode to 16K
>>    Which will trigger writeback through:
>>
>>    btrfs_setsize()
>>    |- truncate_setsize()
>>    |  Now the inode size is set to 16K
>>    |
>>    |- btrfs_truncacte()
> 
> truncacte -> truncate
> 
> It looks good otherwise, so:
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Thanks.
> 
>>       |- btrfs_wait_ordered_range() for [16K, u64(-1)]
>>          |- btrfs_fdatawrite_range() for [16K, u64(-1)}
>>             |- extent_writepage() for folio 0
>>                |- writepage_delalloc()
>>                |  Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
>>                |
>>                |- extent_writepage_io()
>>
>>    Then inside extent_writepage_io(), the dirty fs blocks are handled
>>    differently:
>>
>>    - Submit write for range [0, 16K)
>>      As they are still inside the inode size (16K).
>>
>>    - Mark OE [32K, 36K) as truncated
>>      Since we only call btrfs_lookup_first_ordered_range() once, which
>>      returned the first OE after file offset 16K.
>>
>>    - Mark all OEs inside range [16K, 64K) as finished
>>      Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
>>
>>      For OE [32K, 36K) since it's already marked as truncated, and its
>>      truncated length is 0, no file extent will be inserted.
>>
>>      For OE [60K, 64K) it has never been submitted thus has no data
>>      checksum, and we insert the file extent as usual.
>>      This is the root cause of file extent at 60K to be inserted without
>>      any data checksum.
>>
>>    - Clear dirty flags for range [16K, 64K)
>>      It is the function btrfs_folio_clear_dirty() which search and clear
>>      any dirty blocks inside that range.
>>
>> [FIX]
>> The bug itself is introduced a long time ago, way before subpage and
>> larger folio support.
>>
>> At that time, fs block size must match page size, thus the range
>> [cur, end) is just one fs block.
>>
>> But later with subpage and larger folios, the same range [cur, end)
>> can have multiple blocks and ordered extents.
>>
>> Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
>> writeback past i_size") is fixing a bug related to subpage/larger
>> folios, but it's still utilizing the old range [cur, end), meaning only
>> the first OE will be marked as truncated.
>>
>> The proper fix here is to make EOF handling block-by-block, not trying
>> to handle the whole range to @end.
>>
>> By this we always locate and truncate the OE for every dirty block.
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 629fd5af4286..a4b74023618d 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1728,7 +1728,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>                          struct btrfs_ordered_extent *ordered;
>>
>>                          ordered = btrfs_lookup_first_ordered_range(inode, cur,
>> -                                                                  folio_end - cur);
>> +                                                                  fs_info->sectorsize);
>>                          /*
>>                           * We have just run delalloc before getting here, so
>>                           * there must be an ordered extent.
>> @@ -1742,7 +1742,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>                          btrfs_put_ordered_extent(ordered);
>>
>>                          btrfs_mark_ordered_io_finished(inode, folio, cur,
>> -                                                      end - cur, true);
>> +                                                      fs_info->sectorsize, true);
>>                          /*
>>                           * This range is beyond i_size, thus we don't need to
>>                           * bother writing back.
>> @@ -1751,8 +1751,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>                           * writeback the sectors with subpage dirty bits,
>>                           * causing writeback without ordered extent.
>>                           */
>> -                       btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
>> -                       break;
>> +                       btrfs_folio_clear_dirty(fs_info, folio, cur, fs_info->sectorsize);
>> +                       continue;
>>                  }
>>                  ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>>                  if (unlikely(ret < 0)) {
>> --
>> 2.52.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11  2:15 [PATCH 0/2] btrfs: fix beyond-EOF write handling mostly for subpage and larger folios Qu Wenruo
  2025-12-11  2:15 ` [PATCH 1/2] btrfs: fix beyond-EOF write handling Qu Wenruo
@ 2025-12-11  2:15 ` Qu Wenruo
  2025-12-11 11:41   ` Filipe Manana
  1 sibling, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11  2:15 UTC (permalink / raw)
  To: linux-btrfs

Inside btrfs_finish_one_ordered(), there are only very limited
situations where the OE has no checksum:

- The OE is completely truncated or error happened
  In that case no file extent is going to be inserted.

- The inode has NODATASUM flag

- The inode belongs to data reloc tree

Add an ASSERT() using the last two cases, which will help us to catch
problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
extent when skipping writeback past i_size"), and prevent future similar
cases.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 461725c8ccd7..740de9436d24 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 		goto out;
 	}
 
+	/*
+	 * If we have no data checksum, either the OE is:
+	 * - Fully truncated
+	 *   Those ones won't reach here.
+	 *
+	 * - No data checksum
+	 *
+	 * - Belongs to data reloc inode
+	 *   Which doesn't have csum attached to OE, but cloned
+	 *   from original chunk.
+	 */
+	if (list_empty(&ordered_extent->list))
+		ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
+		       btrfs_is_data_reloc_root(inode->root));
+
 	ret = add_pending_csums(trans, &ordered_extent->list);
 	if (unlikely(ret)) {
 		btrfs_abort_transaction(trans, ret);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11  2:15 ` [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum Qu Wenruo
@ 2025-12-11 11:41   ` Filipe Manana
  2025-12-11 16:06     ` Filipe Manana
  2025-12-11 23:06     ` Qu Wenruo
  0 siblings, 2 replies; 10+ messages in thread
From: Filipe Manana @ 2025-12-11 11:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 11, 2025 at 2:16 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Inside btrfs_finish_one_ordered(), there are only very limited
> situations where the OE has no checksum:
>
> - The OE is completely truncated or error happened
>   In that case no file extent is going to be inserted.
>
> - The inode has NODATASUM flag
>
> - The inode belongs to data reloc tree
>
> Add an ASSERT() using the last two cases, which will help us to catch
> problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
> extent when skipping writeback past i_size"), and prevent future similar
> cases.

How exactly does this new assertion catches that case described in that commit?
We had csums there, just not for the whole range of the ordered
extent, just for the range from its start offset to the rounded up
i_size (which is less than the ordered extent's end offset).

>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 461725c8ccd7..740de9436d24 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>                 goto out;
>         }
>
> +       /*
> +        * If we have no data checksum, either the OE is:
> +        * - Fully truncated
> +        *   Those ones won't reach here.
> +        *
> +        * - No data checksum
> +        *
> +        * - Belongs to data reloc inode
> +        *   Which doesn't have csum attached to OE, but cloned
> +        *   from original chunk.
> +        */
> +       if (list_empty(&ordered_extent->list))
> +               ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
> +                      btrfs_is_data_reloc_root(inode->root));

No need to inode->root, we have a local variable with the root already.

> +
>         ret = add_pending_csums(trans, &ordered_extent->list);
>         if (unlikely(ret)) {
>                 btrfs_abort_transaction(trans, ret);
> --
> 2.52.0
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11 11:41   ` Filipe Manana
@ 2025-12-11 16:06     ` Filipe Manana
  2025-12-11 23:15       ` Qu Wenruo
  2025-12-11 23:06     ` Qu Wenruo
  1 sibling, 1 reply; 10+ messages in thread
From: Filipe Manana @ 2025-12-11 16:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Dec 11, 2025 at 11:41 AM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Thu, Dec 11, 2025 at 2:16 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> > Inside btrfs_finish_one_ordered(), there are only very limited
> > situations where the OE has no checksum:
> >
> > - The OE is completely truncated or error happened
> >   In that case no file extent is going to be inserted.
> >
> > - The inode has NODATASUM flag
> >
> > - The inode belongs to data reloc tree
> >
> > Add an ASSERT() using the last two cases, which will help us to catch
> > problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
> > extent when skipping writeback past i_size"), and prevent future similar
> > cases.
>
> How exactly does this new assertion catches that case described in that commit?
> We had csums there, just not for the whole range of the ordered
> extent, just for the range from its start offset to the rounded up
> i_size (which is less than the ordered extent's end offset).

Rather than checking for the presence of checksums, and if they cover
the whole range, the best is to check if we have an ordered range that
crosses EOF...
Since it's also a bug if we have a nocow ordered extent that crosses EOF.

Something like this (only compile tested):

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0cbac085cdaf..76f7eab7a750 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3139,6 +3139,11 @@ int btrfs_finish_one_ordered(struct
btrfs_ordered_extent *ordered_extent)
                        goto out;
        }

+       ASSERT(start + logical_len <=
+              round_up(i_size_read(&inode->vfs_inode), fs_info->sectorsize),
+              "start=%llu logical_len=%llu i_size=%lld sectorsize=%u",
+              start, logical_len, i_size_read(&inode->vfs_inode),
fs_info->sectorsize);
+
        /*
         * If it's a COW write we need to lock the extent range as we will be
         * inserting/replacing file extent items and unpinning an extent map.



>
> >
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/inode.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 461725c8ccd7..740de9436d24 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
> >                 goto out;
> >         }
> >
> > +       /*
> > +        * If we have no data checksum, either the OE is:
> > +        * - Fully truncated
> > +        *   Those ones won't reach here.
> > +        *
> > +        * - No data checksum
> > +        *
> > +        * - Belongs to data reloc inode
> > +        *   Which doesn't have csum attached to OE, but cloned
> > +        *   from original chunk.
> > +        */
> > +       if (list_empty(&ordered_extent->list))
> > +               ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
> > +                      btrfs_is_data_reloc_root(inode->root));
>
> No need to inode->root, we have a local variable with the root already.
>
> > +
> >         ret = add_pending_csums(trans, &ordered_extent->list);
> >         if (unlikely(ret)) {
> >                 btrfs_abort_transaction(trans, ret);
> > --
> > 2.52.0
> >
> >

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11 16:06     ` Filipe Manana
@ 2025-12-11 23:15       ` Qu Wenruo
  2025-12-12  3:15         ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11 23:15 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2025/12/12 02:36, Filipe Manana 写道:
> On Thu, Dec 11, 2025 at 11:41 AM Filipe Manana <fdmanana@kernel.org> wrote:
>>
>> On Thu, Dec 11, 2025 at 2:16 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>> Inside btrfs_finish_one_ordered(), there are only very limited
>>> situations where the OE has no checksum:
>>>
>>> - The OE is completely truncated or error happened
>>>    In that case no file extent is going to be inserted.
>>>
>>> - The inode has NODATASUM flag
>>>
>>> - The inode belongs to data reloc tree
>>>
>>> Add an ASSERT() using the last two cases, which will help us to catch
>>> problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
>>> extent when skipping writeback past i_size"), and prevent future similar
>>> cases.
>>
>> How exactly does this new assertion catches that case described in that commit?
>> We had csums there, just not for the whole range of the ordered
>> extent, just for the range from its start offset to the rounded up
>> i_size (which is less than the ordered extent's end offset).
> 
> Rather than checking for the presence of checksums, and if they cover
> the whole range, the best is to check if we have an ordered range that
> crosses EOF...

Although btrfs_setsize() will also wait for OEs during expansion, I'd 
prefer not to get isize involved in btrfs_finish_one_ordered() context.

Once i_size is involved, we need extra correctness check to make sure 
the value is correct and not racy.

I know the current one can not catch the situation that only part of the 
OE has csum, but I also didn't expect the ASSERT() to catch all situations.

The current one is good enough to catch most cases on subpage 
environment and still safe enough not to cause false alerts.

Thanks,
Qu

> Since it's also a bug if we have a nocow ordered extent that crosses EOF.
> 
> Something like this (only compile tested):
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0cbac085cdaf..76f7eab7a750 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3139,6 +3139,11 @@ int btrfs_finish_one_ordered(struct
> btrfs_ordered_extent *ordered_extent)
>                          goto out;
>          }
> 
> +       ASSERT(start + logical_len <=
> +              round_up(i_size_read(&inode->vfs_inode), fs_info->sectorsize),
> +              "start=%llu logical_len=%llu i_size=%lld sectorsize=%u",
> +              start, logical_len, i_size_read(&inode->vfs_inode),
> fs_info->sectorsize);
> +
>          /*
>           * If it's a COW write we need to lock the extent range as we will be
>           * inserting/replacing file extent items and unpinning an extent map.
> 
> 
> 
>>
>>>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/inode.c | 15 +++++++++++++++
>>>   1 file changed, 15 insertions(+)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 461725c8ccd7..740de9436d24 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>>>                  goto out;
>>>          }
>>>
>>> +       /*
>>> +        * If we have no data checksum, either the OE is:
>>> +        * - Fully truncated
>>> +        *   Those ones won't reach here.
>>> +        *
>>> +        * - No data checksum
>>> +        *
>>> +        * - Belongs to data reloc inode
>>> +        *   Which doesn't have csum attached to OE, but cloned
>>> +        *   from original chunk.
>>> +        */
>>> +       if (list_empty(&ordered_extent->list))
>>> +               ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
>>> +                      btrfs_is_data_reloc_root(inode->root));
>>
>> No need to inode->root, we have a local variable with the root already.
>>
>>> +
>>>          ret = add_pending_csums(trans, &ordered_extent->list);
>>>          if (unlikely(ret)) {
>>>                  btrfs_abort_transaction(trans, ret);
>>> --
>>> 2.52.0
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11 23:15       ` Qu Wenruo
@ 2025-12-12  3:15         ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-12-12  3:15 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2025/12/12 09:45, Qu Wenruo 写道:
> 
> 
> 在 2025/12/12 02:36, Filipe Manana 写道:
>> On Thu, Dec 11, 2025 at 11:41 AM Filipe Manana <fdmanana@kernel.org> 
>> wrote:
>>>
>>> On Thu, Dec 11, 2025 at 2:16 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> Inside btrfs_finish_one_ordered(), there are only very limited
>>>> situations where the OE has no checksum:
>>>>
>>>> - The OE is completely truncated or error happened
>>>>    In that case no file extent is going to be inserted.
>>>>
>>>> - The inode has NODATASUM flag
>>>>
>>>> - The inode belongs to data reloc tree
>>>>
>>>> Add an ASSERT() using the last two cases, which will help us to catch
>>>> problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
>>>> extent when skipping writeback past i_size"), and prevent future 
>>>> similar
>>>> cases.
>>>
>>> How exactly does this new assertion catches that case described in 
>>> that commit?
>>> We had csums there, just not for the whole range of the ordered
>>> extent, just for the range from its start offset to the rounded up
>>> i_size (which is less than the ordered extent's end offset).
>>
>> Rather than checking for the presence of checksums, and if they cover
>> the whole range, the best is to check if we have an ordered range that
>> crosses EOF...
> 
> Although btrfs_setsize() will also wait for OEs during expansion, I'd 
> prefer not to get isize involved in btrfs_finish_one_ordered() context.
> 
> Once i_size is involved, we need extra correctness check to make sure 
> the value is correct and not racy.
> 
> I know the current one can not catch the situation that only part of the 
> OE has csum, but I also didn't expect the ASSERT() to catch all situations.
> 
> The current one is good enough to catch most cases on subpage 
> environment and still safe enough not to cause false alerts.

But on the other hand, if we're adding an ASSERT(), we really want to 
catch them all, not just some easier to catch cases.

So I'll try to do all the following checks in a dedicated helper instead:

- If inode has NODATASUM, there must be no csum

- No checksum should be there for data reloc inode
   Or it will lead to -EEXIST during csum insertion.

- Csum must match the num_bytes ranges for uncompressed OE

- Csum must match the full disk_num_bytes range for compressed OE
   I'm not 100% sure about this part and will do more testing.

By this we do not need to bother i_size and still catch all cases, 
including the one in 18de34daa7c6 ("btrfs: truncate ordered extent when 
skipping writeback past i_size") and the no csum cases in previous patch.

Thanks,
Qu

> 
> Thanks,
> Qu
> 
>> Since it's also a bug if we have a nocow ordered extent that crosses EOF.
>>
>> Something like this (only compile tested):
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 0cbac085cdaf..76f7eab7a750 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -3139,6 +3139,11 @@ int btrfs_finish_one_ordered(struct
>> btrfs_ordered_extent *ordered_extent)
>>                          goto out;
>>          }
>>
>> +       ASSERT(start + logical_len <=
>> +              round_up(i_size_read(&inode->vfs_inode), fs_info- 
>> >sectorsize),
>> +              "start=%llu logical_len=%llu i_size=%lld sectorsize=%u",
>> +              start, logical_len, i_size_read(&inode->vfs_inode),
>> fs_info->sectorsize);
>> +
>>          /*
>>           * If it's a COW write we need to lock the extent range as we 
>> will be
>>           * inserting/replacing file extent items and unpinning an 
>> extent map.
>>
>>
>>
>>>
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>   fs/btrfs/inode.c | 15 +++++++++++++++
>>>>   1 file changed, 15 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>> index 461725c8ccd7..740de9436d24 100644
>>>> --- a/fs/btrfs/inode.c
>>>> +++ b/fs/btrfs/inode.c
>>>> @@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct 
>>>> btrfs_ordered_extent *ordered_extent)
>>>>                  goto out;
>>>>          }
>>>>
>>>> +       /*
>>>> +        * If we have no data checksum, either the OE is:
>>>> +        * - Fully truncated
>>>> +        *   Those ones won't reach here.
>>>> +        *
>>>> +        * - No data checksum
>>>> +        *
>>>> +        * - Belongs to data reloc inode
>>>> +        *   Which doesn't have csum attached to OE, but cloned
>>>> +        *   from original chunk.
>>>> +        */
>>>> +       if (list_empty(&ordered_extent->list))
>>>> +               ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
>>>> +                      btrfs_is_data_reloc_root(inode->root));
>>>
>>> No need to inode->root, we have a local variable with the root already.
>>>
>>>> +
>>>>          ret = add_pending_csums(trans, &ordered_extent->list);
>>>>          if (unlikely(ret)) {
>>>>                  btrfs_abort_transaction(trans, ret);
>>>> -- 
>>>> 2.52.0
>>>>
>>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum
  2025-12-11 11:41   ` Filipe Manana
  2025-12-11 16:06     ` Filipe Manana
@ 2025-12-11 23:06     ` Qu Wenruo
  1 sibling, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-12-11 23:06 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2025/12/11 22:11, Filipe Manana 写道:
> On Thu, Dec 11, 2025 at 2:16 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Inside btrfs_finish_one_ordered(), there are only very limited
>> situations where the OE has no checksum:
>>
>> - The OE is completely truncated or error happened
>>    In that case no file extent is going to be inserted.
>>
>> - The inode has NODATASUM flag
>>
>> - The inode belongs to data reloc tree
>>
>> Add an ASSERT() using the last two cases, which will help us to catch
>> problems described in commit 18de34daa7c6 ("btrfs: truncate ordered
>> extent when skipping writeback past i_size"), and prevent future similar
>> cases.
> 
> How exactly does this new assertion catches that case described in that commit?

If go with the example of patch 1, the OE range [60K, 64K) has no data 
checksum but the inode has data checksum, thus will immediately trigger 
the ASSERT().

Thanks,
Qu

> We had csums there, just not for the whole range of the ordered
> extent, just for the range from its start offset to the rounded up
> i_size (which is less than the ordered extent's end offset).
> 
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 461725c8ccd7..740de9436d24 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -3226,6 +3226,21 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
>>                  goto out;
>>          }
>>
>> +       /*
>> +        * If we have no data checksum, either the OE is:
>> +        * - Fully truncated
>> +        *   Those ones won't reach here.
>> +        *
>> +        * - No data checksum
>> +        *
>> +        * - Belongs to data reloc inode
>> +        *   Which doesn't have csum attached to OE, but cloned
>> +        *   from original chunk.
>> +        */
>> +       if (list_empty(&ordered_extent->list))
>> +               ASSERT(inode->flags & BTRFS_INODE_NODATASUM ||
>> +                      btrfs_is_data_reloc_root(inode->root));
> 
> No need to inode->root, we have a local variable with the root already.
> 
>> +
>>          ret = add_pending_csums(trans, &ordered_extent->list);
>>          if (unlikely(ret)) {
>>                  btrfs_abort_transaction(trans, ret);
>> --
>> 2.52.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-12  3:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-11  2:15 [PATCH 0/2] btrfs: fix beyond-EOF write handling mostly for subpage and larger folios Qu Wenruo
2025-12-11  2:15 ` [PATCH 1/2] btrfs: fix beyond-EOF write handling Qu Wenruo
2025-12-11 11:26   ` Filipe Manana
2025-12-11 23:05     ` Qu Wenruo
2025-12-11  2:15 ` [PATCH 2/2] btrfs: add an ASSERT() to catch ordered extents without datasum Qu Wenruo
2025-12-11 11:41   ` Filipe Manana
2025-12-11 16:06     ` Filipe Manana
2025-12-11 23:15       ` Qu Wenruo
2025-12-12  3:15         ` Qu Wenruo
2025-12-11 23:06     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox