* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-04-27 20:24 ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
@ 2026-04-27 22:15 ` Qu Wenruo
2026-04-28 0:44 ` Qu Wenruo
2026-04-28 9:03 ` Johannes Thumshirn
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2026-04-27 22:15 UTC (permalink / raw)
To: Teng Liu, linux-btrfs
Cc: dsterba, clm, linux-kernel, syzbot+3e20d8f3d41bac5dc9a2
在 2026/4/28 05:54, Teng Liu 写道:
> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> item it looks up has any of offset, compression, encryption, or
> other_encoding set. The data reloc inode is only written by relocation's
> own paths -- insert_prealloc_file_extent() and
> insert_ordered_extent_file_extent() -- which always leave those four
> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> and encryption/other_encoding are reserved-and-zero). Observing a
> non-zero value therefore means the leaf decoded from disk does not match
> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> reach this code via balance and panic the kernel.
>
> Move the validation into tree-checker's check_extent_data_item(), where
> the constraint is enforced when the leaf is read off disk rather than
> after relocation has already started. The data reloc tree has a fixed
> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> header, so check_extent_data_item() has all the information it needs to
> apply this check on its own. Report violations via file_extent_err() and
> print the four offending values.
>
> In get_new_location() replace the BUG_ON() with an ASSERT().
> The caller in replace_file_extents() already handles non-zero returns from
> get_new_location() by breaking out of the loop without aborting the
> transaction, so no caller changes are needed.
>
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Suggested-by: David Sterba <dsterba@suse.com>
> Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
> Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
And merged.
Thanks,
Qu
> ---
> Changes in v3:
> - Move the corruption check from relocation.c into tree-checker's
> check_extent_data_item(), per Qu and David. The data reloc tree's
> fixed objectid is recorded in the extent buffer header, so the
> check has all the context it needs at read time.
> - Use file_extent_err() and print offset/compression/encryption/
> other_encoding values, per Qu.
> - Replace the BUG_ON in get_new_location() with ASSERT() rather than
> -EUCLEAN, per David.
>
> Changes in v2:
> - Pair the -EUCLEAN return with btrfs_print_leaf() and btrfs_err() so
> the offending leaf is dumped to dmesg, per Qu's review of v1.
> - Expand the changelog to argue why non-zero
> compression/encryption/other_encoding in the data reloc inode imply
> on-disk corruption.
>
> fs/btrfs/relocation.c | 8 ++++----
> fs/btrfs/tree-checker.c | 21 +++++++++++++++++++++
> 2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 1c42c5180bdd..527d4dbfe31c 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -835,10 +835,10 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
> fi = btrfs_item_ptr(leaf, path->slots[0],
> struct btrfs_file_extent_item);
>
> - BUG_ON(btrfs_file_extent_offset(leaf, fi) ||
> - btrfs_file_extent_compression(leaf, fi) ||
> - btrfs_file_extent_encryption(leaf, fi) ||
> - btrfs_file_extent_other_encoding(leaf, fi));
> + ASSERT(!btrfs_file_extent_offset(leaf, fi) &&
> + !btrfs_file_extent_compression(leaf, fi) &&
> + !btrfs_file_extent_encryption(leaf, fi) &&
> + !btrfs_file_extent_other_encoding(leaf, fi));
>
> if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi))
> return -EINVAL;
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 1f15d0793a9c..e4864f7a471e 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -296,6 +296,27 @@ static int check_extent_data_item(struct extent_buffer *leaf,
> return 0;
> }
>
> + /*
> + * For the data reloc tree, file extent items are written by
> + * relocation's own paths, which always leave offset, compression,
> + * encryption and other_encoding as 0. Any non-zero value here means
> + * the leaf decoded from disk does not match what the kernel wrote,
> + * i.e. on-disk corruption.
> + */
> + if (unlikely(btrfs_header_owner(leaf) == BTRFS_DATA_RELOC_TREE_OBJECTID &&
> + (btrfs_file_extent_offset(leaf, fi) ||
> + btrfs_file_extent_compression(leaf, fi) ||
> + btrfs_file_extent_encryption(leaf, fi) ||
> + btrfs_file_extent_other_encoding(leaf, fi)))) {
> + file_extent_err(leaf, slot,
> +"invalid members for data reloc tree, offset=%llu compress=%u encryption=%u other_encoding=%u",
> + btrfs_file_extent_offset(leaf, fi),
> + btrfs_file_extent_compression(leaf, fi),
> + btrfs_file_extent_encryption(leaf, fi),
> + btrfs_file_extent_other_encoding(leaf, fi));
> + return -EUCLEAN;
> + }
> +
> /* Regular or preallocated extent has fixed item size */
> if (unlikely(item_size != sizeof(*fi))) {
> file_extent_err(leaf, slot,
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-04-27 22:15 ` Qu Wenruo
@ 2026-04-28 0:44 ` Qu Wenruo
2026-04-28 15:29 ` David Sterba
0 siblings, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2026-04-28 0:44 UTC (permalink / raw)
To: Teng Liu, linux-btrfs
Cc: dsterba, clm, linux-kernel, syzbot+3e20d8f3d41bac5dc9a2
在 2026/4/28 07:45, Qu Wenruo 写道:
>
>
> 在 2026/4/28 05:54, Teng Liu 写道:
>> get_new_location() uses BUG_ON() to crash the kernel if the file extent
>> item it looks up has any of offset, compression, encryption, or
>> other_encoding set. The data reloc inode is only written by relocation's
>> own paths -- insert_prealloc_file_extent() and
>> insert_ordered_extent_file_extent() -- which always leave those four
>> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
>> and encryption/other_encoding are reserved-and-zero). Observing a
>> non-zero value therefore means the leaf decoded from disk does not match
>> what the kernel wrote, i.e. on-disk corruption. A malformed image can
>> reach this code via balance and panic the kernel.
>>
>> Move the validation into tree-checker's check_extent_data_item(), where
>> the constraint is enforced when the leaf is read off disk rather than
>> after relocation has already started. The data reloc tree has a fixed
>> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
>> header, so check_extent_data_item() has all the information it needs to
>> apply this check on its own. Report violations via file_extent_err() and
>> print the four offending values.
>>
>> In get_new_location() replace the BUG_ON() with an ASSERT().
>> The caller in replace_file_extents() already handles non-zero returns
>> from
>> get_new_location() by breaking out of the loop without aborting the
>> transaction, so no caller changes are needed.
>>
>> Suggested-by: Qu Wenruo <wqu@suse.com>
>> Suggested-by: David Sterba <dsterba@suse.com>
>> Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
>> Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> And merged.
Unfortunately this tree-checker got triggered during btrfs/061 runs at
write-time tree-checker, with arm64 64K page size.
The offending file extent is as the following:
[ 536.885066] item 69 key (258 EXTENT_DATA 4063232) itemoff 12400
itemsize 53
[ 536.885067] generation 28 type 1
[ 536.885067] extent data disk bytenr 10512723968 nr 36864
[ 536.885068] extent data offset 24576 nr 12288 ram 36864
[ 536.885069] extent compression 0
Note the offset is not zero, and the type is 1 which means it's a
regular file extent.
So the check is causing false alerts.
Now the patch is reverted, and I'll spend more time digging into the case.
Thanks,
Qu
>
> Thanks,
> Qu
>
>> ---
>> Changes in v3:
>> - Move the corruption check from relocation.c into tree-checker's
>> check_extent_data_item(), per Qu and David. The data reloc tree's
>> fixed objectid is recorded in the extent buffer header, so the
>> check has all the context it needs at read time.
>> - Use file_extent_err() and print offset/compression/encryption/
>> other_encoding values, per Qu.
>> - Replace the BUG_ON in get_new_location() with ASSERT() rather than
>> -EUCLEAN, per David.
>>
>> Changes in v2:
>> - Pair the -EUCLEAN return with btrfs_print_leaf() and btrfs_err() so
>> the offending leaf is dumped to dmesg, per Qu's review of v1.
>> - Expand the changelog to argue why non-zero
>> compression/encryption/other_encoding in the data reloc inode imply
>> on-disk corruption.
>>
>> fs/btrfs/relocation.c | 8 ++++----
>> fs/btrfs/tree-checker.c | 21 +++++++++++++++++++++
>> 2 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index 1c42c5180bdd..527d4dbfe31c 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -835,10 +835,10 @@ static int get_new_location(struct inode
>> *reloc_inode, u64 *new_bytenr,
>> fi = btrfs_item_ptr(leaf, path->slots[0],
>> struct btrfs_file_extent_item);
>> - BUG_ON(btrfs_file_extent_offset(leaf, fi) ||
>> - btrfs_file_extent_compression(leaf, fi) ||
>> - btrfs_file_extent_encryption(leaf, fi) ||
>> - btrfs_file_extent_other_encoding(leaf, fi));
>> + ASSERT(!btrfs_file_extent_offset(leaf, fi) &&
>> + !btrfs_file_extent_compression(leaf, fi) &&
>> + !btrfs_file_extent_encryption(leaf, fi) &&
>> + !btrfs_file_extent_other_encoding(leaf, fi));
>> if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi))
>> return -EINVAL;
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index 1f15d0793a9c..e4864f7a471e 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -296,6 +296,27 @@ static int check_extent_data_item(struct
>> extent_buffer *leaf,
>> return 0;
>> }
>> + /*
>> + * For the data reloc tree, file extent items are written by
>> + * relocation's own paths, which always leave offset, compression,
>> + * encryption and other_encoding as 0. Any non-zero value here means
>> + * the leaf decoded from disk does not match what the kernel wrote,
>> + * i.e. on-disk corruption.
>> + */
>> + if (unlikely(btrfs_header_owner(leaf) ==
>> BTRFS_DATA_RELOC_TREE_OBJECTID &&
>> + (btrfs_file_extent_offset(leaf, fi) ||
>> + btrfs_file_extent_compression(leaf, fi) ||
>> + btrfs_file_extent_encryption(leaf, fi) ||
>> + btrfs_file_extent_other_encoding(leaf, fi)))) {
>> + file_extent_err(leaf, slot,
>> +"invalid members for data reloc tree, offset=%llu compress=%u
>> encryption=%u other_encoding=%u",
>> + btrfs_file_extent_offset(leaf, fi),
>> + btrfs_file_extent_compression(leaf, fi),
>> + btrfs_file_extent_encryption(leaf, fi),
>> + btrfs_file_extent_other_encoding(leaf, fi));
>> + return -EUCLEAN;
>> + }
>> +
>> /* Regular or preallocated extent has fixed item size */
>> if (unlikely(item_size != sizeof(*fi))) {
>> file_extent_err(leaf, slot,
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-04-28 0:44 ` Qu Wenruo
@ 2026-04-28 15:29 ` David Sterba
0 siblings, 0 replies; 13+ messages in thread
From: David Sterba @ 2026-04-28 15:29 UTC (permalink / raw)
To: Qu Wenruo
Cc: Teng Liu, linux-btrfs, dsterba, clm, linux-kernel,
syzbot+3e20d8f3d41bac5dc9a2
On Tue, Apr 28, 2026 at 10:14:40AM +0930, Qu Wenruo wrote:
>
>
> 在 2026/4/28 07:45, Qu Wenruo 写道:
> >
> >
> > 在 2026/4/28 05:54, Teng Liu 写道:
> >> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> >> item it looks up has any of offset, compression, encryption, or
> >> other_encoding set. The data reloc inode is only written by relocation's
> >> own paths -- insert_prealloc_file_extent() and
> >> insert_ordered_extent_file_extent() -- which always leave those four
> >> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> >> and encryption/other_encoding are reserved-and-zero). Observing a
> >> non-zero value therefore means the leaf decoded from disk does not match
> >> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> >> reach this code via balance and panic the kernel.
> >>
> >> Move the validation into tree-checker's check_extent_data_item(), where
> >> the constraint is enforced when the leaf is read off disk rather than
> >> after relocation has already started. The data reloc tree has a fixed
> >> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> >> header, so check_extent_data_item() has all the information it needs to
> >> apply this check on its own. Report violations via file_extent_err() and
> >> print the four offending values.
> >>
> >> In get_new_location() replace the BUG_ON() with an ASSERT().
> >> The caller in replace_file_extents() already handles non-zero returns
> >> from
> >> get_new_location() by breaking out of the loop without aborting the
> >> transaction, so no caller changes are needed.
> >>
> >> Suggested-by: Qu Wenruo <wqu@suse.com>
> >> Suggested-by: David Sterba <dsterba@suse.com>
> >> Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
> >> Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
> >> Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
> >
> > Reviewed-by: Qu Wenruo <wqu@suse.com>
> >
> > And merged.
>
> Unfortunately this tree-checker got triggered during btrfs/061 runs at
> write-time tree-checker, with arm64 64K page size.
>
> The offending file extent is as the following:
>
> [ 536.885066] item 69 key (258 EXTENT_DATA 4063232) itemoff 12400
> itemsize 53
> [ 536.885067] generation 28 type 1
> [ 536.885067] extent data disk bytenr 10512723968 nr 36864
> [ 536.885068] extent data offset 24576 nr 12288 ram 36864
> [ 536.885069] extent compression 0
>
> Note the offset is not zero, and the type is 1 which means it's a
> regular file extent.
>
> So the check is causing false alerts.
I maybe have an idea. The difference from the BUG_ON and the
tree-checker is the context where it's called. In relocation it's
somewere in the middle and there are actions fixing up the offset. OTOH
when this is done in tree-checker the constraints are different.
get_new_location() - verifies offset, compression, ...
The offset corresponds to 'bytenr' and is returned via *new_bytenr to
replace_file_extents() and then updated in the leaf
btrfs_set_file_extent_disk_bytenr(leaf, fi, new_bytenr);
This eventually ends up in in the pre-write check.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-04-27 20:24 ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
2026-04-27 22:15 ` Qu Wenruo
@ 2026-04-28 9:03 ` Johannes Thumshirn
2026-05-03 15:35 ` Teng Liu
2026-05-13 11:35 ` [PATCH v4] btrfs: validate data reloc tree file extent item members Teng Liu
3 siblings, 0 replies; 13+ messages in thread
From: Johannes Thumshirn @ 2026-04-28 9:03 UTC (permalink / raw)
To: Teng Liu, linux-btrfs
Cc: dsterba, clm, wqu, linux-kernel, syzbot+3e20d8f3d41bac5dc9a2
On 4/27/26 10:24 PM, Teng Liu wrote:
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 1c42c5180bdd..527d4dbfe31c 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -835,10 +835,10 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
> fi = btrfs_item_ptr(leaf, path->slots[0],
> struct btrfs_file_extent_item);
>
> - BUG_ON(btrfs_file_extent_offset(leaf, fi) ||
> - btrfs_file_extent_compression(leaf, fi) ||
> - btrfs_file_extent_encryption(leaf, fi) ||
> - btrfs_file_extent_other_encoding(leaf, fi));
> + ASSERT(!btrfs_file_extent_offset(leaf, fi) &&
> + !btrfs_file_extent_compression(leaf, fi) &&
> + !btrfs_file_extent_encryption(leaf, fi) &&
> + !btrfs_file_extent_other_encoding(leaf, fi));
Can you split that into multiple ASSERT()s? So we quickly see which one
actually triggered.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-04-27 20:24 ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
2026-04-27 22:15 ` Qu Wenruo
2026-04-28 9:03 ` Johannes Thumshirn
@ 2026-05-03 15:35 ` Teng Liu
2026-05-03 22:36 ` Qu Wenruo
2026-05-13 11:35 ` [PATCH v4] btrfs: validate data reloc tree file extent item members Teng Liu
3 siblings, 1 reply; 13+ messages in thread
From: Teng Liu @ 2026-05-03 15:35 UTC (permalink / raw)
To: linux-btrfs; +Cc: wqu, dsterba
On 2026-04-27 22:24, Teng Liu wrote:
> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> item it looks up has any of offset, compression, encryption, or
> other_encoding set. The data reloc inode is only written by relocation's
> own paths -- insert_prealloc_file_extent() and
> insert_ordered_extent_file_extent() -- which always leave those four
> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> and encryption/other_encoding are reserved-and-zero). Observing a
> non-zero value therefore means the leaf decoded from disk does not match
> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> reach this code via balance and panic the kernel.
>
> Move the validation into tree-checker's check_extent_data_item(), where
> the constraint is enforced when the leaf is read off disk rather than
> after relocation has already started. The data reloc tree has a fixed
> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> header, so check_extent_data_item() has all the information it needs to
> apply this check on its own. Report violations via file_extent_err() and
> print the four offending values.
>
> In get_new_location() replace the BUG_ON() with an ASSERT().
> The caller in replace_file_extents() already handles non-zero returns from
> get_new_location() by breaking out of the loop without aborting the
> transaction, so no caller changes are needed.
>
Hi Qu, David, and all,
I spent some time trying to understand why the tree-checker saw non-zero
offset while BUG_ON never hit. Share some observation here in case it's
helpful and also ask for further advice about how to make the patch
correct.
Short answer: I noticed that get_new_location() and the tree-checker
look at different itmes in the data reloc tree. The tree-chcker
validates every item pre-write; while get_new_location() searches only
exact cluster boundary key, so the offset is always 0.
== How I tested ==
I instrumented both check_extent_data_item() (tree-checker) and
get_new_location() with pr_info() to log ino+key_offset+offset for every
item. Then ran btrfs/061 on qemu.
== Some Evidence ==
Tree-checker hits (items with non-zero fields in the data reloc tree):
[ 63.531400] tree-checker HIT: ino=258 key_off=13303808 type=1 offset=102400
compress=0 encrypt=0 other=0
Printing out the leaf items (when tree-checker hit an item with non-zero
offset) and we found:
[ 63.540541] item 114 key (258 EXTENT_DATA 13201408) itemoff 10188 itemsize 53
[ 63.540549] generation 22 type 2
[ 63.540554] extent data disk bytenr 4566134784 nr 110592
[ 63.540558] extent data offset 0 nr 102400 ram 110592
[ 63.540613] extent compression 0
[ 63.540619] item 115 key (258 EXTENT_DATA 13303808) itemoff 10135 itemsize 53
[ 63.540644] generation 22 type 1
[ 63.540649] extent data disk bytenr 4566134784 nr 110592
[ 63.540744] extent data offset 102400 nr 8192 ram 110592
[ 63.540750] extent compression 0
Looking for the two 13201408 and 13303808 items in get_new_location:
get_new_location() search keys:
ino=258 key_off=13201408 offset=0 type=1
...
We didn't see 13303808 item at all during get_new_location.
get_new_location searched for key_off=13201408 (the cluster boundary) and
found offset=0. It never searched for 13303808.
=== Why this happens ===
During the MOVE_DATA_EXTENTS phase:
1. prealloc_file_extent_cluster() inserts one PREALLOC per cluster
boundary (via insert_prealloc_file_extent), all with offset=0.
2. relocate_one_folio() reads source data and marks pages dirty. On
writeback, ordered extents are created. When an ordered extent
completes, insert_ordered_extent_file_extent() calls
insert_reserved_file_extent(), which calls btrfs_drop_extents() to
carve out the written range from the PREALLOC, splitting it. The
right-hand piece (REG) gets the ordered extent's offset into the
shared disk_bytenr -- which can be non-zero.
3. The pre-write tree-checker fires when the modified leaf is written
to disk, seeing the REG item with offset!=0. This is the false
positive.
4. btrfs_wait_ordered_range() waits for all ordered extents to finish,
then the stage switches to UPDATE_DATA_PTRS. replace_file_extents()
calls get_new_location(src_disk_bytenr) for each source extent.
The search key is:
bytenr -= reloc_block_group_start;
btrfs_lookup_file_extent(..., ino, bytenr, 0);
This is the cluster boundary -- the start of an extent, where offset
is always 0. get_new_location never searches the mid-range keys
where the split pieces live.
== Advice Needed ==
If my analysis is correct, what do you think is the best way to fix this
problem? Or am I missing something somewhere?
Thanks in advance,
Teng Liu
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
2026-05-03 15:35 ` Teng Liu
@ 2026-05-03 22:36 ` Qu Wenruo
0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2026-05-03 22:36 UTC (permalink / raw)
To: Teng Liu, linux-btrfs; +Cc: dsterba
在 2026/5/4 01:05, Teng Liu 写道:
> On 2026-04-27 22:24, Teng Liu wrote:
>> get_new_location() uses BUG_ON() to crash the kernel if the file extent
>> item it looks up has any of offset, compression, encryption, or
>> other_encoding set. The data reloc inode is only written by relocation's
>> own paths -- insert_prealloc_file_extent() and
>> insert_ordered_extent_file_extent() -- which always leave those four
>> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
>> and encryption/other_encoding are reserved-and-zero). Observing a
>> non-zero value therefore means the leaf decoded from disk does not match
>> what the kernel wrote, i.e. on-disk corruption. A malformed image can
>> reach this code via balance and panic the kernel.
>>
>> Move the validation into tree-checker's check_extent_data_item(), where
>> the constraint is enforced when the leaf is read off disk rather than
>> after relocation has already started. The data reloc tree has a fixed
>> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
>> header, so check_extent_data_item() has all the information it needs to
>> apply this check on its own. Report violations via file_extent_err() and
>> print the four offending values.
>>
>> In get_new_location() replace the BUG_ON() with an ASSERT().
>> The caller in replace_file_extents() already handles non-zero returns from
>> get_new_location() by breaking out of the loop without aborting the
>> transaction, so no caller changes are needed.
>>
>
> Hi Qu, David, and all,
>
> I spent some time trying to understand why the tree-checker saw non-zero
> offset while BUG_ON never hit. Share some observation here in case it's
> helpful and also ask for further advice about how to make the patch
> correct.
>
> Short answer: I noticed that get_new_location() and the tree-checker
> look at different itmes in the data reloc tree. The tree-chcker
> validates every item pre-write; while get_new_location() searches only
> exact cluster boundary key, so the offset is always 0.
>
> == How I tested ==
>
> I instrumented both check_extent_data_item() (tree-checker) and
> get_new_location() with pr_info() to log ino+key_offset+offset for every
> item. Then ran btrfs/061 on qemu.
>
> == Some Evidence ==
>
> Tree-checker hits (items with non-zero fields in the data reloc tree):
> [ 63.531400] tree-checker HIT: ino=258 key_off=13303808 type=1 offset=102400
> compress=0 encrypt=0 other=0
>
> Printing out the leaf items (when tree-checker hit an item with non-zero
> offset) and we found:
> [ 63.540541] item 114 key (258 EXTENT_DATA 13201408) itemoff 10188 itemsize 53
> [ 63.540549] generation 22 type 2
> [ 63.540554] extent data disk bytenr 4566134784 nr 110592
> [ 63.540558] extent data offset 0 nr 102400 ram 110592
> [ 63.540613] extent compression 0
> [ 63.540619] item 115 key (258 EXTENT_DATA 13303808) itemoff 10135 itemsize 53
> [ 63.540644] generation 22 type 1
> [ 63.540649] extent data disk bytenr 4566134784 nr 110592
> [ 63.540744] extent data offset 102400 nr 8192 ram 110592
> [ 63.540750] extent compression 0
>
> Looking for the two 13201408 and 13303808 items in get_new_location:
>
> get_new_location() search keys:
> ino=258 key_off=13201408 offset=0 type=1
> ...
>
> We didn't see 13303808 item at all during get_new_location.
> get_new_location searched for key_off=13201408 (the cluster boundary) and
> found offset=0. It never searched for 13303808.
>
> === Why this happens ===
>
> During the MOVE_DATA_EXTENTS phase:
>
> 1. prealloc_file_extent_cluster() inserts one PREALLOC per cluster
> boundary (via insert_prealloc_file_extent), all with offset=0.
>
> 2. relocate_one_folio() reads source data and marks pages dirty. On
> writeback, ordered extents are created. When an ordered extent
> completes, insert_ordered_extent_file_extent() calls
> insert_reserved_file_extent(), which calls btrfs_drop_extents() to
> carve out the written range from the PREALLOC, splitting it. The
> right-hand piece (REG) gets the ordered extent's offset into the
> shared disk_bytenr -- which can be non-zero.
>
> 3. The pre-write tree-checker fires when the modified leaf is written
> to disk, seeing the REG item with offset!=0. This is the false
> positive.
>
> 4. btrfs_wait_ordered_range() waits for all ordered extents to finish,
> then the stage switches to UPDATE_DATA_PTRS. replace_file_extents()
> calls get_new_location(src_disk_bytenr) for each source extent.
> The search key is:
>
> bytenr -= reloc_block_group_start;
> btrfs_lookup_file_extent(..., ino, bytenr, 0);
>
> This is the cluster boundary -- the start of an extent, where offset
> is always 0. get_new_location never searches the mid-range keys
> where the split pieces live.
>
> == Advice Needed ==
>
> If my analysis is correct, what do you think is the best way to fix this
> problem? Or am I missing something somewhere?
Make the tree-checker to reject compress/encoded extents in data reloc tree.
And add proper handling with leaf dump for relocation when an non-zero
offset is hit.
>
> Thanks in advance,
> Teng Liu
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v4] btrfs: validate data reloc tree file extent item members
2026-04-27 20:24 ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
` (2 preceding siblings ...)
2026-05-03 15:35 ` Teng Liu
@ 2026-05-13 11:35 ` Teng Liu
3 siblings, 0 replies; 13+ messages in thread
From: Teng Liu @ 2026-05-13 11:35 UTC (permalink / raw)
To: linux-btrfs
Cc: Teng Liu, dsterba, clm, wqu, linux-kernel,
syzbot+3e20d8f3d41bac5dc9a2
get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:
- insert_prealloc_file_extent() memsets the stack item to zero and
only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
offset/compression/encryption/other_encoding stay 0.
- insert_ordered_extent_file_extent() copies oe->compress_type into
the file extent's compression field, but the data reloc inode is
created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
encryption and other_encoding are reserved-and-zero in btrfs.
A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.
A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with
offset = oe->offset
num_bytes = oe->num_bytes
ram_bytes preserved from the original PREALLOC,
and this item can reach disk if a transaction commit fires while it
is present in the leaf.
The four fields belong in different layers:
- compression, encryption and other_encoding are universal
invariants for every item in the data reloc tree, regardless of
cluster geometry. Enforce them in tree-checker's
check_extent_data_item() so a corrupt leaf is rejected at read
time.
- offset is only an invariant at the cluster-boundary keys that
get_new_location() searches (the key is computed as
src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
writebacks legitimately place REG items at non-boundary keys with
offset != 0; tree-checker cannot reject these. The cluster-
boundary item is always written by either
insert_prealloc_file_extent() (offset=0 by memset) or by the
front portion of a partial writeback (offset=0 by construction),
so a non-zero offset there is corruption.
Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.
Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
---
Changes in v4:
- Split the check by which layer the invariant holds in. Reject
compression/encryption/other_encoding != 0 in tree-checker (true
on-disk invariant for the entire data reloc tree). Keep the offset
check at the call site in get_new_location() (true only at the
cluster-boundary keys it searches; partial-PREALLOC writeback
legitimately produces non-zero offset at non-boundary keys, which
is why the v3 single-rule approach was reverted).
- Suggested by Qu Wenruo in reply to v3:
https://lore.kernel.org/linux-btrfs/20260427202822.278326-1-27rabbitlt@gmail.com/
Changes in v3:
- Moved the entire four-field check from get_new_location() into
tree-checker's check_extent_data_item(). Replaced BUG_ON() with
ASSERT() in get_new_location(). Merged as 7d0ee95979e9 and
reverted by 1c034697fcaa due to false positives in btrfs/061 on
arm64 64K pages.
Changes in v2:
- Pair the -EUCLEAN return with btrfs_print_leaf() and btrfs_err()
so the offending leaf is dumped to dmesg, per Qu's v1 review:
https://lore.kernel.org/linux-btrfs/6c54901d-5e07-4c46-9553-997b28c93b86@suse.com/
- Expand the changelog to argue why non-zero compression/encryption/
other_encoding in the data reloc inode imply on-disk corruption
rather than a kernel bug.
fs/btrfs/relocation.c | 22 ++++++++++++++++++----
fs/btrfs/tree-checker.c | 27 +++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 1c42c5180bdd..01977fa282db 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -814,6 +814,7 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
u64 bytenr, u64 num_bytes)
{
struct btrfs_root *root = BTRFS_I(reloc_inode)->root;
+ struct btrfs_fs_info *fs_info = root->fs_info;
BTRFS_PATH_AUTO_FREE(path);
struct btrfs_file_extent_item *fi;
struct extent_buffer *leaf;
@@ -835,10 +836,23 @@ static int get_new_location(struct inode *reloc_inode, u64 *new_bytenr,
fi = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_file_extent_item);
- BUG_ON(btrfs_file_extent_offset(leaf, fi) ||
- btrfs_file_extent_compression(leaf, fi) ||
- btrfs_file_extent_encryption(leaf, fi) ||
- btrfs_file_extent_other_encoding(leaf, fi));
+ /*
+ * The cluster-boundary key searched above is always written by
+ * relocation with offset 0: either by insert_prealloc_file_extent()
+ * (memsets the stack item to 0) or by the front portion of a partial
+ * writeback (offset=0 by construction). A non-zero value here means
+ * the on-disk leaf does not match what relocation wrote, i.e.
+ * corruption. The other encoding fields are caught earlier by
+ * tree-checker's check_extent_data_item().
+ */
+ if (unlikely(btrfs_file_extent_offset(leaf, fi))) {
+ btrfs_print_leaf(leaf);
+ btrfs_err(fs_info,
+"unexpected non-zero offset in file extent item for data reloc inode %llu key offset %llu offset %llu",
+ btrfs_ino(BTRFS_I(reloc_inode)), bytenr,
+ btrfs_file_extent_offset(leaf, fi));
+ return -EUCLEAN;
+ }
if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi))
return -EINVAL;
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 1f15d0793a9c..8fc919dc08d0 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -296,6 +296,33 @@ static int check_extent_data_item(struct extent_buffer *leaf,
return 0;
}
+ /*
+ * For the data reloc tree, file extent items are written by
+ * relocation's own paths. The data reloc inode is created with
+ * BTRFS_INODE_NOCOMPRESS, so insert_ordered_extent_file_extent()
+ * always leaves the compression field at 0. Encryption and
+ * other_encoding are reserved-and-zero in btrfs. A non-zero value
+ * for any of these means the leaf decoded from disk does not match
+ * what the kernel wrote, i.e. on-disk corruption.
+ *
+ * The file_extent_item's offset field is NOT a universal invariant
+ * here: partial-PREALLOC writebacks legitimately produce REG items
+ * with non-zero offset at non-boundary keys. The offset check is
+ * performed at the call site in get_new_location(), which only
+ * inspects cluster-boundary keys where offset is always 0.
+ */
+ if (unlikely(btrfs_header_owner(leaf) == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+ (btrfs_file_extent_compression(leaf, fi) ||
+ btrfs_file_extent_encryption(leaf, fi) ||
+ btrfs_file_extent_other_encoding(leaf, fi)))) {
+ file_extent_err(leaf, slot,
+"invalid encoding fields for data reloc tree, compression=%u encryption=%u other_encoding=%u",
+ btrfs_file_extent_compression(leaf, fi),
+ btrfs_file_extent_encryption(leaf, fi),
+ btrfs_file_extent_other_encoding(leaf, fi));
+ return -EUCLEAN;
+ }
+
/* Regular or preallocated extent has fixed item size */
if (unlikely(item_size != sizeof(*fi))) {
file_extent_err(leaf, slot,
base-commit: 6bf684b8823552b99c86bf791b22f622934ee771
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread