From: Teng Liu <27rabbitlt@gmail.com>
To: linux-btrfs@vger.kernel.org
Cc: wqu@suse.com, dsterba@suse.cz
Subject: Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
Date: Sun, 3 May 2026 17:35:48 +0200 [thread overview]
Message-ID: <afdoNU3rQUr0pREJ@rabbitArch> (raw)
In-Reply-To: <20260427202822.278326-1-27rabbitlt@gmail.com>
On 2026-04-27 22:24, Teng Liu wrote:
> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> item it looks up has any of offset, compression, encryption, or
> other_encoding set. The data reloc inode is only written by relocation's
> own paths -- insert_prealloc_file_extent() and
> insert_ordered_extent_file_extent() -- which always leave those four
> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> and encryption/other_encoding are reserved-and-zero). Observing a
> non-zero value therefore means the leaf decoded from disk does not match
> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> reach this code via balance and panic the kernel.
>
> Move the validation into tree-checker's check_extent_data_item(), where
> the constraint is enforced when the leaf is read off disk rather than
> after relocation has already started. The data reloc tree has a fixed
> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> header, so check_extent_data_item() has all the information it needs to
> apply this check on its own. Report violations via file_extent_err() and
> print the four offending values.
>
> In get_new_location() replace the BUG_ON() with an ASSERT().
> The caller in replace_file_extents() already handles non-zero returns from
> get_new_location() by breaking out of the loop without aborting the
> transaction, so no caller changes are needed.
>
Hi Qu, David, and all,
I spent some time trying to understand why the tree-checker saw non-zero
offset while BUG_ON never hit. Share some observation here in case it's
helpful and also ask for further advice about how to make the patch
correct.
Short answer: I noticed that get_new_location() and the tree-checker
look at different itmes in the data reloc tree. The tree-chcker
validates every item pre-write; while get_new_location() searches only
exact cluster boundary key, so the offset is always 0.
== How I tested ==
I instrumented both check_extent_data_item() (tree-checker) and
get_new_location() with pr_info() to log ino+key_offset+offset for every
item. Then ran btrfs/061 on qemu.
== Some Evidence ==
Tree-checker hits (items with non-zero fields in the data reloc tree):
[ 63.531400] tree-checker HIT: ino=258 key_off=13303808 type=1 offset=102400
compress=0 encrypt=0 other=0
Printing out the leaf items (when tree-checker hit an item with non-zero
offset) and we found:
[ 63.540541] item 114 key (258 EXTENT_DATA 13201408) itemoff 10188 itemsize 53
[ 63.540549] generation 22 type 2
[ 63.540554] extent data disk bytenr 4566134784 nr 110592
[ 63.540558] extent data offset 0 nr 102400 ram 110592
[ 63.540613] extent compression 0
[ 63.540619] item 115 key (258 EXTENT_DATA 13303808) itemoff 10135 itemsize 53
[ 63.540644] generation 22 type 1
[ 63.540649] extent data disk bytenr 4566134784 nr 110592
[ 63.540744] extent data offset 102400 nr 8192 ram 110592
[ 63.540750] extent compression 0
Looking for the two 13201408 and 13303808 items in get_new_location:
get_new_location() search keys:
ino=258 key_off=13201408 offset=0 type=1
...
We didn't see 13303808 item at all during get_new_location.
get_new_location searched for key_off=13201408 (the cluster boundary) and
found offset=0. It never searched for 13303808.
=== Why this happens ===
During the MOVE_DATA_EXTENTS phase:
1. prealloc_file_extent_cluster() inserts one PREALLOC per cluster
boundary (via insert_prealloc_file_extent), all with offset=0.
2. relocate_one_folio() reads source data and marks pages dirty. On
writeback, ordered extents are created. When an ordered extent
completes, insert_ordered_extent_file_extent() calls
insert_reserved_file_extent(), which calls btrfs_drop_extents() to
carve out the written range from the PREALLOC, splitting it. The
right-hand piece (REG) gets the ordered extent's offset into the
shared disk_bytenr -- which can be non-zero.
3. The pre-write tree-checker fires when the modified leaf is written
to disk, seeing the REG item with offset!=0. This is the false
positive.
4. btrfs_wait_ordered_range() waits for all ordered extents to finish,
then the stage switches to UPDATE_DATA_PTRS. replace_file_extents()
calls get_new_location(src_disk_bytenr) for each source extent.
The search key is:
bytenr -= reloc_block_group_start;
btrfs_lookup_file_extent(..., ino, bytenr, 0);
This is the cluster boundary -- the start of an extent, where offset
is always 0. get_new_location never searches the mid-range keys
where the split pieces live.
== Advice Needed ==
If my analysis is correct, what do you think is the best way to fix this
problem? Or am I missing something somewhere?
Thanks in advance,
Teng Liu
next prev parent reply other threads:[~2026-05-03 15:35 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-25 6:10 [PATCH] btrfs: replace BUG_ON() with error return in get_new_location() Teng Liu
2026-04-25 8:06 ` Qu Wenruo
2026-04-26 20:16 ` [PATCH v2] " Teng Liu
2026-04-27 1:19 ` Qu Wenruo
2026-04-27 13:50 ` David Sterba
2026-04-27 20:24 ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
2026-04-27 22:15 ` Qu Wenruo
2026-04-28 0:44 ` Qu Wenruo
2026-04-28 15:29 ` David Sterba
2026-04-28 9:03 ` Johannes Thumshirn
2026-05-03 15:35 ` Teng Liu [this message]
2026-05-03 22:36 ` Qu Wenruo
2026-05-13 11:35 ` [PATCH v4] btrfs: validate data reloc tree file extent item members Teng Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afdoNU3rQUr0pREJ@rabbitArch \
--to=27rabbitlt@gmail.com \
--cc=dsterba@suse.cz \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox