Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Teng Liu <27rabbitlt@gmail.com>
To: linux-btrfs@vger.kernel.org
Cc: wqu@suse.com, dsterba@suse.cz
Subject: Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
Date: Sun, 3 May 2026 17:35:48 +0200	[thread overview]
Message-ID: <afdoNU3rQUr0pREJ@rabbitArch> (raw)
In-Reply-To: <20260427202822.278326-1-27rabbitlt@gmail.com>

On 2026-04-27 22:24, Teng Liu wrote:
> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> item it looks up has any of offset, compression, encryption, or
> other_encoding set. The data reloc inode is only written by relocation's
> own paths -- insert_prealloc_file_extent() and
> insert_ordered_extent_file_extent() -- which always leave those four
> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> and encryption/other_encoding are reserved-and-zero). Observing a
> non-zero value therefore means the leaf decoded from disk does not match
> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> reach this code via balance and panic the kernel.
> 
> Move the validation into tree-checker's check_extent_data_item(), where
> the constraint is enforced when the leaf is read off disk rather than
> after relocation has already started. The data reloc tree has a fixed
> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> header, so check_extent_data_item() has all the information it needs to
> apply this check on its own. Report violations via file_extent_err() and
> print the four offending values.
> 
> In get_new_location() replace the BUG_ON() with an ASSERT().
> The caller in replace_file_extents() already handles non-zero returns from
> get_new_location() by breaking out of the loop without aborting the
> transaction, so no caller changes are needed.
> 

Hi Qu, David, and all,

I spent some time trying to understand why the tree-checker saw non-zero
offset while BUG_ON never hit. Share some observation here in case it's
helpful and also ask for further advice about how to make the patch
correct.

Short answer: I noticed that get_new_location() and the tree-checker
look at different itmes in the data reloc tree. The tree-chcker
validates every item pre-write; while get_new_location() searches only
exact cluster boundary key, so the offset is always 0.

== How I tested ==

I instrumented both check_extent_data_item() (tree-checker) and
get_new_location() with pr_info() to log ino+key_offset+offset for every
item. Then ran btrfs/061 on qemu. 

== Some Evidence ==

Tree-checker hits (items with non-zero fields in the data reloc tree):
[   63.531400] tree-checker HIT: ino=258 key_off=13303808 type=1 offset=102400
              compress=0 encrypt=0 other=0

Printing out the leaf items (when tree-checker hit an item with non-zero
offset) and we found:
[   63.540541]  item 114 key (258 EXTENT_DATA 13201408) itemoff 10188 itemsize 53
[   63.540549]          generation 22 type 2
[   63.540554]          extent data disk bytenr 4566134784 nr 110592
[   63.540558]          extent data offset 0 nr 102400 ram 110592
[   63.540613]          extent compression 0
[   63.540619]  item 115 key (258 EXTENT_DATA 13303808) itemoff 10135 itemsize 53
[   63.540644]          generation 22 type 1
[   63.540649]          extent data disk bytenr 4566134784 nr 110592
[   63.540744]          extent data offset 102400 nr 8192 ram 110592
[   63.540750]          extent compression 0

Looking for the two 13201408 and 13303808 items in get_new_location:

get_new_location() search keys:
  ino=258 key_off=13201408 offset=0 type=1
  ...

We didn't see 13303808 item at all during get_new_location.
get_new_location searched for key_off=13201408 (the cluster boundary) and
found offset=0.  It never searched for 13303808.

=== Why this happens ===

During the MOVE_DATA_EXTENTS phase:

1. prealloc_file_extent_cluster() inserts one PREALLOC per cluster
   boundary (via insert_prealloc_file_extent), all with offset=0.

2. relocate_one_folio() reads source data and marks pages dirty.  On
   writeback, ordered extents are created.  When an ordered extent
   completes, insert_ordered_extent_file_extent() calls
   insert_reserved_file_extent(), which calls btrfs_drop_extents() to
   carve out the written range from the PREALLOC, splitting it.  The
   right-hand piece (REG) gets the ordered extent's offset into the
   shared disk_bytenr -- which can be non-zero.

3. The pre-write tree-checker fires when the modified leaf is written
   to disk, seeing the REG item with offset!=0.  This is the false
   positive.

4. btrfs_wait_ordered_range() waits for all ordered extents to finish,
   then the stage switches to UPDATE_DATA_PTRS.  replace_file_extents()
   calls get_new_location(src_disk_bytenr) for each source extent.
   The search key is:

       bytenr -= reloc_block_group_start;
       btrfs_lookup_file_extent(..., ino, bytenr, 0);

   This is the cluster boundary -- the start of an extent, where offset
   is always 0.  get_new_location never searches the mid-range keys
   where the split pieces live.

== Advice Needed ==

If my analysis is correct, what do you think is the best way to fix this
problem? Or am I missing something somewhere?

Thanks in advance,
Teng Liu


  parent reply	other threads:[~2026-05-03 15:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-25  6:10 [PATCH] btrfs: replace BUG_ON() with error return in get_new_location() Teng Liu
2026-04-25  8:06 ` Qu Wenruo
2026-04-26 20:16 ` [PATCH v2] " Teng Liu
2026-04-27  1:19   ` Qu Wenruo
2026-04-27 13:50     ` David Sterba
2026-04-27 20:24   ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
2026-04-27 22:15     ` Qu Wenruo
2026-04-28  0:44       ` Qu Wenruo
2026-04-28 15:29         ` David Sterba
2026-04-28  9:03     ` Johannes Thumshirn
2026-05-03 15:35     ` Teng Liu [this message]
2026-05-03 22:36       ` Qu Wenruo
2026-05-13 11:35     ` [PATCH v4] btrfs: validate data reloc tree file extent item members Teng Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afdoNU3rQUr0pREJ@rabbitArch \
    --to=27rabbitlt@gmail.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox