All of lore.kernel.org
 help / color / mirror / Atom feed
From: Teng Liu <27rabbitlt@gmail.com>
To: linux-btrfs@vger.kernel.org
Cc: wqu@suse.com, dsterba@suse.cz
Subject: Re: [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker
Date: Sun, 3 May 2026 17:35:48 +0200	[thread overview]
Message-ID: <afdoNU3rQUr0pREJ@rabbitArch> (raw)
In-Reply-To: <20260427202822.278326-1-27rabbitlt@gmail.com>

On 2026-04-27 22:24, Teng Liu wrote:
> get_new_location() uses BUG_ON() to crash the kernel if the file extent
> item it looks up has any of offset, compression, encryption, or
> other_encoding set. The data reloc inode is only written by relocation's
> own paths -- insert_prealloc_file_extent() and
> insert_ordered_extent_file_extent() -- which always leave those four
> fields at 0 (the data reloc inode is created with BTRFS_INODE_NOCOMPRESS,
> and encryption/other_encoding are reserved-and-zero). Observing a
> non-zero value therefore means the leaf decoded from disk does not match
> what the kernel wrote, i.e. on-disk corruption. A malformed image can
> reach this code via balance and panic the kernel.
> 
> Move the validation into tree-checker's check_extent_data_item(), where
> the constraint is enforced when the leaf is read off disk rather than
> after relocation has already started. The data reloc tree has a fixed
> root id (BTRFS_DATA_RELOC_TREE_OBJECTID) recorded in the extent buffer
> header, so check_extent_data_item() has all the information it needs to
> apply this check on its own. Report violations via file_extent_err() and
> print the four offending values.
> 
> In get_new_location() replace the BUG_ON() with an ASSERT().
> The caller in replace_file_extents() already handles non-zero returns from
> get_new_location() by breaking out of the loop without aborting the
> transaction, so no caller changes are needed.
> 

Hi Qu, David, and all,

I spent some time trying to understand why the tree-checker saw non-zero
offset while BUG_ON never hit. Share some observation here in case it's
helpful and also ask for further advice about how to make the patch
correct.

Short answer: I noticed that get_new_location() and the tree-checker
look at different itmes in the data reloc tree. The tree-chcker
validates every item pre-write; while get_new_location() searches only
exact cluster boundary key, so the offset is always 0.

== How I tested ==

I instrumented both check_extent_data_item() (tree-checker) and
get_new_location() with pr_info() to log ino+key_offset+offset for every
item. Then ran btrfs/061 on qemu. 

== Some Evidence ==

Tree-checker hits (items with non-zero fields in the data reloc tree):
[   63.531400] tree-checker HIT: ino=258 key_off=13303808 type=1 offset=102400
              compress=0 encrypt=0 other=0

Printing out the leaf items (when tree-checker hit an item with non-zero
offset) and we found:
[   63.540541]  item 114 key (258 EXTENT_DATA 13201408) itemoff 10188 itemsize 53
[   63.540549]          generation 22 type 2
[   63.540554]          extent data disk bytenr 4566134784 nr 110592
[   63.540558]          extent data offset 0 nr 102400 ram 110592
[   63.540613]          extent compression 0
[   63.540619]  item 115 key (258 EXTENT_DATA 13303808) itemoff 10135 itemsize 53
[   63.540644]          generation 22 type 1
[   63.540649]          extent data disk bytenr 4566134784 nr 110592
[   63.540744]          extent data offset 102400 nr 8192 ram 110592
[   63.540750]          extent compression 0

Looking for the two 13201408 and 13303808 items in get_new_location:

get_new_location() search keys:
  ino=258 key_off=13201408 offset=0 type=1
  ...

We didn't see 13303808 item at all during get_new_location.
get_new_location searched for key_off=13201408 (the cluster boundary) and
found offset=0.  It never searched for 13303808.

=== Why this happens ===

During the MOVE_DATA_EXTENTS phase:

1. prealloc_file_extent_cluster() inserts one PREALLOC per cluster
   boundary (via insert_prealloc_file_extent), all with offset=0.

2. relocate_one_folio() reads source data and marks pages dirty.  On
   writeback, ordered extents are created.  When an ordered extent
   completes, insert_ordered_extent_file_extent() calls
   insert_reserved_file_extent(), which calls btrfs_drop_extents() to
   carve out the written range from the PREALLOC, splitting it.  The
   right-hand piece (REG) gets the ordered extent's offset into the
   shared disk_bytenr -- which can be non-zero.

3. The pre-write tree-checker fires when the modified leaf is written
   to disk, seeing the REG item with offset!=0.  This is the false
   positive.

4. btrfs_wait_ordered_range() waits for all ordered extents to finish,
   then the stage switches to UPDATE_DATA_PTRS.  replace_file_extents()
   calls get_new_location(src_disk_bytenr) for each source extent.
   The search key is:

       bytenr -= reloc_block_group_start;
       btrfs_lookup_file_extent(..., ino, bytenr, 0);

   This is the cluster boundary -- the start of an extent, where offset
   is always 0.  get_new_location never searches the mid-range keys
   where the split pieces live.

== Advice Needed ==

If my analysis is correct, what do you think is the best way to fix this
problem? Or am I missing something somewhere?

Thanks in advance,
Teng Liu


  parent reply	other threads:[~2026-05-03 15:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-25  6:10 [PATCH] btrfs: replace BUG_ON() with error return in get_new_location() Teng Liu
2026-04-25  8:06 ` Qu Wenruo
2026-04-26 20:16 ` [PATCH v2] " Teng Liu
2026-04-27  1:19   ` Qu Wenruo
2026-04-27 13:50     ` David Sterba
2026-04-27 20:24   ` [PATCH v3] btrfs: validate data reloc tree file extent item members in tree-checker Teng Liu
2026-04-27 22:15     ` Qu Wenruo
2026-04-28  0:44       ` Qu Wenruo
2026-04-28 15:29         ` David Sterba
2026-04-28  9:03     ` Johannes Thumshirn
2026-05-03 15:35     ` Teng Liu [this message]
2026-05-03 22:36       ` Qu Wenruo
2026-05-13 11:35     ` [PATCH v4] btrfs: validate data reloc tree file extent item members Teng Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afdoNU3rQUr0pREJ@rabbitArch \
    --to=27rabbitlt@gmail.com \
    --cc=dsterba@suse.cz \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.