[PATCH] btrfs: reject root with mismatched level between root

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] btrfs: reject root with mismatched level between root_item and node header
@ 2026-03-12 10:22 ZhengYuan Huang
  2026-03-12 21:29 ` Qu Wenruo
  0 siblings, 1 reply; 4+ messages in thread
From: ZhengYuan Huang @ 2026-03-12 10:22 UTC (permalink / raw)
  To: dsterba, clm
  Cc: linux-btrfs, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	ZhengYuan Huang, stable

[BUG]
A KASAN null-ptr-deref is triggered when running balance on a filesystem
with a corrupted root item:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
  CPU: 1 UID: 0 PID: 347 ... Tainted: G OE  6.18.0+ #17 PREEMPT(voluntary)
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU Ubuntu 24.04 BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  RIP: 0010:get_eb_folio_index fs/btrfs/extent_io.h:180 [inline]
  RIP: 0010:btrfs_get_64+0x91/0x590 fs/btrfs/accessors.c:117
  Code: 400400f3 f3f36548 8b056324 31054889
  Call Trace:
   btrfs_key_blockptr fs/btrfs/accessors.h:368 [inline]
   btrfs_node_blockptr fs/btrfs/accessors.h:380 [inline]
   handle_indirect_tree_backref fs/btrfs/backref.c:3324 [inline]
   btrfs_backref_add_tree_node+0x7a5/0x26a0 fs/btrfs/backref.c:3538
   build_backref_tree+0x11c/0xb00 fs/btrfs/relocation.c:437
   relocate_tree_blocks+0x583/0x1a30 fs/btrfs/relocation.c:2649
   relocate_block_group+0x521/0xf60 fs/btrfs/relocation.c:3584
   btrfs_relocate_block_group+0x4d8/0xde0 fs/btrfs/relocation.c:3984
   btrfs_relocate_chunk+0x133/0x620 fs/btrfs/volumes.c:3451
   __btrfs_balance fs/btrfs/volumes.c:4227 [inline]
   btrfs_balance+0x1e8b/0x42b0 fs/btrfs/volumes.c:4604
   btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
   btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
   ...
  RIP: 0033:0x7bbaa73a75ad
  Code: ffc3662e 0f1f8400 00000000 90f30f1e fa4889f8

The bug is reproducible on 7.0.0-rc2-next-20260311 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.

[CAUSE]
The corruption consists of a single corrupted field in a root tree leaf:
the btrfs_root_item for the affected tree has its .level field set to 1,
while the actual root block on disk has header.level = 0. The root block
itself is completely intact; only the field value stored inside the root
tree leaf is wrong. The existing tree-checker validation in
check_root_item() accepts this because it only verifies that
root_item.level < BTRFS_MAX_LEVEL, and does not cross-check the value
against the root block's own header.

The inconsistency becomes dangerous when balance calls
relocate_tree_blocks() to move a level-0 block belonging to that tree.
relocate_tree_blocks() has two sequential phases that together set the
trap:

Phase 1 -- get_tree_block_key(): reads the root block to retrieve its first
key before building the backref tree. The check level passed to
read_tree_block() here comes from the EXTENT_ITEM in the extent tree, which
correctly records level 0. The disk I/O completes,
btrfs_validate_extent_buffer() sees found_level(0) == check->level(0), and
marks the extent_buffer EXTENT_BUFFER_UPTODATE.

Phase 2 -- build_backref_tree() calls handle_indirect_tree_backref(), which
calls btrfs_get_fs_root() to open the affected tree. Inside
read_tree_root_path(), level is set from btrfs_root_level(&root->root_item),
yielding the corrupted value 1. read_tree_block() is then called with
check.level = 1 for the same bytenr. Because EXTENT_BUFFER_UPTODATE is
already set from Phase 1, read_extent_buffer_pages_nowait() returns
immediately via the cache fast path, skipping
btrfs_validate_extent_buffer() entirely. read_tree_root_path() has no
cross-check between btrfs_header_level(root->node) and the level read from
root_item, so it silently builds a btrfs_root with root_item.level = 1 and
commit_root whose btrfs_header_level() is 0 and installs it in the
fs_roots radix tree.

Back in handle_indirect_tree_backref(), btrfs_root_level(&root->root_item)
returns 1, which does not equal cur->level(0), so the tree-root early-exit
is skipped and path->lowest_level is set to 1.
btrfs_search_slot_get_root() starts at commit_root (level 0), records it in
p->nodes[0], and returns immediately because it is already a leaf --
p->nodes[1] is never assigned and retains its kzalloc-zeroed NULL value.
eb = path->nodes[1] = NULL is then passed directly to
btrfs_node_blockptr(), which calls btrfs_get_64() and then
get_eb_folio_index(), where eb->folio_shift is dereferenced through the
NULL pointer, causing the crash.

Note that the subsequent for() loop in handle_indirect_tree_backref()
already checks for a NULL path->nodes[level] correctly; the initial
blockptr comparison just above it was never given the same guard.

[FIX]
Catch the inconsistency in read_tree_root_path(), right after read_tree_block()
returns root->node and the generation and owner checks have passed. At that
point level = btrfs_root_level(&root->root_item) is already known, so
comparing it against btrfs_header_level(root->node) costs nothing. If they
differ, emit a btrfs_crit() message and return -EUCLEAN to prevent the
inconsistent btrfs_root object from being installed in the radix-tree cache
and reaching any caller. read_tree_root_path() is the only place that sees
both root_item.level and the actual root node simultaneously, making it the
correct and minimal location for this cross-block consistency check.
Returning -EUCLEAN is consistent with the existing owner-mismatch check
directly above and with the general btrfs policy of converting detectable
corruption into -EUCLEAN rather than crashing later.

After the fix, btrfs detects the level mismatch at root load time and
fails with -EUCLEAN instead of crashing later in
handle_indirect_tree_backref().

Cc: stable@vger.kernel.org
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
---
 fs/btrfs/disk-io.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 900e462d8ea1..06a8689cbf62 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1067,6 +1067,26 @@ static struct btrfs_root *read_tree_root_path(struct btrfs_root *tree_root,
 		ret = -EUCLEAN;
 		goto fail;
 	}
+	/*
+	 * Verify that the root node's on-disk level matches root_item.level.
+	 * These can diverge when the root item in the root tree was corrupted
+	 * (e.g. a bit flip changing level) while the actual tree block is
+	 * already cached in memory at its real level. In that case
+	 * read_tree_block() returns the cached buffer without re-running
+	 * btrfs_validate_extent_buffer(), silently bypassing the level check.
+	 * The mismatch would later cause a null-ptr-deref in backref walking
+	 * (handle_indirect_tree_backref) when the commit root's real height is
+	 * lower than what root_item.level claims.
+	 */
+	if (unlikely(btrfs_header_level(root->node) != level)) {
+		btrfs_crit(fs_info,
+           "root=%llu block=%llu, root item level mismatch: "
+           "root_item.level=%d block.level=%u",
+           btrfs_root_id(root), root->node->start,
+           level, btrfs_header_level(root->node));
+		ret = -EUCLEAN;
+		goto fail;
+	}
 	root->commit_root = btrfs_root_node(root);
 	return root;
 fail:
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: reject root with mismatched level between root_item and node header
  2026-03-12 10:22 [PATCH] btrfs: reject root with mismatched level between root_item and node header ZhengYuan Huang
@ 2026-03-12 21:29 ` Qu Wenruo
  2026-03-13  2:49   ` ZhengYuan Huang
  0 siblings, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2026-03-12 21:29 UTC (permalink / raw)
  To: ZhengYuan Huang, dsterba, clm
  Cc: linux-btrfs, linux-kernel, baijiaju1990, r33s3n6, zzzccc427,
	stable



在 2026/3/12 20:52, ZhengYuan Huang 写道:
[...]
> 
> [FIX]
> Catch the inconsistency in read_tree_root_path(), right after read_tree_block()
> returns root->node and the generation and owner checks have passed. At that
> point level = btrfs_root_level(&root->root_item) is already known, so
> comparing it against btrfs_header_level(root->node) costs nothing. If they
> differ, emit a btrfs_crit() message and return -EUCLEAN to prevent the
> inconsistent btrfs_root object from being installed in the radix-tree cache
> and reaching any caller. read_tree_root_path() is the only place that sees
> both root_item.level and the actual root node simultaneously, making it the
> correct and minimal location for this cross-block consistency check.
> Returning -EUCLEAN is consistent with the existing owner-mismatch check
> directly above and with the general btrfs policy of converting detectable
> corruption into -EUCLEAN rather than crashing later.
> 
> After the fix, btrfs detects the level mismatch at root load time and
> fails with -EUCLEAN instead of crashing later in
> handle_indirect_tree_backref().
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
> ---
>   fs/btrfs/disk-io.c | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 900e462d8ea1..06a8689cbf62 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1067,6 +1067,26 @@ static struct btrfs_root *read_tree_root_path(struct btrfs_root *tree_root,
>   		ret = -EUCLEAN;
>   		goto fail;
>   	}
> +	/*
> +	 * Verify that the root node's on-disk level matches root_item.level.
> +	 * These can diverge when the root item in the root tree was corrupted
> +	 * (e.g. a bit flip changing level) while the actual tree block is
> +	 * already cached in memory at its real level. In that case
> +	 * read_tree_block() returns the cached buffer without re-running
> +	 * btrfs_validate_extent_buffer(), silently bypassing the level check.
> +	 * The mismatch would later cause a null-ptr-deref in backref walking
> +	 * (handle_indirect_tree_backref) when the commit root's real height is
> +	 * lower than what root_item.level claims.
> +	 */
> +	if (unlikely(btrfs_header_level(root->node) != level)) {

Nope, we have btrfs_tree_parent_check structure, which has all the 
needed checks at read time.

The point of using that other than doing it manually here is, if one 
mirror is bad, but the other mirror is good, then we can still grab the 
good copy, but checking it here means if we got the bad mirror first, we 
have no more chance.

And during read of root-node, we have already passed the proper level 
into it.

So the only possibility is, your fuzzing tool is modifying the memory 
after the read check.

If so, it's impossible to fix.

> +		btrfs_crit(fs_info,
> +           "root=%llu block=%llu, root item level mismatch: "
> +           "root_item.level=%d block.level=%u",
> +           btrfs_root_id(root), root->node->start,
> +           level, btrfs_header_level(root->node));
> +		ret = -EUCLEAN;
> +		goto fail;
> +	}
>   	root->commit_root = btrfs_root_node(root);
>   	return root;
>   fail:


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: reject root with mismatched level between root_item and node header
  2026-03-12 21:29 ` Qu Wenruo
@ 2026-03-13  2:49   ` ZhengYuan Huang
  2026-03-13  3:09     ` Qu Wenruo
  0 siblings, 1 reply; 4+ messages in thread
From: ZhengYuan Huang @ 2026-03-13  2:49 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, clm, linux-btrfs, linux-kernel, baijiaju1990, r33s3n6,
	zzzccc427, stable

On Fri, Mar 13, 2026 at 5:29 AM Qu Wenruo <wqu@suse.com> wrote:
> Nope, we have btrfs_tree_parent_check structure, which has all the
> needed checks at read time.
>
> The point of using that other than doing it manually here is, if one
> mirror is bad, but the other mirror is good, then we can still grab the
> good copy, but checking it here means if we got the bad mirror first, we
> have no more chance.
>
> And during read of root-node, we have already passed the proper level
> into it.
>
> So the only possibility is, your fuzzing tool is modifying the memory
> after the read check.
>
> If so, it's impossible to fix.

Thanks for the review and for pointing this out.

I agree that btrfs_tree_parent_check is the intended read-time
verifier, but the crash path here relies on a cache-hit bypass where
that verification is not re-run.

My earlier description may have been misleading, or at least not clear
enough, so let me clarify the exact trigger path in more detail below.

Two different metadata blocks are involved:
- Block A: a root-tree leaf containing root_item for tree 265 (this
field is corrupted: root_item.level = 1)
  item 11 key (265 ROOT_ITEM 0) itemoff 13489 itemsize 439
      generation 4255 root_dirid 256 bytenr 18787663872 level 1 refs 1
      lastsnap 4214 byte_limit 0 bytes_used 16384 flags 0x0(none)
      uuid 4cc4bc58-9708-2848-a264-19b95269f104
      ctransid 13 otransid 13 stransid 0 rtransid 0
      ctime 1766050670.362764444 (2025-12-18 09:37:50)
      otime 1766050670.362000000 (2025-12-18 09:37:50)
      drop key (0 UNKNOWN.0 0) level 0

- Block B: the actual tree-265 root block at bytenr 18787663872
(header.level = 0, otherwise valid)
    item 57 key (18787663872 METADATA_ITEM 0) itemoff 14198 itemsize 33
      refs 1 gen 4255 flags TREE_BLOCK
      tree block skinny level 0
      tree block backref root 265

In relocate_tree_blocks phase 1 (get_tree_block_key), block B is read
with check.level = block->level = 0 (from extent-tree metadata for
that extent item).
This I/O path runs btrfs_validate_extent_buffer, and level check
passes (found 0, expected 0).
So block B becomes EXTENT_BUFFER_UPTODATE.

In phase 2 (build_backref_tree -> handle_indirect_tree_backref ->
btrfs_get_fs_root -> read_tree_root_path), level is taken from
root_item in block A via btrfs_root_level, so expected level becomes 1
(corrupt value).
Then read_tree_block is called for the same bytenr (block B), but now
it hits EXTENT_BUFFER_UPTODATE and returns from
read_extent_buffer_pages_nowait early.

On that cache-hit path, btrfs_validate_extent_buffer is not executed
again, so no level mismatch check occurs for expected=1 vs actual=0.

Because no read error is returned on cache hit, mirror retry logic is
never entered. So this is not a “bad mirror first, good mirror later”
case: there is no second mirror attempt because the read already
succeeded from cache.

read_tree_root_path then builds an inconsistent root object:
  - root->root_item.level = 1 (from block A)
  - root->node/commit_root level = 0 (from cached block B)

handle_indirect_tree_backref computes level = cur->level + 1 = 1,
searches commit_root (actual level 0), path->nodes[1] remains NULL,
and btrfs_node_blockptr(NULL, ...) crashes. So the issue is a
cross-block consistency gap at root construction time, not post-read
memory corruption by the fuzzer.

That is why the fix in read_tree_root_path (checking
btrfs_header_level(root->node) == btrfs_root_level(&root->root_item))
is needed even with btrfs_tree_parent_check in place.

To clarify, our fuzzing tool does not perform any in-memory
modification during testing. In fact, this bug is not caused by memory
corruption at all; it is triggered entirely by corrupted on-disk
metadata together with a cache-hit path that skips re-validation of
the root block. I have also uploaded the reproduction script to
https://drive.google.com/drive/folders/1BPXcgVI4DLzDcufNyynOakKD4EKnfVCg.

To reproduce the issue:
1. Build the PoC program: gcc repro.c -o poc
2. Build the ublk helper program from the ublk codebase, which is
used to provide the runtime corruption capability:
g++ -std=c++20 -fcoroutines -O2 -o standalone_replay \
standalone_replay_btrfs.cpp targets/ublksrv_tgt.cpp \
-I. -Iinclude -Itargets/include \
-L./lib/.libs -lublksrv -luring -lpthread
3. Attach the crafted image through ublk:
./standalone_replay add -t loop -f /path/to/image
4. Run the PoC: ./poc
This reliably reproduces the bug.

Thanks,
ZhengYuan Huang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: reject root with mismatched level between root_item and node header
  2026-03-13  2:49   ` ZhengYuan Huang
@ 2026-03-13  3:09     ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2026-03-13  3:09 UTC (permalink / raw)
  To: ZhengYuan Huang
  Cc: dsterba, clm, linux-btrfs, linux-kernel, baijiaju1990, r33s3n6,
	zzzccc427, stable



在 2026/3/13 13:19, ZhengYuan Huang 写道:
> On Fri, Mar 13, 2026 at 5:29 AM Qu Wenruo <wqu@suse.com> wrote:
>> Nope, we have btrfs_tree_parent_check structure, which has all the
>> needed checks at read time.
>>
>> The point of using that other than doing it manually here is, if one
>> mirror is bad, but the other mirror is good, then we can still grab the
>> good copy, but checking it here means if we got the bad mirror first, we
>> have no more chance.
>>
>> And during read of root-node, we have already passed the proper level
>> into it.
>>
>> So the only possibility is, your fuzzing tool is modifying the memory
>> after the read check.
>>
>> If so, it's impossible to fix.
> 
> Thanks for the review and for pointing this out.
> 
> I agree that btrfs_tree_parent_check is the intended read-time
> verifier, but the crash path here relies on a cache-hit bypass where
> that verification is not re-run.
> 
> My earlier description may have been misleading, or at least not clear
> enough, so let me clarify the exact trigger path in more detail below.

My bad, I forgot to mention the correct way to fix: you should put the 
check into the cached hit path, other than adhocing random checks around.

The correct way is to add an optional @check parameter for 
btrfs_buffer_uptodate() so that cached extent buffer will still be checked.

> 
> Two different metadata blocks are involved:
> - Block A: a root-tree leaf containing root_item for tree 265 (this
> field is corrupted: root_item.level = 1)
>    item 11 key (265 ROOT_ITEM 0) itemoff 13489 itemsize 439
>        generation 4255 root_dirid 256 bytenr 18787663872 level 1 refs 1
>        lastsnap 4214 byte_limit 0 bytes_used 16384 flags 0x0(none)
>        uuid 4cc4bc58-9708-2848-a264-19b95269f104
>        ctransid 13 otransid 13 stransid 0 rtransid 0
>        ctime 1766050670.362764444 (2025-12-18 09:37:50)
>        otime 1766050670.362000000 (2025-12-18 09:37:50)
>        drop key (0 UNKNOWN.0 0) level 0
> 
> - Block B: the actual tree-265 root block at bytenr 18787663872
> (header.level = 0, otherwise valid)
>      item 57 key (18787663872 METADATA_ITEM 0) itemoff 14198 itemsize 33
>        refs 1 gen 4255 flags TREE_BLOCK
>        tree block skinny level 0
>        tree block backref root 265
> 
> In relocate_tree_blocks phase 1 (get_tree_block_key), block B is read
> with check.level = block->level = 0 (from extent-tree metadata for
> that extent item).
> This I/O path runs btrfs_validate_extent_buffer, and level check
> passes (found 0, expected 0).
> So block B becomes EXTENT_BUFFER_UPTODATE.
> 
> In phase 2 (build_backref_tree -> handle_indirect_tree_backref ->
> btrfs_get_fs_root -> read_tree_root_path), level is taken from
> root_item in block A via btrfs_root_level, so expected level becomes 1
> (corrupt value).
> Then read_tree_block is called for the same bytenr (block B), but now
> it hits EXTENT_BUFFER_UPTODATE and returns from
> read_extent_buffer_pages_nowait early.
> 
> On that cache-hit path, btrfs_validate_extent_buffer is not executed
> again, so no level mismatch check occurs for expected=1 vs actual=0.
> 
> Because no read error is returned on cache hit, mirror retry logic is
> never entered. So this is not a “bad mirror first, good mirror later”
> case: there is no second mirror attempt because the read already
> succeeded from cache.
> 
> read_tree_root_path then builds an inconsistent root object:
>    - root->root_item.level = 1 (from block A)
>    - root->node/commit_root level = 0 (from cached block B)
> 
> handle_indirect_tree_backref computes level = cur->level + 1 = 1,
> searches commit_root (actual level 0), path->nodes[1] remains NULL,
> and btrfs_node_blockptr(NULL, ...) crashes. So the issue is a
> cross-block consistency gap at root construction time, not post-read
> memory corruption by the fuzzer.
> 
> That is why the fix in read_tree_root_path (checking
> btrfs_header_level(root->node) == btrfs_root_level(&root->root_item))
> is needed even with btrfs_tree_parent_check in place.
> 
> To clarify, our fuzzing tool does not perform any in-memory
> modification during testing. In fact, this bug is not caused by memory
> corruption at all; it is triggered entirely by corrupted on-disk
> metadata together with a cache-hit path that skips re-validation of
> the root block. I have also uploaded the reproduction script to
> https://drive.google.com/drive/folders/1BPXcgVI4DLzDcufNyynOakKD4EKnfVCg.
> 
> To reproduce the issue:
> 1. Build the PoC program: gcc repro.c -o poc
> 2. Build the ublk helper program from the ublk codebase, which is
> used to provide the runtime corruption capability:
> g++ -std=c++20 -fcoroutines -O2 -o standalone_replay \
> standalone_replay_btrfs.cpp targets/ublksrv_tgt.cpp \
> -I. -Iinclude -Itargets/include \
> -L./lib/.libs -lublksrv -luring -lpthread
> 3. Attach the crafted image through ublk:
> ./standalone_replay add -t loop -f /path/to/image
> 4. Run the PoC: ./poc
> This reliably reproduces the bug.
> 
> Thanks,
> ZhengYuan Huang


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-13  3:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 10:22 [PATCH] btrfs: reject root with mismatched level between root_item and node header ZhengYuan Huang
2026-03-12 21:29 ` Qu Wenruo
2026-03-13  2:49   ` ZhengYuan Huang
2026-03-13  3:09     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox