public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 6.19-6.12] btrfs: reserve enough transaction items for qgroup ioctls
       [not found] <20260324111931.3257972-1-sashal@kernel.org>
@ 2026-03-24 11:19 ` Sasha Levin
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: reject root items with drop_progress and zero drop_level Sasha Levin
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: don't take device_list_mutex when querying zone info Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-24 11:19 UTC (permalink / raw)
  To: patches, stable
  Cc: Filipe Manana, Boris Burkov, Qu Wenruo, David Sterba, Sasha Levin,
	clm, linux-btrfs, linux-kernel

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit f9a4e3015db1aeafbef407650eb8555445ca943e ]

Currently our qgroup ioctls don't reserve any space, they just do a
transaction join, which does not reserve any space, neither for the quota
tree updates nor for the delayed refs generated when updating the quota
tree. The quota root uses the global block reserve, which is fine most of
the time since we don't expect a lot of updates to the quota root, or to
be too close to -ENOSPC such that other critical metadata updates need to
resort to the global reserve.

However this is not optimal, as not reserving proper space may result in a
transaction abort due to not reserving space for delayed refs and then
abusing the use of the global block reserve.

For example, the following reproducer (which is unlikely to model any
real world use case, but just to illustrate the problem), triggers such a
transaction abort due to -ENOSPC when running delayed refs:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0

  umount $DEV &> /dev/null
  # Limit device to 1G so that it's much faster to reproduce the issue.
  mkfs.btrfs -f -b 1G $DEV
  mount -o commit=600 $DEV $MNT

  fallocate -l 800M $MNT/filler
  btrfs quota enable $MNT

  for ((i = 1; i <= 400000; i++)); do
      btrfs qgroup create 1/$i $MNT
  done

  umount $MNT

When running this, we can see in dmesg/syslog that a transaction abort
happened:

  [436.490] BTRFS error (device nullb0): failed to run delayed ref for logical 30408704 num_bytes 16384 type 176 action 1 ref_mod 1: -28
  [436.493] ------------[ cut here ]------------
  [436.494] BTRFS: Transaction aborted (error -28)
  [436.495] WARNING: fs/btrfs/extent-tree.c:2247 at btrfs_run_delayed_refs+0xd9/0x110 [btrfs], CPU#4: umount/2495372
  [436.497] Modules linked in: btrfs loop (...)
  [436.508] CPU: 4 UID: 0 PID: 2495372 Comm: umount Tainted: G        W           6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
  [436.510] Tainted: [W]=WARN
  [436.511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [436.513] RIP: 0010:btrfs_run_delayed_refs+0xdf/0x110 [btrfs]
  [436.514] Code: 0f 82 ea (...)
  [436.518] RSP: 0018:ffffd511850b7d78 EFLAGS: 00010292
  [436.519] RAX: 00000000ffffffe4 RBX: ffff8f120dad37e0 RCX: 0000000002040001
  [436.520] RDX: 0000000000000002 RSI: 00000000ffffffe4 RDI: ffffffffc090fd80
  [436.522] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffc04d1867
  [436.523] R10: ffff8f18dc1fffa8 R11: 0000000000000003 R12: ffff8f173aa89400
  [436.524] R13: 0000000000000000 R14: ffff8f173aa89400 R15: 0000000000000000
  [436.526] FS:  00007fe59045d840(0000) GS:ffff8f192e22e000(0000) knlGS:0000000000000000
  [436.527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [436.528] CR2: 00007fe5905ff2b0 CR3: 000000060710a002 CR4: 0000000000370ef0
  [436.530] Call Trace:
  [436.530]  <TASK>
  [436.530]  btrfs_commit_transaction+0x73/0xc00 [btrfs]
  [436.531]  ? btrfs_attach_transaction_barrier+0x1e/0x70 [btrfs]
  [436.532]  sync_filesystem+0x7a/0x90
  [436.533]  generic_shutdown_super+0x28/0x180
  [436.533]  kill_anon_super+0x12/0x40
  [436.534]  btrfs_kill_super+0x12/0x20 [btrfs]
  [436.534]  deactivate_locked_super+0x2f/0xb0
  [436.534]  cleanup_mnt+0xea/0x180
  [436.535]  task_work_run+0x58/0xa0
  [436.535]  exit_to_user_mode_loop+0xed/0x480
  [436.536]  ? __x64_sys_umount+0x68/0x80
  [436.536]  do_syscall_64+0x2a5/0xf20
  [436.537]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [436.537] RIP: 0033:0x7fe5906b6217
  [436.538] Code: 0d 00 f7 (...)
  [436.540] RSP: 002b:00007ffcd87a61f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  [436.541] RAX: 0000000000000000 RBX: 00005618b9ecadc8 RCX: 00007fe5906b6217
  [436.541] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005618b9ecb100
  [436.542] RBP: 0000000000000000 R08: 00007ffcd87a4fe0 R09: 00000000ffffffff
  [436.544] R10: 0000000000000103 R11: 0000000000000246 R12: 00007fe59081626c
  [436.544] R13: 00005618b9ecb100 R14: 0000000000000000 R15: 00005618b9ecacc0
  [436.545]  </TASK>
  [436.545] ---[ end trace 0000000000000000 ]---

Fix this by changing the qgroup ioctls to use start transaction instead of
joining so that proper space is reserved for the delayed refs generated
for the updates to the quota root. This way we don't get any transaction
abort.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have enough information for a thorough analysis. Let me compile
the findings.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1: Subject Line**
- Subsystem: `btrfs` (filesystem)
- Action: "reserve enough" — implies fixing insufficient space
  reservation
- Summary: Qgroup ioctls don't reserve transaction space, leading to
  transaction aborts.

Record: [btrfs] [reserve/fix] [Fix insufficient transaction space
reservation in qgroup ioctls]

**Step 1.2: Tags**
- `Reviewed-by: Boris Burkov <boris@bur.io>` — btrfs developer
- `Reviewed-by: Qu Wenruo <wqu@suse.com>` — prolific btrfs
  maintainer/developer
- `Signed-off-by: Filipe Manana <fdmanana@suse.com>` — author, very
  active btrfs maintainer
- `Signed-off-by: David Sterba <dsterba@suse.com>` — btrfs tree
  maintainer

Record: Two reviews from respected btrfs developers, authored by a top
btrfs contributor, committed by the subsystem maintainer. No Fixes: tag,
no Cc: stable (expected for review candidates).

**Step 1.3: Commit Body Analysis**
- Bug: Qgroup ioctls use `btrfs_join_transaction()` which reserves 0
  metadata space items, relying entirely on the global block reserve.
- Symptom: Transaction abort with -ENOSPC (-28) when running delayed
  refs during transaction commit.
- Reproducer provided: creating 400,000 qgroups on a 1G filesystem
  triggers the abort.
- Stack trace included showing `btrfs_run_delayed_refs+0xd9/0x110` →
  `btrfs_commit_transaction` → `sync_filesystem` →
  `generic_shutdown_super` (during umount).
- The error message: "failed to run delayed ref for logical 30408704
  num_bytes 16384 type 176 action 1 ref_mod 1: -28"

Record: Transaction abort (-ENOSPC) during delayed ref processing. While
the reproducer is synthetic (400K qgroups), the underlying bug is real —
any scenario where global block reserve is insufficient and qgroup
operations are happening can trigger this. The fix is to properly
reserve space upfront.

**Step 1.4: Hidden Bug Fix Detection**
This is clearly a bug fix despite not having "fix" in the subject. It
prevents transaction aborts (which are critical — they make the
filesystem read-only and can lead to data loss). The commit message
explicitly says "Fix this by..." and provides a reproducer with a stack
trace.

Record: Definite bug fix. Transaction aborts are severe — they force the
filesystem read-only.

## PHASE 2: DIFF ANALYSIS

**Step 2.1: Inventory**
- 1 file changed: `fs/btrfs/ioctl.c`
- 3 hunks, each changing 1 line + adding 1-3 comment lines
- Functions modified: `btrfs_ioctl_qgroup_assign`,
  `btrfs_ioctl_qgroup_create`, `btrfs_ioctl_qgroup_limit`
- Total: ~6 lines added, 3 lines removed (net +3 lines)
- Scope: Single-file, surgical fix

Record: Minimal change, 3 substitutions of
`btrfs_join_transaction(root)` → `btrfs_start_transaction(root, N)` with
explanatory comments.

**Step 2.2: Code Flow Change**
- Hunk 1 (`qgroup_assign`): `btrfs_join_transaction(root)` →
  `btrfs_start_transaction(root, 2)` — reserves space for 2
  BTRFS_QGROUP_RELATION_KEY items
- Hunk 2 (`qgroup_create`): `btrfs_join_transaction(root)` →
  `btrfs_start_transaction(root, 2)` — reserves for 1 INFO + 1 LIMIT key
- Hunk 3 (`qgroup_limit`): `btrfs_join_transaction(root)` →
  `btrfs_start_transaction(root, 1)` — reserves for 1 LIMIT key

Before: No space reserved, relying on global block reserve (a shared
emergency pool).
After: Proper per-operation space reservation, ensuring delayed refs
have space.

**Step 2.3: Bug Mechanism**
Category: Resource exhaustion / incorrect space reservation →
transaction abort.
Mechanism: `btrfs_join_transaction()` passes `num_items=0` to
`start_transaction()`, meaning no metadata space is reserved. When the
quota tree is updated, delayed refs are generated. These delayed refs
need space to run. Without reservation, they fall back to the global
block reserve, which can be exhausted, causing -ENOSPC transaction
aborts.

**Step 2.4: Fix Quality**
- Obviously correct: `btrfs_start_transaction()` is the standard API for
  reserving transaction space, used throughout btrfs.
- Minimal and surgical: only the transaction start calls are changed.
- Regression risk: Very low. `btrfs_start_transaction` may block waiting
  for space (unlike `join`), but this is the correct behavior for ioctls
  that modify metadata. The caller is already in a context that can
  block (ioctl handler, not in atomic context).
- The item counts (2, 2, 1) match the number of quota tree items each
  operation modifies, per the comments.

Record: Fix is obviously correct, minimal, and low-risk.

## PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1: Blame Analysis**
The `btrfs_join_transaction(root)` calls in all three functions were
introduced in commit `5d13a37bd53272` by Arne Jansen, dated 2011-09-14,
with subject "Btrfs: add qgroup ioctls". This was in kernel 3.x era. The
bug has existed since the qgroup feature was first introduced — it
exists in ALL stable kernel trees.

Record: Bug present since 2011 (kernel ~3.0), exists in all active
stable trees.

**Step 3.2: Fixes Tag**
No Fixes: tag present (expected for review candidates).

**Step 3.3: File History**
Recent changes to `fs/btrfs/ioctl.c` show active development (146
commits since v6.6). Related qgroup changes include:
- `53a4acbfc1de8` - memory leak fix in qgroup assign (Sep 2025)
- `4addc1ffd67ad` - preallocate memory for qgroup relation (May 2024)
- `a943812bfffb3` - use btrfs_qgroup_enabled() in ioctls (Jul 2025)

These are context-only changes that don't affect the core fix. The
actual `join→start` substitution is independent.

**Step 3.4: Author**
Filipe Manana is one of the most prolific btrfs developers with hundreds
of commits. He is essentially a co-maintainer of btrfs. His fixes are
high quality.

**Step 3.5: Dependencies**
The patch itself (changing `join` to `start`) has no functional
dependencies. However, the diff context lines reference code from recent
commits (`btrfs_qgroup_enabled()`, `prealloc`, `btrfs_is_fstree()`). For
older stable trees, the patch would need minor context adaptation but
the actual change is trivially portable — it's just replacing one
function call with another.

Record: Standalone fix, no functional dependencies. Context may need
minor adaptation for older trees.

## PHASE 4: MAILING LIST RESEARCH

**Step 4.1: Lore Search**
Found the patch submission on linux-btrfs dated 2026-02-17 with
responses from Boris Burkov and Qu Wenruo (both gave Reviewed-by). The
patch was included in a "Btrfs fixes for 7.0-rc5" pull request by David
Sterba, confirming it was treated as a fix for the current cycle.

**Step 4.2: Bug Report**
No external bug reports referenced — the author found this through code
analysis and created a reproducer.

Record: Patch was reviewed positively by two experienced developers and
pulled as a fix.

## PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1: Functions Modified**
- `btrfs_ioctl_qgroup_assign` — handles BTRFS_IOC_QGROUP_ASSIGN ioctl
- `btrfs_ioctl_qgroup_create` — handles BTRFS_IOC_QGROUP_CREATE ioctl
- `btrfs_ioctl_qgroup_limit` — handles BTRFS_IOC_QGROUP_LIMIT ioctl

**Step 5.2: Callers**
All three are called from `btrfs_ioctl()` switch statement — triggered
directly by userspace via `ioctl()` syscall. All require
`CAP_SYS_ADMIN`, so not exploitable by unprivileged users.

**Step 5.3-5.4: Call Chain**
userspace `ioctl()` → `btrfs_ioctl()` →
`btrfs_ioctl_qgroup_{assign,create,limit}()` → transaction → quota tree
updates → delayed refs → transaction commit. The abort happens during
`btrfs_commit_transaction()` → `btrfs_run_delayed_refs()`.

**Step 5.5: Similar Patterns**
Other btrfs code properly uses `btrfs_start_transaction()` with the
correct item count. The qgroup ioctls were an outlier using
`btrfs_join_transaction()`.

## PHASE 6: STABLE TREE ANALYSIS

**Step 6.1: Code Existence**
The buggy code (`btrfs_join_transaction` in qgroup ioctls) has existed
since 2011. It is present in ALL active stable trees (6.12, 6.6, 6.1,
5.15, etc.).

**Step 6.2: Backport Complications**
- The surrounding context differs in older stable trees due to recent
  refactoring (e.g., `btrfs_qgroup_enabled()`, `prealloc` code).
- However, the core change (replacing `btrfs_join_transaction(root)`
  with `btrfs_start_transaction(root, N)`) is trivially adaptable — just
  find the `btrfs_join_transaction` call in each function and replace
  it.
- `btrfs_start_transaction()` has existed since early btrfs days, so the
  target API is available in all stable trees.

Record: May need minor context adaptation but the fix itself is
trivially portable.

**Step 6.3: Related Fixes**
No related fix for this specific issue has been applied to stable.

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1: Subsystem**
- btrfs filesystem (fs/btrfs/) — IMPORTANT criticality level
- Widely used filesystem, especially in enterprise (SUSE,
  Facebook/Meta), containers, and NAS systems

**Step 7.2: Activity**
Very active subsystem with continuous development.

## PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1: Affected Users**
All btrfs users who use qgroup functionality (quotas). Qgroups are used
in container environments, multi-tenant storage, and by tools like
Snapper (SUSE default backup tool).

**Step 8.2: Trigger Conditions**
- Requires creating/modifying many qgroups when filesystem is near
  ENOSPC
- Requires CAP_SYS_ADMIN (root)
- More likely on smaller filesystems or heavily utilized ones
- While the reproducer uses 400K qgroups, the real threshold depends on
  available global reserve space

Record: Moderate likelihood trigger — needs filesystem to be near
capacity while doing many qgroup operations. Common in container
environments with quota management.

**Step 8.3: Failure Mode**
- Transaction abort → filesystem goes READ-ONLY → potential data loss,
  system disruption
- Severity: HIGH to CRITICAL (transaction aborts are one of the most
  severe btrfs failures)

**Step 8.4: Risk-Benefit Ratio**
- BENEFIT: Prevents transaction aborts (filesystem going read-only)
  during qgroup operations
- RISK: Very low — the change is 3 simple function call substitutions,
  `btrfs_start_transaction` is the standard correct API, and the item
  counts are well-justified
- The only behavioral change is that the ioctls may now wait for space
  to be available instead of blindly proceeding — this is the correct
  behavior

Record: High benefit, very low risk. Excellent ratio.

## PHASE 9: FINAL SYNTHESIS

**Step 9.1: Evidence Compilation**

FOR backporting:
- Fixes a real bug: transaction abort (-ENOSPC) causing filesystem to go
  read-only
- Bug has existed since 2011 — affects all stable trees
- Fix is minimal: 3 function call substitutions (+3 comment lines)
- Obviously correct: uses the standard btrfs transaction API
- Reviewed by two experienced btrfs developers
- Authored by a top btrfs maintainer
- Includes reproducer demonstrating the bug
- Low regression risk

AGAINST backporting:
- Reproducer is synthetic (400K qgroups on 1G filesystem)
- Requires near-ENOSPC conditions to trigger in practice
- Minor context conflicts expected in older stable trees (but trivially
  resolvable)
- No Fixes: tag (expected, not a negative)

**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** — reviewed by 2 developers,
   includes reproducer
2. Fixes a real bug? **YES** — transaction abort causing filesystem
   read-only
3. Important issue? **YES** — transaction abort is severe (data loss
   risk)
4. Small and contained? **YES** — 3 lines changed in 1 file
5. No new features? **YES** — no new features or APIs
6. Can apply to stable? **YES** — may need minor context adaptation

**Step 9.3: Exception Categories**
Not an exception category — this is a standard bug fix.

**Step 9.4: Decision**
This is a clear, well-reviewed, minimal bug fix from a top btrfs
maintainer that prevents transaction aborts (a severe failure mode) in
qgroup ioctls. The fix is obviously correct and uses the standard
transaction API. It meets all stable kernel criteria.

## Verification

- [Phase 1] Parsed tags: Reviewed-by from Boris Burkov and Qu Wenruo,
  authored by Filipe Manana, committed by David Sterba
- [Phase 1] Commit body: includes full stack trace, reproducer script,
  and detailed explanation of -ENOSPC transaction abort
- [Phase 2] Diff analysis: 3 substitutions of
  `btrfs_join_transaction(root)` → `btrfs_start_transaction(root, N)` in
  fs/btrfs/ioctl.c, +6/-3 lines
- [Phase 2] Verified `btrfs_join_transaction` passes `num_items=0`
  (transaction.c:831-834) while `btrfs_start_transaction` reserves space
  (transaction.c:816-821)
- [Phase 3] git blame: all three `btrfs_join_transaction` calls traced
  to commit 5d13a37bd53272 (2011-09-14, "Btrfs: add qgroup ioctls") —
  bug present since qgroup feature inception
- [Phase 3] git show 5d13a37bd53272: confirmed this is the original
  commit introducing qgroup ioctls
- [Phase 3] Author check: Filipe Manana has 10+ recent btrfs commits, is
  a primary btrfs developer
- [Phase 4] Lore search: found patch dated 2026-02-17, included in
  "Btrfs fixes for 7.0-rc5" pull
- [Phase 5] Callers: all three functions called from btrfs_ioctl()
  switch statement (ioctl dispatch)
- [Phase 6] Code exists in all stable trees — `btrfs_join_transaction`
  in qgroup ioctls present since 2011
- [Phase 6] `btrfs_start_transaction()` API verified available in all
  stable trees (defined in transaction.c since early btrfs)
- [Phase 6] Context differences: `btrfs_qgroup_enabled()` (Jul 2025),
  `prealloc` code (May 2024) may cause minor conflicts in older trees,
  but core fix is portable
- [Phase 7] Subsystem: btrfs filesystem, IMPORTANT criticality — used
  widely in enterprise and containers
- [Phase 8] Failure mode: transaction abort → filesystem goes read-only,
  severity HIGH/CRITICAL
- [Phase 8] Risk: Very low — standard API substitution, well-justified
  item counts, reviewed by domain experts

**YES**

 fs/btrfs/ioctl.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 1a5d98811f2b2..0a7d3a9dedfb0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3690,7 +3690,8 @@ static long btrfs_ioctl_qgroup_assign(struct file *file, void __user *arg)
 		}
 	}
 
-	trans = btrfs_join_transaction(root);
+	/* 2 BTRFS_QGROUP_RELATION_KEY items. */
+	trans = btrfs_start_transaction(root, 2);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		goto out;
@@ -3762,7 +3763,11 @@ static long btrfs_ioctl_qgroup_create(struct file *file, void __user *arg)
 		goto out;
 	}
 
-	trans = btrfs_join_transaction(root);
+	/*
+	 * 1 BTRFS_QGROUP_INFO_KEY item.
+	 * 1 BTRFS_QGROUP_LIMIT_KEY item.
+	 */
+	trans = btrfs_start_transaction(root, 2);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		goto out;
@@ -3811,7 +3816,8 @@ static long btrfs_ioctl_qgroup_limit(struct file *file, void __user *arg)
 		goto drop_write;
 	}
 
-	trans = btrfs_join_transaction(root);
+	/* 1 BTRFS_QGROUP_LIMIT_KEY item. */
+	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
 		goto out;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19-5.15] btrfs: reject root items with drop_progress and zero drop_level
       [not found] <20260324111931.3257972-1-sashal@kernel.org>
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-6.12] btrfs: reserve enough transaction items for qgroup ioctls Sasha Levin
@ 2026-03-24 11:19 ` Sasha Levin
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: don't take device_list_mutex when querying zone info Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-24 11:19 UTC (permalink / raw)
  To: patches, stable
  Cc: ZhengYuan Huang, Qu Wenruo, David Sterba, Sasha Levin, clm,
	linux-btrfs, linux-kernel

From: ZhengYuan Huang <gality369@gmail.com>

[ Upstream commit b17b79ff896305fd74980a5f72afec370ee88ca4 ]

[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:

------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
 merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
 btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
 btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
 open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
 btrfs_fill_super fs/btrfs/super.c:987 [inline]
 btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
 btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
 btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
 vfs_get_tree+0x9a/0x370 fs/super.c:1758
 fc_mount fs/namespace.c:1199 [inline]
 do_new_mount_fc fs/namespace.c:3642 [inline]
 do_new_mount fs/namespace.c:3718 [inline]
 path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
 do_mount fs/namespace.c:4041 [inline]
 __do_sys_mount fs/namespace.c:4229 [inline]
 __se_sys_mount fs/namespace.c:4206 [inline]
 __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
 ...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---

The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.

[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.

Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.

When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().

[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.

After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Record: The `btrfs_root_drop_level()` accessor was added in
c842268458d90 (v5.10). Before that, code accessed
`root_item->drop_level` directly. The `btrfs_disk_key_objectid()`
accessor has existed much longer. In stable trees v5.4 and older, the
patch would need minor adjustment to use `ri.drop_progress.objectid` and
`ri.drop_level` directly instead of accessor macros. For v5.10+ the
patch should apply cleanly or with minimal fuzz.

### Step 6.2: CHECK FOR BACKPORT COMPLICATIONS
The new check uses:
- `btrfs_disk_key_objectid(&ri.drop_progress)` — long-standing accessor
- `btrfs_root_drop_level(&ri)` — added in v5.10 (c842268458d90)

For stable trees ≥ 5.10, the patch should apply cleanly. For 5.4.y, a
trivial modification would be needed.

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

### Step 7.1: IDENTIFY THE SUBSYSTEM AND ITS CRITICALITY
- **Subsystem:** btrfs (filesystem)
- **Criticality:** IMPORTANT — btrfs is a widely-used filesystem.
  Corruption handling is critical for data integrity.
- **Component:** tree-checker (metadata validation layer) — the
  defensive layer against on-disk corruption

### Step 7.2: ASSESS SUBSYSTEM ACTIVITY
Active subsystem with regular commits. The tree-checker is actively
maintained by Qu Wenruo and David Sterba.

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: DETERMINE WHO IS AFFECTED
All btrfs users who encounter on-disk corruption (which can happen from
hardware failures, power loss, firmware bugs, etc.). This is filesystem-
specific but btrfs has a large user base.

### Step 8.2: DETERMINE THE TRIGGER CONDITIONS
- **Trigger:** Mounting a btrfs filesystem with specific metadata
  corruption (non-zero drop_progress.objectid + zero drop_level in a
  root item)
- **How common:** Not frequent in normal operation, but can happen with
  disk corruption, power loss during snapshot deletion, or intentional
  fuzzing
- **Security relevance:** An unprivileged user might be able to mount a
  crafted filesystem image (e.g., USB drive) to trigger the BUG_ON and
  crash the kernel

### Step 8.3: DETERMINE THE FAILURE MODE SEVERITY
- **Failure mode:** Kernel BUG (invalid opcode trap) → full kernel
  crash/oops
- **Severity:** CRITICAL — system crash on mount with corrupted
  metadata. No graceful error handling.

### Step 8.4: CALCULATE RISK-BENEFIT RATIO
- **BENEFIT:** Very high — prevents kernel crash (BUG_ON) from corrupted
  filesystem metadata. Converts crash into clean -EUCLEAN rejection.
- **RISK:** Very low — adds 17 lines of pure validation in an existing
  validation function. Only rejects metadata that was guaranteed to
  crash the kernel. Zero chance of regression for valid filesystems.
- **Ratio:** Excellent. Very high benefit, very low risk.

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: COMPILE THE EVIDENCE

**Evidence FOR backporting:**
1. Prevents kernel BUG_ON crash (system crash) — CRITICAL severity
2. Small, surgical fix (17 lines added to existing validation function)
3. Follows established tree-checker validation patterns
4. Reviewed by senior btrfs developer Qu Wenruo
5. Merged by btrfs maintainer David Sterba
6. Author explicitly tagged `Cc: stable@vger.kernel.org # 5.3+`
7. Bug exists since v5.4 (when check_root_item was introduced) — affects
   ALL active stable trees
8. No dependencies on other commits
9. Reproducible with fuzzing tool, stack trace provided
10. Potential security implication (crafted filesystem image → kernel
    crash)
11. Related fix (4289b494ac553) was already Cc'd to stable

**Evidence AGAINST backporting:**
- None identified.

### Step 9.2: APPLY THE STABLE RULES CHECKLIST
1. **Obviously correct and tested?** YES — invariant is well-documented
   (drop_level must be >0 when drop_progress is set), enforced on write
   path, now validated on read path. Tested by reproducer.
2. **Fixes a real bug?** YES — BUG_ON kernel crash on corrupted
   metadata.
3. **Important issue?** YES — CRITICAL (kernel crash at mount time).
4. **Small and contained?** YES — 17 lines, single file, single
   function.
5. **No new features or APIs?** Correct — pure validation addition.
6. **Can apply to stable trees?** YES — clean for 5.10+, trivial
   adjustment for 5.4.

### Step 9.3: CHECK FOR EXCEPTION CATEGORIES
Not an exception category — this is a standard, high-quality bug fix.

### Step 9.4: MAKE YOUR DECISION
This is a textbook stable backport candidate. It prevents a kernel
BUG_ON crash from corrupted metadata with a minimal, obviously-correct
validation check in the tree-checker, following established patterns.
The author, reviewer, and maintainer all participated, and the author
explicitly nominated it for stable.

## Verification

- [Phase 1] Parsed tags: Reviewed-by Qu Wenruo, SOBs from author and
  maintainer David Sterba. No syzbot, found via metadata fuzzing tool.
- [Phase 2] Diff analysis: +17 lines in `check_root_item()` in tree-
  checker.c, adds validation for drop_progress vs drop_level invariant,
  returns -EUCLEAN on invalid combination.
- [Phase 3] git blame: `check_root_item()` introduced in commit
  259ee7754b6793 (v5.4, 2019-07-16). Bug (missing validation) present
  since inception.
- [Phase 3] Confirmed 3 BUG_ON(level == 0) crash sites:
  relocation.c:1538, relocation.c:1621, extent-tree.c:6118.
- [Phase 3] No Fixes: tag to follow (expected).
- [Phase 3] Author has no other commits in tree (first-time
  contributor), but patch reviewed by subsystem expert.
- [Phase 3] `btrfs_root_drop_level()` accessor exists since v5.10
  (c842268458d90). Patch applies cleanly to 5.10+ stable trees.
- [Phase 4] lore.kernel.org: Found patch v2 discussion. Patch includes
  `Cc: stable@vger.kernel.org # 5.3+`. No NAKs found.
- [Phase 4] Related fix 4289b494ac553 ("btrfs: do not allow relocation
  of partially dropped subvolumes") also Cc'd to stable 5.15+.
- [Phase 5] Call chain: `btrfs_check_leaf()` → `__btrfs_check_leaf()` →
  `check_leaf_items()` → `check_root_item()`. Called during all metadata
  reads from disk.
- [Phase 6] Buggy code exists in all active stable trees (v5.4+). Patch
  standalone, no dependencies.
- [Phase 7] btrfs filesystem — IMPORTANT subsystem, actively maintained.
- [Phase 8] Failure mode: BUG_ON kernel crash at mount time. Severity
  CRITICAL. Fix risk: very low (pure validation addition).
- UNVERIFIED: Exact applicability to v5.4.y (may need trivial accessor
  adjustment) — does not affect decision.

**YES**

 fs/btrfs/tree-checker.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 12d6ae49bc078..902a5874bda81 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1260,6 +1260,23 @@ static int check_root_item(struct extent_buffer *leaf, struct btrfs_key *key,
 			    btrfs_root_drop_level(&ri), BTRFS_MAX_LEVEL - 1);
 		return -EUCLEAN;
 	}
+	/*
+	 * If drop_progress.objectid is non-zero, a btrfs_drop_snapshot() was
+	 * interrupted and the resume point was recorded in drop_progress and
+	 * drop_level.  In that case drop_level must be >= 1: level 0 is the
+	 * leaf level and drop_snapshot never saves a checkpoint there (it
+	 * only records checkpoints at internal node levels in DROP_REFERENCE
+	 * stage).  A zero drop_level combined with a non-zero drop_progress
+	 * objectid indicates on-disk corruption and would cause a BUG_ON in
+	 * merge_reloc_root() and btrfs_drop_snapshot() at mount time.
+	 */
+	if (unlikely(btrfs_disk_key_objectid(&ri.drop_progress) != 0 &&
+		     btrfs_root_drop_level(&ri) == 0)) {
+		generic_err(leaf, slot,
+			    "invalid root drop_level 0 with non-zero drop_progress objectid %llu",
+			    btrfs_disk_key_objectid(&ri.drop_progress));
+		return -EUCLEAN;
+	}
 
 	/* Flags check */
 	if (unlikely(btrfs_root_flags(&ri) & ~valid_root_flags)) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19-5.15] btrfs: don't take device_list_mutex when querying zone info
       [not found] <20260324111931.3257972-1-sashal@kernel.org>
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-6.12] btrfs: reserve enough transaction items for qgroup ioctls Sasha Levin
  2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: reject root items with drop_progress and zero drop_level Sasha Levin
@ 2026-03-24 11:19 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-24 11:19 UTC (permalink / raw)
  To: patches, stable
  Cc: Johannes Thumshirn, Shin'ichiro Kawasaki, Damien Le Moal,
	David Sterba, Sasha Levin, clm, linux-btrfs, linux-kernel

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

[ Upstream commit 77603ab10429fe713a03345553ca8dbbfb1d91c6 ]

Shin'ichiro reported sporadic hangs when running generic/013 in our CI
system. When enabling lockdep, there is a lockdep splat when calling
btrfs_get_dev_zone_info_all_devices() in the mount path that can be
triggered by i.e. generic/013:

  ======================================================
  WARNING: possible circular locking dependency detected
  7.0.0-rc1+ #355 Not tainted
  ------------------------------------------------------
  mount/1043 is trying to acquire lock:
  ffff8881020b5470 (&vblk->vdev_mutex){+.+.}-{4:4}, at: virtblk_report_zones+0xda/0x430

  but task is already holding lock:
  ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
	 __mutex_lock+0xa3/0x1360
	 btrfs_create_pending_block_groups+0x1f4/0x9d0
	 __btrfs_end_transaction+0x3e/0x2e0
	 btrfs_zoned_reserve_data_reloc_bg+0x2f8/0x390
	 open_ctree+0x1934/0x23db
	 btrfs_get_tree.cold+0x105/0x26c
	 vfs_get_tree+0x28/0xb0
	 __do_sys_fsconfig+0x324/0x680
	 do_syscall_64+0x92/0x4f0
	 entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #3 (btrfs_trans_num_extwriters){++++}-{0:0}:
	 join_transaction+0xc2/0x5c0
	 start_transaction+0x17c/0xbc0
	 btrfs_zoned_reserve_data_reloc_bg+0x2b4/0x390
	 open_ctree+0x1934/0x23db
	 btrfs_get_tree.cold+0x105/0x26c
	 vfs_get_tree+0x28/0xb0
	 __do_sys_fsconfig+0x324/0x680
	 do_syscall_64+0x92/0x4f0
	 entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #2 (btrfs_trans_num_writers){++++}-{0:0}:
	 lock_release+0x163/0x4b0
	 __btrfs_end_transaction+0x1c7/0x2e0
	 btrfs_dirty_inode+0x6f/0xd0
	 touch_atime+0xe5/0x2c0
	 btrfs_file_mmap_prepare+0x65/0x90
	 __mmap_region+0x4b9/0xf00
	 mmap_region+0xf7/0x120
	 do_mmap+0x43d/0x610
	 vm_mmap_pgoff+0xd6/0x190
	 ksys_mmap_pgoff+0x7e/0xc0
	 do_syscall_64+0x92/0x4f0
	 entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #1 (&mm->mmap_lock){++++}-{4:4}:
	 __might_fault+0x68/0xa0
	 _copy_to_user+0x22/0x70
	 blkdev_copy_zone_to_user+0x22/0x40
	 virtblk_report_zones+0x282/0x430
	 blkdev_report_zones_ioctl+0xfd/0x130
	 blkdev_ioctl+0x20f/0x2c0
	 __x64_sys_ioctl+0x86/0xd0
	 do_syscall_64+0x92/0x4f0
	 entry_SYSCALL_64_after_hwframe+0x76/0x7e

  -> #0 (&vblk->vdev_mutex){+.+.}-{4:4}:
	 __lock_acquire+0x1522/0x2680
	 lock_acquire+0xd5/0x2f0
	 __mutex_lock+0xa3/0x1360
	 virtblk_report_zones+0xda/0x430
	 blkdev_report_zones_cached+0x162/0x190
	 btrfs_get_dev_zones+0xdc/0x2e0
	 btrfs_get_dev_zone_info+0x219/0xe80
	 btrfs_get_dev_zone_info_all_devices+0x62/0x90
	 open_ctree+0x1200/0x23db
	 btrfs_get_tree.cold+0x105/0x26c
	 vfs_get_tree+0x28/0xb0
	 __do_sys_fsconfig+0x324/0x680
	 do_syscall_64+0x92/0x4f0
	 entry_SYSCALL_64_after_hwframe+0x76/0x7e

  other info that might help us debug this:

  Chain exists of:
    &vblk->vdev_mutex --> btrfs_trans_num_extwriters --> &fs_devs->device_list_mutex

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_devs->device_list_mutex);
				 lock(btrfs_trans_num_extwriters);
				 lock(&fs_devs->device_list_mutex);
    lock(&vblk->vdev_mutex);

   *** DEADLOCK ***

  3 locks held by mount/1043:
   #0: ffff88811063e878 (&fc->uapi_mutex){+.+.}-{4:4}, at: __do_sys_fsconfig+0x2ae/0x680
   #1: ffff88810cb9f0e8 (&type->s_umount_key#31/1){+.+.}-{4:4}, at: alloc_super+0xc0/0x3e0
   #2: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90

  stack backtrace:
  CPU: 2 UID: 0 PID: 1043 Comm: mount Not tainted 7.0.0-rc1+ #355 PREEMPT(full)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5b/0x80
   print_circular_bug.cold+0x18d/0x1d8
   check_noncircular+0x10d/0x130
   __lock_acquire+0x1522/0x2680
   ? vmap_small_pages_range_noflush+0x3ef/0x820
   lock_acquire+0xd5/0x2f0
   ? virtblk_report_zones+0xda/0x430
   ? lock_is_held_type+0xcd/0x130
   __mutex_lock+0xa3/0x1360
   ? virtblk_report_zones+0xda/0x430
   ? virtblk_report_zones+0xda/0x430
   ? __pfx_copy_zone_info_cb+0x10/0x10
   ? virtblk_report_zones+0xda/0x430
   virtblk_report_zones+0xda/0x430
   ? __pfx_copy_zone_info_cb+0x10/0x10
   blkdev_report_zones_cached+0x162/0x190
   ? __pfx_copy_zone_info_cb+0x10/0x10
   btrfs_get_dev_zones+0xdc/0x2e0
   btrfs_get_dev_zone_info+0x219/0xe80
   btrfs_get_dev_zone_info_all_devices+0x62/0x90
   open_ctree+0x1200/0x23db
   btrfs_get_tree.cold+0x105/0x26c
   ? rcu_is_watching+0x18/0x50
   vfs_get_tree+0x28/0xb0
   __do_sys_fsconfig+0x324/0x680
   do_syscall_64+0x92/0x4f0
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f615e27a40e
  RSP: 002b:00007fff11b18fb8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
  RAX: ffffffffffffffda RBX: 000055572e92ab10 RCX: 00007f615e27a40e
  RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
  RBP: 00007fff11b19100 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  R13: 000055572e92bc40 R14: 00007f615e3faa60 R15: 000055572e92bd08
   </TASK>

Don't hold the device_list_mutex while calling into
btrfs_get_dev_zone_info() in btrfs_get_dev_zone_info_all_devices() to
mitigate the issue. This is safe, as no other thread can touch the device
list at the moment of execution.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have sufficient information to complete the full analysis. Let me
compile the results.

---

## PHASE 1: COMMIT MESSAGE FORENSICS

### Step 1.1: Subject Line
- **Subsystem**: btrfs (zoned)
- **Action verb**: "don't take" — removing an unnecessary/harmful lock,
  indicates a deadlock fix
- **Summary**: Remove device_list_mutex from zone info query to fix
  circular lock dependency

### Step 1.2: Tags
- **Reported-by**: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> —
  real user/tester report of sporadic hangs
- **Reviewed-by**: Damien Le Moal <dlemoal@kernel.org> — block
  layer/zoned storage expert
- **Reviewed-by**: David Sterba <dsterba@suse.com> — btrfs maintainer
- **Signed-off-by**: Johannes Thumshirn <johannes.thumshirn@wdc.com> —
  author, btrfs/zoned regular contributor
- **Signed-off-by**: David Sterba <dsterba@suse.com> — btrfs maintainer
  acceptance
- **No Fixes: tag** (expected for manual review candidates)
- **No Cc: stable** (expected)

### Step 1.3: Commit Body
The commit describes sporadic hangs found by a CI system running
generic/013. Lockdep confirms a circular locking dependency:
- Chain: `device_list_mutex` → `vblk->vdev_mutex` → `mmap_lock` →
  `btrfs_trans_num_writers` → `btrfs_trans_num_extwriters` →
  `device_list_mutex`
- Full stack traces provided showing the exact lock acquisition paths
- Clear reproduction: mount of zoned btrfs filesystem (open_ctree path)

### Step 1.4: Hidden Bug Fix Detection
This is explicitly a deadlock fix. The commit message directly shows the
lockdep splat and the deadlock scenario. Not hidden at all.

**Record**: Real deadlock fix. Sporadic hangs during mount. Confirmed by
lockdep with full dependency chain.

---

## PHASE 2: DIFF ANALYSIS

### Step 2.1: Inventory
- **Files changed**: 1 — `fs/btrfs/zoned.c`
- **Lines**: -2 (mutex_lock/mutex_unlock removed), +4 (comment added)
- **Net**: +2 lines
- **Functions modified**: `btrfs_get_dev_zone_info_all_devices()`
- **Scope**: Single-file, single-function, surgical fix

### Step 2.2: Code Flow Change
- **Before**: The function acquires `device_list_mutex` before iterating
  the device list and releases it after
- **After**: The function iterates the device list without the mutex,
  relying on the fact that during mount (open_ctree), no other thread
  can modify the device list

### Step 2.3: Bug Mechanism
**Category**: Deadlock / circular lock dependency (CRITICAL)
- The lock creates a dependency chain that can deadlock with other paths
  that acquire these locks in different order
- CPU0 holds `device_list_mutex` and needs `vblk->vdev_mutex`
- CPU1 holds locks in the transaction path that eventually need
  `device_list_mutex`

### Step 2.4: Fix Quality
- **Obviously correct**: Yes — during `open_ctree()`, the filesystem is
  being mounted, no other thread can add/remove devices
- **Minimal/surgical**: Yes — only removes the unnecessary lock, adds
  explanatory comment
- **Regression risk**: Very low — the lock was unnecessary at this call
  site (mount path only)
- **Red flags**: None

**Record**: Minimal 2-line removal + comment. Obviously safe because
called only from mount path.

---

## PHASE 3: GIT HISTORY INVESTIGATION

### Step 3.1: Blame
- Buggy code (mutex_lock/unlock in
  `btrfs_get_dev_zone_info_all_devices`) introduced by commit
  `7365104236ade0` (Naohiro Aota, 2021-02-04)
- First appeared in **v5.12-rc1**
- Present in all current stable trees (v5.15.y, v6.1.y, v6.6.y, v6.12.y)

### Step 3.2: Fixes tag
No Fixes: tag present. The implicit target is `7365104236ade0` which
introduced the function with the mutex.

### Step 3.3: File History
- The lock dependency chain also depends on `2eadb9e75e8e65` (Nikolay
  Borisov, 2021-07-05) which added `device_list_mutex` to
  `btrfs_create_pending_block_groups` — first in **v5.15-rc1**
- So the deadlock is possible from **v5.15** onwards (when both sides of
  the circular dependency exist)
- Prior lockdep fixes in this file: `0b9e66762aa0c` (device_list_mutex
  deadlock in `btrfs_can_activate_zone`), `b18f3b60b35a8` (lock ordering
  in `btrfs_zone_activate`)

### Step 3.4: Author
Johannes Thumshirn is a regular btrfs/zoned contributor and co-
maintainer of zoned storage code. High trust.

### Step 3.5: Dependencies
- The fix is completely self-contained: removes 2 lines, adds a comment
- No dependency on other commits
- The function and its surrounding code are unchanged since introduction
  (except the `populate_cache` parameter added in v5.16)
- Will apply cleanly to all stable trees v5.15+

**Record**: Bug exists from v5.15+. Fix is standalone. Will apply
cleanly.

---

## PHASE 4: MAILING LIST RESEARCH

### Step 4.1: Lore Discussion
Found the original patch posted 2026-03-03. Reviewed by Damien Le Moal
(block/zoned expert) and David Sterba (btrfs maintainer). Included in
"Btrfs fixes for 7.0-rc5" pull request by David Sterba, confirming it's
treated as a fix.

### Step 4.2: Bug Report
The bug was found by Shin'ichiro Kawasaki in their CI system running
generic/013 (an fstests test). The lockdep splat is the primary
evidence. The commit also mentions "sporadic hangs."

### Step 4.3: Related Patches
No other patches in a series — this is a standalone fix.

### Step 4.4: Stable Discussion
No explicit stable nomination found, which is why this commit is being
manually reviewed.

**Record**: Standalone fix. Reviewed by 2 experts. Accepted as bugfix
for rc5.

---

## PHASE 5: CODE SEMANTIC ANALYSIS

### Step 5.1: Key Functions
- Modified: `btrfs_get_dev_zone_info_all_devices()`

### Step 5.2: Callers
- Single caller: `open_ctree()` in `fs/btrfs/disk-io.c:3464`
- This is the mount path — called during every btrfs mount operation on
  zoned devices

### Step 5.3: Callees
- Calls `btrfs_get_dev_zone_info()` which calls `btrfs_get_dev_zones()`
  which calls `blkdev_report_zones_cached()` (or `blkdev_report_zones()`
  in older kernels) → block device driver zone report callback
- The block device callback (e.g., `virtblk_report_zones`) may take its
  own mutex, creating the circular dependency

### Step 5.4: Call Chain
`mount` syscall → `vfs_get_tree` → `btrfs_get_tree` → `open_ctree` →
`btrfs_get_dev_zone_info_all_devices`
This is reachable from userspace during every mount of a zoned btrfs
filesystem.

### Step 5.5: Similar Patterns
Previous similar fix: `0b9e66762aa0c` removed `device_list_mutex` from
`btrfs_can_activate_zone()` for the same type of deadlock reason. This
pattern of removing unnecessary `device_list_mutex` usage has precedent.

**Record**: Called on every zoned btrfs mount. Single call site. Same
pattern as previous accepted fix.

---

## PHASE 6: STABLE TREE ANALYSIS

### Step 6.1: Buggy Code in Stable Trees
- The function `btrfs_get_dev_zone_info_all_devices` with
  `device_list_mutex` exists since v5.12
- The other side of the circular dependency (`device_list_mutex` in
  `btrfs_create_pending_block_groups`) exists since v5.15
- **Deadlock is possible in all stable trees from v5.15 onwards**:
  v5.15.y, v6.1.y, v6.6.y, v6.12.y

### Step 6.2: Backport Complications
- The function is essentially unchanged since introduction (only
  `populate_cache` parameter change in v5.16)
- The patch should apply cleanly to all stable trees v5.15+
- In v5.12-v5.14 the other side of the dependency doesn't exist, so the
  deadlock can't occur there

### Step 6.3: Related Fixes in Stable
No fix for this specific deadlock is already in stable trees.

**Record**: Bug exists in v5.15+. Clean apply expected. No existing fix
in stable.

---

## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

### Step 7.1: Subsystem
- **Subsystem**: btrfs (filesystem), zoned device support
- **Criticality**: IMPORTANT — btrfs is widely used, zoned storage is
  growing (SMR HDDs, ZNS SSDs)

### Step 7.2: Activity
The btrfs zoned code is very actively developed with frequent fixes,
indicating ongoing maturation of this subsystem.

**Record**: btrfs/zoned — IMPORTANT criticality, actively developed.

---

## PHASE 8: IMPACT AND RISK ASSESSMENT

### Step 8.1: Affected Users
- All users of btrfs on zoned block devices (ZNS SSDs, SMR HDDs)
- Also affects users of btrfs zone emulation mode on regular devices
- Growing user population as zoned storage adoption increases

### Step 8.2: Trigger Conditions
- **Trigger**: Mount a zoned btrfs filesystem while another thread/CPU
  holds locks in the transaction/mmap path (the lock ordering violation
  makes this possible whenever two threads are active)
- **Frequency**: Sporadic — depends on timing, but confirmed
  reproducible with fstests generic/013
- **Unprivileged trigger**: No — requires mount privileges

### Step 8.3: Failure Mode Severity
- **Failure mode**: System hang / deadlock during mount
- **Severity**: **CRITICAL** — deadlock makes the system unusable, mount
  never completes
- Lockdep confirmed the circular dependency chain

### Step 8.4: Risk-Benefit Ratio
- **BENEFIT**: Very high — prevents deadlock/hang during mount of zoned
  btrfs
- **RISK**: Very low — 2-line removal of an unnecessary lock in the
  mount path, which is inherently single-threaded at that point
- The safety argument is sound: during `open_ctree()`, the filesystem is
  being set up; no device add/remove operations can be happening
- **Ratio**: Excellent — minimal risk, high benefit

---

## PHASE 9: FINAL SYNTHESIS

### Step 9.1: Evidence Summary

**FOR backporting:**
- Fixes a real deadlock (sporadic hangs, confirmed by lockdep)
- Reported by a real tester (Shin'ichiro Kawasaki)
- Reviewed by block/zoned expert (Damien Le Moal) and btrfs maintainer
  (David Sterba)
- Author is a trusted btrfs/zoned contributor (Johannes Thumshirn)
- Included in "Btrfs fixes for 7.0-rc5" — treated as a bugfix
- Surgical fix: 2 lines removed, comment added
- Bug exists in all stable trees v5.15+
- Will apply cleanly
- No dependencies
- Previous precedent for same pattern fix (`0b9e66762aa0c`)

**AGAINST backporting:**
- (None identified)

### Step 9.2: Stable Rules Checklist
1. **Obviously correct and tested?** YES — the mount path is single-
   threaded; no concurrent device list modification is possible. Tested
   in CI.
2. **Fixes a real bug?** YES — deadlock during mount, confirmed by
   lockdep
3. **Important issue?** YES — deadlock/hang, severity CRITICAL
4. **Small and contained?** YES — 2-line removal in 1 function, 1 file
5. **No new features or APIs?** YES — purely removes unnecessary locking
6. **Can apply to stable trees?** YES — clean apply expected for v5.15+

### Step 9.3: Exception Categories
Not applicable — this is a straightforward deadlock fix.

### Step 9.4: Decision
This is a clear YES. It fixes a real deadlock that causes system hangs
during mount of zoned btrfs filesystems. The fix is minimal, obviously
correct, well-reviewed, and will apply cleanly to stable trees.

---

## Verification

- **[Phase 1]** Parsed tags: Reported-by Shin'ichiro Kawasaki, Reviewed-
  by Damien Le Moal and David Sterba, SOB Johannes Thumshirn and David
  Sterba
- **[Phase 2]** Diff analysis: 2 lines removed (mutex_lock/unlock), 4
  lines added (comment). Single function
  `btrfs_get_dev_zone_info_all_devices()` in single file
  `fs/btrfs/zoned.c`
- **[Phase 3]** git blame: buggy mutex_lock introduced in commit
  `7365104236ade0` (v5.12-rc1, 2021-02-04). Other side of deadlock chain
  in `2eadb9e75e8e65` (v5.15-rc1)
- **[Phase 3]** git describe: confirmed `7365104236ade0` first in
  v5.12-rc1, `2eadb9e75e8e65` first in v5.15-rc1
- **[Phase 3]** Author history: Johannes Thumshirn is a regular
  btrfs/zoned contributor with multiple commits to this file
- **[Phase 3]** Related fixes: `0b9e66762aa0c` used same pattern
  (removing device_list_mutex) to fix deadlock in
  `btrfs_can_activate_zone`
- **[Phase 4]** lore.kernel.org: Patch found posted 2026-03-03, reviewed
  by Damien Le Moal and David Sterba, included in "Btrfs fixes for
  7.0-rc5"
- **[Phase 5]** Single caller: `open_ctree()` at `disk-io.c:3464` —
  mount path only
- **[Phase 5]** Call chain traced: mount syscall → vfs_get_tree →
  btrfs_get_tree → open_ctree → btrfs_get_dev_zone_info_all_devices
- **[Phase 6]** Stable tree check: confirmed zoned.c has been modified
  in v5.15.y, v6.1.y, v6.6.y stable trees (function present in all)
- **[Phase 6]** No conflicting fixes found in stable for this specific
  issue
- **[Phase 7]** Subsystem: btrfs (IMPORTANT criticality), actively
  developed
- **[Phase 8]** Failure mode: deadlock/hang during mount — CRITICAL
  severity
- **[Phase 8]** Risk: Very low (2-line lock removal in single-threaded
  mount path)
- UNVERIFIED: Exact clean apply status for each individual stable branch
  (would need cherry-pick test), but function code is nearly identical
  across all stable trees

**YES**

 fs/btrfs/zoned.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 392e6ad874cc7..ab7cc30212702 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -337,7 +337,10 @@ int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
 	if (!btrfs_fs_incompat(fs_info, ZONED))
 		return 0;
 
-	mutex_lock(&fs_devices->device_list_mutex);
+	/*
+	 * No need to take the device_list mutex here, we're still in the mount
+	 * path and devices cannot be added to or removed from the list yet.
+	 */
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		/* We can skip reading of zone info for missing devices */
 		if (!device->bdev)
@@ -347,7 +350,6 @@ int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
 		if (ret)
 			break;
 	}
-	mutex_unlock(&fs_devices->device_list_mutex);
 
 	return ret;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-24 11:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260324111931.3257972-1-sashal@kernel.org>
2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-6.12] btrfs: reserve enough transaction items for qgroup ioctls Sasha Levin
2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: reject root items with drop_progress and zero drop_level Sasha Levin
2026-03-24 11:19 ` [PATCH AUTOSEL 6.19-5.15] btrfs: don't take device_list_mutex when querying zone info Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox