* [PATCH AUTOSEL 7.0-5.15] btrfs: don't allow log trees to consume global reserve or overcommit metadata
[not found] <20260420132314.1023554-1-sashal@kernel.org>
@ 2026-04-20 13:16 ` Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] btrfs: avoid GFP_ATOMIC allocations in qgroup free paths Sasha Levin
` (5 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:16 UTC (permalink / raw)
To: patches, stable
Cc: Filipe Manana, Leo Martins, David Sterba, Sasha Levin, clm,
linux-btrfs, linux-kernel
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit 40f2b11c1b7c593bbbfbf6bf333228ee53ed4050 ]
For a fsync we never reserve space in advance, we just start a transaction
without reserving space and we use an empty block reserve for a log tree.
We reserve space as we need while updating a log tree, we end up in
btrfs_use_block_rsv() when reserving space for the allocation of a log
tree extent buffer and we attempt first to reserve without flushing,
and if that fails we attempt to consume from the global reserve or
overcommit metadata. This makes us consume space that may be the last
resort for a transaction commit to succeed, therefore increasing the
chances for a transaction abort with -ENOSPC.
So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
a log tree extent buffer allocation without flushing, making the fsync
fallback to a transaction commit and avoid using critical space that could
be the only resort for a transaction commit to succeed when we are in a
critical space situation.
Reviewed-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have enough information to compile my full analysis.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem**: `btrfs` (filesystem)
- **Action verb**: "don't allow" — preventing unwanted behavior,
indicating a correctness fix
- **Summary**: Prevent log trees from consuming the global reserve or
overcommitting metadata space
### Step 1.2: Tags
- **Reviewed-by**: Leo Martins `<loemra.dev@gmail.com>` — reviewed
- **Signed-off-by**: Filipe Manana `<fdmanana@suse.com>` — author, core
btrfs developer (1903 commits)
- **Signed-off-by**: David Sterba `<dsterba@suse.com>` — btrfs
maintainer
- No Fixes: tag — expected for AUTOSEL candidates. The bug is a design
oversight, not from a single commit.
- No Reported-by: tag — author identified the issue through code
analysis
- No Cc: stable — expected for AUTOSEL candidates
### Step 1.3: Commit Body Analysis
The message clearly describes:
- **Bug mechanism**: During fsync, log trees don't reserve space in
advance. When `btrfs_use_block_rsv()` can't reserve with NO_FLUSH, it
falls through to consuming the global reserve or overcommitting
metadata.
- **Symptom**: This depletes critical space needed for transaction
commits to succeed, increasing the chances of transaction abort with
-ENOSPC.
- **Failure mode**: Transaction abort with -ENOSPC makes the filesystem
read-only.
- **Fix approach**: Fail immediately for log trees after NO_FLUSH
attempt, forcing fsync to fall back to a full transaction commit.
### Step 1.4: Hidden Bug Fix Detection
This is NOT a hidden fix — it's explicitly described as preventing a
problematic behavior that leads to transaction aborts. Record: This is a
clear bug fix for ENOSPC transaction aborts.
---
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Inventory
- **Files changed**: `fs/btrfs/block-rsv.c` only (single file)
- **Lines added**: ~25 (of which 2 are code, rest are extensive
comments)
- **Lines removed**: 0
- **Function modified**: `btrfs_use_block_rsv()` — a single function
- **Scope**: Single-file surgical fix
### Step 2.2: Code Flow Change
The change inserts an early return AFTER the `BTRFS_RESERVE_NO_FLUSH`
attempt fails (line ~543) and BEFORE:
1. The global reserve fallback (lines 549-553)
2. The `BTRFS_RESERVE_FLUSH_EMERGENCY` overcommit (lines 562-565)
**Before**: Log tree allocations could steal from the global reserve and
use emergency flush.
**After**: Log tree allocations fail immediately, causing fsync to fall
back to a full transaction commit.
### Step 2.3: Bug Mechanism
This is a **logic/correctness fix**: Log trees are an optimization path
(fsync via log replay vs full commit). When they consume the global
reserve or use emergency flush, they deplete resources needed for
regular transaction commits, creating conditions for -ENOSPC transaction
aborts.
### Step 2.4: Fix Quality
- **Obviously correct**: YES — the 2-line check is trivial: `if
(btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID) return ERR_PTR(ret);`
- **Minimal/surgical**: YES — only 2 lines of actual code
- **Regression risk**: ZERO — the worst case is fsync falls back to a
full transaction commit (slower but safe and already well-tested). All
callers in `tree-log.c` handle this via `btrfs_set_log_full_commit()`.
- **Red flags**: None
---
## PHASE 3: GIT HISTORY INVESTIGATION
### Step 3.1: Blame
The `try_reserve` section was originally by Josef Bacik (commit
`67f9c2209e885c`, 2019). The emergency flush was added by Josef Bacik in
commit `765c3fe99bcda0` (Sept 2022, ~v6.1). The bug has existed since
the original code and was worsened by the emergency flush addition.
### Step 3.2: Fixes Tag
No Fixes tag. This is a design oversight that goes back to when the
function was first written. The global reserve stealing has been
possible since the original `btrfs_use_block_rsv()`, and the emergency
flush (added in v6.1 timeframe) made it worse.
### Step 3.3: File History
Recent changes to `block-rsv.c` are mostly refactoring (removing fs_info
arguments, adding treelog_rsv, etc.). No other fix for this specific
issue exists.
### Step 3.4: Author
Filipe Manana is one of the top 3 btrfs contributors with 1903 commits.
He is a core developer and deeply familiar with ENOSPC handling. He also
wrote related fixes like commit `09e44868f1e03` ("btrfs: do not abort
transaction on failure to update log root"), which follows the same
principle: log tree failures should gracefully fall back, not abort
transactions.
### Step 3.5: Dependencies
- `btrfs_root_id()` was introduced in commit `e094f48040cda` (April
2024, v6.12). For older stable trees, this would need to be
`root->root_key.objectid`.
- No other structural dependencies — the check is independent of
`treelog_rsv`.
---
## PHASE 4: MAILING LIST RESEARCH
### Step 4.1-4.5
Lore.kernel.org was inaccessible due to bot protection. The commit was
submitted by Filipe Manana, signed by the btrfs maintainer David Sterba,
and reviewed by Leo Martins. The emergency flush commit message
(765c3fe99bcda0) mentions "100-200 ENOSPC aborts per day" at Facebook,
demonstrating the real-world impact of ENOSPC issues.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: Key Functions
Modified function: `btrfs_use_block_rsv()`
### Step 5.2: Callers
`btrfs_use_block_rsv()` is called from `btrfs_alloc_tree_block()` in
`extent-tree.c` (line 5367). This is the central tree block allocation
function used by ALL btree operations including log tree operations.
### Step 5.3-5.4: Call Chain
For log trees: `btrfs_sync_log()` →
`btrfs_search_slot()`/`btrfs_cow_block()` → `btrfs_alloc_tree_block()` →
`btrfs_use_block_rsv()`. Errors propagate back, and `btrfs_sync_log()`
calls `btrfs_set_log_full_commit()` to fall back to full transaction
commit. This path is reachable from any `fsync()` syscall — a very
common user operation.
### Step 5.5: Similar Patterns
The pattern of checking root type before allowing dangerous operations
is used elsewhere in btrfs (e.g., `btrfs_init_root_block_rsv()` already
distinguishes log trees from other roots).
---
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: Buggy Code in Stable Trees
The `btrfs_use_block_rsv()` function exists in ALL active stable trees.
The global reserve stealing has been there since the function was
written. Emergency flush was added in ~v6.1. Both paths allow log trees
to deplete critical reserves.
### Step 6.2: Backport Complications
- For 7.0.y: should apply cleanly
- For 6.12.y, 6.6.y: minor adjustment needed — `btrfs_root_id()` doesn't
exist in 6.6; needs `root->root_key.objectid`
- For 6.1.y: same `btrfs_root_id` issue +
`btrfs_reserve_metadata_bytes()` has `fs_info` parameter
- The function structure is preserved across all stable trees
### Step 6.3: No related fixes already in stable for this issue.
---
## PHASE 7: SUBSYSTEM CONTEXT
### Step 7.1: Subsystem Criticality
**btrfs** (`fs/btrfs/`) — **CORE/IMPORTANT**. Btrfs is a major Linux
filesystem used widely, especially in enterprise (SUSE, Facebook/Meta),
NAS devices, and desktop Linux.
### Step 7.2: Activity
btrfs is actively developed with regular fixes. Filipe Manana alone has
many ENOSPC-related fixes.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: Affected Users
All btrfs users who use `fsync()` under near-full filesystem conditions.
This is common for database workloads, log-heavy applications, and any
system with significant write activity.
### Step 8.2: Trigger Conditions
- Filesystem must be near-full or under metadata pressure
- Application calls `fsync()` which triggers log tree updates
- The NO_FLUSH reservation fails, and the log tree consumes the global
reserve
- Subsequently, a real transaction commit fails because the global
reserve is depleted
- Not timing-dependent — purely resource-based
### Step 8.3: Failure Mode Severity
**CRITICAL**: Transaction abort with -ENOSPC forces the filesystem read-
only. This is a data availability issue (filesystem becomes unusable
until remounted). The emergency flush commit message confirms "100-200
ENOSPC aborts per day" at scale at Facebook.
### Step 8.4: Risk-Benefit Ratio
- **Benefit**: VERY HIGH — prevents transaction aborts that make
filesystem read-only
- **Risk**: VERY LOW — 2 lines of code, the fallback (full commit) is
well-tested, zero regression potential
- **Ratio**: Overwhelmingly favorable
---
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence
**FOR backporting:**
- Prevents CRITICAL failure (ENOSPC transaction abort → filesystem goes
read-only)
- Extremely small and surgical fix (2 lines of code)
- Obviously correct — log trees are an optimization that always has a
safe fallback
- Written by core btrfs developer (Filipe Manana, 1903 commits)
- Reviewed and signed off by btrfs maintainer (David Sterba)
- Bug exists in all stable trees
- Zero regression risk — worst case is slightly slower fsync
- Well-established error handling path (`btrfs_set_log_full_commit()`)
- Emergency flush commit explicitly called out real-world ENOSPC aborts
at scale
**AGAINST backporting:**
- No explicit Cc: stable (expected for AUTOSEL)
- No Fixes: tag (design oversight, not single-commit introduction)
- Minor adaptation needed for older stable trees (`btrfs_root_id` →
`root->root_key.objectid`)
### Step 9.2: Stable Rules Checklist
1. **Obviously correct and tested?** YES — 2-line check, well-understood
semantics
2. **Fixes a real bug?** YES — prevents ENOSPC transaction aborts
3. **Important issue?** YES — CRITICAL (filesystem goes read-only)
4. **Small and contained?** YES — single function, single file, 2 lines
of code
5. **No new features or APIs?** CORRECT — no new features
6. **Can apply to stable?** YES (7.0 cleanly; older trees need trivial
adaptation)
### Step 9.3: Exception Categories
Not applicable — this is a standard bug fix, not an exception category.
### Step 9.4: Decision
The evidence overwhelmingly supports backporting. This is a small,
obviously correct fix from a core btrfs developer that prevents a
critical failure mode (filesystem forced read-only by ENOSPC transaction
abort). The fix has zero regression risk because the fallback (full
transaction commit instead of log-based fsync) is a well-established
code path.
---
## Verification
- [Phase 1] Parsed tags: Reviewed-by Leo Martins, SOB by Filipe Manana
(author) and David Sterba (maintainer)
- [Phase 2] Diff analysis: 2 lines of code added in
`btrfs_use_block_rsv()` after NO_FLUSH fails, before global reserve
fallback
- [Phase 2] Verified the error path returns `ERR_PTR(ret)` which
propagates correctly through `btrfs_alloc_tree_block()`
- [Phase 3] git blame: the `try_reserve` section from commit
`67f9c2209e885c` (2019), emergency flush from `765c3fe99bcda0` (2022,
v6.1 timeframe)
- [Phase 3] git log: confirmed 1903 commits from Filipe Manana to
fs/btrfs/
- [Phase 3] Confirmed `btrfs_root_id()` introduced in `e094f48040cda`
(April 2024, v6.12)
- [Phase 5] Grep confirmed `btrfs_use_block_rsv()` called from
`btrfs_alloc_tree_block()` in extent-tree.c:5367
- [Phase 5] Grep confirmed 28+ calls to `btrfs_set_log_full_commit()` in
tree-log.c — error recovery is well-established
- [Phase 5] Verified `BTRFS_RESERVE_FLUSH_EMERGENCY` comment explicitly
says "This is potentially dangerous" (space-info.h:75)
- [Phase 6] Buggy code exists in all active stable trees (function
existed since 2019, emergency flush since ~v6.1)
- [Phase 7] Confirmed btrfs is a major filesystem, ENOSPC issues
documented at Facebook scale
- [Phase 8] Failure mode: transaction abort → filesystem forced read-
only → CRITICAL severity
- UNVERIFIED: Could not access lore.kernel.org for mailing list
discussion (bot protection)
**YES**
fs/btrfs/block-rsv.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 6064dd00d041b..9efb3016ef116 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -541,6 +541,31 @@ struct btrfs_block_rsv *btrfs_use_block_rsv(struct btrfs_trans_handle *trans,
BTRFS_RESERVE_NO_FLUSH);
if (!ret)
return block_rsv;
+
+ /*
+ * If we are being used for updating a log tree, fail immediately, which
+ * makes the fsync fallback to a transaction commit.
+ *
+ * We don't want to consume from the global block reserve, as that is
+ * precious space that may be needed to do updates to some trees for
+ * which we don't reserve space during a transaction commit (update root
+ * items in the root tree, device stat items in the device tree and
+ * quota tree updates, see btrfs_init_root_block_rsv()), or to fallback
+ * to in case we did not reserve enough space to run delayed items,
+ * delayed references, or anything else we need in order to avoid a
+ * transaction abort.
+ *
+ * We also don't want to do a reservation in flush emergency mode, as
+ * we end up using metadata that could be critical to allow a
+ * transaction to complete successfully and therefore increase the
+ * chances for a transaction abort.
+ *
+ * Log trees are an optimization and should never consume from the
+ * global reserve or be allowed overcommitting metadata.
+ */
+ if (btrfs_root_id(root) == BTRFS_TREE_LOG_OBJECTID)
+ return ERR_PTR(ret);
+
/*
* If we couldn't reserve metadata bytes try and use some from
* the global reserve if its space type is the same as the global
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 7.0] btrfs: avoid GFP_ATOMIC allocations in qgroup free paths
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-5.15] btrfs: don't allow log trees to consume global reserve or overcommit metadata Sasha Levin
@ 2026-04-20 13:18 ` Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: replace BUG_ON() with error return in cache_save_setup() Sasha Levin
` (4 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:18 UTC (permalink / raw)
To: patches, stable
Cc: Leo Martins, Qu Wenruo, David Sterba, Sasha Levin, clm,
linux-btrfs, linux-kernel
From: Leo Martins <loemra.dev@gmail.com>
[ Upstream commit e0a85137a882db789b1bccc1e7db06356ac8c69f ]
When qgroups are enabled, __btrfs_qgroup_release_data() and
qgroup_free_reserved_data() pass an extent_changeset to
btrfs_clear_record_extent_bits() to track how many bytes had their
EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock,
add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each
changed range. If this allocation fails, it hits a BUG_ON and panics the
kernel.
However, both of these callers only read changeset.bytes_changed
afterwards — the range_changed ulist is populated and immediately freed
without ever being iterated. The GFP_ATOMIC allocation is entirely
unnecessary for these paths.
Introduce extent_changeset_init_bytes_only() which uses a sentinel value
(EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal
that only bytes_changed should be tracked. add_extent_changeset() checks
for this sentinel and returns early after updating bytes_changed,
skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC
allocation and makes the BUG_ON unreachable for these paths.
Callers that need range tracking (qgroup_reserve_data,
qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to
use extent_changeset_init() and are unaffected.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
This is the `set_extent_bit()` path - the preallocation happens for set
operations (used by `qgroup_reserve_data`), not the clear operations.
The clear path doesn't prealloc. This confirms the commit's analysis -
the clear path (used by free/release) always hits the GFP_ATOMIC
allocation.
Now I have all the information needed for the full analysis. Let me
compile it.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: PARSE THE SUBJECT LINE
Record: [btrfs] [avoid] "Avoid GFP_ATOMIC allocations in qgroup free
paths" - The action verb "avoid" suggests preventing a problematic
behavior. While not a classic "fix" verb, the commit message body
reveals the actual bug: BUG_ON/kernel panic from failed GFP_ATOMIC
allocation.
### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Reviewed-by: Qu Wenruo <wqu@suse.com>** - Qu Wenruo is a senior
btrfs developer/maintainer
- **Signed-off-by: Leo Martins <loemra.dev@gmail.com>** - the author
- **Signed-off-by: David Sterba <dsterba@suse.com>** - the btrfs
maintainer who committed it
- No Fixes: tag (expected for candidates under review)
- No Reported-by: tag
- No Link: tag
Record: Reviewed by a core btrfs developer, committed by the btrfs
maintainer.
### Step 1.3: ANALYZE THE COMMIT BODY TEXT
The commit describes:
- **Bug:** When qgroups are enabled, `__btrfs_qgroup_release_data()` and
`qgroup_free_reserved_data()` pass an `extent_changeset` to
`btrfs_clear_record_extent_bits()`, which calls
`add_extent_changeset()` under a spinlock. Inside, `ulist_add()` is
called with `GFP_ATOMIC`. If this allocation fails, it hits a `BUG_ON`
and panics the kernel.
- **Observation:** The callers only read `changeset.bytes_changed` - the
`range_changed` ulist is never iterated for these paths.
- **Fix:** Introduces a "bytes only" mode that skips the `ulist_add()`
call entirely.
Record: Bug = kernel panic (BUG_ON) from GFP_ATOMIC allocation failure
in qgroup free/release paths. Failure mode = kernel crash. Root cause =
unnecessary GFP_ATOMIC allocation that can fail under memory pressure.
### Step 1.4: DETECT HIDDEN BUG FIXES
Record: This IS a bug fix disguised with the verb "avoid". The actual
bug is a kernel panic (BUG_ON) triggered by memory allocation failure
under memory pressure. The fix eliminates the problematic allocation
path entirely.
## PHASE 2: DIFF ANALYSIS - LINE BY LINE
### Step 2.1: INVENTORY THE CHANGES
- `fs/btrfs/extent-io-tree.c`: +3/-0 (add early return in
`add_extent_changeset`)
- `fs/btrfs/extent_io.h`: +21/-2 (new inline functions + guard in
release/prealloc)
- `fs/btrfs/qgroup.c`: +3/-2 (change two callsites to use `_bytes_only`,
add 1 ASSERT)
Total: ~27 lines added, ~4 removed. Single-subsystem, well-contained.
Record: 3 files changed, ~30 lines of actual code. Functions modified:
`add_extent_changeset`, `extent_changeset_release`,
`extent_changeset_prealloc`, `qgroup_free_reserved_data`,
`__btrfs_qgroup_release_data`, `btrfs_qgroup_check_reserved_leak`.
Scope: single-subsystem surgical fix.
### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE
**Hunk 1 (extent-io-tree.c):** Before: `add_extent_changeset()` always
calls `ulist_add()` with GFP_ATOMIC. After: If changeset is "bytes
only", returns early after incrementing `bytes_changed`, skipping
`ulist_add()`.
**Hunk 2 (extent_io.h):** Adds `EXTENT_CHANGESET_BYTES_ONLY` sentinel,
`extent_changeset_init_bytes_only()`, `extent_changeset_tracks_ranges()`
helper. Guards `extent_changeset_release()` and
`extent_changeset_prealloc()` against the sentinel.
**Hunk 3 (qgroup.c):** Changes `qgroup_free_reserved_data()` and
`__btrfs_qgroup_release_data()` from `extent_changeset_init()` to
`extent_changeset_init_bytes_only()`. Adds a safety ASSERT in
`btrfs_qgroup_check_reserved_leak()` (which needs ranges).
Record: The change flow is: callers opt into bytes-only mode ->
`add_extent_changeset` checks and returns early -> no GFP_ATOMIC
allocation -> BUG_ON becomes unreachable on this path.
### Step 2.3: IDENTIFY THE BUG MECHANISM
Category: **Kernel panic / BUG_ON from allocation failure**.
The call chain is:
1. `btrfs_qgroup_free_data()` / `btrfs_qgroup_release_data()` ->
`__btrfs_qgroup_release_data()`
2. -> `btrfs_clear_record_extent_bits()` ->
`btrfs_clear_extent_bit_changeset()`
3. -> `spin_lock(&tree->lock)` (acquires spinlock)
4. -> `clear_state_bit()` -> `add_extent_changeset()` ->
`ulist_add(&changeset->range_changed, ..., GFP_ATOMIC)`
5. If GFP_ATOMIC allocation fails, returns -ENOMEM
6. `BUG_ON(ret < 0)` at line 570 fires -> kernel panic
Record: Bug category = kernel panic from allocation failure under
spinlock. The fix makes the allocation unreachable for callers that
don't need the result.
### Step 2.4: ASSESS THE FIX QUALITY
- **Obviously correct?** YES - The commit clearly shows that
`qgroup_free_reserved_data` and `__btrfs_qgroup_release_data` only
read `changeset.bytes_changed` and never iterate `range_changed`.
Verified by reading the code.
- **Minimal/surgical?** YES - ~30 lines, well-contained.
- **Regression risk?** VERY LOW - The sentinel value approach is clean.
The only risk would be if a future change to these callers started
iterating the range_changed list without checking, but the ASSERT in
`btrfs_qgroup_check_reserved_leak` shows the pattern is guarded.
- **Red flags?** None.
Record: Fix quality is high. Obviously correct, minimal, low regression
risk.
## PHASE 3: GIT HISTORY INVESTIGATION
### Step 3.1: BLAME THE CHANGED LINES
The `add_extent_changeset` function with its GFP_ATOMIC `ulist_add` was
introduced in the 2017-2018 timeframe. The BUG_ON was moved from inside
the function to callers by David Sterba in commit 57599c7e7722
(2018-03-01). The buggy code (`extent_changeset_init` in
`qgroup_free_reserved_data`) traces back to commit bc42bda22345e (Qu
Wenruo, 2017-02-27).
Record: Buggy code introduced in v4.11 (2017). Present in ALL active
stable trees.
### Step 3.2: FOLLOW THE FIXES TAG
No Fixes: tag present (expected).
### Step 3.3: CHECK FILE HISTORY
Recent changes to extent-io-tree.c are mostly cleanups and
optimizations. No conflicting changes observed that would affect this
patch's applicability.
Record: Standalone fix, no prerequisites identified.
### Step 3.4: CHECK THE AUTHOR
Leo Martins appears to be a newer contributor. However, the patch was
reviewed by Qu Wenruo (core btrfs dev) and committed by David Sterba
(btrfs maintainer).
Record: Author is less experienced, but patch was reviewed by core btrfs
team.
### Step 3.5: CHECK FOR DEPENDENCIES
The patch is self-contained. The only dependency is on the `struct
ulist` having a `prealloc` field, which has existed since the ulist was
introduced.
Record: No dependencies. Can apply standalone.
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
### Step 4.1-4.5:
Lore.kernel.org blocked access with bot protection. Web search did not
return the specific patch thread. However, the commit has `Reviewed-by:
Qu Wenruo` which indicates formal review, and was committed by the btrfs
maintainer David Sterba.
Record: Could not access lore.kernel.org. Review is indicated by the
Reviewed-by tag from a core btrfs developer.
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: KEY FUNCTIONS
Modified: `add_extent_changeset`, `extent_changeset_release`,
`extent_changeset_prealloc`, `qgroup_free_reserved_data`,
`__btrfs_qgroup_release_data`, `btrfs_qgroup_check_reserved_leak`.
### Step 5.2: TRACE CALLERS
- `btrfs_qgroup_free_data()` and `btrfs_qgroup_release_data()` are
called from: `inode.c` (11 call sites), `file.c` (1), `ordered-data.c`
(3), `delalloc-space.c` (1), `qgroup.c` (7). These are core btrfs I/O
paths.
- Any btrfs file write with qgroups enabled goes through
`btrfs_qgroup_release_data`.
- Error paths go through `btrfs_qgroup_free_data`.
Record: The affected code path is reachable from every btrfs write
operation when qgroups are enabled. High-impact path.
### Step 5.3-5.5: CALL CHAIN
The chain from userspace: `write()` -> btrfs write path -> qgroup
accounting -> `__btrfs_qgroup_release_data()` ->
`btrfs_clear_record_extent_bits()` -> spinlock -> `clear_state_bit()` ->
`add_extent_changeset()` -> BUG_ON on failure.
Record: Reachable from userspace write operations. Very common path for
qgroup users.
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: BUGGY CODE IN STABLE TREES?
Verified: The `extent_changeset_init()` calls in
`qgroup_free_reserved_data` and `__btrfs_qgroup_release_data` exist in
v5.15, v6.1, v6.6, and all newer trees. The BUG_ON exists in all these
versions.
Record: Bug exists in ALL active stable trees (v5.15+, likely v4.11+).
### Step 6.2: BACKPORT COMPLICATIONS
In v6.6, the code structure is essentially identical. The `extent_io.h`
structure, `add_extent_changeset`, and qgroup functions have the same
layout. Minor differences: `kmalloc_obj` vs `kmalloc` in
`extent_changeset_alloc()`, `int set` vs `bool set` parameter in
`add_extent_changeset`. The patch would need minor adaptation for v6.6
and older but is conceptually clean.
Record: Minor adaptation needed for stable trees, but the core logic
applies cleanly.
### Step 6.3: RELATED FIXES
No related fix for the same issue was found in stable.
## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT
### Step 7.1: SUBSYSTEM CRITICALITY
btrfs is an IMPORTANT subsystem - widely used filesystem. Qgroups are
used for quota management, common in container deployments and multi-
user systems.
Record: [fs/btrfs] [IMPORTANT] - affects btrfs qgroup users.
### Step 7.2: SUBSYSTEM ACTIVITY
Very active - many commits per cycle. The btrfs maintainer committed
this fix.
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: WHO IS AFFECTED
All btrfs users with qgroups enabled. This includes container
deployments (Docker, LXC), systems using btrfs quotas for storage
management.
Record: filesystem-specific, but a significant user population.
### Step 8.2: TRIGGER CONDITIONS
- Trigger: Memory pressure + btrfs qgroup-enabled file operations
- GFP_ATOMIC allocations fail when there's no immediately available
memory (no sleeping allowed under spinlock)
- More likely under heavy workloads or constrained systems
- Can be triggered by any unprivileged user doing file writes on a btrfs
filesystem with qgroups
Record: Trigger = memory pressure during file write with qgroups.
Moderate likelihood, but can affect production systems.
### Step 8.3: FAILURE MODE SEVERITY
BUG_ON -> kernel panic -> system crash. CRITICAL severity.
Record: CRITICAL - kernel panic, complete system loss.
### Step 8.4: RISK-BENEFIT RATIO
- **BENEFIT:** HIGH - Prevents kernel panic in a real-world scenario
(memory pressure + qgroups)
- **RISK:** VERY LOW - ~30 lines, well-contained, obviously correct
sentinel pattern, reviewed by btrfs maintainer
- **Ratio:** Very favorable for backporting
Record: High benefit, very low risk.
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: COMPILE EVIDENCE
**FOR backporting:**
- Fixes a real kernel panic (BUG_ON) triggered by GFP_ATOMIC allocation
failure
- Code path is reachable from common btrfs operations (file writes with
qgroups)
- Small, surgical fix (~30 lines, 3 files, single subsystem)
- Reviewed by core btrfs developer Qu Wenruo
- Committed by btrfs maintainer David Sterba
- Buggy code exists in ALL active stable trees (since v4.11)
- Eliminates unnecessary GFP_ATOMIC allocation entirely (not just
handling failure differently)
- No risk of regression - sentinel pattern is well-guarded
**AGAINST backporting:**
- No explicit Cc: stable tag or Fixes: tag (expected - that's why it
needs review)
- No reported-by tag (the bug may be rare in practice)
- Introduces new API functions (`extent_changeset_init_bytes_only`,
`extent_changeset_tracks_ranges`, `EXTENT_CHANGESET_BYTES_ONLY`) -
though these are internal inline helpers
- May need minor adaptation for older stable trees
**UNRESOLVED:**
- Could not access lore.kernel.org discussion for stable nomination
signals
- No concrete report of this BUG_ON triggering in the wild
### Step 9.2: STABLE RULES CHECKLIST
1. Obviously correct and tested? **YES** - Reviewed by core developer,
clearly eliminates unnecessary allocation
2. Fixes a real bug? **YES** - BUG_ON/kernel panic from allocation
failure
3. Important issue? **YES** - kernel panic (CRITICAL)
4. Small and contained? **YES** - ~30 lines, single subsystem
5. No new features or APIs? **YES** - internal helpers only, no
userspace-visible changes
6. Can apply to stable trees? **YES** - with minor adaptation
### Step 9.3: EXCEPTION CATEGORIES
Not an exception category - this is a direct bug fix.
### Step 9.4: DECISION
The commit fixes a real kernel panic (BUG_ON triggered by GFP_ATOMIC
allocation failure under spinlock) in a commonly-used btrfs code path
(qgroup free/release). The fix is small, obviously correct, reviewed by
the btrfs core team, and committed by the btrfs maintainer. The buggy
code exists in all active stable trees. The risk of regression is very
low.
## Verification
- [Phase 1] Parsed tags: Reviewed-by Qu Wenruo, SOB by Leo Martins and
David Sterba (btrfs maintainer)
- [Phase 2] Diff analysis: 3 files, ~30 lines. Adds sentinel to skip
ulist_add in bytes-only mode
- [Phase 2] Verified `qgroup_free_reserved_data` only reads
`changeset.bytes_changed` (line 4364), never iterates `range_changed`
- [Phase 2] Verified `__btrfs_qgroup_release_data` only reads
`changeset.bytes_changed` (lines 4402-4408), never iterates
`range_changed`
- [Phase 2] Verified `btrfs_qgroup_check_reserved_leak` DOES iterate
`range_changed` (line 4650) and correctly still uses
`extent_changeset_init`
- [Phase 3] git blame: `qgroup_free_reserved_data` buggy code from
bc42bda22345e (Qu Wenruo, 2017-02-27, v4.11)
- [Phase 3] git blame: BUG_ON moved to callers by commit 57599c7e7722
(David Sterba, 2018-03-01)
- [Phase 3] Confirmed BUG_ON at lines 398 and 570 of current extent-io-
tree.c
- [Phase 3] Confirmed `add_extent_changeset` called under
`spin_lock(&tree->lock)` (line 647 -> line 709/740/746)
- [Phase 5] Verified
`btrfs_qgroup_free_data`/`btrfs_qgroup_release_data` called from 25+
sites across btrfs
- [Phase 6] Confirmed buggy pattern (`extent_changeset_init` in these
functions) exists in v5.15, v6.1, v6.6
- [Phase 6] Confirmed BUG_ON on `add_extent_changeset` return exists in
v5.15, v6.6
- [Phase 7] Qu Wenruo and David Sterba verified as core btrfs
developers/maintainers
- [Phase 8] Failure mode: BUG_ON -> kernel panic, severity CRITICAL
- UNVERIFIED: Could not access lore.kernel.org to check for stable
nomination in review comments
- UNVERIFIED: No concrete real-world reports of this BUG_ON triggering
(does not invalidate the bug - GFP_ATOMIC failures are real)
**YES**
fs/btrfs/extent-io-tree.c | 3 +++
fs/btrfs/extent_io.h | 23 ++++++++++++++++++++++-
fs/btrfs/qgroup.c | 5 +++--
3 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index d0dd50f7d2795..2a2bce0f1f7c8 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -193,7 +193,10 @@ static int add_extent_changeset(struct extent_state *state, u32 bits,
return 0;
if (!set && (state->state & bits) == 0)
return 0;
+
changeset->bytes_changed += state->end - state->start + 1;
+ if (!extent_changeset_tracks_ranges(changeset))
+ return 0;
return ulist_add(&changeset->range_changed, state->start, state->end, GFP_ATOMIC);
}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8d05f1a58b7c3..080215352b7a1 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -196,6 +196,25 @@ static inline void extent_changeset_init(struct extent_changeset *changeset)
ulist_init(&changeset->range_changed);
}
+/*
+ * Sentinel value for range_changed.prealloc indicating that the changeset
+ * only tracks bytes_changed and does not record individual ranges. This
+ * avoids GFP_ATOMIC allocations inside add_extent_changeset() when the
+ * caller doesn't need to iterate the changed ranges afterwards.
+ */
+#define EXTENT_CHANGESET_BYTES_ONLY ((struct ulist_node *)1)
+
+static inline void extent_changeset_init_bytes_only(struct extent_changeset *changeset)
+{
+ changeset->bytes_changed = 0;
+ changeset->range_changed.prealloc = EXTENT_CHANGESET_BYTES_ONLY;
+}
+
+static inline bool extent_changeset_tracks_ranges(const struct extent_changeset *changeset)
+{
+ return changeset->range_changed.prealloc != EXTENT_CHANGESET_BYTES_ONLY;
+}
+
static inline struct extent_changeset *extent_changeset_alloc(void)
{
struct extent_changeset *ret;
@@ -210,6 +229,7 @@ static inline struct extent_changeset *extent_changeset_alloc(void)
static inline void extent_changeset_prealloc(struct extent_changeset *changeset, gfp_t gfp_mask)
{
+ ASSERT(extent_changeset_tracks_ranges(changeset));
ulist_prealloc(&changeset->range_changed, gfp_mask);
}
@@ -218,7 +238,8 @@ static inline void extent_changeset_release(struct extent_changeset *changeset)
if (!changeset)
return;
changeset->bytes_changed = 0;
- ulist_release(&changeset->range_changed);
+ if (extent_changeset_tracks_ranges(changeset))
+ ulist_release(&changeset->range_changed);
}
static inline void extent_changeset_free(struct extent_changeset *changeset)
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 41589ce663718..a95fa70def7f8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4326,7 +4326,7 @@ static int qgroup_free_reserved_data(struct btrfs_inode *inode,
u64 freed = 0;
int ret;
- extent_changeset_init(&changeset);
+ extent_changeset_init_bytes_only(&changeset);
len = round_up(start + len, root->fs_info->sectorsize);
start = round_down(start, root->fs_info->sectorsize);
@@ -4391,7 +4391,7 @@ static int __btrfs_qgroup_release_data(struct btrfs_inode *inode,
WARN_ON(!free && reserved);
if (free && reserved)
return qgroup_free_reserved_data(inode, reserved, start, len, released);
- extent_changeset_init(&changeset);
+ extent_changeset_init_bytes_only(&changeset);
ret = btrfs_clear_record_extent_bits(&inode->io_tree, start, start + len - 1,
EXTENT_QGROUP_RESERVED, &changeset);
if (ret < 0)
@@ -4646,6 +4646,7 @@ void btrfs_qgroup_check_reserved_leak(struct btrfs_inode *inode)
WARN_ON(ret < 0);
if (WARN_ON(changeset.bytes_changed)) {
+ ASSERT(extent_changeset_tracks_ranges(&changeset));
ULIST_ITER_INIT(&iter);
while ((unode = ulist_next(&changeset.range_changed, &iter))) {
btrfs_warn(inode->root->fs_info,
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 7.0-5.10] btrfs: replace BUG_ON() with error return in cache_save_setup()
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-5.15] btrfs: don't allow log trees to consume global reserve or overcommit metadata Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] btrfs: avoid GFP_ATOMIC allocations in qgroup free paths Sasha Levin
@ 2026-04-20 13:19 ` Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing Sasha Levin
` (3 subsequent siblings)
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:19 UTC (permalink / raw)
To: patches, stable
Cc: Teng Liu, Qu Wenruo, David Sterba, Sasha Levin, clm, linux-btrfs,
linux-kernel
From: Teng Liu <27rabbitlt@gmail.com>
[ Upstream commit 30d537f723d6f37a8ddfb17fe668bb9808f5b49f ]
In cache_save_setup(), if create_free_space_inode() succeeds but the
subsequent lookup_free_space_inode() still fails on retry, the
BUG_ON(retries) will crash the kernel. This can happen due to I/O
errors or transient failures, not just programming bugs.
Replace the BUG_ON with proper error handling that returns the original
error code through the existing cleanup path. The callers already handle
this gracefully: disk_cache_state defaults to BTRFS_DC_ERROR, so the
space cache simply won't be written for that block group.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the information I need. Let me compile the final
analysis.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem**: btrfs
- **Action verb**: "replace" (BUG_ON with error return)
- **Summary**: Replaces a kernel-crashing BUG_ON() with graceful error
handling in cache_save_setup()
Record: [btrfs] [replace] [Remove BUG_ON(retries) in cache_save_setup(),
replace with error return]
### Step 1.2: Tags
- **Reviewed-by**: Qu Wenruo <wqu@suse.com> - prominent btrfs
maintainer/reviewer
- **Signed-off-by**: Teng Liu <27rabbitlt@gmail.com> (author)
- **Signed-off-by**: David Sterba <dsterba@suse.com> (btrfs maintainer
who merged it)
- No Fixes: tag (expected for candidates)
- No Cc: stable (expected)
Record: Reviewed by key btrfs developer Qu Wenruo. Merged by David
Sterba, the btrfs maintainer.
### Step 1.3: Commit Body
The bug: If `create_free_space_inode()` succeeds but the subsequent
`lookup_free_space_inode()` still fails on retry (due to I/O errors or
transient failures), `BUG_ON(retries)` crashes the kernel. The callers
already handle errors gracefully - `disk_cache_state` defaults to
`BTRFS_DC_ERROR`, so the space cache simply won't be written for that
block group.
Record: Bug = kernel crash (BUG_ON) on transient I/O failures. Symptom =
kernel panic. Root cause = BUG_ON used for a condition that can happen
due to I/O errors, not just programming bugs.
### Step 1.4: Hidden Bug Fix Detection
This IS a bug fix - it prevents a kernel crash (BUG_ON → panic) from a
reachable error condition.
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Inventory
- **Files changed**: 1 (fs/btrfs/block-group.c)
- **Lines**: +6 added, -1 removed (net +5 lines)
- **Function modified**: `cache_save_setup()`
- **Scope**: Single-file, surgical fix
### Step 2.2: Code Flow Change
**Before**: `BUG_ON(retries)` — if retries is non-zero (i.e., we already
tried once to create the inode and look it up again), the kernel
crashes.
**After**: If retries is non-zero, set `ret = PTR_ERR(inode)`, log an
error message, and `goto out_free` which flows through the existing
cleanup path. `dcs` remains `BTRFS_DC_ERROR` (its initial value), so
`block_group->disk_cache_state` will be set to `BTRFS_DC_ERROR`, and the
space cache simply won't be written for this block group.
### Step 2.3: Bug Mechanism
Category: **Logic/correctness fix** - replacing a crash assertion with
proper error handling. The BUG_ON asserts that a condition "cannot
happen," but it can happen due to I/O errors.
### Step 2.4: Fix Quality
- **Obviously correct**: Yes. The `out_free` path already exists and
handles exactly this case. The `dcs` variable defaults to
`BTRFS_DC_ERROR`.
- **Minimal/surgical**: Yes, only 6 lines added replacing 1 line.
- **Regression risk**: Very low. The error path is well-established and
callers check `disk_cache_state == BTRFS_DC_SETUP` before proceeding.
## PHASE 3: GIT HISTORY INVESTIGATION
### Step 3.1: Blame
The BUG_ON(retries) line was in commit `77745c05115fc` (2019), which was
a code migration. The actual BUG_ON was introduced in commit
`0af3d00bad38d` ("Btrfs: create special free space cache inode") from
2010, present since **v2.6.37**. This bug has been in the kernel for ~16
years.
### Step 3.2: No Fixes: tag to follow (expected).
### Step 3.3: Related Changes
- `8ac7fad32b930` (Feb 2026): Removed a pointless WARN_ON() in the same
function - shows the btrfs team is actively cleaning up this function.
- `719dc4b75561f`: Similar BUG_ON removal in
`btrfs_remove_block_group()`
- Many other BUG_ON removal commits in btrfs history
### Step 3.4: Author
Teng Liu (27rabbitlt) appears to be a relatively new contributor.
However, the patch was **Reviewed-by Qu Wenruo** and **Signed-off-by
David Sterba** (the btrfs maintainer), giving it strong credibility.
### Step 3.5: Dependencies
None. This is a completely standalone fix - it only changes one
conditional in one function, using existing error paths.
## PHASE 4: MAILING LIST RESEARCH
The patch was submitted as v1 and v2 on 2026-03-28, found in the
lore/LKML archive mirror. The v2 was the applied version. Reviewed-by Qu
Wenruo confirms it was peer-reviewed by a senior btrfs developer.
Record: Patch went through v1 → v2 revision. Reviewed by senior btrfs
developer.
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: Function Modified
`cache_save_setup()` - a static function in `fs/btrfs/block-group.c`.
### Step 5.2: Callers
Three callers, all in the same file:
1. `btrfs_setup_space_cache()` (line 3490) - ignores return value
2. `btrfs_start_dirty_block_groups()` (line 3577) - ignores return
value, checks `disk_cache_state`
3. `btrfs_write_dirty_block_groups()` (line 3729) - ignores return
value, checks `disk_cache_state`
All callers check `cache->disk_cache_state == BTRFS_DC_SETUP` before
proceeding with cache write. When `cache_save_setup()` fails, `dcs`
stays at `BTRFS_DC_ERROR`, so the callers gracefully skip the cache
write.
### Step 5.3-5.4: Call Chain
These functions are called during **transaction commit**
(`btrfs_commit_transaction`), a core kernel path that runs frequently
during normal btrfs filesystem operations.
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: Buggy Code in Stable Trees
The BUG_ON(retries) was introduced in v2.6.37 (2010) and exists in **ALL
active stable trees** (5.10.y, 5.15.y, 6.1.y, 6.6.y, 6.12.y, etc.). The
code hasn't changed around this specific line since it was written.
### Step 6.2: Backport Complications
The patch should apply cleanly to all stable trees. The surrounding code
is unchanged since 2019 (when it was migrated from extent-tree.c to
block-group.c). For trees older than 5.3 (before migration), the file
would be `extent-tree.c` instead.
### Step 6.3: No related fixes already in stable.
## PHASE 7: SUBSYSTEM CONTEXT
- **Subsystem**: btrfs (filesystem)
- **Criticality**: IMPORTANT - btrfs is a widely used filesystem
(default in openSUSE, Fedora)
- **Path**: Space cache management during transaction commit - a **core
btrfs operation**
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: Affected Users
All btrfs users with space_cache v1 enabled (the default for many
configs) are affected.
### Step 8.2: Trigger Conditions
The BUG_ON triggers when:
1. A block group's free space cache needs to be written
2. The free space inode doesn't exist, so btrfs creates one
3. On retry lookup, the inode still can't be found (I/O error, transient
failure)
This can be triggered by I/O errors on the disk, which are real-world
events, especially on aging or failing hardware.
### Step 8.3: Failure Mode
**CRITICAL** - BUG_ON causes a kernel panic, crashing the system.
Without this fix, a transient I/O error during space cache setup causes
a full system crash instead of gracefully skipping the cache write.
### Step 8.4: Risk-Benefit Ratio
- **Benefit**: VERY HIGH - prevents kernel panic on I/O errors during
normal filesystem operation
- **Risk**: VERY LOW - 6-line change using existing error paths,
reviewed by btrfs maintainers
- **Ratio**: Strongly favorable for backporting
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence
**FOR backporting:**
- Prevents kernel panic (BUG_ON → crash) on a reachable error condition
- Tiny, surgical fix (+6/-1 lines) in a single file
- Uses existing, well-tested error handling paths
- Bug present since v2.6.37 (2010) — affects ALL stable trees
- Reviewed by Qu Wenruo (senior btrfs developer), merged by David Sterba
(btrfs maintainer)
- No dependencies on other patches
- Callers already handle the error gracefully
- btrfs is a widely-used filesystem
- Pattern consistent with other BUG_ON removals in btrfs that have gone
to stable
**AGAINST backporting:**
- No reported syzbot trigger or specific user crash report cited
- The trigger condition (I/O error between create and lookup) may be
uncommon
### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - trivial change, reviewed by
maintainer
2. Fixes a real bug? **YES** - kernel panic on I/O error
3. Important issue? **YES** - kernel crash (CRITICAL severity)
4. Small and contained? **YES** - 6 lines, single file
5. No new features or APIs? **YES** - purely error handling
6. Can apply to stable? **YES** - code unchanged since 2010
### Step 9.3: Exception Categories
Not needed - this qualifies as a straightforward bug fix.
## Verification
- [Phase 1] Parsed tags: Reviewed-by Qu Wenruo, SOB David Sterba (btrfs
maintainer)
- [Phase 2] Diff analysis: +6/-1 lines replacing BUG_ON(retries) with
error return + log message, uses existing `out_free` cleanup path
- [Phase 3] git blame: BUG_ON(retries) introduced in commit
`0af3d00bad38d` (v2.6.37-rc1, 2010), migrated in `77745c05115fc`
(2019)
- [Phase 3] git describe: confirmed original commit is in v2.6.37-rc1,
present in all stable trees
- [Phase 3] Related commits: `8ac7fad32b930` removed WARN_ON in same
function (Feb 2026), `719dc4b75561f` similar BUG_ON removal in btrfs
- [Phase 4] Found v1 and v2 patch submissions on lore mirror
(2026-03-28), v2 is the applied version
- [Phase 5] Verified callers: 3 call sites in same file, all ignore
return value and check `disk_cache_state == BTRFS_DC_SETUP` — error
case is handled gracefully
- [Phase 5] Verified `dcs` defaults to `BTRFS_DC_ERROR` (line 3316),
confirmed `out_free` path preserves this default
- [Phase 6] Code exists unchanged in all active stable trees (verified
via git log v6.6.. and git log v6.1..)
- [Phase 6] Patch should apply cleanly (code hasn't changed since 2019
migration)
- [Phase 8] Failure mode: BUG_ON → kernel panic during transaction
commit, severity CRITICAL
- UNVERIFIED: Could not fetch full lore.kernel.org discussion due to bot
protection; relied on web search confirmation of review
The fix is small, surgical, obviously correct, prevents a kernel crash,
uses existing error paths, and was reviewed and merged by the btrfs
maintainers. It meets all stable kernel criteria.
**YES**
fs/btrfs/block-group.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c0d17a369bda5..ccabcad1a3fc3 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3343,7 +3343,13 @@ static int cache_save_setup(struct btrfs_block_group *block_group,
}
if (IS_ERR(inode)) {
- BUG_ON(retries);
+ if (retries) {
+ ret = PTR_ERR(inode);
+ btrfs_err(fs_info,
+ "failed to lookup free space inode after creation for block group %llu: %d",
+ block_group->start, ret);
+ goto out_free;
+ }
retries++;
if (block_group->ro)
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing
[not found] <20260420132314.1023554-1-sashal@kernel.org>
` (2 preceding siblings ...)
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: replace BUG_ON() with error return in cache_save_setup() Sasha Levin
@ 2026-04-20 13:19 ` Sasha Levin
2026-04-22 12:24 ` Aleksandar Gerasimovski
2026-04-20 13:20 ` [PATCH AUTOSEL 6.18] btrfs: fix zero size inode with non-zero size after log replay Sasha Levin
` (2 subsequent siblings)
6 siblings, 1 reply; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:19 UTC (permalink / raw)
To: patches, stable
Cc: Filipe Manana, Aleksandar Gerasimovski, Qu Wenruo, David Sterba,
Sasha Levin, clm, linux-btrfs, linux-kernel
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit 574d93fc62e2b03ab39c8f92fb44ded89ca6274d ]
Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:
$ cat test.sh
#!/bin/bash
# Create and use a 1G null block device, memory backed, otherwise
# the test takes a very long time.
modprobe null_blk nr_devices="0"
null_dev="/sys/kernel/config/nullb/nullb0"
mkdir "$null_dev"
size=$((1 * 1024)) # in MB
echo 2 > "$null_dev/submit_queues"
echo "$size" > "$null_dev/size"
echo 1 > "$null_dev/memory_backed"
echo 1 > "$null_dev/discard"
echo 1 > "$null_dev/power"
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/test/
bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
umount $MNT
echo 0 > "$null_dev/power"
rmdir "$null_dev"
When running this bonnie++ fails in the phase where it deletes test
directories and files:
$ ./test.sh
(...)
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...Can't sync directory, turning off dir-sync.
Can't delete file 9Bq7sr0000000338
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): Read-only file system
And in the syslog/dmesg we can see the following transaction abort trace:
[161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
[161915.502983] ------------[ cut here ]------------
[161915.503832] BTRFS: Transaction aborted (error -28)
[161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
[161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
[161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
[161915.520857] Tainted: [W]=WARN
[161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
[161915.524630] Code: 48 8b 7c 24 (...)
[161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
[161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
[161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
[161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
[161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
[161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
[161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
[161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
[161915.536758] Call Trace:
[161915.537185] <TASK>
[161915.537575] btrfs_sync_file+0x431/0x530 [btrfs]
[161915.538473] do_fsync+0x39/0x80
[161915.539042] __x64_sys_fsync+0xf/0x20
[161915.539750] do_syscall_64+0x50/0xf20
[161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[161915.541301] RIP: 0033:0x7ff930ca49ee
[161915.541904] Code: 08 0f 85 f5 (...)
[161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
[161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
[161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
[161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
[161915.552161] </TASK>
[161915.552457] ---[ end trace 0000000000000000 ]---
[161915.553232] BTRFS info (device nullb0 state A): dumping space info:
[161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
[161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
[161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
[161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
[161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
[161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
[161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
[161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
[161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
[161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
[161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
[161915.554463] BTRFS info (device nullb0 state EA): forced readonly
The problem is that we allow for a very aggressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).
Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.
Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocation attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.
The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:
mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0
These are from loading the block groups created by mkfs during mount.
Then when bonnie++ starts doing its thing:
kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544
kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072
First allocation of a data block group of 112M.
kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032
kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584
Another 112M data block group allocated.
kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520
kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096
Yet another one.
bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008
bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456
One more.
kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496
kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392
Another one.
bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984
bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456
Another one.
bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472
bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728
Another one.
bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960
bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864
Another one, but a bit smaller (~100.6M) since we now have less space.
bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912
bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192
Another one, this one even smaller (12M).
kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt
kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912
kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0
536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().
As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.
kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC
In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.
But dumping the space infos right after the metadata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:
[29972.409295] BTRFS info (device nullb0): dumping space info:
[29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
[29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
[29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
[29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
[29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
[29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
[29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
[29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
[29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
[29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
[29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
So here we see there's ~904.6M of data space, ~51.2M of metadata space and
8M of system space, making a total of 963.8M.
Reported-by: Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>
Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
I now have enough information for a thorough analysis. Let me compile my
findings.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem prefix:** `btrfs`
- **Action verb:** "be less aggressive" (behavioral adjustment)
- **Summary:** Reduce metadata overcommit aggressiveness when full
flushing is possible, to avoid -ENOSPC transaction aborts.
- Record: [btrfs] [behavioral fix] [reduce overcommit to prevent
transaction abort -ENOSPC]
### Step 1.2: Tags
- **Reported-by:** Aleksandar Gerasimovski (user report with a
reproducible test case)
- **Link 1:** lore bug report thread
- **Link 2:** lore follow-up discussion
- **Reviewed-by:** Qu Wenruo (core btrfs developer)
- **Signed-off-by:** Filipe Manana (author, prominent btrfs developer),
David Sterba (btrfs maintainer)
- No Fixes: tag (expected for candidates under review)
- No Cc: stable (expected)
- Record: User-reported with reproduction steps, reviewed by a key btrfs
developer, signed-off by the btrfs maintainer.
### Step 1.3: Commit Body Analysis
The commit describes a transaction abort with -ENOSPC (error -28) during
bonnie++ workload on a 1G filesystem. The abort forces the filesystem
read-only. The detailed trace shows `btrfs_commit_transaction` aborting
at line 2045 with the call path `btrfs_sync_file -> do_fsync ->
__x64_sys_fsync`. The author explains that the overly generous 1/8
overcommit allows too many tasks to overcommit without triggering
transaction commits that would release pinned extents, eventually
leading to metadata exhaustion and transaction abort. Includes custom
tracing evidence of block group allocation behavior leading up to the
failure.
- Record: Real bug manifesting as filesystem going read-only
(transaction abort with -ENOSPC) during normal workload on small
filesystem. Root cause: too-aggressive metadata overcommit allows too
many tasks to bypass flushing, resulting in no free metadata extents
and no unallocated space for new metadata chunks.
### Step 1.4: Hidden Bug Fix Detection
This is not a hidden fix - it is clearly described as fixing a
transaction abort bug. The words "Fix this by" are explicitly used.
Record: This IS a direct bug fix.
---
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Inventory
- **Files changed:** `fs/btrfs/space-info.c` (1 file)
- **Lines changed:** 3 lines modified (1 comment change, 2 logic
changes)
- **Functions modified:** `calc_available_free_space()`
- **Scope:** Single-file, surgical fix
### Step 2.2: Code Flow Change
Before:
- When `flush == BTRFS_RESERVE_FLUSH_ALL`, overcommit limit was `avail
>> 3` (1/8 of available)
- `BTRFS_RESERVE_FLUSH_ALL_STEAL` fell through to `else` branch: `avail
>> 1` (1/2 of available)
After:
- When `flush == BTRFS_RESERVE_FLUSH_ALL || flush ==
BTRFS_RESERVE_FLUSH_ALL_STEAL`, overcommit limit is `avail >> 6` (1/64
of available)
- This is more conservative, forcing earlier transaction commits
### Step 2.3: Bug Mechanism
This is a **logic/correctness fix**. The overcommit threshold was too
generous, allowing too many tasks to avoid triggering the space flushing
machinery, which would commit transactions and unpin extents. This
eventually exhausted metadata space with no recovery path.
Two bugs fixed:
1. `BTRFS_RESERVE_FLUSH_ALL_STEAL` was falling into the "else" (1/2
overcommit) branch — far too generous for a flush type that CAN do
full flushing.
2. Even `BTRFS_RESERVE_FLUSH_ALL` at 1/8 was too aggressive for small
filesystems.
### Step 2.4: Fix Quality
- Minimal and obviously correct — reducing overcommit thresholds is safe
- Well-understood mechanism with detailed analysis in commit message
- Regression risk: slightly more frequent transaction commits under
memory pressure (performance trade-off, not a correctness regression)
- The author is Filipe Manana, one of the most prolific btrfs developers
Record: Very high quality, obviously correct, minimal scope.
---
## PHASE 3: GIT HISTORY
### Step 3.1: Blame
The buggy code (`avail >>= 3` / `avail >>= 1`) was introduced in commit
`41783ef24d56ce` ("btrfs: move and export can_overcommit") by Josef
Bacik, merged in v5.4. The code has been in every kernel since v5.4.
### Step 3.2: No Fixes: tag — skipped as expected.
### Step 3.3: File History
`fs/btrfs/space-info.c` has ~90 changes since v6.6 but the specific
`calc_available_free_space()` function's overcommit logic has only been
touched by:
- `cb6cbab79055c` (v6.7, adjusted overcommit for "very close to full"
condition)
- `64d2c847ba380` (v6.10, zoned fix)
- Various argument refactoring (fs_info removal)
The current patch touches only the two lines at the `>>= 3` / `>>= 1`
branch which have been stable since v5.4.
### Step 3.4: Author
Filipe Manana is one of the most active btrfs contributors with hundreds
of commits. He regularly fixes space reservation bugs and is deeply
familiar with the overcommit subsystem.
### Step 3.5: Dependencies
The patch is standalone. The only dependency is the existence of
`BTRFS_RESERVE_FLUSH_ALL_STEAL`, which was added in commit
`7f9fe61440769` and confirmed present in all stable trees back to v5.10.
---
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
Lore.kernel.org has bot protection enabled, preventing direct access.
However:
- The commit has two Link: tags referencing mailing list discussions
- The commit was reviewed by Qu Wenruo and signed-off by David Sterba
- The commit message includes the original user report from Aleksandar
Gerasimovski
Record: Could not access lore directly. The commit has proper review
chain and user report.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: Function Modified
`calc_available_free_space()` — computes how much overcommit is allowed
for metadata.
### Step 5.2: Callers
1. `check_can_overcommit()` → called by `can_overcommit()` and
`btrfs_can_overcommit()`
2. `btrfs_calc_reclaim_metadata_size()` — reclaim size calculation
3. `need_preemptive_reclaim()` — decides if preemptive reclaim is needed
These are called during **every metadata reservation** in the kernel.
This is a hot path for all btrfs operations.
### Step 5.3-5.4: Call Chain
`reserve_bytes()` → `can_overcommit()` → `check_can_overcommit()` →
`calc_available_free_space()`
This is reachable from any filesystem operation that reserves metadata
(file creation, deletion, modification, etc.).
### Step 5.5: Similar Patterns
The earlier commit `cb6cbab79055c` addressed a related but different
aspect of overcommit (when very close to full). This patch addresses the
general case.
---
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: Buggy Code in Stable Trees
Verified the EXACT same code pattern exists in ALL active stable trees:
- v5.10: same code at line 327
- v5.15: same code at line 324
- v6.1: same code at line 372
- v6.6: same code at line 373
- v6.12: same code at line 421
`BTRFS_RESERVE_FLUSH_ALL_STEAL` confirmed present in v5.10+.
### Step 6.2: Backport Complications
The surrounding context has minor differences (e.g., the zoned mode
alignment was added in v6.10, function signature changed in v6.13+) but
the actual 3-line change applies to code that is IDENTICAL across all
stable trees. Minor context adjustment may be needed for the surrounding
lines (no zoned block in older trees), but the core logic change is
trivially backportable.
### Step 6.3: No related fix already in stable.
---
## PHASE 7: SUBSYSTEM CONTEXT
### Step 7.1: Subsystem and Criticality
- **Subsystem:** `fs/btrfs` — filesystem
- **Criticality:** IMPORTANT — btrfs is a widely-used filesystem,
especially in enterprise (SLES, openSUSE) and desktop Linux. Metadata
ENOSPC bugs cause data loss risk (filesystem goes read-only).
### Step 7.2: Activity
btrfs/space-info.c is very actively maintained with frequent
improvements and fixes.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: Affected Users
All btrfs users, especially those with smaller filesystems (1G-8G) under
heavy workloads. This is common in containers, VMs, embedded systems,
and IoT devices.
### Step 8.2: Trigger Conditions
- Normal file operations (create/delete files) on a filesystem that has
most of its space allocated to data
- Reproducible with bonnie++ on a 1G filesystem
- No special privileges needed — any user writing files can trigger this
### Step 8.3: Failure Mode
**CRITICAL** — Transaction abort with -ENOSPC forces the filesystem into
read-only mode. This means:
- Active writes fail
- The filesystem must be unmounted/remounted to recover
- Potential data loss if writes were in progress
- User sees "Read-only file system" errors
### Step 8.4: Risk-Benefit Ratio
- **Benefit:** HIGH — prevents filesystem going read-only on small
filesystems under normal workloads
- **Risk:** VERY LOW — 3-line change reducing an overcommit threshold;
the only behavioral change is slightly more frequent transaction
commits, which is a minor performance trade-off with no correctness
risk
- **Ratio:** Strongly favorable
---
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence Summary
**FOR backporting:**
- Fixes a real, reproducible bug (filesystem goes read-only)
- Failure mode is CRITICAL (transaction abort, forced read-only)
- Extremely small patch (3 lines in 1 file)
- Reviewed by core btrfs developer (Qu Wenruo), signed off by maintainer
(David Sterba)
- Author is one of the most prolific btrfs developers (Filipe Manana)
- User-reported with clear reproduction case
- Buggy code exists in ALL stable trees (v5.10+)
- Fix applies cleanly with trivial context adjustment
- No new features or APIs
- The FLUSH_ALL_STEAL case falling through to 1/2 overcommit was clearly
a bug
**AGAINST backporting:**
- This is a behavioral/heuristic change rather than a strictly "wrong
code" fix
- Could theoretically affect performance on large filesystems (more
frequent commits)
- Changes a tuning constant (1/8 → 1/64) which is somewhat subjective
- No Fixes: tag (but this is expected)
### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** — reproduced, analyzed with
tracing, reviewed
2. Fixes a real bug? **YES** — transaction abort forcing read-only
filesystem
3. Important issue? **YES** — filesystem goes read-only = CRITICAL
4. Small and contained? **YES** — 3 lines in 1 file
5. No new features? **YES** — changes an existing threshold
6. Can apply to stable? **YES** — verified same code in all stable trees
### Step 9.3: Exception Category
Not applicable — this is a standard bug fix.
### Step 9.4: Decision
The fix is small, surgical, obviously correct, and addresses a CRITICAL
failure mode (filesystem forced read-only). The risk is minimal
(slightly more conservative overcommit policy). The benefit-to-risk
ratio is extremely favorable.
---
## Verification
- [Phase 1] Parsed tags: Reported-by from Aleksandar Gerasimovski,
Reviewed-by Qu Wenruo, two lore Links
- [Phase 2] Diff analysis: 3-line change in
`calc_available_free_space()` — shifts avail from >>3 to >>6 for full-
flush cases, and adds FLUSH_ALL_STEAL to the condition
- [Phase 3] git blame: buggy code introduced in commit 41783ef24d56ce
(v5.4), present in all stable trees
- [Phase 3] Verified FLUSH_ALL_STEAL exists in v5.10+ (commit
7f9fe61440769)
- [Phase 3] Author Filipe Manana has hundreds of btrfs commits —
confirmed core developer
- [Phase 4] lore.kernel.org blocked by bot protection — could not access
mailing list directly
- [Phase 5] Traced callers: `calc_available_free_space()` called through
`can_overcommit()` from `reserve_bytes()` — on every metadata
reservation path
- [Phase 6] Verified exact same code pattern in v5.10, v5.15, v6.1,
v6.6, v6.12 trees
- [Phase 6] cb6cbab79055c (related but different overcommit fix) already
in v6.7+
- [Phase 8] Failure mode: transaction abort → read-only filesystem =
CRITICAL severity
- [Phase 8] Trigger: normal file operations on small filesystem, easily
reproducible
- UNVERIFIED: Could not access mailing list discussion to check for NAKs
or concerns about performance regression on large filesystems
**YES**
fs/btrfs/space-info.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 87cbc051cb12f..b2b775ab878c6 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -489,10 +489,10 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
/*
* If we aren't flushing all things, let us overcommit up to
* 1/2th of the space. If we can flush, don't let us overcommit
- * too much, let it overcommit up to 1/8 of the space.
+ * too much, let it overcommit up to 1/64th of the space.
*/
- if (flush == BTRFS_RESERVE_FLUSH_ALL)
- avail >>= 3;
+ if (flush == BTRFS_RESERVE_FLUSH_ALL || flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
+ avail >>= 6;
else
avail >>= 1;
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 6.18] btrfs: fix zero size inode with non-zero size after log replay
[not found] <20260420132314.1023554-1-sashal@kernel.org>
` (3 preceding siblings ...)
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing Sasha Levin
@ 2026-04-20 13:20 ` Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.19] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] btrfs: fix silent IO error loss in encoded writes and zoned split Sasha Levin
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:20 UTC (permalink / raw)
To: patches, stable
Cc: Filipe Manana, Vyacheslav Kovalevsky, David Sterba, Sasha Levin,
clm, linux-btrfs, linux-kernel
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit 5254d4181add9dfaa5e3519edd71cc8f752b2f85 ]
When logging that an inode exists, as part of logging a new name or
logging new dir entries for a directory, we always set the generation of
the logged inode item to 0. This is to signal during log replay (in
overwrite_item()), that we should not set the i_size since we only logged
that an inode exists, so the i_size of the inode in the subvolume tree
must be preserved (as when we log new names or that an inode exists, we
don't log extents).
This works fine except when we have already logged an inode in full mode
or it's the first time we are logging an inode created in a past
transaction, that inode has a new i_size of 0 and then we log a new name
for the inode (due to a new hardlink or a rename), in which case we log
an i_size of 0 for the inode and a generation of 0, which causes the log
replay code to not update the inode's i_size to 0 (in overwrite_item()).
An example scenario:
mkdir /mnt/dir
xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo
sync
xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo
ln /mnt/dir/foo /mnt/dir/bar
xfs_io -c "fsync" /mnt/dir
<power fail>
After log replay the file remains with a size of 64K. This is because when
we first log the inode, when we fsync file foo, we log its current i_size
of 0, and then when we create a hard link we log again the inode in exists
mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we
add to the log tree, so during log replay overwrite_item() sees that the
generation is 0 and i_size is 0 so we skip updating the inode's i_size
from 64K to 0.
Fix this by making sure at fill_inode_item() we always log the real
generation of the inode if it was logged in the current transaction with
the i_size we logged before. Also if an inode created in a previous
transaction is logged in exists mode only, make sure we log the i_size
stored in the inode item located from the commit root, so that if we log
multiple times that the inode exists we get the correct i_size.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Error: Failed to generate final synthesis
fs/btrfs/tree-log.c | 98 ++++++++++++++++++++++++++++++---------------
1 file changed, 65 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7505a87522fd7..c45c5112c0350 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4608,21 +4608,32 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
struct inode *inode, bool log_inode_only,
u64 logged_isize)
{
+ u64 gen = BTRFS_I(inode)->generation;
u64 flags;
if (log_inode_only) {
- /* set the generation to zero so the recover code
- * can tell the difference between an logging
- * just to say 'this inode exists' and a logging
- * to say 'update this inode with these values'
+ /*
+ * Set the generation to zero so the recover code can tell the
+ * difference between a logging just to say 'this inode exists'
+ * and a logging to say 'update this inode with these values'.
+ * But only if the inode was not already logged before.
+ * We access ->logged_trans directly since it was already set
+ * up in the call chain by btrfs_log_inode(), and data_race()
+ * to avoid false alerts from KCSAN and since it was set already
+ * and one can set it to 0 since that only happens on eviction
+ * and we are holding a ref on the inode.
*/
- btrfs_set_inode_generation(leaf, item, 0);
+ ASSERT(data_race(BTRFS_I(inode)->logged_trans) > 0);
+ if (data_race(BTRFS_I(inode)->logged_trans) < trans->transid)
+ gen = 0;
+
btrfs_set_inode_size(leaf, item, logged_isize);
} else {
- btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
btrfs_set_inode_size(leaf, item, inode->i_size);
}
+ btrfs_set_inode_generation(leaf, item, gen);
+
btrfs_set_inode_uid(leaf, item, i_uid_read(inode));
btrfs_set_inode_gid(leaf, item, i_gid_read(inode));
btrfs_set_inode_mode(leaf, item, inode->i_mode);
@@ -5428,42 +5439,63 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
return 0;
}
-static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode,
- struct btrfs_path *path, u64 *size_ret)
+static int get_inode_size_to_log(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode,
+ struct btrfs_path *path, u64 *size_ret)
{
struct btrfs_key key;
+ struct btrfs_inode_item *item;
int ret;
key.objectid = btrfs_ino(inode);
key.type = BTRFS_INODE_ITEM_KEY;
key.offset = 0;
- ret = btrfs_search_slot(NULL, log, &key, path, 0, 0);
- if (ret < 0) {
- return ret;
- } else if (ret > 0) {
- *size_ret = 0;
- } else {
- struct btrfs_inode_item *item;
+ /*
+ * Our caller called inode_logged(), so logged_trans is up to date.
+ * Use data_race() to silence any warning from KCSAN. Once logged_trans
+ * is set, it can only be reset to 0 after inode eviction.
+ */
+ if (data_race(inode->logged_trans) == trans->transid) {
+ ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0);
+ } else if (inode->generation < trans->transid) {
+ path->search_commit_root = true;
+ path->skip_locking = true;
+ ret = btrfs_search_slot(NULL, inode->root, &key, path, 0, 0);
+ path->search_commit_root = false;
+ path->skip_locking = false;
- item = btrfs_item_ptr(path->nodes[0], path->slots[0],
- struct btrfs_inode_item);
- *size_ret = btrfs_inode_size(path->nodes[0], item);
- /*
- * If the in-memory inode's i_size is smaller then the inode
- * size stored in the btree, return the inode's i_size, so
- * that we get a correct inode size after replaying the log
- * when before a power failure we had a shrinking truncate
- * followed by addition of a new name (rename / new hard link).
- * Otherwise return the inode size from the btree, to avoid
- * data loss when replaying a log due to previously doing a
- * write that expands the inode's size and logging a new name
- * immediately after.
- */
- if (*size_ret > inode->vfs_inode.i_size)
- *size_ret = inode->vfs_inode.i_size;
+ } else {
+ *size_ret = 0;
+ return 0;
}
+ /*
+ * If the inode was logged before or is from a past transaction, then
+ * its inode item must exist in the log root or in the commit root.
+ */
+ ASSERT(ret <= 0);
+ if (WARN_ON_ONCE(ret > 0))
+ ret = -ENOENT;
+
+ if (ret < 0)
+ return ret;
+
+ item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+ struct btrfs_inode_item);
+ *size_ret = btrfs_inode_size(path->nodes[0], item);
+ /*
+ * If the in-memory inode's i_size is smaller then the inode size stored
+ * in the btree, return the inode's i_size, so that we get a correct
+ * inode size after replaying the log when before a power failure we had
+ * a shrinking truncate followed by addition of a new name (rename / new
+ * hard link). Otherwise return the inode size from the btree, to avoid
+ * data loss when replaying a log due to previously doing a write that
+ * expands the inode's size and logging a new name immediately after.
+ */
+ if (*size_ret > inode->vfs_inode.i_size)
+ *size_ret = inode->vfs_inode.i_size;
+
btrfs_release_path(path);
return 0;
}
@@ -6978,7 +7010,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
ret = drop_inode_items(trans, log, path, inode,
BTRFS_XATTR_ITEM_KEY);
} else {
- if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) {
+ if (inode_only == LOG_INODE_EXISTS) {
/*
* Make sure the new inode item we write to the log has
* the same isize as the current one (if it exists).
@@ -6992,7 +7024,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
* (zeroes), as if an expanding truncate happened,
* instead of getting a file of 4Kb only.
*/
- ret = logged_inode_size(log, inode, path, &logged_isize);
+ ret = get_inode_size_to_log(trans, inode, path, &logged_isize);
if (ret)
goto out_unlock;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 7.0-6.19] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit
[not found] <20260420132314.1023554-1-sashal@kernel.org>
` (4 preceding siblings ...)
2026-04-20 13:20 ` [PATCH AUTOSEL 6.18] btrfs: fix zero size inode with non-zero size after log replay Sasha Levin
@ 2026-04-20 13:21 ` Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] btrfs: fix silent IO error loss in encoded writes and zoned split Sasha Levin
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:21 UTC (permalink / raw)
To: patches, stable
Cc: Johannes Thumshirn, Filipe Manana, David Sterba, Sasha Levin, clm,
linux-btrfs, linux-kernel
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
[ Upstream commit 7bcb04de982ff0718870112ad9f38c35cbca528b ]
On zoned filesystems metadata space accounting can become overly optimistic
due to delayed refs reservations growing without a hard upper bound.
The delayed_refs_rsv block reservation is allowed to speculatively grow and
is only backed by actual metadata space when refilled. On zoned devices this
can result in delayed_refs_rsv reserving a large portion of metadata space
that is already effectively unusable due to zone write pointer constraints.
As a result, space_info->may_use can grow far beyond the usable metadata
capacity, causing the allocator to believe space is available when it is not.
This leads to premature ENOSPC failures and "cannot satisfy tickets" reports
even though commits would be able to make progress by flushing delayed refs.
Analysis of "-o enospc_debug" dumps using a Python debug script
confirmed that delayed_refs_rsv was responsible for the majority of
metadata overcommit on zoned devices. By correlating space_info counters
(total, used, may_use, zone_unusable) across transactions, the analysis
showed that may_use continued to grow even after usable metadata space
was exhausted, with delayed refs refills accounting for the excess
reservations.
Here's the output of the analysis:
======================================================================
Space Type: METADATA
======================================================================
Raw Values:
Total: 256.00 MB (268435456 bytes)
Used: 128.00 KB (131072 bytes)
Pinned: 16.00 KB (16384 bytes)
Reserved: 144.00 KB (147456 bytes)
May Use: 255.48 MB (267894784 bytes)
Zone Unusable: 192.00 KB (196608 bytes)
Calculated Metrics:
Actually Usable: 255.81 MB (total - zone_unusable)
Committed: 255.77 MB (used + pinned + reserved + may_use)
Consumed: 320.00 KB (used + zone_unusable)
Percentages:
Zone Unusable: 0.07% of total
May Use: 99.80% of total
Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill():
Before reserving additional metadata bytes, limit the delayed refs
reservation based on the usable metadata space (total bytes minus
zone_unusable). If the reservation would exceed this cap, return -EAGAIN
to trigger the existing flush/commit logic instead of overcommitting
metadata space.
This preserves the existing reservation and flushing semantics while
preventing metadata overcommit on zoned devices. The change is limited to
metadata space and does not affect non-zoned filesystems.
This patch addresses premature metadata ENOSPC conditions on zoned devices
and ensures delayed refs are throttled before exhausting usable metadata.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have a thorough understanding. Let me compile the final analysis.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem**: `btrfs: zoned:` (btrfs filesystem, zoned device mode)
- **Action verb**: "cap" (limit/constrain)
- **Summary**: Cap delayed refs metadata reservation to avoid overcommit
on zoned filesystems
Record: [btrfs: zoned] [cap] [Limit delayed refs metadata reservation to
prevent metadata overcommit on zoned devices]
### Step 1.2: Tags
- **Reviewed-by**: Filipe Manana <fdmanana@suse.com> -- a core btrfs
developer and the most prolific btrfs contributor
- **Signed-off-by**: Johannes Thumshirn <johannes.thumshirn@wdc.com> --
author, WDC employee, active btrfs/zoned contributor
- **Signed-off-by**: David Sterba <dsterba@suse.com> -- btrfs maintainer
who merged it
- No Fixes: tag (expected for manual review candidates)
- No Cc: stable tag (expected)
- No Reported-by tag (author-discovered through debugging)
Record: Reviewed by Filipe Manana (core btrfs developer), committed by
maintainer David Sterba. No bug report reference.
### Step 1.3: Commit Body
The commit describes a real-world ENOSPC problem on zoned btrfs:
- `delayed_refs_rsv` speculatively grows without a hard upper bound
- On zoned devices, zone write pointer constraints make some space
unusable
- `space_info->may_use` grows beyond usable metadata capacity
- This causes premature ENOSPC failures ("cannot satisfy tickets")
- The author provided extensive analysis output from enospc_debug dumps
showing may_use at 99.80% of total while consumed was only 320KB
**Failure mode**: Premature ENOSPC errors on zoned devices, preventing
writes even though space could be recovered by flushing delayed refs.
Record: [Bug: Metadata overcommit on zoned devices leads to premature
ENOSPC] [Symptom: cannot satisfy tickets, premature ENOSPC] [Root cause:
delayed_refs_rsv unbounded growth relative to zone_unusable space]
### Step 1.4: Hidden Bug Fix Detection
This is NOT a hidden bug fix - the commit explicitly describes fixing
premature ENOSPC on zoned devices. It's a clear bug fix with detailed
analysis.
Record: [Direct bug fix, not hidden]
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Change Inventory
- **fs/btrfs/delayed-ref.c**: +28 lines (new function
`btrfs_zoned_cap_metadata_reservation` + 3-line call site)
- **fs/btrfs/transaction.c**: +8 lines (handle -EAGAIN from refill by
committing transaction and retrying)
- **Total**: ~36 lines added, 0 removed
- **Functions modified**: `btrfs_delayed_refs_rsv_refill()`,
`start_transaction()`
- **New function**: `btrfs_zoned_cap_metadata_reservation()` (static
helper)
- **Scope**: Two-file surgical fix, limited to zoned mode
Record: [2 files, ~36 lines added] [btrfs_delayed_refs_rsv_refill
modified, start_transaction modified] [Small surgical fix]
### Step 2.2: Code Flow Changes
**Hunk 1 (delayed-ref.c)**: New static function
`btrfs_zoned_cap_metadata_reservation`:
- Before: No cap on delayed refs reservation
- After: On zoned devices, checks if `block_rsv->size` exceeds half of
usable metadata (`total_bytes - bytes_zone_unusable`). Returns -EAGAIN
if exceeded.
- Only affects zoned mode (`btrfs_is_zoned` check at start)
**Hunk 2 (delayed-ref.c)**: Call to new function in
`btrfs_delayed_refs_rsv_refill`:
- Before: Directly calls `btrfs_reserve_metadata_bytes`
- After: First checks the zoned cap; if exceeded, returns -EAGAIN before
attempting actual reservation
**Hunk 3 (transaction.c)**: -EAGAIN handling in `start_transaction`:
- Before: Any error from `btrfs_delayed_refs_rsv_refill` goes to
`reserve_fail`
- After: If -EAGAIN (zoned cap hit), commits current transaction (which
flushes delayed refs, freeing space), then retries the refill
Record: [New cap check prevents overcommit] [EAGAIN triggers transaction
commit + retry] [Only zoned mode affected]
### Step 2.3: Bug Mechanism
Category: **Logic/correctness fix** for metadata accounting on zoned
devices.
What was broken: The delayed refs block reserve could grow arbitrarily
large on zoned filesystems, where zone write pointer constraints
(tracked as `bytes_zone_unusable`) make portions of metadata space
physically unusable. The overcommit logic didn't account for this, so
`may_use` could far exceed actually usable space.
How the fix works: Adds a zoned-specific cap at 50% of usable metadata
space (`usable >> 1`). When the cap is hit, returns -EAGAIN instead of
proceeding with the reservation. The caller (transaction start) responds
by committing the current transaction, which flushes delayed refs and
frees the overcommitted space.
Record: [Logic/correctness bug in metadata accounting on zoned devices]
[Fix: cap at 50% usable space, trigger flush when cap exceeded]
### Step 2.4: Fix Quality
- The fix is well-contained: adds one static helper + two call sites
- The zoned-only guard (`btrfs_is_zoned`) ensures non-zoned systems are
completely unaffected
- The `ASSERT(btrfs_is_zoned(fs_info))` in the EAGAIN handler is good
defensive coding
- The retry pattern (commit, then retry) is a well-established pattern
in btrfs space management
- Reviewed by Filipe Manana who is the most active btrfs contributor
- Potential regression risk is LOW: only affects zoned mode, uses
existing flush/commit mechanisms, and the cap is generous (50% of
usable)
Record: [Obviously correct, well-reviewed, minimal regression risk for
non-zoned users] [Zero risk for non-zoned, low risk for zoned]
## PHASE 3: GIT HISTORY INVESTIGATION
### Step 3.1: Blame
- `btrfs_delayed_refs_rsv_refill()` was introduced by Josef Bacik in
commit `6ef03debdb3d82` (2019-06-19), present since approximately
v5.3.
- The function has been refined by Filipe Manana (2023) and others but
its core logic (grow unbounded) has been present since inception.
- The zoned mode support was added later, but the interaction with
delayed refs rsv was never specifically addressed.
Record: [Refill function from v5.3 (6ef03debdb3d82)] [Zoned support
added later without accounting for delayed refs rsv interaction]
### Step 3.2: Fixes Tag
No Fixes: tag present. The bug is a design gap in how delayed refs rsv
interacts with zoned mode constraints, not introduced by a single
commit.
Record: [No Fixes: tag - this is a design gap, not a single-commit
regression]
### Step 3.3: Related Changes
- `28270e25c69a2` (v6.7) - "btrfs: always reserve space for delayed refs
when starting transaction" - changed how delayed refs reservations
work, may have exacerbated the issue
- `64d2c847ba380` (v6.9) - "btrfs: zoned: fix
calc_available_free_space() for zoned mode" - closely related fix for
overcommit on zoned, was CC'd to stable
- `a1359d06d7878` (v7.0) - API change to `btrfs_reserve_metadata_bytes`
that would affect clean backport
Record: [Related to 28270e25c69a2 and 64d2c847ba380] [API differences
across stable trees]
### Step 3.4: Author
Johannes Thumshirn is a WDC employee and regular btrfs/zoned contributor
with 20+ btrfs commits visible. He is a recognized expert on zoned
btrfs.
Record: [Author is a recognized zoned btrfs expert at WDC]
### Step 3.5: Dependencies
**CRITICAL**: `btrfs_commit_current_transaction()` was introduced in
commit `ded980eb3fadd7` (2024-05-22), which is only present in v6.11+.
This function is used in the `transaction.c` hunk. Backporting to v6.6.y
or older stable trees would require either:
1. Also backporting `ded980eb3fadd7` (and its dependents)
2. Replacing the call with the inline equivalent
(`btrfs_attach_transaction_barrier` + `btrfs_commit_transaction`)
Additionally, `btrfs_reserve_metadata_bytes()` had its signature changed
by `a1359d06d7878` (dropping `fs_info` argument), which is only in the
latest tree. Older trees have a different API.
Record: [Depends on ded980eb3fadd7 (btrfs_commit_current_transaction) -
only in v6.11+] [API differences for btrfs_reserve_metadata_bytes across
versions]
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
### Step 4.1-4.5
b4 dig could not find the commit (likely very recent, post-7.0-rc or in
a merge window). Web searches also did not find the specific patch
discussion. Lore.kernel.org was protected by anti-bot measures.
Record: [Could not find mailing list discussion - commit appears very
recent, possibly in 7.0 merge window or rc cycle] [UNVERIFIED: Full
mailing list discussion not available]
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1-5.4: Functions and Call Chains
- `btrfs_delayed_refs_rsv_refill()` is called from:
1. `start_transaction()` in `transaction.c` - called on every
transaction start with num_items==0
2. `btrfs_truncate_inode_items()` in `inode-item.c` - called during
truncate/unlink (with BTRFS_RESERVE_NO_FLUSH)
- `start_transaction()` is called from many places throughout btrfs
(dozens of call sites)
- The `num_items == 0` path specifically handles callers using
`btrfs_start_transaction(root, 0)` which is a very common pattern (24+
call sites across btrfs)
The `inode-item.c` caller already converts ALL errors to `-EAGAIN` (line
708), so the new -EAGAIN from the cap function is handled correctly
without modification.
Record: [btrfs_delayed_refs_rsv_refill called from transaction start and
truncate] [Very widely called function] [inode-item.c caller unaffected
by change]
### Step 5.5: Similar Patterns
The previous fix `64d2c847ba380` ("btrfs: zoned: fix
calc_available_free_space() for zoned mode") addressed a very similar
issue - overcommit on zoned mode leading to ENOSPC. That fix was CC'd to
stable 6.9+. This new fix addresses a different vector of the same
overcommit problem.
Record: [Similar fix 64d2c847ba380 was CC'd to stable 6.9+]
## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS
### Step 6.1: Existence in Stable Trees
- The delayed refs rsv refill mechanism exists since v5.3
- Zoned mode support has been present since ~v5.12
- The interaction problem exists in all stable trees with zoned mode
support
- However, `28270e25c69a2` (v6.7) changed delayed refs reservation
behavior and may have worsened the problem
Record: [Buggy interaction exists in v5.12+, but may be worse in v6.7+
due to 28270e25c69a2]
### Step 6.2: Backport Complications
**SIGNIFICANT backport complications:**
1. `btrfs_commit_current_transaction()` only exists in v6.11+ - requires
adaptation for older trees
2. `btrfs_reserve_metadata_bytes()` API changed - minor adaptation
needed for older trees
3. The `delayed-ref.c` hunk adding the new function should apply
relatively cleanly
Record: [Needs adaptation for v6.6-v6.10 due to missing
btrfs_commit_current_transaction] [API differences need resolution]
### Step 6.3: Related Fixes Already in Stable
`64d2c847ba380` (CC: stable 6.9+) addresses a different vector of the
same overcommit problem. This new patch addresses a complementary
vector.
Record: [64d2c847ba380 is a related but different fix, CC'd to stable
6.9+]
## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT
### Step 7.1: Subsystem Criticality
- **Subsystem**: btrfs filesystem, zoned device support
- **Criticality**: IMPORTANT (btrfs is a widely used filesystem, but
zoned mode is a specialized use case for SMR/ZNS devices)
- Zoned btrfs is increasingly used on enterprise/datacenter storage
systems with ZNS SSDs
Record: [btrfs filesystem, zoned mode - IMPORTANT but specialized use
case]
### Step 7.2: Subsystem Activity
btrfs is one of the most actively developed filesystems in the kernel.
The zoned mode subsystem specifically is under active development by
WDC/Seagate engineers.
Record: [Very active subsystem]
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: Affected Users
Users of btrfs on zoned block devices (SMR HDDs, ZNS SSDs). This is a
growing but still specialized use case, primarily in
enterprise/datacenter environments.
Record: [Affected: btrfs zoned mode users, primarily
enterprise/datacenter]
### Step 8.2: Trigger Conditions
- Occurs on zoned devices when delayed refs accumulate
- Triggered by normal write workloads that generate many delayed
references
- More likely with sustained write activity and many COW operations
- Not timing-dependent - deterministic once space accounting gets out of
balance
Record: [Triggered by normal sustained write workloads on zoned devices]
[Deterministic, not timing-dependent]
### Step 8.3: Failure Mode Severity
- **ENOSPC errors** - writes fail prematurely
- This is a HIGH severity issue for affected users: they lose the
ability to write to their filesystem even though space could be
reclaimed
- Not a crash/security issue, but a significant usability/functionality
bug
- Data in-flight could potentially be lost if applications don't handle
ENOSPC gracefully
Record: [Premature ENOSPC - HIGH severity for affected users] [No
crash/corruption, but functional failure]
### Step 8.4: Risk-Benefit Ratio
**BENEFIT**: High for zoned btrfs users - fixes a real ENOSPC issue
preventing normal operation
**RISK**:
- Very low for non-zoned users (completely unaffected - `btrfs_is_zoned`
guard)
- Low for zoned users (uses existing transaction commit mechanism)
- ~36 lines added, well-contained
- BUT: requires backport adaptation due to
`btrfs_commit_current_transaction` dependency
Record: [HIGH benefit for zoned users] [LOW risk overall] [Needs
adaptation for older stable trees]
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence Compilation
**FOR backporting:**
- Fixes a real, significant ENOSPC bug affecting zoned btrfs users
- Well-analyzed and well-documented by the author
- Reviewed by Filipe Manana (core btrfs developer)
- Committed by David Sterba (btrfs maintainer)
- Small and well-contained (~36 lines, 2 files)
- Zero risk to non-zoned users
- Author is a recognized zoned btrfs expert
- Related fix (64d2c847ba380) was explicitly CC'd to stable
**AGAINST backporting:**
- Requires adaptation for stable trees older than v6.11
(`btrfs_commit_current_transaction` dependency)
- API differences in `btrfs_reserve_metadata_bytes` across stable trees
- No Fixes: tag or Cc: stable tag (design gap, not single-commit
regression)
- Zoned mode is a specialized use case (fewer affected users)
- The new static function adds ~25 lines of new code (more than a
trivial one-liner)
**UNRESOLVED:**
- Could not access mailing list discussion to check for stable
nominations by reviewers
- Could not verify whether this was part of a larger series
### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** - reviewed by Filipe Manana,
well-analyzed
2. Fixes a real bug? **YES** - premature ENOSPC on zoned devices
3. Important issue? **YES** - prevents normal filesystem operation
(ENOSPC)
4. Small and contained? **YES** - ~36 lines, 2 files, zoned-only
5. No new features or APIs? **YES** - no new features, just a cap on
existing behavior
6. Can apply to stable trees? **NEEDS ADAPTATION** - requires backport
work for v6.6-v6.10
### Step 9.3: Exception Categories
Not an exception category (not a device ID, quirk, DT, build fix, or doc
fix). It's a standard bug fix.
### Step 9.4: Decision
This is a genuine bug fix for premature ENOSPC on zoned btrfs devices.
The fix is well-contained, well-reviewed, and carries very low
regression risk (zero for non-zoned users). However, it has notable
backport complications:
1. The dependency on `btrfs_commit_current_transaction()` (v6.11+) means
this cannot be cleanly cherry-picked to older stable trees without
adaptation.
2. API differences in `btrfs_reserve_metadata_bytes()` add further
complications for older trees.
Despite the backport complications, the fix addresses a real,
significant user-facing bug (premature ENOSPC preventing writes), is
well-reviewed by the top btrfs developer, and is small enough to warrant
the adaptation effort. The related fix `64d2c847ba380` was CC'd to
stable 6.9+, indicating the maintainers recognize zoned overcommit
issues as stable-worthy.
## Verification
- [Phase 1] Parsed tags: Reviewed-by: Filipe Manana, Signed-off-by:
Johannes Thumshirn (author), David Sterba (maintainer)
- [Phase 2] Diff analysis: ~36 lines added across 2 files, new static
helper + EAGAIN handling
- [Phase 2] Verified inode-item.c caller already handles all errors as
-EAGAIN (line 708)
- [Phase 3] git blame: btrfs_delayed_refs_rsv_refill from 6ef03debdb3d82
(Josef Bacik, 2019, v5.3)
- [Phase 3] git blame: btrfs_commit_current_transaction from
ded980eb3fadd7 (Filipe Manana, 2024-05-22)
- [Phase 3] Verified ded980eb3fadd7 is in v6.11+ but NOT in v6.10 or
earlier
- [Phase 3] Verified 28270e25c69a2 (always reserve for delayed refs) is
in v6.7+
- [Phase 3] git show 64d2c847ba380: confirmed related zoned overcommit
fix was CC'd to stable 6.9+
- [Phase 4] b4 dig failed to find commit (too recent); lore.kernel.org
blocked by anti-bot
- [Phase 4] UNVERIFIED: Could not access mailing list discussion for
stable nominations
- [Phase 5] Verified btrfs_delayed_refs_rsv_refill callers:
start_transaction (transaction.c), btrfs_truncate_inode_items (inode-
item.c)
- [Phase 5] Verified btrfs_start_transaction(root, 0) has 24+ call sites
across btrfs
- [Phase 6] Verified btrfs_commit_current_transaction exists in current
tree (line 2040 of transaction.c) but not in v6.6
- [Phase 6] Verified bytes_zone_unusable field exists in space-info.h
(all relevant stable trees)
- [Phase 6] Verified btrfs_reserve_metadata_bytes API changed by
a1359d06d7878 (signature differs in older trees)
- [Phase 8] Failure mode: premature ENOSPC preventing writes on zoned
devices - severity HIGH for affected users
**YES**
fs/btrfs/delayed-ref.c | 28 ++++++++++++++++++++++++++++
fs/btrfs/transaction.c | 8 ++++++++
2 files changed, 36 insertions(+)
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3766ff29fbbb1..605858c2d9a95 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -207,6 +207,30 @@ void btrfs_dec_delayed_refs_rsv_bg_updates(struct btrfs_fs_info *fs_info)
* This will refill the delayed block_rsv up to 1 items size worth of space and
* will return -ENOSPC if we can't make the reservation.
*/
+static int btrfs_zoned_cap_metadata_reservation(struct btrfs_space_info *space_info)
+{
+ struct btrfs_fs_info *fs_info = space_info->fs_info;
+ struct btrfs_block_rsv *block_rsv = &fs_info->delayed_refs_rsv;
+ u64 usable;
+ u64 cap;
+ int ret = 0;
+
+ if (!btrfs_is_zoned(fs_info))
+ return 0;
+
+ spin_lock(&space_info->lock);
+ usable = space_info->total_bytes - space_info->bytes_zone_unusable;
+ spin_unlock(&space_info->lock);
+ cap = usable >> 1;
+
+ spin_lock(&block_rsv->lock);
+ if (block_rsv->size > cap)
+ ret = -EAGAIN;
+ spin_unlock(&block_rsv->lock);
+
+ return ret;
+}
+
int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
enum btrfs_reserve_flush_enum flush)
{
@@ -228,6 +252,10 @@ int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
if (!num_bytes)
return 0;
+ ret = btrfs_zoned_cap_metadata_reservation(space_info);
+ if (ret)
+ return ret;
+
ret = btrfs_reserve_metadata_bytes(space_info, num_bytes, flush);
if (ret)
return ret;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8dd77c431974d..86c5ebdf56998 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -678,6 +678,14 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
* here.
*/
ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
+ if (ret == -EAGAIN) {
+ ASSERT(btrfs_is_zoned(fs_info));
+ ret = btrfs_commit_current_transaction(root);
+ if (ret)
+ goto reserve_fail;
+ ret = btrfs_delayed_refs_rsv_refill(fs_info, flush);
+ }
+
if (ret)
goto reserve_fail;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH AUTOSEL 7.0-6.12] btrfs: fix silent IO error loss in encoded writes and zoned split
[not found] <20260420132314.1023554-1-sashal@kernel.org>
` (5 preceding siblings ...)
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.19] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Sasha Levin
@ 2026-04-20 13:21 ` Sasha Levin
6 siblings, 0 replies; 10+ messages in thread
From: Sasha Levin @ 2026-04-20 13:21 UTC (permalink / raw)
To: patches, stable
Cc: Michal Grzedzicki, Qu Wenruo, Johannes Thumshirn, Mark Harmstone,
David Sterba, Sasha Levin, clm, linux-btrfs, linux-kernel
From: Michal Grzedzicki <mge@meta.com>
[ Upstream commit 3cd181cc46d36aa7bd4af85f14639d86a25beaec ]
can_finish_ordered_extent() and btrfs_finish_ordered_zoned() set
BTRFS_ORDERED_IOERR via bare set_bit(). Later,
btrfs_mark_ordered_extent_error() in btrfs_finish_one_ordered() uses
test_and_set_bit(), finds it already set, and skips
mapping_set_error(). The error is never recorded on the inode's
address_space, making it invisible to fsync. For encoded writes this
causes btrfs receive to silently produce files with zero-filled holes.
Fix: replace bare set_bit(BTRFS_ORDERED_IOERR) with
btrfs_mark_ordered_extent_error() which pairs test_and_set_bit() with
mapping_set_error(), guaranteeing the error is recorded exactly once.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: Michal Grzedzicki <mge@meta.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the information needed. Let me compile the full analysis.
---
# PHASE 1: COMMIT MESSAGE FORENSICS
## Step 1.1: PARSE THE SUBJECT LINE
- **Subsystem prefix**: `btrfs`
- **Action verb**: "fix"
- **Summary**: Fix silent IO error loss in encoded writes and zoned
split
- Record: [btrfs] [fix] [silent IO error loss in encoded writes and
zoned split paths]
## Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Reviewed-by**: Qu Wenruo `<wqu@suse.com>` — well-known btrfs
developer/maintainer
- **Reviewed-by**: Johannes Thumshirn `<johannes.thumshirn@wdc.com>` —
prominent btrfs contributor
- **Reviewed-by**: Mark Harmstone `<mark@harmstone.com>` — btrfs
developer
- **Signed-off-by**: Michal Grzedzicki `<mge@meta.com>` — author (Meta)
- **Signed-off-by**: David Sterba `<dsterba@suse.com>` — btrfs subsystem
maintainer
- No Fixes: tag, no Cc: stable — expected for candidate review
- No Reported-by — likely found via code review/testing at Meta
- Record: 3 Reviewed-by from expert btrfs devs, signed off by subsystem
maintainer.
## Step 1.3: ANALYZE THE COMMIT BODY TEXT
The commit describes a clear, precise bug mechanism:
1. `can_finish_ordered_extent()` and `btrfs_finish_ordered_zoned()` set
`BTRFS_ORDERED_IOERR` via bare `set_bit()`.
2. Later, `btrfs_mark_ordered_extent_error()` in
`btrfs_finish_one_ordered()` uses `test_and_set_bit()`, finds the bit
already set, and **skips** `mapping_set_error()`.
3. The IO error is never recorded on the inode's address_space, making
it invisible to `fsync`.
4. For encoded writes, `btrfs receive` silently produces files with
zero-filled holes.
- **Failure mode**: Silent data corruption (zero-filled holes instead of
actual data)
- **Root cause**: bare `set_bit()` pre-empts the `test_and_set_bit()` in
the helper that actually records the error
- Record: [Silent data loss bug] [fsync misses IO errors] [Encoded write
files get zero-filled holes] [Author clearly explains the root cause
mechanism]
## Step 1.4: DETECT HIDDEN BUG FIXES
This is an explicit bug fix, not hidden. The subject and body directly
describe a data integrity bug.
Record: [Not a hidden fix — explicitly described as a data loss fix]
---
# PHASE 2: DIFF ANALYSIS - LINE BY LINE
## Step 2.1: INVENTORY THE CHANGES
- `fs/btrfs/ordered-data.c`: 1 line changed (line 388)
- `fs/btrfs/zoned.c`: 1 line changed (line 2139)
- Total: 2 lines changed, 0 lines added/removed net
- Functions modified: `can_finish_ordered_extent()`,
`btrfs_finish_ordered_zoned()`
- Record: [2 files, 2 lines changed, single-file-equivalent surgical
fix]
## Step 2.2: UNDERSTAND THE CODE FLOW CHANGE
**Hunk 1** (`ordered-data.c:388`):
- Before: `set_bit(BTRFS_ORDERED_IOERR, &ordered->flags)` — sets flag
only
- After: `btrfs_mark_ordered_extent_error(ordered)` — sets flag AND
calls `mapping_set_error()`
**Hunk 2** (`zoned.c:2139`):
- Before: `set_bit(BTRFS_ORDERED_IOERR, &ordered->flags)` — sets flag
only
- After: `btrfs_mark_ordered_extent_error(ordered)` — sets flag AND
calls `mapping_set_error()`
- Record: [Both hunks: bare set_bit → helper that also records error on
mapping]
## Step 2.3: IDENTIFY THE BUG MECHANISM
**Category**: Logic / correctness bug → silent data corruption
The bug is a race between setting the error flag and recording the
error:
1. `can_finish_ordered_extent()` uses bare `set_bit()` to set
`BTRFS_ORDERED_IOERR`
2. `btrfs_finish_one_ordered()` (line 3363) later calls
`btrfs_mark_ordered_extent_error()`
3. `btrfs_mark_ordered_extent_error()` does `if (!test_and_set_bit(...))
mapping_set_error()`
4. Since the bit was already set by step 1, step 3 thinks the error was
already recorded and skips `mapping_set_error()`
5. But the `mapping_set_error()` was NEVER called — the bare `set_bit()`
didn't do it
Record: [Logic/correctness bug] [test_and_set_bit finds bit already set,
skips recording error to mapping]
## Step 2.4: ASSESS THE FIX QUALITY
- **Obviously correct**: YES. The helper function
`btrfs_mark_ordered_extent_error()` at line 336-340 does exactly
what's needed:
```336:340:fs/btrfs/ordered-data.c
void btrfs_mark_ordered_extent_error(struct btrfs_ordered_extent
*ordered)
{
if (!test_and_set_bit(BTRFS_ORDERED_IOERR, &ordered->flags))
mapping_set_error(ordered->inode->vfs_inode.i_mapping,
-EIO);
}
```
- **Minimal/surgical**: YES. 2 lines, identical transformation pattern.
- **Regression risk**: Essentially zero. The helper does a superset of
what the bare `set_bit()` did: same flag setting + additional error
recording. The `test_and_set_bit()` ensures `mapping_set_error()` is
called at most once.
- Record: [Obviously correct, minimal, zero regression risk]
---
# PHASE 3: GIT HISTORY INVESTIGATION
## Step 3.1: BLAME THE CHANGED LINES
- **ordered-data.c:388**: Introduced by commit `53df25869a5659`
(Christoph Hellwig, 2023-05-31) — "btrfs: factor out a
can_finish_ordered_extent helper". Present since v6.5.
- **zoned.c:2139**: Introduced by commit `71df088c1cc090` (Christoph
Hellwig, 2023-05-24) — "btrfs: defer splitting of ordered extents
until I/O completion". Present since v6.5.
- **Helper function** (`btrfs_mark_ordered_extent_error()`): Introduced
by commit `aa5ccf29173acf` (Josef Bacik, 2024-04-03) — "btrfs: handle
errors in btrfs_reloc_clone_csums properly". Present since v6.10.
Record: [Buggy code introduced in v6.5 by refactoring commits; helper
exists from v6.10]
## Step 3.2: FOLLOW THE FIXES: TAG
No Fixes: tag present — expected. The commits that introduced the bare
`set_bit()` calls (`53df25869a5659` and `71df088c1cc090`) are the
implicit "fixes targets."
Record: [Implicit fixes: 53df25869a5659 and 71df088c1cc090, both in
v6.5+]
## Step 3.3: CHECK FILE HISTORY FOR RELATED CHANGES
Recent ordered-data.c changes are refactoring (folio conversion, lock
relaxation). No conflicting changes to the `can_finish_ordered_extent()`
function in this area. The fix is self-contained.
Record: [No conflicting recent changes; fix is standalone]
## Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS
Michal Grzedzicki from Meta. Other commits in this tree are in SCSI.
This appears to be a cross-subsystem contributor who found the bug
likely through `btrfs receive` usage at Meta.
Record: [Author is from Meta, likely found bug through production btrfs
receive usage]
## Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS
The helper `btrfs_mark_ordered_extent_error()` already exists in the 7.0
tree (and all trees since v6.10). The fix has NO dependencies for v6.10+
trees. For v6.6-v6.9, the helper would need to be backported first.
Record: [No dependencies for 7.0 tree; for older stable trees, helper
(aa5ccf29173acf) may be needed]
---
# PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
## Step 4.1: FIND THE ORIGINAL PATCH DISCUSSION
b4 dig did not find a match (commit not yet in mainline at time of
search). Web search also did not locate the specific patch thread. This
is likely a recently submitted patch that was accepted into David
Sterba's btrfs tree but not yet pushed to mainline.
Record: [Could not locate lore discussion — likely very recent
submission]
## Step 4.2: CHECK WHO REVIEWED THE PATCH
Three Reviewed-by tags from Qu Wenruo, Johannes Thumshirn, and Mark
Harmstone — three core btrfs developers. David Sterba (btrfs maintainer)
signed off, indicating it passed through the official btrfs tree.
Record: [Reviewed by 3 expert btrfs developers; signed off by
maintainer]
## Step 4.3-4.5: SEARCH FOR BUG REPORT / RELATED PATCHES / STABLE
HISTORY
No external bug report found. The commit message specifically mentions
`btrfs receive` producing files with zero-filled holes, suggesting this
was found in a production environment at Meta.
Record: [Likely production-found bug at Meta; no external reports found]
---
# PHASE 5: CODE SEMANTIC ANALYSIS
## Step 5.1: IDENTIFY KEY FUNCTIONS
- `can_finish_ordered_extent()` — processes ordered extent completion
- `btrfs_finish_ordered_zoned()` — handles zoned write completion
- `btrfs_mark_ordered_extent_error()` — the helper being used as the fix
## Step 5.2: TRACE CALLERS
`can_finish_ordered_extent()` is called by:
- `btrfs_finish_ordered_extent()` — called from IO completion paths (bio
endio)
- `btrfs_mark_ordered_io_finished()` — called from writeback paths
`btrfs_finish_ordered_zoned()` is called by:
- `btrfs_finish_ordered_io()` — the main ordered extent completion
function
These are core IO completion paths — every data write goes through them.
Record: [Core IO completion path; called for every write operation]
## Step 5.3-5.4: TRACE CALLEES AND CALL CHAIN
The encoded write path is: `btrfs_do_encoded_write()` →
`btrfs_submit_compressed_write()` → bio completion →
`end_bbio_compressed_write()` → `btrfs_finish_ordered_extent()` →
`can_finish_ordered_extent()`. This is the path that `btrfs receive`
uses.
Record: [btrfs receive → encoded write → bio completion → buggy
function]
## Step 5.5: SEARCH FOR SIMILAR PATTERNS
The third `set_bit(BTRFS_ORDERED_IOERR)` in `disk-io.c:4598` is in
`btrfs_destroy_ordered_extents()`, a cleanup path during filesystem
abort/unmount. This is intentionally different — during umount there's
no fsync concern, so bare `set_bit()` is acceptable there.
Record: [disk-io.c case is in cleanup path, doesn't need the fix]
---
# PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS
## Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE TREES?
- Buggy `set_bit()` calls: present since v6.5
- Helper function: present since v6.10
- For 7.0 tree: both the buggy code AND the helper exist. Fix applies
directly.
- For v6.12.y, v6.6.y: buggy code exists; v6.12 has the helper, v6.6
does not.
Record: [Bug exists in 7.0 tree; fix applies cleanly]
## Step 6.2: CHECK FOR BACKPORT COMPLICATIONS
The fix is a trivial 2-line substitution. No contextual conflicts
expected for 7.0.
Record: [Clean apply expected for 7.0]
## Step 6.3: CHECK IF RELATED FIXES ARE ALREADY IN STABLE
No related fixes found in this tree for this specific issue.
Record: [No existing fix in stable]
---
# PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT
## Step 7.1: IDENTIFY THE SUBSYSTEM AND ITS CRITICALITY
- **Subsystem**: fs/btrfs — filesystem
- **Criticality**: IMPORTANT — btrfs is widely used, and data integrity
is its primary value proposition
Record: [btrfs filesystem, IMPORTANT criticality]
## Step 7.2: ASSESS SUBSYSTEM ACTIVITY
The btrfs subsystem is very actively developed (48+ commits since v6.6
in ordered-data.c alone).
Record: [Very active subsystem]
---
# PHASE 8: IMPACT AND RISK ASSESSMENT
## Step 8.1: DETERMINE WHO IS AFFECTED
- All btrfs users who encounter IO errors during:
1. Encoded writes (`btrfs receive` with stream v2/v3)
2. Zoned device writes where ordered extent splitting fails
Record: [btrfs receive users, zoned device users]
## Step 8.2: DETERMINE THE TRIGGER CONDITIONS
- **Encoded writes**: Any IO error during `btrfs receive` (e.g., disk
error, corruption)
- **Zoned split**: Memory allocation failure during zoned ordered extent
splitting
- These are not obscure conditions — disk errors happen, and memory
pressure with `btrfs receive` on large datasets is common in
production
Record: [IO error during btrfs receive or zoned write; realistic trigger
conditions]
## Step 8.3: DETERMINE THE FAILURE MODE SEVERITY
- **Silent data corruption**: Files produced by `btrfs receive` have
zero-filled holes where data should be
- `fsync` returns success because `mapping_set_error()` was never called
- User believes data is safely written when it is not
- **Severity: CRITICAL** — silent data loss is the worst possible
outcome for a filesystem
Record: [Silent data loss; fsync returns success with corrupted data;
CRITICAL]
## Step 8.4: CALCULATE RISK-BENEFIT RATIO
- **BENEFIT**: Very high — prevents silent data corruption for `btrfs
receive` users
- **RISK**: Very low — 2-line change, replacing bare `set_bit()` with
existing helper that does strictly more, obviously correct
Record: [Very high benefit, very low risk — strongly favorable ratio]
---
# PHASE 9: FINAL SYNTHESIS
## Step 9.1: COMPILE THE EVIDENCE
**Evidence FOR backporting:**
- Fixes silent data corruption (the most critical bug category for
filesystems)
- Extremely small and surgical (2 lines)
- Uses existing well-tested helper function
- Obviously correct by code inspection
- 3 Reviewed-by from expert btrfs developers
- Signed off by btrfs subsystem maintainer (David Sterba)
- Bug affects `btrfs receive` — a critical data migration/backup tool
- All needed infrastructure (the helper) already exists in the 7.0 tree
- Zero regression risk
**Evidence AGAINST backporting:**
- None identified
**Unresolved questions:**
- Could not access lore.kernel.org to read full patch discussion (Anubis
bot protection)
- Could not find the commit in mainline (likely very recently merged or
in-flight to next merge window)
## Step 9.2: APPLY THE STABLE RULES CHECKLIST
1. **Obviously correct and tested?** YES — trivial substitution of bare
`set_bit()` with existing helper; 3 reviews
2. **Fixes a real bug?** YES — silent data loss in `btrfs receive`
3. **Important issue?** YES — data corruption (CRITICAL severity)
4. **Small and contained?** YES — 2 lines, 2 files
5. **No new features or APIs?** YES — uses existing helper
6. **Can apply to stable trees?** YES — clean apply for 7.0 (helper
exists)
## Step 9.3: CHECK FOR EXCEPTION CATEGORIES
Not an exception category — this is a standard bug fix, and a strong
one.
## Step 9.4: DECISION
This is a textbook stable backport candidate: a tiny, obviously correct
fix for silent data corruption in a widely used filesystem feature
(`btrfs receive`). It uses existing infrastructure, has been thoroughly
reviewed, and carries essentially zero regression risk.
---
# Verification
- [Phase 1] Parsed tags: 3 Reviewed-by (Qu Wenruo, Johannes Thumshirn,
Mark Harmstone), 2 Signed-off-by (author + David Sterba maintainer)
- [Phase 2] Diff analysis: 2 lines changed —
`set_bit(BTRFS_ORDERED_IOERR)` → `btrfs_mark_ordered_extent_error()`
in two locations
- [Phase 2] Verified `btrfs_mark_ordered_extent_error()` at line 336-340
does `test_and_set_bit() + mapping_set_error()` — confirmed exact
mechanism described in commit message
- [Phase 2] Verified `btrfs_finish_one_ordered()` at inode.c:3363 calls
`btrfs_mark_ordered_extent_error()` which finds bit already set and
skips `mapping_set_error()` — confirmed the "double-set" bug path
- [Phase 3] git blame: `ordered-data.c:388` introduced by 53df25869a5659
(Christoph Hellwig, v6.5)
- [Phase 3] git blame: `zoned.c:2139` introduced by 71df088c1cc090
(Christoph Hellwig, v6.5)
- [Phase 3] git show aa5ccf29173acf: confirmed
`btrfs_mark_ordered_extent_error()` introduced in v6.10 by Josef Bacik
- [Phase 3] git merge-base: confirmed helper exists in v6.10+, buggy
code in v6.5+
- [Phase 4] b4 dig: no match found (likely very recent patch not yet
indexed)
- [Phase 4] UNVERIFIED: Could not access lore.kernel.org discussion due
to bot protection
- [Phase 5] Traced call chain: `btrfs_do_encoded_write()` → bio
completion → `btrfs_finish_ordered_extent()` →
`can_finish_ordered_extent()` — confirmed encoded writes reach buggy
code
- [Phase 5] Verified disk-io.c:4598 `set_bit()` is in
`btrfs_destroy_ordered_extents()` (cleanup path) — correctly not fixed
- [Phase 5] Verified 3 remaining bare `set_bit(BTRFS_ORDERED_IOERR)`
calls: 2 fixed by this commit, 1 in cleanup path (acceptable)
- [Phase 6] Confirmed buggy code exists in 7.0 tree (read both files
directly)
- [Phase 6] Confirmed fix applies cleanly — helper exists, code context
matches
- [Phase 8] Failure mode: silent data loss (zero-filled holes) —
CRITICAL severity
**YES**
fs/btrfs/ordered-data.c | 2 +-
fs/btrfs/zoned.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 5df02c707aee6..b65c1f1e2956e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -385,7 +385,7 @@ static bool can_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
}
if (!uptodate)
- set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
+ btrfs_mark_ordered_extent_error(ordered);
if (ordered->bytes_left)
return false;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0cd7fd3fcfa3a..d728c3bafc092 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2136,7 +2136,7 @@ void btrfs_finish_ordered_zoned(struct btrfs_ordered_extent *ordered)
continue;
}
if (!btrfs_zoned_split_ordered(ordered, logical, len)) {
- set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
+ btrfs_mark_ordered_extent_error(ordered);
btrfs_err(fs_info, "failed to split ordered extent");
goto out;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing Sasha Levin
@ 2026-04-22 12:24 ` Aleksandar Gerasimovski
2026-04-22 12:28 ` Aleksandar Gerasimovski
2026-04-22 19:14 ` David Sterba
0 siblings, 2 replies; 10+ messages in thread
From: Aleksandar Gerasimovski @ 2026-04-22 12:24 UTC (permalink / raw)
To: Sasha Levin, patches@lists.linux.dev, stable@vger.kernel.org,
Rene Straub
Cc: Filipe Manana, Qu Wenruo, David Sterba, clm@fb.com,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
Hi everyone,
Can you add Rene's tag to the reporter as well, he was the one seeing and triggering the that internally and I continued with following that with the mailing list:
From: Sasha Levin <sashal@kernel.org>
Sent: Monday, April 20, 2026 3:19 PM
To: patches@lists.linux.dev; stable@vger.kernel.org
Cc: Filipe Manana <fdmanana@suse.com>; Aleksandar Gerasimovski <Aleksandar.Gerasimovski@belden.com>; Qu Wenruo <wqu@suse.com>; David Sterba <dsterba@suse.com>; Sasha Levin <sashal@kernel.org>; clm@fb.com; linux-btrfs@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing
From: Filipe Manana <mailto:fdmanana@suse.com>
[ Upstream commit 574d93fc62e2b03ab39c8f92fb44ded89ca6274d ]
Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:
$ cat test.sh
#!/bin/bash
# Create and use a 1G null block device, memory backed, otherwise
# the test takes a very long time.
modprobe null_blk nr_devices="0"
null_dev="/sys/kernel/config/nullb/nullb0"
mkdir "$null_dev"
size=$((1 * 1024)) # in MB
echo 2 > "$null_dev/submit_queues"
echo "$size" > "$null_dev/size"
echo 1 > "$null_dev/memory_backed"
echo 1 > "$null_dev/discard"
echo 1 > "$null_dev/power"
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/test/
bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
umount $MNT
echo 0 > "$null_dev/power"
rmdir "$null_dev"
When running this bonnie++ fails in the phase where it deletes test
directories and files:
$ ./test.sh
(...)
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...Can't sync directory, turning off dir-sync.
Can't delete file 9Bq7sr0000000338
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): Read-only file system
And in the syslog/dmesg we can see the following transaction abort trace:
[161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
[161915.502983] ------------[ cut here ]------------
[161915.503832] BTRFS: Transaction aborted (error -28)
[161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
[161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
[161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
[161915.520857] Tainted: [W]=WARN
[161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
[161915.524630] Code: 48 8b 7c 24 (...)
[161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
[161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
[161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
[161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
[161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
[161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
[161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
[161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
[161915.536758] Call Trace:
[161915.537185] <TASK>
[161915.537575] btrfs_sync_file+0x431/0x530 [btrfs]
[161915.538473] do_fsync+0x39/0x80
[161915.539042] __x64_sys_fsync+0xf/0x20
[161915.539750] do_syscall_64+0x50/0xf20
[161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[161915.541301] RIP: 0033:0x7ff930ca49ee
[161915.541904] Code: 08 0f 85 f5 (...)
[161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
[161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
[161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
[161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
[161915.552161] </TASK>
[161915.552457] ---[ end trace 0000000000000000 ]---
[161915.553232] BTRFS info (device nullb0 state A): dumping space info:
[161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
[161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
[161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
[161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
[161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
[161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
[161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
[161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
[161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
[161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
[161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
[161915.554463] BTRFS info (device nullb0 state EA): forced readonly
The problem is that we allow for a very aggressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).
Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.
Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocation attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.
The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:
mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0
These are from loading the block groups created by mkfs during mount.
Then when bonnie++ starts doing its thing:
kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544
kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072
First allocation of a data block group of 112M.
kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032
kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584
Another 112M data block group allocated.
kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520
kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096
Yet another one.
bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008
bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456
One more.
kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496
kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392
Another one.
bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984
bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456
Another one.
bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472
bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728
Another one.
bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960
bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864
Another one, but a bit smaller (~100.6M) since we now have less space.
bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912
bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192
Another one, this one even smaller (12M).
kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt
kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912
kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0
536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().
As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.
kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC
In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.
But dumping the space infos right after the metadata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:
[29972.409295] BTRFS info (device nullb0): dumping space info:
[29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
[29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
[29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
[29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
[29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
[29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
[29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
[29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
[29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
[29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
[29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
So here we see there's ~904.6M of data space, ~51.2M of metadata space and
8M of system space, making a total of 963.8M.
Reported-by: Aleksandar Gerasimovski <mailto:Aleksandar.Gerasimovski@belden.com>
Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/__;!!Fpyg6SJIkmElPg!0UcClDGtNPrCLTXCMZZAE3NDoTTN_qjQBtFrypqTZVyC4UEWTorQgT2-9j41bFVkpiuF2CPfgzBi3zUopC1QDCZ4UA$
Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_*eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/__;Kw!!Fpyg6SJIkmElPg!0UcClDGtNPrCLTXCMZZAE3NDoTTN_qjQBtFrypqTZVyC4UEWTorQgT2-9j41bFVkpiuF2CPfgzBi3zUopC09Cf3_uA$
Reviewed-by: Qu Wenruo <mailto:wqu@suse.com>
Signed-off-by: Filipe Manana <mailto:fdmanana@suse.com>
Signed-off-by: David Sterba <mailto:dsterba@suse.com>
Signed-off-by: Sasha Levin <mailto:sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
I now have enough information for a thorough analysis. Let me compile my
findings.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: Subject Line
- **Subsystem prefix:** `btrfs`
- **Action verb:** "be less aggressive" (behavioral adjustment)
- **Summary:** Reduce metadata overcommit aggressiveness when full
flushing is possible, to avoid -ENOSPC transaction aborts.
- Record: [btrfs] [behavioral fix] [reduce overcommit to prevent
transaction abort -ENOSPC]
### Step 1.2: Tags
- **Reported-by:** Aleksandar Gerasimovski (user report with a
reproducible test case)
- **Link 1:** lore bug report thread
- **Link 2:** lore follow-up discussion
- **Reviewed-by:** Qu Wenruo (core btrfs developer)
- **Signed-off-by:** Filipe Manana (author, prominent btrfs developer),
David Sterba (btrfs maintainer)
- No Fixes: tag (expected for candidates under review)
- No Cc: stable (expected)
- Record: User-reported with reproduction steps, reviewed by a key btrfs
developer, signed-off by the btrfs maintainer.
### Step 1.3: Commit Body Analysis
The commit describes a transaction abort with -ENOSPC (error -28) during
bonnie++ workload on a 1G filesystem. The abort forces the filesystem
read-only. The detailed trace shows `btrfs_commit_transaction` aborting
at line 2045 with the call path `btrfs_sync_file -> do_fsync ->
__x64_sys_fsync`. The author explains that the overly generous 1/8
overcommit allows too many tasks to overcommit without triggering
transaction commits that would release pinned extents, eventually
leading to metadata exhaustion and transaction abort. Includes custom
tracing evidence of block group allocation behavior leading up to the
failure.
- Record: Real bug manifesting as filesystem going read-only
(transaction abort with -ENOSPC) during normal workload on small
filesystem. Root cause: too-aggressive metadata overcommit allows too
many tasks to bypass flushing, resulting in no free metadata extents
and no unallocated space for new metadata chunks.
### Step 1.4: Hidden Bug Fix Detection
This is not a hidden fix - it is clearly described as fixing a
transaction abort bug. The words "Fix this by" are explicitly used.
Record: This IS a direct bug fix.
---
## PHASE 2: DIFF ANALYSIS
### Step 2.1: Inventory
- **Files changed:** `fs/btrfs/space-info.c` (1 file)
- **Lines changed:** 3 lines modified (1 comment change, 2 logic
changes)
- **Functions modified:** `calc_available_free_space()`
- **Scope:** Single-file, surgical fix
### Step 2.2: Code Flow Change
Before:
- When `flush == BTRFS_RESERVE_FLUSH_ALL`, overcommit limit was `avail
>> 3` (1/8 of available)
- `BTRFS_RESERVE_FLUSH_ALL_STEAL` fell through to `else` branch: `avail
>> 1` (1/2 of available)
After:
- When `flush == BTRFS_RESERVE_FLUSH_ALL || flush ==
BTRFS_RESERVE_FLUSH_ALL_STEAL`, overcommit limit is `avail >> 6` (1/64
of available)
- This is more conservative, forcing earlier transaction commits
### Step 2.3: Bug Mechanism
This is a **logic/correctness fix**. The overcommit threshold was too
generous, allowing too many tasks to avoid triggering the space flushing
machinery, which would commit transactions and unpin extents. This
eventually exhausted metadata space with no recovery path.
Two bugs fixed:
1. `BTRFS_RESERVE_FLUSH_ALL_STEAL` was falling into the "else" (1/2
overcommit) branch — far too generous for a flush type that CAN do
full flushing.
2. Even `BTRFS_RESERVE_FLUSH_ALL` at 1/8 was too aggressive for small
filesystems.
### Step 2.4: Fix Quality
- Minimal and obviously correct — reducing overcommit thresholds is safe
- Well-understood mechanism with detailed analysis in commit message
- Regression risk: slightly more frequent transaction commits under
memory pressure (performance trade-off, not a correctness regression)
- The author is Filipe Manana, one of the most prolific btrfs developers
Record: Very high quality, obviously correct, minimal scope.
---
## PHASE 3: GIT HISTORY
### Step 3.1: Blame
The buggy code (`avail >>= 3` / `avail >>= 1`) was introduced in commit
`41783ef24d56ce` ("btrfs: move and export can_overcommit") by Josef
Bacik, merged in v5.4. The code has been in every kernel since v5.4.
### Step 3.2: No Fixes: tag — skipped as expected.
### Step 3.3: File History
`fs/btrfs/space-info.c` has ~90 changes since v6.6 but the specific
`calc_available_free_space()` function's overcommit logic has only been
touched by:
- `cb6cbab79055c` (v6.7, adjusted overcommit for "very close to full"
condition)
- `64d2c847ba380` (v6.10, zoned fix)
- Various argument refactoring (fs_info removal)
The current patch touches only the two lines at the `>>= 3` / `>>= 1`
branch which have been stable since v5.4.
### Step 3.4: Author
Filipe Manana is one of the most active btrfs contributors with hundreds
of commits. He regularly fixes space reservation bugs and is deeply
familiar with the overcommit subsystem.
### Step 3.5: Dependencies
The patch is standalone. The only dependency is the existence of
`BTRFS_RESERVE_FLUSH_ALL_STEAL`, which was added in commit
`7f9fe61440769` and confirmed present in all stable trees back to v5.10.
---
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
Lore.kernel.org has bot protection enabled, preventing direct access.
However:
- The commit has two Link: tags referencing mailing list discussions
- The commit was reviewed by Qu Wenruo and signed-off by David Sterba
- The commit message includes the original user report from Aleksandar
Gerasimovski
Record: Could not access lore directly. The commit has proper review
chain and user report.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: Function Modified
`calc_available_free_space()` — computes how much overcommit is allowed
for metadata.
### Step 5.2: Callers
1. `check_can_overcommit()` → called by `can_overcommit()` and
`btrfs_can_overcommit()`
2. `btrfs_calc_reclaim_metadata_size()` — reclaim size calculation
3. `need_preemptive_reclaim()` — decides if preemptive reclaim is needed
These are called during **every metadata reservation** in the kernel.
This is a hot path for all btrfs operations.
### Step 5.3-5.4: Call Chain
`reserve_bytes()` → `can_overcommit()` → `check_can_overcommit()` →
`calc_available_free_space()`
This is reachable from any filesystem operation that reserves metadata
(file creation, deletion, modification, etc.).
### Step 5.5: Similar Patterns
The earlier commit `cb6cbab79055c` addressed a related but different
aspect of overcommit (when very close to full). This patch addresses the
general case.
---
## PHASE 6: STABLE TREE ANALYSIS
### Step 6.1: Buggy Code in Stable Trees
Verified the EXACT same code pattern exists in ALL active stable trees:
- v5.10: same code at line 327
- v5.15: same code at line 324
- v6.1: same code at line 372
- v6.6: same code at line 373
- v6.12: same code at line 421
`BTRFS_RESERVE_FLUSH_ALL_STEAL` confirmed present in v5.10+.
### Step 6.2: Backport Complications
The surrounding context has minor differences (e.g., the zoned mode
alignment was added in v6.10, function signature changed in v6.13+) but
the actual 3-line change applies to code that is IDENTICAL across all
stable trees. Minor context adjustment may be needed for the surrounding
lines (no zoned block in older trees), but the core logic change is
trivially backportable.
### Step 6.3: No related fix already in stable.
---
## PHASE 7: SUBSYSTEM CONTEXT
### Step 7.1: Subsystem and Criticality
- **Subsystem:** `fs/btrfs` — filesystem
- **Criticality:** IMPORTANT — btrfs is a widely-used filesystem,
especially in enterprise (SLES, openSUSE) and desktop Linux. Metadata
ENOSPC bugs cause data loss risk (filesystem goes read-only).
### Step 7.2: Activity
btrfs/space-info.c is very actively maintained with frequent
improvements and fixes.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: Affected Users
All btrfs users, especially those with smaller filesystems (1G-8G) under
heavy workloads. This is common in containers, VMs, embedded systems,
and IoT devices.
### Step 8.2: Trigger Conditions
- Normal file operations (create/delete files) on a filesystem that has
most of its space allocated to data
- Reproducible with bonnie++ on a 1G filesystem
- No special privileges needed — any user writing files can trigger this
### Step 8.3: Failure Mode
**CRITICAL** — Transaction abort with -ENOSPC forces the filesystem into
read-only mode. This means:
- Active writes fail
- The filesystem must be unmounted/remounted to recover
- Potential data loss if writes were in progress
- User sees "Read-only file system" errors
### Step 8.4: Risk-Benefit Ratio
- **Benefit:** HIGH — prevents filesystem going read-only on small
filesystems under normal workloads
- **Risk:** VERY LOW — 3-line change reducing an overcommit threshold;
the only behavioral change is slightly more frequent transaction
commits, which is a minor performance trade-off with no correctness
risk
- **Ratio:** Strongly favorable
---
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: Evidence Summary
**FOR backporting:**
- Fixes a real, reproducible bug (filesystem goes read-only)
- Failure mode is CRITICAL (transaction abort, forced read-only)
- Extremely small patch (3 lines in 1 file)
- Reviewed by core btrfs developer (Qu Wenruo), signed off by maintainer
(David Sterba)
- Author is one of the most prolific btrfs developers (Filipe Manana)
- User-reported with clear reproduction case
- Buggy code exists in ALL stable trees (v5.10+)
- Fix applies cleanly with trivial context adjustment
- No new features or APIs
- The FLUSH_ALL_STEAL case falling through to 1/2 overcommit was clearly
a bug
**AGAINST backporting:**
- This is a behavioral/heuristic change rather than a strictly "wrong
code" fix
- Could theoretically affect performance on large filesystems (more
frequent commits)
- Changes a tuning constant (1/8 → 1/64) which is somewhat subjective
- No Fixes: tag (but this is expected)
### Step 9.2: Stable Rules Checklist
1. Obviously correct and tested? **YES** — reproduced, analyzed with
tracing, reviewed
2. Fixes a real bug? **YES** — transaction abort forcing read-only
filesystem
3. Important issue? **YES** — filesystem goes read-only = CRITICAL
4. Small and contained? **YES** — 3 lines in 1 file
5. No new features? **YES** — changes an existing threshold
6. Can apply to stable? **YES** — verified same code in all stable trees
### Step 9.3: Exception Category
Not applicable — this is a standard bug fix.
### Step 9.4: Decision
The fix is small, surgical, obviously correct, and addresses a CRITICAL
failure mode (filesystem forced read-only). The risk is minimal
(slightly more conservative overcommit policy). The benefit-to-risk
ratio is extremely favorable.
---
## Verification
- [Phase 1] Parsed tags: Reported-by from Aleksandar Gerasimovski,
Reviewed-by Qu Wenruo, two lore Links
- [Phase 2] Diff analysis: 3-line change in
`calc_available_free_space()` — shifts avail from >>3 to >>6 for full-
flush cases, and adds FLUSH_ALL_STEAL to the condition
- [Phase 3] git blame: buggy code introduced in commit 41783ef24d56ce
(v5.4), present in all stable trees
- [Phase 3] Verified FLUSH_ALL_STEAL exists in v5.10+ (commit
7f9fe61440769)
- [Phase 3] Author Filipe Manana has hundreds of btrfs commits —
confirmed core developer
- [Phase 4] lore.kernel.org blocked by bot protection — could not access
mailing list directly
- [Phase 5] Traced callers: `calc_available_free_space()` called through
`can_overcommit()` from `reserve_bytes()` — on every metadata
reservation path
- [Phase 6] Verified exact same code pattern in v5.10, v5.15, v6.1,
v6.6, v6.12 trees
- [Phase 6] cb6cbab79055c (related but different overcommit fix) already
in v6.7+
- [Phase 8] Failure mode: transaction abort → read-only filesystem =
CRITICAL severity
- [Phase 8] Trigger: normal file operations on small filesystem, easily
reproducible
- UNVERIFIED: Could not access mailing list discussion to check for NAKs
or concerns about performance regression on large filesystems
**YES**
fs/btrfs/space-info.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 87cbc051cb12f..b2b775ab878c6 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -489,10 +489,10 @@ static u64 calc_available_free_space(const struct btrfs_space_info *space_info,
/*
* If we aren't flushing all things, let us overcommit up to
* 1/2th of the space. If we can flush, don't let us overcommit
- * too much, let it overcommit up to 1/8 of the space.
+ * too much, let it overcommit up to 1/64th of the space.
*/
- if (flush == BTRFS_RESERVE_FLUSH_ALL)
- avail >>= 3;
+ if (flush == BTRFS_RESERVE_FLUSH_ALL || flush == BTRFS_RESERVE_FLUSH_ALL_STEAL)
+ avail >>= 6;
else
avail >>= 1;
--
2.53.0
**********************************************************************
DISCLAIMER:
Privileged and/or Confidential information may be contained in this message. If you are not the addressee of this message, you may not copy, use or deliver this message to anyone. In such event, you should destroy the message and kindly notify the sender by reply e-mail. It is understood that opinions or conclusions that do not relate to the official business of the company are neither given nor endorsed by the company. Thank You.
^ permalink raw reply related [flat|nested] 10+ messages in thread
* RE: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing
2026-04-22 12:24 ` Aleksandar Gerasimovski
@ 2026-04-22 12:28 ` Aleksandar Gerasimovski
2026-04-22 19:14 ` David Sterba
1 sibling, 0 replies; 10+ messages in thread
From: Aleksandar Gerasimovski @ 2026-04-22 12:28 UTC (permalink / raw)
To: Sasha Levin, patches@lists.linux.dev, stable@vger.kernel.org,
Rene Straub
Cc: Filipe Manana, Qu Wenruo, David Sterba, clm@fb.com,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
Hi everyone,
Can you add Rene's tag to the reporter as well, he was the one seeing and triggering the problem internally and I continued with following that with the mailing list:
Reported-by: Rene Straub <mailto:Rene.Straub@belden.com>
Regards,
Aleksandar
> -----Original Message-----
> From: Aleksandar Gerasimovski
> Sent: Wednesday, April 22, 2026 2:25 PM
> To: Sasha Levin <sashal@kernel.org>; patches@lists.linux.dev;
> stable@vger.kernel.org; Rene Straub <Rene.Straub@belden.com>
> Cc: Filipe Manana <fdmanana@suse.com>; Qu Wenruo <wqu@suse.com>;
> David Sterba <dsterba@suse.com>; clm@fb.com; linux-btrfs@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: RE: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with
> metadata overcommit when we can do full flushing
>
> Hi everyone,
>
> Can you add Rene's tag to the reporter as well, he was the one seeing and
> triggering the that internally and I continued with following that with the
> mailing list:
>
>
>
> From: Sasha Levin <sashal@kernel.org>
> Sent: Monday, April 20, 2026 3:19 PM
> To: patches@lists.linux.dev; stable@vger.kernel.org
> Cc: Filipe Manana <fdmanana@suse.com>; Aleksandar Gerasimovski
> <Aleksandar.Gerasimovski@belden.com>; Qu Wenruo <wqu@suse.com>;
> David Sterba <dsterba@suse.com>; Sasha Levin <sashal@kernel.org>;
> clm@fb.com; linux-btrfs@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata
> overcommit when we can do full flushing
>
> From: Filipe Manana <mailto:fdmanana@suse.com>
>
> [ Upstream commit 574d93fc62e2b03ab39c8f92fb44ded89ca6274d ]
>
> Over the years we often get reports of some -ENOSPC failure while updating
> metadata that leads to a transaction abort. I have seen this happen for
> filesystems of all sizes and with workloads that are very user/customer specific
> and unable to reproduce, but Aleksandar recently reported a simple way to
> reproduce this with a 1G filesystem and using the bonnie++ benchmark tool.
> The following test script reproduces the failure:
>
> $ cat test.sh
> #!/bin/bash
>
> # Create and use a 1G null block device, memory backed, otherwise
> # the test takes a very long time.
> modprobe null_blk nr_devices="0"
> null_dev="/sys/kernel/config/nullb/nullb0"
> mkdir "$null_dev"
> size=$((1 * 1024)) # in MB
> echo 2 > "$null_dev/submit_queues"
> echo "$size" > "$null_dev/size"
> echo 1 > "$null_dev/memory_backed"
> echo 1 > "$null_dev/discard"
> echo 1 > "$null_dev/power"
>
> DEV=/dev/nullb0
> MNT=/mnt/nullb0
>
> mkfs.btrfs -f $DEV
> mount $DEV $MNT
>
> mkdir $MNT/test/
> bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
>
> umount $MNT
>
> echo 0 > "$null_dev/power"
> rmdir "$null_dev"
>
> When running this bonnie++ fails in the phase where it deletes test directories
> and files:
>
> $ ./test.sh
> (...)
> Using uid:0, gid:0.
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...done
> Reading a byte at a time...done
> Reading intelligently...done
> start 'em...done...done...done...done...done...
> Create files in sequential order...done.
> Stat files in sequential order...done.
> Delete files in sequential order...done.
> Create files in random order...done.
> Stat files in random order...done.
> Delete files in random order...Can't sync directory, turning off dir-sync.
> Can't delete file 9Bq7sr0000000338
> Cleaning up test directory after error.
> Bonnie: drastic I/O error (rmdir): Read-only file system
>
> And in the syslog/dmesg we can see the following transaction abort trace:
>
> [161915.501506] BTRFS warning (device nullb0): Skipping commit of
> aborted transaction.
> [161915.502983] ------------[ cut here ]------------
> [161915.503832] BTRFS: Transaction aborted (error -28)
> [161915.504748] WARNING: fs/btrfs/transaction.c:2045 at
> btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
> [161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
> [161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted:
> G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
> [161915.520857] Tainted: [W]=WARN
> [161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30
> [btrfs]
> [161915.524630] Code: 48 8b 7c 24 (...)
> [161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
> [161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX:
> 0000000000000000
> [161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI:
> ffffffffc088f780
> [161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09:
> ffffd3fe8206fb90
> [161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12:
> 00000000ffffffe4
> [161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15:
> ffff8f4f62d18000
> [161915.533229] FS: 00007ff93112a780(0000)
> GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
> [161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005
> CR4: 0000000000370ef0
> [161915.536758] Call Trace:
> [161915.537185] <TASK>
> [161915.537575] btrfs_sync_file+0x431/0x530 [btrfs]
> [161915.538473] do_fsync+0x39/0x80
> [161915.539042] __x64_sys_fsync+0xf/0x20
> [161915.539750] do_syscall_64+0x50/0xf20
> [161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [161915.541301] RIP: 0033:0x7ff930ca49ee
> [161915.541904] Code: 08 0f 85 f5 (...)
> [161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246
> ORIG_RAX: 000000000000004a
> [161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX:
> 00007ff930ca49ee
> [161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000003
> [161915.548383] RBP: 0000000000000dab R08: 0000000000000000
> R09: 0000000000000000
> [161915.549853] R10: 0000000000000000 R11: 0000000000000246
> R12: 00007ffd94291fb0
> [161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15:
> 00007ffd94292340
> [161915.552161] </TASK>
> [161915.552457] ---[ end trace 0000000000000000 ]---
> [161915.553232] BTRFS info (device nullb0 state A): dumping space info:
> [161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-
> group id 0) has 12582912 free, is not full
> [161915.553239] BTRFS info (device nullb0 state A): space_info
> total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0
> zone_unusable=0
> [161915.553243] BTRFS info (device nullb0 state A): space_info METADATA
> (sub-group id 0) has -5767168 free, is full
> [161915.553245] BTRFS info (device nullb0 state A): space_info
> total=53673984, used=6635520, pinned=46956544, reserved=16384,
> may_use=5767168, readonly=65536 zone_unusable=0
> [161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM
> (sub-group id 0) has 8355840 free, is not full
> [161915.553254] BTRFS info (device nullb0 state A): space_info
> total=8388608, used=16384, pinned=16384, reserved=0, may_use=0,
> readonly=0 zone_unusable=0
> [161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size
> 5767168 reserved 5767168
> [161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0
> reserved 0
> [161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size
> 0 reserved 0
> [161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size
> 0 reserved 0
> [161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv:
> size 0 reserved 0
> [161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size
> 0 reserved 0
> [161915.553272] BTRFS: error (device nullb0 state A) in
> cleanup_transaction:2045: errno=-28 No space left
> [161915.554463] BTRFS info (device nullb0 state EA): forced readonly
>
> The problem is that we allow for a very aggressive metadata overcommit,
> about 1/8th of the currently available space, even when the task attempting
> the reservation allows for full flushing. Over time this allows more and more
> tasks to overcommit without getting a transaction commit to release pinned
> extents, joining the same transaction and eventually lead to the transaction
> abort when attempting some tree update, as the extent allocator is not able to
> find any available metadata extent and it's not able to allocate a new metadata
> block group either (not enough unallocated space for that).
>
> Fix this by allowing the overcommit to be up to 1/64th of the available
> (unallocated) space instead and for that limit to apply to both types of full
> flushing, BTRFS_RESERVE_FLUSH_ALL and
> BTRFS_RESERVE_FLUSH_ALL_STEAL.
> This way we get more frequent transaction commits to release pinned extents
> in case our caller is in a context where full flushing is allowed.
>
> Note that the space infos dump in the dmesg/syslog right after the transaction
> abort give the wrong idea that we have plenty of unallocated space when the
> abort happened. During the bonnie++ workload we had a metadata chunk
> allocation attempt and it failed with -ENOSPC because at that time we had a
> bunch of data block groups allocated, which then became empty and got
> deleted by the cleaner kthread after the metadata chunk allocation failed with
> -ENOSPC and before the transaction abort happened and dumped the space
> infos.
>
> The custom tracing (some trace_printk() calls spread in strategic places) used
> to check that:
>
> mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info:
> added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1
> total_bytes 8388608 bytes_used 0 bytes_may_use 0
> mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info:
> added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2
> total_bytes 8388608 bytes_used 16384 bytes_may_use 0
> mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info:
> added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4
> total_bytes 53673984 bytes_used 131072 bytes_may_use 0
>
> These are from loading the block groups created by mkfs during mount.
>
> Then when bonnie++ starts doing its thing:
>
> kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 927596544
> kworker/u48:5-1792004 [011] ..... 28886.122055:
> btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
> kworker/u48:5-1792004 [011] ...1. 28886.122064:
> btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512
> flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0
> bytes_may_use 5251072
>
> First allocation of a data block group of 112M.
>
> kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 810156032
> kworker/u48:5-1792004 [011] ..... 28886.192415:
> btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
> kworker/u48:5-1792004 [011] ...1. 28886.192425:
> btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512
> flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0
> bytes_may_use 122691584
>
> Another 112M data block group allocated.
>
> kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 692715520
> kworker/u48:5-1792004 [011] ..... 28886.260943:
> btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
> kworker/u48:5-1792004 [011] ...1. 28886.260954:
> btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512
> flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0
> bytes_may_use 240132096
>
> Yet another one.
>
> bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 575275008
> bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group:
> make bg offset 436404224 size 117440512 type 1
> bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info:
> added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1
> total_bytes 478150656 bytes_used 0 bytes_may_use 268435456
>
> One more.
>
> kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 457834496
> kworker/u48:5-1792004 [011] ..... 28886.566241:
> btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
> kworker/u48:5-1792004 [011] ...1. 28886.566250:
> btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512
> flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456
> bytes_may_use 209723392
>
> Another one.
>
> bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 340393984
> bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group:
> make bg offset 671285248 size 117440512 type 1
> bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info:
> added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1
> total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456
>
> Another one.
>
> bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 222953472
> bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group:
> make bg offset 788725760 size 117440512 type 1
> bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info:
> added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1
> total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728
>
> Another one.
>
> bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 105512960
> bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group:
> make bg offset 906166272 size 105512960 type 1
> bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info:
> added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1
> total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864
>
> Another one, but a bit smaller (~100.6M) since we now have less space.
>
> bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824
> bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want
> 1073741824 max_avail 12582912
> bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group:
> make bg offset 1011679232 size 12582912 type 1
> bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info:
> added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1
> total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192
>
> Another one, this one even smaller (12M).
>
> kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter
> first metadata chunk alloc attempt
> kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk:
> gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912
> kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk:
> gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want
> 536870912 max_avail 0
>
> 536870912 is 512M, the standard 256M metadata chunk size times 2
> because of the DUP profile for metadata.
> 'max_avail' is what find_free_dev_extent() returns to us in
> gather_device_info().
>
> As a result, gather_device_info() sets ctl->ndevs to 0, making
> decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
> allocation fails while we are attempting to run delayed items during the
> transaction commit.
>
> kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk:
> decide_stripe_size fail -ENOSPC
>
> In the syslog/dmesg pasted above, which happened after the transaction was
> aborted, the space info dumps did not account for all these data block groups
> that were allocated during bonnie++'s workload. And that is because after the
> metadata chunk allocation failed with -ENOSPC and before the transaction
> abort happened, most of the data block groups had become empty and got
> deleted by by the cleaner kthread - when the abort happened, we had
> bonnie++ in the middle of deleting the files it created.
>
> But dumping the space infos right after the metadata chunk allocation fails by
> adding a call to btrfs_dump_space_info_for_trans_abort() in
> decide_stripe_size() when it returns -ENOSPC, we get:
>
> [29972.409295] BTRFS info (device nullb0): dumping space info:
> [29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id
> 0) has 673341440 free, is not full
> [29972.409303] BTRFS info (device nullb0): space_info total=948568064,
> used=0, pinned=275226624, reserved=0, may_use=0, readonly=0
> zone_unusable=0
> [29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-
> group id 0) has 3915776 free, is not full
> [29972.409306] BTRFS info (device nullb0): space_info total=53673984,
> used=163840, pinned=42827776, reserved=147456, may_use=6553600,
> readonly=65536 zone_unusable=0
> [29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group
> id 0) has 7979008 free, is not full
> [29972.409310] BTRFS info (device nullb0): space_info total=8388608,
> used=16384, pinned=0, reserved=0, may_use=393216, readonly=0
> zone_unusable=0
> [29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168
> reserved 5767168
> [29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved
> 0
> [29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216
> reserved 393216
> [29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0
> reserved 0
> [29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0
> reserved 0
>
> So here we see there's ~904.6M of data space, ~51.2M of metadata space and
> 8M of system space, making a total of 963.8M.
>
> Reported-by: Aleksandar Gerasimovski
> <mailto:Aleksandar.Gerasimovski@belden.com>
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-
> btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.
> namprd18.prod.outlook.com/__;!!Fpyg6SJIkmElPg!0UcClDGtNPrCLTXCMZZA
> E3NDoTTN_qjQBtFrypqTZVyC4UEWTorQgT2-
> 9j41bFVkpiuF2CPfgzBi3zUopC1QDCZ4UA$
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-
> btrfs/CAL3q7H61vZ3_*eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail
> .gmail.com/__;Kw!!Fpyg6SJIkmElPg!0UcClDGtNPrCLTXCMZZAE3NDoTTN_qjQ
> BtFrypqTZVyC4UEWTorQgT2-9j41bFVkpiuF2CPfgzBi3zUopC09Cf3_uA$
> Reviewed-by: Qu Wenruo <mailto:wqu@suse.com>
> Signed-off-by: Filipe Manana <mailto:fdmanana@suse.com>
> Signed-off-by: David Sterba <mailto:dsterba@suse.com>
> Signed-off-by: Sasha Levin <mailto:sashal@kernel.org>
> ---
>
> LLM Generated explanations, may be completely bogus:
>
> I now have enough information for a thorough analysis. Let me compile my
> findings.
>
> ---
>
> ## PHASE 1: COMMIT MESSAGE FORENSICS
>
> ### Step 1.1: Subject Line
> - **Subsystem prefix:** `btrfs`
> - **Action verb:** "be less aggressive" (behavioral adjustment)
> - **Summary:** Reduce metadata overcommit aggressiveness when full
> flushing is possible, to avoid -ENOSPC transaction aborts.
> - Record: [btrfs] [behavioral fix] [reduce overcommit to prevent
> transaction abort -ENOSPC]
>
> ### Step 1.2: Tags
> - **Reported-by:** Aleksandar Gerasimovski (user report with a
> reproducible test case)
> - **Link 1:** lore bug report thread
> - **Link 2:** lore follow-up discussion
> - **Reviewed-by:** Qu Wenruo (core btrfs developer)
> - **Signed-off-by:** Filipe Manana (author, prominent btrfs developer),
> David Sterba (btrfs maintainer)
> - No Fixes: tag (expected for candidates under review)
> - No Cc: stable (expected)
> - Record: User-reported with reproduction steps, reviewed by a key btrfs
> developer, signed-off by the btrfs maintainer.
>
> ### Step 1.3: Commit Body Analysis
> The commit describes a transaction abort with -ENOSPC (error -28) during
> bonnie++ workload on a 1G filesystem. The abort forces the filesystem
> read-only. The detailed trace shows `btrfs_commit_transaction` aborting at
> line 2045 with the call path `btrfs_sync_file -> do_fsync -> __x64_sys_fsync`.
> The author explains that the overly generous 1/8 overcommit allows too many
> tasks to overcommit without triggering transaction commits that would
> release pinned extents, eventually leading to metadata exhaustion and
> transaction abort. Includes custom tracing evidence of block group allocation
> behavior leading up to the failure.
>
> - Record: Real bug manifesting as filesystem going read-only
> (transaction abort with -ENOSPC) during normal workload on small
> filesystem. Root cause: too-aggressive metadata overcommit allows too
> many tasks to bypass flushing, resulting in no free metadata extents
> and no unallocated space for new metadata chunks.
>
> ### Step 1.4: Hidden Bug Fix Detection
> This is not a hidden fix - it is clearly described as fixing a transaction abort bug.
> The words "Fix this by" are explicitly used.
> Record: This IS a direct bug fix.
>
> ---
>
> ## PHASE 2: DIFF ANALYSIS
>
> ### Step 2.1: Inventory
> - **Files changed:** `fs/btrfs/space-info.c` (1 file)
> - **Lines changed:** 3 lines modified (1 comment change, 2 logic
> changes)
> - **Functions modified:** `calc_available_free_space()`
> - **Scope:** Single-file, surgical fix
>
> ### Step 2.2: Code Flow Change
> Before:
> - When `flush == BTRFS_RESERVE_FLUSH_ALL`, overcommit limit was `avail
> >> 3` (1/8 of available)
> - `BTRFS_RESERVE_FLUSH_ALL_STEAL` fell through to `else` branch: `avail
> >> 1` (1/2 of available)
>
> After:
> - When `flush == BTRFS_RESERVE_FLUSH_ALL || flush ==
> BTRFS_RESERVE_FLUSH_ALL_STEAL`, overcommit limit is `avail >> 6` (1/64
> of available)
> - This is more conservative, forcing earlier transaction commits
>
> ### Step 2.3: Bug Mechanism
> This is a **logic/correctness fix**. The overcommit threshold was too
> generous, allowing too many tasks to avoid triggering the space flushing
> machinery, which would commit transactions and unpin extents. This
> eventually exhausted metadata space with no recovery path.
>
> Two bugs fixed:
> 1. `BTRFS_RESERVE_FLUSH_ALL_STEAL` was falling into the "else" (1/2
> overcommit) branch — far too generous for a flush type that CAN do
> full flushing.
> 2. Even `BTRFS_RESERVE_FLUSH_ALL` at 1/8 was too aggressive for small
> filesystems.
>
> ### Step 2.4: Fix Quality
> - Minimal and obviously correct — reducing overcommit thresholds is safe
> - Well-understood mechanism with detailed analysis in commit message
> - Regression risk: slightly more frequent transaction commits under
> memory pressure (performance trade-off, not a correctness regression)
> - The author is Filipe Manana, one of the most prolific btrfs developers
>
> Record: Very high quality, obviously correct, minimal scope.
>
> ---
>
> ## PHASE 3: GIT HISTORY
>
> ### Step 3.1: Blame
> The buggy code (`avail >>= 3` / `avail >>= 1`) was introduced in commit
> `41783ef24d56ce` ("btrfs: move and export can_overcommit") by Josef Bacik,
> merged in v5.4. The code has been in every kernel since v5.4.
>
> ### Step 3.2: No Fixes: tag — skipped as expected.
>
> ### Step 3.3: File History
> `fs/btrfs/space-info.c` has ~90 changes since v6.6 but the specific
> `calc_available_free_space()` function's overcommit logic has only been
> touched by:
> - `cb6cbab79055c` (v6.7, adjusted overcommit for "very close to full"
> condition)
> - `64d2c847ba380` (v6.10, zoned fix)
> - Various argument refactoring (fs_info removal)
>
> The current patch touches only the two lines at the `>>= 3` / `>>= 1` branch
> which have been stable since v5.4.
>
> ### Step 3.4: Author
> Filipe Manana is one of the most active btrfs contributors with hundreds of
> commits. He regularly fixes space reservation bugs and is deeply familiar with
> the overcommit subsystem.
>
> ### Step 3.5: Dependencies
> The patch is standalone. The only dependency is the existence of
> `BTRFS_RESERVE_FLUSH_ALL_STEAL`, which was added in commit
> `7f9fe61440769` and confirmed present in all stable trees back to v5.10.
>
> ---
>
> ## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
>
> Lore.kernel.org has bot protection enabled, preventing direct access.
> However:
> - The commit has two Link: tags referencing mailing list discussions
> - The commit was reviewed by Qu Wenruo and signed-off by David Sterba
> - The commit message includes the original user report from Aleksandar
> Gerasimovski
>
> Record: Could not access lore directly. The commit has proper review chain
> and user report.
>
> ---
>
> ## PHASE 5: CODE SEMANTIC ANALYSIS
>
> ### Step 5.1: Function Modified
> `calc_available_free_space()` — computes how much overcommit is allowed
> for metadata.
>
> ### Step 5.2: Callers
> 1. `check_can_overcommit()` → called by `can_overcommit()` and
> `btrfs_can_overcommit()`
> 2. `btrfs_calc_reclaim_metadata_size()` — reclaim size calculation 3.
> `need_preemptive_reclaim()` — decides if preemptive reclaim is needed
>
> These are called during **every metadata reservation** in the kernel.
> This is a hot path for all btrfs operations.
>
> ### Step 5.3-5.4: Call Chain
> `reserve_bytes()` → `can_overcommit()` → `check_can_overcommit()` →
> `calc_available_free_space()`
>
> This is reachable from any filesystem operation that reserves metadata (file
> creation, deletion, modification, etc.).
>
> ### Step 5.5: Similar Patterns
> The earlier commit `cb6cbab79055c` addressed a related but different aspect
> of overcommit (when very close to full). This patch addresses the general case.
>
> ---
>
> ## PHASE 6: STABLE TREE ANALYSIS
>
> ### Step 6.1: Buggy Code in Stable Trees Verified the EXACT same code pattern
> exists in ALL active stable trees:
> - v5.10: same code at line 327
> - v5.15: same code at line 324
> - v6.1: same code at line 372
> - v6.6: same code at line 373
> - v6.12: same code at line 421
>
> `BTRFS_RESERVE_FLUSH_ALL_STEAL` confirmed present in v5.10+.
>
> ### Step 6.2: Backport Complications
> The surrounding context has minor differences (e.g., the zoned mode
> alignment was added in v6.10, function signature changed in v6.13+) but the
> actual 3-line change applies to code that is IDENTICAL across all stable trees.
> Minor context adjustment may be needed for the surrounding lines (no zoned
> block in older trees), but the core logic change is trivially backportable.
>
> ### Step 6.3: No related fix already in stable.
>
> ---
>
> ## PHASE 7: SUBSYSTEM CONTEXT
>
> ### Step 7.1: Subsystem and Criticality
> - **Subsystem:** `fs/btrfs` — filesystem
> - **Criticality:** IMPORTANT — btrfs is a widely-used filesystem,
> especially in enterprise (SLES, openSUSE) and desktop Linux. Metadata
> ENOSPC bugs cause data loss risk (filesystem goes read-only).
>
> ### Step 7.2: Activity
> btrfs/space-info.c is very actively maintained with frequent improvements and
> fixes.
>
> ---
>
> ## PHASE 8: IMPACT AND RISK ASSESSMENT
>
> ### Step 8.1: Affected Users
> All btrfs users, especially those with smaller filesystems (1G-8G) under heavy
> workloads. This is common in containers, VMs, embedded systems, and IoT
> devices.
>
> ### Step 8.2: Trigger Conditions
> - Normal file operations (create/delete files) on a filesystem that has
> most of its space allocated to data
> - Reproducible with bonnie++ on a 1G filesystem
> - No special privileges needed — any user writing files can trigger this
>
> ### Step 8.3: Failure Mode
> **CRITICAL** — Transaction abort with -ENOSPC forces the filesystem into
> read-only mode. This means:
> - Active writes fail
> - The filesystem must be unmounted/remounted to recover
> - Potential data loss if writes were in progress
> - User sees "Read-only file system" errors
>
> ### Step 8.4: Risk-Benefit Ratio
> - **Benefit:** HIGH — prevents filesystem going read-only on small
> filesystems under normal workloads
> - **Risk:** VERY LOW — 3-line change reducing an overcommit threshold;
> the only behavioral change is slightly more frequent transaction
> commits, which is a minor performance trade-off with no correctness
> risk
> - **Ratio:** Strongly favorable
>
> ---
>
> ## PHASE 9: FINAL SYNTHESIS
>
> ### Step 9.1: Evidence Summary
>
> **FOR backporting:**
> - Fixes a real, reproducible bug (filesystem goes read-only)
> - Failure mode is CRITICAL (transaction abort, forced read-only)
> - Extremely small patch (3 lines in 1 file)
> - Reviewed by core btrfs developer (Qu Wenruo), signed off by maintainer
> (David Sterba)
> - Author is one of the most prolific btrfs developers (Filipe Manana)
> - User-reported with clear reproduction case
> - Buggy code exists in ALL stable trees (v5.10+)
> - Fix applies cleanly with trivial context adjustment
> - No new features or APIs
> - The FLUSH_ALL_STEAL case falling through to 1/2 overcommit was clearly
> a bug
>
> **AGAINST backporting:**
> - This is a behavioral/heuristic change rather than a strictly "wrong
> code" fix
> - Could theoretically affect performance on large filesystems (more
> frequent commits)
> - Changes a tuning constant (1/8 → 1/64) which is somewhat subjective
> - No Fixes: tag (but this is expected)
>
> ### Step 9.2: Stable Rules Checklist
> 1. Obviously correct and tested? **YES** — reproduced, analyzed with
> tracing, reviewed
> 2. Fixes a real bug? **YES** — transaction abort forcing read-only
> filesystem
> 3. Important issue? **YES** — filesystem goes read-only = CRITICAL 4. Small
> and contained? **YES** — 3 lines in 1 file 5. No new features? **YES** —
> changes an existing threshold 6. Can apply to stable? **YES** — verified same
> code in all stable trees
>
> ### Step 9.3: Exception Category
> Not applicable — this is a standard bug fix.
>
> ### Step 9.4: Decision
> The fix is small, surgical, obviously correct, and addresses a CRITICAL failure
> mode (filesystem forced read-only). The risk is minimal (slightly more
> conservative overcommit policy). The benefit-to-risk ratio is extremely
> favorable.
>
> ---
>
> ## Verification
>
> - [Phase 1] Parsed tags: Reported-by from Aleksandar Gerasimovski,
> Reviewed-by Qu Wenruo, two lore Links
> - [Phase 2] Diff analysis: 3-line change in
> `calc_available_free_space()` — shifts avail from >>3 to >>6 for full-
> flush cases, and adds FLUSH_ALL_STEAL to the condition
> - [Phase 3] git blame: buggy code introduced in commit 41783ef24d56ce
> (v5.4), present in all stable trees
> - [Phase 3] Verified FLUSH_ALL_STEAL exists in v5.10+ (commit
> 7f9fe61440769)
> - [Phase 3] Author Filipe Manana has hundreds of btrfs commits —
> confirmed core developer
> - [Phase 4] lore.kernel.org blocked by bot protection — could not access
> mailing list directly
> - [Phase 5] Traced callers: `calc_available_free_space()` called through
> `can_overcommit()` from `reserve_bytes()` — on every metadata
> reservation path
> - [Phase 6] Verified exact same code pattern in v5.10, v5.15, v6.1,
> v6.6, v6.12 trees
> - [Phase 6] cb6cbab79055c (related but different overcommit fix) already
> in v6.7+
> - [Phase 8] Failure mode: transaction abort → read-only filesystem =
> CRITICAL severity
> - [Phase 8] Trigger: normal file operations on small filesystem, easily
> reproducible
> - UNVERIFIED: Could not access mailing list discussion to check for NAKs
> or concerns about performance regression on large filesystems
>
> **YES**
>
> fs/btrfs/space-info.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index
> 87cbc051cb12f..b2b775ab878c6 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -489,10 +489,10 @@ static u64 calc_available_free_space(const struct
> btrfs_space_info *space_info,
> /*
> * If we aren't flushing all things, let us overcommit up to
> * 1/2th of the space. If we can flush, don't let us overcommit
> - * too much, let it overcommit up to 1/8 of the space.
> + * too much, let it overcommit up to 1/64th of the space.
> */
> - if (flush == BTRFS_RESERVE_FLUSH_ALL)
> - avail >>= 3;
> + if (flush == BTRFS_RESERVE_FLUSH_ALL || flush ==
> BTRFS_RESERVE_FLUSH_ALL_STEAL)
> + avail >>= 6;
> else
> avail >>= 1;
>
> --
> 2.53.0
**********************************************************************
DISCLAIMER:
Privileged and/or Confidential information may be contained in this message. If you are not the addressee of this message, you may not copy, use or deliver this message to anyone. In such event, you should destroy the message and kindly notify the sender by reply e-mail. It is understood that opinions or conclusions that do not relate to the official business of the company are neither given nor endorsed by the company. Thank You.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing
2026-04-22 12:24 ` Aleksandar Gerasimovski
2026-04-22 12:28 ` Aleksandar Gerasimovski
@ 2026-04-22 19:14 ` David Sterba
1 sibling, 0 replies; 10+ messages in thread
From: David Sterba @ 2026-04-22 19:14 UTC (permalink / raw)
To: Aleksandar Gerasimovski
Cc: Sasha Levin, patches@lists.linux.dev, stable@vger.kernel.org,
Rene Straub, Filipe Manana, Qu Wenruo, David Sterba, clm@fb.com,
linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org
On Wed, Apr 22, 2026 at 12:24:52PM +0000, Aleksandar Gerasimovski wrote:
> Hi everyone,
>
> Can you add Rene's tag to the reporter as well, he was the one seeing
> and triggering the that internally and I continued with following that
> with the mailing list:
Technically the Reported-by tag can be added but that would be only for
the stable patches and I don't think this has even been done. Credits
are collected before the patch is merged, after that it's immutable
(574d93fc62e2).
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-04-22 19:14 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260420132314.1023554-1-sashal@kernel.org>
2026-04-20 13:16 ` [PATCH AUTOSEL 7.0-5.15] btrfs: don't allow log trees to consume global reserve or overcommit metadata Sasha Levin
2026-04-20 13:18 ` [PATCH AUTOSEL 7.0] btrfs: avoid GFP_ATOMIC allocations in qgroup free paths Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: replace BUG_ON() with error return in cache_save_setup() Sasha Levin
2026-04-20 13:19 ` [PATCH AUTOSEL 7.0-5.10] btrfs: be less aggressive with metadata overcommit when we can do full flushing Sasha Levin
2026-04-22 12:24 ` Aleksandar Gerasimovski
2026-04-22 12:28 ` Aleksandar Gerasimovski
2026-04-22 19:14 ` David Sterba
2026-04-20 13:20 ` [PATCH AUTOSEL 6.18] btrfs: fix zero size inode with non-zero size after log replay Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.19] btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit Sasha Levin
2026-04-20 13:21 ` [PATCH AUTOSEL 7.0-6.12] btrfs: fix silent IO error loss in encoded writes and zoned split Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox