* [PATCH AUTOSEL 6.19-5.10] btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()
[not found] <20260330123842.756154-1-sashal@kernel.org>
@ 2026-03-30 12:38 ` Sasha Levin
2026-03-30 12:38 ` [PATCH AUTOSEL 6.19-6.18] btrfs: fix zero size inode with non-zero size after log replay Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-03-30 12:38 UTC (permalink / raw)
To: patches, stable
Cc: Goldwyn Rodrigues, Boris Burkov, Goldwyn Rodrigues, David Sterba,
Sasha Levin, clm, rostedt, mhiramat, linux-btrfs, linux-kernel,
linux-trace-kernel
From: Goldwyn Rodrigues <rgoldwyn@suse.de>
[ Upstream commit a85b46db143fda5869e7d8df8f258ccef5fa1719 ]
If overlay is used on top of btrfs, dentry->d_sb translates to overlay's
super block and fsid assignment will lead to a crash.
Use file_inode(file)->i_sb to always get btrfs_sb.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
Now I have all the evidence. Here is the complete analysis.
---
## PHASE 1: COMMIT MESSAGE FORENSICS
**Step 1.1: Subject Line**
Record: [btrfs: tracepoints] [get correct / fix] [Fix incorrect
superblock derivation in the `btrfs_sync_file` trace event when
overlayfs is stacked on btrfs]
**Step 1.2: Tags**
- Reviewed-by: Boris Burkov <boris@bur.io> (btrfs developer/reviewer)
- Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> (author, active
btrfs contributor)
- Signed-off-by: David Sterba <dsterba@suse.com> (btrfs co-maintainer,
committer)
- No Fixes: tag (expected for commits under manual review)
- No Reported-by:, Tested-by:, Link:, or Cc: stable tags
Record: Reviewed by btrfs developer, committed by btrfs maintainer. No
Fixes: or Cc: stable (expected for manually reviewed candidates).
**Step 1.3: Commit Body**
Record: Bug: when overlayfs is used on top of btrfs, `dentry->d_sb` in
the tracepoint resolves to the overlay superblock, not btrfs'. The
`btrfs_sb()` inline function then treats the overlay's `s_fs_info` as
`struct btrfs_fs_info *`, and `TP_fast_assign_fsid` dereferences
`fs_info->fs_devices->fsid`—accessing completely invalid memory.
Symptom: kernel crash. Fix: use `file_inode(file)->i_sb` to always get
the btrfs superblock.
**Step 1.4: Hidden Bug Fix**
Record: Not hidden—this is an explicit crash fix. The commit message
directly states "will lead to a crash."
---
## PHASE 2: DIFF ANALYSIS
**Step 2.1: Inventory**
Record: 1 file changed: `include/trace/events/btrfs.h`. Approximately 6
lines added, 4 removed within the `TP_fast_assign` block of
`TRACE_EVENT(btrfs_sync_file)`. Single-file, surgical fix.
**Step 2.2: Code Flow Change**
Before:
```c
const struct dentry *dentry = file->f_path.dentry;
const struct inode *inode = d_inode(dentry);
TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb));
__entry->parent = btrfs_ino(BTRFS_I(d_inode(dentry->d_parent)));
```
After:
```c
struct dentry *dentry = file_dentry(file);
struct inode *inode = file_inode(file);
struct dentry *parent = dget_parent(dentry);
struct inode *parent_inode = d_inode(parent);
dput(parent);
TP_fast_assign_fsid(btrfs_sb(inode->i_sb));
__entry->parent = btrfs_ino(BTRFS_I(parent_inode));
```
Three independent improvements:
1. **Critical crash fix**: `file->f_path.dentry->d_sb` → `inode->i_sb`
for the fsid assignment
2. **Correctness**: `file->f_path.dentry` → `file_dentry(file)` and
`d_inode(dentry)` → `file_inode(file)` (overlay-safe helpers)
3. **Safety**: parent dentry now accessed via `dget_parent()`/`dput()`
(proper reference counting)
Record: Single hunk, tracepoint-only path, three small correctness
improvements.
**Step 2.3: Bug Mechanism**
Verified:
- `btrfs_sb(sb)` returns `sb->s_fs_info` (`fs/btrfs/super.h` line 21–24)
- `TP_fast_assign_fsid(fs_info)` does `memcpy(__entry->fsid,
fs_info->fs_devices->fsid, BTRFS_FSID_SIZE)` (line 163–170)
- Overlayfs stores `struct ovl_fs *` in `sb->s_fs_info`
(`fs/overlayfs/ovl_entry.h` line 115–121)
- When overlay sb is passed to `btrfs_sb()`, the returned pointer is not
a `btrfs_fs_info`; dereferencing `->fs_devices->fsid` accesses invalid
memory → crash
Record: [Type confusion via wrong superblock] [overlay's `s_fs_info`
interpreted as `btrfs_fs_info *`, then invalid dereference of
`fs_devices->fsid`]
**Step 2.4: Fix Quality**
Record: Obviously correct—this is the only btrfs tracepoint using
`file->f_path.dentry->d_sb`; all others already use `inode->i_sb`. Fix
aligns this tracepoint with the established pattern. Very low regression
risk: changes only tracepoint data assignment code.
---
## PHASE 3: GIT HISTORY INVESTIGATION
**Step 3.1: Blame**
Verified via `git blame -L 771,779`:
- The buggy fsid line
(`TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))`)
introduced in commit `bc074524e123de` (Jeff Mahoney, 2016-06-09,
"btrfs: prefix fsid to all trace events")
- `git describe --contains bc074524e123de` → `v4.8-rc1~38^2~1^2~12`
- Bug has been present since **v4.8-rc1** — all currently active stable
trees are affected
Record: [Bug introduced in bc074524e123de, first in v4.8-rc1, present
since 2016 in all active stable trees]
**Step 3.2: Fixes Tag**
Record: No Fixes: tag present. The implicit target is bc074524e123de.
**Step 3.3: File History**
Verified via `git log --oneline -20 -- include/trace/events/btrfs.h`:
- Related prior fix: `f157dd661339f` ("btrfs: fix NULL dereference on
root when tracing inode eviction") — a different tracepoint crash fix
in the same file
- Historical related fix: `de17e793b104d` ("btrfs: fix crash/invalid
memory access on fsync when using overlayfs") — this fixed the **core
`btrfs_sync_file()` function** for the same overlayfs class of bug,
but did NOT fix the tracepoint. The current commit completes that
work.
- The historical commit includes a full oops trace showing the exact
crash scenario
Record: [Standalone fix. Historical `de17e793b104d` fixed the fsync
function itself but left the tracepoint buggy. This commit completes
that fix.]
**Step 3.4: Author**
Verified: Goldwyn Rodrigues has 10+ btrfs commits including folio
conversions and core btrfs work. David Sterba is a listed btrfs
maintainer.
Record: [Author is established btrfs contributor from SUSE; committed by
btrfs maintainer]
**Step 3.5: Dependencies**
Verified: `file_dentry()`, `file_inode()`, `dget_parent()`, `dput()` all
exist in v5.15, v6.1, v6.6 stable trees.
Record: [No dependencies. All required helper APIs confirmed present in
stable trees.]
---
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
**Step 4.1: Lore Search**
Record: lore.kernel.org returned Anubis anti-bot challenge — exact patch
thread not verified.
**Step 4.2: Bug Report**
Verified: The historical commit `de17e793b104d` includes a full kernel
oops trace from `btrfs_sync_file` when using overlayfs. This establishes
that overlayfs+btrfs fsync crashes are a known, real-world class of bug.
The current tracepoint fix addresses the remaining instance of the same
pattern.
Record: [Real-world crash reports documented in historical commit
de17e793b104d with full stack trace]
**Step 4.3–4.4: Related Patches / Stable History**
Record: Could not verify lore threads. No evidence of prior stable
selection for this specific tracepoint fix.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
**Step 5.1: Functions Modified**
Record: `TRACE_EVENT(btrfs_sync_file)` — specifically its
`TP_fast_assign` block
**Step 5.2: Callers**
Verified: `trace_btrfs_sync_file(file, datasync)` called from exactly
one place: `fs/btrfs/file.c:1578` inside `btrfs_sync_file()`.
**Step 5.3–5.4: Call Chain / Reachability**
Verified complete path:
- `fsync(2)` / `fdatasync(2)` → `do_fsync()` → `vfs_fsync()` →
`vfs_fsync_range()` → `btrfs_sync_file()` → `trace_btrfs_sync_file()`
- Overlayfs path: `ovl_fsync()` (line 441 of `fs/overlayfs/file.c`) →
`vfs_fsync_range(upperfile, ...)` → `btrfs_sync_file()` →
`trace_btrfs_sync_file()`
- The tracepoint body executes only when the `btrfs_sync_file`
tracepoint is enabled (static key gated)
Record: [Directly reachable from userspace fsync() syscall. Overlayfs
path confirmed via ovl_fsync(). Tracepoint gated by static key.]
**Step 5.5: Similar Patterns**
Verified: `TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))`
appears only once in the entire file — this tracepoint. All other btrfs
tracepoints use `inode->i_sb` or receive `fs_info` directly. This is the
sole inconsistent instance.
Record: [Only tracepoint with this bug pattern; all others already
correct]
---
## PHASE 6: STABLE TREE ANALYSIS
**Step 6.1: Presence in Stable Trees**
Verified via `git cat-file -p`:
- v5.15: buggy line at line 701 ✓
- v6.1: buggy line at line 766 ✓
- v6.6: buggy line at line 795 ✓
Record: [All active stable trees (v5.15, v6.1, v6.6) contain the exact
buggy line]
**Step 6.2: Backport Complications**
Record: Clean apply expected for recent stable trees — same code
structure, same APIs available. Minor line number offsets only.
**Step 6.3: Duplicate Fixes**
Record: No alternative fix for this tracepoint found in any stable tree.
The historical `de17e793b104d` fixed only the function, not the
tracepoint.
---
## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT
**Step 7.1: Subsystem**
Record: btrfs filesystem tracepoints — IMPORTANT subsystem. btrfs is
widely used, especially with overlayfs in container environments
(Docker, Podman).
**Step 7.2: Activity**
Record: Active — `include/trace/events/btrfs.h` has seen 20+ recent
commits including other tracepoint crash fixes.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
**Step 8.1: Affected Users**
Record: Users running overlayfs on top of btrfs with btrfs tracepoints
enabled. This includes container workloads and debugging/tracing
scenarios on production systems.
**Step 8.2: Trigger Conditions**
Record: Enable `btrfs_sync_file` tracepoint (or all btrfs events) + use
overlayfs on btrfs + any `fsync()`/`fdatasync()` call. Deterministic
when conditions met — not a race.
**Step 8.3: Failure Mode**
Record: Kernel crash / oops from invalid memory access in tracepoint
assignment. Severity: **CRITICAL** when triggered (system crash,
potential data loss from incomplete fsync).
**Step 8.4: Risk-Benefit**
- BENEFIT: HIGH — prevents a deterministic kernel crash in a real,
userspace-triggerable path
- RISK: VERY LOW — ~10 lines changed in a single tracepoint, using
established VFS helpers consistent with all other btrfs tracepoints
Record: [Benefit: HIGH, Risk: VERY LOW, Ratio: Excellent for
backporting]
---
## PHASE 9: FINAL SYNTHESIS
**Step 9.1: Evidence Compilation**
FOR backporting:
- Fixes a real kernel crash (type confusion → invalid memory access →
oops)
- Small, surgical fix: ~10 lines in 1 file, 1 tracepoint
- Obviously correct: aligns with how all other btrfs tracepoints handle
the superblock
- Bug class verified real via historical commit `de17e793b104d` with
full crash stack trace
- Reviewed by btrfs developer, committed by btrfs maintainer
- Bug present in ALL active stable trees (v5.15, v6.1, v6.6) — confirmed
- All required helper APIs exist in stable trees — confirmed
- No dependencies on other commits
- Overlayfs call path verified to reach the buggy code
AGAINST backporting:
- Tracepoint must be enabled to trigger (narrower population than core
path bugs)
- No Tested-by: tag
UNRESOLVED:
- Exact lore.kernel.org patch discussion thread (blocked by Anubis)
- Whether unprivileged users can enable this tracepoint
- `git apply --check` not run against stable branches (but same code
confirmed present)
**Step 9.2: Stable Rules Checklist**
1. Obviously correct and tested? **YES** — consistent with all other
btrfs tracepoints; maintainer-reviewed
2. Fixes a real bug? **YES** — kernel crash with overlayfs+btrfs+tracing
3. Important issue? **YES** — kernel oops (CRITICAL)
4. Small and contained? **YES** — ~10 lines, single tracepoint
5. No new features/APIs? **YES** — pure bug fix
6. Applies to stable? **YES** — buggy code and required APIs confirmed
in v5.15/v6.1/v6.6
**Step 9.3: Exception Categories**
Record: Not applicable. This is a standard bug fix, not a device
ID/quirk/DT exception.
**Step 9.4: Decision**
The fix addresses a real, deterministic kernel crash caused by type
confusion when overlayfs is stacked on btrfs. The crash mechanism is
fully verified: `btrfs_sb()` interprets overlay's `s_fs_info` (an
`ovl_fs *`) as `btrfs_fs_info *`, then `TP_fast_assign_fsid`
dereferences `fs_info->fs_devices->fsid` — accessing garbage memory. The
fix is small, obviously correct, consistent with all other btrfs
tracepoints, and has no dependencies. The bug exists in all active
stable trees. The only limiting factor is that tracepoints must be
enabled, but stable kernels are regularly used with tracing enabled for
support and debugging, and a crash in that scenario is unacceptable.
---
## Verification
- [Phase 1] Parsed tags from commit message: Reviewed-by Boris Burkov,
SOB from David Sterba (btrfs maintainer), SOB from Goldwyn Rodrigues
(author). No Fixes:, Reported-by:, Link:, Cc: stable.
- [Phase 2] Read `include/trace/events/btrfs.h` lines 163–170: confirmed
`TP_fast_assign_fsid` does `memcpy(__entry->fsid,
fs_info->fs_devices->fsid, BTRFS_FSID_SIZE)`
- [Phase 2] Read `include/trace/events/btrfs.h` lines 771–779: confirmed
pre-fix code uses `file->f_path.dentry->d_sb`
- [Phase 2] Grep confirmed `btrfs_sb()` returns `sb->s_fs_info` in
`fs/btrfs/super.h` lines 21–24
- [Phase 2] Read `fs/overlayfs/ovl_entry.h` lines 115–121: confirmed
`OVL_FS(sb)` casts `sb->s_fs_info` to `struct ovl_fs *` — type
confusion verified
- [Phase 2] Grep confirmed the buggy
`TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb))` is the ONLY
such pattern in btrfs.h
- [Phase 3] `git blame -L 771,779`: buggy fsid line from
`bc074524e123de` (Jeff Mahoney, 2016)
- [Phase 3] `git describe --contains bc074524e123de`:
`v4.8-rc1~38^2~1^2~12` — bug present since v4.8
- [Phase 3] `git show de17e793b104d`: confirmed historical
overlayfs+btrfs fsync crash with full kernel oops trace; that fix
addressed only `btrfs_sync_file()`, NOT the tracepoint
- [Phase 3] `git log --oneline --author="Goldwyn Rodrigues" -10 --
fs/btrfs`: confirmed active btrfs contributor
- [Phase 4] lore.kernel.org: blocked by Anubis — patch thread UNVERIFIED
- [Phase 5] Grep: `trace_btrfs_sync_file` called from exactly
`fs/btrfs/file.c:1578`
- [Phase 5] Read `fs/overlayfs/file.c` lines 441–464: confirmed
`ovl_fsync()` → `vfs_fsync_range(upperfile)` call path
- [Phase 6] `git cat-file -p v5.15:include/trace/events/btrfs.h`: buggy
line at line 701 ✓
- [Phase 6] `git cat-file -p v6.1:include/trace/events/btrfs.h`: buggy
line at line 766 ✓
- [Phase 6] `git cat-file -p v6.6:include/trace/events/btrfs.h`: buggy
line at line 795 ✓
- [Phase 6] Verified `file_inode`, `file_dentry`, `dget_parent` present
in v5.15, v6.1, v6.6 via `git cat-file` grep
- [Phase 8] Failure mode: kernel oops from invalid memory access in
tracepoint assignment, severity CRITICAL
- UNVERIFIED: Exact lore patch discussion (Anubis blocked); privilege
requirements for tracepoint enablement; `git apply --check` on stable
branches
**YES**
include/trace/events/btrfs.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 125bdc166bfed..0864700f76e0a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -769,12 +769,15 @@ TRACE_EVENT(btrfs_sync_file,
),
TP_fast_assign(
- const struct dentry *dentry = file->f_path.dentry;
- const struct inode *inode = d_inode(dentry);
+ struct dentry *dentry = file_dentry(file);
+ struct inode *inode = file_inode(file);
+ struct dentry *parent = dget_parent(dentry);
+ struct inode *parent_inode = d_inode(parent);
- TP_fast_assign_fsid(btrfs_sb(file->f_path.dentry->d_sb));
+ dput(parent);
+ TP_fast_assign_fsid(btrfs_sb(inode->i_sb));
__entry->ino = btrfs_ino(BTRFS_I(inode));
- __entry->parent = btrfs_ino(BTRFS_I(d_inode(dentry->d_parent)));
+ __entry->parent = btrfs_ino(BTRFS_I(parent_inode));
__entry->datasync = datasync;
__entry->root_objectid = btrfs_root_id(BTRFS_I(inode)->root);
),
--
2.53.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* [PATCH AUTOSEL 6.19-6.18] btrfs: fix zero size inode with non-zero size after log replay
[not found] <20260330123842.756154-1-sashal@kernel.org>
2026-03-30 12:38 ` [PATCH AUTOSEL 6.19-5.10] btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file() Sasha Levin
@ 2026-03-30 12:38 ` Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-03-30 12:38 UTC (permalink / raw)
To: patches, stable
Cc: Filipe Manana, Vyacheslav Kovalevsky, David Sterba, Sasha Levin,
clm, linux-btrfs, linux-kernel
From: Filipe Manana <fdmanana@suse.com>
[ Upstream commit 5254d4181add9dfaa5e3519edd71cc8f752b2f85 ]
When logging that an inode exists, as part of logging a new name or
logging new dir entries for a directory, we always set the generation of
the logged inode item to 0. This is to signal during log replay (in
overwrite_item()), that we should not set the i_size since we only logged
that an inode exists, so the i_size of the inode in the subvolume tree
must be preserved (as when we log new names or that an inode exists, we
don't log extents).
This works fine except when we have already logged an inode in full mode
or it's the first time we are logging an inode created in a past
transaction, that inode has a new i_size of 0 and then we log a new name
for the inode (due to a new hardlink or a rename), in which case we log
an i_size of 0 for the inode and a generation of 0, which causes the log
replay code to not update the inode's i_size to 0 (in overwrite_item()).
An example scenario:
mkdir /mnt/dir
xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo
sync
xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo
ln /mnt/dir/foo /mnt/dir/bar
xfs_io -c "fsync" /mnt/dir
<power fail>
After log replay the file remains with a size of 64K. This is because when
we first log the inode, when we fsync file foo, we log its current i_size
of 0, and then when we create a hard link we log again the inode in exists
mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we
add to the log tree, so during log replay overwrite_item() sees that the
generation is 0 and i_size is 0 so we skip updating the inode's i_size
from 64K to 0.
Fix this by making sure at fill_inode_item() we always log the real
generation of the inode if it was logged in the current transaction with
the i_size we logged before. Also if an inode created in a previous
transaction is logged in exists mode only, make sure we log the i_size
stored in the inode item located from the commit root, so that if we log
multiple times that the inode exists we get the correct i_size.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
---
## PHASE 1: COMMIT MESSAGE FORENSICS
### Step 1.1: PARSE THE SUBJECT LINE
Record: **[btrfs]** **[fix]** Incorrect inode `i_size` after log replay
— a file truncated to zero appears with its old non-zero size after
power failure and log replay, when a new name (hardlink/rename) is
logged for the inode.
### Step 1.2: PARSE ALL COMMIT MESSAGE TAGS
- **Reported-by:** Vyacheslav Kovalevsky
\<slava.kovalevskiy.2014@gmail.com\> — a real user encountered this
bug
- **Link:** https://lore.kernel.org/linux-
btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/ — bug report on
the btrfs mailing list
- **Signed-off-by:** Filipe Manana \<fdmanana@suse.com\> — author,
primary btrfs tree-log developer
- **Signed-off-by:** David Sterba \<dsterba@suse.com\> — btrfs
maintainer
- No Fixes: tag (expected — this is why the commit is under manual
review)
- No Cc: stable (expected)
- No Tested-by or Reviewed-by
Record: Single user reporter, fix from subsystem's primary tree-log
developer, accepted by maintainer.
### Step 1.3: ANALYZE THE COMMIT BODY TEXT
The commit provides a concrete, reproducible scenario:
1. `mkdir /mnt/dir && xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo`
2. `sync`
3. `xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo` — truncates to 0,
fsync logs the inode with `i_size = 0`
4. `ln /mnt/dir/foo /mnt/dir/bar` — creates a hard link, triggering
LOG_INODE_EXISTS logging
5. `xfs_io -c "fsync" /mnt/dir` — syncs the directory log
6. Power failure → after log replay, file has 64K size instead of 0
**Root cause:** When logging "inode exists" (`LOG_INODE_EXISTS`),
`fill_inode_item()` always sets generation=0. During log replay,
`overwrite_item()` sees generation=0 and `ino_size=0` and skips updating
the subvolume tree's `i_size`. The file retains its stale pre-truncate
size.
Record: Data integrity bug — wrong file size after crash recovery. Clear
mechanism, reproducible, user-reported.
### Step 1.4: DETECT HIDDEN BUG FIXES
Record: Not hidden — explicitly a correctness fix for log replay data
integrity.
---
## PHASE 2: DIFF ANALYSIS — LINE BY LINE
### Step 2.1: INVENTORY THE CHANGES
- **File:** `fs/btrfs/tree-log.c` only
- **Functions modified:** `fill_inode_item()`, `logged_inode_size()` →
renamed to `get_inode_size_to_log()`, `btrfs_log_inode()`
- **Scope:** ~70 lines changed across 3 hunks in a single file.
Surgical, well-contained.
Record: Single-file fix, 3 functions modified, ~+55/-23 lines.
### Step 2.2: UNDERSTAND THE CODE FLOW CHANGE
**Hunk 1 — `fill_inode_item()`:**
- **Before:** Always `btrfs_set_inode_generation(leaf, item, 0)` when
`log_inode_only=true`.
- **After:** Default `gen` to the real inode generation. Only set
generation to 0 if `logged_trans < trans->transid` (i.e., the inode
was NOT already logged in the current transaction). If it WAS
previously logged in this transaction, keep the real generation so
replay will correctly apply the logged `i_size`.
**Hunk 2 — `logged_inode_size()` → `get_inode_size_to_log()`:**
- **Before:** Always searched the log tree; if item not found,
`*size_ret = 0`.
- **After:** If inode was logged in the current transaction, searches
the log tree (same as before). If inode is from a past transaction and
not yet logged, searches the **commit root** of the subvolume tree. If
inode was created in the current transaction and not yet logged,
returns 0. Added ASSERT/WARN_ON_ONCE safety checks.
**Hunk 3 — `btrfs_log_inode()`:**
- **Before:** `if (inode_only == LOG_INODE_EXISTS &&
ctx->logged_before)` — only fetched `logged_isize` if inode was logged
before.
- **After:** `if (inode_only == LOG_INODE_EXISTS)` — always fetches the
inode size to log, removing the `logged_before` guard. This ensures
correct size even on the first exists-only log in a transaction.
### Step 2.3: IDENTIFY THE BUG MECHANISM
**Category:** Logic/correctness fix — filesystem log replay invariant
violation leading to data inconsistency.
**Mechanism verified against `overwrite_item()` (lines 628-658 in
current tree):**
```644:658:fs/btrfs/tree-log.c
if (btrfs_inode_generation(wc->log_leaf, src_item) == 0)
{
const u64 ino_size =
btrfs_inode_size(wc->log_leaf, src_item);
// ...
if (S_ISREG(btrfs_inode_mode(wc->log_leaf,
src_item)) &&
S_ISREG(btrfs_inode_mode(dst_eb, dst_item))
&&
ino_size != 0)
btrfs_set_inode_size(dst_eb, dst_item,
ino_size);
goto no_copy;
}
```
When generation=0 AND `ino_size=0` (for a regular file), the size update
is skipped entirely and execution jumps to `no_copy`. This means the
subvolume tree's stale `i_size` (64K in the example) is preserved. The
fix ensures that when the inode was already logged in the same
transaction (with the truncated size), the real generation is used,
causing replay to take the full-copy path instead of the `no_copy` path.
### Step 2.4: ASSESS THE FIX QUALITY
- **Correctness:** The logic aligns with the documented
`overwrite_item()` replay semantics. Using `logged_trans` to
distinguish first-time vs. re-logging is consistent with existing
patterns in `inode_logged()` (line 3744) which already uses
`data_race(inode->logged_trans)`.
- **Minimality:** Well-contained to logging paths only.
- **Regression risk:** LOW — changes only affect the LOG_INODE_EXISTS
code path (triggered by hard links, renames). The ASSERT and
WARN_ON_ONCE provide debugging safety nets. No lock changes, no API
changes.
Record: Fix is obviously correct when cross-referenced with
`overwrite_item()` replay code. Minimal regression risk.
---
## PHASE 3: GIT HISTORY INVESTIGATION
### Step 3.1: BLAME THE CHANGED LINES
Verified via `git blame -L 4612,4625 fs/btrfs/tree-log.c`:
- The generation=0 logic was introduced by **`94edf4ae43a5f9`** (Josef
Bacik, 2012-09-25) — "Btrfs: don't bother committing delayed inode
updates when fsyncing"
- Lines 4620-4624 last touched by **`c418a1504540c6`** (David Sterba,
2025-06-27) — cosmetic accessor conversion, not the logic origin.
Record: The buggy generation=0 behavior has existed since **2012**
(pre-v3.8). Present in ALL active stable trees.
### Step 3.2: FOLLOW THE FIXES: TAG
No Fixes: tag present. N/A.
### Step 3.3: CHECK FILE HISTORY FOR RELATED CHANGES
Verified recent related commits from the same reporter/timeframe:
- **`953902e4fb4c3`** — "btrfs: set inode flag
BTRFS_INODE_COPY_EVERYTHING when logging new name" — different bug
(names not persisted), touches `inode.c` and `tree-log.c`
- **`bfe3d755ef7ce`** — "btrfs: do not update last_log_commit when
logging inode due to a new name" — different bug (directory fsync not
persisting), 1-line change in `tree-log.c`
These are independent fixes for different bugs, all reported by the same
user during testing.
Record: Standalone fix. Not part of a series. Independent of the two
related commits above.
### Step 3.4: CHECK THE AUTHOR'S OTHER COMMITS
Filipe Manana is the primary btrfs tree-log developer. He authored
`logged_inode_size()` (commit `1a4bcf470c886`, v4.0) and most of the
tree-log fsync correctness fixes. David Sterba signed off as btrfs
maintainer.
Record: Author is THE subsystem expert for this code. Highest-confidence
authorship signal.
### Step 3.5: CHECK FOR DEPENDENT/PREREQUISITE COMMITS
- Uses `logged_trans`, `data_race()`, `search_commit_root`,
`skip_locking` — all long-established.
- `ctx->logged_before` was introduced by **`0f8ce49821de3`** (Filipe
Manana) in **v5.18-rc1**. This field exists in stable trees 6.1+ and
later.
- `logged_inode_size()` was introduced by **`1a4bcf470c886`** in
**v4.0-rc1**.
Record: Self-contained. Requires `ctx->logged_before` (v5.18+), so
applies to stable trees 6.1+.
---
## PHASE 4: MAILING LIST AND EXTERNAL RESEARCH
### Step 4.1-4.4: LORE RESEARCH
Both lore.kernel.org URLs returned Anubis bot protection (inaccessible).
The Link: tag in the commit confirms a real bug report exists on the
linux-btrfs mailing list from Vyacheslav Kovalevsky.
Record: Lore inaccessible. Bug report confirmed to exist via Link: tag.
UNVERIFIED: Full review discussion, any explicit stable nominations by
reviewers, any NAKs.
---
## PHASE 5: CODE SEMANTIC ANALYSIS
### Step 5.1: KEY FUNCTIONS
Functions modified: `fill_inode_item()`, `logged_inode_size()` →
`get_inode_size_to_log()`, `btrfs_log_inode()`.
### Step 5.2: TRACE CALLERS
**`fill_inode_item()`** is called from:
1. `log_inode_item()` (line 4704) — with `log_inode_only=false` (NOT
affected by this change)
2. `copy_inode_items_to_log()` (line 4966) — with `log_inode_only =
(inode_only == LOG_INODE_EXISTS)` — **this is the affected path**
**`btrfs_log_inode()`** is called during fsync and directory logging.
The LOG_INODE_EXISTS path is triggered via `btrfs_log_new_name()`.
### Step 5.4: FOLLOW THE CALL CHAIN
Verified call chain for bug reachability:
- `link()` / `rename()` syscall → `btrfs_link()` / `btrfs_rename()` in
`inode.c` → `btrfs_log_new_name()` (line 7931) → `btrfs_log_inode()`
with `LOG_INODE_EXISTS` (line 6293) → `logged_inode_size()` (line
6992) → `fill_inode_item()` (line 4966)
This is a **common userspace operation** — any unprivileged user can
trigger it with `link()` or `rename()` followed by `fsync()`.
Record: Bug reachable from common syscalls (link, rename, fsync).
Unprivileged users can trigger it.
### Step 5.5: SEARCH FOR SIMILAR PATTERNS
The `data_race(inode->logged_trans)` pattern already exists at line 3744
in `inode_logged()`. The new code follows the same established pattern.
Record: Consistent with existing code patterns.
---
## PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS
### Step 6.1: DOES THE BUGGY CODE EXIST IN STABLE TREES?
**Verified:** The pre-fix code (`logged_inode_size`, unconditional
generation=0 in `fill_inode_item`, `ctx->logged_before` guard) is still
present in the current tree (`v6.19.10`). `get_inode_size_to_log` does
not exist — the fix has NOT been applied yet.
- The generation=0 behavior dates to `94edf4ae43a5f9` (2012, pre-v3.8) —
present in ALL stable trees.
- `logged_inode_size()` dates to `1a4bcf470c886` (v4.0) — present in all
active stable trees.
- `ctx->logged_before` dates to `0f8ce49821de3` (v5.18) — present in
6.1+ stable trees.
Record: All active stable trees (6.1+, 6.6+, 6.12+) contain the buggy
code.
### Step 6.2: BACKPORT COMPLICATIONS
`tree-log.c` has been actively modified (5+ recent commits since v6.12).
The patch modifies a function signature (`logged_inode_size` →
`get_inode_size_to_log`), changes the condition in `btrfs_log_inode`,
and restructures `fill_inode_item`. May require minor adaptation for
older stables, but the core logic should port cleanly.
Record: Expected backport difficulty: clean apply on recent stables
(6.12+), minor conflicts possible on older (6.1, 6.6).
### Step 6.3: CHECK IF RELATED FIXES ARE ALREADY IN STABLE
No evidence that a different fix for the same bug exists. The confirmed
related commits (`953902e4fb4c3`, `bfe3d755ef7ce`) address different
bugs.
Record: No duplicate fix found.
---
## PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT
### Step 7.1: IDENTIFY THE SUBSYSTEM
**Subsystem:** fs/btrfs — filesystem layer, specifically the tree-log
(fsync journal) subsystem.
**Criticality:** IMPORTANT — btrfs is widely used, and crash recovery
correctness is fundamental to filesystem integrity. Wrong file sizes
after recovery = silent data corruption from the user's perspective.
### Step 7.2: ASSESS SUBSYSTEM ACTIVITY
`tree-log.c` is actively maintained with 30+ recent commits visible in
the log. This is a mature, well-maintained subsystem.
Record: Active subsystem, mature code, high importance for data
integrity.
---
## PHASE 8: IMPACT AND RISK ASSESSMENT
### Step 8.1: DETERMINE WHO IS AFFECTED
All btrfs users who rely on fsync for data integrity and who might
create hard links or rename files after truncating. This is a
fundamental filesystem operation.
Record: Affected population: all btrfs users (filesystem-specific but
btrfs is widely used).
### Step 8.2: DETERMINE THE TRIGGER CONDITIONS
- **Trigger:** truncate file to 0 → fsync → create hard link or rename →
fsync directory → power failure
- **How common:** Realistic workflow — the reporter hit it naturally
- **Unprivileged:** Yes, any user can trigger this with normal file
operations
Record: Realistic trigger, unprivileged, discovered by a real user.
### Step 8.3: DETERMINE THE FAILURE MODE SEVERITY
When triggered: **File appears with wrong (old) size after crash
recovery.** The file was truncated to 0 but shows 64K after log replay.
This is **data integrity corruption** — the filesystem state after
recovery does not match what was committed via fsync.
Record: **Failure mode: data integrity violation (wrong file size after
crash).** **Severity: HIGH** — silent data corruption, stale data
visible, violates fsync durability contract.
### Step 8.4: CALCULATE RISK-BENEFIT RATIO
- **Benefit: HIGH** — prevents data corruption after crash recovery in a
production filesystem. Fixes a violation of the fsync durability
guarantee.
- **Risk: LOW** — ~70 lines in one file, authored by the subsystem
expert, changes only the LOG_INODE_EXISTS logging path. Uses
established patterns (`data_race(logged_trans)`). Includes
ASSERT/WARN_ON safety nets. No lock changes, no API changes.
- **Net ratio: Strongly favorable for backporting.**
---
## PHASE 9: FINAL SYNTHESIS
### Step 9.1: COMPILE THE EVIDENCE
**FOR backporting:**
- Fixes a real, user-reported data integrity bug (wrong file size after
crash recovery)
- Concrete, reproducible scenario provided in the commit message
- Verified against `overwrite_item()` replay code: generation=0 +
ino_size=0 → skips `i_size` update → stale data preserved
- Authored by Filipe Manana, THE primary btrfs tree-log developer
- Accepted by David Sterba, btrfs maintainer
- Single file, ~70 lines, well-contained to logging paths
- Buggy code exists in ALL active stable trees (since 2012)
- Uses established patterns (`data_race(logged_trans)`)
- Trigger is realistic and reachable from unprivileged userspace
- Standalone fix with no dependencies beyond existing stable code
**AGAINST backporting:**
- Not trivial (not 1-5 lines) — moderate change size
- Function renamed and restructured (`logged_inode_size` →
`get_inode_size_to_log`)
- May require minor adaptation for oldest stable trees
- Accesses commit root in a new code path (though safe and read-only)
**UNRESOLVED:**
- Lore discussion inaccessible (Anubis protection) — could not verify
reviewer comments or explicit stable nominations
- Per-branch clean apply not verified
### Step 9.2: APPLY THE STABLE RULES CHECKLIST
1. **Obviously correct and tested?** YES — logic verified against
`overwrite_item()` replay code; authored by subsystem expert; fstests
mentioned as follow-up
2. **Fixes a real bug that affects users?** YES — user-reported,
reproducible, data integrity violation
3. **Important issue?** YES — data corruption (wrong file size after
crash recovery) — HIGH severity
4. **Small and contained?** YES — single file, ~70 lines, 3 functions in
tree-log.c
5. **No new features or APIs?** YES — purely a bug fix
6. **Can apply to stable trees?** YES — should apply to 6.1+ with
possible minor adaptation for older trees
### Step 9.3: CHECK FOR EXCEPTION CATEGORIES
Not applicable — this is a standard data integrity bug fix, not an
exception category.
### Step 9.4: DECISION
This is a clear **YES**. It fixes a real, user-reported data integrity
bug where a file's size is incorrect after crash recovery. The fix is
well-contained, authored by the subsystem's primary developer, and meets
all stable kernel criteria. The risk-benefit ratio strongly favors
backporting.
---
## Verification
- **[Phase 1]** Parsed tags: found Reported-by: Vyacheslav Kovalevsky,
Link: to lore.kernel.org bug report, Signed-off-by from Filipe Manana
(author) and David Sterba (maintainer)
- **[Phase 2]** Diff analysis: ~70 lines in `fs/btrfs/tree-log.c`,
modifies `fill_inode_item()`, `logged_inode_size()` →
`get_inode_size_to_log()`, `btrfs_log_inode()`
- **[Phase 2]** Verified `overwrite_item()` log replay logic at lines
644-658: generation=0 AND ino_size=0 → skips `btrfs_set_inode_size()`
→ goto `no_copy` → stale size preserved. Confirms the described bug
mechanism.
- **[Phase 3]** `git blame -L 4612,4625`: generation=0 behavior from
`94edf4ae43a5f9` (Josef Bacik, 2012-09-25) — present since pre-v3.8
- **[Phase 3]** `git log -S "logged_inode_size"`: introduced by
`1a4bcf470c886` "Btrfs: fix fsync data loss after adding hard link to
inode" — `git describe`: v4.0-rc1
- **[Phase 3]** `git log -S "ctx->logged_before"`: introduced by
`0f8ce49821de3` — `git describe`: v5.18-rc1 — present in stable trees
6.1+
- **[Phase 3]** Verified `953902e4fb4c3` and `bfe3d755ef7ce` are
independent fixes for different bugs (same reporter)
- **[Phase 3]** Confirmed Filipe Manana is the primary tree-log
developer (10+ recent commits in this file)
- **[Phase 4]** Lore.kernel.org inaccessible (Anubis bot protection) —
UNVERIFIED: full review discussion, explicit stable nominations
- **[Phase 5]** `fill_inode_item()` called from
`copy_inode_items_to_log()` at line 4966 with `log_inode_only =
(inode_only == LOG_INODE_EXISTS)` — verified
- **[Phase 5]** Call chain: `link()/rename()` → `btrfs_log_new_name()`
(line 7931) → `btrfs_log_inode()` → LOG_INODE_EXISTS path (line
6978/6992) — confirmed reachable from unprivileged syscalls
- **[Phase 5]** `data_race(inode->logged_trans)` pattern already used at
line 3744 in `inode_logged()` — consistent
- **[Phase 6]** `get_inode_size_to_log` does NOT exist in current tree —
patch not yet applied. Pre-fix code (`logged_inode_size`,
generation=0, `ctx->logged_before` guard) all confirmed present.
- **[Phase 6]** Buggy code exists in all active stable trees:
generation=0 since 2012, `logged_inode_size` since v4.0,
`ctx->logged_before` since v5.18
- **[Phase 8]** Failure mode: data integrity violation — wrong file size
after crash recovery. Severity: HIGH.
- **UNVERIFIED:** Lore review discussion content. Per-branch clean apply
testing. Runtime fstests validation.
**YES**
fs/btrfs/tree-log.c | 98 ++++++++++++++++++++++++++++++---------------
1 file changed, 65 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6c40f48cc194d..4cea0489f121c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4609,21 +4609,32 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
struct inode *inode, bool log_inode_only,
u64 logged_isize)
{
+ u64 gen = BTRFS_I(inode)->generation;
u64 flags;
if (log_inode_only) {
- /* set the generation to zero so the recover code
- * can tell the difference between an logging
- * just to say 'this inode exists' and a logging
- * to say 'update this inode with these values'
+ /*
+ * Set the generation to zero so the recover code can tell the
+ * difference between a logging just to say 'this inode exists'
+ * and a logging to say 'update this inode with these values'.
+ * But only if the inode was not already logged before.
+ * We access ->logged_trans directly since it was already set
+ * up in the call chain by btrfs_log_inode(), and data_race()
+ * to avoid false alerts from KCSAN and since it was set already
+ * and one can set it to 0 since that only happens on eviction
+ * and we are holding a ref on the inode.
*/
- btrfs_set_inode_generation(leaf, item, 0);
+ ASSERT(data_race(BTRFS_I(inode)->logged_trans) > 0);
+ if (data_race(BTRFS_I(inode)->logged_trans) < trans->transid)
+ gen = 0;
+
btrfs_set_inode_size(leaf, item, logged_isize);
} else {
- btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
btrfs_set_inode_size(leaf, item, inode->i_size);
}
+ btrfs_set_inode_generation(leaf, item, gen);
+
btrfs_set_inode_uid(leaf, item, i_uid_read(inode));
btrfs_set_inode_gid(leaf, item, i_gid_read(inode));
btrfs_set_inode_mode(leaf, item, inode->i_mode);
@@ -5427,42 +5438,63 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
return 0;
}
-static int logged_inode_size(struct btrfs_root *log, struct btrfs_inode *inode,
- struct btrfs_path *path, u64 *size_ret)
+static int get_inode_size_to_log(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode,
+ struct btrfs_path *path, u64 *size_ret)
{
struct btrfs_key key;
+ struct btrfs_inode_item *item;
int ret;
key.objectid = btrfs_ino(inode);
key.type = BTRFS_INODE_ITEM_KEY;
key.offset = 0;
- ret = btrfs_search_slot(NULL, log, &key, path, 0, 0);
- if (ret < 0) {
- return ret;
- } else if (ret > 0) {
- *size_ret = 0;
- } else {
- struct btrfs_inode_item *item;
+ /*
+ * Our caller called inode_logged(), so logged_trans is up to date.
+ * Use data_race() to silence any warning from KCSAN. Once logged_trans
+ * is set, it can only be reset to 0 after inode eviction.
+ */
+ if (data_race(inode->logged_trans) == trans->transid) {
+ ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0);
+ } else if (inode->generation < trans->transid) {
+ path->search_commit_root = true;
+ path->skip_locking = true;
+ ret = btrfs_search_slot(NULL, inode->root, &key, path, 0, 0);
+ path->search_commit_root = false;
+ path->skip_locking = false;
- item = btrfs_item_ptr(path->nodes[0], path->slots[0],
- struct btrfs_inode_item);
- *size_ret = btrfs_inode_size(path->nodes[0], item);
- /*
- * If the in-memory inode's i_size is smaller then the inode
- * size stored in the btree, return the inode's i_size, so
- * that we get a correct inode size after replaying the log
- * when before a power failure we had a shrinking truncate
- * followed by addition of a new name (rename / new hard link).
- * Otherwise return the inode size from the btree, to avoid
- * data loss when replaying a log due to previously doing a
- * write that expands the inode's size and logging a new name
- * immediately after.
- */
- if (*size_ret > inode->vfs_inode.i_size)
- *size_ret = inode->vfs_inode.i_size;
+ } else {
+ *size_ret = 0;
+ return 0;
}
+ /*
+ * If the inode was logged before or is from a past transaction, then
+ * its inode item must exist in the log root or in the commit root.
+ */
+ ASSERT(ret <= 0);
+ if (WARN_ON_ONCE(ret > 0))
+ ret = -ENOENT;
+
+ if (ret < 0)
+ return ret;
+
+ item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+ struct btrfs_inode_item);
+ *size_ret = btrfs_inode_size(path->nodes[0], item);
+ /*
+ * If the in-memory inode's i_size is smaller then the inode size stored
+ * in the btree, return the inode's i_size, so that we get a correct
+ * inode size after replaying the log when before a power failure we had
+ * a shrinking truncate followed by addition of a new name (rename / new
+ * hard link). Otherwise return the inode size from the btree, to avoid
+ * data loss when replaying a log due to previously doing a write that
+ * expands the inode's size and logging a new name immediately after.
+ */
+ if (*size_ret > inode->vfs_inode.i_size)
+ *size_ret = inode->vfs_inode.i_size;
+
btrfs_release_path(path);
return 0;
}
@@ -6975,7 +7007,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
ret = drop_inode_items(trans, log, path, inode,
BTRFS_XATTR_ITEM_KEY);
} else {
- if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) {
+ if (inode_only == LOG_INODE_EXISTS) {
/*
* Make sure the new inode item we write to the log has
* the same isize as the current one (if it exists).
@@ -6989,7 +7021,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
* (zeroes), as if an expanding truncate happened,
* instead of getting a file of 4Kb only.
*/
- ret = logged_inode_size(log, inode, path, &logged_isize);
+ ret = get_inode_size_to_log(trans, inode, path, &logged_isize);
if (ret)
goto out_unlock;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-03-30 12:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260330123842.756154-1-sashal@kernel.org>
2026-03-30 12:38 ` [PATCH AUTOSEL 6.19-5.10] btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file() Sasha Levin
2026-03-30 12:38 ` [PATCH AUTOSEL 6.19-6.18] btrfs: fix zero size inode with non-zero size after log replay Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox