[PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication
       [not found] <20260210233123.2905307-1-sashal@kernel.org>
@ 2026-02-10 23:30 ` Sasha Levin
  2026-02-10 23:30 ` [PATCH AUTOSEL 6.19] btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref() Sasha Levin
  2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.6] btrfs: handle user interrupt properly in btrfs_trim_fs() Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-02-10 23:30 UTC (permalink / raw)
  To: patches, stable; +Cc: Qu Wenruo, David Sterba, Sasha Levin, clm, linux-btrfs

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 7c2830f00c3e086292c1ee9f27b61efaf8e76c9a ]

[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.

[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.

Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.

This will be way harder to detect without data checksum.

Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.

In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:

" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback

              0                32K                      64K
  Data 1      |                                         |  Has csum
  Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
  Parity      |                                         |

In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).

After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.

Now all data is submitted to the storage, and the fs got fully synced.

Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.

[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.

However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Here is my detailed analysis:

---

## Commit Analysis: "btrfs: fallback to buffered IO if the data profile
has duplication"

### 1. COMMIT MESSAGE ANALYSIS

The commit has an extremely detailed and well-structured message with
[BACKGROUND], [CAUSE AND PROBLEMS], and [FIX] sections. Key points:

- **Root cause**: Direct IO buffers can be modified by userspace during
  writeback. This is a known issue — the same root cause was addressed
  for checksummed inodes by commit 968f19c5b1b7.
- **Impact**: On filesystems with redundant data profiles (RAID1,
  RAID5/6, DUP, RAID10, RAID1C3/C4), modified DIO buffers cause **silent
  data mismatch across mirrors**. This is *data corruption* — different
  mirrors/copies hold different content.
- **RAID56 amplification**: For RAID56, the problem is worse — corrupted
  parity renders checksummed data unrecoverable, since parity is
  calculated from pre-modification data but the actual written data may
  differ.
- **Referenced bug**: [bugzilla.kernel.org
  #99171](https://bugzilla.kernel.org/show_bug.cgi?id=99171) — a long-
  standing issue dating back to 2015.
- **Authors**: Qu Wenruo (key btrfs developer), signed-off by David
  Sterba (btrfs maintainer).

### 2. CODE CHANGE ANALYSIS

The patch is **12 lines, single file** (`fs/btrfs/direct-io.c`),
touching only `btrfs_direct_write()`:

**Variable declaration added:**
```c
const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
                         BTRFS_BLOCK_GROUP_PROFILE_MASK;
```

**Early bailout check added before `relock:` label:**
```c
if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
    goto buffered;
```

The logic is:
- `btrfs_data_alloc_profile()` returns the current data allocation
  profile, which includes type + profile bits
- After masking with `BTRFS_BLOCK_GROUP_PROFILE_MASK`, the result is:
  - `0` for SINGLE (no profile bits)
  - `BTRFS_BLOCK_GROUP_RAID0` for RAID0
  - Other values for RAID1/DUP/RAID5/RAID6/RAID10/RAID1C3/RAID1C4
- Only SINGLE and RAID0 are allowed through to direct IO — these have no
  data duplication
- All other profiles fall back to `buffered:` where the kernel controls
  the page cache

The `goto buffered` jumps to the existing `buffered:` label (line 965 in
current code) that handles the buffered write fallback, including NOWAIT
handling. This is safe because the inode lock has NOT been taken yet at
this point.

### 3. THE BUG MECHANISM IN DETAIL

The bug occurs in this sequence:
1. Application opens a file with O_DIRECT on a btrfs filesystem with
   RAID1/RAID5/RAID6/DUP data profile
2. Application issues a direct IO write
3. btrfs begins writing the data to multiple mirrors/parity
4. During writeback (between when the data is read from the user buffer
   and when it's written to disk), another thread or the application
   itself modifies the DIO buffer
5. Different mirrors receive different content — some get the original
   data, some get the modified data
6. For NODATASUM inodes, there's **no checksum to detect this mismatch**
7. If a drive fails and rebuild is needed, the wrong data may be
   reconstructed

This is *silent data corruption* — the most insidious kind, since the
user doesn't know it happened.

### 4. CLASSIFICATION

This is a **data corruption fix**. It prevents silent data mismatch on
RAID mirrors caused by DIO buffer modification during writeback. This is
firmly in the "data corruption" category of stable-worthy fixes.

### 5. SCOPE AND RISK ASSESSMENT

- **Lines changed**: 12 added, 0 removed — extremely small
- **Files touched**: 1 (`fs/btrfs/direct-io.c`)
- **Risk of regression**: LOW. The worst case is a performance
  regression — direct IO falls back to buffered IO on RAID
  configurations. But:
  - The buffered IO path is well-tested and reliable
  - Users who need zero-copy DIO on RAID can still use NODATASUM (it was
    already the common practice)
  - Data integrity is more important than performance
- **Subsystem maturity**: btrfs is mature; the `btrfs_direct_write()`
  function is well-understood

### 6. DEPENDENCY ANALYSIS

- **`btrfs_data_alloc_profile()`**: Exists since v5.4-rc1 (commit
  878d7b679491). Available in ALL active stable trees (5.15.y, 6.1.y,
  6.6.y, 6.12.y).
- **`BTRFS_BLOCK_GROUP_PROFILE_MASK`**: Exists in
  `include/uapi/linux/btrfs_tree.h` since very early btrfs — available
  everywhere.
- **`buffered:` label**: Exists in `btrfs_direct_write()` since the
  function was written — available everywhere.
- **Commit 968f19c5b1b7**: IS an ancestor of this commit (confirmed).
  However, this patch does NOT strictly depend on it — the new check is
  placed BEFORE `relock:`, before the inode lock is taken, and uses
  `goto buffered` which existed before 968f. The two fixes are
  complementary but independent:
  - 968f handles checksummed inodes (checksum mismatch)
  - This commit handles redundant profiles (mirror mismatch)
- **File location**: On v6.12+, code is in `fs/btrfs/direct-io.c`. On
  v6.6 and earlier, the direct IO code was in `fs/btrfs/inode.c` (moved
  by commit 9aa29a20b7009 in v6.7). Backports to v6.6 and earlier would
  need path adjustment.

### 7. USER IMPACT

- **Who is affected**: Anyone using btrfs with RAID1, RAID10, RAID5,
  RAID6, or DUP data profiles and direct IO writes. This is a **very
  common scenario** — VM images on btrfs RAID1 is a standard deployment
  pattern.
- **Severity**: **Data corruption** — silent mirror mismatch that can
  lead to data loss on rebuild.
- **How long the bug has existed**: Since btrfs RAID support was added.
  The bugzilla (#99171) was filed in 2015.
- **Acknowledgment in commit**: The author acknowledges this is an
  incomplete fix (race with balance can still occur), but it "greatly
  reduces the window."

### 8. STABILITY INDICATORS

- Author: Qu Wenruo — prolific btrfs developer, deeply understands the
  subsystem
- Committer: David Sterba — btrfs maintainer, reviewed and accepted
- Committed 2026-01-29, authored 2025-11-01 — has had time for review
- The commit is already on multiple autosel candidate branches,
  confirming interest in backporting

### Concerns

1. **Performance impact**: Users on RAID configurations who previously
   got zero-copy DIO will now get buffered IO. This is a behavioral
   change, though documented and deliberate.
2. **Incomplete fix**: The commit acknowledges a remaining race with
   `balance` changing profiles. But a partial fix that "greatly reduces
   the window" is better than no fix.
3. **Needs 968f19c5b1b7 for clean apply**: The diff context includes
   code from 968f. However, the fix can be adapted to apply without it —
   only the surrounding context differs. The core logic (profile check +
   goto buffered) is self-contained.

### Verdict

This commit fixes **silent data corruption** on btrfs RAID
configurations — one of the most serious categories of bugs. The fix is
minimal (12 lines), self-contained, obviously correct, and has zero risk
of introducing new crashes or instability (worst case: performance
regression from buffered IO fallback). The bug has been known since 2015
(bugzilla #99171) and affects a common use case. The author and
committer are the primary btrfs maintainers.

**YES**

 fs/btrfs/direct-io.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 07e19e88ba4b3..5443d69efe956 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -814,6 +814,8 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
+	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;

 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -827,6 +829,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;

+	/*
+	 * If our data profile has duplication (either extra mirrors or RAID56),
+	 * we can not trust the direct IO buffer, the content may change during
+	 * writeback and cause different contents written to different mirrors.
+	 *
+	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
+	 */
+	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
+		goto buffered;
+
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19] btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref()
       [not found] <20260210233123.2905307-1-sashal@kernel.org>
  2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication Sasha Levin
@ 2026-02-10 23:30 ` Sasha Levin
  2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.6] btrfs: handle user interrupt properly in btrfs_trim_fs() Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-02-10 23:30 UTC (permalink / raw)
  To: patches, stable
  Cc: Filipe Manana, Boris Burkov, Qu Wenruo, David Sterba, Sasha Levin,
	clm, linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

[ Upstream commit c7d1d4ff56744074e005771aff193b927392d51f ]

There is no need to BUG(), we can just return an error and log an error
message.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Those `BUG()` calls in the sub-functions (`run_delayed_data_ref` line
1602 and `run_delayed_tree_ref` line 1752) are for unexpected
`node->action` values, not `node->type` values. The commit we're
analyzing only addresses the `node->type` BUG() in
`run_one_delayed_ref`. These are separate concerns and independently
addressable.

## Analysis Summary

### What the commit does
This commit replaces a `BUG()` (unconditional kernel panic) with proper
error handling (`-EUCLEAN` return + `btrfs_err()` log message) in
`run_one_delayed_ref()` in `fs/btrfs/extent-tree.c`. The `BUG()`
triggers when a delayed ref node has an unexpected type value that
doesn't match any of the known reference key types
(`BTRFS_TREE_BLOCK_REF_KEY`, `BTRFS_SHARED_BLOCK_REF_KEY`,
`BTRFS_EXTENT_DATA_REF_KEY`, `BTRFS_SHARED_DATA_REF_KEY`, or
`BTRFS_EXTENT_OWNER_REF_KEY`).

### Why it matters
- **BUG() causes a complete kernel crash/panic.** The system becomes
  unresponsive, potentially causing data loss if writes were in
  progress, and requires a reboot.
- **With the fix**, the error returns -EUCLEAN, which propagates through
  `btrfs_run_delayed_refs_for_head()` -> `__btrfs_run_delayed_refs()` ->
  `btrfs_run_delayed_refs()`, where `btrfs_abort_transaction(trans,
  ret)` is called. This gracefully aborts the transaction and puts the
  filesystem into an error state, allowing the system to continue
  operating (possibly in read-only mode for that filesystem) without a
  full kernel crash.

### Trigger conditions
While the `node->type` field is typically populated correctly by
`btrfs_ref_type()` (which can only return 4 valid values), this BUG()
can be triggered by:
1. **Memory corruption** (e.g., hardware memory errors, other kernel
   bugs corrupting memory)
2. **Future code bugs** that introduce a new type without handling it
   here
3. **Unusual filesystem states** or logic bugs in the delayed ref
   infrastructure

### Stable kernel criteria assessment

1. **Obviously correct and tested**: Yes. The patch is reviewed by 3
   btrfs developers (Boris Burkov, Qu Wenruo, David Sterba). The error
   path already exists and is exercised by other error codes. The change
   is straightforward: `BUG()` -> `return -EUCLEAN` + error log.

2. **Fixes a real bug**: Yes. BUG() in a code path that can be triggered
   is a bug - it causes an unnecessary kernel crash when graceful error
   handling is possible. The entire call chain already handles errors
   from this function correctly.

3. **Fixes an important issue**: Yes. A kernel panic/crash is a serious
   issue. Converting to graceful error handling prevents data loss and
   system downtime.

4. **Small and contained**: Yes. The change is approximately 10 lines in
   a single function, in a single file. It only modifies error handling
   behavior.

5. **No new features**: Correct. No new features or APIs are added. This
   only changes how an error condition is handled.

6. **Applies cleanly**: For 6.12.y and later stable trees, it should
   apply cleanly. For 6.6.y, the `BTRFS_EXTENT_OWNER_REF_KEY` branch
   needs to be removed (it was introduced in v6.7), requiring a minor
   backport adaptation.

### Precedent
Similar BUG_ON()/BUG() to -EUCLEAN conversions in the same file
(`fs/btrfs/extent-tree.c`) by the same author have been backported to
stable:
- `28cb13f29faf` ("btrfs: don't BUG_ON() when 0 reference count at
  btrfs_lookup_extent_info()") - backported to 6.6.y
- `bb3868033a4c` ("btrfs: do not BUG_ON() when freeing tree block after
  error") - backported to 6.6.y

### Risk assessment
**Very low risk.** The change only affects an error path that was
previously causing a kernel panic. The new behavior (returning -EUCLEAN)
is strictly better than the old behavior (BUG/crash). The error
propagation chain is already well-tested since other errors from
`run_one_delayed_ref` follow the exact same path.

### Dependencies
The commit is **self-contained**. It has no dependencies on other
commits. For older stable trees (pre-6.7), the
`BTRFS_EXTENT_OWNER_REF_KEY` branch wouldn't exist, requiring a trivial
adaptation.

**YES**

 fs/btrfs/extent-tree.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e4cae34620d19..1bf081243efb2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1761,32 +1761,36 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 			       struct btrfs_delayed_extent_op *extent_op,
 			       bool insert_reserved)
 {
+	struct btrfs_fs_info *fs_info = trans->fs_info;
 	int ret = 0;

 	if (TRANS_ABORTED(trans)) {
 		if (insert_reserved) {
 			btrfs_pin_extent(trans, node->bytenr, node->num_bytes);
-			free_head_ref_squota_rsv(trans->fs_info, href);
+			free_head_ref_squota_rsv(fs_info, href);
 		}
 		return 0;
 	}

 	if (node->type == BTRFS_TREE_BLOCK_REF_KEY ||
-	    node->type == BTRFS_SHARED_BLOCK_REF_KEY)
+	    node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
 		ret = run_delayed_tree_ref(trans, href, node, extent_op,
 					   insert_reserved);
-	else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
-		 node->type == BTRFS_SHARED_DATA_REF_KEY)
+	} else if (node->type == BTRFS_EXTENT_DATA_REF_KEY ||
+		   node->type == BTRFS_SHARED_DATA_REF_KEY) {
 		ret = run_delayed_data_ref(trans, href, node, extent_op,
 					   insert_reserved);
-	else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY)
+	} else if (node->type == BTRFS_EXTENT_OWNER_REF_KEY) {
 		ret = 0;
-	else
-		BUG();
+	} else {
+		ret = -EUCLEAN;
+		btrfs_err(fs_info, "unexpected delayed ref node type: %u", node->type);
+	}
+
 	if (ret && insert_reserved)
 		btrfs_pin_extent(trans, node->bytenr, node->num_bytes);
 	if (ret < 0)
-		btrfs_err(trans->fs_info,
+		btrfs_err(fs_info,
 "failed to run delayed ref for logical %llu num_bytes %llu type %u action %u ref_mod %d: %d",
 			  node->bytenr, node->num_bytes, node->type,
 			  node->action, node->ref_mod, ret);
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19-6.6] btrfs: handle user interrupt properly in btrfs_trim_fs()
       [not found] <20260210233123.2905307-1-sashal@kernel.org>
  2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication Sasha Levin
  2026-02-10 23:30 ` [PATCH AUTOSEL 6.19] btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref() Sasha Levin
@ 2026-02-10 23:31 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-02-10 23:31 UTC (permalink / raw)
  To: patches, stable
  Cc: jinbaohong, Qu Wenruo, Robbie Ko, Filipe Manana, David Sterba,
	Sasha Levin, clm, linux-btrfs

From: jinbaohong <jinbaohong@synology.com>

[ Upstream commit bfb670b9183b0e4ba660aff2e396ec1cc01d0761 ]

When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.

Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a complete understanding. Let me provide my thorough
analysis.

---

## Detailed Analysis

### 1. COMMIT MESSAGE ANALYSIS

The commit message is clear and precise: when `btrfs_trim_block_group()`
or `btrfs_trim_free_extents()` return `-ERESTARTSYS` (due to fatal
signal or freezing), the current code treats it as a regular error. It
continues the loops, counts it as a failure, and logs misleading
warnings. The fix is to break out immediately and return `-ERESTARTSYS`
to userspace.

The commit has excellent review pedigree: reviewed by Qu Wenruo, Filipe
Manana, and David Sterba (the btrfs maintainer himself). Three separate
reviews is strong confidence.

### 2. CODE CHANGE ANALYSIS

The fix touches a single function `btrfs_trim_fs()` in `fs/btrfs/extent-
tree.c`, adding 11 lines across three locations:

**Location 1** - Block group loop: After `btrfs_trim_block_group()`
returns, check for `-ERESTARTSYS`/`-EINTR` and break immediately:

```c
if (ret == -ERESTARTSYS || ret == -EINTR) {
    btrfs_put_block_group(cache);
    break;
}
```

Note the critical `btrfs_put_block_group(cache)` call before `break` —
this prevents a reference count leak. When using `continue`, the loop
iterator `cache = btrfs_next_block_group(cache)` handles putting the old
reference. But on `break`, we must do it explicitly. This matches the
existing pattern earlier in the same loop:

```6530:6533:fs/btrfs/extent-tree.c
                if (cache->start >= range_end) {
                        btrfs_put_block_group(cache);
                        break;
                }
```

**Location 2** - Between the two loops: Skip the device trimming loop
entirely if the block group loop was interrupted:

```c
if (ret == -ERESTARTSYS || ret == -EINTR)
    return ret;
```

**Location 3** - Device loop and final return: Break out of the device
loop on interrupt and return appropriately:

```c
if (ret == -ERESTARTSYS || ret == -EINTR)
    break;
...
if (ret == -ERESTARTSYS || ret == -EINTR)
    return ret;
```

### 3. THE BUG MECHANISM

This bug is a **follow-up to commit `69313850dce33` ("btrfs: add
cancellation points to trim loops")** which was merged in v6.12 and is
`Cc: stable@vger.kernel.org # 5.15+`. That commit added
`btrfs_trim_interrupted()` checks to the inner trim functions
(`trim_no_bitmap`, `trim_bitmaps`, `btrfs_issue_discard`,
`btrfs_trim_free_extents`) so they return `-ERESTARTSYS` when a fatal
signal is pending or the process is freezing.

**The problem**: The outer function `btrfs_trim_fs()` was NOT updated to
handle this `-ERESTARTSYS` return. So the inner loops correctly detect
the interrupt and return early, but the outer loop just treats it as a
regular error and continues:

1. `btrfs_trim_block_group(cache_0)` → detects signal → returns
   `-ERESTARTSYS`
2. Outer loop: `bg_failed++`, `bg_ret = -ERESTARTSYS`, `continue`
3. `btrfs_trim_block_group(cache_1)` → detects signal again → returns
   `-ERESTARTSYS`
4. Repeat for ALL remaining block groups
5. Then iterate ALL devices, each returning `-ERESTARTSYS` immediately

On a large filesystem with thousands of block groups and multiple
devices, this means:
- **Delayed response to Ctrl+C/SIGKILL**: The process doesn't terminate
  promptly
- **Blocked system suspend**: `freezing(current)` remains true, but the
  outer loop keeps going, preventing the process from actually freezing.
  This was the exact scenario reported in [bug
  219180](https://bugzilla.kernel.org/show_bug.cgi?id=219180) and [SUSE
  bug 1229737](https://bugzilla.suse.com/show_bug.cgi?id=1229737)
- **Misleading dmesg warnings**: `btrfs_warn(fs_info, "failed to trim
  %llu block group(s)...")` fires, counting all the interrupted block
  groups as "failures"
- **Wrong return value**: Instead of returning `-ERESTARTSYS` cleanly to
  userspace, the function may return a mixed error code

### 4. SCOPE AND RISK ASSESSMENT

- **Size**: 11 lines added, 0 removed. Extremely small and surgical.
- **Files touched**: 1 (`fs/btrfs/extent-tree.c`)
- **Scope**: Only affects the interrupt/signal error path. The normal
  trim path (no signal pending) is completely unaffected — all new code
  is gated behind `ret == -ERESTARTSYS || ret == -EINTR` checks.
- **Risk**: Very low. The added checks are early-exit conditions that
  only trigger when a signal is pending or process is freezing. There's
  no way these can cause a regression in normal operation.
- **Reference counting**: Correctly handled
  (`btrfs_put_block_group(cache)` before break).

### 5. USER IMPACT

- **Who is affected**: Any user running `fstrim` on a btrfs filesystem
  who interrupts it (Ctrl+C) or has a system that suspends while trim is
  running. This is a very common scenario, especially on laptops with
  btrfs and periodic fstrim timers.
- **Call path**: `fstrim` → `FITRIM` ioctl → `btrfs_ioctl_fitrim()` →
  `btrfs_trim_fs()`
- **Severity**: The original bugs from the linked reports were about
  systems unable to suspend. The cancellation point commit
  (`69313850dce33`) fixed the inner loops but left the outer loop
  broken, meaning the fix was incomplete. This commit completes it.

### 6. DEPENDENCY CHECK

This commit depends on two preceding commits:

1. **`912d1c6680bdb` ("btrfs: continue trimming remaining devices on
   failure")** - Changes `break` to `continue` in the device loop.
   **Already targeted for stable** (`Fixes:` tag and `Cc:
   stable@vger.kernel.org # 5.4+`).

2. **`1cc4ada4182fa` ("btrfs: preserve first error in
   btrfs_trim_fs()")** - Changes `bg_ret = ret` to `if (!bg_ret) bg_ret
   = ret`. **Not targeted for stable**. This is a small context
   dependency; the core fix logic is independent of it.

Both prerequisites are small (1-line and 15-line changes respectively).
The first is already stable-bound. The second would be needed for clean
application but could alternatively be resolved by a minor context
adjustment during backport.

The fix also requires `69313850dce33` ("btrfs: add cancellation points
to trim loops") which is `Cc: stable # 5.15+` and should already be in
stable trees 5.15+.

### 7. STABILITY INDICATORS

- Reviewed by 3 btrfs experts
- The parent commit adding interruption infrastructure has been in
  stable since 5.15+
- The fix is straightforward conditional checks — no complex logic

### 8. CLASSIFICATION

This is a **bug fix** that:
- Fixes incomplete signal/interrupt handling
- Fixes potential system suspend blocking
- Fixes misleading kernel warnings
- Fixes incorrect error propagation to userspace
- Completes an existing stable fix (`69313850dce33`)

It is NOT:
- A new feature
- A performance optimization
- A code cleanup
- An API change

### Summary

This commit fixes a real, user-visible bug where btrfs trim operations
cannot be properly interrupted by signals (Ctrl+C) or system suspend. It
completes the fix started by `69313850dce33` (already in stable 5.15+)
which added cancellation points to inner trim loops but left the outer
loop in `btrfs_trim_fs()` unaware of the `-ERESTARTSYS` return code. The
fix is extremely small (11 lines), well-reviewed by 3 btrfs developers
including the maintainer, has zero risk to normal code paths, correctly
handles reference counting, and addresses the same class of bug (suspend
blocking) that motivated the original stable-tagged commit. It requires
two small prerequisite commits (one already stable-targeted) for clean
application.

**YES**

 fs/btrfs/extent-tree.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1bf081243efb2..8bdb609f58a7e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6555,6 +6555,10 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 						     range->minlen);

 			trimmed += group_trimmed;
+			if (ret == -ERESTARTSYS || ret == -EINTR) {
+				btrfs_put_block_group(cache);
+				break;
+			}
 			if (ret) {
 				bg_failed++;
 				bg_ret = ret;
@@ -6568,6 +6572,9 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 			"failed to trim %llu block group(s), last error %d",
 			bg_failed, bg_ret);

+	if (ret == -ERESTARTSYS || ret == -EINTR)
+		return ret;
+
 	mutex_lock(&fs_devices->device_list_mutex);
 	list_for_each_entry(device, &fs_devices->devices, dev_list) {
 		if (test_bit(BTRFS_DEV_STATE_MISSING, &device->dev_state))
@@ -6576,6 +6583,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 		ret = btrfs_trim_free_extents(device, &group_trimmed);

 		trimmed += group_trimmed;
+		if (ret == -ERESTARTSYS || ret == -EINTR)
+			break;
 		if (ret) {
 			dev_failed++;
 			dev_ret = ret;
@@ -6589,6 +6598,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 			"failed to trim %llu device(s), last error %d",
 			dev_failed, dev_ret);
 	range->len = trimmed;
+	if (ret == -ERESTARTSYS || ret == -EINTR)
+		return ret;
 	if (bg_ret)
 		return bg_ret;
 	return dev_ret;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-10 23:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260210233123.2905307-1-sashal@kernel.org>
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19] btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref() Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.6] btrfs: handle user interrupt properly in btrfs_trim_fs() Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox