[PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid

public inbox for linux-raid@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
@ 2026-02-11 12:30 ` Sasha Levin
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] md-cluster: fix NULL pointer dereference in process_metadata_update Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable; +Cc: Heinz Mauelshagen, Yu Kuai, Sasha Levin, song, linux-raid

From: Heinz Mauelshagen <heinzm@redhat.com>

[ Upstream commit cefcb9297fbdb6d94b61787b4f8d84f55b741470 ]

When using device-mapper's dm-raid target, stopping a RAID array can cause
the system to hang under specific conditions.

This occurs when:

- A dm-raid managed device tree is suspended from top to bottom
   (the top-level RAID device is suspended first, followed by its
    underlying metadata and data devices)

- The top-level RAID device is then removed

Removing the top-level device triggers a hang in the following sequence:
the dm-raid destructor calls md_stop(), which tries to flush the
write-intent bitmap by writing to the metadata sub-devices. However, these
devices are already suspended, making them unable to complete the write-intent
operations and causing an indefinite block.

Fix:

- Prevent bitmap flushing when md_stop() is called from dm-raid
destructor context
  and avoid a quiescing/unquescing cycle which could also cause I/O

- Still allow write-intent bitmap flushing when called from dm-raid
suspend context

This ensures that RAID array teardown can complete successfully even when the
underlying devices are in a suspended state.

This second patch uses md_is_rdwr() to distinguish between suspend and
destructor paths as elaborated on above.

Link: https://lore.kernel.org/linux-raid/CAM23VxqYrwkhKEBeQrZeZwQudbiNey2_8B_SEOLqug=pXxaFrA@mail.gmail.com
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Complete Analysis

### 1. What the Bug Is

This commit fixes a **system hang** (indefinite block/deadlock) when
stopping dm-raid managed RAID arrays. The hang scenario is:

1. A dm-raid device tree is suspended top-to-bottom (top-level RAID
   device first, then underlying metadata/data devices)
2. The top-level RAID device is removed (`raid_dtr` destructor)
3. `raid_dtr()` → `md_stop()` → `__md_stop_writes()` tries to:
   - Quiesce the RAID personality (`pers->quiesce()`)
   - Flush the write-intent bitmap to metadata devices
     (`bitmap_ops->flush()`)
4. But the metadata devices are already suspended and **cannot complete
   I/O**
5. The flush waits indefinitely → **system hang**

This is a real, user-reported bug (Link in commit message points to a
lore report).

### 2. The Fix

The fix adds a conditional guard around the quiesce and bitmap flush in
`__md_stop_writes()`:

```c
if (md_is_rdwr(mddev) || !mddev_is_dm(mddev)) {
    // quiesce + bitmap flush
}
```

This condition skips the quiesce and bitmap flush **only** when:
- The device is a dm-raid device (`mddev_is_dm()` returns true), AND
- The device is NOT in read-write mode (`md_is_rdwr()` returns false)

The clever trick: `raid_postsuspend()` (suspend path) already calls
`md_stop_writes()` while the device is still `MD_RDWR`, so the bitmap
flush proceeds normally during suspend. Then it sets `rs->md.ro =
MD_RDONLY`. Later when `raid_dtr()` calls `md_stop()` →
`__md_stop_writes()`, the device is `MD_RDONLY`, so the condition is
false and the dangerous I/O is skipped.

For non-dm md arrays (`!mddev_is_dm()` is true), the condition is always
true and behavior is unchanged.

### 3. Code Change Scope

- **1 file changed**: `drivers/md/md.c`
- **8 insertions, 6 deletions** (net +2 lines)
- Only touches the `__md_stop_writes()` function
- Small and surgical

### 4. Critical Dependency Issue

The commit message explicitly says **"This second patch"**, indicating
it's part of a 2-patch series. The first patch is `55dcfdf8af9c3` ("dm
raid: use proper md_ro_state enumerators"), which:
- Added `rs->md.ro = MD_RDONLY;` to `raid_postsuspend()` in `dm-raid.c`
- Without this line, when `raid_dtr` runs, `mddev->ro` is still
  `MD_RDWR` (from `raid_resume`), so `md_is_rdwr()` returns true, and
  the quiesce/flush is NOT skipped → the hang still occurs!

**This means the fix (`cefcb9297fbdb`) CANNOT work without the
prerequisite (`55dcfdf8af9c3`).**

The prerequisite `55dcfdf8af9c3` was merged for v6.18. It is:
- Present in 6.18.y
- **NOT** present in 6.12.y, 6.6.y, 6.1.y, or 5.15.y

### 5. Additional Backport Complications

For older stable trees (6.12.y and earlier), the code context has
changed significantly:
- v6.12+: `bitmap_ops->flush()` (ops-based interface)
- v6.11 and earlier: `md_bitmap_flush()` (direct function call)
- The `md_bitmap_enabled()` guard was added between v6.12 and current
- The `serialize_policy` field changed to a flag bit
- `md_stop_writes()` in v6.6 includes `set_bit(MD_RECOVERY_FROZEN)` and
  sync thread reaping, which was later moved to the callers

### 6. Is the Bug Present in Older Stable Trees?

**Yes**, the underlying bug (bitmap flush blocking on suspended metadata
devices) exists in all kernel versions with dm-raid. In v6.6/v6.1,
`__md_stop_writes()` also calls `md_bitmap_flush()` and
`pers->quiesce()` unconditionally, and the same hang can occur when
`raid_dtr` runs after `raid_postsuspend`.

However, to fix it in older trees, you would need BOTH:
1. The prerequisite that sets `ro = MD_RDONLY` in `raid_postsuspend`
   (equivalent of `55dcfdf8af9c3`)
2. This fix that checks `md_is_rdwr()` in `__md_stop_writes` (this
   commit)

And both would need adaptation to the different code context.

### 7. Risk Assessment

- **Severity of bug**: HIGH — system hang is a serious issue, especially
  for enterprise users of dm-raid (LVM RAID, etc.)
- **Risk of regression**: LOW-MEDIUM — the fix only changes behavior for
  dm-raid devices in the destructor path (when `!md_is_rdwr &&
  mddev_is_dm`); all other paths are unchanged
- **Size**: Very small (8 insertions, 6 deletions)
- **Dependency complexity**: HIGH — requires prerequisite commit and
  significant code context differences across versions

### 8. Verdict

This commit fixes a **real, user-reported system hang** in dm-raid
teardown, which is clearly important for stable. The fix itself is
small, surgical, and well-understood. However, it has a **hard
dependency** on `55dcfdf8af9c3` ("dm raid: use proper md_ro_state
enumerators") which adds the critical `rs->md.ro = MD_RDONLY` line in
`raid_postsuspend()`. Without that prerequisite, the `md_is_rdwr()`
check in this fix would never distinguish between suspend and destructor
paths, and the fix would be completely ineffective.

For 6.18.y, the prerequisite is present and this patch could apply
(possibly with minor context adjustments). For older LTS trees (6.12.y,
6.6.y, 6.1.y), both this commit and the prerequisite (with appropriate
context adaptations) would need to be backported together.

Given that:
1. It fixes a definite system hang (high severity)
2. It's small and well-contained
3. It has a clear dependency that must also be backported
4. dm-raid (LVM RAID) is widely used in enterprise environments
5. The author is from Red Hat (enterprise focus) and it was reported by
   a real user

This is a **YES** for backporting, with the strong caveat that the
prerequisite commit `55dcfdf8af9c3` must be included in any backport.

**YES**

 drivers/md/md.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 6d73f6e196a9f..ac71640ff3a81 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6848,13 +6848,15 @@ static void __md_stop_writes(struct mddev *mddev)
 {
 	timer_delete_sync(&mddev->safemode_timer);

-	if (mddev->pers && mddev->pers->quiesce) {
-		mddev->pers->quiesce(mddev, 1);
-		mddev->pers->quiesce(mddev, 0);
-	}
+	if (md_is_rdwr(mddev) || !mddev_is_dm(mddev)) {
+		if (mddev->pers && mddev->pers->quiesce) {
+			mddev->pers->quiesce(mddev, 1);
+			mddev->pers->quiesce(mddev, 0);
+		}

-	if (md_bitmap_enabled(mddev, true))
-		mddev->bitmap_ops->flush(mddev);
+		if (md_bitmap_enabled(mddev, true))
+			mddev->bitmap_ops->flush(mddev);
+	}

 	if (md_is_rdwr(mddev) &&
 	    ((!mddev->in_sync && !mddev_is_clustered(mddev)) ||
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* [PATCH AUTOSEL 6.19-6.6] md-cluster: fix NULL pointer dereference in process_metadata_update
       [not found] <20260211123112.1330287-1-sashal@kernel.org>
  2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid Sasha Levin
@ 2026-02-11 12:30 ` Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-02-11 12:30 UTC (permalink / raw)
  To: patches, stable; +Cc: Jiasheng Jiang, Yu Kuai, Sasha Levin, song, linux-raid

From: Jiasheng Jiang <jiashengjiangcool@gmail.com>

[ Upstream commit f150e753cb8dd756085f46e86f2c35ce472e0a3c ]

The function process_metadata_update() blindly dereferences the 'thread'
pointer (acquired via rcu_dereference_protected) within the wait_event()
macro.

While the code comment states "daemon thread must exist", there is a valid
race condition window during the MD array startup sequence (md_run):

1. bitmap_load() is called, which invokes md_cluster_ops->join().
2. join() starts the "cluster_recv" thread (recv_daemon).
3. At this point, recv_daemon is active and processing messages.
4. However, mddev->thread (the main MD thread) is not initialized until
   later in md_run().

If a METADATA_UPDATED message is received from a remote node during this
specific window, process_metadata_update() will be called while
mddev->thread is still NULL, leading to a kernel panic.

To fix this, we must validate the 'thread' pointer. If it is NULL, we
release the held lock (no_new_dev_lockres) and return early, safely
ignoring the update request as the array is not yet fully ready to
process it.

Link: https://lore.kernel.org/linux-raid/20260117145903.28921-1-jiashengjiangcool@gmail.com
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a comprehensive understanding of the issue. Let me summarize
my analysis.

---

## Detailed Analysis

### 1. Commit Message Analysis

The commit fixes a **NULL pointer dereference** in
`process_metadata_update()` in `drivers/md/md-cluster.c`. The subject
explicitly says "fix NULL pointer dereference" -- a strong indicator for
stable.

### 2. Code Change Analysis

The vulnerable code is at line 552-556 (before the fix):

```552:556:drivers/md/md-cluster.c
        /* daemaon thread must exist */
        thread = rcu_dereference_protected(mddev->thread, true);
        wait_event(thread->wqueue,
                   (got_lock = mddev_trylock(mddev)) ||
                    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
&cinfo->state));
```

The code obtains `mddev->thread` via `rcu_dereference_protected()` and
**immediately dereferences `thread->wqueue`** without any NULL check. If
`thread` is NULL, this is a guaranteed kernel panic.

**Critical comparison**: All other uses of `mddev->thread` in `md-
cluster.c` (lines 352, 468, 571, 726, 1079) go through
`md_wakeup_thread()`, which has a **built-in NULL check**:

```8520:8531:drivers/md/md.c
void __md_wakeup_thread(struct md_thread __rcu *thread)
{
        struct md_thread *t;

        t = rcu_dereference(thread);
        if (t) {
                pr_debug("md: waking up MD thread %s.\n", t->tsk->comm);
                set_bit(THREAD_WAKEUP, &t->flags);
                if (wq_has_sleeper(&t->wqueue))
                        wake_up(&t->wqueue);
        }
}
```

So `process_metadata_update()` is the **only location** in the file that
directly dereferences `mddev->thread` without safety.

### 3. The Race Condition

The vulnerability was introduced in commit `0ba959774e939` ("md-cluster:
use sync way to handle METADATA_UPDATED msg", 2017, v4.12). The author
of that commit was aware of the `thread->wqueue` dependency -- they even
wrote a follow-up commit `48df498daf62e` ("md: move bitmap_destroy to
the beginning of __md_stop") that explicitly states:

> "process_metadata_update is depended on mddev->thread->wqueue"
> "clustered raid could possible hang if array received a
METADATA_UPDATED msg after array unregistered mddev->thread"

This follow-up only addressed the **shutdown ordering** (moving
`bitmap_destroy` before `mddev_detach`), but did NOT add a NULL safety
check for the startup/error paths.

The race window during startup:
- `md_run()` calls `pers->run()` which sets `mddev->thread`
- Then `md_bitmap_create()` -> `join()` creates recv_thread
- Then `bitmap_load()` -> `load_bitmaps()` enables message processing

While the normal ordering seems safe, there are scenarios involving:
- Error paths during bitmap creation where `mddev_detach()` is called
  (NULLing `mddev->thread`) while the recv_thread may still have work
  pending
- Edge cases in `dm-raid` which has a different bitmap_load timing
- Future code changes that could affect the ordering

### 4. The Fix

The fix adds a simple NULL check:

```diff
        thread = rcu_dereference_protected(mddev->thread, true);
+       if (!thread) {
+               pr_warn("md-cluster: Received metadata update but MD
thread is not ready\n");
+               dlm_unlock_sync(cinfo->no_new_dev_lockres);
+               return;
+       }
```

The fix properly:
- Checks for NULL before dereferencing `thread->wqueue`
- Releases the DLM lock (`no_new_dev_lockres`) acquired earlier in the
  function (avoids deadlock on early return)
- Logs a warning for debugging
- Returns early, safely skipping the update (the array isn't fully ready
  anyway)
- Removes the incorrect "daemaon" typo comment

### 5. Scope and Risk Assessment

- **Lines changed**: +6/-1, single file
- **Risk**: Near zero. The check only triggers when `thread` is NULL
  (abnormal case). Normal operation is completely unaffected.
- **Subsystem**: MD RAID (clustered), mature subsystem present since
  v4.12
- **Could break something**: No. This is purely defensive -- adding a
  safety check that only activates in the error scenario.

### 6. User Impact

- **Who is affected**: Users of clustered MD RAID (enterprise/SAN
  environments)
- **Severity if triggered**: Kernel panic/oops (NULL pointer
  dereference)
- **Affected stable trees**: All versions since v4.12 (5.4, 5.10, 5.15,
  6.1, 6.6, 6.12, etc.)

### 7. Stable Criteria Checklist

- **Obviously correct and tested**: Yes, trivially correct NULL check
  with proper cleanup
- **Fixes a real bug**: Yes, NULL pointer dereference leading to kernel
  panic
- **Important issue**: Yes, kernel crash
- **Small and contained**: Yes, 6-line change in one function in one
  file
- **No new features**: Correct
- **Clean backport**: The fix should apply cleanly to all stable trees
  since the code hasn't materially changed since v4.12

**YES**

 drivers/md/md-cluster.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index 11f1e91d387d8..896279988dfd5 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -549,8 +549,13 @@ static void process_metadata_update(struct mddev *mddev, struct cluster_msg *msg
 
 	dlm_lock_sync(cinfo->no_new_dev_lockres, DLM_LOCK_CR);
 
-	/* daemaon thread must exist */
 	thread = rcu_dereference_protected(mddev->thread, true);
+	if (!thread) {
+		pr_warn("md-cluster: Received metadata update but MD thread is not ready\n");
+		dlm_unlock_sync(cinfo->no_new_dev_lockres);
+		return;
+	}
+
 	wait_event(thread->wqueue,
 		   (got_lock = mddev_trylock(mddev)) ||
 		    test_bit(MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD, &cinfo->state));
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-02-11 12:32 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260211123112.1330287-1-sashal@kernel.org>
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.18] md raid: fix hang when stopping arrays with metadata through dm-raid Sasha Levin
2026-02-11 12:30 ` [PATCH AUTOSEL 6.19-6.6] md-cluster: fix NULL pointer dereference in process_metadata_update Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox