Linux RAID subsystem development
 help / color / mirror / Atom feed
* [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded
@ 2026-06-15 11:34 Chen Cheng
  2026-06-15 11:49 ` sashiko-bot
  0 siblings, 1 reply; 2+ messages in thread
From: Chen Cheng @ 2026-06-15 11:34 UTC (permalink / raw)
  To: linux-raid, yukuai, yukuai; +Cc: chencheng, linux-kernel

From: Chen Cheng <chencheng@fnnas.com>

reshape stripe lifetime:
- start reshape ==> reshape_request():
	* get destination stripe,
	  - if need to copy source data chunks, set STRIPE_EXPANDING;
	  - or, if new regions past the old end of the array, zero-filled,
	    no need source data, set STRIPE_EXPANDING | STRIPE_READY
	* get source stripe,
	  - set STRIPE_EXPAND_SOURCE

- handle expand stripe ==> handle_stripe():
	reshape use reconstruct-write to construct stripe,
	four stages:
	1. prepare source data chunks for old geometry stripe
		- fill source stripe data by read or compute
	2. move data from old geometry source stripe to new geometry
	   destination stripe
		- source stripe clear STRIPE_EXPAND_SOURCE
		- drain data from source to destination stripe
		- mark stripe chunk as R5_Expanded|R5_UPTODATE when the
		  drain from source chunk to destionation chunk is completed
		- all stripe chunks drain are completed, then mark
		  STRIPE_EXPAND_READY
	3. calculate p/q chunks for destination stripe
		- if destination stripe does't depends on source dstripe,
		  then we can clear STRIPE_EXPANDING
	4. write-out to disks and release
		- set R5_Wantwrite|R5_Locked, writeout to disk
		- if write-out successed, clear STRIPE_EXPAND_READY, and
		  decrement reshape_stripe, call md_done_sync() to report
		  reshape progress.

1. cleanup the following kinds of **destination stripe**
	when failed device more than max degraded:
  - new regions past the old end of the array, zero-filled in place,
    requires no source data.
	(STRIPE_EXPANDING | STRIPE_EXPAND_READY)
  - prepare source data chunks already done, and writeout failed
	(STRIPE_EXPAND_READY)

2. destination stripes that need source data
	(STRIPE_EXPANDING, no STRIPE_HANDLE)
  - these kind of stripes sit idle in the stripe cache and are never seen
    by handle_stripe(). So clean up indirectly when thier source stripe
    (type 3) is processed.

3. source stripes (STRIPE_EXPAND_SOURCE)
  - hit handle_stripe() after thier member disks are markded Faulty.
  - clear STRIPE_EXPAND_SOURCE, finds and cleanup all dependent destination
    stripes that were waiting for data.
  - walks the source's data disks, compute the corresponding destination
    sector, looks up the destination stripe, and do cleanup(clear flags,
    dec counters, call md_done_sync())

Reproducer:
  - Create a 4-disk RAID5 with mdadm on top of 5 disposable test disks
    wrapped by dm targets.
  - Add the 5th device as a spare and start a 4 -> 5 reshape.
  - Wait until /sys/block/mdX/md/sync_action reports "reshape".
  - Inject failures on two members so reshape exceeds max_degraded.
  - After a few seconds, write "frozen" to /sys/block/mdX/md/sync_action.
    Before this fix, the write blocks indefinitely.

Read-error variant:
  - Use dm-dust on /dev/sd[b-f].
  - Preload bad blocks on two source members, e.g. dust0 and dust1:
      dmsetup message dust0 0 addbadblock <range>
      dmsetup message dust1 0 addbadblock <range>
  - Start reshape:
      mdadm -C /dev/mdX -e 1.2 -l 5 -n 4 -c 64 --assume-clean /dev/mapper/dust{0..3}
      mdadm --manage /dev/mdX --add /dev/mapper/dust4
      mdadm --grow /dev/mdX -n 5 --backup-file=/tmp/grow.backup &
  - Once reshape starts, enable the injected read failures:
      dmsetup message dust0 0 enable
      dmsetup message dust1 0 enable
  - Then:
      echo frozen > /sys/block/mdX/md/sync_action
    hangs forever before the fix.

Write-error variant:
  - Use dm-flakey on /dev/sd[b-f].
  - Start the same 4 -> 5 reshape on flakey0..flakey4.
  - Once reshape starts, switch two members, e.g. flakey3 and flakey4,
    to error_writes.
  - Then:
      echo frozen > /sys/block/mdX/md/sync_action
    hangs forever before the fix.

md_do_sync() exits its main loop on MD_RECOVERY_INTR but then blocks
forever at:

  wait_event(mddev->recovery_wait,
		!atomic_read(&mddev->recovery_active));

After the fix recovery_active drains to zero, md_do_sync() prints

    md/raid:md0: Cannot continue operation (2/5 failed).
    md: md0: reshape interrupted.

changes v1 -> v2:
- handle reshape write deadlock while failed devices more than max degraded

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
---
 drivers/md/raid5.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 65ae7d8930fc..a320b71d7117 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3728,10 +3728,82 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
 
 	if (abort)
 		md_sync_error(conf->mddev);
 }
 
+/*
+ * handle_failed_reshape - handl failed stripes when reshape failed and
+ *			   degraded devices >= max_degraded
+ *
+ * handle following kinds of stripe:
+ * 1. cleanup the following kinds of destination stripe:
+ *	- new regions past the old end of the array, zero-filled in place,
+ *	  requires no source data.
+ *		(STRIPE_EXPANDING | STRIPE_EXPAND_READY)
+ *	- prepare source data chunks already done, and writeout failed
+ *		(STRIPE_EXPAND_READY)
+ * 2. dest stripes that need source data (STRIPE_EXPANDING, no STRIPE_HANDLE)
+ *   - these kind of stripes sit idle in the stripe cache and are never seen
+ *     by handle_stripe(). So clean up indirectly when thier source stripe
+ *     (type 3) is processed.
+ * 3. src stripes (STRIPE_EXPAND_SOURCE)
+ *   - hit handle_stripe() after thier member disks are markded Faulty.
+ *   - clear STRIPE_EXPAND_SOURCE, finds and cleanup all dependent destination
+ *     stripes that were waiting for data.
+ *   - walks the source's data disks, compute the corresponding destination
+ *     sector, looks up the destination stripe, and do cleanup(clear flags,
+ *     dec counters, call md_done_sync())
+ */
+static void handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+				  struct stripe_head_state *s)
+{
+	int i;
+	bool was_expanding = test_and_clear_bit(STRIPE_EXPANDING, &sh->state);
+	bool was_ready = test_and_clear_bit(STRIPE_EXPAND_READY, &sh->state);
+
+	if (was_expanding || was_ready) {
+		atomic_dec(&conf->reshape_stripes);
+		wake_up(&conf->wait_for_reshape);
+		md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf));
+	}
+
+	s->expanded = 0;
+	s->expanding = 0;
+
+	/* release the destination stripes that are waiting to be filled */
+	if (test_and_clear_bit(STRIPE_EXPAND_SOURCE, &sh->state)) {
+		for (i = 0; i < sh->disks; i++) {
+			int dd_idx;
+			struct stripe_head *sh2;
+			sector_t bn, sec;
+
+			if (i == sh->pd_idx)
+				continue;
+			if (conf->level == 6 && i == sh->qd_idx)
+				continue;
+
+			bn = raid5_compute_blocknr(sh, i, 1);
+			sec = raid5_compute_sector(conf, bn, 0, &dd_idx, NULL);
+			sh2 = raid5_get_active_stripe(conf, NULL, sec,
+					R5_GAS_NOBLOCK | R5_GAS_NOQUIESCE);
+			if (!sh2)
+				continue;
+
+			if (test_and_clear_bit(STRIPE_EXPANDING, &sh2->state)) {
+				atomic_dec(&conf->reshape_stripes);
+				wake_up(&conf->wait_for_reshape);
+				md_done_sync(conf->mddev,
+					     RAID5_STRIPE_SECTORS(conf));
+			}
+
+			clear_bit(STRIPE_EXPAND_READY, &sh2->state);
+
+			raid5_release_stripe(sh2);
+		}
+	}
+}
+
 static int want_replace(struct stripe_head *sh, int disk_idx)
 {
 	struct md_rdev *rdev;
 	int rv = 0;
 
@@ -5001,10 +5073,12 @@ static void handle_stripe(struct stripe_head *sh)
 		break_stripe_batch_list(sh, 0);
 		if (s.to_read+s.to_write+s.written)
 			handle_failed_stripe(conf, sh, &s, disks);
 		if (s.syncing + s.replacing)
 			handle_failed_sync(conf, sh, &s);
+		if (s.expanding + s.expanded)
+			handle_failed_reshape(conf, sh, &s);
 	}
 
 	/* Now we check to see if any write operations have recently
 	 * completed
 	 */
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded
  2026-06-15 11:34 [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded Chen Cheng
@ 2026-06-15 11:49 ` sashiko-bot
  0 siblings, 0 replies; 2+ messages in thread
From: sashiko-bot @ 2026-06-15 11:49 UTC (permalink / raw)
  To: Chen Cheng; +Cc: yukuai, linux-raid

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Clearing STRIPE_EXPANDING prematurely in handle_failed_reshape() allows a destination stripe to be freed and reused while asynchronous DMA is still writing to its pages, leading to DMA-after-free memory corruption.
--

commit ec55b78c975b4f9f35e287b83bee72d76b92b33a
Author: Chen Cheng <chencheng@fnnas.com>

md/raid5: fix reshape deadlock while failed devices more than max degraded
    
This patch fixes a deadlock in raid5 reshape by cleaning up stripe states 
(including destination and source stripes) when failed devices exceed the 
max degraded count.

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index d0d7f96fd7cd9..2f4d82bd7c2e6 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3730,6 +3730,78 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
>  		md_sync_error(conf->mddev);
>  }
>  
> +/*
> + * handle_failed_reshape - handl failed stripes when reshape failed and
> + *			   degraded devices >= max_degraded

[ ... ]

> +static void handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
> +				  struct stripe_head_state *s)
> +{
> +	int i;

[ ... ]

> +	/* release the destination stripes that are waiting to be filled */
> +	if (test_and_clear_bit(STRIPE_EXPAND_SOURCE, &sh->state)) {
> +		for (i = 0; i < sh->disks; i++) {
> +			int dd_idx;
> +			struct stripe_head *sh2;
> +			sector_t bn, sec;
> +
> +			if (i == sh->pd_idx)
> +				continue;
> +			if (conf->level == 6 && i == sh->qd_idx)
> +				continue;
> +
> +			bn = raid5_compute_blocknr(sh, i, 1);
> +			sec = raid5_compute_sector(conf, bn, 0, &dd_idx, NULL);
> +			sh2 = raid5_get_active_stripe(conf, NULL, sec,
> +					R5_GAS_NOBLOCK | R5_GAS_NOQUIESCE);
> +			if (!sh2)
> +				continue;
> +
> +			if (test_and_clear_bit(STRIPE_EXPANDING, &sh2->state)) {

[Severity: Critical]
Could clearing STRIPE_EXPANDING prematurely here lead to a DMA-after-free?

During a reshape, a destination stripe (sh2) might require data from multiple
source stripes. STRIPE_EXPANDING normally prevents the destination stripe from
being returned to the free list before all data is copied.

If Worker 1 is handling a source stripe via handle_stripe_expansion(), it
submits an asynchronous DMA copy to sh2's pages and then calls
raid5_release_stripe(sh2) to drop its reference BEFORE waiting for the DMA to
complete:

drivers/md/raid5.c:handle_stripe_expansion() {
    ...
            raid5_release_stripe(sh2);
        }
    /* done submitting copies, wait for them to complete */
    async_tx_quiesce(&tx);
    ...
}

If the array concurrently fails, Worker 2 could enter handle_failed_reshape()
for a different source stripe, look up sh2, clear its STRIPE_EXPANDING bit,
and release it. 

If sh2's refcount drops to 0 without STRIPE_EXPANDING set, could it be
added to the inactive_list and reallocated (or have its pages freed) while
the DMA from Worker 1 is still actively writing to it?

> +				atomic_dec(&conf->reshape_stripes);
> +				wake_up(&conf->wait_for_reshape);
> +				md_done_sync(conf->mddev,
> +					     RAID5_STRIPE_SECTORS(conf));
> +			}
> +
> +			clear_bit(STRIPE_EXPAND_READY, &sh2->state);
> +
> +			raid5_release_stripe(sh2);
> +		}
> +	}
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615113450.2088877-1-chencheng@fnnas.com?part=1

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-15 11:49 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 11:34 [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded Chen Cheng
2026-06-15 11:49 ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox