* [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded
@ 2026-06-15 11:34 Chen Cheng
2026-06-15 11:49 ` sashiko-bot
0 siblings, 1 reply; 2+ messages in thread
From: Chen Cheng @ 2026-06-15 11:34 UTC (permalink / raw)
To: linux-raid, yukuai, yukuai; +Cc: chencheng, linux-kernel
From: Chen Cheng <chencheng@fnnas.com>
reshape stripe lifetime:
- start reshape ==> reshape_request():
* get destination stripe,
- if need to copy source data chunks, set STRIPE_EXPANDING;
- or, if new regions past the old end of the array, zero-filled,
no need source data, set STRIPE_EXPANDING | STRIPE_READY
* get source stripe,
- set STRIPE_EXPAND_SOURCE
- handle expand stripe ==> handle_stripe():
reshape use reconstruct-write to construct stripe,
four stages:
1. prepare source data chunks for old geometry stripe
- fill source stripe data by read or compute
2. move data from old geometry source stripe to new geometry
destination stripe
- source stripe clear STRIPE_EXPAND_SOURCE
- drain data from source to destination stripe
- mark stripe chunk as R5_Expanded|R5_UPTODATE when the
drain from source chunk to destionation chunk is completed
- all stripe chunks drain are completed, then mark
STRIPE_EXPAND_READY
3. calculate p/q chunks for destination stripe
- if destination stripe does't depends on source dstripe,
then we can clear STRIPE_EXPANDING
4. write-out to disks and release
- set R5_Wantwrite|R5_Locked, writeout to disk
- if write-out successed, clear STRIPE_EXPAND_READY, and
decrement reshape_stripe, call md_done_sync() to report
reshape progress.
1. cleanup the following kinds of **destination stripe**
when failed device more than max degraded:
- new regions past the old end of the array, zero-filled in place,
requires no source data.
(STRIPE_EXPANDING | STRIPE_EXPAND_READY)
- prepare source data chunks already done, and writeout failed
(STRIPE_EXPAND_READY)
2. destination stripes that need source data
(STRIPE_EXPANDING, no STRIPE_HANDLE)
- these kind of stripes sit idle in the stripe cache and are never seen
by handle_stripe(). So clean up indirectly when thier source stripe
(type 3) is processed.
3. source stripes (STRIPE_EXPAND_SOURCE)
- hit handle_stripe() after thier member disks are markded Faulty.
- clear STRIPE_EXPAND_SOURCE, finds and cleanup all dependent destination
stripes that were waiting for data.
- walks the source's data disks, compute the corresponding destination
sector, looks up the destination stripe, and do cleanup(clear flags,
dec counters, call md_done_sync())
Reproducer:
- Create a 4-disk RAID5 with mdadm on top of 5 disposable test disks
wrapped by dm targets.
- Add the 5th device as a spare and start a 4 -> 5 reshape.
- Wait until /sys/block/mdX/md/sync_action reports "reshape".
- Inject failures on two members so reshape exceeds max_degraded.
- After a few seconds, write "frozen" to /sys/block/mdX/md/sync_action.
Before this fix, the write blocks indefinitely.
Read-error variant:
- Use dm-dust on /dev/sd[b-f].
- Preload bad blocks on two source members, e.g. dust0 and dust1:
dmsetup message dust0 0 addbadblock <range>
dmsetup message dust1 0 addbadblock <range>
- Start reshape:
mdadm -C /dev/mdX -e 1.2 -l 5 -n 4 -c 64 --assume-clean /dev/mapper/dust{0..3}
mdadm --manage /dev/mdX --add /dev/mapper/dust4
mdadm --grow /dev/mdX -n 5 --backup-file=/tmp/grow.backup &
- Once reshape starts, enable the injected read failures:
dmsetup message dust0 0 enable
dmsetup message dust1 0 enable
- Then:
echo frozen > /sys/block/mdX/md/sync_action
hangs forever before the fix.
Write-error variant:
- Use dm-flakey on /dev/sd[b-f].
- Start the same 4 -> 5 reshape on flakey0..flakey4.
- Once reshape starts, switch two members, e.g. flakey3 and flakey4,
to error_writes.
- Then:
echo frozen > /sys/block/mdX/md/sync_action
hangs forever before the fix.
md_do_sync() exits its main loop on MD_RECOVERY_INTR but then blocks
forever at:
wait_event(mddev->recovery_wait,
!atomic_read(&mddev->recovery_active));
After the fix recovery_active drains to zero, md_do_sync() prints
md/raid:md0: Cannot continue operation (2/5 failed).
md: md0: reshape interrupted.
changes v1 -> v2:
- handle reshape write deadlock while failed devices more than max degraded
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
---
drivers/md/raid5.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 65ae7d8930fc..a320b71d7117 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3728,10 +3728,82 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
if (abort)
md_sync_error(conf->mddev);
}
+/*
+ * handle_failed_reshape - handl failed stripes when reshape failed and
+ * degraded devices >= max_degraded
+ *
+ * handle following kinds of stripe:
+ * 1. cleanup the following kinds of destination stripe:
+ * - new regions past the old end of the array, zero-filled in place,
+ * requires no source data.
+ * (STRIPE_EXPANDING | STRIPE_EXPAND_READY)
+ * - prepare source data chunks already done, and writeout failed
+ * (STRIPE_EXPAND_READY)
+ * 2. dest stripes that need source data (STRIPE_EXPANDING, no STRIPE_HANDLE)
+ * - these kind of stripes sit idle in the stripe cache and are never seen
+ * by handle_stripe(). So clean up indirectly when thier source stripe
+ * (type 3) is processed.
+ * 3. src stripes (STRIPE_EXPAND_SOURCE)
+ * - hit handle_stripe() after thier member disks are markded Faulty.
+ * - clear STRIPE_EXPAND_SOURCE, finds and cleanup all dependent destination
+ * stripes that were waiting for data.
+ * - walks the source's data disks, compute the corresponding destination
+ * sector, looks up the destination stripe, and do cleanup(clear flags,
+ * dec counters, call md_done_sync())
+ */
+static void handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s)
+{
+ int i;
+ bool was_expanding = test_and_clear_bit(STRIPE_EXPANDING, &sh->state);
+ bool was_ready = test_and_clear_bit(STRIPE_EXPAND_READY, &sh->state);
+
+ if (was_expanding || was_ready) {
+ atomic_dec(&conf->reshape_stripes);
+ wake_up(&conf->wait_for_reshape);
+ md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf));
+ }
+
+ s->expanded = 0;
+ s->expanding = 0;
+
+ /* release the destination stripes that are waiting to be filled */
+ if (test_and_clear_bit(STRIPE_EXPAND_SOURCE, &sh->state)) {
+ for (i = 0; i < sh->disks; i++) {
+ int dd_idx;
+ struct stripe_head *sh2;
+ sector_t bn, sec;
+
+ if (i == sh->pd_idx)
+ continue;
+ if (conf->level == 6 && i == sh->qd_idx)
+ continue;
+
+ bn = raid5_compute_blocknr(sh, i, 1);
+ sec = raid5_compute_sector(conf, bn, 0, &dd_idx, NULL);
+ sh2 = raid5_get_active_stripe(conf, NULL, sec,
+ R5_GAS_NOBLOCK | R5_GAS_NOQUIESCE);
+ if (!sh2)
+ continue;
+
+ if (test_and_clear_bit(STRIPE_EXPANDING, &sh2->state)) {
+ atomic_dec(&conf->reshape_stripes);
+ wake_up(&conf->wait_for_reshape);
+ md_done_sync(conf->mddev,
+ RAID5_STRIPE_SECTORS(conf));
+ }
+
+ clear_bit(STRIPE_EXPAND_READY, &sh2->state);
+
+ raid5_release_stripe(sh2);
+ }
+ }
+}
+
static int want_replace(struct stripe_head *sh, int disk_idx)
{
struct md_rdev *rdev;
int rv = 0;
@@ -5001,10 +5073,12 @@ static void handle_stripe(struct stripe_head *sh)
break_stripe_batch_list(sh, 0);
if (s.to_read+s.to_write+s.written)
handle_failed_stripe(conf, sh, &s, disks);
if (s.syncing + s.replacing)
handle_failed_sync(conf, sh, &s);
+ if (s.expanding + s.expanded)
+ handle_failed_reshape(conf, sh, &s);
}
/* Now we check to see if any write operations have recently
* completed
*/
--
2.54.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded
2026-06-15 11:34 [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded Chen Cheng
@ 2026-06-15 11:49 ` sashiko-bot
0 siblings, 0 replies; 2+ messages in thread
From: sashiko-bot @ 2026-06-15 11:49 UTC (permalink / raw)
To: Chen Cheng; +Cc: yukuai, linux-raid
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Clearing STRIPE_EXPANDING prematurely in handle_failed_reshape() allows a destination stripe to be freed and reused while asynchronous DMA is still writing to its pages, leading to DMA-after-free memory corruption.
--
commit ec55b78c975b4f9f35e287b83bee72d76b92b33a
Author: Chen Cheng <chencheng@fnnas.com>
md/raid5: fix reshape deadlock while failed devices more than max degraded
This patch fixes a deadlock in raid5 reshape by cleaning up stripe states
(including destination and source stripes) when failed devices exceed the
max degraded count.
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index d0d7f96fd7cd9..2f4d82bd7c2e6 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3730,6 +3730,78 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
> md_sync_error(conf->mddev);
> }
>
> +/*
> + * handle_failed_reshape - handl failed stripes when reshape failed and
> + * degraded devices >= max_degraded
[ ... ]
> +static void handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
> + struct stripe_head_state *s)
> +{
> + int i;
[ ... ]
> + /* release the destination stripes that are waiting to be filled */
> + if (test_and_clear_bit(STRIPE_EXPAND_SOURCE, &sh->state)) {
> + for (i = 0; i < sh->disks; i++) {
> + int dd_idx;
> + struct stripe_head *sh2;
> + sector_t bn, sec;
> +
> + if (i == sh->pd_idx)
> + continue;
> + if (conf->level == 6 && i == sh->qd_idx)
> + continue;
> +
> + bn = raid5_compute_blocknr(sh, i, 1);
> + sec = raid5_compute_sector(conf, bn, 0, &dd_idx, NULL);
> + sh2 = raid5_get_active_stripe(conf, NULL, sec,
> + R5_GAS_NOBLOCK | R5_GAS_NOQUIESCE);
> + if (!sh2)
> + continue;
> +
> + if (test_and_clear_bit(STRIPE_EXPANDING, &sh2->state)) {
[Severity: Critical]
Could clearing STRIPE_EXPANDING prematurely here lead to a DMA-after-free?
During a reshape, a destination stripe (sh2) might require data from multiple
source stripes. STRIPE_EXPANDING normally prevents the destination stripe from
being returned to the free list before all data is copied.
If Worker 1 is handling a source stripe via handle_stripe_expansion(), it
submits an asynchronous DMA copy to sh2's pages and then calls
raid5_release_stripe(sh2) to drop its reference BEFORE waiting for the DMA to
complete:
drivers/md/raid5.c:handle_stripe_expansion() {
...
raid5_release_stripe(sh2);
}
/* done submitting copies, wait for them to complete */
async_tx_quiesce(&tx);
...
}
If the array concurrently fails, Worker 2 could enter handle_failed_reshape()
for a different source stripe, look up sh2, clear its STRIPE_EXPANDING bit,
and release it.
If sh2's refcount drops to 0 without STRIPE_EXPANDING set, could it be
added to the inactive_list and reallocated (or have its pages freed) while
the DMA from Worker 1 is still actively writing to it?
> + atomic_dec(&conf->reshape_stripes);
> + wake_up(&conf->wait_for_reshape);
> + md_done_sync(conf->mddev,
> + RAID5_STRIPE_SECTORS(conf));
> + }
> +
> + clear_bit(STRIPE_EXPAND_READY, &sh2->state);
> +
> + raid5_release_stripe(sh2);
> + }
> + }
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260615113450.2088877-1-chencheng@fnnas.com?part=1
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-06-15 11:49 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 11:34 [PATCH v2] md/raid5: fix reshape deadlock while failed devices more than max degraded Chen Cheng
2026-06-15 11:49 ` sashiko-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox