[PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup
@ 2026-06-28 14:24 Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

Hi,

This v2 of series contains a mix of bug fixes and cleanups for RAID10,
along with a related atomic write fix for RAID1.

Changes in v2:
 - Expand the commit message to explain why the
   allow_barrier()/wait_barrier() pair is no longer needed.
 - Drop the early atomic write split check from raid1_write_request().
 - Advertise the atomic write size limit via queue limits.
 - Disable write-behind instead of failing atomic writes when the
   BIO_MAX_VECS limit is encountered.
 - Drop the early atomic write split check from raid10_write_request()
   and rely on queue limits instead.
 - Fix a compilation error (bi -> bio).
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-1-abd.masalkhi@gmail.com/

Thanks,
Abd-alrhman,

Abd-Alrhman Masalkhi (7):
  md/raid10: fix r10bio leak in raid10_write_request() error paths
  md/raid1: advertise atomic write limits and handle runtime constraints
  md/raid10: consistently fail atomic writes that require splitting
  md/raid10: remove unnecessary barrier around bio_submit_split_bioset()
  md/raid10: replace wait loop with wait_event_idle()
  md/raid10: simplify write request error handling
  md/raid10: simplify read request error handling

 drivers/md/raid1.c  |  36 +++++++-------
 drivers/md/raid10.c | 118 +++++++++++++++++++++-----------------------
 2 files changed, 74 insertions(+), 80 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:39   ` sashiko-bot
  2026-06-28 14:24 ` [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints Abd-Alrhman Masalkhi
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid, sashiko-bot

When raid10_write_request() fails because REQ_NOWAIT is set, the
allocated r10_bio is not freed before returning, resulting in a memory
leak. Free r10_bio before returning from the REQ_NOWAIT error paths.

Fixes: c9aa889b035f ("md: raid10 add nowait support")
Reported-by: sashiko-bot <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/linux-raid/20260613184042.BCEC01F000E9@smtp.kernel.org/
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - No changes.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-2-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0a3cfdd3f5df..bd322eccdc3f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1365,6 +1365,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		/* Bail out if REQ_NOWAIT is set for the bio */
 		if (bio->bi_opf & REQ_NOWAIT) {
 			bio_wouldblock_error(bio);
+			free_r10bio(r10_bio);
 			return false;
 		}
 		for (;;) {
@@ -1398,6 +1399,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		if (bio->bi_opf & REQ_NOWAIT) {
 			allow_barrier(conf);
 			bio_wouldblock_error(bio);
+			free_r10bio(r10_bio);
 			return false;
 		}
 		mddev_add_trace_msg(conf->mddev,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:38   ` sashiko-bot
  2026-06-28 14:24 ` [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting Abd-Alrhman Masalkhi
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

Atomic writes in RAID1 must fit within a single barrier unit. Advertise
this restriction through the queue limits by setting
atomic_write_hw_unit_max to BARRIER_UNIT_SECTOR_SIZE so that bios which
would cross a barrier-unit boundary are rejected by the block layer
before reaching MD.

A bio that passes block-layer validation may still become unserviceable
within RAID1 due to bad blocks or write-behind constraints. In the former
case, complete the bio with EIO. In the latter case, disable
write-behind rather than failing the bio with EIO.

Fixes: f2a38abf5f1c ("md/raid1: Atomic write support")
Fixes: a4c55c902670 ("md/raid1: simplify raid1_write_request() error handling")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - Drop the early atomic write split check from raid1_write_request().
 - Advertise the atomic write size limit via queue limits.
 - Disable write-behind instead of failing atomic writes when the
   BIO_MAX_VECS limit is encountered.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-3-abd.masalkhi@gmail.com/
---
 drivers/md/raid1.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index afe2ca96ad8c..f322048ab3c2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1522,6 +1522,7 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
 	int first_clone;
 	bool write_behind = false;
 	bool nowait = bio->bi_opf & REQ_NOWAIT;
+	bool atomic = bio->bi_opf & REQ_ATOMIC;
 	bool is_discard = op_is_discard(bio->bi_opf);
 	sector_t sector = bio->bi_iter.bi_sector;
 
@@ -1603,20 +1604,6 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
 			}
 			if (is_bad) {
 				int good_sectors;
-
-				/*
-				 * We cannot atomically write this, so just
-				 * error in that case. It could be possible to
-				 * atomically write other mirrors, but the
-				 * complexity of supporting that is not worth
-				 * the benefit.
-				 */
-				if (bio->bi_opf & REQ_ATOMIC) {
-					bio->bi_status = BLK_STS_NOTSUPP;
-					bio_endio(bio);
-					goto err_dec_pending;
-				}
-
 				good_sectors = first_bad - sector;
 				if (good_sectors < max_sectors)
 					max_sectors = good_sectors;
@@ -1633,10 +1620,24 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
 	 * at a time and thus needs a new bio that can fit the whole payload
 	 * this bio in page sized chunks.
 	 */
-	if (write_behind && mddev->bitmap)
-		max_sectors = min_t(int, max_sectors,
-				    BIO_MAX_VECS * (PAGE_SIZE >> 9));
+	if (write_behind && mddev->bitmap) {
+		if (atomic && max_sectors > BIO_MAX_VECS * (PAGE_SIZE >> 9))
+			/*
+			 * Atomic writes cannot be split, so disable
+			 * write-behind.
+			 */
+			write_behind = false;
+		else
+			max_sectors = min_t(int, max_sectors,
+					    BIO_MAX_VECS * (PAGE_SIZE >> 9));
+	}
+
 	if (max_sectors < bio_sectors(bio)) {
+		if (atomic) {
+			bio_io_error(bio);
+			goto err_dec_pending;
+		}
+
 		bio = bio_submit_split_bioset(bio, max_sectors,
 					      &conf->bio_split);
 		if (!bio)
@@ -3229,6 +3230,7 @@ static int raid1_set_limits(struct mddev *mddev)
 	lim.max_write_zeroes_sectors = 0;
 	lim.max_hw_wzeroes_unmap_sectors = 0;
 	lim.logical_block_size = mddev->logical_block_size;
+	lim.atomic_write_hw_unit_max = BARRIER_UNIT_SECTOR_SIZE;
 	lim.features |= BLK_FEAT_ATOMIC_WRITES;
 	lim.features |= BLK_FEAT_PCI_P2PDMA;
 	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:36   ` sashiko-bot
  2026-06-28 14:24 ` [PATCH v2 4/7] md/raid10: remove unnecessary barrier around bio_submit_split_bioset() Abd-Alrhman Masalkhi
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

RAID10 currently handles one badblock path explicitly by failing atomic
writes with EIO. However, another badblock path can also reduce the
writable range and force the bio through bio_submit_split_bioset(),
which implicitly completes the bio with EINVAL.

Fix this by handling atomic writes in the common split check. If RAID10
determines that an atomic write would require splitting, complete the
bio with EIO.

Fixes: a1d9b4fd42d9 ("md/raid10: Atomic write support")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - Drop the early atomic write split check from raid10_write_request()
   and rely on queue limits instead.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-4-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index bd322eccdc3f..3480fc7907f0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1356,6 +1356,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 	int i, k;
 	sector_t sectors;
 	int max_sectors;
+	bool atomic = bio->bi_opf & REQ_ATOMIC;
 
 	if ((mddev_is_clustered(mddev) &&
 	     mddev->cluster_ops->area_resyncing(mddev, WRITE,
@@ -1464,16 +1465,6 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 			if (is_bad) {
 				int good_sectors;
 
-				/*
-				 * We cannot atomically write this, so just
-				 * error in that case. It could be possible to
-				 * atomically write other mirrors, but the
-				 * complexity of supporting that is not worth
-				 * the benefit.
-				 */
-				if (bio->bi_opf & REQ_ATOMIC)
-					goto err_handle;
-
 				good_sectors = first_bad - dev_sector;
 				if (good_sectors < max_sectors)
 					max_sectors = good_sectors;
@@ -1493,6 +1484,9 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		r10_bio->sectors = max_sectors;
 
 	if (r10_bio->sectors < bio_sectors(bio)) {
+		if (atomic)
+			goto err_handle;
+
 		allow_barrier(conf);
 		bio = bio_submit_split_bioset(bio, r10_bio->sectors,
 					      &conf->bio_split);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/7] md/raid10: remove unnecessary barrier around bio_submit_split_bioset()
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
                   ` (2 preceding siblings ...)
  2026-06-28 14:24 ` [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 5/7] md/raid10: replace wait loop with wait_event_idle() Abd-Alrhman Masalkhi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

raid10_write_request() drops the barrier before calling
bio_submit_split_bioset() and reacquires it afterwards. This is no
longer necessary because the split bio cannot re-enter
raid10_write_request() while the barrier is held.

The allow_barrier()/wait_barrier() pair was introduced by commit
e820d55cb99d ("md: fix raid10 hang issue caused by barrier") when
submit_flushes() called md_handle_request() directly, allowing re-entry
into raid10_write_request(). Since v5.2, submit_flushes() has instead
gone through submit_bio(), eliminating that recursion. submit_flushes()
was later removed entirely by commit b75197e86e6d ("md: Remove flush
handling").

Currently, raid10_write_request() is only entered from the bio
submission path, so the split bio submitted by bio_submit_split_bioset()
cannot recurse back into wait_barrier().

Remove the redundant allow_barrier()/wait_barrier() pair around
bio_submit_split_bioset().

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - Expand the commit message to explain why the
   allow_barrier()/wait_barrier() pair is no longer needed.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-5-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3480fc7907f0..2574f60dd771 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1487,10 +1487,8 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		if (atomic)
 			goto err_handle;

-		allow_barrier(conf);
 		bio = bio_submit_split_bioset(bio, r10_bio->sectors,
 					      &conf->bio_split);
-		wait_barrier(conf, false);
 		if (!bio) {
 			set_bit(R10BIO_Returned, &r10_bio->state);
 			goto err_handle;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/7] md/raid10: replace wait loop with wait_event_idle()
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
                   ` (3 preceding siblings ...)
  2026-06-28 14:24 ` [PATCH v2 4/7] md/raid10: remove unnecessary barrier around bio_submit_split_bioset() Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 6/7] md/raid10: simplify write request error handling Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 7/7] md/raid10: simplify read " Abd-Alrhman Masalkhi
  6 siblings, 0 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

The wait loop is equivalent to wait_event_idle() and can be simplified
by usaing it for improving readability.

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - No changes.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-6-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2574f60dd771..57813f249578 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1362,22 +1362,17 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 	     mddev->cluster_ops->area_resyncing(mddev, WRITE,
 						bio->bi_iter.bi_sector,
 						bio_end_sector(bio)))) {
-		DEFINE_WAIT(w);
 		/* Bail out if REQ_NOWAIT is set for the bio */
 		if (bio->bi_opf & REQ_NOWAIT) {
 			bio_wouldblock_error(bio);
 			free_r10bio(r10_bio);
 			return false;
 		}
-		for (;;) {
-			prepare_to_wait(&conf->wait_barrier,
-					&w, TASK_IDLE);
-			if (!mddev->cluster_ops->area_resyncing(mddev, WRITE,
-				 bio->bi_iter.bi_sector, bio_end_sector(bio)))
-				break;
-			schedule();
-		}
-		finish_wait(&conf->wait_barrier, &w);
+
+		wait_event_idle(conf->wait_barrier,
+				!mddev->cluster_ops->area_resyncing(mddev, WRITE,
+								    bio->bi_iter.bi_sector,
+								    bio_end_sector(bio)));
 	}
 
 	sectors = r10_bio->sectors;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/7] md/raid10: simplify write request error handling
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
                   ` (4 preceding siblings ...)
  2026-06-28 14:24 ` [PATCH v2 5/7] md/raid10: replace wait loop with wait_event_idle() Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  2026-06-28 14:24 ` [PATCH v2 7/7] md/raid10: simplify read " Abd-Alrhman Masalkhi
  6 siblings, 0 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

raid10_write_request() currently handles bio completion, barrier
handling, and r10_bio lifetime management in several different error
paths. This results in duplicated cleanup logic and increases the risk
of introducing bugs in future modifications.

Move bio_wouldblock_error() handling to the callers of
regular_request_wait(), consolidate the write error paths, and free
r10_bio from a single location in __make_request() when
raid10_write_request() fails.

It remove redundant local copies of r10_bio->sectors and use a single
max_sectors variable throughout the function.

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - No changes.
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-7-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 58 ++++++++++++++++++++++-----------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 57813f249578..d94c1f28a6f6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1123,18 +1123,16 @@ static bool regular_request_wait(struct mddev *mddev, struct r10conf *conf,
 				 struct bio *bio, sector_t sectors)
 {
 	/* Bail out if REQ_NOWAIT is set for the bio */
-	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT)) {
-		bio_wouldblock_error(bio);
+	if (!wait_barrier(conf, bio->bi_opf & REQ_NOWAIT))
 		return false;
-	}
+
 	while (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 	    bio->bi_iter.bi_sector < conf->reshape_progress &&
 	    bio->bi_iter.bi_sector + sectors > conf->reshape_progress) {
 		allow_barrier(conf);
-		if (bio->bi_opf & REQ_NOWAIT) {
-			bio_wouldblock_error(bio);
+		if (bio->bi_opf & REQ_NOWAIT)
 			return false;
-		}
+
 		mddev_add_trace_msg(conf->mddev, "raid10 wait reshape");
 		wait_event(conf->wait_barrier,
 			   conf->reshape_progress <= bio->bi_iter.bi_sector ||
@@ -1192,6 +1190,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	}
 
 	if (!regular_request_wait(mddev, conf, bio, r10_bio->sectors)) {
+		bio_wouldblock_error(bio);
 		free_r10bio(r10_bio);
 		return;
 	}
@@ -1354,8 +1353,8 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 {
 	struct r10conf *conf = mddev->private;
 	int i, k;
-	sector_t sectors;
-	int max_sectors;
+	int max_sectors = r10_bio->sectors;
+	bool nowait = bio->bi_opf & REQ_NOWAIT;
 	bool atomic = bio->bi_opf & REQ_ATOMIC;
 
 	if ((mddev_is_clustered(mddev) &&
@@ -1363,9 +1362,8 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 						bio->bi_iter.bi_sector,
 						bio_end_sector(bio)))) {
 		/* Bail out if REQ_NOWAIT is set for the bio */
-		if (bio->bi_opf & REQ_NOWAIT) {
+		if (nowait) {
 			bio_wouldblock_error(bio);
-			free_r10bio(r10_bio);
 			return false;
 		}
 
@@ -1375,28 +1373,25 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 								    bio_end_sector(bio)));
 	}
 
-	sectors = r10_bio->sectors;
-	if (!regular_request_wait(mddev, conf, bio, sectors)) {
-		free_r10bio(r10_bio);
+	if (!regular_request_wait(mddev, conf, bio, max_sectors)) {
+		bio_wouldblock_error(bio);
 		return false;
 	}
 
 	if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 	    (mddev->reshape_backwards
 	     ? (bio->bi_iter.bi_sector < conf->reshape_safe &&
-		bio->bi_iter.bi_sector + sectors > conf->reshape_progress)
-	     : (bio->bi_iter.bi_sector + sectors > conf->reshape_safe &&
+		bio->bi_iter.bi_sector + max_sectors > conf->reshape_progress)
+	     : (bio->bi_iter.bi_sector + max_sectors > conf->reshape_safe &&
 		bio->bi_iter.bi_sector < conf->reshape_progress))) {
 		/* Need to update reshape_position in metadata */
 		mddev->reshape_position = conf->reshape_progress;
 		set_mask_bits(&mddev->sb_flags, 0,
 			      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
 		md_wakeup_thread(mddev->thread);
-		if (bio->bi_opf & REQ_NOWAIT) {
-			allow_barrier(conf);
+		if (nowait) {
 			bio_wouldblock_error(bio);
-			free_r10bio(r10_bio);
-			return false;
+			goto err_allow_barrier;
 		}
 		mddev_add_trace_msg(conf->mddev,
 			"raid10 wait reshape metadata");
@@ -1421,8 +1416,6 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 
 	wait_blocked_dev(mddev, r10_bio);
 
-	max_sectors = r10_bio->sectors;
-
 	for (i = 0;  i < conf->copies; i++) {
 		int d = r10_bio->devs[i].devnum;
 		struct md_rdev *rdev, *rrdev;
@@ -1479,15 +1472,15 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		r10_bio->sectors = max_sectors;
 
 	if (r10_bio->sectors < bio_sectors(bio)) {
-		if (atomic)
-			goto err_handle;
+		if (atomic) {
+			bio_io_error(bio);
+			goto err_dec_pending;
+		}
 
 		bio = bio_submit_split_bioset(bio, r10_bio->sectors,
 					      &conf->bio_split);
-		if (!bio) {
-			set_bit(R10BIO_Returned, &r10_bio->state);
-			goto err_handle;
-		}
+		if (!bio)
+			goto err_dec_pending;
 
 		r10_bio->master_bio = bio;
 	}
@@ -1505,7 +1498,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 	one_write_done(r10_bio);
 	return true;
 
-err_handle:
+err_dec_pending:
 	for (k = 0;  k < i; k++) {
 		int d = r10_bio->devs[k].devnum;
 		struct md_rdev *rdev = conf->mirrors[d].rdev;
@@ -1521,7 +1514,9 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
 		}
 	}
 
-	raid_end_bio_io(r10_bio);
+err_allow_barrier:
+	allow_barrier(conf);
+
 	return false;
 }
 
@@ -1546,8 +1541,11 @@ static bool __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	ret = true;
 	if (bio_data_dir(bio) == READ)
 		raid10_read_request(mddev, bio, r10_bio);
-	else
+	else {
 		ret = raid10_write_request(mddev, bio, r10_bio);
+		if (!ret)
+			free_r10bio(r10_bio);
+	}
 
 	return ret;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 7/7] md/raid10: simplify read request error handling
  2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
                   ` (5 preceding siblings ...)
  2026-06-28 14:24 ` [PATCH v2 6/7] md/raid10: simplify write request error handling Abd-Alrhman Masalkhi
@ 2026-06-28 14:24 ` Abd-Alrhman Masalkhi
  6 siblings, 0 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 14:24 UTC (permalink / raw)
  To: song, yukuai, magiclinan, xiao, axboe, vverma, john.g.garry,
	martin.petersen, abd.masalkhi, linux-kernel
  Cc: linux-raid

raid10_read_request() currently handles bio completion, barrier
handling, and r10_bio lifetime management in several different error
paths. This results in duplicated cleanup logic and increases the risk
of introducing bugs in future modifications.

Make raid10_read_request() return a status to its callers, consolidate
the read error paths, and free r10_bio from a single location in the
callers. Since the callers allocate r10_bio, they should also be
responsible for freeing it when the request fails.

This makes the read path follow the same ownership model as the write
path and simplifies the error handling flow.

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
---
Changes in v2:
 - Fix a compilation error (bi -> bio).
 - Link to v1: https://lore.kernel.org/linux-raid/20260623072456.333437-8-abd.masalkhi@gmail.com/
---
 drivers/md/raid10.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index d94c1f28a6f6..01162c483644 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1143,7 +1143,7 @@ static bool regular_request_wait(struct mddev *mddev, struct r10conf *conf,
 	return true;
 }
 
-static void raid10_read_request(struct mddev *mddev, struct bio *bio,
+static bool raid10_read_request(struct mddev *mddev, struct bio *bio,
 				struct r10bio *r10_bio)
 {
 	struct r10conf *conf = mddev->private;
@@ -1191,8 +1191,7 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 
 	if (!regular_request_wait(mddev, conf, bio, r10_bio->sectors)) {
 		bio_wouldblock_error(bio);
-		free_r10bio(r10_bio);
-		return;
+		return false;
 	}
 
 	rdev = read_balance(conf, r10_bio, &max_sectors);
@@ -1202,8 +1201,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 					    mdname(mddev), b,
 					    (unsigned long long)r10_bio->sector);
 		}
-		raid_end_bio_io(r10_bio);
-		return;
+		bio_io_error(bio);
+		goto err_allow_barrier;
 	}
 	if (err_rdev)
 		pr_err_ratelimited("md/raid10:%s: %pg: redirecting sector %llu to another mirror\n",
@@ -1215,10 +1214,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 		bio = bio_submit_split_bioset(bio, max_sectors,
 					      &conf->bio_split);
 		wait_barrier(conf, false);
-		if (!bio) {
-			set_bit(R10BIO_Returned, &r10_bio->state);
-			goto err_handle;
-		}
+		if (!bio)
+			goto err_dec_pending;
 
 		r10_bio->master_bio = bio;
 		r10_bio->sectors = max_sectors;
@@ -1244,10 +1241,16 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
 	read_bio->bi_private = r10_bio;
 	mddev_trace_remap(mddev, read_bio, r10_bio->sector);
 	submit_bio_noacct(read_bio);
-	return;
-err_handle:
+
+	return true;
+
+err_dec_pending:
 	atomic_dec(&rdev->nr_pending);
-	raid_end_bio_io(r10_bio);
+
+err_allow_barrier:
+	allow_barrier(conf);
+
+	return false;
 }
 
 static void raid10_write_one_disk(struct mddev *mddev, struct r10bio *r10_bio,
@@ -1538,14 +1541,13 @@ static bool __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) *
 			conf->geo.raid_disks);
 
-	ret = true;
 	if (bio_data_dir(bio) == READ)
-		raid10_read_request(mddev, bio, r10_bio);
-	else {
+		ret = raid10_read_request(mddev, bio, r10_bio);
+	else
 		ret = raid10_write_request(mddev, bio, r10_bio);
-		if (!ret)
-			free_r10bio(r10_bio);
-	}
+
+	if (!ret)
+		free_r10bio(r10_bio);
 
 	return ret;
 }
@@ -1875,6 +1877,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 	sector_t chunk_mask = (conf->geo.chunk_mask & conf->prev.chunk_mask);
 	int chunk_sects = chunk_mask + 1;
 	int sectors = bio_sectors(bio);
+	bool write = bio_data_dir(bio) == WRITE;
 
 	if (unlikely(bio->bi_opf & REQ_PREFLUSH)
 	    && md_flush_request(mddev, bio))
@@ -1898,7 +1901,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 		sectors = chunk_sects -
 			(bio->bi_iter.bi_sector &
 			 (chunk_sects - 1));
-	if (!__make_request(mddev, bio, sectors))
+	if (!__make_request(mddev, bio, sectors) && write)
 		md_write_end(mddev);
 
 	/* In case raid10d snuck in to freeze_array */
@@ -2866,7 +2869,9 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 
 	rdev_dec_pending(rdev, mddev);
 	r10_bio->state = 0;
-	raid10_read_request(mddev, r10_bio->master_bio, r10_bio);
+	if (!raid10_read_request(mddev, r10_bio->master_bio, r10_bio))
+		free_r10bio(r10_bio);
+
 	/*
 	 * allow_barrier after re-submit to ensure no sync io
 	 * can be issued while regular io pending.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting
  2026-06-28 14:24 ` [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting Abd-Alrhman Masalkhi
@ 2026-06-28 14:36   ` sashiko-bot
  2026-06-28 21:35     ` Abd-Alrhman Masalkhi
  0 siblings, 1 reply; 12+ messages in thread
From: sashiko-bot @ 2026-06-28 14:36 UTC (permalink / raw)
  To: Abd-Alrhman Masalkhi; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The `err_dec_pending` (formerly `err_handle`) cleanup path in `raid10_write_request` can trigger a NULL pointer dereference when a replacement device is promoted concurrently.
--

commit bc9d81fdb2c33f8c55baad27ab1eedf427ae25ed
Author: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>

md/raid10: consistently fail atomic writes that require splitting

RAID10 handles one badblock path explicitly by failing atomic writes with
EIO, but another path can also reduce the writable range and force the bio
through bio_submit_split_bioset(), completing it with EINVAL. This commit
centralizes the check to fail all atomic writes requiring splitting with
EIO.

> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index bd322eccdc3ff..3480fc7907f01 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
[ ... ]
> @@ -1493,6 +1484,9 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		r10_bio->sectors = max_sectors;
>  
>  	if (r10_bio->sectors < bio_sectors(bio)) {
> +		if (atomic)
> +			goto err_handle;
> +

[Severity: High]
This is a pre-existing issue, but can this jump to the error handler cause a
null pointer dereference if a replacement device is concurrently promoted?

In the error handler path, the code fetches the replacement device without
checking for null or falling back to the primary slot:

err_handle:
	for (k = 0;  k < i; k++) {
		int d = r10_bio->devs[k].devnum;
		struct md_rdev *rrdev = conf->mirrors[d].replacement;
		...
		if (r10_bio->devs[k].repl_bio) {
			rdev_dec_pending(rrdev, mddev);
			r10_bio->devs[k].repl_bio = NULL;
		}
	}

If raid10_remove_disk() runs concurrently, it can promote the replacement to
the primary slot and clear conf->mirrors[d].replacement:

raid10_remove_disk() {
	...
	WRITE_ONCE(*rdevp, NULL);
	if (p->replacement) {
		WRITE_ONCE(p->rdev, p->replacement);
		clear_bit(Replacement, &p->replacement->flags);
		WRITE_ONCE(p->replacement, NULL);
	}
	...
}

If this concurrent promotion happens, will rrdev be null here? It looks like
other paths such as raid10_end_write_request() safely handle this exact
race by falling back to rdev when replacement is null, but this cleanup
path does not.

>  		allow_barrier(conf);
>  		bio = bio_submit_split_bioset(bio, r10_bio->sectors,
>  					      &conf->bio_split);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260628142420.1051027-1-abd.masalkhi@gmail.com?part=3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints
  2026-06-28 14:24 ` [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints Abd-Alrhman Masalkhi
@ 2026-06-28 14:38   ` sashiko-bot
  0 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-06-28 14:38 UTC (permalink / raw)
  To: Abd-Alrhman Masalkhi; +Cc: linux-raid, yukuai

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Atomic write limits are improperly configured and completely overwritten during limit stacking.
--

commit c2837eff05b0ac2b80653d924f40f1bef7bad6d3
Author: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>

md/raid1: advertise atomic write limits and handle runtime constraints

Atomic writes in RAID1 must fit within a single barrier unit. Advertise
this restriction through the queue limits by setting
atomic_write_hw_unit_max to BARRIER_UNIT_SECTOR_SIZE so that bios which
would cross a barrier-unit boundary are rejected by the block layer
before reaching MD.

A bio that passes block-layer validation may still become unserviceable
within RAID1 due to bad blocks or write-behind constraints. In the former
case, complete the bio with EIO. In the latter case, disable
write-behind rather than failing the bio with EIO.

> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1633,10 +1620,24 @@ static bool raid1_write_request(struct mddev *mddev, struct bio *bio,
[ ... ]
>  	if (max_sectors < bio_sectors(bio)) {
> +		if (atomic) {
> +			bio_io_error(bio);
> +			goto err_dec_pending;
> +		}
> +

[Severity: High]
Will this result in atomic writes dynamically failing with unexpected I/O
errors instead of being properly rejected early by the block layer?

Because the limits configured in raid1_set_limits() appear to get overwritten
during stacking, boundary-crossing bios will bypass block layer validation and
trigger an error dynamically here.

>  		bio = bio_submit_split_bioset(bio, max_sectors,
>  					      &conf->bio_split);
[ ... ]
> @@ -3229,6 +3230,7 @@ static int raid1_set_limits(struct mddev *mddev)
>  	lim.max_write_zeroes_sectors = 0;
>  	lim.max_hw_wzeroes_unmap_sectors = 0;
>  	lim.logical_block_size = mddev->logical_block_size;
> +	lim.atomic_write_hw_unit_max = BARRIER_UNIT_SECTOR_SIZE;
> +	lim.features |= BLK_FEAT_ATOMIC_WRITES;
> +	lim.features |= BLK_FEAT_PCI_P2PDMA;
>  	err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);

[Severity: High]
Is BARRIER_UNIT_SECTOR_SIZE being assigned in sectors while the
atomic_write_hw_unit_max limit expects bytes? If so, this might artificially
cap atomic writes to mismatched units.

Additionally, will these custom RAID1 atomic limits be completely overwritten
when stacking?

Because atomic_write_hw_max is left unmodified from its default of UINT_MAX,
when mddev_stack_rdev_limits() is called:

mddev_stack_rdev_limits() ->
   queue_limits_stack_bdev() ->
      blk_stack_atomic_writes_limits() ->
         blk_stack_atomic_writes_head()

The stacking code will see UINT_MAX and unconditionally overwrite the newly
configured RAID1 limit with the first underlying device's limits.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260628142420.1051027-1-abd.masalkhi@gmail.com?part=2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths
  2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
@ 2026-06-28 14:39   ` sashiko-bot
  0 siblings, 0 replies; 12+ messages in thread
From: sashiko-bot @ 2026-06-28 14:39 UTC (permalink / raw)
  To: Abd-Alrhman Masalkhi; +Cc: yukuai, linux-raid

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Execution paths in raid10_write_request and raid10_read_request unconditionally block on REQ_NOWAIT bios.
- [High] Unconditional use of GFP_NOIO for r10_bio allocation can block, violating REQ_NOWAIT semantics.
--

commit d4f532acb9d5fd7506dfbbb4a12a97248c3fb429
Author: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>

md/raid10: fix r10bio leak in raid10_write_request() error paths

When raid10_write_request() fails because REQ_NOWAIT is set, the
allocated r10_bio is not freed before returning, resulting in a memory
leak. Free r10_bio before returning from the REQ_NOWAIT error paths.

Link: https://lore.kernel.org/linux-raid/20260613184042.BCEC01F000E9@smtp.kernel.org/

> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0a3cfdd3f5df8..bd322eccdc3ff 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1365,6 +1365,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		/* Bail out if REQ_NOWAIT is set for the bio */
>  		if (bio->bi_opf & REQ_NOWAIT) {
>  			bio_wouldblock_error(bio);
> +			free_r10bio(r10_bio);
>  			return false;
>  		}

[ ... ]

> @@ -1398,6 +1399,7 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
>  		if (bio->bi_opf & REQ_NOWAIT) {
>  			allow_barrier(conf);
>  			bio_wouldblock_error(bio);
> +			free_r10bio(r10_bio);
>  			return false;
>  		}

[ ... ]

[Severity: High]
This is a pre-existing issue, but does wait_blocked_dev() still block
unconditionally later in raid10_write_request() even when REQ_NOWAIT is set?

raid10_write_request() {
    ...
    r10_bio->read_slot = -1;
    raid10_find_phys(conf, r10_bio);

    wait_blocked_dev(mddev, r10_bio);
    ...
}

Inside wait_blocked_dev(), the code calls md_wait_for_blocked_rdev() and
wait_barrier(), both of which can block indefinitely. Can this cause
unintended latency spikes or thread starvation for userspace applications
issuing async I/O?

Similarly, in raid10_read_request(), when a bio is split across a chunk
boundary, it calls wait_barrier(conf, false) without checking the
parent bio's REQ_NOWAIT flag.

[Severity: High]
This is another pre-existing issue, but do the mempool_alloc() calls in
__make_request() and raid10_handle_discard() bypass REQ_NOWAIT semantics?

__make_request() {
    ...
    r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
    ...
}

If the mempool is depleted and memory is low, this GFP_NOIO allocation can
block. Should this use GFP_NOWAIT for REQ_NOWAIT bios and return
BLK_STS_AGAIN when the pool is exhausted to preserve the non-blocking
contract?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260628142420.1051027-1-abd.masalkhi@gmail.com?part=1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting
  2026-06-28 14:36   ` sashiko-bot
@ 2026-06-28 21:35     ` Abd-Alrhman Masalkhi
  0 siblings, 0 replies; 12+ messages in thread
From: Abd-Alrhman Masalkhi @ 2026-06-28 21:35 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-raid, yukuai


Hi Kuai,

On Sun, Jun 28, 2026 at 14:36 +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [High] The `err_dec_pending` (formerly `err_handle`) cleanup path in `raid10_write_request` can trigger a NULL pointer dereference when a replacement device is promoted concurrently.
> --
>
> commit bc9d81fdb2c33f8c55baad27ab1eedf427ae25ed
> Author: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
>
> md/raid10: consistently fail atomic writes that require splitting
>
> RAID10 handles one badblock path explicitly by failing atomic writes with
> EIO, but another path can also reduce the writable range and force the bio
> through bio_submit_split_bioset(), completing it with EINVAL. This commit
> centralizes the check to fail all atomic writes requiring splitting with
> EIO.
>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index bd322eccdc3ff..3480fc7907f01 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
> [ ... ]
>> @@ -1493,6 +1484,9 @@ static bool raid10_write_request(struct mddev *mddev, struct bio *bio,
>>  		r10_bio->sectors = max_sectors;
>>  
>>  	if (r10_bio->sectors < bio_sectors(bio)) {
>> +		if (atomic)
>> +			goto err_handle;
>> +
>
> [Severity: High]
> This is a pre-existing issue, but can this jump to the error handler cause a
> null pointer dereference if a replacement device is concurrently promoted?
>
> In the error handler path, the code fetches the replacement device without
> checking for null or falling back to the primary slot:
>
> err_handle:
> 	for (k = 0;  k < i; k++) {
> 		int d = r10_bio->devs[k].devnum;
> 		struct md_rdev *rrdev = conf->mirrors[d].replacement;
> 		...
> 		if (r10_bio->devs[k].repl_bio) {
> 			rdev_dec_pending(rrdev, mddev);
> 			r10_bio->devs[k].repl_bio = NULL;
> 		}
> 	}
>
> If raid10_remove_disk() runs concurrently, it can promote the replacement to
> the primary slot and clear conf->mirrors[d].replacement:
>
> raid10_remove_disk() {
> 	...
> 	WRITE_ONCE(*rdevp, NULL);
> 	if (p->replacement) {
> 		WRITE_ONCE(p->rdev, p->replacement);
> 		clear_bit(Replacement, &p->replacement->flags);
> 		WRITE_ONCE(p->replacement, NULL);
> 	}
> 	...
> }
>
At first look this looks unreachable, because raid10_remove_disk()
should only run while the array is suspended, which would drain the
in-flight write before any promotion. But that assumption does not hold
on the md_start_sync() path. The suspend there is gated on a lock-free
check of md_spares_need_change() taken once at function entry; we only
suspend if it returns true. We then call remove_spares()
unconditionally, regardless of that earlier decision, and
remove_spares() re-evaluates rdev_removeable() independently under the
lock.

So the suspend decision and the actual removal are sampled at different
times with nothing held across them. A primary that was In_sync at the
entry check (suspend skipped) but fails afterward becomes removeable by
the time remove_spares() runs, and raid10_remove_disk() then promotes
the replacement and clears the slot with no suspend in effect. Meanwhile
an in-flight raid10_write_request() that referenced only the replacement
(because the primary was already Faulty when it ran) can be sitting in
err_handle, and it reads conf->mirrors[d].replacement as NULL.

It seems real. I'll submit a fix addressing this issue.

> If this concurrent promotion happens, will rrdev be null here? It looks like
> other paths such as raid10_end_write_request() safely handle this exact
> race by falling back to rdev when replacement is null, but this cleanup
> path does not.
>
>>  		allow_barrier(conf);
>>  		bio = bio_submit_split_bioset(bio, r10_bio->sectors,
>>  					      &conf->bio_split);
>
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260628142420.1051027-1-abd.masalkhi@gmail.com?part=3

-- 
Best Regards,
Abd-Alrhman

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-28 21:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-28 14:24 [PATCH v2 0/7] md/raid10: fixes, atomic write handling, and error-path cleanup Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 1/7] md/raid10: fix r10bio leak in raid10_write_request() error paths Abd-Alrhman Masalkhi
2026-06-28 14:39   ` sashiko-bot
2026-06-28 14:24 ` [PATCH v2 2/7] md/raid1: advertise atomic write limits and handle runtime constraints Abd-Alrhman Masalkhi
2026-06-28 14:38   ` sashiko-bot
2026-06-28 14:24 ` [PATCH v2 3/7] md/raid10: consistently fail atomic writes that require splitting Abd-Alrhman Masalkhi
2026-06-28 14:36   ` sashiko-bot
2026-06-28 21:35     ` Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 4/7] md/raid10: remove unnecessary barrier around bio_submit_split_bioset() Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 5/7] md/raid10: replace wait loop with wait_event_idle() Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 6/7] md/raid10: simplify write request error handling Abd-Alrhman Masalkhi
2026-06-28 14:24 ` [PATCH v2 7/7] md/raid10: simplify read " Abd-Alrhman Masalkhi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.