[PATCH v2 0/2] md/raid10: fix r10bio width mismatches across reshape

Linux RAID subsystem development
 help / color / mirror / Atom feed

* [PATCH v2 0/2] md/raid10: fix r10bio width mismatches across reshape
@ 2026-05-15  9:27 Chen Cheng
  2026-05-15  9:27 ` [PATCH v2 1/2] md/raid10: make r10bio_pool use fixed-size objects Chen Cheng
  2026-05-15  9:27 ` [PATCH v2 2/2] md/raid10: bound reused r10bio devs[] walks by used_nr_devs Chen Cheng
  0 siblings, 2 replies; 3+ messages in thread
From: Chen Cheng @ 2026-05-15  9:27 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Chen Cheng, linux-raid, linux-kernel

From: Chen Cheng <chencheng@fnnas.com>

Hi,

This series fixes slab out-of-bounds accesses in raid10 when reshape changes
the number of raid disks while regular I/O is still reusing r10bio objects
allocated under the previous geometry.

The bug is reproducible with a simple 4-disk to 5-disk reshape under write
load, for example:

  mdadm -C /dev/md777 -l10 -n4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
  mkfs.ext4 /dev/md777
  mount /dev/md777 /mnt/test
  fsstress -d /mnt/test -n 24000 -p 8 -l 24 &
  mdadm /dev/md777 --add /dev/sde
  mdadm --grow /dev/md777 --raid-devices=5 \
    --backup-file=/tmp/md-reshape-backup

Without these changes, an r10bio allocated under the old geometry can later be
reused, initialized, or freed after conf->geo.raid_disks has switched to the
new geometry. This creates width mismatches between the object and the current
devs[] walk/initialization width, which can trigger KASAN reports such as
slab-out-of-bounds in __make_request(), put_all_bios(), or find_bio_disk().

This series addresses the problem in two steps:

  1. make the regular r10bio pool fixed-size across reshape transitions, and
     move the pool rebuild into the freeze window before the live geometry
     switch;

  2. track the number of valid devs[] entries in each reused r10bio and use
     that recorded width when walking devs[] after reshape.

Changes in v2:
  - add this cover letter
  - convert r10bio_pool to a fixed-size kmalloc mempool
  - rebuild r10bio_pool inside the freeze window before switching live reshape
    geometry
  - switch raid10_quiesce() to freeze_array()/unfreeze_array()

Open issues:

One point where this v2 series still differs from raid1 is the pool-switch
semantics during reshape.

raid1 handles this by:
  - converting r1bio_pool to a fixed-size pool,
  - freezing the array,
  - swapping in the new pool while the array is frozen,
  - switching the live geometry/state,
  - unfreezing the array, and
  - destroying the old pool afterwards.

In other words, raid1 keeps the old and new regular I/O pools logically
separated across the reshape transition.

This raid10 v2 series follows the same high-level direction by converting
r10bio_pool to a fixed-size pool and moving the pool rebuild into the freeze
window before the live geometry switch. However, it does not yet mirror
raid1 completely: queued regular r10bios may still exist on retry_list or
bio_end_io_list at the time of the pool replacement, and raid10's current
freeze semantics only guarantee that in-flight I/O has either completed or
been queued.

My current understanding is that there are two possible directions to make
this fully robust:

  1. strengthen raid10 freeze semantics so that the reshape-time pool switch
     guarantees that no old regular r10bio can survive across the transition;
     or

  2. explicitly associate in-flight regular r10bios with the pool they were
     allocated from, so they can always be returned to the correct pool even
     if old and new pools overlap in time.

There is also a pre-existing boundary issue in find_bio_disk(): if the bio
is not found in devs[], the code can still walk past the recorded width.
That issue is not addressed in this series.

Testing:
  - reproduced the original KASAN slab-out-of-bounds on 4-disk -> 5-disk
    raid10 reshape with fsstress
  - verified that this series fixes that reproducer
  - exercised the 5-disk -> 4-disk reshape direction as well

Thanks,
Chen Cheng

Chen Cheng (2):
  md/raid10: make r10bio_pool use fixed-size objects
  md/raid10: bound reused r10bio devs[] walks by used_nr_devs

 drivers/md/raid10.c | 63 +++++++++++++++++++++++++++++++++------------
 drivers/md/raid10.h |  4 ++-
 2 files changed, 49 insertions(+), 18 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH v2 1/2] md/raid10: make r10bio_pool use fixed-size objects
  2026-05-15  9:27 [PATCH v2 0/2] md/raid10: fix r10bio width mismatches across reshape Chen Cheng
@ 2026-05-15  9:27 ` Chen Cheng
  2026-05-15  9:27 ` [PATCH v2 2/2] md/raid10: bound reused r10bio devs[] walks by used_nr_devs Chen Cheng
  1 sibling, 0 replies; 3+ messages in thread
From: Chen Cheng @ 2026-05-15  9:27 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Chen Cheng, linux-raid, linux-kernel

From: Chen Cheng <chencheng@fnnas.com>

raid10 currently allocates r10bio_pool objects with conf->geo.raid_disks,
which makes regular r10bio objects geometry-dependent.

That model breaks down across reshape. mempool objects are preallocated and
reused, so a reshape that changes the number of raid disks can leave old
r10bio objects in the regular I/O pool with a devs[] array sized for the
previous geometry. After the geometry switch, those stale objects may be
reused or later freed under the new layout, creating a width mismatch
between the reused r10bio and the current array geometry.

For example, during a 4-disk to 5-disk reshape, an r10bio allocated before
the geometry switch has room for only 4 devs[] entries. After reshape
updates conf->geo.raid_disks to 5, that stale object can be reused under
the new geometry. Code such as __make_request(), put_all_bios(), and
find_bio_disk() may then access devs[] using the new geometry and step
past the end of the old 4-slot object, leading to slab out-of-bounds
accesses.

The root problem is that regular r10bio pool objects are geometry-dependent,
while mempool elements are preallocated and reused across requests.

Switch r10bio_pool to a fixed-size kmalloc mempool so regular I/O objects no
longer carry an allocation width tied to the current geometry. Use the same
fixed-size allocation rule for the standalone r10bio allocated from
r10buf_pool_alloc().

Because reshape updates live array state such as conf->mirrors, conf->geo,
reshape_progress, and reshape_safe, the geometry switch must happen only
after normal I/O has gone fully quiet. raise_barrier() alone is not strong
enough here: freeze_array() also marks that an array freeze is in progress,
flushes pending writes, and waits until in-flight I/O has either completed
or been queued.

Freeze the array before switching reshape geometry, rebuild r10bio_pool for
the new width inside that freeze window, and switch raid10_quiesce() to use
freeze_array()/unfreeze_array() as well. This keeps new requests from reusing
stale-width regular I/O objects after the geometry change.

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
---
 drivers/md/raid10.c | 57 +++++++++++++++++++++++++++++++++------------
 drivers/md/raid10.h |  2 +-
 2 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 39085e7dd6d2..886bbe6b1ebc 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -103,13 +103,28 @@ static inline struct r10bio *get_resync_r10bio(struct bio *bio)
 	return get_resync_pages(bio)->raid_bio;
 }
 
-static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
+static inline unsigned int calc_r10bio_pool_disks(struct mddev *mddev)
 {
-	struct r10conf *conf = data;
-	int size = offsetof(struct r10bio, devs[conf->geo.raid_disks]);
+	/* If delta_disks < 0, use bigger r10bio->devs[] is ok. */
+	return mddev->raid_disks + max(0, mddev->delta_disks);
+}
+
+static inline int calc_r10bio_size(struct mddev *mddev)
+{
+	return offsetof(struct r10bio, devs[calc_r10bio_pool_disks(mddev)]);
+}
+
+static mempool_t *create_r10bio_pool(struct mddev *mddev)
+{
+	int size = calc_r10bio_size(mddev);
+
+	return mempool_create_kmalloc_pool(NR_RAID_BIOS, size);
+}
+
+static struct r10bio *alloc_r10bio(struct mddev *mddev, gfp_t gfp_flags)
+{
+	int size = calc_r10bio_size(mddev);
 
-	/* allocate a r10bio with room for raid_disks entries in the
-	 * bios array */
 	return kzalloc(size, gfp_flags);
 }
 
@@ -137,7 +152,7 @@ static void * r10buf_pool_alloc(gfp_t gfp_flags, void *data)
 	int nalloc, nalloc_rp;
 	struct resync_pages *rps;
 
-	r10_bio = r10bio_pool_alloc(gfp_flags, conf);
+	r10_bio = alloc_r10bio(conf->mddev, gfp_flags);
 	if (!r10_bio)
 		return NULL;
 
@@ -277,7 +292,7 @@ static void free_r10bio(struct r10bio *r10_bio)
 	struct r10conf *conf = r10_bio->mddev->private;
 
 	put_all_bios(conf, r10_bio);
-	mempool_free(r10_bio, &conf->r10bio_pool);
+	mempool_free(r10_bio, conf->r10bio_pool);
 }
 
 static void put_buf(struct r10bio *r10_bio)
@@ -1531,7 +1546,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	struct r10conf *conf = mddev->private;
 	struct r10bio *r10_bio;
 
-	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
+	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 
 	r10_bio->master_bio = bio;
 	r10_bio->sectors = sectors;
@@ -1723,7 +1738,7 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 				(last_stripe_index << geo->chunk_shift);
 
 retry_discard:
-	r10_bio = mempool_alloc(&conf->r10bio_pool, GFP_NOIO);
+	r10_bio = mempool_alloc(conf->r10bio_pool, GFP_NOIO);
 	r10_bio->mddev = mddev;
 	r10_bio->state = 0;
 	r10_bio->sectors = 0;
@@ -3823,7 +3838,7 @@ static void raid10_free_conf(struct r10conf *conf)
 	if (!conf)
 		return;
 
-	mempool_exit(&conf->r10bio_pool);
+	mempool_destroy(conf->r10bio_pool);
 	kfree(conf->mirrors);
 	kfree(conf->mirrors_old);
 	kfree(conf->mirrors_new);
@@ -3870,9 +3885,8 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 
 	conf->geo = geo;
 	conf->copies = copies;
-	err = mempool_init(&conf->r10bio_pool, NR_RAID_BIOS, r10bio_pool_alloc,
-			   rbio_pool_free, conf);
-	if (err)
+	conf->r10bio_pool = create_r10bio_pool(mddev);
+	if (!conf->r10bio_pool)
 		goto out;
 
 	err = bioset_init(&conf->bio_split, BIO_POOL_SIZE, 0, 0);
@@ -4131,9 +4145,9 @@ static void raid10_quiesce(struct mddev *mddev, int quiesce)
 	struct r10conf *conf = mddev->private;
 
 	if (quiesce)
-		raise_barrier(conf, 0);
+		freeze_array(conf, 0);
 	else
-		lower_barrier(conf);
+		unfreeze_array(conf);
 }
 
 static int raid10_resize(struct mddev *mddev, sector_t sectors)
@@ -4365,6 +4379,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 	struct md_rdev *rdev;
 	int spares = 0;
 	int ret;
+	mempool_t *new_pool;
 
 	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
 		return -EBUSY;
@@ -4400,7 +4415,17 @@ static int raid10_start_reshape(struct mddev *mddev)
 	if (spares < mddev->delta_disks)
 		return -EINVAL;
 
+	freeze_array(conf, 0);
 	conf->offset_diff = min_offset_diff;
+	if (mddev->delta_disks > 0) {
+		new_pool = create_r10bio_pool(mddev);
+		if (!new_pool) {
+			unfreeze_array(conf);
+			return -ENOMEM;
+		}
+		mempool_destroy(conf->r10bio_pool);
+		conf->r10bio_pool = new_pool;
+	}
 	spin_lock_irq(&conf->device_lock);
 	if (conf->mirrors_new) {
 		memcpy(conf->mirrors_new, conf->mirrors,
@@ -4417,6 +4442,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 		sector_t size = raid10_size(mddev, 0, 0);
 		if (size < mddev->array_sectors) {
 			spin_unlock_irq(&conf->device_lock);
+			unfreeze_array(conf);
 			pr_warn("md/raid10:%s: array size must be reduce before number of disks\n",
 				mdname(mddev));
 			return -EINVAL;
@@ -4427,6 +4453,7 @@ static int raid10_start_reshape(struct mddev *mddev)
 		conf->reshape_progress = 0;
 	conf->reshape_safe = conf->reshape_progress;
 	spin_unlock_irq(&conf->device_lock);
+	unfreeze_array(conf);
 
 	if (mddev->delta_disks && mddev->bitmap) {
 		struct mdp_superblock_1 *sb = NULL;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index ec79d87fb92f..b711626a5db7 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -87,7 +87,7 @@ struct r10conf {
 						   */
 	wait_queue_head_t	wait_barrier;
 
-	mempool_t		r10bio_pool;
+	mempool_t		*r10bio_pool;
 	mempool_t		r10buf_pool;
 	struct page		*tmppage;
 	struct bio_set		bio_split;
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH v2 2/2] md/raid10: bound reused r10bio devs[] walks by used_nr_devs
  2026-05-15  9:27 [PATCH v2 0/2] md/raid10: fix r10bio width mismatches across reshape Chen Cheng
  2026-05-15  9:27 ` [PATCH v2 1/2] md/raid10: make r10bio_pool use fixed-size objects Chen Cheng
@ 2026-05-15  9:27 ` Chen Cheng
  1 sibling, 0 replies; 3+ messages in thread
From: Chen Cheng @ 2026-05-15  9:27 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Chen Cheng, linux-raid, linux-kernel

From: Chen Cheng <chencheng@fnnas.com>

After reshape changes raid_disks, an in-flight r10bio from the old geometry
can still be completed or freed later. In that case, using the current
geometry to walk r10_bio->devs[] is unsafe. A failure was reproduced with a
simple write workload while reshaping a raid10 array from 4 disks to 5 disks.
e.g.:

  mdadm -C /dev/md777 -l10 -n4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
  mkfs.ext4 /dev/md777
  mount /dev/md777 /mnt/test
  fsstress -d /mnt/test -n 24000 -p 8 -l 24 &
  mdadm /dev/md777 --add /dev/sde
  mdadm --grow /dev/md777 --raid-devices=5 \
    --backup-file=/tmp/md-reshape-backup

the sequence above can trigger:

  BUG: KASAN: slab-out-of-bounds in free_r10bio+0x1c4/0x260 [raid10]
  Read of size 8 at addr ffff00008c2dfac8 by task ksoftirqd/0/15
  free_r10bio
  raid_end_bio_io
  one_write_done
  raid10_end_write_request

The buggy object was 200 bytes long, which matches an r10bio with space for
only four devs[] entries. However, put_all_bios() and find_bio_disk() walk
r10_bio->devs[] using the current conf->geo.raid_disks value. Once reshape
switches conf->geo.raid_disks from 4 to 5, an old 4-slot r10bio can be
completed or freed as if it had 5 slots, and the walk overruns devs[4]. The
same stale-width mismatch can also surface during a 5-disk to 4-disk reshape.

Track the number of valid devs[] entries in each reused r10bio with
used_nr_devs. Initialize it whenever an r10bio is prepared for regular I/O,
discard, or resync/recovery/reshape work, and use it to bound devs[] walks
in put_all_bios() and find_bio_disk().

Signed-off-by: Chen Cheng <chencheng@fnnas.com>
---
 drivers/md/raid10.c | 8 ++++++--
 drivers/md/raid10.h | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 886bbe6b1ebc..42865d822d95 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -275,7 +275,7 @@ static void put_all_bios(struct r10conf *conf, struct r10bio *r10_bio)
 {
 	int i;
 
-	for (i = 0; i < conf->geo.raid_disks; i++) {
+	for (i = 0; i < r10_bio->used_nr_devs; i++) {
 		struct bio **bio = & r10_bio->devs[i].bio;
 		if (!BIO_SPECIAL(*bio))
 			bio_put(*bio);
@@ -372,7 +372,7 @@ static int find_bio_disk(struct r10conf *conf, struct r10bio *r10_bio,
 	int slot;
 	int repl = 0;
 
-	for (slot = 0; slot < conf->geo.raid_disks; slot++) {
+	for (slot = 0; slot < r10_bio->used_nr_devs; slot++) {
 		if (r10_bio->devs[slot].bio == bio)
 			break;
 		if (r10_bio->devs[slot].repl_bio == bio) {
@@ -1555,6 +1555,7 @@ static void __make_request(struct mddev *mddev, struct bio *bio, int sectors)
 	r10_bio->sector = bio->bi_iter.bi_sector;
 	r10_bio->state = 0;
 	r10_bio->read_slot = -1;
+	r10_bio->used_nr_devs = conf->geo.raid_disks;
 	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) *
 			conf->geo.raid_disks);
 
@@ -1742,6 +1743,7 @@ static int raid10_handle_discard(struct mddev *mddev, struct bio *bio)
 	r10_bio->mddev = mddev;
 	r10_bio->state = 0;
 	r10_bio->sectors = 0;
+	r10_bio->used_nr_devs = geo->raid_disks;
 	memset(r10_bio->devs, 0, sizeof(r10_bio->devs[0]) * geo->raid_disks);
 	wait_blocked_dev(mddev, r10_bio);
 
@@ -3076,6 +3078,8 @@ static struct r10bio *raid10_alloc_init_r10buf(struct r10conf *conf)
 	else
 		nalloc = 2; /* recovery */
 
+	r10bio->used_nr_devs = nalloc;
+
 	for (i = 0; i < nalloc; i++) {
 		bio = r10bio->devs[i].bio;
 		rp = bio->bi_private;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index b711626a5db7..4751119f9770 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -127,6 +127,8 @@ struct r10bio {
 	 * if the IO is in READ direction, then this is where we read
 	 */
 	int			read_slot;
+	/* Used to bound devs[] walks when the object is reused. */
+	unsigned int		used_nr_devs;
 
 	struct list_head	retry_list;
 	/*
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-15  9:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15  9:27 [PATCH v2 0/2] md/raid10: fix r10bio width mismatches across reshape Chen Cheng
2026-05-15  9:27 ` [PATCH v2 1/2] md/raid10: make r10bio_pool use fixed-size objects Chen Cheng
2026-05-15  9:27 ` [PATCH v2 2/2] md/raid10: bound reused r10bio devs[] walks by used_nr_devs Chen Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox