[PATCH 1/2] block: ignore underlying non-stack devices io

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] block: ignore underlying non-stack devices io_opt
@ 2025-08-17 15:26 colyli
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: colyli @ 2025-08-17 15:26 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-block, yukuai3, Coly Li

From: Coly Li <colyli@kernel.org>

This patch adds a new BLK_FLAG_STACK_IO_OPT for stack block device. If a
stack block device like md raid5 declares its io_opt when don't want
blk_stack_limits() to change it with io_opt of underlying non-stack
block devices, BLK_FLAG_STACK_IO_OPT can be set on limits.flags. Then in
blk_stack_limits(), lcm_not_zero(t->io_opt, b->io_opt) will be avoided.

For md raid5, it is necessary to keep a proper io_opt size for better
I/O thoughput.

Signed-off-by: Coly Li <colyli@kernel.org>
---
 block/blk-settings.c   | 6 +++++-
 drivers/md/raid5.c     | 1 +
 include/linux/blkdev.h | 3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 07874e9b609f..46ee538b2be9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 		t->features &= ~BLK_FEAT_POLL;
 
 	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
+	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);
 
 	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
 	t->max_user_sectors = min_not_zero(t->max_user_sectors,
@@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 				     b->physical_block_size);
 
 	t->io_min = max(t->io_min, b->io_min);
-	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
+	    (b->flags & BLK_FLAG_STACK_IO_OPT))
+		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+
 	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
 
 	/* Set non-power-of-2 compatible chunk_sectors boundary */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 023649fe2476..989acd8abd98 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
 	lim.io_min = mddev->chunk_sectors << 9;
 	lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
 	lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
+	lim.flags |= BLK_FLAG_STACK_IO_OPT;
 	lim.discard_granularity = stripe;
 	lim.max_write_zeroes_sectors = 0;
 	mddev_stack_rdev_limits(mddev, &lim, 0);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 95886b404b16..a22c7cea9836 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -366,6 +366,9 @@ typedef unsigned int __bitwise blk_flags_t;
 /* passthrough command IO accounting */
 #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
 
+/* ignore underlying non-stack devices io_opt */
+#define BLK_FLAG_STACK_IO_OPT		((__force blk_flags_t)(1u << 3))
+
 struct queue_limits {
 	blk_features_t		features;
 	blk_flags_t		flags;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] md: split bio by io_opt size in md_submit_bio()
  2025-08-17 15:26 [PATCH 1/2] block: ignore underlying non-stack devices io_opt colyli
@ 2025-08-17 15:26 ` colyli
  2025-08-18  1:38   ` Yu Kuai
                     ` (2 more replies)
  2025-08-17 18:37 ` [PATCH 1/2] block: ignore underlying non-stack devices io_opt Paul Menzel
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 21+ messages in thread
From: colyli @ 2025-08-17 15:26 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-block, yukuai3, Coly Li

From: Coly Li <colyli@kernel.org>

Currently in md_submit_bio() the incoming request bio is split by
bio_split_to_limits() which makes sure the bio won't exceed
max_hw_sectors of a specific raid level before senting into its
.make_request method.

For raid level 4/5/6 such split method might be problematic and hurt
large read/write perforamnce. Because limits.max_hw_sectors are not
always aligned to limits.io_opt size, the split bio won't be full
stripes covered on all data disks, and will introduce extra read-in I/O.
Even the bio's bi_sector is aligned to limits.io_opt size and large
enough, the resulted split bio is not size-friendly to corresponding
raid456 level.

This patch introduces bio_split_by_io_opt() to solve the above issue,
1, If the incoming bio is not limits.io_opt aligned, split the non-
  aligned head part. Then the next one will be aligned.
2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
  then try to split a by multiple of limits.io_opt but not exceed
  limits.max_hw_sectors.

Then for large bio, the sligned split part will be full-stripes covered
to all data disks, no extra read-in I/Os when rmw_level is 0. And for
rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
for performace as well.

This patch only tests on 8 disks raid5 array with 64KiB chunk size.
By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
If fio bs=488K (exact limits.io_opt size) the peak sequential write
throughput can reach 1.51GiB/s.

Signed-off-by: Coly Li <colyli@kernel.org>
---
 drivers/md/md.c    | 51 +++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c |  6 +++++-
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ac85ec73a409..d0d4d05150fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -426,6 +426,55 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
 }
 EXPORT_SYMBOL(md_handle_request);
 
+/**
+ * For raid456 read/write request, if bio LBA isn't aligned tot io_opt,
+ * split the non io_opt aligned header, to make the second part's LBA be
+ * aligned to io_opt. Otherwise still call bio_split_to_limits() to
+ * handle bio split with queue limits.
+ */
+static struct bio *bio_split_by_io_opt(struct bio *bio)
+{
+	sector_t io_opt_sectors, start, offset;
+	struct queue_limits lim;
+	struct mddev *mddev;
+	struct bio *split;
+	int level;
+
+	mddev = bio->bi_bdev->bd_disk->private_data;
+	level = mddev->level;
+
+	/* Only handle read456 read/write requests */
+	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
+	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
+		return bio_split_to_limits(bio);
+
+	/* In case raid456 chunk size is too large */
+	lim = mddev->gendisk->queue->limits;
+	io_opt_sectors = lim.io_opt >> SECTOR_SHIFT;
+	if (unlikely(io_opt_sectors > lim.max_hw_sectors))
+		return bio_split_to_limits(bio);
+
+	/* Small request, no need to split */
+	if (bio_sectors(bio) <= io_opt_sectors)
+		return bio;
+
+	/* Only split the non-io-opt aligned header part */
+	start = bio->bi_iter.bi_sector;
+	offset = sector_div(start, io_opt_sectors);
+	if (offset == 0)
+		return bio_split_to_limits(bio);
+
+	split = bio_split(bio, (io_opt_sectors - offset), GFP_NOIO,
+			  &bio->bi_bdev->bd_disk->bio_split);
+	if (!split)
+		return bio_split_to_limits(bio);
+
+	split->bi_opf |= REQ_NOMERGE;
+	bio_chain(split, bio);
+	submit_bio_noacct(bio);
+	return split;
+}
+
 static void md_submit_bio(struct bio *bio)
 {
 	const int rw = bio_data_dir(bio);
@@ -441,7 +490,7 @@ static void md_submit_bio(struct bio *bio)
 		return;
 	}
 
-	bio = bio_split_to_limits(bio);
+	bio = bio_split_by_io_opt(bio);
 	if (!bio)
 		return;
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 989acd8abd98..985fabeeead5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7759,9 +7759,13 @@ static int raid5_set_limits(struct mddev *mddev)
 
 	/*
 	 * Requests require having a bitmap for each stripe.
-	 * Limit the max sectors based on this.
+	 * Limit the max sectors based on this. And being
+	 * aligned to lim.io_opt for better I/O performance.
 	 */
 	lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);
+	if (lim.max_hw_sectors > lim.io_opt >> SECTOR_SHIFT)
+		lim.max_hw_sectors = rounddown(lim.max_hw_sectors,
+			  lim.io_opt >> SECTOR_SHIFT);
 
 	/* No restrictions on the number of segments in the request */
 	lim.max_segments = USHRT_MAX;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-17 15:26 [PATCH 1/2] block: ignore underlying non-stack devices io_opt colyli
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
@ 2025-08-17 18:37 ` Paul Menzel
  2025-08-18  1:14 ` Yu Kuai
  2025-08-18  2:51 ` Damien Le Moal
  3 siblings, 0 replies; 21+ messages in thread
From: Paul Menzel @ 2025-08-17 18:37 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-raid, linux-block, yukuai3

Dear Coly,


Thank you for your patch.

Am 17.08.25 um 17:26 schrieb colyli@kernel.org:
> From: Coly Li <colyli@kernel.org>
> 
> This patch adds a new BLK_FLAG_STACK_IO_OPT for stack block device. If a
> stack block device like md raid5 declares its io_opt when don't want
> blk_stack_limits() to change it with io_opt of underlying non-stack
> block devices, BLK_FLAG_STACK_IO_OPT can be set on limits.flags. Then in
> blk_stack_limits(), lcm_not_zero(t->io_opt, b->io_opt) will be avoided.
> 
> For md raid5, it is necessary to keep a proper io_opt size for better
> I/O thoughput.

th*r*oughput

For performance claims, it’d be great if you added the benchmark and the 
test system details and test results.

> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
>   block/blk-settings.c   | 6 +++++-
>   drivers/md/raid5.c     | 1 +
>   include/linux/blkdev.h | 3 +++
>   3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 07874e9b609f..46ee538b2be9 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   		t->features &= ~BLK_FEAT_POLL;
>   
>   	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
> +	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);
>   
>   	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>   	t->max_user_sectors = min_not_zero(t->max_user_sectors,
> @@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   				     b->physical_block_size);
>   
>   	t->io_min = max(t->io_min, b->io_min);
> -	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
> +	    (b->flags & BLK_FLAG_STACK_IO_OPT))
> +		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +
>   	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
>   
>   	/* Set non-power-of-2 compatible chunk_sectors boundary */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..989acd8abd98 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>   	lim.io_min = mddev->chunk_sectors << 9;
>   	lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
>   	lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
> +	lim.flags |= BLK_FLAG_STACK_IO_OPT;
>   	lim.discard_granularity = stripe;
>   	lim.max_write_zeroes_sectors = 0;
>   	mddev_stack_rdev_limits(mddev, &lim, 0);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 95886b404b16..a22c7cea9836 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -366,6 +366,9 @@ typedef unsigned int __bitwise blk_flags_t;
>   /* passthrough command IO accounting */
>   #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
>   
> +/* ignore underlying non-stack devices io_opt */
> +#define BLK_FLAG_STACK_IO_OPT		((__force blk_flags_t)(1u << 3))
> +
>   struct queue_limits {
>   	blk_features_t		features;
>   	blk_flags_t		flags;


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-17 15:26 [PATCH 1/2] block: ignore underlying non-stack devices io_opt colyli
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
  2025-08-17 18:37 ` [PATCH 1/2] block: ignore underlying non-stack devices io_opt Paul Menzel
@ 2025-08-18  1:14 ` Yu Kuai
  2025-08-18  2:51 ` Damien Le Moal
  3 siblings, 0 replies; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  1:14 UTC (permalink / raw)
  To: colyli, linux-raid; +Cc: linux-block, yukuai (C)

Hi,

在 2025/08/17 23:26, colyli@kernel.org 写道:
> From: Coly Li <colyli@kernel.org>
> 
> This patch adds a new BLK_FLAG_STACK_IO_OPT for stack block device. If a
> stack block device like md raid5 declares its io_opt when don't want
> blk_stack_limits() to change it with io_opt of underlying non-stack
> block devices, BLK_FLAG_STACK_IO_OPT can be set on limits.flags. Then in
> blk_stack_limits(), lcm_not_zero(t->io_opt, b->io_opt) will be avoided.
> 
It's better refering to the thread:

https://lore.kernel.org/all/ywsfp3lqnijgig6yrlv2ztxram6ohf5z4yfeebswjkvp2dzisd@f5ikoyo3sfq5/

That scsi and mdraid have different definition of io_opt.

> For md raid5, it is necessary to keep a proper io_opt size for better
> I/O thoughput.
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
>   block/blk-settings.c   | 6 +++++-
>   drivers/md/raid5.c     | 1 +
>   include/linux/blkdev.h | 3 +++
>   3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 07874e9b609f..46ee538b2be9 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   		t->features &= ~BLK_FEAT_POLL;
>   
>   	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
> +	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);
>   
>   	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>   	t->max_user_sectors = min_not_zero(t->max_user_sectors,
> @@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>   				     b->physical_block_size);
>   
>   	t->io_min = max(t->io_min, b->io_min);
> -	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
> +	    (b->flags & BLK_FLAG_STACK_IO_OPT))
> +		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +
>   	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
>   
>   	/* Set non-power-of-2 compatible chunk_sectors boundary */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..989acd8abd98 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>   	lim.io_min = mddev->chunk_sectors << 9;
>   	lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
>   	lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
> +	lim.flags |= BLK_FLAG_STACK_IO_OPT;
>   	lim.discard_granularity = stripe;
>   	lim.max_write_zeroes_sectors = 0;
>   	mddev_stack_rdev_limits(mddev, &lim, 0);

And I think raid0/raid1/raid10 should all set this flag as well.

Thanks,
Kuai

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 95886b404b16..a22c7cea9836 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -366,6 +366,9 @@ typedef unsigned int __bitwise blk_flags_t;
>   /* passthrough command IO accounting */
>   #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
>   
> +/* ignore underlying non-stack devices io_opt */
> +#define BLK_FLAG_STACK_IO_OPT		((__force blk_flags_t)(1u << 3))
> +
>   struct queue_limits {
>   	blk_features_t		features;
>   	blk_flags_t		flags;
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] md: split bio by io_opt size in md_submit_bio()
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
@ 2025-08-18  1:38   ` Yu Kuai
  2025-08-18  8:01   ` Christoph Hellwig
  2025-08-18  9:51   ` John Garry
  2 siblings, 0 replies; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  1:38 UTC (permalink / raw)
  To: colyli, linux-raid; +Cc: linux-block, yukuai (C)

Hi,

在 2025/08/17 23:26, colyli@kernel.org 写道:
> From: Coly Li <colyli@kernel.org>
> 
> Currently in md_submit_bio() the incoming request bio is split by
> bio_split_to_limits() which makes sure the bio won't exceed
> max_hw_sectors of a specific raid level before senting into its
> .make_request method.
> 
> For raid level 4/5/6 such split method might be problematic and hurt
> large read/write perforamnce. Because limits.max_hw_sectors are not
> always aligned to limits.io_opt size, the split bio won't be full
> stripes covered on all data disks, and will introduce extra read-in I/O.
> Even the bio's bi_sector is aligned to limits.io_opt size and large
> enough, the resulted split bio is not size-friendly to corresponding
> raid456 level.
> 
> This patch introduces bio_split_by_io_opt() to solve the above issue,
> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>    aligned head part. Then the next one will be aligned.
> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>    then try to split a by multiple of limits.io_opt but not exceed
>    limits.max_hw_sectors.
> 
> Then for large bio, the sligned split part will be full-stripes covered
> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> for performace as well.
> 
> This patch only tests on 8 disks raid5 array with 64KiB chunk size.
> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> If fio bs=488K (exact limits.io_opt size) the peak sequential write
> throughput can reach 1.51GiB/s.
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
>   drivers/md/md.c    | 51 +++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/md/raid5.c |  6 +++++-
>   2 files changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..d0d4d05150fe 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -426,6 +426,55 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
>   }
>   EXPORT_SYMBOL(md_handle_request);
>   
> +/**
> + * For raid456 read/write request, if bio LBA isn't aligned tot io_opt,
> + * split the non io_opt aligned header, to make the second part's LBA be
> + * aligned to io_opt. Otherwise still call bio_split_to_limits() to
> + * handle bio split with queue limits.
> + */
> +static struct bio *bio_split_by_io_opt(struct bio *bio)
> +{
> +	sector_t io_opt_sectors, start, offset;
> +	struct queue_limits lim;
> +	struct mddev *mddev;
> +	struct bio *split;
> +	int level;
> +
> +	mddev = bio->bi_bdev->bd_disk->private_data;
> +	level = mddev->level;
> +
> +	/* Only handle read456 read/write requests */
> +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
> +	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
For raid0/1/10, I feel this change may also beneficial for IO bandwith,
not 100% percent sure.
> +		return bio_split_to_limits(bio);
> +
> +	/* In case raid456 chunk size is too large */
> +	lim = mddev->gendisk->queue->limits;
> +	io_opt_sectors = lim.io_opt >> SECTOR_SHIFT;
> +	if (unlikely(io_opt_sectors > lim.max_hw_sectors))
> +		return bio_split_to_limits(bio);
> +
> +	/* Small request, no need to split */
> +	if (bio_sectors(bio) <= io_opt_sectors)
> +		return bio;
> +
> +	/* Only split the non-io-opt aligned header part */
> +	start = bio->bi_iter.bi_sector;
> +	offset = sector_div(start, io_opt_sectors);
> +	if (offset == 0)
> +		return bio_split_to_limits(bio);
> +
> +	split = bio_split(bio, (io_opt_sectors - offset), GFP_NOIO,
> +			  &bio->bi_bdev->bd_disk->bio_split);
> +	if (!split)

bio_split return ERR_PTR now.
> +		return bio_split_to_limits(bio);
> +
> +	split->bi_opf |= REQ_NOMERGE;
> +	bio_chain(split, bio);
> +	submit_bio_noacct(bio);
> +	return split;
> +}
> +
>   static void md_submit_bio(struct bio *bio)
>   {
>   	const int rw = bio_data_dir(bio);
> @@ -441,7 +490,7 @@ static void md_submit_bio(struct bio *bio)
>   		return;
>   	}
>   
> -	bio = bio_split_to_limits(bio);
> +	bio = bio_split_by_io_opt(bio);
>   	if (!bio)
>   		return;
>   
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 989acd8abd98..985fabeeead5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7759,9 +7759,13 @@ static int raid5_set_limits(struct mddev *mddev)
>   
>   	/*
>   	 * Requests require having a bitmap for each stripe.
> -	 * Limit the max sectors based on this.
> +	 * Limit the max sectors based on this. And being
> +	 * aligned to lim.io_opt for better I/O performance.
>   	 */
>   	lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);
> +	if (lim.max_hw_sectors > lim.io_opt >> SECTOR_SHIFT)
> +		lim.max_hw_sectors = rounddown(lim.max_hw_sectors,
> +			  lim.io_opt >> SECTOR_SHIFT);

For huge chunksize, I think it'll also make sense to simpliy things in
mdraid by setting max_hw_sectors to at least io_opt after patch 1. Then
we only need to consider alignment in split case.

BTW, consider this more common in mddev_stack_rdev_limits()?

Thanks,
Kuai

>   
>   	/* No restrictions on the number of segments in the request */
>   	lim.max_segments = USHRT_MAX;
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-17 15:26 [PATCH 1/2] block: ignore underlying non-stack devices io_opt colyli
                   ` (2 preceding siblings ...)
  2025-08-18  1:14 ` Yu Kuai
@ 2025-08-18  2:51 ` Damien Le Moal
  2025-08-18  2:57   ` Yu Kuai
  3 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2025-08-18  2:51 UTC (permalink / raw)
  To: colyli, linux-raid; +Cc: linux-block, yukuai3

On 8/18/25 12:26 AM, colyli@kernel.org wrote:
> From: Coly Li <colyli@kernel.org>
> 
> This patch adds a new BLK_FLAG_STACK_IO_OPT for stack block device. If a
> stack block device like md raid5 declares its io_opt when don't want
> blk_stack_limits() to change it with io_opt of underlying non-stack
> block devices, BLK_FLAG_STACK_IO_OPT can be set on limits.flags. Then in
> blk_stack_limits(), lcm_not_zero(t->io_opt, b->io_opt) will be avoided.
> 
> For md raid5, it is necessary to keep a proper io_opt size for better
> I/O thoughput.
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
>  block/blk-settings.c   | 6 +++++-
>  drivers/md/raid5.c     | 1 +
>  include/linux/blkdev.h | 3 +++
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 07874e9b609f..46ee538b2be9 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  		t->features &= ~BLK_FEAT_POLL;
>  
>  	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
> +	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);
>  
>  	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>  	t->max_user_sectors = min_not_zero(t->max_user_sectors,
> @@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>  				     b->physical_block_size);
>  
>  	t->io_min = max(t->io_min, b->io_min);
> -	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
> +	    (b->flags & BLK_FLAG_STACK_IO_OPT))
> +		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
> +
>  	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
>  
>  	/* Set non-power-of-2 compatible chunk_sectors boundary */
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..989acd8abd98 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>  	lim.io_min = mddev->chunk_sectors << 9;
>  	lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);

It seems to me that moving this *after* the call to mddev_stack_rdev_limits()
would simply overwrite the io_opt limit coming from stacking and get you the
same result as your patch, but without adding the new limit flags.

I do not think that the use of such flag is good as we definitely do not want
to have more of these. That would make the limit stacking far too complicated
with too many corner cases. Instead, drivers should simply change whatever
limit they do not like. DM has the .io_hint method for that. md should have
something similar.

>  	lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
> +	lim.flags |= BLK_FLAG_STACK_IO_OPT;
>  	lim.discard_granularity = stripe;
>  	lim.max_write_zeroes_sectors = 0;
>  	mddev_stack_rdev_limits(mddev, &lim, 0);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 95886b404b16..a22c7cea9836 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -366,6 +366,9 @@ typedef unsigned int __bitwise blk_flags_t;
>  /* passthrough command IO accounting */
>  #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
>  
> +/* ignore underlying non-stack devices io_opt */
> +#define BLK_FLAG_STACK_IO_OPT		((__force blk_flags_t)(1u << 3))
> +
>  struct queue_limits {
>  	blk_features_t		features;
>  	blk_flags_t		flags;


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  2:51 ` Damien Le Moal
@ 2025-08-18  2:57   ` Yu Kuai
  2025-08-18  3:18     ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  2:57 UTC (permalink / raw)
  To: Damien Le Moal, colyli, linux-raid; +Cc: linux-block, yukuai (C)

Hi,

在 2025/08/18 10:51, Damien Le Moal 写道:
> On 8/18/25 12:26 AM, colyli@kernel.org wrote:
>> From: Coly Li <colyli@kernel.org>
>>
>> This patch adds a new BLK_FLAG_STACK_IO_OPT for stack block device. If a
>> stack block device like md raid5 declares its io_opt when don't want
>> blk_stack_limits() to change it with io_opt of underlying non-stack
>> block devices, BLK_FLAG_STACK_IO_OPT can be set on limits.flags. Then in
>> blk_stack_limits(), lcm_not_zero(t->io_opt, b->io_opt) will be avoided.
>>
>> For md raid5, it is necessary to keep a proper io_opt size for better
>> I/O thoughput.
>>
>> Signed-off-by: Coly Li <colyli@kernel.org>
>> ---
>>   block/blk-settings.c   | 6 +++++-
>>   drivers/md/raid5.c     | 1 +
>>   include/linux/blkdev.h | 3 +++
>>   3 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 07874e9b609f..46ee538b2be9 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>   		t->features &= ~BLK_FEAT_POLL;
>>   
>>   	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
>> +	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);
>>   
>>   	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>>   	t->max_user_sectors = min_not_zero(t->max_user_sectors,
>> @@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>   				     b->physical_block_size);
>>   
>>   	t->io_min = max(t->io_min, b->io_min);
>> -	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>> +	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
>> +	    (b->flags & BLK_FLAG_STACK_IO_OPT))
>> +		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
>> +
>>   	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
>>   
>>   	/* Set non-power-of-2 compatible chunk_sectors boundary */
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 023649fe2476..989acd8abd98 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>>   	lim.io_min = mddev->chunk_sectors << 9;
>>   	lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
> 
> It seems to me that moving this *after* the call to mddev_stack_rdev_limits()
> would simply overwrite the io_opt limit coming from stacking and get you the
> same result as your patch, but without adding the new limit flags.

This is not enough, we have the case array is build on the top of
another array, we still need the lcm_not_zero() to not break this case.
And I would expect this flag for all the arrays, not just raid5.

Thanks,
Kuai

> 
> I do not think that the use of such flag is good as we definitely do not want
> to have more of these. That would make the limit stacking far too complicated
> with too many corner cases. Instead, drivers should simply change whatever
> limit they do not like. DM has the .io_hint method for that. md should have
> something similar.
> 
>>   	lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
>> +	lim.flags |= BLK_FLAG_STACK_IO_OPT;
>>   	lim.discard_granularity = stripe;
>>   	lim.max_write_zeroes_sectors = 0;
>>   	mddev_stack_rdev_limits(mddev, &lim, 0);
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 95886b404b16..a22c7cea9836 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -366,6 +366,9 @@ typedef unsigned int __bitwise blk_flags_t;
>>   /* passthrough command IO accounting */
>>   #define BLK_FLAG_IOSTATS_PASSTHROUGH	((__force blk_flags_t)(1u << 2))
>>   
>> +/* ignore underlying non-stack devices io_opt */
>> +#define BLK_FLAG_STACK_IO_OPT		((__force blk_flags_t)(1u << 3))
>> +
>>   struct queue_limits {
>>   	blk_features_t		features;
>>   	blk_flags_t		flags;
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  2:57   ` Yu Kuai
@ 2025-08-18  3:18     ` Damien Le Moal
  2025-08-18  3:40       ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2025-08-18  3:18 UTC (permalink / raw)
  To: Yu Kuai, colyli, linux-raid; +Cc: linux-block, yukuai (C)

On 8/18/25 11:57 AM, Yu Kuai wrote:
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 023649fe2476..989acd8abd98 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>>>       lim.io_min = mddev->chunk_sectors << 9;
>>>       lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
>>
>> It seems to me that moving this *after* the call to mddev_stack_rdev_limits()
>> would simply overwrite the io_opt limit coming from stacking and get you the
>> same result as your patch, but without adding the new limit flags.
> 
> This is not enough, we have the case array is build on the top of
> another array, we still need the lcm_not_zero() to not break this case.
> And I would expect this flag for all the arrays, not just raid5.

Nothing prevents you from doing that in the md code. The block layer stacking
limits provides a sensible default. If the block device driver does not like
the default given, it is free to change it for whatever valid reason it has.
As I said, that's what the DM .io_hint target driver method is for.

As for the "expected that flag for all arrays", that is optimistic at best. For
scsi hardware raid, as discussed already, the optimal I/O size is *not* the
stripe size. And good luck with any AHCI-based hardware RAID...

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  3:18     ` Damien Le Moal
@ 2025-08-18  3:40       ` Yu Kuai
  2025-08-18  5:56         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  3:40 UTC (permalink / raw)
  To: Damien Le Moal, Yu Kuai, colyli, linux-raid; +Cc: linux-block, yukuai (C)

Hi,

在 2025/08/18 11:18, Damien Le Moal 写道:
> On 8/18/25 11:57 AM, Yu Kuai wrote:
>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>>> index 023649fe2476..989acd8abd98 100644
>>>> --- a/drivers/md/raid5.c
>>>> +++ b/drivers/md/raid5.c
>>>> @@ -7730,6 +7730,7 @@ static int raid5_set_limits(struct mddev *mddev)
>>>>        lim.io_min = mddev->chunk_sectors << 9;
>>>>        lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
>>>
>>> It seems to me that moving this *after* the call to mddev_stack_rdev_limits()
>>> would simply overwrite the io_opt limit coming from stacking and get you the
>>> same result as your patch, but without adding the new limit flags.
>>
>> This is not enough, we have the case array is build on the top of
>> another array, we still need the lcm_not_zero() to not break this case.
>> And I would expect this flag for all the arrays, not just raid5.
> 
> Nothing prevents you from doing that in the md code. The block layer stacking
> limits provides a sensible default. If the block device driver does not like
> the default given, it is free to change it for whatever valid reason it has.
> As I said, that's what the DM .io_hint target driver method is for.

Then code will be much complex insdie md code, and we'll have to
reimplement the stack limits logical with the consideration if each rdev
is already a stacked rdev. I really do not like this ...

And in theroy, we can't handle this case just in md code:

scsi disk -> mdarray -> device mapper/loop/nbd ... -> mdarray

> 
> As for the "expected that flag for all arrays", that is optimistic at best. For
> scsi hardware raid, as discussed already, the optimal I/O size is *not* the
> stripe size. And good luck with any AHCI-based hardware RAID...
> 

Yes, I just mean mdraid arrays, we absolutely don't want to touch other
drivers.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  3:40       ` Yu Kuai
@ 2025-08-18  5:56         ` Christoph Hellwig
  2025-08-18  6:14           ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  5:56 UTC (permalink / raw)
  To: Yu Kuai; +Cc: Damien Le Moal, colyli, linux-raid, linux-block, yukuai (C)

On Mon, Aug 18, 2025 at 11:40:54AM +0800, Yu Kuai wrote:
> Then code will be much complex insdie md code, and we'll have to
> reimplement the stack limits logical with the consideration if each rdev
> is already a stacked rdev. I really do not like this ...

You don't.  You get the stacked value as input and then simply decide
if you want to take it or override it.  The easiest way to do the
latter is to just stash it away in a local variable before calling
queue_limits_stack_bdev and then reapplying it afterwards.

> And in theroy, we can't handle this case just in md code:
> 
> scsi disk -> mdarray -> device mapper/loop/nbd ... -> mdarray

I don't see how the code in this series could handle that either.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  5:56         ` Christoph Hellwig
@ 2025-08-18  6:14           ` Yu Kuai
  2025-08-18  6:18             ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  6:14 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: Damien Le Moal, colyli, linux-raid, linux-block, yukuai (C)

Hi,

在 2025/08/18 13:56, Christoph Hellwig 写道:
> On Mon, Aug 18, 2025 at 11:40:54AM +0800, Yu Kuai wrote:
>> Then code will be much complex insdie md code, and we'll have to
>> reimplement the stack limits logical with the consideration if each rdev
>> is already a stacked rdev. I really do not like this ...
> 
> You don't.  You get the stacked value as input and then simply decide
> if you want to take it or override it.  The easiest way to do the
> latter is to just stash it away in a local variable before calling
> queue_limits_stack_bdev and then reapplying it afterwards.
> 
>> And in theroy, we can't handle this case just in md code:
>>
>> scsi disk -> mdarray -> device mapper/loop/nbd ... -> mdarray
> 
> I don't see how the code in this series could handle that either.
> 
> .
> 

Please take a look at the first patch, nothing special, the new flag
will be passed to the top device.

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 07874e9b609f..46ee538b2be9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -782,6 +782,7 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
  		t->features &= ~BLK_FEAT_POLL;

  	t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
+	t->flags |= (b->flags & BLK_FLAG_STACK_IO_OPT);

  	t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
  	t->max_user_sectors = min_not_zero(t->max_user_sectors,
@@ -839,7 +840,10 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
  				     b->physical_block_size);

  	t->io_min = max(t->io_min, b->io_min);
-	t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+	if (!t->io_opt || !(t->flags & BLK_FLAG_STACK_IO_OPT) ||
+	    (b->flags & BLK_FLAG_STACK_IO_OPT))
+		t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+

For the case scsi->mdraid:

t->io_opt is set, the flag is set for t and not for b, hence
lcm_not_zero() is skipped

For the case mdraid->dm/loop/nbd ... -> mdarray:

the flag will be set for both t and b, lcm_not_zero() will not be
skipped.

Thanks,
Kuai
  	t->dma_alignment = max(t->dma_alignment, b->dma_alignment);

  	/* Set non-power-of-2 compatible chunk_sectors boundary */


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  6:14           ` Yu Kuai
@ 2025-08-18  6:18             ` Christoph Hellwig
  2025-08-18  6:31               ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  6:18 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Damien Le Moal, colyli, linux-raid,
	linux-block, yukuai (C)

On Mon, Aug 18, 2025 at 02:14:06PM +0800, Yu Kuai wrote:
> Please take a look at the first patch, nothing special, the new flag
> will be passed to the top device.

But passing it on will be incorrect in many cases, e.g. for any
write caching solution.  And that is a much more common use case
than stacking different raid level using block layer stacking.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  6:18             ` Christoph Hellwig
@ 2025-08-18  6:31               ` Yu Kuai
  2025-08-18  8:00                 ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  6:31 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: Damien Le Moal, colyli, linux-raid, linux-block, yukuai (C)

Hi,

在 2025/08/18 14:18, Christoph Hellwig 写道:
> On Mon, Aug 18, 2025 at 02:14:06PM +0800, Yu Kuai wrote:
>> Please take a look at the first patch, nothing special, the new flag
>> will be passed to the top device.
> 
> But passing it on will be incorrect in many cases, e.g. for any
> write caching solution.  And that is a much more common use case
> than stacking different raid level using block layer stacking.

I don't quite understand why it's incorrect for write caching solution,
can you please explain in details? AFAIK, the behaviour is only changed
for the first mdraid device is the stacking chain.

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  6:31               ` Yu Kuai
@ 2025-08-18  8:00                 ` Christoph Hellwig
  2025-08-18  8:10                   ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  8:00 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Damien Le Moal, colyli, linux-raid,
	linux-block, yukuai (C)

On Mon, Aug 18, 2025 at 02:31:20PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/18 14:18, Christoph Hellwig 写道:
> > On Mon, Aug 18, 2025 at 02:14:06PM +0800, Yu Kuai wrote:
> > > Please take a look at the first patch, nothing special, the new flag
> > > will be passed to the top device.
> > 
> > But passing it on will be incorrect in many cases, e.g. for any
> > write caching solution.  And that is a much more common use case
> > than stacking different raid level using block layer stacking.
> 
> I don't quite understand why it's incorrect for write caching solution,
> can you please explain in details? AFAIK, the behaviour is only changed
> for the first mdraid device is the stacking chain.

The way I read the patch, the flag is inherited if any underlying
device sets it.

Now if you stack something that buffers most I/O the md raid limits
aren't really that relevant, and you'd rather expose the limits
for the writeback or read caching.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] md: split bio by io_opt size in md_submit_bio()
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
  2025-08-18  1:38   ` Yu Kuai
@ 2025-08-18  8:01   ` Christoph Hellwig
  2025-08-18  9:51   ` John Garry
  2 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  8:01 UTC (permalink / raw)
  To: colyli; +Cc: linux-raid, linux-block, yukuai3

On Sun, Aug 17, 2025 at 11:26:45PM +0800, colyli@kernel.org wrote:
> This patch introduces bio_split_by_io_opt() to solve the above issue,
> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>   aligned head part. Then the next one will be aligned.
> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>   then try to split a by multiple of limits.io_opt but not exceed
>   limits.max_hw_sectors.
> 
> Then for large bio, the sligned split part will be full-stripes covered
> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> for performace as well.
> 
> This patch only tests on 8 disks raid5 array with 64KiB chunk size.
> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> If fio bs=488K (exact limits.io_opt size) the peak sequential write
> throughput can reach 1.51GiB/s.

All this code duplication seems like a bad idea.  What is the problem
with setting max_hw_sectors to a stripe size aligned value and then
letting bio_split_by_io_opt do the work?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  8:00                 ` Christoph Hellwig
@ 2025-08-18  8:10                   ` Yu Kuai
  2025-08-18  8:14                     ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  8:10 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: Damien Le Moal, colyli, linux-raid, linux-block, yukuai (C)

Hi,

在 2025/08/18 16:00, Christoph Hellwig 写道:
> On Mon, Aug 18, 2025 at 02:31:20PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/18 14:18, Christoph Hellwig 写道:
>>> On Mon, Aug 18, 2025 at 02:14:06PM +0800, Yu Kuai wrote:
>>>> Please take a look at the first patch, nothing special, the new flag
>>>> will be passed to the top device.
>>>
>>> But passing it on will be incorrect in many cases, e.g. for any
>>> write caching solution.  And that is a much more common use case
>>> than stacking different raid level using block layer stacking.
>>
>> I don't quite understand why it's incorrect for write caching solution,
>> can you please explain in details? AFAIK, the behaviour is only changed
>> for the first mdraid device is the stacking chain.
> 
> The way I read the patch, the flag is inherited if any underlying
> device sets it.
> 
> Now if you stack something that buffers most I/O the md raid limits
> aren't really that relevant, and you'd rather expose the limits
> for the writeback or read caching.

Why? We just set the flag for mdraid disks first, and then inherit to
top devices that is stacked by mdraid, so md raid limits should always
be relevant. I still don't understand the problem that you said :(

Thanks,
Kuai

> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  8:10                   ` Yu Kuai
@ 2025-08-18  8:14                     ` Christoph Hellwig
  2025-08-18  8:57                       ` Yu Kuai
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  8:14 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Damien Le Moal, colyli, linux-raid,
	linux-block, yukuai (C)

On Mon, Aug 18, 2025 at 04:10:32PM +0800, Yu Kuai wrote:
> Why? We just set the flag for mdraid disks first, and then inherit to
> top devices that is stacked by mdraid, so md raid limits should always
> be relevant. I still don't understand the problem that you said :(

The point is the fact how mdraid calculates the opt_io is somewhere
between ireelevant and wrong for any layer above.  The layers above
just care about the value.  And to calculate the value you don't
need a special flag.  mdraid can simply override io_opt (or better
max_hw_sectors) after stacking the lower level devices with a few
lines of code (and hopefully a good comment).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  8:14                     ` Christoph Hellwig
@ 2025-08-18  8:57                       ` Yu Kuai
  2025-08-18  9:08                         ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Yu Kuai @ 2025-08-18  8:57 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: Damien Le Moal, colyli, linux-raid, linux-block, yukuai (C)

Hi,

在 2025/08/18 16:14, Christoph Hellwig 写道:
> On Mon, Aug 18, 2025 at 04:10:32PM +0800, Yu Kuai wrote:
>> Why? We just set the flag for mdraid disks first, and then inherit to
>> top devices that is stacked by mdraid, so md raid limits should always
>> be relevant. I still don't understand the problem that you said :(
> 
> The point is the fact how mdraid calculates the opt_io is somewhere
> between ireelevant and wrong for any layer above.  The layers above
> just care about the value.  And to calculate the value you don't
> need a special flag.  mdraid can simply override io_opt (or better
> max_hw_sectors) after stacking the lower level devices with a few
> lines of code (and hopefully a good comment).

Ok, let's ignore the case there are other drivers in the stack chains,
just in this case: mdraid on the top of another mdraid, which we already
have. And in order not to introduce regression, we can do this inside
mdraid.

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] block: ignore underlying non-stack devices io_opt
  2025-08-18  8:57                       ` Yu Kuai
@ 2025-08-18  9:08                         ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-08-18  9:08 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, Damien Le Moal, colyli, linux-raid,
	linux-block, yukuai (C)

On Mon, Aug 18, 2025 at 04:57:30PM +0800, Yu Kuai wrote:
> Ok, let's ignore the case there are other drivers in the stack chains,
> just in this case: mdraid on the top of another mdraid, which we already
> have. And in order not to introduce regression, we can do this inside
> mdraid.

Whatever you want to do in that case you can do by looking at
BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE.

But I'm not even sure what you want to do.  Assume you have to raid5
stacked on top of each other, using the same chunk size, but a different
non-multiple number of stripe units.  The only thing you could do is to
multiply the values, but I doubt anything above will take the resulting
number serious.  And depending on the stripe size of the lower raid5
the upper one might not even be capable of feeding it large enough I/O.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] md: split bio by io_opt size in md_submit_bio()
  2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
  2025-08-18  1:38   ` Yu Kuai
  2025-08-18  8:01   ` Christoph Hellwig
@ 2025-08-18  9:51   ` John Garry
       [not found]     ` <6DA25F37-26B3-4912-90A3-346CFD9A6EEA@coly.li>
  2 siblings, 1 reply; 21+ messages in thread
From: John Garry @ 2025-08-18  9:51 UTC (permalink / raw)
  To: colyli, linux-raid; +Cc: linux-block, yukuai3

On 17/08/2025 16:26, colyli@kernel.org wrote:
> From: Coly Li <colyli@kernel.org>
> 
> Currently in md_submit_bio() the incoming request bio is split by
> bio_split_to_limits() which makes sure the bio won't exceed
> max_hw_sectors of a specific raid level before senting into its
> .make_request method.
> 
> For raid level 4/5/6 such split method might be problematic and hurt
> large read/write perforamnce. Because limits.max_hw_sectors are not
> always aligned to limits.io_opt size, the split bio won't be full
> stripes covered on all data disks, and will introduce extra read-in I/O.
> Even the bio's bi_sector is aligned to limits.io_opt size and large
> enough, the resulted split bio is not size-friendly to corresponding
> raid456 level.
> 
> This patch introduces bio_split_by_io_opt() to solve the above issue,
> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>    aligned head part. Then the next one will be aligned.
> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>    then try to split a by multiple of limits.io_opt but not exceed
>    limits.max_hw_sectors.
> 

this sounds like chunk_sectors functionality, apart from "split a by 
multiple of limits.io_opt"

> Then for large bio, the sligned split part will be full-stripes covered
> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> for performace as well.
> 
> This patch only tests on 8 disks raid5 array with 64KiB chunk size.
> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> If fio bs=488K (exact limits.io_opt size) the peak sequential write
> throughput can reach 1.51GiB/s.
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> ---
>   drivers/md/md.c    | 51 +++++++++++++++++++++++++++++++++++++++++++++-
>   drivers/md/raid5.c |  6 +++++-
>   2 files changed, 55 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..d0d4d05150fe 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -426,6 +426,55 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
>   }
>   EXPORT_SYMBOL(md_handle_request);
>   
> +/**
> + * For raid456 read/write request, if bio LBA isn't aligned tot io_opt,
> + * split the non io_opt aligned header, to make the second part's LBA be
> + * aligned to io_opt. Otherwise still call bio_split_to_limits() to
> + * handle bio split with queue limits.
> + */
> +static struct bio *bio_split_by_io_opt(struct bio *bio)
> +{
> +	sector_t io_opt_sectors, start, offset;
> +	struct queue_limits lim;
> +	struct mddev *mddev;
> +	struct bio *split;
> +	int level;
> +
> +	mddev = bio->bi_bdev->bd_disk->private_data;
> +	level = mddev->level;
> +
> +	/* Only handle read456 read/write requests */
> +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
> +	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
> +		return bio_split_to_limits(bio);

this should be taken outside this function, as we are not splitting to 
io_opt here

> +
> +	/* In case raid456 chunk size is too large */
> +	lim = mddev->gendisk->queue->limits;
> +	io_opt_sectors = lim.io_opt >> SECTOR_SHIFT;
> +	if (unlikely(io_opt_sectors > lim.max_hw_sectors))
> +		return bio_split_to_limits(bio);
> +
> +	/* Small request, no need to split */
> +	if (bio_sectors(bio) <= io_opt_sectors)
> +		return bio;

According to 1, above, we should split this if bio->bi_iter.bi_sector is 
not aligned, yet we possibly don't here

> +
> +	/* Only split the non-io-opt aligned header part */
> +	start = bio->bi_iter.bi_sector;
> +	offset = sector_div(start, io_opt_sectors);
> +	if (offset == 0)
> +		return bio_split_to_limits(bio);

this does not seem to match the description in 2, above, where we have 
"and split is necessary".

> +
> +	split = bio_split(bio, (io_opt_sectors - offset), GFP_NOIO,
> +			  &bio->bi_bdev->bd_disk->bio_split);
> +	if (!split)

that check is incorrect. It should be IS_ERR(). So I doubt the 
functionality earlier for handling "and split is necessary".

> +		return bio_split_to_limits(bio);
> +
> +	split->bi_opf |= REQ_NOMERGE;
> +	bio_chain(split, bio);
> +	submit_bio_noacct(bio);
> +	return split;
> +}
> +
>   static void md_submit_bio(struct bio *bio)
>   {
>   	const int rw = bio_data_dir(bio);
> @@ -441,7 +490,7 @@ static void md_submit_bio(struct bio *bio)
>   		return;
>   	}
>   
> -	bio = bio_split_to_limits(bio);
> +	bio = bio_split_by_io_opt(bio);
>   	if (!bio)
>   		return;
>   
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 989acd8abd98..985fabeeead5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7759,9 +7759,13 @@ static int raid5_set_limits(struct mddev *mddev)
>   
>   	/*
>   	 * Requests require having a bitmap for each stripe.
> -	 * Limit the max sectors based on this.
> +	 * Limit the max sectors based on this. And being
> +	 * aligned to lim.io_opt for better I/O performance.
>   	 */
>   	lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);
> +	if (lim.max_hw_sectors > lim.io_opt >> SECTOR_SHIFT)
> +		lim.max_hw_sectors = rounddown(lim.max_hw_sectors,
> +			  lim.io_opt >> SECTOR_SHIFT);
>   
>   	/* No restrictions on the number of segments in the request */
>   	lim.max_segments = USHRT_MAX;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] md: split bio by io_opt size in md_submit_bio()
       [not found]     ` <6DA25F37-26B3-4912-90A3-346CFD9A6EEA@coly.li>
@ 2025-08-18 12:20       ` John Garry
  0 siblings, 0 replies; 21+ messages in thread
From: John Garry @ 2025-08-18 12:20 UTC (permalink / raw)
  To: Coly Li; +Cc: colyli, linux-raid, linux-block, yukuai3

On 18/08/2025 11:26, Coly Li wrote:
>>> static struct bio *bio_split_by_io_opt(struct bio *bio)
>>> +{
>>> + sector_t io_opt_sectors, start, offset;
>>> + struct queue_limits lim;
>>> + struct mddev *mddev;
>>> + struct bio *split;
>>> + int level;
>>> +
>>> + mddev = bio->bi_bdev->bd_disk->private_data;
>>> + level = mddev->level;
>>> +
>>> + /* Only handle read456 read/write requests */
>>> + if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
>>> +     (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
>>> + return bio_split_to_limits(bio);
>> this should be taken outside this function, as we are not splitting to io_opt here
>>
> 
> It is not split to io_opt, it is split to max_hw_sectors. And the value of max_hw_sectors is aligned to io_opt.
> 
> 

Where is alignment of max_hw_sectors and io_opt enforced? raid1 does not 
even explicitly set max_hw_sectors or io_opt for the top device. I also 
note that raid10 does not set max_hw_sectors, which I doubt is proper. 
And md-linear does not set io_opt AFAICS.

> 
>>> +
>>> + /* In case raid456 chunk size is too large */
>>> + lim = mddev->gendisk->queue->limits;
>>> + io_opt_sectors = lim.io_opt >> SECTOR_SHIFT;
>>> + if (unlikely(io_opt_sectors > lim.max_hw_sectors))
>>> + return bio_split_to_limits(bio);
>>> +
>>> + /* Small request, no need to split */
>>> + if (bio_sectors(bio) <= io_opt_sectors)
>>> + return bio;
>> According to 1, above, we should split this if bio->bi_iter.bi_sector is not aligned, yet we possibly don't here
>>
> The split is only for performance, for too small bio, split or not doesn’t matter obviously for performance.
Then it should be part of the documented rules in the commit message.

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-08-18 12:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-17 15:26 [PATCH 1/2] block: ignore underlying non-stack devices io_opt colyli
2025-08-17 15:26 ` [PATCH 2/2] md: split bio by io_opt size in md_submit_bio() colyli
2025-08-18  1:38   ` Yu Kuai
2025-08-18  8:01   ` Christoph Hellwig
2025-08-18  9:51   ` John Garry
     [not found]     ` <6DA25F37-26B3-4912-90A3-346CFD9A6EEA@coly.li>
2025-08-18 12:20       ` John Garry
2025-08-17 18:37 ` [PATCH 1/2] block: ignore underlying non-stack devices io_opt Paul Menzel
2025-08-18  1:14 ` Yu Kuai
2025-08-18  2:51 ` Damien Le Moal
2025-08-18  2:57   ` Yu Kuai
2025-08-18  3:18     ` Damien Le Moal
2025-08-18  3:40       ` Yu Kuai
2025-08-18  5:56         ` Christoph Hellwig
2025-08-18  6:14           ` Yu Kuai
2025-08-18  6:18             ` Christoph Hellwig
2025-08-18  6:31               ` Yu Kuai
2025-08-18  8:00                 ` Christoph Hellwig
2025-08-18  8:10                   ` Yu Kuai
2025-08-18  8:14                     ` Christoph Hellwig
2025-08-18  8:57                       ` Yu Kuai
2025-08-18  9:08                         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).