linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
@ 2025-07-15 17:59 colyli
  0 siblings, 0 replies; 19+ messages in thread
From: colyli @ 2025-07-15 17:59 UTC (permalink / raw)
  To: linux-raid
  Cc: linux-block, Coly Li, Yu Kuai, Xiao Ni, Hannes Reinecke,
	Martin Wilck

From: Coly Li <colyli@kernel.org>

Currently in md_submit_bio() the incoming request bio is split by
bio_split_to_limits() which makes sure the bio won't exceed
max_hw_sectors of a specific raid level before senting into its
.make_request method.

For raid level 4/5/6 such split method might be problematic and hurt
large read/write perforamnce. Because limits.max_hw_sectors are not
always aligned to limits.io_opt size, the split bio won't be full
stripes covered on all data disks, and will introduce extra read-in I/O.
Even the bio's bi_sector is aligned to limits.io_opt size and large
enough, the resulted split bio is not size-friendly to corresponding
raid456 level.

This patch introduces bio_split_by_io_opt() to solve the above issue,
1, If the incoming bio is not limits.io_opt aligned, split the non-
   aligned head part. Then the next one will be aligned.
2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
   then try to split a by multiple of limits.io_opt but not exceed
   limits.max_hw_sectors.

Then for large bio, the sligned split part will be full-stripes covered
to all data disks, no extra read-in I/Os when rmw_level is 0. And for
rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
for performace as well.

This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
If fio bs=488K (exact limits.io_opt size) the peak sequential write
throughput can reach 1.51GiB/s.

Signed-off-by: Coly Li <colyli@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Martin Wilck <mwilck@suse.com>
---
 drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0f03b21e66e4..363cff633af3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
 }
 EXPORT_SYMBOL(md_handle_request);
 
+static struct bio *bio_split_by_io_opt(struct bio *bio)
+{
+	sector_t io_opt_sectors, sectors, n;
+	struct queue_limits lim;
+	struct mddev *mddev;
+	struct bio *split;
+	int level;
+
+	mddev = bio->bi_bdev->bd_disk->private_data;
+	level = mddev->level;
+	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
+		return bio_split_to_limits(bio);
+
+	lim = mddev->gendisk->queue->limits;
+	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
+			      lim.max_hw_sectors);
+
+	/* No need to split */
+	if (bio_sectors(bio) == io_opt_sectors)
+		return bio;
+
+	n = bio->bi_iter.bi_sector;
+	sectors = do_div(n, io_opt_sectors);
+	/* Aligned to io_opt size and no need to split for radi456 */
+	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
+		return bio;
+
+	if (sectors) {
+		/**
+		 * Not aligned to io_opt, split
+		 * non-aligned head part.
+		 */
+		sectors = io_opt_sectors - sectors;
+	} else {
+		/**
+		 * Aligned to io_opt, split to the largest multiple
+		 * of io_opt within max_hw_sectors, to make full
+		 * stripe write/read for underlying raid456 levels.
+		 */
+		n = lim.max_hw_sectors;
+		do_div(n, io_opt_sectors);
+		sectors = n * io_opt_sectors;
+	}
+
+	/* Almost won't happen */
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
+			__func__, level, sectors, bio_sectors(bio));
+		return bio;
+	}
+
+	split = bio_split(bio, sectors, GFP_NOIO,
+			  &bio->bi_bdev->bd_disk->bio_split);
+	if (!split)
+		return bio;
+	split->bi_opf |= REQ_NOMERGE;
+	bio_chain(split, bio);
+	submit_bio_noacct(bio);
+	return split;
+}
+
 static void md_submit_bio(struct bio *bio)
 {
 	const int rw = bio_data_dir(bio);
@@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
 		return;
 	}
 
-	bio = bio_split_to_limits(bio);
+	bio = bio_split_by_io_opt(bio);
 	if (!bio)
 		return;
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
@ 2025-07-15 18:02 colyli
  2025-07-16  1:46 ` Yu Kuai
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: colyli @ 2025-07-15 18:02 UTC (permalink / raw)
  To: linux-raid
  Cc: linux-block, Coly Li, Yu Kuai, Xiao Ni, Hannes Reinecke,
	Martin Wilck, Christoph Hellwig, Keith Busch

From: Coly Li <colyli@kernel.org>

Currently in md_submit_bio() the incoming request bio is split by
bio_split_to_limits() which makes sure the bio won't exceed
max_hw_sectors of a specific raid level before senting into its
.make_request method.

For raid level 4/5/6 such split method might be problematic and hurt
large read/write perforamnce. Because limits.max_hw_sectors are not
always aligned to limits.io_opt size, the split bio won't be full
stripes covered on all data disks, and will introduce extra read-in I/O.
Even the bio's bi_sector is aligned to limits.io_opt size and large
enough, the resulted split bio is not size-friendly to corresponding
raid456 level.

This patch introduces bio_split_by_io_opt() to solve the above issue,
1, If the incoming bio is not limits.io_opt aligned, split the non-
   aligned head part. Then the next one will be aligned.
2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
   then try to split a by multiple of limits.io_opt but not exceed
   limits.max_hw_sectors.

Then for large bio, the sligned split part will be full-stripes covered
to all data disks, no extra read-in I/Os when rmw_level is 0. And for
rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
for performace as well.

This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
If fio bs=488K (exact limits.io_opt size) the peak sequential write
throughput can reach 1.51GiB/s.

(Resend to include Christoph and Keith in CC list.)

Signed-off-by: Coly Li <colyli@kernel.org>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
---
 drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0f03b21e66e4..363cff633af3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
 }
 EXPORT_SYMBOL(md_handle_request);
 
+static struct bio *bio_split_by_io_opt(struct bio *bio)
+{
+	sector_t io_opt_sectors, sectors, n;
+	struct queue_limits lim;
+	struct mddev *mddev;
+	struct bio *split;
+	int level;
+
+	mddev = bio->bi_bdev->bd_disk->private_data;
+	level = mddev->level;
+	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
+		return bio_split_to_limits(bio);
+
+	lim = mddev->gendisk->queue->limits;
+	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
+			      lim.max_hw_sectors);
+
+	/* No need to split */
+	if (bio_sectors(bio) == io_opt_sectors)
+		return bio;
+
+	n = bio->bi_iter.bi_sector;
+	sectors = do_div(n, io_opt_sectors);
+	/* Aligned to io_opt size and no need to split for radi456 */
+	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
+		return bio;
+
+	if (sectors) {
+		/**
+		 * Not aligned to io_opt, split
+		 * non-aligned head part.
+		 */
+		sectors = io_opt_sectors - sectors;
+	} else {
+		/**
+		 * Aligned to io_opt, split to the largest multiple
+		 * of io_opt within max_hw_sectors, to make full
+		 * stripe write/read for underlying raid456 levels.
+		 */
+		n = lim.max_hw_sectors;
+		do_div(n, io_opt_sectors);
+		sectors = n * io_opt_sectors;
+	}
+
+	/* Almost won't happen */
+	if (unlikely(sectors >= bio_sectors(bio))) {
+		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
+			__func__, level, sectors, bio_sectors(bio));
+		return bio;
+	}
+
+	split = bio_split(bio, sectors, GFP_NOIO,
+			  &bio->bi_bdev->bd_disk->bio_split);
+	if (!split)
+		return bio;
+	split->bi_opf |= REQ_NOMERGE;
+	bio_chain(split, bio);
+	submit_bio_noacct(bio);
+	return split;
+}
+
 static void md_submit_bio(struct bio *bio)
 {
 	const int rw = bio_data_dir(bio);
@@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
 		return;
 	}
 
-	bio = bio_split_to_limits(bio);
+	bio = bio_split_by_io_opt(bio);
 	if (!bio)
 		return;
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-15 18:02 colyli
@ 2025-07-16  1:46 ` Yu Kuai
  2025-07-16  6:58 ` Yu Kuai
  2025-07-16 11:37 ` Christoph Hellwig
  2 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-07-16  1:46 UTC (permalink / raw)
  To: colyli, linux-raid
  Cc: linux-block, Xiao Ni, Hannes Reinecke, Martin Wilck,
	Christoph Hellwig, Keith Busch, yukuai (C)

Hi,

在 2025/07/16 2:02, colyli@kernel.org 写道:
> From: Coly Li <colyli@kernel.org>
> 
> Currently in md_submit_bio() the incoming request bio is split by
> bio_split_to_limits() which makes sure the bio won't exceed
> max_hw_sectors of a specific raid level before senting into its
> .make_request method.
> 
> For raid level 4/5/6 such split method might be problematic and hurt
> large read/write perforamnce. Because limits.max_hw_sectors are not
> always aligned to limits.io_opt size, the split bio won't be full
> stripes covered on all data disks, and will introduce extra read-in I/O.
> Even the bio's bi_sector is aligned to limits.io_opt size and large
> enough, the resulted split bio is not size-friendly to corresponding
> raid456 level.
> 
> This patch introduces bio_split_by_io_opt() to solve the above issue,
> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>     aligned head part. Then the next one will be aligned.
> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>     then try to split a by multiple of limits.io_opt but not exceed
>     limits.max_hw_sectors.
> 
> Then for large bio, the sligned split part will be full-stripes covered
> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> for performace as well.
> 
> This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> If fio bs=488K (exact limits.io_opt size) the peak sequential write
> throughput can reach 1.51GiB/s.
> 
> (Resend to include Christoph and Keith in CC list.)
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Martin Wilck <mwilck@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0f03b21e66e4..363cff633af3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
>   }
>   EXPORT_SYMBOL(md_handle_request);
>   

The split should be due to:

lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);

Which is introduced by commit:

7e55c60acfbb ("md/raid5: Pivot raid5_make_request()")

And take a quick a look at that, I'm still not sure yet why
max_hw_sectors is limited now, perhaps if chunksize > 256 * stripe_size,
and bio inside one chunk is greater than 256 stripes, such bio can stuck
forever because there are not enough stripes?

Thanks,
Kuai

> +static struct bio *bio_split_by_io_opt(struct bio *bio)
> +{
> +	sector_t io_opt_sectors, sectors, n;
> +	struct queue_limits lim;
> +	struct mddev *mddev;
> +	struct bio *split;
> +	int level;
> +
> +	mddev = bio->bi_bdev->bd_disk->private_data;
> +	level = mddev->level;
> +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
> +		return bio_split_to_limits(bio);
> +
> +	lim = mddev->gendisk->queue->limits;
> +	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
> +			      lim.max_hw_sectors);
> +
> +	/* No need to split */
> +	if (bio_sectors(bio) == io_opt_sectors)
> +		return bio;
> +
> +	n = bio->bi_iter.bi_sector;
> +	sectors = do_div(n, io_opt_sectors);
> +	/* Aligned to io_opt size and no need to split for radi456 */
> +	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
> +		return bio;
> +
> +	if (sectors) {
> +		/**
> +		 * Not aligned to io_opt, split
> +		 * non-aligned head part.
> +		 */
> +		sectors = io_opt_sectors - sectors;
> +	} else {
> +		/**
> +		 * Aligned to io_opt, split to the largest multiple
> +		 * of io_opt within max_hw_sectors, to make full
> +		 * stripe write/read for underlying raid456 levels.
> +		 */
> +		n = lim.max_hw_sectors;
> +		do_div(n, io_opt_sectors);
> +		sectors = n * io_opt_sectors;
> +	}
> +
> +	/* Almost won't happen */
> +	if (unlikely(sectors >= bio_sectors(bio))) {
> +		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
> +			__func__, level, sectors, bio_sectors(bio));
> +		return bio;
> +	}
> +
> +	split = bio_split(bio, sectors, GFP_NOIO,
> +			  &bio->bi_bdev->bd_disk->bio_split);
> +	if (!split)
> +		return bio;
> +	split->bi_opf |= REQ_NOMERGE;
> +	bio_chain(split, bio);
> +	submit_bio_noacct(bio);
> +	return split;
> +}
> +
>   static void md_submit_bio(struct bio *bio)
>   {
>   	const int rw = bio_data_dir(bio);
> @@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
>   		return;
>   	}
>   
> -	bio = bio_split_to_limits(bio);
> +	bio = bio_split_by_io_opt(bio);
>   	if (!bio)
>   		return;
>   
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-15 18:02 colyli
  2025-07-16  1:46 ` Yu Kuai
@ 2025-07-16  6:58 ` Yu Kuai
  2025-07-16  8:50   ` Coly Li
  2025-07-16 11:37 ` Christoph Hellwig
  2 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-07-16  6:58 UTC (permalink / raw)
  To: colyli, linux-raid
  Cc: linux-block, Xiao Ni, Hannes Reinecke, Martin Wilck,
	Christoph Hellwig, Keith Busch, yukuai (C)

Hi,

在 2025/07/16 2:02, colyli@kernel.org 写道:
> From: Coly Li <colyli@kernel.org>
> 
> Currently in md_submit_bio() the incoming request bio is split by
> bio_split_to_limits() which makes sure the bio won't exceed
> max_hw_sectors of a specific raid level before senting into its
> .make_request method.
> 
> For raid level 4/5/6 such split method might be problematic and hurt
> large read/write perforamnce. Because limits.max_hw_sectors are not
> always aligned to limits.io_opt size, the split bio won't be full
> stripes covered on all data disks, and will introduce extra read-in I/O.
> Even the bio's bi_sector is aligned to limits.io_opt size and large
> enough, the resulted split bio is not size-friendly to corresponding
> raid456 level.
> 
> This patch introduces bio_split_by_io_opt() to solve the above issue,
> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>     aligned head part. Then the next one will be aligned.
> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>     then try to split a by multiple of limits.io_opt but not exceed
>     limits.max_hw_sectors.
> 
> Then for large bio, the sligned split part will be full-stripes covered
> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> for performace as well.
> 
> This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> If fio bs=488K (exact limits.io_opt size) the peak sequential write
> throughput can reach 1.51GiB/s.
> 
> (Resend to include Christoph and Keith in CC list.)
> 
> Signed-off-by: Coly Li <colyli@kernel.org>
> Cc: Yu Kuai <yukuai3@huawei.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Martin Wilck <mwilck@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Keith Busch <kbusch@kernel.org>
> ---
>   drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 62 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0f03b21e66e4..363cff633af3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
>   }
>   EXPORT_SYMBOL(md_handle_request);
>   
> +static struct bio *bio_split_by_io_opt(struct bio *bio)
> +{
> +	sector_t io_opt_sectors, sectors, n;
> +	struct queue_limits lim;
> +	struct mddev *mddev;
> +	struct bio *split;
> +	int level;
> +
> +	mddev = bio->bi_bdev->bd_disk->private_data;
> +	level = mddev->level;
> +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
> +		return bio_split_to_limits(bio);

There is another patch that provide a helper raid_is_456()
https://lore.kernel.org/all/20250707165202.11073-3-yukuai@kernel.org/

You might want to use it here.
> +
> +	lim = mddev->gendisk->queue->limits;
> +	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
> +			      lim.max_hw_sectors);

You might want to use max_sectors here, to honor user setting.

And max_hw_sectors is just for normal read and write, for other IO like
discard, atomic write, write zero, the limit is different.

> +
> +	/* No need to split */
> +	if (bio_sectors(bio) == io_opt_sectors)
> +		return bio;
> +

If the bio happend to accross two io_opt, do you think it's better to
split it here? For example:

io_opt is 64k(chunk size) * 7 = 448k, issue an IO start from 444k with
len = 8k. raid5 will have to use 2 stripes to handle such IO.

> +	n = bio->bi_iter.bi_sector;
> +	sectors = do_div(n, io_opt_sectors);
> +	/* Aligned to io_opt size and no need to split for radi456 */
> +	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
> +		return bio;

I'm confused here, do_div doesn't mean aligned, should bio_offset() be
taken into consideration? For example, issue an IO start from 4k with
len = 448 * 2 k, if I read the code correctly, the result is:

4 + 896 -> 4 + 896 (not split if within max_sectors)

What we really expect is:

4 + 896 -> 4 + 444, 448 + 448, 892 + 4
> +
> +	if (sectors) {
> +		/**
> +		 * Not aligned to io_opt, split
> +		 * non-aligned head part.
> +		 */
> +		sectors = io_opt_sectors - sectors;
> +	} else {
> +		/**
> +		 * Aligned to io_opt, split to the largest multiple
> +		 * of io_opt within max_hw_sectors, to make full
> +		 * stripe write/read for underlying raid456 levels.
> +		 */
> +		n = lim.max_hw_sectors;
> +		do_div(n, io_opt_sectors);
> +		sectors = n * io_opt_sectors;

roundown() ?
> +	}
> +
> +	/* Almost won't happen */
> +	if (unlikely(sectors >= bio_sectors(bio))) {
> +		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
> +			__func__, level, sectors, bio_sectors(bio));
> +		return bio;
> +	}
> +
> +	split = bio_split(bio, sectors, GFP_NOIO,
> +			  &bio->bi_bdev->bd_disk->bio_split);
> +	if (!split)
> +		return bio;
> +	split->bi_opf |= REQ_NOMERGE;
> +	bio_chain(split, bio);
> +	submit_bio_noacct(bio);
> +	return split;
> +}
> +
>   static void md_submit_bio(struct bio *bio)
>   {
>   	const int rw = bio_data_dir(bio);
> @@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
>   		return;
>   	}
>   
> -	bio = bio_split_to_limits(bio);
> +	bio = bio_split_by_io_opt(bio);
>   	if (!bio)
>   		return;
>   
> 

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16  6:58 ` Yu Kuai
@ 2025-07-16  8:50   ` Coly Li
  2025-07-16  9:30     ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16  8:50 UTC (permalink / raw)
  To: Yu Kuai
  Cc: linux-raid, linux-block, Xiao Ni, Hannes Reinecke, Martin Wilck,
	Christoph Hellwig, Keith Busch, yukuai (C)

On Wed, Jul 16, 2025 at 02:58:13PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2025/07/16 2:02, colyli@kernel.org 写道:
> > From: Coly Li <colyli@kernel.org>
> > 
> > Currently in md_submit_bio() the incoming request bio is split by
> > bio_split_to_limits() which makes sure the bio won't exceed
> > max_hw_sectors of a specific raid level before senting into its
> > .make_request method.
> > 
> > For raid level 4/5/6 such split method might be problematic and hurt
> > large read/write perforamnce. Because limits.max_hw_sectors are not
> > always aligned to limits.io_opt size, the split bio won't be full
> > stripes covered on all data disks, and will introduce extra read-in I/O.
> > Even the bio's bi_sector is aligned to limits.io_opt size and large
> > enough, the resulted split bio is not size-friendly to corresponding
> > raid456 level.
> > 
> > This patch introduces bio_split_by_io_opt() to solve the above issue,
> > 1, If the incoming bio is not limits.io_opt aligned, split the non-
> >     aligned head part. Then the next one will be aligned.
> > 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
> >     then try to split a by multiple of limits.io_opt but not exceed
> >     limits.max_hw_sectors.
> > 
> > Then for large bio, the sligned split part will be full-stripes covered
> > to all data disks, no extra read-in I/Os when rmw_level is 0. And for
> > rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
> > for performace as well.
> > 
> > This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
> > By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
> > write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
> > If fio bs=488K (exact limits.io_opt size) the peak sequential write
> > throughput can reach 1.51GiB/s.
> > 
> > (Resend to include Christoph and Keith in CC list.)
> > 
> > Signed-off-by: Coly Li <colyli@kernel.org>
> > Cc: Yu Kuai <yukuai3@huawei.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Martin Wilck <mwilck@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Keith Busch <kbusch@kernel.org>
> > ---
> >   drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 62 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 0f03b21e66e4..363cff633af3 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
> >   }
> >   EXPORT_SYMBOL(md_handle_request);
> > +static struct bio *bio_split_by_io_opt(struct bio *bio)
> > +{
> > +	sector_t io_opt_sectors, sectors, n;
> > +	struct queue_limits lim;
> > +	struct mddev *mddev;
> > +	struct bio *split;
> > +	int level;
> > +
> > +	mddev = bio->bi_bdev->bd_disk->private_data;
> > +	level = mddev->level;
> > +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
> > +		return bio_split_to_limits(bio);
> 
> There is another patch that provide a helper raid_is_456()
> https://lore.kernel.org/all/20250707165202.11073-3-yukuai@kernel.org/
> 
> You might want to use it here.

Copied, I will use raid_is_456() after it gets merged.

> > +
> > +	lim = mddev->gendisk->queue->limits;
> > +	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
> > +			      lim.max_hw_sectors);
> 
> You might want to use max_sectors here, to honor user setting.
> 
> And max_hw_sectors is just for normal read and write, for other IO like
> discard, atomic write, write zero, the limit is different.
> 

Yeah, this is a reason why I want your comments :-) Originally I want to
change bio_split_to_limits(), but the raid5 logic cannot be accessed
there. If I write a clone of bio_split_to_limits(), it seems too heavy
and unncessary.

How about only handle read/write bio here, and let bio_split_to_limits()
to do the rested, like,

	level = mddev->level;
	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
		return bio_split_to_limits(bio);


> > +
> > +	/* No need to split */
> > +	if (bio_sectors(bio) == io_opt_sectors)
> > +		return bio;
> > +
> 
> If the bio happend to accross two io_opt, do you think it's better to
> split it here? For example:
> 

I assume raid5 code will handle this, the pages of this bio will belong
to different stripe pages belong to different chunks. So no need to
split here. For such condition, it is same handling process as calling
bio_split_to_limits().

> io_opt is 64k(chunk size) * 7 = 448k, issue an IO start from 444k with
> len = 8k. raid5 will have to use 2 stripes to handle such IO.
>

Yes, I assume although this bio is not split, raid5 will have different
cloned bio to map different stripe pages in __add_stripe_bio().

And because they belong to difference chunks, extra read-in for XOR (when
rmw_level == 0) cannot be avoided.
 
> > +	n = bio->bi_iter.bi_sector;
> > +	sectors = do_div(n, io_opt_sectors);
> > +	/* Aligned to io_opt size and no need to split for radi456 */
> > +	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
> > +		return bio;
> 
> I'm confused here, do_div doesn't mean aligned, should bio_offset() be
> taken into consideration? For example, issue an IO start from 4k with
> len = 448 * 2 k, if I read the code correctly, the result is:
> 
> 4 + 896 -> 4 + 896 (not split if within max_sectors)
> 
> What we really expect is:
> 
> 4 + 896 -> 4 + 444, 448 + 448, 892 + 4

Yes you are right. And I do this on purpose, becasue the size of bio
is small (less then max_hw_sectors). Even split the bio as you exampled,
the full-stripes-write in middle doesn't help performance because the
extra read-in will happen for head and tail part when rmw_level == 0.

For small bio (size < io_opt x 2 in this case), there is almost no
performance difference. The performance loss of incorrect bio split will
show up when the bio is large enough and many split bios in middle are
not full-stripes covered.

For bio size < max_hw_sectors, just let raid5 handle it as what it does.
> > +
> > +	if (sectors) {
> > +		/**
> > +		 * Not aligned to io_opt, split
> > +		 * non-aligned head part.
> > +		 */
> > +		sectors = io_opt_sectors - sectors;
> > +	} else {
> > +		/**
> > +		 * Aligned to io_opt, split to the largest multiple
> > +		 * of io_opt within max_hw_sectors, to make full
> > +		 * stripe write/read for underlying raid456 levels.
> > +		 */
> > +		n = lim.max_hw_sectors;
> > +		do_div(n, io_opt_sectors);
> > +		sectors = n * io_opt_sectors;
> 
> roundown() ?

Yes, of course :-)


> > +	}
> > +
> > +	/* Almost won't happen */
> > +	if (unlikely(sectors >= bio_sectors(bio))) {
> > +		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
> > +			__func__, level, sectors, bio_sectors(bio));
> > +		return bio;
> > +	}
> > +
> > +	split = bio_split(bio, sectors, GFP_NOIO,
> > +			  &bio->bi_bdev->bd_disk->bio_split);
> > +	if (!split)
> > +		return bio;
> > +	split->bi_opf |= REQ_NOMERGE;
> > +	bio_chain(split, bio);
> > +	submit_bio_noacct(bio);
> > +	return split;
> > +}
> > +
> >   static void md_submit_bio(struct bio *bio)
> >   {
> >   	const int rw = bio_data_dir(bio);
> > @@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
> >   		return;
> >   	}
> > -	bio = bio_split_to_limits(bio);
> > +	bio = bio_split_by_io_opt(bio);
> >   	if (!bio)
> >   		return;
> > 
>

Thanks for the review.

Coly Li 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16  8:50   ` Coly Li
@ 2025-07-16  9:30     ` Yu Kuai
  0 siblings, 0 replies; 19+ messages in thread
From: Yu Kuai @ 2025-07-16  9:30 UTC (permalink / raw)
  To: Coly Li, Yu Kuai
  Cc: linux-raid, linux-block, Xiao Ni, Hannes Reinecke, Martin Wilck,
	Christoph Hellwig, Keith Busch, yukuai (C)

Hi,

在 2025/07/16 16:50, Coly Li 写道:
> On Wed, Jul 16, 2025 at 02:58:13PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/07/16 2:02, colyli@kernel.org 写道:
>>> From: Coly Li <colyli@kernel.org>
>>>
>>> Currently in md_submit_bio() the incoming request bio is split by
>>> bio_split_to_limits() which makes sure the bio won't exceed
>>> max_hw_sectors of a specific raid level before senting into its
>>> .make_request method.
>>>
>>> For raid level 4/5/6 such split method might be problematic and hurt
>>> large read/write perforamnce. Because limits.max_hw_sectors are not
>>> always aligned to limits.io_opt size, the split bio won't be full
>>> stripes covered on all data disks, and will introduce extra read-in I/O.
>>> Even the bio's bi_sector is aligned to limits.io_opt size and large
>>> enough, the resulted split bio is not size-friendly to corresponding
>>> raid456 level.
>>>
>>> This patch introduces bio_split_by_io_opt() to solve the above issue,
>>> 1, If the incoming bio is not limits.io_opt aligned, split the non-
>>>      aligned head part. Then the next one will be aligned.
>>> 2, If the imcoming bio is limits.io_opt aligned, and split is necessary,
>>>      then try to split a by multiple of limits.io_opt but not exceed
>>>      limits.max_hw_sectors.
>>>
>>> Then for large bio, the sligned split part will be full-stripes covered
>>> to all data disks, no extra read-in I/Os when rmw_level is 0. And for
>>> rmw_level > 0 condistions, the limits.io_opt aligned bios are welcomed
>>> for performace as well.
>>>
>>> This RFC patch only tests on 8 disks raid5 array with 64KiB chunk size.
>>> By this patch, 64KiB chunk size for a 8 disks raid5 array, sequential
>>> write performance increases from 900MiB/s to 1.1GiB/s by fio bs=10M.
>>> If fio bs=488K (exact limits.io_opt size) the peak sequential write
>>> throughput can reach 1.51GiB/s.
>>>
>>> (Resend to include Christoph and Keith in CC list.)
>>>
>>> Signed-off-by: Coly Li <colyli@kernel.org>
>>> Cc: Yu Kuai <yukuai3@huawei.com>
>>> Cc: Xiao Ni <xni@redhat.com>
>>> Cc: Hannes Reinecke <hare@suse.de>
>>> Cc: Martin Wilck <mwilck@suse.com>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Cc: Keith Busch <kbusch@kernel.org>
>>> ---
>>>    drivers/md/md.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 62 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 0f03b21e66e4..363cff633af3 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -426,6 +426,67 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
>>>    }
>>>    EXPORT_SYMBOL(md_handle_request);
>>> +static struct bio *bio_split_by_io_opt(struct bio *bio)
>>> +{
>>> +	sector_t io_opt_sectors, sectors, n;
>>> +	struct queue_limits lim;
>>> +	struct mddev *mddev;
>>> +	struct bio *split;
>>> +	int level;
>>> +
>>> +	mddev = bio->bi_bdev->bd_disk->private_data;
>>> +	level = mddev->level;
>>> +	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR)
>>> +		return bio_split_to_limits(bio);
>>
>> There is another patch that provide a helper raid_is_456()
>> https://lore.kernel.org/all/20250707165202.11073-3-yukuai@kernel.org/
>>
>> You might want to use it here.
> 
> Copied, I will use raid_is_456() after it gets merged.
> 
>>> +
>>> +	lim = mddev->gendisk->queue->limits;
>>> +	io_opt_sectors = min3(bio_sectors(bio), lim.io_opt >> SECTOR_SHIFT,
>>> +			      lim.max_hw_sectors);
>>
>> You might want to use max_sectors here, to honor user setting.
>>
>> And max_hw_sectors is just for normal read and write, for other IO like
>> discard, atomic write, write zero, the limit is different.
>>
> 
> Yeah, this is a reason why I want your comments :-) Originally I want to
> change bio_split_to_limits(), but the raid5 logic cannot be accessed
> there. If I write a clone of bio_split_to_limits(), it seems too heavy
> and unncessary.
> 
> How about only handle read/write bio here, and let bio_split_to_limits()
> to do the rested, like,
> 
> 	level = mddev->level;
> 	if (level == 1 || level == 10 || level == 0 || level == LEVEL_LINEAR ||
> 	    (bio_op(bio) != REQ_OP_READ && bio_op(bio) != REQ_OP_WRITE))
> 		return bio_split_to_limits(bio);
> 
> 
>>> +
>>> +	/* No need to split */
>>> +	if (bio_sectors(bio) == io_opt_sectors)
>>> +		return bio;
>>> +
>>
>> If the bio happend to accross two io_opt, do you think it's better to
>> split it here? For example:
>>
> 
> I assume raid5 code will handle this, the pages of this bio will belong
> to different stripe pages belong to different chunks. So no need to
> split here. For such condition, it is same handling process as calling
> bio_split_to_limits().
> 
>> io_opt is 64k(chunk size) * 7 = 448k, issue an IO start from 444k with
>> len = 8k. raid5 will have to use 2 stripes to handle such IO.
>>
> 
> Yes, I assume although this bio is not split, raid5 will have different
> cloned bio to map different stripe pages in __add_stripe_bio().
> 
> And because they belong to difference chunks, extra read-in for XOR (when
> rmw_level == 0) cannot be avoided.
>   
>>> +	n = bio->bi_iter.bi_sector;
>>> +	sectors = do_div(n, io_opt_sectors);
>>> +	/* Aligned to io_opt size and no need to split for radi456 */
>>> +	if (!sectors && (bio_sectors(bio) <=  lim.max_hw_sectors))
>>> +		return bio;
>>
>> I'm confused here, do_div doesn't mean aligned, should bio_offset() be
>> taken into consideration? For example, issue an IO start from 4k with
>> len = 448 * 2 k, if I read the code correctly, the result is:
>>
>> 4 + 896 -> 4 + 896 (not split if within max_sectors)
>>
>> What we really expect is:
>>
>> 4 + 896 -> 4 + 444, 448 + 448, 892 + 4
> 
> Yes you are right. And I do this on purpose, becasue the size of bio
> is small (less then max_hw_sectors). Even split the bio as you exampled,
> the full-stripes-write in middle doesn't help performance because the
> extra read-in will happen for head and tail part when rmw_level == 0.
> 
> For small bio (size < io_opt x 2 in this case), there is almost no
> performance difference. The performance loss of incorrect bio split will
> show up when the bio is large enough and many split bios in middle are
> not full-stripes covered.
> 
> For bio size < max_hw_sectors, just let raid5 handle it as what it does.

I'm still a bit confused :( I do understand that in the above case that
IO size < max_sectors, raid5 can handle it and there is not much
difference, what I don't understand is why not aligned to io_opt, for
example, in the above case if I increase io size to 448 *n, I would
expect the result as following:(assume max_sectors = 1M)

4 + 444, 448 + 448*2, 448*3 + 448*2, ..., 448*n + 4;

Other than the head and tail, all splited bio in the middle will end up
with full stripes.

And in this patch, becasue do_div can pass, then splited bio will end up
like:

4 + 448*2, 4+448*2 + 448*2, ...

And each bio will not end up with full stripes, I don't get it how this
behaviour have any difference without this patch, raid5 will have to try
fill in stripes with different bio.

Thanks,
Kuai

>>> +
>>> +	if (sectors) {
>>> +		/**
>>> +		 * Not aligned to io_opt, split
>>> +		 * non-aligned head part.
>>> +		 */
>>> +		sectors = io_opt_sectors - sectors;
>>> +	} else {
>>> +		/**
>>> +		 * Aligned to io_opt, split to the largest multiple
>>> +		 * of io_opt within max_hw_sectors, to make full
>>> +		 * stripe write/read for underlying raid456 levels.
>>> +		 */
>>> +		n = lim.max_hw_sectors;
>>> +		do_div(n, io_opt_sectors);
>>> +		sectors = n * io_opt_sectors;
>>
>> roundown() ?
> 
> Yes, of course :-)
> 
> 
>>> +	}
>>> +
>>> +	/* Almost won't happen */
>>> +	if (unlikely(sectors >= bio_sectors(bio))) {
>>> +		pr_warn("%s raid level %d: sectors %llu >= bio_sectors %u, not split\n",
>>> +			__func__, level, sectors, bio_sectors(bio));
>>> +		return bio;
>>> +	}
>>> +
>>> +	split = bio_split(bio, sectors, GFP_NOIO,
>>> +			  &bio->bi_bdev->bd_disk->bio_split);
>>> +	if (!split)
>>> +		return bio;
>>> +	split->bi_opf |= REQ_NOMERGE;
>>> +	bio_chain(split, bio);
>>> +	submit_bio_noacct(bio);
>>> +	return split;
>>> +}
>>> +
>>>    static void md_submit_bio(struct bio *bio)
>>>    {
>>>    	const int rw = bio_data_dir(bio);
>>> @@ -441,7 +502,7 @@ static void md_submit_bio(struct bio *bio)
>>>    		return;
>>>    	}
>>> -	bio = bio_split_to_limits(bio);
>>> +	bio = bio_split_by_io_opt(bio);
>>>    	if (!bio)
>>>    		return;
>>>
>>
> 
> Thanks for the review.
> 
> Coly Li
> 
> .
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-15 18:02 colyli
  2025-07-16  1:46 ` Yu Kuai
  2025-07-16  6:58 ` Yu Kuai
@ 2025-07-16 11:37 ` Christoph Hellwig
  2025-07-16 11:39   ` Coly Li
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-16 11:37 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, linux-block, Yu Kuai, Xiao Ni, Hannes Reinecke,
	Martin Wilck, Christoph Hellwig, Keith Busch

On Wed, Jul 16, 2025 at 02:02:41AM +0800, colyli@kernel.org wrote:
> From: Coly Li <colyli@kernel.org>
> 
> Currently in md_submit_bio() the incoming request bio is split by
> bio_split_to_limits() which makes sure the bio won't exceed
> max_hw_sectors of a specific raid level before senting into its
> .make_request method.
> 
> For raid level 4/5/6 such split method might be problematic and hurt
> large read/write perforamnce. Because limits.max_hw_sectors are not
> always aligned to limits.io_opt size, the split bio won't be full
> stripes covered on all data disks, and will introduce extra read-in I/O.
> Even the bio's bi_sector is aligned to limits.io_opt size and large
> enough, the resulted split bio is not size-friendly to corresponding
> raid456 level.

So why don't you set a sane max_hw_sectors value instead of duplicating
the splitting logic?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 11:37 ` Christoph Hellwig
@ 2025-07-16 11:39   ` Coly Li
  2025-07-16 11:41     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: colyli, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch



> 2025年7月16日 19:37,Christoph Hellwig <hch@lst.de> 写道:
> 
> On Wed, Jul 16, 2025 at 02:02:41AM +0800, colyli@kernel.org wrote:
>> From: Coly Li <colyli@kernel.org>
>> 
>> Currently in md_submit_bio() the incoming request bio is split by
>> bio_split_to_limits() which makes sure the bio won't exceed
>> max_hw_sectors of a specific raid level before senting into its
>> .make_request method.
>> 
>> For raid level 4/5/6 such split method might be problematic and hurt
>> large read/write perforamnce. Because limits.max_hw_sectors are not
>> always aligned to limits.io_opt size, the split bio won't be full
>> stripes covered on all data disks, and will introduce extra read-in I/O.
>> Even the bio's bi_sector is aligned to limits.io_opt size and large
>> enough, the resulted split bio is not size-friendly to corresponding
>> raid456 level.
> 
> So why don't you set a sane max_hw_sectors value instead of duplicating
> the splitting logic?

Can you explain a bit more detail?  In case I misunderstand you like I did with Kuai’s comments.

Thanks.

Coly Li



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 11:39   ` Coly Li
@ 2025-07-16 11:41     ` Christoph Hellwig
  2025-07-16 11:44       ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-16 11:41 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, colyli, linux-raid, linux-block, Yu Kuai,
	Xiao Ni, Hannes Reinecke, Martin Wilck, Keith Busch

On Wed, Jul 16, 2025 at 07:39:18PM +0800, Coly Li wrote:
> >> For raid level 4/5/6 such split method might be problematic and hurt
> >> large read/write perforamnce. Because limits.max_hw_sectors are not
> >> always aligned to limits.io_opt size, the split bio won't be full
> >> stripes covered on all data disks, and will introduce extra read-in I/O.
> >> Even the bio's bi_sector is aligned to limits.io_opt size and large
> >> enough, the resulted split bio is not size-friendly to corresponding
> >> raid456 level.
> > 
> > So why don't you set a sane max_hw_sectors value instead of duplicating
> > the splitting logic?
> 
> Can you explain a bit more detail?  In case I misunderstand you like I
> did with Kuai’s comments

Just set the max_hw_sectors you want.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 11:41     ` Christoph Hellwig
@ 2025-07-16 11:44       ` Coly Li
  2025-07-16 11:45         ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16 11:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: colyli, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch



> 2025年7月16日 19:41,Christoph Hellwig <hch@lst.de> 写道:
> 
> On Wed, Jul 16, 2025 at 07:39:18PM +0800, Coly Li wrote:
>>>> For raid level 4/5/6 such split method might be problematic and hurt
>>>> large read/write perforamnce. Because limits.max_hw_sectors are not
>>>> always aligned to limits.io_opt size, the split bio won't be full
>>>> stripes covered on all data disks, and will introduce extra read-in I/O.
>>>> Even the bio's bi_sector is aligned to limits.io_opt size and large
>>>> enough, the resulted split bio is not size-friendly to corresponding
>>>> raid456 level.
>>> 
>>> So why don't you set a sane max_hw_sectors value instead of duplicating
>>> the splitting logic?
>> 
>> Can you explain a bit more detail?  In case I misunderstand you like I
>> did with Kuai’s comments
> 
> Just set the max_hw_sectors you want.

Do you mean setting max_hw_sectors as (chunk_size * data_disks)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 11:44       ` Coly Li
@ 2025-07-16 11:45         ` Christoph Hellwig
  2025-07-16 12:10           ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-16 11:45 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, colyli, linux-raid, linux-block, Yu Kuai,
	Xiao Ni, Hannes Reinecke, Martin Wilck, Keith Busch

On Wed, Jul 16, 2025 at 07:44:38PM +0800, Coly Li wrote:
> Do you mean setting max_hw_sectors as (chunk_size * data_disks)?

If that is what works best: yes.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 11:45         ` Christoph Hellwig
@ 2025-07-16 12:10           ` Coly Li
  2025-07-16 12:14             ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16 12:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Coly Li, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch

On Wed, Jul 16, 2025 at 01:45:33PM +0800, Christoph Hellwig wrote:
> On Wed, Jul 16, 2025 at 07:44:38PM +0800, Coly Li wrote:
> > Do you mean setting max_hw_sectors as (chunk_size * data_disks)?
> 
> If that is what works best: yes.
>

Simplying setting max_hw_sectors as opt_io_size doesn't solve 2 issues,
1, the split length might be aligned to opt_io_size, but the start offset
   of a split bio cannot alwasy be aligned to opt_io_size. Current code
   in bio_split_rw() doesn't handle the alignment.
2, opt_io_size is not large enough comparing max_hw_sector, extra bios
   in the recursive md code path may hurt performance. Anyway this can be
   solved by setting max_hw_size as multiple of opt_io_size.

Just like hanlding discard requests, handling raid5 read/write bios should
try to split the large bio into opt_io_size aligned both *offset* and
*length*. If I understand correctly, bio_split_to_limits() doesn't handle
offset alignment for read5 read/write bios.

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 12:10           ` Coly Li
@ 2025-07-16 12:14             ` Christoph Hellwig
  2025-07-16 12:16               ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-16 12:14 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, Coly Li, linux-raid, linux-block, Yu Kuai,
	Xiao Ni, Hannes Reinecke, Martin Wilck, Keith Busch

On Wed, Jul 16, 2025 at 08:10:33PM +0800, Coly Li wrote:
> Just like hanlding discard requests, handling raid5 read/write bios should
> try to split the large bio into opt_io_size aligned both *offset* and
> *length*. If I understand correctly, bio_split_to_limits() doesn't handle
> offset alignment for read5 read/write bios.

Well, if you want offset alignment, set chunk_sectors.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 12:14             ` Christoph Hellwig
@ 2025-07-16 12:16               ` Coly Li
  2025-07-16 12:17                 ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Coly Li, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch



> 2025年7月16日 20:14,Christoph Hellwig <hch@lst.de> 写道:
> 
> On Wed, Jul 16, 2025 at 08:10:33PM +0800, Coly Li wrote:
>> Just like hanlding discard requests, handling raid5 read/write bios should
>> try to split the large bio into opt_io_size aligned both *offset* and
>> *length*. If I understand correctly, bio_split_to_limits() doesn't handle
>> offset alignment for read5 read/write bios.
> 
> Well, if you want offset alignment, set chunk_sectors.
> 

Do you mean setting max_hw_sectors as chunk_sectors?

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 12:16               ` Coly Li
@ 2025-07-16 12:17                 ` Christoph Hellwig
  2025-07-16 12:23                   ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-16 12:17 UTC (permalink / raw)
  To: Coly Li
  Cc: Christoph Hellwig, Coly Li, linux-raid, linux-block, Yu Kuai,
	Xiao Ni, Hannes Reinecke, Martin Wilck, Keith Busch

On Wed, Jul 16, 2025 at 08:16:34PM +0800, Coly Li wrote:
> 
> 
> > 2025年7月16日 20:14,Christoph Hellwig <hch@lst.de> 写道:
> > 
> > On Wed, Jul 16, 2025 at 08:10:33PM +0800, Coly Li wrote:
> >> Just like hanlding discard requests, handling raid5 read/write bios should
> >> try to split the large bio into opt_io_size aligned both *offset* and
> >> *length*. If I understand correctly, bio_split_to_limits() doesn't handle
> >> offset alignment for read5 read/write bios.
> > 
> > Well, if you want offset alignment, set chunk_sectors.
> > 
> 
> Do you mean setting max_hw_sectors as chunk_sectors?

Setting both to the desired value (full stipe width).


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 12:17                 ` Christoph Hellwig
@ 2025-07-16 12:23                   ` Coly Li
  2025-07-16 16:29                     ` Yu Kuai
  0 siblings, 1 reply; 19+ messages in thread
From: Coly Li @ 2025-07-16 12:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Coly Li, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch



> 2025年7月16日 20:17,Christoph Hellwig <hch@lst.de> 写道:
> 
> On Wed, Jul 16, 2025 at 08:16:34PM +0800, Coly Li wrote:
>> 
>> 
>>> 2025年7月16日 20:14,Christoph Hellwig <hch@lst.de> 写道:
>>> 
>>> On Wed, Jul 16, 2025 at 08:10:33PM +0800, Coly Li wrote:
>>>> Just like hanlding discard requests, handling raid5 read/write bios should
>>>> try to split the large bio into opt_io_size aligned both *offset* and
>>>> *length*. If I understand correctly, bio_split_to_limits() doesn't handle
>>>> offset alignment for read5 read/write bios.
>>> 
>>> Well, if you want offset alignment, set chunk_sectors.
>>> 
>> 
>> Do you mean setting max_hw_sectors as chunk_sectors?
> 
> Setting both to the desired value (full stipe width).
> 

Do you mean setting chunk_size as (chunk_size * data_disks)?  This is deadlock…

If opt_io_size is (chunk_size * data_disks), setting new max_hw_sectors as rounddown(current max_hw_sectors, opt_io_size) is good idea.

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 12:23                   ` Coly Li
@ 2025-07-16 16:29                     ` Yu Kuai
  2025-07-17  4:52                       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Yu Kuai @ 2025-07-16 16:29 UTC (permalink / raw)
  To: Coly Li, Christoph Hellwig
  Cc: Coly Li, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch

Hi,

在 2025/7/16 20:23, Coly Li 写道:
>
>> 2025年7月16日 20:17,Christoph Hellwig <hch@lst.de> 写道:
>>
>> On Wed, Jul 16, 2025 at 08:16:34PM +0800, Coly Li wrote:
>>>
>>>> 2025年7月16日 20:14,Christoph Hellwig <hch@lst.de> 写道:
>>>>
>>>> On Wed, Jul 16, 2025 at 08:10:33PM +0800, Coly Li wrote:
>>>>> Just like hanlding discard requests, handling raid5 read/write bios should
>>>>> try to split the large bio into opt_io_size aligned both *offset* and
>>>>> *length*. If I understand correctly, bio_split_to_limits() doesn't handle
>>>>> offset alignment for read5 read/write bios.
>>>> Well, if you want offset alignment, set chunk_sectors.
>>>>
>>> Do you mean setting max_hw_sectors as chunk_sectors?
>> Setting both to the desired value (full stipe width).
>>
> Do you mean setting chunk_size as (chunk_size * data_disks)?  This is deadlock…
>
> If opt_io_size is (chunk_size * data_disks), setting new max_hw_sectors as rounddown(current max_hw_sectors, opt_io_size) is good idea.

I think round down max_hw_sectors to io_opt(chunk_size * data_disks) 
will really
make things much easier, perhaps Christoph means this way. All you need 
to do is to
handle not aligned bio and split that part, and for aligned bio fall 
back to use
bio_split_to_limits().

Thanks,
Kuai
>
> Thanks.
>
> Coly Li
>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-16 16:29                     ` Yu Kuai
@ 2025-07-17  4:52                       ` Christoph Hellwig
  2025-07-17 15:19                         ` Coly Li
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-17  4:52 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Coly Li, Christoph Hellwig, Coly Li, linux-raid, linux-block,
	Yu Kuai, Xiao Ni, Hannes Reinecke, Martin Wilck, Keith Busch

On Thu, Jul 17, 2025 at 12:29:27AM +0800, Yu Kuai wrote:
>> If opt_io_size is (chunk_size * data_disks), setting new max_hw_sectors as rounddown(current max_hw_sectors, opt_io_size) is good idea.
>
> I think round down max_hw_sectors to io_opt(chunk_size * data_disks) will 
> really
> make things much easier, perhaps Christoph means this way. All you need to 
> do is to
> handle not aligned bio and split that part, and for aligned bio fall back 
> to use
> bio_split_to_limits().

If the raid5 code can handle multiple stripes per I/O, than you
just need to round it down, yes.  I assumed it could only handle
a single full or partial stripe at a time, but I guess I was wrong.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] md: split bio by io_opt size in md_submit_bio()
  2025-07-17  4:52                       ` Christoph Hellwig
@ 2025-07-17 15:19                         ` Coly Li
  0 siblings, 0 replies; 19+ messages in thread
From: Coly Li @ 2025-07-17 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yu Kuai, Coly Li, linux-raid, linux-block, Yu Kuai, Xiao Ni,
	Hannes Reinecke, Martin Wilck, Keith Busch



> 2025年7月17日 12:52,Christoph Hellwig <hch@lst.de> 写道:
> 
> On Thu, Jul 17, 2025 at 12:29:27AM +0800, Yu Kuai wrote:
>>> If opt_io_size is (chunk_size * data_disks), setting new max_hw_sectors as rounddown(current max_hw_sectors, opt_io_size) is good idea.
>> 
>> I think round down max_hw_sectors to io_opt(chunk_size * data_disks) will 
>> really
>> make things much easier, perhaps Christoph means this way. All you need to 
>> do is to
>> handle not aligned bio and split that part, and for aligned bio fall back 
>> to use
>> bio_split_to_limits().
> 
> If the raid5 code can handle multiple stripes per I/O, than you
> just need to round it down, yes.  I assumed it could only handle
> a single full or partial stripe at a time, but I guess I was wrong.
> 
> 

Thanks for the suggestion. I will compose a new version.

Coly Li

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-07-17 15:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 17:59 [RFC PATCH] md: split bio by io_opt size in md_submit_bio() colyli
  -- strict thread matches above, loose matches on Subject: below --
2025-07-15 18:02 colyli
2025-07-16  1:46 ` Yu Kuai
2025-07-16  6:58 ` Yu Kuai
2025-07-16  8:50   ` Coly Li
2025-07-16  9:30     ` Yu Kuai
2025-07-16 11:37 ` Christoph Hellwig
2025-07-16 11:39   ` Coly Li
2025-07-16 11:41     ` Christoph Hellwig
2025-07-16 11:44       ` Coly Li
2025-07-16 11:45         ` Christoph Hellwig
2025-07-16 12:10           ` Coly Li
2025-07-16 12:14             ` Christoph Hellwig
2025-07-16 12:16               ` Coly Li
2025-07-16 12:17                 ` Christoph Hellwig
2025-07-16 12:23                   ` Coly Li
2025-07-16 16:29                     ` Yu Kuai
2025-07-17  4:52                       ` Christoph Hellwig
2025-07-17 15:19                         ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).