* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-21 8:06 ` Coly Li
2025-08-21 8:43 ` Christoph Hellwig
2025-08-21 9:16 ` Coly Li
2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21 8:06 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
>
> For example, assume max_sectors is 1M, and chunksize is 512k
>
> 1) issue a 2M IO:
>
> bio issuing: 0+2M
> current->bio_list: NULL
>
> 2) md_submit_bio() split by max_sector:
>
> bio issuing: 0+1M
> current->bio_list: 1M+1M
>
> 3) chunk_aligned_read() split by chunksize:
>
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
>
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
>
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
>
> 5) chunk_aligned_read() split by chunksize:
>
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
>
> 6) no split afterwards, finally the issue order is:
>
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
>
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>
> Test results:
> Before this patch
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
> 2) blktrace G stage:
> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
>
> Read|Write seek
> cnt 55175, zero cnt 25079
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 25079 45.5% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 12540 22.7% |##################### |
> 64 .. 128 : 2508 4.5% |##### |
> 128 .. 256 : 0 0.0% | |
> 256 .. 512 : 10032 18.2% |################# |
> 512 .. 1024 : 5016 9.1% |######### |
>
> After this patch:
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
> 2) blktrace G stage:
> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 21483 75.0% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 7161 25.0% |############## |
>
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
>
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
>
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Coly Li <colyli@kernel.org>
Yeah, I can see 'dd if=./on-fs-file of=/dev/null bs=6400K status=proress' to report
the read throughput from around 1.2GiB/s to 9.8GiB/s.
Great job and thank you!
Coly Li
> ---
> block/blk-core.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> - bio_list_add(¤t->bio_list[0], bio);
> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> __submit_bio_noacct_mq(bio);
> - else
> + } else {
> __submit_bio_noacct(bio);
> + }
> }
>
> static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21 8:06 ` Coly Li
@ 2025-08-21 8:43 ` Christoph Hellwig
2025-08-21 8:56 ` Yu Kuai
2025-08-21 9:16 ` Coly Li
2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21 8:43 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
This breaks all the code the already chains the right way around,
and there's quite a bit of that (speaking as someone who created a few
instances).
So instead fix your submitter to chain the right way instead.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 8:43 ` Christoph Hellwig
@ 2025-08-21 8:56 ` Yu Kuai
2025-08-21 9:02 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 8:56 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
Hi,
在 2025/08/21 16:43, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
>> + if (current->bio_list) {
>> + if (bio_flagged(bio, BIO_CHAIN))
>> + bio_list_add_head(¤t->bio_list[0], bio);
>> + else
>> + bio_list_add(¤t->bio_list[0], bio);
>> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>
> This breaks all the code the already chains the right way around,
> and there's quite a bit of that (speaking as someone who created a few
> instances).
>
> So instead fix your submitter to chain the right way instead.
>
Can you give some examples as how to chain the right way? BTW, for all
the io split case, should this order be fixed? I feel we should, this
disorder can happen on any stack case, where top max_sector is greater
than stacked disk.
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 8:56 ` Yu Kuai
@ 2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21 9:02 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C)
On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
> Can you give some examples as how to chain the right way?
fs/xfs/xfs_bio_io.c: xfs_rw_bdev
fs/xfs/xfs_buf.c: xfs_buf_submit_bio
fs/xfs/xfs_log.c: xlog_write_iclog
> BTW, for all
> the io split case, should this order be fixed? I feel we should, this
> disorder can happen on any stack case, where top max_sector is greater
> than stacked disk.
Yes, I've been trying get Bart to fix this for a while instead of
putting in a workaround very similar to the one proposed here,
but so far nothing happened.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
@ 2025-08-21 9:33 ` Hannes Reinecke
2025-08-21 9:42 ` Yu Kuai
2025-08-21 9:37 ` Yu Kuai
2025-08-21 15:19 ` Bart Van Assche
2 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2025-08-21 9:33 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
On 8/21/25 11:02, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
>
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
> fs/xfs/xfs_log.c: xlog_write_iclog
>
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
>
>
This feels like a really stupid fix, but wouldn't that help?
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 023649fe2476..2b342bb59612 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev
*mddev, struct bio *raid_bio)
split = bio_split(raid_bio, sectors, GFP_NOIO,
&conf->bio_split);
bio_chain(split, raid_bio);
submit_bio_noacct(raid_bio);
- raid_bio = split;
}
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:33 ` Hannes Reinecke
@ 2025-08-21 9:42 ` Yu Kuai
0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 9:42 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
Hi,
在 2025/08/21 17:33, Hannes Reinecke 写道:
> On 8/21/25 11:02, Christoph Hellwig wrote:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
>> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
>> fs/xfs/xfs_log.c: xlog_write_iclog
>>
>>> BTW, for all
>>> the io split case, should this order be fixed? I feel we should, this
>>> disorder can happen on any stack case, where top max_sector is greater
>>> than stacked disk.
>>
>> Yes, I've been trying get Bart to fix this for a while instead of
>> putting in a workaround very similar to the one proposed here,
>> but so far nothing happened.
>>
>>
> This feels like a really stupid fix, but wouldn't that help?
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..2b342bb59612 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev
> *mddev, struct bio *raid_bio)
> split = bio_split(raid_bio, sectors, GFP_NOIO,
> &conf->bio_split);
> bio_chain(split, raid_bio);
> submit_bio_noacct(raid_bio);
> - raid_bio = split;
> }
>
I do not understand how can this help, do you miss that submit split
instead?
And with this change, this can help, however, I think we'll still submit
the last lba bio first, like bio_last -> bio0 -> bio1 ... where the
following is sequential.
BTW, this is not just a raid5 problem, this is also possible for
raid0/10 and all other recursive split case.
Thanks,
Kuai
>
> Cheers,
>
> Hannes
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
@ 2025-08-21 9:37 ` Yu Kuai
2025-08-25 6:15 ` Yu Kuai
2025-08-25 9:17 ` Christoph Hellwig
2025-08-21 15:19 ` Bart Van Assche
2 siblings, 2 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 9:37 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C), tieren
Hi,
在 2025/08/21 17:02, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
>
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
Just take a look, this is
old bio->new bio
while bio split is:
new_bio->old bio
So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
be resubmitted to current->bio_list, hence this patch won't break this
case, right?
I'll take look at all the bio_chain() callers other than split case to
make sure.
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
This is a little different, new bio -> old bio, while new bio is
resubmitted, the new bio still don't have flag BIO_CHAIN, so this is not
affected by this patch.
> fs/xfs/xfs_log.c: xlog_write_iclog
This is the same as above.
>
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
>
Do you mean this thread?
Fix bio splitting by the crypto fallback code
I'll take look at all the callers of bio_chain(), in theory, we'll have
different use cases like:
1) chain old -> new, or chain new -> old
2) put old or new to current->bio_list, currently always in the tail,
we might want a new case to the head;
Perhaps it'll make sense to add high level helpers to do the chain
and resubmit and convert all callers to use new helpers, want do you
think?
Thanks,
Kuai
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:37 ` Yu Kuai
@ 2025-08-25 6:15 ` Yu Kuai
2025-08-25 9:15 ` Christoph Hellwig
2025-08-25 9:17 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-25 6:15 UTC (permalink / raw)
To: Yu Kuai, Christoph Hellwig
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)
Hi,
在 2025/08/21 17:37, Yu Kuai 写道:
> 在 2025/08/21 17:02, Christoph Hellwig 写道:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
>
> Just take a look, this is
>
> old bio->new bio
>
> while bio split is:
>
> new_bio->old bio
>
> So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
> be resubmitted to current->bio_list, hence this patch won't break this
> case, right?
And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
right lba order, the bio recursive handling is not related in this case.
Thanks,
Kuai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-25 6:15 ` Yu Kuai
@ 2025-08-25 9:15 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25 9:15 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
tieren, yukuai (C)
On Mon, Aug 25, 2025 at 02:15:49PM +0800, Yu Kuai wrote:
> > be resubmitted to current->bio_list, hence this patch won't break this
> > case, right?
>
> And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
> is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
> right lba order, the bio recursive handling is not related in this case.
Yes, usually not - unless we somehow eventually get the loop device
out a separate workque context.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:37 ` Yu Kuai
2025-08-25 6:15 ` Yu Kuai
@ 2025-08-25 9:17 ` Christoph Hellwig
2025-08-25 9:49 ` Yu Kuai
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25 9:17 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C), tieren
On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
> Fix bio splitting by the crypto fallback code
Yes.
>
> I'll take look at all the callers of bio_chain(), in theory, we'll have
> different use cases like:
>
> 1) chain old -> new, or chain new -> old
> 2) put old or new to current->bio_list, currently always in the tail,
> we might want a new case to the head;
>
> Perhaps it'll make sense to add high level helpers to do the chain
> and resubmit and convert all callers to use new helpers, want do you
> think?
I don't think chaining really is problem here, but more how bios
are split when already in the block layer. It's been a bit of a
source for problems, so I think we'll need to sort it out. Especially
as the handling of splits for the same device vs devices below the
current one seems a bit problematic in general.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-25 9:17 ` Christoph Hellwig
@ 2025-08-25 9:49 ` Yu Kuai
0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-25 9:49 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)
Hi,
在 2025/08/25 17:17, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
>> Fix bio splitting by the crypto fallback code
>
> Yes.
>
>>
>> I'll take look at all the callers of bio_chain(), in theory, we'll have
>> different use cases like:
>>
>> 1) chain old -> new, or chain new -> old
>> 2) put old or new to current->bio_list, currently always in the tail,
>> we might want a new case to the head;
>>
>> Perhaps it'll make sense to add high level helpers to do the chain
>> and resubmit and convert all callers to use new helpers, want do you
>> think?
>
> I don't think chaining really is problem here, but more how bios
> are split when already in the block layer. It's been a bit of a
> source for problems, so I think we'll need to sort it out. Especially
> as the handling of splits for the same device vs devices below the
> current one seems a bit problematic in general.
>
I just send a new rfc verion to unify block layer and mdraid to use
the same helper bio_submit_split(), and convert only that helper to
insert split bio to the head of current->bio_list(). And probably
blk-crypto-fallback can use this new helper as well.
Can you take a look? This is the proper solution that I can think of
for now.
Thanks,
Kuai
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
2025-08-21 9:37 ` Yu Kuai
@ 2025-08-21 15:19 ` Bart Van Assche
2 siblings, 0 replies; 14+ messages in thread
From: Bart Van Assche @ 2025-08-21 15:19 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
On 8/21/25 2:02 AM, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> BTW, for all the io split case, should this order be fixed? I feel
>> we should, this disorder can happen on any stack case, where top
>> max_sector is greater than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
Does the above comment refer to the block/blk-crypto-fallback.c code? I
will leave it to Eric Biggers to move the bio splitting call from that
code into the filesystems that need it.
Bart.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21 8:06 ` Coly Li
2025-08-21 8:43 ` Christoph Hellwig
@ 2025-08-21 9:16 ` Coly Li
2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21 9:16 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
>
> For example, assume max_sectors is 1M, and chunksize is 512k
>
> 1) issue a 2M IO:
>
> bio issuing: 0+2M
> current->bio_list: NULL
>
> 2) md_submit_bio() split by max_sector:
>
> bio issuing: 0+1M
> current->bio_list: 1M+1M
>
> 3) chunk_aligned_read() split by chunksize:
>
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
>
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
>
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
>
> 5) chunk_aligned_read() split by chunksize:
>
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
>
> 6) no split afterwards, finally the issue order is:
>
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
>
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>
> Test results:
> Before this patch
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
> 2) blktrace G stage:
> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
>
> Read|Write seek
> cnt 55175, zero cnt 25079
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 25079 45.5% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 12540 22.7% |##################### |
> 64 .. 128 : 2508 4.5% |##### |
> 128 .. 256 : 0 0.0% | |
> 256 .. 512 : 10032 18.2% |################# |
> 512 .. 1024 : 5016 9.1% |######### |
>
> After this patch:
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
> 2) blktrace G stage:
> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 21483 75.0% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 7161 25.0% |############## |
>
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
>
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
>
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
BTW, this issue was not originally caught by me, my colleague Tie Ren found it.
Please consider to add,
Reported-by: Tie Ren <tieren@fnnas.com>
Thanks.
Coly Li
> ---
> block/blk-core.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> - bio_list_add(¤t->bio_list[0], bio);
> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> __submit_bio_noacct_mq(bio);
> - else
> + } else {
> __submit_bio_noacct(bio);
> + }
> }
>
> static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread