* [PATCH] block: fix disordered IO in the case recursive split
@ 2025-08-21 7:47 Yu Kuai
2025-08-21 8:06 ` Coly Li
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 7:47 UTC (permalink / raw)
To: axboe, neil, akpm
Cc: linux-block, linux-raid, linux-kernel, colyli, xni, yukuai3,
yukuai1, yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up, for example, in raid456 IO will first be
split by max_sector from md_submit_bio(), and then later be split
again by chunksize for internal handling:
For example, assume max_sectors is 1M, and chunksize is 512k
1) issue a 2M IO:
bio issuing: 0+2M
current->bio_list: NULL
2) md_submit_bio() split by max_sector:
bio issuing: 0+1M
current->bio_list: 1M+1M
3) chunk_aligned_read() split by chunksize:
bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k
4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:
bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k
5) chunk_aligned_read() split by chunksize:
bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k
6) no split afterwards, finally the issue order is:
0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing chanied
bio to the head of current->bio_list.
Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
Test results:
Before this patch
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
2) blktrace G stage:
8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))
Read|Write seek
cnt 55175, zero cnt 25079
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 25079 45.5% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 12540 22.7% |##################### |
64 .. 128 : 2508 4.5% |##### |
128 .. 256 : 0 0.0% | |
256 .. 512 : 10032 18.2% |################# |
512 .. 1024 : 5016 9.1% |######### |
After this patch:
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
2) blktrace G stage:
8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
2) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 21483 75.0% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 7161 25.0% |############## |
BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.
And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit and cat get even better performance. However, we'll figure out
how to do this properly later.
Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-core.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4201504158a1..0d46d10edb22 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* to collect a list of requests submited by a ->submit_bio method while
* it is active, and then process them after it returned.
*/
- if (current->bio_list)
- bio_list_add(¤t->bio_list[0], bio);
- else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+ if (current->bio_list) {
+ if (bio_flagged(bio, BIO_CHAIN))
+ bio_list_add_head(¤t->bio_list[0], bio);
+ else
+ bio_list_add(¤t->bio_list[0], bio);
+ } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
__submit_bio_noacct_mq(bio);
- else
+ } else {
__submit_bio_noacct(bio);
+ }
}
static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
--
2.39.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-21 8:06 ` Coly Li
2025-08-21 8:43 ` Christoph Hellwig
2025-08-21 9:16 ` Coly Li
2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21 8:06 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
>
> For example, assume max_sectors is 1M, and chunksize is 512k
>
> 1) issue a 2M IO:
>
> bio issuing: 0+2M
> current->bio_list: NULL
>
> 2) md_submit_bio() split by max_sector:
>
> bio issuing: 0+1M
> current->bio_list: 1M+1M
>
> 3) chunk_aligned_read() split by chunksize:
>
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
>
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
>
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
>
> 5) chunk_aligned_read() split by chunksize:
>
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
>
> 6) no split afterwards, finally the issue order is:
>
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
>
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>
> Test results:
> Before this patch
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
> 2) blktrace G stage:
> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
>
> Read|Write seek
> cnt 55175, zero cnt 25079
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 25079 45.5% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 12540 22.7% |##################### |
> 64 .. 128 : 2508 4.5% |##### |
> 128 .. 256 : 0 0.0% | |
> 256 .. 512 : 10032 18.2% |################# |
> 512 .. 1024 : 5016 9.1% |######### |
>
> After this patch:
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
> 2) blktrace G stage:
> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 21483 75.0% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 7161 25.0% |############## |
>
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
>
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
>
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Tested-by: Coly Li <colyli@kernel.org>
Yeah, I can see 'dd if=./on-fs-file of=/dev/null bs=6400K status=proress' to report
the read throughput from around 1.2GiB/s to 9.8GiB/s.
Great job and thank you!
Coly Li
> ---
> block/blk-core.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> - bio_list_add(¤t->bio_list[0], bio);
> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> __submit_bio_noacct_mq(bio);
> - else
> + } else {
> __submit_bio_noacct(bio);
> + }
> }
>
> static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21 8:06 ` Coly Li
@ 2025-08-21 8:43 ` Christoph Hellwig
2025-08-21 8:56 ` Yu Kuai
2025-08-21 9:16 ` Coly Li
2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21 8:43 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
This breaks all the code the already chains the right way around,
and there's quite a bit of that (speaking as someone who created a few
instances).
So instead fix your submitter to chain the right way instead.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 8:43 ` Christoph Hellwig
@ 2025-08-21 8:56 ` Yu Kuai
2025-08-21 9:02 ` Christoph Hellwig
0 siblings, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 8:56 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
Hi,
在 2025/08/21 16:43, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
>> + if (current->bio_list) {
>> + if (bio_flagged(bio, BIO_CHAIN))
>> + bio_list_add_head(¤t->bio_list[0], bio);
>> + else
>> + bio_list_add(¤t->bio_list[0], bio);
>> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>
> This breaks all the code the already chains the right way around,
> and there's quite a bit of that (speaking as someone who created a few
> instances).
>
> So instead fix your submitter to chain the right way instead.
>
Can you give some examples as how to chain the right way? BTW, for all
the io split case, should this order be fixed? I feel we should, this
disorder can happen on any stack case, where top max_sector is greater
than stacked disk.
Thanks,
Kuai
>
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 8:56 ` Yu Kuai
@ 2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21 9:02 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C)
On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
> Can you give some examples as how to chain the right way?
fs/xfs/xfs_bio_io.c: xfs_rw_bdev
fs/xfs/xfs_buf.c: xfs_buf_submit_bio
fs/xfs/xfs_log.c: xlog_write_iclog
> BTW, for all
> the io split case, should this order be fixed? I feel we should, this
> disorder can happen on any stack case, where top max_sector is greater
> than stacked disk.
Yes, I've been trying get Bart to fix this for a while instead of
putting in a workaround very similar to the one proposed here,
but so far nothing happened.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21 8:06 ` Coly Li
2025-08-21 8:43 ` Christoph Hellwig
@ 2025-08-21 9:16 ` Coly Li
2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21 9:16 UTC (permalink / raw)
To: Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
yukuai3, yi.zhang, yangerkun, johnny.chenyi
On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
>
> For example, assume max_sectors is 1M, and chunksize is 512k
>
> 1) issue a 2M IO:
>
> bio issuing: 0+2M
> current->bio_list: NULL
>
> 2) md_submit_bio() split by max_sector:
>
> bio issuing: 0+1M
> current->bio_list: 1M+1M
>
> 3) chunk_aligned_read() split by chunksize:
>
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
>
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
>
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
>
> 5) chunk_aligned_read() split by chunksize:
>
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
>
> 6) no split afterwards, finally the issue order is:
>
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
>
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>
> Test results:
> Before this patch
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
> 2) blktrace G stage:
> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
>
> Read|Write seek
> cnt 55175, zero cnt 25079
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 25079 45.5% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 12540 22.7% |##################### |
> 64 .. 128 : 2508 4.5% |##### |
> 128 .. 256 : 0 0.0% | |
> 256 .. 512 : 10032 18.2% |################# |
> 512 .. 1024 : 5016 9.1% |######### |
>
> After this patch:
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
> 2) blktrace G stage:
> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 21483 75.0% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 7161 25.0% |############## |
>
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
>
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
>
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
BTW, this issue was not originally caught by me, my colleague Tie Ren found it.
Please consider to add,
Reported-by: Tie Ren <tieren@fnnas.com>
Thanks.
Coly Li
> ---
> block/blk-core.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> - bio_list_add(¤t->bio_list[0], bio);
> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> + if (current->bio_list) {
> + if (bio_flagged(bio, BIO_CHAIN))
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> __submit_bio_noacct_mq(bio);
> - else
> + } else {
> __submit_bio_noacct(bio);
> + }
> }
>
> static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> --
> 2.39.2
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
@ 2025-08-21 9:33 ` Hannes Reinecke
2025-08-21 9:42 ` Yu Kuai
2025-08-21 9:37 ` Yu Kuai
2025-08-21 15:19 ` Bart Van Assche
2 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2025-08-21 9:33 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
On 8/21/25 11:02, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
>
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
> fs/xfs/xfs_log.c: xlog_write_iclog
>
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
>
>
This feels like a really stupid fix, but wouldn't that help?
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 023649fe2476..2b342bb59612 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev
*mddev, struct bio *raid_bio)
split = bio_split(raid_bio, sectors, GFP_NOIO,
&conf->bio_split);
bio_chain(split, raid_bio);
submit_bio_noacct(raid_bio);
- raid_bio = split;
}
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
@ 2025-08-21 9:37 ` Yu Kuai
2025-08-25 6:15 ` Yu Kuai
2025-08-25 9:17 ` Christoph Hellwig
2025-08-21 15:19 ` Bart Van Assche
2 siblings, 2 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 9:37 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C), tieren
Hi,
在 2025/08/21 17:02, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
>
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
Just take a look, this is
old bio->new bio
while bio split is:
new_bio->old bio
So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
be resubmitted to current->bio_list, hence this patch won't break this
case, right?
I'll take look at all the bio_chain() callers other than split case to
make sure.
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
This is a little different, new bio -> old bio, while new bio is
resubmitted, the new bio still don't have flag BIO_CHAIN, so this is not
affected by this patch.
> fs/xfs/xfs_log.c: xlog_write_iclog
This is the same as above.
>
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
>
Do you mean this thread?
Fix bio splitting by the crypto fallback code
I'll take look at all the callers of bio_chain(), in theory, we'll have
different use cases like:
1) chain old -> new, or chain new -> old
2) put old or new to current->bio_list, currently always in the tail,
we might want a new case to the head;
Perhaps it'll make sense to add high level helpers to do the chain
and resubmit and convert all callers to use new helpers, want do you
think?
Thanks,
Kuai
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:33 ` Hannes Reinecke
@ 2025-08-21 9:42 ` Yu Kuai
0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21 9:42 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
Hi,
在 2025/08/21 17:33, Hannes Reinecke 写道:
> On 8/21/25 11:02, Christoph Hellwig wrote:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
>> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
>> fs/xfs/xfs_log.c: xlog_write_iclog
>>
>>> BTW, for all
>>> the io split case, should this order be fixed? I feel we should, this
>>> disorder can happen on any stack case, where top max_sector is greater
>>> than stacked disk.
>>
>> Yes, I've been trying get Bart to fix this for a while instead of
>> putting in a workaround very similar to the one proposed here,
>> but so far nothing happened.
>>
>>
> This feels like a really stupid fix, but wouldn't that help?
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..2b342bb59612 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev
> *mddev, struct bio *raid_bio)
> split = bio_split(raid_bio, sectors, GFP_NOIO,
> &conf->bio_split);
> bio_chain(split, raid_bio);
> submit_bio_noacct(raid_bio);
> - raid_bio = split;
> }
>
I do not understand how can this help, do you miss that submit split
instead?
And with this change, this can help, however, I think we'll still submit
the last lba bio first, like bio_last -> bio0 -> bio1 ... where the
following is sequential.
BTW, this is not just a raid5 problem, this is also possible for
raid0/10 and all other recursive split case.
Thanks,
Kuai
>
> Cheers,
>
> Hannes
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
2025-08-21 9:37 ` Yu Kuai
@ 2025-08-21 15:19 ` Bart Van Assche
2 siblings, 0 replies; 14+ messages in thread
From: Bart Van Assche @ 2025-08-21 15:19 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)
On 8/21/25 2:02 AM, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> BTW, for all the io split case, should this order be fixed? I feel
>> we should, this disorder can happen on any stack case, where top
>> max_sector is greater than stacked disk.
>
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
Does the above comment refer to the block/blk-crypto-fallback.c code? I
will leave it to Eric Biggers to move the bio splitting call from that
code into the filesystems that need it.
Bart.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:37 ` Yu Kuai
@ 2025-08-25 6:15 ` Yu Kuai
2025-08-25 9:15 ` Christoph Hellwig
2025-08-25 9:17 ` Christoph Hellwig
1 sibling, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-25 6:15 UTC (permalink / raw)
To: Yu Kuai, Christoph Hellwig
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)
Hi,
在 2025/08/21 17:37, Yu Kuai 写道:
> 在 2025/08/21 17:02, Christoph Hellwig 写道:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
>
> Just take a look, this is
>
> old bio->new bio
>
> while bio split is:
>
> new_bio->old bio
>
> So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
> be resubmitted to current->bio_list, hence this patch won't break this
> case, right?
And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
right lba order, the bio recursive handling is not related in this case.
Thanks,
Kuai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-25 6:15 ` Yu Kuai
@ 2025-08-25 9:15 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25 9:15 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
tieren, yukuai (C)
On Mon, Aug 25, 2025 at 02:15:49PM +0800, Yu Kuai wrote:
> > be resubmitted to current->bio_list, hence this patch won't break this
> > case, right?
>
> And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
> is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
> right lba order, the bio recursive handling is not related in this case.
Yes, usually not - unless we somehow eventually get the loop device
out a separate workque context.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-21 9:37 ` Yu Kuai
2025-08-25 6:15 ` Yu Kuai
@ 2025-08-25 9:17 ` Christoph Hellwig
2025-08-25 9:49 ` Yu Kuai
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25 9:17 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
yukuai (C), tieren
On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
> Fix bio splitting by the crypto fallback code
Yes.
>
> I'll take look at all the callers of bio_chain(), in theory, we'll have
> different use cases like:
>
> 1) chain old -> new, or chain new -> old
> 2) put old or new to current->bio_list, currently always in the tail,
> we might want a new case to the head;
>
> Perhaps it'll make sense to add high level helpers to do the chain
> and resubmit and convert all callers to use new helpers, want do you
> think?
I don't think chaining really is problem here, but more how bios
are split when already in the block layer. It's been a bit of a
source for problems, so I think we'll need to sort it out. Especially
as the handling of splits for the same device vs devices below the
current one seems a bit problematic in general.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] block: fix disordered IO in the case recursive split
2025-08-25 9:17 ` Christoph Hellwig
@ 2025-08-25 9:49 ` Yu Kuai
0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-25 9:49 UTC (permalink / raw)
To: Christoph Hellwig, Yu Kuai
Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)
Hi,
在 2025/08/25 17:17, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
>> Fix bio splitting by the crypto fallback code
>
> Yes.
>
>>
>> I'll take look at all the callers of bio_chain(), in theory, we'll have
>> different use cases like:
>>
>> 1) chain old -> new, or chain new -> old
>> 2) put old or new to current->bio_list, currently always in the tail,
>> we might want a new case to the head;
>>
>> Perhaps it'll make sense to add high level helpers to do the chain
>> and resubmit and convert all callers to use new helpers, want do you
>> think?
>
> I don't think chaining really is problem here, but more how bios
> are split when already in the block layer. It's been a bit of a
> source for problems, so I think we'll need to sort it out. Especially
> as the handling of splits for the same device vs devices below the
> current one seems a bit problematic in general.
>
I just send a new rfc verion to unify block layer and mdraid to use
the same helper bio_submit_split(), and convert only that helper to
insert split bio to the head of current->bio_list(). And probably
blk-crypto-fallback can use this new helper as well.
Can you take a look? This is the proper solution that I can think of
for now.
Thanks,
Kuai
> .
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-08-25 9:49 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21 8:06 ` Coly Li
2025-08-21 8:43 ` Christoph Hellwig
2025-08-21 8:56 ` Yu Kuai
2025-08-21 9:02 ` Christoph Hellwig
2025-08-21 9:33 ` Hannes Reinecke
2025-08-21 9:42 ` Yu Kuai
2025-08-21 9:37 ` Yu Kuai
2025-08-25 6:15 ` Yu Kuai
2025-08-25 9:15 ` Christoph Hellwig
2025-08-25 9:17 ` Christoph Hellwig
2025-08-25 9:49 ` Yu Kuai
2025-08-21 15:19 ` Bart Van Assche
2025-08-21 9:16 ` Coly Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).