[PATCH] block: fix disordered IO in the case recursive split

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] block: fix disordered IO in the case recursive split
@ 2025-08-21  7:47 Yu Kuai
  2025-08-21  8:06 ` Coly Li
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21  7:47 UTC (permalink / raw)
  To: axboe, neil, akpm
  Cc: linux-block, linux-raid, linux-kernel, colyli, xni, yukuai3,
	yukuai1, yi.zhang, yangerkun, johnny.chenyi

From: Yu Kuai <yukuai3@huawei.com>

Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up, for example, in raid456 IO will first be
split by max_sector from md_submit_bio(), and then later be split
again by chunksize for internal handling:

For example, assume max_sectors is 1M, and chunksize is 512k

1) issue a 2M IO:

bio issuing: 0+2M
current->bio_list: NULL

2) md_submit_bio() split by max_sector:

bio issuing: 0+1M
current->bio_list: 1M+1M

3) chunk_aligned_read() split by chunksize:

bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k

4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:

bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k

5) chunk_aligned_read() split by chunksize:

bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k

6) no split afterwards, finally the issue order is:

0+512k -> 1M+512k -> 512k+512k -> 1536k+512k

This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing chanied
bio to the head of current->bio_list.

Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct

Test results:
Before this patch
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
2) blktrace G stage:
  8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
  8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
  8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
  8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
  8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
  8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
  8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))

Read|Write seek
cnt 55175, zero cnt 25079
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 25079       45.5% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 12540       22.7% |#####################                   |
        64 .. 128       : 2508         4.5% |#####                                   |
       128 .. 256       : 0            0.0% |                                        |
       256 .. 512       : 10032       18.2% |#################                       |
       512 .. 1024      : 5016         9.1% |#########                               |

After this patch:
1) iostat results:
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
2) blktrace G stage:
  8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
  8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
  8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
  8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
  8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
  8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
  8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
  8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
2) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
    >=(KB) .. <(KB)     : count       ratio |distribution                            |
         0 .. 1         : 21483       75.0% |########################################|
         1 .. 2         : 0            0.0% |                                        |
         2 .. 4         : 0            0.0% |                                        |
         4 .. 8         : 0            0.0% |                                        |
         8 .. 16        : 0            0.0% |                                        |
        16 .. 32        : 0            0.0% |                                        |
        32 .. 64        : 7161        25.0% |##############                          |

BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.

And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit and cat get even better performance. However, we'll figure out
how to do this properly later.

Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-core.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4201504158a1..0d46d10edb22 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
 	 * to collect a list of requests submited by a ->submit_bio method while
 	 * it is active, and then process them after it returned.
 	 */
-	if (current->bio_list)
-		bio_list_add(&current->bio_list[0], bio);
-	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+	if (current->bio_list) {
+		if (bio_flagged(bio, BIO_CHAIN))
+			bio_list_add_head(&current->bio_list[0], bio);
+		else
+			bio_list_add(&current->bio_list[0], bio);
+	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
 		__submit_bio_noacct_mq(bio);
-	else
+	} else {
 		__submit_bio_noacct(bio);
+	}
 }
 
 static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-21  8:06 ` Coly Li
  2025-08-21  8:43 ` Christoph Hellwig
  2025-08-21  9:16 ` Coly Li
  2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21  8:06 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
	yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
> 
> For example, assume max_sectors is 1M, and chunksize is 512k
> 
> 1) issue a 2M IO:
> 
> bio issuing: 0+2M
> current->bio_list: NULL
> 
> 2) md_submit_bio() split by max_sector:
> 
> bio issuing: 0+1M
> current->bio_list: 1M+1M
> 
> 3) chunk_aligned_read() split by chunksize:
> 
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
> 
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
> 
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
> 
> 5) chunk_aligned_read() split by chunksize:
> 
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
> 
> 6) no split afterwards, finally the issue order is:
> 
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
> 
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
> 
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
> 
> Test results:
> Before this patch
> 1) iostat results:
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
> md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
> sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
> 2) blktrace G stage:
>   8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
>   8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
>   8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
>   8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
>   8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
>   8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
>   8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
> 
> Read|Write seek
> cnt 55175, zero cnt 25079
>     >=(KB) .. <(KB)     : count       ratio |distribution                            |
>          0 .. 1         : 25079       45.5% |########################################|
>          1 .. 2         : 0            0.0% |                                        |
>          2 .. 4         : 0            0.0% |                                        |
>          4 .. 8         : 0            0.0% |                                        |
>          8 .. 16        : 0            0.0% |                                        |
>         16 .. 32        : 0            0.0% |                                        |
>         32 .. 64        : 12540       22.7% |#####################                   |
>         64 .. 128       : 2508         4.5% |#####                                   |
>        128 .. 256       : 0            0.0% |                                        |
>        256 .. 512       : 10032       18.2% |#################                       |
>        512 .. 1024      : 5016         9.1% |#########                               |
> 
> After this patch:
> 1) iostat results:
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
> md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
> sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
> 2) blktrace G stage:
>   8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
>   8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
>   8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
>   8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
>   8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
>   8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
>   8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
>   8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
>     >=(KB) .. <(KB)     : count       ratio |distribution                            |
>          0 .. 1         : 21483       75.0% |########################################|
>          1 .. 2         : 0            0.0% |                                        |
>          2 .. 4         : 0            0.0% |                                        |
>          4 .. 8         : 0            0.0% |                                        |
>          8 .. 16        : 0            0.0% |                                        |
>         16 .. 32        : 0            0.0% |                                        |
>         32 .. 64        : 7161        25.0% |##############                          |
> 
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
> 
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
> 
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

Tested-by: Coly Li <colyli@kernel.org>

Yeah, I can see 'dd if=./on-fs-file of=/dev/null bs=6400K status=proress' to report
the read throughput from around 1.2GiB/s to 9.8GiB/s.

Great job and thank you!

Coly Li

> ---
>  block/blk-core.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>  	 * to collect a list of requests submited by a ->submit_bio method while
>  	 * it is active, and then process them after it returned.
>  	 */
> -	if (current->bio_list)
> -		bio_list_add(&current->bio_list[0], bio);
> -	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> +	if (current->bio_list) {
> +		if (bio_flagged(bio, BIO_CHAIN))
> +			bio_list_add_head(&current->bio_list[0], bio);
> +		else
> +			bio_list_add(&current->bio_list[0], bio);
> +	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>  		__submit_bio_noacct_mq(bio);
> -	else
> +	} else {
>  		__submit_bio_noacct(bio);
> +	}
>  }
>  
>  static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
  2025-08-21  8:06 ` Coly Li
@ 2025-08-21  8:43 ` Christoph Hellwig
  2025-08-21  8:56   ` Yu Kuai
  2025-08-21  9:16 ` Coly Li
  2 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21  8:43 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> +	if (current->bio_list) {
> +		if (bio_flagged(bio, BIO_CHAIN))
> +			bio_list_add_head(&current->bio_list[0], bio);
> +		else
> +			bio_list_add(&current->bio_list[0], bio);
> +	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {

This breaks all the code the already chains the right way around,
and there's quite a bit of that (speaking as someone who created a few
instances).

So instead fix your submitter to chain the right way instead.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  8:43 ` Christoph Hellwig
@ 2025-08-21  8:56   ` Yu Kuai
  2025-08-21  9:02     ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-21  8:56 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/21 16:43, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
>> +	if (current->bio_list) {
>> +		if (bio_flagged(bio, BIO_CHAIN))
>> +			bio_list_add_head(&current->bio_list[0], bio);
>> +		else
>> +			bio_list_add(&current->bio_list[0], bio);
>> +	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> 
> This breaks all the code the already chains the right way around,
> and there's quite a bit of that (speaking as someone who created a few
> instances).
> 
> So instead fix your submitter to chain the right way instead.
> 

Can you give some examples as how to chain the right way? BTW, for all
the io split case, should this order be fixed? I feel we should, this
disorder can happen on any stack case, where top max_sector is greater
than stacked disk.

Thanks,
Kuai

> 
> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  8:56   ` Yu Kuai
@ 2025-08-21  9:02     ` Christoph Hellwig
  2025-08-21  9:33       ` Hannes Reinecke
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-21  9:02 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
	linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C)

On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
> Can you give some examples as how to chain the right way?

fs/xfs/xfs_bio_io.c: xfs_rw_bdev
fs/xfs/xfs_buf.c: xfs_buf_submit_bio
fs/xfs/xfs_log.c: xlog_write_iclog

> BTW, for all
> the io split case, should this order be fixed? I feel we should, this
> disorder can happen on any stack case, where top max_sector is greater
> than stacked disk.

Yes, I've been trying get Bart to fix this for a while instead of
putting in a workaround very similar to the one proposed here,
but so far nothing happened.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
  2025-08-21  8:06 ` Coly Li
  2025-08-21  8:43 ` Christoph Hellwig
@ 2025-08-21  9:16 ` Coly Li
  2 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2025-08-21  9:16 UTC (permalink / raw)
  To: Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, xni,
	yukuai3, yi.zhang, yangerkun, johnny.chenyi

On Thu, Aug 21, 2025 at 03:47:06PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
> 
> For example, assume max_sectors is 1M, and chunksize is 512k
> 
> 1) issue a 2M IO:
> 
> bio issuing: 0+2M
> current->bio_list: NULL
> 
> 2) md_submit_bio() split by max_sector:
> 
> bio issuing: 0+1M
> current->bio_list: 1M+1M
> 
> 3) chunk_aligned_read() split by chunksize:
> 
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
> 
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
> 
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
> 
> 5) chunk_aligned_read() split by chunksize:
> 
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
> 
> 6) no split afterwards, finally the issue order is:
> 
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
> 
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing chanied
> bio to the head of current->bio_list.
> 
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
> 
> Test results:
> Before this patch
> 1) iostat results:
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
> md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
> sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
> 2) blktrace G stage:
>   8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
>   8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
>   8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
>   8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
>   8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
>   8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
>   8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
> 
> Read|Write seek
> cnt 55175, zero cnt 25079
>     >=(KB) .. <(KB)     : count       ratio |distribution                            |
>          0 .. 1         : 25079       45.5% |########################################|
>          1 .. 2         : 0            0.0% |                                        |
>          2 .. 4         : 0            0.0% |                                        |
>          4 .. 8         : 0            0.0% |                                        |
>          8 .. 16        : 0            0.0% |                                        |
>         16 .. 32        : 0            0.0% |                                        |
>         32 .. 64        : 12540       22.7% |#####################                   |
>         64 .. 128       : 2508         4.5% |#####                                   |
>        128 .. 256       : 0            0.0% |                                        |
>        256 .. 512       : 10032       18.2% |#################                       |
>        512 .. 1024      : 5016         9.1% |#########                               |
> 
> After this patch:
> 1) iostat results:
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
> md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
> sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
> 2) blktrace G stage:
>   8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
>   8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
>   8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
>   8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
>   8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
>   8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
>   8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
>   8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
> 2) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
>     >=(KB) .. <(KB)     : count       ratio |distribution                            |
>          0 .. 1         : 21483       75.0% |########################################|
>          1 .. 2         : 0            0.0% |                                        |
>          2 .. 4         : 0            0.0% |                                        |
>          4 .. 8         : 0            0.0% |                                        |
>          8 .. 16        : 0            0.0% |                                        |
>         16 .. 32        : 0            0.0% |                                        |
>         32 .. 64        : 7161        25.0% |##############                          |
> 
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
> 
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit and cat get even better performance. However, we'll figure out
> how to do this properly later.
> 
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

BTW, this issue was not originally caught by me, my colleague Tie Ren found it.

Please consider to add,
	Reported-by: Tie Ren <tieren@fnnas.com>

Thanks.

Coly Li

> ---
>  block/blk-core.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..0d46d10edb22 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>  	 * to collect a list of requests submited by a ->submit_bio method while
>  	 * it is active, and then process them after it returned.
>  	 */
> -	if (current->bio_list)
> -		bio_list_add(&current->bio_list[0], bio);
> -	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> +	if (current->bio_list) {
> +		if (bio_flagged(bio, BIO_CHAIN))
> +			bio_list_add_head(&current->bio_list[0], bio);
> +		else
> +			bio_list_add(&current->bio_list[0], bio);
> +	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>  		__submit_bio_noacct_mq(bio);
> -	else
> +	} else {
>  		__submit_bio_noacct(bio);
> +	}
>  }
>  
>  static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:02     ` Christoph Hellwig
@ 2025-08-21  9:33       ` Hannes Reinecke
  2025-08-21  9:42         ` Yu Kuai
  2025-08-21  9:37       ` Yu Kuai
  2025-08-21 15:19       ` Bart Van Assche
  2 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2025-08-21  9:33 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

On 8/21/25 11:02, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
> 
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
> fs/xfs/xfs_log.c: xlog_write_iclog
> 
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
> 
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
> 
> 
This feels like a really stupid fix, but wouldn't that help?

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 023649fe2476..2b342bb59612 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev 
*mddev, struct bio *raid_bio)
                 split = bio_split(raid_bio, sectors, GFP_NOIO, 
&conf->bio_split);
                 bio_chain(split, raid_bio);
                 submit_bio_noacct(raid_bio);
-               raid_bio = split;
         }


Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:02     ` Christoph Hellwig
  2025-08-21  9:33       ` Hannes Reinecke
@ 2025-08-21  9:37       ` Yu Kuai
  2025-08-25  6:15         ` Yu Kuai
  2025-08-25  9:17         ` Christoph Hellwig
  2025-08-21 15:19       ` Bart Van Assche
  2 siblings, 2 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21  9:37 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C), tieren

Hi,

在 2025/08/21 17:02, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> Can you give some examples as how to chain the right way?
> 
> fs/xfs/xfs_bio_io.c: xfs_rw_bdev

Just take a look, this is

old bio->new bio

while bio split is:

new_bio->old bio

So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
be resubmitted to current->bio_list, hence this patch won't break this
case, right?

I'll take look at all the bio_chain() callers other than split case to
make sure.
> fs/xfs/xfs_buf.c: xfs_buf_submit_bio

This is a little different, new bio -> old bio, while new bio is
resubmitted, the new bio still don't have flag BIO_CHAIN, so this is not
affected by this patch.

> fs/xfs/xfs_log.c: xlog_write_iclog

This is the same as above.
> 
>> BTW, for all
>> the io split case, should this order be fixed? I feel we should, this
>> disorder can happen on any stack case, where top max_sector is greater
>> than stacked disk.
> 
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.
> 

Do you mean this thread?

Fix bio splitting by the crypto fallback code

I'll take look at all the callers of bio_chain(), in theory, we'll have
different use cases like:

1) chain old -> new, or chain new -> old
2) put old or new to current->bio_list, currently always in the tail,
we might want a new case to the head;

Perhaps it'll make sense to add high level helpers to do the chain
and resubmit and convert all callers to use new helpers, want do you
think?

Thanks,
Kuai

> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:33       ` Hannes Reinecke
@ 2025-08-21  9:42         ` Yu Kuai
  0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-21  9:42 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)

Hi,

在 2025/08/21 17:33, Hannes Reinecke 写道:
> On 8/21/25 11:02, Christoph Hellwig wrote:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
>> fs/xfs/xfs_buf.c: xfs_buf_submit_bio
>> fs/xfs/xfs_log.c: xlog_write_iclog
>>
>>> BTW, for all
>>> the io split case, should this order be fixed? I feel we should, this
>>> disorder can happen on any stack case, where top max_sector is greater
>>> than stacked disk.
>>
>> Yes, I've been trying get Bart to fix this for a while instead of
>> putting in a workaround very similar to the one proposed here,
>> but so far nothing happened.
>>
>>
> This feels like a really stupid fix, but wouldn't that help?
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 023649fe2476..2b342bb59612 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5478,7 +5478,6 @@ static struct bio *chunk_aligned_read(struct mddev 
> *mddev, struct bio *raid_bio)
>                  split = bio_split(raid_bio, sectors, GFP_NOIO, 
> &conf->bio_split);
>                  bio_chain(split, raid_bio);
>                  submit_bio_noacct(raid_bio);
> -               raid_bio = split;
>          }
> 

I do not understand how can this help, do you miss that submit split
instead?

And with this change, this can help, however, I think we'll still submit
the last lba bio first, like bio_last -> bio0 -> bio1 ... where the
following is sequential.

BTW, this is not just a raid5 problem, this is also possible for
raid0/10 and all other recursive split case.

Thanks,
Kuai

> 
> Cheers,
> 
> Hannes


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:02     ` Christoph Hellwig
  2025-08-21  9:33       ` Hannes Reinecke
  2025-08-21  9:37       ` Yu Kuai
@ 2025-08-21 15:19       ` Bart Van Assche
  2 siblings, 0 replies; 14+ messages in thread
From: Bart Van Assche @ 2025-08-21 15:19 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, yukuai (C)


On 8/21/25 2:02 AM, Christoph Hellwig wrote:
> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>> BTW, for all the io split case, should this order be fixed? I feel
>> we should, this disorder can happen on any stack case, where top
>> max_sector is greater than stacked disk.
> 
> Yes, I've been trying get Bart to fix this for a while instead of
> putting in a workaround very similar to the one proposed here,
> but so far nothing happened.

Does the above comment refer to the block/blk-crypto-fallback.c code? I
will leave it to Eric Biggers to move the bio splitting call from that
code into the filesystems that need it.

Bart.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:37       ` Yu Kuai
@ 2025-08-25  6:15         ` Yu Kuai
  2025-08-25  9:15           ` Christoph Hellwig
  2025-08-25  9:17         ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Yu Kuai @ 2025-08-25  6:15 UTC (permalink / raw)
  To: Yu Kuai, Christoph Hellwig
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)

Hi,

在 2025/08/21 17:37, Yu Kuai 写道:
> 在 2025/08/21 17:02, Christoph Hellwig 写道:
>> On Thu, Aug 21, 2025 at 04:56:33PM +0800, Yu Kuai wrote:
>>> Can you give some examples as how to chain the right way?
>>
>> fs/xfs/xfs_bio_io.c: xfs_rw_bdev
> 
> Just take a look, this is
> 
> old bio->new bio
> 
> while bio split is:
> 
> new_bio->old bio
> 
> So xfs_rw_bdev won't flag old bio as BIO_CHAIN, while old bio will still
> be resubmitted to current->bio_list, hence this patch won't break this
> case, right?

And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
right lba order, the bio recursive handling is not related in this case.

Thanks,
Kuai


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-25  6:15         ` Yu Kuai
@ 2025-08-25  9:15           ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25  9:15 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
	linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
	tieren, yukuai (C)

On Mon, Aug 25, 2025 at 02:15:49PM +0800, Yu Kuai wrote:
> > be resubmitted to current->bio_list, hence this patch won't break this
> > case, right?
> 
> And xfs_rw_bdev() is not under submit_bio() context, current->bio_list
> is still NULL, means xfs_rw_bdev() is submitting bio one by one in the
> right lba order, the bio recursive handling is not related in this case.

Yes, usually not - unless we somehow eventually get the loop device
out a separate workque context.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-21  9:37       ` Yu Kuai
  2025-08-25  6:15         ` Yu Kuai
@ 2025-08-25  9:17         ` Christoph Hellwig
  2025-08-25  9:49           ` Yu Kuai
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-08-25  9:17 UTC (permalink / raw)
  To: Yu Kuai
  Cc: Christoph Hellwig, axboe, neil, akpm, linux-block, linux-raid,
	linux-kernel, colyli, xni, yi.zhang, yangerkun, johnny.chenyi,
	yukuai (C), tieren

On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
> Fix bio splitting by the crypto fallback code

Yes.

> 
> I'll take look at all the callers of bio_chain(), in theory, we'll have
> different use cases like:
> 
> 1) chain old -> new, or chain new -> old
> 2) put old or new to current->bio_list, currently always in the tail,
> we might want a new case to the head;
> 
> Perhaps it'll make sense to add high level helpers to do the chain
> and resubmit and convert all callers to use new helpers, want do you
> think?

I don't think chaining really is problem here, but more how bios
are split when already in the block layer.  It's been a bit of a
source for problems, so I think we'll need to sort it out.  Especially
as the handling of splits for the same device vs devices below the
current one seems a bit problematic in general.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] block: fix disordered IO in the case recursive split
  2025-08-25  9:17         ` Christoph Hellwig
@ 2025-08-25  9:49           ` Yu Kuai
  0 siblings, 0 replies; 14+ messages in thread
From: Yu Kuai @ 2025-08-25  9:49 UTC (permalink / raw)
  To: Christoph Hellwig, Yu Kuai
  Cc: axboe, neil, akpm, linux-block, linux-raid, linux-kernel, colyli,
	xni, yi.zhang, yangerkun, johnny.chenyi, tieren, yukuai (C)

Hi,

在 2025/08/25 17:17, Christoph Hellwig 写道:
> On Thu, Aug 21, 2025 at 05:37:15PM +0800, Yu Kuai wrote:
>> Fix bio splitting by the crypto fallback code
> 
> Yes.
> 
>>
>> I'll take look at all the callers of bio_chain(), in theory, we'll have
>> different use cases like:
>>
>> 1) chain old -> new, or chain new -> old
>> 2) put old or new to current->bio_list, currently always in the tail,
>> we might want a new case to the head;
>>
>> Perhaps it'll make sense to add high level helpers to do the chain
>> and resubmit and convert all callers to use new helpers, want do you
>> think?
> 
> I don't think chaining really is problem here, but more how bios
> are split when already in the block layer.  It's been a bit of a
> source for problems, so I think we'll need to sort it out.  Especially
> as the handling of splits for the same device vs devices below the
> current one seems a bit problematic in general.
> 

I just send a new rfc verion to unify block layer and mdraid to use
the same helper bio_submit_split(), and convert only that helper to
insert split bio to the head of current->bio_list(). And probably
blk-crypto-fallback can use this new helper as well.

Can you take a look? This is the proper solution that I can think of
for now.

Thanks,
Kuai

> .
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-08-25  9:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21  7:47 [PATCH] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-21  8:06 ` Coly Li
2025-08-21  8:43 ` Christoph Hellwig
2025-08-21  8:56   ` Yu Kuai
2025-08-21  9:02     ` Christoph Hellwig
2025-08-21  9:33       ` Hannes Reinecke
2025-08-21  9:42         ` Yu Kuai
2025-08-21  9:37       ` Yu Kuai
2025-08-25  6:15         ` Yu Kuai
2025-08-25  9:15           ` Christoph Hellwig
2025-08-25  9:17         ` Christoph Hellwig
2025-08-25  9:49           ` Yu Kuai
2025-08-21 15:19       ` Bart Van Assche
2025-08-21  9:16 ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).