cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yu Kuai <hailan@yukuai.org.cn>
To: Damien Le Moal <dlemoal@kernel.org>,
	Yu Kuai <yukuai1@huaweicloud.com>,
	axboe@kernel.dk, tj@kernel.org, josef@toxicpanda.com,
	song@kernel.org, neil@brown.name, akpm@linux-foundation.org,
	hch@infradead.org, colyli@kernel.org, hare@suse.de,
	tieren@fnnas.com
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, linux-raid@vger.kernel.org,
	yukuai3@huawei.com, yi.zhang@huawei.com, yangerkun@huawei.com,
	johnny.chenyi@huawei.com
Subject: Re: [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
Date: Sat, 30 Aug 2025 12:28:28 +0800	[thread overview]
Message-ID: <79aae55c-a2fe-465c-9204-44dce9a80256@yukuai.org.cn> (raw)
In-Reply-To: <23872034-2b36-4a71-91b9-e599976902b6@kernel.org>

Hi,

在 2025/8/30 9:02, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently, split bio will be chained to original bio, and original bio
>> will be resubmitted to the tail of current->bio_list, waiting for
>> split bio to be issued. However, if split bio get split again, the IO
>> order will be messed up, for example, in raid456 IO will first be
>> split by max_sector from md_submit_bio(), and then later be split
>> again by chunksize for internal handling:
>>
>> For example, assume max_sectors is 1M, and chunksize is 512k
>>
>> 1) issue a 2M IO:
>>
>> bio issuing: 0+2M
>> current->bio_list: NULL
>>
>> 2) md_submit_bio() split by max_sector:
>>
>> bio issuing: 0+1M
>> current->bio_list: 1M+1M
>>
>> 3) chunk_aligned_read() split by chunksize:
>>
>> bio issuing: 0+512k
>> current->bio_list: 1M+1M -> 512k+512k
>>
>> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
>> next bio:
>>
>> bio issuing: 1M+1M
>> current->bio_list: 512k+512k
>> bio issued: 0+512k
>>
>> 5) chunk_aligned_read() split by chunksize:
>>
>> bio issuing: 1M+512k
>> current->bio_list: 512k+512k -> 1536k+512k
>> bio issued: 0+512k
>>
>> 6) no split afterwards, finally the issue order is:
>>
>> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>>
>> This behaviour will cause large IO read on raid456 endup to be small
>> discontinuous IO in underlying disks. Fix this problem by placing split
>> bio to the head of current->bio_list.
>>
>> Test script: test on 8 disk raid5 with 64k chunksize
>> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>>
>> Test results:
>> Before this patch
>> 1) iostat results:
>> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
>> md0           52430.00   3276.87     0.00   0.00    0.62    64.00   32.60  80.10
>> sd*           4487.00    409.00  2054.00  31.40    0.82    93.34    3.68  71.20
>> 2) blktrace G stage:
>>    8,0    0   486445    11.357392936   843  G   R 14071424 + 128 [dd]
>>    8,0    0   486451    11.357466360   843  G   R 14071168 + 128 [dd]
>>    8,0    0   486454    11.357515868   843  G   R 14071296 + 128 [dd]
>>    8,0    0   486468    11.357968099   843  G   R 14072192 + 128 [dd]
>>    8,0    0   486474    11.358031320   843  G   R 14071936 + 128 [dd]
>>    8,0    0   486480    11.358096298   843  G   R 14071552 + 128 [dd]
>>    8,0    0   486490    11.358303858   843  G   R 14071808 + 128 [dd]
>> 3) io seek for sdx:
>> Noted io seek is the result from blktrace D stage, statistic of:
>> ABS((offset of next IO) - (offset + len of previous IO))
>>
>> Read|Write seek
>> cnt 55175, zero cnt 25079
>>      >=(KB) .. <(KB)     : count       ratio |distribution                            |
>>           0 .. 1         : 25079       45.5% |########################################|
>>           1 .. 2         : 0            0.0% |                                        |
>>           2 .. 4         : 0            0.0% |                                        |
>>           4 .. 8         : 0            0.0% |                                        |
>>           8 .. 16        : 0            0.0% |                                        |
>>          16 .. 32        : 0            0.0% |                                        |
>>          32 .. 64        : 12540       22.7% |#####################                   |
>>          64 .. 128       : 2508         4.5% |#####                                   |
>>         128 .. 256       : 0            0.0% |                                        |
>>         256 .. 512       : 10032       18.2% |#################                       |
>>         512 .. 1024      : 5016         9.1% |#########                               |
>>
>> After this patch:
>> 1) iostat results:
>> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz  aqu-sz  %util
>> md0           87965.00   5271.88     0.00   0.00    0.16    61.37   14.03  90.60
>> sd*           6020.00    658.44  5117.00  45.95    0.44   112.00    2.68  86.50
>> 2) blktrace G stage:
>>    8,0    0   206296     5.354894072   664  G   R 7156992 + 128 [dd]
>>    8,0    0   206305     5.355018179   664  G   R 7157248 + 128 [dd]
>>    8,0    0   206316     5.355204438   664  G   R 7157504 + 128 [dd]
>>    8,0    0   206319     5.355241048   664  G   R 7157760 + 128 [dd]
>>    8,0    0   206333     5.355500923   664  G   R 7158016 + 128 [dd]
>>    8,0    0   206344     5.355837806   664  G   R 7158272 + 128 [dd]
>>    8,0    0   206353     5.355960395   664  G   R 7158528 + 128 [dd]
>>    8,0    0   206357     5.356020772   664  G   R 7158784 + 128 [dd]
>> 3) io seek for sdx
>> Read|Write seek
>> cnt 28644, zero cnt 21483
>>      >=(KB) .. <(KB)     : count       ratio |distribution                            |
>>           0 .. 1         : 21483       75.0% |########################################|
>>           1 .. 2         : 0            0.0% |                                        |
>>           2 .. 4         : 0            0.0% |                                        |
>>           4 .. 8         : 0            0.0% |                                        |
>>           8 .. 16        : 0            0.0% |                                        |
>>          16 .. 32        : 0            0.0% |                                        |
>>          32 .. 64        : 7161        25.0% |##############                          |
>>
>> BTW, this looks like a long term problem from day one, and large
>> sequential IO read is pretty common case like video playing.
>>
>> And even with this patch, in this test case IO is merged to at most 128k
>> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
>> limit can get even better performance. However, we'll figure out how to do
>> this properly later.
>>
>> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
>> Reported-by: Tie Ren <tieren@fnnas.com>
>> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-core.c     | 21 ++++++++++++++-------
>>   block/blk-throttle.c |  2 +-
>>   block/blk.h          |  2 +-
>>   3 files changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 37836446f365..b643d3b1e9fe 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
>>   	current->bio_list = NULL;
>>   }
>>   
>> -void submit_bio_noacct_nocheck(struct bio *bio)
>> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>>   {
>>   	blk_cgroup_bio_start(bio);
>>   	blkcg_bio_issue_init(bio);
>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>   	 * to collect a list of requests submited by a ->submit_bio method while
>>   	 * it is active, and then process them after it returned.
>>   	 */
>> -	if (current->bio_list)
>> -		bio_list_add(&current->bio_list[0], bio);
>> -	else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>> +	if (current->bio_list) {
>> +		if (split)
>> +			bio_list_add_head(&current->bio_list[0], bio);
>> +		else
>> +			bio_list_add(&current->bio_list[0], bio);
> This really needs a comment clarifying why we do an add at tail instead of
> keeping the original order with a add at head. I am also scared that this may
> break sequential write ordering for zoned devices.

I think add at head is exactly what we do here to keep the orginal order for
the case bio split. Other than split, if caller do generate multiple sequential
bios, we should keep the order by add at tail.

Not sure about zoned devices for now, I'll have a look in details.

Thanks,
Kuai

>
>> +	} else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>>   		__submit_bio_noacct_mq(bio);
>> -	else
>> +	} else {
>>   		__submit_bio_noacct(bio);
>> +	}
>>   }
>
>

  reply	other threads:[~2025-08-30  4:29 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-28  6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-28  6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
2025-08-30  0:37   ` Damien Le Moal
2025-08-30  4:03     ` Yu Kuai
2025-08-28  6:57 ` [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Yu Kuai
2025-08-30  0:41   ` Damien Le Moal
2025-08-30  4:10     ` Yu Kuai
2025-08-30  4:38       ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 03/10] md/raid1: convert " Yu Kuai
2025-08-30  0:43   ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 04/10] md/raid10: convert read/write " Yu Kuai
2025-08-30  0:48   ` Damien Le Moal
2025-08-30  4:18     ` Yu Kuai
2025-08-28  6:57 ` [PATCH RFC v2 05/10] md/raid5: convert " Yu Kuai
2025-08-30  0:50   ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 06/10] md/md-linear: " Yu Kuai
2025-08-30  0:51   ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 07/10] blk-crypto: " Yu Kuai
2025-08-30  0:55   ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio Yu Kuai
2025-08-30  0:58   ` Damien Le Moal
2025-08-30  4:22     ` Yu Kuai
2025-08-28  6:57 ` [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-30  1:02   ` Damien Le Moal
2025-08-30  4:28     ` Yu Kuai [this message]
2025-09-01  2:40       ` Yu Kuai
2025-09-01  6:51         ` Damien Le Moal
2025-08-28  6:57 ` [PATCH RFC v2 10/10] md/raid0: convert raid0_make_request() to use bio_submit_split_bioset() Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=79aae55c-a2fe-465c-9204-44dce9a80256@yukuai.org.cn \
    --to=hailan@yukuai.org.cn \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=colyli@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=johnny.chenyi@huawei.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=song@kernel.org \
    --cc=tieren@fnnas.com \
    --cc=tj@kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai1@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).