* [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split
@ 2025-08-28 6:57 Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
` (9 more replies)
0 siblings, 10 replies; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Changes in v2:
- export a new helper bio_submit_split_bioset() instead of
export bio_submit_split() directly;
- don't set no merge flag in the new helper;
- add patch 7 and patch 10;
- add patch 8 to skip bio checks for resubmitting split bio;
patch 1 export a bio split helper;
patch 2-7 unify bio split code;
path 8-9 convert the helper to insert split bio to the head of current
bio list;
patch 10 is a follow cleanup for raid0;
This set is just test for raid5 for now, see details in patch 9;
Yu Kuai (10):
block: factor out a helper bio_submit_split_bioset()
md/raid0: convert raid0_handle_discard() to use
bio_submit_split_bioset()
md/raid1: convert to use bio_submit_split_bioset()
md/raid10: convert read/write to use bio_submit_split_bioset()
md/raid5: convert to use bio_submit_split_bioset()
md/md-linear: convert to use bio_submit_split_bioset()
blk-crypto: convert to use bio_submit_split_bioset()
block: skip unnecessary checks for split bio
block: fix disordered IO in the case recursive split
md/raid0: convert raid0_make_request() to use
bio_submit_split_bioset()
block/blk-core.c | 34 ++++++++++++++++----
block/blk-crypto-fallback.c | 15 +++------
block/blk-merge.c | 63 ++++++++++++++++++++++++-------------
block/blk-throttle.c | 2 +-
block/blk.h | 3 +-
drivers/md/md-linear.c | 14 ++-------
drivers/md/raid0.c | 31 ++++++------------
drivers/md/raid1.c | 35 +++++++++------------
drivers/md/raid1.h | 4 ++-
drivers/md/raid10.c | 51 +++++++++++++-----------------
drivers/md/raid10.h | 2 ++
drivers/md/raid5.c | 10 +++---
include/linux/blkdev.h | 2 ++
13 files changed, 137 insertions(+), 129 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:37 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Yu Kuai
` (8 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
No functional changes are intended, some drivers like mdraid will split
bio by internal processing, prepare to unify bio split codes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-merge.c | 63 ++++++++++++++++++++++++++++--------------
include/linux/blkdev.h | 2 ++
2 files changed, 44 insertions(+), 21 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 70d704615be5..3d6dc9cc4f61 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -104,34 +104,55 @@ static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim)
return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT;
}
+/**
+ * bio_submit_split_bioset - Submit a bio, splitting it at a designated sector
+ * @bio: the original bio to be submitted and split
+ * @split_sectors: the sector count at which to split
+ * @bs: the bio set used for allocating the new split bio
+ *
+ * The original bio is modified to contain the remaining sectors and submitted.
+ * The caller is responsible for submitting the returned bio.
+ *
+ * If succeed, the newly allocated bio representing the initial part will be
+ * returned, on failure NULL will be returned and original bio will fail.
+ */
+struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
+ struct bio_set *bs)
+{
+ struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs);
+
+ if (IS_ERR(split)) {
+ bio->bi_status = errno_to_blk_status(PTR_ERR(split));
+ bio_endio(bio);
+ return NULL;
+ }
+
+ blkcg_bio_issue_init(split);
+ bio_chain(split, bio);
+ trace_block_split(split, bio->bi_iter.bi_sector);
+ WARN_ON_ONCE(bio_zone_write_plugging(bio));
+ submit_bio_noacct(bio);
+
+ return split;
+}
+EXPORT_SYMBOL_GPL(bio_submit_split_bioset);
+
static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
{
- if (unlikely(split_sectors < 0))
- goto error;
+ if (unlikely(split_sectors < 0)) {
+ bio->bi_status = errno_to_blk_status(split_sectors);
+ bio_endio(bio);
+ return NULL;
+ }
if (split_sectors) {
- struct bio *split;
-
- split = bio_split(bio, split_sectors, GFP_NOIO,
- &bio->bi_bdev->bd_disk->bio_split);
- if (IS_ERR(split)) {
- split_sectors = PTR_ERR(split);
- goto error;
- }
- split->bi_opf |= REQ_NOMERGE;
- blkcg_bio_issue_init(split);
- bio_chain(split, bio);
- trace_block_split(split, bio->bi_iter.bi_sector);
- WARN_ON_ONCE(bio_zone_write_plugging(bio));
- submit_bio_noacct(bio);
- return split;
+ bio = bio_submit_split_bioset(bio, split_sectors,
+ &bio->bi_bdev->bd_disk->bio_split);
+ if (bio)
+ bio->bi_opf |= REQ_NOMERGE;
}
return bio;
-error:
- bio->bi_status = errno_to_blk_status(split_sectors);
- bio_endio(bio);
- return NULL;
}
struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fe1797bbec42..be4b3adf3989 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -999,6 +999,8 @@ extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);
+struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
+ struct bio_set *bs);
extern int blk_lld_busy(struct request_queue *q);
extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:41 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 03/10] md/raid1: convert " Yu Kuai
` (7 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large
sequential writes") already fix disordered split IO by converting bio to
underlying disks before submit_bio_noacct(), with the respect
md_submit_bio() already split by sectors, and raid0_make_request() will
split at most once for unaligned IO. This is a bit hacky and we'll convert
this to solution in general later.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid0.c | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f1d8811a542a..4dcc5133d679 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -463,21 +463,16 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
zone = find_zone(conf, &start);
if (bio_end_sector(bio) > zone->zone_end) {
- struct bio *split = bio_split(bio,
- zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
- &mddev->bio_set);
-
- if (IS_ERR(split)) {
- bio->bi_status = errno_to_blk_status(PTR_ERR(split));
- bio_endio(bio);
+ bio = bio_submit_split_bioset(bio,
+ zone->zone_end - bio->bi_iter.bi_sector,
+ &mddev->bio_set);
+ if (!bio)
return;
- }
- bio_chain(split, bio);
- submit_bio_noacct(bio);
- bio = split;
+
end = zone->zone_end;
- } else
+ } else {
end = bio_end_sector(bio);
+ }
orig_end = end;
if (zone != conf->strip_zone)
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 03/10] md/raid1: convert to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:43 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 04/10] md/raid10: convert read/write " Yu Kuai
` (6 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid1.c | 35 ++++++++++++++---------------------
drivers/md/raid1.h | 4 +++-
2 files changed, 17 insertions(+), 22 deletions(-)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..70ec5df43346 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1317,7 +1317,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
struct raid1_info *mirror;
struct bio *read_bio;
int max_sectors;
- int rdisk, error;
+ int rdisk;
bool r1bio_existed = !!r1_bio;
/*
@@ -1376,16 +1376,13 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
}
if (max_sectors < bio_sectors(bio)) {
- struct bio *split = bio_split(bio, max_sectors,
- gfp, &conf->bio_split);
-
- if (IS_ERR(split)) {
- error = PTR_ERR(split);
+ bio = bio_submit_split_bioset(bio, max_sectors,
+ &conf->bio_split);
+ if (!bio) {
+ set_bit(R1BIO_Returned, &r1_bio->state);
goto err_handle;
}
- bio_chain(split, bio);
- submit_bio_noacct(bio);
- bio = split;
+
r1_bio->master_bio = bio;
r1_bio->sectors = max_sectors;
}
@@ -1413,7 +1410,6 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
err_handle:
atomic_dec(&mirror->rdev->nr_pending);
- bio->bi_status = errno_to_blk_status(error);
set_bit(R1BIO_Uptodate, &r1_bio->state);
raid_end_bio_io(r1_bio);
}
@@ -1457,7 +1453,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
{
struct r1conf *conf = mddev->private;
struct r1bio *r1_bio;
- int i, disks, k, error;
+ int i, disks, k;
unsigned long flags;
int first_clone;
int max_sectors;
@@ -1562,7 +1558,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
* the benefit.
*/
if (bio->bi_opf & REQ_ATOMIC) {
- error = -EIO;
+ bio->bi_status =
+ errno_to_blk_status(-EIO);
goto err_handle;
}
@@ -1584,16 +1581,13 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
max_sectors = min_t(int, max_sectors,
BIO_MAX_VECS * (PAGE_SIZE >> 9));
if (max_sectors < bio_sectors(bio)) {
- struct bio *split = bio_split(bio, max_sectors,
- GFP_NOIO, &conf->bio_split);
-
- if (IS_ERR(split)) {
- error = PTR_ERR(split);
+ bio = bio_submit_split_bioset(bio, max_sectors,
+ &conf->bio_split);
+ if (!bio) {
+ set_bit(R1BIO_Returned, &r1_bio->state);
goto err_handle;
}
- bio_chain(split, bio);
- submit_bio_noacct(bio);
- bio = split;
+
r1_bio->master_bio = bio;
r1_bio->sectors = max_sectors;
}
@@ -1683,7 +1677,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
}
}
- bio->bi_status = errno_to_blk_status(error);
set_bit(R1BIO_Uptodate, &r1_bio->state);
raid_end_bio_io(r1_bio);
}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index d236ef179cfb..64548057583b 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -178,7 +178,9 @@ enum r1bio_state {
* any write was successful. Otherwise we call when
* any write-behind write succeeds, otherwise we call
* with failure when last write completes (and all failed).
- * Record that bi_end_io was called with this flag...
+ *
+ * And for bio_split errors, record that bi_end_io was called with
+ * this flag...
*/
R1BIO_Returned,
/* If a write for this request means we can clear some
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 04/10] md/raid10: convert read/write to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (2 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 03/10] md/raid1: convert " Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:48 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 05/10] md/raid5: convert " Yu Kuai
` (5 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Noted discard is not handled, because discard is only splited for
unaligned head and tail, and this can be considered slow path, the
disorder here does not matter much.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid10.c | 51 +++++++++++++++++++--------------------------
drivers/md/raid10.h | 2 ++
2 files changed, 23 insertions(+), 30 deletions(-)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b60c30bfb6c7..0e7d2a67fca6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -322,10 +322,12 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
struct bio *bio = r10_bio->master_bio;
struct r10conf *conf = r10_bio->mddev->private;
- if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
- bio->bi_status = BLK_STS_IOERR;
+ if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) {
+ if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
+ bio->bi_status = BLK_STS_IOERR;
+ bio_endio(bio);
+ }
- bio_endio(bio);
/*
* Wake up any possible resync thread that waits for the device
* to go idle.
@@ -1154,7 +1156,6 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
int slot = r10_bio->read_slot;
struct md_rdev *err_rdev = NULL;
gfp_t gfp = GFP_NOIO;
- int error;
if (slot >= 0 && r10_bio->devs[slot].rdev) {
/*
@@ -1203,17 +1204,15 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
rdev->bdev,
(unsigned long long)r10_bio->sector);
if (max_sectors < bio_sectors(bio)) {
- struct bio *split = bio_split(bio, max_sectors,
- gfp, &conf->bio_split);
- if (IS_ERR(split)) {
- error = PTR_ERR(split);
- goto err_handle;
- }
- bio_chain(split, bio);
allow_barrier(conf);
- submit_bio_noacct(bio);
+ bio = bio_submit_split_bioset(bio, max_sectors,
+ &conf->bio_split);
wait_barrier(conf, false);
- bio = split;
+ if (!bio) {
+ set_bit(R10BIO_Returned, &r10_bio->state);
+ goto err_handle;
+ }
+
r10_bio->master_bio = bio;
r10_bio->sectors = max_sectors;
}
@@ -1239,10 +1238,9 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio,
mddev_trace_remap(mddev, read_bio, r10_bio->sector);
submit_bio_noacct(read_bio);
return;
+
err_handle:
atomic_dec(&rdev->nr_pending);
- bio->bi_status = errno_to_blk_status(error);
- set_bit(R10BIO_Uptodate, &r10_bio->state);
raid_end_bio_io(r10_bio);
}
@@ -1351,7 +1349,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
int i, k;
sector_t sectors;
int max_sectors;
- int error;
if ((mddev_is_clustered(mddev) &&
mddev->cluster_ops->area_resyncing(mddev, WRITE,
@@ -1465,10 +1462,8 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
* complexity of supporting that is not worth
* the benefit.
*/
- if (bio->bi_opf & REQ_ATOMIC) {
- error = -EIO;
+ if (bio->bi_opf & REQ_ATOMIC)
goto err_handle;
- }
good_sectors = first_bad - dev_sector;
if (good_sectors < max_sectors)
@@ -1489,17 +1484,15 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
r10_bio->sectors = max_sectors;
if (r10_bio->sectors < bio_sectors(bio)) {
- struct bio *split = bio_split(bio, r10_bio->sectors,
- GFP_NOIO, &conf->bio_split);
- if (IS_ERR(split)) {
- error = PTR_ERR(split);
- goto err_handle;
- }
- bio_chain(split, bio);
allow_barrier(conf);
- submit_bio_noacct(bio);
+ bio = bio_submit_split_bioset(bio, r10_bio->sectors,
+ &conf->bio_split);
wait_barrier(conf, false);
- bio = split;
+ if (!bio) {
+ set_bit(R10BIO_Returned, &r10_bio->state);
+ goto err_handle;
+ }
+
r10_bio->master_bio = bio;
}
@@ -1531,8 +1524,6 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio,
}
}
- bio->bi_status = errno_to_blk_status(error);
- set_bit(R10BIO_Uptodate, &r10_bio->state);
raid_end_bio_io(r10_bio);
}
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 3f16ad6904a9..ba13f08a2646 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -165,6 +165,8 @@ enum r10bio_state {
* so that raid10d knows what to do with them.
*/
R10BIO_ReadError,
+/* For bio_split errors, record that bi_end_io was called */
+ R10BIO_Returned,
/* If a write for this request means we can clear some
* known-bad-block records, we set this flag.
*/
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 05/10] md/raid5: convert to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (3 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 04/10] md/raid10: convert read/write " Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:50 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 06/10] md/md-linear: " Yu Kuai
` (4 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid5.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 023649fe2476..3c9825ad3f07 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5468,17 +5468,17 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio)
{
- struct bio *split;
sector_t sector = raid_bio->bi_iter.bi_sector;
unsigned chunk_sects = mddev->chunk_sectors;
unsigned sectors = chunk_sects - (sector & (chunk_sects-1));
if (sectors < bio_sectors(raid_bio)) {
struct r5conf *conf = mddev->private;
- split = bio_split(raid_bio, sectors, GFP_NOIO, &conf->bio_split);
- bio_chain(split, raid_bio);
- submit_bio_noacct(raid_bio);
- raid_bio = split;
+
+ raid_bio = bio_submit_split_bioset(raid_bio, sectors,
+ &conf->bio_split);
+ if (!raid_bio)
+ return NULL;
}
if (!raid5_read_one_chunk(mddev, raid_bio))
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 06/10] md/md-linear: convert to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (4 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 05/10] md/raid5: convert " Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:51 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 07/10] blk-crypto: " Yu Kuai
` (3 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/md-linear.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 5d9b08115375..76f85cc32942 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -256,18 +256,10 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
if (unlikely(bio_end_sector(bio) > end_sector)) {
/* This bio crosses a device boundary, so we have to split it */
- struct bio *split = bio_split(bio, end_sector - bio_sector,
- GFP_NOIO, &mddev->bio_set);
-
- if (IS_ERR(split)) {
- bio->bi_status = errno_to_blk_status(PTR_ERR(split));
- bio_endio(bio);
+ bio = bio_submit_split_bioset(bio, end_sector - bio_sector,
+ &mddev->bio_set);
+ if (!bio)
return true;
- }
-
- bio_chain(split, bio);
- submit_bio_noacct(bio);
- bio = split;
}
md_account_bio(mddev, &bio);
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 07/10] blk-crypto: convert to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (5 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 06/10] md/md-linear: " Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:55 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio Yu Kuai
` (2 subsequent siblings)
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
On the one hand unify bio split code, prepare to fix disordered split
IO; On the other hand fix missing blkcg_bio_issue_init() and
trace_block_split() for split IO.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-crypto-fallback.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 005c9157ffb3..e6ed50d9b00f 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -223,17 +223,12 @@ static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
break;
}
if (num_sectors < bio_sectors(bio)) {
- struct bio *split_bio;
-
- split_bio = bio_split(bio, num_sectors, GFP_NOIO,
- &crypto_bio_split);
- if (IS_ERR(split_bio)) {
- bio->bi_status = BLK_STS_RESOURCE;
+ bio = bio_submit_split_bioset(bio, num_sectors,
+ &crypto_bio_split);
+ if (!bio)
return false;
- }
- bio_chain(split_bio, bio);
- submit_bio_noacct(bio);
- *bio_ptr = split_bio;
+
+ *bio_ptr = bio;
}
return true;
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (6 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 07/10] blk-crypto: " Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 0:58 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 10/10] md/raid0: convert raid0_make_request() to use bio_submit_split_bioset() Yu Kuai
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Lots of checks are already done while submitting this bio the first
time, and there is no need to check them again when this bio is
resubmitted after split.
Hence factor out a helper submit_split_bio_noacct() for resubmitting
bio after splitting, only should_fail_bio() and blk_throtl_bio() are
keeped.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-core.c | 15 +++++++++++++++
block/blk-merge.c | 2 +-
block/blk.h | 1 +
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4201504158a1..37836446f365 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -765,6 +765,21 @@ static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
return BLK_STS_OK;
}
+void submit_split_bio_noacct(struct bio *bio)
+{
+ might_sleep();
+
+ if (should_fail_bio(bio)) {
+ bio_io_error(bio);
+ return;
+ }
+
+ if (blk_throtl_bio(bio))
+ return;
+
+ submit_bio_noacct_nocheck(bio);
+}
+
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
* @bio: The bio describing the location in memory and on the device.
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3d6dc9cc4f61..fa2c3d98b277 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -131,7 +131,7 @@ struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
bio_chain(split, bio);
trace_block_split(split, bio->bi_iter.bi_sector);
WARN_ON_ONCE(bio_zone_write_plugging(bio));
- submit_bio_noacct(bio);
+ submit_split_bio_noacct(bio);
return split;
}
diff --git a/block/blk.h b/block/blk.h
index 46f566f9b126..80375374ef55 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -54,6 +54,7 @@ bool blk_queue_start_drain(struct request_queue *q);
bool __blk_freeze_queue_start(struct request_queue *q,
struct task_struct *owner);
int __bio_queue_enter(struct request_queue *q, struct bio *bio);
+void submit_split_bio_noacct(struct bio *bio);
void submit_bio_noacct_nocheck(struct bio *bio);
void bio_await_chain(struct bio *bio);
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (7 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
2025-08-30 1:02 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 10/10] md/raid0: convert raid0_make_request() to use bio_submit_split_bioset() Yu Kuai
9 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Currently, split bio will be chained to original bio, and original bio
will be resubmitted to the tail of current->bio_list, waiting for
split bio to be issued. However, if split bio get split again, the IO
order will be messed up, for example, in raid456 IO will first be
split by max_sector from md_submit_bio(), and then later be split
again by chunksize for internal handling:
For example, assume max_sectors is 1M, and chunksize is 512k
1) issue a 2M IO:
bio issuing: 0+2M
current->bio_list: NULL
2) md_submit_bio() split by max_sector:
bio issuing: 0+1M
current->bio_list: 1M+1M
3) chunk_aligned_read() split by chunksize:
bio issuing: 0+512k
current->bio_list: 1M+1M -> 512k+512k
4) after first bio issued, __submit_bio_noacct() will contuine issuing
next bio:
bio issuing: 1M+1M
current->bio_list: 512k+512k
bio issued: 0+512k
5) chunk_aligned_read() split by chunksize:
bio issuing: 1M+512k
current->bio_list: 512k+512k -> 1536k+512k
bio issued: 0+512k
6) no split afterwards, finally the issue order is:
0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
This behaviour will cause large IO read on raid456 endup to be small
discontinuous IO in underlying disks. Fix this problem by placing split
bio to the head of current->bio_list.
Test script: test on 8 disk raid5 with 64k chunksize
dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
Test results:
Before this patch
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
2) blktrace G stage:
8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
3) io seek for sdx:
Noted io seek is the result from blktrace D stage, statistic of:
ABS((offset of next IO) - (offset + len of previous IO))
Read|Write seek
cnt 55175, zero cnt 25079
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 25079 45.5% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 12540 22.7% |##################### |
64 .. 128 : 2508 4.5% |##### |
128 .. 256 : 0 0.0% | |
256 .. 512 : 10032 18.2% |################# |
512 .. 1024 : 5016 9.1% |######### |
After this patch:
1) iostat results:
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
2) blktrace G stage:
8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
3) io seek for sdx
Read|Write seek
cnt 28644, zero cnt 21483
>=(KB) .. <(KB) : count ratio |distribution |
0 .. 1 : 21483 75.0% |########################################|
1 .. 2 : 0 0.0% | |
2 .. 4 : 0 0.0% | |
4 .. 8 : 0 0.0% | |
8 .. 16 : 0 0.0% | |
16 .. 32 : 0 0.0% | |
32 .. 64 : 7161 25.0% |############## |
BTW, this looks like a long term problem from day one, and large
sequential IO read is pretty common case like video playing.
And even with this patch, in this test case IO is merged to at most 128k
is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
limit can get even better performance. However, we'll figure out how to do
this properly later.
Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
Reported-by: Tie Ren <tieren@fnnas.com>
Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
block/blk-core.c | 21 ++++++++++++++-------
block/blk-throttle.c | 2 +-
block/blk.h | 2 +-
3 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 37836446f365..b643d3b1e9fe 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
-void submit_bio_noacct_nocheck(struct bio *bio)
+void submit_bio_noacct_nocheck(struct bio *bio, bool split)
{
blk_cgroup_bio_start(bio);
blkcg_bio_issue_init(bio);
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* to collect a list of requests submited by a ->submit_bio method while
* it is active, and then process them after it returned.
*/
- if (current->bio_list)
- bio_list_add(¤t->bio_list[0], bio);
- else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+ if (current->bio_list) {
+ if (split)
+ bio_list_add_head(¤t->bio_list[0], bio);
+ else
+ bio_list_add(¤t->bio_list[0], bio);
+ } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
__submit_bio_noacct_mq(bio);
- else
+ } else {
__submit_bio_noacct(bio);
+ }
}
static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
@@ -769,6 +773,9 @@ void submit_split_bio_noacct(struct bio *bio)
{
might_sleep();
+ /* This helper should only be called from submit_bio context */
+ WARN_ON_ONCE(!current->bio_list);
+
if (should_fail_bio(bio)) {
bio_io_error(bio);
return;
@@ -777,7 +784,7 @@ void submit_split_bio_noacct(struct bio *bio)
if (blk_throtl_bio(bio))
return;
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, true);
}
/**
@@ -886,7 +893,7 @@ void submit_bio_noacct(struct bio *bio)
if (blk_throtl_bio(bio))
return;
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, false);
return;
not_supported:
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 397b6a410f9e..ead7b0eb4846 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1224,7 +1224,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
if (!bio_list_empty(&bio_list_on_stack)) {
blk_start_plug(&plug);
while ((bio = bio_list_pop(&bio_list_on_stack)))
- submit_bio_noacct_nocheck(bio);
+ submit_bio_noacct_nocheck(bio, false);
blk_finish_plug(&plug);
}
}
diff --git a/block/blk.h b/block/blk.h
index 80375374ef55..da120c74c4b5 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -55,7 +55,7 @@ bool __blk_freeze_queue_start(struct request_queue *q,
struct task_struct *owner);
int __bio_queue_enter(struct request_queue *q, struct bio *bio);
void submit_split_bio_noacct(struct bio *bio);
-void submit_bio_noacct_nocheck(struct bio *bio);
+void submit_bio_noacct_nocheck(struct bio *bio, bool split);
void bio_await_chain(struct bio *bio);
static inline bool blk_try_enter_queue(struct request_queue *q, bool pm)
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH RFC v2 10/10] md/raid0: convert raid0_make_request() to use bio_submit_split_bioset()
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
` (8 preceding siblings ...)
2025-08-28 6:57 ` [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-28 6:57 ` Yu Kuai
9 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2025-08-28 6:57 UTC (permalink / raw)
To: axboe, tj, josef, song, neil, akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yukuai1,
yi.zhang, yangerkun, johnny.chenyi
From: Yu Kuai <yukuai3@huawei.com>
Currently, raid0_make_request() will remap the original bio to underlying
disks to prevent disordered IO. Now that bio_submit_split_bioset() will put
original bio to the head of current->bio_list, it's safe converting to use
this helper and bio will still be ordered.
On the one hand unify bio split code; On the other hand fix missing
blkcg_bio_issue_init() and trace_block_split() for split IO.
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid0.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 4dcc5133d679..8773f633299e 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -607,17 +607,9 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
: sector_div(sector, chunk_sects));
if (sectors < bio_sectors(bio)) {
- struct bio *split = bio_split(bio, sectors, GFP_NOIO,
- &mddev->bio_set);
-
- if (IS_ERR(split)) {
- bio->bi_status = errno_to_blk_status(PTR_ERR(split));
- bio_endio(bio);
+ bio = bio_submit_split_bioset(bio, sectors, &mddev->bio_set);
+ if (!bio)
return true;
- }
- bio_chain(split, bio);
- raid0_map_submit_bio(mddev, bio);
- bio = split;
}
raid0_map_submit_bio(mddev, bio);
--
2.39.2
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
@ 2025-08-30 0:37 ` Damien Le Moal
2025-08-30 4:03 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:37 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> No functional changes are intended, some drivers like mdraid will split
> bio by internal processing, prepare to unify bio split codes.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Looks good to me. A few nits below.
> ---
> block/blk-merge.c | 63 ++++++++++++++++++++++++++++--------------
> include/linux/blkdev.h | 2 ++
> 2 files changed, 44 insertions(+), 21 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 70d704615be5..3d6dc9cc4f61 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -104,34 +104,55 @@ static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim)
> return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT;
> }
>
> +/**
> + * bio_submit_split_bioset - Submit a bio, splitting it at a designated sector
> + * @bio: the original bio to be submitted and split
> + * @split_sectors: the sector count at which to split
> + * @bs: the bio set used for allocating the new split bio
> + *
> + * The original bio is modified to contain the remaining sectors and submitted.
> + * The caller is responsible for submitting the returned bio.
> + *
> + * If succeed, the newly allocated bio representing the initial part will be
> + * returned, on failure NULL will be returned and original bio will fail.
> + */
> +struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
> + struct bio_set *bs)
While at it, it would be nice to have split_sectors be unsigned. That would
avoid the check in bio_submit_split().
> +{
> + struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs);
> +
> + if (IS_ERR(split)) {
> + bio->bi_status = errno_to_blk_status(PTR_ERR(split));
> + bio_endio(bio);
> + return NULL;
> + }
> +
> + blkcg_bio_issue_init(split);
> + bio_chain(split, bio);
> + trace_block_split(split, bio->bi_iter.bi_sector);
> + WARN_ON_ONCE(bio_zone_write_plugging(bio));
> + submit_bio_noacct(bio);
> +
> + return split;
> +}
> +EXPORT_SYMBOL_GPL(bio_submit_split_bioset);
> +
> static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
> {
> - if (unlikely(split_sectors < 0))
> - goto error;
> + if (unlikely(split_sectors < 0)) {
> + bio->bi_status = errno_to_blk_status(split_sectors);
> + bio_endio(bio);
> + return NULL;
> + }
See above.
>
> if (split_sectors) {
> - struct bio *split;
> -
> - split = bio_split(bio, split_sectors, GFP_NOIO,
> - &bio->bi_bdev->bd_disk->bio_split);
> - if (IS_ERR(split)) {
> - split_sectors = PTR_ERR(split);
> - goto error;
> - }
> - split->bi_opf |= REQ_NOMERGE;
> - blkcg_bio_issue_init(split);
> - bio_chain(split, bio);
> - trace_block_split(split, bio->bi_iter.bi_sector);
> - WARN_ON_ONCE(bio_zone_write_plugging(bio));
> - submit_bio_noacct(bio);
> - return split;
> + bio = bio_submit_split_bioset(bio, split_sectors,
> + &bio->bi_bdev->bd_disk->bio_split);
> + if (bio)
> + bio->bi_opf |= REQ_NOMERGE;
I think that setting REQ_NOMERGE should be done in bio_submit_split_bioset().
> }
>
> return bio;
> -error:
> - bio->bi_status = errno_to_blk_status(split_sectors);
> - bio_endio(bio);
> - return NULL;
> }
>
> struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index fe1797bbec42..be4b3adf3989 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -999,6 +999,8 @@ extern int blk_register_queue(struct gendisk *disk);
> extern void blk_unregister_queue(struct gendisk *disk);
> void submit_bio_noacct(struct bio *bio);
> struct bio *bio_split_to_limits(struct bio *bio);
> +struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
> + struct bio_set *bs);
>
> extern int blk_lld_busy(struct request_queue *q);
> extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Yu Kuai
@ 2025-08-30 0:41 ` Damien Le Moal
2025-08-30 4:10 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:41 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
Hmmm... Shouldn't that be a prep patch with a fixes tag for backport ?
Because that "fix" here is not done directly but is the result of calling
bio_submit_split_bioset().
>
> Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large
> sequential writes") already fix disordered split IO by converting bio to
> underlying disks before submit_bio_noacct(), with the respect
> md_submit_bio() already split by sectors, and raid0_make_request() will
> split at most once for unaligned IO. This is a bit hacky and we'll convert
> this to solution in general later.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/raid0.c | 19 +++++++------------
> 1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index f1d8811a542a..4dcc5133d679 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -463,21 +463,16 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
> zone = find_zone(conf, &start);
>
> if (bio_end_sector(bio) > zone->zone_end) {
> - struct bio *split = bio_split(bio,
> - zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
> - &mddev->bio_set);
> -
> - if (IS_ERR(split)) {
> - bio->bi_status = errno_to_blk_status(PTR_ERR(split));
> - bio_endio(bio);
> + bio = bio_submit_split_bioset(bio,
> + zone->zone_end - bio->bi_iter.bi_sector,
Can this ever be negative (of course not I think)? But if
bio_submit_split_bioset() is changed to have an unsigned int sectors count,
maybe add a sanity check before calling bio_submit_split_bioset() ?
> + &mddev->bio_set);
> + if (!bio)
> return;
> - }
> - bio_chain(split, bio);
> - submit_bio_noacct(bio);
> - bio = split;
> +
> end = zone->zone_end;
> - } else
> + } else {
> end = bio_end_sector(bio);
> + }
>
> orig_end = end;
> if (zone != conf->strip_zone)
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 03/10] md/raid1: convert to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 03/10] md/raid1: convert " Yu Kuai
@ 2025-08-30 0:43 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:43 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Looks OK to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index d236ef179cfb..64548057583b 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -178,7 +178,9 @@ enum r1bio_state {
> * any write was successful. Otherwise we call when
> * any write-behind write succeeds, otherwise we call
> * with failure when last write completes (and all failed).
> - * Record that bi_end_io was called with this flag...
> + *
> + * And for bio_split errors, record that bi_end_io was called with
> + * this flag...
Nit: s/.../.
> */
> R1BIO_Returned,
> /* If a write for this request means we can clear some
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 04/10] md/raid10: convert read/write to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 04/10] md/raid10: convert read/write " Yu Kuai
@ 2025-08-30 0:48 ` Damien Le Moal
2025-08-30 4:18 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:48 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
>
> Noted discard is not handled, because discard is only splited for
s/splited/split
> unaligned head and tail, and this can be considered slow path, the
> disorder here does not matter much.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/raid10.c | 51 +++++++++++++++++++--------------------------
> drivers/md/raid10.h | 2 ++
> 2 files changed, 23 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..0e7d2a67fca6 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -322,10 +322,12 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
> struct bio *bio = r10_bio->master_bio;
> struct r10conf *conf = r10_bio->mddev->private;
>
> - if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> - bio->bi_status = BLK_STS_IOERR;
> + if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) {
> + if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
> + bio->bi_status = BLK_STS_IOERR;
> + bio_endio(bio);
> + }
This change / the R10BIO_Returned flag is not mentioned in the commit message.
Please explain why it is needed to add for switching to using
bio_submit_split_bioset(), which in itself should not introduce functional
changes. Looks like this needs to be split into different patches...
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 05/10] md/raid5: convert to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 05/10] md/raid5: convert " Yu Kuai
@ 2025-08-30 0:50 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:50 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Same comment about the missing blkcg_bio_issue_init() call... Separate fix patch
first ?
Other than that, this looks OK to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 06/10] md/md-linear: convert to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 06/10] md/md-linear: " Yu Kuai
@ 2025-08-30 0:51 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:51 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Same comment about the missing blkcg_bio_issue_init() call.
Other than that, this looks OK to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 07/10] blk-crypto: convert to use bio_submit_split_bioset()
2025-08-28 6:57 ` [PATCH RFC v2 07/10] blk-crypto: " Yu Kuai
@ 2025-08-30 0:55 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:55 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> On the one hand unify bio split code, prepare to fix disordered split
> IO; On the other hand fix missing blkcg_bio_issue_init() and
> trace_block_split() for split IO.
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Again same comment about the missing blkcg_bio_issue_init() call.
Other than that, this looks OK to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio
2025-08-28 6:57 ` [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio Yu Kuai
@ 2025-08-30 0:58 ` Damien Le Moal
2025-08-30 4:22 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 0:58 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Lots of checks are already done while submitting this bio the first
> time, and there is no need to check them again when this bio is
> resubmitted after split.
>
> Hence factor out a helper submit_split_bio_noacct() for resubmitting
> bio after splitting, only should_fail_bio() and blk_throtl_bio() are
> keeped.
s/keeped/kept
>
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> block/blk-core.c | 15 +++++++++++++++
> block/blk-merge.c | 2 +-
> block/blk.h | 1 +
> 3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4201504158a1..37836446f365 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -765,6 +765,21 @@ static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> return BLK_STS_OK;
> }
>
> +void submit_split_bio_noacct(struct bio *bio)
> +{
> + might_sleep();
> +
> + if (should_fail_bio(bio)) {
> + bio_io_error(bio);
> + return;
> + }
> +
> + if (blk_throtl_bio(bio))
> + return;
> +
> + submit_bio_noacct_nocheck(bio);
> +}
> +
> /**
> * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> * @bio: The bio describing the location in memory and on the device.
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3d6dc9cc4f61..fa2c3d98b277 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -131,7 +131,7 @@ struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
> bio_chain(split, bio);
> trace_block_split(split, bio->bi_iter.bi_sector);
> WARN_ON_ONCE(bio_zone_write_plugging(bio));
> - submit_bio_noacct(bio);
> + submit_split_bio_noacct(bio);
If this helper is used only here, I would just open-code it here as otherwise it
is confusing: its name has "split" in it but nothing of its code checks that the
BIO was actually split.
>
> return split;
> }
> diff --git a/block/blk.h b/block/blk.h
> index 46f566f9b126..80375374ef55 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -54,6 +54,7 @@ bool blk_queue_start_drain(struct request_queue *q);
> bool __blk_freeze_queue_start(struct request_queue *q,
> struct task_struct *owner);
> int __bio_queue_enter(struct request_queue *q, struct bio *bio);
> +void submit_split_bio_noacct(struct bio *bio);
> void submit_bio_noacct_nocheck(struct bio *bio);
> void bio_await_chain(struct bio *bio);
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
2025-08-28 6:57 ` [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split Yu Kuai
@ 2025-08-30 1:02 ` Damien Le Moal
2025-08-30 4:28 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 1:02 UTC (permalink / raw)
To: Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli, hare,
tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/28/25 15:57, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> Currently, split bio will be chained to original bio, and original bio
> will be resubmitted to the tail of current->bio_list, waiting for
> split bio to be issued. However, if split bio get split again, the IO
> order will be messed up, for example, in raid456 IO will first be
> split by max_sector from md_submit_bio(), and then later be split
> again by chunksize for internal handling:
>
> For example, assume max_sectors is 1M, and chunksize is 512k
>
> 1) issue a 2M IO:
>
> bio issuing: 0+2M
> current->bio_list: NULL
>
> 2) md_submit_bio() split by max_sector:
>
> bio issuing: 0+1M
> current->bio_list: 1M+1M
>
> 3) chunk_aligned_read() split by chunksize:
>
> bio issuing: 0+512k
> current->bio_list: 1M+1M -> 512k+512k
>
> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
> next bio:
>
> bio issuing: 1M+1M
> current->bio_list: 512k+512k
> bio issued: 0+512k
>
> 5) chunk_aligned_read() split by chunksize:
>
> bio issuing: 1M+512k
> current->bio_list: 512k+512k -> 1536k+512k
> bio issued: 0+512k
>
> 6) no split afterwards, finally the issue order is:
>
> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>
> This behaviour will cause large IO read on raid456 endup to be small
> discontinuous IO in underlying disks. Fix this problem by placing split
> bio to the head of current->bio_list.
>
> Test script: test on 8 disk raid5 with 64k chunksize
> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>
> Test results:
> Before this patch
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
> 2) blktrace G stage:
> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
> 3) io seek for sdx:
> Noted io seek is the result from blktrace D stage, statistic of:
> ABS((offset of next IO) - (offset + len of previous IO))
>
> Read|Write seek
> cnt 55175, zero cnt 25079
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 25079 45.5% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 12540 22.7% |##################### |
> 64 .. 128 : 2508 4.5% |##### |
> 128 .. 256 : 0 0.0% | |
> 256 .. 512 : 10032 18.2% |################# |
> 512 .. 1024 : 5016 9.1% |######### |
>
> After this patch:
> 1) iostat results:
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
> 2) blktrace G stage:
> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
> 3) io seek for sdx
> Read|Write seek
> cnt 28644, zero cnt 21483
> >=(KB) .. <(KB) : count ratio |distribution |
> 0 .. 1 : 21483 75.0% |########################################|
> 1 .. 2 : 0 0.0% | |
> 2 .. 4 : 0 0.0% | |
> 4 .. 8 : 0 0.0% | |
> 8 .. 16 : 0 0.0% | |
> 16 .. 32 : 0 0.0% | |
> 32 .. 64 : 7161 25.0% |############## |
>
> BTW, this looks like a long term problem from day one, and large
> sequential IO read is pretty common case like video playing.
>
> And even with this patch, in this test case IO is merged to at most 128k
> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
> limit can get even better performance. However, we'll figure out how to do
> this properly later.
>
> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
> Reported-by: Tie Ren <tieren@fnnas.com>
> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> block/blk-core.c | 21 ++++++++++++++-------
> block/blk-throttle.c | 2 +-
> block/blk.h | 2 +-
> 3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 37836446f365..b643d3b1e9fe 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
> current->bio_list = NULL;
> }
>
> -void submit_bio_noacct_nocheck(struct bio *bio)
> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
> {
> blk_cgroup_bio_start(bio);
> blkcg_bio_issue_init(bio);
> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> - bio_list_add(¤t->bio_list[0], bio);
> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
> + if (current->bio_list) {
> + if (split)
> + bio_list_add_head(¤t->bio_list[0], bio);
> + else
> + bio_list_add(¤t->bio_list[0], bio);
This really needs a comment clarifying why we do an add at tail instead of
keeping the original order with a add at head. I am also scared that this may
break sequential write ordering for zoned devices.
> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> __submit_bio_noacct_mq(bio);
> - else
> + } else {
> __submit_bio_noacct(bio);
> + }
> }
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset()
2025-08-30 0:37 ` Damien Le Moal
@ 2025-08-30 4:03 ` Yu Kuai
0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2025-08-30 4:03 UTC (permalink / raw)
To: Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch,
colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
Hi,
在 2025/8/30 8:37, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> No functional changes are intended, some drivers like mdraid will split
>> bio by internal processing, prepare to unify bio split codes.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Looks good to me. A few nits below.
>
>> ---
>> block/blk-merge.c | 63 ++++++++++++++++++++++++++++--------------
>> include/linux/blkdev.h | 2 ++
>> 2 files changed, 44 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 70d704615be5..3d6dc9cc4f61 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -104,34 +104,55 @@ static unsigned int bio_allowed_max_sectors(const struct queue_limits *lim)
>> return round_down(UINT_MAX, lim->logical_block_size) >> SECTOR_SHIFT;
>> }
>>
>> +/**
>> + * bio_submit_split_bioset - Submit a bio, splitting it at a designated sector
>> + * @bio: the original bio to be submitted and split
>> + * @split_sectors: the sector count at which to split
>> + * @bs: the bio set used for allocating the new split bio
>> + *
>> + * The original bio is modified to contain the remaining sectors and submitted.
>> + * The caller is responsible for submitting the returned bio.
>> + *
>> + * If succeed, the newly allocated bio representing the initial part will be
>> + * returned, on failure NULL will be returned and original bio will fail.
>> + */
>> +struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
>> + struct bio_set *bs)
> While at it, it would be nice to have split_sectors be unsigned. That would
> avoid the check in bio_submit_split().
Sure, I can change this to unsigned. However, there are caller return error
code and bio_submit_split() will handler the error code, if we want to avoid
the check here, we'll have to hande error before bio_submit_split() as well,
is this what you mean?
>
>> +{
>> + struct bio *split = bio_split(bio, split_sectors, GFP_NOIO, bs);
>> +
>> + if (IS_ERR(split)) {
>> + bio->bi_status = errno_to_blk_status(PTR_ERR(split));
>> + bio_endio(bio);
>> + return NULL;
>> + }
>> +
>> + blkcg_bio_issue_init(split);
>> + bio_chain(split, bio);
>> + trace_block_split(split, bio->bi_iter.bi_sector);
>> + WARN_ON_ONCE(bio_zone_write_plugging(bio));
>> + submit_bio_noacct(bio);
>> +
>> + return split;
>> +}
>> +EXPORT_SYMBOL_GPL(bio_submit_split_bioset);
>> +
>> static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
>> {
>> - if (unlikely(split_sectors < 0))
>> - goto error;
>> + if (unlikely(split_sectors < 0)) {
>> + bio->bi_status = errno_to_blk_status(split_sectors);
>> + bio_endio(bio);
>> + return NULL;
>> + }
> See above.
>
>>
>> if (split_sectors) {
>> - struct bio *split;
>> -
>> - split = bio_split(bio, split_sectors, GFP_NOIO,
>> - &bio->bi_bdev->bd_disk->bio_split);
>> - if (IS_ERR(split)) {
>> - split_sectors = PTR_ERR(split);
>> - goto error;
>> - }
>> - split->bi_opf |= REQ_NOMERGE;
>> - blkcg_bio_issue_init(split);
>> - bio_chain(split, bio);
>> - trace_block_split(split, bio->bi_iter.bi_sector);
>> - WARN_ON_ONCE(bio_zone_write_plugging(bio));
>> - submit_bio_noacct(bio);
>> - return split;
>> + bio = bio_submit_split_bioset(bio, split_sectors,
>> + &bio->bi_bdev->bd_disk->bio_split);
>> + if (bio)
>> + bio->bi_opf |= REQ_NOMERGE;
> I think that setting REQ_NOMERGE should be done in bio_submit_split_bioset().
In mdraid, we'll expect bio still can be merged in underlying disks after split
by chunksize, and setting it and then clear it in mdraid is not necessary, I think
this is better.
Thanks,
Kuai
>> }
>>
>> return bio;
>> -error:
>> - bio->bi_status = errno_to_blk_status(split_sectors);
>> - bio_endio(bio);
>> - return NULL;
>> }
>>
>> struct bio *bio_split_discard(struct bio *bio, const struct queue_limits *lim,
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index fe1797bbec42..be4b3adf3989 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -999,6 +999,8 @@ extern int blk_register_queue(struct gendisk *disk);
>> extern void blk_unregister_queue(struct gendisk *disk);
>> void submit_bio_noacct(struct bio *bio);
>> struct bio *bio_split_to_limits(struct bio *bio);
>> +struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
>> + struct bio_set *bs);
>>
>> extern int blk_lld_busy(struct request_queue *q);
>> extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
2025-08-30 0:41 ` Damien Le Moal
@ 2025-08-30 4:10 ` Yu Kuai
2025-08-30 4:38 ` Damien Le Moal
0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-30 4:10 UTC (permalink / raw)
To: Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch,
colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
Hi,
在 2025/8/30 8:41, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> On the one hand unify bio split code, prepare to fix disordered split
>> IO; On the other hand fix missing blkcg_bio_issue_init() and
>> trace_block_split() for split IO.
> Hmmm... Shouldn't that be a prep patch with a fixes tag for backport ?
> Because that "fix" here is not done directly but is the result of calling
> bio_submit_split_bioset().
I can add a fix tag as blkcg_bio_issue_init() and trace_block_split() is missed,
however, if we consider stable backport, should we fix this directly from caller
first? As this is better for backport. Later this patch can be just considered
cleanup.
>> Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large
>> sequential writes") already fix disordered split IO by converting bio to
>> underlying disks before submit_bio_noacct(), with the respect
>> md_submit_bio() already split by sectors, and raid0_make_request() will
>> split at most once for unaligned IO. This is a bit hacky and we'll convert
>> this to solution in general later.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> drivers/md/raid0.c | 19 +++++++------------
>> 1 file changed, 7 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>> index f1d8811a542a..4dcc5133d679 100644
>> --- a/drivers/md/raid0.c
>> +++ b/drivers/md/raid0.c
>> @@ -463,21 +463,16 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
>> zone = find_zone(conf, &start);
>>
>> if (bio_end_sector(bio) > zone->zone_end) {
>> - struct bio *split = bio_split(bio,
>> - zone->zone_end - bio->bi_iter.bi_sector, GFP_NOIO,
>> - &mddev->bio_set);
>> -
>> - if (IS_ERR(split)) {
>> - bio->bi_status = errno_to_blk_status(PTR_ERR(split));
>> - bio_endio(bio);
>> + bio = bio_submit_split_bioset(bio,
>> + zone->zone_end - bio->bi_iter.bi_sector,
> Can this ever be negative (of course not I think)? But if
> bio_submit_split_bioset() is changed to have an unsigned int sectors count,
> maybe add a sanity check before calling bio_submit_split_bioset() ?
Yes, this can never be negative.
Thanks,
Kuai
>
>> + &mddev->bio_set);
>> + if (!bio)
>> return;
>> - }
>> - bio_chain(split, bio);
>> - submit_bio_noacct(bio);
>> - bio = split;
>> +
>> end = zone->zone_end;
>> - } else
>> + } else {
>> end = bio_end_sector(bio);
>> + }
>>
>> orig_end = end;
>> if (zone != conf->strip_zone)
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 04/10] md/raid10: convert read/write to use bio_submit_split_bioset()
2025-08-30 0:48 ` Damien Le Moal
@ 2025-08-30 4:18 ` Yu Kuai
0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2025-08-30 4:18 UTC (permalink / raw)
To: Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch,
colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
Hi,
在 2025/8/30 8:48, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> On the one hand unify bio split code, prepare to fix disordered split
>> IO; On the other hand fix missing blkcg_bio_issue_init() and
>> trace_block_split() for split IO.
>>
>> Noted discard is not handled, because discard is only splited for
> s/splited/split
>
>> unaligned head and tail, and this can be considered slow path, the
>> disorder here does not matter much.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> drivers/md/raid10.c | 51 +++++++++++++++++++--------------------------
>> drivers/md/raid10.h | 2 ++
>> 2 files changed, 23 insertions(+), 30 deletions(-)
>>
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..0e7d2a67fca6 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -322,10 +322,12 @@ static void raid_end_bio_io(struct r10bio *r10_bio)
>> struct bio *bio = r10_bio->master_bio;
>> struct r10conf *conf = r10_bio->mddev->private;
>>
>> - if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>> - bio->bi_status = BLK_STS_IOERR;
>> + if (!test_and_set_bit(R10BIO_Returned, &r10_bio->state)) {
>> + if (!test_bit(R10BIO_Uptodate, &r10_bio->state))
>> + bio->bi_status = BLK_STS_IOERR;
>> + bio_endio(bio);
>> + }
> This change / the R10BIO_Returned flag is not mentioned in the commit message.
> Please explain why it is needed to add for switching to using
> bio_submit_split_bioset(), which in itself should not introduce functional
> changes. Looks like this needs to be split into different patches...
>
Ok, this is actually refered from raid1 flag, I can make this flag a seperate
patch :)
Thanks,
Kuai
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio
2025-08-30 0:58 ` Damien Le Moal
@ 2025-08-30 4:22 ` Yu Kuai
0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2025-08-30 4:22 UTC (permalink / raw)
To: Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch,
colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
Hi,
在 2025/8/30 8:58, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Lots of checks are already done while submitting this bio the first
>> time, and there is no need to check them again when this bio is
>> resubmitted after split.
>>
>> Hence factor out a helper submit_split_bio_noacct() for resubmitting
>> bio after splitting, only should_fail_bio() and blk_throtl_bio() are
>> keeped.
> s/keeped/kept
>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> block/blk-core.c | 15 +++++++++++++++
>> block/blk-merge.c | 2 +-
>> block/blk.h | 1 +
>> 3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4201504158a1..37836446f365 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -765,6 +765,21 @@ static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
>> return BLK_STS_OK;
>> }
>>
>> +void submit_split_bio_noacct(struct bio *bio)
>> +{
>> + might_sleep();
>> +
>> + if (should_fail_bio(bio)) {
>> + bio_io_error(bio);
>> + return;
>> + }
>> +
>> + if (blk_throtl_bio(bio))
>> + return;
>> +
>> + submit_bio_noacct_nocheck(bio);
>> +}
>> +
>> /**
>> * submit_bio_noacct - re-submit a bio to the block device layer for I/O
>> * @bio: The bio describing the location in memory and on the device.
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 3d6dc9cc4f61..fa2c3d98b277 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -131,7 +131,7 @@ struct bio *bio_submit_split_bioset(struct bio *bio, int split_sectors,
>> bio_chain(split, bio);
>> trace_block_split(split, bio->bi_iter.bi_sector);
>> WARN_ON_ONCE(bio_zone_write_plugging(bio));
>> - submit_bio_noacct(bio);
>> + submit_split_bio_noacct(bio);
> If this helper is used only here, I would just open-code it here as otherwise it
> is confusing: its name has "split" in it but nothing of its code checks that the
> BIO was actually split.
I agree and I actually tried this first, the problem I met is that should_fail_bio()
is defined internal at blk-core.c, perhaps I'll add comment about this function?
Thanks,
Kuai
>>
>> return split;
>> }
>> diff --git a/block/blk.h b/block/blk.h
>> index 46f566f9b126..80375374ef55 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -54,6 +54,7 @@ bool blk_queue_start_drain(struct request_queue *q);
>> bool __blk_freeze_queue_start(struct request_queue *q,
>> struct task_struct *owner);
>> int __bio_queue_enter(struct request_queue *q, struct bio *bio);
>> +void submit_split_bio_noacct(struct bio *bio);
>> void submit_bio_noacct_nocheck(struct bio *bio);
>> void bio_await_chain(struct bio *bio);
>>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
2025-08-30 1:02 ` Damien Le Moal
@ 2025-08-30 4:28 ` Yu Kuai
2025-09-01 2:40 ` Yu Kuai
0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-08-30 4:28 UTC (permalink / raw)
To: Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch,
colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
Hi,
在 2025/8/30 9:02, Damien Le Moal 写道:
> On 8/28/25 15:57, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently, split bio will be chained to original bio, and original bio
>> will be resubmitted to the tail of current->bio_list, waiting for
>> split bio to be issued. However, if split bio get split again, the IO
>> order will be messed up, for example, in raid456 IO will first be
>> split by max_sector from md_submit_bio(), and then later be split
>> again by chunksize for internal handling:
>>
>> For example, assume max_sectors is 1M, and chunksize is 512k
>>
>> 1) issue a 2M IO:
>>
>> bio issuing: 0+2M
>> current->bio_list: NULL
>>
>> 2) md_submit_bio() split by max_sector:
>>
>> bio issuing: 0+1M
>> current->bio_list: 1M+1M
>>
>> 3) chunk_aligned_read() split by chunksize:
>>
>> bio issuing: 0+512k
>> current->bio_list: 1M+1M -> 512k+512k
>>
>> 4) after first bio issued, __submit_bio_noacct() will contuine issuing
>> next bio:
>>
>> bio issuing: 1M+1M
>> current->bio_list: 512k+512k
>> bio issued: 0+512k
>>
>> 5) chunk_aligned_read() split by chunksize:
>>
>> bio issuing: 1M+512k
>> current->bio_list: 512k+512k -> 1536k+512k
>> bio issued: 0+512k
>>
>> 6) no split afterwards, finally the issue order is:
>>
>> 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k
>>
>> This behaviour will cause large IO read on raid456 endup to be small
>> discontinuous IO in underlying disks. Fix this problem by placing split
>> bio to the head of current->bio_list.
>>
>> Test script: test on 8 disk raid5 with 64k chunksize
>> dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct
>>
>> Test results:
>> Before this patch
>> 1) iostat results:
>> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
>> md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10
>> sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20
>> 2) blktrace G stage:
>> 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd]
>> 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd]
>> 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd]
>> 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd]
>> 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd]
>> 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd]
>> 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd]
>> 3) io seek for sdx:
>> Noted io seek is the result from blktrace D stage, statistic of:
>> ABS((offset of next IO) - (offset + len of previous IO))
>>
>> Read|Write seek
>> cnt 55175, zero cnt 25079
>> >=(KB) .. <(KB) : count ratio |distribution |
>> 0 .. 1 : 25079 45.5% |########################################|
>> 1 .. 2 : 0 0.0% | |
>> 2 .. 4 : 0 0.0% | |
>> 4 .. 8 : 0 0.0% | |
>> 8 .. 16 : 0 0.0% | |
>> 16 .. 32 : 0 0.0% | |
>> 32 .. 64 : 12540 22.7% |##################### |
>> 64 .. 128 : 2508 4.5% |##### |
>> 128 .. 256 : 0 0.0% | |
>> 256 .. 512 : 10032 18.2% |################# |
>> 512 .. 1024 : 5016 9.1% |######### |
>>
>> After this patch:
>> 1) iostat results:
>> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util
>> md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60
>> sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50
>> 2) blktrace G stage:
>> 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd]
>> 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd]
>> 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd]
>> 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd]
>> 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd]
>> 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd]
>> 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd]
>> 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd]
>> 3) io seek for sdx
>> Read|Write seek
>> cnt 28644, zero cnt 21483
>> >=(KB) .. <(KB) : count ratio |distribution |
>> 0 .. 1 : 21483 75.0% |########################################|
>> 1 .. 2 : 0 0.0% | |
>> 2 .. 4 : 0 0.0% | |
>> 4 .. 8 : 0 0.0% | |
>> 8 .. 16 : 0 0.0% | |
>> 16 .. 32 : 0 0.0% | |
>> 32 .. 64 : 7161 25.0% |############## |
>>
>> BTW, this looks like a long term problem from day one, and large
>> sequential IO read is pretty common case like video playing.
>>
>> And even with this patch, in this test case IO is merged to at most 128k
>> is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such
>> limit can get even better performance. However, we'll figure out how to do
>> this properly later.
>>
>> Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls")
>> Reported-by: Tie Ren <tieren@fnnas.com>
>> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>> block/blk-core.c | 21 ++++++++++++++-------
>> block/blk-throttle.c | 2 +-
>> block/blk.h | 2 +-
>> 3 files changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 37836446f365..b643d3b1e9fe 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -725,7 +725,7 @@ static void __submit_bio_noacct_mq(struct bio *bio)
>> current->bio_list = NULL;
>> }
>>
>> -void submit_bio_noacct_nocheck(struct bio *bio)
>> +void submit_bio_noacct_nocheck(struct bio *bio, bool split)
>> {
>> blk_cgroup_bio_start(bio);
>> blkcg_bio_issue_init(bio);
>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>> * to collect a list of requests submited by a ->submit_bio method while
>> * it is active, and then process them after it returned.
>> */
>> - if (current->bio_list)
>> - bio_list_add(¤t->bio_list[0], bio);
>> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>> + if (current->bio_list) {
>> + if (split)
>> + bio_list_add_head(¤t->bio_list[0], bio);
>> + else
>> + bio_list_add(¤t->bio_list[0], bio);
> This really needs a comment clarifying why we do an add at tail instead of
> keeping the original order with a add at head. I am also scared that this may
> break sequential write ordering for zoned devices.
I think add at head is exactly what we do here to keep the orginal order for
the case bio split. Other than split, if caller do generate multiple sequential
bios, we should keep the order by add at tail.
Not sure about zoned devices for now, I'll have a look in details.
Thanks,
Kuai
>
>> + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
>> __submit_bio_noacct_mq(bio);
>> - else
>> + } else {
>> __submit_bio_noacct(bio);
>> + }
>> }
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
2025-08-30 4:10 ` Yu Kuai
@ 2025-08-30 4:38 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-08-30 4:38 UTC (permalink / raw)
To: Yu Kuai, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli,
hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yukuai3, yi.zhang,
yangerkun, johnny.chenyi
On 8/30/25 13:10, Yu Kuai wrote:
> Hi,
>
> 在 2025/8/30 8:41, Damien Le Moal 写道:
>> On 8/28/25 15:57, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> On the one hand unify bio split code, prepare to fix disordered split
>>> IO; On the other hand fix missing blkcg_bio_issue_init() and
>>> trace_block_split() for split IO.
>> Hmmm... Shouldn't that be a prep patch with a fixes tag for backport ?
>> Because that "fix" here is not done directly but is the result of calling
>> bio_submit_split_bioset().
>
> I can add a fix tag as blkcg_bio_issue_init() and trace_block_split() is missed,
> however, if we consider stable backport, should we fix this directly from caller
> first? As this is better for backport. Later this patch can be just considered
> cleanup.
That is what I was suggesting: fix the blkcg issue first withe fixes tag and
then do the conversion to using bio_submit_split_bioset() in later patch that is
not to be backported.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
2025-08-30 4:28 ` Yu Kuai
@ 2025-09-01 2:40 ` Yu Kuai
2025-09-01 6:51 ` Damien Le Moal
0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2025-09-01 2:40 UTC (permalink / raw)
To: Yu Kuai, Damien Le Moal, Yu Kuai, axboe, tj, josef, song, neil,
akpm, hch, colyli, hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yi.zhang,
yangerkun, johnny.chenyi, yukuai (C)
Hi,
在 2025/08/30 12:28, Yu Kuai 写道:
>>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>> * to collect a list of requests submited by a ->submit_bio
>>> method while
>>> * it is active, and then process them after it returned.
>>> */
>>> - if (current->bio_list)
>>> - bio_list_add(¤t->bio_list[0], bio);
>>> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>>> + if (current->bio_list) {
>>> + if (split)
>>> + bio_list_add_head(¤t->bio_list[0], bio);
>>> + else
>>> + bio_list_add(¤t->bio_list[0], bio);
>> This really needs a comment clarifying why we do an add at tail
>> instead of
>> keeping the original order with a add at head. I am also scared that
>> this may
>> break sequential write ordering for zoned devices.
>
> I think add at head is exactly what we do here to keep the orginal order
> for
> the case bio split. Other than split, if caller do generate multiple
> sequential
> bios, we should keep the order by add at tail.
>
> Not sure about zoned devices for now, I'll have a look in details.
For zoned devices, can we somehow trigger this recursive split? I
suspect bio disordered will apear in this case but I don't know for
now and I can't find a way to reporduce it.
Perhaps I can bypass zoned devices for now, and if we really met the
recursive split case and there is a problem, we can fix it later:
if (split && !bdev_is_zoned(bio->bi_bdev))
bio_list_add_head()
Thanks,
Kuai
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split
2025-09-01 2:40 ` Yu Kuai
@ 2025-09-01 6:51 ` Damien Le Moal
0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-09-01 6:51 UTC (permalink / raw)
To: Yu Kuai, Yu Kuai, axboe, tj, josef, song, neil, akpm, hch, colyli,
hare, tieren
Cc: linux-block, linux-kernel, cgroups, linux-raid, yi.zhang,
yangerkun, johnny.chenyi, yukuai (C)
On 9/1/25 11:40 AM, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/30 12:28, Yu Kuai 写道:
>>>> @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>>>> * to collect a list of requests submited by a ->submit_bio method while
>>>> * it is active, and then process them after it returned.
>>>> */
>>>> - if (current->bio_list)
>>>> - bio_list_add(¤t->bio_list[0], bio);
>>>> - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
>>>> + if (current->bio_list) {
>>>> + if (split)
>>>> + bio_list_add_head(¤t->bio_list[0], bio);
>>>> + else
>>>> + bio_list_add(¤t->bio_list[0], bio);
>>> This really needs a comment clarifying why we do an add at tail instead of
>>> keeping the original order with a add at head. I am also scared that this may
>>> break sequential write ordering for zoned devices.
>>
>> I think add at head is exactly what we do here to keep the orginal order for
>> the case bio split. Other than split, if caller do generate multiple sequential
>> bios, we should keep the order by add at tail.
>>
>> Not sure about zoned devices for now, I'll have a look in details.
>
> For zoned devices, can we somehow trigger this recursive split? I
> suspect bio disordered will apear in this case but I don't know for
> now and I can't find a way to reporduce it.
dm-linear can be stacked on e.g. dm-crypt, or the reverse. So recursive
splitting may be possible. Though since for DM everything is zone aligned, it
may be hard to find a reproducer. Though dm-crypt will always split BIOs to
BIO_MAX_VECS << PAGE_SECTORS_SHIFT sectors, so it may be possible with very
large BIOs. Would need to try, but really overloaded with other things right now.
>
> Perhaps I can bypass zoned devices for now, and if we really met the
> recursive split case and there is a problem, we can fix it later:
>
> if (split && !bdev_is_zoned(bio->bi_bdev))
> bio_list_add_head()
>
> Thanks,
> Kuai
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2025-09-01 6:54 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-28 6:57 [PATCH RFC v2 00/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 01/10] block: factor out a helper bio_submit_split_bioset() Yu Kuai
2025-08-30 0:37 ` Damien Le Moal
2025-08-30 4:03 ` Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 02/10] md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset() Yu Kuai
2025-08-30 0:41 ` Damien Le Moal
2025-08-30 4:10 ` Yu Kuai
2025-08-30 4:38 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 03/10] md/raid1: convert " Yu Kuai
2025-08-30 0:43 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 04/10] md/raid10: convert read/write " Yu Kuai
2025-08-30 0:48 ` Damien Le Moal
2025-08-30 4:18 ` Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 05/10] md/raid5: convert " Yu Kuai
2025-08-30 0:50 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 06/10] md/md-linear: " Yu Kuai
2025-08-30 0:51 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 07/10] blk-crypto: " Yu Kuai
2025-08-30 0:55 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 08/10] block: skip unnecessary checks for split bio Yu Kuai
2025-08-30 0:58 ` Damien Le Moal
2025-08-30 4:22 ` Yu Kuai
2025-08-28 6:57 ` [PATCH RFC v2 09/10] block: fix disordered IO in the case recursive split Yu Kuai
2025-08-30 1:02 ` Damien Le Moal
2025-08-30 4:28 ` Yu Kuai
2025-09-01 2:40 ` Yu Kuai
2025-09-01 6:51 ` Damien Le Moal
2025-08-28 6:57 ` [PATCH RFC v2 10/10] md/raid0: convert raid0_make_request() to use bio_submit_split_bioset() Yu Kuai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).