* [PATCH v5 00/12] md: align bio to io_opt for better performance
@ 2026-01-14 17:12 Yu Kuai
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
` (12 more replies)
0 siblings, 13 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
This patchset optimizes MD RAID performance by aligning bios to the
optimal I/O size before splitting. When I/O is aligned to io_opt,
raid5 can perform full stripe writes without needing to read extra
data for parity calculation, significantly improving bandwidth.
Patch 1: Fix a bug in raid5_run() error handling
Patches 2-4: Cleanup - merge boolean fields into mddev_flags
Patches 5-6: Preparation - use mempool for stripe_request_ctx and
ensure max_sectors >= io_opt
Patches 7-8: Core - add bio alignment infrastructure
Patches 9-11: Enable bio alignment for raid5, raid10, and raid0
Patch 12: Fix abnormal io_opt from member disks
Performance improvement on 32-disk raid5 with 64kb chunk:
dd if=/dev/zero of=/dev/md0 bs=100M oflag=direct
Before: 782 MB/s
After: 1.1 GB/s
Changes in v5:
- Add patch 1 to fix raid5_run() returning success when log_init() fails
- Patch 12: Fix stale commit message (remove mention of MD_STACK_IO_OPT flag)
Changes in v4:
- Patch 12: Simplify by checking rdev_is_mddev() first, remove
MD_STACK_IO_OPT flag
Changes in v3:
- Patch 5: Remove unnecessary NULL check before mempool_destroy()
- Patch 7: Use sector_div() instead of roundup()/rounddown() to fix
64-bit division issue on 32-bit platforms
Changes in v2:
- Fix mempool in patch 5
- Add prep cleanup patches, 2-4
- Add patch 12 to fix abnormal io_opt
- Add Link tags to patches
Yu Kuai (12):
md/raid5: fix raid5_run() to return error when log_init() fails
md: merge mddev has_superblock into mddev_flags
md: merge mddev faillast_dev into mddev_flags
md: merge mddev serialize_policy into mddev_flags
md/raid5: use mempool to allocate stripe_request_ctx
md/raid5: make sure max_sectors is not less than io_opt
md: support to align bio to limits
md: add a helper md_config_align_limits()
md/raid5: align bio to io_opt
md/raid10: align bio to io_opt
md/raid0: align bio to io_opt
md: fix abnormal io_opt from member disks
drivers/md/md-bitmap.c | 4 +-
drivers/md/md.c | 118 +++++++++++++++++++++++++++++++++++------
drivers/md/md.h | 30 +++++++++--
drivers/md/raid0.c | 6 ++-
drivers/md/raid1-10.c | 5 --
drivers/md/raid1.c | 13 ++---
drivers/md/raid10.c | 10 ++--
drivers/md/raid5.c | 95 +++++++++++++++++++++++----------
drivers/md/raid5.h | 3 ++
9 files changed, 217 insertions(+), 67 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-15 1:28 ` Li Nan
` (2 more replies)
2026-01-14 17:12 ` [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags Yu Kuai
` (11 subsequent siblings)
12 siblings, 3 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
Since commit f63f17350e53 ("md/raid5: use the atomic queue limit
update APIs"), the abort path in raid5_run() returns 'ret' instead of
-EIO. However, if log_init() fails, 'ret' is still 0 from the previous
successful call, causing raid5_run() to return success despite the
failure.
Fix this by capturing the return value from log_init().
Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
drivers/md/raid5.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e57ce3295292..39bec4d199a1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8055,7 +8055,8 @@ static int raid5_run(struct mddev *mddev)
goto abort;
}
- if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
+ ret = log_init(conf, journal_dev, raid5_has_ppl(conf));
+ if (ret)
goto abort;
return 0;
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-16 15:06 ` Christoph Hellwig
2026-01-14 17:12 ` [PATCH v5 03/12] md: merge mddev faillast_dev " Yu Kuai
` (10 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
There is not need to use a separate field in struct mddev, there are no
functional changes.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.c | 6 +++---
drivers/md/md.h | 3 ++-
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e5922a682953..91a30ed6b01e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6463,7 +6463,7 @@ int md_run(struct mddev *mddev)
* the only valid external interface is through the md
* device.
*/
- mddev->has_superblocks = false;
+ clear_bit(MD_HAS_SUPERBLOCK, &mddev->flags);
rdev_for_each(rdev, mddev) {
if (test_bit(Faulty, &rdev->flags))
continue;
@@ -6476,7 +6476,7 @@ int md_run(struct mddev *mddev)
}
if (rdev->sb_page)
- mddev->has_superblocks = true;
+ set_bit(MD_HAS_SUPERBLOCK, &mddev->flags);
/* perform some consistency tests on the device.
* We don't want the data to overlap the metadata,
@@ -9086,7 +9086,7 @@ void md_write_start(struct mddev *mddev, struct bio *bi)
rcu_read_unlock();
if (did_change)
sysfs_notify_dirent_safe(mddev->sysfs_state);
- if (!mddev->has_superblocks)
+ if (!test_bit(MD_HAS_SUPERBLOCK, &mddev->flags))
return;
wait_event(mddev->sb_wait,
!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6985f2829bbd..b4c9aa600edd 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -340,6 +340,7 @@ struct md_cluster_operations;
* array is ready yet.
* @MD_BROKEN: This is used to stop writes and mark array as failed.
* @MD_DELETED: This device is being deleted
+ * @MD_HAS_SUPERBLOCK: There is persistence sb in member disks.
*
* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
*/
@@ -356,6 +357,7 @@ enum mddev_flags {
MD_BROKEN,
MD_DO_DELETE,
MD_DELETED,
+ MD_HAS_SUPERBLOCK,
};
enum mddev_sb_flags {
@@ -623,7 +625,6 @@ struct mddev {
/* The sequence number for sync thread */
atomic_t sync_seq;
- bool has_superblocks:1;
bool fail_last_dev:1;
bool serialize_policy:1;
};
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 03/12] md: merge mddev faillast_dev into mddev_flags
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
2026-01-14 17:12 ` [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 04/12] md: merge mddev serialize_policy " Yu Kuai
` (9 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
There is not need to use a separate field in struct mddev, there are no
functional changes.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.c | 10 ++++++----
drivers/md/md.h | 3 ++-
drivers/md/raid0.c | 3 ++-
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 4 ++--
drivers/md/raid5.c | 5 ++++-
6 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 91a30ed6b01e..be0d33fbf988 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5865,11 +5865,11 @@ __ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
static ssize_t fail_last_dev_show(struct mddev *mddev, char *page)
{
- return sprintf(page, "%d\n", mddev->fail_last_dev);
+ return sprintf(page, "%d\n", test_bit(MD_FAILLAST_DEV, &mddev->flags));
}
/*
- * Setting fail_last_dev to true to allow last device to be forcibly removed
+ * Setting MD_FAILLAST_DEV to allow last device to be forcibly removed
* from RAID1/RAID10.
*/
static ssize_t
@@ -5882,8 +5882,10 @@ fail_last_dev_store(struct mddev *mddev, const char *buf, size_t len)
if (ret)
return ret;
- if (value != mddev->fail_last_dev)
- mddev->fail_last_dev = value;
+ if (value)
+ set_bit(MD_FAILLAST_DEV, &mddev->flags);
+ else
+ clear_bit(MD_FAILLAST_DEV, &mddev->flags);
return len;
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b4c9aa600edd..297a104fba88 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -341,6 +341,7 @@ struct md_cluster_operations;
* @MD_BROKEN: This is used to stop writes and mark array as failed.
* @MD_DELETED: This device is being deleted
* @MD_HAS_SUPERBLOCK: There is persistence sb in member disks.
+ * @MD_FAILLAST_DEV: Allow last rdev to be removed.
*
* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
*/
@@ -358,6 +359,7 @@ enum mddev_flags {
MD_DO_DELETE,
MD_DELETED,
MD_HAS_SUPERBLOCK,
+ MD_FAILLAST_DEV,
};
enum mddev_sb_flags {
@@ -625,7 +627,6 @@ struct mddev {
/* The sequence number for sync thread */
atomic_t sync_seq;
- bool fail_last_dev:1;
bool serialize_policy:1;
};
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 985c377356eb..4d567fcf6a7c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -27,7 +27,8 @@ module_param(default_layout, int, 0644);
(1L << MD_JOURNAL_CLEAN) | \
(1L << MD_FAILFAST_SUPPORTED) |\
(1L << MD_HAS_PPL) | \
- (1L << MD_HAS_MULTIPLE_PPLS))
+ (1L << MD_HAS_MULTIPLE_PPLS) | \
+ (1L << MD_FAILLAST_DEV))
/*
* inform the user of the raid configuration
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 57d50465eed1..98b5c93810bb 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1746,7 +1746,7 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
* - &mddev->degraded is bumped.
*
* @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * MD_FAILLAST_DEV is not set.
*/
static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
{
@@ -1759,7 +1759,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
(conf->raid_disks - mddev->degraded) == 1) {
set_bit(MD_BROKEN, &mddev->flags);
- if (!mddev->fail_last_dev) {
+ if (!test_bit(MD_FAILLAST_DEV, &mddev->flags)) {
conf->recovery_disabled = mddev->recovery_disabled;
spin_unlock_irqrestore(&conf->device_lock, flags);
return;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 84be4cc7e873..09328e032f14 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1990,7 +1990,7 @@ static int enough(struct r10conf *conf, int ignore)
* - &mddev->degraded is bumped.
*
* @rdev is marked as &Faulty excluding case when array is failed and
- * &mddev->fail_last_dev is off.
+ * MD_FAILLAST_DEV is not set.
*/
static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
{
@@ -2002,7 +2002,7 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
set_bit(MD_BROKEN, &mddev->flags);
- if (!mddev->fail_last_dev) {
+ if (!test_bit(MD_FAILLAST_DEV, &mddev->flags)) {
spin_unlock_irqrestore(&conf->device_lock, flags);
return;
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 39bec4d199a1..e6a399c52ea0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -56,7 +56,10 @@
#include "md-bitmap.h"
#include "raid5-log.h"
-#define UNSUPPORTED_MDDEV_FLAGS (1L << MD_FAILFAST_SUPPORTED)
+#define UNSUPPORTED_MDDEV_FLAGS \
+ ((1L << MD_FAILFAST_SUPPORTED) | \
+ (1L << MD_FAILLAST_DEV))
+
#define cpu_to_group(cpu) cpu_to_node(cpu)
#define ANY_GROUP NUMA_NO_NODE
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 04/12] md: merge mddev serialize_policy into mddev_flags
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (2 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 03/12] md: merge mddev faillast_dev " Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 05/12] md/raid5: use mempool to allocate stripe_request_ctx Yu Kuai
` (8 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
There is not need to use a separate field in struct mddev, there are no
functional changes.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md-bitmap.c | 4 ++--
drivers/md/md.c | 20 ++++++++++++--------
drivers/md/md.h | 4 ++--
drivers/md/raid0.c | 3 ++-
drivers/md/raid1.c | 4 ++--
drivers/md/raid5.c | 3 ++-
6 files changed, 22 insertions(+), 16 deletions(-)
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 84b7e2af6dba..dbe4c4b9a1da 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -2085,7 +2085,7 @@ static void bitmap_destroy(struct mddev *mddev)
return;
bitmap_wait_behind_writes(mddev);
- if (!mddev->serialize_policy)
+ if (!test_bit(MD_SERIALIZE_POLICY, &mddev->flags))
mddev_destroy_serial_pool(mddev, NULL);
mutex_lock(&mddev->bitmap_info.mutex);
@@ -2809,7 +2809,7 @@ backlog_store(struct mddev *mddev, const char *buf, size_t len)
mddev->bitmap_info.max_write_behind = backlog;
if (!backlog && mddev->serial_info_pool) {
/* serial_info_pool is not needed if backlog is zero */
- if (!mddev->serialize_policy)
+ if (!test_bit(MD_SERIALIZE_POLICY, &mddev->flags))
mddev_destroy_serial_pool(mddev, NULL);
} else if (backlog && !mddev->serial_info_pool) {
/* serial_info_pool is needed since backlog is not zero */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index be0d33fbf988..21b0bc3088d2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -279,7 +279,8 @@ void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev)
rdev_for_each(temp, mddev) {
if (!rdev) {
- if (!mddev->serialize_policy ||
+ if (!test_bit(MD_SERIALIZE_POLICY,
+ &mddev->flags) ||
!rdev_need_serial(temp))
rdev_uninit_serial(temp);
else
@@ -5898,11 +5899,12 @@ static ssize_t serialize_policy_show(struct mddev *mddev, char *page)
if (mddev->pers == NULL || (mddev->pers->head.id != ID_RAID1))
return sprintf(page, "n/a\n");
else
- return sprintf(page, "%d\n", mddev->serialize_policy);
+ return sprintf(page, "%d\n",
+ test_bit(MD_SERIALIZE_POLICY, &mddev->flags));
}
/*
- * Setting serialize_policy to true to enforce write IO is not reordered
+ * Setting MD_SERIALIZE_POLICY enforce write IO is not reordered
* for raid1.
*/
static ssize_t
@@ -5915,7 +5917,7 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
if (err)
return err;
- if (value == mddev->serialize_policy)
+ if (value == test_bit(MD_SERIALIZE_POLICY, &mddev->flags))
return len;
err = mddev_suspend_and_lock(mddev);
@@ -5927,11 +5929,13 @@ serialize_policy_store(struct mddev *mddev, const char *buf, size_t len)
goto unlock;
}
- if (value)
+ if (value) {
mddev_create_serial_pool(mddev, NULL);
- else
+ set_bit(MD_SERIALIZE_POLICY, &mddev->flags);
+ } else {
mddev_destroy_serial_pool(mddev, NULL);
- mddev->serialize_policy = value;
+ clear_bit(MD_SERIALIZE_POLICY, &mddev->flags);
+ }
unlock:
mddev_unlock_and_resume(mddev);
return err ?: len;
@@ -6828,7 +6832,7 @@ static void __md_stop_writes(struct mddev *mddev)
md_update_sb(mddev, 1);
}
/* disable policy to guarantee rdevs free resources for serialization */
- mddev->serialize_policy = 0;
+ clear_bit(MD_SERIALIZE_POLICY, &mddev->flags);
mddev_destroy_serial_pool(mddev, NULL);
}
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 297a104fba88..6ee18045f41c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -342,6 +342,7 @@ struct md_cluster_operations;
* @MD_DELETED: This device is being deleted
* @MD_HAS_SUPERBLOCK: There is persistence sb in member disks.
* @MD_FAILLAST_DEV: Allow last rdev to be removed.
+ * @MD_SERIALIZE_POLICY: Enforce write IO is not reordered, just used by raid1.
*
* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
*/
@@ -360,6 +361,7 @@ enum mddev_flags {
MD_DELETED,
MD_HAS_SUPERBLOCK,
MD_FAILLAST_DEV,
+ MD_SERIALIZE_POLICY,
};
enum mddev_sb_flags {
@@ -626,8 +628,6 @@ struct mddev {
/* The sequence number for sync thread */
atomic_t sync_seq;
-
- bool serialize_policy:1;
};
enum recovery_flags {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 4d567fcf6a7c..d83b2b1c0049 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -28,7 +28,8 @@ module_param(default_layout, int, 0644);
(1L << MD_FAILFAST_SUPPORTED) |\
(1L << MD_HAS_PPL) | \
(1L << MD_HAS_MULTIPLE_PPLS) | \
- (1L << MD_FAILLAST_DEV))
+ (1L << MD_FAILLAST_DEV) | \
+ (1L << MD_SERIALIZE_POLICY))
/*
* inform the user of the raid configuration
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 98b5c93810bb..f4c7004888af 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -542,7 +542,7 @@ static void raid1_end_write_request(struct bio *bio)
call_bio_endio(r1_bio);
}
}
- } else if (rdev->mddev->serialize_policy)
+ } else if (test_bit(MD_SERIALIZE_POLICY, &rdev->mddev->flags))
remove_serial(rdev, lo, hi);
if (r1_bio->bios[mirror] == NULL)
rdev_dec_pending(rdev, conf->mddev);
@@ -1644,7 +1644,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
mbio = bio_alloc_clone(rdev->bdev, bio, GFP_NOIO,
&mddev->bio_set);
- if (mddev->serialize_policy)
+ if (test_bit(MD_SERIALIZE_POLICY, &mddev->flags))
wait_for_serialization(rdev, r1_bio);
}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e6a399c52ea0..37325a053fb4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -58,7 +58,8 @@
#define UNSUPPORTED_MDDEV_FLAGS \
((1L << MD_FAILFAST_SUPPORTED) | \
- (1L << MD_FAILLAST_DEV))
+ (1L << MD_FAILLAST_DEV) | \
+ (1L << MD_SERIALIZE_POLICY))
#define cpu_to_group(cpu) cpu_to_node(cpu)
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 05/12] md/raid5: use mempool to allocate stripe_request_ctx
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (3 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 04/12] md: merge mddev serialize_policy " Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 06/12] md/raid5: make sure max_sectors is not less than io_opt Yu Kuai
` (7 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
On the one hand, stripe_request_ctx is 72 bytes, and it's a bit huge for
a stack variable.
On the other hand, the bitmap sectors_to_do is a fixed size, result in
max_hw_sector_kb of raid5 array is at most 256 * 4k = 1Mb, and this will
make full stripe IO impossible for the array that chunk_size * data_disks
is bigger. Allocate ctx during runtime will make it possible to get rid
of this limit.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.h | 4 +++
drivers/md/raid1-10.c | 5 ----
drivers/md/raid5.c | 61 +++++++++++++++++++++++++++----------------
drivers/md/raid5.h | 2 ++
4 files changed, 45 insertions(+), 27 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 6ee18045f41c..b8c5dec12b62 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -22,6 +22,10 @@
#include <trace/events/block.h>
#define MaxSector (~(sector_t)0)
+/*
+ * Number of guaranteed raid bios in case of extreme VM load:
+ */
+#define NR_RAID_BIOS 256
enum md_submodule_type {
MD_PERSONALITY = 0,
diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c
index 521625756128..c33099925f23 100644
--- a/drivers/md/raid1-10.c
+++ b/drivers/md/raid1-10.c
@@ -3,11 +3,6 @@
#define RESYNC_BLOCK_SIZE (64*1024)
#define RESYNC_PAGES ((RESYNC_BLOCK_SIZE + PAGE_SIZE-1) / PAGE_SIZE)
-/*
- * Number of guaranteed raid bios in case of extreme VM load:
- */
-#define NR_RAID_BIOS 256
-
/* when we get a read error on a read-only array, we redirect to another
* device without failing the first device, or trying to over-write to
* correct the read error. To keep track of bad blocks on a per-bio
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 37325a053fb4..b250c6c9e72b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6084,13 +6084,13 @@ static sector_t raid5_bio_lowest_chunk_sector(struct r5conf *conf,
static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
{
DEFINE_WAIT_FUNC(wait, woken_wake_function);
- bool on_wq;
struct r5conf *conf = mddev->private;
- sector_t logical_sector;
- struct stripe_request_ctx ctx = {};
const int rw = bio_data_dir(bi);
+ struct stripe_request_ctx *ctx;
+ sector_t logical_sector;
enum stripe_result res;
int s, stripe_cnt;
+ bool on_wq;
if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
int ret = log_handle_flush_request(conf, bi);
@@ -6102,11 +6102,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
return true;
}
/* ret == -EAGAIN, fallback */
- /*
- * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
- * we need to flush journal device
- */
- ctx.do_flush = bi->bi_opf & REQ_PREFLUSH;
}
md_write_start(mddev, bi);
@@ -6129,16 +6124,25 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
}
logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
- ctx.first_sector = logical_sector;
- ctx.last_sector = bio_end_sector(bi);
bi->bi_next = NULL;
- stripe_cnt = DIV_ROUND_UP_SECTOR_T(ctx.last_sector - logical_sector,
+ ctx = mempool_alloc(conf->ctx_pool, GFP_NOIO);
+ memset(ctx, 0, sizeof(*ctx));
+ ctx->first_sector = logical_sector;
+ ctx->last_sector = bio_end_sector(bi);
+ /*
+ * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
+ * we need to flush journal device
+ */
+ if (unlikely(bi->bi_opf & REQ_PREFLUSH))
+ ctx->do_flush = true;
+
+ stripe_cnt = DIV_ROUND_UP_SECTOR_T(ctx->last_sector - logical_sector,
RAID5_STRIPE_SECTORS(conf));
- bitmap_set(ctx.sectors_to_do, 0, stripe_cnt);
+ bitmap_set(ctx->sectors_to_do, 0, stripe_cnt);
pr_debug("raid456: %s, logical %llu to %llu\n", __func__,
- bi->bi_iter.bi_sector, ctx.last_sector);
+ bi->bi_iter.bi_sector, ctx->last_sector);
/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
if ((bi->bi_opf & REQ_NOWAIT) &&
@@ -6146,6 +6150,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
bio_wouldblock_error(bi);
if (rw == WRITE)
md_write_end(mddev);
+ mempool_free(ctx, conf->ctx_pool);
return true;
}
md_account_bio(mddev, &bi);
@@ -6164,10 +6169,10 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
add_wait_queue(&conf->wait_for_reshape, &wait);
on_wq = true;
}
- s = (logical_sector - ctx.first_sector) >> RAID5_STRIPE_SHIFT(conf);
+ s = (logical_sector - ctx->first_sector) >> RAID5_STRIPE_SHIFT(conf);
while (1) {
- res = make_stripe_request(mddev, conf, &ctx, logical_sector,
+ res = make_stripe_request(mddev, conf, ctx, logical_sector,
bi);
if (res == STRIPE_FAIL || res == STRIPE_WAIT_RESHAPE)
break;
@@ -6184,9 +6189,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
* raid5_activate_delayed() from making progress
* and thus deadlocking.
*/
- if (ctx.batch_last) {
- raid5_release_stripe(ctx.batch_last);
- ctx.batch_last = NULL;
+ if (ctx->batch_last) {
+ raid5_release_stripe(ctx->batch_last);
+ ctx->batch_last = NULL;
}
wait_woken(&wait, TASK_UNINTERRUPTIBLE,
@@ -6194,21 +6199,23 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
continue;
}
- s = find_next_bit_wrap(ctx.sectors_to_do, stripe_cnt, s);
+ s = find_next_bit_wrap(ctx->sectors_to_do, stripe_cnt, s);
if (s == stripe_cnt)
break;
- logical_sector = ctx.first_sector +
+ logical_sector = ctx->first_sector +
(s << RAID5_STRIPE_SHIFT(conf));
}
if (unlikely(on_wq))
remove_wait_queue(&conf->wait_for_reshape, &wait);
- if (ctx.batch_last)
- raid5_release_stripe(ctx.batch_last);
+ if (ctx->batch_last)
+ raid5_release_stripe(ctx->batch_last);
if (rw == WRITE)
md_write_end(mddev);
+
+ mempool_free(ctx, conf->ctx_pool);
if (res == STRIPE_WAIT_RESHAPE) {
md_free_cloned_bio(bi);
return false;
@@ -7376,6 +7383,9 @@ static void free_conf(struct r5conf *conf)
bioset_exit(&conf->bio_split);
kfree(conf->stripe_hashtbl);
kfree(conf->pending_data);
+
+ mempool_destroy(conf->ctx_pool);
+
kfree(conf);
}
@@ -8059,6 +8069,13 @@ static int raid5_run(struct mddev *mddev)
goto abort;
}
+ conf->ctx_pool = mempool_create_kmalloc_pool(NR_RAID_BIOS,
+ sizeof(struct stripe_request_ctx));
+ if (!conf->ctx_pool) {
+ ret = -ENOMEM;
+ goto abort;
+ }
+
ret = log_init(conf, journal_dev, raid5_has_ppl(conf));
if (ret)
goto abort;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index eafc6e9ed6ee..6e3f07119fa4 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -690,6 +690,8 @@ struct r5conf {
struct list_head pending_list;
int pending_data_cnt;
struct r5pending_data *next_pending_data;
+
+ mempool_t *ctx_pool;
};
#if PAGE_SIZE == DEFAULT_STRIPE_SIZE
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 06/12] md/raid5: make sure max_sectors is not less than io_opt
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (4 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 05/12] md/raid5: use mempool to allocate stripe_request_ctx Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 07/12] md: support to align bio to limits Yu Kuai
` (6 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
Otherwise, even if user issue IO by io_opt, such IO will be split
by max_sectors before they are submitted to raid5. For consequence,
full stripe IO is impossible.
BTW, dm-raid5 is not affected and still have such problem.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
---
drivers/md/raid5.c | 38 ++++++++++++++++++++++++++++----------
drivers/md/raid5.h | 1 +
2 files changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b250c6c9e72b..8a7fed91d46b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -777,14 +777,14 @@ struct stripe_request_ctx {
/* last sector in the request */
sector_t last_sector;
+ /* the request had REQ_PREFLUSH, cleared after the first stripe_head */
+ bool do_flush;
+
/*
* bitmap to track stripe sectors that have been added to stripes
* add one to account for unaligned requests
*/
- DECLARE_BITMAP(sectors_to_do, RAID5_MAX_REQ_STRIPES + 1);
-
- /* the request had REQ_PREFLUSH, cleared after the first stripe_head */
- bool do_flush;
+ unsigned long sectors_to_do[];
};
/*
@@ -6127,7 +6127,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
bi->bi_next = NULL;
ctx = mempool_alloc(conf->ctx_pool, GFP_NOIO);
- memset(ctx, 0, sizeof(*ctx));
+ memset(ctx, 0, conf->ctx_size);
ctx->first_sector = logical_sector;
ctx->last_sector = bio_end_sector(bi);
/*
@@ -7741,6 +7741,25 @@ static int only_parity(int raid_disk, int algo, int raid_disks, int max_degraded
return 0;
}
+static int raid5_create_ctx_pool(struct r5conf *conf)
+{
+ struct stripe_request_ctx *ctx;
+ int size;
+
+ if (mddev_is_dm(conf->mddev))
+ size = BITS_TO_LONGS(RAID5_MAX_REQ_STRIPES);
+ else
+ size = BITS_TO_LONGS(
+ queue_max_hw_sectors(conf->mddev->gendisk->queue) >>
+ RAID5_STRIPE_SHIFT(conf));
+
+ conf->ctx_size = struct_size(ctx, sectors_to_do, size);
+ conf->ctx_pool = mempool_create_kmalloc_pool(NR_RAID_BIOS,
+ conf->ctx_size);
+
+ return conf->ctx_pool ? 0 : -ENOMEM;
+}
+
static int raid5_set_limits(struct mddev *mddev)
{
struct r5conf *conf = mddev->private;
@@ -7797,6 +7816,8 @@ static int raid5_set_limits(struct mddev *mddev)
* Limit the max sectors based on this.
*/
lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);
+ if ((lim.max_hw_sectors << 9) < lim.io_opt)
+ lim.max_hw_sectors = lim.io_opt >> 9;
/* No restrictions on the number of segments in the request */
lim.max_segments = USHRT_MAX;
@@ -8069,12 +8090,9 @@ static int raid5_run(struct mddev *mddev)
goto abort;
}
- conf->ctx_pool = mempool_create_kmalloc_pool(NR_RAID_BIOS,
- sizeof(struct stripe_request_ctx));
- if (!conf->ctx_pool) {
- ret = -ENOMEM;
+ ret = raid5_create_ctx_pool(conf);
+ if (ret)
goto abort;
- }
ret = log_init(conf, journal_dev, raid5_has_ppl(conf));
if (ret)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6e3f07119fa4..ddfe65237888 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -692,6 +692,7 @@ struct r5conf {
struct r5pending_data *next_pending_data;
mempool_t *ctx_pool;
+ int ctx_size;
};
#if PAGE_SIZE == DEFAULT_STRIPE_SIZE
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 07/12] md: support to align bio to limits
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (5 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 06/12] md/raid5: make sure max_sectors is not less than io_opt Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-16 15:08 ` Christoph Hellwig
2026-01-14 17:12 ` [PATCH v5 08/12] md: add a helper md_config_align_limits() Yu Kuai
` (5 subsequent siblings)
12 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
For personalities that report optimal IO size, it indicates that users
can get the best IO bandwidth if they issue IO with this size. However
there is also an implicit condition that IO should also be aligned to the
optimal IO size.
Currently, bio will only be split by limits, if bio offset is not aligned
to limits, then all split bio will not be aligned. This patch add a new
feature to align bio to limits first, and following patches will support
this for each personality if necessary.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++
drivers/md/md.h | 2 ++
2 files changed, 56 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 21b0bc3088d2..731ec800f5cb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -428,6 +428,56 @@ bool md_handle_request(struct mddev *mddev, struct bio *bio)
}
EXPORT_SYMBOL(md_handle_request);
+static struct bio *__md_bio_align_to_limits(struct mddev *mddev,
+ struct bio *bio)
+{
+ unsigned int max_sectors = mddev->gendisk->queue->limits.max_sectors;
+ sector_t start = bio->bi_iter.bi_sector;
+ sector_t end = start + bio_sectors(bio);
+ sector_t align_start;
+ sector_t align_end;
+ u32 rem;
+
+ /* calculate align_start = roundup(start, max_sectors) */
+ align_start = start;
+ rem = sector_div(align_start, max_sectors);
+ /* already aligned */
+ if (!rem)
+ return bio;
+
+ align_start = start + max_sectors - rem;
+
+ /* calculate align_end = rounddown(end, max_sectors) */
+ align_end = end;
+ rem = sector_div(align_end, max_sectors);
+ align_end = end - rem;
+
+ /* bio is too small to split */
+ if (align_end <= align_start)
+ return bio;
+
+ return bio_submit_split_bioset(bio, align_start - start,
+ &mddev->gendisk->bio_split);
+}
+
+static struct bio *md_bio_align_to_limits(struct mddev *mddev, struct bio *bio)
+{
+ if (!test_bit(MD_BIO_ALIGN, &mddev->flags))
+ return bio;
+
+ /* atomic write can't split */
+ if (bio->bi_opf & REQ_ATOMIC)
+ return bio;
+
+ switch (bio_op(bio)) {
+ case REQ_OP_READ:
+ case REQ_OP_WRITE:
+ return __md_bio_align_to_limits(mddev, bio);
+ default:
+ return bio;
+ }
+}
+
static void md_submit_bio(struct bio *bio)
{
const int rw = bio_data_dir(bio);
@@ -443,6 +493,10 @@ static void md_submit_bio(struct bio *bio)
return;
}
+ bio = md_bio_align_to_limits(mddev, bio);
+ if (!bio)
+ return;
+
bio = bio_split_to_limits(bio);
if (!bio)
return;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b8c5dec12b62..e7aba83b708b 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -347,6 +347,7 @@ struct md_cluster_operations;
* @MD_HAS_SUPERBLOCK: There is persistence sb in member disks.
* @MD_FAILLAST_DEV: Allow last rdev to be removed.
* @MD_SERIALIZE_POLICY: Enforce write IO is not reordered, just used by raid1.
+ * @MD_BIO_ALIGN: Bio issued to the array will align to io_opt before split.
*
* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
*/
@@ -366,6 +367,7 @@ enum mddev_flags {
MD_HAS_SUPERBLOCK,
MD_FAILLAST_DEV,
MD_SERIALIZE_POLICY,
+ MD_BIO_ALIGN,
};
enum mddev_sb_flags {
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 08/12] md: add a helper md_config_align_limits()
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (6 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 07/12] md: support to align bio to limits Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 09/12] md/raid5: align bio to io_opt Yu Kuai
` (4 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
This helper will be used by personalities that want to align bio to
io_opt to get best IO bandwidth.
Also add the new flag to UNSUPPORTED_MDDEV_FLAGS for now, following
patches will enable this for personalities.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.h | 11 +++++++++++
drivers/md/raid0.c | 3 ++-
drivers/md/raid1.c | 3 ++-
drivers/md/raid5.c | 3 ++-
4 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/md/md.h b/drivers/md/md.h
index e7aba83b708b..ddf989f2a139 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -1091,6 +1091,17 @@ static inline bool rdev_blocked(struct md_rdev *rdev)
return false;
}
+static inline void md_config_align_limits(struct mddev *mddev,
+ struct queue_limits *lim)
+{
+ if ((lim->max_hw_sectors << 9) < lim->io_opt)
+ lim->max_hw_sectors = lim->io_opt >> 9;
+ else
+ lim->max_hw_sectors = rounddown(lim->max_hw_sectors,
+ lim->io_opt >> 9);
+ set_bit(MD_BIO_ALIGN, &mddev->flags);
+}
+
#define mddev_add_trace_msg(mddev, fmt, args...) \
do { \
if (!mddev_is_dm(mddev)) \
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index d83b2b1c0049..f3814a69cd13 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -29,7 +29,8 @@ module_param(default_layout, int, 0644);
(1L << MD_HAS_PPL) | \
(1L << MD_HAS_MULTIPLE_PPLS) | \
(1L << MD_FAILLAST_DEV) | \
- (1L << MD_SERIALIZE_POLICY))
+ (1L << MD_SERIALIZE_POLICY) | \
+ (1L << MD_BIO_ALIGN))
/*
* inform the user of the raid configuration
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index f4c7004888af..1a957dba2640 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -42,7 +42,8 @@
((1L << MD_HAS_JOURNAL) | \
(1L << MD_JOURNAL_CLEAN) | \
(1L << MD_HAS_PPL) | \
- (1L << MD_HAS_MULTIPLE_PPLS))
+ (1L << MD_HAS_MULTIPLE_PPLS) | \
+ (1L << MD_BIO_ALIGN))
static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8a7fed91d46b..d4a44fe0b5a5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -59,7 +59,8 @@
#define UNSUPPORTED_MDDEV_FLAGS \
((1L << MD_FAILFAST_SUPPORTED) | \
(1L << MD_FAILLAST_DEV) | \
- (1L << MD_SERIALIZE_POLICY))
+ (1L << MD_SERIALIZE_POLICY) | \
+ (1L << MD_BIO_ALIGN))
#define cpu_to_group(cpu) cpu_to_node(cpu)
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 09/12] md/raid5: align bio to io_opt
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (7 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 08/12] md: add a helper md_config_align_limits() Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 10/12] md/raid10: " Yu Kuai
` (3 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
raid5 internal implementaion indicates that if write bio is aligned to
io_opt, then full stripe write will be used, which will be best for
bandwidth because there is no need to read extra data to build new
xor data.
Simple test in my VM, 32 disks raid5 with 64kb chunksize:
dd if=/dev/zero of=/dev/md0 bs=100M oflag=direct
Before this patch: 782 MB/s
With this patch: 1.1 GB/s
BTW, there are still other bottleneck related to stripe handler, and
require further optimization.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/raid5.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d4a44fe0b5a5..bbcf4b1127e7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -59,8 +59,7 @@
#define UNSUPPORTED_MDDEV_FLAGS \
((1L << MD_FAILFAST_SUPPORTED) | \
(1L << MD_FAILLAST_DEV) | \
- (1L << MD_SERIALIZE_POLICY) | \
- (1L << MD_BIO_ALIGN))
+ (1L << MD_SERIALIZE_POLICY))
#define cpu_to_group(cpu) cpu_to_node(cpu)
@@ -7817,8 +7816,7 @@ static int raid5_set_limits(struct mddev *mddev)
* Limit the max sectors based on this.
*/
lim.max_hw_sectors = RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf);
- if ((lim.max_hw_sectors << 9) < lim.io_opt)
- lim.max_hw_sectors = lim.io_opt >> 9;
+ md_config_align_limits(mddev, &lim);
/* No restrictions on the number of segments in the request */
lim.max_segments = USHRT_MAX;
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 10/12] md/raid10: align bio to io_opt
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (8 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 09/12] md/raid5: align bio to io_opt Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 11/12] md/raid0: " Yu Kuai
` (2 subsequent siblings)
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
The impact is not so significant for raid10 compared to raid5, however
it's still more appropriate to issue IOs evenly to underlying disks.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/raid10.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 09328e032f14..2c6b65b83724 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4008,6 +4008,8 @@ static int raid10_set_queue_limits(struct mddev *mddev)
err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
if (err)
return err;
+
+ md_config_align_limits(mddev, &lim);
return queue_limits_set(mddev->gendisk->queue, &lim);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 11/12] md/raid0: align bio to io_opt
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (9 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 10/12] md/raid10: " Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 12/12] md: fix abnormal io_opt from member disks Yu Kuai
2026-01-15 23:38 ` [PATCH v5 00/12] md: align bio to io_opt for better performance John Stoffel
12 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
The impact is not so significant for raid0 compared to raid5, however
it's still more appropriate to issue IOs evenly to underlying disks.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
---
drivers/md/raid0.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f3814a69cd13..0ae44e3bfff2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -29,8 +29,7 @@ module_param(default_layout, int, 0644);
(1L << MD_HAS_PPL) | \
(1L << MD_HAS_MULTIPLE_PPLS) | \
(1L << MD_FAILLAST_DEV) | \
- (1L << MD_SERIALIZE_POLICY) | \
- (1L << MD_BIO_ALIGN))
+ (1L << MD_SERIALIZE_POLICY))
/*
* inform the user of the raid configuration
@@ -398,6 +397,8 @@ static int raid0_set_limits(struct mddev *mddev)
err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
if (err)
return err;
+
+ md_config_align_limits(mddev, &lim);
return queue_limits_set(mddev->gendisk->queue, &lim);
}
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (10 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 11/12] md/raid0: " Yu Kuai
@ 2026-01-14 17:12 ` Yu Kuai
2026-01-16 15:08 ` Christoph Hellwig
2026-01-15 23:38 ` [PATCH v5 00/12] md: align bio to io_opt for better performance John Stoffel
12 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-14 17:12 UTC (permalink / raw)
To: linux-raid; +Cc: yukuai, linan122, xni, dan.carpenter
It's reported that mtp3sas can report abnormal io_opt, for consequence,
md array will end up with abnormal io_opt as well, due to the
lcm_not_zero() from blk_stack_limits().
Some personalities will configure optimal IO size, and it's indicate that
users can get the best IO bandwidth if they issue IO with this size, and
we don't want io_opt to be covered by member disks with abnormal io_opt.
Fix this problem by checking if the member disk is an mdraid array. If
not, keep the io_opt configured by personalities and ignore io_opt from
member disk.
Reported-by: Filippo Giunchedi <filippo@debian.org>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006
Reported-by: Coly Li <colyli@fnnas.com>
Closes: https://lore.kernel.org/all/20250817152645.7115-1-colyli@kernel.org/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
---
drivers/md/md.c | 28 +++++++++++++++++++++++++++-
drivers/md/md.h | 3 ++-
drivers/md/raid1.c | 2 +-
drivers/md/raid10.c | 4 ++--
4 files changed, 32 insertions(+), 5 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 731ec800f5cb..6c0fb09c26dc 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6200,18 +6200,33 @@ static const struct kobj_type md_ktype = {
int mdp_major = 0;
+static bool rdev_is_mddev(struct md_rdev *rdev)
+{
+ return rdev->bdev->bd_disk->fops == &md_fops;
+}
+
/* stack the limit for all rdevs into lim */
int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
unsigned int flags)
{
struct md_rdev *rdev;
+ bool io_opt_configured = lim->io_opt;
rdev_for_each(rdev, mddev) {
+ unsigned int io_opt = lim->io_opt;
+
queue_limits_stack_bdev(lim, rdev->bdev, rdev->data_offset,
mddev->gendisk->disk_name);
if ((flags & MDDEV_STACK_INTEGRITY) &&
!queue_limits_stack_integrity_bdev(lim, rdev->bdev))
return -EINVAL;
+
+ /*
+ * If member disk is not mdraid array, keep the io_opt
+ * from personality and ignore io_opt from member disk.
+ */
+ if (!rdev_is_mddev(rdev) && io_opt_configured)
+ lim->io_opt = io_opt;
}
/*
@@ -6230,9 +6245,11 @@ int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
EXPORT_SYMBOL_GPL(mddev_stack_rdev_limits);
/* apply the extra stacking limits from a new rdev into mddev */
-int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev)
+int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev,
+ bool io_opt_configured)
{
struct queue_limits lim;
+ unsigned int io_opt;
if (mddev_is_dm(mddev))
return 0;
@@ -6245,6 +6262,8 @@ int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev)
}
lim = queue_limits_start_update(mddev->gendisk->queue);
+ io_opt = lim.io_opt;
+
queue_limits_stack_bdev(&lim, rdev->bdev, rdev->data_offset,
mddev->gendisk->disk_name);
@@ -6255,6 +6274,13 @@ int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev)
return -ENXIO;
}
+ /*
+ * If member disk is not mdraid array, keep the io_opt from
+ * personality and ignore io_opt from member disk.
+ */
+ if (!rdev_is_mddev(rdev) && io_opt_configured)
+ lim.io_opt = io_opt;
+
return queue_limits_commit_update(mddev->gendisk->queue, &lim);
}
EXPORT_SYMBOL_GPL(mddev_stack_new_rdev);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ddf989f2a139..80c527b3777d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -1041,7 +1041,8 @@ int do_md_run(struct mddev *mddev);
#define MDDEV_STACK_INTEGRITY (1u << 0)
int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
unsigned int flags);
-int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev);
+int mddev_stack_new_rdev(struct mddev *mddev, struct md_rdev *rdev,
+ bool io_opt_configured);
void mddev_update_io_opt(struct mddev *mddev, unsigned int nr_stripes);
extern const struct block_device_operations md_fops;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1a957dba2640..f3f3086f27fa 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1944,7 +1944,7 @@ static int raid1_add_disk(struct mddev *mddev, struct md_rdev *rdev)
for (mirror = first; mirror <= last; mirror++) {
p = conf->mirrors + mirror;
if (!p->rdev) {
- err = mddev_stack_new_rdev(mddev, rdev);
+ err = mddev_stack_new_rdev(mddev, rdev, false);
if (err)
return err;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2c6b65b83724..a6edc91e7a9a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2139,7 +2139,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
continue;
}
- err = mddev_stack_new_rdev(mddev, rdev);
+ err = mddev_stack_new_rdev(mddev, rdev, true);
if (err)
return err;
p->head_position = 0;
@@ -2157,7 +2157,7 @@ static int raid10_add_disk(struct mddev *mddev, struct md_rdev *rdev)
clear_bit(In_sync, &rdev->flags);
set_bit(Replacement, &rdev->flags);
rdev->raid_disk = repl_slot;
- err = mddev_stack_new_rdev(mddev, rdev);
+ err = mddev_stack_new_rdev(mddev, rdev, true);
if (err)
return err;
conf->fullsync = 1;
--
2.51.0
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
@ 2026-01-15 1:28 ` Li Nan
2026-01-15 2:29 ` Xiao Ni
2026-01-16 15:05 ` Christoph Hellwig
2 siblings, 0 replies; 32+ messages in thread
From: Li Nan @ 2026-01-15 1:28 UTC (permalink / raw)
To: Yu Kuai, linux-raid; +Cc: xni, dan.carpenter
在 2026/1/15 1:12, Yu Kuai 写道:
> Since commit f63f17350e53 ("md/raid5: use the atomic queue limit
> update APIs"), the abort path in raid5_run() returns 'ret' instead of
> -EIO. However, if log_init() fails, 'ret' is still 0 from the previous
> successful call, causing raid5_run() to return success despite the
> failure.
>
> Fix this by capturing the return value from log_init().
>
> Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs")
> Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/raid5.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e57ce3295292..39bec4d199a1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8055,7 +8055,8 @@ static int raid5_run(struct mddev *mddev)
> goto abort;
> }
>
> - if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
> + ret = log_init(conf, journal_dev, raid5_has_ppl(conf));
> + if (ret)
> goto abort;
>
> return 0;
Reviewed-by: Li Nan <linan122@huawei.com>
--
Thanks,
Nan
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
2026-01-15 1:28 ` Li Nan
@ 2026-01-15 2:29 ` Xiao Ni
2026-01-16 15:05 ` Christoph Hellwig
2 siblings, 0 replies; 32+ messages in thread
From: Xiao Ni @ 2026-01-15 2:29 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, dan.carpenter
On Thu, Jan 15, 2026 at 1:17 AM Yu Kuai <yukuai@fnnas.com> wrote:
>
> Since commit f63f17350e53 ("md/raid5: use the atomic queue limit
> update APIs"), the abort path in raid5_run() returns 'ret' instead of
> -EIO. However, if log_init() fails, 'ret' is still 0 from the previous
> successful call, causing raid5_run() to return success despite the
> failure.
>
> Fix this by capturing the return value from log_init().
>
> Fixes: f63f17350e53 ("md/raid5: use the atomic queue limit update APIs")
> Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> Closes: https://lore.kernel.org/r/202601130531.LGfcZsa4-lkp@intel.com/
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
> drivers/md/raid5.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index e57ce3295292..39bec4d199a1 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -8055,7 +8055,8 @@ static int raid5_run(struct mddev *mddev)
> goto abort;
> }
>
> - if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
> + ret = log_init(conf, journal_dev, raid5_has_ppl(conf));
> + if (ret)
> goto abort;
>
> return 0;
> --
> 2.51.0
>
>
Reviewed-by: Xiao Ni <xni@redhat.com>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 00/12] md: align bio to io_opt for better performance
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
` (11 preceding siblings ...)
2026-01-14 17:12 ` [PATCH v5 12/12] md: fix abnormal io_opt from member disks Yu Kuai
@ 2026-01-15 23:38 ` John Stoffel
12 siblings, 0 replies; 32+ messages in thread
From: John Stoffel @ 2026-01-15 23:38 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, xni, dan.carpenter
>>>>> "Yu" == Yu Kuai <yukuai@fnnas.com> writes:
> This patchset optimizes MD RAID performance by aligning bios to the
> optimal I/O size before splitting. When I/O is aligned to io_opt,
> raid5 can perform full stripe writes without needing to read extra
> data for parity calculation, significantly improving bandwidth.
> Patch 1: Fix a bug in raid5_run() error handling
> Patches 2-4: Cleanup - merge boolean fields into mddev_flags
> Patches 5-6: Preparation - use mempool for stripe_request_ctx and
> ensure max_sectors >= io_opt
> Patches 7-8: Core - add bio alignment infrastructure
> Patches 9-11: Enable bio alignment for raid5, raid10, and raid0
> Patch 12: Fix abnormal io_opt from member disks
> Performance improvement on 32-disk raid5 with 64kb chunk:
> dd if=/dev/zero of=/dev/md0 bs=100M oflag=direct
> Before: 782 MB/s
> After: 1.1 GB/s
My only comment is how is performance impacted at other block sizes?
And smaller RAID5 arrays? What about RAID6?
And more importantly, are random disk writes impacted? It's great
that you have gotten streaming direct writes faster, but have other
writes slowed down for the common case?
> Changes in v5:
> - Add patch 1 to fix raid5_run() returning success when log_init() fails
> - Patch 12: Fix stale commit message (remove mention of MD_STACK_IO_OPT flag)
> Changes in v4:
> - Patch 12: Simplify by checking rdev_is_mddev() first, remove
> MD_STACK_IO_OPT flag
> Changes in v3:
> - Patch 5: Remove unnecessary NULL check before mempool_destroy()
> - Patch 7: Use sector_div() instead of roundup()/rounddown() to fix
> 64-bit division issue on 32-bit platforms
> Changes in v2:
> - Fix mempool in patch 5
> - Add prep cleanup patches, 2-4
> - Add patch 12 to fix abnormal io_opt
> - Add Link tags to patches
> Yu Kuai (12):
> md/raid5: fix raid5_run() to return error when log_init() fails
> md: merge mddev has_superblock into mddev_flags
> md: merge mddev faillast_dev into mddev_flags
> md: merge mddev serialize_policy into mddev_flags
> md/raid5: use mempool to allocate stripe_request_ctx
> md/raid5: make sure max_sectors is not less than io_opt
> md: support to align bio to limits
> md: add a helper md_config_align_limits()
> md/raid5: align bio to io_opt
> md/raid10: align bio to io_opt
> md/raid0: align bio to io_opt
> md: fix abnormal io_opt from member disks
> drivers/md/md-bitmap.c | 4 +-
> drivers/md/md.c | 118 +++++++++++++++++++++++++++++++++++------
> drivers/md/md.h | 30 +++++++++--
> drivers/md/raid0.c | 6 ++-
> drivers/md/raid1-10.c | 5 --
> drivers/md/raid1.c | 13 ++---
> drivers/md/raid10.c | 10 ++--
> drivers/md/raid5.c | 95 +++++++++++++++++++++++----------
> drivers/md/raid5.h | 3 ++
> 9 files changed, 217 insertions(+), 67 deletions(-)
> --
> 2.51.0
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
2026-01-15 1:28 ` Li Nan
2026-01-15 2:29 ` Xiao Ni
@ 2026-01-16 15:05 ` Christoph Hellwig
2 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:05 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, xni, dan.carpenter
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags
2026-01-14 17:12 ` [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags Yu Kuai
@ 2026-01-16 15:06 ` Christoph Hellwig
2026-01-18 11:30 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:06 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, xni, dan.carpenter
On Thu, Jan 15, 2026 at 01:12:30AM +0800, Yu Kuai wrote:
> There is not need to use a separate field in struct mddev, there are no
> functional changes.
It seems to be that right now the bitfields are persistent "features"
while the bits are state. This might not matter much, but it seems like
there is some rationale behind the current version.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-14 17:12 ` [PATCH v5 07/12] md: support to align bio to limits Yu Kuai
@ 2026-01-16 15:08 ` Christoph Hellwig
2026-01-18 11:40 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:08 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, xni, dan.carpenter
On Thu, Jan 15, 2026 at 01:12:35AM +0800, Yu Kuai wrote:
> For personalities that report optimal IO size, it indicates that users
> can get the best IO bandwidth if they issue IO with this size. However
> there is also an implicit condition that IO should also be aligned to the
> optimal IO size.
>
> Currently, bio will only be split by limits, if bio offset is not aligned
> to limits, then all split bio will not be aligned. This patch add a new
> feature to align bio to limits first, and following patches will support
> this for each personality if necessary.
This feels a bit odd and mixes up different things as right now
nothing in the block layer splits to the opt_io size. If you want
a boundary that is split on, the chunk_size limit seems to be what
you want, and the existing code would do the work based on that.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-14 17:12 ` [PATCH v5 12/12] md: fix abnormal io_opt from member disks Yu Kuai
@ 2026-01-16 15:08 ` Christoph Hellwig
2026-01-17 3:28 ` Coly Li
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-16 15:08 UTC (permalink / raw)
To: Yu Kuai; +Cc: linux-raid, linan122, xni, dan.carpenter
On Thu, Jan 15, 2026 at 01:12:40AM +0800, Yu Kuai wrote:
> It's reported that mtp3sas can report abnormal io_opt, for consequence,
> md array will end up with abnormal io_opt as well, due to the
How do you define "abnormal"?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-16 15:08 ` Christoph Hellwig
@ 2026-01-17 3:28 ` Coly Li
2026-01-19 6:48 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Coly Li @ 2026-01-17 3:28 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Yu Kuai, linux-raid, linan122, xni, dan.carpenter
> 2026年1月16日 23:08,Christoph Hellwig <hch@infradead.org> 写道:
>
> On Thu, Jan 15, 2026 at 01:12:40AM +0800, Yu Kuai wrote:
>> It's reported that mtp3sas can report abnormal io_opt, for consequence,
>> md array will end up with abnormal io_opt as well, due to the
>
> How do you define "abnormal”?
E.g. a spinning hard drive connect to this HBA card reports its max_sectors as 32767 sectors.
This is around 16MB and too large for normal hard drive.
Coly Li
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags
2026-01-16 15:06 ` Christoph Hellwig
@ 2026-01-18 11:30 ` Yu Kuai
0 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-18 11:30 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-raid, linan122, xni, dan.carpenter, yukuai
Hi,
在 2026/1/16 23:06, Christoph Hellwig 写道:
> On Thu, Jan 15, 2026 at 01:12:30AM +0800, Yu Kuai wrote:
>> There is not need to use a separate field in struct mddev, there are no
>> functional changes.
> It seems to be that right now the bitfields are persistent "features"
> while the bits are state. This might not matter much, but it seems like
> there is some rationale behind the current version.
I don't think so, there are already flags like MD_CLOSING, MD_ARRAY_FIRST_USE
that is set only in memory for a long time now. Anyway, it'll make sense to
split feature flags and state flags later.
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-16 15:08 ` Christoph Hellwig
@ 2026-01-18 11:40 ` Yu Kuai
2026-01-19 6:47 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-18 11:40 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-raid, linan122, xni, dan.carpenter, yukuai
Hi,
在 2026/1/16 23:08, Christoph Hellwig 写道:
> On Thu, Jan 15, 2026 at 01:12:35AM +0800, Yu Kuai wrote:
>> For personalities that report optimal IO size, it indicates that users
>> can get the best IO bandwidth if they issue IO with this size. However
>> there is also an implicit condition that IO should also be aligned to the
>> optimal IO size.
>>
>> Currently, bio will only be split by limits, if bio offset is not aligned
>> to limits, then all split bio will not be aligned. This patch add a new
>> feature to align bio to limits first, and following patches will support
>> this for each personality if necessary.
> This feels a bit odd and mixes up different things as right now
> nothing in the block layer splits to the opt_io size. If you want
> a boundary that is split on, the chunk_size limit seems to be what
> you want, and the existing code would do the work based on that.
No, the chunk_sectors and io_opt are different, and align io to io_opt
is not a general idea, for now this is the only requirement in mdraid.
Not sure if you remembered, Coly used to send a version to align IO to
io_opt separately, and we discussed and both agree align max_sectors to
io_opt and then align io to max_sectors in mdraid is better.
For now, current setting of max_sectors, and split IO to max_sectors
does not make much sense for raid0/10/456.
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-18 11:40 ` Yu Kuai
@ 2026-01-19 6:47 ` Christoph Hellwig
2026-01-19 7:21 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-19 6:47 UTC (permalink / raw)
To: Yu Kuai; +Cc: Christoph Hellwig, linux-raid, linan122, xni, dan.carpenter
On Sun, Jan 18, 2026 at 07:40:23PM +0800, Yu Kuai wrote:
> No, the chunk_sectors and io_opt are different, and align io to io_opt
> is not a general idea, for now this is the only requirement in mdraid.
The chunk size was added for (hardware) devices that require I/O split at
a fixed granularity for performance reasons. Which seems to e exactly
what you want here.
This has nothing to do with max_sectors.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-17 3:28 ` Coly Li
@ 2026-01-19 6:48 ` Christoph Hellwig
2026-01-19 7:24 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-19 6:48 UTC (permalink / raw)
To: Coly Li
Cc: Christoph Hellwig, Yu Kuai, linux-raid, linan122, xni,
dan.carpenter
On Sat, Jan 17, 2026 at 11:28:49AM +0800, Coly Li wrote:
> > 2026年1月16日 23:08,Christoph Hellwig <hch@infradead.org> 写道:
> >
> > On Thu, Jan 15, 2026 at 01:12:40AM +0800, Yu Kuai wrote:
> >> It's reported that mtp3sas can report abnormal io_opt, for consequence,
> >> md array will end up with abnormal io_opt as well, due to the
> >
> > How do you define "abnormal”?
>
> E.g. a spinning hard drive connect to this HBA card reports its max_sectors as 32767 sectors.
> This is around 16MB and too large for normal hard drive.
Which is larger than what we'd expect for the HDD itself, where it
should be around 1MB. But HBAs do weird stuff, so it might actually
be correct here. Have you talked to the mpt3sas maintainers?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-19 6:47 ` Christoph Hellwig
@ 2026-01-19 7:21 ` Yu Kuai
2026-01-19 7:27 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-19 7:21 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-raid, linan122, xni, dan.carpenter, yukuai
Hi,
在 2026/1/19 14:47, Christoph Hellwig 写道:
> On Sun, Jan 18, 2026 at 07:40:23PM +0800, Yu Kuai wrote:
>> No, the chunk_sectors and io_opt are different, and align io to io_opt
>> is not a general idea, for now this is the only requirement in mdraid.
> The chunk size was added for (hardware) devices that require I/O split at
> a fixed granularity for performance reasons. Which seems to e exactly
> what you want here.
>
> This has nothing to do with max_sectors.
For example, 32 disks raid5 array with chunksize=64k, currently the queue
limits are:
chunk_sectors = 64k
io_min = 64k
io_opt = 64 * 31k
max_sectors = 1M
It's correct to split I/O at 64k boundary to avoid performance issues, however
split at 64 *31k boundary is what we want to get best bandwidth.
So, if we simply changes chunk_sectors to 64 * 31k, it will be incorrect, because
64k boundary is still necessary for small IO.
So I'm not quite sure, are you suggesting following solution?
Add a max_chunk_sectors, and in this case it'll be 64 * 31k, for I/O:
1) First check and split to chunk_sectors boundary;
2) If I/O is already aligned to chunk_sectors, then check and split to
max_chunk_sectors boundary(BTW, this is what this set trying to do);
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-19 6:48 ` Christoph Hellwig
@ 2026-01-19 7:24 ` Yu Kuai
2026-01-19 8:36 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-19 7:24 UTC (permalink / raw)
To: Christoph Hellwig, Coly Li
Cc: linux-raid, linan122, xni, dan.carpenter, yukuai
Hi,
在 2026/1/19 14:48, Christoph Hellwig 写道:
> On Sat, Jan 17, 2026 at 11:28:49AM +0800, Coly Li wrote:
>>> 2026年1月16日 23:08,Christoph Hellwig <hch@infradead.org> 写道:
>>>
>>> On Thu, Jan 15, 2026 at 01:12:40AM +0800, Yu Kuai wrote:
>>>> It's reported that mtp3sas can report abnormal io_opt, for consequence,
>>>> md array will end up with abnormal io_opt as well, due to the
>>> How do you define "abnormal”?
>> E.g. a spinning hard drive connect to this HBA card reports its max_sectors as 32767 sectors.
>> This is around 16MB and too large for normal hard drive.
> Which is larger than what we'd expect for the HDD itself, where it
> should be around 1MB. But HBAs do weird stuff, so it might actually
> be correct here. Have you talked to the mpt3sas maintainers?
We CC them in several previous threads, however, they're not responding. :(
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-19 7:21 ` Yu Kuai
@ 2026-01-19 7:27 ` Christoph Hellwig
2026-01-19 7:43 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-19 7:27 UTC (permalink / raw)
To: Yu Kuai; +Cc: Christoph Hellwig, linux-raid, linan122, xni, dan.carpenter
On Mon, Jan 19, 2026 at 03:21:14PM +0800, Yu Kuai wrote:
> Hi,
>
> 在 2026/1/19 14:47, Christoph Hellwig 写道:
> > On Sun, Jan 18, 2026 at 07:40:23PM +0800, Yu Kuai wrote:
> >> No, the chunk_sectors and io_opt are different, and align io to io_opt
> >> is not a general idea, for now this is the only requirement in mdraid.
> > The chunk size was added for (hardware) devices that require I/O split at
> > a fixed granularity for performance reasons. Which seems to e exactly
> > what you want here.
> >
> > This has nothing to do with max_sectors.
>
> For example, 32 disks raid5 array with chunksize=64k, currently the queue
> limits are:
>
> chunk_sectors = 64k
> io_min = 64k
> io_opt = 64 * 31k
> max_sectors = 1M
>
> It's correct to split I/O at 64k boundary to avoid performance issues, however
> split at 64 *31k boundary is what we want to get best bandwidth.
>
> So, if we simply changes chunk_sectors to 64 * 31k, it will be incorrect, because
> 64k boundary is still necessary for small IO.
What do you mean with "necessary for small IO"?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-19 7:27 ` Christoph Hellwig
@ 2026-01-19 7:43 ` Yu Kuai
2026-01-19 8:27 ` Christoph Hellwig
0 siblings, 1 reply; 32+ messages in thread
From: Yu Kuai @ 2026-01-19 7:43 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-raid, linan122, xni, dan.carpenter, yukuai
在 2026/1/19 15:27, Christoph Hellwig 写道:
> On Mon, Jan 19, 2026 at 03:21:14PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2026/1/19 14:47, Christoph Hellwig 写道:
>>> On Sun, Jan 18, 2026 at 07:40:23PM +0800, Yu Kuai wrote:
>>>> No, the chunk_sectors and io_opt are different, and align io to io_opt
>>>> is not a general idea, for now this is the only requirement in mdraid.
>>> The chunk size was added for (hardware) devices that require I/O split at
>>> a fixed granularity for performance reasons. Which seems to e exactly
>>> what you want here.
>>>
>>> This has nothing to do with max_sectors.
>> For example, 32 disks raid5 array with chunksize=64k, currently the queue
>> limits are:
>>
>> chunk_sectors = 64k
>> io_min = 64k
>> io_opt = 64 * 31k
>> max_sectors = 1M
>>
>> It's correct to split I/O at 64k boundary to avoid performance issues, however
>> split at 64 *31k boundary is what we want to get best bandwidth.
>>
>> So, if we simply changes chunk_sectors to 64 * 31k, it will be incorrect, because
>> 64k boundary is still necessary for small IO.
> What do you mean with "necessary for small IO"?
io_min and io_opt is quite similar in mdraid, IO aligned with io_opt can get the best
bandwidth, and IO aligned with io_min can get the best iops. Currently chunk_sectors
is the same as io_min, and bio_split_rw() will try to align IO to io_min. I don't think
we want to remove this behavior.
>
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-19 7:43 ` Yu Kuai
@ 2026-01-19 8:27 ` Christoph Hellwig
2026-01-19 9:15 ` Yu Kuai
0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-19 8:27 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, linux-raid, linan122, xni, dan.carpenter,
Martin K. Petersen
On Mon, Jan 19, 2026 at 03:43:34PM +0800, Yu Kuai wrote:
> >> 64k boundary is still necessary for small IO.
> > What do you mean with "necessary for small IO"?
>
> io_min and io_opt is quite similar in mdraid, IO aligned with io_opt
> can get the best bandwidth, and IO aligned with io_min can get the
> best iops. Currently chunk_sectors is the same as io_min, and
> bio_split_rw() will try to align IO to io_min. I don't think we want
> to remove this behavior.
I'm still confused. Let's go back to your example:
32 disks raid5 array with chunksize=64k.
Let's look at writes first:
Each I/O that is full aligned to 31 * 64k can be handled without a
read-modify-write cycle, so splitting I/O at that boundary makes perfect
sense. Below that there really should not me much difference, i.e.
splitting anything at the 64k boundary is not useful. So you want the
chunk_sectors to apply at the 31 * 64k boundary, and the io_opt as well.
And probably io_min too. (all just looking at writes).
For non-degradead reads, not much should matter. All reads should be
reasonably efficient, splitting 64k boundaries is going to make the
implementation trivial, but will make your rely heavily on plugging
below, and also means you use quite a lot of lower bios.
For degraded reads, each I/O will always read 31 * 64k. Splitting at
31 * 64k makes the implementation much easier.
I guess you want different boundaries for reads and writes?
Note that io_opt and io_min really just are values for the caller and
not affect splitting decisions themselves. Of course the underlying
factors should be related.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 12/12] md: fix abnormal io_opt from member disks
2026-01-19 7:24 ` Yu Kuai
@ 2026-01-19 8:36 ` Christoph Hellwig
0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2026-01-19 8:36 UTC (permalink / raw)
To: Yu Kuai
Cc: Christoph Hellwig, Coly Li, linux-raid, linan122, xni,
dan.carpenter
On Mon, Jan 19, 2026 at 03:24:19PM +0800, Yu Kuai wrote:
> >> E.g. a spinning hard drive connect to this HBA card reports its max_sectors as 32767 sectors.
> >> This is around 16MB and too large for normal hard drive.
> > Which is larger than what we'd expect for the HDD itself, where it
> > should be around 1MB. But HBAs do weird stuff, so it might actually
> > be correct here. Have you talked to the mpt3sas maintainers?
>
> We CC them in several previous threads, however, they're not responding. :(
In that case we'll have to assume the value is intended. Especially
if the HBA is running in RAID mode, which would be only sensible path
to modify this value anyway.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH v5 07/12] md: support to align bio to limits
2026-01-19 8:27 ` Christoph Hellwig
@ 2026-01-19 9:15 ` Yu Kuai
0 siblings, 0 replies; 32+ messages in thread
From: Yu Kuai @ 2026-01-19 9:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-raid, linan122, xni, dan.carpenter, Martin K. Petersen,
yukuai
Hi,
在 2026/1/19 16:27, Christoph Hellwig 写道:
> On Mon, Jan 19, 2026 at 03:43:34PM +0800, Yu Kuai wrote:
>>>> 64k boundary is still necessary for small IO.
>>> What do you mean with "necessary for small IO"?
>> io_min and io_opt is quite similar in mdraid, IO aligned with io_opt
>> can get the best bandwidth, and IO aligned with io_min can get the
>> best iops. Currently chunk_sectors is the same as io_min, and
>> bio_split_rw() will try to align IO to io_min. I don't think we want
>> to remove this behavior.
> I'm still confused. Let's go back to your example:
>
> 32 disks raid5 array with chunksize=64k.
>
> Let's look at writes first:
>
> Each I/O that is full aligned to 31 * 64k can be handled without a
> read-modify-write cycle, so splitting I/O at that boundary makes perfect
> sense. Below that there really should not me much difference, i.e.
> splitting anything at the 64k boundary is not useful. So you want the
> chunk_sectors to apply at the 31 * 64k boundary, and the io_opt as well.
> And probably io_min too. (all just looking at writes).
This sounds reasonable, however, I'm not 100% sure split at 64k boundary
is not useful, I must run some tests to confirm. This behavior exist for
quite a long time.
>
> For non-degradead reads, not much should matter. All reads should be
> reasonably efficient, splitting 64k boundaries is going to make the
> implementation trivial, but will make your rely heavily on plugging
> below, and also means you use quite a lot of lower bios.
Correct, BTW, even if we don't split at 64k boundary in bio_split_rw(), raid5
will stil try to split at 64k boundary in chunk_aligned_read(), and this do
rely hevily on plugging below.
BTW, current plug limit really is too low for huge array, like 32+ member disks,
only 32 requests and at most 128k per request. However, I still can't find better
solution other than simply increase the limits.
>
> For degraded reads, each I/O will always read 31 * 64k. Splitting at
> 31 * 64k makes the implementation much easier.
I don't feel this is correct, each I/O will be handled by stripes, so a
4k read to the removed disk will only need to read 4k from other disks.
Anyway, this does not matter.
>
> I guess you want different boundaries for reads and writes?
Yes, this is still a potential demand, I'll test and take a detailed look
at other personalities.
>
> Note that io_opt and io_min really just are values for the caller and
> not affect splitting decisions themselves. Of course the underlying
> factors should be related.
Thanks for the explanation, I'll feedback soon after testing.
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2026-01-19 9:15 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-14 17:12 [PATCH v5 00/12] md: align bio to io_opt for better performance Yu Kuai
2026-01-14 17:12 ` [PATCH v5 01/12] md/raid5: fix raid5_run() to return error when log_init() fails Yu Kuai
2026-01-15 1:28 ` Li Nan
2026-01-15 2:29 ` Xiao Ni
2026-01-16 15:05 ` Christoph Hellwig
2026-01-14 17:12 ` [PATCH v5 02/12] md: merge mddev has_superblock into mddev_flags Yu Kuai
2026-01-16 15:06 ` Christoph Hellwig
2026-01-18 11:30 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 03/12] md: merge mddev faillast_dev " Yu Kuai
2026-01-14 17:12 ` [PATCH v5 04/12] md: merge mddev serialize_policy " Yu Kuai
2026-01-14 17:12 ` [PATCH v5 05/12] md/raid5: use mempool to allocate stripe_request_ctx Yu Kuai
2026-01-14 17:12 ` [PATCH v5 06/12] md/raid5: make sure max_sectors is not less than io_opt Yu Kuai
2026-01-14 17:12 ` [PATCH v5 07/12] md: support to align bio to limits Yu Kuai
2026-01-16 15:08 ` Christoph Hellwig
2026-01-18 11:40 ` Yu Kuai
2026-01-19 6:47 ` Christoph Hellwig
2026-01-19 7:21 ` Yu Kuai
2026-01-19 7:27 ` Christoph Hellwig
2026-01-19 7:43 ` Yu Kuai
2026-01-19 8:27 ` Christoph Hellwig
2026-01-19 9:15 ` Yu Kuai
2026-01-14 17:12 ` [PATCH v5 08/12] md: add a helper md_config_align_limits() Yu Kuai
2026-01-14 17:12 ` [PATCH v5 09/12] md/raid5: align bio to io_opt Yu Kuai
2026-01-14 17:12 ` [PATCH v5 10/12] md/raid10: " Yu Kuai
2026-01-14 17:12 ` [PATCH v5 11/12] md/raid0: " Yu Kuai
2026-01-14 17:12 ` [PATCH v5 12/12] md: fix abnormal io_opt from member disks Yu Kuai
2026-01-16 15:08 ` Christoph Hellwig
2026-01-17 3:28 ` Coly Li
2026-01-19 6:48 ` Christoph Hellwig
2026-01-19 7:24 ` Yu Kuai
2026-01-19 8:36 ` Christoph Hellwig
2026-01-15 23:38 ` [PATCH v5 00/12] md: align bio to io_opt for better performance John Stoffel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox