* [PATCH v8 1/4] md: delete md_redundancy_group when array is becoming inactive
2025-10-30 6:28 [PATCH v8 0/4] make logical block size configurable linan666
@ 2025-10-30 6:28 ` linan666
2025-11-03 0:27 ` Xiao Ni
2025-10-30 6:28 ` [PATCH v8 2/4] md: init bioset in mddev_init linan666
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: linan666 @ 2025-10-30 6:28 UTC (permalink / raw)
To: corbet, song, yukuai, linan122, hare, xni
Cc: linux-doc, linux-kernel, linux-raid, linan666, yangerkun,
yi.zhang
From: Li Nan <linan122@huawei.com>
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(),
but these are not paired. Writing inactive/active to sysfs array_state can
trigger md_run() multiple times without del_gendisk(), leading to
duplicate creation as below:
sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action'
Call Trace:
dump_stack_lvl+0x9f/0x120
dump_stack+0x14/0x20
sysfs_warn_dup+0x96/0xc0
sysfs_add_file_mode_ns+0x19c/0x1b0
internal_create_group+0x213/0x830
sysfs_create_group+0x17/0x20
md_run+0x856/0xe60
? __x64_sys_openat+0x23/0x30
do_md_run+0x26/0x1d0
array_state_store+0x559/0x760
md_attr_store+0xc9/0x1e0
sysfs_kf_write+0x6f/0xa0
kernfs_fop_write_iter+0x141/0x2a0
vfs_write+0x1fc/0x5a0
ksys_write+0x79/0x180
__x64_sys_write+0x1d/0x30
x64_sys_call+0x2818/0x2880
do_syscall_64+0xa9/0x580
entry_SYSCALL_64_after_hwframe+0x4b/0x53
md: cannot register extra attributes for md0
Creation of it depends on 'pers', its lifecycle cannot be aligned with
gendisk. So fix this issue by triggering 'md_redundancy_group' deletion
when the array is becoming inactive.
Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change")
Signed-off-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index fa13eb02874e..f6fd55a1637b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6878,6 +6878,10 @@ static int do_md_stop(struct mddev *mddev, int mode)
if (!md_is_rdwr(mddev))
set_disk_ro(disk, 0);
+ if (mode == 2 && mddev->pers->sync_request &&
+ mddev->to_remove == NULL)
+ mddev->to_remove = &md_redundancy_group;
+
__md_stop_writes(mddev);
__md_stop(mddev);
--
2.39.2
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v8 1/4] md: delete md_redundancy_group when array is becoming inactive
2025-10-30 6:28 ` [PATCH v8 1/4] md: delete md_redundancy_group when array is becoming inactive linan666
@ 2025-11-03 0:27 ` Xiao Ni
0 siblings, 0 replies; 12+ messages in thread
From: Xiao Ni @ 2025-11-03 0:27 UTC (permalink / raw)
To: linan666
Cc: corbet, song, yukuai, linan122, hare, linux-doc, linux-kernel,
linux-raid, yangerkun, yi.zhang
On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> 'md_redundancy_group' are created in md_run() and deleted in del_gendisk(),
> but these are not paired. Writing inactive/active to sysfs array_state can
> trigger md_run() multiple times without del_gendisk(), leading to
> duplicate creation as below:
>
> sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action'
> Call Trace:
> dump_stack_lvl+0x9f/0x120
> dump_stack+0x14/0x20
> sysfs_warn_dup+0x96/0xc0
> sysfs_add_file_mode_ns+0x19c/0x1b0
> internal_create_group+0x213/0x830
> sysfs_create_group+0x17/0x20
> md_run+0x856/0xe60
> ? __x64_sys_openat+0x23/0x30
> do_md_run+0x26/0x1d0
> array_state_store+0x559/0x760
> md_attr_store+0xc9/0x1e0
> sysfs_kf_write+0x6f/0xa0
> kernfs_fop_write_iter+0x141/0x2a0
> vfs_write+0x1fc/0x5a0
> ksys_write+0x79/0x180
> __x64_sys_write+0x1d/0x30
> x64_sys_call+0x2818/0x2880
> do_syscall_64+0xa9/0x580
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> md: cannot register extra attributes for md0
>
> Creation of it depends on 'pers', its lifecycle cannot be aligned with
> gendisk. So fix this issue by triggering 'md_redundancy_group' deletion
> when the array is becoming inactive.
>
> Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change")
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
> drivers/md/md.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index fa13eb02874e..f6fd55a1637b 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -6878,6 +6878,10 @@ static int do_md_stop(struct mddev *mddev, int mode)
> if (!md_is_rdwr(mddev))
> set_disk_ro(disk, 0);
>
> + if (mode == 2 && mddev->pers->sync_request &&
> + mddev->to_remove == NULL)
> + mddev->to_remove = &md_redundancy_group;
> +
> __md_stop_writes(mddev);
> __md_stop(mddev);
>
> --
> 2.39.2
>
Looks good to me.
Reviewed-by: Xiao Ni <xni@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v8 2/4] md: init bioset in mddev_init
2025-10-30 6:28 [PATCH v8 0/4] make logical block size configurable linan666
2025-10-30 6:28 ` [PATCH v8 1/4] md: delete md_redundancy_group when array is becoming inactive linan666
@ 2025-10-30 6:28 ` linan666
2025-11-03 1:23 ` Xiao Ni
2025-10-30 6:28 ` [PATCH v8 3/4] md/raid0: Move queue limit setup before r0conf initialization linan666
2025-10-30 6:28 ` [PATCH v8 4/4] md: allow configuring logical block size linan666
3 siblings, 1 reply; 12+ messages in thread
From: linan666 @ 2025-10-30 6:28 UTC (permalink / raw)
To: corbet, song, yukuai, linan122, hare, xni
Cc: linux-doc, linux-kernel, linux-raid, linan666, yangerkun,
yi.zhang
From: Li Nan <linan122@huawei.com>
IO operations may be needed before md_run(), such as updating metadata
after writing sysfs. Without bioset, this triggers a NULL pointer
dereference as below:
BUG: kernel NULL pointer dereference, address: 0000000000000020
Call Trace:
md_update_sb+0x658/0xe00
new_level_store+0xc5/0x120
md_attr_store+0xc9/0x1e0
sysfs_kf_write+0x6f/0xa0
kernfs_fop_write_iter+0x141/0x2a0
vfs_write+0x1fc/0x5a0
ksys_write+0x79/0x180
__x64_sys_write+0x1d/0x30
x64_sys_call+0x2818/0x2880
do_syscall_64+0xa9/0x580
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Reproducer
```
mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
echo inactive > /sys/block/md0/md/array_state
echo 10 > /sys/block/md0/md/new_level
```
mddev_init() can only be called once per mddev, no need to test if bioset
has been initialized anymore.
Fixes: d981ed841930 ("md: Add new_level sysfs interface")
Signed-off-by: Li Nan <linan122@huawei.com>
---
drivers/md/md.c | 69 +++++++++++++++++++++++--------------------------
1 file changed, 33 insertions(+), 36 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index f6fd55a1637b..dffc6a482181 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -730,6 +730,8 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
int mddev_init(struct mddev *mddev)
{
+ int err = 0;
+
if (!IS_ENABLED(CONFIG_MD_BITMAP))
mddev->bitmap_id = ID_BITMAP_NONE;
else
@@ -741,10 +743,23 @@ int mddev_init(struct mddev *mddev)
if (percpu_ref_init(&mddev->writes_pending, no_op,
PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
- percpu_ref_exit(&mddev->active_io);
- return -ENOMEM;
+ err = -ENOMEM;
+ goto exit_acitve_io;
}
+ err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+ if (err)
+ goto exit_writes_pending;
+
+ err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+ if (err)
+ goto exit_bio_set;
+
+ err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
+ offsetof(struct md_io_clone, bio_clone), 0);
+ if (err)
+ goto exit_sync_set;
+
/* We want to start with the refcount at zero */
percpu_ref_put(&mddev->writes_pending);
@@ -773,11 +788,24 @@ int mddev_init(struct mddev *mddev)
INIT_WORK(&mddev->del_work, mddev_delayed_delete);
return 0;
+
+exit_sync_set:
+ bioset_exit(&mddev->sync_set);
+exit_bio_set:
+ bioset_exit(&mddev->bio_set);
+exit_writes_pending:
+ percpu_ref_exit(&mddev->writes_pending);
+exit_acitve_io:
+ percpu_ref_exit(&mddev->active_io);
+ return err;
}
EXPORT_SYMBOL_GPL(mddev_init);
void mddev_destroy(struct mddev *mddev)
{
+ bioset_exit(&mddev->bio_set);
+ bioset_exit(&mddev->sync_set);
+ bioset_exit(&mddev->io_clone_set);
percpu_ref_exit(&mddev->active_io);
percpu_ref_exit(&mddev->writes_pending);
}
@@ -6393,29 +6421,9 @@ int md_run(struct mddev *mddev)
nowait = nowait && bdev_nowait(rdev->bdev);
}
- if (!bioset_initialized(&mddev->bio_set)) {
- err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
- if (err)
- return err;
- }
- if (!bioset_initialized(&mddev->sync_set)) {
- err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
- if (err)
- goto exit_bio_set;
- }
-
- if (!bioset_initialized(&mddev->io_clone_set)) {
- err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
- offsetof(struct md_io_clone, bio_clone), 0);
- if (err)
- goto exit_sync_set;
- }
-
pers = get_pers(mddev->level, mddev->clevel);
- if (!pers) {
- err = -EINVAL;
- goto abort;
- }
+ if (!pers)
+ return -EINVAL;
if (mddev->level != pers->head.id) {
mddev->level = pers->head.id;
mddev->new_level = pers->head.id;
@@ -6426,8 +6434,7 @@ int md_run(struct mddev *mddev)
pers->start_reshape == NULL) {
/* This personality cannot handle reshaping... */
put_pers(pers);
- err = -EINVAL;
- goto abort;
+ return -EINVAL;
}
if (pers->sync_request) {
@@ -6554,12 +6561,6 @@ int md_run(struct mddev *mddev)
mddev->private = NULL;
put_pers(pers);
md_bitmap_destroy(mddev);
-abort:
- bioset_exit(&mddev->io_clone_set);
-exit_sync_set:
- bioset_exit(&mddev->sync_set);
-exit_bio_set:
- bioset_exit(&mddev->bio_set);
return err;
}
EXPORT_SYMBOL_GPL(md_run);
@@ -6784,10 +6785,6 @@ static void __md_stop(struct mddev *mddev)
mddev->private = NULL;
put_pers(pers);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-
- bioset_exit(&mddev->bio_set);
- bioset_exit(&mddev->sync_set);
- bioset_exit(&mddev->io_clone_set);
}
void md_stop(struct mddev *mddev)
--
2.39.2
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v8 2/4] md: init bioset in mddev_init
2025-10-30 6:28 ` [PATCH v8 2/4] md: init bioset in mddev_init linan666
@ 2025-11-03 1:23 ` Xiao Ni
2025-11-03 12:32 ` Li Nan
0 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2025-11-03 1:23 UTC (permalink / raw)
To: linan666
Cc: corbet, song, yukuai, linan122, hare, linux-doc, linux-kernel,
linux-raid, yangerkun, yi.zhang
On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> IO operations may be needed before md_run(), such as updating metadata
> after writing sysfs. Without bioset, this triggers a NULL pointer
> dereference as below:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000020
> Call Trace:
> md_update_sb+0x658/0xe00
> new_level_store+0xc5/0x120
> md_attr_store+0xc9/0x1e0
> sysfs_kf_write+0x6f/0xa0
> kernfs_fop_write_iter+0x141/0x2a0
> vfs_write+0x1fc/0x5a0
> ksys_write+0x79/0x180
> __x64_sys_write+0x1d/0x30
> x64_sys_call+0x2818/0x2880
> do_syscall_64+0xa9/0x580
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
> Reproducer
> ```
> mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
> echo inactive > /sys/block/md0/md/array_state
> echo 10 > /sys/block/md0/md/new_level
> ```
>
Hi Li Nan
> mddev_init() can only be called once per mddev, no need to test if bioset
> has been initialized anymore.
The patch looks good to me. But I don't understand the message here.
This patch changes the alloc/free bioset positions. What's the meaning
of "no need to test if bioset has been initialized anymore"?
Regards
Xiao
>
> Fixes: d981ed841930 ("md: Add new_level sysfs interface")
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
> drivers/md/md.c | 69 +++++++++++++++++++++++--------------------------
> 1 file changed, 33 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f6fd55a1637b..dffc6a482181 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -730,6 +730,8 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
>
> int mddev_init(struct mddev *mddev)
> {
> + int err = 0;
> +
> if (!IS_ENABLED(CONFIG_MD_BITMAP))
> mddev->bitmap_id = ID_BITMAP_NONE;
> else
> @@ -741,10 +743,23 @@ int mddev_init(struct mddev *mddev)
>
> if (percpu_ref_init(&mddev->writes_pending, no_op,
> PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> - percpu_ref_exit(&mddev->active_io);
> - return -ENOMEM;
> + err = -ENOMEM;
> + goto exit_acitve_io;
> }
>
> + err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + if (err)
> + goto exit_writes_pending;
> +
> + err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + if (err)
> + goto exit_bio_set;
> +
> + err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
> + offsetof(struct md_io_clone, bio_clone), 0);
> + if (err)
> + goto exit_sync_set;
> +
> /* We want to start with the refcount at zero */
> percpu_ref_put(&mddev->writes_pending);
>
> @@ -773,11 +788,24 @@ int mddev_init(struct mddev *mddev)
> INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>
> return 0;
> +
> +exit_sync_set:
> + bioset_exit(&mddev->sync_set);
> +exit_bio_set:
> + bioset_exit(&mddev->bio_set);
> +exit_writes_pending:
> + percpu_ref_exit(&mddev->writes_pending);
> +exit_acitve_io:
> + percpu_ref_exit(&mddev->active_io);
> + return err;
> }
> EXPORT_SYMBOL_GPL(mddev_init);
>
> void mddev_destroy(struct mddev *mddev)
> {
> + bioset_exit(&mddev->bio_set);
> + bioset_exit(&mddev->sync_set);
> + bioset_exit(&mddev->io_clone_set);
> percpu_ref_exit(&mddev->active_io);
> percpu_ref_exit(&mddev->writes_pending);
> }
> @@ -6393,29 +6421,9 @@ int md_run(struct mddev *mddev)
> nowait = nowait && bdev_nowait(rdev->bdev);
> }
>
> - if (!bioset_initialized(&mddev->bio_set)) {
> - err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> - if (err)
> - return err;
> - }
> - if (!bioset_initialized(&mddev->sync_set)) {
> - err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> - if (err)
> - goto exit_bio_set;
> - }
> -
> - if (!bioset_initialized(&mddev->io_clone_set)) {
> - err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
> - offsetof(struct md_io_clone, bio_clone), 0);
> - if (err)
> - goto exit_sync_set;
> - }
> -
> pers = get_pers(mddev->level, mddev->clevel);
> - if (!pers) {
> - err = -EINVAL;
> - goto abort;
> - }
> + if (!pers)
> + return -EINVAL;
> if (mddev->level != pers->head.id) {
> mddev->level = pers->head.id;
> mddev->new_level = pers->head.id;
> @@ -6426,8 +6434,7 @@ int md_run(struct mddev *mddev)
> pers->start_reshape == NULL) {
> /* This personality cannot handle reshaping... */
> put_pers(pers);
> - err = -EINVAL;
> - goto abort;
> + return -EINVAL;
> }
>
> if (pers->sync_request) {
> @@ -6554,12 +6561,6 @@ int md_run(struct mddev *mddev)
> mddev->private = NULL;
> put_pers(pers);
> md_bitmap_destroy(mddev);
> -abort:
> - bioset_exit(&mddev->io_clone_set);
> -exit_sync_set:
> - bioset_exit(&mddev->sync_set);
> -exit_bio_set:
> - bioset_exit(&mddev->bio_set);
> return err;
> }
> EXPORT_SYMBOL_GPL(md_run);
> @@ -6784,10 +6785,6 @@ static void __md_stop(struct mddev *mddev)
> mddev->private = NULL;
> put_pers(pers);
> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> -
> - bioset_exit(&mddev->bio_set);
> - bioset_exit(&mddev->sync_set);
> - bioset_exit(&mddev->io_clone_set);
> }
>
> void md_stop(struct mddev *mddev)
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v8 2/4] md: init bioset in mddev_init
2025-11-03 1:23 ` Xiao Ni
@ 2025-11-03 12:32 ` Li Nan
2025-11-04 1:20 ` Xiao Ni
0 siblings, 1 reply; 12+ messages in thread
From: Li Nan @ 2025-11-03 12:32 UTC (permalink / raw)
To: Xiao Ni, linan666
Cc: corbet, song, yukuai, hare, linux-doc, linux-kernel, linux-raid,
yangerkun, yi.zhang
在 2025/11/3 9:23, Xiao Ni 写道:
> On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>>
>> From: Li Nan <linan122@huawei.com>
>>
>> IO operations may be needed before md_run(), such as updating metadata
>> after writing sysfs. Without bioset, this triggers a NULL pointer
>> dereference as below:
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000020
>> Call Trace:
>> md_update_sb+0x658/0xe00
>> new_level_store+0xc5/0x120
>> md_attr_store+0xc9/0x1e0
>> sysfs_kf_write+0x6f/0xa0
>> kernfs_fop_write_iter+0x141/0x2a0
>> vfs_write+0x1fc/0x5a0
>> ksys_write+0x79/0x180
>> __x64_sys_write+0x1d/0x30
>> x64_sys_call+0x2818/0x2880
>> do_syscall_64+0xa9/0x580
>> entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>
>> Reproducer
>> ```
>> mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
>> echo inactive > /sys/block/md0/md/array_state
>> echo 10 > /sys/block/md0/md/new_level
>> ```
>>
>
> Hi Li Nan
>
>> mddev_init() can only be called once per mddev, no need to test if bioset
>> has been initialized anymore.
>
> The patch looks good to me. But I don't understand the message here.
> This patch changes the alloc/free bioset positions. What's the meaning
> of "no need to test if bioset has been initialized anymore"?
>
> Regards
> Xiao
Hi Xiao
Thanks for your review.
Sorry for causing any misunderstanding.
Old code:
- if (!bioset_initialized(&mddev->bio_set)) {
- err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
New code:
+ err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
bioset_initialized() is removed. Can I describe it as:
mddev_init() can only be called once per mddev, thus bioset_initialized()
can be removed.
>>
>> Fixes: d981ed841930 ("md: Add new_level sysfs interface")
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>> drivers/md/md.c | 69 +++++++++++++++++++++++--------------------------
>> 1 file changed, 33 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index f6fd55a1637b..dffc6a482181 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -730,6 +730,8 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
>>
>> int mddev_init(struct mddev *mddev)
>> {
>> + int err = 0;
>> +
>> if (!IS_ENABLED(CONFIG_MD_BITMAP))
>> mddev->bitmap_id = ID_BITMAP_NONE;
>> else
>> @@ -741,10 +743,23 @@ int mddev_init(struct mddev *mddev)
>>
>> if (percpu_ref_init(&mddev->writes_pending, no_op,
>> PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
>> - percpu_ref_exit(&mddev->active_io);
>> - return -ENOMEM;
>> + err = -ENOMEM;
>> + goto exit_acitve_io;
>> }
>>
>> + err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>> + if (err)
>> + goto exit_writes_pending;
>> +
>> + err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>> + if (err)
>> + goto exit_bio_set;
>> +
>> + err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
>> + offsetof(struct md_io_clone, bio_clone), 0);
>> + if (err)
>> + goto exit_sync_set;
>> +
>> /* We want to start with the refcount at zero */
>> percpu_ref_put(&mddev->writes_pending);
>>
>> @@ -773,11 +788,24 @@ int mddev_init(struct mddev *mddev)
>> INIT_WORK(&mddev->del_work, mddev_delayed_delete);
>>
>> return 0;
>> +
>> +exit_sync_set:
>> + bioset_exit(&mddev->sync_set);
>> +exit_bio_set:
>> + bioset_exit(&mddev->bio_set);
>> +exit_writes_pending:
>> + percpu_ref_exit(&mddev->writes_pending);
>> +exit_acitve_io:
>> + percpu_ref_exit(&mddev->active_io);
>> + return err;
>> }
>> EXPORT_SYMBOL_GPL(mddev_init);
>>
>> void mddev_destroy(struct mddev *mddev)
>> {
>> + bioset_exit(&mddev->bio_set);
>> + bioset_exit(&mddev->sync_set);
>> + bioset_exit(&mddev->io_clone_set);
>> percpu_ref_exit(&mddev->active_io);
>> percpu_ref_exit(&mddev->writes_pending);
>> }
>> @@ -6393,29 +6421,9 @@ int md_run(struct mddev *mddev)
>> nowait = nowait && bdev_nowait(rdev->bdev);
>> }
>>
>> - if (!bioset_initialized(&mddev->bio_set)) {
>> - err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>> - if (err)
>> - return err;
>> - }
>> - if (!bioset_initialized(&mddev->sync_set)) {
>> - err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>> - if (err)
>> - goto exit_bio_set;
>> - }
>> -
>> - if (!bioset_initialized(&mddev->io_clone_set)) {
>> - err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
>> - offsetof(struct md_io_clone, bio_clone), 0);
>> - if (err)
>> - goto exit_sync_set;
>> - }
>> -
>> pers = get_pers(mddev->level, mddev->clevel);
>> - if (!pers) {
>> - err = -EINVAL;
>> - goto abort;
>> - }
>> + if (!pers)
>> + return -EINVAL;
>> if (mddev->level != pers->head.id) {
>> mddev->level = pers->head.id;
>> mddev->new_level = pers->head.id;
>> @@ -6426,8 +6434,7 @@ int md_run(struct mddev *mddev)
>> pers->start_reshape == NULL) {
>> /* This personality cannot handle reshaping... */
>> put_pers(pers);
>> - err = -EINVAL;
>> - goto abort;
>> + return -EINVAL;
>> }
>>
>> if (pers->sync_request) {
>> @@ -6554,12 +6561,6 @@ int md_run(struct mddev *mddev)
>> mddev->private = NULL;
>> put_pers(pers);
>> md_bitmap_destroy(mddev);
>> -abort:
>> - bioset_exit(&mddev->io_clone_set);
>> -exit_sync_set:
>> - bioset_exit(&mddev->sync_set);
>> -exit_bio_set:
>> - bioset_exit(&mddev->bio_set);
>> return err;
>> }
>> EXPORT_SYMBOL_GPL(md_run);
>> @@ -6784,10 +6785,6 @@ static void __md_stop(struct mddev *mddev)
>> mddev->private = NULL;
>> put_pers(pers);
>> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>> -
>> - bioset_exit(&mddev->bio_set);
>> - bioset_exit(&mddev->sync_set);
>> - bioset_exit(&mddev->io_clone_set);
>> }
>>
>> void md_stop(struct mddev *mddev)
>> --
>> 2.39.2
>>
>
>
> .
--
Thanks,
Nan
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v8 2/4] md: init bioset in mddev_init
2025-11-03 12:32 ` Li Nan
@ 2025-11-04 1:20 ` Xiao Ni
0 siblings, 0 replies; 12+ messages in thread
From: Xiao Ni @ 2025-11-04 1:20 UTC (permalink / raw)
To: Li Nan
Cc: corbet, song, yukuai, hare, linux-doc, linux-kernel, linux-raid,
yangerkun, yi.zhang
On Mon, Nov 3, 2025 at 8:32 PM Li Nan <linan666@huaweicloud.com> wrote:
>
>
>
> 在 2025/11/3 9:23, Xiao Ni 写道:
> > On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
> >>
> >> From: Li Nan <linan122@huawei.com>
> >>
> >> IO operations may be needed before md_run(), such as updating metadata
> >> after writing sysfs. Without bioset, this triggers a NULL pointer
> >> dereference as below:
> >>
> >> BUG: kernel NULL pointer dereference, address: 0000000000000020
> >> Call Trace:
> >> md_update_sb+0x658/0xe00
> >> new_level_store+0xc5/0x120
> >> md_attr_store+0xc9/0x1e0
> >> sysfs_kf_write+0x6f/0xa0
> >> kernfs_fop_write_iter+0x141/0x2a0
> >> vfs_write+0x1fc/0x5a0
> >> ksys_write+0x79/0x180
> >> __x64_sys_write+0x1d/0x30
> >> x64_sys_call+0x2818/0x2880
> >> do_syscall_64+0xa9/0x580
> >> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> >>
> >> Reproducer
> >> ```
> >> mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd]
> >> echo inactive > /sys/block/md0/md/array_state
> >> echo 10 > /sys/block/md0/md/new_level
> >> ```
> >>
> >
> > Hi Li Nan
> >
> >> mddev_init() can only be called once per mddev, no need to test if bioset
> >> has been initialized anymore.
> >
> > The patch looks good to me. But I don't understand the message here.
> > This patch changes the alloc/free bioset positions. What's the meaning
> > of "no need to test if bioset has been initialized anymore"?
> >
> > Regards
> > Xiao
>
> Hi Xiao
>
> Thanks for your review.
>
> Sorry for causing any misunderstanding.
> Old code:
> - if (!bioset_initialized(&mddev->bio_set)) {
> - err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>
> New code:
> + err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>
> bioset_initialized() is removed. Can I describe it as:
> mddev_init() can only be called once per mddev, thus bioset_initialized()
> can be removed.
I c, thanks very much for the explanation. The description is good to me.
Thanks
Xiao
>
> >>
> >> Fixes: d981ed841930 ("md: Add new_level sysfs interface")
> >> Signed-off-by: Li Nan <linan122@huawei.com>
> >> ---
> >> drivers/md/md.c | 69 +++++++++++++++++++++++--------------------------
> >> 1 file changed, 33 insertions(+), 36 deletions(-)
> >>
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index f6fd55a1637b..dffc6a482181 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -730,6 +730,8 @@ static void mddev_clear_bitmap_ops(struct mddev *mddev)
> >>
> >> int mddev_init(struct mddev *mddev)
> >> {
> >> + int err = 0;
> >> +
> >> if (!IS_ENABLED(CONFIG_MD_BITMAP))
> >> mddev->bitmap_id = ID_BITMAP_NONE;
> >> else
> >> @@ -741,10 +743,23 @@ int mddev_init(struct mddev *mddev)
> >>
> >> if (percpu_ref_init(&mddev->writes_pending, no_op,
> >> PERCPU_REF_ALLOW_REINIT, GFP_KERNEL)) {
> >> - percpu_ref_exit(&mddev->active_io);
> >> - return -ENOMEM;
> >> + err = -ENOMEM;
> >> + goto exit_acitve_io;
> >> }
> >>
> >> + err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> >> + if (err)
> >> + goto exit_writes_pending;
> >> +
> >> + err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> >> + if (err)
> >> + goto exit_bio_set;
> >> +
> >> + err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
> >> + offsetof(struct md_io_clone, bio_clone), 0);
> >> + if (err)
> >> + goto exit_sync_set;
> >> +
> >> /* We want to start with the refcount at zero */
> >> percpu_ref_put(&mddev->writes_pending);
> >>
> >> @@ -773,11 +788,24 @@ int mddev_init(struct mddev *mddev)
> >> INIT_WORK(&mddev->del_work, mddev_delayed_delete);
> >>
> >> return 0;
> >> +
> >> +exit_sync_set:
> >> + bioset_exit(&mddev->sync_set);
> >> +exit_bio_set:
> >> + bioset_exit(&mddev->bio_set);
> >> +exit_writes_pending:
> >> + percpu_ref_exit(&mddev->writes_pending);
> >> +exit_acitve_io:
> >> + percpu_ref_exit(&mddev->active_io);
> >> + return err;
> >> }
> >> EXPORT_SYMBOL_GPL(mddev_init);
> >>
> >> void mddev_destroy(struct mddev *mddev)
> >> {
> >> + bioset_exit(&mddev->bio_set);
> >> + bioset_exit(&mddev->sync_set);
> >> + bioset_exit(&mddev->io_clone_set);
> >> percpu_ref_exit(&mddev->active_io);
> >> percpu_ref_exit(&mddev->writes_pending);
> >> }
> >> @@ -6393,29 +6421,9 @@ int md_run(struct mddev *mddev)
> >> nowait = nowait && bdev_nowait(rdev->bdev);
> >> }
> >>
> >> - if (!bioset_initialized(&mddev->bio_set)) {
> >> - err = bioset_init(&mddev->bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> >> - if (err)
> >> - return err;
> >> - }
> >> - if (!bioset_initialized(&mddev->sync_set)) {
> >> - err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> >> - if (err)
> >> - goto exit_bio_set;
> >> - }
> >> -
> >> - if (!bioset_initialized(&mddev->io_clone_set)) {
> >> - err = bioset_init(&mddev->io_clone_set, BIO_POOL_SIZE,
> >> - offsetof(struct md_io_clone, bio_clone), 0);
> >> - if (err)
> >> - goto exit_sync_set;
> >> - }
> >> -
> >> pers = get_pers(mddev->level, mddev->clevel);
> >> - if (!pers) {
> >> - err = -EINVAL;
> >> - goto abort;
> >> - }
> >> + if (!pers)
> >> + return -EINVAL;
> >> if (mddev->level != pers->head.id) {
> >> mddev->level = pers->head.id;
> >> mddev->new_level = pers->head.id;
> >> @@ -6426,8 +6434,7 @@ int md_run(struct mddev *mddev)
> >> pers->start_reshape == NULL) {
> >> /* This personality cannot handle reshaping... */
> >> put_pers(pers);
> >> - err = -EINVAL;
> >> - goto abort;
> >> + return -EINVAL;
> >> }
> >>
> >> if (pers->sync_request) {
> >> @@ -6554,12 +6561,6 @@ int md_run(struct mddev *mddev)
> >> mddev->private = NULL;
> >> put_pers(pers);
> >> md_bitmap_destroy(mddev);
> >> -abort:
> >> - bioset_exit(&mddev->io_clone_set);
> >> -exit_sync_set:
> >> - bioset_exit(&mddev->sync_set);
> >> -exit_bio_set:
> >> - bioset_exit(&mddev->bio_set);
> >> return err;
> >> }
> >> EXPORT_SYMBOL_GPL(md_run);
> >> @@ -6784,10 +6785,6 @@ static void __md_stop(struct mddev *mddev)
> >> mddev->private = NULL;
> >> put_pers(pers);
> >> clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
> >> -
> >> - bioset_exit(&mddev->bio_set);
> >> - bioset_exit(&mddev->sync_set);
> >> - bioset_exit(&mddev->io_clone_set);
> >> }
> >>
> >> void md_stop(struct mddev *mddev)
> >> --
> >> 2.39.2
> >>
> >
> >
> > .
>
> --
> Thanks,
> Nan
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v8 3/4] md/raid0: Move queue limit setup before r0conf initialization
2025-10-30 6:28 [PATCH v8 0/4] make logical block size configurable linan666
2025-10-30 6:28 ` [PATCH v8 1/4] md: delete md_redundancy_group when array is becoming inactive linan666
2025-10-30 6:28 ` [PATCH v8 2/4] md: init bioset in mddev_init linan666
@ 2025-10-30 6:28 ` linan666
2025-11-03 1:47 ` Xiao Ni
2025-10-30 6:28 ` [PATCH v8 4/4] md: allow configuring logical block size linan666
3 siblings, 1 reply; 12+ messages in thread
From: linan666 @ 2025-10-30 6:28 UTC (permalink / raw)
To: corbet, song, yukuai, linan122, hare, xni
Cc: linux-doc, linux-kernel, linux-raid, linan666, yangerkun,
yi.zhang
From: Li Nan <linan122@huawei.com>
Prepare for making logical blocksize configurable. This change has no
impact until logical block size becomes configurable.
Move raid0_set_limits() before create_strip_zones(). It is safe as fields
modified in create_strip_zones() do not involve mddev configuration, and
rdev modifications there are not used in raid0_set_limits().
'blksize' in create_strip_zones() fetches mddev's logical block size,
which is already the maximum aross all rdevs, so the later max() can be
removed.
Signed-off-by: Li Nan <linan122@huawei.com>
---
drivers/md/raid0.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e443e478645a..fbf763401521 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -68,7 +68,7 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
struct strip_zone *zone;
int cnt;
struct r0conf *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
- unsigned blksize = 512;
+ unsigned int blksize = queue_logical_block_size(mddev->gendisk->queue);
*private_conf = ERR_PTR(-ENOMEM);
if (!conf)
@@ -84,9 +84,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
sector_div(sectors, mddev->chunk_sectors);
rdev1->sectors = sectors * mddev->chunk_sectors;
- blksize = max(blksize, queue_logical_block_size(
- rdev1->bdev->bd_disk->queue));
-
rdev_for_each(rdev2, mddev) {
pr_debug("md/raid0:%s: comparing %pg(%llu)"
" with %pg(%llu)\n",
@@ -405,6 +402,12 @@ static int raid0_run(struct mddev *mddev)
if (md_check_no_bitmap(mddev))
return -EINVAL;
+ if (!mddev_is_dm(mddev)) {
+ ret = raid0_set_limits(mddev);
+ if (ret)
+ return ret;
+ }
+
/* if private is not null, we are here after takeover */
if (mddev->private == NULL) {
ret = create_strip_zones(mddev, &conf);
@@ -413,11 +416,6 @@ static int raid0_run(struct mddev *mddev)
mddev->private = conf;
}
conf = mddev->private;
- if (!mddev_is_dm(mddev)) {
- ret = raid0_set_limits(mddev);
- if (ret)
- return ret;
- }
/* calculate array device size */
md_set_array_sectors(mddev, raid0_size(mddev, 0, 0));
--
2.39.2
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v8 3/4] md/raid0: Move queue limit setup before r0conf initialization
2025-10-30 6:28 ` [PATCH v8 3/4] md/raid0: Move queue limit setup before r0conf initialization linan666
@ 2025-11-03 1:47 ` Xiao Ni
0 siblings, 0 replies; 12+ messages in thread
From: Xiao Ni @ 2025-11-03 1:47 UTC (permalink / raw)
To: linan666
Cc: corbet, song, yukuai, linan122, hare, linux-doc, linux-kernel,
linux-raid, yangerkun, yi.zhang
On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> Prepare for making logical blocksize configurable. This change has no
> impact until logical block size becomes configurable.
>
> Move raid0_set_limits() before create_strip_zones(). It is safe as fields
> modified in create_strip_zones() do not involve mddev configuration, and
> rdev modifications there are not used in raid0_set_limits().
>
> 'blksize' in create_strip_zones() fetches mddev's logical block size,
> which is already the maximum aross all rdevs, so the later max() can be
> removed.
>
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
> drivers/md/raid0.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index e443e478645a..fbf763401521 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -68,7 +68,7 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
> struct strip_zone *zone;
> int cnt;
> struct r0conf *conf = kzalloc(sizeof(*conf), GFP_KERNEL);
> - unsigned blksize = 512;
> + unsigned int blksize = queue_logical_block_size(mddev->gendisk->queue);
>
> *private_conf = ERR_PTR(-ENOMEM);
> if (!conf)
> @@ -84,9 +84,6 @@ static int create_strip_zones(struct mddev *mddev, struct r0conf **private_conf)
> sector_div(sectors, mddev->chunk_sectors);
> rdev1->sectors = sectors * mddev->chunk_sectors;
>
> - blksize = max(blksize, queue_logical_block_size(
> - rdev1->bdev->bd_disk->queue));
> -
> rdev_for_each(rdev2, mddev) {
> pr_debug("md/raid0:%s: comparing %pg(%llu)"
> " with %pg(%llu)\n",
> @@ -405,6 +402,12 @@ static int raid0_run(struct mddev *mddev)
> if (md_check_no_bitmap(mddev))
> return -EINVAL;
>
> + if (!mddev_is_dm(mddev)) {
> + ret = raid0_set_limits(mddev);
> + if (ret)
> + return ret;
> + }
> +
> /* if private is not null, we are here after takeover */
> if (mddev->private == NULL) {
> ret = create_strip_zones(mddev, &conf);
> @@ -413,11 +416,6 @@ static int raid0_run(struct mddev *mddev)
> mddev->private = conf;
> }
> conf = mddev->private;
> - if (!mddev_is_dm(mddev)) {
> - ret = raid0_set_limits(mddev);
> - if (ret)
> - return ret;
> - }
>
> /* calculate array device size */
> md_set_array_sectors(mddev, raid0_size(mddev, 0, 0));
> --
> 2.39.2
>
Looks good to me.
Reviewed-by: Xiao Ni <xni@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v8 4/4] md: allow configuring logical block size
2025-10-30 6:28 [PATCH v8 0/4] make logical block size configurable linan666
` (2 preceding siblings ...)
2025-10-30 6:28 ` [PATCH v8 3/4] md/raid0: Move queue limit setup before r0conf initialization linan666
@ 2025-10-30 6:28 ` linan666
2025-11-03 3:11 ` Xiao Ni
3 siblings, 1 reply; 12+ messages in thread
From: linan666 @ 2025-10-30 6:28 UTC (permalink / raw)
To: corbet, song, yukuai, linan122, hare, xni
Cc: linux-doc, linux-kernel, linux-raid, linan666, yangerkun,
yi.zhang
From: Li Nan <linan122@huawei.com>
Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:
```
# LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean
# LBS is 512
cat /sys/block/md0/queue/logical_block_size
# create partition md0p1
parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
lsblk | grep md0p1
# LBS becomes 4096 after adding sdf
mdadm --add -q /dev/md0 /dev/sdf
cat /sys/block/md0/queue/logical_block_size
# partition lost
partprobe /dev/md0
lsblk | grep md0p1
```
Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.
Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
- store LBS in disk metadata
- add a read-write sysfs 'mdX/logical_block_size'
Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.
Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.
Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.
Signed-off-by: Li Nan <linan122@huawei.com>
---
Documentation/admin-guide/md.rst | 7 +++
drivers/md/md.h | 1 +
include/uapi/linux/raid/md_p.h | 3 +-
drivers/md/md-linear.c | 1 +
drivers/md/md.c | 77 ++++++++++++++++++++++++++++++++
drivers/md/raid0.c | 1 +
drivers/md/raid1.c | 1 +
drivers/md/raid10.c | 1 +
drivers/md/raid5.c | 1 +
9 files changed, 92 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 1c2eacc94758..0f143acd2db7 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -238,6 +238,13 @@ All md devices contain:
the number of devices in a raid4/5/6, or to support external
metadata formats which mandate such clipping.
+ logical_block_size
+ Configure the array's logical block size in bytes. This attribute
+ is only supported for 1.x meta. The value should be written before
+ starting the array. The final array LBS will use the max value
+ between this configuration and all combined device's LBS. Note that
+ LBS cannot exceed PAGE_SIZE before RAID supports folio.
+
reshape_position
This is either ``none`` or a sector number within the devices of
the array where ``reshape`` is up to. If this is set, the three
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 38a7c2fab150..a6b3cb69c28c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -432,6 +432,7 @@ struct mddev {
sector_t array_sectors; /* exported array size */
int external_size; /* size managed
* externally */
+ unsigned int logical_block_size;
__u64 events;
/* If the last 'event' was simply a clean->dirty transition, and
* we didn't write it to the spares, then it is safe and simple
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index ac74133a4768..310068bb2a1d 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -291,7 +291,8 @@ struct mdp_superblock_1 {
__le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
__le32 sb_csum; /* checksum up to devs[max_dev] */
__le32 max_dev; /* size of devs[] array to consider */
- __u8 pad3[64-32]; /* set to 0 when writing */
+ __le32 logical_block_size; /* same as q->limits->logical_block_size */
+ __u8 pad3[64-36]; /* set to 0 when writing */
/* device state information. Indexed by dev_number.
* 2 bytes per device
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 7033d982d377..50d4a419a16e 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -72,6 +72,7 @@ static int linear_set_limits(struct mddev *mddev)
md_init_stacking_limits(&lim);
lim.max_hw_sectors = mddev->chunk_sectors;
+ lim.logical_block_size = mddev->logical_block_size;
lim.max_write_zeroes_sectors = mddev->chunk_sectors;
lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors;
lim.io_min = mddev->chunk_sectors << 9;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index dffc6a482181..d78e9e52c951 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1993,6 +1993,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struc
mddev->layout = le32_to_cpu(sb->layout);
mddev->raid_disks = le32_to_cpu(sb->raid_disks);
mddev->dev_sectors = le64_to_cpu(sb->size);
+ mddev->logical_block_size = le32_to_cpu(sb->logical_block_size);
mddev->events = ev1;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.space = 0;
@@ -2202,6 +2203,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
sb->level = cpu_to_le32(mddev->level);
sb->layout = cpu_to_le32(mddev->layout);
+ sb->logical_block_size = cpu_to_le32(mddev->logical_block_size);
if (test_bit(FailFast, &rdev->flags))
sb->devflags |= FailFast1;
else
@@ -5930,6 +5932,68 @@ static struct md_sysfs_entry md_serialize_policy =
__ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
serialize_policy_store);
+static int mddev_set_logical_block_size(struct mddev *mddev,
+ unsigned int lbs)
+{
+ int err = 0;
+ struct queue_limits lim;
+
+ if (queue_logical_block_size(mddev->gendisk->queue) >= lbs) {
+ pr_err("%s: Cannot set LBS smaller than mddev LBS %u\n",
+ mdname(mddev), lbs);
+ return -EINVAL;
+ }
+
+ lim = queue_limits_start_update(mddev->gendisk->queue);
+ lim.logical_block_size = lbs;
+ pr_info("%s: logical_block_size is changed, data may be lost\n",
+ mdname(mddev));
+ err = queue_limits_commit_update(mddev->gendisk->queue, &lim);
+ if (err)
+ return err;
+
+ mddev->logical_block_size = lbs;
+ /* New lbs will be written to superblock after array is running */
+ set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
+ return 0;
+}
+
+static ssize_t
+lbs_show(struct mddev *mddev, char *page)
+{
+ return sprintf(page, "%u\n", mddev->logical_block_size);
+}
+
+static ssize_t
+lbs_store(struct mddev *mddev, const char *buf, size_t len)
+{
+ unsigned int lbs;
+ int err = -EBUSY;
+
+ /* Only 1.x meta supports configurable LBS */
+ if (mddev->major_version == 0)
+ return -EINVAL;
+
+ if (mddev->pers)
+ return -EBUSY;
+
+ err = kstrtouint(buf, 10, &lbs);
+ if (err < 0)
+ return -EINVAL;
+
+ err = mddev_lock(mddev);
+ if (err)
+ goto unlock;
+
+ err = mddev_set_logical_block_size(mddev, lbs);
+
+unlock:
+ mddev_unlock(mddev);
+ return err ?: len;
+}
+
+static struct md_sysfs_entry md_logical_block_size =
+__ATTR(logical_block_size, 0644, lbs_show, lbs_store);
static struct attribute *md_default_attrs[] = {
&md_level.attr,
@@ -5952,6 +6016,7 @@ static struct attribute *md_default_attrs[] = {
&md_consistency_policy.attr,
&md_fail_last_dev.attr,
&md_serialize_policy.attr,
+ &md_logical_block_size.attr,
NULL,
};
@@ -6082,6 +6147,17 @@ int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
return -EINVAL;
}
+ /*
+ * Before RAID adding folio support, the logical_block_size
+ * should be smaller than the page size.
+ */
+ if (lim->logical_block_size > PAGE_SIZE) {
+ pr_err("%s: logical_block_size must not larger than PAGE_SIZE\n",
+ mdname(mddev));
+ return -EINVAL;
+ }
+ mddev->logical_block_size = lim->logical_block_size;
+
return 0;
}
EXPORT_SYMBOL_GPL(mddev_stack_rdev_limits);
@@ -6693,6 +6769,7 @@ static void md_clean(struct mddev *mddev)
mddev->chunk_sectors = 0;
mddev->ctime = mddev->utime = 0;
mddev->layout = 0;
+ mddev->logical_block_size = 0;
mddev->max_disks = 0;
mddev->events = 0;
mddev->can_decrease_events = 0;
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index fbf763401521..47aee1b1d4d1 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -380,6 +380,7 @@ static int raid0_set_limits(struct mddev *mddev)
lim.max_hw_sectors = mddev->chunk_sectors;
lim.max_write_zeroes_sectors = mddev->chunk_sectors;
lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors;
+ lim.logical_block_size = mddev->logical_block_size;
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * mddev->raid_disks;
lim.chunk_sectors = mddev->chunk_sectors;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 64bfe8ca5b38..167768edaec1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3212,6 +3212,7 @@ static int raid1_set_limits(struct mddev *mddev)
md_init_stacking_limits(&lim);
lim.max_write_zeroes_sectors = 0;
lim.max_hw_wzeroes_unmap_sectors = 0;
+ lim.logical_block_size = mddev->logical_block_size;
lim.features |= BLK_FEAT_ATOMIC_WRITES;
err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
if (err)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 6b2d4b7057ae..71bfed3b798d 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4000,6 +4000,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
md_init_stacking_limits(&lim);
lim.max_write_zeroes_sectors = 0;
lim.max_hw_wzeroes_unmap_sectors = 0;
+ lim.logical_block_size = mddev->logical_block_size;
lim.io_min = mddev->chunk_sectors << 9;
lim.chunk_sectors = mddev->chunk_sectors;
lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index aa404abf5d17..92473850f381 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7747,6 +7747,7 @@ static int raid5_set_limits(struct mddev *mddev)
stripe = roundup_pow_of_two(data_disks * (mddev->chunk_sectors << 9));
md_init_stacking_limits(&lim);
+ lim.logical_block_size = mddev->logical_block_size;
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
--
2.39.2
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v8 4/4] md: allow configuring logical block size
2025-10-30 6:28 ` [PATCH v8 4/4] md: allow configuring logical block size linan666
@ 2025-11-03 3:11 ` Xiao Ni
2025-11-03 13:09 ` Li Nan
0 siblings, 1 reply; 12+ messages in thread
From: Xiao Ni @ 2025-11-03 3:11 UTC (permalink / raw)
To: linan666
Cc: corbet, song, yukuai, linan122, hare, linux-doc, linux-kernel,
linux-raid, yangerkun, yi.zhang
On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>
> From: Li Nan <linan122@huawei.com>
>
> Previously, raid array used the maximum logical block size (LBS)
> of all member disks. Adding a larger LBS disk at runtime could
> unexpectedly increase RAID's LBS, risking corruption of existing
> partitions. This can be reproduced by:
>
> ```
> # LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
> mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean
>
> # LBS is 512
> cat /sys/block/md0/queue/logical_block_size
>
> # create partition md0p1
> parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
> lsblk | grep md0p1
>
> # LBS becomes 4096 after adding sdf
> mdadm --add -q /dev/md0 /dev/sdf
> cat /sys/block/md0/queue/logical_block_size
>
> # partition lost
> partprobe /dev/md0
> lsblk | grep md0p1
> ```
>
> Simply restricting larger-LBS disks is inflexible. In some scenarios,
> only disks with 512 bytes LBS are available currently, but later, disks
> with 4KB LBS may be added to the array.
>
> Making LBS configurable is the best way to solve this scenario.
> After this patch, the raid will:
> - store LBS in disk metadata
> - add a read-write sysfs 'mdX/logical_block_size'
>
> Future mdadm should support setting LBS via metadata field during RAID
> creation and the new sysfs. Though the kernel allows runtime LBS changes,
> users should avoid modifying it after creating partitions or filesystems
> to prevent compatibility issues.
>
> Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
> fields to default values at auto-detect. Supporting 0.90 would require
> more extensive changes and no such use case has been observed.
>
> Note that many RAID paths rely on PAGE_SIZE alignment, including for
> metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
> read/write failures. So this config should be prevented.
>
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
> Documentation/admin-guide/md.rst | 7 +++
> drivers/md/md.h | 1 +
> include/uapi/linux/raid/md_p.h | 3 +-
> drivers/md/md-linear.c | 1 +
> drivers/md/md.c | 77 ++++++++++++++++++++++++++++++++
> drivers/md/raid0.c | 1 +
> drivers/md/raid1.c | 1 +
> drivers/md/raid10.c | 1 +
> drivers/md/raid5.c | 1 +
> 9 files changed, 92 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 1c2eacc94758..0f143acd2db7 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -238,6 +238,13 @@ All md devices contain:
> the number of devices in a raid4/5/6, or to support external
> metadata formats which mandate such clipping.
>
> + logical_block_size
> + Configure the array's logical block size in bytes. This attribute
> + is only supported for 1.x meta. The value should be written before
> + starting the array. The final array LBS will use the max value
> + between this configuration and all combined device's LBS. Note that
> + LBS cannot exceed PAGE_SIZE before RAID supports folio.
> +
> reshape_position
> This is either ``none`` or a sector number within the devices of
> the array where ``reshape`` is up to. If this is set, the three
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 38a7c2fab150..a6b3cb69c28c 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -432,6 +432,7 @@ struct mddev {
> sector_t array_sectors; /* exported array size */
> int external_size; /* size managed
> * externally */
> + unsigned int logical_block_size;
> __u64 events;
> /* If the last 'event' was simply a clean->dirty transition, and
> * we didn't write it to the spares, then it is safe and simple
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index ac74133a4768..310068bb2a1d 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -291,7 +291,8 @@ struct mdp_superblock_1 {
> __le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
> __le32 sb_csum; /* checksum up to devs[max_dev] */
> __le32 max_dev; /* size of devs[] array to consider */
> - __u8 pad3[64-32]; /* set to 0 when writing */
> + __le32 logical_block_size; /* same as q->limits->logical_block_size */
> + __u8 pad3[64-36]; /* set to 0 when writing */
>
> /* device state information. Indexed by dev_number.
> * 2 bytes per device
> diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
> index 7033d982d377..50d4a419a16e 100644
> --- a/drivers/md/md-linear.c
> +++ b/drivers/md/md-linear.c
> @@ -72,6 +72,7 @@ static int linear_set_limits(struct mddev *mddev)
>
> md_init_stacking_limits(&lim);
> lim.max_hw_sectors = mddev->chunk_sectors;
> + lim.logical_block_size = mddev->logical_block_size;
> lim.max_write_zeroes_sectors = mddev->chunk_sectors;
> lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors;
> lim.io_min = mddev->chunk_sectors << 9;
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index dffc6a482181..d78e9e52c951 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1993,6 +1993,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struc
> mddev->layout = le32_to_cpu(sb->layout);
> mddev->raid_disks = le32_to_cpu(sb->raid_disks);
> mddev->dev_sectors = le64_to_cpu(sb->size);
> + mddev->logical_block_size = le32_to_cpu(sb->logical_block_size);
> mddev->events = ev1;
> mddev->bitmap_info.offset = 0;
> mddev->bitmap_info.space = 0;
> @@ -2202,6 +2203,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
> sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
> sb->level = cpu_to_le32(mddev->level);
> sb->layout = cpu_to_le32(mddev->layout);
> + sb->logical_block_size = cpu_to_le32(mddev->logical_block_size);
> if (test_bit(FailFast, &rdev->flags))
> sb->devflags |= FailFast1;
> else
> @@ -5930,6 +5932,68 @@ static struct md_sysfs_entry md_serialize_policy =
> __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
> serialize_policy_store);
>
> +static int mddev_set_logical_block_size(struct mddev *mddev,
> + unsigned int lbs)
> +{
> + int err = 0;
> + struct queue_limits lim;
> +
> + if (queue_logical_block_size(mddev->gendisk->queue) >= lbs) {
> + pr_err("%s: Cannot set LBS smaller than mddev LBS %u\n",
> + mdname(mddev), lbs);
> + return -EINVAL;
> + }
> +
> + lim = queue_limits_start_update(mddev->gendisk->queue);
> + lim.logical_block_size = lbs;
> + pr_info("%s: logical_block_size is changed, data may be lost\n",
> + mdname(mddev));
> + err = queue_limits_commit_update(mddev->gendisk->queue, &lim);
> + if (err)
> + return err;
> +
> + mddev->logical_block_size = lbs;
> + /* New lbs will be written to superblock after array is running */
> + set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags);
> + return 0;
> +}
> +
> +static ssize_t
> +lbs_show(struct mddev *mddev, char *page)
> +{
> + return sprintf(page, "%u\n", mddev->logical_block_size);
> +}
> +
> +static ssize_t
> +lbs_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> + unsigned int lbs;
> + int err = -EBUSY;
> +
> + /* Only 1.x meta supports configurable LBS */
> + if (mddev->major_version == 0)
> + return -EINVAL;
> +
> + if (mddev->pers)
> + return -EBUSY;
> +
> + err = kstrtouint(buf, 10, &lbs);
> + if (err < 0)
> + return -EINVAL;
> +
> + err = mddev_lock(mddev);
> + if (err)
> + goto unlock;
> +
> + err = mddev_set_logical_block_size(mddev, lbs);
> +
> +unlock:
> + mddev_unlock(mddev);
> + return err ?: len;
> +}
> +
> +static struct md_sysfs_entry md_logical_block_size =
> +__ATTR(logical_block_size, 0644, lbs_show, lbs_store);
>
> static struct attribute *md_default_attrs[] = {
> &md_level.attr,
> @@ -5952,6 +6016,7 @@ static struct attribute *md_default_attrs[] = {
> &md_consistency_policy.attr,
> &md_fail_last_dev.attr,
> &md_serialize_policy.attr,
> + &md_logical_block_size.attr,
> NULL,
> };
>
> @@ -6082,6 +6147,17 @@ int mddev_stack_rdev_limits(struct mddev *mddev, struct queue_limits *lim,
> return -EINVAL;
> }
>
> + /*
> + * Before RAID adding folio support, the logical_block_size
> + * should be smaller than the page size.
> + */
> + if (lim->logical_block_size > PAGE_SIZE) {
> + pr_err("%s: logical_block_size must not larger than PAGE_SIZE\n",
> + mdname(mddev));
> + return -EINVAL;
> + }
> + mddev->logical_block_size = lim->logical_block_size;
> +
> return 0;
> }
> EXPORT_SYMBOL_GPL(mddev_stack_rdev_limits);
> @@ -6693,6 +6769,7 @@ static void md_clean(struct mddev *mddev)
> mddev->chunk_sectors = 0;
> mddev->ctime = mddev->utime = 0;
> mddev->layout = 0;
> + mddev->logical_block_size = 0;
> mddev->max_disks = 0;
> mddev->events = 0;
> mddev->can_decrease_events = 0;
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index fbf763401521..47aee1b1d4d1 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -380,6 +380,7 @@ static int raid0_set_limits(struct mddev *mddev)
> lim.max_hw_sectors = mddev->chunk_sectors;
> lim.max_write_zeroes_sectors = mddev->chunk_sectors;
> lim.max_hw_wzeroes_unmap_sectors = mddev->chunk_sectors;
> + lim.logical_block_size = mddev->logical_block_size;
> lim.io_min = mddev->chunk_sectors << 9;
> lim.io_opt = lim.io_min * mddev->raid_disks;
> lim.chunk_sectors = mddev->chunk_sectors;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 64bfe8ca5b38..167768edaec1 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -3212,6 +3212,7 @@ static int raid1_set_limits(struct mddev *mddev)
> md_init_stacking_limits(&lim);
> lim.max_write_zeroes_sectors = 0;
> lim.max_hw_wzeroes_unmap_sectors = 0;
> + lim.logical_block_size = mddev->logical_block_size;
> lim.features |= BLK_FEAT_ATOMIC_WRITES;
> err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
> if (err)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 6b2d4b7057ae..71bfed3b798d 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4000,6 +4000,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
> md_init_stacking_limits(&lim);
> lim.max_write_zeroes_sectors = 0;
> lim.max_hw_wzeroes_unmap_sectors = 0;
> + lim.logical_block_size = mddev->logical_block_size;
> lim.io_min = mddev->chunk_sectors << 9;
> lim.chunk_sectors = mddev->chunk_sectors;
> lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index aa404abf5d17..92473850f381 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7747,6 +7747,7 @@ static int raid5_set_limits(struct mddev *mddev)
> stripe = roundup_pow_of_two(data_disks * (mddev->chunk_sectors << 9));
>
> md_init_stacking_limits(&lim);
> + lim.logical_block_size = mddev->logical_block_size;
> lim.io_min = mddev->chunk_sectors << 9;
> lim.io_opt = lim.io_min * (conf->raid_disks - conf->max_degraded);
> lim.features |= BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE;
> --
> 2.39.2
>
Hi Li Nan
The problem can't be fixed if there is no user space (mdadm) patch, right?
The patch Looks good to me.
Reviewed-by: Xiao Ni <xni@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH v8 4/4] md: allow configuring logical block size
2025-11-03 3:11 ` Xiao Ni
@ 2025-11-03 13:09 ` Li Nan
0 siblings, 0 replies; 12+ messages in thread
From: Li Nan @ 2025-11-03 13:09 UTC (permalink / raw)
To: Xiao Ni, linan666
Cc: corbet, song, yukuai, hare, linux-doc, linux-kernel, linux-raid,
yangerkun, yi.zhang
在 2025/11/3 11:11, Xiao Ni 写道:
> On Thu, Oct 30, 2025 at 2:36 PM <linan666@huaweicloud.com> wrote:
>>
>> From: Li Nan <linan122@huawei.com>
>>
>> Previously, raid array used the maximum logical block size (LBS)
>> of all member disks. Adding a larger LBS disk at runtime could
>> unexpectedly increase RAID's LBS, risking corruption of existing
>> partitions. This can be reproduced by:
>>
>> ```
>> # LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
>> mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean
>>
>> # LBS is 512
>> cat /sys/block/md0/queue/logical_block_size
>>
>> # create partition md0p1
>> parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
>> lsblk | grep md0p1
>>
>> # LBS becomes 4096 after adding sdf
>> mdadm --add -q /dev/md0 /dev/sdf
>> cat /sys/block/md0/queue/logical_block_size
>>
>> # partition lost
>> partprobe /dev/md0
>> lsblk | grep md0p1
>> ```
>>
>> Simply restricting larger-LBS disks is inflexible. In some scenarios,
>> only disks with 512 bytes LBS are available currently, but later, disks
>> with 4KB LBS may be added to the array.
>>
>> Making LBS configurable is the best way to solve this scenario.
>> After this patch, the raid will:
>> - store LBS in disk metadata
>> - add a read-write sysfs 'mdX/logical_block_size'
>>
>> Future mdadm should support setting LBS via metadata field during RAID
>> creation and the new sysfs. Though the kernel allows runtime LBS changes,
>> users should avoid modifying it after creating partitions or filesystems
>> to prevent compatibility issues.
>>
>> Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
>> fields to default values at auto-detect. Supporting 0.90 would require
>> more extensive changes and no such use case has been observed.
>>
>> Note that many RAID paths rely on PAGE_SIZE alignment, including for
>> metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
>> read/write failures. So this config should be prevented.
>>
>> Signed-off-by: Li Nan <linan122@huawei.com >
> Hi Li Nan
>
Hi Xiao,
Thanks for your review.
> The problem can't be fixed if there is no user space (mdadm) patch, right?
>
Yeah, mdadm should update same time. And Guanghao will send a mdadm patch
later.
> The patch Looks good to me.
> Reviewed-by: Xiao Ni <xni@redhat.com>
>
Sorry for the trouble. I sent the v9 with some changes to the
Documentation. Could you please review v9 patch when you have time?
>
> .
--
Thanks,
Nan
^ permalink raw reply [flat|nested] 12+ messages in thread