linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: fix kobject double initialization in add_disk
@ 2025-08-07  7:20 Zheng Qixing
  2025-08-07  8:42 ` Yu Kuai
  2025-08-07 11:47 ` Nilay Shroff
  0 siblings, 2 replies; 8+ messages in thread
From: Zheng Qixing @ 2025-08-07  7:20 UTC (permalink / raw)
  To: axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun, houtao1,
	zhengqixing

From: Zheng Qixing <zhengqixing@huawei.com>

Device-mapper can call add_disk() multiple times for the same gendisk
due to its two-phase creation process (dm create + dm load). This leads
to kobject double initialization errors when the underlying iSCSI devices
become temporarily unavailable and then reappear.

However, if the first add_disk() call fails and is retried, the queue_kobj
gets initialized twice, causing:

kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
something is seriously wrong.
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x80
  kobject_init.cold+0x43/0x51
  blk_register_queue+0x46/0x280
  add_disk_fwnode+0xb5/0x280
  dm_setup_md_queue+0x194/0x1c0
  table_load+0x297/0x2d0
  ctl_ioctl+0x2a2/0x480
  dm_ctl_ioctl+0xe/0x20
  __x64_sys_ioctl+0xc7/0x110
  do_syscall_64+0x72/0x390
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by separating kobject initialization from sysfs registration:
 - Initialize queue_kobj early during gendisk allocation
 - add_disk() only adds the already-initialized kobject to sysfs
 - del_gendisk() removes from sysfs but doesn't destroy the kobject
 - Final cleanup happens when the disk is released

Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
---
 block/blk-sysfs.c | 4 +---
 block/blk.h       | 1 +
 block/genhd.c     | 2 ++
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 396cded255ea..37d8654faff9 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
 	/* nothing to do here, all data is associated with the parent gendisk */
 }
 
-static const struct kobj_type blk_queue_ktype = {
+const struct kobj_type blk_queue_ktype = {
 	.default_groups = blk_queue_attr_groups,
 	.sysfs_ops	= &queue_sysfs_ops,
 	.release	= blk_queue_release,
@@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
 	struct request_queue *q = disk->queue;
 	int ret;
 
-	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
 	ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
 	if (ret < 0)
 		goto out_put_queue_kobj;
@@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
 		elevator_set_none(q);
 
 	blk_debugfs_remove(disk);
-	kobject_put(&disk->queue_kobj);
 }
diff --git a/block/blk.h b/block/blk.h
index 0a2eccf28ca4..46f566f9b126 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -29,6 +29,7 @@ struct elevator_tags;
 /* Max future timer expiry for timeouts */
 #define BLK_MAX_TIMEOUT		(5 * HZ)
 
+extern const struct kobj_type blk_queue_ktype;
 extern struct dentry *blk_debugfs_root;
 
 struct blk_flush_queue {
diff --git a/block/genhd.c b/block/genhd.c
index c26733f6324b..9bbc38d12792 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1303,6 +1303,7 @@ static void disk_release(struct device *dev)
 	disk_free_zone_resources(disk);
 	xa_destroy(&disk->part_tbl);
 
+	kobject_put(&disk->queue_kobj);
 	disk->queue->disk = NULL;
 	blk_put_queue(disk->queue);
 
@@ -1486,6 +1487,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
 	INIT_LIST_HEAD(&disk->slave_bdevs);
 #endif
 	mutex_init(&disk->rqos_state_mutex);
+	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
 	return disk;
 
 out_erase_part0:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-07  7:20 [PATCH] block: fix kobject double initialization in add_disk Zheng Qixing
@ 2025-08-07  8:42 ` Yu Kuai
  2025-08-07 11:47 ` Nilay Shroff
  1 sibling, 0 replies; 8+ messages in thread
From: Yu Kuai @ 2025-08-07  8:42 UTC (permalink / raw)
  To: Zheng Qixing, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, houtao1,
	zhengqixing, yukuai (C)

在 2025/08/07 15:20, Zheng Qixing 写道:
> From: Zheng Qixing<zhengqixing@huawei.com>
> 
> Device-mapper can call add_disk() multiple times for the same gendisk
> due to its two-phase creation process (dm create + dm load). This leads
> to kobject double initialization errors when the underlying iSCSI devices
> become temporarily unavailable and then reappear.
> 
> However, if the first add_disk() call fails and is retried, the queue_kobj
> gets initialized twice, causing:
> 
> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
> something is seriously wrong.
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x5b/0x80
>    kobject_init.cold+0x43/0x51
>    blk_register_queue+0x46/0x280
>    add_disk_fwnode+0xb5/0x280
>    dm_setup_md_queue+0x194/0x1c0
>    table_load+0x297/0x2d0
>    ctl_ioctl+0x2a2/0x480
>    dm_ctl_ioctl+0xe/0x20
>    __x64_sys_ioctl+0xc7/0x110
>    do_syscall_64+0x72/0x390
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fix this by separating kobject initialization from sysfs registration:
>   - Initialize queue_kobj early during gendisk allocation
>   - add_disk() only adds the already-initialized kobject to sysfs
>   - del_gendisk() removes from sysfs but doesn't destroy the kobject
>   - Final cleanup happens when the disk is released
> 
> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
> Reported-by: Li Lingfeng<lilingfeng3@huawei.com>
> Closes:https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
> Signed-off-by: Zheng Qixing<zhengqixing@huawei.com>
> ---
>   block/blk-sysfs.c | 4 +---
>   block/blk.h       | 1 +
>   block/genhd.c     | 2 ++
>   3 files changed, 4 insertions(+), 3 deletions(-)

LGTM, the kobject_init() is called when queue is allocated before the
fix tag.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-07  7:20 [PATCH] block: fix kobject double initialization in add_disk Zheng Qixing
  2025-08-07  8:42 ` Yu Kuai
@ 2025-08-07 11:47 ` Nilay Shroff
  2025-08-07 13:44   ` Zheng Qixing
  1 sibling, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-08-07 11:47 UTC (permalink / raw)
  To: Zheng Qixing, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun, houtao1,
	zhengqixing



On 8/7/25 12:50 PM, Zheng Qixing wrote:
> From: Zheng Qixing <zhengqixing@huawei.com>
> 
> Device-mapper can call add_disk() multiple times for the same gendisk
> due to its two-phase creation process (dm create + dm load). This leads
> to kobject double initialization errors when the underlying iSCSI devices
> become temporarily unavailable and then reappear.
> 
> However, if the first add_disk() call fails and is retried, the queue_kobj
> gets initialized twice, causing:
> 
> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
> something is seriously wrong.
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x5b/0x80
>   kobject_init.cold+0x43/0x51
>   blk_register_queue+0x46/0x280
>   add_disk_fwnode+0xb5/0x280
>   dm_setup_md_queue+0x194/0x1c0
>   table_load+0x297/0x2d0
>   ctl_ioctl+0x2a2/0x480
>   dm_ctl_ioctl+0xe/0x20
>   __x64_sys_ioctl+0xc7/0x110
>   do_syscall_64+0x72/0x390
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fix this by separating kobject initialization from sysfs registration:
>  - Initialize queue_kobj early during gendisk allocation
>  - add_disk() only adds the already-initialized kobject to sysfs
>  - del_gendisk() removes from sysfs but doesn't destroy the kobject
>  - Final cleanup happens when the disk is released
> 
> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
> ---
>  block/blk-sysfs.c | 4 +---
>  block/blk.h       | 1 +
>  block/genhd.c     | 2 ++
>  3 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 396cded255ea..37d8654faff9 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>  	/* nothing to do here, all data is associated with the parent gendisk */
>  }
>  
> -static const struct kobj_type blk_queue_ktype = {
> +const struct kobj_type blk_queue_ktype = {
>  	.default_groups = blk_queue_attr_groups,
>  	.sysfs_ops	= &queue_sysfs_ops,
>  	.release	= blk_queue_release,
> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>  	struct request_queue *q = disk->queue;
>  	int ret;
>  
> -	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>  	ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>  	if (ret < 0)
>  		goto out_put_queue_kobj;

If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
where we release/put disk->queue_kobj. That would decrement the kref of 
disk->queue_kobj and possibly bring it to zero. 
Next time, when we call add_disk() again without invoking kobject_init() 
(because the initialization is now moved outside add_disk()), the refcount
of disk->queue_kobj — which was previously released — would now go for a
toss. Wouldn't that lead to use-after-free or inconsistent state?

> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>  		elevator_set_none(q);
>  
>  	blk_debugfs_remove(disk);
> -	kobject_put(&disk->queue_kobj);
>  }

I'm thinking a case where add_disk() fails after the queue is registered. 
In that case, we call blk_unregister_queue() — which would ideally put()
the disk->queue_kobj.
But if we skip that put() in blk_unregister_queue() (and that's what we do
above), and then later retry add_disk(), wouldn’t kobject_add() from 
blk_register_queue() complain loudly — since we’re trying to add a kobject
that was already added previously?

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-07 11:47 ` Nilay Shroff
@ 2025-08-07 13:44   ` Zheng Qixing
  2025-08-08  0:48     ` Yu Kuai
  0 siblings, 1 reply; 8+ messages in thread
From: Zheng Qixing @ 2025-08-07 13:44 UTC (permalink / raw)
  To: Nilay Shroff, axboe
  Cc: linux-block, linux-kernel, yukuai3, yi.zhang, yangerkun, houtao1,
	zhengqixing

Hi,


在 2025/8/7 19:47, Nilay Shroff 写道:
>
> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>> From: Zheng Qixing <zhengqixing@huawei.com>
>>
>> Device-mapper can call add_disk() multiple times for the same gendisk
>> due to its two-phase creation process (dm create + dm load). This leads
>> to kobject double initialization errors when the underlying iSCSI devices
>> become temporarily unavailable and then reappear.
>>
>> However, if the first add_disk() call fails and is retried, the queue_kobj
>> gets initialized twice, causing:
>>
>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>> something is seriously wrong.
>>   Call Trace:
>>    <TASK>
>>    dump_stack_lvl+0x5b/0x80
>>    kobject_init.cold+0x43/0x51
>>    blk_register_queue+0x46/0x280
>>    add_disk_fwnode+0xb5/0x280
>>    dm_setup_md_queue+0x194/0x1c0
>>    table_load+0x297/0x2d0
>>    ctl_ioctl+0x2a2/0x480
>>    dm_ctl_ioctl+0xe/0x20
>>    __x64_sys_ioctl+0xc7/0x110
>>    do_syscall_64+0x72/0x390
>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Fix this by separating kobject initialization from sysfs registration:
>>   - Initialize queue_kobj early during gendisk allocation
>>   - add_disk() only adds the already-initialized kobject to sysfs
>>   - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>   - Final cleanup happens when the disk is released
>>
>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>> ---
>>   block/blk-sysfs.c | 4 +---
>>   block/blk.h       | 1 +
>>   block/genhd.c     | 2 ++
>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 396cded255ea..37d8654faff9 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>   	/* nothing to do here, all data is associated with the parent gendisk */
>>   }
>>   
>> -static const struct kobj_type blk_queue_ktype = {
>> +const struct kobj_type blk_queue_ktype = {
>>   	.default_groups = blk_queue_attr_groups,
>>   	.sysfs_ops	= &queue_sysfs_ops,
>>   	.release	= blk_queue_release,
>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>   	struct request_queue *q = disk->queue;
>>   	int ret;
>>   
>> -	kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>   	ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>   	if (ret < 0)
>>   		goto out_put_queue_kobj;
> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
> where we release/put disk->queue_kobj. That would decrement the kref of
> disk->queue_kobj and possibly bring it to zero.


Since we remove the kobject_init() into alloc disk, when the 
kobject_add() fails here,

it should return without kobject_del/put().


If kobject_add() succeeds but later steps fail, we should call 
kobject_del() to rollback.


The current error handling with kobject_put() in blk_register_queue() is 
indeed problematic.


> Next time, when we call add_disk() again without invoking kobject_init()
> (because the initialization is now moved outside add_disk()), the refcount
> of disk->queue_kobj — which was previously released — would now go for a
> toss. Wouldn't that lead to use-after-free or inconsistent state?
>
>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>   		elevator_set_none(q);
>>   
>>   	blk_debugfs_remove(disk);
>> -	kobject_put(&disk->queue_kobj);
>>   }
> I'm thinking a case where add_disk() fails after the queue is registered.
> In that case, we call blk_unregister_queue() — which would ideally put()
> the disk->queue_kobj.
> But if we skip that put() in blk_unregister_queue() (and that's what we do
> above), and then later retry add_disk(), wouldn’t kobject_add() from
> blk_register_queue() complain loudly — since we’re trying to add a kobject
> that was already added previously?


blk_unregister_queue() calls kobject_del(), then the sysfs state is 
properly cleaned up

and retry should work fine.


>
> Thanks,
> --Nilay


Thanks,

Qixing


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-07 13:44   ` Zheng Qixing
@ 2025-08-08  0:48     ` Yu Kuai
  2025-08-08  8:09       ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-08  0:48 UTC (permalink / raw)
  To: Zheng Qixing, Nilay Shroff, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, houtao1,
	zhengqixing, yukuai (C)

Hi,

在 2025/08/07 21:44, Zheng Qixing 写道:
> Hi,
> 
> 
> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>
>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>> From: Zheng Qixing <zhengqixing@huawei.com>
>>>
>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>> due to its two-phase creation process (dm create + dm load). This leads
>>> to kobject double initialization errors when the underlying iSCSI 
>>> devices
>>> become temporarily unavailable and then reappear.
>>>
>>> However, if the first add_disk() call fails and is retried, the 
>>> queue_kobj
>>> gets initialized twice, causing:
>>>
>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized 
>>> object,
>>> something is seriously wrong.
>>>   Call Trace:
>>>    <TASK>
>>>    dump_stack_lvl+0x5b/0x80
>>>    kobject_init.cold+0x43/0x51
>>>    blk_register_queue+0x46/0x280
>>>    add_disk_fwnode+0xb5/0x280
>>>    dm_setup_md_queue+0x194/0x1c0
>>>    table_load+0x297/0x2d0
>>>    ctl_ioctl+0x2a2/0x480
>>>    dm_ctl_ioctl+0xe/0x20
>>>    __x64_sys_ioctl+0xc7/0x110
>>>    do_syscall_64+0x72/0x390
>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> Fix this by separating kobject initialization from sysfs registration:
>>>   - Initialize queue_kobj early during gendisk allocation
>>>   - add_disk() only adds the already-initialized kobject to sysfs
>>>   - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>   - Final cleanup happens when the disk is released
>>>
>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from 
>>> sysfs")
>>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>>> Closes: 
>>> https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/ 
>>>
>>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>>> ---
>>>   block/blk-sysfs.c | 4 +---
>>>   block/blk.h       | 1 +
>>>   block/genhd.c     | 2 ++
>>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index 396cded255ea..37d8654faff9 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>       /* nothing to do here, all data is associated with the parent 
>>> gendisk */
>>>   }
>>> -static const struct kobj_type blk_queue_ktype = {
>>> +const struct kobj_type blk_queue_ktype = {
>>>       .default_groups = blk_queue_attr_groups,
>>>       .sysfs_ops    = &queue_sysfs_ops,
>>>       .release    = blk_queue_release,
>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>       struct request_queue *q = disk->queue;
>>>       int ret;
>>> -    kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>       ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, 
>>> "queue");
>>>       if (ret < 0)
>>>           goto out_put_queue_kobj;
>> If the kobject_add() fails here, then we jump to the label 
>> out_put_queue_kobj,
>> where we release/put disk->queue_kobj. That would decrement the kref of
>> disk->queue_kobj and possibly bring it to zero.
> 
> 
> Since we remove the kobject_init() into alloc disk, when the 
> kobject_add() fails here,
> 
> it should return without kobject_del/put().

Yes, sorry I didn't noticed.
> 
> 
> If kobject_add() succeeds but later steps fail, we should call 
> kobject_del() to rollback.
> 
> 
> The current error handling with kobject_put() in blk_register_queue() is 
> indeed problematic.
> 
> 
>> Next time, when we call add_disk() again without invoking kobject_init()
>> (because the initialization is now moved outside add_disk()), the 
>> refcount
>> of disk->queue_kobj — which was previously released — would now go for a
>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>
>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>           elevator_set_none(q);
>>>       blk_debugfs_remove(disk);
>>> -    kobject_put(&disk->queue_kobj);
>>>   }
>> I'm thinking a case where add_disk() fails after the queue is registered.
>> In that case, we call blk_unregister_queue() — which would ideally put()
>> the disk->queue_kobj.
>> But if we skip that put() in blk_unregister_queue() (and that's what 
>> we do
>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>> blk_register_queue() complain loudly — since we’re trying to add a 
>> kobject
>> that was already added previously?

This is exactly the problem reported orginally, now is the same
procedures before 2bd85221a625:

1) allocate memory: kobject_init
2) register queue: kobject_add
3) unregister queue: kobject_del
4) free memory: kobject_put

Noted that kobject_add is corresponding to kobject_del, and they don't
grab/release kobject reference. 2) and 3) can be executed multiple
times, the only thing that I noticed is the following uevent:

kobject_uevent(&disk->queue_kobj, KOBJ_ADD);

Looks like the uevent is only valid for the first one, if first add_disk
failed and then queue is registered again, there won't be uevent again,
see state_add_uevent_sent.

However, this is probably fine.
> 
> 
> blk_unregister_queue() calls kobject_del(), then the sysfs state is 
> properly cleaned up
> 
> and retry should work fine.

Thanks,
Kuai

> 
> 
>>
>> Thanks,
>> --Nilay
> 
> 
> Thanks,
> 
> Qixing
> 
> .
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-08  0:48     ` Yu Kuai
@ 2025-08-08  8:09       ` Nilay Shroff
  2025-08-08  8:34         ` Yu Kuai
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-08-08  8:09 UTC (permalink / raw)
  To: Yu Kuai, Zheng Qixing, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, houtao1,
	zhengqixing, yukuai (C)



On 8/8/25 6:18 AM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/07 21:44, Zheng Qixing 写道:
>> Hi,
>>
>>
>> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>>
>>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>>> From: Zheng Qixing <zhengqixing@huawei.com>
>>>>
>>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>>> due to its two-phase creation process (dm create + dm load). This leads
>>>> to kobject double initialization errors when the underlying iSCSI devices
>>>> become temporarily unavailable and then reappear.
>>>>
>>>> However, if the first add_disk() call fails and is retried, the queue_kobj
>>>> gets initialized twice, causing:
>>>>
>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>>>> something is seriously wrong.
>>>>   Call Trace:
>>>>    <TASK>
>>>>    dump_stack_lvl+0x5b/0x80
>>>>    kobject_init.cold+0x43/0x51
>>>>    blk_register_queue+0x46/0x280
>>>>    add_disk_fwnode+0xb5/0x280
>>>>    dm_setup_md_queue+0x194/0x1c0
>>>>    table_load+0x297/0x2d0
>>>>    ctl_ioctl+0x2a2/0x480
>>>>    dm_ctl_ioctl+0xe/0x20
>>>>    __x64_sys_ioctl+0xc7/0x110
>>>>    do_syscall_64+0x72/0x390
>>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Fix this by separating kobject initialization from sysfs registration:
>>>>   - Initialize queue_kobj early during gendisk allocation
>>>>   - add_disk() only adds the already-initialized kobject to sysfs
>>>>   - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>>   - Final cleanup happens when the disk is released
>>>>
>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>>>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
>>>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>>>> ---
>>>>   block/blk-sysfs.c | 4 +---
>>>>   block/blk.h       | 1 +
>>>>   block/genhd.c     | 2 ++
>>>>   3 files changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>> index 396cded255ea..37d8654faff9 100644
>>>> --- a/block/blk-sysfs.c
>>>> +++ b/block/blk-sysfs.c
>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>>       /* nothing to do here, all data is associated with the parent gendisk */
>>>>   }
>>>> -static const struct kobj_type blk_queue_ktype = {
>>>> +const struct kobj_type blk_queue_ktype = {
>>>>       .default_groups = blk_queue_attr_groups,
>>>>       .sysfs_ops    = &queue_sysfs_ops,
>>>>       .release    = blk_queue_release,
>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>>       struct request_queue *q = disk->queue;
>>>>       int ret;
>>>> -    kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>>       ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>>>       if (ret < 0)
>>>>           goto out_put_queue_kobj;
>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
>>> where we release/put disk->queue_kobj. That would decrement the kref of
>>> disk->queue_kobj and possibly bring it to zero.
>>
>>
>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,
>>
>> it should return without kobject_del/put().
> 
> Yes, sorry I didn't noticed.
>>
>>
>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.
>>
>>
>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic.
>>
>>
>>> Next time, when we call add_disk() again without invoking kobject_init()
>>> (because the initialization is now moved outside add_disk()), the refcount
>>> of disk->queue_kobj — which was previously released — would now go for a
>>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>>
>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>>           elevator_set_none(q);
>>>>       blk_debugfs_remove(disk);
>>>> -    kobject_put(&disk->queue_kobj);
>>>>   }
>>> I'm thinking a case where add_disk() fails after the queue is registered.
>>> In that case, we call blk_unregister_queue() — which would ideally put()
>>> the disk->queue_kobj.
>>> But if we skip that put() in blk_unregister_queue() (and that's what we do
>>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>>> blk_register_queue() complain loudly — since we’re trying to add a kobject
>>> that was already added previously?
> 
> This is exactly the problem reported orginally, now is the same
> procedures before 2bd85221a625:
> 
> 1) allocate memory: kobject_init
> 2) register queue: kobject_add
> 3) unregister queue: kobject_del
> 4) free memory: kobject_put
> 
> Noted that kobject_add is corresponding to kobject_del, and they don't
> grab/release kobject reference. 2) and 3) can be executed multiple
> times, the only thing that I noticed is the following uevent:
> 
> kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
> 
> Looks like the uevent is only valid for the first one, if first add_disk
> failed and then queue is registered again, there won't be uevent again,
> see state_add_uevent_sent.
> 
Why do you think so? We always send the "add" uevent when we register 
queue and then send "remove" uevent when the queue is unregistered. This
behavior should remain intact with this change. The kobj->state_add_uevent_sent
is only used for sending "remove" event during automatic cleanup of kobject.
It shouldn't have any side effect while we re-register queue or retry 
add_disk(). Or am I missing something here?

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-08  8:09       ` Nilay Shroff
@ 2025-08-08  8:34         ` Yu Kuai
  2025-08-08  9:42           ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Yu Kuai @ 2025-08-08  8:34 UTC (permalink / raw)
  To: Nilay Shroff, Yu Kuai, Zheng Qixing, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, houtao1,
	zhengqixing, yukuai (C)

Hi,

在 2025/08/08 16:09, Nilay Shroff 写道:
> 
> 
> On 8/8/25 6:18 AM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/07 21:44, Zheng Qixing 写道:
>>> Hi,
>>>
>>>
>>> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>>>
>>>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>>>> From: Zheng Qixing <zhengqixing@huawei.com>
>>>>>
>>>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>>>> due to its two-phase creation process (dm create + dm load). This leads
>>>>> to kobject double initialization errors when the underlying iSCSI devices
>>>>> become temporarily unavailable and then reappear.
>>>>>
>>>>> However, if the first add_disk() call fails and is retried, the queue_kobj
>>>>> gets initialized twice, causing:
>>>>>
>>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>>>>> something is seriously wrong.
>>>>>    Call Trace:
>>>>>     <TASK>
>>>>>     dump_stack_lvl+0x5b/0x80
>>>>>     kobject_init.cold+0x43/0x51
>>>>>     blk_register_queue+0x46/0x280
>>>>>     add_disk_fwnode+0xb5/0x280
>>>>>     dm_setup_md_queue+0x194/0x1c0
>>>>>     table_load+0x297/0x2d0
>>>>>     ctl_ioctl+0x2a2/0x480
>>>>>     dm_ctl_ioctl+0xe/0x20
>>>>>     __x64_sys_ioctl+0xc7/0x110
>>>>>     do_syscall_64+0x72/0x390
>>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>
>>>>> Fix this by separating kobject initialization from sysfs registration:
>>>>>    - Initialize queue_kobj early during gendisk allocation
>>>>>    - add_disk() only adds the already-initialized kobject to sysfs
>>>>>    - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>>>    - Final cleanup happens when the disk is released
>>>>>
>>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>>>>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
>>>>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>>>>> ---
>>>>>    block/blk-sysfs.c | 4 +---
>>>>>    block/blk.h       | 1 +
>>>>>    block/genhd.c     | 2 ++
>>>>>    3 files changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>> index 396cded255ea..37d8654faff9 100644
>>>>> --- a/block/blk-sysfs.c
>>>>> +++ b/block/blk-sysfs.c
>>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>>>        /* nothing to do here, all data is associated with the parent gendisk */
>>>>>    }
>>>>> -static const struct kobj_type blk_queue_ktype = {
>>>>> +const struct kobj_type blk_queue_ktype = {
>>>>>        .default_groups = blk_queue_attr_groups,
>>>>>        .sysfs_ops    = &queue_sysfs_ops,
>>>>>        .release    = blk_queue_release,
>>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>>>        struct request_queue *q = disk->queue;
>>>>>        int ret;
>>>>> -    kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>>>        ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>>>>        if (ret < 0)
>>>>>            goto out_put_queue_kobj;
>>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
>>>> where we release/put disk->queue_kobj. That would decrement the kref of
>>>> disk->queue_kobj and possibly bring it to zero.
>>>
>>>
>>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,
>>>
>>> it should return without kobject_del/put().
>>
>> Yes, sorry I didn't noticed.
>>>
>>>
>>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.
>>>
>>>
>>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic.
>>>
>>>
>>>> Next time, when we call add_disk() again without invoking kobject_init()
>>>> (because the initialization is now moved outside add_disk()), the refcount
>>>> of disk->queue_kobj — which was previously released — would now go for a
>>>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>>>
>>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>>>            elevator_set_none(q);
>>>>>        blk_debugfs_remove(disk);
>>>>> -    kobject_put(&disk->queue_kobj);
>>>>>    }
>>>> I'm thinking a case where add_disk() fails after the queue is registered.
>>>> In that case, we call blk_unregister_queue() — which would ideally put()
>>>> the disk->queue_kobj.
>>>> But if we skip that put() in blk_unregister_queue() (and that's what we do
>>>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>>>> blk_register_queue() complain loudly — since we’re trying to add a kobject
>>>> that was already added previously?
>>
>> This is exactly the problem reported orginally, now is the same
>> procedures before 2bd85221a625:
>>
>> 1) allocate memory: kobject_init
>> 2) register queue: kobject_add
>> 3) unregister queue: kobject_del
>> 4) free memory: kobject_put
>>
>> Noted that kobject_add is corresponding to kobject_del, and they don't
>> grab/release kobject reference. 2) and 3) can be executed multiple
>> times, the only thing that I noticed is the following uevent:
>>
>> kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
>>
>> Looks like the uevent is only valid for the first one, if first add_disk
>> failed and then queue is registered again, there won't be uevent again,
>> see state_add_uevent_sent.
>>
> Why do you think so? We always send the "add" uevent when we register
> queue and then send "remove" uevent when the queue is unregistered. This
> behavior should remain intact with this change. The kobj->state_add_uevent_sent
> is only used for sending "remove" event during automatic cleanup of kobject.
> It shouldn't have any side effect while we re-register queue or retry
> add_disk(). Or am I missing something here?

Yes, turns out I misread the code after I tested and found that user
didn't get the queue uevent. Just dig deeper and realize this is due
to dev_uevent_filter() from the top devices_kset stop the queue kobj
uevent.

So, if I don't misread the code now, looks like the kobject_uevent()
call to queue_kobj is useless.

Thanks,
Kuai

> 
> Thanks,
> --Nilay
> .
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] block: fix kobject double initialization in add_disk
  2025-08-08  8:34         ` Yu Kuai
@ 2025-08-08  9:42           ` Nilay Shroff
  0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-08-08  9:42 UTC (permalink / raw)
  To: Yu Kuai, Zheng Qixing, axboe
  Cc: linux-block, linux-kernel, yi.zhang, yangerkun, houtao1,
	zhengqixing, yukuai (C)



On 8/8/25 2:04 PM, Yu Kuai wrote:
> Hi,
> 
> 在 2025/08/08 16:09, Nilay Shroff 写道:
>>
>>
>> On 8/8/25 6:18 AM, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/08/07 21:44, Zheng Qixing 写道:
>>>> Hi,
>>>>
>>>>
>>>> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>>>>
>>>>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>>>>> From: Zheng Qixing <zhengqixing@huawei.com>
>>>>>>
>>>>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>>>>> due to its two-phase creation process (dm create + dm load). This leads
>>>>>> to kobject double initialization errors when the underlying iSCSI devices
>>>>>> become temporarily unavailable and then reappear.
>>>>>>
>>>>>> However, if the first add_disk() call fails and is retried, the queue_kobj
>>>>>> gets initialized twice, causing:
>>>>>>
>>>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>>>>>> something is seriously wrong.
>>>>>>    Call Trace:
>>>>>>     <TASK>
>>>>>>     dump_stack_lvl+0x5b/0x80
>>>>>>     kobject_init.cold+0x43/0x51
>>>>>>     blk_register_queue+0x46/0x280
>>>>>>     add_disk_fwnode+0xb5/0x280
>>>>>>     dm_setup_md_queue+0x194/0x1c0
>>>>>>     table_load+0x297/0x2d0
>>>>>>     ctl_ioctl+0x2a2/0x480
>>>>>>     dm_ctl_ioctl+0xe/0x20
>>>>>>     __x64_sys_ioctl+0xc7/0x110
>>>>>>     do_syscall_64+0x72/0x390
>>>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>>>
>>>>>> Fix this by separating kobject initialization from sysfs registration:
>>>>>>    - Initialize queue_kobj early during gendisk allocation
>>>>>>    - add_disk() only adds the already-initialized kobject to sysfs
>>>>>>    - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>>>>    - Final cleanup happens when the disk is released
>>>>>>
>>>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>>>>>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
>>>>>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>>>>>> ---
>>>>>>    block/blk-sysfs.c | 4 +---
>>>>>>    block/blk.h       | 1 +
>>>>>>    block/genhd.c     | 2 ++
>>>>>>    3 files changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>>>> index 396cded255ea..37d8654faff9 100644
>>>>>> --- a/block/blk-sysfs.c
>>>>>> +++ b/block/blk-sysfs.c
>>>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>>>>        /* nothing to do here, all data is associated with the parent gendisk */
>>>>>>    }
>>>>>> -static const struct kobj_type blk_queue_ktype = {
>>>>>> +const struct kobj_type blk_queue_ktype = {
>>>>>>        .default_groups = blk_queue_attr_groups,
>>>>>>        .sysfs_ops    = &queue_sysfs_ops,
>>>>>>        .release    = blk_queue_release,
>>>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>>>>        struct request_queue *q = disk->queue;
>>>>>>        int ret;
>>>>>> -    kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>>>>        ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>>>>>        if (ret < 0)
>>>>>>            goto out_put_queue_kobj;
>>>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
>>>>> where we release/put disk->queue_kobj. That would decrement the kref of
>>>>> disk->queue_kobj and possibly bring it to zero.
>>>>
>>>>
>>>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,
>>>>
>>>> it should return without kobject_del/put().
>>>
>>> Yes, sorry I didn't noticed.
>>>>
>>>>
>>>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.
>>>>
>>>>
>>>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic.
>>>>
>>>>
>>>>> Next time, when we call add_disk() again without invoking kobject_init()
>>>>> (because the initialization is now moved outside add_disk()), the refcount
>>>>> of disk->queue_kobj — which was previously released — would now go for a
>>>>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>>>>
>>>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>>>>            elevator_set_none(q);
>>>>>>        blk_debugfs_remove(disk);
>>>>>> -    kobject_put(&disk->queue_kobj);
>>>>>>    }
>>>>> I'm thinking a case where add_disk() fails after the queue is registered.
>>>>> In that case, we call blk_unregister_queue() — which would ideally put()
>>>>> the disk->queue_kobj.
>>>>> But if we skip that put() in blk_unregister_queue() (and that's what we do
>>>>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>>>>> blk_register_queue() complain loudly — since we’re trying to add a kobject
>>>>> that was already added previously?
>>>
>>> This is exactly the problem reported orginally, now is the same
>>> procedures before 2bd85221a625:
>>>
>>> 1) allocate memory: kobject_init
>>> 2) register queue: kobject_add
>>> 3) unregister queue: kobject_del
>>> 4) free memory: kobject_put
>>>
>>> Noted that kobject_add is corresponding to kobject_del, and they don't
>>> grab/release kobject reference. 2) and 3) can be executed multiple
>>> times, the only thing that I noticed is the following uevent:
>>>
>>> kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
>>>
>>> Looks like the uevent is only valid for the first one, if first add_disk
>>> failed and then queue is registered again, there won't be uevent again,
>>> see state_add_uevent_sent.
>>>
>> Why do you think so? We always send the "add" uevent when we register
>> queue and then send "remove" uevent when the queue is unregistered. This
>> behavior should remain intact with this change. The kobj->state_add_uevent_sent
>> is only used for sending "remove" event during automatic cleanup of kobject.
>> It shouldn't have any side effect while we re-register queue or retry
>> add_disk(). Or am I missing something here?
> 
> Yes, turns out I misread the code after I tested and found that user
> didn't get the queue uevent. Just dig deeper and realize this is due
> to dev_uevent_filter() from the top devices_kset stop the queue kobj
> uevent.
> 
> So, if I don't misread the code now, looks like the kobject_uevent()
> call to queue_kobj is useless.
> 
Yes that's correct. And so this patch doesn't alter the queue_kobj  uevent behavior and so we're good here.
Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-08  9:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07  7:20 [PATCH] block: fix kobject double initialization in add_disk Zheng Qixing
2025-08-07  8:42 ` Yu Kuai
2025-08-07 11:47 ` Nilay Shroff
2025-08-07 13:44   ` Zheng Qixing
2025-08-08  0:48     ` Yu Kuai
2025-08-08  8:09       ` Nilay Shroff
2025-08-08  8:34         ` Yu Kuai
2025-08-08  9:42           ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).