From: Nilay Shroff <nilay@linux.ibm.com>
To: Yu Kuai <yukuai1@huaweicloud.com>,
Zheng Qixing <zhengqixing@huaweicloud.com>,
axboe@kernel.dk
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
yi.zhang@huawei.com, yangerkun@huawei.com, houtao1@huawei.com,
zhengqixing@huawei.com, "yukuai (C)" <yukuai3@huawei.com>
Subject: Re: [PATCH] block: fix kobject double initialization in add_disk
Date: Fri, 8 Aug 2025 13:39:27 +0530 [thread overview]
Message-ID: <bc3b132d-f017-4bcd-a3f4-3ea344d67a04@linux.ibm.com> (raw)
In-Reply-To: <c5036a51-ffd5-4eab-f1a5-369adff3a291@huaweicloud.com>
On 8/8/25 6:18 AM, Yu Kuai wrote:
> Hi,
>
> 在 2025/08/07 21:44, Zheng Qixing 写道:
>> Hi,
>>
>>
>> 在 2025/8/7 19:47, Nilay Shroff 写道:
>>>
>>> On 8/7/25 12:50 PM, Zheng Qixing wrote:
>>>> From: Zheng Qixing <zhengqixing@huawei.com>
>>>>
>>>> Device-mapper can call add_disk() multiple times for the same gendisk
>>>> due to its two-phase creation process (dm create + dm load). This leads
>>>> to kobject double initialization errors when the underlying iSCSI devices
>>>> become temporarily unavailable and then reappear.
>>>>
>>>> However, if the first add_disk() call fails and is retried, the queue_kobj
>>>> gets initialized twice, causing:
>>>>
>>>> kobject: kobject (ffff88810c27bb90): tried to init an initialized object,
>>>> something is seriously wrong.
>>>> Call Trace:
>>>> <TASK>
>>>> dump_stack_lvl+0x5b/0x80
>>>> kobject_init.cold+0x43/0x51
>>>> blk_register_queue+0x46/0x280
>>>> add_disk_fwnode+0xb5/0x280
>>>> dm_setup_md_queue+0x194/0x1c0
>>>> table_load+0x297/0x2d0
>>>> ctl_ioctl+0x2a2/0x480
>>>> dm_ctl_ioctl+0xe/0x20
>>>> __x64_sys_ioctl+0xc7/0x110
>>>> do_syscall_64+0x72/0x390
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Fix this by separating kobject initialization from sysfs registration:
>>>> - Initialize queue_kobj early during gendisk allocation
>>>> - add_disk() only adds the already-initialized kobject to sysfs
>>>> - del_gendisk() removes from sysfs but doesn't destroy the kobject
>>>> - Final cleanup happens when the disk is released
>>>>
>>>> Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
>>>> Reported-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>> Closes: https://lore.kernel.org/all/83591d0b-2467-433c-bce0-5581298eb161@huawei.com/
>>>> Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
>>>> ---
>>>> block/blk-sysfs.c | 4 +---
>>>> block/blk.h | 1 +
>>>> block/genhd.c | 2 ++
>>>> 3 files changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>>> index 396cded255ea..37d8654faff9 100644
>>>> --- a/block/blk-sysfs.c
>>>> +++ b/block/blk-sysfs.c
>>>> @@ -847,7 +847,7 @@ static void blk_queue_release(struct kobject *kobj)
>>>> /* nothing to do here, all data is associated with the parent gendisk */
>>>> }
>>>> -static const struct kobj_type blk_queue_ktype = {
>>>> +const struct kobj_type blk_queue_ktype = {
>>>> .default_groups = blk_queue_attr_groups,
>>>> .sysfs_ops = &queue_sysfs_ops,
>>>> .release = blk_queue_release,
>>>> @@ -875,7 +875,6 @@ int blk_register_queue(struct gendisk *disk)
>>>> struct request_queue *q = disk->queue;
>>>> int ret;
>>>> - kobject_init(&disk->queue_kobj, &blk_queue_ktype);
>>>> ret = kobject_add(&disk->queue_kobj, &disk_to_dev(disk)->kobj, "queue");
>>>> if (ret < 0)
>>>> goto out_put_queue_kobj;
>>> If the kobject_add() fails here, then we jump to the label out_put_queue_kobj,
>>> where we release/put disk->queue_kobj. That would decrement the kref of
>>> disk->queue_kobj and possibly bring it to zero.
>>
>>
>> Since we remove the kobject_init() into alloc disk, when the kobject_add() fails here,
>>
>> it should return without kobject_del/put().
>
> Yes, sorry I didn't noticed.
>>
>>
>> If kobject_add() succeeds but later steps fail, we should call kobject_del() to rollback.
>>
>>
>> The current error handling with kobject_put() in blk_register_queue() is indeed problematic.
>>
>>
>>> Next time, when we call add_disk() again without invoking kobject_init()
>>> (because the initialization is now moved outside add_disk()), the refcount
>>> of disk->queue_kobj — which was previously released — would now go for a
>>> toss. Wouldn't that lead to use-after-free or inconsistent state?
>>>
>>>> @@ -986,5 +985,4 @@ void blk_unregister_queue(struct gendisk *disk)
>>>> elevator_set_none(q);
>>>> blk_debugfs_remove(disk);
>>>> - kobject_put(&disk->queue_kobj);
>>>> }
>>> I'm thinking a case where add_disk() fails after the queue is registered.
>>> In that case, we call blk_unregister_queue() — which would ideally put()
>>> the disk->queue_kobj.
>>> But if we skip that put() in blk_unregister_queue() (and that's what we do
>>> above), and then later retry add_disk(), wouldn’t kobject_add() from
>>> blk_register_queue() complain loudly — since we’re trying to add a kobject
>>> that was already added previously?
>
> This is exactly the problem reported orginally, now is the same
> procedures before 2bd85221a625:
>
> 1) allocate memory: kobject_init
> 2) register queue: kobject_add
> 3) unregister queue: kobject_del
> 4) free memory: kobject_put
>
> Noted that kobject_add is corresponding to kobject_del, and they don't
> grab/release kobject reference. 2) and 3) can be executed multiple
> times, the only thing that I noticed is the following uevent:
>
> kobject_uevent(&disk->queue_kobj, KOBJ_ADD);
>
> Looks like the uevent is only valid for the first one, if first add_disk
> failed and then queue is registered again, there won't be uevent again,
> see state_add_uevent_sent.
>
Why do you think so? We always send the "add" uevent when we register
queue and then send "remove" uevent when the queue is unregistered. This
behavior should remain intact with this change. The kobj->state_add_uevent_sent
is only used for sending "remove" event during automatic cleanup of kobject.
It shouldn't have any side effect while we re-register queue or retry
add_disk(). Or am I missing something here?
Thanks,
--Nilay
next prev parent reply other threads:[~2025-08-08 8:09 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-07 7:20 [PATCH] block: fix kobject double initialization in add_disk Zheng Qixing
2025-08-07 8:42 ` Yu Kuai
2025-08-07 11:47 ` Nilay Shroff
2025-08-07 13:44 ` Zheng Qixing
2025-08-08 0:48 ` Yu Kuai
2025-08-08 8:09 ` Nilay Shroff [this message]
2025-08-08 8:34 ` Yu Kuai
2025-08-08 9:42 ` Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bc3b132d-f017-4bcd-a3f4-3ea344d67a04@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=houtao1@huawei.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai1@huaweicloud.com \
--cc=yukuai3@huawei.com \
--cc=zhengqixing@huawei.com \
--cc=zhengqixing@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).