* [PATCH 0/1] cover letter
@ 2025-10-29 21:08 Casey Chen
2025-10-29 21:08 ` [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer Casey Chen
0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2025-10-29 21:08 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: yzhong, sconnor, axboe, mkhalfella, Casey Chen
When a controller is deleted (e.g., via sysfs delete_controller), the
admin queue is freed while userspace may still have open fd to the
namespace block device. Userspace can issue IOCTLs on the open fd
that access the freed admin queue through the stale ns->ctrl->admin_q
pointer, causing a use-after-free.
Fix this by taking an additional reference on the admin queue during
namespace allocation and releasing it during namespace cleanup.
We can easily reproduce this issue by doing following experiment.
1. Adding 10s delay in nvme_submit_user_cmd before it allocates request.
@@ -175,6 +176,10 @@ static int nvme_submit_user_cmd(struct request_queue *q,
u32 effects;
int ret;
+ pr_info("About to sleep for 10 seconds\n");
+ msleep(10000);
+ pr_info("Done sleeping for 10 seconds\n");
+
req = nvme_alloc_user_request(q, cmd, 0, 0);
if (IS_ERR(req))
return PTR_ERR(req);
2. Run nvme command to send admin cmd through block device
$ strace -vf nvme read --start-block=0 --data-size=4096 /dev/nvme12n1
3. Right after issuing the nvme command, remove nvme device from sysfs
$ echo 1 > /sys/bus/pci/devices/0000\:ce\:00.0/remove
Output from strace:
openat(AT_FDCWD, "/dev/nvme12n1", O_RDONLY) = 3
fstat(3, {st_dev=makedev(0, 0x6), st_ino=711, st_mode=S_IFBLK|0660, st_nlink=1, st_uid=0, st_gid=6, st_blksize=4096, st_blocks=0, st_rdev=makedev(0x103, 0xa4), st_atime=1761700579 /* 2025-10-29T01:16:19.769373519+0000 */, st_atime_nsec=769373519, st_mtime=1761700579 /* 2025-10-29T01:16:19.769373519+0000 */, st_mtime_nsec=769373519, st_ctime=1761700579 /* 2025-10-29T01:16:19.769373519+0000 */, st_ctime_nsec=769373519}) = 0
ioctl(3, NVME_IOCTL_ID, 0) = 3
ioctl(3, NVME_IOCTL_ADMIN_CMD, "\x06\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x60\x54\x00\x00\x00\x00\x00"... => "\x06\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x60\x54\x00\x00\x00\x00\x00"...) = -1 ENODEV (No such device)
write(2, "identify namespace: No such devi"..., 34identify namespace: No such device) = 34
write(2, "\n", 1
) = 1
close(3) = 0
exit_group(1) = ?
+++ exited with 1 +++
strace shows it stuck at identify namespace admin command.
Output from KASAN:
[ 360.958500] ==================================================================
[ 360.959310] BUG: KASAN: slab-use-after-free in blk_queue_enter+0x41c/0x4a0
[ 360.960213] Read of size 8 at addr ffff88c0a53819f8 by task nvme/3287
[ 360.962096] CPU: 67 UID: 0 PID: 3287 Comm: nvme Tainted: G E 6.13.2-ga1582f1a031e #15
[ 360.962103] Tainted: [E]=UNSIGNED_MODULE
[ 360.962105] Hardware name: Jabil /EGS 2S MB1, BIOS 1.00 06/18/2025
[ 360.962107] Call Trace:
[ 360.962110] <TASK>
[ 360.962112] dump_stack_lvl+0x4f/0x60
[ 360.962120] print_report+0xc4/0x620
[ 360.962128] ? _raw_spin_lock_irqsave+0x70/0xb0
[ 360.962135] ? _raw_read_unlock_irqrestore+0x30/0x30
[ 360.962139] ? blk_queue_enter+0x41c/0x4a0
[ 360.962143] kasan_report+0xab/0xe0
[ 360.962147] ? blk_queue_enter+0x41c/0x4a0
[ 360.962151] blk_queue_enter+0x41c/0x4a0
[ 360.962155] ? __irq_work_queue_local+0x75/0x1d0
[ 360.962162] ? blk_queue_start_drain+0x70/0x70
[ 360.962166] ? irq_work_queue+0x18/0x20
[ 360.962170] ? vprintk_emit.part.0+0x1cc/0x350
[ 360.962176] ? wake_up_klogd_work_func+0x60/0x60
[ 360.962180] blk_mq_alloc_request+0x2b7/0x6b0
[ 360.962186] ? __blk_mq_alloc_requests+0x1060/0x1060
[ 360.962190] ? __switch_to+0x5b7/0x1060
[ 360.962198] nvme_submit_user_cmd+0xa9/0x330
[ 360.962204] nvme_user_cmd.isra.0+0x240/0x3f0
[ 360.962208] ? force_sigsegv+0xe0/0xe0
[ 360.962215] ? nvme_user_cmd64+0x400/0x400
[ 360.962218] ? vfs_fileattr_set+0x9b0/0x9b0
[ 360.962224] ? cgroup_update_frozen_flag+0x24/0x1c0
[ 360.962231] ? cgroup_leave_frozen+0x204/0x330
[ 360.962236] ? nvme_ioctl+0x7c/0x2c0
[ 360.962238] blkdev_ioctl+0x1a8/0x4d0
[ 360.962242] ? blkdev_common_ioctl+0x1930/0x1930
[ 360.962244] ? fdget+0x54/0x380
[ 360.962251] __x64_sys_ioctl+0x129/0x190
[ 360.962255] do_syscall_64+0x5b/0x160
[ 360.962260] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 360.962267] RIP: 0033:0x7f765f703b0b
[ 360.962271] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 52 0f 00 f7 d8 64 89 01 48
[ 360.962275] RSP: 002b:00007ffe2cefe808 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 360.962280] RAX: ffffffffffffffda RBX: 00007ffe2cefe860 RCX: 00007f765f703b0b
[ 360.962283] RDX: 00007ffe2cefe860 RSI: 00000000c0484e41 RDI: 0000000000000003
[ 360.962285] RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
[ 360.962287] R10: 00007f765f611d50 R11: 0000000000000202 R12: 0000000000000003
[ 360.962289] R13: 00000000c0484e41 R14: 0000000000000001 R15: 00007ffe2cefea60
[ 360.962293] </TASK>
[ 361.046089] Allocated by task 970:
[ 361.048182] kasan_save_stack+0x1c/0x40
[ 361.050278] kasan_save_track+0x10/0x30
[ 361.052339] __kasan_slab_alloc+0x7d/0x80
[ 361.054396] kmem_cache_alloc_node_noprof+0xe5/0x3a0
[ 361.056485] blk_alloc_queue+0x31/0x700
[ 361.058573] blk_mq_alloc_queue+0x13e/0x210
[ 361.060660] nvme_alloc_admin_tag_set+0x32e/0x620
[ 361.062760] nvme_probe+0x951/0x1810
[ 361.064836] local_pci_probe+0xc6/0x170
[ 361.066909] work_for_cpu_fn+0x4e/0xa0
[ 361.068968] process_one_work+0x5a4/0xe00
[ 361.071035] worker_thread+0x8a2/0x1560
[ 361.073103] kthread+0x284/0x350
[ 361.075159] ret_from_fork+0x2d/0x70
[ 361.077217] ret_from_fork_asm+0x11/0x20
[ 361.081256] Freed by task 0:
[ 361.083263] kasan_save_stack+0x1c/0x40
[ 361.085300] kasan_save_track+0x10/0x30
[ 361.087299] kasan_save_free_info+0x37/0x50
[ 361.089287] __kasan_slab_free+0x4b/0x60
[ 361.091277] kmem_cache_free+0x11a/0x5c0
[ 361.093266] rcu_core+0x6da/0xe50
[ 361.095232] handle_softirqs+0x196/0x570
[ 361.097193] __irq_exit_rcu+0xb6/0xf0
[ 361.099097] sysvec_apic_timer_interrupt+0x6e/0x90
[ 361.100943] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 361.104458] Last potentially related work creation:
[ 361.106185] kasan_save_stack+0x1c/0x40
[ 361.107893] __kasan_record_aux_stack+0xba/0xd0
[ 361.109628] __call_rcu_common.constprop.0+0x70/0x790
[ 361.111393] nvme_remove_admin_tag_set+0x94/0x1b0
[ 361.113166] nvme_remove+0xce/0x350
[ 361.114879] pci_device_remove+0x65/0x110
[ 361.116620] device_release_driver_internal+0x34d/0x670
[ 361.118410] pci_stop_bus_device+0x106/0x150
[ 361.120209] pci_stop_and_remove_bus_device_locked+0x16/0x30
[ 361.122072] remove_store+0xab/0xc0
[ 361.123929] kernfs_fop_write_iter+0x2f2/0x550
[ 361.125792] vfs_write+0x54b/0xbd0
[ 361.127631] ksys_write+0xe0/0x190
[ 361.129459] do_syscall_64+0x5b/0x160
[ 361.131274] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 361.134937] Second to last potentially related work creation:
[ 361.136852] kasan_save_stack+0x1c/0x40
[ 361.138760] __kasan_record_aux_stack+0xba/0xd0
[ 361.140685] insert_work+0x2b/0x1f0
[ 361.142612] __queue_work.part.0+0x516/0xb20
[ 361.144564] queue_work_on+0x5a/0x70
[ 361.146526] call_timer_fn+0x25/0x190
[ 361.148475] __run_timers+0x522/0x850
[ 361.150433] run_timer_softirq+0x117/0x270
[ 361.152402] handle_softirqs+0x196/0x570
[ 361.154398] __irq_exit_rcu+0xb6/0xf0
[ 361.156402] sysvec_apic_timer_interrupt+0x6e/0x90
[ 361.158439] asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 361.162525] The buggy address belongs to the object at ffff88c0a53819b0
[ 361.162525] which belongs to the cache request_queue of size 968
[ 361.166831] The buggy address is located 72 bytes inside of
[ 361.166831] freed 968-byte region [ffff88c0a53819b0, ffff88c0a5381d78)
[ 361.173361] The buggy address belongs to the physical page:
[ 361.175584] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x40a5380
[ 361.177895] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 361.180251] flags: 0x15ffff0000000040(head|node=1|zone=2|lastcpupid=0x1ffff)
[ 361.182652] page_type: f5(slab)
[ 361.185024] raw: 15ffff0000000040 ffff88810b26e800 dead000000000122 0000000000000000
[ 361.187525] raw: 0000000000000000 00000000801d001d 00000001f5000000 0000000000000000
[ 361.190033] head: 15ffff0000000040 ffff88810b26e800 dead000000000122 0000000000000000
[ 361.192551] head: 0000000000000000 00000000801d001d 00000001f5000000 0000000000000000
[ 361.195057] head: 15ffff0000000003 ffffea010294e001 ffffffffffffffff 0000000000000000
[ 361.197564] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[ 361.200047] page dumped because: kasan: bad access detected
[ 361.205007] Memory state around the buggy address:
[ 361.207539] ffff88c0a5381880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 361.210162] ffff88c0a5381900: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
[ 361.212762] >ffff88c0a5381980: fc fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb
[ 361.215346] ^
[ 361.217944] ffff88c0a5381a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 361.220542] ffff88c0a5381a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 361.223104] ==================================================================
[ 361.225739] Disabling lock debugging due to kernel taint
Yuanyuan Zhong (1):
nvme: fix use-after-free of admin queue via stale pointer
drivers/nvme/host/core.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-29 21:08 [PATCH 0/1] cover letter Casey Chen
@ 2025-10-29 21:08 ` Casey Chen
2025-10-29 23:00 ` Chaitanya Kulkarni
2025-10-30 0:39 ` Keith Busch
0 siblings, 2 replies; 12+ messages in thread
From: Casey Chen @ 2025-10-29 21:08 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: yzhong, sconnor, axboe, mkhalfella, Casey Chen
From: Yuanyuan Zhong <yzhong@purestorage.com>
When a controller is deleted (e.g., via sysfs delete_controller), the
admin queue is freed while userspace may still have open fd to the
namespace block device. Userspace can issue IOCTLs on the open fd
that access the freed admin queue through the stale ns->ctrl->admin_q
pointer, causing a use-after-free.
Fix this by taking an additional reference on the admin queue during
namespace allocation and releasing it during namespace cleanup.
Signed-off-by: Casey Chen <cachen@purestorage.com>
Signed-off-by: Seamus Connor <sconnor@purestorage.com>
---
drivers/nvme/host/core.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8d8af58e79d1..184a6096a2be 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -687,6 +687,7 @@ static void nvme_free_ns(struct kref *kref)
{
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+ blk_put_queue(ns->ctrl->admin_q);
put_disk(ns->disk);
nvme_put_ns_head(ns->head);
nvme_put_ctrl(ns->ctrl);
@@ -3903,9 +3904,14 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
struct gendisk *disk;
int node = ctrl->numa_node;
+ if (!blk_get_queue(ctrl->admin_q)) {
+ dev_err(ctrl->device, "failed to get admin_q %p\n", ctrl->admin_q);
+ return;
+ }
+
ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
if (!ns)
- return;
+ goto out_put_admin_q;
if (ctrl->opts && ctrl->opts->data_digest)
lim.features |= BLK_FEAT_STABLE_WRITES;
@@ -4002,6 +4008,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
put_disk(disk);
out_free_ns:
kfree(ns);
+ out_put_admin_q:
+ blk_put_queue(ctrl->admin_q);
}
static void nvme_ns_remove(struct nvme_ns *ns)
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-29 21:08 ` [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer Casey Chen
@ 2025-10-29 23:00 ` Chaitanya Kulkarni
2025-10-29 23:33 ` Ming Lei
2025-10-30 0:39 ` Keith Busch
1 sibling, 1 reply; 12+ messages in thread
From: Chaitanya Kulkarni @ 2025-10-29 23:00 UTC (permalink / raw)
To: Casey Chen, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org
Cc: yzhong@purestorage.com, sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com
On 10/29/25 2:08 PM, Casey Chen wrote:
> From: Yuanyuan Zhong <yzhong@purestorage.com>
>
> When a controller is deleted (e.g., via sysfs delete_controller), the
> admin queue is freed while userspace may still have open fd to the
> namespace block device. Userspace can issue IOCTLs on the open fd
> that access the freed admin queue through the stale ns->ctrl->admin_q
> pointer, causing a use-after-free.
>
> Fix this by taking an additional reference on the admin queue during
> namespace allocation and releasing it during namespace cleanup.
>
> Signed-off-by: Casey Chen <cachen@purestorage.com>
> Signed-off-by: Seamus Connor <sconnor@purestorage.com>
> ---
> drivers/nvme/host/core.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8d8af58e79d1..184a6096a2be 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -687,6 +687,7 @@ static void nvme_free_ns(struct kref *kref)
> {
> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>
> + blk_put_queue(ns->ctrl->admin_q);
> put_disk(ns->disk);
> nvme_put_ns_head(ns->head);
> nvme_put_ctrl(ns->ctrl);
> @@ -3903,9 +3904,14 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> struct gendisk *disk;
> int node = ctrl->numa_node;
>
would be a good idea to add a comment at both places to explain why we
are taking this additional reference ? since this is specifically needed
for userspace.
> + if (!blk_get_queue(ctrl->admin_q)) {
> + dev_err(ctrl->device, "failed to get admin_q %p\n", ctrl->admin_q);
> + return;
> + }
> +
> ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
-ck
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-29 23:00 ` Chaitanya Kulkarni
@ 2025-10-29 23:33 ` Ming Lei
0 siblings, 0 replies; 12+ messages in thread
From: Ming Lei @ 2025-10-29 23:33 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: Casey Chen, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com
On Wed, Oct 29, 2025 at 11:00:29PM +0000, Chaitanya Kulkarni wrote:
> On 10/29/25 2:08 PM, Casey Chen wrote:
> > From: Yuanyuan Zhong <yzhong@purestorage.com>
> >
> > When a controller is deleted (e.g., via sysfs delete_controller), the
> > admin queue is freed while userspace may still have open fd to the
> > namespace block device. Userspace can issue IOCTLs on the open fd
> > that access the freed admin queue through the stale ns->ctrl->admin_q
> > pointer, causing a use-after-free.
> >
> > Fix this by taking an additional reference on the admin queue during
> > namespace allocation and releasing it during namespace cleanup.
> >
> > Signed-off-by: Casey Chen <cachen@purestorage.com>
> > Signed-off-by: Seamus Connor <sconnor@purestorage.com>
> > ---
> > drivers/nvme/host/core.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 8d8af58e79d1..184a6096a2be 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -687,6 +687,7 @@ static void nvme_free_ns(struct kref *kref)
> > {
> > struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
> >
> > + blk_put_queue(ns->ctrl->admin_q);
> > put_disk(ns->disk);
> > nvme_put_ns_head(ns->head);
> > nvme_put_ctrl(ns->ctrl);
> > @@ -3903,9 +3904,14 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> > struct gendisk *disk;
> > int node = ctrl->numa_node;
> >
>
> would be a good idea to add a comment at both places to explain why we
> are taking this additional reference ? since this is specifically needed
> for userspace.
Yes, it needs documentation, I believe it is because the reference of
`struct nvme_ctrl` doesn't cover ctrl->admin_queue & ctrl->admin_tagset.
Thanks,
Ming
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-29 21:08 ` [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer Casey Chen
2025-10-29 23:00 ` Chaitanya Kulkarni
@ 2025-10-30 0:39 ` Keith Busch
2025-10-30 8:12 ` Chaitanya Kulkarni
1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-10-30 0:39 UTC (permalink / raw)
To: Casey Chen; +Cc: linux-nvme, linux-block, yzhong, sconnor, axboe, mkhalfella
On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
>
> Fix this by taking an additional reference on the admin queue during
> namespace allocation and releasing it during namespace cleanup.
Since the namespaces already hold references on the controller, would it
be simpler to move the controller's final blk_put_queue to the final
ctrl free? This should have the same lifetime as your patch, but with
simpler ref counting:
---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de736..0b83d82f67e75 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
*/
nvme_stop_keep_alive(ctrl);
blk_mq_destroy_queue(ctrl->admin_q);
- blk_put_queue(ctrl->admin_q);
if (ctrl->ops->flags & NVME_F_FABRICS) {
blk_mq_destroy_queue(ctrl->fabrics_q);
blk_put_queue(ctrl->fabrics_q);
@@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
container_of(dev, struct nvme_ctrl, ctrl_device);
struct nvme_subsystem *subsys = ctrl->subsys;
+ blk_put_queue(ctrl->admin_q);
if (!subsys || ctrl->instance != subsys->instance)
ida_free(&nvme_instance_ida, ctrl->instance);
nvme_free_cels(ctrl);
--
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-30 0:39 ` Keith Busch
@ 2025-10-30 8:12 ` Chaitanya Kulkarni
2025-10-31 2:54 ` Keith Busch
2025-10-31 19:19 ` Casey Chen
0 siblings, 2 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2025-10-30 8:12 UTC (permalink / raw)
To: Keith Busch, Casey Chen
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
yzhong@purestorage.com, sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On 10/29/25 17:39, Keith Busch wrote:
> On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
>> Fix this by taking an additional reference on the admin queue during
>> namespace allocation and releasing it during namespace cleanup.
> Since the namespaces already hold references on the controller, would it
> be simpler to move the controller's final blk_put_queue to the final
> ctrl free? This should have the same lifetime as your patch, but with
> simpler ref counting:
>
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa4181d7de736..0b83d82f67e75 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
> */
> nvme_stop_keep_alive(ctrl);
> blk_mq_destroy_queue(ctrl->admin_q);
> - blk_put_queue(ctrl->admin_q);
> if (ctrl->ops->flags & NVME_F_FABRICS) {
> blk_mq_destroy_queue(ctrl->fabrics_q);
> blk_put_queue(ctrl->fabrics_q);
> @@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
> container_of(dev, struct nvme_ctrl, ctrl_device);
> struct nvme_subsystem *subsys = ctrl->subsys;
>
> + blk_put_queue(ctrl->admin_q);
> if (!subsys || ctrl->instance != subsys->instance)
> ida_free(&nvme_instance_ida, ctrl->instance);
> nvme_free_cels(ctrl);
> --
>
above is much better approach that doesn't rely on taking extra
ref count but using existing count to protect the UAF.
I've added required comments that are very much needed here,
totally untested :-
nvme: fix UAF when accessing admin queue after removal
Fix a use-after-free where userspace IOCTLs can access ctrl->admin_q
after it has been freed during controller removal.
The Race Condition:
Thread 1 (userspace IOCTL) Thread 2 (sysfs remove)
-------------------------- -------------------
open(/dev/nvme0n1) -> fd=3
ioctl(3, NVME_IOCTL_ADMIN_CMD)
nvme_ioctl()
nvme_user_cmd()
echo 1 > .../remove
pci_device_remove()
nvme_remove()
nvme_remove_admin_tag_set()
blk_put_queue(admin_q)
[RCU grace period]
blk_free_queue(admin_q)
kmem_cache_free() <- FREED
nvme_submit_user_cmd(ns->ctrl->admin_q) <- STALE POINTER
blk_mq_alloc_request(admin_q)
blk_queue_enter(admin_q)
*** USE-AFTER-FREE ***
The admin queue is freed in nvme_remove_admin_tag_set() while userspace
may still hold open file descriptors to namespace devices. These open
file descriptors can issue IOCTLs that dereference ctrl->admin_q after
it has been freed.
Defer blk_put_queue(ctrl->admin_q) from nvme_remove_admin_tag_set() to
nvme_free_ctrl(). Since each namespace holds a controller reference via
nvme_get_ctrl()/nvme_put_ctrl(), the controller will only be freed after
all namespaces (and their open file descriptors) are released. This
guarantees admin_q remains allocated while it may still be accessed.
After blk_mq_destroy_queue() in nvme_remove_admin_tag_set(), the queue
is marked dying (QUEUE_FLAG_DYING), so new IOCTL attempts fail safely
at blk_queue_enter() with -ENODEV. The queue structure remains valid for
pointer dereference until nvme_free_ctrl() is called.
---
drivers/nvme/host/core.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 734ad725e6f4..dbbcf99dbef8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4897,7 +4897,19 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl
*ctrl)
*/
nvme_stop_keep_alive(ctrl);
blk_mq_destroy_queue(ctrl->admin_q);
- blk_put_queue(ctrl->admin_q);
+ /**
+ * Defer blk_put_queue() to nvme_free_ctrl() to prevent use-after-free.
+ *
+ * Userspace may hold open file descriptors to namespace devices and
+ * issue IOCTLs that dereference ctrl->admin_q after controller removal
+ * starts. Since each namespace holds a controller reference, deferring
+ * the final queue release ensures admin_q remains allocated until all
+ * namespace references are released.
+ *
+ * blk_mq_destroy_queue() above marks the queue dying
(QUEUE_FLAG_DYING),
+ * causing new requests to fail at blk_queue_enter() with -ENODEV while
+ * keeping the structure valid for pointer access.
+ */
if (ctrl->ops->flags & NVME_F_FABRICS) {
blk_mq_destroy_queue(ctrl->fabrics_q);
blk_put_queue(ctrl->fabrics_q);
@@ -5041,6 +5053,14 @@ static void nvme_free_ctrl(struct device *dev)
container_of(dev, struct nvme_ctrl, ctrl_device);
struct nvme_subsystem *subsys = ctrl->subsys;
+ /**
+ * Release admin_q's final reference. All namespace references have
+ * been released at this point. NULL check is needed for to handle
+ * allocation failure in nvme_alloc_admin_tag_set().
+ */
+ if (ctrl->admin_q)
+ blk_put_queue(ctrl->admin_q);
+
if (!subsys || ctrl->instance != subsys->instance)
ida_free(&nvme_instance_ida, ctrl->instance);
nvme_free_cels(ctrl);
-ck
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-30 8:12 ` Chaitanya Kulkarni
@ 2025-10-31 2:54 ` Keith Busch
2025-10-31 8:16 ` Chaitanya Kulkarni
2025-10-31 19:19 ` Casey Chen
1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-10-31 2:54 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: Casey Chen, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On Thu, Oct 30, 2025 at 08:12:53AM +0000, Chaitanya Kulkarni wrote:
> On 10/29/25 17:39, Keith Busch wrote:
> I've added required comments that are very much needed here,
> totally untested :-
Honestly that seems a bit verbose. The code mostly speaks for itself
with this move: the controller's request_queue reference remains active
as long as there's an active controller reference.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-31 2:54 ` Keith Busch
@ 2025-10-31 8:16 ` Chaitanya Kulkarni
0 siblings, 0 replies; 12+ messages in thread
From: Chaitanya Kulkarni @ 2025-10-31 8:16 UTC (permalink / raw)
To: Keith Busch
Cc: Casey Chen, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On 10/30/25 19:54, Keith Busch wrote:
> On Thu, Oct 30, 2025 at 08:12:53AM +0000, Chaitanya Kulkarni wrote:
>> On 10/29/25 17:39, Keith Busch wrote:
>> I've added required comments that are very much needed here,
>> totally untested :-
> Honestly that seems a bit verbose. The code mostly speaks for itself
> with this move: the controller's request_queue reference remains active
> as long as there's an active controller reference.
whatever works for everyone :).
-ck
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-30 8:12 ` Chaitanya Kulkarni
2025-10-31 2:54 ` Keith Busch
@ 2025-10-31 19:19 ` Casey Chen
2025-10-31 19:31 ` Chaitanya Kulkarni
1 sibling, 1 reply; 12+ messages in thread
From: Casey Chen @ 2025-10-31 19:19 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: Keith Busch, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On Thu, Oct 30, 2025 at 1:12 AM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> On 10/29/25 17:39, Keith Busch wrote:
> > On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
> >> Fix this by taking an additional reference on the admin queue during
> >> namespace allocation and releasing it during namespace cleanup.
> > Since the namespaces already hold references on the controller, would it
> > be simpler to move the controller's final blk_put_queue to the final
> > ctrl free? This should have the same lifetime as your patch, but with
> > simpler ref counting:
> >
> > ---
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index fa4181d7de736..0b83d82f67e75 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
> > */
> > nvme_stop_keep_alive(ctrl);
> > blk_mq_destroy_queue(ctrl->admin_q);
> > - blk_put_queue(ctrl->admin_q);
> > if (ctrl->ops->flags & NVME_F_FABRICS) {
> > blk_mq_destroy_queue(ctrl->fabrics_q);
> > blk_put_queue(ctrl->fabrics_q);
> > @@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
> > container_of(dev, struct nvme_ctrl, ctrl_device);
> > struct nvme_subsystem *subsys = ctrl->subsys;
> >
> > + blk_put_queue(ctrl->admin_q);
> > if (!subsys || ctrl->instance != subsys->instance)
> > ida_free(&nvme_instance_ida, ctrl->instance);
> > nvme_free_cels(ctrl);
> > --
> >
>
> above is much better approach that doesn't rely on taking extra
> ref count but using existing count to protect the UAF.
> I've added required comments that are very much needed here,
> totally untested :-
>
> nvme: fix UAF when accessing admin queue after removal
>
> Fix a use-after-free where userspace IOCTLs can access ctrl->admin_q
> after it has been freed during controller removal.
>
> The Race Condition:
>
> Thread 1 (userspace IOCTL) Thread 2 (sysfs remove)
> -------------------------- -------------------
> open(/dev/nvme0n1) -> fd=3
> ioctl(3, NVME_IOCTL_ADMIN_CMD)
> nvme_ioctl()
> nvme_user_cmd()
> echo 1 > .../remove
> pci_device_remove()
> nvme_remove()
> nvme_remove_admin_tag_set()
> blk_put_queue(admin_q)
> [RCU grace period]
> blk_free_queue(admin_q)
> kmem_cache_free() <- FREED
> nvme_submit_user_cmd(ns->ctrl->admin_q) <- STALE POINTER
> blk_mq_alloc_request(admin_q)
> blk_queue_enter(admin_q)
> *** USE-AFTER-FREE ***
>
>
> The admin queue is freed in nvme_remove_admin_tag_set() while userspace
> may still hold open file descriptors to namespace devices. These open
> file descriptors can issue IOCTLs that dereference ctrl->admin_q after
> it has been freed.
>
> Defer blk_put_queue(ctrl->admin_q) from nvme_remove_admin_tag_set() to
> nvme_free_ctrl(). Since each namespace holds a controller reference via
> nvme_get_ctrl()/nvme_put_ctrl(), the controller will only be freed after
> all namespaces (and their open file descriptors) are released. This
> guarantees admin_q remains allocated while it may still be accessed.
>
> After blk_mq_destroy_queue() in nvme_remove_admin_tag_set(), the queue
> is marked dying (QUEUE_FLAG_DYING), so new IOCTL attempts fail safely
> at blk_queue_enter() with -ENODEV. The queue structure remains valid for
> pointer dereference until nvme_free_ctrl() is called.
>
> ---
> drivers/nvme/host/core.c | 22 +++++++++++++++++++++-
> 1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 734ad725e6f4..dbbcf99dbef8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4897,7 +4897,19 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl
> *ctrl)
> */
> nvme_stop_keep_alive(ctrl);
> blk_mq_destroy_queue(ctrl->admin_q);
> - blk_put_queue(ctrl->admin_q);
> + /**
> + * Defer blk_put_queue() to nvme_free_ctrl() to prevent use-after-free.
> + *
> + * Userspace may hold open file descriptors to namespace devices and
> + * issue IOCTLs that dereference ctrl->admin_q after controller removal
> + * starts. Since each namespace holds a controller reference, deferring
> + * the final queue release ensures admin_q remains allocated until all
> + * namespace references are released.
> + *
> + * blk_mq_destroy_queue() above marks the queue dying
> (QUEUE_FLAG_DYING),
> + * causing new requests to fail at blk_queue_enter() with -ENODEV while
> + * keeping the structure valid for pointer access.
> + */
> if (ctrl->ops->flags & NVME_F_FABRICS) {
> blk_mq_destroy_queue(ctrl->fabrics_q);
> blk_put_queue(ctrl->fabrics_q);
> @@ -5041,6 +5053,14 @@ static void nvme_free_ctrl(struct device *dev)
> container_of(dev, struct nvme_ctrl, ctrl_device);
> struct nvme_subsystem *subsys = ctrl->subsys;
>
> + /**
> + * Release admin_q's final reference. All namespace references have
> + * been released at this point. NULL check is needed for to handle
> + * allocation failure in nvme_alloc_admin_tag_set().
> + */
> + if (ctrl->admin_q)
> + blk_put_queue(ctrl->admin_q);
> +
> if (!subsys || ctrl->instance != subsys->instance)
> ida_free(&nvme_instance_ida, ctrl->instance);
> nvme_free_cels(ctrl);
>
> -ck
>
>
>
Thanks Chaitanya. I tested your fix and all tests look good. We're
looking forward to your final version.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-31 19:19 ` Casey Chen
@ 2025-10-31 19:31 ` Chaitanya Kulkarni
2025-11-04 22:39 ` Casey Chen
0 siblings, 1 reply; 12+ messages in thread
From: Chaitanya Kulkarni @ 2025-10-31 19:31 UTC (permalink / raw)
To: Casey Chen
Cc: Keith Busch, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On 10/31/25 12:19, Casey Chen wrote:
> On Thu, Oct 30, 2025 at 1:12 AM Chaitanya Kulkarni
> <chaitanyak@nvidia.com> wrote:
>> On 10/29/25 17:39, Keith Busch wrote:
>>> On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
>>>> Fix this by taking an additional reference on the admin queue during
>>>> namespace allocation and releasing it during namespace cleanup.
>>> Since the namespaces already hold references on the controller, would it
>>> be simpler to move the controller's final blk_put_queue to the final
>>> ctrl free? This should have the same lifetime as your patch, but with
>>> simpler ref counting:
>>>
>>> ---
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fa4181d7de736..0b83d82f67e75 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
>>> */
>>> nvme_stop_keep_alive(ctrl);
>>> blk_mq_destroy_queue(ctrl->admin_q);
>>> - blk_put_queue(ctrl->admin_q);
>>> if (ctrl->ops->flags & NVME_F_FABRICS) {
>>> blk_mq_destroy_queue(ctrl->fabrics_q);
>>> blk_put_queue(ctrl->fabrics_q);
>>> @@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
>>> container_of(dev, struct nvme_ctrl, ctrl_device);
>>> struct nvme_subsystem *subsys = ctrl->subsys;
>>>
>>> + blk_put_queue(ctrl->admin_q);
>>> if (!subsys || ctrl->instance != subsys->instance)
>>> ida_free(&nvme_instance_ida, ctrl->instance);
>>> nvme_free_cels(ctrl);
>>> --
>>>
>> above is much better approach that doesn't rely on taking extra
>> ref count but using existing count to protect the UAF.
>> I've added required comments that are very much needed here,
>> totally untested :-
>>
>> nvme: fix UAF when accessing admin queue after removal
>>
>> Fix a use-after-free where userspace IOCTLs can access ctrl->admin_q
>> after it has been freed during controller removal.
>>
>> The Race Condition:
>>
>> Thread 1 (userspace IOCTL) Thread 2 (sysfs remove)
>> -------------------------- -------------------
>> open(/dev/nvme0n1) -> fd=3
>> ioctl(3, NVME_IOCTL_ADMIN_CMD)
>> nvme_ioctl()
>> nvme_user_cmd()
>> echo 1 > .../remove
>> pci_device_remove()
>> nvme_remove()
>> nvme_remove_admin_tag_set()
>> blk_put_queue(admin_q)
>> [RCU grace period]
>> blk_free_queue(admin_q)
>> kmem_cache_free() <- FREED
>> nvme_submit_user_cmd(ns->ctrl->admin_q) <- STALE POINTER
>> blk_mq_alloc_request(admin_q)
>> blk_queue_enter(admin_q)
>> *** USE-AFTER-FREE ***
>>
>>
>> The admin queue is freed in nvme_remove_admin_tag_set() while userspace
>> may still hold open file descriptors to namespace devices. These open
>> file descriptors can issue IOCTLs that dereference ctrl->admin_q after
>> it has been freed.
>>
>> Defer blk_put_queue(ctrl->admin_q) from nvme_remove_admin_tag_set() to
>> nvme_free_ctrl(). Since each namespace holds a controller reference via
>> nvme_get_ctrl()/nvme_put_ctrl(), the controller will only be freed after
>> all namespaces (and their open file descriptors) are released. This
>> guarantees admin_q remains allocated while it may still be accessed.
>>
>> After blk_mq_destroy_queue() in nvme_remove_admin_tag_set(), the queue
>> is marked dying (QUEUE_FLAG_DYING), so new IOCTL attempts fail safely
>> at blk_queue_enter() with -ENODEV. The queue structure remains valid for
>> pointer dereference until nvme_free_ctrl() is called.
>>
>> ---
>> drivers/nvme/host/core.c | 22 +++++++++++++++++++++-
>> 1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 734ad725e6f4..dbbcf99dbef8 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4897,7 +4897,19 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl
>> *ctrl)
>> */
>> nvme_stop_keep_alive(ctrl);
>> blk_mq_destroy_queue(ctrl->admin_q);
>> - blk_put_queue(ctrl->admin_q);
>> + /**
>> + * Defer blk_put_queue() to nvme_free_ctrl() to prevent use-after-free.
>> + *
>> + * Userspace may hold open file descriptors to namespace devices and
>> + * issue IOCTLs that dereference ctrl->admin_q after controller removal
>> + * starts. Since each namespace holds a controller reference, deferring
>> + * the final queue release ensures admin_q remains allocated until all
>> + * namespace references are released.
>> + *
>> + * blk_mq_destroy_queue() above marks the queue dying
>> (QUEUE_FLAG_DYING),
>> + * causing new requests to fail at blk_queue_enter() with -ENODEV while
>> + * keeping the structure valid for pointer access.
>> + */
>> if (ctrl->ops->flags & NVME_F_FABRICS) {
>> blk_mq_destroy_queue(ctrl->fabrics_q);
>> blk_put_queue(ctrl->fabrics_q);
>> @@ -5041,6 +5053,14 @@ static void nvme_free_ctrl(struct device *dev)
>> container_of(dev, struct nvme_ctrl, ctrl_device);
>> struct nvme_subsystem *subsys = ctrl->subsys;
>>
>> + /**
>> + * Release admin_q's final reference. All namespace references have
>> + * been released at this point. NULL check is needed for to handle
>> + * allocation failure in nvme_alloc_admin_tag_set().
>> + */
>> + if (ctrl->admin_q)
>> + blk_put_queue(ctrl->admin_q);
>> +
>> if (!subsys || ctrl->instance != subsys->instance)
>> ida_free(&nvme_instance_ida, ctrl->instance);
>> nvme_free_cels(ctrl);
>>
>> -ck
>>
>>
>>
> Thanks Chaitanya. I tested your fix and all tests look good. We're
> looking forward to your final version.
This is Keith's patch I just tested his patch and added commit log-comments.
-ck
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-10-31 19:31 ` Chaitanya Kulkarni
@ 2025-11-04 22:39 ` Casey Chen
2025-11-04 23:00 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: Casey Chen @ 2025-11-04 22:39 UTC (permalink / raw)
To: Chaitanya Kulkarni
Cc: Keith Busch, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On Fri, Oct 31, 2025 at 12:31 PM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> On 10/31/25 12:19, Casey Chen wrote:
> > On Thu, Oct 30, 2025 at 1:12 AM Chaitanya Kulkarni
> > <chaitanyak@nvidia.com> wrote:
> >> On 10/29/25 17:39, Keith Busch wrote:
> >>> On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
> >>>> Fix this by taking an additional reference on the admin queue during
> >>>> namespace allocation and releasing it during namespace cleanup.
> >>> Since the namespaces already hold references on the controller, would it
> >>> be simpler to move the controller's final blk_put_queue to the final
> >>> ctrl free? This should have the same lifetime as your patch, but with
> >>> simpler ref counting:
> >>>
> >>> ---
> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>> index fa4181d7de736..0b83d82f67e75 100644
> >>> --- a/drivers/nvme/host/core.c
> >>> +++ b/drivers/nvme/host/core.c
> >>> @@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
> >>> */
> >>> nvme_stop_keep_alive(ctrl);
> >>> blk_mq_destroy_queue(ctrl->admin_q);
> >>> - blk_put_queue(ctrl->admin_q);
> >>> if (ctrl->ops->flags & NVME_F_FABRICS) {
> >>> blk_mq_destroy_queue(ctrl->fabrics_q);
> >>> blk_put_queue(ctrl->fabrics_q);
> >>> @@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
> >>> container_of(dev, struct nvme_ctrl, ctrl_device);
> >>> struct nvme_subsystem *subsys = ctrl->subsys;
> >>>
> >>> + blk_put_queue(ctrl->admin_q);
> >>> if (!subsys || ctrl->instance != subsys->instance)
> >>> ida_free(&nvme_instance_ida, ctrl->instance);
> >>> nvme_free_cels(ctrl);
> >>> --
> >>>
> >> above is much better approach that doesn't rely on taking extra
> >> ref count but using existing count to protect the UAF.
> >> I've added required comments that are very much needed here,
> >> totally untested :-
> >>
> >> nvme: fix UAF when accessing admin queue after removal
> >>
> >> Fix a use-after-free where userspace IOCTLs can access ctrl->admin_q
> >> after it has been freed during controller removal.
> >>
> >> The Race Condition:
> >>
> >> Thread 1 (userspace IOCTL) Thread 2 (sysfs remove)
> >> -------------------------- -------------------
> >> open(/dev/nvme0n1) -> fd=3
> >> ioctl(3, NVME_IOCTL_ADMIN_CMD)
> >> nvme_ioctl()
> >> nvme_user_cmd()
> >> echo 1 > .../remove
> >> pci_device_remove()
> >> nvme_remove()
> >> nvme_remove_admin_tag_set()
> >> blk_put_queue(admin_q)
> >> [RCU grace period]
> >> blk_free_queue(admin_q)
> >> kmem_cache_free() <- FREED
> >> nvme_submit_user_cmd(ns->ctrl->admin_q) <- STALE POINTER
> >> blk_mq_alloc_request(admin_q)
> >> blk_queue_enter(admin_q)
> >> *** USE-AFTER-FREE ***
> >>
> >>
> >> The admin queue is freed in nvme_remove_admin_tag_set() while userspace
> >> may still hold open file descriptors to namespace devices. These open
> >> file descriptors can issue IOCTLs that dereference ctrl->admin_q after
> >> it has been freed.
> >>
> >> Defer blk_put_queue(ctrl->admin_q) from nvme_remove_admin_tag_set() to
> >> nvme_free_ctrl(). Since each namespace holds a controller reference via
> >> nvme_get_ctrl()/nvme_put_ctrl(), the controller will only be freed after
> >> all namespaces (and their open file descriptors) are released. This
> >> guarantees admin_q remains allocated while it may still be accessed.
> >>
> >> After blk_mq_destroy_queue() in nvme_remove_admin_tag_set(), the queue
> >> is marked dying (QUEUE_FLAG_DYING), so new IOCTL attempts fail safely
> >> at blk_queue_enter() with -ENODEV. The queue structure remains valid for
> >> pointer dereference until nvme_free_ctrl() is called.
> >>
> >> ---
> >> drivers/nvme/host/core.c | 22 +++++++++++++++++++++-
> >> 1 file changed, 21 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >> index 734ad725e6f4..dbbcf99dbef8 100644
> >> --- a/drivers/nvme/host/core.c
> >> +++ b/drivers/nvme/host/core.c
> >> @@ -4897,7 +4897,19 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl
> >> *ctrl)
> >> */
> >> nvme_stop_keep_alive(ctrl);
> >> blk_mq_destroy_queue(ctrl->admin_q);
> >> - blk_put_queue(ctrl->admin_q);
> >> + /**
> >> + * Defer blk_put_queue() to nvme_free_ctrl() to prevent use-after-free.
> >> + *
> >> + * Userspace may hold open file descriptors to namespace devices and
> >> + * issue IOCTLs that dereference ctrl->admin_q after controller removal
> >> + * starts. Since each namespace holds a controller reference, deferring
> >> + * the final queue release ensures admin_q remains allocated until all
> >> + * namespace references are released.
> >> + *
> >> + * blk_mq_destroy_queue() above marks the queue dying
> >> (QUEUE_FLAG_DYING),
> >> + * causing new requests to fail at blk_queue_enter() with -ENODEV while
> >> + * keeping the structure valid for pointer access.
> >> + */
> >> if (ctrl->ops->flags & NVME_F_FABRICS) {
> >> blk_mq_destroy_queue(ctrl->fabrics_q);
> >> blk_put_queue(ctrl->fabrics_q);
> >> @@ -5041,6 +5053,14 @@ static void nvme_free_ctrl(struct device *dev)
> >> container_of(dev, struct nvme_ctrl, ctrl_device);
> >> struct nvme_subsystem *subsys = ctrl->subsys;
> >>
> >> + /**
> >> + * Release admin_q's final reference. All namespace references have
> >> + * been released at this point. NULL check is needed for to handle
> >> + * allocation failure in nvme_alloc_admin_tag_set().
> >> + */
> >> + if (ctrl->admin_q)
> >> + blk_put_queue(ctrl->admin_q);
> >> +
> >> if (!subsys || ctrl->instance != subsys->instance)
> >> ida_free(&nvme_instance_ida, ctrl->instance);
> >> nvme_free_cels(ctrl);
> >>
> >> -ck
> >>
> >>
> >>
> > Thanks Chaitanya. I tested your fix and all tests look good. We're
> > looking forward to your final version.
>
>
> This is Keith's patch I just tested his patch and added commit log-comments.
>
>
Hi Keith,
Could you make a patch based on this ? So we can backport upstream
patch instead of keeping our own patch. Thanks
> -ck
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
2025-11-04 22:39 ` Casey Chen
@ 2025-11-04 23:00 ` Keith Busch
0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-11-04 23:00 UTC (permalink / raw)
To: Casey Chen
Cc: Chaitanya Kulkarni, linux-nvme@lists.infradead.org,
linux-block@vger.kernel.org, yzhong@purestorage.com,
sconnor@purestorage.com, axboe@kernel.dk,
mkhalfella@purestorage.com, Ming Lei
On Tue, Nov 04, 2025 at 02:39:47PM -0800, Casey Chen wrote:
> Hi Keith,
>
> Could you make a patch based on this ? So we can backport upstream
> patch instead of keeping our own patch. Thanks
Sure, just sent it out. Wasn't sure if you were planning to do a v2 or
just use what I proposed.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-11-04 23:00 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 21:08 [PATCH 0/1] cover letter Casey Chen
2025-10-29 21:08 ` [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer Casey Chen
2025-10-29 23:00 ` Chaitanya Kulkarni
2025-10-29 23:33 ` Ming Lei
2025-10-30 0:39 ` Keith Busch
2025-10-30 8:12 ` Chaitanya Kulkarni
2025-10-31 2:54 ` Keith Busch
2025-10-31 8:16 ` Chaitanya Kulkarni
2025-10-31 19:19 ` Casey Chen
2025-10-31 19:31 ` Chaitanya Kulkarni
2025-11-04 22:39 ` Casey Chen
2025-11-04 23:00 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox