Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Chaitanya Kulkarni <chaitanyak@nvidia.com>
To: Keith Busch <kbusch@kernel.org>, Casey Chen <cachen@purestorage.com>
Cc: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"yzhong@purestorage.com" <yzhong@purestorage.com>,
	"sconnor@purestorage.com" <sconnor@purestorage.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"mkhalfella@purestorage.com" <mkhalfella@purestorage.com>,
	Ming Lei <ming.lei@redhat.com>
Subject: Re: [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer
Date: Thu, 30 Oct 2025 08:12:53 +0000	[thread overview]
Message-ID: <9669f8a9-11ad-4911-9e03-00758e1d9957@nvidia.com> (raw)
In-Reply-To: <aQKzxpJp98Po_pch@kbusch-mbp>

On 10/29/25 17:39, Keith Busch wrote:
> On Wed, Oct 29, 2025 at 03:08:53PM -0600, Casey Chen wrote:
>> Fix this by taking an additional reference on the admin queue during
>> namespace allocation and releasing it during namespace cleanup.
> Since the namespaces already hold references on the controller, would it
> be simpler to move the controller's final blk_put_queue to the final
> ctrl free? This should have the same lifetime as your patch, but with
> simpler ref counting:
>
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa4181d7de736..0b83d82f67e75 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4901,7 +4901,6 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl *ctrl)
>           */
>          nvme_stop_keep_alive(ctrl);
>          blk_mq_destroy_queue(ctrl->admin_q);
> -       blk_put_queue(ctrl->admin_q);
>          if (ctrl->ops->flags & NVME_F_FABRICS) {
>                  blk_mq_destroy_queue(ctrl->fabrics_q);
>                  blk_put_queue(ctrl->fabrics_q);
> @@ -5045,6 +5044,7 @@ static void nvme_free_ctrl(struct device *dev)
>                  container_of(dev, struct nvme_ctrl, ctrl_device);
>          struct nvme_subsystem *subsys = ctrl->subsys;
>
> +       blk_put_queue(ctrl->admin_q);
>          if (!subsys || ctrl->instance != subsys->instance)
>                  ida_free(&nvme_instance_ida, ctrl->instance);
>          nvme_free_cels(ctrl);
> --
>

above is much better approach that doesn't rely on taking extra
ref count but using existing count to protect the UAF.
I've added required comments that are very much needed here,
totally untested :-

nvme: fix UAF when accessing admin queue after removal

Fix a use-after-free where userspace IOCTLs can access ctrl->admin_q
after it has been freed during controller removal.

The Race Condition:

   Thread 1 (userspace IOCTL)          Thread 2 (sysfs remove)
   --------------------------          -------------------
   open(/dev/nvme0n1) -> fd=3
   ioctl(3, NVME_IOCTL_ADMIN_CMD)
     nvme_ioctl()
     nvme_user_cmd()
                                        echo 1 > .../remove
                                        pci_device_remove()
                                        nvme_remove()
  nvme_remove_admin_tag_set()
                                          blk_put_queue(admin_q)
                                          [RCU grace period]
                                          blk_free_queue(admin_q)
                                            kmem_cache_free() <- FREED
     nvme_submit_user_cmd(ns->ctrl->admin_q) <- STALE POINTER
       blk_mq_alloc_request(admin_q)
         blk_queue_enter(admin_q)
           *** USE-AFTER-FREE ***


The admin queue is freed in nvme_remove_admin_tag_set() while userspace
may still hold open file descriptors to namespace devices. These open
file descriptors can issue IOCTLs that dereference ctrl->admin_q after
it has been freed.

Defer blk_put_queue(ctrl->admin_q) from nvme_remove_admin_tag_set() to
nvme_free_ctrl(). Since each namespace holds a controller reference via
nvme_get_ctrl()/nvme_put_ctrl(), the controller will only be freed after
all namespaces (and their open file descriptors) are released. This
guarantees admin_q remains allocated while it may still be accessed.

After blk_mq_destroy_queue() in nvme_remove_admin_tag_set(), the queue
is marked dying (QUEUE_FLAG_DYING), so new IOCTL attempts fail safely
at blk_queue_enter() with -ENODEV. The queue structure remains valid for
pointer dereference until nvme_free_ctrl() is called.

---
  drivers/nvme/host/core.c | 22 +++++++++++++++++++++-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 734ad725e6f4..dbbcf99dbef8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4897,7 +4897,19 @@ void nvme_remove_admin_tag_set(struct nvme_ctrl 
*ctrl)
       */
      nvme_stop_keep_alive(ctrl);
      blk_mq_destroy_queue(ctrl->admin_q);
-    blk_put_queue(ctrl->admin_q);
+    /**
+     * Defer blk_put_queue() to nvme_free_ctrl() to prevent use-after-free.
+     *
+     * Userspace may hold open file descriptors to namespace devices and
+     * issue IOCTLs that dereference ctrl->admin_q after controller removal
+     * starts. Since each namespace holds a controller reference, deferring
+     * the final queue release ensures admin_q remains allocated until all
+     * namespace references are released.
+     *
+     * blk_mq_destroy_queue() above marks the queue dying 
(QUEUE_FLAG_DYING),
+     * causing new requests to fail at blk_queue_enter() with -ENODEV while
+     * keeping the structure valid for pointer access.
+     */
      if (ctrl->ops->flags & NVME_F_FABRICS) {
          blk_mq_destroy_queue(ctrl->fabrics_q);
          blk_put_queue(ctrl->fabrics_q);
@@ -5041,6 +5053,14 @@ static void nvme_free_ctrl(struct device *dev)
          container_of(dev, struct nvme_ctrl, ctrl_device);
      struct nvme_subsystem *subsys = ctrl->subsys;

+    /**
+     * Release admin_q's final reference. All namespace references have
+     * been released at this point. NULL check is needed for to handle
+     * allocation failure in nvme_alloc_admin_tag_set().
+     */
+    if (ctrl->admin_q)
+        blk_put_queue(ctrl->admin_q);
+
      if (!subsys || ctrl->instance != subsys->instance)
          ida_free(&nvme_instance_ida, ctrl->instance);
      nvme_free_cels(ctrl);

-ck




  reply	other threads:[~2025-10-30  8:13 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-29 21:08 [PATCH 0/1] cover letter Casey Chen
2025-10-29 21:08 ` [PATCH 1/1] nvme: fix use-after-free of admin queue via stale pointer Casey Chen
2025-10-29 23:00   ` Chaitanya Kulkarni
2025-10-29 23:33     ` Ming Lei
2025-10-30  0:39   ` Keith Busch
2025-10-30  8:12     ` Chaitanya Kulkarni [this message]
2025-10-31  2:54       ` Keith Busch
2025-10-31  8:16         ` Chaitanya Kulkarni
2025-10-31 19:19       ` Casey Chen
2025-10-31 19:31         ` Chaitanya Kulkarni
2025-11-04 22:39           ` Casey Chen
2025-11-04 23:00             ` Keith Busch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9669f8a9-11ad-4911-9e03-00758e1d9957@nvidia.com \
    --to=chaitanyak@nvidia.com \
    --cc=axboe@kernel.dk \
    --cc=cachen@purestorage.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=mkhalfella@purestorage.com \
    --cc=sconnor@purestorage.com \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox