linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/1] nvme: correctly account for namespace head reference counter
@ 2025-06-20  8:02 Nilay Shroff
  2025-06-23  8:18 ` Daniel Wagner
  2025-06-23 12:20 ` Christoph Hellwig
  0 siblings, 2 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-20  8:02 UTC (permalink / raw)
  To: linux-nvme
  Cc: hch, kbusch, sagi, hare, dwagner, axboe, shinichiro.kawasaki,
	gjoyce

The blktests nvme/058 menifests an issue where the NVMe subsystem
kobject entry remains stale in sysfs, causing a failure during
subsequent NVMe module reloads[1]. Specifically, when attempting to
register a new NVMe subsystem, the driver encounters a kobejct name
collision because a stale kobject still exists. Though, please note
that nvme/058 doesn't report any failure and test case passes and
it's only during subsequent NVMe module reloads, the stale nvme sub-
system kobject entry in sysfs causes the observed symptom[1].

This issue stems from an imbalance in the get/put usage of the namespace
head (nshead) reference counter. The nshead holds a reference to the
associated NVMe subsystem. If the nshead reference is not properly
released, it prevents the cleanup of the subsystem's kobject, leaving
nvme subsystem stale entry behind in sysfs.

During the failre case, the last namespace path referencing a nshead
is removed, but the nshead reference was not released. This occurs
because the release logic currently only puts the nshead reference
when its state is LIVE. However, in configurations where ANA (Asymmetric
Namespace Access) is enabled, a namespace may be associated with an ANA
state that is neither optimized nor non-optimized. In this case, the
nshead may never transition to LIVE, and the corresponding nshead
reference is then never dropped. In fact nvme/058 associates some of
nvme namespaces to an inaccessible ANA state and with that nshead is
created but it's state is not transitioned to LIVE. So the current
logic would then causes nshead reference to be leaked for non-LIVE
states.

Another scenario, during namespace allocation, the driver first
allocates a nshead and then issues an Identify Namespace command. If
this command fails — which can happen in tests like nvme/058 that
rapidly enables and disables namespaces — we must release the reference
to the newly allocated nshead. However this reference release is
currently missing in the failure, causing a nshead reference leak.

To fix this, we now unconditionally release the nshead reference when
the last nvme path referencing to the nshead is removed, regardless of
the head’s state. Also during idnetify namespace failure case we now
properly release the nshead refernce. So this ensures proper cleanup
of the nshead, and consequently, the NVMe subsystem and its associated
kobject.

This change prevents stale kobject entries from lingering in sysfs and
eliminates the module reload failures observed just after running
nvme/058.

[1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/

Reported-by: yi.zhang@redhat.com
Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/
Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node")
Tested-by: yi.zhang@redhat.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      | 7 ++++++-
 drivers/nvme/host/multipath.c | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 92697f98c601..b3e280149a25 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4089,6 +4089,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	struct nvme_ns *ns;
 	struct gendisk *disk;
 	int node = ctrl->numa_node;
+	bool last_path = false;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
@@ -4181,9 +4182,13 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
-	if (list_empty(&ns->head->list))
+	if (list_empty(&ns->head->list)) {
 		list_del_init(&ns->head->entry);
+		last_path = true;
+	}
 	mutex_unlock(&ctrl->subsys->lock);
+	if (last_path)
+		nvme_put_ns_head(ns->head);
 	nvme_put_ns_head(ns->head);
  out_cleanup_disk:
 	put_disk(disk);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e040e467f9fa..5f52b50907a0 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -690,8 +690,8 @@ static void nvme_remove_head(struct nvme_ns_head *head)
 		nvme_cdev_del(&head->cdev, &head->cdev_device);
 		synchronize_srcu(&head->srcu);
 		del_gendisk(head->disk);
-		nvme_put_ns_head(head);
 	}
+	nvme_put_ns_head(head);
 }
 
 static void nvme_remove_head_work(struct work_struct *work)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-20  8:02 [PATCH 1/1] nvme: correctly account for namespace head reference counter Nilay Shroff
@ 2025-06-23  8:18 ` Daniel Wagner
  2025-06-24 13:30   ` Nilay Shroff
  2025-06-23 12:20 ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Wagner @ 2025-06-23  8:18 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, hch, kbusch, sagi, hare, axboe, shinichiro.kawasaki,
	gjoyce

Hi Nilay,

On Fri, Jun 20, 2025 at 01:32:43PM +0530, Nilay Shroff wrote:
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4089,6 +4089,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>  	struct nvme_ns *ns;
>  	struct gendisk *disk;
>  	int node = ctrl->numa_node;
> +	bool last_path = false;
>  
>  	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
>  	if (!ns)
> @@ -4181,9 +4182,13 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>   out_unlink_ns:
>  	mutex_lock(&ctrl->subsys->lock);
>  	list_del_rcu(&ns->siblings);
> -	if (list_empty(&ns->head->list))
> +	if (list_empty(&ns->head->list)) {
>  		list_del_init(&ns->head->entry);
> +		last_path = true;
> +	}
>  	mutex_unlock(&ctrl->subsys->lock);
> +	if (last_path)
> +		nvme_put_ns_head(ns->head);
>  	nvme_put_ns_head(ns->head);

Could the put moved in front of 'goto out_unlink_ns'?

Thanks,
Daniel


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-20  8:02 [PATCH 1/1] nvme: correctly account for namespace head reference counter Nilay Shroff
  2025-06-23  8:18 ` Daniel Wagner
@ 2025-06-23 12:20 ` Christoph Hellwig
  2025-06-24 14:24   ` Nilay Shroff
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-23 12:20 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, hch, kbusch, sagi, hare, dwagner, axboe,
	shinichiro.kawasaki, gjoyce

On Fri, Jun 20, 2025 at 01:32:43PM +0530, Nilay Shroff wrote:
> During the failre

failure?

> To fix this, we now unconditionally release the nshead reference when
> the last nvme path referencing to the nshead is removed, regardless of
> the head’s state. Also during idnetify

identify

Otherwise this looks good.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-23  8:18 ` Daniel Wagner
@ 2025-06-24 13:30   ` Nilay Shroff
  2025-06-24 13:42     ` Daniel Wagner
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 13:30 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: linux-nvme, hch, kbusch, sagi, hare, axboe, shinichiro.kawasaki,
	gjoyce



On 6/23/25 1:48 PM, Daniel Wagner wrote:
> Hi Nilay,
> 
> On Fri, Jun 20, 2025 at 01:32:43PM +0530, Nilay Shroff wrote:
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4089,6 +4089,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>  	struct nvme_ns *ns;
>>  	struct gendisk *disk;
>>  	int node = ctrl->numa_node;
>> +	bool last_path = false;
>>  
>>  	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
>>  	if (!ns)
>> @@ -4181,9 +4182,13 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>   out_unlink_ns:
>>  	mutex_lock(&ctrl->subsys->lock);
>>  	list_del_rcu(&ns->siblings);
>> -	if (list_empty(&ns->head->list))
>> +	if (list_empty(&ns->head->list)) {
>>  		list_del_init(&ns->head->entry);
>> +		last_path = true;
>> +	}
>>  	mutex_unlock(&ctrl->subsys->lock);
>> +	if (last_path)
>> +		nvme_put_ns_head(ns->head);
>>  	nvme_put_ns_head(ns->head);
> 
> Could the put moved in front of 'goto out_unlink_ns'?

I believe you would want to move the second nvme_put_ns_head()
above as the first statement under out_unlink_ns label as shown
below:

out_unlink_ns:
	nvme_put_ns_head(ns->head);
	mutex_lock(&ctrl->subsys->lock);
	list_del_rcu(&ns->siblings);
	if (list_empty(&ns->head->list)) {
		list_del_init(&ns->head->entry);
                last_path = true;
        }
	mutex_unlock(&ctrl->subsys->lock);
        if (last_path)
                nvme_put_ns_head(ns->head);

If that's the case, and this is indeed the last reference to ns->head,
then nvme_put_ns_head() could potentially free the ns->head object. 
If that happens, any access to ns->head afterward — such as ns->head->list
— would result in a use-after-free and potentially trigger a kernel oops,
correct?

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-24 13:30   ` Nilay Shroff
@ 2025-06-24 13:42     ` Daniel Wagner
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Wagner @ 2025-06-24 13:42 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, hch, kbusch, sagi, hare, axboe, shinichiro.kawasaki,
	gjoyce

On Tue, Jun 24, 2025 at 07:00:39PM +0530, Nilay Shroff wrote:
> > Could the put moved in front of 'goto out_unlink_ns'?
> 
> I believe you would want to move the second nvme_put_ns_head()
> above as the first statement under out_unlink_ns label as shown
> below:
> 
> out_unlink_ns:
> 	nvme_put_ns_head(ns->head);
> 	mutex_lock(&ctrl->subsys->lock);
> 	list_del_rcu(&ns->siblings);
> 	if (list_empty(&ns->head->list)) {
> 		list_del_init(&ns->head->entry);
>                 last_path = true;
>         }
> 	mutex_unlock(&ctrl->subsys->lock);
>         if (last_path)
>                 nvme_put_ns_head(ns->head);
> 
> If that's the case, and this is indeed the last reference to ns->head,
> then nvme_put_ns_head() could potentially free the ns->head object. 
> If that happens, any access to ns->head afterward — such as ns->head->list
> — would result in a use-after-free and potentially trigger a kernel oops,
> correct?

Ah sorry brainfarth, you are right.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-23 12:20 ` Christoph Hellwig
@ 2025-06-24 14:24   ` Nilay Shroff
  2025-06-25  6:10     ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Nilay Shroff @ 2025-06-24 14:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, kbusch, sagi, hare, dwagner, axboe,
	shinichiro.kawasaki, gjoyce



On 6/23/25 5:50 PM, Christoph Hellwig wrote:
> On Fri, Jun 20, 2025 at 01:32:43PM +0530, Nilay Shroff wrote:
>> During the failre
> 
> failure?
> 
>> To fix this, we now unconditionally release the nshead reference when
>> the last nvme path referencing to the nshead is removed, regardless of
>> the head’s state. Also during idnetify
> 
> identify
> 
> Otherwise this looks good.

Thanks! I'd update the above typo, however today I was testing some code 
after disabling CONFIG_NVME_MULTIPATH and I realized that although the
namespace head (ns->head) is still created, its associated head->disk is
not initialized in this configuration. As a result, ns->head only has a
single reference — the one from kref_init() during its creation. Therefore,
when we delete the namespace or handle an error case, we must ensure that
we only call nvme_put_ns_head() once to properly release the reference and
avoid a potential double-free. So I updated this patch as shown below (please
note that this is equally applicable when we insmod nvme_core with module
param "multipath" set to false):

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 92697f98c601..033fe0a12727 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4089,6 +4089,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
        struct nvme_ns *ns;
        struct gendisk *disk;
        int node = ctrl->numa_node;
+       bool last_path = false;
 
        ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
        if (!ns)
@@ -4181,9 +4182,14 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
  out_unlink_ns:
        mutex_lock(&ctrl->subsys->lock);
        list_del_rcu(&ns->siblings);
-       if (list_empty(&ns->head->list))
+       if (list_empty(&ns->head->list)) {
                list_del_init(&ns->head->entry);
+               if (ns->head->disk)
+                       last_path = true;
+       }
        mutex_unlock(&ctrl->subsys->lock);
+       if (last_path)
+               nvme_put_ns_head(ns->head);
        nvme_put_ns_head(ns->head);
  out_cleanup_disk:
        put_disk(disk);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e040e467f9fa..c7644631bb1a 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -690,8 +690,8 @@ static void nvme_remove_head(struct nvme_ns_head *head)
                nvme_cdev_del(&head->cdev, &head->cdev_device);
                synchronize_srcu(&head->srcu);
                del_gendisk(head->disk);
-               nvme_put_ns_head(head);
        }
+       nvme_put_ns_head(head);
 }
 
 static void nvme_remove_head_work(struct work_struct *work)
@@ -1291,6 +1291,9 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
        bool remove = false;
 
+       if (!head->disk)
+               return;
+
        mutex_lock(&head->subsys->lock);
        /*
         * We are called when all paths have been removed, and at that point

So I'll send out patchv2 with the above changes.

Thanks,
--Nilay



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-24 14:24   ` Nilay Shroff
@ 2025-06-25  6:10     ` Christoph Hellwig
  2025-06-25  6:12       ` Nilay Shroff
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2025-06-25  6:10 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: Christoph Hellwig, linux-nvme, kbusch, sagi, hare, dwagner, axboe,
	shinichiro.kawasaki, gjoyce

On Tue, Jun 24, 2025 at 07:54:29PM +0530, Nilay Shroff wrote:
> Thanks! I'd update the above typo, however today I was testing some code 
> after disabling CONFIG_NVME_MULTIPATH and I realized that although the
> namespace head (ns->head) is still created, its associated head->disk is
> not initialized in this configuration.

Yes.


> So I'll send out patchv2 with the above changes.

Thanks!  Please also add a comment to nvme_alloc_ns as the logic is
fairly complex now and future readers will benefit from an explanation.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/1] nvme: correctly account for namespace head reference counter
  2025-06-25  6:10     ` Christoph Hellwig
@ 2025-06-25  6:12       ` Nilay Shroff
  0 siblings, 0 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-06-25  6:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, kbusch, sagi, hare, dwagner, axboe,
	shinichiro.kawasaki, gjoyce



On 6/25/25 11:40 AM, Christoph Hellwig wrote:
> On Tue, Jun 24, 2025 at 07:54:29PM +0530, Nilay Shroff wrote:
>> Thanks! I'd update the above typo, however today I was testing some code 
>> after disabling CONFIG_NVME_MULTIPATH and I realized that although the
>> namespace head (ns->head) is still created, its associated head->disk is
>> not initialized in this configuration.
> 
> Yes.
> 
> 
>> So I'll send out patchv2 with the above changes.
> 
> Thanks!  Please also add a comment to nvme_alloc_ns as the logic is
> fairly complex now and future readers will benefit from an explanation.
> 
Yeah will do the needful.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-25  6:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20  8:02 [PATCH 1/1] nvme: correctly account for namespace head reference counter Nilay Shroff
2025-06-23  8:18 ` Daniel Wagner
2025-06-24 13:30   ` Nilay Shroff
2025-06-24 13:42     ` Daniel Wagner
2025-06-23 12:20 ` Christoph Hellwig
2025-06-24 14:24   ` Nilay Shroff
2025-06-25  6:10     ` Christoph Hellwig
2025-06-25  6:12       ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).