[PATCH 0/3] nvme-multipath: fix deadlock in device_add

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk()
@ 2024-10-07 10:01 Hannes Reinecke
  2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I'm having a testcase which repeatedly disables namespaces on the target
assigning new UUID (to simulate namespace remapping) and enable that
namespace again.
To throw in more fun these namespaces have their ANA group ID changes
to simulate namespace moving around in a cluster, where only the paths
local to the cluster node are active, and all other paths are inaccessible.

Essentially it's doing something like:

echo 0 > ${ns}/enable
<random delay>
echo "<dev>" > ${ns}/device_path
echo "<grpid>" > ${ns}/ana_grpid
uuidgen > ${ns}/device_uuid
echo 1 > ${ns}/enable

ie a similar testcase than the previous patchset, only this time I'm
just doing an 'enable/disable' bit without removing the namespace from
the target.
This is causing lockups in device_add_disk(), as the partition scan is
constantly retrying I/O and never completes.
Funnily enough the very same issue should have been fixed with
ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work
and/or ana_work"), but that fix seem to be imperfect.

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  nvme-multipath: simplify loop in nvme_update_ana_state()
  nvme-multipath: cannot disconnect controller on stuck partition scan
  nvme-multipath: skip failed paths during partition scan

 drivers/nvme/host/multipath.c | 51 ++++++++++++++++++++++++++---------
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 40 insertions(+), 12 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state()
  2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke
@ 2024-10-07 10:01 ` Hannes Reinecke
  2024-10-07 15:46   ` Keith Busch
  2024-10-20 23:33   ` Sagi Grimberg
  2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke
  2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke
  2 siblings, 2 replies; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Currently nvme_update_ana_state() iterates over all namespaces
with the srcu lock held, and calls the 'cb' function if a
match on the list of namespaces from the ANA log is found.
This assumes that the iteration itself is relatively quick,
and the namespace state itself doesn't change during iteration.
For more complex functions (eg if the 'cb' function triggers
mpath_set_live() which then kicks off a partition scan) the
namespace state might change while iterating, and the rcu-protected
area becomes really long.
So change the loop to iterated over the entries in the ANA log,
and call nvme_find_get_ns() on each entry.
With that the 'cb' function is executed outside of the RCU
protected area, and normal reference counting rules apply.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f72c5a6a2d8e..61f8ae199288 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -745,7 +745,6 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 	u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0;
 	unsigned *nr_change_groups = data;
 	struct nvme_ns *ns;
-	int srcu_idx;
 
 	dev_dbg(ctrl->device, "ANA group %d: %s.\n",
 			le32_to_cpu(desc->grpid),
@@ -757,21 +756,16 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
 	if (!nr_nsids)
 		return 0;
 
-	srcu_idx = srcu_read_lock(&ctrl->srcu);
-	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
+	for(n = 0; n < nr_nsids; n++) {
 		unsigned nsid;
-again:
+
 		nsid = le32_to_cpu(desc->nsids[n]);
-		if (ns->head->ns_id < nsid)
-			continue;
-		if (ns->head->ns_id == nsid)
+		ns = nvme_find_get_ns(ctrl, nsid);
+		if (ns) {
 			nvme_update_ns_ana_state(desc, ns);
-		if (++n == nr_nsids)
-			break;
-		if (ns->head->ns_id > nsid)
-			goto again;
+			nvme_put_ns(ns);
+		}
 	}
-	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 	return 0;
 }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke
  2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
@ 2024-10-07 10:01 ` Hannes Reinecke
  2024-10-07 18:19   ` Keith Busch
  2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke
  2 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

When a namespace state is changed during partition scan triggered via
nvme_scan_ns->nvme_mpath_set_live()->device_add_disk()
I/O might be returned with a path error, causing it to be retried on
other paths. But if this happens to be the last path the process will
be stuck.
Trying to disconnect this controller will call

nvme_unquiesce_io_queues()
flush_work(&ctrl_scan_work)

where the first should abort/retry all I/O pending during
scan such that the following 'flush_work' can succeeed.
However, we explicitly do _not_ ignore paths from deleted controllers
in nvme_mpath_is_disabled(), so that I/O on these devices
will be _retried_, not aborted, and the scanning process
continues to be stuck. So the process to disconnect the
controller will be stuck in flush_work(), and that controller
and all namespaces become unusable until the system is rebooted.

Fixes: ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 61f8ae199288..f03ef983a75f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
 {
 	enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl);

+	/*
+	 * Skip deleted controllers for I/O from partition scan
+	 */
+	if (state == NVME_CTRL_DELETING &&
+	    mutex_is_locked(&ns->ctrl->scan_lock))
+		return true;
+
 	/*
 	 * We don't treat NVME_CTRL_DELETING as a disabled path as I/O should
 	 * still be able to complete assuming that the controller is connected.
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/3] nvme-multipath: skip failed paths during partition scan
  2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke
  2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
  2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke
@ 2024-10-07 10:01 ` Hannes Reinecke
  2024-10-08  6:40   ` Christoph Hellwig
  2 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
	Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

When an I/O error is encountered during scanning (ie when the
scan_lock is held) we should avoid using this path until scanning
is finished to avoid deadlocks with device_add_disk().
So set a new flag NVME_NS_SCAN_FAILED if a failover happened during
scanning, and skip this path in nvme_available_paths().
Then we can check if that bit is set after device_add_disk() returned,
and remove the disk again if no available paths are found.
That allows the device to be recreated via the 'rescan' sysfs attribute
once no I/O errors occur anymore.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 26 ++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f03ef983a75f..4113d38606a4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req)
 		queue_work(nvme_wq, &ns->ctrl->ana_work);
 	}
 
+	/*
+	 * Do not use this path during scanning
+	 * to avoid deadlocks in device_add_disk()
+	 */
+	if (mutex_is_locked(&ns->ctrl->scan_lock))
+		set_bit(NVME_NS_SCAN_FAILED, &ns->flags);
+
 	spin_lock_irqsave(&ns->head->requeue_lock, flags);
 	for (bio = req->bio; bio; bio = bio->bi_next) {
 		bio_set_dev(bio, ns->head->disk->part0);
@@ -434,6 +441,10 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
 			continue;
+		if (test_bit(NVME_NS_SCAN_FAILED, &ns->flags) &&
+		    mutex_is_locked(&ns->ctrl->scan_lock))
+			continue;
+
 		switch (nvme_ctrl_state(ns->ctrl)) {
 		case NVME_CTRL_LIVE:
 		case NVME_CTRL_RESETTING:
@@ -659,6 +670,20 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 			clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
 			return;
 		}
+		/*
+		 * If there is no available path and NVME_NS_SCAN_FAILED is
+		 * set an error occurred during partition scan triggered
+		 * by device_add_disk(), and the disk is most certainly
+		 * not live.
+		 */
+		if (!nvme_available_path(head) &&
+		    test_and_clear_bit(NVME_NS_SCAN_FAILED, &ns->flags)) {
+			dev_dbg(ns->ctrl->device, "delete gendisk for nsid %d\n",
+				head->ns_id);
+			clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags);
+			del_gendisk(head->disk);
+			return;
+		}
 		nvme_add_ns_head_cdev(head);
 	}
 
@@ -732,6 +757,7 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
 	ns->ana_grpid = le32_to_cpu(desc->grpid);
 	ns->ana_state = desc->state;
 	clear_bit(NVME_NS_ANA_PENDING, &ns->flags);
+	clear_bit(NVME_NS_SCAN_FAILED, &ns->flags);
 	/*
 	 * nvme_mpath_set_live() will trigger I/O to the multipath path device
 	 * and in turn to this path device.  However we cannot accept this I/O
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 50515ad0f9d6..a4f99873ecb7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -527,6 +527,7 @@ struct nvme_ns {
 #define NVME_NS_ANA_PENDING	2
 #define NVME_NS_FORCE_RO	3
 #define NVME_NS_READY		4
+#define NVME_NS_SCAN_FAILED	5
 
 	struct cdev		cdev;
 	struct device		cdev_device;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state()
  2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
@ 2024-10-07 15:46   ` Keith Busch
  2024-10-08  6:39     ` Christoph Hellwig
  2024-10-20 23:33   ` Sagi Grimberg
  1 sibling, 1 reply; 25+ messages in thread
From: Keith Busch @ 2024-10-07 15:46 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme

On Mon, Oct 07, 2024 at 12:01:32PM +0200, Hannes Reinecke wrote:
> Currently nvme_update_ana_state() iterates over all namespaces
> with the srcu lock held, and calls the 'cb' function if a
> match on the list of namespaces from the ANA log is found.
> This assumes that the iteration itself is relatively quick,
> and the namespace state itself doesn't change during iteration.
> For more complex functions (eg if the 'cb' function triggers
> mpath_set_live() which then kicks off a partition scan) the
> namespace state might change while iterating, and the rcu-protected
> area becomes really long.
> So change the loop to iterated over the entries in the ANA log,
> and call nvme_find_get_ns() on each entry.
> With that the 'cb' function is executed outside of the RCU
> protected area, and normal reference counting rules apply.

This patch looks good to me.

Your commit log is word wrapping a bit short, though. We standardize on
72 characters.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke
@ 2024-10-07 18:19   ` Keith Busch
  2024-10-08  6:43     ` Christoph Hellwig
  2024-10-08  7:17     ` Hannes Reinecke
  0 siblings, 2 replies; 25+ messages in thread
From: Keith Busch @ 2024-10-07 18:19 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme

On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote:
> @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
>  {
>  	enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl);
>  
> +	/*
> +	 * Skip deleted controllers for I/O from partition scan
> +	 */
> +	if (state == NVME_CTRL_DELETING &&
> +	    mutex_is_locked(&ns->ctrl->scan_lock))
> +		return true;

This feels off to me, using these seemingly unrelated dependencies to
make these kinds of decisions.

We talked a couple weeks ago about suppressing the parition scanning
during nvme scan_work. I know you said there was some reason it wouldn't
work, but could you check the below? It seems okay to me.

---
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 48e7a8906d012..82cb1eb3a773b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work)
 		container_of(work, struct nvme_ns_head, requeue_work);
 	struct bio *bio, *next;
 
+	if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) {
+		mutex_lock(&head->disk->open_mutex);
+		bdev_disk_changed(head->disk, false);
+		mutex_unlock(&head->disk->open_mutex);
+	}
+
 	spin_lock_irq(&head->requeue_lock);
 	next = bio_list_get(&head->requeue_list);
 	spin_unlock_irq(&head->requeue_lock);
@@ -629,6 +635,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		return PTR_ERR(head->disk);
 	head->disk->fops = &nvme_ns_head_ops;
 	head->disk->private_data = head;
+	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
 	sprintf(head->disk->disk_name, "nvme%dn%d",
 			ctrl->subsys->instance, head->instance);
 	return 0;
--


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state()
  2024-10-07 15:46   ` Keith Busch
@ 2024-10-08  6:39     ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-10-08  6:39 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Mon, Oct 07, 2024 at 09:46:16AM -0600, Keith Busch wrote:
> This patch looks good to me.
> 
> Your commit log is word wrapping a bit short, though. We standardize on
> 72 characters.

The for loop is also missing a space before the opening brace while
we're nitpicking.

The reason the code was written this way is that when there are a lot
of namespace in an update we now need to a full list scan for each
one, which can add a lot of overhead.  So I guess we'll need to see
why it is important to change it to counter that tradeoff.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: skip failed paths during partition scan
  2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke
@ 2024-10-08  6:40   ` Christoph Hellwig
  2024-10-20 23:38     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2024-10-08  6:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme,
	Hannes Reinecke

On Mon, Oct 07, 2024 at 12:01:34PM +0200, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> When an I/O error is encountered during scanning (ie when the
> scan_lock is held) we should avoid using this path until scanning
> is finished to avoid deadlocks with device_add_disk().
> So set a new flag NVME_NS_SCAN_FAILED if a failover happened during
> scanning, and skip this path in nvme_available_paths().
> Then we can check if that bit is set after device_add_disk() returned,
> and remove the disk again if no available paths are found.
> That allows the device to be recreated via the 'rescan' sysfs attribute
> once no I/O errors occur anymore.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/multipath.c | 26 ++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h      |  1 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index f03ef983a75f..4113d38606a4 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req)
>  		queue_work(nvme_wq, &ns->ctrl->ana_work);
>  	}
>  
> +	/*
> +	 * Do not use this path during scanning
> +	 * to avoid deadlocks in device_add_disk()
> +	 */
> +	if (mutex_is_locked(&ns->ctrl->scan_lock))
> +		set_bit(NVME_NS_SCAN_FAILED, &ns->flags);

Err, no.  mutex_is_locked is never a valid way to detect the calling
context - obviously someone else could also be holding it.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-07 18:19   ` Keith Busch
@ 2024-10-08  6:43     ` Christoph Hellwig
  2024-10-08  7:17     ` Hannes Reinecke
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-10-08  6:43 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Mon, Oct 07, 2024 at 12:19:23PM -0600, Keith Busch wrote:
> On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote:
> > @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
> >  {
> >  	enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl);
> >  
> > +	/*
> > +	 * Skip deleted controllers for I/O from partition scan
> > +	 */
> > +	if (state == NVME_CTRL_DELETING &&
> > +	    mutex_is_locked(&ns->ctrl->scan_lock))
> > +		return true;
> 
> This feels off to me, using these seemingly unrelated dependencies to
> make these kinds of decisions.
> 
> We talked a couple weeks ago about suppressing the parition scanning
> during nvme scan_work. I know you said there was some reason it wouldn't
> work, but could you check the below? It seems okay to me.
> 
> ---
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 48e7a8906d012..82cb1eb3a773b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work)
>  		container_of(work, struct nvme_ns_head, requeue_work);
>  	struct bio *bio, *next;
>  
> +	if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) {
> +		mutex_lock(&head->disk->open_mutex);
> +		bdev_disk_changed(head->disk, false);
> +		mutex_unlock(&head->disk->open_mutex);
> +	}
> +

Rescan_work feels a little counter intuitive for the partition scanning.
I guess this should work because requeue_work is scheduled from the end
of nvme_mpath_set_live and gives us a context outside of scan_lock.

It'll need really good comments to explain this.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-07 18:19   ` Keith Busch
  2024-10-08  6:43     ` Christoph Hellwig
@ 2024-10-08  7:17     ` Hannes Reinecke
  2024-10-08 20:41       ` Keith Busch
  1 sibling, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-08  7:17 UTC (permalink / raw)
  To: Keith Busch, Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/7/24 20:19, Keith Busch wrote:
> On Mon, Oct 07, 2024 at 12:01:33PM +0200, Hannes Reinecke wrote:
>> @@ -239,6 +239,13 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
>>   {
>>   	enum nvme_ctrl_state state = nvme_ctrl_state(ns->ctrl);
>>   
>> +	/*
>> +	 * Skip deleted controllers for I/O from partition scan
>> +	 */
>> +	if (state == NVME_CTRL_DELETING &&
>> +	    mutex_is_locked(&ns->ctrl->scan_lock))
>> +		return true;
> 
> This feels off to me, using these seemingly unrelated dependencies to
> make these kinds of decisions.
> 
> We talked a couple weeks ago about suppressing the parition scanning
> during nvme scan_work. I know you said there was some reason it wouldn't
> work, but could you check the below? It seems okay to me.
> 
> ---
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 48e7a8906d012..82cb1eb3a773b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,12 @@ static void nvme_requeue_work(struct work_struct *work)
>   		container_of(work, struct nvme_ns_head, requeue_work);
>   	struct bio *bio, *next;
>   
> +	if (test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) {
> +		mutex_lock(&head->disk->open_mutex);
> +		bdev_disk_changed(head->disk, false);
> +		mutex_unlock(&head->disk->open_mutex);
> +	}
> +
>   	spin_lock_irq(&head->requeue_lock);
>   	next = bio_list_get(&head->requeue_list);
>   	spin_unlock_irq(&head->requeue_lock);
> @@ -629,6 +635,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>   		return PTR_ERR(head->disk);
>   	head->disk->fops = &nvme_ns_head_ops;
>   	head->disk->private_data = head;
> +	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
>   	sprintf(head->disk->disk_name, "nvme%dn%d",
>   			ctrl->subsys->instance, head->instance);
>   	return 0;
> --
Hmm. Sadly, not really. Still stuck in partition scanning.
Guess we'll need to check if we have paths available before triggering 
partition scan; let me check ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-08  7:17     ` Hannes Reinecke
@ 2024-10-08 20:41       ` Keith Busch
  2024-10-09  6:23         ` Hannes Reinecke
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2024-10-08 20:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote:
> Hmm. Sadly, not really. Still stuck in partition scanning.
>
> Guess we'll need to check if we have paths available before triggering
> partition scan; let me check ...

I think you must have two paths, and the 2nd path cleared the suppress
bit before the 1st one finished its called to device_add_disk(). How
about this one?

---
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 48e7a8906d012..755495f387168 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,13 @@ static void nvme_requeue_work(struct work_struct *work)
 		container_of(work, struct nvme_ns_head, requeue_work);
 	struct bio *bio, *next;
 
+	if (disk_live(head->disk) &&
+	    test_and_clear_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state)) {
+		mutex_lock(&head->disk->open_mutex);
+		bdev_disk_changed(head->disk, false);
+		mutex_unlock(&head->disk->open_mutex);
+	}
+
 	spin_lock_irq(&head->requeue_lock);
 	next = bio_list_get(&head->requeue_list);
 	spin_unlock_irq(&head->requeue_lock);
@@ -629,6 +636,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		return PTR_ERR(head->disk);
 	head->disk->fops = &nvme_ns_head_ops;
 	head->disk->private_data = head;
+	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
 	sprintf(head->disk->disk_name, "nvme%dn%d",
 			ctrl->subsys->instance, head->instance);
 	return 0;
--


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-08 20:41       ` Keith Busch
@ 2024-10-09  6:23         ` Hannes Reinecke
  2024-10-09 16:33           ` Keith Busch
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-09  6:23 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/8/24 22:41, Keith Busch wrote:
> On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote:
>> Hmm. Sadly, not really. Still stuck in partition scanning.
>>
>> Guess we'll need to check if we have paths available before triggering
>> partition scan; let me check ...
> 
> I think you must have two paths, and the 2nd path cleared the suppress
> bit before the 1st one finished its called to device_add_disk(). How
> about this one?
> 
Nope. Still stuck, this time in bdev_disc_changed().
With my testcase _all_ paths return NS_NOT_READY during partition scan, 
so I/O is constantly bounced between paths, and partition scan never 
returns. Doesn't matter where you call it, it's stuck.

The only chance we have is to modify I/O handling during scanning
(cf my new patchset).

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09  6:23         ` Hannes Reinecke
@ 2024-10-09 16:33           ` Keith Busch
  2024-10-09 17:10           ` Keith Busch
  2024-10-09 17:32           ` Keith Busch
  2 siblings, 0 replies; 25+ messages in thread
From: Keith Busch @ 2024-10-09 16:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
> On 10/8/24 22:41, Keith Busch wrote:
> > On Tue, Oct 08, 2024 at 09:17:47AM +0200, Hannes Reinecke wrote:
> > > Hmm. Sadly, not really. Still stuck in partition scanning.
> > > 
> > > Guess we'll need to check if we have paths available before triggering
> > > partition scan; let me check ...
> > 
> > I think you must have two paths, and the 2nd path cleared the suppress
> > bit before the 1st one finished its called to device_add_disk(). How
> > about this one?
> > 
> Nope. Still stuck, this time in bdev_disc_changed().
> With my testcase _all_ paths return NS_NOT_READY during partition scan, so
> I/O is constantly bounced between paths, and partition scan never returns.
> Doesn't matter where you call it, it's stuck.

Oh right... We need the requeue_work to end the bio's when the disk is
dead, but the work can't proceed because it's waiting on its own bio's.
Darn, I guess this scheme would need yet another work_queue to do it.
 
> The only chance we have is to modify I/O handling during scanning
> (cf my new patchset).

I hoped we could avoid special error handling paths, but I'll take
another look at that approach.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09  6:23         ` Hannes Reinecke
  2024-10-09 16:33           ` Keith Busch
@ 2024-10-09 17:10           ` Keith Busch
  2024-10-09 17:32           ` Keith Busch
  2 siblings, 0 replies; 25+ messages in thread
From: Keith Busch @ 2024-10-09 17:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
> With my testcase _all_ paths return NS_NOT_READY during partition scan, so
> I/O is constantly bounced between paths, and partition scan never returns.
> Doesn't matter where you call it, it's stuck.

I'm trying to recreate this so just want to clarify the return status.
NS_NOT_READY isn't a path error, so doesn't use failover. It eventually
exhausts its retries, so I don't see a problem with that status.

If I synthesize a pathing error though, then yeah, I can recreate a
problem.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09  6:23         ` Hannes Reinecke
  2024-10-09 16:33           ` Keith Busch
  2024-10-09 17:10           ` Keith Busch
@ 2024-10-09 17:32           ` Keith Busch
  2024-10-10  6:16             ` Hannes Reinecke
                               ` (2 more replies)
  2 siblings, 3 replies; 25+ messages in thread
From: Keith Busch @ 2024-10-09 17:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
> With my testcase _all_ paths return NS_NOT_READY during partition scan, so
> I/O is constantly bounced between paths, and partition scan never returns.
> Doesn't matter where you call it, it's stuck.

Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I
can recreate a stuck scan. Let me get one more shot at fixing this by
suppressing the scan to another context. This one is successful in my
tests:

---
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 48e7a8906d012..e1fd53ff2c0ca 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
 	return ret;
 }
 
+static void nvme_scan_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head =
+		container_of(work, struct nvme_ns_head, scan_work);
+
+	if (WARN_ON_ONCE(!test_and_clear_bit(GD_SUPPRESS_PART_SCAN,
+					     &head->disk->state)))
+		return;
+
+	mutex_lock(&head->disk->open_mutex);
+	bdev_disk_changed(head->disk, false);
+	mutex_unlock(&head->disk->open_mutex);
+}
+
 static void nvme_requeue_work(struct work_struct *work)
 {
 	struct nvme_ns_head *head =
@@ -606,6 +620,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	bio_list_init(&head->requeue_list);
 	spin_lock_init(&head->requeue_lock);
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
+	INIT_WORK(&head->scan_work, nvme_scan_work);
 
 	/*
 	 * Add a multipath node if the subsystems supports multiple controllers.
@@ -629,6 +644,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 		return PTR_ERR(head->disk);
 	head->disk->fops = &nvme_ns_head_ops;
 	head->disk->private_data = head;
+	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
 	sprintf(head->disk->disk_name, "nvme%dn%d",
 			ctrl->subsys->instance, head->instance);
 	return 0;
@@ -655,6 +671,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 			return;
 		}
 		nvme_add_ns_head_cdev(head);
+		kblockd_schedule_work(&head->scan_work);
 	}
 
 	mutex_lock(&head->lock);
@@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
 		return;
 	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
 		nvme_cdev_del(&head->cdev, &head->cdev_device);
+		/*
+		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+		 * to allow multipath to fail all I/O.
+		 */
+		synchronize_srcu(&head->srcu);
+		kblockd_schedule_work(&head->requeue_work);
 		del_gendisk(head->disk);
 	}
-	/*
-	 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
-	 * to allow multipath to fail all I/O.
-	 */
-	synchronize_srcu(&head->srcu);
-	kblockd_schedule_work(&head->requeue_work);
 }
 
 void nvme_mpath_remove_disk(struct nvme_ns_head *head)
@@ -991,6 +1008,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 	/* make sure all pending bios are cleaned up */
 	kblockd_schedule_work(&head->requeue_work);
 	flush_work(&head->requeue_work);
+	flush_work(&head->scan_work);
 	put_disk(head->disk);
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 313a4f978a2cf..ff3d092dbdb6d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -494,6 +494,7 @@ struct nvme_ns_head {
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
 	struct work_struct	requeue_work;
+	struct work_struct	scan_work;
 	struct mutex		lock;
 	unsigned long		flags;
 #define NVME_NSHEAD_DISK_LIVE	0
--


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09 17:32           ` Keith Busch
@ 2024-10-10  6:16             ` Hannes Reinecke
  2024-10-10  7:18               ` Hannes Reinecke
  2024-10-10  8:17             ` Christoph Hellwig
  2024-10-10  8:57             ` Hannes Reinecke
  2 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-10  6:16 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/9/24 19:32, Keith Busch wrote:
> On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
>> With my testcase _all_ paths return NS_NOT_READY during partition scan, so
>> I/O is constantly bounced between paths, and partition scan never returns.
>> Doesn't matter where you call it, it's stuck.
> 
> Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I
> can recreate a stuck scan. Let me get one more shot at fixing this by
> suppressing the scan to another context. This one is successful in my
> tests:
> 
> ---
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 48e7a8906d012..e1fd53ff2c0ca 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
>   	return ret;
[ .. ]

Piling workqueues on workqueues ... but anyway, I'll give it a spin.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-10  6:16             ` Hannes Reinecke
@ 2024-10-10  7:18               ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-10  7:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/10/24 08:16, Hannes Reinecke wrote:
> On 10/9/24 19:32, Keith Busch wrote:
>> On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
>>> With my testcase _all_ paths return NS_NOT_READY during partition 
>>> scan, so
>>> I/O is constantly bounced between paths, and partition scan never 
>>> returns.
>>> Doesn't matter where you call it, it's stuck.
>>
>> Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I
>> can recreate a stuck scan. Let me get one more shot at fixing this by
>> suppressing the scan to another context. This one is successful in my
>> tests:
>>
>> ---
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index 48e7a8906d012..e1fd53ff2c0ca 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct 
>> nvme_ns_head *head)
>>       return ret;
> [ .. ]
> 
> Piling workqueues on workqueues ... but anyway, I'll give it a spin.
> 
And it even works. Care to send a proper patch?
You can add my 'Tested-by:' to it, too.
Thanks for doing this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09 17:32           ` Keith Busch
  2024-10-10  6:16             ` Hannes Reinecke
@ 2024-10-10  8:17             ` Christoph Hellwig
  2024-10-10  8:57             ` Hannes Reinecke
  2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2024-10-10  8:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig,
	Sagi Grimberg, linux-nvme

On Wed, Oct 09, 2024 at 11:32:32AM -0600, Keith Busch wrote:
> On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
> > With my testcase _all_ paths return NS_NOT_READY during partition scan, so
> > I/O is constantly bounced between paths, and partition scan never returns.
> > Doesn't matter where you call it, it's stuck.
> 
> Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I
> can recreate a stuck scan. Let me get one more shot at fixing this by
> suppressing the scan to another context. This one is successful in my
> tests:
> 
> ---
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 48e7a8906d012..e1fd53ff2c0ca 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
>  	return ret;
>  }
>  
> +static void nvme_scan_work(struct work_struct *work)

scan_work feels a little too generic.  Maybe part_scan_work?

Otherwise not pretty, but by far the best idea so far, so I guess should
go with it.  Can you also add a big fat comment explaining the rationale
for the actual submission?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-09 17:32           ` Keith Busch
  2024-10-10  6:16             ` Hannes Reinecke
  2024-10-10  8:17             ` Christoph Hellwig
@ 2024-10-10  8:57             ` Hannes Reinecke
  2024-10-15 14:33               ` Keith Busch
  2 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-10  8:57 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/9/24 19:32, Keith Busch wrote:
> On Wed, Oct 09, 2024 at 08:23:45AM +0200, Hannes Reinecke wrote:
>> With my testcase _all_ paths return NS_NOT_READY during partition scan, so
>> I/O is constantly bounced between paths, and partition scan never returns.
>> Doesn't matter where you call it, it's stuck.
> 
> Assuming you mean a differet status, like NVME_SC_INTERNAL_PATH_ERROR, I
> can recreate a stuck scan. Let me get one more shot at fixing this by
> suppressing the scan to another context. This one is successful in my
> tests:
> 
> ---
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 48e7a8906d012..e1fd53ff2c0ca 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -580,6 +580,20 @@ static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
>   	return ret;
>   }
>   
> +static void nvme_scan_work(struct work_struct *work)
> +{
> +	struct nvme_ns_head *head =
> +		container_of(work, struct nvme_ns_head, scan_work);
> +
> +	if (WARN_ON_ONCE(!test_and_clear_bit(GD_SUPPRESS_PART_SCAN,
> +					     &head->disk->state)))
> +		return;
> +
> +	mutex_lock(&head->disk->open_mutex);
> +	bdev_disk_changed(head->disk, false);
> +	mutex_unlock(&head->disk->open_mutex);
> +}
> +
>   static void nvme_requeue_work(struct work_struct *work)
>   {
>   	struct nvme_ns_head *head =
> @@ -606,6 +620,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>   	bio_list_init(&head->requeue_list);
>   	spin_lock_init(&head->requeue_lock);
>   	INIT_WORK(&head->requeue_work, nvme_requeue_work);
> +	INIT_WORK(&head->scan_work, nvme_scan_work);
>   
>   	/*
>   	 * Add a multipath node if the subsystems supports multiple controllers.
> @@ -629,6 +644,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>   		return PTR_ERR(head->disk);
>   	head->disk->fops = &nvme_ns_head_ops;
>   	head->disk->private_data = head;
> +	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
>   	sprintf(head->disk->disk_name, "nvme%dn%d",
>   			ctrl->subsys->instance, head->instance);
>   	return 0;
> @@ -655,6 +671,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>   			return;
>   		}
>   		nvme_add_ns_head_cdev(head);
> +		kblockd_schedule_work(&head->scan_work);
>   	}
>   
>   	mutex_lock(&head->lock);
> @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>   		return;
>   	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>   		nvme_cdev_del(&head->cdev, &head->cdev_device);
> +		/*
> +		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> +		 * to allow multipath to fail all I/O.
> +		 */
> +		synchronize_srcu(&head->srcu);
> +		kblockd_schedule_work(&head->requeue_work);
>   		del_gendisk(head->disk);
>   	}

I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when 
testing for the condition and a 'clear_bit()' after del_gendisk().

Otherwise we're having a race condition with nvme_mpath_set_live:

         if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
                 rc = device_add_disk(&head->subsys->dev, head->disk,
                                      nvme_ns_attr_groups);

which could sneak in from another controller just after we cleared the
NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the
same name is already registered.
Or nvme_cdev_del() to display nice kernel warnings as the cdev was
not registered.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-10  8:57             ` Hannes Reinecke
@ 2024-10-15 14:33               ` Keith Busch
  2024-10-15 14:56                 ` Hannes Reinecke
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2024-10-15 14:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Thu, Oct 10, 2024 at 10:57:23AM +0200, Hannes Reinecke wrote:
> On 10/9/24 19:32, Keith Busch wrote:
> > @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
> >   		return;
> >   	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> >   		nvme_cdev_del(&head->cdev, &head->cdev_device);
> > +		/*
> > +		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> > +		 * to allow multipath to fail all I/O.
> > +		 */
> > +		synchronize_srcu(&head->srcu);
> > +		kblockd_schedule_work(&head->requeue_work);
> >   		del_gendisk(head->disk);
> >   	}
> 
> I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when
> testing for the condition and a 'clear_bit()' after del_gendisk().
> 
> Otherwise we're having a race condition with nvme_mpath_set_live:
> 
>         if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>                 rc = device_add_disk(&head->subsys->dev, head->disk,
>                                      nvme_ns_attr_groups);
> 
> which could sneak in from another controller just after we cleared the
> NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the
> same name is already registered.
> Or nvme_cdev_del() to display nice kernel warnings as the cdev was
> not registered.

Is that actually happening for you? I don't think it's supposed to
happen because mpath_shutdwon_disk only happens if the head is detached
from the subsystem, so no other thread should be able to possibly
attempt to access that head's disk for a different controller path. That
new controller should have to allocate an entirely new head.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-15 14:33               ` Keith Busch
@ 2024-10-15 14:56                 ` Hannes Reinecke
  2024-10-15 15:10                   ` Keith Busch
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2024-10-15 14:56 UTC (permalink / raw)
  To: Keith Busch; +Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On 10/15/24 16:33, Keith Busch wrote:
> On Thu, Oct 10, 2024 at 10:57:23AM +0200, Hannes Reinecke wrote:
>> On 10/9/24 19:32, Keith Busch wrote:
>>> @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>>    		return;
>>>    	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>    		nvme_cdev_del(&head->cdev, &head->cdev_device);
>>> +		/*
>>> +		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>> +		 * to allow multipath to fail all I/O.
>>> +		 */
>>> +		synchronize_srcu(&head->srcu);
>>> +		kblockd_schedule_work(&head->requeue_work);
>>>    		del_gendisk(head->disk);
>>>    	}
>>
>> I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when
>> testing for the condition and a 'clear_bit()' after del_gendisk().
>>
>> Otherwise we're having a race condition with nvme_mpath_set_live:
>>
>>          if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>                  rc = device_add_disk(&head->subsys->dev, head->disk,
>>                                       nvme_ns_attr_groups);
>>
>> which could sneak in from another controller just after we cleared the
>> NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the
>> same name is already registered.
>> Or nvme_cdev_del() to display nice kernel warnings as the cdev was
>> not registered.
> 
> Is that actually happening for you? I don't think it's supposed to
> happen because mpath_shutdwon_disk only happens if the head is detached
> from the subsystem, so no other thread should be able to possibly
> attempt to access that head's disk for a different controller path. That
> new controller should have to allocate an entirely new head.

Na, forget it. Not only doesn't it happen, but we need the 
'test_and_set_bit()' at the start otherwise we can't abort I/O.
So all's fine with your patch, and it even survives my testcase.

(The first to do so even after several runs, I might add :-)

Care to send an official patch?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-15 14:56                 ` Hannes Reinecke
@ 2024-10-15 15:10                   ` Keith Busch
  2024-10-20 23:37                     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2024-10-15 15:10 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg, linux-nvme

On Tue, Oct 15, 2024 at 04:56:08PM +0200, Hannes Reinecke wrote:
> On 10/15/24 16:33, Keith Busch wrote:
> > Is that actually happening for you? I don't think it's supposed to
> > happen because mpath_shutdwon_disk only happens if the head is detached
> > from the subsystem, so no other thread should be able to possibly
> > attempt to access that head's disk for a different controller path. That
> > new controller should have to allocate an entirely new head.
> 
> Na, forget it. Not only doesn't it happen, but we need the
> 'test_and_set_bit()' at the start otherwise we can't abort I/O.
> So all's fine with your patch, and it even survives my testcase.
> 
> (The first to do so even after several runs, I might add :-)
> 
> Care to send an official patch?

Should be in your mailbox. Here's the link:

  https://lore.kernel.org/linux-nvme/20241015143136.810779-1-kbusch@meta.com/T/#u


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state()
  2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
  2024-10-07 15:46   ` Keith Busch
@ 2024-10-20 23:33   ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2024-10-20 23:33 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme




On 07/10/2024 13:01, Hannes Reinecke wrote:
> Currently nvme_update_ana_state() iterates over all namespaces
> with the srcu lock held, and calls the 'cb' function if a
> match on the list of namespaces from the ANA log is found.
> This assumes that the iteration itself is relatively quick,
> and the namespace state itself doesn't change during iteration.
> For more complex functions (eg if the 'cb' function triggers
> mpath_set_live() which then kicks off a partition scan) the
> namespace state might change while iterating, and the rcu-protected
> area becomes really long.
> So change the loop to iterated over the entries in the ANA log,
> and call nvme_find_get_ns() on each entry.
> With that the 'cb' function is executed outside of the RCU
> protected area, and normal reference counting rules apply.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/multipath.c | 18 ++++++------------
>   1 file changed, 6 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index f72c5a6a2d8e..61f8ae199288 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -745,7 +745,6 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
>   	u32 nr_nsids = le32_to_cpu(desc->nnsids), n = 0;
>   	unsigned *nr_change_groups = data;
>   	struct nvme_ns *ns;
> -	int srcu_idx;
>   
>   	dev_dbg(ctrl->device, "ANA group %d: %s.\n",
>   			le32_to_cpu(desc->grpid),
> @@ -757,21 +756,16 @@ static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
>   	if (!nr_nsids)
>   		return 0;
>   
> -	srcu_idx = srcu_read_lock(&ctrl->srcu);
> -	list_for_each_entry_rcu(ns, &ctrl->namespaces, list) {
> +	for(n = 0; n < nr_nsids; n++) {
>   		unsigned nsid;
> -again:
> +
>   		nsid = le32_to_cpu(desc->nsids[n]);
> -		if (ns->head->ns_id < nsid)
> -			continue;
> -		if (ns->head->ns_id == nsid)
> +		ns = nvme_find_get_ns(ctrl, nsid);
> +		if (ns) {
>   			nvme_update_ns_ana_state(desc, ns);
> -		if (++n == nr_nsids)
> -			break;
> -		if (ns->head->ns_id > nsid)
> -			goto again;
> +			nvme_put_ns(ns);
> +		}
>   	}
> -	srcu_read_unlock(&ctrl->srcu, srcu_idx);

You also made the loop now n^2... which *may* be an issue for an 
environment with
a lot of namespaces. It is simpler though...

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
  2024-10-15 15:10                   ` Keith Busch
@ 2024-10-20 23:37                     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2024-10-20 23:37 UTC (permalink / raw)
  To: Keith Busch, Hannes Reinecke
  Cc: Hannes Reinecke, Christoph Hellwig, linux-nvme




On 15/10/2024 18:10, Keith Busch wrote:
> On Tue, Oct 15, 2024 at 04:56:08PM +0200, Hannes Reinecke wrote:
>> On 10/15/24 16:33, Keith Busch wrote:
>>> Is that actually happening for you? I don't think it's supposed to
>>> happen because mpath_shutdwon_disk only happens if the head is detached
>>> from the subsystem, so no other thread should be able to possibly
>>> attempt to access that head's disk for a different controller path. That
>>> new controller should have to allocate an entirely new head.
>> Na, forget it. Not only doesn't it happen, but we need the
>> 'test_and_set_bit()' at the start otherwise we can't abort I/O.
>> So all's fine with your patch, and it even survives my testcase.
>>
>> (The first to do so even after several runs, I might add :-)
>>
>> Care to send an official patch?
> Should be in your mailbox. Here's the link:
>
>    https://lore.kernel.org/linux-nvme/20241015143136.810779-1-kbusch@meta.com/T/#u

I'm glad we didn't take the original version of this patch :)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/3] nvme-multipath: skip failed paths during partition scan
  2024-10-08  6:40   ` Christoph Hellwig
@ 2024-10-20 23:38     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2024-10-20 23:38 UTC (permalink / raw)
  To: Christoph Hellwig, Hannes Reinecke
  Cc: Keith Busch, linux-nvme, Hannes Reinecke


>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index f03ef983a75f..4113d38606a4 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -102,6 +102,13 @@ void nvme_failover_req(struct request *req)
>>   		queue_work(nvme_wq, &ns->ctrl->ana_work);
>>   	}
>>   
>> +	/*
>> +	 * Do not use this path during scanning
>> +	 * to avoid deadlocks in device_add_disk()
>> +	 */
>> +	if (mutex_is_locked(&ns->ctrl->scan_lock))
>> +		set_bit(NVME_NS_SCAN_FAILED, &ns->flags);
> Err, no.  mutex_is_locked is never a valid way to detect the calling
> context - obviously someone else could also be holding it.
>

Strongly agree here.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-10-20 23:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-07 10:01 [PATCH 0/3] nvme-multipath: fix deadlock in device_add_disk() Hannes Reinecke
2024-10-07 10:01 ` [PATCH 1/3] nvme-multipath: simplify loop in nvme_update_ana_state() Hannes Reinecke
2024-10-07 15:46   ` Keith Busch
2024-10-08  6:39     ` Christoph Hellwig
2024-10-20 23:33   ` Sagi Grimberg
2024-10-07 10:01 ` [PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan Hannes Reinecke
2024-10-07 18:19   ` Keith Busch
2024-10-08  6:43     ` Christoph Hellwig
2024-10-08  7:17     ` Hannes Reinecke
2024-10-08 20:41       ` Keith Busch
2024-10-09  6:23         ` Hannes Reinecke
2024-10-09 16:33           ` Keith Busch
2024-10-09 17:10           ` Keith Busch
2024-10-09 17:32           ` Keith Busch
2024-10-10  6:16             ` Hannes Reinecke
2024-10-10  7:18               ` Hannes Reinecke
2024-10-10  8:17             ` Christoph Hellwig
2024-10-10  8:57             ` Hannes Reinecke
2024-10-15 14:33               ` Keith Busch
2024-10-15 14:56                 ` Hannes Reinecke
2024-10-15 15:10                   ` Keith Busch
2024-10-20 23:37                     ` Sagi Grimberg
2024-10-07 10:01 ` [PATCH 3/3] nvme-multipath: skip failed paths during " Hannes Reinecke
2024-10-08  6:40   ` Christoph Hellwig
2024-10-20 23:38     ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).