nvme/pcie hot plug results in /dev name change

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* nvme/pcie hot plug results in /dev name change
@ 2023-01-20 19:50 John Meneghini
  2023-01-20 21:42 ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: John Meneghini @ 2023-01-20 19:50 UTC (permalink / raw)
  To: linux-nvme@lists.infradead.org

We have a customer complaining that when they hot unplug/plug their nvme-pci device the device name changes.

For example, when they hot remove /dev/nvme1n1 physically, then reinsert it, it gets identified as /dev/nvme4n1.

IIRC there is nothing about nvme_init_ctrl that guarantees the ctrl->instance will be the same because
ida_alloc() will only return the next free id in the bitmap.

The fact that nvme_remove ->  nvme_init_ctrl many times results in the same ctrl->instance is just a happen stance and there is 
nothing that guarantees /dev name consistency in Linux. Especially in the case of hot unplug/plug where the controller may very 
well be marked NVME_CTRL_DEAD due to the ungraceful disconnect.

True?

-- 
John A. Meneghini
Senior Principal Platform Storage Engineer
RHEL SST - Platform Storage Group
jmeneghi@redhat.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-01-20 19:50 nvme/pcie hot plug results in /dev name change John Meneghini
@ 2023-01-20 21:42 ` Keith Busch
  2023-01-21  7:01   ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2023-01-20 21:42 UTC (permalink / raw)
  To: John Meneghini; +Cc: linux-nvme@lists.infradead.org

On Fri, Jan 20, 2023 at 02:50:46PM -0500, John Meneghini wrote:
> We have a customer complaining that when they hot unplug/plug their nvme-pci device the device name changes.
> 
> For example, when they hot remove /dev/nvme1n1 physically, then reinsert it, it gets identified as /dev/nvme4n1.
> 
> IIRC there is nothing about nvme_init_ctrl that guarantees the ctrl->instance will be the same because
> ida_alloc() will only return the next free id in the bitmap.
> 
> The fact that nvme_remove ->  nvme_init_ctrl many times results in the same
> ctrl->instance is just a happen stance and there is nothing that guarantees
> /dev name consistency in Linux. Especially in the case of hot unplug/plug
> where the controller may very well be marked NVME_CTRL_DEAD due to the
> ungraceful disconnect.

That is correct. We don't know the identity of the device at the point
we have to assign it an instance number, so the hot added one will just
get the first available unique number. If you need a consistent name, we
have the persistent naming rules that should create those links in
/dev/disk/by-id/.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-01-20 21:42 ` Keith Busch
@ 2023-01-21  7:01   ` Christoph Hellwig
  2023-01-29 10:28     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2023-01-21  7:01 UTC (permalink / raw)
  To: Keith Busch; +Cc: John Meneghini, linux-nvme@lists.infradead.org

On Fri, Jan 20, 2023 at 02:42:23PM -0700, Keith Busch wrote:
> That is correct. We don't know the identity of the device at the point
> we have to assign it an instance number, so the hot added one will just
> get the first available unique number. If you need a consistent name, we
> have the persistent naming rules that should create those links in
> /dev/disk/by-id/.

Note that this a bit of a problem under a file system or stacking driver
that handles failing drives (e.g. btrfs or md raid), that holds ontop
the "old" device file, and then fails to find the new one.  I had a
customer complaint for that as well :)

The first hack was to force run the multipath code that can keep the
node alive.  That works, but is really ugly especially when dealing
with corner cases such as overlapping nsids between different
controllers.

In the long run I think we'll need to:
 - send a notification to the holder if a device is hot removed from
   the block layer so that it can clean up
 - make the upper layers look for the replugged devie

I've been working on some of this for a while but haven't made much
progress due to other committments.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-01-21  7:01   ` Christoph Hellwig
@ 2023-01-29 10:28     ` Ming Lei
  2023-01-31 16:38       ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2023-01-29 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, John Meneghini, linux-nvme@lists.infradead.org,
	ming.lei

On Fri, Jan 20, 2023 at 11:01:53PM -0800, Christoph Hellwig wrote:
> On Fri, Jan 20, 2023 at 02:42:23PM -0700, Keith Busch wrote:
> > That is correct. We don't know the identity of the device at the point
> > we have to assign it an instance number, so the hot added one will just
> > get the first available unique number. If you need a consistent name, we
> > have the persistent naming rules that should create those links in
> > /dev/disk/by-id/.
> 
> Note that this a bit of a problem under a file system or stacking driver
> that handles failing drives (e.g. btrfs or md raid), that holds ontop
> the "old" device file, and then fails to find the new one.  I had a
> customer complaint for that as well :)
> 
> The first hack was to force run the multipath code that can keep the
> node alive.  That works, but is really ugly especially when dealing
> with corner cases such as overlapping nsids between different
> controllers.
> 
> In the long run I think we'll need to:
>  - send a notification to the holder if a device is hot removed from
>    the block layer so that it can clean up

When the disk is deleted, the notification has been sent to userspace
via udev/kobj uevent, so user can umount the original FS or
DM/MD userspace can handle the device removal.

>  - make the upper layers look for the replugged devie
> 
> I've been working on some of this for a while but haven't made much
> progress due to other committments.

block device persistent name is supposed to be supported by userspace,
such as udev rule.

[1] https://wiki.archlinux.org/title/persistent_block_device_naming

Thanks,
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-01-29 10:28     ` Ming Lei
@ 2023-01-31 16:38       ` Keith Busch
  2023-02-01  2:33         ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2023-01-31 16:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, John Meneghini, linux-nvme@lists.infradead.org

On Sun, Jan 29, 2023 at 06:28:05PM +0800, Ming Lei wrote:
> On Fri, Jan 20, 2023 at 11:01:53PM -0800, Christoph Hellwig wrote:
> > On Fri, Jan 20, 2023 at 02:42:23PM -0700, Keith Busch wrote:
> > > That is correct. We don't know the identity of the device at the point
> > > we have to assign it an instance number, so the hot added one will just
> > > get the first available unique number. If you need a consistent name, we
> > > have the persistent naming rules that should create those links in
> > > /dev/disk/by-id/.
> > 
> > Note that this a bit of a problem under a file system or stacking driver
> > that handles failing drives (e.g. btrfs or md raid), that holds ontop
> > the "old" device file, and then fails to find the new one.  I had a
> > customer complaint for that as well :)
> > 
> > The first hack was to force run the multipath code that can keep the
> > node alive.  That works, but is really ugly especially when dealing
> > with corner cases such as overlapping nsids between different
> > controllers.
> > 
> > In the long run I think we'll need to:
> >  - send a notification to the holder if a device is hot removed from
> >    the block layer so that it can clean up
> 
> When the disk is deleted, the notification has been sent to userspace
> via udev/kobj uevent, so user can umount the original FS or
> DM/MD userspace can handle the device removal.
> 
> >  - make the upper layers look for the replugged devie
> > 
> > I've been working on some of this for a while but haven't made much
> > progress due to other committments.
> 
> block device persistent name is supposed to be supported by userspace,
> such as udev rule.

Come to think of it, I actually have heard many complaints about this behavior.
Requiring user space deal with the teardown and restore of their open files and
mount points on a transient link loss can be inconvenient. Example use cases
are firmware activation requiring a Subsystem Reset, or a PCIe error
containment event. Those cause the links to bounce, which can trigger hot plug
events in some platforms.

The native nvme multipath looks like it could be leveraged to improving that
user experience if we wanted to make that layer an option for non-multipath
devices.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-01-31 16:38       ` Keith Busch
@ 2023-02-01  2:33         ` Ming Lei
  2023-02-01  6:27           ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2023-02-01  2:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, John Meneghini, linux-nvme@lists.infradead.org,
	ming.lei

On Tue, Jan 31, 2023 at 09:38:47AM -0700, Keith Busch wrote:
> On Sun, Jan 29, 2023 at 06:28:05PM +0800, Ming Lei wrote:
> > On Fri, Jan 20, 2023 at 11:01:53PM -0800, Christoph Hellwig wrote:
> > > On Fri, Jan 20, 2023 at 02:42:23PM -0700, Keith Busch wrote:
> > > > That is correct. We don't know the identity of the device at the point
> > > > we have to assign it an instance number, so the hot added one will just
> > > > get the first available unique number. If you need a consistent name, we
> > > > have the persistent naming rules that should create those links in
> > > > /dev/disk/by-id/.
> > > 
> > > Note that this a bit of a problem under a file system or stacking driver
> > > that handles failing drives (e.g. btrfs or md raid), that holds ontop
> > > the "old" device file, and then fails to find the new one.  I had a
> > > customer complaint for that as well :)
> > > 
> > > The first hack was to force run the multipath code that can keep the
> > > node alive.  That works, but is really ugly especially when dealing
> > > with corner cases such as overlapping nsids between different
> > > controllers.
> > > 
> > > In the long run I think we'll need to:
> > >  - send a notification to the holder if a device is hot removed from
> > >    the block layer so that it can clean up
> > 
> > When the disk is deleted, the notification has been sent to userspace
> > via udev/kobj uevent, so user can umount the original FS or
> > DM/MD userspace can handle the device removal.
> > 
> > >  - make the upper layers look for the replugged devie
> > > 
> > > I've been working on some of this for a while but haven't made much
> > > progress due to other committments.
> > 
> > block device persistent name is supposed to be supported by userspace,
> > such as udev rule.
> 
> Come to think of it, I actually have heard many complaints about this behavior.
> Requiring user space deal with the teardown and restore of their open files and
> mount points on a transient link loss can be inconvenient. Example use cases

If IO error is returned to FS, I guess umount may have to be done since it might
be one meta IO. But if userspace has persistent device name, it is easy for
userspace to handle the umount and re-setup.

> are firmware activation requiring a Subsystem Reset, or a PCIe error
> containment event. Those cause the links to bounce, which can trigger hot plug
> events in some platforms.

The above isn't unique for nvme, and it is just easier for nvme-pci to
handle timeout/err by removing device, IMO.

> The native nvme multipath looks like it could be leveraged to improving that
> user experience if we wanted to make that layer an option for non-multipath
> devices.

Can you share the basic idea? Will nvme mulitpath hold the io error and
not propagate to upper layer until new device is probed? What if the
new device is probed late, and IO has been timed out and err is returned
to upper layer?


Thanks, 
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-01  2:33         ` Ming Lei
@ 2023-02-01  6:27           ` Christoph Hellwig
  2023-02-01 16:31             ` Keith Busch
  2023-02-15 22:18             ` Keith Busch
  0 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2023-02-01  6:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, Christoph Hellwig, John Meneghini,
	linux-nvme@lists.infradead.org

[-- Attachment #1: Type: text/plain, Size: 554 bytes --]

On Wed, Feb 01, 2023 at 10:33:15AM +0800, Ming Lei wrote:
> > The native nvme multipath looks like it could be leveraged to improving that
> > user experience if we wanted to make that layer an option for non-multipath
> > devices.
> 
> Can you share the basic idea? Will nvme mulitpath hold the io error and
> not propagate to upper layer until new device is probed? What if the
> new device is probed late, and IO has been timed out and err is returned
> to upper layer?

Attached is what I did, but I'm really not sure it is the right final
approach.

[-- Attachment #2: 0001-nvme-split-ANA-attribute-handling-from-nvme_ns_id_at.patch --]
[-- Type: text/plain, Size: 6332 bytes --]

From 7016fc46f29463871b68348088384b3d0ceeba91 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Sun, 30 Oct 2022 18:17:28 +0100
Subject: nvme: split ANA attribute handling from nvme_ns_id_attr_group

Use a separate attribute group for the ANA attributes that aren't shown
on the multipath-device instead of just hiding them at runtime.
While this is a little more code, it keeps the separation clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c      | 22 +++++++-------------
 drivers/nvme/host/multipath.c | 38 ++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h      | 14 ++-----------
 3 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 14568ae308fba5..2eab4d1b521edb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3432,10 +3432,6 @@ static struct attribute *nvme_ns_id_attrs[] = {
 	&dev_attr_nguid.attr,
 	&dev_attr_eui.attr,
 	&dev_attr_nsid.attr,
-#ifdef CONFIG_NVME_MULTIPATH
-	&dev_attr_ana_grpid.attr,
-	&dev_attr_ana_state.attr,
-#endif
 	NULL,
 };
 
@@ -3458,24 +3454,20 @@ static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
 		if (!memchr_inv(ids->eui64, 0, sizeof(ids->eui64)))
 			return 0;
 	}
-#ifdef CONFIG_NVME_MULTIPATH
-	if (a == &dev_attr_ana_grpid.attr || a == &dev_attr_ana_state.attr) {
-		if (dev_to_disk(dev)->fops != &nvme_bdev_ops) /* per-path attr */
-			return 0;
-		if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
-			return 0;
-	}
-#endif
+
 	return a->mode;
 }
 
-static const struct attribute_group nvme_ns_id_attr_group = {
+const struct attribute_group nvme_ns_id_attr_group = {
 	.attrs		= nvme_ns_id_attrs,
 	.is_visible	= nvme_ns_id_attrs_are_visible,
 };
 
-const struct attribute_group *nvme_ns_id_attr_groups[] = {
+static const struct attribute_group *nvme_disk_attr_groups[] = {
 	&nvme_ns_id_attr_group,
+#ifdef CONFIG_NVME_MULTIPATH
+	&nvme_ns_ana_attr_group,
+#endif
 	NULL,
 };
 
@@ -4250,7 +4242,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	up_write(&ctrl->namespaces_rwsem);
 	nvme_get_ctrl(ctrl);
 
-	if (device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups))
+	if (device_add_disk(ctrl->device, ns->disk, nvme_disk_attr_groups))
 		goto out_cleanup_ns_from_list;
 
 	if (!nvme_ns_head_multipath(ns->head))
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0ea7e441e080f2..96ea23f1e374f0 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -50,6 +50,11 @@ void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 	subsys->iopolicy = iopolicy;
 }
 
+static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
+{
+	return ctrl->ana_log_buf != NULL;
+}
+
 void nvme_mpath_unfreeze(struct nvme_subsystem *subsys)
 {
 	struct nvme_ns_head *h;
@@ -443,6 +448,11 @@ static const struct file_operations nvme_ns_head_chr_fops = {
 	.uring_cmd_iopoll = nvme_ns_head_chr_uring_cmd_iopoll,
 };
 
+static const struct attribute_group *nvme_mpath_disk_attr_groups[] = {
+	&nvme_ns_id_attr_group,
+	NULL,
+};
+
 static int nvme_add_ns_head_cdev(struct nvme_ns_head *head)
 {
 	int ret;
@@ -539,7 +549,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 	 */
 	if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
 		rc = device_add_disk(&head->subsys->dev, head->disk,
-				     nvme_ns_id_attr_groups);
+				     nvme_mpath_disk_attr_groups);
 		if (rc) {
 			clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags);
 			return;
@@ -780,7 +790,7 @@ static ssize_t ana_grpid_show(struct device *dev, struct device_attribute *attr,
 {
 	return sysfs_emit(buf, "%d\n", nvme_get_ns_from_dev(dev)->ana_grpid);
 }
-DEVICE_ATTR_RO(ana_grpid);
+static DEVICE_ATTR_RO(ana_grpid);
 
 static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
@@ -789,7 +799,29 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
 
 	return sysfs_emit(buf, "%s\n", nvme_ana_state_names[ns->ana_state]);
 }
-DEVICE_ATTR_RO(ana_state);
+static DEVICE_ATTR_RO(ana_state);
+
+static struct attribute *nvme_ns_ana_attrs[] = {
+	&dev_attr_ana_grpid.attr,
+	&dev_attr_ana_state.attr,
+	NULL,
+};
+
+static umode_t nvme_ns_ana_attr_is_visible(struct kobject *kobj,
+		struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+
+	if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
+		return 0;
+
+	return a->mode;
+}
+
+const struct attribute_group nvme_ns_ana_attr_group = {
+	.attrs		= nvme_ns_ana_attrs,
+	.is_visible	= nvme_ns_ana_attr_is_visible,
+};
 
 static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
 		struct nvme_ana_group_desc *desc, void *data)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 00d3b438efe3f3..c656643ba0eead 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -855,17 +855,13 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
 int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 
-extern const struct attribute_group *nvme_ns_id_attr_groups[];
+extern const struct attribute_group nvme_ns_id_attr_group;
+extern const struct attribute_group nvme_ns_ana_attr_group;
 extern const struct pr_ops nvme_pr_ops;
 extern const struct block_device_operations nvme_ns_head_ops;
 
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #ifdef CONFIG_NVME_MULTIPATH
-static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
-{
-	return ctrl->ana_log_buf != NULL;
-}
-
 void nvme_mpath_unfreeze(struct nvme_subsystem *subsys);
 void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys);
 void nvme_mpath_start_freeze(struct nvme_subsystem *subsys);
@@ -894,16 +890,10 @@ static inline void nvme_trace_bio_complete(struct request *req)
 }
 
 extern bool multipath;
-extern struct device_attribute dev_attr_ana_grpid;
-extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute subsys_attr_iopolicy;
 
 #else
 #define multipath false
-static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
-{
-	return false;
-}
 static inline void nvme_failover_req(struct request *req)
 {
 }
-- 
2.39.0


[-- Attachment #3: 0002-nvme-allow-pinning-shared-namespaces.patch --]
[-- Type: text/plain, Size: 3763 bytes --]

From 76e56c0c5a1e6bb3d9d4d11255a5bc4db634e28b Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Sun, 30 Oct 2022 18:39:53 +0100
Subject: nvme: allow pinning shared namespaces

Add a sysfs attribute to pin a ns_head.  While pinned the ns_head
will not go away even if no namespace currently exists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c      | 11 ++++--
 drivers/nvme/host/multipath.c | 63 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  1 +
 3 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2eab4d1b521edb..32f2e394dd6dc9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3897,8 +3897,12 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
 		 */
 		if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
 			continue;
-		if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
-			return h;
+		if (list_empty(&h->list) &&
+		    !test_bit(NVME_NSHEAD_PINNED, &h->flags))
+			continue;
+		if (!nvme_tryget_ns_head(h))
+			continue;
+		return h;
 	}
 
 	return NULL;
@@ -4294,7 +4298,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	mutex_lock(&ns->ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
-	if (list_empty(&ns->head->list)) {
+	if (list_empty(&ns->head->list) &&
+	    !test_bit(NVME_NSHEAD_PINNED, &ns->head->flags)) {
 		list_del_init(&ns->head->entry);
 		last_path = true;
 	}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 96ea23f1e374f0..9dbbd3a4398598 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -448,8 +448,71 @@ static const struct file_operations nvme_ns_head_chr_fops = {
 	.uring_cmd_iopoll = nvme_ns_head_chr_uring_cmd_iopoll,
 };
 
+static ssize_t pin_show(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nvme_ns_head *head = disk->private_data;
+
+	return sysfs_emit(buf, "%d\n",
+			  test_bit(NVME_NSHEAD_PINNED, &head->flags));
+}
+
+static ssize_t pin_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nvme_ns_head *head = disk->private_data;
+	bool shutdown = false;
+	ssize_t ret = 0;
+	bool pin;
+
+	ret = kstrtobool(buf, &pin);
+	if (ret)
+		return ret;
+
+	if (pin) {
+		if (!nvme_tryget_ns_head(head))
+			return -EINVAL;
+		if (test_and_set_bit(NVME_NSHEAD_PINNED, &head->flags)) {
+			ret = -EINVAL;
+			goto out_put_head;
+		}
+		return count;
+	}
+
+	if (!test_and_clear_bit(NVME_NSHEAD_PINNED, &head->flags))
+		return -EINVAL;
+
+	mutex_lock(&head->subsys->lock);
+	if (list_empty(&head->list)) {
+		list_del_init(&head->entry);
+		shutdown = true;
+	}
+	mutex_unlock(&head->subsys->lock);
+	if (shutdown)
+		nvme_mpath_shutdown_disk(head);
+
+out_put_head:
+	nvme_put_ns_head(head);
+	if (ret)
+		return ret;
+	return count;
+}
+static DEVICE_ATTR_RW(pin);
+
+static struct attribute *nvme_mpath_disk_attrs[] = {
+	&dev_attr_pin.attr,
+	NULL,
+};
+
+const struct attribute_group nvme_mpath_disk_attr_group = {
+	.attrs		= nvme_mpath_disk_attrs,
+};
+
 static const struct attribute_group *nvme_mpath_disk_attr_groups[] = {
 	&nvme_ns_id_attr_group,
+	&nvme_mpath_disk_attr_group,
 	NULL,
 };
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c656643ba0eead..1e0ec8ef4d842a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -442,6 +442,7 @@ struct nvme_ns_head {
 	struct mutex		lock;
 	unsigned long		flags;
 #define NVME_NSHEAD_DISK_LIVE	0
+#define NVME_NSHEAD_PINNED	1
 	struct nvme_ns __rcu	*current_path[];
 #endif
 };
-- 
2.39.0


[-- Attachment #4: 0003-HACK-allow-running-multipath-code-for-non-shared-nam.patch --]
[-- Type: text/plain, Size: 1896 bytes --]

From 9ceb1ab63aeab757dab5f78f62a2b467dc34e103 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Mon, 31 Oct 2022 10:34:44 +0100
Subject: HACK: allow running multipath code for non-shared namespaces

---
 drivers/nvme/host/core.c      | 4 ++--
 drivers/nvme/host/multipath.c | 3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 32f2e394dd6dc9..3e4ddbc373db80 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1447,7 +1447,7 @@ static int nvme_ns_info_from_identify(struct nvme_ctrl *ctrl,
 	if (ret)
 		return ret;
 	info->anagrpid = id->anagrpid;
-	info->is_shared = id->nmic & NVME_NS_NMIC_SHARED;
+	info->is_shared = true;
 	info->is_readonly = id->nsattr & NVME_NS_ATTR_RO;
 	info->is_ready = true;
 	if (ctrl->quirks & NVME_QUIRK_BOGUS_NID) {
@@ -1483,7 +1483,7 @@ static int nvme_ns_info_from_id_cs_indep(struct nvme_ctrl *ctrl,
 	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
 	if (!ret) {
 		info->anagrpid = id->anagrpid;
-		info->is_shared = id->nmic & NVME_NS_NMIC_SHARED;
+		info->is_shared = true;
 		info->is_readonly = id->nsattr & NVME_NS_ATTR_RO;
 		info->is_ready = id->nstat & NVME_NSTAT_NRDY;
 	}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 9dbbd3a4398598..0cb8d5c853156e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -562,8 +562,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	 * We also do this for private namespaces as the namespace sharing flag
 	 * could change after a rescan.
 	 */
-	if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
-	    !nvme_is_unique_nsid(ctrl, head) || !multipath)
+	if (!nvme_is_unique_nsid(ctrl, head) || !multipath)
 		return 0;
 
 	head->disk = blk_alloc_disk(ctrl->numa_node);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-01  6:27           ` Christoph Hellwig
@ 2023-02-01 16:31             ` Keith Busch
  2023-02-13 14:01               ` Sagi Grimberg
  2023-02-15 22:18             ` Keith Busch
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2023-02-01 16:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, John Meneghini, linux-nvme@lists.infradead.org

On Tue, Jan 31, 2023 at 10:27:24PM -0800, Christoph Hellwig wrote:
> On Wed, Feb 01, 2023 at 10:33:15AM +0800, Ming Lei wrote:
> > > The native nvme multipath looks like it could be leveraged to improving that
> > > user experience if we wanted to make that layer an option for non-multipath
> > > devices.
> > 
> > Can you share the basic idea? Will nvme mulitpath hold the io error and
> > not propagate to upper layer until new device is probed? What if the
> > new device is probed late, and IO has been timed out and err is returned
> > to upper layer?
> 
> Attached is what I did, but I'm really not sure it is the right final
> approach.

Also not sure if this is going to work out, but it looks like a good start to
me.

Instead of user pinning the virtual device so that it never goes away though, I
considered a user tunable "device missing delay" parameter to debounce link
events. New IOs would be deferred to the requeue_list while the timer is
active, and then EIO'ed if the timer expires without a successful LIVE
controller attachment. The use cases I'm considering are short bounces from
transient link resets, so I'd expect timers to be from a few seconds to maybe a
minute.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-01 16:31             ` Keith Busch
@ 2023-02-13 14:01               ` Sagi Grimberg
  2023-02-13 16:32                 ` Keith Busch
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2023-02-13 14:01 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Ming Lei, John Meneghini, linux-nvme@lists.infradead.org


>>>> The native nvme multipath looks like it could be leveraged to improving that
>>>> user experience if we wanted to make that layer an option for non-multipath
>>>> devices.
>>>
>>> Can you share the basic idea? Will nvme mulitpath hold the io error and
>>> not propagate to upper layer until new device is probed? What if the
>>> new device is probed late, and IO has been timed out and err is returned
>>> to upper layer?
>>
>> Attached is what I did, but I'm really not sure it is the right final
>> approach.
> 
> Also not sure if this is going to work out, but it looks like a good start to
> me.
> 
> Instead of user pinning the virtual device so that it never goes away though, I
> considered a user tunable "device missing delay" parameter to debounce link
> events. New IOs would be deferred to the requeue_list while the timer is
> active, and then EIO'ed if the timer expires without a successful LIVE
> controller attachment. The use cases I'm considering are short bounces from
> transient link resets, so I'd expect timers to be from a few seconds to maybe a
> minute.

Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?

We can keep the mpath device around, but if not, what is the desired
behavior from the upper layers?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-13 14:01               ` Sagi Grimberg
@ 2023-02-13 16:32                 ` Keith Busch
  2023-02-14  0:04                   ` Ming Lei
                                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Keith Busch @ 2023-02-13 16:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Ming Lei, John Meneghini,
	linux-nvme@lists.infradead.org

On Mon, Feb 13, 2023 at 04:01:03PM +0200, Sagi Grimberg wrote:
> > Also not sure if this is going to work out, but it looks like a good start to
> > me.
> > 
> > Instead of user pinning the virtual device so that it never goes away though, I
> > considered a user tunable "device missing delay" parameter to debounce link
> > events. New IOs would be deferred to the requeue_list while the timer is
> > active, and then EIO'ed if the timer expires without a successful LIVE
> > controller attachment. The use cases I'm considering are short bounces from
> > transient link resets, so I'd expect timers to be from a few seconds to maybe a
> > minute.
> 
> Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?

Similiar, but generic to non-multipath devices.
 
> We can keep the mpath device around, but if not, what is the desired
> behavior from the upper layers?

I don't think we're looking for any behavioral changes in the upper layers.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-13 16:32                 ` Keith Busch
@ 2023-02-14  0:04                   ` Ming Lei
  2023-02-14  9:18                     ` Sagi Grimberg
  2023-02-14 16:17                     ` Keith Busch
  2023-02-14  9:04                   ` Sagi Grimberg
  2023-02-15  6:18                   ` Christoph Hellwig
  2 siblings, 2 replies; 19+ messages in thread
From: Ming Lei @ 2023-02-14  0:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, John Meneghini,
	linux-nvme@lists.infradead.org, ming.lei

On Mon, Feb 13, 2023 at 09:32:35AM -0700, Keith Busch wrote:
> On Mon, Feb 13, 2023 at 04:01:03PM +0200, Sagi Grimberg wrote:
> > > Also not sure if this is going to work out, but it looks like a good start to
> > > me.
> > > 
> > > Instead of user pinning the virtual device so that it never goes away though, I
> > > considered a user tunable "device missing delay" parameter to debounce link
> > > events. New IOs would be deferred to the requeue_list while the timer is
> > > active, and then EIO'ed if the timer expires without a successful LIVE
> > > controller attachment. The use cases I'm considering are short bounces from
> > > transient link resets, so I'd expect timers to be from a few seconds to maybe a
> > > minute.
> > 
> > Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> 
> Similiar, but generic to non-multipath devices.
>  
> > We can keep the mpath device around, but if not, what is the desired
> > behavior from the upper layers?
> 
> I don't think we're looking for any behavioral changes in the upper layers.
 
That also means no matter if the nvme mpath layer is added or not, upper
layer still has to handle this kind of failure, so what is the
difference made from the added nvme-mpath?

Thanks, 
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-13 16:32                 ` Keith Busch
  2023-02-14  0:04                   ` Ming Lei
@ 2023-02-14  9:04                   ` Sagi Grimberg
  2023-02-15  6:18                   ` Christoph Hellwig
  2 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2023-02-14  9:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Ming Lei, John Meneghini,
	linux-nvme@lists.infradead.org


>>> Also not sure if this is going to work out, but it looks like a good start to
>>> me.
>>>
>>> Instead of user pinning the virtual device so that it never goes away though, I
>>> considered a user tunable "device missing delay" parameter to debounce link
>>> events. New IOs would be deferred to the requeue_list while the timer is
>>> active, and then EIO'ed if the timer expires without a successful LIVE
>>> controller attachment. The use cases I'm considering are short bounces from
>>> transient link resets, so I'd expect timers to be from a few seconds to maybe a
>>> minute.
>>
>> Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> 
> Similiar, but generic to non-multipath devices.
>   
>> We can keep the mpath device around, but if not, what is the desired
>> behavior from the upper layers?
> 
> I don't think we're looking for any behavioral changes in the upper layers.

I'm looking to what Christoph wrote, that in the long term upper layers
will need to look for replugged devices. Let's take the simple example
of a FS mounted disk that is going away, today the FS may get an error
and go into RO or failure, needing a manual remount after the disk came
back.

What exactly is the proposal for change here?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-14  0:04                   ` Ming Lei
@ 2023-02-14  9:18                     ` Sagi Grimberg
  2023-02-15  1:04                       ` Ming Lei
  2023-02-14 16:17                     ` Keith Busch
  1 sibling, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2023-02-14  9:18 UTC (permalink / raw)
  To: Ming Lei, Keith Busch
  Cc: Christoph Hellwig, John Meneghini, linux-nvme@lists.infradead.org


>>>> Also not sure if this is going to work out, but it looks like a good start to
>>>> me.
>>>>
>>>> Instead of user pinning the virtual device so that it never goes away though, I
>>>> considered a user tunable "device missing delay" parameter to debounce link
>>>> events. New IOs would be deferred to the requeue_list while the timer is
>>>> active, and then EIO'ed if the timer expires without a successful LIVE
>>>> controller attachment. The use cases I'm considering are short bounces from
>>>> transient link resets, so I'd expect timers to be from a few seconds to maybe a
>>>> minute.
>>>
>>> Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
>>
>> Similiar, but generic to non-multipath devices.
>>   
>>> We can keep the mpath device around, but if not, what is the desired
>>> behavior from the upper layers?
>>
>> I don't think we're looking for any behavioral changes in the upper layers.
>   
> That also means no matter if the nvme mpath layer is added or not, upper
> layer still has to handle this kind of failure, so what is the
> difference made from the added nvme-mpath?

What do you expect from the upper layer? The device went away...

The suggestion is that when the device comes back the upper layer will
detect it and reconfigure it (restore it to a raid/dm or remount the
FS)?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-14  0:04                   ` Ming Lei
  2023-02-14  9:18                     ` Sagi Grimberg
@ 2023-02-14 16:17                     ` Keith Busch
  2023-02-15  1:11                       ` Ming Lei
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2023-02-14 16:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Christoph Hellwig, John Meneghini,
	linux-nvme@lists.infradead.org

On Tue, Feb 14, 2023 at 08:04:22AM +0800, Ming Lei wrote:
> On Mon, Feb 13, 2023 at 09:32:35AM -0700, Keith Busch wrote:
> > On Mon, Feb 13, 2023 at 04:01:03PM +0200, Sagi Grimberg wrote:
> > > > Also not sure if this is going to work out, but it looks like a good start to
> > > > me.
> > > > 
> > > > Instead of user pinning the virtual device so that it never goes away though, I
> > > > considered a user tunable "device missing delay" parameter to debounce link
> > > > events. New IOs would be deferred to the requeue_list while the timer is
> > > > active, and then EIO'ed if the timer expires without a successful LIVE
> > > > controller attachment. The use cases I'm considering are short bounces from
> > > > transient link resets, so I'd expect timers to be from a few seconds to maybe a
> > > > minute.
> > > 
> > > Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> > 
> > Similiar, but generic to non-multipath devices.
> >  
> > > We can keep the mpath device around, but if not, what is the desired
> > > behavior from the upper layers?
> > 
> > I don't think we're looking for any behavioral changes in the upper layers.
>  
> That also means no matter if the nvme mpath layer is added or not, upper
> layer still has to handle this kind of failure, so what is the
> difference made from the added nvme-mpath?

I don't understand. There's no difference here for the upper layers. It's just
a different timing from when a disconnect starts failing IO. The upper layers
can keep doing what they're already doing in either case.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-14  9:18                     ` Sagi Grimberg
@ 2023-02-15  1:04                       ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2023-02-15  1:04 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Keith Busch, Christoph Hellwig, John Meneghini,
	linux-nvme@lists.infradead.org, ming.lei

On Tue, Feb 14, 2023 at 11:18:20AM +0200, Sagi Grimberg wrote:
> 
> > > > > Also not sure if this is going to work out, but it looks like a good start to
> > > > > me.
> > > > > 
> > > > > Instead of user pinning the virtual device so that it never goes away though, I
> > > > > considered a user tunable "device missing delay" parameter to debounce link
> > > > > events. New IOs would be deferred to the requeue_list while the timer is
> > > > > active, and then EIO'ed if the timer expires without a successful LIVE
> > > > > controller attachment. The use cases I'm considering are short bounces from
> > > > > transient link resets, so I'd expect timers to be from a few seconds to maybe a
> > > > > minute.
> > > > 
> > > > Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> > > 
> > > Similiar, but generic to non-multipath devices.
> > > > We can keep the mpath device around, but if not, what is the desired
> > > > behavior from the upper layers?
> > > 
> > > I don't think we're looking for any behavioral changes in the upper layers.
> > That also means no matter if the nvme mpath layer is added or not, upper
> > layer still has to handle this kind of failure, so what is the
> > difference made from the added nvme-mpath?
> 
> What do you expect from the upper layer? The device went away...

From developer viewpoint, FS has to be re-configured(re-mount after
new device comes back).

> 
> The suggestion is that when the device comes back the upper layer will
> detect it and reconfigure it (restore it to a raid/dm or remount the
> FS)?

I understand userspace is capable of recognizing if same device comes
back, and userspace can get notified with device add/del. Why bother kernel
to do that?

Also suppose the mpath device is invented for this purpose, we can't
guarantee that the mapth dev won't be gone and back, so userspace has
to deal with the situation, right?

Also the issue isn't nvme unique, and it shouldn't be implemented in
nvme-mpath suppose agreement is reached to solve it in kernel.

Thanks,
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-14 16:17                     ` Keith Busch
@ 2023-02-15  1:11                       ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2023-02-15  1:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, John Meneghini,
	linux-nvme@lists.infradead.org, ming.lei

On Tue, Feb 14, 2023 at 09:17:15AM -0700, Keith Busch wrote:
> On Tue, Feb 14, 2023 at 08:04:22AM +0800, Ming Lei wrote:
> > On Mon, Feb 13, 2023 at 09:32:35AM -0700, Keith Busch wrote:
> > > On Mon, Feb 13, 2023 at 04:01:03PM +0200, Sagi Grimberg wrote:
> > > > > Also not sure if this is going to work out, but it looks like a good start to
> > > > > me.
> > > > > 
> > > > > Instead of user pinning the virtual device so that it never goes away though, I
> > > > > considered a user tunable "device missing delay" parameter to debounce link
> > > > > events. New IOs would be deferred to the requeue_list while the timer is
> > > > > active, and then EIO'ed if the timer expires without a successful LIVE
> > > > > controller attachment. The use cases I'm considering are short bounces from
> > > > > transient link resets, so I'd expect timers to be from a few seconds to maybe a
> > > > > minute.
> > > > 
> > > > Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> > > 
> > > Similiar, but generic to non-multipath devices.
> > >  
> > > > We can keep the mpath device around, but if not, what is the desired
> > > > behavior from the upper layers?
> > > 
> > > I don't think we're looking for any behavioral changes in the upper layers.
> >  
> > That also means no matter if the nvme mpath layer is added or not, upper
> > layer still has to handle this kind of failure, so what is the
> > difference made from the added nvme-mpath?
> 
> I don't understand. There's no difference here for the upper layers. It's just
> a different timing from when a disconnect starts failing IO. The upper layers

If that isn't real disconnect, probably it is better to solve it
in nvme error handling, and device removal should be done as the
last straw.

> can keep doing what they're already doing in either case.

Because any device(include the to-be-invented mpath device) may be gone and
back, userspace needs to take measures to cover them if persistent
device name is required.


Thanks,
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-13 16:32                 ` Keith Busch
  2023-02-14  0:04                   ` Ming Lei
  2023-02-14  9:04                   ` Sagi Grimberg
@ 2023-02-15  6:18                   ` Christoph Hellwig
  2023-02-15  8:56                     ` Sagi Grimberg
  2 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2023-02-15  6:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, Ming Lei, John Meneghini,
	linux-nvme@lists.infradead.org

On Mon, Feb 13, 2023 at 09:32:35AM -0700, Keith Busch wrote:
> > Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
> 
> Similiar, but generic to non-multipath devices.

Exactly!

> > We can keep the mpath device around, but if not, what is the desired
> > behavior from the upper layers?
> 
> I don't think we're looking for any behavioral changes in the upper layers.

There's really two different things, which would both a problem:

 - make the device not go away if we expect it to come back by using
   the multipath code
 - tell the upper layer that a devie went away when it actually did

Right now if user has a block device open, it will keep a reference to
it forever.  If we send a notification to the upper layer not only can
it release that refrence, but it also knows the device went away.
This really helps to e.g. rebalance data in a RAID or RAID-like setup.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-15  6:18                   ` Christoph Hellwig
@ 2023-02-15  8:56                     ` Sagi Grimberg
  0 siblings, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2023-02-15  8:56 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Ming Lei, John Meneghini, linux-nvme@lists.infradead.org


>>> Isn't this equivalent to dm-mpath queue_if_no_path or no_path_timeout ?
>>
>> Similiar, but generic to non-multipath devices.
> 
> Exactly!
> 
>>> We can keep the mpath device around, but if not, what is the desired
>>> behavior from the upper layers?
>>
>> I don't think we're looking for any behavioral changes in the upper layers.
> 
> There's really two different things, which would both a problem:
> 
>   - make the device not go away if we expect it to come back by using
>     the multipath code

How do you know in the driver what to expect? how can anyone know
what to expect? There is also a question of what to do with the
I/O coming in, queue it or fail it and when...

>   - tell the upper layer that a devie went away when it actually did

Don't we already send a udev event KOBJ_REMOVE when a block device goes
away?

IIRC LVM already removes its stuff if the underlying block device
goes away (although I do recall that the dm did leave the device
handle open)...

> Right now if user has a block device open, it will keep a reference to
> it forever.  If we send a notification to the upper layer not only can
> it release that refrence, but it also knows the device went away.

--
KERNEL[520130.230115] remove   /devices/virtual/bdi/252:0 (bdi)
KERNEL[520130.230261] remove   /devices/virtual/block/nullb0 (block)
--

> This really helps to e.g. rebalance data in a RAID or RAID-like setup.

I thought that RAID already tracked these sort of events...


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: nvme/pcie hot plug results in /dev name change
  2023-02-01  6:27           ` Christoph Hellwig
  2023-02-01 16:31             ` Keith Busch
@ 2023-02-15 22:18             ` Keith Busch
  1 sibling, 0 replies; 19+ messages in thread
From: Keith Busch @ 2023-02-15 22:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, John Meneghini, linux-nvme@lists.infradead.org

On Tue, Jan 31, 2023 at 10:27:24PM -0800, Christoph Hellwig wrote:
> On Wed, Feb 01, 2023 at 10:33:15AM +0800, Ming Lei wrote:
> > > The native nvme multipath looks like it could be leveraged to improving that
> > > user experience if we wanted to make that layer an option for non-multipath
> > > devices.
> > 
> > Can you share the basic idea? Will nvme mulitpath hold the io error and
> > not propagate to upper layer until new device is probed? What if the
> > new device is probed late, and IO has been timed out and err is returned
> > to upper layer?
> 
> Attached is what I did, but I'm really not sure it is the right final
> approach.

Just adding my proof-of-concept attempt to this below. It uses a hard-coded 10
second delay before proceeding with deletion. There may be some unlikely race
conditions with this, but the concept works. I tested mounting a single path
nvme namespace, and quick remove-add sequences while running file fio on the
filesystem. Everything proceeded seemlessly with no need to remount.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8698410aeb843..faccbd8bbb310 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3974,7 +3974,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
 		 */
 		if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
 			continue;
-		if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
+		if (nvme_tryget_ns_head(h))
 			return h;
 	}
 
@@ -4185,7 +4185,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
 		}
 	} else {
 		ret = -EINVAL;
-		if (!info->is_shared || !head->shared) {
+		if ((!info->is_shared || !head->shared) &&
+		    !list_empty(&head->list)) {
 			dev_err(ctrl->device,
 				"Duplicate unshared namespace %d\n",
 				info->nsid);
@@ -4350,8 +4351,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
-	bool last_path = false;
-
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -4371,10 +4370,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	mutex_lock(&ns->ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
-	if (list_empty(&ns->head->list)) {
-		list_del_init(&ns->head->entry);
-		last_path = true;
-	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	/* guarantee not available in head->list */
@@ -4388,8 +4383,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	list_del_init(&ns->list);
 	up_write(&ns->ctrl->namespaces_rwsem);
 
-	if (last_path)
-		nvme_mpath_shutdown_disk(ns->head);
+	nvme_mpath_shutdown_disk(ns->head);
 	nvme_put_ns(ns);
 }
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index fc39d01e7b63b..c74d0326b285c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -387,7 +387,8 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
 		submit_bio_noacct(bio);
-	} else if (nvme_available_path(head)) {
+	} else if (nvme_available_path(head) ||
+		   delayed_work_pending(&head->remove_work)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 
 		spin_lock_irq(&head->requeue_lock);
@@ -505,6 +506,23 @@ static void nvme_requeue_work(struct work_struct *work)
 	}
 }
 
+static void nvme_remove_head_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head = container_of(to_delayed_work(work),
+					struct nvme_ns_head, remove_work);
+
+	if (!list_empty(&head->list))
+		return;
+
+	list_del_init(&head->entry);
+	kblockd_schedule_work(&head->requeue_work);
+	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+		nvme_cdev_del(&head->cdev, &head->cdev_device);
+		del_gendisk(head->disk);
+	}
+	nvme_put_ns_head(head);
+}
+
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 {
 	bool vwc = false;
@@ -513,14 +531,14 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	bio_list_init(&head->requeue_list);
 	spin_lock_init(&head->requeue_lock);
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
+	INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
 
 	/*
 	 * Add a multipath node if the subsystems supports multiple controllers.
 	 * We also do this for private namespaces as the namespace sharing flag
 	 * could change after a rescan.
 	 */
-	if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
-	    !nvme_is_unique_nsid(ctrl, head) || !multipath)
+	if (!nvme_is_unique_nsid(ctrl, head) || !multipath)
 		return 0;
 
 	head->disk = blk_alloc_disk(ctrl->numa_node);
@@ -553,6 +571,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
 		vwc = true;
 	blk_queue_write_cache(head->disk->queue, vwc, vwc);
+	nvme_tryget_ns_head(head);
 	return 0;
 }
 
@@ -871,13 +890,9 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
 
 void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
 {
-	if (!head->disk)
+	if (!list_empty(&head->list) || !head->disk)
 		return;
-	kblockd_schedule_work(&head->requeue_work);
-	if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
-		nvme_cdev_del(&head->cdev, &head->cdev_device);
-		del_gendisk(head->disk);
-	}
+	queue_delayed_work(nvme_wq, &head->remove_work, 10 * HZ);
 }
 
 void nvme_mpath_remove_disk(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1e..f6d7d826cd48a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -446,6 +446,9 @@ struct nvme_ns_head {
 	struct work_struct	requeue_work;
 	struct mutex		lock;
 	unsigned long		flags;
+
+	struct delayed_work	remove_work;
+
 #define NVME_NSHEAD_DISK_LIVE	0
 	struct nvme_ns __rcu	*current_path[];
 #endif
--


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-02-15 22:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-20 19:50 nvme/pcie hot plug results in /dev name change John Meneghini
2023-01-20 21:42 ` Keith Busch
2023-01-21  7:01   ` Christoph Hellwig
2023-01-29 10:28     ` Ming Lei
2023-01-31 16:38       ` Keith Busch
2023-02-01  2:33         ` Ming Lei
2023-02-01  6:27           ` Christoph Hellwig
2023-02-01 16:31             ` Keith Busch
2023-02-13 14:01               ` Sagi Grimberg
2023-02-13 16:32                 ` Keith Busch
2023-02-14  0:04                   ` Ming Lei
2023-02-14  9:18                     ` Sagi Grimberg
2023-02-15  1:04                       ` Ming Lei
2023-02-14 16:17                     ` Keith Busch
2023-02-15  1:11                       ` Ming Lei
2023-02-14  9:04                   ` Sagi Grimberg
2023-02-15  6:18                   ` Christoph Hellwig
2023-02-15  8:56                     ` Sagi Grimberg
2023-02-15 22:18             ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox