public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCHv2 0/3] improve NVMe multipath handling
@ 2025-04-25 10:33 Nilay Shroff
  2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw)
  To: linux-nvme, linux-block
  Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce

Hi,

This patch series introduces improvements to NVMe multipath handling by
refining the removal behavior of the multipath head node and simplifying
configuration options. The idea/POC for this change was originally
proposed by Christoph[1] and Keith[2]. I worked upon their original
idea/POC and implemented this series.

The first patch in the series addresses an issue where the multipath
head node of a PCIe NVMe disk is removed immediately when all disk paths
are lost. This can cause problems in scenarios such as:
- Hot removal and re-addition of a disk.
- Transient PCIe link failures that trigger re-enumeration,
  briefly removing and restoring the disk.

In such cases, premature removal of the head node may result in a device
node name change, requiring applications to reopen device handles if
they were performing I/O during the failure. To mitigate this, we
introduce a delayed removal mechanism. Instead of removing the head node
immediately, the system waits for a configurable timeout, allowing the
disk to recover. If the disk comes back online within this window, the
head node remains unchanged, ensuring uninterrupted workloads.

A new sysfs attribute, delayed_removal_secs, allows users to configure
this timeout. By default, it is set to 0 seconds, preserving the
existing behavior unless explicitly changed.

The second patch in the series introduced multipath_head_always module
param. When this option is set, it force creating multipath head disk
node even for single ported NVMe disks or private namespaces and thus
allows delayed head node removal. This would help handle transient PCIe
link failures transparently even in case of single ported NVMe disk or a
private namespace

The third patch in the series doesn't make any functional changes but
just renames few of the function name which improves code readability
and it better aligns function names with their actual roles.

These changes should help improve NVMe multipath reliability and simplify
configuration. Feedback and testing are welcome!

[1] https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
[2] https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/

Changes from v1:
    - Renamed delayed_shutdown_sec to delayed_removal_secs as "shutdown"
      has a special meaning when used with NVMe device (Martin Petersen)
    - Instead of adding mpath head disk node always by default, added new
      module option nvme_core.multipath_head_always which when set creates
      mpath head disk node (even for a private namespace or a namespace
      backed by single ported nvme disk). This way we can preserve the
      default old behavior.(hch)
    - Renamed nvme_mpath_shutdown_disk function as shutdown as in the NVMe
      context, the term "shutdown" has a specific technical meaning. (hch)
    - Undo changes which removed multipath module param as this param is
      still useful and used for many different things.

Link to v1: https://lore.kernel.org/all/20250321063901.747605-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme: introduce multipath_head_always module param
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk

 drivers/nvme/host/core.c      |  18 ++--
 drivers/nvme/host/multipath.c | 193 ++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h      |  20 +++-
 drivers/nvme/host/sysfs.c     |  13 +++
 4 files changed, 211 insertions(+), 33 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node
  2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff
@ 2025-04-25 10:33 ` Nilay Shroff
  2025-04-25 14:43   ` Christoph Hellwig
  2025-04-25 22:26   ` Sagi Grimberg
  2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
  2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
  2 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw)
  To: linux-nvme, linux-block
  Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce

Currently, the multipath head node of a PCIe NVMe disk is removed
immediately as soon as all paths of the disk are removed. However,
this can cause issues in scenarios where:

- The disk hot-removal followed by re-addition.
- Transient PCIe link failures that trigger re-enumeration,
  temporarily removing and then restoring the disk.

In these cases, removing the head node prematurely may lead to a head
disk node name change upon re-addition, requiring applications to
reopen their handles if they were performing I/O during the failure.

To address this, introduce a delayed removal mechanism of head disk
node. During transient failure, instead of immediate removal of head
disk node, the system waits for a configurable timeout, allowing the
disk to recover.

During transient disk failure, if application sends any IO then we
queue it instead of failing such IO immediately. If the disk comes back
online within the timeout, the queued IOs are resubmitted to the disk
ensuring seamless operation. In case disk couldn't recover from the
failure then queued IOs are failed to its completion and application
receives the error.

So this way, if disk comes back online within the configured period,
the head node remains unchanged, ensuring uninterrupted workloads
without requiring applications to reopen device handles.

A new sysfs attribute, named "delayed_removal_secs" is added under head
disk blkdev for user who wish to configure time for the delayed removal
of head disk node. The default value of this attribute is set to zero
second ensuring no behavior change unless explicitly configured.

Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/
Suggested-by: Keith Busch <kbusch@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
[nilay: reworked based on the original idea/POC from Christoph and Keith]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      |  16 ++---
 drivers/nvme/host/multipath.c | 119 +++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h      |  12 ++++
 drivers/nvme/host/sysfs.c     |  13 ++++
 4 files changed, 141 insertions(+), 19 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index eb6ea8acb3cc..5755069e6974 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3560,7 +3560,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
 		 */
 		if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
 			continue;
-		if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
+		if (nvme_tryget_ns_head(h))
 			return h;
 	}
 
@@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
 	ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
 	kref_init(&head->ref);
+	if (ctrl->opts)
+		nvme_mpath_set_fabrics(head);
 
 	if (head->ids.csi) {
 		ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects);
@@ -3804,7 +3806,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
 		}
 	} else {
 		ret = -EINVAL;
-		if (!info->is_shared || !head->shared) {
+		if ((!info->is_shared || !head->shared) &&
+		    !list_empty(&head->list)) {
 			dev_err(ctrl->device,
 				"Duplicate unshared namespace %d\n",
 				info->nsid);
@@ -3986,8 +3989,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
-	bool last_path = false;
-
 	if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
 		return;
 
@@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	mutex_lock(&ns->ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
-	if (list_empty(&ns->head->list)) {
-		list_del_init(&ns->head->entry);
-		last_path = true;
-	}
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
 	/* guarantee not available in head->list */
@@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	mutex_unlock(&ns->ctrl->namespaces_lock);
 	synchronize_srcu(&ns->ctrl->srcu);
 
-	if (last_path)
-		nvme_mpath_shutdown_disk(ns->head);
+	nvme_mpath_shutdown_disk(ns->head);
 	nvme_put_ns(ns);
 }
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 250f3da67cc9..68318337c275 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -442,6 +442,23 @@ static bool nvme_available_path(struct nvme_ns_head *head)
 			break;
 		}
 	}
+
+	/*
+	 * If "head->delayed_removal_secs" is configured (i.e., non-zero), do
+	 * not immediately fail I/O. Instead, requeue the I/O for the configured
+	 * duration, anticipating that if there's a transient link failure then
+	 * it may recover within this time window. This parameter is exported to
+	 * userspace via sysfs, and its default value is zero.
+	 * Note: This option is not exposed to the users for NVMeoF connections.
+	 * In fabric-based setups, fabric link failure is managed through other
+	 * parameters such as "reconnect_delay" and "max_reconnects" which is
+	 * handled at respective fabric driver layer. Therefor, head->delayed_
+	 * removal_secs" is intended exclusively for non-fabric (e.g., PCIe)
+	 * multipath configurations.
+	 */
+	if (head->delayed_removal_secs)
+		return true;
+
 	return false;
 }
 
@@ -617,6 +634,40 @@ static void nvme_requeue_work(struct work_struct *work)
 	}
 }
 
+static void nvme_remove_head(struct nvme_ns_head *head)
+{
+	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+		/*
+		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+		 * to allow multipath to fail all I/O.
+		 */
+		kblockd_schedule_work(&head->requeue_work);
+
+		nvme_cdev_del(&head->cdev, &head->cdev_device);
+		synchronize_srcu(&head->srcu);
+		del_gendisk(head->disk);
+		nvme_put_ns_head(head);
+	}
+}
+
+static void nvme_remove_head_work(struct work_struct *work)
+{
+	struct nvme_ns_head *head = container_of(to_delayed_work(work),
+			struct nvme_ns_head, remove_work);
+	bool shutdown = false;
+
+	mutex_lock(&head->subsys->lock);
+	if (list_empty(&head->list)) {
+		list_del_init(&head->entry);
+		shutdown = true;
+	}
+	mutex_unlock(&head->subsys->lock);
+	if (shutdown)
+		nvme_remove_head(head);
+
+	module_put(THIS_MODULE);
+}
+
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 {
 	struct queue_limits lim;
@@ -626,6 +677,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	spin_lock_init(&head->requeue_lock);
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
+	INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
+	head->delayed_removal_secs = 0;
 
 	/*
 	 * Add a multipath node if the subsystems supports multiple controllers.
@@ -659,6 +712,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
 	sprintf(head->disk->disk_name, "nvme%dn%d",
 			ctrl->subsys->instance, head->instance);
+	nvme_tryget_ns_head(head);
 	return 0;
 }
 
@@ -1015,6 +1069,40 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr
 }
 DEVICE_ATTR_RO(numa_nodes);
 
+static ssize_t delayed_removal_secs_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nvme_ns_head *head = disk->private_data;
+	int ret;
+
+	mutex_lock(&head->subsys->lock);
+	ret = sysfs_emit(buf, "%u\n", head->delayed_removal_secs);
+	mutex_unlock(&head->subsys->lock);
+	return ret;
+}
+
+static ssize_t delayed_removal_secs_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nvme_ns_head *head = disk->private_data;
+	unsigned int sec;
+	int ret;
+
+	ret = kstrtouint(buf, 0, &sec);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&head->subsys->lock);
+	head->delayed_removal_secs = sec;
+	mutex_unlock(&head->subsys->lock);
+
+	return count;
+}
+
+DEVICE_ATTR_RW(delayed_removal_secs);
+
 static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
 		struct nvme_ana_group_desc *desc, void *data)
 {
@@ -1138,18 +1226,31 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
 
 void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
 {
-	if (!head->disk)
-		return;
-	if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
-		nvme_cdev_del(&head->cdev, &head->cdev_device);
+	bool shutdown = false;
+
+	mutex_lock(&head->subsys->lock);
+
+	if (!list_empty(&head->list))
+		goto out;
+
+	if (head->delayed_removal_secs) {
 		/*
-		 * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
-		 * to allow multipath to fail all I/O.
+		 * Ensure that no one could remove this module while the head
+		 * remove work is pending.
 		 */
-		synchronize_srcu(&head->srcu);
-		kblockd_schedule_work(&head->requeue_work);
-		del_gendisk(head->disk);
+		if (!try_module_get(THIS_MODULE))
+			goto out;
+		queue_delayed_work(nvme_wq, &head->remove_work,
+				head->delayed_removal_secs * HZ);
+	} else {
+		list_del_init(&head->entry);
+		if (head->disk)
+			shutdown = true;
 	}
+out:
+	mutex_unlock(&head->subsys->lock);
+	if (shutdown)
+		nvme_remove_head(head);
 }
 
 void nvme_mpath_remove_disk(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 51e078642127..74cd569882ce 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -503,7 +503,10 @@ struct nvme_ns_head {
 	struct work_struct	partition_scan_work;
 	struct mutex		lock;
 	unsigned long		flags;
+	struct delayed_work	remove_work;
+	unsigned int		delayed_removal_secs;
 #define NVME_NSHEAD_DISK_LIVE	0
+#define NVME_NSHEAD_FABRICS	1
 	struct nvme_ns __rcu	*current_path[];
 #endif
 };
@@ -986,12 +989,17 @@ extern struct device_attribute dev_attr_ana_grpid;
 extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute dev_attr_queue_depth;
 extern struct device_attribute dev_attr_numa_nodes;
+extern struct device_attribute dev_attr_delayed_removal_secs;
 extern struct device_attribute subsys_attr_iopolicy;
 
 static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
 {
 	return disk->fops == &nvme_ns_head_ops;
 }
+static inline void nvme_mpath_set_fabrics(struct nvme_ns_head *head)
+{
+	set_bit(NVME_NSHEAD_FABRICS, &head->flags);
+}
 #else
 #define multipath false
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
@@ -1078,6 +1086,10 @@ static inline void nvme_mpath_end_request(struct request *rq)
 static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
 {
 	return false;
+}
+static inline void nvme_mpath_set_fabrics(struct nvme_ns_head *head)
+{
+
 }
 #endif /* CONFIG_NVME_MULTIPATH */
 
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 6d31226f7a4f..51633670d177 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = {
 	&dev_attr_ana_state.attr,
 	&dev_attr_queue_depth.attr,
 	&dev_attr_numa_nodes.attr,
+	&dev_attr_delayed_removal_secs.attr,
 #endif
 	&dev_attr_io_passthru_err_log_enabled.attr,
 	NULL,
@@ -296,6 +297,18 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 		if (nvme_disk_is_ns_head(dev_to_disk(dev)))
 			return 0;
 	}
+	if (a == &dev_attr_delayed_removal_secs.attr) {
+		struct nvme_ns_head *head = dev_to_ns_head(dev);
+		struct gendisk *disk = dev_to_disk(dev);
+
+		/*
+		 * This attribute is only valid for head node and non-fabric
+		 * setup.
+		 */
+		if (!nvme_disk_is_ns_head(disk) ||
+				test_bit(NVME_NSHEAD_FABRICS, &head->flags))
+			return 0;
+	}
 #endif
 	return a->mode;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff
  2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
@ 2025-04-25 10:33 ` Nilay Shroff
  2025-04-25 14:45   ` Christoph Hellwig
  2025-04-28  6:57   ` Hannes Reinecke
  2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
  2 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw)
  To: linux-nvme, linux-block
  Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce

Currently, a multipath head disk node is not created for single-ported
NVMe adapters or private namespaces. However, creating a head node in
these cases can help transparently handle transient PCIe link failures.
Without a head node, features like delayed removal cannot be leveraged,
making it difficult to tolerate such link failures. To address this,
this commit introduces nvme_core module parameter multipath_head_always.

When this param is set to true, it forces the creation of a multipath
head node regardless NVMe disk or namespace type. So this option allows
the use of delayed removal of head node functionality even for single-
ported NVMe disks and private namespaces and thus helps transparently
handling transient PCIe link failures.

By default multipath_head_always is set to false, thus preserving the
existing behavior. Setting it to true enables improved fault tolerance
in PCIe setups. Moreover, please note that enabling this option would
also implicitly enable nvme_core.multipath.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 68318337c275..1acdbbddfe01 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -10,10 +10,59 @@
 #include "nvme.h"
 
 bool multipath = true;
-module_param(multipath, bool, 0444);
+bool multipath_head_always;	/* default is flase */
+
+static int multipath_param_set(const char *val, const struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_bool(val, kp);
+	if (ret)
+		return ret;
+
+	if (multipath_head_always && !*(bool *)kp->arg) {
+		pr_err("Can't disable multipath when multipath_head_always is configured.\n");
+		*(bool *)kp->arg = true;
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static const struct kernel_param_ops multipath_param_ops = {
+	.set = multipath_param_set,
+	.get = param_get_bool,
+};
+
+module_param_cb(multipath, &multipath_param_ops, &multipath, 0444);
 MODULE_PARM_DESC(multipath,
 	"turn on native support for multiple controllers per subsystem");
 
+static int multipath_head_always_set(const char *val,
+		const struct kernel_param *kp)
+{
+	int ret;
+
+	ret = param_set_bool(val, kp);
+	if (ret < 0)
+		return ret;
+
+	if (*(bool *)kp->arg)
+		multipath = true;
+
+	return 0;
+}
+
+static const struct kernel_param_ops multipath_head_always_param_ops = {
+	.set = multipath_head_always_set,
+	.get = param_get_bool,
+};
+
+module_param_cb(multipath_head_always, &multipath_head_always_param_ops,
+		&multipath_head_always, 0444);
+MODULE_PARM_DESC(multipath_head_always,
+	"create multipath head node always; note that this also implicitly enables native multipath support");
+
 static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
@@ -681,13 +730,20 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	head->delayed_removal_secs = 0;
 
 	/*
-	 * Add a multipath node if the subsystems supports multiple controllers.
-	 * We also do this for private namespaces as the namespace sharing flag
-	 * could change after a rescan.
+	 * If multipath_head_always is configured then we add a multipath head
+	 * disk node irrespective of disk is single/multi ported or namespace is
+	 * shared/private.
 	 */
-	if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
-	    !nvme_is_unique_nsid(ctrl, head) || !multipath)
-		return 0;
+	if (!multipath_head_always) {
+		/*
+		 * Add a multipath node if the subsystems supports multiple
+		 * controllers. We also do this for private namespaces as the
+		 * namespace sharing flag could change after a rescan.
+		 */
+		if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
+		    !nvme_is_unique_nsid(ctrl, head) || !multipath)
+			return 0;
+	}
 
 	blk_set_stacking_limits(&lim);
 	lim.dma_alignment = 3;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff
  2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
  2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
@ 2025-04-25 10:33 ` Nilay Shroff
  2025-04-25 14:46   ` Christoph Hellwig
  2025-04-25 22:27   ` Sagi Grimberg
  2 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw)
  To: linux-nvme, linux-block
  Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce

In the NVMe context, the term "shutdown" has a specific technical
meaning. To avoid confusion, this commit renames the nvme_mpath_
shutdown_disk function to nvme_mpath_remove_disk to better reflect
its purpose (i.e. removing the disk from the system). However,
nvme_mpath_remove_disk was already in use, and its functionality
is related to releasing or putting the head node disk. To resolve
this naming conflict and improve clarity, the existing nvme_mpath_
remove_disk function is also renamed to nvme_mpath_put_disk.

This renaming improves code readability and better aligns function
names with their actual roles.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      |  4 ++--
 drivers/nvme/host/multipath.c | 16 ++++++++--------
 drivers/nvme/host/nvme.h      |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5755069e6974..306ac36eb812 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -664,7 +664,7 @@ static void nvme_free_ns_head(struct kref *ref)
 	struct nvme_ns_head *head =
 		container_of(ref, struct nvme_ns_head, ref);
 
-	nvme_mpath_remove_disk(head);
+	nvme_mpath_put_disk(head);
 	ida_free(&head->subsys->ns_ida, head->instance);
 	cleanup_srcu_struct(&head->srcu);
 	nvme_put_subsystem(head->subsys);
@@ -4025,7 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	mutex_unlock(&ns->ctrl->namespaces_lock);
 	synchronize_srcu(&ns->ctrl->srcu);
 
-	nvme_mpath_shutdown_disk(ns->head);
+	nvme_mpath_remove_disk(ns->head);
 	nvme_put_ns(ns);
 }
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1acdbbddfe01..a392b4825dc3 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -703,15 +703,15 @@ static void nvme_remove_head_work(struct work_struct *work)
 {
 	struct nvme_ns_head *head = container_of(to_delayed_work(work),
 			struct nvme_ns_head, remove_work);
-	bool shutdown = false;
+	bool remove = false;
 
 	mutex_lock(&head->subsys->lock);
 	if (list_empty(&head->list)) {
 		list_del_init(&head->entry);
-		shutdown = true;
+		remove = true;
 	}
 	mutex_unlock(&head->subsys->lock);
-	if (shutdown)
+	if (remove)
 		nvme_remove_head(head);
 
 	module_put(THIS_MODULE);
@@ -1280,9 +1280,9 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
 #endif
 }
 
-void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
+void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
-	bool shutdown = false;
+	bool remove = false;
 
 	mutex_lock(&head->subsys->lock);
 
@@ -1301,15 +1301,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
 	} else {
 		list_del_init(&head->entry);
 		if (head->disk)
-			shutdown = true;
+			remove = true;
 	}
 out:
 	mutex_unlock(&head->subsys->lock);
-	if (shutdown)
+	if (remove)
 		nvme_remove_head(head);
 }
 
-void nvme_mpath_remove_disk(struct nvme_ns_head *head)
+void nvme_mpath_put_disk(struct nvme_ns_head *head)
 {
 	if (!head->disk)
 		return;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 74cd569882ce..591df076522d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -963,7 +963,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns);
 void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns);
 void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid);
-void nvme_mpath_remove_disk(struct nvme_ns_head *head);
+void nvme_mpath_put_disk(struct nvme_ns_head *head);
 int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
 void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl);
 void nvme_mpath_update(struct nvme_ctrl *ctrl);
@@ -972,7 +972,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl);
 bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
 void nvme_mpath_revalidate_paths(struct nvme_ns *ns);
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
-void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
+void nvme_mpath_remove_disk(struct nvme_ns_head *head);
 void nvme_mpath_start_request(struct request *rq);
 void nvme_mpath_end_request(struct request *rq);
 
@@ -1020,7 +1020,7 @@ static inline int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,
 static inline void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
 {
 }
-static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
+static inline void nvme_mpath_put_disk(struct nvme_ns_head *head)
 {
 }
 static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
@@ -1039,7 +1039,7 @@ static inline void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
 static inline void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
 {
 }
-static inline void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
+static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
 }
 static inline void nvme_trace_bio_complete(struct request *req)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node
  2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
@ 2025-04-25 14:43   ` Christoph Hellwig
  2025-04-28  7:05     ` Nilay Shroff
  2025-04-25 22:26   ` Sagi Grimberg
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:43 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
	martin.petersen, gjoyce

On Fri, Apr 25, 2025 at 04:03:08PM +0530, Nilay Shroff wrote:
> @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>  
>  	mutex_lock(&ns->ctrl->subsys->lock);
>  	list_del_rcu(&ns->siblings);
> -	if (list_empty(&ns->head->list)) {
> -		list_del_init(&ns->head->entry);
> -		last_path = true;
> -	}
>  	mutex_unlock(&ns->ctrl->subsys->lock);
>  
>  	/* guarantee not available in head->list */
> @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>  	mutex_unlock(&ns->ctrl->namespaces_lock);
>  	synchronize_srcu(&ns->ctrl->srcu);
>  
> -	if (last_path)
> -		nvme_mpath_shutdown_disk(ns->head);
> +	nvme_mpath_shutdown_disk(ns->head);

This now removes the head list deletion from the first critical
section into nvme_mpath_shutdown_disk.  I remember we had to do it
the way it currently is done because we were running into issues
otherwise, the commit history might help a bit with what the issues
were.

> +	if (a == &dev_attr_delayed_removal_secs.attr) {
> +		struct nvme_ns_head *head = dev_to_ns_head(dev);
> +		struct gendisk *disk = dev_to_disk(dev);
> +
> +		/*
> +		 * This attribute is only valid for head node and non-fabric
> +		 * setup.
> +		 */
> +		if (!nvme_disk_is_ns_head(disk) ||
> +				test_bit(NVME_NSHEAD_FABRICS, &head->flags))
> +			return 0;
> +	}

Didn't we say we also want to allow fabrics to use it if explicitly
configured?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
@ 2025-04-25 14:45   ` Christoph Hellwig
  2025-04-29  6:26     ` Nilay Shroff
  2025-04-28  6:57   ` Hannes Reinecke
  1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:45 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
	martin.petersen, gjoyce

On Fri, Apr 25, 2025 at 04:03:09PM +0530, Nilay Shroff wrote:
> +static int multipath_param_set(const char *val, const struct kernel_param *kp)
> +{
> +	int ret;
> +
> +	ret = param_set_bool(val, kp);
> +	if (ret)
> +		return ret;
> +
> +	if (multipath_head_always && !*(bool *)kp->arg) {
> +		pr_err("Can't disable multipath when multipath_head_always is configured.\n");
> +		*(bool *)kp->arg = true;

This reads much nicer if you add a local bool * variable and avoid
all the casting, i.e.

	bool *arg = kp->arg;

	...

	if (multipath_head_always && !*kp->arg) {
		pr_err("Can't disable multipath when multipath_head_always is configured.\n");
		*arg = true;

> +static int multipath_head_always_set(const char *val,
> +		const struct kernel_param *kp)
> +{
> +	int ret;
> +
> +	ret = param_set_bool(val, kp);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (*(bool *)kp->arg)
> +		multipath = true;

Same here.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
@ 2025-04-25 14:46   ` Christoph Hellwig
  2025-04-25 22:27   ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-25 14:46 UTC (permalink / raw)
  To: Nilay Shroff
  Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
	martin.petersen, gjoyce

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node
  2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
  2025-04-25 14:43   ` Christoph Hellwig
@ 2025-04-25 22:26   ` Sagi Grimberg
  2025-04-28  7:39     ` Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2025-04-25 22:26 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme, linux-block
  Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce

>   
> @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>   	ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
>   	ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
>   	kref_init(&head->ref);
> +	if (ctrl->opts)

While this check is correct, perhaps checking ctrl->ops->flags & 
NVME_F_FABRICS
would be clearer.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
  2025-04-25 14:46   ` Christoph Hellwig
@ 2025-04-25 22:27   ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2025-04-25 22:27 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme, linux-block
  Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
  2025-04-25 14:45   ` Christoph Hellwig
@ 2025-04-28  6:57   ` Hannes Reinecke
  2025-04-28  7:39     ` Nilay Shroff
  1 sibling, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2025-04-28  6:57 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce

On 4/25/25 12:33, Nilay Shroff wrote:
> Currently, a multipath head disk node is not created for single-ported
> NVMe adapters or private namespaces. However, creating a head node in
> these cases can help transparently handle transient PCIe link failures.
> Without a head node, features like delayed removal cannot be leveraged,
> making it difficult to tolerate such link failures. To address this,
> this commit introduces nvme_core module parameter multipath_head_always.
> 
> When this param is set to true, it forces the creation of a multipath
> head node regardless NVMe disk or namespace type. So this option allows
> the use of delayed removal of head node functionality even for single-
> ported NVMe disks and private namespaces and thus helps transparently
> handling transient PCIe link failures.
> 
> By default multipath_head_always is set to false, thus preserving the
> existing behavior. Setting it to true enables improved fault tolerance
> in PCIe setups. Moreover, please note that enabling this option would
> also implicitly enable nvme_core.multipath.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>   1 file changed, 63 insertions(+), 7 deletions(-)
> 
I really would model this according to dm-multipath where we have the
'fail_if_no_path' flag.
This can be set for PCIe devices to retain the current behaviour
(which we need for things like 'md' on top of NVMe) whenever the
this flag is set.

And it might be an idea to rename this flag to 'multipath_always_on',
so 'multipath_head_always' might be confusing for people not familiar
with the internal layout of the nvme multipath driver.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node
  2025-04-25 14:43   ` Christoph Hellwig
@ 2025-04-28  7:05     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-28  7:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
	martin.petersen, gjoyce



On 4/25/25 8:13 PM, Christoph Hellwig wrote:
> On Fri, Apr 25, 2025 at 04:03:08PM +0530, Nilay Shroff wrote:
>> @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>  
>>  	mutex_lock(&ns->ctrl->subsys->lock);
>>  	list_del_rcu(&ns->siblings);
>> -	if (list_empty(&ns->head->list)) {
>> -		list_del_init(&ns->head->entry);
>> -		last_path = true;
>> -	}
>>  	mutex_unlock(&ns->ctrl->subsys->lock);
>>  
>>  	/* guarantee not available in head->list */
>> @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>  	mutex_unlock(&ns->ctrl->namespaces_lock);
>>  	synchronize_srcu(&ns->ctrl->srcu);
>>  
>> -	if (last_path)
>> -		nvme_mpath_shutdown_disk(ns->head);
>> +	nvme_mpath_shutdown_disk(ns->head);
> 
> This now removes the head list deletion from the first critical
> section into nvme_mpath_shutdown_disk.  I remember we had to do it
> the way it currently is done because we were running into issues
> otherwise, the commit history might help a bit with what the issues
> were.
> 
Thank you for the pointers! Yes, I checked the commit history and found this
commit 9edceaf43050 ("nvme: avoid race in shutdown namespace removal") which
avoids the race between nvme_find_ns_head and nvme_ns_remove. And looking at
the commit history it seems we also need to handle that case here. I think if
user doesn't configure delayed removal of head node then the existing behavior
should be preserved to avoid above mentioned race. However if user configures
the delayed head node removal then we would delay the head->entry removal until
the delayed node removal time elapses. I'd handle this case in next patch.
 
>> +	if (a == &dev_attr_delayed_removal_secs.attr) {
>> +		struct nvme_ns_head *head = dev_to_ns_head(dev);
>> +		struct gendisk *disk = dev_to_disk(dev);
>> +
>> +		/*
>> +		 * This attribute is only valid for head node and non-fabric
>> +		 * setup.
>> +		 */
>> +		if (!nvme_disk_is_ns_head(disk) ||
>> +				test_bit(NVME_NSHEAD_FABRICS, &head->flags))
>> +			return 0;
>> +	}
> 
> Didn't we say we also want to allow fabrics to use it if explicitly
> configured?
> 
Hmm, I think there's some confusion here. I thought we discussed that for fabric setup, 
fabric link failure is managed through other parameters such as "reconnect_delay" 
and "max_reconnects" which is handled at respective fabric driver layer. Therefor, 
head->delayed_removal_secs is intended exclusively for non-fabric (e.g., PCIe) 
multipath configurations.

Even if we were to support delayed head node removal for fabric setup then I 
wonder who would handle fabric (re-)connect command? As once we exhaust number of 
reconnect attempts, we stop the reconnect work and delete the fabric controller.

So do you think we should still expose head->delayed_removal_secs for fabric setup? 
If yes, then it'd add further additional delay (without re-connect attempt) for 
fabric link failures, isn't it?

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-28  6:57   ` Hannes Reinecke
@ 2025-04-28  7:39     ` Nilay Shroff
  2025-04-29  5:49       ` Hannes Reinecke
  0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-28  7:39 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce



On 4/28/25 12:27 PM, Hannes Reinecke wrote:
> On 4/25/25 12:33, Nilay Shroff wrote:
>> Currently, a multipath head disk node is not created for single-ported
>> NVMe adapters or private namespaces. However, creating a head node in
>> these cases can help transparently handle transient PCIe link failures.
>> Without a head node, features like delayed removal cannot be leveraged,
>> making it difficult to tolerate such link failures. To address this,
>> this commit introduces nvme_core module parameter multipath_head_always.
>>
>> When this param is set to true, it forces the creation of a multipath
>> head node regardless NVMe disk or namespace type. So this option allows
>> the use of delayed removal of head node functionality even for single-
>> ported NVMe disks and private namespaces and thus helps transparently
>> handling transient PCIe link failures.
>>
>> By default multipath_head_always is set to false, thus preserving the
>> existing behavior. Setting it to true enables improved fault tolerance
>> in PCIe setups. Moreover, please note that enabling this option would
>> also implicitly enable nvme_core.multipath.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>   drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>   1 file changed, 63 insertions(+), 7 deletions(-)
>>
> I really would model this according to dm-multipath where we have the
> 'fail_if_no_path' flag.
> This can be set for PCIe devices to retain the current behaviour
> (which we need for things like 'md' on top of NVMe) whenever the
> this flag is set.
> 
Okay so you meant that when sysfs attribute "delayed_removal_secs" 
under head disk node is _NOT_ configured (or delayed_removal_secs
is set to zero) we have internal flag "fail_if_no_path" is set to 
true. However in other case when "delayed_removal_secs" is set to 
a non-zero value we set "fail_if_no_path" to false. Is that correct?

> And it might be an idea to rename this flag to 'multipath_always_on',
> so 'multipath_head_always' might be confusing for people not familiar
> with the internal layout of the nvme multipath driver.
> 
Okay, I like this "multipath_always_on" module param. I'd rename
it in the next patch.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node
  2025-04-25 22:26   ` Sagi Grimberg
@ 2025-04-28  7:39     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-28  7:39 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme, linux-block
  Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce



On 4/26/25 3:56 AM, Sagi Grimberg wrote:
>>   @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>>       ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
>>       ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
>>       kref_init(&head->ref);
>> +    if (ctrl->opts)
> 
> While this check is correct, perhaps checking ctrl->ops->flags & NVME_F_FABRICS
> would be clearer.
> 
Sure, I'd update this in the next patch.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-28  7:39     ` Nilay Shroff
@ 2025-04-29  5:49       ` Hannes Reinecke
  2025-04-29  6:24         ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2025-04-29  5:49 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce

On 4/28/25 09:39, Nilay Shroff wrote:
> 
> 
> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>> On 4/25/25 12:33, Nilay Shroff wrote:
>>> Currently, a multipath head disk node is not created for single-ported
>>> NVMe adapters or private namespaces. However, creating a head node in
>>> these cases can help transparently handle transient PCIe link failures.
>>> Without a head node, features like delayed removal cannot be leveraged,
>>> making it difficult to tolerate such link failures. To address this,
>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>
>>> When this param is set to true, it forces the creation of a multipath
>>> head node regardless NVMe disk or namespace type. So this option allows
>>> the use of delayed removal of head node functionality even for single-
>>> ported NVMe disks and private namespaces and thus helps transparently
>>> handling transient PCIe link failures.
>>>
>>> By default multipath_head_always is set to false, thus preserving the
>>> existing behavior. Setting it to true enables improved fault tolerance
>>> in PCIe setups. Moreover, please note that enabling this option would
>>> also implicitly enable nvme_core.multipath.
>>>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>>    drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>    1 file changed, 63 insertions(+), 7 deletions(-)
>>>
>> I really would model this according to dm-multipath where we have the
>> 'fail_if_no_path' flag.
>> This can be set for PCIe devices to retain the current behaviour
>> (which we need for things like 'md' on top of NVMe) whenever the
>> this flag is set.
>>
> Okay so you meant that when sysfs attribute "delayed_removal_secs"
> under head disk node is _NOT_ configured (or delayed_removal_secs
> is set to zero) we have internal flag "fail_if_no_path" is set to
> true. However in other case when "delayed_removal_secs" is set to
> a non-zero value we set "fail_if_no_path" to false. Is that correct?
> 
Don't make it overly complicated.
'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
mapped onto delayed_removal_secs; if the value is '0' then the head
disk is immediately removed (the 'fail_if_no_path' case), and if it's
-1 it is never removed (the 'queue_if_no_path' case).

Question, though: How does it interact with the existing 
'ctrl_loss_tmo'? Both describe essentially the same situation...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-29  5:49       ` Hannes Reinecke
@ 2025-04-29  6:24         ` Nilay Shroff
  2025-04-29  7:01           ` Hannes Reinecke
  0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-29  6:24 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce



On 4/29/25 11:19 AM, Hannes Reinecke wrote:
> On 4/28/25 09:39, Nilay Shroff wrote:
>>
>>
>> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>>> On 4/25/25 12:33, Nilay Shroff wrote:
>>>> Currently, a multipath head disk node is not created for single-ported
>>>> NVMe adapters or private namespaces. However, creating a head node in
>>>> these cases can help transparently handle transient PCIe link failures.
>>>> Without a head node, features like delayed removal cannot be leveraged,
>>>> making it difficult to tolerate such link failures. To address this,
>>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>>
>>>> When this param is set to true, it forces the creation of a multipath
>>>> head node regardless NVMe disk or namespace type. So this option allows
>>>> the use of delayed removal of head node functionality even for single-
>>>> ported NVMe disks and private namespaces and thus helps transparently
>>>> handling transient PCIe link failures.
>>>>
>>>> By default multipath_head_always is set to false, thus preserving the
>>>> existing behavior. Setting it to true enables improved fault tolerance
>>>> in PCIe setups. Moreover, please note that enabling this option would
>>>> also implicitly enable nvme_core.multipath.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>>    drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>>    1 file changed, 63 insertions(+), 7 deletions(-)
>>>>
>>> I really would model this according to dm-multipath where we have the
>>> 'fail_if_no_path' flag.
>>> This can be set for PCIe devices to retain the current behaviour
>>> (which we need for things like 'md' on top of NVMe) whenever the
>>> this flag is set.
>>>
>> Okay so you meant that when sysfs attribute "delayed_removal_secs"
>> under head disk node is _NOT_ configured (or delayed_removal_secs
>> is set to zero) we have internal flag "fail_if_no_path" is set to
>> true. However in other case when "delayed_removal_secs" is set to
>> a non-zero value we set "fail_if_no_path" to false. Is that correct?
>>
> Don't make it overly complicated.
> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
> mapped onto delayed_removal_secs; if the value is '0' then the head
> disk is immediately removed (the 'fail_if_no_path' case), and if it's
> -1 it is never removed (the 'queue_if_no_path' case).
> 
Yes if the value of delayed_removal_secs is 0 then the head is immediately
removed, however if value of delayed_removal_secs is anything but zero 
(i.e. greater than zero as delayed_removal_secs is unsigned) then head 
is removed only after delayed_removal_secs is elapsed and hence disk 
couldn't recover from transient link failure. We never pin head node 
indefinitely.  

> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation...
> 
The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really
doesn't interact or interfere with ctrl_loss_tmo which is fabric controller
option.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-25 14:45   ` Christoph Hellwig
@ 2025-04-29  6:26     ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-29  6:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
	martin.petersen, gjoyce



On 4/25/25 8:15 PM, Christoph Hellwig wrote:
> On Fri, Apr 25, 2025 at 04:03:09PM +0530, Nilay Shroff wrote:
>> +static int multipath_param_set(const char *val, const struct kernel_param *kp)
>> +{
>> +	int ret;
>> +
>> +	ret = param_set_bool(val, kp);
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (multipath_head_always && !*(bool *)kp->arg) {
>> +		pr_err("Can't disable multipath when multipath_head_always is configured.\n");
>> +		*(bool *)kp->arg = true;
> 
> This reads much nicer if you add a local bool * variable and avoid
> all the casting, i.e.
> 
> 	bool *arg = kp->arg;
> 
> 	...
> 
> 	if (multipath_head_always && !*kp->arg) {
> 		pr_err("Can't disable multipath when multipath_head_always is configured.\n");
> 		*arg = true;
> 
>> +static int multipath_head_always_set(const char *val,
>> +		const struct kernel_param *kp)
>> +{
>> +	int ret;
>> +
>> +	ret = param_set_bool(val, kp);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (*(bool *)kp->arg)
>> +		multipath = true;
> 
> Same here.
> 
Agreed, will change in the next patch.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-29  6:24         ` Nilay Shroff
@ 2025-04-29  7:01           ` Hannes Reinecke
  2025-04-29  7:15             ` Nilay Shroff
  0 siblings, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2025-04-29  7:01 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce

On 4/29/25 08:24, Nilay Shroff wrote:
> 
> 
> On 4/29/25 11:19 AM, Hannes Reinecke wrote:
>> On 4/28/25 09:39, Nilay Shroff wrote:
>>>
>>>
>>> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>>>> On 4/25/25 12:33, Nilay Shroff wrote:
>>>>> Currently, a multipath head disk node is not created for single-ported
>>>>> NVMe adapters or private namespaces. However, creating a head node in
>>>>> these cases can help transparently handle transient PCIe link failures.
>>>>> Without a head node, features like delayed removal cannot be leveraged,
>>>>> making it difficult to tolerate such link failures. To address this,
>>>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>>>
>>>>> When this param is set to true, it forces the creation of a multipath
>>>>> head node regardless NVMe disk or namespace type. So this option allows
>>>>> the use of delayed removal of head node functionality even for single-
>>>>> ported NVMe disks and private namespaces and thus helps transparently
>>>>> handling transient PCIe link failures.
>>>>>
>>>>> By default multipath_head_always is set to false, thus preserving the
>>>>> existing behavior. Setting it to true enables improved fault tolerance
>>>>> in PCIe setups. Moreover, please note that enabling this option would
>>>>> also implicitly enable nvme_core.multipath.
>>>>>
>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>> ---
>>>>>     drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>>>     1 file changed, 63 insertions(+), 7 deletions(-)
>>>>>
>>>> I really would model this according to dm-multipath where we have the
>>>> 'fail_if_no_path' flag.
>>>> This can be set for PCIe devices to retain the current behaviour
>>>> (which we need for things like 'md' on top of NVMe) whenever the
>>>> this flag is set.
>>>>
>>> Okay so you meant that when sysfs attribute "delayed_removal_secs"
>>> under head disk node is _NOT_ configured (or delayed_removal_secs
>>> is set to zero) we have internal flag "fail_if_no_path" is set to
>>> true. However in other case when "delayed_removal_secs" is set to
>>> a non-zero value we set "fail_if_no_path" to false. Is that correct?
>>>
>> Don't make it overly complicated.
>> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
>> mapped onto delayed_removal_secs; if the value is '0' then the head
>> disk is immediately removed (the 'fail_if_no_path' case), and if it's
>> -1 it is never removed (the 'queue_if_no_path' case).
>>
> Yes if the value of delayed_removal_secs is 0 then the head is immediately
> removed, however if value of delayed_removal_secs is anything but zero
> (i.e. greater than zero as delayed_removal_secs is unsigned) then head
> is removed only after delayed_removal_secs is elapsed and hence disk
> couldn't recover from transient link failure. We never pin head node
> indefinitely.
> 
>> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation...
>>
> The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really
> doesn't interact or interfere with ctrl_loss_tmo which is fabric controller
> option.
> 
Not so sure here.
You _could_ expand the scope for ctrl_loss_tmo to PCI, too;
as most PCI devices will only ever have one controller 'ctrl_loss_tmo'
will be identical to 'delayed_removal_secs'.

So I guess my question is: is there a value for fabrics to control
the lifetime of struct ns_head independent on the lifetime of the
controller?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
  2025-04-29  7:01           ` Hannes Reinecke
@ 2025-04-29  7:15             ` Nilay Shroff
  0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-29  7:15 UTC (permalink / raw)
  To: Hannes Reinecke, linux-nvme, linux-block
  Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce



On 4/29/25 12:31 PM, Hannes Reinecke wrote:
> On 4/29/25 08:24, Nilay Shroff wrote:
>>
>>
>> On 4/29/25 11:19 AM, Hannes Reinecke wrote:
>>> On 4/28/25 09:39, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>>>>> On 4/25/25 12:33, Nilay Shroff wrote:
>>>>>> Currently, a multipath head disk node is not created for single-ported
>>>>>> NVMe adapters or private namespaces. However, creating a head node in
>>>>>> these cases can help transparently handle transient PCIe link failures.
>>>>>> Without a head node, features like delayed removal cannot be leveraged,
>>>>>> making it difficult to tolerate such link failures. To address this,
>>>>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>>>>
>>>>>> When this param is set to true, it forces the creation of a multipath
>>>>>> head node regardless NVMe disk or namespace type. So this option allows
>>>>>> the use of delayed removal of head node functionality even for single-
>>>>>> ported NVMe disks and private namespaces and thus helps transparently
>>>>>> handling transient PCIe link failures.
>>>>>>
>>>>>> By default multipath_head_always is set to false, thus preserving the
>>>>>> existing behavior. Setting it to true enables improved fault tolerance
>>>>>> in PCIe setups. Moreover, please note that enabling this option would
>>>>>> also implicitly enable nvme_core.multipath.
>>>>>>
>>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>>> ---
>>>>>>     drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>>>>     1 file changed, 63 insertions(+), 7 deletions(-)
>>>>>>
>>>>> I really would model this according to dm-multipath where we have the
>>>>> 'fail_if_no_path' flag.
>>>>> This can be set for PCIe devices to retain the current behaviour
>>>>> (which we need for things like 'md' on top of NVMe) whenever the
>>>>> this flag is set.
>>>>>
>>>> Okay so you meant that when sysfs attribute "delayed_removal_secs"
>>>> under head disk node is _NOT_ configured (or delayed_removal_secs
>>>> is set to zero) we have internal flag "fail_if_no_path" is set to
>>>> true. However in other case when "delayed_removal_secs" is set to
>>>> a non-zero value we set "fail_if_no_path" to false. Is that correct?
>>>>
>>> Don't make it overly complicated.
>>> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
>>> mapped onto delayed_removal_secs; if the value is '0' then the head
>>> disk is immediately removed (the 'fail_if_no_path' case), and if it's
>>> -1 it is never removed (the 'queue_if_no_path' case).
>>>
>> Yes if the value of delayed_removal_secs is 0 then the head is immediately
>> removed, however if value of delayed_removal_secs is anything but zero
>> (i.e. greater than zero as delayed_removal_secs is unsigned) then head
>> is removed only after delayed_removal_secs is elapsed and hence disk
>> couldn't recover from transient link failure. We never pin head node
>> indefinitely.
>>
>>> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation...
>>>
>> The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really
>> doesn't interact or interfere with ctrl_loss_tmo which is fabric controller
>> option.
>>
> Not so sure here.
> You _could_ expand the scope for ctrl_loss_tmo to PCI, too;
> as most PCI devices will only ever have one controller 'ctrl_loss_tmo'
> will be identical to 'delayed_removal_secs'.
> 
> So I guess my question is: is there a value for fabrics to control
> the lifetime of struct ns_head independent on the lifetime of the
> controller?
> 
The ctrl_loss_tmo option doesn't actually controls the lifetime of
ns_head. In fact, the ctrl_loss_tmo allows the fabric I/O commands to
fail fast so that I/O commands don't get stuck while host NVMe-oF 
controller is in reconnect state. User may not want to wait longer
while the fabric controller enters into reconnect state when it
loses connection with the target. Typically, the default reconnect
timeout is 10 minutes which is way longer than the expected timeout
of 30 seconds for any I/O command to fail. 
You may find more details in this commit 8c4dfea97f15 ("nvme-fabrics:
reject I/O to offline device") which implements the ctrl_loss_tmo.

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-04-29  7:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-04-25 14:43   ` Christoph Hellwig
2025-04-28  7:05     ` Nilay Shroff
2025-04-25 22:26   ` Sagi Grimberg
2025-04-28  7:39     ` Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
2025-04-25 14:45   ` Christoph Hellwig
2025-04-29  6:26     ` Nilay Shroff
2025-04-28  6:57   ` Hannes Reinecke
2025-04-28  7:39     ` Nilay Shroff
2025-04-29  5:49       ` Hannes Reinecke
2025-04-29  6:24         ` Nilay Shroff
2025-04-29  7:01           ` Hannes Reinecke
2025-04-29  7:15             ` Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
2025-04-25 14:46   ` Christoph Hellwig
2025-04-25 22:27   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox