* [RFC PATCH 0/2] improve NVMe multipath handling
@ 2025-03-21 6:37 Nilay Shroff
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-03-21 6:37 ` [RFC PATCH 2/2] nvme-multipath: remove multipath module param Nilay Shroff
0 siblings, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-03-21 6:37 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, gjoyce
Hi,
This patch series introduces improvements to NVMe multipath handling by
refining the removal behavior of the multipath head node and simplifying
configuration options. The idea/POC for this change was originally
proposed by Christoph[1] and Keith[2]. I worked upon their original
idea/POC and implemented this series.
The first patch in the series addresses an issue where the multipath
head node of a PCIe NVMe disk is removed immediately when all disk paths
are lost. This can cause problems in scenarios such as:
- Hot removal and re-addition of a disk.
- Transient PCIe link failures that trigger re-enumeration,
briefly removing and restoring the disk.
In such cases, premature removal of the head node may result in a device
node name change, requiring applications to reopen device handles if
they were performing I/O during the failure. To mitigate this, we
introduce a delayed removal mechanism. Instead of removing the head node
immediately, the system waits for a configurable timeout, allowing the
disk to recover. If the disk comes back online within this window, the
head node remains unchanged, ensuring uninterrupted workloads.
A new sysfs attribute, delayed_shutdown_sec, allows users to configure
this timeout. By default, it is set to 0 seconds, preserving the
existing behavior unless explicitly changed.
Additionally, please note that this change now always creates head disk
node for all types of NVMe disks (single-ported or multi-ported) as well
as shared/private namespaces, unless the multipath nvme-core module
parameter is explicitly set to false or CONFIG_NVME_MULTIPATH is disabled.
The second patch removes the multipath module parameter parameter from
nvme-core, making native NVMe multipath support explicit. Now with first
patch changes, the multipath head node is always created, even for single-
port NVMe disks when CONFIG_NVME_MULTIPATH is configured. Since this
behavior is now default, the multipath module parameter may no longer be
needed. IMO, the CONFIG_NVME_MULTIPATH (native-multipath) should be the
default and non-native multipath should ideally be deprecated by now,
however I didn't remove CONFIG_NVME_MULTIPATH in this series. So users
who still prefers non-native multipath can disable CONFIG_NVME_MULTIPATH
at compile time. Having said that, if everyone agress we may depreacte
non-native multipath support for NVMe.
These changes should help improve NVMe multipath reliability and simplify
configuration. Feedback and testing are welcome!
PS: Yes I know this RFC is late, but the intention is to get feedback/
suggestion in the upcoming LSF/MM/BPF summit. This might be used as a
reference implementation for discussion. I also saw that we've already
got a timeslot where John is going to talk about removing NVMe multipath
config option. Maybe we could include it in that discussion, if everyone
agress.
Thanks!
--Nilay
Nilay Shroff (2):
nvme-multipath: introduce delayed removal of the multipath head node
nvme-multipath: remove multipath module param
drivers/nvme/host/core.c | 36 ++++------
drivers/nvme/host/multipath.c | 127 ++++++++++++++++++++++++++--------
drivers/nvme/host/nvme.h | 5 +-
drivers/nvme/host/sysfs.c | 13 ++++
4 files changed, 132 insertions(+), 49 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-03-21 6:37 [RFC PATCH 0/2] improve NVMe multipath handling Nilay Shroff
@ 2025-03-21 6:37 ` Nilay Shroff
2025-03-22 1:48 ` Martin K. Petersen
` (2 more replies)
2025-03-21 6:37 ` [RFC PATCH 2/2] nvme-multipath: remove multipath module param Nilay Shroff
1 sibling, 3 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-03-21 6:37 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, gjoyce
Currently, the multipath head node of a PCIe NVMe disk is removed
immediately as soon as all paths of the disk are removed. However,
this can cause issues in scenarios where:
- The disk hot-removal followed by re-addition.
- Transient PCIe link failures that trigger re-enumeration,
temporarily removing and then restoring the disk.
In these cases, removing the head node prematurely may lead to a head
disk node name change upon re-addition, requiring applications to
reopen their handles if they were performing I/O during the failure.
To address this, introduce a delayed removal mechanism of head disk
node. During transient failure, instead of immediate removal of head
disk node, the system waits for a configurable timeout, allowing the
disk to recover. Additionally, please note that the head disk node is
now always created for all types of NVMe disks (single-ported or multi-
ported), unless the multipath module parameter is explicitly set to
false or CONFIG_NVME_MULTIPATH is disabled.
During transient disk failure, if application sends any IO then we
queue it instead of failing such IO immediately. If the disk comes back
online within the timeout, the queued IOs are resubmitted to the disk
ensuring seamless operation. In case disk couldn't recover from the
failure then queued IOs are failed to its completion and application
receives the error.
So this way, if disk comes back online within the configured period,
the head node remains unchanged, ensuring uninterrupted workloads
without requiring applications to reopen device handles.
A new sysfs attribute, named "delayed_shutdown_sec" is added for user
who wish to configure time for the delayed removal of head disk node.
The default value of this attribute is set to 0 seconds ensuring no
behavior change unless explicitly configured.
Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/
Suggested-by: Keith Busch <kbusch@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
[nilay: reworked based on the original idea/POC from Christoph and Keith]
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 18 +++---
drivers/nvme/host/multipath.c | 118 +++++++++++++++++++++++++++++-----
drivers/nvme/host/nvme.h | 4 ++
drivers/nvme/host/sysfs.c | 13 ++++
4 files changed, 127 insertions(+), 26 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 870314c52107..e798809a8325 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3562,7 +3562,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
*/
if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
continue;
- if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
+ if (nvme_tryget_ns_head(h))
return h;
}
@@ -3690,6 +3690,10 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
kref_init(&head->ref);
+#ifdef CONFIG_NVME_MULTIPATH
+ if (ctrl->ops->flags & NVME_F_FABRICS)
+ set_bit(NVME_NSHEAD_FABRICS, &head->flags);
+#endif
if (head->ids.csi) {
ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects);
@@ -3806,7 +3810,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
}
} else {
ret = -EINVAL;
- if (!info->is_shared || !head->shared) {
+ if ((!info->is_shared || !head->shared) &&
+ !list_empty(&head->list)) {
dev_err(ctrl->device,
"Duplicate unshared namespace %d\n",
info->nsid);
@@ -3988,8 +3993,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
static void nvme_ns_remove(struct nvme_ns *ns)
{
- bool last_path = false;
-
if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
return;
@@ -4009,10 +4012,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
mutex_lock(&ns->ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
- if (list_empty(&ns->head->list)) {
- list_del_init(&ns->head->entry);
- last_path = true;
- }
mutex_unlock(&ns->ctrl->subsys->lock);
/* guarantee not available in head->list */
@@ -4030,8 +4029,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
mutex_unlock(&ns->ctrl->namespaces_lock);
synchronize_srcu(&ns->ctrl->srcu);
- if (last_path)
- nvme_mpath_shutdown_disk(ns->head);
+ nvme_mpath_shutdown_disk(ns->head);
nvme_put_ns(ns);
}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6b12ca80aa27..0f54889bd483 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -421,7 +421,6 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
return nvme_numa_path(head);
}
}
-
static bool nvme_available_path(struct nvme_ns_head *head)
{
struct nvme_ns *ns;
@@ -429,6 +428,16 @@ static bool nvme_available_path(struct nvme_ns_head *head)
if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
return NULL;
+ /*
+ * For non-fabric controllers we support delayed removal of head disk
+ * node. If we reached up to here then it means that head disk is still
+ * alive and so we assume here that even if there's no path available
+ * maybe due to the transient link failure, we could queue up the IO
+ * and later when path becomes ready we re-submit queued IO.
+ */
+ if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
+ return true;
+
list_for_each_entry_srcu(ns, &head->list, siblings,
srcu_read_lock_held(&head->srcu)) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
@@ -444,7 +453,6 @@ static bool nvme_available_path(struct nvme_ns_head *head)
}
return false;
}
-
static void nvme_ns_head_submit_bio(struct bio *bio)
{
struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data;
@@ -617,6 +625,40 @@ static void nvme_requeue_work(struct work_struct *work)
}
}
+static void nvme_remove_head(struct nvme_ns_head *head)
+{
+ if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
+ /*
+ * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
+ * to allow multipath to fail all I/O.
+ */
+ kblockd_schedule_work(&head->requeue_work);
+
+ nvme_cdev_del(&head->cdev, &head->cdev_device);
+ synchronize_srcu(&head->srcu);
+ del_gendisk(head->disk);
+ nvme_put_ns_head(head);
+ }
+}
+
+static void nvme_remove_head_work(struct work_struct *work)
+{
+ struct nvme_ns_head *head = container_of(to_delayed_work(work),
+ struct nvme_ns_head, remove_work);
+ bool shutdown = false;
+
+ mutex_lock(&head->subsys->lock);
+ if (list_empty(&head->list)) {
+ list_del_init(&head->entry);
+ shutdown = true;
+ }
+ mutex_unlock(&head->subsys->lock);
+ if (shutdown)
+ nvme_remove_head(head);
+
+ module_put(THIS_MODULE);
+}
+
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
{
struct queue_limits lim;
@@ -626,14 +668,15 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
spin_lock_init(&head->requeue_lock);
INIT_WORK(&head->requeue_work, nvme_requeue_work);
INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
+ INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
+ head->delayed_shutdown_sec = 0;
/*
- * Add a multipath node if the subsystems supports multiple controllers.
- * We also do this for private namespaces as the namespace sharing flag
- * could change after a rescan.
+ * A head disk node is always created for all types of NVMe disks
+ * (single-ported and multi-ported), unless the multipath module
+ * parameter is explicitly set to false.
*/
- if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
- !nvme_is_unique_nsid(ctrl, head) || !multipath)
+ if (!multipath)
return 0;
blk_set_stacking_limits(&lim);
@@ -659,6 +702,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
sprintf(head->disk->disk_name, "nvme%dn%d",
ctrl->subsys->instance, head->instance);
+ nvme_tryget_ns_head(head);
return 0;
}
@@ -1015,6 +1059,40 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr
}
DEVICE_ATTR_RO(numa_nodes);
+static ssize_t delayed_shutdown_sec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ struct nvme_ns_head *head = disk->private_data;
+ int ret;
+
+ mutex_lock(&head->subsys->lock);
+ ret = sysfs_emit(buf, "%u\n", head->delayed_shutdown_sec);
+ mutex_unlock(&head->subsys->lock);
+ return ret;
+}
+
+static ssize_t delayed_shutdown_sec_store(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct gendisk *disk = dev_to_disk(dev);
+ struct nvme_ns_head *head = disk->private_data;
+ unsigned int sec;
+ int ret;
+
+ ret = kstrtouint(buf, 0, &sec);
+ if (ret < 0)
+ return ret;
+
+ mutex_lock(&head->subsys->lock);
+ head->delayed_shutdown_sec = sec;
+ mutex_unlock(&head->subsys->lock);
+
+ return count;
+}
+
+DEVICE_ATTR_RW(delayed_shutdown_sec);
+
static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
struct nvme_ana_group_desc *desc, void *data)
{
@@ -1138,18 +1216,26 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
{
- if (!head->disk)
- return;
- if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
- nvme_cdev_del(&head->cdev, &head->cdev_device);
+ mutex_lock(&head->subsys->lock);
+
+ if (!list_empty(&head->list) || !head->disk)
+ goto out;
+
+ if (head->delayed_shutdown_sec) {
/*
- * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
- * to allow multipath to fail all I/O.
+ * Ensure that no one could remove this module while the head
+ * remove work is pending.
*/
- synchronize_srcu(&head->srcu);
- kblockd_schedule_work(&head->requeue_work);
- del_gendisk(head->disk);
+ if (!try_module_get(THIS_MODULE))
+ goto out;
+ queue_delayed_work(nvme_wq, &head->remove_work,
+ head->delayed_shutdown_sec * HZ);
+ } else {
+ list_del_init(&head->entry);
+ nvme_remove_head(head);
}
+out:
+ mutex_unlock(&head->subsys->lock);
}
void nvme_mpath_remove_disk(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 51e078642127..4375357b8cd7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -503,7 +503,10 @@ struct nvme_ns_head {
struct work_struct partition_scan_work;
struct mutex lock;
unsigned long flags;
+ struct delayed_work remove_work;
+ unsigned int delayed_shutdown_sec;
#define NVME_NSHEAD_DISK_LIVE 0
+#define NVME_NSHEAD_FABRICS 1
struct nvme_ns __rcu *current_path[];
#endif
};
@@ -986,6 +989,7 @@ extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
extern struct device_attribute dev_attr_queue_depth;
extern struct device_attribute dev_attr_numa_nodes;
+extern struct device_attribute dev_attr_delayed_shutdown_sec;
extern struct device_attribute subsys_attr_iopolicy;
static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 6d31226f7a4f..170897349093 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = {
&dev_attr_ana_state.attr,
&dev_attr_queue_depth.attr,
&dev_attr_numa_nodes.attr,
+ &dev_attr_delayed_shutdown_sec.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
NULL,
@@ -296,6 +297,18 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (nvme_disk_is_ns_head(dev_to_disk(dev)))
return 0;
}
+ if (a == &dev_attr_delayed_shutdown_sec.attr) {
+ struct nvme_ns_head *head = dev_to_ns_head(dev);
+ struct gendisk *disk = dev_to_disk(dev);
+
+ /*
+ * This attribute is only valid for head node and non-fabric
+ * controllers.
+ */
+ if (!nvme_disk_is_ns_head(disk) ||
+ test_bit(NVME_NSHEAD_FABRICS, &head->flags))
+ return 0;
+ }
#endif
return a->mode;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-03-21 6:37 [RFC PATCH 0/2] improve NVMe multipath handling Nilay Shroff
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
@ 2025-03-21 6:37 ` Nilay Shroff
2025-03-25 15:09 ` John Meneghini
2025-04-07 14:45 ` Christoph Hellwig
1 sibling, 2 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-03-21 6:37 UTC (permalink / raw)
To: linux-nvme, linux-block; +Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, gjoyce
Remove the multipath module parameter from nvme-core and make native
NVMe multipath support explicit. Since we now always create a multipath
head disk node, even for single-port NVMe disks, when CONFIG_NVME_
MULTIPATH is enabled, this module parameter is no longer needed to
toggle the behavior.
Users who prefer non-native multipath must disable CONFIG_NVME_MULTIPATH
at compile time.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 18 +++++++-----------
drivers/nvme/host/multipath.c | 17 ++---------------
drivers/nvme/host/nvme.h | 1 -
3 files changed, 9 insertions(+), 27 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e798809a8325..50c170425141 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3823,14 +3823,13 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
info->nsid);
goto out_put_ns_head;
}
-
- if (!multipath) {
- dev_warn(ctrl->device,
- "Found shared namespace %d, but multipathing not supported.\n",
- info->nsid);
- dev_warn_once(ctrl->device,
- "Support for shared namespaces without CONFIG_NVME_MULTIPATH is deprecated and will be removed in Linux 6.0.\n");
- }
+#ifndef CONFIG_NVME_MULTIPATH
+ dev_warn(ctrl->device,
+ "Found shared namespace %d, but multipathing not supported.\n",
+ info->nsid);
+ dev_warn_once(ctrl->device,
+ "Support for shared namespaces without CONFIG_NVME_MULTIPATH is deprecated and will be removed in Linux 6.0.\n");
+#endif
}
list_add_tail_rcu(&ns->siblings, &head->list);
@@ -3929,9 +3928,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
sprintf(disk->disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
ctrl->instance, ns->head->instance);
disk->flags |= GENHD_FL_HIDDEN;
- } else if (multipath) {
- sprintf(disk->disk_name, "nvme%dn%d", ctrl->subsys->instance,
- ns->head->instance);
} else {
sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance,
ns->head->instance);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0f54889bd483..84211f64d178 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -9,11 +9,6 @@
#include <trace/events/block.h>
#include "nvme.h"
-bool multipath = true;
-module_param(multipath, bool, 0444);
-MODULE_PARM_DESC(multipath,
- "turn on native support for multiple controllers per subsystem");
-
static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
@@ -671,14 +666,6 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
head->delayed_shutdown_sec = 0;
- /*
- * A head disk node is always created for all types of NVMe disks
- * (single-ported and multi-ported), unless the multipath module
- * parameter is explicitly set to false.
- */
- if (!multipath)
- return 0;
-
blk_set_stacking_limits(&lim);
lim.dma_alignment = 3;
lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
@@ -1262,8 +1249,8 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
size_t ana_log_size;
int error = 0;
- /* check if multipath is enabled and we have the capability */
- if (!multipath || !ctrl->subsys ||
+ /* check if controller has ANA capability */
+ if (!ctrl->subsys ||
!(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
return 0;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4375357b8cd7..fba686b91976 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -997,7 +997,6 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
return disk->fops == &nvme_ns_head_ops;
}
#else
-#define multipath false
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
{
return false;
--
2.47.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
@ 2025-03-22 1:48 ` Martin K. Petersen
2025-03-22 22:08 ` Nilay Shroff
2025-03-25 15:21 ` John Meneghini
2025-04-07 14:44 ` Christoph Hellwig
2 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2025-03-22 1:48 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
Nilay,
> A new sysfs attribute, named "delayed_shutdown_sec" is added for user
> who wish to configure time for the delayed removal of head disk node.
Shutdown has a fairly specific meaning in the context of NVMe devices.
Maybe "defer_removal_secs" or "delay_removal_secs"?
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-03-22 1:48 ` Martin K. Petersen
@ 2025-03-22 22:08 ` Nilay Shroff
0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-03-22 22:08 UTC (permalink / raw)
To: Martin K. Petersen
Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 3/22/25 7:18 AM, Martin K. Petersen wrote:
>
> Nilay,
>
>> A new sysfs attribute, named "delayed_shutdown_sec" is added for user
>> who wish to configure time for the delayed removal of head disk node.
>
> Shutdown has a fairly specific meaning in the context of NVMe devices.
> Maybe "defer_removal_secs" or "delay_removal_secs"?
>
Yeah 'delay_removal_secs' sounds good!
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-03-21 6:37 ` [RFC PATCH 2/2] nvme-multipath: remove multipath module param Nilay Shroff
@ 2025-03-25 15:09 ` John Meneghini
2025-04-07 14:45 ` Christoph Hellwig
1 sibling, 0 replies; 18+ messages in thread
From: John Meneghini @ 2025-03-25 15:09 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme, linux-block
Cc: hch, kbusch, hare, sagi, axboe, gjoyce, jmeneghi
On 3/21/25 2:37 AM, Nilay Shroff wrote:
> Remove the multipath module parameter from nvme-core and make native
> NVMe multipath support explicit. Since we now always create a multipath
> head disk node, even for single-port NVMe disks, when CONFIG_NVME_
> MULTIPATH is enabled, this module parameter is no longer needed to
> toggle the behavior.
>
> Users who prefer non-native multipath must disable CONFIG_NVME_MULTIPATH
> at compile time.
This patch is not needed if you use my patch at:
https://lore.kernel.org/linux-nvme/20250322232848.225140-1-jmeneghi@redhat.com/T/#m7a6ea63be64731f19df4e17fda7db2b18d8551c7
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 18 +++++++-----------
> drivers/nvme/host/multipath.c | 17 ++---------------
> drivers/nvme/host/nvme.h | 1 -
> 3 files changed, 9 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e798809a8325..50c170425141 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3823,14 +3823,13 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
> info->nsid);
> goto out_put_ns_head;
> }
> -
> - if (!multipath) {
> - dev_warn(ctrl->device,
> - "Found shared namespace %d, but multipathing not supported.\n",
> - info->nsid);
> - dev_warn_once(ctrl->device,
> - "Support for shared namespaces without CONFIG_NVME_MULTIPATH is deprecated and will be removed in Linux 6.0.\n");
> - }
> +#ifndef CONFIG_NVME_MULTIPATH
> + dev_warn(ctrl->device,
> + "Found shared namespace %d, but multipathing not supported.\n",
> + info->nsid);
> + dev_warn_once(ctrl->device,
> + "Support for shared namespaces without CONFIG_NVME_MULTIPATH is deprecated and will be removed in Linux 6.0.\n");
I think this message needs to change. See my patch at:
https://lore.kernel.org/linux-nvme/20250322232848.225140-1-jmeneghi@redhat.com/T/#md2f1d9706f4cbeb8bd53e562d1036195c0599fe1
> +#endif
> }
>
> list_add_tail_rcu(&ns->siblings, &head->list);
> @@ -3929,9 +3928,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> sprintf(disk->disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
> ctrl->instance, ns->head->instance);
> disk->flags |= GENHD_FL_HIDDEN;
> - } else if (multipath) {
> - sprintf(disk->disk_name, "nvme%dn%d", ctrl->subsys->instance,
> - ns->head->instance);
I don't think we should remove the 'multipath' variable here we just need to control the users ability
to change it.
> } else {
> sprintf(disk->disk_name, "nvme%dn%d", ctrl->instance,
> ns->head->instance);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 0f54889bd483..84211f64d178 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -9,11 +9,6 @@
> #include <trace/events/block.h>
> #include "nvme.h"
>
> -bool multipath = true;
> -module_param(multipath, bool, 0444);
> -MODULE_PARM_DESC(multipath,
> - "turn on native support for multiple controllers per subsystem");
> -
> static const char *nvme_iopolicy_names[] = {
> [NVME_IOPOLICY_NUMA] = "numa",
> [NVME_IOPOLICY_RR] = "round-robin",
> @@ -671,14 +666,6 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
> head->delayed_shutdown_sec = 0;
>
> - /*
> - * A head disk node is always created for all types of NVMe disks
> - * (single-ported and multi-ported), unless the multipath module
> - * parameter is explicitly set to false.
> - */
> - if (!multipath)
> - return 0;
> -
> blk_set_stacking_limits(&lim);
> lim.dma_alignment = 3;
> lim.features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT | BLK_FEAT_POLL;
> @@ -1262,8 +1249,8 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
> size_t ana_log_size;
> int error = 0;
>
> - /* check if multipath is enabled and we have the capability */
> - if (!multipath || !ctrl->subsys ||
> + /* check if controller has ANA capability */
> + if (!ctrl->subsys ||
> !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
> return 0;
>
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 4375357b8cd7..fba686b91976 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -997,7 +997,6 @@ static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
> return disk->fops == &nvme_ns_head_ops;
> }
> #else
> -#define multipath false
> static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
> {
> return false;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-03-22 1:48 ` Martin K. Petersen
@ 2025-03-25 15:21 ` John Meneghini
2025-04-07 14:44 ` Christoph Hellwig
2 siblings, 0 replies; 18+ messages in thread
From: John Meneghini @ 2025-03-25 15:21 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme, linux-block
Cc: hch, kbusch, hare, sagi, axboe, gjoyce, jmeneghi
Thanks for the patch Nilay!
I've rebased this patch onto my changes at:
https://lore.kernel.org/linux-nvme/20250322232848.225140-1-jmeneghi@redhat.com/T/
This patch applies and clean compiles w/no problems.
I will test this out after LSF/MM with my fabrics testbed and get back to you.
linux(nilay_mp) > git log --oneline -5
0834c24969a1 (HEAD -> nilay_mp) nvme-multipath: introduce delayed removal of the multipath head node
f9ca6ae944e4 (johnm/config_ana5, config_ana5) nvme: update the multipath warning in nvme_init_ns_head
79feb9e51d89 nvme-multipath: add the NVME_MULTIPATH_PARAM config option
7e2928cacd71 nvme-multipath: change the NVME_MULTIPATH config option
64ea88e3afa8 (tag: nvme-6.15-2025-03-20, nvme/nvme-6.15, nvme-6.15) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
/John
On 3/21/25 2:37 AM, Nilay Shroff wrote:
> Currently, the multipath head node of a PCIe NVMe disk is removed
> immediately as soon as all paths of the disk are removed. However,
> this can cause issues in scenarios where:
>
> - The disk hot-removal followed by re-addition.
> - Transient PCIe link failures that trigger re-enumeration,
> temporarily removing and then restoring the disk.
>
> In these cases, removing the head node prematurely may lead to a head
> disk node name change upon re-addition, requiring applications to
> reopen their handles if they were performing I/O during the failure.
>
> To address this, introduce a delayed removal mechanism of head disk
> node. During transient failure, instead of immediate removal of head
> disk node, the system waits for a configurable timeout, allowing the
> disk to recover. Additionally, please note that the head disk node is
> now always created for all types of NVMe disks (single-ported or multi-
> ported), unless the multipath module parameter is explicitly set to
> false or CONFIG_NVME_MULTIPATH is disabled.
>
> During transient disk failure, if application sends any IO then we
> queue it instead of failing such IO immediately. If the disk comes back
> online within the timeout, the queued IOs are resubmitted to the disk
> ensuring seamless operation. In case disk couldn't recover from the
> failure then queued IOs are failed to its completion and application
> receives the error.
>
> So this way, if disk comes back online within the configured period,
> the head node remains unchanged, ensuring uninterrupted workloads
> without requiring applications to reopen device handles.
>
> A new sysfs attribute, named "delayed_shutdown_sec" is added for user
> who wish to configure time for the delayed removal of head disk node.
> The default value of this attribute is set to 0 seconds ensuring no
> behavior change unless explicitly configured.
>
> Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
> Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/
> Suggested-by: Keith Busch <kbusch@kernel.org>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> [nilay: reworked based on the original idea/POC from Christoph and Keith]
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 18 +++---
> drivers/nvme/host/multipath.c | 118 +++++++++++++++++++++++++++++-----
> drivers/nvme/host/nvme.h | 4 ++
> drivers/nvme/host/sysfs.c | 13 ++++
> 4 files changed, 127 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 870314c52107..e798809a8325 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3562,7 +3562,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
> */
> if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
> continue;
> - if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
> + if (nvme_tryget_ns_head(h))
> return h;
> }
>
> @@ -3690,6 +3690,10 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
> ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
> ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
> kref_init(&head->ref);
> +#ifdef CONFIG_NVME_MULTIPATH
> + if (ctrl->ops->flags & NVME_F_FABRICS)
> + set_bit(NVME_NSHEAD_FABRICS, &head->flags);
> +#endif
>
> if (head->ids.csi) {
> ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects);
> @@ -3806,7 +3810,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
> }
> } else {
> ret = -EINVAL;
> - if (!info->is_shared || !head->shared) {
> + if ((!info->is_shared || !head->shared) &&
> + !list_empty(&head->list)) {
> dev_err(ctrl->device,
> "Duplicate unshared namespace %d\n",
> info->nsid);
> @@ -3988,8 +3993,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>
> static void nvme_ns_remove(struct nvme_ns *ns)
> {
> - bool last_path = false;
> -
> if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
> return;
>
> @@ -4009,10 +4012,6 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>
> mutex_lock(&ns->ctrl->subsys->lock);
> list_del_rcu(&ns->siblings);
> - if (list_empty(&ns->head->list)) {
> - list_del_init(&ns->head->entry);
> - last_path = true;
> - }
> mutex_unlock(&ns->ctrl->subsys->lock);
>
> /* guarantee not available in head->list */
> @@ -4030,8 +4029,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> mutex_unlock(&ns->ctrl->namespaces_lock);
> synchronize_srcu(&ns->ctrl->srcu);
>
> - if (last_path)
> - nvme_mpath_shutdown_disk(ns->head);
> + nvme_mpath_shutdown_disk(ns->head);
> nvme_put_ns(ns);
> }
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 6b12ca80aa27..0f54889bd483 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -421,7 +421,6 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> return nvme_numa_path(head);
> }
> }
> -
> static bool nvme_available_path(struct nvme_ns_head *head)
> {
> struct nvme_ns *ns;
> @@ -429,6 +428,16 @@ static bool nvme_available_path(struct nvme_ns_head *head)
> if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
> return NULL;
>
> + /*
> + * For non-fabric controllers we support delayed removal of head disk
> + * node. If we reached up to here then it means that head disk is still
> + * alive and so we assume here that even if there's no path available
> + * maybe due to the transient link failure, we could queue up the IO
> + * and later when path becomes ready we re-submit queued IO.
> + */
> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
> + return true;
> +
> list_for_each_entry_srcu(ns, &head->list, siblings,
> srcu_read_lock_held(&head->srcu)) {
> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
> @@ -444,7 +453,6 @@ static bool nvme_available_path(struct nvme_ns_head *head)
> }
> return false;
> }
> -
> static void nvme_ns_head_submit_bio(struct bio *bio)
> {
> struct nvme_ns_head *head = bio->bi_bdev->bd_disk->private_data;
> @@ -617,6 +625,40 @@ static void nvme_requeue_work(struct work_struct *work)
> }
> }
>
> +static void nvme_remove_head(struct nvme_ns_head *head)
> +{
> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> + /*
> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> + * to allow multipath to fail all I/O.
> + */
> + kblockd_schedule_work(&head->requeue_work);
> +
> + nvme_cdev_del(&head->cdev, &head->cdev_device);
> + synchronize_srcu(&head->srcu);
> + del_gendisk(head->disk);
> + nvme_put_ns_head(head);
> + }
> +}
> +
> +static void nvme_remove_head_work(struct work_struct *work)
> +{
> + struct nvme_ns_head *head = container_of(to_delayed_work(work),
> + struct nvme_ns_head, remove_work);
> + bool shutdown = false;
> +
> + mutex_lock(&head->subsys->lock);
> + if (list_empty(&head->list)) {
> + list_del_init(&head->entry);
> + shutdown = true;
> + }
> + mutex_unlock(&head->subsys->lock);
> + if (shutdown)
> + nvme_remove_head(head);
> +
> + module_put(THIS_MODULE);
> +}
> +
> int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> {
> struct queue_limits lim;
> @@ -626,14 +668,15 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> spin_lock_init(&head->requeue_lock);
> INIT_WORK(&head->requeue_work, nvme_requeue_work);
> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
> + INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
> + head->delayed_shutdown_sec = 0;
>
> /*
> - * Add a multipath node if the subsystems supports multiple controllers.
> - * We also do this for private namespaces as the namespace sharing flag
> - * could change after a rescan.
> + * A head disk node is always created for all types of NVMe disks
> + * (single-ported and multi-ported), unless the multipath module
> + * parameter is explicitly set to false.
> */
> - if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) ||
> - !nvme_is_unique_nsid(ctrl, head) || !multipath)
> + if (!multipath)
> return 0;
>
> blk_set_stacking_limits(&lim);
> @@ -659,6 +702,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
> sprintf(head->disk->disk_name, "nvme%dn%d",
> ctrl->subsys->instance, head->instance);
> + nvme_tryget_ns_head(head);
> return 0;
> }
>
> @@ -1015,6 +1059,40 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr
> }
> DEVICE_ATTR_RO(numa_nodes);
>
> +static ssize_t delayed_shutdown_sec_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> + struct nvme_ns_head *head = disk->private_data;
> + int ret;
> +
> + mutex_lock(&head->subsys->lock);
> + ret = sysfs_emit(buf, "%u\n", head->delayed_shutdown_sec);
> + mutex_unlock(&head->subsys->lock);
> + return ret;
> +}
> +
> +static ssize_t delayed_shutdown_sec_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t count)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> + struct nvme_ns_head *head = disk->private_data;
> + unsigned int sec;
> + int ret;
> +
> + ret = kstrtouint(buf, 0, &sec);
> + if (ret < 0)
> + return ret;
> +
> + mutex_lock(&head->subsys->lock);
> + head->delayed_shutdown_sec = sec;
> + mutex_unlock(&head->subsys->lock);
> +
> + return count;
> +}
> +
> +DEVICE_ATTR_RW(delayed_shutdown_sec);
> +
> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
> struct nvme_ana_group_desc *desc, void *data)
> {
> @@ -1138,18 +1216,26 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
>
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
> {
> - if (!head->disk)
> - return;
> - if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> - nvme_cdev_del(&head->cdev, &head->cdev_device);
> + mutex_lock(&head->subsys->lock);
> +
> + if (!list_empty(&head->list) || !head->disk)
> + goto out;
> +
> + if (head->delayed_shutdown_sec) {
> /*
> - * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> - * to allow multipath to fail all I/O.
> + * Ensure that no one could remove this module while the head
> + * remove work is pending.
> */
> - synchronize_srcu(&head->srcu);
> - kblockd_schedule_work(&head->requeue_work);
> - del_gendisk(head->disk);
> + if (!try_module_get(THIS_MODULE))
> + goto out;
> + queue_delayed_work(nvme_wq, &head->remove_work,
> + head->delayed_shutdown_sec * HZ);
> + } else {
> + list_del_init(&head->entry);
> + nvme_remove_head(head);
> }
> +out:
> + mutex_unlock(&head->subsys->lock);
> }
>
> void nvme_mpath_remove_disk(struct nvme_ns_head *head)
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 51e078642127..4375357b8cd7 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -503,7 +503,10 @@ struct nvme_ns_head {
> struct work_struct partition_scan_work;
> struct mutex lock;
> unsigned long flags;
> + struct delayed_work remove_work;
> + unsigned int delayed_shutdown_sec;
> #define NVME_NSHEAD_DISK_LIVE 0
> +#define NVME_NSHEAD_FABRICS 1
> struct nvme_ns __rcu *current_path[];
> #endif
> };
> @@ -986,6 +989,7 @@ extern struct device_attribute dev_attr_ana_grpid;
> extern struct device_attribute dev_attr_ana_state;
> extern struct device_attribute dev_attr_queue_depth;
> extern struct device_attribute dev_attr_numa_nodes;
> +extern struct device_attribute dev_attr_delayed_shutdown_sec;
> extern struct device_attribute subsys_attr_iopolicy;
>
> static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> index 6d31226f7a4f..170897349093 100644
> --- a/drivers/nvme/host/sysfs.c
> +++ b/drivers/nvme/host/sysfs.c
> @@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = {
> &dev_attr_ana_state.attr,
> &dev_attr_queue_depth.attr,
> &dev_attr_numa_nodes.attr,
> + &dev_attr_delayed_shutdown_sec.attr,
> #endif
> &dev_attr_io_passthru_err_log_enabled.attr,
> NULL,
> @@ -296,6 +297,18 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
> if (nvme_disk_is_ns_head(dev_to_disk(dev)))
> return 0;
> }
> + if (a == &dev_attr_delayed_shutdown_sec.attr) {
> + struct nvme_ns_head *head = dev_to_ns_head(dev);
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + /*
> + * This attribute is only valid for head node and non-fabric
> + * controllers.
> + */
> + if (!nvme_disk_is_ns_head(disk) ||
> + test_bit(NVME_NSHEAD_FABRICS, &head->flags))
> + return 0;
> + }
> #endif
> return a->mode;
> }
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-03-22 1:48 ` Martin K. Petersen
2025-03-25 15:21 ` John Meneghini
@ 2025-04-07 14:44 ` Christoph Hellwig
2025-04-08 14:07 ` Nilay Shroff
2 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-07 14:44 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
> @@ -3690,6 +3690,10 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
> ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
> ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
> kref_init(&head->ref);
> +#ifdef CONFIG_NVME_MULTIPATH
> + if (ctrl->ops->flags & NVME_F_FABRICS)
> + set_bit(NVME_NSHEAD_FABRICS, &head->flags);
> +#endif
We might want to make the flags unconditional or move this into a helper
to avoid the ifdef'ery if we keep the flag (see below).
> - if (last_path)
> - nvme_mpath_shutdown_disk(ns->head);
> + nvme_mpath_shutdown_disk(ns->head);
I guess this function is where the shutdown naming came from, and it
probably was a bad idea even back then..
Maybe throw in an extra patch to rename it as well.
> + /*
> + * For non-fabric controllers we support delayed removal of head disk
> + * node. If we reached up to here then it means that head disk is still
> + * alive and so we assume here that even if there's no path available
> + * maybe due to the transient link failure, we could queue up the IO
> + * and later when path becomes ready we re-submit queued IO.
> + */
> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
> + return true;
Why is this conditional on fabrics or not? The same rationale should
apply as much if not more for fabrics controllers.
Also no need for the set of braces around the test_bit() call.
> }
>
> +static void nvme_remove_head(struct nvme_ns_head *head)
> +{
> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> + /*
> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> + * to allow multipath to fail all I/O.
> + */
Full sentence are supposed to start with a capitalized word.
(yes, I saw this just move, but it's a good chance to fix it)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-03-21 6:37 ` [RFC PATCH 2/2] nvme-multipath: remove multipath module param Nilay Shroff
2025-03-25 15:09 ` John Meneghini
@ 2025-04-07 14:45 ` Christoph Hellwig
2025-04-08 14:35 ` Nilay Shroff
1 sibling, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-07 14:45 UTC (permalink / raw)
To: Nilay Shroff
Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On Fri, Mar 21, 2025 at 12:07:23PM +0530, Nilay Shroff wrote:
> Remove the multipath module parameter from nvme-core and make native
> NVMe multipath support explicit. Since we now always create a multipath
> head disk node, even for single-port NVMe disks, when CONFIG_NVME_
> MULTIPATH is enabled, this module parameter is no longer needed to
> toggle the behavior.
>
> Users who prefer non-native multipath must disable CONFIG_NVME_MULTIPATH
> at compile time.
Hmm, I actually missed that in the last patch. I fear that is a huge
change people don't expect. I suspect we need to make the creation
of the head node and the delayed removal an opt-in and not the default
to keep existing behavior.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-04-07 14:44 ` Christoph Hellwig
@ 2025-04-08 14:07 ` Nilay Shroff
2025-04-09 10:43 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-08 14:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 4/7/25 8:14 PM, Christoph Hellwig wrote:
>> @@ -3690,6 +3690,10 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>> ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
>> ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
>> kref_init(&head->ref);
>> +#ifdef CONFIG_NVME_MULTIPATH
>> + if (ctrl->ops->flags & NVME_F_FABRICS)
>> + set_bit(NVME_NSHEAD_FABRICS, &head->flags);
>> +#endif
>
> We might want to make the flags unconditional or move this into a helper
> to avoid the ifdef'ery if we keep the flag (see below).
Yes okay, this could be moved into a helper so that we can avoid ifdef'ery.
>
>> - if (last_path)
>> - nvme_mpath_shutdown_disk(ns->head);
>> + nvme_mpath_shutdown_disk(ns->head);
>
> I guess this function is where the shutdown naming came from, and it
> probably was a bad idea even back then..
>
> Maybe throw in an extra patch to rename it as well.
>
Sure will do.
>> + /*
>> + * For non-fabric controllers we support delayed removal of head disk
>> + * node. If we reached up to here then it means that head disk is still
>> + * alive and so we assume here that even if there's no path available
>> + * maybe due to the transient link failure, we could queue up the IO
>> + * and later when path becomes ready we re-submit queued IO.
>> + */
>> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
>> + return true;
>
> Why is this conditional on fabrics or not? The same rationale should
> apply as much if not more for fabrics controllers.
>
For fabrics we already have options like "reconnect_delay" and
"max_reconnects". So in case of fabric link failures, we delay
the removal of the head disk node based on those options.
> Also no need for the set of braces around the test_bit() call.
>
Yeah agreed!
>> }
>>
>> +static void nvme_remove_head(struct nvme_ns_head *head)
>> +{
>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>> + /*
>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>> + * to allow multipath to fail all I/O.
>> + */
>
> Full sentence are supposed to start with a capitalized word.
>
> (yes, I saw this just move, but it's a good chance to fix it)
>
Yes, I will fix it in the next patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-04-07 14:45 ` Christoph Hellwig
@ 2025-04-08 14:35 ` Nilay Shroff
2025-04-09 10:45 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-08 14:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 4/7/25 8:15 PM, Christoph Hellwig wrote:
> On Fri, Mar 21, 2025 at 12:07:23PM +0530, Nilay Shroff wrote:
>> Remove the multipath module parameter from nvme-core and make native
>> NVMe multipath support explicit. Since we now always create a multipath
>> head disk node, even for single-port NVMe disks, when CONFIG_NVME_
>> MULTIPATH is enabled, this module parameter is no longer needed to
>> toggle the behavior.
>>
>> Users who prefer non-native multipath must disable CONFIG_NVME_MULTIPATH
>> at compile time.
>
> Hmm, I actually missed that in the last patch. I fear that is a huge
> change people don't expect. I suspect we need to make the creation
> of the head node and the delayed removal an opt-in and not the default
> to keep existing behavior.
Okay, we can add an option to avoid making this behavior "the default".
So do you recommend adding a module option for opting in this behavior
change or something else?
BTW, we're not removing nvme_core.multipath module option as we discussed
during LSFMM. Keith and others were against removing this option. So I'd
drop this second patch in the series.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-04-08 14:07 ` Nilay Shroff
@ 2025-04-09 10:43 ` Christoph Hellwig
2025-04-18 10:45 ` Nilay Shroff
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:43 UTC (permalink / raw)
To: Nilay Shroff
Cc: Christoph Hellwig, linux-nvme, linux-block, kbusch, hare, sagi,
jmeneghi, axboe, gjoyce
On Tue, Apr 08, 2025 at 07:37:48PM +0530, Nilay Shroff wrote:
> >> + * For non-fabric controllers we support delayed removal of head disk
> >> + * node. If we reached up to here then it means that head disk is still
> >> + * alive and so we assume here that even if there's no path available
> >> + * maybe due to the transient link failure, we could queue up the IO
> >> + * and later when path becomes ready we re-submit queued IO.
> >> + */
> >> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
> >> + return true;
> >
> > Why is this conditional on fabrics or not? The same rationale should
> > apply as much if not more for fabrics controllers.
> >
> For fabrics we already have options like "reconnect_delay" and
> "max_reconnects". So in case of fabric link failures, we delay
> the removal of the head disk node based on those options.
Yes. But having entirely different behavior for creating a multipath
node and removing it still seems like a rather bad idea.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-04-08 14:35 ` Nilay Shroff
@ 2025-04-09 10:45 ` Christoph Hellwig
2025-04-18 14:22 ` Nilay Shroff
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-09 10:45 UTC (permalink / raw)
To: Nilay Shroff
Cc: Christoph Hellwig, linux-nvme, linux-block, kbusch, hare, sagi,
jmeneghi, axboe, gjoyce
On Tue, Apr 08, 2025 at 08:05:20PM +0530, Nilay Shroff wrote:
> Okay, we can add an option to avoid making this behavior "the default".
> So do you recommend adding a module option for opting in this behavior
> change or something else?
I guess a module option as default makes sense. I'd still love to figure
out a way to have per-controller options of some kind as e.g. this
option make very little sense for thunderbolt-attached external devices.
But unfortunately I'm a bit lost what a good interface for that would be.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-04-09 10:43 ` Christoph Hellwig
@ 2025-04-18 10:45 ` Nilay Shroff
2025-04-22 7:36 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-18 10:45 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 4/9/25 4:13 PM, Christoph Hellwig wrote:
> On Tue, Apr 08, 2025 at 07:37:48PM +0530, Nilay Shroff wrote:
>>>> + * For non-fabric controllers we support delayed removal of head disk
>>>> + * node. If we reached up to here then it means that head disk is still
>>>> + * alive and so we assume here that even if there's no path available
>>>> + * maybe due to the transient link failure, we could queue up the IO
>>>> + * and later when path becomes ready we re-submit queued IO.
>>>> + */
>>>> + if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
>>>> + return true;
>>>
>>> Why is this conditional on fabrics or not? The same rationale should
>>> apply as much if not more for fabrics controllers.
>>>
>> For fabrics we already have options like "reconnect_delay" and
>> "max_reconnects". So in case of fabric link failures, we delay
>> the removal of the head disk node based on those options.
>
> Yes. But having entirely different behavior for creating a multipath
> node and removing it still seems like a rather bad idea.
>
Yes agreed, but as you know for the NVMeoF, connection is retried (for
instrance fabric link goes down) at the respective fabric driver layer.
However for NVMe over PCIe, our proposal is to handle the delayed removal
of the multipath head node in the NVMe stacked (multipath) driver layer.
If we want to unify this behavior across PCIe and fabric controllers then
we may allow setting delayed removal of head node option even to the fabric
connection however as we know that fabric already supports such behavior
at driver layer (by virtue of reconnection_delay and max_reconnects options),
IMO, it may not be worthwhile to add this (delayed removal of multipath head
node) support for the fabric connection.
If the current proposed implementation seems conditional for fabrics, we may
change it such that we don't have any explicit checks for the fabrics in the
nvme_available_path function and so it could be re-written as follows:
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0f54889bd483..47a3723605f6 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -428,16 +428,6 @@ static bool nvme_available_path(struct nvme_ns_head *head)
if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
return NULL;
- /*
- * For non-fabric controllers we support delayed removal of head disk
- * node. If we reached up to here then it means that head disk is still
- * alive and so we assume here that even if there's no path available
- * maybe due to the transient link failure, we could queue up the IO
- * and later when path becomes ready we re-submit queued IO.
- */
- if (!(test_bit(NVME_NSHEAD_FABRICS, &head->flags)))
- return true;
-
list_for_each_entry_srcu(ns, &head->list, siblings,
srcu_read_lock_held(&head->srcu)) {
if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
@@ -451,6 +441,10 @@ static bool nvme_available_path(struct nvme_ns_head *head)
break;
}
}
+
+ if (head->delayed_shutdown_sec)
+ return true;
+
return false;
}
static void nvme_ns_head_submit_bio(struct bio *bio)
Please note that head->delayed_shutdown_sec is exported to user via sysfs.
The default value of this variable is zero. So it's only when the value
of head->delayed_shutdown_sec is set to a non-zero, we allow delayed removal
of head node. Moreover, as discussed above, we may not want to export this
option to user if the head node refers to the fabric connection. So this
option shall be exclusively available to non-fabric multipath connections.
Thanks,
--Nilay
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-04-09 10:45 ` Christoph Hellwig
@ 2025-04-18 14:22 ` Nilay Shroff
2025-04-22 7:36 ` Christoph Hellwig
0 siblings, 1 reply; 18+ messages in thread
From: Nilay Shroff @ 2025-04-18 14:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 4/9/25 4:15 PM, Christoph Hellwig wrote:
> On Tue, Apr 08, 2025 at 08:05:20PM +0530, Nilay Shroff wrote:
>> Okay, we can add an option to avoid making this behavior "the default".
>> So do you recommend adding a module option for opting in this behavior
>> change or something else?
>
> I guess a module option as default makes sense. I'd still love to figure
> out a way to have per-controller options of some kind as e.g. this
> option make very little sense for thunderbolt-attached external devices.
>
> But unfortunately I'm a bit lost what a good interface for that would be.
>
>
I don't know how to make this option per-controller as you know the
head node, typically, refers to namespace paths and each path then
refers to different controller. So if we were to make this option
per controller then how could we handle it in case one controller has
this option set but then the another controller doesn't set this
option. It could be confusing.
How about module option "nvme_core.multipath_head_always"? The default is
set to false. So now it becomes two step process:
1. modprobe nvme_core multipath_head_always=Y && modprobe nvme
2. echo "<val>" > /sys/block/nvme0XnY/delayed_shutdown_sec
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-04-18 10:45 ` Nilay Shroff
@ 2025-04-22 7:36 ` Christoph Hellwig
2025-04-22 9:52 ` Nilay Shroff
0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-22 7:36 UTC (permalink / raw)
To: Nilay Shroff
Cc: Christoph Hellwig, linux-nvme, linux-block, kbusch, hare, sagi,
jmeneghi, axboe, gjoyce
On Fri, Apr 18, 2025 at 04:15:55PM +0530, Nilay Shroff wrote:
> }
> +
> + if (head->delayed_shutdown_sec)
> + return true;
> +
Please add a comment explaining this check, but otherwise that sounds
fine. Your text below is a good start for that.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] nvme-multipath: remove multipath module param
2025-04-18 14:22 ` Nilay Shroff
@ 2025-04-22 7:36 ` Christoph Hellwig
0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2025-04-22 7:36 UTC (permalink / raw)
To: Nilay Shroff
Cc: Christoph Hellwig, linux-nvme, linux-block, kbusch, hare, sagi,
jmeneghi, axboe, gjoyce
On Fri, Apr 18, 2025 at 07:52:06PM +0530, Nilay Shroff wrote:
> I don't know how to make this option per-controller as you know the
> head node, typically, refers to namespace paths and each path then
> refers to different controller. So if we were to make this option
> per controller then how could we handle it in case one controller has
> this option set but then the another controller doesn't set this
> option. It could be confusing.
>
> How about module option "nvme_core.multipath_head_always"? The default is
> set to false. So now it becomes two step process:
> 1. modprobe nvme_core multipath_head_always=Y && modprobe nvme
> 2. echo "<val>" > /sys/block/nvme0XnY/delayed_shutdown_sec
That's probably the best we can do.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node
2025-04-22 7:36 ` Christoph Hellwig
@ 2025-04-22 9:52 ` Nilay Shroff
0 siblings, 0 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-22 9:52 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe,
gjoyce
On 4/22/25 1:06 PM, Christoph Hellwig wrote:
> On Fri, Apr 18, 2025 at 04:15:55PM +0530, Nilay Shroff wrote:
>> }
>> +
>> + if (head->delayed_shutdown_sec)
>> + return true;
>> +
>
> Please add a comment explaining this check, but otherwise that sounds
> fine. Your text below is a good start for that.
>
yup, this will be added in the next version of the patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-04-22 9:52 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21 6:37 [RFC PATCH 0/2] improve NVMe multipath handling Nilay Shroff
2025-03-21 6:37 ` [RFC PATCH 1/2] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-03-22 1:48 ` Martin K. Petersen
2025-03-22 22:08 ` Nilay Shroff
2025-03-25 15:21 ` John Meneghini
2025-04-07 14:44 ` Christoph Hellwig
2025-04-08 14:07 ` Nilay Shroff
2025-04-09 10:43 ` Christoph Hellwig
2025-04-18 10:45 ` Nilay Shroff
2025-04-22 7:36 ` Christoph Hellwig
2025-04-22 9:52 ` Nilay Shroff
2025-03-21 6:37 ` [RFC PATCH 2/2] nvme-multipath: remove multipath module param Nilay Shroff
2025-03-25 15:09 ` John Meneghini
2025-04-07 14:45 ` Christoph Hellwig
2025-04-08 14:35 ` Nilay Shroff
2025-04-09 10:45 ` Christoph Hellwig
2025-04-18 14:22 ` Nilay Shroff
2025-04-22 7:36 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).