* [RFC PATCHv2 0/3] improve NVMe multipath handling
@ 2025-04-25 10:33 Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw)
To: linux-nvme, linux-block
Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce
Hi,
This patch series introduces improvements to NVMe multipath handling by
refining the removal behavior of the multipath head node and simplifying
configuration options. The idea/POC for this change was originally
proposed by Christoph[1] and Keith[2]. I worked upon their original
idea/POC and implemented this series.
The first patch in the series addresses an issue where the multipath
head node of a PCIe NVMe disk is removed immediately when all disk paths
are lost. This can cause problems in scenarios such as:
- Hot removal and re-addition of a disk.
- Transient PCIe link failures that trigger re-enumeration,
briefly removing and restoring the disk.
In such cases, premature removal of the head node may result in a device
node name change, requiring applications to reopen device handles if
they were performing I/O during the failure. To mitigate this, we
introduce a delayed removal mechanism. Instead of removing the head node
immediately, the system waits for a configurable timeout, allowing the
disk to recover. If the disk comes back online within this window, the
head node remains unchanged, ensuring uninterrupted workloads.
A new sysfs attribute, delayed_removal_secs, allows users to configure
this timeout. By default, it is set to 0 seconds, preserving the
existing behavior unless explicitly changed.
The second patch in the series introduced multipath_head_always module
param. When this option is set, it force creating multipath head disk
node even for single ported NVMe disks or private namespaces and thus
allows delayed head node removal. This would help handle transient PCIe
link failures transparently even in case of single ported NVMe disk or a
private namespace
The third patch in the series doesn't make any functional changes but
just renames few of the function name which improves code readability
and it better aligns function names with their actual roles.
These changes should help improve NVMe multipath reliability and simplify
configuration. Feedback and testing are welcome!
[1] https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
[2] https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/
Changes from v1:
- Renamed delayed_shutdown_sec to delayed_removal_secs as "shutdown"
has a special meaning when used with NVMe device (Martin Petersen)
- Instead of adding mpath head disk node always by default, added new
module option nvme_core.multipath_head_always which when set creates
mpath head disk node (even for a private namespace or a namespace
backed by single ported nvme disk). This way we can preserve the
default old behavior.(hch)
- Renamed nvme_mpath_shutdown_disk function as shutdown as in the NVMe
context, the term "shutdown" has a specific technical meaning. (hch)
- Undo changes which removed multipath module param as this param is
still useful and used for many different things.
Link to v1: https://lore.kernel.org/all/20250321063901.747605-1-nilay@linux.ibm.com/
Nilay Shroff (3):
nvme-multipath: introduce delayed removal of the multipath head node
nvme: introduce multipath_head_always module param
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
drivers/nvme/host/core.c | 18 ++--
drivers/nvme/host/multipath.c | 193 ++++++++++++++++++++++++++++++----
drivers/nvme/host/nvme.h | 20 +++-
drivers/nvme/host/sysfs.c | 13 +++
4 files changed, 211 insertions(+), 33 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 18+ messages in thread* [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node 2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff @ 2025-04-25 10:33 ` Nilay Shroff 2025-04-25 14:43 ` Christoph Hellwig 2025-04-25 22:26 ` Sagi Grimberg 2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff 2 siblings, 2 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw) To: linux-nvme, linux-block Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce Currently, the multipath head node of a PCIe NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- drivers/nvme/host/core.c | 16 ++--- drivers/nvme/host/multipath.c | 119 +++++++++++++++++++++++++++++++--- drivers/nvme/host/nvme.h | 12 ++++ drivers/nvme/host/sysfs.c | 13 ++++ 4 files changed, 141 insertions(+), 19 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index eb6ea8acb3cc..5755069e6974 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3560,7 +3560,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl, */ if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h)) continue; - if (!list_empty(&h->list) && nvme_tryget_ns_head(h)) + if (nvme_tryget_ns_head(h)) return h; } @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl, ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1); ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE); kref_init(&head->ref); + if (ctrl->opts) + nvme_mpath_set_fabrics(head); if (head->ids.csi) { ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects); @@ -3804,7 +3806,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info) } } else { ret = -EINVAL; - if (!info->is_shared || !head->shared) { + if ((!info->is_shared || !head->shared) && + !list_empty(&head->list)) { dev_err(ctrl->device, "Duplicate unshared namespace %d\n", info->nsid); @@ -3986,8 +3989,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info) static void nvme_ns_remove(struct nvme_ns *ns) { - bool last_path = false; - if (test_and_set_bit(NVME_NS_REMOVING, &ns->flags)) return; @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) mutex_lock(&ns->ctrl->subsys->lock); list_del_rcu(&ns->siblings); - if (list_empty(&ns->head->list)) { - list_del_init(&ns->head->entry); - last_path = true; - } mutex_unlock(&ns->ctrl->subsys->lock); /* guarantee not available in head->list */ @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns) mutex_unlock(&ns->ctrl->namespaces_lock); synchronize_srcu(&ns->ctrl->srcu); - if (last_path) - nvme_mpath_shutdown_disk(ns->head); + nvme_mpath_shutdown_disk(ns->head); nvme_put_ns(ns); } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 250f3da67cc9..68318337c275 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -442,6 +442,23 @@ static bool nvme_available_path(struct nvme_ns_head *head) break; } } + + /* + * If "head->delayed_removal_secs" is configured (i.e., non-zero), do + * not immediately fail I/O. Instead, requeue the I/O for the configured + * duration, anticipating that if there's a transient link failure then + * it may recover within this time window. This parameter is exported to + * userspace via sysfs, and its default value is zero. + * Note: This option is not exposed to the users for NVMeoF connections. + * In fabric-based setups, fabric link failure is managed through other + * parameters such as "reconnect_delay" and "max_reconnects" which is + * handled at respective fabric driver layer. Therefor, head->delayed_ + * removal_secs" is intended exclusively for non-fabric (e.g., PCIe) + * multipath configurations. + */ + if (head->delayed_removal_secs) + return true; + return false; } @@ -617,6 +634,40 @@ static void nvme_requeue_work(struct work_struct *work) } } +static void nvme_remove_head(struct nvme_ns_head *head) +{ + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { + /* + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared + * to allow multipath to fail all I/O. + */ + kblockd_schedule_work(&head->requeue_work); + + nvme_cdev_del(&head->cdev, &head->cdev_device); + synchronize_srcu(&head->srcu); + del_gendisk(head->disk); + nvme_put_ns_head(head); + } +} + +static void nvme_remove_head_work(struct work_struct *work) +{ + struct nvme_ns_head *head = container_of(to_delayed_work(work), + struct nvme_ns_head, remove_work); + bool shutdown = false; + + mutex_lock(&head->subsys->lock); + if (list_empty(&head->list)) { + list_del_init(&head->entry); + shutdown = true; + } + mutex_unlock(&head->subsys->lock); + if (shutdown) + nvme_remove_head(head); + + module_put(THIS_MODULE); +} + int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) { struct queue_limits lim; @@ -626,6 +677,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) spin_lock_init(&head->requeue_lock); INIT_WORK(&head->requeue_work, nvme_requeue_work); INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work); + INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work); + head->delayed_removal_secs = 0; /* * Add a multipath node if the subsystems supports multiple controllers. @@ -659,6 +712,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); sprintf(head->disk->disk_name, "nvme%dn%d", ctrl->subsys->instance, head->instance); + nvme_tryget_ns_head(head); return 0; } @@ -1015,6 +1069,40 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr } DEVICE_ATTR_RO(numa_nodes); +static ssize_t delayed_removal_secs_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct gendisk *disk = dev_to_disk(dev); + struct nvme_ns_head *head = disk->private_data; + int ret; + + mutex_lock(&head->subsys->lock); + ret = sysfs_emit(buf, "%u\n", head->delayed_removal_secs); + mutex_unlock(&head->subsys->lock); + return ret; +} + +static ssize_t delayed_removal_secs_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct gendisk *disk = dev_to_disk(dev); + struct nvme_ns_head *head = disk->private_data; + unsigned int sec; + int ret; + + ret = kstrtouint(buf, 0, &sec); + if (ret < 0) + return ret; + + mutex_lock(&head->subsys->lock); + head->delayed_removal_secs = sec; + mutex_unlock(&head->subsys->lock); + + return count; +} + +DEVICE_ATTR_RW(delayed_removal_secs); + static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl, struct nvme_ana_group_desc *desc, void *data) { @@ -1138,18 +1226,31 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid) void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) { - if (!head->disk) - return; - if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { - nvme_cdev_del(&head->cdev, &head->cdev_device); + bool shutdown = false; + + mutex_lock(&head->subsys->lock); + + if (!list_empty(&head->list)) + goto out; + + if (head->delayed_removal_secs) { /* - * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared - * to allow multipath to fail all I/O. + * Ensure that no one could remove this module while the head + * remove work is pending. */ - synchronize_srcu(&head->srcu); - kblockd_schedule_work(&head->requeue_work); - del_gendisk(head->disk); + if (!try_module_get(THIS_MODULE)) + goto out; + queue_delayed_work(nvme_wq, &head->remove_work, + head->delayed_removal_secs * HZ); + } else { + list_del_init(&head->entry); + if (head->disk) + shutdown = true; } +out: + mutex_unlock(&head->subsys->lock); + if (shutdown) + nvme_remove_head(head); } void nvme_mpath_remove_disk(struct nvme_ns_head *head) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 51e078642127..74cd569882ce 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -503,7 +503,10 @@ struct nvme_ns_head { struct work_struct partition_scan_work; struct mutex lock; unsigned long flags; + struct delayed_work remove_work; + unsigned int delayed_removal_secs; #define NVME_NSHEAD_DISK_LIVE 0 +#define NVME_NSHEAD_FABRICS 1 struct nvme_ns __rcu *current_path[]; #endif }; @@ -986,12 +989,17 @@ extern struct device_attribute dev_attr_ana_grpid; extern struct device_attribute dev_attr_ana_state; extern struct device_attribute dev_attr_queue_depth; extern struct device_attribute dev_attr_numa_nodes; +extern struct device_attribute dev_attr_delayed_removal_secs; extern struct device_attribute subsys_attr_iopolicy; static inline bool nvme_disk_is_ns_head(struct gendisk *disk) { return disk->fops == &nvme_ns_head_ops; } +static inline void nvme_mpath_set_fabrics(struct nvme_ns_head *head) +{ + set_bit(NVME_NSHEAD_FABRICS, &head->flags); +} #else #define multipath false static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) @@ -1078,6 +1086,10 @@ static inline void nvme_mpath_end_request(struct request *rq) static inline bool nvme_disk_is_ns_head(struct gendisk *disk) { return false; +} +static inline void nvme_mpath_set_fabrics(struct nvme_ns_head *head) +{ + } #endif /* CONFIG_NVME_MULTIPATH */ diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c index 6d31226f7a4f..51633670d177 100644 --- a/drivers/nvme/host/sysfs.c +++ b/drivers/nvme/host/sysfs.c @@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = { &dev_attr_ana_state.attr, &dev_attr_queue_depth.attr, &dev_attr_numa_nodes.attr, + &dev_attr_delayed_removal_secs.attr, #endif &dev_attr_io_passthru_err_log_enabled.attr, NULL, @@ -296,6 +297,18 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj, if (nvme_disk_is_ns_head(dev_to_disk(dev))) return 0; } + if (a == &dev_attr_delayed_removal_secs.attr) { + struct nvme_ns_head *head = dev_to_ns_head(dev); + struct gendisk *disk = dev_to_disk(dev); + + /* + * This attribute is only valid for head node and non-fabric + * setup. + */ + if (!nvme_disk_is_ns_head(disk) || + test_bit(NVME_NSHEAD_FABRICS, &head->flags)) + return 0; + } #endif return a->mode; } -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node 2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff @ 2025-04-25 14:43 ` Christoph Hellwig 2025-04-28 7:05 ` Nilay Shroff 2025-04-25 22:26 ` Sagi Grimberg 1 sibling, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2025-04-25 14:43 UTC (permalink / raw) To: Nilay Shroff Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce On Fri, Apr 25, 2025 at 04:03:08PM +0530, Nilay Shroff wrote: > @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) > > mutex_lock(&ns->ctrl->subsys->lock); > list_del_rcu(&ns->siblings); > - if (list_empty(&ns->head->list)) { > - list_del_init(&ns->head->entry); > - last_path = true; > - } > mutex_unlock(&ns->ctrl->subsys->lock); > > /* guarantee not available in head->list */ > @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns) > mutex_unlock(&ns->ctrl->namespaces_lock); > synchronize_srcu(&ns->ctrl->srcu); > > - if (last_path) > - nvme_mpath_shutdown_disk(ns->head); > + nvme_mpath_shutdown_disk(ns->head); This now removes the head list deletion from the first critical section into nvme_mpath_shutdown_disk. I remember we had to do it the way it currently is done because we were running into issues otherwise, the commit history might help a bit with what the issues were. > + if (a == &dev_attr_delayed_removal_secs.attr) { > + struct nvme_ns_head *head = dev_to_ns_head(dev); > + struct gendisk *disk = dev_to_disk(dev); > + > + /* > + * This attribute is only valid for head node and non-fabric > + * setup. > + */ > + if (!nvme_disk_is_ns_head(disk) || > + test_bit(NVME_NSHEAD_FABRICS, &head->flags)) > + return 0; > + } Didn't we say we also want to allow fabrics to use it if explicitly configured? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node 2025-04-25 14:43 ` Christoph Hellwig @ 2025-04-28 7:05 ` Nilay Shroff 0 siblings, 0 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-28 7:05 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/25/25 8:13 PM, Christoph Hellwig wrote: > On Fri, Apr 25, 2025 at 04:03:08PM +0530, Nilay Shroff wrote: >> @@ -4007,10 +4008,6 @@ static void nvme_ns_remove(struct nvme_ns *ns) >> >> mutex_lock(&ns->ctrl->subsys->lock); >> list_del_rcu(&ns->siblings); >> - if (list_empty(&ns->head->list)) { >> - list_del_init(&ns->head->entry); >> - last_path = true; >> - } >> mutex_unlock(&ns->ctrl->subsys->lock); >> >> /* guarantee not available in head->list */ >> @@ -4028,8 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns) >> mutex_unlock(&ns->ctrl->namespaces_lock); >> synchronize_srcu(&ns->ctrl->srcu); >> >> - if (last_path) >> - nvme_mpath_shutdown_disk(ns->head); >> + nvme_mpath_shutdown_disk(ns->head); > > This now removes the head list deletion from the first critical > section into nvme_mpath_shutdown_disk. I remember we had to do it > the way it currently is done because we were running into issues > otherwise, the commit history might help a bit with what the issues > were. > Thank you for the pointers! Yes, I checked the commit history and found this commit 9edceaf43050 ("nvme: avoid race in shutdown namespace removal") which avoids the race between nvme_find_ns_head and nvme_ns_remove. And looking at the commit history it seems we also need to handle that case here. I think if user doesn't configure delayed removal of head node then the existing behavior should be preserved to avoid above mentioned race. However if user configures the delayed head node removal then we would delay the head->entry removal until the delayed node removal time elapses. I'd handle this case in next patch. >> + if (a == &dev_attr_delayed_removal_secs.attr) { >> + struct nvme_ns_head *head = dev_to_ns_head(dev); >> + struct gendisk *disk = dev_to_disk(dev); >> + >> + /* >> + * This attribute is only valid for head node and non-fabric >> + * setup. >> + */ >> + if (!nvme_disk_is_ns_head(disk) || >> + test_bit(NVME_NSHEAD_FABRICS, &head->flags)) >> + return 0; >> + } > > Didn't we say we also want to allow fabrics to use it if explicitly > configured? > Hmm, I think there's some confusion here. I thought we discussed that for fabric setup, fabric link failure is managed through other parameters such as "reconnect_delay" and "max_reconnects" which is handled at respective fabric driver layer. Therefor, head->delayed_removal_secs is intended exclusively for non-fabric (e.g., PCIe) multipath configurations. Even if we were to support delayed head node removal for fabric setup then I wonder who would handle fabric (re-)connect command? As once we exhaust number of reconnect attempts, we stop the reconnect work and delete the fabric controller. So do you think we should still expose head->delayed_removal_secs for fabric setup? If yes, then it'd add further additional delay (without re-connect attempt) for fabric link failures, isn't it? Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node 2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff 2025-04-25 14:43 ` Christoph Hellwig @ 2025-04-25 22:26 ` Sagi Grimberg 2025-04-28 7:39 ` Nilay Shroff 1 sibling, 1 reply; 18+ messages in thread From: Sagi Grimberg @ 2025-04-25 22:26 UTC (permalink / raw) To: Nilay Shroff, linux-nvme, linux-block Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce > > @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl, > ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1); > ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE); > kref_init(&head->ref); > + if (ctrl->opts) While this check is correct, perhaps checking ctrl->ops->flags & NVME_F_FABRICS would be clearer. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node 2025-04-25 22:26 ` Sagi Grimberg @ 2025-04-28 7:39 ` Nilay Shroff 0 siblings, 0 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-28 7:39 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme, linux-block Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce On 4/26/25 3:56 AM, Sagi Grimberg wrote: >> @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl, >> ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1); >> ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE); >> kref_init(&head->ref); >> + if (ctrl->opts) > > While this check is correct, perhaps checking ctrl->ops->flags & NVME_F_FABRICS > would be clearer. > Sure, I'd update this in the next patch. Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff @ 2025-04-25 10:33 ` Nilay Shroff 2025-04-25 14:45 ` Christoph Hellwig 2025-04-28 6:57 ` Hannes Reinecke 2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff 2 siblings, 2 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw) To: linux-nvme, linux-block Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce Currently, a multipath head disk node is not created for single-ported NVMe adapters or private namespaces. However, creating a head node in these cases can help transparently handle transient PCIe link failures. Without a head node, features like delayed removal cannot be leveraged, making it difficult to tolerate such link failures. To address this, this commit introduces nvme_core module parameter multipath_head_always. When this param is set to true, it forces the creation of a multipath head node regardless NVMe disk or namespace type. So this option allows the use of delayed removal of head node functionality even for single- ported NVMe disks and private namespaces and thus helps transparently handling transient PCIe link failures. By default multipath_head_always is set to false, thus preserving the existing behavior. Setting it to true enables improved fault tolerance in PCIe setups. Moreover, please note that enabling this option would also implicitly enable nvme_core.multipath. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- 1 file changed, 63 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 68318337c275..1acdbbddfe01 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -10,10 +10,59 @@ #include "nvme.h" bool multipath = true; -module_param(multipath, bool, 0444); +bool multipath_head_always; /* default is flase */ + +static int multipath_param_set(const char *val, const struct kernel_param *kp) +{ + int ret; + + ret = param_set_bool(val, kp); + if (ret) + return ret; + + if (multipath_head_always && !*(bool *)kp->arg) { + pr_err("Can't disable multipath when multipath_head_always is configured.\n"); + *(bool *)kp->arg = true; + return -EINVAL; + } + + return 0; +} + +static const struct kernel_param_ops multipath_param_ops = { + .set = multipath_param_set, + .get = param_get_bool, +}; + +module_param_cb(multipath, &multipath_param_ops, &multipath, 0444); MODULE_PARM_DESC(multipath, "turn on native support for multiple controllers per subsystem"); +static int multipath_head_always_set(const char *val, + const struct kernel_param *kp) +{ + int ret; + + ret = param_set_bool(val, kp); + if (ret < 0) + return ret; + + if (*(bool *)kp->arg) + multipath = true; + + return 0; +} + +static const struct kernel_param_ops multipath_head_always_param_ops = { + .set = multipath_head_always_set, + .get = param_get_bool, +}; + +module_param_cb(multipath_head_always, &multipath_head_always_param_ops, + &multipath_head_always, 0444); +MODULE_PARM_DESC(multipath_head_always, + "create multipath head node always; note that this also implicitly enables native multipath support"); + static const char *nvme_iopolicy_names[] = { [NVME_IOPOLICY_NUMA] = "numa", [NVME_IOPOLICY_RR] = "round-robin", @@ -681,13 +730,20 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) head->delayed_removal_secs = 0; /* - * Add a multipath node if the subsystems supports multiple controllers. - * We also do this for private namespaces as the namespace sharing flag - * could change after a rescan. + * If multipath_head_always is configured then we add a multipath head + * disk node irrespective of disk is single/multi ported or namespace is + * shared/private. */ - if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || - !nvme_is_unique_nsid(ctrl, head) || !multipath) - return 0; + if (!multipath_head_always) { + /* + * Add a multipath node if the subsystems supports multiple + * controllers. We also do this for private namespaces as the + * namespace sharing flag could change after a rescan. + */ + if (!(ctrl->subsys->cmic & NVME_CTRL_CMIC_MULTI_CTRL) || + !nvme_is_unique_nsid(ctrl, head) || !multipath) + return 0; + } blk_set_stacking_limits(&lim); lim.dma_alignment = 3; -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff @ 2025-04-25 14:45 ` Christoph Hellwig 2025-04-29 6:26 ` Nilay Shroff 2025-04-28 6:57 ` Hannes Reinecke 1 sibling, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2025-04-25 14:45 UTC (permalink / raw) To: Nilay Shroff Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce On Fri, Apr 25, 2025 at 04:03:09PM +0530, Nilay Shroff wrote: > +static int multipath_param_set(const char *val, const struct kernel_param *kp) > +{ > + int ret; > + > + ret = param_set_bool(val, kp); > + if (ret) > + return ret; > + > + if (multipath_head_always && !*(bool *)kp->arg) { > + pr_err("Can't disable multipath when multipath_head_always is configured.\n"); > + *(bool *)kp->arg = true; This reads much nicer if you add a local bool * variable and avoid all the casting, i.e. bool *arg = kp->arg; ... if (multipath_head_always && !*kp->arg) { pr_err("Can't disable multipath when multipath_head_always is configured.\n"); *arg = true; > +static int multipath_head_always_set(const char *val, > + const struct kernel_param *kp) > +{ > + int ret; > + > + ret = param_set_bool(val, kp); > + if (ret < 0) > + return ret; > + > + if (*(bool *)kp->arg) > + multipath = true; Same here. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-25 14:45 ` Christoph Hellwig @ 2025-04-29 6:26 ` Nilay Shroff 0 siblings, 0 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-29 6:26 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-nvme, linux-block, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/25/25 8:15 PM, Christoph Hellwig wrote: > On Fri, Apr 25, 2025 at 04:03:09PM +0530, Nilay Shroff wrote: >> +static int multipath_param_set(const char *val, const struct kernel_param *kp) >> +{ >> + int ret; >> + >> + ret = param_set_bool(val, kp); >> + if (ret) >> + return ret; >> + >> + if (multipath_head_always && !*(bool *)kp->arg) { >> + pr_err("Can't disable multipath when multipath_head_always is configured.\n"); >> + *(bool *)kp->arg = true; > > This reads much nicer if you add a local bool * variable and avoid > all the casting, i.e. > > bool *arg = kp->arg; > > ... > > if (multipath_head_always && !*kp->arg) { > pr_err("Can't disable multipath when multipath_head_always is configured.\n"); > *arg = true; > >> +static int multipath_head_always_set(const char *val, >> + const struct kernel_param *kp) >> +{ >> + int ret; >> + >> + ret = param_set_bool(val, kp); >> + if (ret < 0) >> + return ret; >> + >> + if (*(bool *)kp->arg) >> + multipath = true; > > Same here. > Agreed, will change in the next patch. Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff 2025-04-25 14:45 ` Christoph Hellwig @ 2025-04-28 6:57 ` Hannes Reinecke 2025-04-28 7:39 ` Nilay Shroff 1 sibling, 1 reply; 18+ messages in thread From: Hannes Reinecke @ 2025-04-28 6:57 UTC (permalink / raw) To: Nilay Shroff, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/25/25 12:33, Nilay Shroff wrote: > Currently, a multipath head disk node is not created for single-ported > NVMe adapters or private namespaces. However, creating a head node in > these cases can help transparently handle transient PCIe link failures. > Without a head node, features like delayed removal cannot be leveraged, > making it difficult to tolerate such link failures. To address this, > this commit introduces nvme_core module parameter multipath_head_always. > > When this param is set to true, it forces the creation of a multipath > head node regardless NVMe disk or namespace type. So this option allows > the use of delayed removal of head node functionality even for single- > ported NVMe disks and private namespaces and thus helps transparently > handling transient PCIe link failures. > > By default multipath_head_always is set to false, thus preserving the > existing behavior. Setting it to true enables improved fault tolerance > in PCIe setups. Moreover, please note that enabling this option would > also implicitly enable nvme_core.multipath. > > Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> > --- > drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- > 1 file changed, 63 insertions(+), 7 deletions(-) > I really would model this according to dm-multipath where we have the 'fail_if_no_path' flag. This can be set for PCIe devices to retain the current behaviour (which we need for things like 'md' on top of NVMe) whenever the this flag is set. And it might be an idea to rename this flag to 'multipath_always_on', so 'multipath_head_always' might be confusing for people not familiar with the internal layout of the nvme multipath driver. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-28 6:57 ` Hannes Reinecke @ 2025-04-28 7:39 ` Nilay Shroff 2025-04-29 5:49 ` Hannes Reinecke 0 siblings, 1 reply; 18+ messages in thread From: Nilay Shroff @ 2025-04-28 7:39 UTC (permalink / raw) To: Hannes Reinecke, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/28/25 12:27 PM, Hannes Reinecke wrote: > On 4/25/25 12:33, Nilay Shroff wrote: >> Currently, a multipath head disk node is not created for single-ported >> NVMe adapters or private namespaces. However, creating a head node in >> these cases can help transparently handle transient PCIe link failures. >> Without a head node, features like delayed removal cannot be leveraged, >> making it difficult to tolerate such link failures. To address this, >> this commit introduces nvme_core module parameter multipath_head_always. >> >> When this param is set to true, it forces the creation of a multipath >> head node regardless NVMe disk or namespace type. So this option allows >> the use of delayed removal of head node functionality even for single- >> ported NVMe disks and private namespaces and thus helps transparently >> handling transient PCIe link failures. >> >> By default multipath_head_always is set to false, thus preserving the >> existing behavior. Setting it to true enables improved fault tolerance >> in PCIe setups. Moreover, please note that enabling this option would >> also implicitly enable nvme_core.multipath. >> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >> --- >> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- >> 1 file changed, 63 insertions(+), 7 deletions(-) >> > I really would model this according to dm-multipath where we have the > 'fail_if_no_path' flag. > This can be set for PCIe devices to retain the current behaviour > (which we need for things like 'md' on top of NVMe) whenever the > this flag is set. > Okay so you meant that when sysfs attribute "delayed_removal_secs" under head disk node is _NOT_ configured (or delayed_removal_secs is set to zero) we have internal flag "fail_if_no_path" is set to true. However in other case when "delayed_removal_secs" is set to a non-zero value we set "fail_if_no_path" to false. Is that correct? > And it might be an idea to rename this flag to 'multipath_always_on', > so 'multipath_head_always' might be confusing for people not familiar > with the internal layout of the nvme multipath driver. > Okay, I like this "multipath_always_on" module param. I'd rename it in the next patch. Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-28 7:39 ` Nilay Shroff @ 2025-04-29 5:49 ` Hannes Reinecke 2025-04-29 6:24 ` Nilay Shroff 0 siblings, 1 reply; 18+ messages in thread From: Hannes Reinecke @ 2025-04-29 5:49 UTC (permalink / raw) To: Nilay Shroff, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/28/25 09:39, Nilay Shroff wrote: > > > On 4/28/25 12:27 PM, Hannes Reinecke wrote: >> On 4/25/25 12:33, Nilay Shroff wrote: >>> Currently, a multipath head disk node is not created for single-ported >>> NVMe adapters or private namespaces. However, creating a head node in >>> these cases can help transparently handle transient PCIe link failures. >>> Without a head node, features like delayed removal cannot be leveraged, >>> making it difficult to tolerate such link failures. To address this, >>> this commit introduces nvme_core module parameter multipath_head_always. >>> >>> When this param is set to true, it forces the creation of a multipath >>> head node regardless NVMe disk or namespace type. So this option allows >>> the use of delayed removal of head node functionality even for single- >>> ported NVMe disks and private namespaces and thus helps transparently >>> handling transient PCIe link failures. >>> >>> By default multipath_head_always is set to false, thus preserving the >>> existing behavior. Setting it to true enables improved fault tolerance >>> in PCIe setups. Moreover, please note that enabling this option would >>> also implicitly enable nvme_core.multipath. >>> >>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >>> --- >>> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- >>> 1 file changed, 63 insertions(+), 7 deletions(-) >>> >> I really would model this according to dm-multipath where we have the >> 'fail_if_no_path' flag. >> This can be set for PCIe devices to retain the current behaviour >> (which we need for things like 'md' on top of NVMe) whenever the >> this flag is set. >> > Okay so you meant that when sysfs attribute "delayed_removal_secs" > under head disk node is _NOT_ configured (or delayed_removal_secs > is set to zero) we have internal flag "fail_if_no_path" is set to > true. However in other case when "delayed_removal_secs" is set to > a non-zero value we set "fail_if_no_path" to false. Is that correct? > Don't make it overly complicated. 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be mapped onto delayed_removal_secs; if the value is '0' then the head disk is immediately removed (the 'fail_if_no_path' case), and if it's -1 it is never removed (the 'queue_if_no_path' case). Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-29 5:49 ` Hannes Reinecke @ 2025-04-29 6:24 ` Nilay Shroff 2025-04-29 7:01 ` Hannes Reinecke 0 siblings, 1 reply; 18+ messages in thread From: Nilay Shroff @ 2025-04-29 6:24 UTC (permalink / raw) To: Hannes Reinecke, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/29/25 11:19 AM, Hannes Reinecke wrote: > On 4/28/25 09:39, Nilay Shroff wrote: >> >> >> On 4/28/25 12:27 PM, Hannes Reinecke wrote: >>> On 4/25/25 12:33, Nilay Shroff wrote: >>>> Currently, a multipath head disk node is not created for single-ported >>>> NVMe adapters or private namespaces. However, creating a head node in >>>> these cases can help transparently handle transient PCIe link failures. >>>> Without a head node, features like delayed removal cannot be leveraged, >>>> making it difficult to tolerate such link failures. To address this, >>>> this commit introduces nvme_core module parameter multipath_head_always. >>>> >>>> When this param is set to true, it forces the creation of a multipath >>>> head node regardless NVMe disk or namespace type. So this option allows >>>> the use of delayed removal of head node functionality even for single- >>>> ported NVMe disks and private namespaces and thus helps transparently >>>> handling transient PCIe link failures. >>>> >>>> By default multipath_head_always is set to false, thus preserving the >>>> existing behavior. Setting it to true enables improved fault tolerance >>>> in PCIe setups. Moreover, please note that enabling this option would >>>> also implicitly enable nvme_core.multipath. >>>> >>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >>>> --- >>>> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- >>>> 1 file changed, 63 insertions(+), 7 deletions(-) >>>> >>> I really would model this according to dm-multipath where we have the >>> 'fail_if_no_path' flag. >>> This can be set for PCIe devices to retain the current behaviour >>> (which we need for things like 'md' on top of NVMe) whenever the >>> this flag is set. >>> >> Okay so you meant that when sysfs attribute "delayed_removal_secs" >> under head disk node is _NOT_ configured (or delayed_removal_secs >> is set to zero) we have internal flag "fail_if_no_path" is set to >> true. However in other case when "delayed_removal_secs" is set to >> a non-zero value we set "fail_if_no_path" to false. Is that correct? >> > Don't make it overly complicated. > 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be > mapped onto delayed_removal_secs; if the value is '0' then the head > disk is immediately removed (the 'fail_if_no_path' case), and if it's > -1 it is never removed (the 'queue_if_no_path' case). > Yes if the value of delayed_removal_secs is 0 then the head is immediately removed, however if value of delayed_removal_secs is anything but zero (i.e. greater than zero as delayed_removal_secs is unsigned) then head is removed only after delayed_removal_secs is elapsed and hence disk couldn't recover from transient link failure. We never pin head node indefinitely. > Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation... > The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really doesn't interact or interfere with ctrl_loss_tmo which is fabric controller option. Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-29 6:24 ` Nilay Shroff @ 2025-04-29 7:01 ` Hannes Reinecke 2025-04-29 7:15 ` Nilay Shroff 0 siblings, 1 reply; 18+ messages in thread From: Hannes Reinecke @ 2025-04-29 7:01 UTC (permalink / raw) To: Nilay Shroff, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/29/25 08:24, Nilay Shroff wrote: > > > On 4/29/25 11:19 AM, Hannes Reinecke wrote: >> On 4/28/25 09:39, Nilay Shroff wrote: >>> >>> >>> On 4/28/25 12:27 PM, Hannes Reinecke wrote: >>>> On 4/25/25 12:33, Nilay Shroff wrote: >>>>> Currently, a multipath head disk node is not created for single-ported >>>>> NVMe adapters or private namespaces. However, creating a head node in >>>>> these cases can help transparently handle transient PCIe link failures. >>>>> Without a head node, features like delayed removal cannot be leveraged, >>>>> making it difficult to tolerate such link failures. To address this, >>>>> this commit introduces nvme_core module parameter multipath_head_always. >>>>> >>>>> When this param is set to true, it forces the creation of a multipath >>>>> head node regardless NVMe disk or namespace type. So this option allows >>>>> the use of delayed removal of head node functionality even for single- >>>>> ported NVMe disks and private namespaces and thus helps transparently >>>>> handling transient PCIe link failures. >>>>> >>>>> By default multipath_head_always is set to false, thus preserving the >>>>> existing behavior. Setting it to true enables improved fault tolerance >>>>> in PCIe setups. Moreover, please note that enabling this option would >>>>> also implicitly enable nvme_core.multipath. >>>>> >>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >>>>> --- >>>>> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- >>>>> 1 file changed, 63 insertions(+), 7 deletions(-) >>>>> >>>> I really would model this according to dm-multipath where we have the >>>> 'fail_if_no_path' flag. >>>> This can be set for PCIe devices to retain the current behaviour >>>> (which we need for things like 'md' on top of NVMe) whenever the >>>> this flag is set. >>>> >>> Okay so you meant that when sysfs attribute "delayed_removal_secs" >>> under head disk node is _NOT_ configured (or delayed_removal_secs >>> is set to zero) we have internal flag "fail_if_no_path" is set to >>> true. However in other case when "delayed_removal_secs" is set to >>> a non-zero value we set "fail_if_no_path" to false. Is that correct? >>> >> Don't make it overly complicated. >> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be >> mapped onto delayed_removal_secs; if the value is '0' then the head >> disk is immediately removed (the 'fail_if_no_path' case), and if it's >> -1 it is never removed (the 'queue_if_no_path' case). >> > Yes if the value of delayed_removal_secs is 0 then the head is immediately > removed, however if value of delayed_removal_secs is anything but zero > (i.e. greater than zero as delayed_removal_secs is unsigned) then head > is removed only after delayed_removal_secs is elapsed and hence disk > couldn't recover from transient link failure. We never pin head node > indefinitely. > >> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation... >> > The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really > doesn't interact or interfere with ctrl_loss_tmo which is fabric controller > option. > Not so sure here. You _could_ expand the scope for ctrl_loss_tmo to PCI, too; as most PCI devices will only ever have one controller 'ctrl_loss_tmo' will be identical to 'delayed_removal_secs'. So I guess my question is: is there a value for fabrics to control the lifetime of struct ns_head independent on the lifetime of the controller? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param 2025-04-29 7:01 ` Hannes Reinecke @ 2025-04-29 7:15 ` Nilay Shroff 0 siblings, 0 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-29 7:15 UTC (permalink / raw) To: Hannes Reinecke, linux-nvme, linux-block Cc: hch, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce On 4/29/25 12:31 PM, Hannes Reinecke wrote: > On 4/29/25 08:24, Nilay Shroff wrote: >> >> >> On 4/29/25 11:19 AM, Hannes Reinecke wrote: >>> On 4/28/25 09:39, Nilay Shroff wrote: >>>> >>>> >>>> On 4/28/25 12:27 PM, Hannes Reinecke wrote: >>>>> On 4/25/25 12:33, Nilay Shroff wrote: >>>>>> Currently, a multipath head disk node is not created for single-ported >>>>>> NVMe adapters or private namespaces. However, creating a head node in >>>>>> these cases can help transparently handle transient PCIe link failures. >>>>>> Without a head node, features like delayed removal cannot be leveraged, >>>>>> making it difficult to tolerate such link failures. To address this, >>>>>> this commit introduces nvme_core module parameter multipath_head_always. >>>>>> >>>>>> When this param is set to true, it forces the creation of a multipath >>>>>> head node regardless NVMe disk or namespace type. So this option allows >>>>>> the use of delayed removal of head node functionality even for single- >>>>>> ported NVMe disks and private namespaces and thus helps transparently >>>>>> handling transient PCIe link failures. >>>>>> >>>>>> By default multipath_head_always is set to false, thus preserving the >>>>>> existing behavior. Setting it to true enables improved fault tolerance >>>>>> in PCIe setups. Moreover, please note that enabling this option would >>>>>> also implicitly enable nvme_core.multipath. >>>>>> >>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >>>>>> --- >>>>>> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++---- >>>>>> 1 file changed, 63 insertions(+), 7 deletions(-) >>>>>> >>>>> I really would model this according to dm-multipath where we have the >>>>> 'fail_if_no_path' flag. >>>>> This can be set for PCIe devices to retain the current behaviour >>>>> (which we need for things like 'md' on top of NVMe) whenever the >>>>> this flag is set. >>>>> >>>> Okay so you meant that when sysfs attribute "delayed_removal_secs" >>>> under head disk node is _NOT_ configured (or delayed_removal_secs >>>> is set to zero) we have internal flag "fail_if_no_path" is set to >>>> true. However in other case when "delayed_removal_secs" is set to >>>> a non-zero value we set "fail_if_no_path" to false. Is that correct? >>>> >>> Don't make it overly complicated. >>> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be >>> mapped onto delayed_removal_secs; if the value is '0' then the head >>> disk is immediately removed (the 'fail_if_no_path' case), and if it's >>> -1 it is never removed (the 'queue_if_no_path' case). >>> >> Yes if the value of delayed_removal_secs is 0 then the head is immediately >> removed, however if value of delayed_removal_secs is anything but zero >> (i.e. greater than zero as delayed_removal_secs is unsigned) then head >> is removed only after delayed_removal_secs is elapsed and hence disk >> couldn't recover from transient link failure. We never pin head node >> indefinitely. >> >>> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation... >>> >> The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really >> doesn't interact or interfere with ctrl_loss_tmo which is fabric controller >> option. >> > Not so sure here. > You _could_ expand the scope for ctrl_loss_tmo to PCI, too; > as most PCI devices will only ever have one controller 'ctrl_loss_tmo' > will be identical to 'delayed_removal_secs'. > > So I guess my question is: is there a value for fabrics to control > the lifetime of struct ns_head independent on the lifetime of the > controller? > The ctrl_loss_tmo option doesn't actually controls the lifetime of ns_head. In fact, the ctrl_loss_tmo allows the fabric I/O commands to fail fast so that I/O commands don't get stuck while host NVMe-oF controller is in reconnect state. User may not want to wait longer while the fabric controller enters into reconnect state when it loses connection with the target. Typically, the default reconnect timeout is 10 minutes which is way longer than the expected timeout of 30 seconds for any I/O command to fail. You may find more details in this commit 8c4dfea97f15 ("nvme-fabrics: reject I/O to offline device") which implements the ctrl_loss_tmo. Thanks, --Nilay ^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk 2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff @ 2025-04-25 10:33 ` Nilay Shroff 2025-04-25 14:46 ` Christoph Hellwig 2025-04-25 22:27 ` Sagi Grimberg 2 siblings, 2 replies; 18+ messages in thread From: Nilay Shroff @ 2025-04-25 10:33 UTC (permalink / raw) To: linux-nvme, linux-block Cc: hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- drivers/nvme/host/core.c | 4 ++-- drivers/nvme/host/multipath.c | 16 ++++++++-------- drivers/nvme/host/nvme.h | 8 ++++---- 3 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 5755069e6974..306ac36eb812 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -664,7 +664,7 @@ static void nvme_free_ns_head(struct kref *ref) struct nvme_ns_head *head = container_of(ref, struct nvme_ns_head, ref); - nvme_mpath_remove_disk(head); + nvme_mpath_put_disk(head); ida_free(&head->subsys->ns_ida, head->instance); cleanup_srcu_struct(&head->srcu); nvme_put_subsystem(head->subsys); @@ -4025,7 +4025,7 @@ static void nvme_ns_remove(struct nvme_ns *ns) mutex_unlock(&ns->ctrl->namespaces_lock); synchronize_srcu(&ns->ctrl->srcu); - nvme_mpath_shutdown_disk(ns->head); + nvme_mpath_remove_disk(ns->head); nvme_put_ns(ns); } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 1acdbbddfe01..a392b4825dc3 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -703,15 +703,15 @@ static void nvme_remove_head_work(struct work_struct *work) { struct nvme_ns_head *head = container_of(to_delayed_work(work), struct nvme_ns_head, remove_work); - bool shutdown = false; + bool remove = false; mutex_lock(&head->subsys->lock); if (list_empty(&head->list)) { list_del_init(&head->entry); - shutdown = true; + remove = true; } mutex_unlock(&head->subsys->lock); - if (shutdown) + if (remove) nvme_remove_head(head); module_put(THIS_MODULE); @@ -1280,9 +1280,9 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid) #endif } -void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) +void nvme_mpath_remove_disk(struct nvme_ns_head *head) { - bool shutdown = false; + bool remove = false; mutex_lock(&head->subsys->lock); @@ -1301,15 +1301,15 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) } else { list_del_init(&head->entry); if (head->disk) - shutdown = true; + remove = true; } out: mutex_unlock(&head->subsys->lock); - if (shutdown) + if (remove) nvme_remove_head(head); } -void nvme_mpath_remove_disk(struct nvme_ns_head *head) +void nvme_mpath_put_disk(struct nvme_ns_head *head) { if (!head->disk) return; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 74cd569882ce..591df076522d 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -963,7 +963,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns); void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns); void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid); -void nvme_mpath_remove_disk(struct nvme_ns_head *head); +void nvme_mpath_put_disk(struct nvme_ns_head *head); int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id); void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl); void nvme_mpath_update(struct nvme_ctrl *ctrl); @@ -972,7 +972,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl); bool nvme_mpath_clear_current_path(struct nvme_ns *ns); void nvme_mpath_revalidate_paths(struct nvme_ns *ns); void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); -void nvme_mpath_shutdown_disk(struct nvme_ns_head *head); +void nvme_mpath_remove_disk(struct nvme_ns_head *head); void nvme_mpath_start_request(struct request *rq); void nvme_mpath_end_request(struct request *rq); @@ -1020,7 +1020,7 @@ static inline int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, static inline void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid) { } -static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head) +static inline void nvme_mpath_put_disk(struct nvme_ns_head *head) { } static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns) @@ -1039,7 +1039,7 @@ static inline void nvme_mpath_revalidate_paths(struct nvme_ns *ns) static inline void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl) { } -static inline void nvme_mpath_shutdown_disk(struct nvme_ns_head *head) +static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head) { } static inline void nvme_trace_bio_complete(struct request *req) -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk 2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff @ 2025-04-25 14:46 ` Christoph Hellwig 2025-04-25 22:27 ` Sagi Grimberg 1 sibling, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2025-04-25 14:46 UTC (permalink / raw) To: Nilay Shroff Cc: linux-nvme, linux-block, hch, kbusch, hare, sagi, jmeneghi, axboe, martin.petersen, gjoyce Looks good: Reviewed-by: Christoph Hellwig <hch@lst.de> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk 2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff 2025-04-25 14:46 ` Christoph Hellwig @ 2025-04-25 22:27 ` Sagi Grimberg 1 sibling, 0 replies; 18+ messages in thread From: Sagi Grimberg @ 2025-04-25 22:27 UTC (permalink / raw) To: Nilay Shroff, linux-nvme, linux-block Cc: hch, kbusch, hare, jmeneghi, axboe, martin.petersen, gjoyce Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-04-29 7:16 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff 2025-04-25 14:43 ` Christoph Hellwig 2025-04-28 7:05 ` Nilay Shroff 2025-04-25 22:26 ` Sagi Grimberg 2025-04-28 7:39 ` Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff 2025-04-25 14:45 ` Christoph Hellwig 2025-04-29 6:26 ` Nilay Shroff 2025-04-28 6:57 ` Hannes Reinecke 2025-04-28 7:39 ` Nilay Shroff 2025-04-29 5:49 ` Hannes Reinecke 2025-04-29 6:24 ` Nilay Shroff 2025-04-29 7:01 ` Hannes Reinecke 2025-04-29 7:15 ` Nilay Shroff 2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff 2025-04-25 14:46 ` Christoph Hellwig 2025-04-25 22:27 ` Sagi Grimberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox