From: Nilay Shroff <nilay@linux.ibm.com>
To: Hannes Reinecke <hare@suse.de>, linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, sagi@grimberg.me,
jmeneghi@redhat.com, axboe@kernel.dk, martin.petersen@oracle.com,
gjoyce@ibm.com
Subject: Re: [RFC PATCHv3 1/3] nvme-multipath: introduce delayed removal of the multipath head node
Date: Tue, 6 May 2025 13:58:00 +0530 [thread overview]
Message-ID: <bf6ba108-1ef9-4722-8091-4bb0ad0582f8@linux.ibm.com> (raw)
In-Reply-To: <dd7330a2-6499-4af7-af3a-489ab86d2b5f@suse.de>
On 5/5/25 12:09 PM, Hannes Reinecke wrote:
> On 5/4/25 19:50, Nilay Shroff wrote:
>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index eb6ea8acb3cc..0f6b01f10c70 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3560,7 +3560,7 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
>> */
>> if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
>> continue;
>> - if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
>> + if (nvme_tryget_ns_head(h))
>> return h;
>> }
>> @@ -3688,6 +3688,8 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
>> ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
>> ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
>> kref_init(&head->ref);
>> + if (ctrl->ops->flags & NVME_F_FABRICS)
>> + nvme_mpath_set_fabrics(head);
>>
>
> Please make this a separate patch.
Yes sure, will do it while I spin the next patchset.
>
>> if (head->ids.csi) {
>> ret = nvme_get_effects_log(ctrl, head->ids.csi, &head->effects);
>> @@ -3804,7 +3806,8 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
>> }
>> } else {
>> ret = -EINVAL;
>> - if (!info->is_shared || !head->shared) {
>> + if ((!info->is_shared || !head->shared) &&
>> + !list_empty(&head->list)) {
>> dev_err(ctrl->device,
>> "Duplicate unshared namespace %d\n",
>> info->nsid);
>> @@ -4008,7 +4011,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> mutex_lock(&ns->ctrl->subsys->lock);
>> list_del_rcu(&ns->siblings);
>> if (list_empty(&ns->head->list)) {
>> - list_del_init(&ns->head->entry);
>> + if (!nvme_mpath_queue_if_no_path(ns->head))
>> + list_del_init(&ns->head->entry);
>> last_path = true;
>> }
>> mutex_unlock(&ns->ctrl->subsys->lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 250f3da67cc9..f3f981c9f3ec 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -442,7 +442,24 @@ static bool nvme_available_path(struct nvme_ns_head *head)
>> break;
>> }
>> }
>> - return false;
>> +
>> + /*
>> + * If "head->delayed_removal_secs" is configured (i.e., non-zero), do
>> + * not immediately fail I/O. Instead, requeue the I/O for the configured
>> + * duration, anticipating that if there's a transient link failure then
>> + * it may recover within this time window. This parameter is exported to
>> + * userspace via sysfs, and its default value is zero. It is internally
>> + * mapped to NVME_NSHEAD_QUEUE_IF_NO_PATH. When delayed_removal_secs is
>> + * non-zero, this flag is set to true. When zero, the flag is cleared.
>> + *
>> + * Note: This option is not exposed to the users for NVMeoF connections.
>> + * In fabric-based setups, fabric link failure is managed through other
>> + * parameters such as "reconnect_delay" and "max_reconnects" which is
>> + * handled at respective fabric driver layer. Therefor, head->delayed_
>> + * removal_secs" is intended exclusively for non-fabric (e.g., PCIe)
>> + * multipath configurations.
>> + */
>> + return nvme_mpath_queue_if_no_path(head);
>> }
>> static void nvme_ns_head_submit_bio(struct bio *bio)
>> @@ -617,6 +634,40 @@ static void nvme_requeue_work(struct work_struct *work)
>> }
>> }
>> +static void nvme_remove_head(struct nvme_ns_head *head)
>> +{
>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>> + /*
>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>> + * to allow multipath to fail all I/O.
>> + */
>> + kblockd_schedule_work(&head->requeue_work);
>> +
>> + nvme_cdev_del(&head->cdev, &head->cdev_device);
>> + synchronize_srcu(&head->srcu);
>> + del_gendisk(head->disk);
>> + nvme_put_ns_head(head);
>> + }
>> +}
>> +
>> +static void nvme_remove_head_work(struct work_struct *work)
>> +{
>> + struct nvme_ns_head *head = container_of(to_delayed_work(work),
>> + struct nvme_ns_head, remove_work);
>> + bool shutdown = false;
>> +
>
> Don't you need to check here if the 'QUEUE_IF_NO_PATH' bit is still
> cleared?
No, we don't need to do that, because the QUEUE_IF_NO_PATH flag is
tied to the head->delayed_removal_secs setting.
Specifically, the QUEUE_IF_NO_PATH flag is set when head->delayed_
removal_secs is configured to a non-zero value. It is cleared only
when head->delayed_removal_secs is reset to zero.
For example, if the user sets delayed_removal_secs to a non-zero
value for a head node, the QUEUE_IF_NO_PATH flag is set. Even if
all paths to a shared namespace are lost, the flag remains set
until delayed_removal_secs is reset to zero.
Hence, instead of relying solely on this flag, here we check whether
head->list is empty after the delayed removal timeout has elapsed.
If head->list is not empty, it indicates that a path to the namespace
has been recovered (e.g., the link failure was resolved in time).
Therefore, we should not remove the head node.
In short in this function, checking whether head->list is empty
is sufficient to determine whether the head node should be removed
or retained.
>
>> + mutex_lock(&head->subsys->lock);
>> + if (list_empty(&head->list)) {
>> + list_del_init(&head->entry);
>> + shutdown = true;
>> + }
>> + mutex_unlock(&head->subsys->lock);
>> + if (shutdown)
>> + nvme_remove_head(head);
>> +
>> + module_put(THIS_MODULE);
>> +}
>> +
>> int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>> {
>> struct queue_limits lim;
>> @@ -626,6 +677,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>> spin_lock_init(&head->requeue_lock);
>> INIT_WORK(&head->requeue_work, nvme_requeue_work);
>> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>> + INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>> + head->delayed_removal_secs = 0;
>> /*
>> * Add a multipath node if the subsystems supports multiple controllers.
>> @@ -659,6 +712,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>> set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
>> sprintf(head->disk->disk_name, "nvme%dn%d",
>> ctrl->subsys->instance, head->instance);
>> + nvme_tryget_ns_head(head);
>
> What is that doing here?
> Why do you tak an additional reference?
We take an additional reference to the head node here to ensure that
it isn't freed immediately when all paths to the shared namespace
are removed. The final reference to the head node is only released when:
- All paths to the namespace are lost, and
- No new path to the namespace has appeared even after the configured
delayed_removal_secs (if set) has elapsed.
This way it's ensured that the head node remains valid during the
grace period, allowing recovery if a path becomes available again.
Thanks,
--Nilay
next prev parent reply other threads:[~2025-05-06 9:00 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-04 17:50 [RFC PATCHv3 0/3] improve NVMe multipath handling Nilay Shroff
2025-05-04 17:50 ` [RFC PATCHv3 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-05-05 6:39 ` Hannes Reinecke
2025-05-06 8:28 ` Nilay Shroff [this message]
2025-05-07 6:50 ` Christoph Hellwig
2025-05-07 19:56 ` Nilay Shroff
2025-05-08 5:33 ` Christoph Hellwig
2025-05-08 15:11 ` Nilay Shroff
2025-05-04 17:50 ` [RFC PATCHv3 2/3] nvme: introduce multipath_always_on module param Nilay Shroff
2025-05-05 6:24 ` Hannes Reinecke
2025-05-07 6:51 ` Christoph Hellwig
2025-05-04 17:50 ` [RFC PATCHv3 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
2025-05-05 6:39 ` Hannes Reinecke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bf6ba108-1ef9-4722-8091-4bb0ad0582f8@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=gjoyce@ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jmeneghi@redhat.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=martin.petersen@oracle.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox