From: Nilay Shroff <nilay@linux.ibm.com>
To: Sagi Grimberg <sagi@grimberg.me>, linux-nvme@lists.infradead.org
Cc: dwagner@suse.de, hch@lst.de, kbusch@kernel.org, axboe@fb.com,
gjoyce@linux.ibm.com, Hannes Reinecke <hare@suse.de>
Subject: Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
Date: Wed, 23 Oct 2024 19:01:14 +0530 [thread overview]
Message-ID: <a6bca823-646e-450b-8356-d9626339764d@linux.ibm.com> (raw)
In-Reply-To: <97c8466d-ee4c-4009-9d5e-725436c963f1@grimberg.me>
On 10/23/24 15:28, Sagi Grimberg wrote:
>>> Also, are the sysfs files always exist? what do numa_nodes output in case
>>> the policy is round-robin? or what would queue_depth file produce?
>>>
>> Assuming kernel is compiled with CONFIG_NVME_MULTIPATH, the sysfs files
>> "numa_nodes" and "round_robin" would be always created once gendisk node
>> (nvmeXCYnZ) for the per-path namespace device is added.
>>
>> In case io-policy is round-robin then numa_nodes file would show the numa
>> nodes preferred for the respective per-path namespace device in case the
>> policy is set to numa. For instance, if we read the file
>> "/sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes" then it shows the numa
>> nodes preferring the path nvme1c1n1 when I/O workload is targeted to nvme1n1.
>>
>> The queue_depth file would contain the value 0 if the io-policy is anything
>> but queue-depth. The reason being queue_depth file emits the output of num of
>> active requests currently queued up in the nvme/fabric queue (i.e. ctrl->nr_active)
>> however if the io-policy is not queue-depth then ->nr_active would typically remains
>> zero.
>>
>> BTW, we don't expect to read the numa_nodes file when the nvme subsystem
>> io-policy is NOT numa. Similarly we don't expect user to review the
>> queue_depth file when io-policy is NOT set to queue-depth. Isn't it?
>
> I'm asking what does it show in case it _is_ read. queue_depth and numa_nodes
> output seems to be a "lie" in case the policy is round-robin.
>
Yes the output of queue_dpeth and numa_nodes is not useful when io-policy
is round-robin. So do you suggest to avoid emitting any output when io-policy
is round-robin and user reads numa_nodes or queue_depth file?
Similarly, when io-policy is numa, we should refrain from emitting anything
if user access queue_depth file and the same is the case when io-policy is
queue-depth and user access the numa_nodes file.
>>
>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 3 ++
>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 983909a600ad..6be29fd64236 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> if (!nvme_ns_head_multipath(ns->head))
>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>> +
>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>> +
>>>> del_gendisk(ns->disk);
>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>> nvme_add_ns_head_cdev(head);
>>>> }
>>>> + nvme_mpath_add_sysfs_link(ns);
>>>> +
>>>> mutex_lock(&head->lock);
>>>> if (nvme_path_is_optimized(ns)) {
>>>> int node, srcu_idx;
>>>> @@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
>>>> }
>>>> DEVICE_ATTR_RO(ana_state);
>>>> +static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
>>>> + char *buf)
>>>> +{
>>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>>> +
>>>> + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
>>>> +}
>>>> +DEVICE_ATTR_RO(queue_depth);
>>>> +
>>>> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
>>>> + char *buf)
>>>> +{
>>>> + int node, srcu_idx;
>>>> + nodemask_t numa_nodes;
>>>> + struct nvme_ns *current_ns;
>>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>>> + struct nvme_ns_head *head = ns->head;
>>>> +
>>>> + nodes_clear(numa_nodes);
>>>> +
>>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>>> + for_each_node(node) {
>>>> + current_ns = srcu_dereference(head->current_path[node],
>>>> + &head->srcu);
>>>> + if (ns == current_ns)
>>>> + node_set(node, numa_nodes);
>>>> + }
>>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>>> +
>>>> + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
>>>> +}
>>>> +DEVICE_ATTR_RO(numa_nodes);
>>>> +
>>>> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>>> struct nvme_ana_group_desc *desc, void *data)
>>>> {
>>>> @@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>>> return -ENXIO; /* just break out of the loop */
>>>> }
>>>> +void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
>>>> +{
>>>> + struct device *target;
>>>> + struct kobject *kobj;
>>>> + int rc;
>>>> +
>>>> + if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
>>>> + return;
>>>> +
>>>> + target = disk_to_dev(ns->disk);
>>>> + kobj = &disk_to_dev(ns->head->disk)->kobj;
>>>> + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
>>>> + &target->kobj, dev_name(target));
>>>> + if (unlikely(rc)) {
>>>> + dev_err(disk_to_dev(ns->head->disk),
>>>> + "failed to create link to %s\n",
>>>> + dev_name(target));
>>>> + } else
>>>> + set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
>>> This complication is not clear at all, not sure why this is needed, and it is
>>> not documented.
>> The nvme_mpath_add_sysfs_link function is invoked from nvme_mpath_set_live() when a
>> new namespace is allocated and its ANA state is set to OPTIMIZED or NON-OPTIMIZED.
>> The nvme_mpath_add_sysfs_link function simply creates the link to the target nvme
>> path-device (nvmeXCYnZ) from head disk-node (nvmeXnY) device. If this looks complex
>> then I would add commentary in the code in the next patch.
>
> Still not clear why we need a flag for this? The usage suggest that we may call add/remove sysfs link
> multiple times... What about inaccessible paths? or paths that transition to/from inaccessible? They
> will add/remove their corresponding sysfs links?
The NVME_NS_SYSFS_ATTR_LINK flag is useful to avoid creating link multiple times when
ana state of an nvme path transition from optimized to non-optimized state or vice-versa.
When ana state of nvme path transitions to live (i.e. optimized or non-optimized), we call
nvme_mpath_set_live() and then from this function we call nvme_mpath_add_sysfs_link().
Assuming the sysfs link for the nvme path already exists, if we attempt to recreate the link
then sysfs_add_link_to_group() complains about this anomaly loudly. It warns about this anomaly
by printing stack dump. You may refer sysfs_warn_dup().
For inaccessible path, in the current implementation, we don't create sysfs link.
However in one of the earlier discussion, Hannes recommended[1] to always create the link
to the path whose ana state could be inaccessible or any other state. So I am preparing
for the next patch iteration where I will address this comment.
[1]: https://lore.kernel.org/all/3c92b9ae-1223-4787-90e3-955f82161269@suse.de/
Thanks,
--Nilay
next prev parent reply other threads:[~2024-10-23 13:31 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-11 6:26 [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
2024-10-07 10:14 ` Hannes Reinecke
2024-10-07 13:47 ` Nilay Shroff
2024-10-07 14:04 ` Hannes Reinecke
2024-10-07 15:33 ` Nilay Shroff
2024-10-16 3:19 ` Nilay Shroff
2024-10-16 6:52 ` Hannes Reinecke
2024-10-21 12:24 ` Nilay Shroff
2024-10-20 23:17 ` Sagi Grimberg
2024-10-21 13:37 ` Nilay Shroff
2024-10-23 9:58 ` Sagi Grimberg
2024-10-23 13:31 ` Nilay Shroff [this message]
2024-09-24 6:41 ` [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a6bca823-646e-450b-8356-d9626339764d@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@fb.com \
--cc=dwagner@suse.de \
--cc=gjoyce@linux.ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox