* [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs
@ 2024-09-11 6:26 Nilay Shroff
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
2024-09-24 6:41 ` [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
0 siblings, 2 replies; 14+ messages in thread
From: Nilay Shroff @ 2024-09-11 6:26 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce, Nilay Shroff
Hi,
This patch propose adding new sysfs attributes for adding visibility of
native multipath I/O.
The first version of this RFC[1] proposed using debugfs for visibility
however the general feedback was to instead export the multipath I/O
information using sysfs attributes and then latter parse and format those
sysfs attributes using libnvme/nvme-cli.
The second version of this RFC[2] uses sysfs however the sysfs attribute
file contains multiple lines of output and the feedback was to instead
follow the principal of one value per one attribute.
The third version of this RFC[3] follows the one value per one attrbiute
principal. There was a review comment about using srcu read lock while
dereferencing the namespace for each node which is protected by the srcu
lock.
So now the fourth version of this RFC ensures that we protect the
namespace dereference code with the srcu read lock.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
nvme-subsys4 nvmet_subsystem nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=numa
\
+- ns 1
\
+- nvme0 pcie 052e:78:00.0 live optimized
+- nvme1 pcie 058e:78:00.0 live optimized
nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=round-robin
\
+- ns 2
\
+- nvme2 pcie 0524:28:00.0 live optimized
+- nvme3 pcie 0584:28:00.0 live optimized
nvme-subsys4 - NQN=nvmet_subsystem
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=queue-depth
\
+- ns 1
\
+- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
+- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:
# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).
For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.
# cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1
# cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3
From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.
For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.
# cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518
# cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504
From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.
[1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
[2] https://lore.kernel.org/all/20240809173030.2281021-2-nilay@linux.ibm.com/
[3] https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/
Changes since v3:
- Protect the namespace dereference code with srcu read lock (Daniel Wagner)
Changes since v2:
- Use one value per one sysfs attribute (Keith Busch)
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
Nilay Shroff (1):
nvme-multipath: Add sysfs attributes for showing multipath info
drivers/nvme/host/core.c | 3 ++
drivers/nvme/host/multipath.c | 69 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++++++--
drivers/nvme/host/sysfs.c | 20 ++++++++++
4 files changed, 108 insertions(+), 4 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-09-11 6:26 [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
@ 2024-09-11 6:26 ` Nilay Shroff
2024-10-07 10:14 ` Hannes Reinecke
2024-10-20 23:17 ` Sagi Grimberg
2024-09-24 6:41 ` [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
1 sibling, 2 replies; 14+ messages in thread
From: Nilay Shroff @ 2024-09-11 6:26 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce, Nilay Shroff
NVMe native multipath supports different IO policies for selecting I/O
path, however we don't have any visibility about which path is being
selected by multipath code for forwarding I/O.
This patch helps add that visibility by adding new sysfs attribute files
named "numa_nodes" and "queue_depth" under each namespace block device
path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
under head disk node and then from this directory add a link to each
namespace path device this head disk node points to.
For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
each path the head disk node <nvmeXnY> points to:
$ ls -1 /sys/block/nvme1n1/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
For round-robin I/O policy, we could easily infer from the above output
that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
and nvme1c3n1.
For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
being preferred by the respective block device path. The numa nodes value
is comma delimited list of nodes or A-B range of nodes.
For queue-depth I/O policy, the "queue_depth" attribute file shows the
number of active/in-flight I/O requests currently queued for each path.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 3 ++
drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++++++--
drivers/nvme/host/sysfs.c | 20 ++++++++++
4 files changed, 110 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 983909a600ad..6be29fd64236 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (!nvme_ns_head_multipath(ns->head))
nvme_cdev_del(&ns->cdev, &ns->cdev_device);
+
+ nvme_mpath_remove_sysfs_link(ns);
+
del_gendisk(ns->disk);
mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 518e22dd4f9b..7d9c36a7a261 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
nvme_add_ns_head_cdev(head);
}
+ nvme_mpath_add_sysfs_link(ns);
+
mutex_lock(&head->lock);
if (nvme_path_is_optimized(ns)) {
int node, srcu_idx;
@@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
}
DEVICE_ATTR_RO(ana_state);
+static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+ return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
+}
+DEVICE_ATTR_RO(queue_depth);
+
+static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ int node, srcu_idx;
+ nodemask_t numa_nodes;
+ struct nvme_ns *current_ns;
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+ struct nvme_ns_head *head = ns->head;
+
+ nodes_clear(numa_nodes);
+
+ srcu_idx = srcu_read_lock(&head->srcu);
+ for_each_node(node) {
+ current_ns = srcu_dereference(head->current_path[node],
+ &head->srcu);
+ if (ns == current_ns)
+ node_set(node, numa_nodes);
+ }
+ srcu_read_unlock(&head->srcu, srcu_idx);
+
+ return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
+}
+DEVICE_ATTR_RO(numa_nodes);
+
static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
struct nvme_ana_group_desc *desc, void *data)
{
@@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
return -ENXIO; /* just break out of the loop */
}
+void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
+{
+ struct device *target;
+ struct kobject *kobj;
+ int rc;
+
+ if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ return;
+
+ target = disk_to_dev(ns->disk);
+ kobj = &disk_to_dev(ns->head->disk)->kobj;
+ rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
+ &target->kobj, dev_name(target));
+ if (unlikely(rc)) {
+ dev_err(disk_to_dev(ns->head->disk),
+ "failed to create link to %s\n",
+ dev_name(target));
+ } else
+ set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+}
+
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+ struct device *target;
+ struct kobject *kobj;
+
+ if (!test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ return;
+
+ target = disk_to_dev(ns->disk);
+ kobj = &disk_to_dev(ns->head->disk)->kobj;
+ sysfs_remove_link_from_group(kobj, nvme_ns_mpath_attr_group.name,
+ dev_name(target));
+ clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+}
+
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
{
if (nvme_ctrl_use_ana(ns->ctrl)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index da57947130cc..dfd3b23bcc6f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -522,10 +522,11 @@ struct nvme_ns {
struct nvme_ns_head *head;
unsigned long flags;
-#define NVME_NS_REMOVING 0
-#define NVME_NS_ANA_PENDING 2
-#define NVME_NS_FORCE_RO 3
-#define NVME_NS_READY 4
+#define NVME_NS_REMOVING 0
+#define NVME_NS_ANA_PENDING 2
+#define NVME_NS_FORCE_RO 3
+#define NVME_NS_READY 4
+#define NVME_NS_SYSFS_ATTR_LINK 5
struct cdev cdev;
struct device cdev_device;
@@ -921,6 +922,7 @@ int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
extern const struct attribute_group *nvme_ns_attr_groups[];
+extern const struct attribute_group nvme_ns_mpath_attr_group;
extern const struct pr_ops nvme_pr_ops;
extern const struct block_device_operations nvme_ns_head_ops;
extern const struct attribute_group nvme_dev_attrs_group;
@@ -943,6 +945,8 @@ void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys);
void nvme_failover_req(struct request *req);
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
+void nvme_mpath_add_sysfs_link(struct nvme_ns *ns);
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns);
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
@@ -968,6 +972,8 @@ static inline void nvme_trace_bio_complete(struct request *req)
extern bool multipath;
extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
+extern struct device_attribute dev_attr_queue_depth;
+extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute subsys_attr_iopolicy;
static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
@@ -997,6 +1003,12 @@ static inline void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
}
+static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
+{
+}
+static inline void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+}
static inline bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
return false;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index ba05faaac562..92bcdd4f6072 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -258,6 +258,8 @@ static struct attribute *nvme_ns_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&dev_attr_ana_grpid.attr,
&dev_attr_ana_state.attr,
+ &dev_attr_queue_depth.attr,
+ &dev_attr_numa_nodes.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
NULL,
@@ -290,6 +292,10 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
return 0;
}
+ if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) {
+ if (nvme_disk_is_ns_head(dev_to_disk(dev)))
+ return 0;
+ }
#endif
return a->mode;
}
@@ -299,8 +305,22 @@ static const struct attribute_group nvme_ns_attr_group = {
.is_visible = nvme_ns_attrs_are_visible,
};
+#ifdef CONFIG_NVME_MULTIPATH
+static struct attribute *nvme_ns_mpath_attrs[] = {
+ NULL,
+};
+
+const struct attribute_group nvme_ns_mpath_attr_group = {
+ .name = "multipath",
+ .attrs = nvme_ns_mpath_attrs,
+};
+#endif
+
const struct attribute_group *nvme_ns_attr_groups[] = {
&nvme_ns_attr_group,
+#ifdef CONFIG_NVME_MULTIPATH
+ &nvme_ns_mpath_attr_group,
+#endif
NULL,
};
--
2.45.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs
2024-09-11 6:26 [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
@ 2024-09-24 6:41 ` Nilay Shroff
1 sibling, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2024-09-24 6:41 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
A gentle ping about this RFC. Does it look okay or if there are any further comments?
Thanks,
--Nilay
On 9/11/24 11:56, Nilay Shroff wrote:
> Hi,
>
> This patch propose adding new sysfs attributes for adding visibility of
> native multipath I/O.
>
> The first version of this RFC[1] proposed using debugfs for visibility
> however the general feedback was to instead export the multipath I/O
> information using sysfs attributes and then latter parse and format those
> sysfs attributes using libnvme/nvme-cli.
>
> The second version of this RFC[2] uses sysfs however the sysfs attribute
> file contains multiple lines of output and the feedback was to instead
> follow the principal of one value per one attribute.
>
> The third version of this RFC[3] follows the one value per one attrbiute
> principal. There was a review comment about using srcu read lock while
> dereferencing the namespace for each node which is protected by the srcu
> lock.
>
> So now the fourth version of this RFC ensures that we protect the
> namespace dereference code with the srcu read lock.
>
> As we know, NVMe native multipath supports three different io policies
> (numa, round-robin and queue-depth) for selecting I/O path, however, we
> don't have any visibility about which path is being selected by multipath
> code for forwarding I/O. This RFC helps add that visibility by adding new
> sysfs attribute files named "numa_nodes" and "queue_depth" under each
> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
> "multipath" sysfs directory under head disk node and then from this
> directory add a link to each namespace path device this head disk node
> points to.
>
> Please find below output generated with this proposed RFC patch applied on
> a system with two multi-controller PCIe NVMe disks attached to it. This
> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
> over two NIC cards. This system has four numa nodes online when the below
> output was captured:
>
> # cat /sys/devices/system/node/online
> 0-3
>
> # lscpu
> <snip>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s):
> NUMA node1 CPU(s): 0-7
> NUMA node2 CPU(s): 8-31
> NUMA node3 CPU(s): 32-63
> <snip>
>
> Please note that numa node 0 though online, doesn't have any CPU
> currently assigned to it.
>
> # nvme list -v
> Subsystem Subsystem-NQN Controllers
> ---------------- ------------------------------------------------------------------------------------------------ ----------------
> nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
> nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
> nvme-subsys4 nvmet_subsystem nvme4, nvme5
>
> Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
> ---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
> nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
> nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
> nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
> nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
> nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
> nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
>
> Device Generic NSID Usage Format Controllers
> ----------------- ----------------- ---------- -------------------------- ---------------- ----------------
> /dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
> /dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
> /dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
>
>
> # nvme show-topology
> nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=numa
> \
> +- ns 1
> \
> +- nvme0 pcie 052e:78:00.0 live optimized
> +- nvme1 pcie 058e:78:00.0 live optimized
>
> nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=round-robin
> \
> +- ns 2
> \
> +- nvme2 pcie 0524:28:00.0 live optimized
> +- nvme3 pcie 0584:28:00.0 live optimized
>
> nvme-subsys4 - NQN=nvmet_subsystem
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=queue-depth
> \
> +- ns 1
> \
> +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
> +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
>
> As we could see above, we've three shared namespaces created. In terms of
> iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
> configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
>
> Now, under each namespace "head disk node", we create a sysfs group
> attribute named "multipath". The "multipath" group then points to the
> each path this head disk node points to:
>
> # tree /sys/block/nvme1n1/multipath/
> /sys/block/nvme1n1/multipath/
> ├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
> └── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
>
> # tree /sys/block/nvme3n1/multipath/
> /sys/block/nvme3n1/multipath/
> ├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
> └── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
>
> # tree /sys/block/nvme4n1/multipath/
> /sys/block/nvme4n1/multipath/
> ├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
> └── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
>
> One can easily infer from the above output that for the "round-robin"
> I/O policy, configured under nvme-subsys3, the I/O workload targeted at
> nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
> of each path is optimized (as can be seen in the output of show-topology).
>
> For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
> attribute file shows the numa nodes being preferred by the respective
> namespace path. The numa nodes value is comma delimited list of nodes or
> A-B range of nodes.
>
> # cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
> 0-1
>
> # cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
> 2-3
>
> From the above output, one can easily infer that I/O workload targeted at
> nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
> Similarly, I/O workload running on numa nodes 2 and 3 would use path
> nvme1c1n1.
>
> For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
> attribute file shows the number of active/in-flight I/O requests currently
> queued for each path.
>
> # cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
> 518
>
> # cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
> 504
>
> From the above output, one can easily infer that I/O workload targeted at
> nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
> of each path is 518 and 504 respectively.
>
> [1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
> [2] https://lore.kernel.org/all/20240809173030.2281021-2-nilay@linux.ibm.com/
> [3] https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/
>
> Changes since v3:
> - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
>
> Changes since v2:
> - Use one value per one sysfs attribute (Keith Busch)
>
> Changes since v1:
> - Use sysfs to export multipath I/O information instead of debugfs
>
> Nilay Shroff (1):
> nvme-multipath: Add sysfs attributes for showing multipath info
>
> drivers/nvme/host/core.c | 3 ++
> drivers/nvme/host/multipath.c | 69 +++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 20 ++++++++--
> drivers/nvme/host/sysfs.c | 20 ++++++++++
> 4 files changed, 108 insertions(+), 4 deletions(-)
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
@ 2024-10-07 10:14 ` Hannes Reinecke
2024-10-07 13:47 ` Nilay Shroff
2024-10-20 23:17 ` Sagi Grimberg
1 sibling, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-10-07 10:14 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 9/11/24 08:26, Nilay Shroff wrote:
> NVMe native multipath supports different IO policies for selecting I/O
> path, however we don't have any visibility about which path is being
> selected by multipath code for forwarding I/O.
> This patch helps add that visibility by adding new sysfs attribute files
> named "numa_nodes" and "queue_depth" under each namespace block device
> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
> under head disk node and then from this directory add a link to each
> namespace path device this head disk node points to.
>
> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
> each path the head disk node <nvmeXnY> points to:
>
> $ ls -1 /sys/block/nvme1n1/
> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>
> For round-robin I/O policy, we could easily infer from the above output
> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
> and nvme1c3n1.
>
> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
> being preferred by the respective block device path. The numa nodes value
> is comma delimited list of nodes or A-B range of nodes.
>
> For queue-depth I/O policy, the "queue_depth" attribute file shows the
> number of active/in-flight I/O requests currently queued for each path.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 3 ++
> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 20 ++++++++--
> drivers/nvme/host/sysfs.c | 20 ++++++++++
> 4 files changed, 110 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 983909a600ad..6be29fd64236 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>
> if (!nvme_ns_head_multipath(ns->head))
> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> +
> + nvme_mpath_remove_sysfs_link(ns);
> +
> del_gendisk(ns->disk);
>
> mutex_lock(&ns->ctrl->namespaces_lock);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 518e22dd4f9b..7d9c36a7a261 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
> nvme_add_ns_head_cdev(head);
> }
>
> + nvme_mpath_add_sysfs_link(ns);
> +
> mutex_lock(&head->lock);
> if (nvme_path_is_optimized(ns)) {
> int node, srcu_idx;
Nearly there.
You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the
head had been created.
And there is one branch in nvme_mpath_add_disk():
if (desc.state) {
/* found the group desc: update */
nvme_update_ns_ana_state(&desc, ns);
which does not go via nvme_mpath_set_live(), yet a device link would
need to be create here, too.
But you can't call nvme_mpath_add_sysfs_link() from
nvme_mpath_add_disk(), as the actual gendisk might only be created
later on during ANA log parsing.
It is a tangle, and I haven't found a good way out of this.
But I am _very much_ in favour of having these links, so please
update your patch.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-07 10:14 ` Hannes Reinecke
@ 2024-10-07 13:47 ` Nilay Shroff
2024-10-07 14:04 ` Hannes Reinecke
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2024-10-07 13:47 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 10/7/24 15:44, Hannes Reinecke wrote:
> On 9/11/24 08:26, Nilay Shroff wrote:
>> NVMe native multipath supports different IO policies for selecting I/O
>> path, however we don't have any visibility about which path is being
>> selected by multipath code for forwarding I/O.
>> This patch helps add that visibility by adding new sysfs attribute files
>> named "numa_nodes" and "queue_depth" under each namespace block device
>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>> under head disk node and then from this directory add a link to each
>> namespace path device this head disk node points to.
>>
>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>> each path the head disk node <nvmeXnY> points to:
>>
>> $ ls -1 /sys/block/nvme1n1/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>> being preferred by the respective block device path. The numa nodes value
>> is comma delimited list of nodes or A-B range of nodes.
>>
>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>> number of active/in-flight I/O requests currently queued for each path.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 3 ++
>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 20 ++++++++--
>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 983909a600ad..6be29fd64236 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> if (!nvme_ns_head_multipath(ns->head))
>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> + nvme_mpath_remove_sysfs_link(ns);
>> +
>> del_gendisk(ns->disk);
>> mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 518e22dd4f9b..7d9c36a7a261 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> nvme_add_ns_head_cdev(head);
>> }
>> + nvme_mpath_add_sysfs_link(ns);
>> +
>> mutex_lock(&head->lock);
>> if (nvme_path_is_optimized(ns)) {
>> int node, srcu_idx;
> Nearly there.
Thank you for your review comments!
>
> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>
> And there is one branch in nvme_mpath_add_disk():
>
> if (desc.state) {
> /* found the group desc: update */
> nvme_update_ns_ana_state(&desc, ns);
>
> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk might only be created
> later on during ANA log parsing.
>
> It is a tangle, and I haven't found a good way out of this.
> But I am _very much_ in favour of having these links, so please
> update your patch.
> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
nvme_mpath_add_sysfs_link().
So I think that in any case while multipath namespace is being created it has to go through
nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
me know if you have any further question.
> Cheers,
>
> Hannes
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-07 13:47 ` Nilay Shroff
@ 2024-10-07 14:04 ` Hannes Reinecke
2024-10-07 15:33 ` Nilay Shroff
0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-10-07 14:04 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 10/7/24 15:47, Nilay Shroff wrote:
>
>
> On 10/7/24 15:44, Hannes Reinecke wrote:
>> On 9/11/24 08:26, Nilay Shroff wrote:
>>> NVMe native multipath supports different IO policies for selecting I/O
>>> path, however we don't have any visibility about which path is being
>>> selected by multipath code for forwarding I/O.
>>> This patch helps add that visibility by adding new sysfs attribute files
>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>> under head disk node and then from this directory add a link to each
>>> namespace path device this head disk node points to.
>>>
>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>> each path the head disk node <nvmeXnY> points to:
>>>
>>> $ ls -1 /sys/block/nvme1n1/
>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>
>>> For round-robin I/O policy, we could easily infer from the above output
>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>> and nvme1c3n1.
>>>
>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>> being preferred by the respective block device path. The numa nodes value
>>> is comma delimited list of nodes or A-B range of nodes.
>>>
>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>> number of active/in-flight I/O requests currently queued for each path.
>>>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>> drivers/nvme/host/core.c | 3 ++
>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 983909a600ad..6be29fd64236 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> if (!nvme_ns_head_multipath(ns->head))
>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>> +
>>> + nvme_mpath_remove_sysfs_link(ns);
>>> +
>>> del_gendisk(ns->disk);
>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>> nvme_add_ns_head_cdev(head);
>>> }
>>> + nvme_mpath_add_sysfs_link(ns);
>>> +
>>> mutex_lock(&head->lock);
>>> if (nvme_path_is_optimized(ns)) {
>>> int node, srcu_idx;
>> Nearly there.
> Thank you for your review comments!
>
>>
>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>
>> And there is one branch in nvme_mpath_add_disk():
>>
>> if (desc.state) {
>> /* found the group desc: update */
>> nvme_update_ns_ana_state(&desc, ns);
>>
>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>> might only be created later on during ANA log parsing.>>
>> It is a tangle, and I haven't found a good way out of this.
>> But I am _very much_ in favour of having these links, so please
>> update your patch.
>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>> nvme_mpath_add_sysfs_link().
>
> So I think that in any case while multipath namespace is being created it has to go through
> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
> me know if you have any further question.
>
But it doesn't:
if (nvme_state_is_live(ns->ana_state) &&
nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
nvme_mpath_set_live(ns);
so if a namespace is being created for which nvme_state_is_live()
returns 'false' nvme_mpath_set_live() is not called, and no links are
being created.
This can happen eg. if the first namespace to be encountered is in any
other state than 'optimized' or 'non-optimized'.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-07 14:04 ` Hannes Reinecke
@ 2024-10-07 15:33 ` Nilay Shroff
2024-10-16 3:19 ` Nilay Shroff
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2024-10-07 15:33 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 10/7/24 19:34, Hannes Reinecke wrote:
> On 10/7/24 15:47, Nilay Shroff wrote:
>>
>>
>> On 10/7/24 15:44, Hannes Reinecke wrote:
>>> On 9/11/24 08:26, Nilay Shroff wrote:
>>>> NVMe native multipath supports different IO policies for selecting I/O
>>>> path, however we don't have any visibility about which path is being
>>>> selected by multipath code for forwarding I/O.
>>>> This patch helps add that visibility by adding new sysfs attribute files
>>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>>> under head disk node and then from this directory add a link to each
>>>> namespace path device this head disk node points to.
>>>>
>>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>>> each path the head disk node <nvmeXnY> points to:
>>>>
>>>> $ ls -1 /sys/block/nvme1n1/
>>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>>
>>>> For round-robin I/O policy, we could easily infer from the above output
>>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>>> and nvme1c3n1.
>>>>
>>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>>> being preferred by the respective block device path. The numa nodes value
>>>> is comma delimited list of nodes or A-B range of nodes.
>>>>
>>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>>> number of active/in-flight I/O requests currently queued for each path.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 3 ++
>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 983909a600ad..6be29fd64236 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> if (!nvme_ns_head_multipath(ns->head))
>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>> +
>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>> +
>>>> del_gendisk(ns->disk);
>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>> nvme_add_ns_head_cdev(head);
>>>> }
>>>> + nvme_mpath_add_sysfs_link(ns);
>>>> +
>>>> mutex_lock(&head->lock);
>>>> if (nvme_path_is_optimized(ns)) {
>>>> int node, srcu_idx;
>>> Nearly there.
>> Thank you for your review comments!
>>
>>>
>>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>>
>>> And there is one branch in nvme_mpath_add_disk():
>>>
>>> if (desc.state) {
>>> /* found the group desc: update */
>>> nvme_update_ns_ana_state(&desc, ns);
>>>
>>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>>> might only be created later on during ANA log parsing.>>
>>> It is a tangle, and I haven't found a good way out of this.
>>> But I am _very much_ in favour of having these links, so please
>>> update your patch.
>>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>>> nvme_mpath_add_sysfs_link().
>>
>> So I think that in any case while multipath namespace is being created it has to go through
>> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
>> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
>> me know if you have any further question.
>>
> But it doesn't:
>
> if (nvme_state_is_live(ns->ana_state) &&
> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
> nvme_mpath_set_live(ns);
>
> so if a namespace is being created for which nvme_state_is_live() returns 'false' nvme_mpath_set_live() is not called, and no links are being created.
> This can happen eg. if the first namespace to be encountered is in any other state than 'optimized' or 'non-optimized'.
>
OK I got what you're suggesting here. So in this particular case when ANA state of a shared namespace
is neither "optimized" nor "non-optimized", we would have gendisk for shared namepspace (i.e.
nvmeXcYnZ) created but we don't have yet gendisk for corresponding head node (i.e. nvmeXnY) created.
So without the gendisk for head node created, how could we create a link from it to the namespace node?
The link from gendisk head node would be eventually created when the ANA state of the namespace transition
to "optimized" or "non-optimized" state. I think, it's not anyways possible to have multipathing function
enabled until the gendisk for the head node created, isn't it? So I don't yet understand why do we really
need device link created if we don't have gendisk for head not ready? Am I missing anything here?
> Cheers,
>
> Hannes
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-07 15:33 ` Nilay Shroff
@ 2024-10-16 3:19 ` Nilay Shroff
2024-10-16 6:52 ` Hannes Reinecke
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2024-10-16 3:19 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
Hi Hannes,
On 10/7/24 21:03, Nilay Shroff wrote:
>
>
> On 10/7/24 19:34, Hannes Reinecke wrote:
>> On 10/7/24 15:47, Nilay Shroff wrote:
>>>
>>>
>>> On 10/7/24 15:44, Hannes Reinecke wrote:
>>>> On 9/11/24 08:26, Nilay Shroff wrote:
>>>>> NVMe native multipath supports different IO policies for selecting I/O
>>>>> path, however we don't have any visibility about which path is being
>>>>> selected by multipath code for forwarding I/O.
>>>>> This patch helps add that visibility by adding new sysfs attribute files
>>>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>>>> under head disk node and then from this directory add a link to each
>>>>> namespace path device this head disk node points to.
>>>>>
>>>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>>>> each path the head disk node <nvmeXnY> points to:
>>>>>
>>>>> $ ls -1 /sys/block/nvme1n1/
>>>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>>>
>>>>> For round-robin I/O policy, we could easily infer from the above output
>>>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>>>> and nvme1c3n1.
>>>>>
>>>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>>>> being preferred by the respective block device path. The numa nodes value
>>>>> is comma delimited list of nodes or A-B range of nodes.
>>>>>
>>>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>>>> number of active/in-flight I/O requests currently queued for each path.
>>>>>
>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>> ---
>>>>> drivers/nvme/host/core.c | 3 ++
>>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index 983909a600ad..6be29fd64236 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>>> if (!nvme_ns_head_multipath(ns->head))
>>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>>> +
>>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>>> +
>>>>> del_gendisk(ns->disk);
>>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>>> --- a/drivers/nvme/host/multipath.c
>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>>> nvme_add_ns_head_cdev(head);
>>>>> }
>>>>> + nvme_mpath_add_sysfs_link(ns);
>>>>> +
>>>>> mutex_lock(&head->lock);
>>>>> if (nvme_path_is_optimized(ns)) {
>>>>> int node, srcu_idx;
>>>> Nearly there.
>>> Thank you for your review comments!
>>>
>>>>
>>>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>>>
>>>> And there is one branch in nvme_mpath_add_disk():
>>>>
>>>> if (desc.state) {
>>>> /* found the group desc: update */
>>>> nvme_update_ns_ana_state(&desc, ns);
>>>>
>>>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>>>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>>>> might only be created later on during ANA log parsing.>>
>>>> It is a tangle, and I haven't found a good way out of this.
>>>> But I am _very much_ in favour of having these links, so please
>>>> update your patch.
>>>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>>>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>>>> nvme_mpath_add_sysfs_link().
>>>
>>> So I think that in any case while multipath namespace is being created it has to go through
>>> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
>>> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
>>> me know if you have any further question.
>>>
>> But it doesn't:
>>
>> if (nvme_state_is_live(ns->ana_state) &&
>> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>> nvme_mpath_set_live(ns);
>>
>> so if a namespace is being created for which nvme_state_is_live() returns 'false' nvme_mpath_set_live() is not called, and no links are being created.
>> This can happen eg. if the first namespace to be encountered is in any other state than 'optimized' or 'non-optimized'.
>>
>
> OK I got what you're suggesting here. So in this particular case when ANA state of a shared namespace
> is neither "optimized" nor "non-optimized", we would have gendisk for shared namepspace (i.e.
> nvmeXcYnZ) created but we don't have yet gendisk for corresponding head node (i.e. nvmeXnY) created.
> So without the gendisk for head node created, how could we create a link from it to the namespace node?
>
> The link from gendisk head node would be eventually created when the ANA state of the namespace transition
> to "optimized" or "non-optimized" state. I think, it's not anyways possible to have multipathing function
> enabled until the gendisk for the head node created, isn't it? So I don't yet understand why do we really
> need device link created if we don't have gendisk for head not ready? Am I missing anything here?
>
>> Cheers,
>>
>> Hannes
>
A gentle ping about this thread. Does change look okay or you have further comments?
Am I missing something here? If so, can you please help me with it?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-16 3:19 ` Nilay Shroff
@ 2024-10-16 6:52 ` Hannes Reinecke
2024-10-21 12:24 ` Nilay Shroff
0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-10-16 6:52 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 10/16/24 05:19, Nilay Shroff wrote:
> Hi Hannes,
>
> On 10/7/24 21:03, Nilay Shroff wrote:
>>
>>
>> On 10/7/24 19:34, Hannes Reinecke wrote:
>>> On 10/7/24 15:47, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 10/7/24 15:44, Hannes Reinecke wrote:
>>>>> On 9/11/24 08:26, Nilay Shroff wrote:
>>>>>> NVMe native multipath supports different IO policies for selecting I/O
>>>>>> path, however we don't have any visibility about which path is being
>>>>>> selected by multipath code for forwarding I/O.
>>>>>> This patch helps add that visibility by adding new sysfs attribute files
>>>>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>>>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>>>>> under head disk node and then from this directory add a link to each
>>>>>> namespace path device this head disk node points to.
>>>>>>
>>>>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>>>>> each path the head disk node <nvmeXnY> points to:
>>>>>>
>>>>>> $ ls -1 /sys/block/nvme1n1/
>>>>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>>>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>>>>
>>>>>> For round-robin I/O policy, we could easily infer from the above output
>>>>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>>>>> and nvme1c3n1.
>>>>>>
>>>>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>>>>> being preferred by the respective block device path. The numa nodes value
>>>>>> is comma delimited list of nodes or A-B range of nodes.
>>>>>>
>>>>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>>>>> number of active/in-flight I/O requests currently queued for each path.
>>>>>>
>>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>>> ---
>>>>>> drivers/nvme/host/core.c | 3 ++
>>>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>> index 983909a600ad..6be29fd64236 100644
>>>>>> --- a/drivers/nvme/host/core.c
>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>>>> if (!nvme_ns_head_multipath(ns->head))
>>>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>>>> +
>>>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>>>> +
>>>>>> del_gendisk(ns->disk);
>>>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>>>> --- a/drivers/nvme/host/multipath.c
>>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>>>> nvme_add_ns_head_cdev(head);
>>>>>> }
>>>>>> + nvme_mpath_add_sysfs_link(ns);
>>>>>> +
>>>>>> mutex_lock(&head->lock);
>>>>>> if (nvme_path_is_optimized(ns)) {
>>>>>> int node, srcu_idx;
>>>>> Nearly there.
>>>> Thank you for your review comments!
>>>>
>>>>>
>>>>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>>>>
>>>>> And there is one branch in nvme_mpath_add_disk():
>>>>>
>>>>> if (desc.state) {
>>>>> /* found the group desc: update */
>>>>> nvme_update_ns_ana_state(&desc, ns);
>>>>>
>>>>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>>>>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>>>>> might only be created later on during ANA log parsing.>>
>>>>> It is a tangle, and I haven't found a good way out of this.
>>>>> But I am _very much_ in favour of having these links, so please
>>>>> update your patch.
>>>>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>>>>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>>>>> nvme_mpath_add_sysfs_link().
>>>>
>>>> So I think that in any case while multipath namespace is being created it has to go through
>>>> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
>>>> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
>>>> me know if you have any further question.
>>>>
>>> But it doesn't:
>>>
>>> if (nvme_state_is_live(ns->ana_state) &&
>>> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>>> nvme_mpath_set_live(ns);
>>>
>>> so if a namespace is being created for which nvme_state_is_live() returns 'false' nvme_mpath_set_live() is not called, and no links are being created.
>>> This can happen eg. if the first namespace to be encountered is in any other state than 'optimized' or 'non-optimized'.
>>>
>>
>> OK I got what you're suggesting here. So in this particular case when ANA state of a shared namespace
>> is neither "optimized" nor "non-optimized", we would have gendisk for shared namepspace (i.e.
>> nvmeXcYnZ) created but we don't have yet gendisk for corresponding head node (i.e. nvmeXnY) created.
>> So without the gendisk for head node created, how could we create a link from it to the namespace node?
>>
>> The link from gendisk head node would be eventually created when the ANA state of the namespace transition
>> to "optimized" or "non-optimized" state. I think, it's not anyways possible to have multipathing function
>> enabled until the gendisk for the head node created, isn't it? So I don't yet understand why do we really
>> need device link created if we don't have gendisk for head not ready? Am I missing anything here?
>>
You are right, we will create 'the' link for the gendisk node once
'nvme_mpath_set_live()' is called. But:
Then we would have _one_ link (for the active path) only.
All other paths (ie the non-active ones) might already been present, but
won't get a link.
I'd love to have links to all paths present, active or not.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
2024-10-07 10:14 ` Hannes Reinecke
@ 2024-10-20 23:17 ` Sagi Grimberg
2024-10-21 13:37 ` Nilay Shroff
1 sibling, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-10-20 23:17 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: dwagner, hch, kbusch, axboe, gjoyce
On 11/09/2024 9:26, Nilay Shroff wrote:
> NVMe native multipath supports different IO policies for selecting I/O
> path, however we don't have any visibility about which path is being
> selected by multipath code for forwarding I/O.
> This patch helps add that visibility by adding new sysfs attribute files
> named "numa_nodes" and "queue_depth" under each namespace block device
> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
> under head disk node and then from this directory add a link to each
> namespace path device this head disk node points to.
>
> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
> each path the head disk node <nvmeXnY> points to:
>
> $ ls -1 /sys/block/nvme1n1/
> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>
> For round-robin I/O policy, we could easily infer from the above output
> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
> and nvme1c3n1.
>
> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
> being preferred by the respective block device path. The numa nodes value
> is comma delimited list of nodes or A-B range of nodes.
>
> For queue-depth I/O policy, the "queue_depth" attribute file shows the
> number of active/in-flight I/O requests currently queued for each path.
Would it be possible to separate this to 3 patches, one for sysfs links, one
for numa, and one for queue_depth?
Also, are the sysfs files always exist? what do numa_nodes output in case
the policy is round-robin? or what would queue_depth file produce?
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 3 ++
> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 20 ++++++++--
> drivers/nvme/host/sysfs.c | 20 ++++++++++
> 4 files changed, 110 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 983909a600ad..6be29fd64236 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>
> if (!nvme_ns_head_multipath(ns->head))
> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> +
> + nvme_mpath_remove_sysfs_link(ns);
> +
> del_gendisk(ns->disk);
>
> mutex_lock(&ns->ctrl->namespaces_lock);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 518e22dd4f9b..7d9c36a7a261 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
> nvme_add_ns_head_cdev(head);
> }
>
> + nvme_mpath_add_sysfs_link(ns);
> +
> mutex_lock(&head->lock);
> if (nvme_path_is_optimized(ns)) {
> int node, srcu_idx;
> @@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
> }
> DEVICE_ATTR_RO(ana_state);
>
> +static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> +
> + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
> +}
> +DEVICE_ATTR_RO(queue_depth);
> +
> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + int node, srcu_idx;
> + nodemask_t numa_nodes;
> + struct nvme_ns *current_ns;
> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> + struct nvme_ns_head *head = ns->head;
> +
> + nodes_clear(numa_nodes);
> +
> + srcu_idx = srcu_read_lock(&head->srcu);
> + for_each_node(node) {
> + current_ns = srcu_dereference(head->current_path[node],
> + &head->srcu);
> + if (ns == current_ns)
> + node_set(node, numa_nodes);
> + }
> + srcu_read_unlock(&head->srcu, srcu_idx);
> +
> + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
> +}
> +DEVICE_ATTR_RO(numa_nodes);
> +
> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
> struct nvme_ana_group_desc *desc, void *data)
> {
> @@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
> return -ENXIO; /* just break out of the loop */
> }
>
> +void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
> +{
> + struct device *target;
> + struct kobject *kobj;
> + int rc;
> +
> + if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
> + return;
> +
> + target = disk_to_dev(ns->disk);
> + kobj = &disk_to_dev(ns->head->disk)->kobj;
> + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
> + &target->kobj, dev_name(target));
> + if (unlikely(rc)) {
> + dev_err(disk_to_dev(ns->head->disk),
> + "failed to create link to %s\n",
> + dev_name(target));
> + } else
> + set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
This complication is not clear at all, not sure why this is needed, and
it is
not documented.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-16 6:52 ` Hannes Reinecke
@ 2024-10-21 12:24 ` Nilay Shroff
0 siblings, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2024-10-21 12:24 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: dwagner, hch, kbusch, sagi, axboe, gjoyce
On 10/16/24 12:22, Hannes Reinecke wrote:
> On 10/16/24 05:19, Nilay Shroff wrote:
>> Hi Hannes,
>>
>> On 10/7/24 21:03, Nilay Shroff wrote:
>>>
>>>
>>> On 10/7/24 19:34, Hannes Reinecke wrote:
>>>> On 10/7/24 15:47, Nilay Shroff wrote:
>>>>>
>>>>>
>>>>> On 10/7/24 15:44, Hannes Reinecke wrote:
>>>>>> On 9/11/24 08:26, Nilay Shroff wrote:
>>>>>>> NVMe native multipath supports different IO policies for selecting I/O
>>>>>>> path, however we don't have any visibility about which path is being
>>>>>>> selected by multipath code for forwarding I/O.
>>>>>>> This patch helps add that visibility by adding new sysfs attribute files
>>>>>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>>>>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>>>>>> under head disk node and then from this directory add a link to each
>>>>>>> namespace path device this head disk node points to.
>>>>>>>
>>>>>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>>>>>> each path the head disk node <nvmeXnY> points to:
>>>>>>>
>>>>>>> $ ls -1 /sys/block/nvme1n1/
>>>>>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>>>>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>>>>>
>>>>>>> For round-robin I/O policy, we could easily infer from the above output
>>>>>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>>>>>> and nvme1c3n1.
>>>>>>>
>>>>>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>>>>>> being preferred by the respective block device path. The numa nodes value
>>>>>>> is comma delimited list of nodes or A-B range of nodes.
>>>>>>>
>>>>>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>>>>>> number of active/in-flight I/O requests currently queued for each path.
>>>>>>>
>>>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>>>> ---
>>>>>>> drivers/nvme/host/core.c | 3 ++
>>>>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>>> index 983909a600ad..6be29fd64236 100644
>>>>>>> --- a/drivers/nvme/host/core.c
>>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>>>>> if (!nvme_ns_head_multipath(ns->head))
>>>>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>>>>> +
>>>>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>>>>> +
>>>>>>> del_gendisk(ns->disk);
>>>>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>>>>> --- a/drivers/nvme/host/multipath.c
>>>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>>>>> nvme_add_ns_head_cdev(head);
>>>>>>> }
>>>>>>> + nvme_mpath_add_sysfs_link(ns);
>>>>>>> +
>>>>>>> mutex_lock(&head->lock);
>>>>>>> if (nvme_path_is_optimized(ns)) {
>>>>>>> int node, srcu_idx;
>>>>>> Nearly there.
>>>>> Thank you for your review comments!
>>>>>
>>>>>>
>>>>>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>>>>>
>>>>>> And there is one branch in nvme_mpath_add_disk():
>>>>>>
>>>>>> if (desc.state) {
>>>>>> /* found the group desc: update */
>>>>>> nvme_update_ns_ana_state(&desc, ns);
>>>>>>
>>>>>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>>>>>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>>>>>> might only be created later on during ANA log parsing.>>
>>>>>> It is a tangle, and I haven't found a good way out of this.
>>>>>> But I am _very much_ in favour of having these links, so please
>>>>>> update your patch.
>>>>>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>>>>>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>>>>>> nvme_mpath_add_sysfs_link().
>>>>>
>>>>> So I think that in any case while multipath namespace is being created it has to go through
>>>>> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
>>>>> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
>>>>> me know if you have any further question.
>>>>>
>>>> But it doesn't:
>>>>
>>>> if (nvme_state_is_live(ns->ana_state) &&
>>>> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>>>> nvme_mpath_set_live(ns);
>>>>
>>>> so if a namespace is being created for which nvme_state_is_live() returns 'false' nvme_mpath_set_live() is not called, and no links are being created.
>>>> This can happen eg. if the first namespace to be encountered is in any other state than 'optimized' or 'non-optimized'.
>>>>
>>>
>>> OK I got what you're suggesting here. So in this particular case when ANA state of a shared namespace
>>> is neither "optimized" nor "non-optimized", we would have gendisk for shared namepspace (i.e.
>>> nvmeXcYnZ) created but we don't have yet gendisk for corresponding head node (i.e. nvmeXnY) created.
>>> So without the gendisk for head node created, how could we create a link from it to the namespace node?
>>>
>>> The link from gendisk head node would be eventually created when the ANA state of the namespace transition
>>> to "optimized" or "non-optimized" state. I think, it's not anyways possible to have multipathing function
>>> enabled until the gendisk for the head node created, isn't it? So I don't yet understand why do we really
>>> need device link created if we don't have gendisk for head not ready? Am I missing anything here?
>>>
> You are right, we will create 'the' link for the gendisk node once 'nvme_mpath_set_live()' is called. But:
> Then we would have _one_ link (for the active path) only.
> All other paths (ie the non-active ones) might already been present, but won't get a link.
>
> I'd love to have links to all paths present, active or not.
>
OK, I think I got what you're asking here. I think we can make it the way you want.
I will work it out in the next patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-20 23:17 ` Sagi Grimberg
@ 2024-10-21 13:37 ` Nilay Shroff
2024-10-23 9:58 ` Sagi Grimberg
0 siblings, 1 reply; 14+ messages in thread
From: Nilay Shroff @ 2024-10-21 13:37 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme; +Cc: dwagner, hch, kbusch, axboe, gjoyce
On 10/21/24 04:47, Sagi Grimberg wrote:
>
>
>
> On 11/09/2024 9:26, Nilay Shroff wrote:
>> NVMe native multipath supports different IO policies for selecting I/O
>> path, however we don't have any visibility about which path is being
>> selected by multipath code for forwarding I/O.
>> This patch helps add that visibility by adding new sysfs attribute files
>> named "numa_nodes" and "queue_depth" under each namespace block device
>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>> under head disk node and then from this directory add a link to each
>> namespace path device this head disk node points to.
>>
>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>> each path the head disk node <nvmeXnY> points to:
>>
>> $ ls -1 /sys/block/nvme1n1/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>> being preferred by the respective block device path. The numa nodes value
>> is comma delimited list of nodes or A-B range of nodes.
>>
>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>> number of active/in-flight I/O requests currently queued for each path.
>
> Would it be possible to separate this to 3 patches, one for sysfs links, one
> for numa, and one for queue_depth?
Yeah it should be possible to split this patch into three different patches
if you prefer so.
>
> Also, are the sysfs files always exist? what do numa_nodes output in case
> the policy is round-robin? or what would queue_depth file produce?
>
Assuming kernel is compiled with CONFIG_NVME_MULTIPATH, the sysfs files
"numa_nodes" and "round_robin" would be always created once gendisk node
(nvmeXCYnZ) for the per-path namespace device is added.
In case io-policy is round-robin then numa_nodes file would show the numa
nodes preferred for the respective per-path namespace device in case the
policy is set to numa. For instance, if we read the file
"/sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes" then it shows the numa
nodes preferring the path nvme1c1n1 when I/O workload is targeted to nvme1n1.
The queue_depth file would contain the value 0 if the io-policy is anything
but queue-depth. The reason being queue_depth file emits the output of num of
active requests currently queued up in the nvme/fabric queue (i.e. ctrl->nr_active)
however if the io-policy is not queue-depth then ->nr_active would typically remains
zero.
BTW, we don't expect to read the numa_nodes file when the nvme subsystem
io-policy is NOT numa. Similarly we don't expect user to review the
queue_depth file when io-policy is NOT set to queue-depth. Isn't it?
>
>
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 3 ++
>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 20 ++++++++--
>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 983909a600ad..6be29fd64236 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> if (!nvme_ns_head_multipath(ns->head))
>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> + nvme_mpath_remove_sysfs_link(ns);
>> +
>> del_gendisk(ns->disk);
>> mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 518e22dd4f9b..7d9c36a7a261 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> nvme_add_ns_head_cdev(head);
>> }
>> + nvme_mpath_add_sysfs_link(ns);
>> +
>> mutex_lock(&head->lock);
>> if (nvme_path_is_optimized(ns)) {
>> int node, srcu_idx;
>> @@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
>> }
>> DEVICE_ATTR_RO(ana_state);
>> +static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
>> + char *buf)
>> +{
>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>> +
>> + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
>> +}
>> +DEVICE_ATTR_RO(queue_depth);
>> +
>> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
>> + char *buf)
>> +{
>> + int node, srcu_idx;
>> + nodemask_t numa_nodes;
>> + struct nvme_ns *current_ns;
>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>> + struct nvme_ns_head *head = ns->head;
>> +
>> + nodes_clear(numa_nodes);
>> +
>> + srcu_idx = srcu_read_lock(&head->srcu);
>> + for_each_node(node) {
>> + current_ns = srcu_dereference(head->current_path[node],
>> + &head->srcu);
>> + if (ns == current_ns)
>> + node_set(node, numa_nodes);
>> + }
>> + srcu_read_unlock(&head->srcu, srcu_idx);
>> +
>> + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
>> +}
>> +DEVICE_ATTR_RO(numa_nodes);
>> +
>> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>> struct nvme_ana_group_desc *desc, void *data)
>> {
>> @@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>> return -ENXIO; /* just break out of the loop */
>> }
>> +void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
>> +{
>> + struct device *target;
>> + struct kobject *kobj;
>> + int rc;
>> +
>> + if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
>> + return;
>> +
>> + target = disk_to_dev(ns->disk);
>> + kobj = &disk_to_dev(ns->head->disk)->kobj;
>> + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
>> + &target->kobj, dev_name(target));
>> + if (unlikely(rc)) {
>> + dev_err(disk_to_dev(ns->head->disk),
>> + "failed to create link to %s\n",
>> + dev_name(target));
>> + } else
>> + set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
>
> This complication is not clear at all, not sure why this is needed, and it is
> not documented.
The nvme_mpath_add_sysfs_link function is invoked from nvme_mpath_set_live() when a
new namespace is allocated and its ANA state is set to OPTIMIZED or NON-OPTIMIZED.
The nvme_mpath_add_sysfs_link function simply creates the link to the target nvme
path-device (nvmeXCYnZ) from head disk-node (nvmeXnY) device. If this looks complex
then I would add commentary in the code in the next patch.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-21 13:37 ` Nilay Shroff
@ 2024-10-23 9:58 ` Sagi Grimberg
2024-10-23 13:31 ` Nilay Shroff
0 siblings, 1 reply; 14+ messages in thread
From: Sagi Grimberg @ 2024-10-23 9:58 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: dwagner, hch, kbusch, axboe, gjoyce
>> Also, are the sysfs files always exist? what do numa_nodes output in case
>> the policy is round-robin? or what would queue_depth file produce?
>>
> Assuming kernel is compiled with CONFIG_NVME_MULTIPATH, the sysfs files
> "numa_nodes" and "round_robin" would be always created once gendisk node
> (nvmeXCYnZ) for the per-path namespace device is added.
>
> In case io-policy is round-robin then numa_nodes file would show the numa
> nodes preferred for the respective per-path namespace device in case the
> policy is set to numa. For instance, if we read the file
> "/sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes" then it shows the numa
> nodes preferring the path nvme1c1n1 when I/O workload is targeted to nvme1n1.
>
> The queue_depth file would contain the value 0 if the io-policy is anything
> but queue-depth. The reason being queue_depth file emits the output of num of
> active requests currently queued up in the nvme/fabric queue (i.e. ctrl->nr_active)
> however if the io-policy is not queue-depth then ->nr_active would typically remains
> zero.
>
> BTW, we don't expect to read the numa_nodes file when the nvme subsystem
> io-policy is NOT numa. Similarly we don't expect user to review the
> queue_depth file when io-policy is NOT set to queue-depth. Isn't it?
I'm asking what does it show in case it _is_ read. queue_depth and
numa_nodes
output seems to be a "lie" in case the policy is round-robin.
>
>>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>> drivers/nvme/host/core.c | 3 ++
>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 983909a600ad..6be29fd64236 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> if (!nvme_ns_head_multipath(ns->head))
>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>> +
>>> + nvme_mpath_remove_sysfs_link(ns);
>>> +
>>> del_gendisk(ns->disk);
>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>> nvme_add_ns_head_cdev(head);
>>> }
>>> + nvme_mpath_add_sysfs_link(ns);
>>> +
>>> mutex_lock(&head->lock);
>>> if (nvme_path_is_optimized(ns)) {
>>> int node, srcu_idx;
>>> @@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
>>> }
>>> DEVICE_ATTR_RO(ana_state);
>>> +static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
>>> + char *buf)
>>> +{
>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>> +
>>> + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
>>> +}
>>> +DEVICE_ATTR_RO(queue_depth);
>>> +
>>> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
>>> + char *buf)
>>> +{
>>> + int node, srcu_idx;
>>> + nodemask_t numa_nodes;
>>> + struct nvme_ns *current_ns;
>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>> + struct nvme_ns_head *head = ns->head;
>>> +
>>> + nodes_clear(numa_nodes);
>>> +
>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>> + for_each_node(node) {
>>> + current_ns = srcu_dereference(head->current_path[node],
>>> + &head->srcu);
>>> + if (ns == current_ns)
>>> + node_set(node, numa_nodes);
>>> + }
>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>> +
>>> + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
>>> +}
>>> +DEVICE_ATTR_RO(numa_nodes);
>>> +
>>> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>> struct nvme_ana_group_desc *desc, void *data)
>>> {
>>> @@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>> return -ENXIO; /* just break out of the loop */
>>> }
>>> +void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
>>> +{
>>> + struct device *target;
>>> + struct kobject *kobj;
>>> + int rc;
>>> +
>>> + if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
>>> + return;
>>> +
>>> + target = disk_to_dev(ns->disk);
>>> + kobj = &disk_to_dev(ns->head->disk)->kobj;
>>> + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
>>> + &target->kobj, dev_name(target));
>>> + if (unlikely(rc)) {
>>> + dev_err(disk_to_dev(ns->head->disk),
>>> + "failed to create link to %s\n",
>>> + dev_name(target));
>>> + } else
>>> + set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
>> This complication is not clear at all, not sure why this is needed, and it is
>> not documented.
> The nvme_mpath_add_sysfs_link function is invoked from nvme_mpath_set_live() when a
> new namespace is allocated and its ANA state is set to OPTIMIZED or NON-OPTIMIZED.
> The nvme_mpath_add_sysfs_link function simply creates the link to the target nvme
> path-device (nvmeXCYnZ) from head disk-node (nvmeXnY) device. If this looks complex
> then I would add commentary in the code in the next patch.
Still not clear why we need a flag for this? The usage suggest that we
may call add/remove sysfs link
multiple times... What about inaccessible paths? or paths that
transition to/from inaccessible? They
will add/remove their corresponding sysfs links?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
2024-10-23 9:58 ` Sagi Grimberg
@ 2024-10-23 13:31 ` Nilay Shroff
0 siblings, 0 replies; 14+ messages in thread
From: Nilay Shroff @ 2024-10-23 13:31 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Cc: dwagner, hch, kbusch, axboe, gjoyce, Hannes Reinecke
On 10/23/24 15:28, Sagi Grimberg wrote:
>>> Also, are the sysfs files always exist? what do numa_nodes output in case
>>> the policy is round-robin? or what would queue_depth file produce?
>>>
>> Assuming kernel is compiled with CONFIG_NVME_MULTIPATH, the sysfs files
>> "numa_nodes" and "round_robin" would be always created once gendisk node
>> (nvmeXCYnZ) for the per-path namespace device is added.
>>
>> In case io-policy is round-robin then numa_nodes file would show the numa
>> nodes preferred for the respective per-path namespace device in case the
>> policy is set to numa. For instance, if we read the file
>> "/sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes" then it shows the numa
>> nodes preferring the path nvme1c1n1 when I/O workload is targeted to nvme1n1.
>>
>> The queue_depth file would contain the value 0 if the io-policy is anything
>> but queue-depth. The reason being queue_depth file emits the output of num of
>> active requests currently queued up in the nvme/fabric queue (i.e. ctrl->nr_active)
>> however if the io-policy is not queue-depth then ->nr_active would typically remains
>> zero.
>>
>> BTW, we don't expect to read the numa_nodes file when the nvme subsystem
>> io-policy is NOT numa. Similarly we don't expect user to review the
>> queue_depth file when io-policy is NOT set to queue-depth. Isn't it?
>
> I'm asking what does it show in case it _is_ read. queue_depth and numa_nodes
> output seems to be a "lie" in case the policy is round-robin.
>
Yes the output of queue_dpeth and numa_nodes is not useful when io-policy
is round-robin. So do you suggest to avoid emitting any output when io-policy
is round-robin and user reads numa_nodes or queue_depth file?
Similarly, when io-policy is numa, we should refrain from emitting anything
if user access queue_depth file and the same is the case when io-policy is
queue-depth and user access the numa_nodes file.
>>
>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 3 ++
>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 983909a600ad..6be29fd64236 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> if (!nvme_ns_head_multipath(ns->head))
>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>> +
>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>> +
>>>> del_gendisk(ns->disk);
>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>> nvme_add_ns_head_cdev(head);
>>>> }
>>>> + nvme_mpath_add_sysfs_link(ns);
>>>> +
>>>> mutex_lock(&head->lock);
>>>> if (nvme_path_is_optimized(ns)) {
>>>> int node, srcu_idx;
>>>> @@ -922,6 +924,39 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
>>>> }
>>>> DEVICE_ATTR_RO(ana_state);
>>>> +static ssize_t queue_depth_show(struct device *dev, struct device_attribute *attr,
>>>> + char *buf)
>>>> +{
>>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>>> +
>>>> + return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
>>>> +}
>>>> +DEVICE_ATTR_RO(queue_depth);
>>>> +
>>>> +static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
>>>> + char *buf)
>>>> +{
>>>> + int node, srcu_idx;
>>>> + nodemask_t numa_nodes;
>>>> + struct nvme_ns *current_ns;
>>>> + struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
>>>> + struct nvme_ns_head *head = ns->head;
>>>> +
>>>> + nodes_clear(numa_nodes);
>>>> +
>>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>>> + for_each_node(node) {
>>>> + current_ns = srcu_dereference(head->current_path[node],
>>>> + &head->srcu);
>>>> + if (ns == current_ns)
>>>> + node_set(node, numa_nodes);
>>>> + }
>>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>>> +
>>>> + return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
>>>> +}
>>>> +DEVICE_ATTR_RO(numa_nodes);
>>>> +
>>>> static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>>> struct nvme_ana_group_desc *desc, void *data)
>>>> {
>>>> @@ -934,6 +969,42 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>>> return -ENXIO; /* just break out of the loop */
>>>> }
>>>> +void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
>>>> +{
>>>> + struct device *target;
>>>> + struct kobject *kobj;
>>>> + int rc;
>>>> +
>>>> + if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
>>>> + return;
>>>> +
>>>> + target = disk_to_dev(ns->disk);
>>>> + kobj = &disk_to_dev(ns->head->disk)->kobj;
>>>> + rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
>>>> + &target->kobj, dev_name(target));
>>>> + if (unlikely(rc)) {
>>>> + dev_err(disk_to_dev(ns->head->disk),
>>>> + "failed to create link to %s\n",
>>>> + dev_name(target));
>>>> + } else
>>>> + set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
>>> This complication is not clear at all, not sure why this is needed, and it is
>>> not documented.
>> The nvme_mpath_add_sysfs_link function is invoked from nvme_mpath_set_live() when a
>> new namespace is allocated and its ANA state is set to OPTIMIZED or NON-OPTIMIZED.
>> The nvme_mpath_add_sysfs_link function simply creates the link to the target nvme
>> path-device (nvmeXCYnZ) from head disk-node (nvmeXnY) device. If this looks complex
>> then I would add commentary in the code in the next patch.
>
> Still not clear why we need a flag for this? The usage suggest that we may call add/remove sysfs link
> multiple times... What about inaccessible paths? or paths that transition to/from inaccessible? They
> will add/remove their corresponding sysfs links?
The NVME_NS_SYSFS_ATTR_LINK flag is useful to avoid creating link multiple times when
ana state of an nvme path transition from optimized to non-optimized state or vice-versa.
When ana state of nvme path transitions to live (i.e. optimized or non-optimized), we call
nvme_mpath_set_live() and then from this function we call nvme_mpath_add_sysfs_link().
Assuming the sysfs link for the nvme path already exists, if we attempt to recreate the link
then sysfs_add_link_to_group() complains about this anomaly loudly. It warns about this anomaly
by printing stack dump. You may refer sysfs_warn_dup().
For inaccessible path, in the current implementation, we don't create sysfs link.
However in one of the earlier discussion, Hannes recommended[1] to always create the link
to the path whose ana state could be inaccessible or any other state. So I am preparing
for the next patch iteration where I will address this comment.
[1]: https://lore.kernel.org/all/3c92b9ae-1223-4787-90e3-955f82161269@suse.de/
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-10-23 13:31 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-11 6:26 [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-09-11 6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
2024-10-07 10:14 ` Hannes Reinecke
2024-10-07 13:47 ` Nilay Shroff
2024-10-07 14:04 ` Hannes Reinecke
2024-10-07 15:33 ` Nilay Shroff
2024-10-16 3:19 ` Nilay Shroff
2024-10-16 6:52 ` Hannes Reinecke
2024-10-21 12:24 ` Nilay Shroff
2024-10-20 23:17 ` Sagi Grimberg
2024-10-21 13:37 ` Nilay Shroff
2024-10-23 9:58 ` Sagi Grimberg
2024-10-23 13:31 ` Nilay Shroff
2024-09-24 6:41 ` [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox