* [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
@ 2024-10-30 10:41 Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy Nilay Shroff
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Nilay Shroff @ 2024-10-30 10:41 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, sagi, hare, kbusch, hch, =axboe, gjoyce, Nilay Shroff
Hi,
This RFC propose adding new sysfs attributes for adding visibility of
nvme native multipath I/O.
The changes are divided into three patches.
The first patch adds visibility for round-robin io-policy.
The second patch adds visibility for numa io-policy.
The third patch adds the visibility for queue-depth io-policy.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
nvme-subsys4 nvmet_subsystem nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=numa
\
+- ns 1
\
+- nvme0 pcie 052e:78:00.0 live optimized
+- nvme1 pcie 058e:78:00.0 live optimized
nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=round-robin
\
+- ns 2
\
+- nvme2 pcie 0524:28:00.0 live optimized
+- nvme3 pcie 0584:28:00.0 live optimized
nvme-subsys4 - NQN=nvmet_subsystem
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=queue-depth
\
+- ns 1
\
+- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
+- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:
# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).
For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.
# cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1
# cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3
From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.
For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.
# cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518
# cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504
From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.
changes since v4:
- Ensure that we create sysfs link from head gendisk node to each path
device irrespective of the ANA state of the path (Hannes Reinecke)
- Split the patch into three patch series and add commentary in the
code so that it's easy to read and understand the core logic (Sagi
Grimberg)
- Don't show any output if user reads "numa_nodes" file and configured
iopolicy is anything but numa; similarly don't emit any output if user
reads "queue_depth" file and configured iopolicy is anything but
queue-depth (Sagi Grimberg)
Changes since v3:
- Protect the namespace dereference code with srcu read lock (Daniel Wagner)
Changes since v2:
- Use one value per one sysfs attribute (Keith Busch)
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
Nilay Shroff (3):
nvme-multipah: Add visibility for round-robin io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for queue-depth io-policy
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++--
drivers/nvme/host/sysfs.c | 20 ++++++
4 files changed, 159 insertions(+), 4 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy
2024-10-30 10:41 [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
@ 2024-10-30 10:41 ` Nilay Shroff
2024-12-10 9:22 ` Daniel Wagner
2024-10-30 10:41 ` [PATCHv5 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: Nilay Shroff @ 2024-10-30 10:41 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, sagi, hare, kbusch, hch, =axboe, gjoyce, Nilay Shroff
This patch helps add nvme native multipath visibility for round-robin
io-policy. It creates a "multipath" sysfs directory under head gendisk
device node directory and then from "multipath" directory it adds a link
to each namespace path device the head node refers.
For instance, if we have a shared namespace accessible from two different
controllers/paths then we create a soft link to each path device from head
disk node as shown below:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths.
For round-robin I/O policy, we could easily infer from the above output
that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
and nvme1c3n1.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 3 ++
drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 18 ++++++--
drivers/nvme/host/sysfs.c | 14 ++++++
4 files changed, 112 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 84cb859a911d..c923bd2c5468 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3940,6 +3940,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (!nvme_ns_head_multipath(ns->head))
nvme_cdev_del(&ns->cdev, &ns->cdev_device);
+
+ nvme_mpath_remove_sysfs_link(ns);
+
del_gendisk(ns->disk);
mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6a15873055b9..43c87a946cc7 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -682,6 +682,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
kblockd_schedule_work(&head->partition_scan_work);
}
+ nvme_mpath_add_sysfs_link(ns->head);
+
mutex_lock(&head->lock);
if (nvme_path_is_optimized(ns)) {
int node, srcu_idx;
@@ -764,6 +766,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
if (nvme_state_is_live(ns->ana_state) &&
nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
nvme_mpath_set_live(ns);
+ else {
+ /*
+ * Add sysfs link from multipath head gendisk node to path
+ * device gendisk node.
+ * If path's ana state is live (i.e. state is either optimized
+ * or non-optimized) while we alloc the ns then sysfs link would
+ * be created from nvme_mpath_set_live(). In that case we would
+ * not fallthrough this code path. However for the path's ana
+ * state other than live, we call nvme_mpath_set_live() only
+ * after ana state transitioned to the live state. But we still
+ * want to create the sysfs link from head node to a path device
+ * irrespctive of the path's ana state.
+ * If we reach through here then it means that path's ana state
+ * is not live but still create the sysfs link to this path from
+ * head node if head node of the path has already come alive.
+ */
+ if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
+ nvme_mpath_add_sysfs_link(ns->head);
+ }
}
static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
@@ -962,6 +983,66 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
return -ENXIO; /* just break out of the loop */
}
+void nvme_mpath_add_sysfs_link(struct nvme_ns_head *head)
+{
+ struct device *target;
+ int rc, srcu_idx;
+ struct nvme_ns *ns;
+ struct kobject *kobj = &disk_to_dev(head->disk)->kobj;
+
+ /*
+ * loop through each ns chained through the head->list and create the
+ * sysfs link from head node to the ns path node
+ */
+ srcu_idx = srcu_read_lock(&head->srcu);
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ /*
+ * Avoid creating link if it already exists for the given path.
+ * When path ana state transitions from optimized to non-
+ * optimized or vice-versa, the nvme_mpath_set_live() is
+ * invoked which in truns call this function. Now if the sysfs
+ * link already exists for the given path and we attempt to re-
+ * create the link then sysfs code would warn about it loudly.
+ * So we evaluate NVME_NS_SYSFS_ATTR_LINK flag here to ensure
+ * that we're not creating duplicate link.
+ */
+ if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ continue;
+
+ target = disk_to_dev(ns->disk);
+ /*
+ * Create sysfs link from head gendisk kobject @kobj to the
+ * ns path gendisk kobject @target->kobj.
+ */
+ rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
+ &target->kobj, dev_name(target));
+ if (unlikely(rc)) {
+ dev_err(disk_to_dev(ns->head->disk),
+ "failed to create link to %s\n",
+ dev_name(target));
+ } else
+ set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+ }
+
+ srcu_read_unlock(&head->srcu, srcu_idx);
+}
+
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+ struct device *target;
+ struct kobject *kobj;
+
+ if (!test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ return;
+
+ target = disk_to_dev(ns->disk);
+ kobj = &disk_to_dev(ns->head->disk)->kobj;
+ sysfs_remove_link_from_group(kobj, nvme_ns_mpath_attr_group.name,
+ dev_name(target));
+ clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+}
+
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
{
if (nvme_ctrl_use_ana(ns->ctrl)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536..ff234bce58ef 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -528,10 +528,11 @@ struct nvme_ns {
struct nvme_ns_head *head;
unsigned long flags;
-#define NVME_NS_REMOVING 0
-#define NVME_NS_ANA_PENDING 2
-#define NVME_NS_FORCE_RO 3
-#define NVME_NS_READY 4
+#define NVME_NS_REMOVING 0
+#define NVME_NS_ANA_PENDING 2
+#define NVME_NS_FORCE_RO 3
+#define NVME_NS_READY 4
+#define NVME_NS_SYSFS_ATTR_LINK 5
struct cdev cdev;
struct device cdev_device;
@@ -927,6 +928,7 @@ int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
extern const struct attribute_group *nvme_ns_attr_groups[];
+extern const struct attribute_group nvme_ns_mpath_attr_group;
extern const struct pr_ops nvme_pr_ops;
extern const struct block_device_operations nvme_ns_head_ops;
extern const struct attribute_group nvme_dev_attrs_group;
@@ -949,6 +951,8 @@ void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys);
void nvme_failover_req(struct request *req);
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
+void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns);
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns);
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
@@ -1003,6 +1007,12 @@ static inline void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
}
+static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
+{
+}
+static inline void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+}
static inline bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
return false;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index b68a9e5f1ea3..5a23e23b0d01 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -299,8 +299,22 @@ static const struct attribute_group nvme_ns_attr_group = {
.is_visible = nvme_ns_attrs_are_visible,
};
+#ifdef CONFIG_NVME_MULTIPATH
+static struct attribute *nvme_ns_mpath_attrs[] = {
+ NULL,
+};
+
+const struct attribute_group nvme_ns_mpath_attr_group = {
+ .name = "multipath",
+ .attrs = nvme_ns_mpath_attrs,
+};
+#endif
+
const struct attribute_group *nvme_ns_attr_groups[] = {
&nvme_ns_attr_group,
+#ifdef CONFIG_NVME_MULTIPATH
+ &nvme_ns_mpath_attr_group,
+#endif
NULL,
};
--
2.45.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCHv5 RFC 2/3] nvme-multipath: Add visibility for numa io-policy
2024-10-30 10:41 [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy Nilay Shroff
@ 2024-10-30 10:41 ` Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
2024-11-12 4:37 ` [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
3 siblings, 0 replies; 12+ messages in thread
From: Nilay Shroff @ 2024-10-30 10:41 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, sagi, hare, kbusch, hch, =axboe, gjoyce, Nilay Shroff
This patch helps add nvme native multipath visibility for numa io-policy.
It adds a new attribute file named "numa_nodes" under namespace gendisk
device path node which prints the list of numa nodes preferred by the
given namespace path. The numa nodes value is comma delimited list of
nodes or A-B range of nodes.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head node of the shared namespace would
show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For numa io-policy we can then refer the "numa_nodes" attribute
file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
0-1
$ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes
2-3
From the above output, we infer that I/O workload targeted at nvme1n1
and running on numa nodes 0 and 1 would prefer using path nvme1c1n1.
Similarly, I/O workload running on numa nodes 2 and 3 would prefer
using path nvme1c3n1. Reading "numa_nodes" file when configured
io-policy is anything but numa would show no output.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 27 +++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/sysfs.c | 5 +++++
3 files changed, 33 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 43c87a946cc7..7bfba6637916 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -971,6 +971,33 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
}
DEVICE_ATTR_RO(ana_state);
+static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ int node, srcu_idx;
+ nodemask_t numa_nodes;
+ struct nvme_ns *current_ns;
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+ struct nvme_ns_head *head = ns->head;
+
+ if (head->subsys->iopolicy != NVME_IOPOLICY_NUMA)
+ return 0;
+
+ nodes_clear(numa_nodes);
+
+ srcu_idx = srcu_read_lock(&head->srcu);
+ for_each_node(node) {
+ current_ns = srcu_dereference(head->current_path[node],
+ &head->srcu);
+ if (ns == current_ns)
+ node_set(node, numa_nodes);
+ }
+ srcu_read_unlock(&head->srcu, srcu_idx);
+
+ return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
+}
+DEVICE_ATTR_RO(numa_nodes);
+
static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
struct nvme_ana_group_desc *desc, void *data)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ff234bce58ef..02c0eb7b462c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -978,6 +978,7 @@ static inline void nvme_trace_bio_complete(struct request *req)
extern bool multipath;
extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
+extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute subsys_attr_iopolicy;
static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 5a23e23b0d01..5a5ee0beb166 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -258,6 +258,7 @@ static struct attribute *nvme_ns_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&dev_attr_ana_grpid.attr,
&dev_attr_ana_state.attr,
+ &dev_attr_numa_nodes.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
NULL,
@@ -290,6 +291,10 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
return 0;
}
+ if (a == &dev_attr_numa_nodes.attr) {
+ if (nvme_disk_is_ns_head(dev_to_disk(dev)))
+ return 0;
+ }
#endif
return a->mode;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCHv5 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy
2024-10-30 10:41 [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
@ 2024-10-30 10:41 ` Nilay Shroff
2024-11-12 4:37 ` [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
3 siblings, 0 replies; 12+ messages in thread
From: Nilay Shroff @ 2024-10-30 10:41 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, sagi, hare, kbusch, hch, =axboe, gjoyce, Nilay Shroff
This patch helps add nvme native multipath visibility for queue-depth
io-policy. It adds a new attribute file named "queue_depth" under
namespace device path node which would print the number of active/
in-flight I/O requests currently queued for the given path.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head block node of the shared namespace
would show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For queue-depth io-policy we can then refer the "queue_depth"
attribute file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth
518
$cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth
504
From the above output, we can infer that I/O workload targeted at nvme1n1
uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each
path is 518 and 504 respectively. Reading "queue_depth" file when
configured io-policy is anything but queue-depth would show no output.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 12 ++++++++++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/sysfs.c | 3 ++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 7bfba6637916..4f0381a0b191 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -971,6 +971,18 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
}
DEVICE_ATTR_RO(ana_state);
+static ssize_t queue_depth_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+ if (ns->head->subsys->iopolicy != NVME_IOPOLICY_QD)
+ return 0;
+
+ return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
+}
+DEVICE_ATTR_RO(queue_depth);
+
static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 02c0eb7b462c..792820261db1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -978,6 +978,7 @@ static inline void nvme_trace_bio_complete(struct request *req)
extern bool multipath;
extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
+extern struct device_attribute dev_attr_queue_depth;
extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute subsys_attr_iopolicy;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 5a5ee0beb166..498659efacba 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -258,6 +258,7 @@ static struct attribute *nvme_ns_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&dev_attr_ana_grpid.attr,
&dev_attr_ana_state.attr,
+ &dev_attr_queue_depth.attr,
&dev_attr_numa_nodes.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
@@ -291,7 +292,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
return 0;
}
- if (a == &dev_attr_numa_nodes.attr) {
+ if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) {
if (nvme_disk_is_ns_head(dev_to_disk(dev)))
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-10-30 10:41 [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
` (2 preceding siblings ...)
2024-10-30 10:41 ` [PATCHv5 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
@ 2024-11-12 4:37 ` Nilay Shroff
2024-11-29 12:19 ` Nilay Shroff
3 siblings, 1 reply; 12+ messages in thread
From: Nilay Shroff @ 2024-11-12 4:37 UTC (permalink / raw)
To: linux-nvme; +Cc: dwagner, sagi, hare, kbusch, hch, =axboe, gjoyce
Hi Hannes and Sagi,
A gentle ping... I have addressed your suggestions in this patch series.
Does this now look okay to you or do you have any further suggestions/comments?
Thanks,
--Nilay
On 10/30/24 16:11, Nilay Shroff wrote:
> Hi,
>
> This RFC propose adding new sysfs attributes for adding visibility of
> nvme native multipath I/O.
>
> The changes are divided into three patches.
> The first patch adds visibility for round-robin io-policy.
> The second patch adds visibility for numa io-policy.
> The third patch adds the visibility for queue-depth io-policy.
>
> As we know, NVMe native multipath supports three different io policies
> (numa, round-robin and queue-depth) for selecting I/O path, however, we
> don't have any visibility about which path is being selected by multipath
> code for forwarding I/O. This RFC helps add that visibility by adding new
> sysfs attribute files named "numa_nodes" and "queue_depth" under each
> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
> "multipath" sysfs directory under head disk node and then from this
> directory add a link to each namespace path device this head disk node
> points to.
>
> Please find below output generated with this proposed RFC patch applied on
> a system with two multi-controller PCIe NVMe disks attached to it. This
> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
> over two NIC cards. This system has four numa nodes online when the below
> output was captured:
>
> # cat /sys/devices/system/node/online
> 0-3
>
> # lscpu
> <snip>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s):
> NUMA node1 CPU(s): 0-7
> NUMA node2 CPU(s): 8-31
> NUMA node3 CPU(s): 32-63
> <snip>
>
> Please note that numa node 0 though online, doesn't have any CPU
> currently assigned to it.
>
> # nvme list -v
> Subsystem Subsystem-NQN Controllers
> ---------------- ------------------------------------------------------------------------------------------------ ----------------
> nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
> nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
> nvme-subsys4 nvmet_subsystem nvme4, nvme5
>
> Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
> ---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
> nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
> nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
> nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
> nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
> nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
> nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
>
> Device Generic NSID Usage Format Controllers
> ----------------- ----------------- ---------- -------------------------- ---------------- ----------------
> /dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
> /dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
> /dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
>
>
> # nvme show-topology
> nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=numa
> \
> +- ns 1
> \
> +- nvme0 pcie 052e:78:00.0 live optimized
> +- nvme1 pcie 058e:78:00.0 live optimized
>
> nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=round-robin
> \
> +- ns 2
> \
> +- nvme2 pcie 0524:28:00.0 live optimized
> +- nvme3 pcie 0584:28:00.0 live optimized
>
> nvme-subsys4 - NQN=nvmet_subsystem
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
> iopolicy=queue-depth
> \
> +- ns 1
> \
> +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
> +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
>
> As we could see above, we've three shared namespaces created. In terms of
> iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
> configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
>
> Now, under each namespace "head disk node", we create a sysfs group
> attribute named "multipath". The "multipath" group then points to the
> each path this head disk node points to:
>
> # tree /sys/block/nvme1n1/multipath/
> /sys/block/nvme1n1/multipath/
> ├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
> └── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
>
> # tree /sys/block/nvme3n1/multipath/
> /sys/block/nvme3n1/multipath/
> ├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
> └── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
>
> # tree /sys/block/nvme4n1/multipath/
> /sys/block/nvme4n1/multipath/
> ├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
> └── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
>
> One can easily infer from the above output that for the "round-robin"
> I/O policy, configured under nvme-subsys3, the I/O workload targeted at
> nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
> of each path is optimized (as can be seen in the output of show-topology).
>
> For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
> attribute file shows the numa nodes being preferred by the respective
> namespace path. The numa nodes value is comma delimited list of nodes or
> A-B range of nodes.
>
> # cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
> 0-1
>
> # cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
> 2-3
>
> From the above output, one can easily infer that I/O workload targeted at
> nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
> Similarly, I/O workload running on numa nodes 2 and 3 would use path
> nvme1c1n1.
>
> For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
> attribute file shows the number of active/in-flight I/O requests currently
> queued for each path.
>
> # cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
> 518
>
> # cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
> 504
>
> From the above output, one can easily infer that I/O workload targeted at
> nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
> of each path is 518 and 504 respectively.
>
> changes since v4:
> - Ensure that we create sysfs link from head gendisk node to each path
> device irrespective of the ANA state of the path (Hannes Reinecke)
> - Split the patch into three patch series and add commentary in the
> code so that it's easy to read and understand the core logic (Sagi
> Grimberg)
> - Don't show any output if user reads "numa_nodes" file and configured
> iopolicy is anything but numa; similarly don't emit any output if user
> reads "queue_depth" file and configured iopolicy is anything but
> queue-depth (Sagi Grimberg)
>
> Changes since v3:
> - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
>
> Changes since v2:
> - Use one value per one sysfs attribute (Keith Busch)
>
> Changes since v1:
> - Use sysfs to export multipath I/O information instead of debugfs
>
>
> Nilay Shroff (3):
> nvme-multipah: Add visibility for round-robin io-policy
> nvme-multipath: Add visibility for numa io-policy
> nvme-multipath: Add visibility for queue-depth io-policy
>
> drivers/nvme/host/core.c | 3 +
> drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 20 ++++--
> drivers/nvme/host/sysfs.c | 20 ++++++
> 4 files changed, 159 insertions(+), 4 deletions(-)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-11-12 4:37 ` [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
@ 2024-11-29 12:19 ` Nilay Shroff
2024-12-10 7:03 ` Nilay Shroff
0 siblings, 1 reply; 12+ messages in thread
From: Nilay Shroff @ 2024-11-29 12:19 UTC (permalink / raw)
To: Hannes Reinecke, Sagi Grimberg
Cc: dwagner, sagi, hare, kbusch, hch, gjoyce, axboe@fb.com,
linux-nvme@lists.infradead.org
Hi Hannes and Sagi,
A gentle ping on this. Did you get a chance to look through this?
Please let me know if you still have any further comments.
Thanks,
--Nilay
On 11/12/24 10:07, Nilay Shroff wrote:
> Hi Hannes and Sagi,
>
> A gentle ping... I have addressed your suggestions in this patch series.
> Does this now look okay to you or do you have any further suggestions/comments?
>
> Thanks,
> --Nilay
>
> On 10/30/24 16:11, Nilay Shroff wrote:
>> Hi,
>>
>> This RFC propose adding new sysfs attributes for adding visibility of
>> nvme native multipath I/O.
>>
>> The changes are divided into three patches.
>> The first patch adds visibility for round-robin io-policy.
>> The second patch adds visibility for numa io-policy.
>> The third patch adds the visibility for queue-depth io-policy.
>>
>> As we know, NVMe native multipath supports three different io policies
>> (numa, round-robin and queue-depth) for selecting I/O path, however, we
>> don't have any visibility about which path is being selected by multipath
>> code for forwarding I/O. This RFC helps add that visibility by adding new
>> sysfs attribute files named "numa_nodes" and "queue_depth" under each
>> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
>> "multipath" sysfs directory under head disk node and then from this
>> directory add a link to each namespace path device this head disk node
>> points to.
>>
>> Please find below output generated with this proposed RFC patch applied on
>> a system with two multi-controller PCIe NVMe disks attached to it. This
>> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
>> over two NIC cards. This system has four numa nodes online when the below
>> output was captured:
>>
>> # cat /sys/devices/system/node/online
>> 0-3
>>
>> # lscpu
>> <snip>
>> NUMA:
>> NUMA node(s): 4
>> NUMA node0 CPU(s):
>> NUMA node1 CPU(s): 0-7
>> NUMA node2 CPU(s): 8-31
>> NUMA node3 CPU(s): 32-63
>> <snip>
>>
>> Please note that numa node 0 though online, doesn't have any CPU
>> currently assigned to it.
>>
>> # nvme list -v
>> Subsystem Subsystem-NQN Controllers
>> ---------------- ------------------------------------------------------------------------------------------------ ----------------
>> nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
>> nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
>> nvme-subsys4 nvmet_subsystem nvme4, nvme5
>>
>> Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
>> ---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
>> nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
>> nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
>> nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
>> nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
>> nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
>> nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
>>
>> Device Generic NSID Usage Format Controllers
>> ----------------- ----------------- ---------- -------------------------- ---------------- ----------------
>> /dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
>> /dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
>> /dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
>>
>>
>> # nvme show-topology
>> nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>> iopolicy=numa
>> \
>> +- ns 1
>> \
>> +- nvme0 pcie 052e:78:00.0 live optimized
>> +- nvme1 pcie 058e:78:00.0 live optimized
>>
>> nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>> iopolicy=round-robin
>> \
>> +- ns 2
>> \
>> +- nvme2 pcie 0524:28:00.0 live optimized
>> +- nvme3 pcie 0584:28:00.0 live optimized
>>
>> nvme-subsys4 - NQN=nvmet_subsystem
>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>> iopolicy=queue-depth
>> \
>> +- ns 1
>> \
>> +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
>> +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
>>
>> As we could see above, we've three shared namespaces created. In terms of
>> iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
>> configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
>>
>> Now, under each namespace "head disk node", we create a sysfs group
>> attribute named "multipath". The "multipath" group then points to the
>> each path this head disk node points to:
>>
>> # tree /sys/block/nvme1n1/multipath/
>> /sys/block/nvme1n1/multipath/
>> ├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
>> └── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
>>
>> # tree /sys/block/nvme3n1/multipath/
>> /sys/block/nvme3n1/multipath/
>> ├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
>> └── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
>>
>> # tree /sys/block/nvme4n1/multipath/
>> /sys/block/nvme4n1/multipath/
>> ├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
>> └── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
>>
>> One can easily infer from the above output that for the "round-robin"
>> I/O policy, configured under nvme-subsys3, the I/O workload targeted at
>> nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
>> of each path is optimized (as can be seen in the output of show-topology).
>>
>> For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
>> attribute file shows the numa nodes being preferred by the respective
>> namespace path. The numa nodes value is comma delimited list of nodes or
>> A-B range of nodes.
>>
>> # cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
>> 0-1
>>
>> # cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
>> 2-3
>>
>> From the above output, one can easily infer that I/O workload targeted at
>> nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
>> Similarly, I/O workload running on numa nodes 2 and 3 would use path
>> nvme1c1n1.
>>
>> For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
>> attribute file shows the number of active/in-flight I/O requests currently
>> queued for each path.
>>
>> # cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
>> 518
>>
>> # cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
>> 504
>>
>> From the above output, one can easily infer that I/O workload targeted at
>> nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
>> of each path is 518 and 504 respectively.
>>
>> changes since v4:
>> - Ensure that we create sysfs link from head gendisk node to each path
>> device irrespective of the ANA state of the path (Hannes Reinecke)
>> - Split the patch into three patch series and add commentary in the
>> code so that it's easy to read and understand the core logic (Sagi
>> Grimberg)
>> - Don't show any output if user reads "numa_nodes" file and configured
>> iopolicy is anything but numa; similarly don't emit any output if user
>> reads "queue_depth" file and configured iopolicy is anything but
>> queue-depth (Sagi Grimberg)
>>
>> Changes since v3:
>> - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
>>
>> Changes since v2:
>> - Use one value per one sysfs attribute (Keith Busch)
>>
>> Changes since v1:
>> - Use sysfs to export multipath I/O information instead of debugfs
>>
>>
>> Nilay Shroff (3):
>> nvme-multipah: Add visibility for round-robin io-policy
>> nvme-multipath: Add visibility for numa io-policy
>> nvme-multipath: Add visibility for queue-depth io-policy
>>
>> drivers/nvme/host/core.c | 3 +
>> drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 20 ++++--
>> drivers/nvme/host/sysfs.c | 20 ++++++
>> 4 files changed, 159 insertions(+), 4 deletions(-)
>>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-11-29 12:19 ` Nilay Shroff
@ 2024-12-10 7:03 ` Nilay Shroff
2024-12-10 9:34 ` Daniel Wagner
0 siblings, 1 reply; 12+ messages in thread
From: Nilay Shroff @ 2024-12-10 7:03 UTC (permalink / raw)
To: Hannes Reinecke, Daniel Wagner, Keith Busch
Cc: hch, gjoyce, axboe@fb.com, linux-nvme@lists.infradead.org,
Sagi Grimberg
Hi Hannes, Keith, Daniel,
A gentle ping on this. This has been pending for quite some time now.
So would you please help? I have addressed all your comments.
Please let me know if you've any further comments.
Thanks,
--Nilay
On 11/29/24 17:49, Nilay Shroff wrote:
> Hi Hannes and Sagi,
>
> A gentle ping on this. Did you get a chance to look through this?
>
> Please let me know if you still have any further comments.
>
> Thanks,
> --Nilay
>
> On 11/12/24 10:07, Nilay Shroff wrote:
>> Hi Hannes and Sagi,
>>
>> A gentle ping... I have addressed your suggestions in this patch series.
>> Does this now look okay to you or do you have any further suggestions/comments?
>>
>> Thanks,
>> --Nilay
>>
>> On 10/30/24 16:11, Nilay Shroff wrote:
>>> Hi,
>>>
>>> This RFC propose adding new sysfs attributes for adding visibility of
>>> nvme native multipath I/O.
>>>
>>> The changes are divided into three patches.
>>> The first patch adds visibility for round-robin io-policy.
>>> The second patch adds visibility for numa io-policy.
>>> The third patch adds the visibility for queue-depth io-policy.
>>>
>>> As we know, NVMe native multipath supports three different io policies
>>> (numa, round-robin and queue-depth) for selecting I/O path, however, we
>>> don't have any visibility about which path is being selected by multipath
>>> code for forwarding I/O. This RFC helps add that visibility by adding new
>>> sysfs attribute files named "numa_nodes" and "queue_depth" under each
>>> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
>>> "multipath" sysfs directory under head disk node and then from this
>>> directory add a link to each namespace path device this head disk node
>>> points to.
>>>
>>> Please find below output generated with this proposed RFC patch applied on
>>> a system with two multi-controller PCIe NVMe disks attached to it. This
>>> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
>>> over two NIC cards. This system has four numa nodes online when the below
>>> output was captured:
>>>
>>> # cat /sys/devices/system/node/online
>>> 0-3
>>>
>>> # lscpu
>>> <snip>
>>> NUMA:
>>> NUMA node(s): 4
>>> NUMA node0 CPU(s):
>>> NUMA node1 CPU(s): 0-7
>>> NUMA node2 CPU(s): 8-31
>>> NUMA node3 CPU(s): 32-63
>>> <snip>
>>>
>>> Please note that numa node 0 though online, doesn't have any CPU
>>> currently assigned to it.
>>>
>>> # nvme list -v
>>> Subsystem Subsystem-NQN Controllers
>>> ---------------- ------------------------------------------------------------------------------------------------ ----------------
>>> nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
>>> nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
>>> nvme-subsys4 nvmet_subsystem nvme4, nvme5
>>>
>>> Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
>>> ---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
>>> nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
>>> nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
>>> nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
>>> nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
>>> nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
>>> nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
>>>
>>> Device Generic NSID Usage Format Controllers
>>> ----------------- ----------------- ---------- -------------------------- ---------------- ----------------
>>> /dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
>>> /dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
>>> /dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
>>>
>>>
>>> # nvme show-topology
>>> nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
>>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>>> iopolicy=numa
>>> \
>>> +- ns 1
>>> \
>>> +- nvme0 pcie 052e:78:00.0 live optimized
>>> +- nvme1 pcie 058e:78:00.0 live optimized
>>>
>>> nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
>>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>>> iopolicy=round-robin
>>> \
>>> +- ns 2
>>> \
>>> +- nvme2 pcie 0524:28:00.0 live optimized
>>> +- nvme3 pcie 0584:28:00.0 live optimized
>>>
>>> nvme-subsys4 - NQN=nvmet_subsystem
>>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>>> iopolicy=queue-depth
>>> \
>>> +- ns 1
>>> \
>>> +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
>>> +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
>>>
>>> As we could see above, we've three shared namespaces created. In terms of
>>> iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
>>> configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
>>>
>>> Now, under each namespace "head disk node", we create a sysfs group
>>> attribute named "multipath". The "multipath" group then points to the
>>> each path this head disk node points to:
>>>
>>> # tree /sys/block/nvme1n1/multipath/
>>> /sys/block/nvme1n1/multipath/
>>> ├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
>>> └── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
>>>
>>> # tree /sys/block/nvme3n1/multipath/
>>> /sys/block/nvme3n1/multipath/
>>> ├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
>>> └── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
>>>
>>> # tree /sys/block/nvme4n1/multipath/
>>> /sys/block/nvme4n1/multipath/
>>> ├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
>>> └── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
>>>
>>> One can easily infer from the above output that for the "round-robin"
>>> I/O policy, configured under nvme-subsys3, the I/O workload targeted at
>>> nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
>>> of each path is optimized (as can be seen in the output of show-topology).
>>>
>>> For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
>>> attribute file shows the numa nodes being preferred by the respective
>>> namespace path. The numa nodes value is comma delimited list of nodes or
>>> A-B range of nodes.
>>>
>>> # cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
>>> 0-1
>>>
>>> # cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
>>> 2-3
>>>
>>> From the above output, one can easily infer that I/O workload targeted at
>>> nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
>>> Similarly, I/O workload running on numa nodes 2 and 3 would use path
>>> nvme1c1n1.
>>>
>>> For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
>>> attribute file shows the number of active/in-flight I/O requests currently
>>> queued for each path.
>>>
>>> # cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
>>> 518
>>>
>>> # cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
>>> 504
>>>
>>> From the above output, one can easily infer that I/O workload targeted at
>>> nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
>>> of each path is 518 and 504 respectively.
>>>
>>> changes since v4:
>>> - Ensure that we create sysfs link from head gendisk node to each path
>>> device irrespective of the ANA state of the path (Hannes Reinecke)
>>> - Split the patch into three patch series and add commentary in the
>>> code so that it's easy to read and understand the core logic (Sagi
>>> Grimberg)
>>> - Don't show any output if user reads "numa_nodes" file and configured
>>> iopolicy is anything but numa; similarly don't emit any output if user
>>> reads "queue_depth" file and configured iopolicy is anything but
>>> queue-depth (Sagi Grimberg)
>>>
>>> Changes since v3:
>>> - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
>>>
>>> Changes since v2:
>>> - Use one value per one sysfs attribute (Keith Busch)
>>>
>>> Changes since v1:
>>> - Use sysfs to export multipath I/O information instead of debugfs
>>>
>>>
>>> Nilay Shroff (3):
>>> nvme-multipah: Add visibility for round-robin io-policy
>>> nvme-multipath: Add visibility for numa io-policy
>>> nvme-multipath: Add visibility for queue-depth io-policy
>>>
>>> drivers/nvme/host/core.c | 3 +
>>> drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
>>> drivers/nvme/host/nvme.h | 20 ++++--
>>> drivers/nvme/host/sysfs.c | 20 ++++++
>>> 4 files changed, 159 insertions(+), 4 deletions(-)
>>>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy
2024-10-30 10:41 ` [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy Nilay Shroff
@ 2024-12-10 9:22 ` Daniel Wagner
2024-12-10 13:02 ` Nilay Shroff
2024-12-10 15:18 ` Hannes Reinecke
0 siblings, 2 replies; 12+ messages in thread
From: Daniel Wagner @ 2024-12-10 9:22 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-nvme, sagi, hare, kbusch, hch, =axboe, gjoyce
There is typo in the subject.
On Wed, Oct 30, 2024 at 04:11:41PM +0530, Nilay Shroff wrote:
> This patch helps add nvme native multipath visibility for round-robin
> io-policy. It creates a "multipath" sysfs directory under head gendisk
> device node directory and then from "multipath" directory it adds a link
> to each namespace path device the head node refers.
>
> For instance, if we have a shared namespace accessible from two different
> controllers/paths then we create a soft link to each path device from head
> disk node as shown below:
>
> $ ls -l /sys/block/nvme1n1/multipath/
> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>
> In the above example, nvme1n1 is head gendisk node created for a shared
> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
> paths.
>
> For round-robin I/O policy, we could easily infer from the above output
> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
> and nvme1c3n1.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 3 ++
> drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 18 ++++++--
> drivers/nvme/host/sysfs.c | 14 ++++++
> 4 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 84cb859a911d..c923bd2c5468 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3940,6 +3940,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>
> if (!nvme_ns_head_multipath(ns->head))
> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> +
> + nvme_mpath_remove_sysfs_link(ns);
> +
> del_gendisk(ns->disk);
>
> mutex_lock(&ns->ctrl->namespaces_lock);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 6a15873055b9..43c87a946cc7 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -682,6 +682,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
> kblockd_schedule_work(&head->partition_scan_work);
> }
>
> + nvme_mpath_add_sysfs_link(ns->head);
> +
> mutex_lock(&head->lock);
> if (nvme_path_is_optimized(ns)) {
> int node, srcu_idx;
> @@ -764,6 +766,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
> if (nvme_state_is_live(ns->ana_state) &&
> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
> nvme_mpath_set_live(ns);
> + else {
> + /*
> + * Add sysfs link from multipath head gendisk node to path
> + * device gendisk node.
> + * If path's ana state is live (i.e. state is either optimized
> + * or non-optimized) while we alloc the ns then sysfs link would
> + * be created from nvme_mpath_set_live(). In that case we would
> + * not fallthrough this code path. However for the path's ana
> + * state other than live, we call nvme_mpath_set_live() only
> + * after ana state transitioned to the live state. But we still
> + * want to create the sysfs link from head node to a path device
> + * irrespctive of the path's ana state.
> + * If we reach through here then it means that path's ana state
> + * is not live but still create the sysfs link to this path from
> + * head node if head node of the path has already come alive.
> + */
> + if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
> + nvme_mpath_add_sysfs_link(ns->head);
> + }
Remind me, why do we add the link here? Couldn't we add it earlier and
so we don't have to check the ctrl state? At least I read this here that
we want to add the link independent of the ctrl state.
Or do you want to say whenever the link exist the path is usable? Trying
to figure out if this is a good idea or not when nvme-cli/libnvme
iterates over sysfs and it races with this, what's the user experience?
I am sorry if this has been already discussed. I just completely lost
the state :)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-12-10 7:03 ` Nilay Shroff
@ 2024-12-10 9:34 ` Daniel Wagner
2024-12-11 9:29 ` Nilay Shroff
0 siblings, 1 reply; 12+ messages in thread
From: Daniel Wagner @ 2024-12-10 9:34 UTC (permalink / raw)
To: Nilay Shroff
Cc: Hannes Reinecke, Keith Busch, hch, gjoyce, axboe@fb.com,
linux-nvme@lists.infradead.org, Sagi Grimberg
On Tue, Dec 10, 2024 at 12:33:45PM +0530, Nilay Shroff wrote:
> >>> As we know, NVMe native multipath supports three different io policies
> >>> (numa, round-robin and queue-depth) for selecting I/O path, however, we
> >>> don't have any visibility about which path is being selected by multipath
> >>> code for forwarding I/O. This RFC helps add that visibility by adding new
> >>> sysfs attribute files named "numa_nodes" and "queue_depth" under each
> >>> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
> >>> "multipath" sysfs directory under head disk node and then from this
> >>> directory add a link to each namespace path device this head disk node
> >>> points to.
> >>>
> >>> Please find below output generated with this proposed RFC patch applied on
> >>> a system with two multi-controller PCIe NVMe disks attached to it. This
> >>> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
> >>> over two NIC cards. This system has four numa nodes online when the below
> >>> output was captured:
Looks good to me from libnvme's perspective. I just my question on the
lifetime of the new link, but I might just miss the point.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy
2024-12-10 9:22 ` Daniel Wagner
@ 2024-12-10 13:02 ` Nilay Shroff
2024-12-10 15:18 ` Hannes Reinecke
1 sibling, 0 replies; 12+ messages in thread
From: Nilay Shroff @ 2024-12-10 13:02 UTC (permalink / raw)
To: Daniel Wagner; +Cc: linux-nvme, sagi, hare, kbusch, hch, =axboe, gjoyce
On 12/10/24 14:52, Daniel Wagner wrote:
> There is typo in the subject.
oh yes, I will fix it in the next patch.
>
> On Wed, Oct 30, 2024 at 04:11:41PM +0530, Nilay Shroff wrote:
>> This patch helps add nvme native multipath visibility for round-robin
>> io-policy. It creates a "multipath" sysfs directory under head gendisk
>> device node directory and then from "multipath" directory it adds a link
>> to each namespace path device the head node refers.
>>
>> For instance, if we have a shared namespace accessible from two different
>> controllers/paths then we create a soft link to each path device from head
>> disk node as shown below:
>>
>> $ ls -l /sys/block/nvme1n1/multipath/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> In the above example, nvme1n1 is head gendisk node created for a shared
>> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
>> paths.
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 3 ++
>> drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 18 ++++++--
>> drivers/nvme/host/sysfs.c | 14 ++++++
>> 4 files changed, 112 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 84cb859a911d..c923bd2c5468 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3940,6 +3940,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>
>> if (!nvme_ns_head_multipath(ns->head))
>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> + nvme_mpath_remove_sysfs_link(ns);
>> +
>> del_gendisk(ns->disk);
>>
>> mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 6a15873055b9..43c87a946cc7 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -682,6 +682,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> kblockd_schedule_work(&head->partition_scan_work);
>> }
>>
>> + nvme_mpath_add_sysfs_link(ns->head);
>> +
>> mutex_lock(&head->lock);
>> if (nvme_path_is_optimized(ns)) {
>> int node, srcu_idx;
>> @@ -764,6 +766,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>> if (nvme_state_is_live(ns->ana_state) &&
>> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>> nvme_mpath_set_live(ns);
>> + else {
>> + /*
>> + * Add sysfs link from multipath head gendisk node to path
>> + * device gendisk node.
>> + * If path's ana state is live (i.e. state is either optimized
>> + * or non-optimized) while we alloc the ns then sysfs link would
>> + * be created from nvme_mpath_set_live(). In that case we would
>> + * not fallthrough this code path. However for the path's ana
>> + * state other than live, we call nvme_mpath_set_live() only
>> + * after ana state transitioned to the live state. But we still
>> + * want to create the sysfs link from head node to a path device
>> + * irrespctive of the path's ana state.
>> + * If we reach through here then it means that path's ana state
>> + * is not live but still create the sysfs link to this path from
>> + * head node if head node of the path has already come alive.
>> + */
>> + if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
>> + nvme_mpath_add_sysfs_link(ns->head);
>> + }
>
> Remind me, why do we add the link here? Couldn't we add it earlier and
> so we don't have to check the ctrl state? At least I read this here that
> we want to add the link independent of the ctrl state.
>
Yes we always add link earlier for a case when path's ANA state is LIVE (i.e ANA state
is either optimized or non-optimized). However there's a scenario when we allocate the
namespace path and its ANA state is anything but LIVE and in that case we would fall
through the above code path.
Assume we create a shared namespace and attach it two different controllers. One of the
path of this shared namespace is advertising its ANA state as non-LIVE and the another
path is advertising its ANA state as LIVE.
Now, if first we allocate a namespace path, which is advertising ANA state as non-LIVE,
then in that case the head disk node wouldn't come live and without the head disk node
we can't create the sysfs link to that namespace path. Later when we allocate the second
path, which is advertising ANA sate as LIVE, the head disk node comes live, and so we can
now create the sysfs link to both paths.
So it's important to check the state of head disk node in the above code path. Idea here's
that we always want to create the sysfs link from head disk node to each namespace path
irrespective of path's ANA state. However we could do that only when at-least one path is
advertising its ANA state as LIVE so that its corresponding head disk node would come live
and then we could create sysfs link to all paths. In fact, this requirement was proposed by
Hannes[1] and so I implemented it in v5 patchset.
> Or do you want to say whenever the link exist the path is usable? Trying
> to figure out if this is a good idea or not when nvme-cli/libnvme
> iterates over sysfs and it races with this, what's the user experience?
>
If a link to path from head node exists then it doesn't mean always path is usable. We need
to read its ANA state to determine whether path is usable or not.
> I am sorry if this has been already discussed. I just completely lost
> the state :)
No worries..:) Please let me know if you have any further comments/questions.
[1] https://lore.kernel.org/all/3c92b9ae-1223-4787-90e3-955f82161269@suse.de/
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy
2024-12-10 9:22 ` Daniel Wagner
2024-12-10 13:02 ` Nilay Shroff
@ 2024-12-10 15:18 ` Hannes Reinecke
1 sibling, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2024-12-10 15:18 UTC (permalink / raw)
To: Daniel Wagner, Nilay Shroff; +Cc: linux-nvme, sagi, kbusch, hch, =axboe, gjoyce
On 12/10/24 10:22, Daniel Wagner wrote:
> There is typo in the subject.
>
> On Wed, Oct 30, 2024 at 04:11:41PM +0530, Nilay Shroff wrote:
>> This patch helps add nvme native multipath visibility for round-robin
>> io-policy. It creates a "multipath" sysfs directory under head gendisk
>> device node directory and then from "multipath" directory it adds a link
>> to each namespace path device the head node refers.
>>
>> For instance, if we have a shared namespace accessible from two different
>> controllers/paths then we create a soft link to each path device from head
>> disk node as shown below:
>>
>> $ ls -l /sys/block/nvme1n1/multipath/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> In the above example, nvme1n1 is head gendisk node created for a shared
>> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
>> paths.
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 3 ++
>> drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 18 ++++++--
>> drivers/nvme/host/sysfs.c | 14 ++++++
>> 4 files changed, 112 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 84cb859a911d..c923bd2c5468 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3940,6 +3940,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>
>> if (!nvme_ns_head_multipath(ns->head))
>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> + nvme_mpath_remove_sysfs_link(ns);
>> +
>> del_gendisk(ns->disk);
>>
>> mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 6a15873055b9..43c87a946cc7 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -682,6 +682,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> kblockd_schedule_work(&head->partition_scan_work);
>> }
>>
>> + nvme_mpath_add_sysfs_link(ns->head);
>> +
>> mutex_lock(&head->lock);
>> if (nvme_path_is_optimized(ns)) {
>> int node, srcu_idx;
>> @@ -764,6 +766,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>> if (nvme_state_is_live(ns->ana_state) &&
>> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
>> nvme_mpath_set_live(ns);
>> + else {
>> + /*
>> + * Add sysfs link from multipath head gendisk node to path
>> + * device gendisk node.
>> + * If path's ana state is live (i.e. state is either optimized
>> + * or non-optimized) while we alloc the ns then sysfs link would
>> + * be created from nvme_mpath_set_live(). In that case we would
>> + * not fallthrough this code path. However for the path's ana
>> + * state other than live, we call nvme_mpath_set_live() only
>> + * after ana state transitioned to the live state. But we still
>> + * want to create the sysfs link from head node to a path device
>> + * irrespctive of the path's ana state.
>> + * If we reach through here then it means that path's ana state
>> + * is not live but still create the sysfs link to this path from
>> + * head node if head node of the path has already come alive.
>> + */
>> + if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
>> + nvme_mpath_add_sysfs_link(ns->head);
>> + }
>
> Remind me, why do we add the link here? Couldn't we add it earlier and
> so we don't have to check the ctrl state? At least I read this here that
> we want to add the link independent of the ctrl state.
>
> Or do you want to say whenever the link exist the path is usable? Trying
> to figure out if this is a good idea or not when nvme-cli/libnvme
> iterates over sysfs and it races with this, what's the user experience?
>
> I am sorry if this has been already discussed. I just completely lost
> the state :)
nvme_set_live() is the culprit.
We only present the nvme head device node to userspace once the first
path gets live.
Which injects a bit of complexity when one wants to create the
links; we cannot create the symlink whenever a new path is registered.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-12-10 9:34 ` Daniel Wagner
@ 2024-12-11 9:29 ` Nilay Shroff
0 siblings, 0 replies; 12+ messages in thread
From: Nilay Shroff @ 2024-12-11 9:29 UTC (permalink / raw)
To: Daniel Wagner
Cc: Hannes Reinecke, Keith Busch, hch, gjoyce, axboe@fb.com,
linux-nvme@lists.infradead.org, Sagi Grimberg
On 12/10/24 15:04, Daniel Wagner wrote:
> On Tue, Dec 10, 2024 at 12:33:45PM +0530, Nilay Shroff wrote:
>>>>> As we know, NVMe native multipath supports three different io policies
>>>>> (numa, round-robin and queue-depth) for selecting I/O path, however, we
>>>>> don't have any visibility about which path is being selected by multipath
>>>>> code for forwarding I/O. This RFC helps add that visibility by adding new
>>>>> sysfs attribute files named "numa_nodes" and "queue_depth" under each
>>>>> namespace block device path /sys/block/nvmeXcYnZ/. We also create a
>>>>> "multipath" sysfs directory under head disk node and then from this
>>>>> directory add a link to each namespace path device this head disk node
>>>>> points to.
>>>>>
>>>>> Please find below output generated with this proposed RFC patch applied on
>>>>> a system with two multi-controller PCIe NVMe disks attached to it. This
>>>>> system is also an NVMf-TCP host which is connected to an NVMf-TCP target
>>>>> over two NIC cards. This system has four numa nodes online when the below
>>>>> output was captured:
>
> Looks good to me from libnvme's perspective. I just my question on the
> lifetime of the new link, but I might just miss the point.
>
When a namepsace path device is removed, we also remove the corresponding
link (if it exists) from the head disk node to the path device.Please refer
below function, from the first patch of the series, which is invoked from
the nvme_ns_remove() for removing the sysfs link:
void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
{
struct device *target;
struct kobject *kobj;
if (!test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
return;
target = disk_to_dev(ns->disk);
kobj = &disk_to_dev(ns->head->disk)->kobj;
sysfs_remove_link_from_group(kobj, nvme_ns_mpath_attr_group.name,
dev_name(target));
clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
}
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2024-12-11 9:30 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-30 10:41 [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 1/3] nvme-multipah: Add visibility for round-robin io-policy Nilay Shroff
2024-12-10 9:22 ` Daniel Wagner
2024-12-10 13:02 ` Nilay Shroff
2024-12-10 15:18 ` Hannes Reinecke
2024-10-30 10:41 ` [PATCHv5 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
2024-10-30 10:41 ` [PATCHv5 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
2024-11-12 4:37 ` [PATCHv5 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-11-29 12:19 ` Nilay Shroff
2024-12-10 7:03 ` Nilay Shroff
2024-12-10 9:34 ` Daniel Wagner
2024-12-11 9:29 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox