* [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs
@ 2024-12-13 4:18 Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Nilay Shroff @ 2024-12-13 4:18 UTC (permalink / raw)
To: dwagner, hare; +Cc: kbusch, hch, sagi, axboe, gjoyce, linux-nvme
Hi,
This RFC propose adding new sysfs attributes for adding visibility of
nvme native multipath I/O.
The changes are divided into three patches.
The first patch adds visibility for round-robin io-policy.
The second patch adds visibility for numa io-policy.
The third patch adds the visibility for queue-depth io-policy.
As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.
Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:
# cat /sys/devices/system/node/online
0-3
# lscpu
<snip>
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s):
NUMA node1 CPU(s): 0-7
NUMA node2 CPU(s): 8-31
NUMA node3 CPU(s): 32-63
<snip>
Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme0, nvme1
nvme-subsys3 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme2, nvme3
nvme-subsys4 nvmet_subsystem nvme4, nvme5
Device Cntlid SN MN FR TxPort Address Slot Subsystem Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 66 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1 65 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2 2 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3 1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4 1 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 nvme-subsys4 nvme4n1
nvme5 2 a224673364d1dcb6fab9 Linux 6.9.0-rc tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 nvme-subsys4 nvme4n1
Device Generic NSID Usage Format Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1 0x2 0.00 B / 5.75 GB 4 KiB + 0 B nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme4, nvme5
# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=numa
\
+- ns 1
\
+- nvme0 pcie 052e:78:00.0 live optimized
+- nvme1 pcie 058e:78:00.0 live optimized
nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=round-robin
\
+- ns 2
\
+- nvme2 pcie 0524:28:00.0 live optimized
+- nvme3 pcie 0584:28:00.0 live optimized
nvme-subsys4 - NQN=nvmet_subsystem
hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
iopolicy=queue-depth
\
+- ns 1
\
+- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
+- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:
# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).
For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.
# cat /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1
# cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3
From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.
For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.
# cat /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518
# cat /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504
From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.
Changes since v5:
- Fix typo in the subject line of the first patch in the series (Danied
Wagner)
- Link to v5: https://lore.kernel.org/all/20241030104156.747675-1-nilay@linux.ibm.com/
changes since v4:
- Ensure that we create sysfs link from head gendisk node to each path
device irrespective of the ANA state of the path (Hannes Reinecke)
- Split the patch into three patch series and add commentary in the
code so that it's easy to read and understand the core logic (Sagi
Grimberg)
- Don't show any output if user reads "numa_nodes" file and configured
iopolicy is anything but numa; similarly don't emit any output if user
reads "queue_depth" file and configured iopolicy is anything but
queue-depth (Sagi Grimberg)
- Link to v4: https://lore.kernel.org/all/20240911062653.1060056-1-nilay@linux.ibm.com/
Changes since v3:
- Protect the namespace dereference code with srcu read lock (Daniel Wagner)
- Link to v3: https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/
Changes since v2:
- Use one value per one sysfs attribute (Keith Busch)
- Link to v2: https://lore.kernel.org/all/20240809173030.2281021-1-nilay@linux.ibm.com/
Changes since v1:
- Use sysfs to export multipath I/O information instead of debugfs
- Link to v1: https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
Nilay Shroff (3):
nvme-multipath: Add visibility for round-robin io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for queue-depth io-policy
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/multipath.c | 120 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 20 ++++--
drivers/nvme/host/sysfs.c | 20 ++++++
4 files changed, 159 insertions(+), 4 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy
2024-12-13 4:18 [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
@ 2024-12-13 4:18 ` Nilay Shroff
2024-12-24 11:01 ` Sagi Grimberg
2024-12-13 4:18 ` [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
` (2 subsequent siblings)
3 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-12-13 4:18 UTC (permalink / raw)
To: dwagner, hare; +Cc: kbusch, hch, sagi, axboe, gjoyce, linux-nvme
This patch helps add nvme native multipath visibility for round-robin
io-policy. It creates a "multipath" sysfs directory under head gendisk
device node directory and then from "multipath" directory it adds a link
to each namespace path device the head node refers.
For instance, if we have a shared namespace accessible from two different
controllers/paths then we create a soft link to each path device from head
disk node as shown below:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths.
For round-robin I/O policy, we could easily infer from the above output
that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
and nvme1c3n1.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 3 ++
drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 18 ++++++--
drivers/nvme/host/sysfs.c | 14 ++++++
4 files changed, 112 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d169a30eb935..df4cc8a27385 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3982,6 +3982,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (!nvme_ns_head_multipath(ns->head))
nvme_cdev_del(&ns->cdev, &ns->cdev_device);
+
+ nvme_mpath_remove_sysfs_link(ns);
+
del_gendisk(ns->disk);
mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a85d190942bd..8e2865df2f33 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -686,6 +686,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
kblockd_schedule_work(&head->partition_scan_work);
}
+ nvme_mpath_add_sysfs_link(ns->head);
+
mutex_lock(&head->lock);
if (nvme_path_is_optimized(ns)) {
int node, srcu_idx;
@@ -768,6 +770,25 @@ static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
if (nvme_state_is_live(ns->ana_state) &&
nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
nvme_mpath_set_live(ns);
+ else {
+ /*
+ * Add sysfs link from multipath head gendisk node to path
+ * device gendisk node.
+ * If path's ana state is live (i.e. state is either optimized
+ * or non-optimized) while we alloc the ns then sysfs link would
+ * be created from nvme_mpath_set_live(). In that case we would
+ * not fallthrough this code path. However for the path's ana
+ * state other than live, we call nvme_mpath_set_live() only
+ * after ana state transitioned to the live state. But we still
+ * want to create the sysfs link from head node to a path device
+ * irrespctive of the path's ana state.
+ * If we reach through here then it means that path's ana state
+ * is not live but still create the sysfs link to this path from
+ * head node if head node of the path has already come alive.
+ */
+ if (test_bit(NVME_NSHEAD_DISK_LIVE, &ns->head->flags))
+ nvme_mpath_add_sysfs_link(ns->head);
+ }
}
static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
@@ -967,6 +988,66 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
return -ENXIO; /* just break out of the loop */
}
+void nvme_mpath_add_sysfs_link(struct nvme_ns_head *head)
+{
+ struct device *target;
+ int rc, srcu_idx;
+ struct nvme_ns *ns;
+ struct kobject *kobj = &disk_to_dev(head->disk)->kobj;
+
+ /*
+ * loop through each ns chained through the head->list and create the
+ * sysfs link from head node to the ns path node
+ */
+ srcu_idx = srcu_read_lock(&head->srcu);
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ /*
+ * Avoid creating link if it already exists for the given path.
+ * When path ana state transitions from optimized to non-
+ * optimized or vice-versa, the nvme_mpath_set_live() is
+ * invoked which in truns call this function. Now if the sysfs
+ * link already exists for the given path and we attempt to re-
+ * create the link then sysfs code would warn about it loudly.
+ * So we evaluate NVME_NS_SYSFS_ATTR_LINK flag here to ensure
+ * that we're not creating duplicate link.
+ */
+ if (test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ continue;
+
+ target = disk_to_dev(ns->disk);
+ /*
+ * Create sysfs link from head gendisk kobject @kobj to the
+ * ns path gendisk kobject @target->kobj.
+ */
+ rc = sysfs_add_link_to_group(kobj, nvme_ns_mpath_attr_group.name,
+ &target->kobj, dev_name(target));
+ if (unlikely(rc)) {
+ dev_err(disk_to_dev(ns->head->disk),
+ "failed to create link to %s\n",
+ dev_name(target));
+ } else
+ set_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+ }
+
+ srcu_read_unlock(&head->srcu, srcu_idx);
+}
+
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+ struct device *target;
+ struct kobject *kobj;
+
+ if (!test_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags))
+ return;
+
+ target = disk_to_dev(ns->disk);
+ kobj = &disk_to_dev(ns->head->disk)->kobj;
+ sysfs_remove_link_from_group(kobj, nvme_ns_mpath_attr_group.name,
+ dev_name(target));
+ clear_bit(NVME_NS_SYSFS_ATTR_LINK, &ns->flags);
+}
+
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
{
if (nvme_ctrl_use_ana(ns->ctrl)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 611b02c8a8b3..995e33864d81 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -529,10 +529,11 @@ struct nvme_ns {
struct nvme_ns_head *head;
unsigned long flags;
-#define NVME_NS_REMOVING 0
-#define NVME_NS_ANA_PENDING 2
-#define NVME_NS_FORCE_RO 3
-#define NVME_NS_READY 4
+#define NVME_NS_REMOVING 0
+#define NVME_NS_ANA_PENDING 2
+#define NVME_NS_FORCE_RO 3
+#define NVME_NS_READY 4
+#define NVME_NS_SYSFS_ATTR_LINK 5
struct cdev cdev;
struct device cdev_device;
@@ -928,6 +929,7 @@ int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo);
int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
extern const struct attribute_group *nvme_ns_attr_groups[];
+extern const struct attribute_group nvme_ns_mpath_attr_group;
extern const struct pr_ops nvme_pr_ops;
extern const struct block_device_operations nvme_ns_head_ops;
extern const struct attribute_group nvme_dev_attrs_group;
@@ -950,6 +952,8 @@ void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys);
void nvme_failover_req(struct request *req);
void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
+void nvme_mpath_add_sysfs_link(struct nvme_ns_head *ns);
+void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns);
void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
@@ -1004,6 +1008,12 @@ static inline void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
{
}
+static inline void nvme_mpath_add_sysfs_link(struct nvme_ns *ns)
+{
+}
+static inline void nvme_mpath_remove_sysfs_link(struct nvme_ns *ns)
+{
+}
static inline bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
return false;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index b68a9e5f1ea3..5a23e23b0d01 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -299,8 +299,22 @@ static const struct attribute_group nvme_ns_attr_group = {
.is_visible = nvme_ns_attrs_are_visible,
};
+#ifdef CONFIG_NVME_MULTIPATH
+static struct attribute *nvme_ns_mpath_attrs[] = {
+ NULL,
+};
+
+const struct attribute_group nvme_ns_mpath_attr_group = {
+ .name = "multipath",
+ .attrs = nvme_ns_mpath_attrs,
+};
+#endif
+
const struct attribute_group *nvme_ns_attr_groups[] = {
&nvme_ns_attr_group,
+#ifdef CONFIG_NVME_MULTIPATH
+ &nvme_ns_mpath_attr_group,
+#endif
NULL,
};
--
2.45.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy
2024-12-13 4:18 [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
@ 2024-12-13 4:18 ` Nilay Shroff
2024-12-24 11:18 ` Sagi Grimberg
2024-12-13 4:18 ` [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
2025-01-08 16:47 ` [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Keith Busch
3 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-12-13 4:18 UTC (permalink / raw)
To: dwagner, hare; +Cc: kbusch, hch, sagi, axboe, gjoyce, linux-nvme
This patch helps add nvme native multipath visibility for numa io-policy.
It adds a new attribute file named "numa_nodes" under namespace gendisk
device path node which prints the list of numa nodes preferred by the
given namespace path. The numa nodes value is comma delimited list of
nodes or A-B range of nodes.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head node of the shared namespace would
show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For numa io-policy we can then refer the "numa_nodes" attribute
file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
0-1
$ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes
2-3
From the above output, we infer that I/O workload targeted at nvme1n1
and running on numa nodes 0 and 1 would prefer using path nvme1c1n1.
Similarly, I/O workload running on numa nodes 2 and 3 would prefer
using path nvme1c3n1. Reading "numa_nodes" file when configured
io-policy is anything but numa would show no output.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 27 +++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/sysfs.c | 5 +++++
3 files changed, 33 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8e2865df2f33..6bf5e01b04f4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -976,6 +976,33 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
}
DEVICE_ATTR_RO(ana_state);
+static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ int node, srcu_idx;
+ nodemask_t numa_nodes;
+ struct nvme_ns *current_ns;
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+ struct nvme_ns_head *head = ns->head;
+
+ if (head->subsys->iopolicy != NVME_IOPOLICY_NUMA)
+ return 0;
+
+ nodes_clear(numa_nodes);
+
+ srcu_idx = srcu_read_lock(&head->srcu);
+ for_each_node(node) {
+ current_ns = srcu_dereference(head->current_path[node],
+ &head->srcu);
+ if (ns == current_ns)
+ node_set(node, numa_nodes);
+ }
+ srcu_read_unlock(&head->srcu, srcu_idx);
+
+ return sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&numa_nodes));
+}
+DEVICE_ATTR_RO(numa_nodes);
+
static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
struct nvme_ana_group_desc *desc, void *data)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 995e33864d81..3265e07b2998 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -979,6 +979,7 @@ static inline void nvme_trace_bio_complete(struct request *req)
extern bool multipath;
extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
+extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute subsys_attr_iopolicy;
static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 5a23e23b0d01..5a5ee0beb166 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -258,6 +258,7 @@ static struct attribute *nvme_ns_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&dev_attr_ana_grpid.attr,
&dev_attr_ana_state.attr,
+ &dev_attr_numa_nodes.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
NULL,
@@ -290,6 +291,10 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
return 0;
}
+ if (a == &dev_attr_numa_nodes.attr) {
+ if (nvme_disk_is_ns_head(dev_to_disk(dev)))
+ return 0;
+ }
#endif
return a->mode;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy
2024-12-13 4:18 [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
@ 2024-12-13 4:18 ` Nilay Shroff
2024-12-24 11:18 ` Sagi Grimberg
2025-01-08 16:47 ` [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Keith Busch
3 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-12-13 4:18 UTC (permalink / raw)
To: dwagner, hare; +Cc: kbusch, hch, sagi, axboe, gjoyce, linux-nvme
This patch helps add nvme native multipath visibility for queue-depth
io-policy. It adds a new attribute file named "queue_depth" under
namespace device path node which would print the number of active/
in-flight I/O requests currently queued for the given path.
For instance, if we have a shared namespace accessible from two different
controllers/paths then accessing head block node of the shared namespace
would show the following output:
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
In the above example, nvme1n1 is head gendisk node created for a shared
namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
paths. For queue-depth io-policy we can then refer the "queue_depth"
attribute file created under each namespace path:
$ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth
518
$cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth
504
From the above output, we can infer that I/O workload targeted at nvme1n1
uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each
path is 518 and 504 respectively. Reading "queue_depth" file when
configured io-policy is anything but queue-depth would show no output.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 12 ++++++++++++
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/sysfs.c | 3 ++-
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6bf5e01b04f4..bd340b1524d8 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -976,6 +976,18 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
}
DEVICE_ATTR_RO(ana_state);
+static ssize_t queue_depth_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+ if (ns->head->subsys->iopolicy != NVME_IOPOLICY_QD)
+ return 0;
+
+ return sysfs_emit(buf, "%d\n", atomic_read(&ns->ctrl->nr_active));
+}
+DEVICE_ATTR_RO(queue_depth);
+
static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
char *buf)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3265e07b2998..e860d8609bb7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -979,6 +979,7 @@ static inline void nvme_trace_bio_complete(struct request *req)
extern bool multipath;
extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
+extern struct device_attribute dev_attr_queue_depth;
extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute subsys_attr_iopolicy;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 5a5ee0beb166..498659efacba 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -258,6 +258,7 @@ static struct attribute *nvme_ns_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
&dev_attr_ana_grpid.attr,
&dev_attr_ana_state.attr,
+ &dev_attr_queue_depth.attr,
&dev_attr_numa_nodes.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
@@ -291,7 +292,7 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
return 0;
}
- if (a == &dev_attr_numa_nodes.attr) {
+ if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) {
if (nvme_disk_is_ns_head(dev_to_disk(dev)))
return 0;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy
2024-12-13 4:18 ` [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
@ 2024-12-24 11:01 ` Sagi Grimberg
2024-12-24 11:31 ` Nilay Shroff
0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2024-12-24 11:01 UTC (permalink / raw)
To: Nilay Shroff, dwagner, hare; +Cc: kbusch, hch, axboe, gjoyce, linux-nvme
On 13/12/2024 6:18, Nilay Shroff wrote:
> This patch helps add nvme native multipath visibility for round-robin
> io-policy. It creates a "multipath" sysfs directory under head gendisk
> device node directory and then from "multipath" directory it adds a link
> to each namespace path device the head node refers.
>
> For instance, if we have a shared namespace accessible from two different
> controllers/paths then we create a soft link to each path device from head
> disk node as shown below:
>
> $ ls -l /sys/block/nvme1n1/multipath/
> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>
> In the above example, nvme1n1 is head gendisk node created for a shared
> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
> paths.
>
> For round-robin I/O policy, we could easily infer from the above output
> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
> and nvme1c3n1.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 3 ++
> drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 18 ++++++--
> drivers/nvme/host/sysfs.c | 14 ++++++
> 4 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index d169a30eb935..df4cc8a27385 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3982,6 +3982,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>
> if (!nvme_ns_head_multipath(ns->head))
> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
> +
> + nvme_mpath_remove_sysfs_link(ns);
> +
> del_gendisk(ns->disk);
>
> mutex_lock(&ns->ctrl->namespaces_lock);
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a85d190942bd..8e2865df2f33 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -686,6 +686,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
> kblockd_schedule_work(&head->partition_scan_work);
> }
>
> + nvme_mpath_add_sysfs_link(ns->head);
Why do you add the link in set_live? Why not always set this link? it is
after all, another
path to the device?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy
2024-12-13 4:18 ` [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
@ 2024-12-24 11:18 ` Sagi Grimberg
0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2024-12-24 11:18 UTC (permalink / raw)
To: Nilay Shroff, dwagner, hare; +Cc: kbusch, hch, axboe, gjoyce, linux-nvme
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy
2024-12-13 4:18 ` [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
@ 2024-12-24 11:18 ` Sagi Grimberg
0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2024-12-24 11:18 UTC (permalink / raw)
To: Nilay Shroff, dwagner, hare; +Cc: kbusch, hch, axboe, gjoyce, linux-nvme
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy
2024-12-24 11:01 ` Sagi Grimberg
@ 2024-12-24 11:31 ` Nilay Shroff
0 siblings, 0 replies; 11+ messages in thread
From: Nilay Shroff @ 2024-12-24 11:31 UTC (permalink / raw)
To: Sagi Grimberg, dwagner, hare; +Cc: kbusch, hch, axboe, gjoyce, linux-nvme
On 12/24/24 16:31, Sagi Grimberg wrote:
>
>
>
> On 13/12/2024 6:18, Nilay Shroff wrote:
>> This patch helps add nvme native multipath visibility for round-robin
>> io-policy. It creates a "multipath" sysfs directory under head gendisk
>> device node directory and then from "multipath" directory it adds a link
>> to each namespace path device the head node refers.
>>
>> For instance, if we have a shared namespace accessible from two different
>> controllers/paths then we create a soft link to each path device from head
>> disk node as shown below:
>>
>> $ ls -l /sys/block/nvme1n1/multipath/
>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>
>> In the above example, nvme1n1 is head gendisk node created for a shared
>> namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1
>> paths.
>>
>> For round-robin I/O policy, we could easily infer from the above output
>> that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1
>> and nvme1c3n1.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 3 ++
>> drivers/nvme/host/multipath.c | 81 +++++++++++++++++++++++++++++++++++
>> drivers/nvme/host/nvme.h | 18 ++++++--
>> drivers/nvme/host/sysfs.c | 14 ++++++
>> 4 files changed, 112 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index d169a30eb935..df4cc8a27385 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3982,6 +3982,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> if (!nvme_ns_head_multipath(ns->head))
>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>> +
>> + nvme_mpath_remove_sysfs_link(ns);
>> +
>> del_gendisk(ns->disk);
>> mutex_lock(&ns->ctrl->namespaces_lock);
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index a85d190942bd..8e2865df2f33 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -686,6 +686,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> kblockd_schedule_work(&head->partition_scan_work);
>> }
>> + nvme_mpath_add_sysfs_link(ns->head);
>
> Why do you add the link in set_live? Why not always set this link? it is after all, another
> path to the device?
>
It's because we would be able to create the link from head disk node to a device path node
only after the head node comes alive (i.e. head disk node entry is created under sysfs).
For instance, in the below example,
$ ls -l /sys/block/nvme1n1/multipath/
nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
The nvme1n1 (head disk node) is added under sysfs (/sys/block/nvme1n1) from
the following code path:
nvme_mpath_set_live()
-> device_add_disk()
-> add_disk_fwnode()
-> device_add()
From device_add() we add the head disk node entry under sysfs.
So it's essential for us to only add the link in nvme_mpath_set_live() after
the head disk node comes live. Hope it clarifies. Please let me know if you've
any further query.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2024-12-13 4:18 [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
` (2 preceding siblings ...)
2024-12-13 4:18 ` [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
@ 2025-01-08 16:47 ` Keith Busch
2025-01-10 15:47 ` Keith Busch
3 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-01-08 16:47 UTC (permalink / raw)
To: Nilay Shroff; +Cc: dwagner, hare, hch, sagi, axboe, gjoyce, linux-nvme
On Fri, Dec 13, 2024 at 09:48:33AM +0530, Nilay Shroff wrote:
> This RFC propose adding new sysfs attributes for adding visibility of
> nvme native multipath I/O.
>
> The changes are divided into three patches.
> The first patch adds visibility for round-robin io-policy.
> The second patch adds visibility for numa io-policy.
> The third patch adds the visibility for queue-depth io-policy.
Thanks, applied to nvme-6.14.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2025-01-08 16:47 ` [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Keith Busch
@ 2025-01-10 15:47 ` Keith Busch
2025-01-12 12:18 ` Nilay Shroff
0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2025-01-10 15:47 UTC (permalink / raw)
To: Nilay Shroff; +Cc: dwagner, hare, hch, sagi, axboe, gjoyce, linux-nvme
On Wed, Jan 08, 2025 at 09:47:48AM -0700, Keith Busch wrote:
> On Fri, Dec 13, 2024 at 09:48:33AM +0530, Nilay Shroff wrote:
> > This RFC propose adding new sysfs attributes for adding visibility of
> > nvme native multipath I/O.
> >
> > The changes are divided into three patches.
> > The first patch adds visibility for round-robin io-policy.
> > The second patch adds visibility for numa io-policy.
> > The third patch adds the visibility for queue-depth io-policy.
>
> Thanks, applied to nvme-6.14.
I think I have to back this out of nvme-6.14 for now. This appears to be
causing a problem with blktests, test case trtype = loop nvme/058, as
reported by Chaitanya.
Here's a snippet of the kernel messages related to this:
[ 9031.706759] sysfs: cannot create duplicate filename '/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n2/multipath/nvme1c4n2'
[ 9031.706767] CPU: 41 UID: 0 PID: 52494 Comm: kworker/u192:61 Tainted:G W O N 6.13.0-rc4nvme+ #109
[ 9031.706775] Tainted: [W]=WARN, [O]=OOT_MODULE, [N]=TEST
[ 9031.706777] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 9031.706781] Workqueue: async async_run_entry_fn
[ 9031.706790] Call Trace:
[ 9031.706795] <TASK>
[ 9031.706798] dump_stack_lvl+0x94/0xb0
[ 9031.706806] sysfs_warn_dup+0x5b/0x70
[ 9031.706812] sysfs_do_create_link_sd+0xce/0xe0
[ 9031.706817] sysfs_add_link_to_group+0x35/0x60
[ 9031.706823] nvme_mpath_add_sysfs_link+0xc3/0x160 [nvme_core]
[ 9031.706848] nvme_mpath_set_live+0xb9/0x1f0 [nvme_core]
[ 9031.706865] nvme_mpath_add_disk+0x10b/0x130 [nvme_core]
[ 9031.706883] nvme_alloc_ns+0x8d5/0xc80 [nvme_core]
[ 9031.706904] nvme_scan_ns+0x280/0x350 [nvme_core]
[ 9031.706920] ? do_raw_spin_unlock+0x4e/0xc0
[ 9031.706929] async_run_entry_fn+0x31/0x130
[ 9031.706934] process_one_work+0x1f9/0x630
[ 9031.706943] worker_thread+0x191/0x330
[ 9031.706948] ? __pfx_worker_thread+0x10/0x10
[ 9031.706952] kthread+0xe1/0x120
[ 9031.706956] ? __pfx_kthread+0x10/0x10
[ 9031.706959] ret_from_fork+0x31/0x50
[ 9031.706965] ? __pfx_kthread+0x10/0x10
[ 9031.706968] ret_from_fork_asm+0x1a/0x30
[ 9031.706980] </TASK>
[ 9031.707062] block nvme1n2: failed to create link to nvme1c4n2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs
2025-01-10 15:47 ` Keith Busch
@ 2025-01-12 12:18 ` Nilay Shroff
0 siblings, 0 replies; 11+ messages in thread
From: Nilay Shroff @ 2025-01-12 12:18 UTC (permalink / raw)
To: Keith Busch; +Cc: dwagner, hare, hch, sagi, axboe, gjoyce, linux-nvme
On 1/10/25 9:17 PM, Keith Busch wrote:
> On Wed, Jan 08, 2025 at 09:47:48AM -0700, Keith Busch wrote:
>> On Fri, Dec 13, 2024 at 09:48:33AM +0530, Nilay Shroff wrote:
>>> This RFC propose adding new sysfs attributes for adding visibility of
>>> nvme native multipath I/O.
>>>
>>> The changes are divided into three patches.
>>> The first patch adds visibility for round-robin io-policy.
>>> The second patch adds visibility for numa io-policy.
>>> The third patch adds the visibility for queue-depth io-policy.
>>
>> Thanks, applied to nvme-6.14.
>
> I think I have to back this out of nvme-6.14 for now. This appears to be
> causing a problem with blktests, test case trtype = loop nvme/058, as
> reported by Chaitanya.
>
> Here's a snippet of the kernel messages related to this:
>
> [ 9031.706759] sysfs: cannot create duplicate filename '/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n2/multipath/nvme1c4n2'
> [ 9031.706767] CPU: 41 UID: 0 PID: 52494 Comm: kworker/u192:61 Tainted:G W O N 6.13.0-rc4nvme+ #109
> [ 9031.706775] Tainted: [W]=WARN, [O]=OOT_MODULE, [N]=TEST
> [ 9031.706777] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 9031.706781] Workqueue: async async_run_entry_fn
> [ 9031.706790] Call Trace:
> [ 9031.706795] <TASK>
> [ 9031.706798] dump_stack_lvl+0x94/0xb0
> [ 9031.706806] sysfs_warn_dup+0x5b/0x70
> [ 9031.706812] sysfs_do_create_link_sd+0xce/0xe0
> [ 9031.706817] sysfs_add_link_to_group+0x35/0x60
> [ 9031.706823] nvme_mpath_add_sysfs_link+0xc3/0x160 [nvme_core]
> [ 9031.706848] nvme_mpath_set_live+0xb9/0x1f0 [nvme_core]
> [ 9031.706865] nvme_mpath_add_disk+0x10b/0x130 [nvme_core]
> [ 9031.706883] nvme_alloc_ns+0x8d5/0xc80 [nvme_core]
> [ 9031.706904] nvme_scan_ns+0x280/0x350 [nvme_core]
> [ 9031.706920] ? do_raw_spin_unlock+0x4e/0xc0
> [ 9031.706929] async_run_entry_fn+0x31/0x130
> [ 9031.706934] process_one_work+0x1f9/0x630
> [ 9031.706943] worker_thread+0x191/0x330
> [ 9031.706948] ? __pfx_worker_thread+0x10/0x10
> [ 9031.706952] kthread+0xe1/0x120
> [ 9031.706956] ? __pfx_kthread+0x10/0x10
> [ 9031.706959] ret_from_fork+0x31/0x50
> [ 9031.706965] ? __pfx_kthread+0x10/0x10
> [ 9031.706968] ret_from_fork_asm+0x1a/0x30
> [ 9031.706980] </TASK>
> [ 9031.707062] block nvme1n2: failed to create link to nvme1c4n2
>
>
Thank you for the report! Yes indeed it failed with trtype=loop and nvme/058.
I further investigated it and found that nvme/058 creates 3 shared namespaces
and then attaches those namespaces to 6 different controllers. Later it rapidly
(in quick succession) unmaps and then maps those namespaces in random order.
So that causes multiple nvme paths being simultaneously added/removed in the
host. During those simultaneous add/remove operations, sometimes we trap into
the observed symptom as reported above.
So we have to protect nvme_mpath_add_sysfs_link() from simultaneous add/remove
ns paths. Fortunately it's not so difficult to protect it. There're two things
we need to ensure:
1. Don't try to recreate the sysfs link if it's already created:
The current code uses flag NVME_NS_SYSFS_ATTR_LINK which is marked against ns->flags
when link from head node to the ns path node is added. The current code uses test_bit()
to evaluate if that flag is set or not. If it's not set then we create link and then
mark NVME_NS_SYSFS_ATTR_LINK against ns->flags. However this is not safe and we need to
replace test_bit() with test_and_set_bit() which helps set the flag atomically.
2. Don't create the link from head node to ns path node if disk is not yet added:
As sysfs link is created between kobjects of the head dev node and ns path dev node,
we have to ensure that it's only after device_add_disk() successfully returns for
both head disk and path node disk, we attempt to create the link otherwise sysfs/kernfs
would complain loudly. So we just need to test against GD_ADDED flag for both head disk
and path disk and it's only after respective disks are added, we attempt create sysfs link.
I have made above two changes and tested the code against blktest nvme/058. Now I see
that test pass without any issue. I tested the script for hundreds of times and it
passed each iteration. So with the above two changes, I will spin a patch and submit
upstream. Please help review the same.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-12 12:18 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13 4:18 [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
2024-12-24 11:01 ` Sagi Grimberg
2024-12-24 11:31 ` Nilay Shroff
2024-12-13 4:18 ` [PATCHv6 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
2024-12-24 11:18 ` Sagi Grimberg
2024-12-13 4:18 ` [PATCHv6 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
2024-12-24 11:18 ` Sagi Grimberg
2025-01-08 16:47 ` [PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs Keith Busch
2025-01-10 15:47 ` Keith Busch
2025-01-12 12:18 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox