linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] nvme multipath eBPF path selector
@ 2025-07-29  7:06 hare
  2025-07-29  7:06 ` [PATCH 1/6] nvme-multipath: do not assign ->current_path in __nvme_find_path() hare
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

Hi all,

there are discussion on having to deploy more complex I/O scheduling
algorithms for NVMe, but then there's the question whether we really
want to carry these in the kernel.
Which sounded like an ideal testbed for eBPF struct_ops to me.
Taking a cue from Ming Lei's patchset for eBPF on ublk (thanks, Ming!)
I've started messing around with eBPF.

So here's a patchset to implement nvme multipath eBPF path selectors.
Idea's quite simple: the eBPF 'struct_ops' program is providing a
'select_path' function, which selects a nvme_ns struct to use for
the I/O starting at a given sector.
Unfortunately ePBF doesn't allow to pass pointers, _and_ the definitions
for 'struct nvme_ns_head' and 'struct nvme_ns' are internal to the
nvme subsystem. So I kept those structures as opaque pointers for
ePBF, and introduced a 'nvme_bpf_iter' structure as a path iterator.
There are two functions 'nvme_bpf_first_path' and 'nvme_bpf_next_path'
which can be used for an open-coded loop over all paths.
I've also added sample code as an example how the loop can be coded.

It's all pretty rudimentary (as I'm sure people will need accessors
to get to any namespace or controller details), but that's why I sent
it out as an RFC. And I am by no means an eBPF expert, so I'd be
glad for any corrections or suggestions for a better eBPF integration.

The entire patchset can be found at:
git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
branch nvme-bpf

As usual, reviews and comments are welcome.

Hannes Reinecke (6):
  nvme-multipath: do not assign ->current_path in __nvme_find_path()
  nvme: export nvme_find_get_subsystem()/nvme_put_subsystem()
  nvme: add per-namespace iopolicy sysfs attribute
  nvme: add 'sector' parameter to nvme_find_path()
  nvme-bpf: eBPF struct_ops path selectors
  tools/testing/selftests: add sample nvme bpf path selector

 drivers/nvme/host/Kconfig                     |   9 +
 drivers/nvme/host/Makefile                    |   1 +
 drivers/nvme/host/bpf.h                       |  33 ++
 drivers/nvme/host/bpf_ops.c                   | 347 ++++++++++++++++++
 drivers/nvme/host/core.c                      |  17 +-
 drivers/nvme/host/ioctl.c                     |   7 +-
 drivers/nvme/host/multipath.c                 |  69 +++-
 drivers/nvme/host/nvme.h                      |  11 +-
 drivers/nvme/host/pr.c                        |   2 +-
 drivers/nvme/host/sysfs.c                     |   9 +-
 include/linux/nvme-bpf.h                      |  54 +++
 .../selftests/bpf/progs/bpf_nvme_simple.c     |  52 +++
 12 files changed, 585 insertions(+), 26 deletions(-)
 create mode 100644 drivers/nvme/host/bpf.h
 create mode 100644 drivers/nvme/host/bpf_ops.c
 create mode 100644 include/linux/nvme-bpf.h
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_nvme_simple.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/6] nvme-multipath: do not assign ->current_path in __nvme_find_path()
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
@ 2025-07-29  7:06 ` hare
  2025-07-29  7:06 ` [PATCH 2/6] nvme: export nvme_find_get_subsystem()/nvme_put_subsystem() hare
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke,
	Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

__nvme_find_path() is the fallback if no paths are selected, but
setting ->current_path[] is only relevant for 'numa' and 'round-robin'
iopolicies. So move it out of __nvme_find_path() to avoid a pointless
assignment for other path checkers.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 40 +++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 3da980dc60d9..116b2e71d339 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -342,8 +342,6 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 
 	if (!found)
 		found = fallback;
-	if (found)
-		rcu_assign_pointer(head->current_path[node], found);
 	return found;
 }
 
@@ -364,8 +362,13 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
 	struct nvme_ns *old = srcu_dereference(head->current_path[node],
 					       &head->srcu);
 
-	if (unlikely(!old))
-		return __nvme_find_path(head, node);
+	if (unlikely(!old)) {
+		ns = __nvme_find_path(head, node);
+		if (ns)
+			rcu_assign_pointer(head->current_path[node], found);
+		return ns;
+	}
+
 
 	if (list_is_singular(&head->list)) {
 		if (nvme_path_is_disabled(old))
@@ -455,9 +458,11 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 
 	ns = srcu_dereference(head->current_path[node], &head->srcu);
 	if (unlikely(!ns))
-		return __nvme_find_path(head, node);
-	if (unlikely(!nvme_path_is_optimized(ns)))
-		return __nvme_find_path(head, node);
+		ns = __nvme_find_path(head, node);
+	else if (unlikely(!nvme_path_is_optimized(ns)))
+		ns = __nvme_find_path(head, node);
+	if (ns)
+		rcu_assign_pointer(head->current_path[node], found);
 	return ns;
 }
 
@@ -798,12 +803,21 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 
 	mutex_lock(&head->lock);
 	if (nvme_path_is_optimized(ns)) {
-		int node, srcu_idx;
-
-		srcu_idx = srcu_read_lock(&head->srcu);
-		for_each_online_node(node)
-			__nvme_find_path(head, node);
-		srcu_read_unlock(&head->srcu, srcu_idx);
+		int srcu_idx, iopolicy;
+
+		iopolicy = READ_ONCE(head->subsys->iopolicy);
+		if (iopolicy == NVME_IOPOLICY_NUMA ||
+		    iopolicy == NVME_IOPOLICY_RR) {
+			int node;
+
+			srcu_idx = srcu_read_lock(&head->srcu);
+			for_each_online_node(node) {
+				struct nvme_ns *ns = __nvme_find_path(head, node);
+				if (ns)
+					rcu_assign_pointer(head->current_path[node], ns);
+			}
+			srcu_read_unlock(&head->srcu, srcu_idx);
+		}
 	}
 	mutex_unlock(&head->lock);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/6] nvme: export nvme_find_get_subsystem()/nvme_put_subsystem()
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
  2025-07-29  7:06 ` [PATCH 1/6] nvme-multipath: do not assign ->current_path in __nvme_find_path() hare
@ 2025-07-29  7:06 ` hare
  2025-07-29  7:06 ` [PATCH 3/6] nvme: add per-namespace iopolicy sysfs attribute hare
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

Export functions to find and release a subsystem based on the
subsystem NQN.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/core.c | 14 ++++++++++++--
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 895fb163d48e..a2f3da453af4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -147,7 +147,6 @@ static const struct class nvme_ns_chr_class = {
 	.name = "nvme-generic",
 };
 
-static void nvme_put_subsystem(struct nvme_subsystem *subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 					   unsigned nsid);
 static void nvme_update_keep_alive(struct nvme_ctrl *ctrl,
@@ -3115,7 +3114,7 @@ static void nvme_destroy_subsystem(struct kref *ref)
 	put_device(&subsys->dev);
 }
 
-static void nvme_put_subsystem(struct nvme_subsystem *subsys)
+void nvme_put_subsystem(struct nvme_subsystem *subsys)
 {
 	kref_put(&subsys->ref, nvme_destroy_subsystem);
 }
@@ -3148,6 +3147,17 @@ static struct nvme_subsystem *__nvme_find_get_subsystem(const char *subsysnqn)
 	return NULL;
 }
 
+struct nvme_subsystem *nvme_find_get_subsystem(const char *subsysnqn)
+{
+	struct nvme_subsystem *subsys;
+
+	mutex_lock(&nvme_subsystems_lock);
+	subsys = __nvme_find_get_subsystem(subsysnqn);
+	mutex_unlock(&nvme_subsystems_lock);
+	return subsys;
+}
+EXPORT_SYMBOL_GPL(nvme_find_get_subsystem);
+
 static inline bool nvme_discovery_ctrl(struct nvme_ctrl *ctrl)
 {
 	return ctrl->opts && ctrl->opts->discovery_nqn;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7df2ea21851f..f1eb8ae57c84 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -948,6 +948,8 @@ extern const struct attribute_group *nvme_subsys_attrs_groups[];
 extern const struct attribute_group *nvme_dev_attr_groups[];
 extern const struct block_device_operations nvme_bdev_ops;
 
+struct nvme_subsystem *nvme_find_get_subsystem(const char *subsysnqn);
+void nvme_put_subsystem(struct nvme_subsystem *subsys);
 void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
 struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 #ifdef CONFIG_NVME_MULTIPATH
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/6] nvme: add per-namespace iopolicy sysfs attribute
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
  2025-07-29  7:06 ` [PATCH 1/6] nvme-multipath: do not assign ->current_path in __nvme_find_path() hare
  2025-07-29  7:06 ` [PATCH 2/6] nvme: export nvme_find_get_subsystem()/nvme_put_subsystem() hare
@ 2025-07-29  7:06 ` hare
  2025-07-29  7:06 ` [PATCH 4/6] nvme: add 'sector' parameter to nvme_find_path() hare
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

To display iopolicies which are different from the subsystem.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/multipath.c | 14 ++++++++++++++
 drivers/nvme/host/nvme.h      |  1 +
 drivers/nvme/host/sysfs.c     |  7 +++++++
 3 files changed, 22 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 116b2e71d339..696c2f817bed 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -1137,6 +1137,20 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr
 }
 DEVICE_ATTR_RO(numa_nodes);
 
+static ssize_t iopolicy_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct gendisk *disk = dev_to_disk(dev);
+	struct nvme_ns_head *head = disk->private_data;
+	int iopolicy;
+
+	if (nvme_bpf_enabled(head))
+		return sysfs_emit(buf, "bpf(%pUb)\n", &head->bpf_ops->uuid);
+	iopolicy = READ_ONCE(head->subsys->iopolicy);
+	return sysfs_emit(buf, "%s\n", nvme_iopolicy_names[iopolicy]);
+}
+DEVICE_ATTR_RO(iopolicy);
+
 static ssize_t delayed_removal_secs_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f1eb8ae57c84..2aff8df55d1c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -994,6 +994,7 @@ extern struct device_attribute dev_attr_ana_grpid;
 extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute dev_attr_queue_depth;
 extern struct device_attribute dev_attr_numa_nodes;
+extern struct device_attribute dev_attr_iopolicy;
 extern struct device_attribute dev_attr_delayed_removal_secs;
 extern struct device_attribute subsys_attr_iopolicy;
 
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..378107cf7a21 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = {
 	&dev_attr_ana_state.attr,
 	&dev_attr_queue_depth.attr,
 	&dev_attr_numa_nodes.attr,
+	&dev_attr_iopolicy.attr,
 	&dev_attr_delayed_removal_secs.attr,
 #endif
 	&dev_attr_io_passthru_err_log_enabled.attr,
@@ -297,6 +298,12 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 		if (nvme_disk_is_ns_head(dev_to_disk(dev)))
 			return 0;
 	}
+	if (a == &dev_attr_iopolicy.attr) {
+		struct gendisk *disk = dev_to_disk(dev);
+
+		if (!nvme_disk_is_ns_head(disk))
+			return 0;
+	}
 	if (a == &dev_attr_delayed_removal_secs.attr) {
 		struct gendisk *disk = dev_to_disk(dev);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/6] nvme: add 'sector' parameter to nvme_find_path()
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
                   ` (2 preceding siblings ...)
  2025-07-29  7:06 ` [PATCH 3/6] nvme: add per-namespace iopolicy sysfs attribute hare
@ 2025-07-29  7:06 ` hare
  2025-07-29  7:06 ` [PATCH 5/6] nvme-bpf: eBPF struct_ops path selectors hare
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

An nvme multipath iopolicy might need to make scheduling decisions based
on the starting sector of the I/O, so add an argument 'sector' to
nvme_find_path().

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/ioctl.c     | 7 ++++---
 drivers/nvme/host/multipath.c | 8 ++++----
 drivers/nvme/host/nvme.h      | 2 +-
 drivers/nvme/host/pr.c        | 2 +-
 drivers/nvme/host/sysfs.c     | 2 +-
 5 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 6b3ac8ae3f34..e210908ad78b 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -716,7 +716,8 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
 		flags |= NVME_IOCTL_PARTITION;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	/* TBD: extract LBA and size to get the routing right */
+	ns = nvme_find_path(head, 0);
 	if (!ns)
 		goto out_unlock;
 
@@ -747,7 +748,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 	int srcu_idx, ret = -EWOULDBLOCK;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, 0);
 	if (!ns)
 		goto out_unlock;
 
@@ -767,7 +768,7 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
 	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
 	int srcu_idx = srcu_read_lock(&head->srcu);
-	struct nvme_ns *ns = nvme_find_path(head);
+	struct nvme_ns *ns = nvme_find_path(head, 0);
 	int ret = -EINVAL;
 
 	if (ns)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 696c2f817bed..dee40bd73449 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -466,7 +466,7 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 	return ns;
 }
 
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, sector_t sector)
 {
 	switch (READ_ONCE(head->subsys->iopolicy)) {
 	case NVME_IOPOLICY_QD:
@@ -528,7 +528,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
 		return;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, bio->bi_iter.bi_sector);
 	if (likely(ns)) {
 		bio_set_dev(bio, ns->disk->part0);
 		bio->bi_opf |= REQ_NVME_MPATH;
@@ -570,7 +570,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
 	int srcu_idx, ret = -EWOULDBLOCK;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, 0);
 	if (ns)
 		ret = nvme_ns_get_unique_id(ns, id, type);
 	srcu_read_unlock(&head->srcu, srcu_idx);
@@ -586,7 +586,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
 	int srcu_idx, ret = -EWOULDBLOCK;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, sector);
 	if (ns)
 		ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
 	srcu_read_unlock(&head->srcu, srcu_idx);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2aff8df55d1c..73b966a6653a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -951,7 +951,7 @@ extern const struct block_device_operations nvme_bdev_ops;
 struct nvme_subsystem *nvme_find_get_subsystem(const char *subsysnqn);
 void nvme_put_subsystem(struct nvme_subsystem *subsys);
 void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
-struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
+struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, sector_t sector);
 #ifdef CONFIG_NVME_MULTIPATH
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c
index ca6a74607b13..ba8cee2d6001 100644
--- a/drivers/nvme/host/pr.c
+++ b/drivers/nvme/host/pr.c
@@ -54,7 +54,7 @@ static int nvme_send_ns_head_pr_command(struct block_device *bdev,
 {
 	struct nvme_ns_head *head = bdev->bd_disk->private_data;
 	int srcu_idx = srcu_read_lock(&head->srcu);
-	struct nvme_ns *ns = nvme_find_path(head);
+	struct nvme_ns *ns = nvme_find_path(head, 0);
 	int ret = -EWOULDBLOCK;
 
 	if (ns) {
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 378107cf7a21..ebbdf512a31d 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
 		return 0;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, 0);
 	if (!ns)
 		goto out_unlock;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/6] nvme-bpf: eBPF struct_ops path selectors
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
                   ` (3 preceding siblings ...)
  2025-07-29  7:06 ` [PATCH 4/6] nvme: add 'sector' parameter to nvme_find_path() hare
@ 2025-07-29  7:06 ` hare
  2025-07-29  7:06 ` [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector hare
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

Add support for eBPF struct_ops based path selectors. Due to eBPF limitations
we cannot pass in a 'struct nvme_ns' as argument nor return a 'struct nvme_ns'
from a function, so the eBPF path selectors use a 'struct nvme_bpf_iter'
to iterate over all paths in a struct nvme_ns_head.
That satifies the constrains of the eBPF verifier and allows us to provide
two helper functions 'nvme_bpf_first_path()' and 'nvme_bpf_next_path()'
to iterate over all paths in a namespace.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/Kconfig     |   9 +
 drivers/nvme/host/Makefile    |   1 +
 drivers/nvme/host/bpf.h       |  33 ++++
 drivers/nvme/host/bpf_ops.c   | 347 ++++++++++++++++++++++++++++++++++
 drivers/nvme/host/core.c      |   3 +
 drivers/nvme/host/multipath.c |   9 +-
 drivers/nvme/host/nvme.h      |   6 +-
 include/linux/nvme-bpf.h      |  54 ++++++
 8 files changed, 460 insertions(+), 2 deletions(-)
 create mode 100644 drivers/nvme/host/bpf.h
 create mode 100644 drivers/nvme/host/bpf_ops.c
 create mode 100644 include/linux/nvme-bpf.h

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 31974c7dd20c..7cc1f3898712 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -122,6 +122,15 @@ config NVME_HOST_AUTH
 
 	  If unsure, say N.
 
+config NVME_BPF
+	bool "NVMe multipath BPF path selector"
+	depends on NVME_MULTIPATH
+	depends on BPF_SYSCALL
+	help
+	  Provide support for eBPF multipath path selectors
+
+	  if unsure, say N.
+
 config NVME_APPLE
 	tristate "Apple ANS2 NVM Express host driver"
 	depends on OF && BLOCK
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 6414ec968f99..f81d6349faf6 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -18,6 +18,7 @@ nvme-core-$(CONFIG_BLK_DEV_ZONED)	+= zns.o
 nvme-core-$(CONFIG_FAULT_INJECTION_DEBUG_FS)	+= fault_inject.o
 nvme-core-$(CONFIG_NVME_HWMON)		+= hwmon.o
 nvme-core-$(CONFIG_NVME_HOST_AUTH)	+= auth.o
+nvme-core-$(CONFIG_NVME_BPF)		+= bpf_ops.o
 
 nvme-y					+= pci.o
 
diff --git a/drivers/nvme/host/bpf.h b/drivers/nvme/host/bpf.h
new file mode 100644
index 000000000000..f819332d5293
--- /dev/null
+++ b/drivers/nvme/host/bpf.h
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#ifndef NVME_INT_BPF_HEADER
+#define NVME_INT_BPF_HEADER
+
+#ifdef CONFIG_NVME_BPF
+#include <linux/filter.h>
+#include <linux/nvme-bpf.h>
+
+static inline bool nvme_bpf_enabled(struct nvme_ns_head *head)
+{
+	return !!(srcu_dereference(head->bpf_ops, &head->srcu));
+}
+
+void nvme_bpf_detach(struct nvme_ns_head *head);
+struct nvme_ns *nvme_bpf_select_path(struct nvme_ns_head *head, sector_t sector);
+
+int __init nvme_bpf_struct_ops_init(void);
+
+#else
+
+static inline bool nvme_bpf_enabled(struct nvme_ns_head *head)
+{
+	return false;
+}
+
+static inline void nvme_bpf_detach(struct nvme_ns_head *head) {}
+static inline struct nvme_ns *nvme_bpf_select_path(struct nvme_ns_head *head, sector_t sector)
+{
+	return NULL;
+}
+
+#endif
+#endif
diff --git a/drivers/nvme/host/bpf_ops.c b/drivers/nvme/host/bpf_ops.c
new file mode 100644
index 000000000000..5413541f6f22
--- /dev/null
+++ b/drivers/nvme/host/bpf_ops.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Hannes Reinecke, SUSE */
+
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include "nvme.h"
+#include "bpf.h"
+
+static struct btf *nvme_bpf_ops_btf;
+static char nvme_bpf_ops_name[] = "nvme_bpf_ops";
+
+static int nvme_bpf_ops_init(struct btf *btf)
+{
+	nvme_bpf_ops_btf = btf;
+	return 0;
+}
+
+static bool nvme_bpf_ops_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  const struct bpf_prog *prog,
+					  struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+BTF_ID_LIST(nvme_bpf_ops_args_ids)
+BTF_ID(struct, nvme_ns_head)
+BTF_ID(struct, nvme_ns)
+BTF_ID(struct, nvme_bpf_iter)
+
+static int nvme_bpf_ops_btf_struct_access(struct bpf_verifier_log *log,
+					  const struct bpf_reg_state *reg,
+					  int off, int size)
+{
+	const struct btf_type *nhit, *nit, *niter, *t;
+
+	nhit = btf_type_by_id(reg->btf, nvme_bpf_ops_args_ids[0]);
+	nit = btf_type_by_id(reg->btf, nvme_bpf_ops_args_ids[1]);
+	niter = btf_type_by_id(reg->btf, nvme_bpf_ops_args_ids[2]);
+
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t != nhit && t != niter) {
+		bpf_log(log, "write access to struct %d is not supported\n", reg->btf_id);
+		return -EACCES;
+	}
+	if (t == niter) {
+		/* Allow writes to the 'head' element */
+		if (off >= offsetof(struct nvme_bpf_iter, head) &&
+		    off + size < offsetofend(struct nvme_bpf_iter, head))
+			return NOT_INIT;
+	} else {
+		/* Allow writes to the 'bpf_ops' element */
+		if (off >= offsetof(struct nvme_ns_head, bpf_ops) &&
+		    off + size < offsetofend(struct nvme_ns_head, bpf_ops)) {
+			return NOT_INIT;
+		}
+	}
+	bpf_log(log, "write access for struct %s at off %d with size %d\n",
+		nvme_bpf_ops_name, off, size);
+	return -EACCES;
+}
+
+static const struct bpf_verifier_ops nvme_bpf_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.is_valid_access = nvme_bpf_ops_is_valid_access,
+	.btf_struct_access = nvme_bpf_ops_btf_struct_access,
+};
+
+static int nvme_bpf_ops_check_member(const struct btf_type *t,
+				     const struct btf_member *member,
+				     const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct nvme_bpf_ops, select_path):
+		break;
+	default:
+		if (prog->sleepable)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int nvme_bpf_ops_init_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    void *kdata, const void *udata)
+{
+	const struct nvme_bpf_ops *uops;
+	struct nvme_bpf_ops *kops;
+	u32 moff;
+
+	uops = (const struct nvme_bpf_ops *)udata;
+	kops = (struct nvme_bpf_ops *)kdata;
+
+	moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct nvme_bpf_ops, subsysnqn):
+		memcpy(kops->subsysnqn, uops->subsysnqn,
+		       sizeof(kops->subsysnqn));
+		return 1;
+	case offsetof(struct nvme_bpf_ops, nsid):
+		kops->nsid = uops->nsid;
+		return 1;
+	case offsetof(struct nvme_bpf_ops, uuid):
+		if (uuid_is_null(&uops->uuid))
+			uuid_gen(&kops->uuid);
+		else
+			uuid_copy(&kops->uuid, &uops->uuid);
+		return 1;
+	}
+	return 0;
+}
+
+static int nvme_bpf_reg(void *kdata, struct bpf_link *link)
+{
+	struct nvme_bpf_ops *ops = kdata;
+	struct nvme_ns_head *head;
+	struct nvme_subsystem *subsys = NULL;
+
+	pr_debug("%s: register %s nsid %d\n",
+		 __func__, ops->subsysnqn, ops->nsid);
+
+	subsys = nvme_find_get_subsystem(ops->subsysnqn);
+	if (!subsys)
+		return -EINVAL;
+
+	mutex_lock(&subsys->lock);
+	list_for_each_entry(head, &subsys->nsheads, entry) {
+		if (head->ns_id != ops->nsid)
+			continue;
+		if (head->bpf_ops) {
+			pr_debug("%s: instance %d already attached\n",
+				 __func__, head->instance);
+			continue;
+		}
+		if (nvme_tryget_ns_head(head)) {
+			mutex_lock(&head->lock);
+			ops->head = head;
+			head->bpf_ops = ops;
+			mutex_unlock(&head->lock);
+			pr_debug("%s: attached to %d\n",
+				 __func__, head->instance);
+			synchronize_srcu(&head->srcu);
+			break;
+		}
+	}
+	mutex_unlock(&subsys->lock);
+	nvme_put_subsystem(subsys);
+
+	return 0;
+}
+
+static void nvme_bpf_unreg(void *kdata, struct bpf_link *link)
+{
+	struct nvme_bpf_ops *ops = kdata;
+	struct nvme_ns_head *head;
+
+	if (ops->head) {
+		head = ops->head;
+		pr_debug("%s: unregistered from %d\n",
+			 __func__, head->instance);
+		mutex_lock(&head->lock);
+		head->bpf_ops = NULL;
+		ops->head = NULL;
+		mutex_unlock(&head->lock);
+		nvme_put_ns_head(head);
+		synchronize_srcu(&head->srcu);
+	}
+}
+
+void nvme_bpf_detach(struct nvme_ns_head *head)
+{
+	struct nvme_bpf_ops *ops =
+		srcu_dereference(head->bpf_ops, &head->srcu);
+
+	if (ops) {
+		mutex_lock(&head->lock);
+		rcu_assign_pointer(head->bpf_ops, NULL);
+		list_del_init(&head->bpf_list);
+		mutex_unlock(&head->lock);
+		nvme_put_ns_head(head);
+	}
+}
+
+static int __nvme_bpf_select_path(struct nvme_bpf_iter *iter,
+				  sector_t sector)
+{
+	return -ENXIO;
+}
+
+static struct nvme_bpf_ops __bpf_nvme_bpf_ops = {
+	.uuid = {},
+	.subsysnqn = "",
+	.nsid = UINT_MAX,
+	.select_path = __nvme_bpf_select_path,
+	.head = NULL,
+};
+
+struct nvme_ns *nvme_bpf_select_path(struct nvme_ns_head *head,
+				     sector_t sector)
+{
+	struct nvme_ns *ns = NULL;
+	struct nvme_bpf_ops *ops =
+		srcu_dereference(head->bpf_ops, &head->srcu);
+	struct nvme_bpf_iter iter = {
+		.head = head,
+	};
+	s32 cntlid;
+
+	if (ops) {
+		cntlid = ops->select_path(&iter, sector);
+		if (cntlid < 0)
+			return ERR_PTR(cntlid);
+		if (iter.curr) {
+			ns = iter.curr;
+			if (ns->ctrl->cntlid == cntlid)
+				return ns;
+		}
+	}
+	return ERR_PTR(-ENXIO);
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * nvme_bpf_first_path - select the first path from a nvme bpf path iterator
+ * @iter: nvme_bpf path iterator
+ *
+ * Initializes @iter with the first nvme namespace path (if present) and
+ * returns the controller id of the first nvme namespace path or
+ * -ENXIO if no namespace path is present.
+ */
+__bpf_kfunc int nvme_bpf_first_path(struct nvme_bpf_iter *iter)
+{
+	struct nvme_ns *ns;
+
+	if (!iter || !iter->head)
+		return -EINVAL;
+	if (!nvme_bpf_enabled(iter->head))
+		return -EPERM;
+
+	ns = list_first_or_null_rcu(&iter->head->list, struct nvme_ns, siblings);
+	iter->curr = ns;
+	iter->prev = NULL;
+	return ns ? ns->ctrl->cntlid : -ENXIO;
+}
+EXPORT_SYMBOL_GPL(nvme_bpf_first_path);
+
+/**
+ * nvme_bpf_next_path - select the next path from a nvme bpf path iterator
+ * @iter: nvme_bpf path iterator
+ *
+ * Moves @iter to the next namespace path in @curr, storing the previous namespace
+ * path in @prev. Returns the controller id of the current namespace path, -ENXIO
+ * if no current path is set, or -EAGAIN if no next namespace is found.
+ */
+__bpf_kfunc int nvme_bpf_next_path(struct nvme_bpf_iter *iter)
+{
+	struct nvme_ns *ns, *old;
+
+	if (!iter || !iter->head)
+		return -EINVAL;
+	if (!nvme_bpf_enabled(iter->head))
+		return -EPERM;
+	if (!iter->curr)
+		return -ENXIO;
+	old = iter->curr;
+	ns = list_next_or_null_rcu(&iter->head->list, &old->siblings, struct nvme_ns,
+				   siblings);
+	iter->prev = old;
+	iter->curr = ns;
+	return ns ? ns->ctrl->cntlid : -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(nvme_bpf_next_path);
+
+/**
+ * nvme_bpf_count_paths - count the number of paths in a nvme bpf path iterator
+ * @iter: nvme_bpf namespace path iterator
+ *
+ * Returns number of paths in @iter
+ */
+__bpf_kfunc u32 nvme_bpf_count_paths(struct nvme_bpf_iter *iter)
+{
+	struct nvme_ns *ns;
+	u32 num = 0;
+
+	if (!iter || !iter->head)
+		return 0;
+	if (!nvme_bpf_enabled(iter->head))
+		return num;
+
+	ns = list_first_or_null_rcu(&iter->head->list, struct nvme_ns, siblings);
+	while (ns) {
+		num++;
+		ns = list_next_or_null_rcu(&iter->head->list, &ns->siblings, struct nvme_ns,
+					   siblings);
+	}
+	return num;
+}
+EXPORT_SYMBOL_GPL(nvme_bpf_count_paths);
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(nvme_bpf_kfunc_set_ids)
+BTF_ID_FLAGS(func, nvme_bpf_first_path, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, nvme_bpf_next_path, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, nvme_bpf_count_paths, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(nvme_bpf_kfunc_set_ids)
+
+static const struct btf_kfunc_id_set nvme_bpf_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set = &nvme_bpf_kfunc_set_ids,
+};
+
+static struct bpf_struct_ops bpf_nvme_bpf_ops = {
+	.verifier_ops = &nvme_bpf_verifier_ops,
+	.init = nvme_bpf_ops_init,
+	.check_member = nvme_bpf_ops_check_member,
+	.init_member = nvme_bpf_ops_init_member,
+	.reg = nvme_bpf_reg,
+	.unreg = nvme_bpf_unreg,
+	.name = nvme_bpf_ops_name,
+	.cfi_stubs = &__bpf_nvme_bpf_ops,
+	.owner = THIS_MODULE,
+};
+
+int __init nvme_bpf_struct_ops_init(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&nvme_bpf_kfunc_set);
+	if (ret) {
+		pr_err("Failed to register nvme_bpf_kfunc_set, error %d\n", ret);
+		return ret;
+	}
+	ret = register_bpf_struct_ops(&bpf_nvme_bpf_ops, nvme_bpf_ops);
+	if (ret)
+		pr_err("Failed to register nvme_bpf_ops, error %d\n", ret);
+	else
+		pr_info("nvme_bpf_ops registered\n");
+	return ret;
+}
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a2f3da453af4..e4f69b2f946b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -27,6 +27,7 @@
 #include "nvme.h"
 #include "fabrics.h"
 #include <linux/nvme-auth.h>
+#include "bpf.h"
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -5381,6 +5382,8 @@ static int __init nvme_core_init(void)
 	result = nvme_init_auth();
 	if (result)
 		goto destroy_ns_chr;
+	if (IS_ENABLED(CONFIG_NVME_BPF))
+		nvme_bpf_struct_ops_init();
 	return 0;
 
 destroy_ns_chr:
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index dee40bd73449..e2c6b13591c4 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -8,6 +8,7 @@
 #include <linux/vmalloc.h>
 #include <trace/events/block.h>
 #include "nvme.h"
+#include "bpf.h"
 
 bool multipath = true;
 static bool multipath_always_on;
@@ -462,12 +463,15 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 	else if (unlikely(!nvme_path_is_optimized(ns)))
 		ns = __nvme_find_path(head, node);
 	if (ns)
-		rcu_assign_pointer(head->current_path[node], found);
+		rcu_assign_pointer(head->current_path[node], ns);
 	return ns;
 }
 
 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, sector_t sector)
 {
+	if (nvme_bpf_enabled(head))
+		return nvme_bpf_select_path(head, sector);
+
 	switch (READ_ONCE(head->subsys->iopolicy)) {
 	case NVME_IOPOLICY_QD:
 		return nvme_queue_depth_path(head);
@@ -693,6 +697,7 @@ static void nvme_remove_head(struct nvme_ns_head *head)
 		kblockd_schedule_work(&head->requeue_work);
 
 		nvme_cdev_del(&head->cdev, &head->cdev_device);
+		nvme_bpf_detach(head);
 		synchronize_srcu(&head->srcu);
 		del_gendisk(head->disk);
 	}
@@ -727,6 +732,8 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	INIT_WORK(&head->requeue_work, nvme_requeue_work);
 	INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
 	INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
+	if (IS_ENABLED(CONFIG_NVME_BPF))
+		INIT_LIST_HEAD(&head->bpf_list);
 	head->delayed_removal_secs = 0;
 
 	/*
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 73b966a6653a..3498620d650b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -17,7 +17,7 @@
 #include <linux/wait.h>
 #include <linux/t10-pi.h>
 #include <linux/ratelimit_types.h>
-
+#include <linux/bpf.h>
 #include <trace/events/block.h>
 
 extern const struct pr_ops nvme_pr_ops;
@@ -499,6 +499,10 @@ struct nvme_ns_head {
 
 	u16			nr_plids;
 	u16			*plids;
+#ifdef CONFIG_NVME_BPF
+	struct list_head	bpf_list;
+	struct nvme_bpf_ops __rcu *bpf_ops;
+#endif
 #ifdef CONFIG_NVME_MULTIPATH
 	struct bio_list		requeue_list;
 	spinlock_t		requeue_lock;
diff --git a/include/linux/nvme-bpf.h b/include/linux/nvme-bpf.h
new file mode 100644
index 000000000000..687b96e101ef
--- /dev/null
+++ b/include/linux/nvme-bpf.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2025 Hannes Reinecke, SUSE Software Solutions
+ */
+
+#ifndef _NVME_BPF_H
+#define _NVME_BPF_H
+
+struct nvme_ns_head;
+struct nvme_ns;
+
+/**
+ * struct nvme_bpf_iter - Iterator for select_path BPF function
+ * @head: namespace head to iterate over
+ * @curr: current namespace path
+ * @prev: previous namespace path
+ */
+struct nvme_bpf_iter {
+	struct nvme_ns_head *head;
+	struct nvme_ns *curr;
+	struct nvme_ns *prev;
+};
+
+/**
+ * struct nvme_bpf_ops - A BPF struct_ops of callbacks allowing to implement
+ * 			an nvme bpf path selector
+ * @uuid: ops uuid
+ * @subsys_id: instance number of the subsystem to attach to
+ * @nsid: namespace ID within @subsys_id to attach to
+ * @select_path: callback for selecting the path for @sector
+ */
+struct nvme_bpf_ops {
+	/* UUID to distinguish different instances */
+	uuid_t			uuid;
+
+	/* Subsystem NQN */
+	char			subsysnqn[256];
+
+	/* Namespace ID number or -1 if valid for all namespace */
+	int		nsid;
+
+	/* Return the controller ID of the selected path or -1 if not found */
+	int		(*select_path)(struct nvme_bpf_iter *, sector_t);
+
+	/* private: don't show in doc, must be the last field */
+	struct nvme_ns_head *head;
+};
+
+int nvme_bpf_first_path(struct nvme_bpf_iter *iter);
+int nvme_bpf_next_path(struct nvme_bpf_iter *iter);
+u32 nvme_bpf_count_paths(struct nvme_bpf_iter *iter);
+
+#endif
+
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
                   ` (4 preceding siblings ...)
  2025-07-29  7:06 ` [PATCH 5/6] nvme-bpf: eBPF struct_ops path selectors hare
@ 2025-07-29  7:06 ` hare
  2025-07-30  2:03   ` Geliang Tang
  2025-07-29  7:54 ` [RFC PATCH 0/6] nvme multipath eBPF " Christoph Hellwig
  2025-07-30  2:03 ` Geliang Tang
  7 siblings, 1 reply; 13+ messages in thread
From: hare @ 2025-07-29  7:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@kernel.org>

As a simple nvme bpf path selector to demonstrate the namespace
path iteration.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 .../selftests/bpf/progs/bpf_nvme_simple.c     | 52 +++++++++++++++++++
 1 file changed, 52 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_nvme_simple.c

diff --git a/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c b/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
new file mode 100644
index 000000000000..c9cafb6bd253
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * simple nvme ebpf path selector
+ *
+ * Simulates a RAID layout with chunk size 2M
+ */
+
+#include <vmlinux.h>
+#include <errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+static sector_t simple_offset = 0;
+static sector_t simple_blocksize = 1048576;
+
+SEC("struct_ops")
+int BPF_PROG(simple_select, struct nvme_bpf_iter *iter, sector_t sector)
+{
+	sector_t offset = simple_offset;
+	sector_t block_size = simple_blocksize;
+	u32 num_blks, num_paths, num_iter, i;
+	int cntlid;
+
+	if (sector > offset)
+		sector -= offset;
+	cntlid = nvme_bpf_first_path(iter);
+	if (cntlid < 0)
+		return cntlid;
+	if (!block_size || sector < block_size)
+		return cntlid;
+
+	num_blks = (sector / block_size);
+	num_paths = nvme_bpf_count_paths(iter);
+	num_iter = num_blks % num_paths;
+	bpf_for (i, 1, num_iter) {
+		cntlid = nvme_bpf_next_path(iter);
+		if (cntlid < 0)
+			break;
+	}
+	return cntlid;
+}
+
+SEC(".struct_ops")
+struct nvme_bpf_ops bpf_nvme_simple = {
+	.uuid = { 0x86, 0xee, 0x41, 0xd5, 0x25, 0x6b, 0x45, 0xd0, 0xa4, 0x81, 0x5e, 0x35, 0xf6, 0x02, 0xf5, 0x11 },
+	.subsysnqn = "blktests-subsystem-1",
+	.nsid = 1,
+	.select_path = (void *)simple_select,
+};
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/6] nvme multipath eBPF path selector
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
                   ` (5 preceding siblings ...)
  2025-07-29  7:06 ` [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector hare
@ 2025-07-29  7:54 ` Christoph Hellwig
  2025-07-29 14:53   ` Mike Christie
  2025-07-30  2:03 ` Geliang Tang
  7 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2025-07-29  7:54 UTC (permalink / raw)
  To: hare; +Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, linux-nvme

On Tue, Jul 29, 2025 at 09:06:47AM +0200, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> Hi all,
> 
> there are discussion on having to deploy more complex I/O scheduling
> algorithms for NVMe,

Are "we"?  Where?

> but then there's the question whether we really
> want to carry these in the kernel.

If it makes sense, "we" do, yes.

"We" don't want to export random crap for sure.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/6] nvme multipath eBPF path selector
  2025-07-29  7:54 ` [RFC PATCH 0/6] nvme multipath eBPF " Christoph Hellwig
@ 2025-07-29 14:53   ` Mike Christie
  2025-07-30 14:06     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Christie @ 2025-07-29 14:53 UTC (permalink / raw)
  To: Christoph Hellwig, hare; +Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 7/29/25 2:54 AM, Christoph Hellwig wrote:
> On Tue, Jul 29, 2025 at 09:06:47AM +0200, hare@kernel.org wrote:
>> From: Hannes Reinecke <hare@kernel.org>
>>
>> Hi all,
>>
>> there are discussion on having to deploy more complex I/O scheduling
>> algorithms for NVMe,
> 
> Are "we"?  Where?

I think that's me. I was originally looking into eBPF path selectors
for ADNN:

https://www.snia.org/educational-library/adaptive-distributed-nvme-namespaces-2020

I was bugging Hannes about what happened to the spec. The idea was
just that if the spec is dead, then something like eBPF path selectors
could be an option.

I'm not currently looking at it, because I got busy with other
work. But, maybe the ceph people would want it for their nvme frontend
like in the video. For the ceph iscsi project, I had looked at SCSI
referrals, and it didn't fit (I can't remember the details), and ADNN
looked promising.


> 
>> but then there's the question whether we really
>> want to carry these in the kernel.
> 
> If it makes sense, "we" do, yes.
> 
> "We" don't want to export random crap for sure.
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/6] nvme multipath eBPF path selector
  2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
                   ` (6 preceding siblings ...)
  2025-07-29  7:54 ` [RFC PATCH 0/6] nvme multipath eBPF " Christoph Hellwig
@ 2025-07-30  2:03 ` Geliang Tang
  7 siblings, 0 replies; 13+ messages in thread
From: Geliang Tang @ 2025-07-30  2:03 UTC (permalink / raw)
  To: hare, Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme

Hi Hannes,

On Tue, 2025-07-29 at 09:06 +0200, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> Hi all,
> 
> there are discussion on having to deploy more complex I/O scheduling
> algorithms for NVMe, but then there's the question whether we really
> want to carry these in the kernel.
> Which sounded like an ideal testbed for eBPF struct_ops to me.
> Taking a cue from Ming Lei's patchset for eBPF on ublk (thanks,
> Ming!)
> I've started messing around with eBPF.

I happen to have experience in this area and would like to participate
in the development of nvme-bpf. I have previously developed the MPTCP
BPF packet scheduler [1], which is already in the export branch of the
MPTCP repository [2].

[1]
https://github.com/multipath-tcp/mptcp_net-next/issues/75

[2]
https://github.com/multipath-tcp/mptcp_net-next/commit/e83320eb669f48effae8a2d203d834ca2454308a
https://github.com/multipath-tcp/mptcp_net-next/commit/397b7213a2e45bc0c188d5fefa0889899657716f
https://github.com/multipath-tcp/mptcp_net-next/commit/0c59c5d43f6babd016bdbbf00365257ea57796e9

> 
> So here's a patchset to implement nvme multipath eBPF path selectors.
> Idea's quite simple: the eBPF 'struct_ops' program is providing a
> 'select_path' function, which selects a nvme_ns struct to use for
> the I/O starting at a given sector.
> Unfortunately ePBF doesn't allow to pass pointers, _and_ the
> definitions
> for 'struct nvme_ns_head' and 'struct nvme_ns' are internal to the
> nvme subsystem. So I kept those structures as opaque pointers for
> ePBF, and introduced a 'nvme_bpf_iter' structure as a path iterator.
> There are two functions 'nvme_bpf_first_path' and
> 'nvme_bpf_next_path'
> which can be used for an open-coded loop over all paths.
> I've also added sample code as an example how the loop can be coded.
> 
> It's all pretty rudimentary (as I'm sure people will need accessors
> to get to any namespace or controller details), but that's why I sent
> it out as an RFC. And I am by no means an eBPF expert, so I'd be
> glad for any corrections or suggestions for a better eBPF
> integration.
> 
> The entire patchset can be found at:
> git.kernel.org:/pub/scm/linux/kernel/git/hare/scsi-devel.git
> branch nvme-bpf
> 
> As usual, reviews and comments are welcome.
> 
> Hannes Reinecke (6):
>   nvme-multipath: do not assign ->current_path in __nvme_find_path()
>   nvme: export nvme_find_get_subsystem()/nvme_put_subsystem()
>   nvme: add per-namespace iopolicy sysfs attribute
>   nvme: add 'sector' parameter to nvme_find_path()
>   nvme-bpf: eBPF struct_ops path selectors
>   tools/testing/selftests: add sample nvme bpf path selector
> 
>  drivers/nvme/host/Kconfig                     |   9 +
>  drivers/nvme/host/Makefile                    |   1 +
>  drivers/nvme/host/bpf.h                       |  33 ++
>  drivers/nvme/host/bpf_ops.c                   | 347
> ++++++++++++++++++
>  drivers/nvme/host/core.c                      |  17 +-
>  drivers/nvme/host/ioctl.c                     |   7 +-
>  drivers/nvme/host/multipath.c                 |  69 +++-
>  drivers/nvme/host/nvme.h                      |  11 +-
>  drivers/nvme/host/pr.c                        |   2 +-
>  drivers/nvme/host/sysfs.c                     |   9 +-
>  include/linux/nvme-bpf.h                      |  54 +++
>  .../selftests/bpf/progs/bpf_nvme_simple.c     |  52 +++
>  12 files changed, 585 insertions(+), 26 deletions(-)
>  create mode 100644 drivers/nvme/host/bpf.h
>  create mode 100644 drivers/nvme/host/bpf_ops.c
>  create mode 100644 include/linux/nvme-bpf.h
>  create mode 100644
> tools/testing/selftests/bpf/progs/bpf_nvme_simple.c


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector
  2025-07-29  7:06 ` [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector hare
@ 2025-07-30  2:03   ` Geliang Tang
  2025-07-30  5:56     ` Hannes Reinecke
  0 siblings, 1 reply; 13+ messages in thread
From: Geliang Tang @ 2025-07-30  2:03 UTC (permalink / raw)
  To: hare, Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme

Hi Hannes,

I tried to compile this BPF program, but got a compilation error. To
fix it, NVME_BPF and its dependencies need to be enabled in
tools/testing/selftests/bpf/config:

+++ b/tools/testing/selftests/bpf/config
@@ -121,3 +121,8 @@ CONFIG_XDP_SOCKETS=y
 CONFIG_XFRM_INTERFACE=y
 CONFIG_TCP_CONG_DCTCP=y
 CONFIG_TCP_CONG_BBR=y
+CONFIG_NVME_CORE=y
+CONFIG_NVME_FABRICS=y
+CONFIG_NVME_TCP=y
+CONFIG_NVME_MULTIPATH=y
+CONFIG_NVME_BPF=y

On Tue, 2025-07-29 at 09:06 +0200, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> As a simple nvme bpf path selector to demonstrate the namespace
> path iteration.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  .../selftests/bpf/progs/bpf_nvme_simple.c     | 52
> +++++++++++++++++++

We also need to create an nvme test program in
tools/testing/selftests/bpf/prog_tests/ to load and verify this
bpf_nvme_simple BPF program, just like I did in
tools/testing/selftests/bpf/prog_tests/mptcp.c. I can try to implement
it if you think it's necessary.

Thanks,
-Geliang

>  1 file changed, 52 insertions(+)
>  create mode 100644
> tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
> 
> diff --git a/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
> b/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
> new file mode 100644
> index 000000000000..c9cafb6bd253
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_nvme_simple.c
> @@ -0,0 +1,52 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * simple nvme ebpf path selector
> + *
> + * Simulates a RAID layout with chunk size 2M
> + */
> +
> +#include <vmlinux.h>
> +#include <errno.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +static sector_t simple_offset = 0;
> +static sector_t simple_blocksize = 1048576;
> +
> +SEC("struct_ops")
> +int BPF_PROG(simple_select, struct nvme_bpf_iter *iter, sector_t
> sector)
> +{
> +	sector_t offset = simple_offset;
> +	sector_t block_size = simple_blocksize;
> +	u32 num_blks, num_paths, num_iter, i;
> +	int cntlid;
> +
> +	if (sector > offset)
> +		sector -= offset;
> +	cntlid = nvme_bpf_first_path(iter);
> +	if (cntlid < 0)
> +		return cntlid;
> +	if (!block_size || sector < block_size)
> +		return cntlid;
> +
> +	num_blks = (sector / block_size);
> +	num_paths = nvme_bpf_count_paths(iter);
> +	num_iter = num_blks % num_paths;
> +	bpf_for (i, 1, num_iter) {
> +		cntlid = nvme_bpf_next_path(iter);
> +		if (cntlid < 0)
> +			break;
> +	}
> +	return cntlid;
> +}
> +
> +SEC(".struct_ops")
> +struct nvme_bpf_ops bpf_nvme_simple = {
> +	.uuid = { 0x86, 0xee, 0x41, 0xd5, 0x25, 0x6b, 0x45, 0xd0,
> 0xa4, 0x81, 0x5e, 0x35, 0xf6, 0x02, 0xf5, 0x11 },
> +	.subsysnqn = "blktests-subsystem-1",
> +	.nsid = 1,
> +	.select_path = (void *)simple_select,
> +};


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector
  2025-07-30  2:03   ` Geliang Tang
@ 2025-07-30  5:56     ` Hannes Reinecke
  0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2025-07-30  5:56 UTC (permalink / raw)
  To: Geliang Tang, hare, Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 7/30/25 04:03, Geliang Tang wrote:
> Hi Hannes,
> 
> I tried to compile this BPF program, but got a compilation error. To
> fix it, NVME_BPF and its dependencies need to be enabled in
> tools/testing/selftests/bpf/config:
> 
> +++ b/tools/testing/selftests/bpf/config
> @@ -121,3 +121,8 @@ CONFIG_XDP_SOCKETS=y
>   CONFIG_XFRM_INTERFACE=y
>   CONFIG_TCP_CONG_DCTCP=y
>   CONFIG_TCP_CONG_BBR=y
> +CONFIG_NVME_CORE=y
> +CONFIG_NVME_FABRICS=y
> +CONFIG_NVME_TCP=y
> +CONFIG_NVME_MULTIPATH=y
> +CONFIG_NVME_BPF=y
> 
Yeah; that's correct.
Thanks for that.

That's one thing I forgot to mention: currently one has to compile in
NVME_CORE and NVME_BPF; one will get linking errors otherwise.
Will see if I can fix it up.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 0/6] nvme multipath eBPF path selector
  2025-07-29 14:53   ` Mike Christie
@ 2025-07-30 14:06     ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2025-07-30 14:06 UTC (permalink / raw)
  To: Mike Christie
  Cc: Christoph Hellwig, hare, Keith Busch, Sagi Grimberg, linux-nvme

On Tue, Jul 29, 2025 at 09:53:09AM -0500, Mike Christie wrote:
> On 7/29/25 2:54 AM, Christoph Hellwig wrote:
> > On Tue, Jul 29, 2025 at 09:06:47AM +0200, hare@kernel.org wrote:
> >> From: Hannes Reinecke <hare@kernel.org>
> >>
> >> Hi all,
> >>
> >> there are discussion on having to deploy more complex I/O scheduling
> >> algorithms for NVMe,
> > 
> > Are "we"?  Where?
> 
> I think that's me. I was originally looking into eBPF path selectors
> for ADNN:
> 
> https://www.snia.org/educational-library/adaptive-distributed-nvme-namespaces-2020

ADNN needs to use eBPF, but controller provided eBPF, not from the user.
This has been clearly communicated to the group in 2022, but as usual
FMDS is drifting off into their own parallel universe, not only on the
issue of path selection.  Based on these factors ADNN in it's current
trajectory is unlikely to ever be implemented in Linux.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-30 14:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29  7:06 [RFC PATCH 0/6] nvme multipath eBPF path selector hare
2025-07-29  7:06 ` [PATCH 1/6] nvme-multipath: do not assign ->current_path in __nvme_find_path() hare
2025-07-29  7:06 ` [PATCH 2/6] nvme: export nvme_find_get_subsystem()/nvme_put_subsystem() hare
2025-07-29  7:06 ` [PATCH 3/6] nvme: add per-namespace iopolicy sysfs attribute hare
2025-07-29  7:06 ` [PATCH 4/6] nvme: add 'sector' parameter to nvme_find_path() hare
2025-07-29  7:06 ` [PATCH 5/6] nvme-bpf: eBPF struct_ops path selectors hare
2025-07-29  7:06 ` [PATCH 6/6] tools/testing/selftests: add sample nvme bpf path selector hare
2025-07-30  2:03   ` Geliang Tang
2025-07-30  5:56     ` Hannes Reinecke
2025-07-29  7:54 ` [RFC PATCH 0/6] nvme multipath eBPF " Christoph Hellwig
2025-07-29 14:53   ` Mike Christie
2025-07-30 14:06     ` Christoph Hellwig
2025-07-30  2:03 ` Geliang Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).