[RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
@ 2025-11-05 10:33 Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

Hi,

This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies such as numa, round-robin, and queue-depth
are static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.

The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency for both PCIe and fabrics. Latency is
derived by passively sampling I/O completions. Each path is assigned a
weight proportional to its latency score, and I/Os are then forwarded
accordingly. As condition changes (e.g. latency spikes, bandwidth
differences), path weights are updated, automatically steering traffic
toward better-performing paths.

Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:

        numa         round-robin   queue-depth  adaptive
        -----------  -----------   -----------  ---------
READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
        W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s

This pathcset includes totla 6 patches:
[PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
  - Make blk_stat APIs available to block drivers.
  - Needed for per-path latency measurement in adaptive policy.

[PATCH 2/7] nvme-multipath: add adaptive I/O policy
  - Implement path scoring based on latency (EWMA).
  - Distribute I/O proportionally to per-path weights.

[PATCH 3/7] nvme: add generic debugfs support
  - Introduce generic debugfs support for NVMe module

[PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
  - Adds a debugfs attribute to control ewma shift

[PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
  - Adds a debugfs attribute to control path weight calculation timeout

[PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
  - Add “adaptive_stat” under per-path and head debugfs directories to
    expose adaptive policy state and statistics.

[PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
  - Includes documentation for adaptive I/O multipath policy.

As ususal, feedback and suggestions are most welcome!

Thanks!

Changes from v4:
  - Added patch #7 which includes the documentation for adaptive I/O
    policy. (Guixin Liu)
Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/    

Changes from v3:
  - Update the adaptive APIs name (which actually enable/disable
    adaptive policy) to reflect the actual work it does. Also removed
    the misleading use of "current_path" from the adaptive policy code
    (Hannes Reinecke)
  - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
    sysfs to debugfs (Hannes Reinecke)
Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/

Changes from v2:
  - Addede a new patch to allow user to configure EWMA shift
    through sysfs (Hannes Reinecke)
  - Added a new patch to allow user to configure path weight
    calculation timeout (Hannes Reinecke)
  - Distinguish between read/write and other commands (e.g.
    admin comamnd) and calculate path weight for other commands
    which is separate from read/write weight. (Hannes Reinecke)
  - Normalize per-path weight in the range from 0-128 instead
    of 0-100 (Hannes Reinecke)
  - Restructure and optimize adaptive I/O forwarding code to use
    one loop instead of two (Hannes Reinecke)
Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/

Changes from v1:
  - Ensure that the completion of I/O occurs on the same CPU as the
    submitting I/O CPU (Hannes Reinecke)
  - Remove adapter link speed from the path weight calculation
    (Hannes Reinecke)
  - Add adaptive I/O stat under debugfs instead of current sysfs
    (Hannes Reinecke)
  - Move path weight calculation to a workqueue from IO completion
    code path
Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/

Nilay Shroff (7):
  block: expose blk_stat_{enable,disable}_accounting() to drivers
  nvme-multipath: add support for adaptive I/O policy
  nvme: add generic debugfs support
  nvme-multipath: add debugfs attribute adaptive_ewma_shift
  nvme-multipath: add debugfs attribute adaptive_weight_timeout
  nvme-multipath: add debugfs attribute adaptive_stat
  nvme-multipath: add documentation for adaptive I/O policy

 Documentation/admin-guide/nvme-multipath.rst |  19 +
 block/blk-stat.h                             |   4 -
 drivers/nvme/host/Makefile                   |   2 +-
 drivers/nvme/host/core.c                     |  22 +-
 drivers/nvme/host/debugfs.c                  | 335 +++++++++++++++
 drivers/nvme/host/ioctl.c                    |  31 +-
 drivers/nvme/host/multipath.c                | 430 ++++++++++++++++++-
 drivers/nvme/host/nvme.h                     |  86 +++-
 drivers/nvme/host/pr.c                       |   6 +-
 drivers/nvme/host/sysfs.c                    |   2 +-
 include/linux/blk-mq.h                       |   4 +
 11 files changed, 913 insertions(+), 28 deletions(-)
 create mode 100644 drivers/nvme/host/debugfs.c

-- 
2.51.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-12-12 12:16   ` Sagi Grimberg
  2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

The functions blk_stat_enable_accounting() and
blk_stat_disable_accounting() are currently exported, but their
prototypes are only defined in a private header. Move these prototypes
into a common header so that block drivers can directly use these APIs.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 block/blk-stat.h       | 4 ----
 include/linux/blk-mq.h | 4 ++++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-stat.h b/block/blk-stat.h
index 9e05bf18d1be..f5d95dd8c0e9 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -67,10 +67,6 @@ void blk_free_queue_stats(struct blk_queue_stats *);
 
 void blk_stat_add(struct request *rq, u64 now);
 
-/* record time/size info in request but not add a callback */
-void blk_stat_enable_accounting(struct request_queue *q);
-void blk_stat_disable_accounting(struct request_queue *q);
-
 /**
  * blk_stat_alloc_callback() - Allocate a block statistics callback.
  * @timer_fn: Timer callback function.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46..f647444643b8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -735,6 +735,10 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 
 bool blk_mq_queue_inflight(struct request_queue *q);
 
+/* record time/size info in request but not add a callback */
+void blk_stat_enable_accounting(struct request_queue *q);
+void blk_stat_disable_accounting(struct request_queue *q);
+
 enum {
 	/* return when out of requests */
 	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-12-12 13:04   ` Sagi Grimberg
  2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

This commit introduces a new I/O policy named "adaptive". Users can
configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
subsystemX/iopolicy"

The adaptive policy dynamically distributes I/O based on measured
completion latency. The main idea is to calculate latency for each path,
derive a weight, and then proportionally forward I/O according to those
weights.

To ensure scalability, path latency is measured per-CPU. Each CPU
maintains its own statistics, and I/O forwarding uses these per-CPU
values. Every ~15 seconds, a simple average latency of per-CPU batched
samples are computed and fed into an Exponentially Weighted Moving
Average (EWMA):

avg_latency = div_u64(batch, batch_count);
new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT

With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
latency value and 1/8 (~12.5%) to the most recent latency. This
smoothing reduces jitter, adapts quickly to changing conditions,
avoids storing historical samples, and works well for both low and
high I/O rates. Path weights are then derived from the smoothed (EWMA)
latency as follows (example with two paths A and B):

    path_A_score = NSEC_PER_SEC / path_A_ewma_latency
    path_B_score = NSEC_PER_SEC / path_B_ewma_latency
    total_score  = path_A_score + path_B_score

    path_A_weight = (path_A_score * 100) / total_score
    path_B_weight = (path_B_score * 100) / total_score

where:
  - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
  - NSEC_PER_SEC is used as a scaling factor since valid latencies
    are < 1 second
  - weights are normalized to a 0–64 scale across all paths.

Path credits are refilled based on this weight, with one credit
consumed per I/O. When all credits are consumed, the credits are
refilled again based on the current weight. This ensures that I/O is
distributed across paths proportionally to their calculated weight.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      |  15 +-
 drivers/nvme/host/ioctl.c     |  31 ++-
 drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h      |  74 +++++-
 drivers/nvme/host/pr.c        |   6 +-
 drivers/nvme/host/sysfs.c     |   2 +-
 6 files changed, 530 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de73..47f375c63d2d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
 	cleanup_srcu_struct(&head->srcu);
 	nvme_put_subsystem(head->subsys);
 	kfree(head->plids);
+#ifdef CONFIG_NVME_MULTIPATH
+	free_percpu(head->adp_path);
+#endif
 	kfree(head);
 }
 
@@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	nvme_free_ns_stat(ns);
 	put_disk(ns->disk);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
@@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	if (nvme_init_ns_head(ns, info))
 		goto out_cleanup_disk;
 
+	if (nvme_alloc_ns_stat(ns))
+		goto out_unlink_ns;
+
 	/*
 	 * If multipathing is enabled, the device name for all disks and not
 	 * just those that represent shared namespaces needs to be based on the
@@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	}
 
 	if (nvme_update_ns_info(ns, info))
-		goto out_unlink_ns;
+		goto out_free_ns_stat;
 
 	mutex_lock(&ctrl->namespaces_lock);
 	/*
@@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	 */
 	if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
 		mutex_unlock(&ctrl->namespaces_lock);
-		goto out_unlink_ns;
+		goto out_free_ns_stat;
 	}
 	nvme_ns_add_to_ctrl_list(ns);
 	mutex_unlock(&ctrl->namespaces_lock);
@@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	list_del_rcu(&ns->list);
 	mutex_unlock(&ctrl->namespaces_lock);
 	synchronize_srcu(&ctrl->srcu);
+out_free_ns_stat:
+	nvme_free_ns_stat(ns);
  out_unlink_ns:
 	mutex_lock(&ctrl->subsys->lock);
 	list_del_rcu(&ns->siblings);
@@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	 */
 	synchronize_srcu(&ns->head->srcu);
 
+	nvme_mpath_cancel_adaptive_path_weight_work(ns);
+
 	/* wait for concurrent submissions */
 	if (nvme_mpath_clear_current_path(ns))
 		synchronize_srcu(&ns->head->srcu);
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index c212fa952c0f..759d147d9930 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
 int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
 		unsigned int cmd, unsigned long arg)
 {
+	u8 opcode;
 	struct nvme_ns_head *head = bdev->bd_disk->private_data;
 	bool open_for_write = mode & BLK_OPEN_WRITE;
 	void __user *argp = (void __user *)arg;
 	struct nvme_ns *ns;
 	int srcu_idx, ret = -EWOULDBLOCK;
 	unsigned int flags = 0;
+	unsigned int op_type = NVME_STAT_OTHER;
 
 	if (bdev_is_partition(bdev))
 		flags |= NVME_IOCTL_PARTITION;
 
+	if (cmd == NVME_IOCTL_SUBMIT_IO) {
+		if (get_user(opcode, (u8 *)argp))
+			return -EFAULT;
+		if (opcode == nvme_cmd_write)
+			op_type = NVME_STAT_WRITE;
+		else if (opcode == nvme_cmd_read)
+			op_type = NVME_STAT_READ;
+	}
+
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, op_type);
 	if (!ns)
 		goto out_unlock;
 
@@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
 long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 		unsigned long arg)
 {
+	u8 opcode;
 	bool open_for_write = file->f_mode & FMODE_WRITE;
 	struct cdev *cdev = file_inode(file)->i_cdev;
 	struct nvme_ns_head *head =
@@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
 	void __user *argp = (void __user *)arg;
 	struct nvme_ns *ns;
 	int srcu_idx, ret = -EWOULDBLOCK;
+	unsigned int op_type = NVME_STAT_OTHER;
+
+	if (cmd == NVME_IOCTL_SUBMIT_IO) {
+		if (get_user(opcode, (u8 *)argp))
+			return -EFAULT;
+		if (opcode == nvme_cmd_write)
+			op_type = NVME_STAT_WRITE;
+		else if (opcode == nvme_cmd_read)
+			op_type = NVME_STAT_READ;
+	}
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, op_type);
 	if (!ns)
 		goto out_unlock;
 
@@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
 	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
 	int srcu_idx = srcu_read_lock(&head->srcu);
-	struct nvme_ns *ns = nvme_find_path(head);
+	const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
+	struct nvme_ns *ns = nvme_find_path(head,
+			READ_ONCE(cmd->opcode) & 1 ?
+			NVME_STAT_WRITE : NVME_STAT_READ);
 	int ret = -EINVAL;
 
 	if (ns)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 543e17aead12..55dc28375662 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,9 @@
 #include <linux/backing-dev.h>
 #include <linux/moduleparam.h>
 #include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
+#include <linux/math64.h>
+#include <linux/rculist.h>
 #include <trace/events/block.h>
 #include "nvme.h"
 
@@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
 	"create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
 
 static const char *nvme_iopolicy_names[] = {
-	[NVME_IOPOLICY_NUMA]	= "numa",
-	[NVME_IOPOLICY_RR]	= "round-robin",
-	[NVME_IOPOLICY_QD]      = "queue-depth",
+	[NVME_IOPOLICY_NUMA]	 = "numa",
+	[NVME_IOPOLICY_RR]	 = "round-robin",
+	[NVME_IOPOLICY_QD]       = "queue-depth",
+	[NVME_IOPOLICY_ADAPTIVE] = "adaptive",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_RR;
 	else if (!strncmp(val, "queue-depth", 11))
 		iopolicy = NVME_IOPOLICY_QD;
+	else if (!strncmp(val, "adaptive", 8))
+		iopolicy = NVME_IOPOLICY_ADAPTIVE;
 	else
 		return -EINVAL;
 
@@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
 
+static void nvme_mpath_weight_work(struct work_struct *weight_work)
+{
+	int cpu, srcu_idx;
+	u32 weight;
+	struct nvme_ns *ns;
+	struct nvme_path_stat *stat;
+	struct nvme_path_work *work = container_of(weight_work,
+			struct nvme_path_work, weight_work);
+	struct nvme_ns_head *head = work->ns->head;
+	int op_type = work->op_type;
+	u64 total_score = 0;
+
+	cpu = get_cpu();
+
+	srcu_idx = srcu_read_lock(&head->srcu);
+	list_for_each_entry_srcu(ns, &head->list, siblings,
+			srcu_read_lock_held(&head->srcu)) {
+
+		stat = &this_cpu_ptr(ns->info)[op_type].stat;
+		if (!READ_ONCE(stat->slat_ns)) {
+			stat->score = 0;
+			continue;
+		}
+		/*
+		 * Compute the path score as the inverse of smoothed
+		 * latency, scaled by NSEC_PER_SEC. Floating point
+		 * math is unavailable in the kernel, so fixed-point
+		 * scaling is used instead. NSEC_PER_SEC is chosen
+		 * because valid latencies are always < 1 second; longer
+		 * latencies are ignored.
+		 */
+		stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
+
+		/* Compute total score. */
+		total_score += stat->score;
+	}
+
+	if (!total_score)
+		goto out;
+
+	/*
+	 * After computing the total slatency, we derive per-path weight
+	 * (normalized to the range 0–64). The weight represents the
+	 * relative share of I/O the path should receive.
+	 *
+	 *   - lower smoothed latency -> higher weight
+	 *   - higher smoothed slatency -> lower weight
+	 *
+	 * Next, while forwarding I/O, we assign "credits" to each path
+	 * based on its weight (please also refer nvme_adaptive_path()):
+	 *   - Initially, credits = weight.
+	 *   - Each time an I/O is dispatched on a path, its credits are
+	 *     decremented proportionally.
+	 *   - When a path runs out of credits, it becomes temporarily
+	 *     ineligible until credit is refilled.
+	 *
+	 * I/O distribution is therefore governed by available credits,
+	 * ensuring that over time the proportion of I/O sent to each
+	 * path matches its weight (and thus its performance).
+	 */
+	list_for_each_entry_srcu(ns, &head->list, siblings,
+			srcu_read_lock_held(&head->srcu)) {
+
+		stat = &this_cpu_ptr(ns->info)[op_type].stat;
+		weight = div_u64(stat->score * 64, total_score);
+
+		/*
+		 * Ensure the path weight never drops below 1. A weight
+		 * of 0 is used only for newly added paths. During
+		 * bootstrap, a few I/Os are sent to such paths to
+		 * establish an initial weight. Enforcing a minimum
+		 * weight of 1 guarantees that no path is forgotten and
+		 * that each path is probed at least occasionally.
+		 */
+		if (!weight)
+			weight = 1;
+
+		WRITE_ONCE(stat->weight, weight);
+	}
+out:
+	srcu_read_unlock(&head->srcu, srcu_idx);
+	put_cpu();
+}
+
+/*
+ * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
+ * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
+ * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
+ * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
+ */
+static inline u64 ewma_update(u64 old, u64 new)
+{
+	return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
+			+ new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
+}
+
+static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
+{
+	int cpu;
+	unsigned int op_type;
+	struct nvme_path_info *info;
+	struct nvme_path_stat *stat;
+	u64 now, latency, slat_ns, avg_lat_ns;
+	struct nvme_ns_head *head = ns->head;
+
+	if (list_is_singular(&head->list))
+		return;
+
+	now = ktime_get_ns();
+	latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
+	if (!latency)
+		return;
+
+	/*
+	 * As completion code path is serialized(i.e. no same completion queue
+	 * update code could run simultaneously on multiple cpu) we can safely
+	 * access per cpu nvme path stat here from another cpu (in case the
+	 * completion cpu is different from submission cpu).
+	 * The only field which could be accessed simultaneously here is the
+	 * path ->weight which may be accessed by this function as well as I/O
+	 * submission path during path selection logic and we protect ->weight
+	 * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
+	 * we also don't need to be so accurate here as the path credit would
+	 * be anyways refilled, based on path weight, once path consumes all
+	 * its credits. And we limit path weight/credit max up to 100. Please
+	 * also refer nvme_adaptive_path().
+	 */
+	cpu = blk_mq_rq_cpu(rq);
+	op_type = nvme_data_dir(req_op(rq));
+	info = &per_cpu_ptr(ns->info, cpu)[op_type];
+	stat = &info->stat;
+
+	/*
+	 * If latency > ~1s then ignore this sample to prevent EWMA from being
+	 * skewed by pathological outliers (multi-second waits, controller
+	 * timeouts etc.). This keeps path scores representative of normal
+	 * performance and avoids instability from rare spikes. If such high
+	 * latency is real, ANA state reporting or keep-alive error counters
+	 * will mark the path unhealthy and remove it from the head node list,
+	 * so we safely skip such sample here.
+	 */
+	if (unlikely(latency > NSEC_PER_SEC)) {
+		stat->nr_ignored++;
+		dev_warn_ratelimited(ns->ctrl->device,
+			"ignoring sample with >1s latency (possible controller stall or timeout)\n");
+		return;
+	}
+
+	/*
+	 * Accumulate latency samples and increment the batch count for each
+	 * ~15 second interval. When the interval expires, compute the simple
+	 * average latency over that window, then update the smoothed (EWMA)
+	 * latency. The path weight is recalculated based on this smoothed
+	 * latency.
+	 */
+	stat->batch += latency;
+	stat->batch_count++;
+	stat->nr_samples++;
+
+	if (now > stat->last_weight_ts &&
+	    (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
+
+		stat->last_weight_ts = now;
+
+		/*
+		 * Find simple average latency for the last epoch (~15 sec
+		 * interval).
+		 */
+		avg_lat_ns = div_u64(stat->batch, stat->batch_count);
+
+		/*
+		 * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
+		 * latency. EWMA is preferred over simple average latency
+		 * because it smooths naturally, reduces jitter from sudden
+		 * spikes, and adapts faster to changing conditions. It also
+		 * avoids storing historical samples, and works well for both
+		 * slow and fast I/O rates.
+		 * Formula:
+		 * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
+		 * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
+		 * existing latency and 1/8 (~12.5%) weight to the new latency.
+		 */
+		if (unlikely(!stat->slat_ns))
+			WRITE_ONCE(stat->slat_ns, avg_lat_ns);
+		else {
+			slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+			WRITE_ONCE(stat->slat_ns, slat_ns);
+		}
+
+		stat->batch = stat->batch_count = 0;
+
+		/*
+		 * Defer calculation of the path weight in per-cpu workqueue.
+		 */
+		schedule_work_on(cpu, &info->work.weight_work);
+	}
+}
+
 void nvme_mpath_end_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
@@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
 	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
 		atomic_dec_if_positive(&ns->ctrl->nr_active);
 
+	if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
+		nvme_mpath_add_sample(rq, ns);
+
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
 	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
@@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
 	[NVME_ANA_CHANGE]		= "change",
 };
 
+static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
+{
+	int i, cpu;
+	struct nvme_path_stat *stat;
+
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+			stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+			memset(stat, 0, sizeof(struct nvme_path_stat));
+		}
+	}
+}
+
+void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
+{
+	int i, cpu;
+	struct nvme_path_info *info;
+
+	if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
+		return;
+
+	for_each_online_cpu(cpu) {
+		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+			info = &per_cpu_ptr(ns->info, cpu)[i];
+			cancel_work_sync(&info->work.weight_work);
+		}
+	}
+}
+
+static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
+{
+	struct nvme_ns_head *head = ns->head;
+
+	if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
+		return false;
+
+	if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
+		return false;
+
+	blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
+	blk_stat_enable_accounting(ns->queue);
+	return true;
+}
+
+static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
+{
+
+	if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
+		return false;
+
+	blk_stat_disable_accounting(ns->queue);
+	blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
+	nvme_mpath_reset_adaptive_path_stat(ns);
+	return true;
+}
+
 bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
 {
 	struct nvme_ns_head *head = ns->head;
@@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
 			changed = true;
 		}
 	}
+	if (nvme_mpath_disable_adaptive_path_policy(ns))
+		changed = true;
 out:
 	return changed;
 }
@@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
 	srcu_read_unlock(&ctrl->srcu, srcu_idx);
 }
 
+int nvme_alloc_ns_stat(struct nvme_ns *ns)
+{
+	int i, cpu;
+	struct nvme_path_work *work;
+	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
+
+	if (!ns->head->disk)
+		return 0;
+
+	ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
+			sizeof(struct nvme_path_info),
+			__alignof__(struct nvme_path_info), gfp);
+	if (!ns->info)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+			work = &per_cpu_ptr(ns->info, cpu)[i].work;
+			work->ns = ns;
+			work->op_type = i;
+			INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
+		}
+	}
+
+	return 0;
+}
+
+static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+	int srcu_idx;
+
+	srcu_idx = srcu_read_lock(&ctrl->srcu);
+	list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
+				srcu_read_lock_held(&ctrl->srcu))
+		nvme_mpath_enable_adaptive_path_policy(ns);
+	srcu_read_unlock(&ctrl->srcu, srcu_idx);
+}
+
 void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
 {
 	struct nvme_ns_head *head = ns->head;
@@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
 				 srcu_read_lock_held(&head->srcu)) {
 		if (capacity != get_capacity(ns->disk))
 			clear_bit(NVME_NS_READY, &ns->flags);
+
+		nvme_mpath_reset_adaptive_path_stat(ns);
 	}
 	srcu_read_unlock(&head->srcu, srcu_idx);
 
@@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
 	return found;
 }
 
+static inline bool nvme_state_is_live(enum nvme_ana_state state)
+{
+	return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
+}
+
+static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
+		unsigned int op_type)
+{
+	struct nvme_ns *ns, *start, *found = NULL;
+	struct nvme_path_stat *stat;
+	u32 weight;
+	int cpu;
+
+	cpu = get_cpu();
+	ns = *this_cpu_ptr(head->adp_path);
+	if (unlikely(!ns)) {
+		ns = list_first_or_null_rcu(&head->list,
+				struct nvme_ns, siblings);
+		if (unlikely(!ns))
+			goto out;
+	}
+found_ns:
+	start = ns;
+	while (nvme_path_is_disabled(ns) ||
+			!nvme_state_is_live(ns->ana_state)) {
+		ns = list_next_entry_circular(ns, &head->list, siblings);
+
+		/*
+		 * If we iterate through all paths in the list but find each
+		 * path in list is either disabled or dead then bail out.
+		 */
+		if (ns == start)
+			goto out;
+	}
+
+	stat = &this_cpu_ptr(ns->info)[op_type].stat;
+
+	/*
+	 * When the head path-list is singular we don't calculate the
+	 * only path weight for optimization as we don't need to forward
+	 * I/O to more than one path. The another possibility is whenthe
+	 * path is newly added, we don't know its weight. So we go round
+	 * -robin for each such path and forward I/O to it.Once we start
+	 * getting response for such I/Os, the path weight calculation
+	 * would kick in and then we start using path credit for
+	 * forwarding I/O.
+	 */
+	weight = READ_ONCE(stat->weight);
+	if (!weight) {
+		found = ns;
+		goto out;
+	}
+
+	/*
+	 * To keep path selection logic simple, we don't distinguish
+	 * between ANA optimized and non-optimized states. The non-
+	 * optimized path is expected to have a lower weight, and
+	 * therefore fewer credits. As a result, only a small number of
+	 * I/Os will be forwarded to paths in the non-optimized state.
+	 */
+	if (stat->credit > 0) {
+		--stat->credit;
+		found = ns;
+		goto out;
+	} else {
+		/*
+		 * Refill credit from path weight and move to next path. The
+		 * refilled credit of the current path will be used next when
+		 * all remainng paths exhaust its credits.
+		 */
+		weight = READ_ONCE(stat->weight);
+		stat->credit = weight;
+		ns = list_next_entry_circular(ns, &head->list, siblings);
+		if (likely(ns))
+			goto found_ns;
+	}
+out:
+	if (found) {
+		stat->sel++;
+		*this_cpu_ptr(head->adp_path) = found;
+	}
+
+	put_cpu();
+	return found;
+}
+
 static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 {
 	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
@@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 	return ns;
 }
 
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
+		unsigned int op_type)
 {
 	switch (READ_ONCE(head->subsys->iopolicy)) {
+	case NVME_IOPOLICY_ADAPTIVE:
+		return nvme_adaptive_path(head, op_type);
 	case NVME_IOPOLICY_QD:
 		return nvme_queue_depth_path(head);
 	case NVME_IOPOLICY_RR:
@@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
 		return;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
 	if (likely(ns)) {
 		bio_set_dev(bio, ns->disk->part0);
 		bio->bi_opf |= REQ_NVME_MPATH;
@@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
 	int srcu_idx, ret = -EWOULDBLOCK;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, NVME_STAT_OTHER);
 	if (ns)
 		ret = nvme_ns_get_unique_id(ns, id, type);
 	srcu_read_unlock(&head->srcu, srcu_idx);
@@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
 	int srcu_idx, ret = -EWOULDBLOCK;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, NVME_STAT_OTHER);
 	if (ns)
 		ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
 	srcu_read_unlock(&head->srcu, srcu_idx);
@@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
 	INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
 	head->delayed_removal_secs = 0;
+	head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
+	if (!head->adp_path)
+		return -ENOMEM;
 
 	/*
 	 * If "multipath_always_on" is enabled, a multipath node is added
@@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 	}
 	mutex_unlock(&head->lock);
 
+	mutex_lock(&nvme_subsystems_lock);
+	nvme_mpath_enable_adaptive_path_policy(ns);
+	mutex_unlock(&nvme_subsystems_lock);
+
 	synchronize_srcu(&head->srcu);
 	kblockd_schedule_work(&head->requeue_work);
 }
@@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
 	return 0;
 }
 
-static inline bool nvme_state_is_live(enum nvme_ana_state state)
-{
-	return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
-}
-
 static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
 		struct nvme_ns *ns)
 {
@@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
 
 	WRITE_ONCE(subsys->iopolicy, iopolicy);
 
-	/* iopolicy changes clear the mpath by design */
+	/* iopolicy changes clear/reset the mpath by design */
 	mutex_lock(&nvme_subsystems_lock);
 	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
 		nvme_mpath_clear_ctrl_paths(ctrl);
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		nvme_mpath_set_ctrl_paths(ctrl);
 	mutex_unlock(&nvme_subsystems_lock);
 
 	pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c..715c7053054c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
 extern unsigned int admin_timeout;
 #define NVME_ADMIN_TIMEOUT	(admin_timeout * HZ)
 
-#define NVME_DEFAULT_KATO	5
+#define NVME_DEFAULT_KATO		5
+
+#define NVME_DEFAULT_ADP_EWMA_SHIFT	3
+#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT	(15 * NSEC_PER_SEC)
 
 #ifdef CONFIG_ARCH_NO_SG_CHAIN
 #define  NVME_INLINE_SG_CNT  0
@@ -421,6 +424,7 @@ enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
 	NVME_IOPOLICY_QD,
+	NVME_IOPOLICY_ADAPTIVE,
 };
 
 struct nvme_subsystem {
@@ -459,6 +463,37 @@ struct nvme_ns_ids {
 	u8	csi;
 };
 
+enum nvme_stat_group {
+	NVME_STAT_READ,
+	NVME_STAT_WRITE,
+	NVME_STAT_OTHER,
+	NVME_NUM_STAT_GROUPS
+};
+
+struct nvme_path_stat {
+	u64 nr_samples;		/* total num of samples processed */
+	u64 nr_ignored;		/* num. of samples ignored */
+	u64 slat_ns;		/* smoothed (ewma) latency in nanoseconds */
+	u64 score;		/* score used for weight calculation */
+	u64 last_weight_ts;	/* timestamp of the last weight calculation */
+	u64 sel;		/* num of times this path is selcted for I/O */
+	u64 batch;		/* accumulated latency sum for current window */
+	u32 batch_count;	/* num of samples accumulated in current window */
+	u32 weight;		/* path weight */
+	u32 credit;		/* path credit for I/O forwarding */
+};
+
+struct nvme_path_work {
+	struct nvme_ns *ns;		/* owning namespace */
+	struct work_struct weight_work;	/* deferred work for weight calculation */
+	int op_type;			/* op type : READ/WRITE/OTHER */
+};
+
+struct nvme_path_info {
+	struct nvme_path_stat stat;	/* path statistics */
+	struct nvme_path_work work;	/* background worker context */
+};
+
 /*
  * Anchor structure for namespaces.  There is one for each namespace in a
  * NVMe subsystem that any of our controllers can see, and the namespace
@@ -508,6 +543,9 @@ struct nvme_ns_head {
 	unsigned long		flags;
 	struct delayed_work	remove_work;
 	unsigned int		delayed_removal_secs;
+
+	struct nvme_ns * __percpu	*adp_path;
+
 #define NVME_NSHEAD_DISK_LIVE		0
 #define NVME_NSHEAD_QUEUE_IF_NO_PATH	1
 	struct nvme_ns __rcu	*current_path[];
@@ -534,6 +572,7 @@ struct nvme_ns {
 #ifdef CONFIG_NVME_MULTIPATH
 	enum nvme_ana_state ana_state;
 	u32 ana_grpid;
+	struct nvme_path_info __percpu *info;
 #endif
 	struct list_head siblings;
 	struct kref kref;
@@ -545,6 +584,7 @@ struct nvme_ns {
 #define NVME_NS_FORCE_RO		3
 #define NVME_NS_READY			4
 #define NVME_NS_SYSFS_ATTR_LINK	5
+#define NVME_NS_PATH_STAT		6
 
 	struct cdev		cdev;
 	struct device		cdev_device;
@@ -949,7 +989,17 @@ extern const struct attribute_group *nvme_dev_attr_groups[];
 extern const struct block_device_operations nvme_bdev_ops;
 
 void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
-struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
+struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, unsigned int op_type);
+static inline int nvme_data_dir(const enum req_op op)
+{
+	if (op == REQ_OP_READ)
+		return NVME_STAT_READ;
+	else if (op_is_write(op))
+		return NVME_STAT_WRITE;
+	else
+		return NVME_STAT_OTHER;
+}
+
 #ifdef CONFIG_NVME_MULTIPATH
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
@@ -972,12 +1022,14 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl);
 void nvme_mpath_update(struct nvme_ctrl *ctrl);
 void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
 void nvme_mpath_stop(struct nvme_ctrl *ctrl);
+void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns);
 bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
 void nvme_mpath_revalidate_paths(struct nvme_ns *ns);
 void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
 void nvme_mpath_remove_disk(struct nvme_ns_head *head);
 void nvme_mpath_start_request(struct request *rq);
 void nvme_mpath_end_request(struct request *rq);
+int nvme_alloc_ns_stat(struct nvme_ns *ns);
 
 static inline void nvme_trace_bio_complete(struct request *req)
 {
@@ -1005,6 +1057,13 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
 		return true;
 	return false;
 }
+static inline void nvme_free_ns_stat(struct nvme_ns *ns)
+{
+	if (!ns->head->disk)
+		return;
+
+	free_percpu(ns->info);
+}
 #else
 #define multipath false
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
@@ -1096,6 +1155,17 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
 {
 	return false;
 }
+static inline void nvme_mpath_cancel_adaptive_path_weight_work(
+		struct nvme_ns *ns)
+{
+}
+static inline int nvme_alloc_ns_stat(struct nvme_ns *ns)
+{
+	return 0;
+}
+static inline void nvme_free_ns_stat(struct nvme_ns *ns)
+{
+}
 #endif /* CONFIG_NVME_MULTIPATH */
 
 int nvme_ns_get_unique_id(struct nvme_ns *ns, u8 id[16],
diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c
index ca6a74607b13..7aca2186c462 100644
--- a/drivers/nvme/host/pr.c
+++ b/drivers/nvme/host/pr.c
@@ -53,10 +53,12 @@ static int nvme_send_ns_head_pr_command(struct block_device *bdev,
 		struct nvme_command *c, void *data, unsigned int data_len)
 {
 	struct nvme_ns_head *head = bdev->bd_disk->private_data;
-	int srcu_idx = srcu_read_lock(&head->srcu);
-	struct nvme_ns *ns = nvme_find_path(head);
+	int srcu_idx;
+	struct nvme_ns *ns;
 	int ret = -EWOULDBLOCK;
 
+	srcu_idx = srcu_read_lock(&head->srcu);
+	ns = nvme_find_path(head, NVME_STAT_OTHER);
 	if (ns) {
 		c->common.nsid = cpu_to_le32(ns->head->ns_id);
 		ret = nvme_submit_sync_cmd(ns->queue, c, data, data_len);
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..1cbab90ed42e 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
 		return 0;
 
 	srcu_idx = srcu_read_lock(&head->srcu);
-	ns = nvme_find_path(head);
+	ns = nvme_find_path(head, NVME_STAT_OTHER);
 	if (!ns)
 		goto out_unlock;
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 3/7] nvme: add generic debugfs support
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

Add generic infrastructure for creating and managing debugfs files in
the NVMe module. This introduces helper APIs that allow NVMe drivers to
register and unregister debugfs entries, along with a reusable attribute
structure for defining new debugfs files.

The implementation uses seq_file interfaces to safely expose per-NS and
per-NS-head statistics, while supporting both simple show callbacks and
full seq_operations.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/Makefile    |   2 +-
 drivers/nvme/host/core.c      |   3 +
 drivers/nvme/host/debugfs.c   | 138 ++++++++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c |   2 +
 drivers/nvme/host/nvme.h      |  10 +++
 5 files changed, 154 insertions(+), 1 deletion(-)
 create mode 100644 drivers/nvme/host/debugfs.c

diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 6414ec968f99..7962dfc3b2ad 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_NVME_FC)			+= nvme-fc.o
 obj-$(CONFIG_NVME_TCP)			+= nvme-tcp.o
 obj-$(CONFIG_NVME_APPLE)		+= nvme-apple.o
 
-nvme-core-y				+= core.o ioctl.o sysfs.o pr.o
+nvme-core-y				+= core.o ioctl.o sysfs.o pr.o debugfs.o
 nvme-core-$(CONFIG_NVME_VERBOSE_ERRORS)	+= constants.o
 nvme-core-$(CONFIG_TRACING)		+= trace.o
 nvme-core-$(CONFIG_NVME_MULTIPATH)	+= multipath.o
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 47f375c63d2d..c15dfcaf3de2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4187,6 +4187,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 	if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups))
 		goto out_cleanup_ns_from_list;
 
+	nvme_debugfs_register(ns->disk);
+
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_add_ns_cdev(ns);
 
@@ -4276,6 +4278,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	nvme_mpath_remove_sysfs_link(ns);
 
+	nvme_debugfs_unregister(ns->disk);
 	del_gendisk(ns->disk);
 
 	mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
new file mode 100644
index 000000000000..6bb57c4b5c3b
--- /dev/null
+++ b/drivers/nvme/host/debugfs.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 IBM Corporation
+ *	Nilay Shroff <nilay@linux.ibm.com>
+ */
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "nvme.h"
+
+struct nvme_debugfs_attr {
+	const char *name;
+	umode_t mode;
+	int (*show)(void *data, struct seq_file *m);
+	ssize_t (*write)(void *data, const char __user *buf, size_t count,
+			loff_t *ppos);
+	const struct seq_operations *seq_ops;
+};
+
+struct nvme_debugfs_ctx {
+	void *data;
+	struct nvme_debugfs_attr *attr;
+	int srcu_idx;
+};
+
+static int nvme_debugfs_show(struct seq_file *m, void *v)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	void *data = ctx->data;
+	struct nvme_debugfs_attr *attr = ctx->attr;
+
+	return attr->show(data, m);
+}
+
+static int nvme_debugfs_open(struct inode *inode, struct file *file)
+{
+	void *data = inode->i_private;
+	struct nvme_debugfs_attr *attr = debugfs_get_aux(file);
+	struct nvme_debugfs_ctx *ctx;
+	struct seq_file *m;
+	int ret;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (WARN_ON_ONCE(!ctx))
+		return -ENOMEM;
+
+	ctx->data = data;
+	ctx->attr = attr;
+
+	if (attr->seq_ops) {
+		ret = seq_open(file, attr->seq_ops);
+		if (ret) {
+			kfree(ctx);
+			return ret;
+		}
+		m = file->private_data;
+		m->private = ctx;
+		return ret;
+	}
+
+	if (WARN_ON_ONCE(!attr->show)) {
+		kfree(ctx);
+		return -EPERM;
+	}
+
+	return single_open(file, nvme_debugfs_show, ctx);
+}
+
+static ssize_t nvme_debugfs_write(struct file *file, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct seq_file *m = file->private_data;
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_debugfs_attr *attr = ctx->attr;
+
+	if (!attr->write)
+		return -EPERM;
+
+	return attr->write(ctx->data, buf, count, ppos);
+}
+
+static int nvme_debugfs_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *m = file->private_data;
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_debugfs_attr *attr = ctx->attr;
+	int ret;
+
+	if (attr->seq_ops)
+		ret = seq_release(inode, file);
+	else
+		ret = single_release(inode, file);
+
+	kfree(ctx);
+	return ret;
+}
+
+static const struct file_operations nvme_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = nvme_debugfs_open,
+	.read    = seq_read,
+	.write   = nvme_debugfs_write,
+	.llseek  = seq_lseek,
+	.release = nvme_debugfs_release,
+};
+
+
+static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
+	{},
+};
+
+static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
+	{},
+};
+
+static void nvme_debugfs_create_files(struct request_queue *q,
+		const struct nvme_debugfs_attr *attr, void *data)
+{
+	if (WARN_ON_ONCE(!q->debugfs_dir))
+		return;
+
+	for (; attr->name; attr++)
+		debugfs_create_file_aux(attr->name, attr->mode, q->debugfs_dir,
+				data, (void *)attr, &nvme_debugfs_fops);
+}
+
+void nvme_debugfs_register(struct gendisk *disk)
+{
+	const struct nvme_debugfs_attr *attr;
+
+	if (nvme_disk_is_ns_head(disk))
+		attr = nvme_mpath_debugfs_attrs;
+	else
+		attr = nvme_ns_debugfs_attrs;
+
+	nvme_debugfs_create_files(disk->queue, attr, disk->private_data);
+}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 55dc28375662..047dd9da9cbf 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -1086,6 +1086,7 @@ static void nvme_remove_head(struct nvme_ns_head *head)
 
 		nvme_cdev_del(&head->cdev, &head->cdev_device);
 		synchronize_srcu(&head->srcu);
+		nvme_debugfs_unregister(head->disk);
 		del_gendisk(head->disk);
 	}
 	nvme_put_ns_head(head);
@@ -1192,6 +1193,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 		}
 		nvme_add_ns_head_cdev(head);
 		kblockd_schedule_work(&head->partition_scan_work);
+		nvme_debugfs_register(head->disk);
 	}
 
 	nvme_mpath_add_sysfs_link(ns->head);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 715c7053054c..1c1ec2a7f9ad 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1000,6 +1000,16 @@ static inline int nvme_data_dir(const enum req_op op)
 		return NVME_STAT_OTHER;
 }
 
+void nvme_debugfs_register(struct gendisk *disk);
+static inline void nvme_debugfs_unregister(struct gendisk *disk)
+{
+	/*
+	 * Nothing to do for now. When the request queue is unregistered,
+	 * all files under q->debugfs_dir are recursively deleted.
+	 * This is just a placeholder; the compiler will optimize it out.
+	 */
+}
+
 #ifdef CONFIG_NVME_MULTIPATH
 static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (2 preceding siblings ...)
  2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

By default, the EWMA (Exponentially Weighted Moving Average) shift
value, used for storing latency samples for adaptive iopolicy, is set
to 3. The EWMA is calculated using the following formula:

    ewma = (old * ((1 << ewma_shift) - 1) + new) >> ewma_shift;

The default value of 3 assigns ~87.5% weight to the existing EWMA value
and ~12.5% weight to the new latency sample. This provides a stable
average that smooths out short-term variations.

However, different workloads may require faster or slower adaptation to
changing conditions. This commit introduces a new debugfs attribute,
adaptive_ewma_shift, allowing users to tune the weighting factor.

For example:
  - adaptive_ewma_shift = 2 => 75% old, 25% new
  - adaptive_ewma_shift = 1 => 50% old, 50% new
  - adaptive_ewma_shift = 0 => 0% old, 100% new

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      |  3 +++
 drivers/nvme/host/debugfs.c   | 46 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c |  8 +++---
 drivers/nvme/host/nvme.h      |  1 +
 4 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c15dfcaf3de2..43b9b0d6cbdf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3913,6 +3913,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	head->ids = info->ids;
 	head->shared = info->is_shared;
 	head->rotational = info->is_rotational;
+#ifdef CONFIG_NVME_MULTIPATH
+	head->adp_ewma_shift = NVME_DEFAULT_ADP_EWMA_SHIFT;
+#endif
 	ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
 	ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
 	kref_init(&head->ref);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index 6bb57c4b5c3b..e3c37041e8f2 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -105,8 +105,54 @@ static const struct file_operations nvme_debugfs_fops = {
 	.release = nvme_debugfs_release,
 };
 
+#ifdef CONFIG_NVME_MULTIPATH
+static int nvme_adp_ewma_shift_show(void *data, struct seq_file *m)
+{
+	struct nvme_ns_head *head = data;
+
+	seq_printf(m, "%u\n", READ_ONCE(head->adp_ewma_shift));
+	return 0;
+}
+
+static ssize_t nvme_adp_ewma_shift_store(void *data, const char __user *ubuf,
+		size_t count, loff_t *ppos)
+{
+	struct nvme_ns_head *head = data;
+	char kbuf[8];
+	u32 res;
+	int ret;
+	size_t len;
+	char *arg;
+
+	len = min(sizeof(kbuf) - 1, count);
+
+	if (copy_from_user(kbuf, ubuf, len))
+		return -EFAULT;
+
+	kbuf[len] = '\0';
+	arg = strstrip(kbuf);
+
+	ret = kstrtou32(arg, 0, &res);
+	if (ret)
+		return ret;
+
+	/*
+	 * Values greater than 8 are nonsensical, as they effectively assign
+	 * zero weight to new samples.
+	 */
+	if (res > 8)
+		return -EINVAL;
+
+	WRITE_ONCE(head->adp_ewma_shift, res);
+	return count;
+}
+#endif
 
 static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
+#ifdef CONFIG_NVME_MULTIPATH
+		{"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
+			nvme_adp_ewma_shift_store},
+#endif
 	{},
 };
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 047dd9da9cbf..c7470cc8844e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -294,10 +294,9 @@ static void nvme_mpath_weight_work(struct work_struct *weight_work)
  * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
  * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
  */
-static inline u64 ewma_update(u64 old, u64 new)
+static inline u64 ewma_update(u64 old, u64 new, u32 ewma_shift)
 {
-	return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
-			+ new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
+	return (old * ((1 << ewma_shift) - 1) + new) >> ewma_shift;
 }
 
 static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
@@ -389,7 +388,8 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
 		if (unlikely(!stat->slat_ns))
 			WRITE_ONCE(stat->slat_ns, avg_lat_ns);
 		else {
-			slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+			slat_ns = ewma_update(stat->slat_ns, avg_lat_ns,
+					READ_ONCE(head->adp_ewma_shift));
 			WRITE_ONCE(stat->slat_ns, slat_ns);
 		}
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1c1ec2a7f9ad..97de45634f08 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -545,6 +545,7 @@ struct nvme_ns_head {
 	unsigned int		delayed_removal_secs;
 
 	struct nvme_ns * __percpu	*adp_path;
+	u32				adp_ewma_shift;
 
 #define NVME_NSHEAD_DISK_LIVE		0
 #define NVME_NSHEAD_QUEUE_IF_NO_PATH	1
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (3 preceding siblings ...)
  2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

By default, the adaptive I/O policy accumulates latency samples over a
15-second window. When this window expires, the driver computes the
average latency and updates the smoothed (EWMA) latency value. The
path weight is then recalculated based on this data.

A 15-second window provides a good balance for most workloads, as it
helps smooth out transient latency spikes and produces a more stable
path weight profile. However, some workloads may benefit from faster
or slower adaptation to changing latency conditions.

This commit introduces a new debugfs attribute, adaptive_weight_timeout,
which allows users to configure the path weight calculation interval
based on their workload requirements.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/core.c      |  1 +
 drivers/nvme/host/debugfs.c   | 40 ++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/multipath.c |  7 ++++--
 drivers/nvme/host/nvme.h      |  1 +
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 43b9b0d6cbdf..d3828c4812fc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3915,6 +3915,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	head->rotational = info->is_rotational;
 #ifdef CONFIG_NVME_MULTIPATH
 	head->adp_ewma_shift = NVME_DEFAULT_ADP_EWMA_SHIFT;
+	head->adp_weight_timeout = NVME_DEFAULT_ADP_WEIGHT_TIMEOUT;
 #endif
 	ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
 	ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index e3c37041e8f2..e382fa411b13 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -146,12 +146,50 @@ static ssize_t nvme_adp_ewma_shift_store(void *data, const char __user *ubuf,
 	WRITE_ONCE(head->adp_ewma_shift, res);
 	return count;
 }
+
+static int nvme_adp_weight_timeout_show(void *data, struct seq_file *m)
+{
+	struct nvme_ns_head *head = data;
+
+	seq_printf(m, "%llu\n",
+		div_u64(READ_ONCE(head->adp_weight_timeout), NSEC_PER_SEC));
+	return 0;
+}
+
+static ssize_t nvme_adp_weight_timeout_store(void *data,
+		const char __user *ubuf,
+		size_t count, loff_t *ppos)
+{
+	struct nvme_ns_head *head = data;
+	char kbuf[8];
+	u32 res;
+	int ret;
+	size_t len;
+	char *arg;
+
+	len = min(sizeof(kbuf) - 1, count);
+
+	if (copy_from_user(kbuf, ubuf, len))
+		return -EFAULT;
+
+	kbuf[len] = '\0';
+	arg = strstrip(kbuf);
+
+	ret = kstrtou32(arg, 0, &res);
+	if (ret)
+		return ret;
+
+	WRITE_ONCE(head->adp_weight_timeout, res * NSEC_PER_SEC);
+	return count;
+}
 #endif
 
 static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
 #ifdef CONFIG_NVME_MULTIPATH
-		{"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
+	{"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
 			nvme_adp_ewma_shift_store},
+	{"adaptive_weight_timeout", 0600, nvme_adp_weight_timeout_show,
+			nvme_adp_weight_timeout_store},
 #endif
 	{},
 };
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c7470cc8844e..e70a7d5cf036 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -362,8 +362,11 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
 	stat->batch_count++;
 	stat->nr_samples++;
 
-	if (now > stat->last_weight_ts &&
-	    (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
+	if (now > stat->last_weight_ts) {
+		u64 timeout = READ_ONCE(head->adp_weight_timeout);
+
+		if ((now - stat->last_weight_ts) < timeout)
+			return;
 
 		stat->last_weight_ts = now;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 97de45634f08..53d868cccbeb 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -546,6 +546,7 @@ struct nvme_ns_head {
 
 	struct nvme_ns * __percpu	*adp_path;
 	u32				adp_ewma_shift;
+	u64				adp_weight_timeout;
 
 #define NVME_NSHEAD_DISK_LIVE		0
 #define NVME_NSHEAD_QUEUE_IF_NO_PATH	1
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (4 preceding siblings ...)
  2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

This commit introduces a new debugfs attribute, "adaptive_stat", under
both per-path and head debugfs directories (defined under /sys/kernel/
debug/block/). This attribute provides visibility into the internal
state of the adaptive I/O policy to aid in debugging and performance
analysis.

For per-path entries, "adaptive_stat" reports the corresponding path
statistics such as I/O weight, selection count, processed samples, and
ignored samples.

For head entries, it reports per-CPU statistics for each reachable path,
including I/O weight, path score, smoothed (EWMA) latency, selection
count, processed samples, and ignored samples.

These additions enhance observability of the adaptive I/O path selection
behavior and help diagnose imbalance or instability in multipath
performance.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 drivers/nvme/host/debugfs.c | 113 ++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index e382fa411b13..28de4a8e2333 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -182,6 +182,115 @@ static ssize_t nvme_adp_weight_timeout_store(void *data,
 	WRITE_ONCE(head->adp_weight_timeout, res * NSEC_PER_SEC);
 	return count;
 }
+
+static void *nvme_mpath_adp_stat_start(struct seq_file *m, loff_t *pos)
+{
+	struct nvme_ns *ns;
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns_head *head = ctx->data;
+
+	/* Remember srcu index, so we can unlock later. */
+	ctx->srcu_idx = srcu_read_lock(&head->srcu);
+	ns = list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
+
+	while (*pos && ns) {
+		ns = list_next_or_null_rcu(&head->list, &ns->siblings,
+				struct nvme_ns, siblings);
+		(*pos)--;
+	}
+
+	return ns;
+}
+
+static void *nvme_mpath_adp_stat_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct nvme_ns *ns = v;
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns_head *head = ctx->data;
+
+	(*pos)++;
+
+	return list_next_or_null_rcu(&head->list, &ns->siblings,
+			struct nvme_ns, siblings);
+}
+
+static void nvme_mpath_adp_stat_stop(struct seq_file *m, void *v)
+{
+	struct nvme_debugfs_ctx *ctx = m->private;
+	struct nvme_ns_head *head = ctx->data;
+	int srcu_idx = ctx->srcu_idx;
+
+	srcu_read_unlock(&head->srcu, srcu_idx);
+}
+
+static int nvme_mpath_adp_stat_show(struct seq_file *m, void *v)
+{
+	int i, cpu;
+	struct nvme_path_stat *stat;
+	struct nvme_ns *ns = v;
+
+	seq_printf(m, "%s:\n", ns->disk->disk_name);
+	for_each_online_cpu(cpu) {
+		seq_printf(m, "cpu %d : ", cpu);
+		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+			stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+			seq_printf(m, "%u %u %llu %llu %llu %llu %llu ",
+				stat->weight, stat->credit, stat->score,
+				stat->slat_ns, stat->sel,
+				stat->nr_samples, stat->nr_ignored);
+		}
+		seq_putc(m, '\n');
+	}
+	return 0;
+}
+
+static const struct seq_operations nvme_mpath_adp_stat_seq_ops = {
+	.start = nvme_mpath_adp_stat_start,
+	.next  = nvme_mpath_adp_stat_next,
+	.stop  = nvme_mpath_adp_stat_stop,
+	.show  = nvme_mpath_adp_stat_show
+};
+
+static void adp_stat_read_all(struct nvme_ns *ns, struct nvme_path_stat *batch)
+{
+	int i, cpu;
+	u32 ncpu[NVME_NUM_STAT_GROUPS] = {0};
+	struct nvme_path_stat *stat;
+
+	for_each_online_cpu(cpu) {
+		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+			stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+			batch[i].sel += stat->sel;
+			batch[i].nr_samples += stat->nr_samples;
+			batch[i].nr_ignored += stat->nr_ignored;
+			batch[i].weight += stat->weight;
+			if (stat->weight)
+				ncpu[i]++;
+		}
+	}
+
+	for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+		if (!ncpu[i])
+			continue;
+		batch[i].weight = DIV_U64_ROUND_CLOSEST(batch[i].weight,
+				ncpu[i]);
+	}
+}
+
+static int nvme_ns_adp_stat_show(void *data, struct seq_file *m)
+{
+	int i;
+	struct nvme_path_stat stat[NVME_NUM_STAT_GROUPS] = {0};
+	struct nvme_ns *ns = (struct nvme_ns *)data;
+
+	adp_stat_read_all(ns, stat);
+	for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+		seq_printf(m, "%u %llu %llu %llu ",
+			stat[i].weight, stat[i].sel,
+			stat[i].nr_samples, stat[i].nr_ignored);
+	}
+	return 0;
+}
 #endif
 
 static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
@@ -190,11 +299,15 @@ static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
 			nvme_adp_ewma_shift_store},
 	{"adaptive_weight_timeout", 0600, nvme_adp_weight_timeout_show,
 			nvme_adp_weight_timeout_store},
+	{"adaptive_stat", 0400, .seq_ops = &nvme_mpath_adp_stat_seq_ops},
 #endif
 	{},
 };
 
 static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
+#ifdef CONFIG_NVME_MULTIPATH
+	{"adaptive_stat", 0400, nvme_ns_adp_stat_show},
+#endif
 	{},
 };
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (5 preceding siblings ...)
  2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
  2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
  2025-12-12 12:08 ` Sagi Grimberg
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
  To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce

Update the nvme-multipath documentation to describe the adaptive I/O
policy, its behavior, and when it is suitable for use.

Suggested-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 Documentation/admin-guide/nvme-multipath.rst | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst
index 97ca1ccef459..7befaab01cf5 100644
--- a/Documentation/admin-guide/nvme-multipath.rst
+++ b/Documentation/admin-guide/nvme-multipath.rst
@@ -70,3 +70,22 @@ When to use the queue-depth policy:
   1. High load with small I/Os: Effectively balances load across paths when
      the load is high, and I/O operations consist of small, relatively
      fixed-sized requests.
+
+Adaptive
+--------
+
+The adaptive policy manages I/O requests based on path latency. It periodically
+calculates a weight for each path and distributes I/O accordingly. Paths with
+higher latency receive lower weights, resulting in fewer I/O requests being sent
+to them, while paths with lower latency handle a proportionally larger share of
+the I/O load.
+
+When to use the adaptive policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Homogeneous Path Performance: Utilizes all available paths efficiently when
+   their performance characteristics (e.g., latency, bandwidth) are similar.
+
+2. Heterogeneous Path Performance: Dynamically distributes I/O based on per-path
+   performance characteristics. Paths with lower latency receive a higher share
+   of I/O compared to those with higher latency.
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (6 preceding siblings ...)
  2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
@ 2025-12-09 13:56 ` Nilay Shroff
  2025-12-12 12:08 ` Sagi Grimberg
  8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-09 13:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: hare, hch, sagi, dwagner, axboe, kanie, gjoyce,
	linux-nvme@lists.infradead.org

Hi Keith,

Just gentle ping on this one...

It has been reviewed and ready for some time now, and I wanted to check if you
had any remaining feedback or concerns, or if you could consider pulling it
into nvme-next.

Link to the latest version for convenience:
https://lore.kernel.org/all/20251105103347.86059-1-nilay@linux.ibm.com/

Please let me know if there's anything further needed on my side.

Thanks,
--Nilay

On 11/5/25 4:03 PM, Nilay Shroff wrote:
> Hi,
> 
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
> 
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
> 
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
> 
>         numa         round-robin   queue-depth  adaptive
>         -----------  -----------   -----------  ---------
> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>         W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
> 
> This pathcset includes totla 6 patches:
> [PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
>   - Make blk_stat APIs available to block drivers.
>   - Needed for per-path latency measurement in adaptive policy.
> 
> [PATCH 2/7] nvme-multipath: add adaptive I/O policy
>   - Implement path scoring based on latency (EWMA).
>   - Distribute I/O proportionally to per-path weights.
> 
> [PATCH 3/7] nvme: add generic debugfs support
>   - Introduce generic debugfs support for NVMe module
> 
> [PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   - Adds a debugfs attribute to control ewma shift
> 
> [PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   - Adds a debugfs attribute to control path weight calculation timeout
> 
> [PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
>   - Add “adaptive_stat” under per-path and head debugfs directories to
>     expose adaptive policy state and statistics.
> 
> [PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
>   - Includes documentation for adaptive I/O multipath policy.
> 
> As ususal, feedback and suggestions are most welcome!
> 
> Thanks!
> 
> Changes from v4:
>   - Added patch #7 which includes the documentation for adaptive I/O
>     policy. (Guixin Liu)
> Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/    
> 
> Changes from v3:
>   - Update the adaptive APIs name (which actually enable/disable
>     adaptive policy) to reflect the actual work it does. Also removed
>     the misleading use of "current_path" from the adaptive policy code
>     (Hannes Reinecke)
>   - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
>     sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
> 
> Changes from v2:
>   - Addede a new patch to allow user to configure EWMA shift
>     through sysfs (Hannes Reinecke)
>   - Added a new patch to allow user to configure path weight
>     calculation timeout (Hannes Reinecke)
>   - Distinguish between read/write and other commands (e.g.
>     admin comamnd) and calculate path weight for other commands
>     which is separate from read/write weight. (Hannes Reinecke)
>   - Normalize per-path weight in the range from 0-128 instead
>     of 0-100 (Hannes Reinecke)
>   - Restructure and optimize adaptive I/O forwarding code to use
>     one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
> 
> Changes from v1:
>   - Ensure that the completion of I/O occurs on the same CPU as the
>     submitting I/O CPU (Hannes Reinecke)
>   - Remove adapter link speed from the path weight calculation
>     (Hannes Reinecke)
>   - Add adaptive I/O stat under debugfs instead of current sysfs
>     (Hannes Reinecke)
>   - Move path weight calculation to a workqueue from IO completion
>     code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
> 
> Nilay Shroff (7):
>   block: expose blk_stat_{enable,disable}_accounting() to drivers
>   nvme-multipath: add support for adaptive I/O policy
>   nvme: add generic debugfs support
>   nvme-multipath: add debugfs attribute adaptive_ewma_shift
>   nvme-multipath: add debugfs attribute adaptive_weight_timeout
>   nvme-multipath: add debugfs attribute adaptive_stat
>   nvme-multipath: add documentation for adaptive I/O policy
> 
>  Documentation/admin-guide/nvme-multipath.rst |  19 +
>  block/blk-stat.h                             |   4 -
>  drivers/nvme/host/Makefile                   |   2 +-
>  drivers/nvme/host/core.c                     |  22 +-
>  drivers/nvme/host/debugfs.c                  | 335 +++++++++++++++
>  drivers/nvme/host/ioctl.c                    |  31 +-
>  drivers/nvme/host/multipath.c                | 430 ++++++++++++++++++-
>  drivers/nvme/host/nvme.h                     |  86 +++-
>  drivers/nvme/host/pr.c                       |   6 +-
>  drivers/nvme/host/sysfs.c                    |   2 +-
>  include/linux/blk-mq.h                       |   4 +
>  11 files changed, 913 insertions(+), 28 deletions(-)
>  create mode 100644 drivers/nvme/host/debugfs.c
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
  2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
                   ` (7 preceding siblings ...)
  2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
@ 2025-12-12 12:08 ` Sagi Grimberg
  2025-12-13  8:22   ` Nilay Shroff
  8 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 12:08 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 05/11/2025 12:33, Nilay Shroff wrote:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance.

It can be argued that queue-depth is a proxy of latency.

>   The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics.

Adaptive is not a good name. Maybe weighted-latency of wplat (weighted 
path latency)
or something like that.

>   Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
>          numa         round-robin   queue-depth  adaptive
>          -----------  -----------   -----------  ---------
> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>          W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s

Seems like a nice gain.
Can you please test for the normal symmetric paths case? Would like
to see the trade-off...


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers
  2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
@ 2025-12-12 12:16   ` Sagi Grimberg
  0 siblings, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 12:16 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 05/11/2025 12:33, Nilay Shroff wrote:
> The functions blk_stat_enable_accounting() and
> blk_stat_disable_accounting() are currently exported, but their
> prototypes are only defined in a private header. Move these prototypes
> into a common header so that block drivers can directly use these APIs.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-12-12 13:04   ` Sagi Grimberg
  2025-12-13  7:27     ` Nilay Shroff
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 13:04 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 05/11/2025 12:33, Nilay Shroff wrote:
> This commit introduces a new I/O policy named "adaptive". Users can
> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
> subsystemX/iopolicy"
>
> The adaptive policy dynamically distributes I/O based on measured
> completion latency. The main idea is to calculate latency for each path,
> derive a weight, and then proportionally forward I/O according to those
> weights.
>
> To ensure scalability, path latency is measured per-CPU. Each CPU
> maintains its own statistics, and I/O forwarding uses these per-CPU
> values.

So a given cpu would select path-a vs. another cpu that may select path-b?
How does that play with less queues than cpu cores? what happens to cores
that have low traffic?

> Every ~15 seconds, a simple average latency of per-CPU batched
> samples are computed and fed into an Exponentially Weighted Moving
> Average (EWMA):

I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?

>
> avg_latency = div_u64(batch, batch_count);
> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>
> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
> latency value and 1/8 (~12.5%) to the most recent latency. This
> smoothing reduces jitter, adapts quickly to changing conditions,
> avoids storing historical samples, and works well for both low and
> high I/O rates.

This weight was based on empirical measurements?

>   Path weights are then derived from the smoothed (EWMA)
> latency as follows (example with two paths A and B):
>
>      path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>      path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>      total_score  = path_A_score + path_B_score
>
>      path_A_weight = (path_A_score * 100) / total_score
>      path_B_weight = (path_B_score * 100) / total_score

What happens to R/W mixed workloads? What happens when the I/O pattern
has a distribution of block sizes?

I think that in order to understand how a non-trivial path selector 
works we need
thorough testing in a variety of I/O patterns.

>
> where:
>    - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>    - NSEC_PER_SEC is used as a scaling factor since valid latencies
>      are < 1 second
>    - weights are normalized to a 0–64 scale across all paths.
>
> Path credits are refilled based on this weight, with one credit
> consumed per I/O. When all credits are consumed, the credits are
> refilled again based on the current weight. This ensures that I/O is
> distributed across paths proportionally to their calculated weight.
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
>   drivers/nvme/host/core.c      |  15 +-
>   drivers/nvme/host/ioctl.c     |  31 ++-
>   drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>   drivers/nvme/host/nvme.h      |  74 +++++-
>   drivers/nvme/host/pr.c        |   6 +-
>   drivers/nvme/host/sysfs.c     |   2 +-
>   6 files changed, 530 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa4181d7de73..47f375c63d2d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>   	cleanup_srcu_struct(&head->srcu);
>   	nvme_put_subsystem(head->subsys);
>   	kfree(head->plids);
> +#ifdef CONFIG_NVME_MULTIPATH
> +	free_percpu(head->adp_path);
> +#endif
>   	kfree(head);
>   }
>   
> @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>   {
>   	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>   
> +	nvme_free_ns_stat(ns);
>   	put_disk(ns->disk);
>   	nvme_put_ns_head(ns->head);
>   	nvme_put_ctrl(ns->ctrl);
> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>   	if (nvme_init_ns_head(ns, info))
>   		goto out_cleanup_disk;
>   
> +	if (nvme_alloc_ns_stat(ns))
> +		goto out_unlink_ns;
> +
>   	/*
>   	 * If multipathing is enabled, the device name for all disks and not
>   	 * just those that represent shared namespaces needs to be based on the
> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>   	}
>   
>   	if (nvme_update_ns_info(ns, info))
> -		goto out_unlink_ns;
> +		goto out_free_ns_stat;
>   
>   	mutex_lock(&ctrl->namespaces_lock);
>   	/*
> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>   	 */
>   	if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>   		mutex_unlock(&ctrl->namespaces_lock);
> -		goto out_unlink_ns;
> +		goto out_free_ns_stat;
>   	}
>   	nvme_ns_add_to_ctrl_list(ns);
>   	mutex_unlock(&ctrl->namespaces_lock);
> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>   	list_del_rcu(&ns->list);
>   	mutex_unlock(&ctrl->namespaces_lock);
>   	synchronize_srcu(&ctrl->srcu);
> +out_free_ns_stat:
> +	nvme_free_ns_stat(ns);
>    out_unlink_ns:
>   	mutex_lock(&ctrl->subsys->lock);
>   	list_del_rcu(&ns->siblings);
> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>   	 */
>   	synchronize_srcu(&ns->head->srcu);
>   
> +	nvme_mpath_cancel_adaptive_path_weight_work(ns);
> +

I personally think that the check on path stats should be done in the 
call-site
and not in the function itself.

>   	/* wait for concurrent submissions */
>   	if (nvme_mpath_clear_current_path(ns))
>   		synchronize_srcu(&ns->head->srcu);
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index c212fa952c0f..759d147d9930 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>   int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>   		unsigned int cmd, unsigned long arg)
>   {
> +	u8 opcode;
>   	struct nvme_ns_head *head = bdev->bd_disk->private_data;
>   	bool open_for_write = mode & BLK_OPEN_WRITE;
>   	void __user *argp = (void __user *)arg;
>   	struct nvme_ns *ns;
>   	int srcu_idx, ret = -EWOULDBLOCK;
>   	unsigned int flags = 0;
> +	unsigned int op_type = NVME_STAT_OTHER;
>   
>   	if (bdev_is_partition(bdev))
>   		flags |= NVME_IOCTL_PARTITION;
>   
> +	if (cmd == NVME_IOCTL_SUBMIT_IO) {
> +		if (get_user(opcode, (u8 *)argp))
> +			return -EFAULT;
> +		if (opcode == nvme_cmd_write)
> +			op_type = NVME_STAT_WRITE;
> +		else if (opcode == nvme_cmd_read)
> +			op_type = NVME_STAT_READ;
> +	}
> +
>   	srcu_idx = srcu_read_lock(&head->srcu);
> -	ns = nvme_find_path(head);
> +	ns = nvme_find_path(head, op_type);

Perhaps it would be easier to review if you split passing opcode to 
nvme_find_path()
to a prep patch (explaining that the new iopolicy will leverage it)

>   	if (!ns)
>   		goto out_unlock;
>   
> @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>   long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>   		unsigned long arg)
>   {
> +	u8 opcode;
>   	bool open_for_write = file->f_mode & FMODE_WRITE;
>   	struct cdev *cdev = file_inode(file)->i_cdev;
>   	struct nvme_ns_head *head =
> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>   	void __user *argp = (void __user *)arg;
>   	struct nvme_ns *ns;
>   	int srcu_idx, ret = -EWOULDBLOCK;
> +	unsigned int op_type = NVME_STAT_OTHER;
> +
> +	if (cmd == NVME_IOCTL_SUBMIT_IO) {
> +		if (get_user(opcode, (u8 *)argp))
> +			return -EFAULT;
> +		if (opcode == nvme_cmd_write)
> +			op_type = NVME_STAT_WRITE;
> +		else if (opcode == nvme_cmd_read)
> +			op_type = NVME_STAT_READ;
> +	}
>   
>   	srcu_idx = srcu_read_lock(&head->srcu);
> -	ns = nvme_find_path(head);
> +	ns = nvme_find_path(head, op_type);
>   	if (!ns)
>   		goto out_unlock;
>   
> @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>   	struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>   	struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>   	int srcu_idx = srcu_read_lock(&head->srcu);
> -	struct nvme_ns *ns = nvme_find_path(head);
> +	const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
> +	struct nvme_ns *ns = nvme_find_path(head,
> +			READ_ONCE(cmd->opcode) & 1 ?
> +			NVME_STAT_WRITE : NVME_STAT_READ);
>   	int ret = -EINVAL;
>   
>   	if (ns)
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 543e17aead12..55dc28375662 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -6,6 +6,9 @@
>   #include <linux/backing-dev.h>
>   #include <linux/moduleparam.h>
>   #include <linux/vmalloc.h>
> +#include <linux/blk-mq.h>
> +#include <linux/math64.h>
> +#include <linux/rculist.h>
>   #include <trace/events/block.h>
>   #include "nvme.h"
>   
> @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>   	"create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>   
>   static const char *nvme_iopolicy_names[] = {
> -	[NVME_IOPOLICY_NUMA]	= "numa",
> -	[NVME_IOPOLICY_RR]	= "round-robin",
> -	[NVME_IOPOLICY_QD]      = "queue-depth",
> +	[NVME_IOPOLICY_NUMA]	 = "numa",
> +	[NVME_IOPOLICY_RR]	 = "round-robin",
> +	[NVME_IOPOLICY_QD]       = "queue-depth",
> +	[NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>   };
>   
>   static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>   		iopolicy = NVME_IOPOLICY_RR;
>   	else if (!strncmp(val, "queue-depth", 11))
>   		iopolicy = NVME_IOPOLICY_QD;
> +	else if (!strncmp(val, "adaptive", 8))
> +		iopolicy = NVME_IOPOLICY_ADAPTIVE;
>   	else
>   		return -EINVAL;
>   
> @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>   }
>   EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>   
> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
> +{
> +	int cpu, srcu_idx;
> +	u32 weight;
> +	struct nvme_ns *ns;
> +	struct nvme_path_stat *stat;
> +	struct nvme_path_work *work = container_of(weight_work,
> +			struct nvme_path_work, weight_work);
> +	struct nvme_ns_head *head = work->ns->head;
> +	int op_type = work->op_type;
> +	u64 total_score = 0;
> +
> +	cpu = get_cpu();
> +
> +	srcu_idx = srcu_read_lock(&head->srcu);
> +	list_for_each_entry_srcu(ns, &head->list, siblings,
> +			srcu_read_lock_held(&head->srcu)) {
> +
> +		stat = &this_cpu_ptr(ns->info)[op_type].stat;
> +		if (!READ_ONCE(stat->slat_ns)) {
> +			stat->score = 0;
> +			continue;
> +		}
> +		/*
> +		 * Compute the path score as the inverse of smoothed
> +		 * latency, scaled by NSEC_PER_SEC. Floating point
> +		 * math is unavailable in the kernel, so fixed-point
> +		 * scaling is used instead. NSEC_PER_SEC is chosen
> +		 * because valid latencies are always < 1 second; longer
> +		 * latencies are ignored.
> +		 */
> +		stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
> +
> +		/* Compute total score. */
> +		total_score += stat->score;
> +	}
> +
> +	if (!total_score)
> +		goto out;
> +
> +	/*
> +	 * After computing the total slatency, we derive per-path weight
> +	 * (normalized to the range 0–64). The weight represents the
> +	 * relative share of I/O the path should receive.
> +	 *
> +	 *   - lower smoothed latency -> higher weight
> +	 *   - higher smoothed slatency -> lower weight
> +	 *
> +	 * Next, while forwarding I/O, we assign "credits" to each path
> +	 * based on its weight (please also refer nvme_adaptive_path()):
> +	 *   - Initially, credits = weight.
> +	 *   - Each time an I/O is dispatched on a path, its credits are
> +	 *     decremented proportionally.
> +	 *   - When a path runs out of credits, it becomes temporarily
> +	 *     ineligible until credit is refilled.
> +	 *
> +	 * I/O distribution is therefore governed by available credits,
> +	 * ensuring that over time the proportion of I/O sent to each
> +	 * path matches its weight (and thus its performance).
> +	 */
> +	list_for_each_entry_srcu(ns, &head->list, siblings,
> +			srcu_read_lock_held(&head->srcu)) {
> +
> +		stat = &this_cpu_ptr(ns->info)[op_type].stat;
> +		weight = div_u64(stat->score * 64, total_score);
> +
> +		/*
> +		 * Ensure the path weight never drops below 1. A weight
> +		 * of 0 is used only for newly added paths. During
> +		 * bootstrap, a few I/Os are sent to such paths to
> +		 * establish an initial weight. Enforcing a minimum
> +		 * weight of 1 guarantees that no path is forgotten and
> +		 * that each path is probed at least occasionally.
> +		 */
> +		if (!weight)
> +			weight = 1;
> +
> +		WRITE_ONCE(stat->weight, weight);
> +	}
> +out:
> +	srcu_read_unlock(&head->srcu, srcu_idx);
> +	put_cpu();
> +}
> +
> +/*
> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
> + */
> +static inline u64 ewma_update(u64 old, u64 new)

it is a calculation function, lets call it calc_ewma_update
> +{
> +	return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
> +			+ new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
> +}
> +
> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
> +{
> +	int cpu;
> +	unsigned int op_type;
> +	struct nvme_path_info *info;
> +	struct nvme_path_stat *stat;
> +	u64 now, latency, slat_ns, avg_lat_ns;
> +	struct nvme_ns_head *head = ns->head;
> +
> +	if (list_is_singular(&head->list))
> +		return;
> +
> +	now = ktime_get_ns();
> +	latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
> +	if (!latency)
> +		return;
> +
> +	/*
> +	 * As completion code path is serialized(i.e. no same completion queue
> +	 * update code could run simultaneously on multiple cpu) we can safely
> +	 * access per cpu nvme path stat here from another cpu (in case the
> +	 * completion cpu is different from submission cpu).
> +	 * The only field which could be accessed simultaneously here is the
> +	 * path ->weight which may be accessed by this function as well as I/O
> +	 * submission path during path selection logic and we protect ->weight
> +	 * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
> +	 * we also don't need to be so accurate here as the path credit would
> +	 * be anyways refilled, based on path weight, once path consumes all
> +	 * its credits. And we limit path weight/credit max up to 100. Please
> +	 * also refer nvme_adaptive_path().
> +	 */
> +	cpu = blk_mq_rq_cpu(rq);
> +	op_type = nvme_data_dir(req_op(rq));
> +	info = &per_cpu_ptr(ns->info, cpu)[op_type];

info is really really really confusing and generic and not 
representative of what
"info" it is used for. maybe path_lat? or path_stats? anything is better 
than info.

> +	stat = &info->stat;
> +
> +	/*
> +	 * If latency > ~1s then ignore this sample to prevent EWMA from being
> +	 * skewed by pathological outliers (multi-second waits, controller
> +	 * timeouts etc.). This keeps path scores representative of normal
> +	 * performance and avoids instability from rare spikes. If such high
> +	 * latency is real, ANA state reporting or keep-alive error counters
> +	 * will mark the path unhealthy and remove it from the head node list,
> +	 * so we safely skip such sample here.
> +	 */
> +	if (unlikely(latency > NSEC_PER_SEC)) {
> +		stat->nr_ignored++;
> +		dev_warn_ratelimited(ns->ctrl->device,
> +			"ignoring sample with >1s latency (possible controller stall or timeout)\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Accumulate latency samples and increment the batch count for each
> +	 * ~15 second interval. When the interval expires, compute the simple
> +	 * average latency over that window, then update the smoothed (EWMA)
> +	 * latency. The path weight is recalculated based on this smoothed
> +	 * latency.
> +	 */
> +	stat->batch += latency;
> +	stat->batch_count++;
> +	stat->nr_samples++;
> +
> +	if (now > stat->last_weight_ts &&
> +	    (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
> +
> +		stat->last_weight_ts = now;
> +
> +		/*
> +		 * Find simple average latency for the last epoch (~15 sec
> +		 * interval).
> +		 */
> +		avg_lat_ns = div_u64(stat->batch, stat->batch_count);
> +
> +		/*
> +		 * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
> +		 * latency. EWMA is preferred over simple average latency
> +		 * because it smooths naturally, reduces jitter from sudden
> +		 * spikes, and adapts faster to changing conditions. It also
> +		 * avoids storing historical samples, and works well for both
> +		 * slow and fast I/O rates.
> +		 * Formula:
> +		 * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
> +		 * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
> +		 * existing latency and 1/8 (~12.5%) weight to the new latency.
> +		 */
> +		if (unlikely(!stat->slat_ns))
> +			WRITE_ONCE(stat->slat_ns, avg_lat_ns);
> +		else {
> +			slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
> +			WRITE_ONCE(stat->slat_ns, slat_ns);
> +		}
> +
> +		stat->batch = stat->batch_count = 0;
> +
> +		/*
> +		 * Defer calculation of the path weight in per-cpu workqueue.
> +		 */
> +		schedule_work_on(cpu, &info->work.weight_work);

I'm unsure if the percpu is a good choice here. Don't you want it per 
hctx at least?
workloads tend to bounce quite a bit between cpu cores... we have 
systems with hundreds of
cpu cores.

> +	}
> +}
> +
>   void nvme_mpath_end_request(struct request *rq)
>   {
>   	struct nvme_ns *ns = rq->q->queuedata;
> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>   	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>   		atomic_dec_if_positive(&ns->ctrl->nr_active);
>   
> +	if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
> +		nvme_mpath_add_sample(rq, ns);
> +

Doing all this work for EVERY completion is really worth it?
sounds kinda like an overkill.

>   	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>   		return;
>   	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>   	[NVME_ANA_CHANGE]		= "change",
>   };
>   
> +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
> +{
> +	int i, cpu;
> +	struct nvme_path_stat *stat;
> +
> +	for_each_possible_cpu(cpu) {
> +		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> +			stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
> +			memset(stat, 0, sizeof(struct nvme_path_stat));
> +		}
> +	}
> +}
> +
> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
> +{
> +	int i, cpu;
> +	struct nvme_path_info *info;
> +
> +	if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
> +		return;
> +
> +	for_each_online_cpu(cpu) {
> +		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> +			info = &per_cpu_ptr(ns->info, cpu)[i];
> +			cancel_work_sync(&info->work.weight_work);
> +		}
> +	}
> +}
> +
> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
> +{
> +	struct nvme_ns_head *head = ns->head;
> +
> +	if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
> +		return false;
> +
> +	if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
> +		return false;
> +
> +	blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);

This is an undocumented change...

> +	blk_stat_enable_accounting(ns->queue);
> +	return true;
> +}
> +
> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
> +{
> +
> +	if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
> +		return false;
> +
> +	blk_stat_disable_accounting(ns->queue);
> +	blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
> +	nvme_mpath_reset_adaptive_path_stat(ns);
> +	return true;
> +}
> +
>   bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>   {
>   	struct nvme_ns_head *head = ns->head;
> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>   			changed = true;
>   		}
>   	}
> +	if (nvme_mpath_disable_adaptive_path_policy(ns))
> +		changed = true;

Don't understand why you are setting changed here? it relates to of the 
current_path
was changed. doesn't make sense to me.

>   out:
>   	return changed;
>   }
> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>   	srcu_read_unlock(&ctrl->srcu, srcu_idx);
>   }
>   
> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
> +{
> +	int i, cpu;
> +	struct nvme_path_work *work;
> +	gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> +
> +	if (!ns->head->disk)
> +		return 0;
> +
> +	ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
> +			sizeof(struct nvme_path_info),
> +			__alignof__(struct nvme_path_info), gfp);
> +	if (!ns->info)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> +			work = &per_cpu_ptr(ns->info, cpu)[i].work;
> +			work->ns = ns;
> +			work->op_type = i;
> +			INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)

Does this function set any ctrl paths? your code is very confusing.

> +{
> +	struct nvme_ns *ns;
> +	int srcu_idx;
> +
> +	srcu_idx = srcu_read_lock(&ctrl->srcu);
> +	list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> +				srcu_read_lock_held(&ctrl->srcu))
> +		nvme_mpath_enable_adaptive_path_policy(ns);
> +	srcu_read_unlock(&ctrl->srcu, srcu_idx);

seems like it enables the iopolicy on all ctrl namespaces.
the enable should also be more explicit like:
nvme_enable_ns_lat_sampling or something like that.

> +}
> +
>   void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>   {
>   	struct nvme_ns_head *head = ns->head;
> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>   				 srcu_read_lock_held(&head->srcu)) {
>   		if (capacity != get_capacity(ns->disk))
>   			clear_bit(NVME_NS_READY, &ns->flags);
> +
> +		nvme_mpath_reset_adaptive_path_stat(ns);
>   	}
>   	srcu_read_unlock(&head->srcu, srcu_idx);
>   
> @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>   	return found;
>   }
>   
> +static inline bool nvme_state_is_live(enum nvme_ana_state state)
> +{
> +	return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
> +}
> +
> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
> +		unsigned int op_type)
> +{
> +	struct nvme_ns *ns, *start, *found = NULL;
> +	struct nvme_path_stat *stat;
> +	u32 weight;
> +	int cpu;
> +
> +	cpu = get_cpu();
> +	ns = *this_cpu_ptr(head->adp_path);
> +	if (unlikely(!ns)) {
> +		ns = list_first_or_null_rcu(&head->list,
> +				struct nvme_ns, siblings);
> +		if (unlikely(!ns))
> +			goto out;
> +	}
> +found_ns:
> +	start = ns;
> +	while (nvme_path_is_disabled(ns) ||
> +			!nvme_state_is_live(ns->ana_state)) {
> +		ns = list_next_entry_circular(ns, &head->list, siblings);
> +
> +		/*
> +		 * If we iterate through all paths in the list but find each
> +		 * path in list is either disabled or dead then bail out.
> +		 */
> +		if (ns == start)
> +			goto out;
> +	}
> +
> +	stat = &this_cpu_ptr(ns->info)[op_type].stat;
> +
> +	/*
> +	 * When the head path-list is singular we don't calculate the
> +	 * only path weight for optimization as we don't need to forward
> +	 * I/O to more than one path. The another possibility is whenthe
> +	 * path is newly added, we don't know its weight. So we go round
> +	 * -robin for each such path and forward I/O to it.Once we start
> +	 * getting response for such I/Os, the path weight calculation
> +	 * would kick in and then we start using path credit for
> +	 * forwarding I/O.
> +	 */
> +	weight = READ_ONCE(stat->weight);
> +	if (!weight) {
> +		found = ns;
> +		goto out;
> +	}
> +
> +	/*
> +	 * To keep path selection logic simple, we don't distinguish
> +	 * between ANA optimized and non-optimized states. The non-
> +	 * optimized path is expected to have a lower weight, and
> +	 * therefore fewer credits. As a result, only a small number of
> +	 * I/Os will be forwarded to paths in the non-optimized state.
> +	 */
> +	if (stat->credit > 0) {
> +		--stat->credit;
> +		found = ns;
> +		goto out;
> +	} else {
> +		/*
> +		 * Refill credit from path weight and move to next path. The
> +		 * refilled credit of the current path will be used next when
> +		 * all remainng paths exhaust its credits.
> +		 */
> +		weight = READ_ONCE(stat->weight);
> +		stat->credit = weight;
> +		ns = list_next_entry_circular(ns, &head->list, siblings);
> +		if (likely(ns))
> +			goto found_ns;
> +	}
> +out:
> +	if (found) {
> +		stat->sel++;
> +		*this_cpu_ptr(head->adp_path) = found;
> +	}
> +
> +	put_cpu();
> +	return found;
> +}
> +
>   static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>   {
>   	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>   	return ns;
>   }
>   
> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
> +		unsigned int op_type)
>   {
>   	switch (READ_ONCE(head->subsys->iopolicy)) {
> +	case NVME_IOPOLICY_ADAPTIVE:
> +		return nvme_adaptive_path(head, op_type);
>   	case NVME_IOPOLICY_QD:
>   		return nvme_queue_depth_path(head);
>   	case NVME_IOPOLICY_RR:
> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>   		return;
>   
>   	srcu_idx = srcu_read_lock(&head->srcu);
> -	ns = nvme_find_path(head);
> +	ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>   	if (likely(ns)) {
>   		bio_set_dev(bio, ns->disk->part0);
>   		bio->bi_opf |= REQ_NVME_MPATH;
> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>   	int srcu_idx, ret = -EWOULDBLOCK;
>   
>   	srcu_idx = srcu_read_lock(&head->srcu);
> -	ns = nvme_find_path(head);
> +	ns = nvme_find_path(head, NVME_STAT_OTHER);
>   	if (ns)
>   		ret = nvme_ns_get_unique_id(ns, id, type);
>   	srcu_read_unlock(&head->srcu, srcu_idx);
> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>   	int srcu_idx, ret = -EWOULDBLOCK;
>   
>   	srcu_idx = srcu_read_lock(&head->srcu);
> -	ns = nvme_find_path(head);
> +	ns = nvme_find_path(head, NVME_STAT_OTHER);
>   	if (ns)
>   		ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>   	srcu_read_unlock(&head->srcu, srcu_idx);
> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>   	INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>   	INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>   	head->delayed_removal_secs = 0;
> +	head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
> +	if (!head->adp_path)
> +		return -ENOMEM;
>   
>   	/*
>   	 * If "multipath_always_on" is enabled, a multipath node is added
> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>   	}
>   	mutex_unlock(&head->lock);
>   
> +	mutex_lock(&nvme_subsystems_lock);
> +	nvme_mpath_enable_adaptive_path_policy(ns);
> +	mutex_unlock(&nvme_subsystems_lock);
> +
>   	synchronize_srcu(&head->srcu);
>   	kblockd_schedule_work(&head->requeue_work);
>   }
> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>   	return 0;
>   }
>   
> -static inline bool nvme_state_is_live(enum nvme_ana_state state)
> -{
> -	return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
> -}
> -
>   static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>   		struct nvme_ns *ns)
>   {
> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>   
>   	WRITE_ONCE(subsys->iopolicy, iopolicy);
>   
> -	/* iopolicy changes clear the mpath by design */
> +	/* iopolicy changes clear/reset the mpath by design */
>   	mutex_lock(&nvme_subsystems_lock);
>   	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>   		nvme_mpath_clear_ctrl_paths(ctrl);
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> +		nvme_mpath_set_ctrl_paths(ctrl);
>   	mutex_unlock(&nvme_subsystems_lock);
>   
>   	pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 102fae6a231c..715c7053054c 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>   extern unsigned int admin_timeout;
>   #define NVME_ADMIN_TIMEOUT	(admin_timeout * HZ)
>   
> -#define NVME_DEFAULT_KATO	5
> +#define NVME_DEFAULT_KATO		5
> +
> +#define NVME_DEFAULT_ADP_EWMA_SHIFT	3
> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT	(15 * NSEC_PER_SEC)

You need these defines outside of nvme-mpath?

>   
>   #ifdef CONFIG_ARCH_NO_SG_CHAIN
>   #define  NVME_INLINE_SG_CNT  0
> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>   	NVME_IOPOLICY_NUMA,
>   	NVME_IOPOLICY_RR,
>   	NVME_IOPOLICY_QD,
> +	NVME_IOPOLICY_ADAPTIVE,
>   };
>   
>   struct nvme_subsystem {
> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>   	u8	csi;
>   };
>   
> +enum nvme_stat_group {
> +	NVME_STAT_READ,
> +	NVME_STAT_WRITE,
> +	NVME_STAT_OTHER,
> +	NVME_NUM_STAT_GROUPS
> +};

I see you have stats per io direction. However you don't have it per IO 
size. I wonder
how this plays into this iopolicy.

> +
> +struct nvme_path_stat {
> +	u64 nr_samples;		/* total num of samples processed */
> +	u64 nr_ignored;		/* num. of samples ignored */
> +	u64 slat_ns;		/* smoothed (ewma) latency in nanoseconds */
> +	u64 score;		/* score used for weight calculation */
> +	u64 last_weight_ts;	/* timestamp of the last weight calculation */
> +	u64 sel;		/* num of times this path is selcted for I/O */
> +	u64 batch;		/* accumulated latency sum for current window */
> +	u32 batch_count;	/* num of samples accumulated in current window */
> +	u32 weight;		/* path weight */
> +	u32 credit;		/* path credit for I/O forwarding */
> +};

I'm still not convinced that having this be per-cpu-per-ns really makes 
sense.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-12 13:04   ` Sagi Grimberg
@ 2025-12-13  7:27     ` Nilay Shroff
  2025-12-15 23:36       ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-12-13  7:27 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 12/12/25 6:34 PM, Sagi Grimberg wrote:
> 
> 
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> This commit introduces a new I/O policy named "adaptive". Users can
>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>> subsystemX/iopolicy"
>>
>> The adaptive policy dynamically distributes I/O based on measured
>> completion latency. The main idea is to calculate latency for each path,
>> derive a weight, and then proportionally forward I/O according to those
>> weights.
>>
>> To ensure scalability, path latency is measured per-CPU. Each CPU
>> maintains its own statistics, and I/O forwarding uses these per-CPU
>> values.
> 
> So a given cpu would select path-a vs. another cpu that may select path-b?
> How does that play with less queues than cpu cores? what happens to cores
> that have low traffic?
> 
The path-selection logic does not depend on the relationship between the number
of CPUs and the number of hardware queues. It simply selects a path based on the
per-CPU path score/credit, which reflects the relative performance of each available
path.
For example, assume we have two paths (A and B) to the same shared namespace. 
For each CPU, we maintain a smoothed latency estimate for every path. From these
latency values we derive a per-path score or credit. The credit represents the relative
share of I/O that each path should receive: a path with lower observed latency gets more
credit, and a path with higher latency gets less.

I/O distribution is thus governed directly by the available credits on that CPU. When the
NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e., 
matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
policy runs above the block-layer queueing logic, and the number of hardware queues does
not affect how paths are scored or selected.

>> Every ~15 seconds, a simple average latency of per-CPU batched
>> samples are computed and fed into an Exponentially Weighted Moving
>> Average (EWMA):
> 
> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?

Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"? 

> 
>>
>> avg_latency = div_u64(batch, batch_count);
>> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>>
>> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
>> latency value and 1/8 (~12.5%) to the most recent latency. This
>> smoothing reduces jitter, adapts quickly to changing conditions,
>> avoids storing historical samples, and works well for both low and
>> high I/O rates.
> 
> This weight was based on empirical measurements?
> 
Yes correct and so we also allow user to configure WEIGHT, if needed.

>>   Path weights are then derived from the smoothed (EWMA)
>> latency as follows (example with two paths A and B):
>>
>>      path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>>      path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>>      total_score  = path_A_score + path_B_score
>>
>>      path_A_weight = (path_A_score * 100) / total_score
>>      path_B_weight = (path_B_score * 100) / total_score
> 
> What happens to R/W mixed workloads? What happens when the I/O pattern
> has a distribution of block sizes?
> 

We maintain separate metrics for READ and WRITE traffic, and during path
selection we use the appropriate metric depending on the I/O type.

Regarding block-size variability: the current implementation does not yet
account for I/O size. This is an important point — thank you for raising it.
I discussed this today with Hannes at LPC, and we agreed that a practical
approach is to normalize latency per 512-byte block. For our purposes, we 
do not need an exact latency value; a relative latency metric is sufficient,
as it ultimately feeds into path scoring. A path with higher latency ends up
with a lower score, and a path with lower latency gets a higher score — the 
exact absolute values are less important than maintaining consistent proportional
relationships.

Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
across different I/O block sizes. I think that it's easy to normalize a latency number
per 512 bytes block and I'd implement that in next patch version.
 > I think that in order to understand how a non-trivial path selector works we need
> thorough testing in a variety of I/O patterns.
> 
Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and 
different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed 
for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy. 
Still if you prefer running anything further let me know.

>>
>> where:
>>    - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>>    - NSEC_PER_SEC is used as a scaling factor since valid latencies
>>      are < 1 second
>>    - weights are normalized to a 0–64 scale across all paths.
>>
>> Path credits are refilled based on this weight, with one credit
>> consumed per I/O. When all credits are consumed, the credits are
>> refilled again based on the current weight. This ensures that I/O is
>> distributed across paths proportionally to their calculated weight.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>   drivers/nvme/host/core.c      |  15 +-
>>   drivers/nvme/host/ioctl.c     |  31 ++-
>>   drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>>   drivers/nvme/host/nvme.h      |  74 +++++-
>>   drivers/nvme/host/pr.c        |   6 +-
>>   drivers/nvme/host/sysfs.c     |   2 +-
>>   6 files changed, 530 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fa4181d7de73..47f375c63d2d 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>>       cleanup_srcu_struct(&head->srcu);
>>       nvme_put_subsystem(head->subsys);
>>       kfree(head->plids);
>> +#ifdef CONFIG_NVME_MULTIPATH
>> +    free_percpu(head->adp_path);
>> +#endif
>>       kfree(head);
>>   }
>>   @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>>   {
>>       struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>>   +    nvme_free_ns_stat(ns);
>>       put_disk(ns->disk);
>>       nvme_put_ns_head(ns->head);
>>       nvme_put_ctrl(ns->ctrl);
>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>       if (nvme_init_ns_head(ns, info))
>>           goto out_cleanup_disk;
>>   +    if (nvme_alloc_ns_stat(ns))
>> +        goto out_unlink_ns;
>> +
>>       /*
>>        * If multipathing is enabled, the device name for all disks and not
>>        * just those that represent shared namespaces needs to be based on the
>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>       }
>>         if (nvme_update_ns_info(ns, info))
>> -        goto out_unlink_ns;
>> +        goto out_free_ns_stat;
>>         mutex_lock(&ctrl->namespaces_lock);
>>       /*
>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>        */
>>       if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>>           mutex_unlock(&ctrl->namespaces_lock);
>> -        goto out_unlink_ns;
>> +        goto out_free_ns_stat;
>>       }
>>       nvme_ns_add_to_ctrl_list(ns);
>>       mutex_unlock(&ctrl->namespaces_lock);
>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>       list_del_rcu(&ns->list);
>>       mutex_unlock(&ctrl->namespaces_lock);
>>       synchronize_srcu(&ctrl->srcu);
>> +out_free_ns_stat:
>> +    nvme_free_ns_stat(ns);
>>    out_unlink_ns:
>>       mutex_lock(&ctrl->subsys->lock);
>>       list_del_rcu(&ns->siblings);
>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>        */
>>       synchronize_srcu(&ns->head->srcu);
>>   +    nvme_mpath_cancel_adaptive_path_weight_work(ns);
>> +
> 
> I personally think that the check on path stats should be done in the call-site
> and not in the function itself.
Hmm, can you please elaborate on this point further? I think, I am unable to get 
your point here.

> 
>>       /* wait for concurrent submissions */
>>       if (nvme_mpath_clear_current_path(ns))
>>           synchronize_srcu(&ns->head->srcu);
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index c212fa952c0f..759d147d9930 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>>   int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>           unsigned int cmd, unsigned long arg)
>>   {
>> +    u8 opcode;
>>       struct nvme_ns_head *head = bdev->bd_disk->private_data;
>>       bool open_for_write = mode & BLK_OPEN_WRITE;
>>       void __user *argp = (void __user *)arg;
>>       struct nvme_ns *ns;
>>       int srcu_idx, ret = -EWOULDBLOCK;
>>       unsigned int flags = 0;
>> +    unsigned int op_type = NVME_STAT_OTHER;
>>         if (bdev_is_partition(bdev))
>>           flags |= NVME_IOCTL_PARTITION;
>>   +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>> +        if (get_user(opcode, (u8 *)argp))
>> +            return -EFAULT;
>> +        if (opcode == nvme_cmd_write)
>> +            op_type = NVME_STAT_WRITE;
>> +        else if (opcode == nvme_cmd_read)
>> +            op_type = NVME_STAT_READ;
>> +    }
>> +
>>       srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, op_type);
> 
> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
> to a prep patch (explaining that the new iopolicy will leverage it)
> 
Sure, makes sense. I'll split this into prep patch as you suggested.
>>       if (!ns)
>>           goto out_unlock;
>>   @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>   long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>           unsigned long arg)
>>   {
>> +    u8 opcode;
>>       bool open_for_write = file->f_mode & FMODE_WRITE;
>>       struct cdev *cdev = file_inode(file)->i_cdev;
>>       struct nvme_ns_head *head =
>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>       void __user *argp = (void __user *)arg;
>>       struct nvme_ns *ns;
>>       int srcu_idx, ret = -EWOULDBLOCK;
>> +    unsigned int op_type = NVME_STAT_OTHER;
>> +
>> +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>> +        if (get_user(opcode, (u8 *)argp))
>> +            return -EFAULT;
>> +        if (opcode == nvme_cmd_write)
>> +            op_type = NVME_STAT_WRITE;
>> +        else if (opcode == nvme_cmd_read)
>> +            op_type = NVME_STAT_READ;
>> +    }
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, op_type);
>>       if (!ns)
>>           goto out_unlock;
>>   @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>       struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>       struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>       int srcu_idx = srcu_read_lock(&head->srcu);
>> -    struct nvme_ns *ns = nvme_find_path(head);
>> +    const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>> +    struct nvme_ns *ns = nvme_find_path(head,
>> +            READ_ONCE(cmd->opcode) & 1 ?
>> +            NVME_STAT_WRITE : NVME_STAT_READ);
>>       int ret = -EINVAL;
>>         if (ns)
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 543e17aead12..55dc28375662 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -6,6 +6,9 @@
>>   #include <linux/backing-dev.h>
>>   #include <linux/moduleparam.h>
>>   #include <linux/vmalloc.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/math64.h>
>> +#include <linux/rculist.h>
>>   #include <trace/events/block.h>
>>   #include "nvme.h"
>>   @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>>       "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>>     static const char *nvme_iopolicy_names[] = {
>> -    [NVME_IOPOLICY_NUMA]    = "numa",
>> -    [NVME_IOPOLICY_RR]    = "round-robin",
>> -    [NVME_IOPOLICY_QD]      = "queue-depth",
>> +    [NVME_IOPOLICY_NUMA]     = "numa",
>> +    [NVME_IOPOLICY_RR]     = "round-robin",
>> +    [NVME_IOPOLICY_QD]       = "queue-depth",
>> +    [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>>   };
>>     static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>           iopolicy = NVME_IOPOLICY_RR;
>>       else if (!strncmp(val, "queue-depth", 11))
>>           iopolicy = NVME_IOPOLICY_QD;
>> +    else if (!strncmp(val, "adaptive", 8))
>> +        iopolicy = NVME_IOPOLICY_ADAPTIVE;
>>       else
>>           return -EINVAL;
>>   @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>>   }
>>   EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>>   +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>> +{
>> +    int cpu, srcu_idx;
>> +    u32 weight;
>> +    struct nvme_ns *ns;
>> +    struct nvme_path_stat *stat;
>> +    struct nvme_path_work *work = container_of(weight_work,
>> +            struct nvme_path_work, weight_work);
>> +    struct nvme_ns_head *head = work->ns->head;
>> +    int op_type = work->op_type;
>> +    u64 total_score = 0;
>> +
>> +    cpu = get_cpu();
>> +
>> +    srcu_idx = srcu_read_lock(&head->srcu);
>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>> +            srcu_read_lock_held(&head->srcu)) {
>> +
>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> +        if (!READ_ONCE(stat->slat_ns)) {
>> +            stat->score = 0;
>> +            continue;
>> +        }
>> +        /*
>> +         * Compute the path score as the inverse of smoothed
>> +         * latency, scaled by NSEC_PER_SEC. Floating point
>> +         * math is unavailable in the kernel, so fixed-point
>> +         * scaling is used instead. NSEC_PER_SEC is chosen
>> +         * because valid latencies are always < 1 second; longer
>> +         * latencies are ignored.
>> +         */
>> +        stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>> +
>> +        /* Compute total score. */
>> +        total_score += stat->score;
>> +    }
>> +
>> +    if (!total_score)
>> +        goto out;
>> +
>> +    /*
>> +     * After computing the total slatency, we derive per-path weight
>> +     * (normalized to the range 0–64). The weight represents the
>> +     * relative share of I/O the path should receive.
>> +     *
>> +     *   - lower smoothed latency -> higher weight
>> +     *   - higher smoothed slatency -> lower weight
>> +     *
>> +     * Next, while forwarding I/O, we assign "credits" to each path
>> +     * based on its weight (please also refer nvme_adaptive_path()):
>> +     *   - Initially, credits = weight.
>> +     *   - Each time an I/O is dispatched on a path, its credits are
>> +     *     decremented proportionally.
>> +     *   - When a path runs out of credits, it becomes temporarily
>> +     *     ineligible until credit is refilled.
>> +     *
>> +     * I/O distribution is therefore governed by available credits,
>> +     * ensuring that over time the proportion of I/O sent to each
>> +     * path matches its weight (and thus its performance).
>> +     */
>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>> +            srcu_read_lock_held(&head->srcu)) {
>> +
>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> +        weight = div_u64(stat->score * 64, total_score);
>> +
>> +        /*
>> +         * Ensure the path weight never drops below 1. A weight
>> +         * of 0 is used only for newly added paths. During
>> +         * bootstrap, a few I/Os are sent to such paths to
>> +         * establish an initial weight. Enforcing a minimum
>> +         * weight of 1 guarantees that no path is forgotten and
>> +         * that each path is probed at least occasionally.
>> +         */
>> +        if (!weight)
>> +            weight = 1;
>> +
>> +        WRITE_ONCE(stat->weight, weight);
>> +    }
>> +out:
>> +    srcu_read_unlock(&head->srcu, srcu_idx);
>> +    put_cpu();
>> +}
>> +
>> +/*
>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>> + */
>> +static inline u64 ewma_update(u64 old, u64 new)
> 
> it is a calculation function, lets call it calc_ewma_update
Yeah, will do this in next patch version.

>> +{
>> +    return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>> +            + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>> +}
>> +
>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>> +{
>> +    int cpu;
>> +    unsigned int op_type;
>> +    struct nvme_path_info *info;
>> +    struct nvme_path_stat *stat;
>> +    u64 now, latency, slat_ns, avg_lat_ns;
>> +    struct nvme_ns_head *head = ns->head;
>> +
>> +    if (list_is_singular(&head->list))
>> +        return;
>> +
>> +    now = ktime_get_ns();
>> +    latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>> +    if (!latency)
>> +        return;
>> +
>> +    /*
>> +     * As completion code path is serialized(i.e. no same completion queue
>> +     * update code could run simultaneously on multiple cpu) we can safely
>> +     * access per cpu nvme path stat here from another cpu (in case the
>> +     * completion cpu is different from submission cpu).
>> +     * The only field which could be accessed simultaneously here is the
>> +     * path ->weight which may be accessed by this function as well as I/O
>> +     * submission path during path selection logic and we protect ->weight
>> +     * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>> +     * we also don't need to be so accurate here as the path credit would
>> +     * be anyways refilled, based on path weight, once path consumes all
>> +     * its credits. And we limit path weight/credit max up to 100. Please
>> +     * also refer nvme_adaptive_path().
>> +     */
>> +    cpu = blk_mq_rq_cpu(rq);
>> +    op_type = nvme_data_dir(req_op(rq));
>> +    info = &per_cpu_ptr(ns->info, cpu)[op_type];
> 
> info is really really really confusing and generic and not representative of what
> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
> 
Maybe I am used to with this code and so I never realized it. But yes agreed, I 
will make it @path_lat.  

>> +    stat = &info->stat;
>> +
>> +    /*
>> +     * If latency > ~1s then ignore this sample to prevent EWMA from being
>> +     * skewed by pathological outliers (multi-second waits, controller
>> +     * timeouts etc.). This keeps path scores representative of normal
>> +     * performance and avoids instability from rare spikes. If such high
>> +     * latency is real, ANA state reporting or keep-alive error counters
>> +     * will mark the path unhealthy and remove it from the head node list,
>> +     * so we safely skip such sample here.
>> +     */
>> +    if (unlikely(latency > NSEC_PER_SEC)) {
>> +        stat->nr_ignored++;
>> +        dev_warn_ratelimited(ns->ctrl->device,
>> +            "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Accumulate latency samples and increment the batch count for each
>> +     * ~15 second interval. When the interval expires, compute the simple
>> +     * average latency over that window, then update the smoothed (EWMA)
>> +     * latency. The path weight is recalculated based on this smoothed
>> +     * latency.
>> +     */
>> +    stat->batch += latency;
>> +    stat->batch_count++;
>> +    stat->nr_samples++;
>> +
>> +    if (now > stat->last_weight_ts &&
>> +        (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>> +
>> +        stat->last_weight_ts = now;
>> +
>> +        /*
>> +         * Find simple average latency for the last epoch (~15 sec
>> +         * interval).
>> +         */
>> +        avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>> +
>> +        /*
>> +         * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>> +         * latency. EWMA is preferred over simple average latency
>> +         * because it smooths naturally, reduces jitter from sudden
>> +         * spikes, and adapts faster to changing conditions. It also
>> +         * avoids storing historical samples, and works well for both
>> +         * slow and fast I/O rates.
>> +         * Formula:
>> +         * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>> +         * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>> +         * existing latency and 1/8 (~12.5%) weight to the new latency.
>> +         */
>> +        if (unlikely(!stat->slat_ns))
>> +            WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>> +        else {
>> +            slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>> +            WRITE_ONCE(stat->slat_ns, slat_ns);
>> +        }
>> +
>> +        stat->batch = stat->batch_count = 0;
>> +
>> +        /*
>> +         * Defer calculation of the path weight in per-cpu workqueue.
>> +         */
>> +        schedule_work_on(cpu, &info->work.weight_work);
> 
> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
> cpu cores.
As I explained earlier, in NVMe multipath driver code we don't know hctx while
we choose path. The ctx to hctx mapping happens later in the block layer while
submitting bio. Here we calculate the path score per-cpu and utilize it while
choosing path to forward I/O.

> 
>> +    }
>> +}
>> +
>>   void nvme_mpath_end_request(struct request *rq)
>>   {
>>       struct nvme_ns *ns = rq->q->queuedata;
>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>>       if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>>           atomic_dec_if_positive(&ns->ctrl->nr_active);
>>   +    if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>> +        nvme_mpath_add_sample(rq, ns);
>> +
> 
> Doing all this work for EVERY completion is really worth it?
> sounds kinda like an overkill.
We don't really do much in nvme_mpath_add_sample() other than just 
adding latency sample into batch. The real work where we calculate
the patch score is done once every ~15 seconds and that is done 
under separate workqueu. So we don't do any heavy lifing here during
I/O completion processing. 

> 
>>       if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>           return;
>>       bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>>       [NVME_ANA_CHANGE]        = "change",
>>   };
>>   +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>> +{
>> +    int i, cpu;
>> +    struct nvme_path_stat *stat;
>> +
>> +    for_each_possible_cpu(cpu) {
>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> +            stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>> +            memset(stat, 0, sizeof(struct nvme_path_stat));
>> +        }
>> +    }
>> +}
>> +
>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>> +{
>> +    int i, cpu;
>> +    struct nvme_path_info *info;
>> +
>> +    if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>> +        return;
>> +
>> +    for_each_online_cpu(cpu) {
>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> +            info = &per_cpu_ptr(ns->info, cpu)[i];
>> +            cancel_work_sync(&info->work.weight_work);
>> +        }
>> +    }
>> +}
>> +
>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>> +{
>> +    struct nvme_ns_head *head = ns->head;
>> +
>> +    if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>> +        return false;
>> +
>> +    if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>> +        return false;
>> +
>> +    blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
> 
> This is an undocumented change...
Sure, I would add comment in this code in the next patch version.

> 
>> +    blk_stat_enable_accounting(ns->queue);
>> +    return true;
>> +}
>> +
>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>> +{
>> +
>> +    if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>> +        return false;
>> +
>> +    blk_stat_disable_accounting(ns->queue);
>> +    blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>> +    nvme_mpath_reset_adaptive_path_stat(ns);
>> +    return true;
>> +}
>> +
>>   bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>   {
>>       struct nvme_ns_head *head = ns->head;
>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>               changed = true;
>>           }
>>       }
>> +    if (nvme_mpath_disable_adaptive_path_policy(ns))
>> +        changed = true;
> 
> Don't understand why you are setting changed here? it relates to of the current_path
> was changed. doesn't make sense to me.
> 
I think you were correct. We don't have any rcu update here for adaptive path. 
Will remove this.

>>   out:
>>       return changed;
>>   }
>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>>       srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>   }
>>   +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>> +{
>> +    int i, cpu;
>> +    struct nvme_path_work *work;
>> +    gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>> +
>> +    if (!ns->head->disk)
>> +        return 0;
>> +
>> +    ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>> +            sizeof(struct nvme_path_info),
>> +            __alignof__(struct nvme_path_info), gfp);
>> +    if (!ns->info)
>> +        return -ENOMEM;
>> +
>> +    for_each_possible_cpu(cpu) {
>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> +            work = &per_cpu_ptr(ns->info, cpu)[i].work;
>> +            work->ns = ns;
>> +            work->op_type = i;
>> +            INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
> 
> Does this function set any ctrl paths? your code is very confusing.
> 
Here ctrl path means, we iterate through each controller namespaces-path 
and then sets/enable the adaptive path parameters required for each path. 
Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
function which resets/clears the per-path parameters while chanigng I/O
policy. 

>> +{
>> +    struct nvme_ns *ns;
>> +    int srcu_idx;
>> +
>> +    srcu_idx = srcu_read_lock(&ctrl->srcu);
>> +    list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>> +                srcu_read_lock_held(&ctrl->srcu))
>> +        nvme_mpath_enable_adaptive_path_policy(ns);
>> +    srcu_read_unlock(&ctrl->srcu, srcu_idx);
> 
> seems like it enables the iopolicy on all ctrl namespaces.
> the enable should also be more explicit like:
> nvme_enable_ns_lat_sampling or something like that.
> 
okay, I'll rename it to the appropriate function name, as you suggested. 

>> +}
>> +
>>   void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>   {
>>       struct nvme_ns_head *head = ns->head;
>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>                    srcu_read_lock_held(&head->srcu)) {
>>           if (capacity != get_capacity(ns->disk))
>>               clear_bit(NVME_NS_READY, &ns->flags);
>> +
>> +        nvme_mpath_reset_adaptive_path_stat(ns);
>>       }
>>       srcu_read_unlock(&head->srcu, srcu_idx);
>>   @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>>       return found;
>>   }
>>   +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>> +{
>> +    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>> +}
>> +
>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>> +        unsigned int op_type)
>> +{
>> +    struct nvme_ns *ns, *start, *found = NULL;
>> +    struct nvme_path_stat *stat;
>> +    u32 weight;
>> +    int cpu;
>> +
>> +    cpu = get_cpu();
>> +    ns = *this_cpu_ptr(head->adp_path);
>> +    if (unlikely(!ns)) {
>> +        ns = list_first_or_null_rcu(&head->list,
>> +                struct nvme_ns, siblings);
>> +        if (unlikely(!ns))
>> +            goto out;
>> +    }
>> +found_ns:
>> +    start = ns;
>> +    while (nvme_path_is_disabled(ns) ||
>> +            !nvme_state_is_live(ns->ana_state)) {
>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>> +
>> +        /*
>> +         * If we iterate through all paths in the list but find each
>> +         * path in list is either disabled or dead then bail out.
>> +         */
>> +        if (ns == start)
>> +            goto out;
>> +    }
>> +
>> +    stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> +
>> +    /*
>> +     * When the head path-list is singular we don't calculate the
>> +     * only path weight for optimization as we don't need to forward
>> +     * I/O to more than one path. The another possibility is whenthe
>> +     * path is newly added, we don't know its weight. So we go round
>> +     * -robin for each such path and forward I/O to it.Once we start
>> +     * getting response for such I/Os, the path weight calculation
>> +     * would kick in and then we start using path credit for
>> +     * forwarding I/O.
>> +     */
>> +    weight = READ_ONCE(stat->weight);
>> +    if (!weight) {
>> +        found = ns;
>> +        goto out;
>> +    }
>> +
>> +    /*
>> +     * To keep path selection logic simple, we don't distinguish
>> +     * between ANA optimized and non-optimized states. The non-
>> +     * optimized path is expected to have a lower weight, and
>> +     * therefore fewer credits. As a result, only a small number of
>> +     * I/Os will be forwarded to paths in the non-optimized state.
>> +     */
>> +    if (stat->credit > 0) {
>> +        --stat->credit;
>> +        found = ns;
>> +        goto out;
>> +    } else {
>> +        /*
>> +         * Refill credit from path weight and move to next path. The
>> +         * refilled credit of the current path will be used next when
>> +         * all remainng paths exhaust its credits.
>> +         */
>> +        weight = READ_ONCE(stat->weight);
>> +        stat->credit = weight;
>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>> +        if (likely(ns))
>> +            goto found_ns;
>> +    }
>> +out:
>> +    if (found) {
>> +        stat->sel++;
>> +        *this_cpu_ptr(head->adp_path) = found;
>> +    }
>> +
>> +    put_cpu();
>> +    return found;
>> +}
>> +
>>   static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>>   {
>>       struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>       return ns;
>>   }
>>   -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>> +        unsigned int op_type)
>>   {
>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>> +    case NVME_IOPOLICY_ADAPTIVE:
>> +        return nvme_adaptive_path(head, op_type);
>>       case NVME_IOPOLICY_QD:
>>           return nvme_queue_depth_path(head);
>>       case NVME_IOPOLICY_RR:
>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>>           return;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>>       if (likely(ns)) {
>>           bio_set_dev(bio, ns->disk->part0);
>>           bio->bi_opf |= REQ_NVME_MPATH;
>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>>       int srcu_idx, ret = -EWOULDBLOCK;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>       if (ns)
>>           ret = nvme_ns_get_unique_id(ns, id, type);
>>       srcu_read_unlock(&head->srcu, srcu_idx);
>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>       int srcu_idx, ret = -EWOULDBLOCK;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>       if (ns)
>>           ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>       srcu_read_unlock(&head->srcu, srcu_idx);
>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>       INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>>       INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>>       head->delayed_removal_secs = 0;
>> +    head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>> +    if (!head->adp_path)
>> +        return -ENOMEM;
>>         /*
>>        * If "multipath_always_on" is enabled, a multipath node is added
>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>       }
>>       mutex_unlock(&head->lock);
>>   +    mutex_lock(&nvme_subsystems_lock);
>> +    nvme_mpath_enable_adaptive_path_policy(ns);
>> +    mutex_unlock(&nvme_subsystems_lock);
>> +
>>       synchronize_srcu(&head->srcu);
>>       kblockd_schedule_work(&head->requeue_work);
>>   }
>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>>       return 0;
>>   }
>>   -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>> -{
>> -    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>> -}
>> -
>>   static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>           struct nvme_ns *ns)
>>   {
>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>>         WRITE_ONCE(subsys->iopolicy, iopolicy);
>>   -    /* iopolicy changes clear the mpath by design */
>> +    /* iopolicy changes clear/reset the mpath by design */
>>       mutex_lock(&nvme_subsystems_lock);
>>       list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>           nvme_mpath_clear_ctrl_paths(ctrl);
>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>> +        nvme_mpath_set_ctrl_paths(ctrl);
>>       mutex_unlock(&nvme_subsystems_lock);
>>         pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 102fae6a231c..715c7053054c 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>>   extern unsigned int admin_timeout;
>>   #define NVME_ADMIN_TIMEOUT    (admin_timeout * HZ)
>>   -#define NVME_DEFAULT_KATO    5
>> +#define NVME_DEFAULT_KATO        5
>> +
>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT    3
>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT    (15 * NSEC_PER_SEC)
> 
> You need these defines outside of nvme-mpath?
> 
Currently, those macros are used in nvme/host/core.c. 
I can move this inisde that source file. 

>>     #ifdef CONFIG_ARCH_NO_SG_CHAIN
>>   #define  NVME_INLINE_SG_CNT  0
>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>>       NVME_IOPOLICY_NUMA,
>>       NVME_IOPOLICY_RR,
>>       NVME_IOPOLICY_QD,
>> +    NVME_IOPOLICY_ADAPTIVE,
>>   };
>>     struct nvme_subsystem {
>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>>       u8    csi;
>>   };
>>   +enum nvme_stat_group {
>> +    NVME_STAT_READ,
>> +    NVME_STAT_WRITE,
>> +    NVME_STAT_OTHER,
>> +    NVME_NUM_STAT_GROUPS
>> +};
> 
> I see you have stats per io direction. However you don't have it per IO size. I wonder
> how this plays into this iopolicy.
> 
Yes you're correct, and as mentioned earlier we'd measure latecy per 
512 byte blocks size.

>> +
>> +struct nvme_path_stat {
>> +    u64 nr_samples;        /* total num of samples processed */
>> +    u64 nr_ignored;        /* num. of samples ignored */
>> +    u64 slat_ns;        /* smoothed (ewma) latency in nanoseconds */
>> +    u64 score;        /* score used for weight calculation */
>> +    u64 last_weight_ts;    /* timestamp of the last weight calculation */
>> +    u64 sel;        /* num of times this path is selcted for I/O */
>> +    u64 batch;        /* accumulated latency sum for current window */
>> +    u32 batch_count;    /* num of samples accumulated in current window */
>> +    u32 weight;        /* path weight */
>> +    u32 credit;        /* path credit for I/O forwarding */
>> +};
> 
> I'm still not convinced that having this be per-cpu-per-ns really makes sense.

I understand your concern about whether it really makes sense to keep this 
per-cpu-per-ns, and I see your point that you would prefer maintaining the
stat per-hctx instead of per-CPU.

However, as mentioned earlier, during path selection we cannot reliably map an
I/O to a specific hctx, so using per-hctx statistics becomes problematic in 
practice. On the other hand, maintaining the metrics per-CPU has an additional
advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
the NUMA distance between the workload’s CPU and the I/O controller. This means
that on multi-node systems, the policy can automatically favor I/O paths/controllers
that are local/near to the CPU issuing the request, which may lead to better
latency characteristics.

Really appreciate your feedback/comments!

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
  2025-12-12 12:08 ` Sagi Grimberg
@ 2025-12-13  8:22   ` Nilay Shroff
  0 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-13  8:22 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 12/12/25 5:38 PM, Sagi Grimberg wrote:
> 
> 
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> Hi,
>>
>> This series introduces a new adaptive I/O policy for NVMe native
>> multipath. Existing policies such as numa, round-robin, and queue-depth
>> are static and do not adapt to real-time transport performance.
> 
> It can be argued that queue-depth is a proxy of latency.
> 
>>   The numa
>> selects the path closest to the NUMA node of the current CPU, optimizing
>> memory and path locality, but ignores actual path performance. The
>> round-robin distributes I/O evenly across all paths, providing fairness
>> but not performance awareness. The queue-depth reacts to instantaneous
>> queue occupancy, avoiding heavily loaded paths, but does not account for
>> actual latency, throughput, or link speed.
>>
>> The new adaptive policy addresses these gaps selecting paths dynamically
>> based on measured I/O latency for both PCIe and fabrics.
> 
> Adaptive is not a good name. Maybe weighted-latency of wplat (weighted path latency)
> or something like that.
> 
Yeah I also talked to Hannes about this and he suggest naming it either "weighed-latency"
or "ewma-latency". What do you prefer? 

>>   Latency is
>> derived by passively sampling I/O completions. Each path is assigned a
>> weight proportional to its latency score, and I/Os are then forwarded
>> accordingly. As condition changes (e.g. latency spikes, bandwidth
>> differences), path weights are updated, automatically steering traffic
>> toward better-performing paths.
>>
>> Early results show reduced tail latency under mixed workloads and
>> improved throughput by exploiting higher-speed links more effectively.
>> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
>> delay), fio results with random read/write/rw workloads (direct I/O)
>> showed:
>>
>>          numa         round-robin   queue-depth  adaptive
>>          -----------  -----------   -----------  ---------
>> READ:   50.0 MiB/s   105 MiB/s     230 MiB/s    350 MiB/s
>> WRITE:  65.9 MiB/s   125 MiB/s     385 MiB/s    446 MiB/s
>> RW:     R:30.6 MiB/s R:56.5 MiB/s  R:122 MiB/s  R:175 MiB/s
>>          W:30.7 MiB/s W:56.5 MiB/s  W:122 MiB/s  W:175 MiB/s
> 
> Seems like a nice gain.
> Can you please test for the normal symmetric paths case? Would like
> to see the trade-off...
Yes, I've already tested that. I currently don’t have access to the system,
but based on my earlier runs, the performance for the symmetric-path case
was noticeably better than in the NUMA scenario, and roughly in the same
(or slightly better) range as the round-robin/qdepth I/O policy. I will 
share those numbers later once I get the access.

Thanks,
--Nilay  



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-13  7:27     ` Nilay Shroff
@ 2025-12-15 23:36       ` Sagi Grimberg
  2025-12-18 11:19         ` Nilay Shroff
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-15 23:36 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 13/12/2025 9:27, Nilay Shroff wrote:
>
> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>
>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>> This commit introduces a new I/O policy named "adaptive". Users can
>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>> subsystemX/iopolicy"
>>>
>>> The adaptive policy dynamically distributes I/O based on measured
>>> completion latency. The main idea is to calculate latency for each path,
>>> derive a weight, and then proportionally forward I/O according to those
>>> weights.
>>>
>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>> values.
>> So a given cpu would select path-a vs. another cpu that may select path-b?
>> How does that play with less queues than cpu cores? what happens to cores
>> that have low traffic?
>>
> The path-selection logic does not depend on the relationship between the number
> of CPUs and the number of hardware queues. It simply selects a path based on the
> per-CPU path score/credit, which reflects the relative performance of each available
> path.
> For example, assume we have two paths (A and B) to the same shared namespace.
> For each CPU, we maintain a smoothed latency estimate for every path. From these
> latency values we derive a per-path score or credit. The credit represents the relative
> share of I/O that each path should receive: a path with lower observed latency gets more
> credit, and a path with higher latency gets less.

I understand that the stats are maintained per-cpu, however I am not 
sure that having a
per-cpu path weights make sense. meaning that if we have paths a,b,c and 
for cpu0 we'll
have one set of weights and for cpu1 we'll have another set of weights.

What if the a given cpu happened to schedule some other application in a 
way that impacts
completion latency? won't that skew the sampling? that is not related to 
the path at all. That
is possibly more noticable in tcp which completes in a kthread context.

What do we lose if the 15 seconds weight assignment, averages all the 
cpus samping? won't
that mitigate to some extent the issue of non-path related latency skew?

>
> I/O distribution is thus governed directly by the available credits on that CPU. When the
> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
> policy runs above the block-layer queueing logic, and the number of hardware queues does
> not affect how paths are scored or selected.

This is potentially another problem. application may jump between cpu 
cores due to scheduling
constraints. In this case, how is the path selection policy adhering to 
the path weights?

What I'm trying to say here is that the path selection should be 
inherently reflective on the path,
not the cpu core that was accessing this path. What I am concerned 
about, is how this behaves
in the real-world. Your tests are running in a very distinct artificial 
path variance, and it does not include
other workloads that are running on the system that can impact 
completion latency.

It is possible that what I'm raising here is not a real concern, but I 
think we need to be able to demonstrate
that.

>
>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>> samples are computed and fed into an Exponentially Weighted Moving
>>> Average (EWMA):
>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?

wighted-lat is simpler.

>
>    Path weights are then derived from the smoothed (EWMA)
> latency as follows (example with two paths A and B):
>
>       path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>       path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>       total_score  = path_A_score + path_B_score
>
>       path_A_weight = (path_A_score * 100) / total_score
>       path_B_weight = (path_B_score * 100) / total_score
>
>> What happens to R/W mixed workloads? What happens when the I/O pattern
>> has a distribution of block sizes?
>>
> We maintain separate metrics for READ and WRITE traffic, and during path
> selection we use the appropriate metric depending on the I/O type.
>
> Regarding block-size variability: the current implementation does not yet
> account for I/O size. This is an important point — thank you for raising it.
> I discussed this today with Hannes at LPC, and we agreed that a practical
> approach is to normalize latency per 512-byte block. For our purposes, we
> do not need an exact latency value; a relative latency metric is sufficient,
> as it ultimately feeds into path scoring. A path with higher latency ends up
> with a lower score, and a path with lower latency gets a higher score — the
> exact absolute values are less important than maintaining consistent proportional
> relationships.

I am not sure that normalizing to 512 blocks is a good proxy. I think 
that large IO will
have much lower amortized latency per 512 block. which could create an 
false bias
to place a high weight on a path, if that path happened to host large 
I/Os no?

in my mind having buckets for I/O sizes would probably give a better 
approximation for
the paths weights won't it?


>
> Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
> across different I/O block sizes. I think that it's easy to normalize a latency number
> per 512 bytes block and I'd implement that in next patch version.

I am not sure. maybe it is.
The main issue I have here, is that you are trying to find asymmetry 
between paths,
however you are adding entropy with few factors by not taking into account:
1. I/O size
2. cpu scheduling
3. application cpu affinity changes over time

Now I don't know if these aspects actually make a difference, or it may 
be just hypothetical, but
I think we need to add these aspects when we test the proposed iopolicy...

>   > I think that in order to understand how a non-trivial path selector works we need
>> thorough testing in a variety of I/O patterns.
>>
> Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and
> different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed
> for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy.
> Still if you prefer running anything further let me know.

Maybe run with higher nice values? or run other processes on the host in 
parallel? maybe processes
that also makes heavier use of the network?

I don't think this is a viable approach for pcie in reality, most likely 
this is exclusive to fabrics.

>
>>> where:
>>>     - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>>>     - NSEC_PER_SEC is used as a scaling factor since valid latencies
>>>       are < 1 second
>>>     - weights are normalized to a 0–64 scale across all paths.
>>>
>>> Path credits are refilled based on this weight, with one credit
>>> consumed per I/O. When all credits are consumed, the credits are
>>> refilled again based on the current weight. This ensures that I/O is
>>> distributed across paths proportionally to their calculated weight.
>>>
>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>>    drivers/nvme/host/core.c      |  15 +-
>>>    drivers/nvme/host/ioctl.c     |  31 ++-
>>>    drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>>>    drivers/nvme/host/nvme.h      |  74 +++++-
>>>    drivers/nvme/host/pr.c        |   6 +-
>>>    drivers/nvme/host/sysfs.c     |   2 +-
>>>    6 files changed, 530 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fa4181d7de73..47f375c63d2d 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>>>        cleanup_srcu_struct(&head->srcu);
>>>        nvme_put_subsystem(head->subsys);
>>>        kfree(head->plids);
>>> +#ifdef CONFIG_NVME_MULTIPATH
>>> +    free_percpu(head->adp_path);
>>> +#endif
>>>        kfree(head);
>>>    }
>>>    @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>>>    {
>>>        struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>>>    +    nvme_free_ns_stat(ns);
>>>        put_disk(ns->disk);
>>>        nvme_put_ns_head(ns->head);
>>>        nvme_put_ctrl(ns->ctrl);
>>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>        if (nvme_init_ns_head(ns, info))
>>>            goto out_cleanup_disk;
>>>    +    if (nvme_alloc_ns_stat(ns))
>>> +        goto out_unlink_ns;
>>> +
>>>        /*
>>>         * If multipathing is enabled, the device name for all disks and not
>>>         * just those that represent shared namespaces needs to be based on the
>>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>        }
>>>          if (nvme_update_ns_info(ns, info))
>>> -        goto out_unlink_ns;
>>> +        goto out_free_ns_stat;
>>>          mutex_lock(&ctrl->namespaces_lock);
>>>        /*
>>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>         */
>>>        if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>>>            mutex_unlock(&ctrl->namespaces_lock);
>>> -        goto out_unlink_ns;
>>> +        goto out_free_ns_stat;
>>>        }
>>>        nvme_ns_add_to_ctrl_list(ns);
>>>        mutex_unlock(&ctrl->namespaces_lock);
>>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>        list_del_rcu(&ns->list);
>>>        mutex_unlock(&ctrl->namespaces_lock);
>>>        synchronize_srcu(&ctrl->srcu);
>>> +out_free_ns_stat:
>>> +    nvme_free_ns_stat(ns);
>>>     out_unlink_ns:
>>>        mutex_lock(&ctrl->subsys->lock);
>>>        list_del_rcu(&ns->siblings);
>>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>         */
>>>        synchronize_srcu(&ns->head->srcu);
>>>    +    nvme_mpath_cancel_adaptive_path_weight_work(ns);
>>> +
>> I personally think that the check on path stats should be done in the call-site
>> and not in the function itself.
> Hmm, can you please elaborate on this point further? I think, I am unable to get
> your point here.

nvme_mpath_cancel_adaptive_path_weight_work may do something or it won't, I'd prefer that
this check will be made here and not in the function.



>
>>>        /* wait for concurrent submissions */
>>>        if (nvme_mpath_clear_current_path(ns))
>>>            synchronize_srcu(&ns->head->srcu);
>>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>>> index c212fa952c0f..759d147d9930 100644
>>> --- a/drivers/nvme/host/ioctl.c
>>> +++ b/drivers/nvme/host/ioctl.c
>>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>>>    int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>            unsigned int cmd, unsigned long arg)
>>>    {
>>> +    u8 opcode;
>>>        struct nvme_ns_head *head = bdev->bd_disk->private_data;
>>>        bool open_for_write = mode & BLK_OPEN_WRITE;
>>>        void __user *argp = (void __user *)arg;
>>>        struct nvme_ns *ns;
>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>        unsigned int flags = 0;
>>> +    unsigned int op_type = NVME_STAT_OTHER;
>>>          if (bdev_is_partition(bdev))
>>>            flags |= NVME_IOCTL_PARTITION;
>>>    +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>> +        if (get_user(opcode, (u8 *)argp))
>>> +            return -EFAULT;
>>> +        if (opcode == nvme_cmd_write)
>>> +            op_type = NVME_STAT_WRITE;
>>> +        else if (opcode == nvme_cmd_read)
>>> +            op_type = NVME_STAT_READ;
>>> +    }
>>> +
>>>        srcu_idx = srcu_read_lock(&head->srcu);
>>> -    ns = nvme_find_path(head);
>>> +    ns = nvme_find_path(head, op_type);
>> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
>> to a prep patch (explaining that the new iopolicy will leverage it)
>>
> Sure, makes sense. I'll split this into prep patch as you suggested.
>>>        if (!ns)
>>>            goto out_unlock;
>>>    @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>    long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>            unsigned long arg)
>>>    {
>>> +    u8 opcode;
>>>        bool open_for_write = file->f_mode & FMODE_WRITE;
>>>        struct cdev *cdev = file_inode(file)->i_cdev;
>>>        struct nvme_ns_head *head =
>>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>        void __user *argp = (void __user *)arg;
>>>        struct nvme_ns *ns;
>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>> +    unsigned int op_type = NVME_STAT_OTHER;
>>> +
>>> +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>> +        if (get_user(opcode, (u8 *)argp))
>>> +            return -EFAULT;
>>> +        if (opcode == nvme_cmd_write)
>>> +            op_type = NVME_STAT_WRITE;
>>> +        else if (opcode == nvme_cmd_read)
>>> +            op_type = NVME_STAT_READ;
>>> +    }
>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>> -    ns = nvme_find_path(head);
>>> +    ns = nvme_find_path(head, op_type);
>>>        if (!ns)
>>>            goto out_unlock;
>>>    @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>>        struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>        struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>>        int srcu_idx = srcu_read_lock(&head->srcu);
>>> -    struct nvme_ns *ns = nvme_find_path(head);
>>> +    const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>>> +    struct nvme_ns *ns = nvme_find_path(head,
>>> +            READ_ONCE(cmd->opcode) & 1 ?
>>> +            NVME_STAT_WRITE : NVME_STAT_READ);
>>>        int ret = -EINVAL;
>>>          if (ns)
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 543e17aead12..55dc28375662 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -6,6 +6,9 @@
>>>    #include <linux/backing-dev.h>
>>>    #include <linux/moduleparam.h>
>>>    #include <linux/vmalloc.h>
>>> +#include <linux/blk-mq.h>
>>> +#include <linux/math64.h>
>>> +#include <linux/rculist.h>
>>>    #include <trace/events/block.h>
>>>    #include "nvme.h"
>>>    @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>>>        "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>>>      static const char *nvme_iopolicy_names[] = {
>>> -    [NVME_IOPOLICY_NUMA]    = "numa",
>>> -    [NVME_IOPOLICY_RR]    = "round-robin",
>>> -    [NVME_IOPOLICY_QD]      = "queue-depth",
>>> +    [NVME_IOPOLICY_NUMA]     = "numa",
>>> +    [NVME_IOPOLICY_RR]     = "round-robin",
>>> +    [NVME_IOPOLICY_QD]       = "queue-depth",
>>> +    [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>>>    };
>>>      static int iopolicy = NVME_IOPOLICY_NUMA;
>>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>>            iopolicy = NVME_IOPOLICY_RR;
>>>        else if (!strncmp(val, "queue-depth", 11))
>>>            iopolicy = NVME_IOPOLICY_QD;
>>> +    else if (!strncmp(val, "adaptive", 8))
>>> +        iopolicy = NVME_IOPOLICY_ADAPTIVE;
>>>        else
>>>            return -EINVAL;
>>>    @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>>>    }
>>>    EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>>>    +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>>> +{
>>> +    int cpu, srcu_idx;
>>> +    u32 weight;
>>> +    struct nvme_ns *ns;
>>> +    struct nvme_path_stat *stat;
>>> +    struct nvme_path_work *work = container_of(weight_work,
>>> +            struct nvme_path_work, weight_work);
>>> +    struct nvme_ns_head *head = work->ns->head;
>>> +    int op_type = work->op_type;
>>> +    u64 total_score = 0;
>>> +
>>> +    cpu = get_cpu();
>>> +
>>> +    srcu_idx = srcu_read_lock(&head->srcu);
>>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>>> +            srcu_read_lock_held(&head->srcu)) {
>>> +
>>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> +        if (!READ_ONCE(stat->slat_ns)) {
>>> +            stat->score = 0;
>>> +            continue;
>>> +        }
>>> +        /*
>>> +         * Compute the path score as the inverse of smoothed
>>> +         * latency, scaled by NSEC_PER_SEC. Floating point
>>> +         * math is unavailable in the kernel, so fixed-point
>>> +         * scaling is used instead. NSEC_PER_SEC is chosen
>>> +         * because valid latencies are always < 1 second; longer
>>> +         * latencies are ignored.
>>> +         */
>>> +        stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>>> +
>>> +        /* Compute total score. */
>>> +        total_score += stat->score;
>>> +    }
>>> +
>>> +    if (!total_score)
>>> +        goto out;
>>> +
>>> +    /*
>>> +     * After computing the total slatency, we derive per-path weight
>>> +     * (normalized to the range 0–64). The weight represents the
>>> +     * relative share of I/O the path should receive.
>>> +     *
>>> +     *   - lower smoothed latency -> higher weight
>>> +     *   - higher smoothed slatency -> lower weight
>>> +     *
>>> +     * Next, while forwarding I/O, we assign "credits" to each path
>>> +     * based on its weight (please also refer nvme_adaptive_path()):
>>> +     *   - Initially, credits = weight.
>>> +     *   - Each time an I/O is dispatched on a path, its credits are
>>> +     *     decremented proportionally.
>>> +     *   - When a path runs out of credits, it becomes temporarily
>>> +     *     ineligible until credit is refilled.
>>> +     *
>>> +     * I/O distribution is therefore governed by available credits,
>>> +     * ensuring that over time the proportion of I/O sent to each
>>> +     * path matches its weight (and thus its performance).
>>> +     */
>>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>>> +            srcu_read_lock_held(&head->srcu)) {
>>> +
>>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> +        weight = div_u64(stat->score * 64, total_score);
>>> +
>>> +        /*
>>> +         * Ensure the path weight never drops below 1. A weight
>>> +         * of 0 is used only for newly added paths. During
>>> +         * bootstrap, a few I/Os are sent to such paths to
>>> +         * establish an initial weight. Enforcing a minimum
>>> +         * weight of 1 guarantees that no path is forgotten and
>>> +         * that each path is probed at least occasionally.
>>> +         */
>>> +        if (!weight)
>>> +            weight = 1;
>>> +
>>> +        WRITE_ONCE(stat->weight, weight);
>>> +    }
>>> +out:
>>> +    srcu_read_unlock(&head->srcu, srcu_idx);
>>> +    put_cpu();
>>> +}
>>> +
>>> +/*
>>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>>> + */
>>> +static inline u64 ewma_update(u64 old, u64 new)
>> it is a calculation function, lets call it calc_ewma_update
> Yeah, will do this in next patch version.
>
>>> +{
>>> +    return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>>> +            + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>>> +}
>>> +
>>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>>> +{
>>> +    int cpu;
>>> +    unsigned int op_type;
>>> +    struct nvme_path_info *info;
>>> +    struct nvme_path_stat *stat;
>>> +    u64 now, latency, slat_ns, avg_lat_ns;
>>> +    struct nvme_ns_head *head = ns->head;
>>> +
>>> +    if (list_is_singular(&head->list))
>>> +        return;
>>> +
>>> +    now = ktime_get_ns();
>>> +    latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>>> +    if (!latency)
>>> +        return;
>>> +
>>> +    /*
>>> +     * As completion code path is serialized(i.e. no same completion queue
>>> +     * update code could run simultaneously on multiple cpu) we can safely
>>> +     * access per cpu nvme path stat here from another cpu (in case the
>>> +     * completion cpu is different from submission cpu).
>>> +     * The only field which could be accessed simultaneously here is the
>>> +     * path ->weight which may be accessed by this function as well as I/O
>>> +     * submission path during path selection logic and we protect ->weight
>>> +     * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>>> +     * we also don't need to be so accurate here as the path credit would
>>> +     * be anyways refilled, based on path weight, once path consumes all
>>> +     * its credits. And we limit path weight/credit max up to 100. Please
>>> +     * also refer nvme_adaptive_path().
>>> +     */
>>> +    cpu = blk_mq_rq_cpu(rq);
>>> +    op_type = nvme_data_dir(req_op(rq));
>>> +    info = &per_cpu_ptr(ns->info, cpu)[op_type];
>> info is really really really confusing and generic and not representative of what
>> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
>>
> Maybe I am used to with this code and so I never realized it. But yes agreed, I
> will make it @path_lat.
>
>>> +    stat = &info->stat;
>>> +
>>> +    /*
>>> +     * If latency > ~1s then ignore this sample to prevent EWMA from being
>>> +     * skewed by pathological outliers (multi-second waits, controller
>>> +     * timeouts etc.). This keeps path scores representative of normal
>>> +     * performance and avoids instability from rare spikes. If such high
>>> +     * latency is real, ANA state reporting or keep-alive error counters
>>> +     * will mark the path unhealthy and remove it from the head node list,
>>> +     * so we safely skip such sample here.
>>> +     */
>>> +    if (unlikely(latency > NSEC_PER_SEC)) {
>>> +        stat->nr_ignored++;
>>> +        dev_warn_ratelimited(ns->ctrl->device,
>>> +            "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * Accumulate latency samples and increment the batch count for each
>>> +     * ~15 second interval. When the interval expires, compute the simple
>>> +     * average latency over that window, then update the smoothed (EWMA)
>>> +     * latency. The path weight is recalculated based on this smoothed
>>> +     * latency.
>>> +     */
>>> +    stat->batch += latency;
>>> +    stat->batch_count++;
>>> +    stat->nr_samples++;
>>> +
>>> +    if (now > stat->last_weight_ts &&
>>> +        (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>>> +
>>> +        stat->last_weight_ts = now;
>>> +
>>> +        /*
>>> +         * Find simple average latency for the last epoch (~15 sec
>>> +         * interval).
>>> +         */
>>> +        avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>>> +
>>> +        /*
>>> +         * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>>> +         * latency. EWMA is preferred over simple average latency
>>> +         * because it smooths naturally, reduces jitter from sudden
>>> +         * spikes, and adapts faster to changing conditions. It also
>>> +         * avoids storing historical samples, and works well for both
>>> +         * slow and fast I/O rates.
>>> +         * Formula:
>>> +         * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>>> +         * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>>> +         * existing latency and 1/8 (~12.5%) weight to the new latency.
>>> +         */
>>> +        if (unlikely(!stat->slat_ns))
>>> +            WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>>> +        else {
>>> +            slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>>> +            WRITE_ONCE(stat->slat_ns, slat_ns);
>>> +        }
>>> +
>>> +        stat->batch = stat->batch_count = 0;
>>> +
>>> +        /*
>>> +         * Defer calculation of the path weight in per-cpu workqueue.
>>> +         */
>>> +        schedule_work_on(cpu, &info->work.weight_work);
>> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
>> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
>> cpu cores.
> As I explained earlier, in NVMe multipath driver code we don't know hctx while
> we choose path. The ctx to hctx mapping happens later in the block layer while
> submitting bio.

yes, hctx is not really relevant.

>   Here we calculate the path score per-cpu and utilize it while
> choosing path to forward I/O.
>
>>> +    }
>>> +}
>>> +
>>>    void nvme_mpath_end_request(struct request *rq)
>>>    {
>>>        struct nvme_ns *ns = rq->q->queuedata;
>>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>>>        if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>>>            atomic_dec_if_positive(&ns->ctrl->nr_active);
>>>    +    if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> +        nvme_mpath_add_sample(rq, ns);
>>> +
>> Doing all this work for EVERY completion is really worth it?
>> sounds kinda like an overkill.
> We don't really do much in nvme_mpath_add_sample() other than just
> adding latency sample into batch. The real work where we calculate
> the patch score is done once every ~15 seconds and that is done
> under separate workqueu. So we don't do any heavy lifing here during
> I/O completion processing.
>
>>>        if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>>            return;
>>>        bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>>>        [NVME_ANA_CHANGE]        = "change",
>>>    };
>>>    +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>>> +{
>>> +    int i, cpu;
>>> +    struct nvme_path_stat *stat;
>>> +
>>> +    for_each_possible_cpu(cpu) {
>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> +            stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>>> +            memset(stat, 0, sizeof(struct nvme_path_stat));
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>>> +{
>>> +    int i, cpu;
>>> +    struct nvme_path_info *info;
>>> +
>>> +    if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> +        return;
>>> +
>>> +    for_each_online_cpu(cpu) {
>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> +            info = &per_cpu_ptr(ns->info, cpu)[i];
>>> +            cancel_work_sync(&info->work.weight_work);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>>> +{
>>> +    struct nvme_ns_head *head = ns->head;
>>> +
>>> +    if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>>> +        return false;
>>> +
>>> +    if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> +        return false;
>>> +
>>> +    blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
>> This is an undocumented change...
> Sure, I would add comment in this code in the next patch version.
>
>>> +    blk_stat_enable_accounting(ns->queue);
>>> +    return true;
>>> +}
>>> +
>>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>>> +{
>>> +
>>> +    if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> +        return false;
>>> +
>>> +    blk_stat_disable_accounting(ns->queue);
>>> +    blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>> +    nvme_mpath_reset_adaptive_path_stat(ns);
>>> +    return true;
>>> +}
>>> +
>>>    bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>    {
>>>        struct nvme_ns_head *head = ns->head;
>>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>                changed = true;
>>>            }
>>>        }
>>> +    if (nvme_mpath_disable_adaptive_path_policy(ns))
>>> +        changed = true;
>> Don't understand why you are setting changed here? it relates to of the current_path
>> was changed. doesn't make sense to me.
>>
> I think you were correct. We don't have any rcu update here for adaptive path.
> Will remove this.
>
>>>    out:
>>>        return changed;
>>>    }
>>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>>>        srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>>    }
>>>    +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>>> +{
>>> +    int i, cpu;
>>> +    struct nvme_path_work *work;
>>> +    gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>>> +
>>> +    if (!ns->head->disk)
>>> +        return 0;
>>> +
>>> +    ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>>> +            sizeof(struct nvme_path_info),
>>> +            __alignof__(struct nvme_path_info), gfp);
>>> +    if (!ns->info)
>>> +        return -ENOMEM;
>>> +
>>> +    for_each_possible_cpu(cpu) {
>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> +            work = &per_cpu_ptr(ns->info, cpu)[i].work;
>>> +            work->ns = ns;
>>> +            work->op_type = i;
>>> +            INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
>> Does this function set any ctrl paths? your code is very confusing.
>>
> Here ctrl path means, we iterate through each controller namespaces-path
> and then sets/enable the adaptive path parameters required for each path.
> Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
> function which resets/clears the per-path parameters while chanigng I/O
> policy.
>
>>> +{
>>> +    struct nvme_ns *ns;
>>> +    int srcu_idx;
>>> +
>>> +    srcu_idx = srcu_read_lock(&ctrl->srcu);
>>> +    list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>> +                srcu_read_lock_held(&ctrl->srcu))
>>> +        nvme_mpath_enable_adaptive_path_policy(ns);
>>> +    srcu_read_unlock(&ctrl->srcu, srcu_idx);
>> seems like it enables the iopolicy on all ctrl namespaces.
>> the enable should also be more explicit like:
>> nvme_enable_ns_lat_sampling or something like that.
>>
> okay, I'll rename it to the appropriate function name, as you suggested.
>
>>> +}
>>> +
>>>    void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>    {
>>>        struct nvme_ns_head *head = ns->head;
>>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>                     srcu_read_lock_held(&head->srcu)) {
>>>            if (capacity != get_capacity(ns->disk))
>>>                clear_bit(NVME_NS_READY, &ns->flags);
>>> +
>>> +        nvme_mpath_reset_adaptive_path_stat(ns);
>>>        }
>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>>    @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>>>        return found;
>>>    }
>>>    +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>> +{
>>> +    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>> +}
>>> +
>>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>>> +        unsigned int op_type)
>>> +{
>>> +    struct nvme_ns *ns, *start, *found = NULL;
>>> +    struct nvme_path_stat *stat;
>>> +    u32 weight;
>>> +    int cpu;
>>> +
>>> +    cpu = get_cpu();
>>> +    ns = *this_cpu_ptr(head->adp_path);
>>> +    if (unlikely(!ns)) {
>>> +        ns = list_first_or_null_rcu(&head->list,
>>> +                struct nvme_ns, siblings);
>>> +        if (unlikely(!ns))
>>> +            goto out;
>>> +    }
>>> +found_ns:
>>> +    start = ns;
>>> +    while (nvme_path_is_disabled(ns) ||
>>> +            !nvme_state_is_live(ns->ana_state)) {
>>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>>> +
>>> +        /*
>>> +         * If we iterate through all paths in the list but find each
>>> +         * path in list is either disabled or dead then bail out.
>>> +         */
>>> +        if (ns == start)
>>> +            goto out;
>>> +    }
>>> +
>>> +    stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> +
>>> +    /*
>>> +     * When the head path-list is singular we don't calculate the
>>> +     * only path weight for optimization as we don't need to forward
>>> +     * I/O to more than one path. The another possibility is whenthe
>>> +     * path is newly added, we don't know its weight. So we go round
>>> +     * -robin for each such path and forward I/O to it.Once we start
>>> +     * getting response for such I/Os, the path weight calculation
>>> +     * would kick in and then we start using path credit for
>>> +     * forwarding I/O.
>>> +     */
>>> +    weight = READ_ONCE(stat->weight);
>>> +    if (!weight) {
>>> +        found = ns;
>>> +        goto out;
>>> +    }
>>> +
>>> +    /*
>>> +     * To keep path selection logic simple, we don't distinguish
>>> +     * between ANA optimized and non-optimized states. The non-
>>> +     * optimized path is expected to have a lower weight, and
>>> +     * therefore fewer credits. As a result, only a small number of
>>> +     * I/Os will be forwarded to paths in the non-optimized state.
>>> +     */
>>> +    if (stat->credit > 0) {
>>> +        --stat->credit;
>>> +        found = ns;
>>> +        goto out;
>>> +    } else {
>>> +        /*
>>> +         * Refill credit from path weight and move to next path. The
>>> +         * refilled credit of the current path will be used next when
>>> +         * all remainng paths exhaust its credits.
>>> +         */
>>> +        weight = READ_ONCE(stat->weight);
>>> +        stat->credit = weight;
>>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>>> +        if (likely(ns))
>>> +            goto found_ns;
>>> +    }
>>> +out:
>>> +    if (found) {
>>> +        stat->sel++;
>>> +        *this_cpu_ptr(head->adp_path) = found;
>>> +    }
>>> +
>>> +    put_cpu();
>>> +    return found;
>>> +}
>>> +
>>>    static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>>>    {
>>>        struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>>        return ns;
>>>    }
>>>    -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>>> +        unsigned int op_type)
>>>    {
>>>        switch (READ_ONCE(head->subsys->iopolicy)) {
>>> +    case NVME_IOPOLICY_ADAPTIVE:
>>> +        return nvme_adaptive_path(head, op_type);
>>>        case NVME_IOPOLICY_QD:
>>>            return nvme_queue_depth_path(head);
>>>        case NVME_IOPOLICY_RR:
>>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>>>            return;
>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>> -    ns = nvme_find_path(head);
>>> +    ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>>>        if (likely(ns)) {
>>>            bio_set_dev(bio, ns->disk->part0);
>>>            bio->bi_opf |= REQ_NVME_MPATH;
>>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>> -    ns = nvme_find_path(head);
>>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>        if (ns)
>>>            ret = nvme_ns_get_unique_id(ns, id, type);
>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>> -    ns = nvme_find_path(head);
>>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>        if (ns)
>>>            ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>>        INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>>>        INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>>>        head->delayed_removal_secs = 0;
>>> +    head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>>> +    if (!head->adp_path)
>>> +        return -ENOMEM;
>>>          /*
>>>         * If "multipath_always_on" is enabled, a multipath node is added
>>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>        }
>>>        mutex_unlock(&head->lock);
>>>    +    mutex_lock(&nvme_subsystems_lock);
>>> +    nvme_mpath_enable_adaptive_path_policy(ns);
>>> +    mutex_unlock(&nvme_subsystems_lock);
>>> +
>>>        synchronize_srcu(&head->srcu);
>>>        kblockd_schedule_work(&head->requeue_work);
>>>    }
>>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>>>        return 0;
>>>    }
>>>    -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>> -{
>>> -    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>> -}
>>> -
>>>    static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>>            struct nvme_ns *ns)
>>>    {
>>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>>>          WRITE_ONCE(subsys->iopolicy, iopolicy);
>>>    -    /* iopolicy changes clear the mpath by design */
>>> +    /* iopolicy changes clear/reset the mpath by design */
>>>        mutex_lock(&nvme_subsystems_lock);
>>>        list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>>            nvme_mpath_clear_ctrl_paths(ctrl);
>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>> +        nvme_mpath_set_ctrl_paths(ctrl);
>>>        mutex_unlock(&nvme_subsystems_lock);
>>>          pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index 102fae6a231c..715c7053054c 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>>>    extern unsigned int admin_timeout;
>>>    #define NVME_ADMIN_TIMEOUT    (admin_timeout * HZ)
>>>    -#define NVME_DEFAULT_KATO    5
>>> +#define NVME_DEFAULT_KATO        5
>>> +
>>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT    3
>>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT    (15 * NSEC_PER_SEC)
>> You need these defines outside of nvme-mpath?
>>
> Currently, those macros are used in nvme/host/core.c.
> I can move this inisde that source file.
>
>>>      #ifdef CONFIG_ARCH_NO_SG_CHAIN
>>>    #define  NVME_INLINE_SG_CNT  0
>>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>>>        NVME_IOPOLICY_NUMA,
>>>        NVME_IOPOLICY_RR,
>>>        NVME_IOPOLICY_QD,
>>> +    NVME_IOPOLICY_ADAPTIVE,
>>>    };
>>>      struct nvme_subsystem {
>>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>>>        u8    csi;
>>>    };
>>>    +enum nvme_stat_group {
>>> +    NVME_STAT_READ,
>>> +    NVME_STAT_WRITE,
>>> +    NVME_STAT_OTHER,
>>> +    NVME_NUM_STAT_GROUPS
>>> +};
>> I see you have stats per io direction. However you don't have it per IO size. I wonder
>> how this plays into this iopolicy.
>>
> Yes you're correct, and as mentioned earlier we'd measure latecy per
> 512 byte blocks size.
>
>>> +
>>> +struct nvme_path_stat {
>>> +    u64 nr_samples;        /* total num of samples processed */
>>> +    u64 nr_ignored;        /* num. of samples ignored */
>>> +    u64 slat_ns;        /* smoothed (ewma) latency in nanoseconds */
>>> +    u64 score;        /* score used for weight calculation */
>>> +    u64 last_weight_ts;    /* timestamp of the last weight calculation */
>>> +    u64 sel;        /* num of times this path is selcted for I/O */
>>> +    u64 batch;        /* accumulated latency sum for current window */
>>> +    u32 batch_count;    /* num of samples accumulated in current window */
>>> +    u32 weight;        /* path weight */
>>> +    u32 credit;        /* path credit for I/O forwarding */
>>> +};
>> I'm still not convinced that having this be per-cpu-per-ns really makes sense.
> I understand your concern about whether it really makes sense to keep this
> per-cpu-per-ns, and I see your point that you would prefer maintaining the
> stat per-hctx instead of per-CPU.
>
> However, as mentioned earlier, during path selection we cannot reliably map an
> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
> practice. On the other hand, maintaining the metrics per-CPU has an additional
> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
> the NUMA distance between the workload’s CPU and the I/O controller. This means
> that on multi-node systems, the policy can automatically favor I/O paths/controllers
> that are local/near to the CPU issuing the request, which may lead to better
> latency characteristics.

With this I tend to agree. but per-cpu has lots of other churns IMO.
Maybe the answer is that paths weights are maintained per NUMA node?
then accessing these weights in the fast-path is still cheap enough?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-15 23:36       ` Sagi Grimberg
@ 2025-12-18 11:19         ` Nilay Shroff
  2025-12-18 13:46           ` Hannes Reinecke
  2025-12-25 12:28           ` Sagi Grimberg
  0 siblings, 2 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-18 11:19 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme
  Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 12/16/25 5:06 AM, Sagi Grimberg wrote:
> 
> 
> On 13/12/2025 9:27, Nilay Shroff wrote:
>>
>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>
>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>> subsystemX/iopolicy"
>>>>
>>>> The adaptive policy dynamically distributes I/O based on measured
>>>> completion latency. The main idea is to calculate latency for each path,
>>>> derive a weight, and then proportionally forward I/O according to those
>>>> weights.
>>>>
>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>> values.
>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>> How does that play with less queues than cpu cores? what happens to cores
>>> that have low traffic?
>>>
>> The path-selection logic does not depend on the relationship between the number
>> of CPUs and the number of hardware queues. It simply selects a path based on the
>> per-CPU path score/credit, which reflects the relative performance of each available
>> path.
>> For example, assume we have two paths (A and B) to the same shared namespace.
>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>> latency values we derive a per-path score or credit. The credit represents the relative
>> share of I/O that each path should receive: a path with lower observed latency gets more
>> credit, and a path with higher latency gets less.
> 
> I understand that the stats are maintained per-cpu, however I am not sure that having a
> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
> have one set of weights and for cpu1 we'll have another set of weights.
> 
> What if the a given cpu happened to schedule some other application in a way that impacts
> completion latency? won't that skew the sampling? that is not related to the path at all. That
> is possibly more noticable in tcp which completes in a kthread context.
> 
> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
> that mitigate to some extent the issue of non-path related latency skew?
> 
You’re right — what you’re describing is indeed possible. The intent of the adaptive policy, 
however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
transport latency.
The observed completion latency intentionally includes all components that affect I/O from
the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
scheduler-induced delays, and the target device’s own I/O latency. By capturing the full 
end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
Scheduler-related latency can vary over time due to workload placement or CPU contention,
and this variability is accounted for by the design. Since per-path weights are recalculated
periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
behavior are naturally incorporated into the path scoring. As a result, the policy can 
automatically adapt/adjust and rebalance I/O toward paths that are performing better under
current system conditions.
In short, while per-CPU sampling may include effects beyond the physical path itself, this is
intentional and allows the adaptive policy to respond in real time to changing end-to-end
performance characteristics.

>>
>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>> not affect how paths are scored or selected.
> 
> This is potentially another problem. application may jump between cpu cores due to scheduling
> constraints. In this case, how is the path selection policy adhering to the path weights?
> 
> What I'm trying to say here is that the path selection should be inherently reflective on the path,
> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
> other workloads that are running on the system that can impact completion latency.
> 
> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
> that.
> 

In real-world systems, as stated earlier, the completion latency is influenced not only by
the physical path but also by system load, scheduler behavior, and transport stack processing.
By incorporating all of these factors into the latency measurement, the adaptive policy reflects
the true cost of issuing I/O on a given path under current conditions. This allows it to respond
to both path-level and system-level congestion.

In practice, during experiments with two paths (A and B), I observed that when additional latency—
whether introduced via the path itself or through system load—was present on path A, subsequent I/O
was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
and remains effective even in the presence of CPU migration and competing workloads.
Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
real-world end-to-end performance and continuously adjust I/O distribution in response to changing
system and path conditions.

>>
>>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>>> samples are computed and fed into an Exponentially Weighted Moving
>>>> Average (EWMA):
>>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
>> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
> 
> wighted-lat is simpler.
Okay I'll renanme it to "weighted-lat".> 
>>
>>    Path weights are then derived from the smoothed (EWMA)
>> latency as follows (example with two paths A and B):
>>
>>       path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>>       path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>>       total_score  = path_A_score + path_B_score
>>
>>       path_A_weight = (path_A_score * 100) / total_score
>>       path_B_weight = (path_B_score * 100) / total_score
>>
>>> What happens to R/W mixed workloads? What happens when the I/O pattern
>>> has a distribution of block sizes?
>>>
>> We maintain separate metrics for READ and WRITE traffic, and during path
>> selection we use the appropriate metric depending on the I/O type.
>>
>> Regarding block-size variability: the current implementation does not yet
>> account for I/O size. This is an important point — thank you for raising it.
>> I discussed this today with Hannes at LPC, and we agreed that a practical
>> approach is to normalize latency per 512-byte block. For our purposes, we
>> do not need an exact latency value; a relative latency metric is sufficient,
>> as it ultimately feeds into path scoring. A path with higher latency ends up
>> with a lower score, and a path with lower latency gets a higher score — the
>> exact absolute values are less important than maintaining consistent proportional
>> relationships.
> 
> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
> have much lower amortized latency per 512 block. which could create an false bias
> to place a high weight on a path, if that path happened to host large I/Os no?
> 
Hmm, I think yes, good point, I think for nvme over fabrics this could be true.

> in my mind having buckets for I/O sizes would probably give a better approximation for
> the paths weights won't it?
> 
Okay, so how about dividing I/O sizes in the 4 buckets as shown below?small      <= 4k
medium     4k-64k
large      64k-128k
very-large >128k

> 
>>
>> Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
>> across different I/O block sizes. I think that it's easy to normalize a latency number
>> per 512 bytes block and I'd implement that in next patch version.
> 
> I am not sure. maybe it is.
> The main issue I have here, is that you are trying to find asymmetry between paths,
> however you are adding entropy with few factors by not taking into account:
> 1. I/O size
> 2. cpu scheduling
> 3. application cpu affinity changes over time
> 
> Now I don't know if these aspects actually make a difference, or it may be just hypothetical, but
> I think we need to add these aspects when we test the proposed iopolicy...
> 
As stated earlier, as we measure end-to-end latency, it helps account for both cpu scheduling
and other application workload specific delays while choosing the path. And regarding I/O 
size variation, as you suggested, I proposed using the different bucket sizes mentioned above.

>>   > I think that in order to understand how a non-trivial path selector works we need
>>> thorough testing in a variety of I/O patterns.
>>>
>> Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and
>> different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed
>> for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy.
>> Still if you prefer running anything further let me know.
> 
> Maybe run with higher nice values? or run other processes on the host in parallel? maybe processes
> that also makes heavier use of the network?
> 
Okay I'll run such aaditonal workloads while testing this iopolicy.
In fact, you'd find the result of one such experiments I performed 
at the end of this email.

> I don't think this is a viable approach for pcie in reality, most likely this is exclusive to fabrics.
> 
>>
>>>> where:
>>>>     - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>>>>     - NSEC_PER_SEC is used as a scaling factor since valid latencies
>>>>       are < 1 second
>>>>     - weights are normalized to a 0–64 scale across all paths.
>>>>
>>>> Path credits are refilled based on this weight, with one credit
>>>> consumed per I/O. When all credits are consumed, the credits are
>>>> refilled again based on the current weight. This ensures that I/O is
>>>> distributed across paths proportionally to their calculated weight.
>>>>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>>    drivers/nvme/host/core.c      |  15 +-
>>>>    drivers/nvme/host/ioctl.c     |  31 ++-
>>>>    drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>>>>    drivers/nvme/host/nvme.h      |  74 +++++-
>>>>    drivers/nvme/host/pr.c        |   6 +-
>>>>    drivers/nvme/host/sysfs.c     |   2 +-
>>>>    6 files changed, 530 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index fa4181d7de73..47f375c63d2d 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>>>>        cleanup_srcu_struct(&head->srcu);
>>>>        nvme_put_subsystem(head->subsys);
>>>>        kfree(head->plids);
>>>> +#ifdef CONFIG_NVME_MULTIPATH
>>>> +    free_percpu(head->adp_path);
>>>> +#endif
>>>>        kfree(head);
>>>>    }
>>>>    @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>>>>    {
>>>>        struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>>>>    +    nvme_free_ns_stat(ns);
>>>>        put_disk(ns->disk);
>>>>        nvme_put_ns_head(ns->head);
>>>>        nvme_put_ctrl(ns->ctrl);
>>>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>>        if (nvme_init_ns_head(ns, info))
>>>>            goto out_cleanup_disk;
>>>>    +    if (nvme_alloc_ns_stat(ns))
>>>> +        goto out_unlink_ns;
>>>> +
>>>>        /*
>>>>         * If multipathing is enabled, the device name for all disks and not
>>>>         * just those that represent shared namespaces needs to be based on the
>>>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>>        }
>>>>          if (nvme_update_ns_info(ns, info))
>>>> -        goto out_unlink_ns;
>>>> +        goto out_free_ns_stat;
>>>>          mutex_lock(&ctrl->namespaces_lock);
>>>>        /*
>>>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>>         */
>>>>        if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>>>>            mutex_unlock(&ctrl->namespaces_lock);
>>>> -        goto out_unlink_ns;
>>>> +        goto out_free_ns_stat;
>>>>        }
>>>>        nvme_ns_add_to_ctrl_list(ns);
>>>>        mutex_unlock(&ctrl->namespaces_lock);
>>>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>>        list_del_rcu(&ns->list);
>>>>        mutex_unlock(&ctrl->namespaces_lock);
>>>>        synchronize_srcu(&ctrl->srcu);
>>>> +out_free_ns_stat:
>>>> +    nvme_free_ns_stat(ns);
>>>>     out_unlink_ns:
>>>>        mutex_lock(&ctrl->subsys->lock);
>>>>        list_del_rcu(&ns->siblings);
>>>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>>         */
>>>>        synchronize_srcu(&ns->head->srcu);
>>>>    +    nvme_mpath_cancel_adaptive_path_weight_work(ns);
>>>> +
>>> I personally think that the check on path stats should be done in the call-site
>>> and not in the function itself.
>> Hmm, can you please elaborate on this point further? I think, I am unable to get
>> your point here.
> 
> nvme_mpath_cancel_adaptive_path_weight_work may do something or it won't, I'd prefer that
> this check will be made here and not in the function.
> 
Okay got it. I'll make that path stat check in the call-site.> 
> 
>>
>>>>        /* wait for concurrent submissions */
>>>>        if (nvme_mpath_clear_current_path(ns))
>>>>            synchronize_srcu(&ns->head->srcu);
>>>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>>>> index c212fa952c0f..759d147d9930 100644
>>>> --- a/drivers/nvme/host/ioctl.c
>>>> +++ b/drivers/nvme/host/ioctl.c
>>>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>>>>    int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>>            unsigned int cmd, unsigned long arg)
>>>>    {
>>>> +    u8 opcode;
>>>>        struct nvme_ns_head *head = bdev->bd_disk->private_data;
>>>>        bool open_for_write = mode & BLK_OPEN_WRITE;
>>>>        void __user *argp = (void __user *)arg;
>>>>        struct nvme_ns *ns;
>>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>>        unsigned int flags = 0;
>>>> +    unsigned int op_type = NVME_STAT_OTHER;
>>>>          if (bdev_is_partition(bdev))
>>>>            flags |= NVME_IOCTL_PARTITION;
>>>>    +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>>> +        if (get_user(opcode, (u8 *)argp))
>>>> +            return -EFAULT;
>>>> +        if (opcode == nvme_cmd_write)
>>>> +            op_type = NVME_STAT_WRITE;
>>>> +        else if (opcode == nvme_cmd_read)
>>>> +            op_type = NVME_STAT_READ;
>>>> +    }
>>>> +
>>>>        srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    ns = nvme_find_path(head);
>>>> +    ns = nvme_find_path(head, op_type);
>>> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
>>> to a prep patch (explaining that the new iopolicy will leverage it)
>>>
>> Sure, makes sense. I'll split this into prep patch as you suggested.
>>>>        if (!ns)
>>>>            goto out_unlock;
>>>>    @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>>    long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>>            unsigned long arg)
>>>>    {
>>>> +    u8 opcode;
>>>>        bool open_for_write = file->f_mode & FMODE_WRITE;
>>>>        struct cdev *cdev = file_inode(file)->i_cdev;
>>>>        struct nvme_ns_head *head =
>>>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>>        void __user *argp = (void __user *)arg;
>>>>        struct nvme_ns *ns;
>>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>> +    unsigned int op_type = NVME_STAT_OTHER;
>>>> +
>>>> +    if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>>> +        if (get_user(opcode, (u8 *)argp))
>>>> +            return -EFAULT;
>>>> +        if (opcode == nvme_cmd_write)
>>>> +            op_type = NVME_STAT_WRITE;
>>>> +        else if (opcode == nvme_cmd_read)
>>>> +            op_type = NVME_STAT_READ;
>>>> +    }
>>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    ns = nvme_find_path(head);
>>>> +    ns = nvme_find_path(head, op_type);
>>>>        if (!ns)
>>>>            goto out_unlock;
>>>>    @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>>>        struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>        struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>>>        int srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    struct nvme_ns *ns = nvme_find_path(head);
>>>> +    const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>>>> +    struct nvme_ns *ns = nvme_find_path(head,
>>>> +            READ_ONCE(cmd->opcode) & 1 ?
>>>> +            NVME_STAT_WRITE : NVME_STAT_READ);
>>>>        int ret = -EINVAL;
>>>>          if (ns)
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 543e17aead12..55dc28375662 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -6,6 +6,9 @@
>>>>    #include <linux/backing-dev.h>
>>>>    #include <linux/moduleparam.h>
>>>>    #include <linux/vmalloc.h>
>>>> +#include <linux/blk-mq.h>
>>>> +#include <linux/math64.h>
>>>> +#include <linux/rculist.h>
>>>>    #include <trace/events/block.h>
>>>>    #include "nvme.h"
>>>>    @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>>>>        "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>>>>      static const char *nvme_iopolicy_names[] = {
>>>> -    [NVME_IOPOLICY_NUMA]    = "numa",
>>>> -    [NVME_IOPOLICY_RR]    = "round-robin",
>>>> -    [NVME_IOPOLICY_QD]      = "queue-depth",
>>>> +    [NVME_IOPOLICY_NUMA]     = "numa",
>>>> +    [NVME_IOPOLICY_RR]     = "round-robin",
>>>> +    [NVME_IOPOLICY_QD]       = "queue-depth",
>>>> +    [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>>>>    };
>>>>      static int iopolicy = NVME_IOPOLICY_NUMA;
>>>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>>>            iopolicy = NVME_IOPOLICY_RR;
>>>>        else if (!strncmp(val, "queue-depth", 11))
>>>>            iopolicy = NVME_IOPOLICY_QD;
>>>> +    else if (!strncmp(val, "adaptive", 8))
>>>> +        iopolicy = NVME_IOPOLICY_ADAPTIVE;
>>>>        else
>>>>            return -EINVAL;
>>>>    @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>>>>    +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>>>> +{
>>>> +    int cpu, srcu_idx;
>>>> +    u32 weight;
>>>> +    struct nvme_ns *ns;
>>>> +    struct nvme_path_stat *stat;
>>>> +    struct nvme_path_work *work = container_of(weight_work,
>>>> +            struct nvme_path_work, weight_work);
>>>> +    struct nvme_ns_head *head = work->ns->head;
>>>> +    int op_type = work->op_type;
>>>> +    u64 total_score = 0;
>>>> +
>>>> +    cpu = get_cpu();
>>>> +
>>>> +    srcu_idx = srcu_read_lock(&head->srcu);
>>>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>>>> +            srcu_read_lock_held(&head->srcu)) {
>>>> +
>>>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> +        if (!READ_ONCE(stat->slat_ns)) {
>>>> +            stat->score = 0;
>>>> +            continue;
>>>> +        }
>>>> +        /*
>>>> +         * Compute the path score as the inverse of smoothed
>>>> +         * latency, scaled by NSEC_PER_SEC. Floating point
>>>> +         * math is unavailable in the kernel, so fixed-point
>>>> +         * scaling is used instead. NSEC_PER_SEC is chosen
>>>> +         * because valid latencies are always < 1 second; longer
>>>> +         * latencies are ignored.
>>>> +         */
>>>> +        stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>>>> +
>>>> +        /* Compute total score. */
>>>> +        total_score += stat->score;
>>>> +    }
>>>> +
>>>> +    if (!total_score)
>>>> +        goto out;
>>>> +
>>>> +    /*
>>>> +     * After computing the total slatency, we derive per-path weight
>>>> +     * (normalized to the range 0–64). The weight represents the
>>>> +     * relative share of I/O the path should receive.
>>>> +     *
>>>> +     *   - lower smoothed latency -> higher weight
>>>> +     *   - higher smoothed slatency -> lower weight
>>>> +     *
>>>> +     * Next, while forwarding I/O, we assign "credits" to each path
>>>> +     * based on its weight (please also refer nvme_adaptive_path()):
>>>> +     *   - Initially, credits = weight.
>>>> +     *   - Each time an I/O is dispatched on a path, its credits are
>>>> +     *     decremented proportionally.
>>>> +     *   - When a path runs out of credits, it becomes temporarily
>>>> +     *     ineligible until credit is refilled.
>>>> +     *
>>>> +     * I/O distribution is therefore governed by available credits,
>>>> +     * ensuring that over time the proportion of I/O sent to each
>>>> +     * path matches its weight (and thus its performance).
>>>> +     */
>>>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>>>> +            srcu_read_lock_held(&head->srcu)) {
>>>> +
>>>> +        stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> +        weight = div_u64(stat->score * 64, total_score);
>>>> +
>>>> +        /*
>>>> +         * Ensure the path weight never drops below 1. A weight
>>>> +         * of 0 is used only for newly added paths. During
>>>> +         * bootstrap, a few I/Os are sent to such paths to
>>>> +         * establish an initial weight. Enforcing a minimum
>>>> +         * weight of 1 guarantees that no path is forgotten and
>>>> +         * that each path is probed at least occasionally.
>>>> +         */
>>>> +        if (!weight)
>>>> +            weight = 1;
>>>> +
>>>> +        WRITE_ONCE(stat->weight, weight);
>>>> +    }
>>>> +out:
>>>> +    srcu_read_unlock(&head->srcu, srcu_idx);
>>>> +    put_cpu();
>>>> +}
>>>> +
>>>> +/*
>>>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>>>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>>>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>>>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>>>> + */
>>>> +static inline u64 ewma_update(u64 old, u64 new)
>>> it is a calculation function, lets call it calc_ewma_update
>> Yeah, will do this in next patch version.
>>
>>>> +{
>>>> +    return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>>>> +            + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>>>> +}
>>>> +
>>>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>>>> +{
>>>> +    int cpu;
>>>> +    unsigned int op_type;
>>>> +    struct nvme_path_info *info;
>>>> +    struct nvme_path_stat *stat;
>>>> +    u64 now, latency, slat_ns, avg_lat_ns;
>>>> +    struct nvme_ns_head *head = ns->head;
>>>> +
>>>> +    if (list_is_singular(&head->list))
>>>> +        return;
>>>> +
>>>> +    now = ktime_get_ns();
>>>> +    latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>>>> +    if (!latency)
>>>> +        return;
>>>> +
>>>> +    /*
>>>> +     * As completion code path is serialized(i.e. no same completion queue
>>>> +     * update code could run simultaneously on multiple cpu) we can safely
>>>> +     * access per cpu nvme path stat here from another cpu (in case the
>>>> +     * completion cpu is different from submission cpu).
>>>> +     * The only field which could be accessed simultaneously here is the
>>>> +     * path ->weight which may be accessed by this function as well as I/O
>>>> +     * submission path during path selection logic and we protect ->weight
>>>> +     * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>>>> +     * we also don't need to be so accurate here as the path credit would
>>>> +     * be anyways refilled, based on path weight, once path consumes all
>>>> +     * its credits. And we limit path weight/credit max up to 100. Please
>>>> +     * also refer nvme_adaptive_path().
>>>> +     */
>>>> +    cpu = blk_mq_rq_cpu(rq);
>>>> +    op_type = nvme_data_dir(req_op(rq));
>>>> +    info = &per_cpu_ptr(ns->info, cpu)[op_type];
>>> info is really really really confusing and generic and not representative of what
>>> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
>>>
>> Maybe I am used to with this code and so I never realized it. But yes agreed, I
>> will make it @path_lat.
>>
>>>> +    stat = &info->stat;
>>>> +
>>>> +    /*
>>>> +     * If latency > ~1s then ignore this sample to prevent EWMA from being
>>>> +     * skewed by pathological outliers (multi-second waits, controller
>>>> +     * timeouts etc.). This keeps path scores representative of normal
>>>> +     * performance and avoids instability from rare spikes. If such high
>>>> +     * latency is real, ANA state reporting or keep-alive error counters
>>>> +     * will mark the path unhealthy and remove it from the head node list,
>>>> +     * so we safely skip such sample here.
>>>> +     */
>>>> +    if (unlikely(latency > NSEC_PER_SEC)) {
>>>> +        stat->nr_ignored++;
>>>> +        dev_warn_ratelimited(ns->ctrl->device,
>>>> +            "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Accumulate latency samples and increment the batch count for each
>>>> +     * ~15 second interval. When the interval expires, compute the simple
>>>> +     * average latency over that window, then update the smoothed (EWMA)
>>>> +     * latency. The path weight is recalculated based on this smoothed
>>>> +     * latency.
>>>> +     */
>>>> +    stat->batch += latency;
>>>> +    stat->batch_count++;
>>>> +    stat->nr_samples++;
>>>> +
>>>> +    if (now > stat->last_weight_ts &&
>>>> +        (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>>>> +
>>>> +        stat->last_weight_ts = now;
>>>> +
>>>> +        /*
>>>> +         * Find simple average latency for the last epoch (~15 sec
>>>> +         * interval).
>>>> +         */
>>>> +        avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>>>> +
>>>> +        /*
>>>> +         * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>>>> +         * latency. EWMA is preferred over simple average latency
>>>> +         * because it smooths naturally, reduces jitter from sudden
>>>> +         * spikes, and adapts faster to changing conditions. It also
>>>> +         * avoids storing historical samples, and works well for both
>>>> +         * slow and fast I/O rates.
>>>> +         * Formula:
>>>> +         * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>>>> +         * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>>>> +         * existing latency and 1/8 (~12.5%) weight to the new latency.
>>>> +         */
>>>> +        if (unlikely(!stat->slat_ns))
>>>> +            WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>>>> +        else {
>>>> +            slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>>>> +            WRITE_ONCE(stat->slat_ns, slat_ns);
>>>> +        }
>>>> +
>>>> +        stat->batch = stat->batch_count = 0;
>>>> +
>>>> +        /*
>>>> +         * Defer calculation of the path weight in per-cpu workqueue.
>>>> +         */
>>>> +        schedule_work_on(cpu, &info->work.weight_work);
>>> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
>>> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
>>> cpu cores.
>> As I explained earlier, in NVMe multipath driver code we don't know hctx while
>> we choose path. The ctx to hctx mapping happens later in the block layer while
>> submitting bio.
> 
> yes, hctx is not really relevant.
> 
>>   Here we calculate the path score per-cpu and utilize it while
>> choosing path to forward I/O.
>>
>>>> +    }
>>>> +}
>>>> +
>>>>    void nvme_mpath_end_request(struct request *rq)
>>>>    {
>>>>        struct nvme_ns *ns = rq->q->queuedata;
>>>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>>>>        if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>>>>            atomic_dec_if_positive(&ns->ctrl->nr_active);
>>>>    +    if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> +        nvme_mpath_add_sample(rq, ns);
>>>> +
>>> Doing all this work for EVERY completion is really worth it?
>>> sounds kinda like an overkill.
>> We don't really do much in nvme_mpath_add_sample() other than just
>> adding latency sample into batch. The real work where we calculate
>> the patch score is done once every ~15 seconds and that is done
>> under separate workqueu. So we don't do any heavy lifing here during
>> I/O completion processing.
>>
>>>>        if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>>>            return;
>>>>        bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>>>>        [NVME_ANA_CHANGE]        = "change",
>>>>    };
>>>>    +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>>>> +{
>>>> +    int i, cpu;
>>>> +    struct nvme_path_stat *stat;
>>>> +
>>>> +    for_each_possible_cpu(cpu) {
>>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> +            stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>>>> +            memset(stat, 0, sizeof(struct nvme_path_stat));
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>>>> +{
>>>> +    int i, cpu;
>>>> +    struct nvme_path_info *info;
>>>> +
>>>> +    if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> +        return;
>>>> +
>>>> +    for_each_online_cpu(cpu) {
>>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> +            info = &per_cpu_ptr(ns->info, cpu)[i];
>>>> +            cancel_work_sync(&info->work.weight_work);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>>>> +{
>>>> +    struct nvme_ns_head *head = ns->head;
>>>> +
>>>> +    if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>>>> +        return false;
>>>> +
>>>> +    if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> +        return false;
>>>> +
>>>> +    blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>> This is an undocumented change...
>> Sure, I would add comment in this code in the next patch version.
>>
>>>> +    blk_stat_enable_accounting(ns->queue);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>>>> +{
>>>> +
>>>> +    if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> +        return false;
>>>> +
>>>> +    blk_stat_disable_accounting(ns->queue);
>>>> +    blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>>> +    nvme_mpath_reset_adaptive_path_stat(ns);
>>>> +    return true;
>>>> +}
>>>> +
>>>>    bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>>    {
>>>>        struct nvme_ns_head *head = ns->head;
>>>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>>                changed = true;
>>>>            }
>>>>        }
>>>> +    if (nvme_mpath_disable_adaptive_path_policy(ns))
>>>> +        changed = true;
>>> Don't understand why you are setting changed here? it relates to of the current_path
>>> was changed. doesn't make sense to me.
>>>
>> I think you were correct. We don't have any rcu update here for adaptive path.
>> Will remove this.
>>
>>>>    out:
>>>>        return changed;
>>>>    }
>>>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>>>>        srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>>>    }
>>>>    +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>>>> +{
>>>> +    int i, cpu;
>>>> +    struct nvme_path_work *work;
>>>> +    gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>>>> +
>>>> +    if (!ns->head->disk)
>>>> +        return 0;
>>>> +
>>>> +    ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>>>> +            sizeof(struct nvme_path_info),
>>>> +            __alignof__(struct nvme_path_info), gfp);
>>>> +    if (!ns->info)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    for_each_possible_cpu(cpu) {
>>>> +        for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> +            work = &per_cpu_ptr(ns->info, cpu)[i].work;
>>>> +            work->ns = ns;
>>>> +            work->op_type = i;
>>>> +            INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
>>> Does this function set any ctrl paths? your code is very confusing.
>>>
>> Here ctrl path means, we iterate through each controller namespaces-path
>> and then sets/enable the adaptive path parameters required for each path.
>> Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
>> function which resets/clears the per-path parameters while chanigng I/O
>> policy.
>>
>>>> +{
>>>> +    struct nvme_ns *ns;
>>>> +    int srcu_idx;
>>>> +
>>>> +    srcu_idx = srcu_read_lock(&ctrl->srcu);
>>>> +    list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>>> +                srcu_read_lock_held(&ctrl->srcu))
>>>> +        nvme_mpath_enable_adaptive_path_policy(ns);
>>>> +    srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>> seems like it enables the iopolicy on all ctrl namespaces.
>>> the enable should also be more explicit like:
>>> nvme_enable_ns_lat_sampling or something like that.
>>>
>> okay, I'll rename it to the appropriate function name, as you suggested.
>>
>>>> +}
>>>> +
>>>>    void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>>    {
>>>>        struct nvme_ns_head *head = ns->head;
>>>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>>                     srcu_read_lock_held(&head->srcu)) {
>>>>            if (capacity != get_capacity(ns->disk))
>>>>                clear_bit(NVME_NS_READY, &ns->flags);
>>>> +
>>>> +        nvme_mpath_reset_adaptive_path_stat(ns);
>>>>        }
>>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>>>    @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>>>>        return found;
>>>>    }
>>>>    +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>>> +{
>>>> +    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>>> +}
>>>> +
>>>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>>>> +        unsigned int op_type)
>>>> +{
>>>> +    struct nvme_ns *ns, *start, *found = NULL;
>>>> +    struct nvme_path_stat *stat;
>>>> +    u32 weight;
>>>> +    int cpu;
>>>> +
>>>> +    cpu = get_cpu();
>>>> +    ns = *this_cpu_ptr(head->adp_path);
>>>> +    if (unlikely(!ns)) {
>>>> +        ns = list_first_or_null_rcu(&head->list,
>>>> +                struct nvme_ns, siblings);
>>>> +        if (unlikely(!ns))
>>>> +            goto out;
>>>> +    }
>>>> +found_ns:
>>>> +    start = ns;
>>>> +    while (nvme_path_is_disabled(ns) ||
>>>> +            !nvme_state_is_live(ns->ana_state)) {
>>>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>>>> +
>>>> +        /*
>>>> +         * If we iterate through all paths in the list but find each
>>>> +         * path in list is either disabled or dead then bail out.
>>>> +         */
>>>> +        if (ns == start)
>>>> +            goto out;
>>>> +    }
>>>> +
>>>> +    stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> +
>>>> +    /*
>>>> +     * When the head path-list is singular we don't calculate the
>>>> +     * only path weight for optimization as we don't need to forward
>>>> +     * I/O to more than one path. The another possibility is whenthe
>>>> +     * path is newly added, we don't know its weight. So we go round
>>>> +     * -robin for each such path and forward I/O to it.Once we start
>>>> +     * getting response for such I/Os, the path weight calculation
>>>> +     * would kick in and then we start using path credit for
>>>> +     * forwarding I/O.
>>>> +     */
>>>> +    weight = READ_ONCE(stat->weight);
>>>> +    if (!weight) {
>>>> +        found = ns;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * To keep path selection logic simple, we don't distinguish
>>>> +     * between ANA optimized and non-optimized states. The non-
>>>> +     * optimized path is expected to have a lower weight, and
>>>> +     * therefore fewer credits. As a result, only a small number of
>>>> +     * I/Os will be forwarded to paths in the non-optimized state.
>>>> +     */
>>>> +    if (stat->credit > 0) {
>>>> +        --stat->credit;
>>>> +        found = ns;
>>>> +        goto out;
>>>> +    } else {
>>>> +        /*
>>>> +         * Refill credit from path weight and move to next path. The
>>>> +         * refilled credit of the current path will be used next when
>>>> +         * all remainng paths exhaust its credits.
>>>> +         */
>>>> +        weight = READ_ONCE(stat->weight);
>>>> +        stat->credit = weight;
>>>> +        ns = list_next_entry_circular(ns, &head->list, siblings);
>>>> +        if (likely(ns))
>>>> +            goto found_ns;
>>>> +    }
>>>> +out:
>>>> +    if (found) {
>>>> +        stat->sel++;
>>>> +        *this_cpu_ptr(head->adp_path) = found;
>>>> +    }
>>>> +
>>>> +    put_cpu();
>>>> +    return found;
>>>> +}
>>>> +
>>>>    static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>>>>    {
>>>>        struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>>>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>>>        return ns;
>>>>    }
>>>>    -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>>>> +        unsigned int op_type)
>>>>    {
>>>>        switch (READ_ONCE(head->subsys->iopolicy)) {
>>>> +    case NVME_IOPOLICY_ADAPTIVE:
>>>> +        return nvme_adaptive_path(head, op_type);
>>>>        case NVME_IOPOLICY_QD:
>>>>            return nvme_queue_depth_path(head);
>>>>        case NVME_IOPOLICY_RR:
>>>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>>>>            return;
>>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    ns = nvme_find_path(head);
>>>> +    ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>>>>        if (likely(ns)) {
>>>>            bio_set_dev(bio, ns->disk->part0);
>>>>            bio->bi_opf |= REQ_NVME_MPATH;
>>>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    ns = nvme_find_path(head);
>>>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>>        if (ns)
>>>>            ret = nvme_ns_get_unique_id(ns, id, type);
>>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>>>        int srcu_idx, ret = -EWOULDBLOCK;
>>>>          srcu_idx = srcu_read_lock(&head->srcu);
>>>> -    ns = nvme_find_path(head);
>>>> +    ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>>        if (ns)
>>>>            ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>>>        srcu_read_unlock(&head->srcu, srcu_idx);
>>>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>>>        INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>>>>        INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>>>>        head->delayed_removal_secs = 0;
>>>> +    head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>>>> +    if (!head->adp_path)
>>>> +        return -ENOMEM;
>>>>          /*
>>>>         * If "multipath_always_on" is enabled, a multipath node is added
>>>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>>        }
>>>>        mutex_unlock(&head->lock);
>>>>    +    mutex_lock(&nvme_subsystems_lock);
>>>> +    nvme_mpath_enable_adaptive_path_policy(ns);
>>>> +    mutex_unlock(&nvme_subsystems_lock);
>>>> +
>>>>        synchronize_srcu(&head->srcu);
>>>>        kblockd_schedule_work(&head->requeue_work);
>>>>    }
>>>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>>>>        return 0;
>>>>    }
>>>>    -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>>> -{
>>>> -    return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>>> -}
>>>> -
>>>>    static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>>>            struct nvme_ns *ns)
>>>>    {
>>>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>>>>          WRITE_ONCE(subsys->iopolicy, iopolicy);
>>>>    -    /* iopolicy changes clear the mpath by design */
>>>> +    /* iopolicy changes clear/reset the mpath by design */
>>>>        mutex_lock(&nvme_subsystems_lock);
>>>>        list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>>>            nvme_mpath_clear_ctrl_paths(ctrl);
>>>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>>> +        nvme_mpath_set_ctrl_paths(ctrl);
>>>>        mutex_unlock(&nvme_subsystems_lock);
>>>>          pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>>> index 102fae6a231c..715c7053054c 100644
>>>> --- a/drivers/nvme/host/nvme.h
>>>> +++ b/drivers/nvme/host/nvme.h
>>>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>>>>    extern unsigned int admin_timeout;
>>>>    #define NVME_ADMIN_TIMEOUT    (admin_timeout * HZ)
>>>>    -#define NVME_DEFAULT_KATO    5
>>>> +#define NVME_DEFAULT_KATO        5
>>>> +
>>>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT    3
>>>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT    (15 * NSEC_PER_SEC)
>>> You need these defines outside of nvme-mpath?
>>>
>> Currently, those macros are used in nvme/host/core.c.
>> I can move this inisde that source file.
>>
>>>>      #ifdef CONFIG_ARCH_NO_SG_CHAIN
>>>>    #define  NVME_INLINE_SG_CNT  0
>>>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>>>>        NVME_IOPOLICY_NUMA,
>>>>        NVME_IOPOLICY_RR,
>>>>        NVME_IOPOLICY_QD,
>>>> +    NVME_IOPOLICY_ADAPTIVE,
>>>>    };
>>>>      struct nvme_subsystem {
>>>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>>>>        u8    csi;
>>>>    };
>>>>    +enum nvme_stat_group {
>>>> +    NVME_STAT_READ,
>>>> +    NVME_STAT_WRITE,
>>>> +    NVME_STAT_OTHER,
>>>> +    NVME_NUM_STAT_GROUPS
>>>> +};
>>> I see you have stats per io direction. However you don't have it per IO size. I wonder
>>> how this plays into this iopolicy.
>>>
>> Yes you're correct, and as mentioned earlier we'd measure latecy per
>> 512 byte blocks size.
>>
>>>> +
>>>> +struct nvme_path_stat {
>>>> +    u64 nr_samples;        /* total num of samples processed */
>>>> +    u64 nr_ignored;        /* num. of samples ignored */
>>>> +    u64 slat_ns;        /* smoothed (ewma) latency in nanoseconds */
>>>> +    u64 score;        /* score used for weight calculation */
>>>> +    u64 last_weight_ts;    /* timestamp of the last weight calculation */
>>>> +    u64 sel;        /* num of times this path is selcted for I/O */
>>>> +    u64 batch;        /* accumulated latency sum for current window */
>>>> +    u32 batch_count;    /* num of samples accumulated in current window */
>>>> +    u32 weight;        /* path weight */
>>>> +    u32 credit;        /* path credit for I/O forwarding */
>>>> +};
>>> I'm still not convinced that having this be per-cpu-per-ns really makes sense.
>> I understand your concern about whether it really makes sense to keep this
>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>> stat per-hctx instead of per-CPU.
>>
>> However, as mentioned earlier, during path selection we cannot reliably map an
>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>> that are local/near to the CPU issuing the request, which may lead to better
>> latency characteristics.
> 
> With this I tend to agree. but per-cpu has lots of other churns IMO.
> Maybe the answer is that paths weights are maintained per NUMA node?
> then accessing these weights in the fast-path is still cheap enough?

That’s a fair point, and I agree that per-CPU accounting can introduce additional
variability. However, moving to per-NUMA path weights would implicitly narrow the
scope of what we are trying to measure, as it would largely exclude components of
end-to-end latency that arise from scheduler behavior and application-level scheduling
effects. As discussed earlier, the intent of the adaptive policy is to capture the
actual I/O cost observed by the workload, which includes not only path and controller
locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
maintaining per-CPU path weights remains a better fit for the stated goal. It also
offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
preserving a true end-to-end view of path latency, agreed?

I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
ioengine=io_uring. Below are the aggregated throughput results observed under
different NVMe multipath I/O policies:

        numa         round-robin   queue-depth  adaptive
        -----------  -----------   -----------  ---------
READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
        W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s

These results show that under combined CPU and network stress, the adaptive I/O policy
consistently delivers higher throughput across read, write, and mixed workloads when 
comapred against existing policies.
 
Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-18 11:19         ` Nilay Shroff
@ 2025-12-18 13:46           ` Hannes Reinecke
  2025-12-23 14:50             ` Nilay Shroff
  2025-12-25 12:28           ` Sagi Grimberg
  1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2025-12-18 13:46 UTC (permalink / raw)
  To: Nilay Shroff, Sagi Grimberg, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce

On 12/18/25 12:19, Nilay Shroff wrote:
> 
> 
> On 12/16/25 5:06 AM, Sagi Grimberg wrote:
>>
>>
>> On 13/12/2025 9:27, Nilay Shroff wrote:
>>>
>>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>>
>>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>>> subsystemX/iopolicy"
>>>>>
>>>>> The adaptive policy dynamically distributes I/O based on measured
>>>>> completion latency. The main idea is to calculate latency for each path,
>>>>> derive a weight, and then proportionally forward I/O according to those
>>>>> weights.
>>>>>
>>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>>> values.
>>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>>> How does that play with less queues than cpu cores? what happens to cores
>>>> that have low traffic?
>>>>
>>> The path-selection logic does not depend on the relationship between the number
>>> of CPUs and the number of hardware queues. It simply selects a path based on the
>>> per-CPU path score/credit, which reflects the relative performance of each available
>>> path.
>>> For example, assume we have two paths (A and B) to the same shared namespace.
>>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>>> latency values we derive a per-path score or credit. The credit represents the relative
>>> share of I/O that each path should receive: a path with lower observed latency gets more
>>> credit, and a path with higher latency gets less.
>>
>> I understand that the stats are maintained per-cpu, however I am not sure that having a
>> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
>> have one set of weights and for cpu1 we'll have another set of weights.
>>
>> What if the a given cpu happened to schedule some other application in a way that impacts
>> completion latency? won't that skew the sampling? that is not related to the path at all. That
>> is possibly more noticable in tcp which completes in a kthread context.
>>
>> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
>> that mitigate to some extent the issue of non-path related latency skew?
>>
> You’re right — what you’re describing is indeed possible. The intent of the adaptive policy,
> however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
> transport latency.
> The observed completion latency intentionally includes all components that affect I/O from
> the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
> scheduler-induced delays, and the target device’s own I/O latency. By capturing the full
> end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
> Scheduler-related latency can vary over time due to workload placement or CPU contention,
> and this variability is accounted for by the design. Since per-path weights are recalculated
> periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
> behavior are naturally incorporated into the path scoring. As a result, the policy can
> automatically adapt/adjust and rebalance I/O toward paths that are performing better under
> current system conditions.
> In short, while per-CPU sampling may include effects beyond the physical path itself, this is
> intentional and allows the adaptive policy to respond in real time to changing end-to-end
> performance characteristics.
> 
That was not the point.
Thing is, we _cannot_ move I/O away from a given CPU. Once I/O 
originates from a given CPU, it will stay on that CPU irrespective of 
the path taken.
Remember: the I/O scheduler decides which path a given i/O should take,
not on which cpu any given I/O should run on.
So if a specific CPU has increase latency due to additional tasks / 
interrupts running on it it will show up _on all paths_, but only for 
weights on that CPU.
And Sagis point was that it would skew the measurement.

Which it certainly does.
But on the other hand _all_ I/O on this cpu will be affected, and we
don't have cross-speak to other CPUs (as this is a percpu counter).
So the only change would be that we're seeing increased numbers here,
the relation between paths won't change.
(Except in the really pathological case where the addedd latency is so
high that the path latency will get lost in the noise. But then it
wouldn't matter anyway as it'll be slow as hell.)

>>>
>>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>>> not affect how paths are scored or selected.
>>
>> This is potentially another problem. application may jump between cpu cores due to scheduling
>> constraints. In this case, how is the path selection policy adhering to the path weights?
>>
>> What I'm trying to say here is that the path selection should be inherently reflective on the path,
>> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
>> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
>> other workloads that are running on the system that can impact completion latency.
>>
>> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
>> that.
>>
> 
> In real-world systems, as stated earlier, the completion latency is influenced not only by
> the physical path but also by system load, scheduler behavior, and transport stack processing.
> By incorporating all of these factors into the latency measurement, the adaptive policy reflects
> the true cost of issuing I/O on a given path under current conditions. This allows it to respond
> to both path-level and system-level congestion.
> 
> In practice, during experiments with two paths (A and B), I observed that when additional latency—
> whether introduced via the path itself or through system load—was present on path A, subsequent I/O
> was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
> I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
> and remains effective even in the presence of CPU migration and competing workloads.
> Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
> real-world end-to-end performance and continuously adjust I/O distribution in response to changing
> system and path conditions.
> 
>>>
>>>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>>>> samples are computed and fed into an Exponentially Weighted Moving
>>>>> Average (EWMA):
>>>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
>>> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
>>
>> wighted-lat is simpler.
> Okay I'll renanme it to "weighted-lat".>
>>>
>>>     Path weights are then derived from the smoothed (EWMA)
>>> latency as follows (example with two paths A and B):
>>>
>>>        path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>>>        path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>>>        total_score  = path_A_score + path_B_score
>>>
>>>        path_A_weight = (path_A_score * 100) / total_score
>>>        path_B_weight = (path_B_score * 100) / total_score
>>>
>>>> What happens to R/W mixed workloads? What happens when the I/O pattern
>>>> has a distribution of block sizes?
>>>>
>>> We maintain separate metrics for READ and WRITE traffic, and during path
>>> selection we use the appropriate metric depending on the I/O type.
>>>
>>> Regarding block-size variability: the current implementation does not yet
>>> account for I/O size. This is an important point — thank you for raising it.
>>> I discussed this today with Hannes at LPC, and we agreed that a practical
>>> approach is to normalize latency per 512-byte block. For our purposes, we
>>> do not need an exact latency value; a relative latency metric is sufficient,
>>> as it ultimately feeds into path scoring. A path with higher latency ends up
>>> with a lower score, and a path with lower latency gets a higher score — the
>>> exact absolute values are less important than maintaining consistent proportional
>>> relationships.
>>
>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>> have much lower amortized latency per 512 block. which could create an false bias
>> to place a high weight on a path, if that path happened to host large I/Os no?
>>
> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
> 
Although technically we are then measure two different things (IO 
latency vs block latency). But yeah, block latency might be better
suited for the normal case; I do wonder, though, if for high-speed
links we do see a difference as the data transfer time is getting
really fast...

[ .. ]
>>> I understand your concern about whether it really makes sense to keep this
>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>> stat per-hctx instead of per-CPU.
>>>
>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>> that are local/near to the CPU issuing the request, which may lead to better
>>> latency characteristics.
>>
>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>> Maybe the answer is that paths weights are maintained per NUMA node?
>> then accessing these weights in the fast-path is still cheap enough?
> 
> That’s a fair point, and I agree that per-CPU accounting can introduce additional
> variability. However, moving to per-NUMA path weights would implicitly narrow the
> scope of what we are trying to measure, as it would largely exclude components of
> end-to-end latency that arise from scheduler behavior and application-level scheduling
> effects. As discussed earlier, the intent of the adaptive policy is to capture the
> actual I/O cost observed by the workload, which includes not only path and controller
> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
> maintaining per-CPU path weights remains a better fit for the stated goal. It also
> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
> preserving a true end-to-end view of path latency, agreed?
> 
Well, for fabrics you can easily have several paths connected to the 
same NUMA node (like in the classical 'two initiator ports 
cross-connected to two target ports', resulting in four paths in total.
But two of these paths will always be on the same NUMA node).
So that doesn't work out.

> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
> ioengine=io_uring. Below are the aggregated throughput results observed under
> different NVMe multipath I/O policies:
> 
>          numa         round-robin   queue-depth  adaptive
>          -----------  -----------   -----------  ---------
> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>          W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
> 
> These results show that under combined CPU and network stress, the adaptive I/O policy
> consistently delivers higher throughput across read, write, and mixed workloads when
> comapred against existing policies.
>   
And that is probably the best argument; we should put it under stress 
with various scenarios. I must admit I am _really_ in favour of this
iopolicy, as it would be able to handle any temporary issues on the 
fabric (or backend) without the need of additional signalling.
Talk to me about FPIN ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-18 13:46           ` Hannes Reinecke
@ 2025-12-23 14:50             ` Nilay Shroff
  2025-12-25 12:45               ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-12-23 14:50 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce


[...]
>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>> have much lower amortized latency per 512 block. which could create an false bias
>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>
>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>
> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
> suited for the normal case; I do wonder, though, if for high-speed
> links we do see a difference as the data transfer time is getting
> really fast...
> 
For a high speed/bandwidth NIC card the transfer speed would be very fast,
though I think for a very large I/O size, we would see a higer latency due
to tcp segmentation and re-assembly.

On my nvmf-tcp testbed, I do see the latency differences as shown below 
for varying I/O size (captured for random-read direct I/O workload):
I/O-size	Avg-latency(usec)
 512            12113
 1k             10058  
 2k             11246
 4k             12458
 8k             12189
 16k            11617 
 32k            17686
 64k            28504 
 128k           59013
 256k           118984 
 512k           233428
 1M             460000   

As can be seen, for smaller block sizes (512B–16K), latency remains relatively
stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
above, latency increases significantly and roughly doubles with each step in
block size. Based on this data, I propose using coarse-grained I/O size buckets
to preserve latency characteristics while avoiding excessive fragmentation of
statistics. The suggested bucket layout is as follows:

Bucket		block-size-range
small		512B-32k
medium		32k-64k
large-64k	64k-128k
large-128k	128k-256k
large-256k	256k-512k
large-512k	512k-1M
very-large	>=1M

In this model,
- A single small bucket captures latency for I/O sizes where latency remains
  largely uniform.
- A medium bucket captures the transition region.
- Separate large buckets preserve the rapidly increasing latency behavior
  observed for larger block sizes.
- A very-large bucket handles any I/O beyond 1M.

This approach allows the adaptive policy to retain meaningful latency distinctions across
I/O size regimes while keeping the number of buckets manageable and statistically stable,
make sense? 

> [ .. ]
>>>> I understand your concern about whether it really makes sense to keep this
>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>> stat per-hctx instead of per-CPU.
>>>>
>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>> latency characteristics.
>>>
>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>> then accessing these weights in the fast-path is still cheap enough?
>>
>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>> scope of what we are trying to measure, as it would largely exclude components of
>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>> actual I/O cost observed by the workload, which includes not only path and controller
>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>> preserving a true end-to-end view of path latency, agreed?
>>
> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
> But two of these paths will always be on the same NUMA node).
> So that doesn't work out.
> 
>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>> ioengine=io_uring. Below are the aggregated throughput results observed under
>> different NVMe multipath I/O policies:
>>
>>          numa         round-robin   queue-depth  adaptive
>>          -----------  -----------   -----------  ---------
>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>          W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>
>> These results show that under combined CPU and network stress, the adaptive I/O policy
>> consistently delivers higher throughput across read, write, and mixed workloads when
>> comapred against existing policies.
>>   
> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
> Talk to me about FPIN ...
> 

I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring. 
Below are the aggregated throughput results observed under different NVMe multipath
I/O policies.

i) Stressing all 32 cpus using stress-ng 

All 32 CPUs were stressed using:
# stress-ng --cpu 0 --cpu-method all -t 60m

         numa          round-robin   queue-depth  adaptive
         -----------   -----------   -----------  ---------
READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s   
WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
         W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s

ii) Symmetric paths (No CPU stress and no induced network load):

         numa          round-robin   queue-depth   adaptive
         -----------   -----------   -----------   ---------
READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s 
RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
        W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s

These results show that the adaptive I/O policy consistently delivers higher
throughput under CPU stress and asymmetric path conditions. In case of symmetric
paths the adaptive policy achieves throughput comparable to—or slightly
better than—existing policies. 

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-18 11:19         ` Nilay Shroff
  2025-12-18 13:46           ` Hannes Reinecke
@ 2025-12-25 12:28           ` Sagi Grimberg
  1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-25 12:28 UTC (permalink / raw)
  To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce



On 18/12/2025 13:19, Nilay Shroff wrote:
>
> On 12/16/25 5:06 AM, Sagi Grimberg wrote:
>>
>> On 13/12/2025 9:27, Nilay Shroff wrote:
>>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>>> subsystemX/iopolicy"
>>>>>
>>>>> The adaptive policy dynamically distributes I/O based on measured
>>>>> completion latency. The main idea is to calculate latency for each path,
>>>>> derive a weight, and then proportionally forward I/O according to those
>>>>> weights.
>>>>>
>>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>>> values.
>>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>>> How does that play with less queues than cpu cores? what happens to cores
>>>> that have low traffic?
>>>>
>>> The path-selection logic does not depend on the relationship between the number
>>> of CPUs and the number of hardware queues. It simply selects a path based on the
>>> per-CPU path score/credit, which reflects the relative performance of each available
>>> path.
>>> For example, assume we have two paths (A and B) to the same shared namespace.
>>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>>> latency values we derive a per-path score or credit. The credit represents the relative
>>> share of I/O that each path should receive: a path with lower observed latency gets more
>>> credit, and a path with higher latency gets less.
>> I understand that the stats are maintained per-cpu, however I am not sure that having a
>> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
>> have one set of weights and for cpu1 we'll have another set of weights.
>>
>> What if the a given cpu happened to schedule some other application in a way that impacts
>> completion latency? won't that skew the sampling? that is not related to the path at all. That
>> is possibly more noticable in tcp which completes in a kthread context.
>>
>> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
>> that mitigate to some extent the issue of non-path related latency skew?
>>
> You’re right — what you’re describing is indeed possible. The intent of the adaptive policy,
> however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
> transport latency.
> The observed completion latency intentionally includes all components that affect I/O from
> the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
> scheduler-induced delays, and the target device’s own I/O latency. By capturing the full
> end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
> Scheduler-related latency can vary over time due to workload placement or CPU contention,
> and this variability is accounted for by the design. Since per-path weights are recalculated
> periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
> behavior are naturally incorporated into the path scoring. As a result, the policy can
> automatically adapt/adjust and rebalance I/O toward paths that are performing better under
> current system conditions.
> In short, while per-CPU sampling may include effects beyond the physical path itself, this is
> intentional and allows the adaptive policy to respond in real time to changing end-to-end
> performance characteristics.

The issue is that you are crediting latency to a path where portions of 
it (or maybe even the majority)
may be completely unrelated to the path at all. What I mean is that you 
are accounting things that are unrelated
to the path selection.

In my mind, it would be better to amortize the cpu-local aspects of the 
path selection (e.g. average out latency across
cpus - or across cpu numa-node) when calculating credits, and then have 
all cpus use the same credits).

>
>>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>>> not affect how paths are scored or selected.
>> This is potentially another problem. application may jump between cpu cores due to scheduling
>> constraints. In this case, how is the path selection policy adhering to the path weights?
>>
>> What I'm trying to say here is that the path selection should be inherently reflective on the path,
>> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
>> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
>> other workloads that are running on the system that can impact completion latency.
>>
>> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
>> that.
>>
> In real-world systems, as stated earlier, the completion latency is influenced not only by
> the physical path but also by system load, scheduler behavior, and transport stack processing.
> By incorporating all of these factors into the latency measurement, the adaptive policy reflects
> the true cost of issuing I/O on a given path under current conditions. This allows it to respond
> to both path-level and system-level congestion.
>
> In practice, during experiments with two paths (A and B), I observed that when additional latency—
> whether introduced via the path itself or through system load—was present on path A, subsequent I/O
> was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
> I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
> and remains effective even in the presence of CPU migration and competing workloads.
> Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
> real-world end-to-end performance and continuously adjust I/O distribution in response to changing
> system and path conditions.

I just don't understand how the presence of additional workloads or 
system cpu load distribution
should affect the path that you select. I mean you can choose the 
"worst" path but you run on a cpu
that happens to run just your thread and you score it maybe better than 
the "best" path if you
are unfortunate enough to run on a cpu that currently is task switching 
multiple cpu intensive threads...

>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>> Maybe the answer is that paths weights are maintained per NUMA node?
>> then accessing these weights in the fast-path is still cheap enough?
> That’s a fair point, and I agree that per-CPU accounting can introduce additional
> variability. However, moving to per-NUMA path weights would implicitly narrow the
> scope of what we are trying to measure, as it would largely exclude components of
> end-to-end latency that arise from scheduler behavior and application-level scheduling
> effects.

Not sure I agree. I argue that it will help you cleanup noise, which is 
unrelated to evaluation
of "path quality".

>   As discussed earlier, the intent of the adaptive policy is to capture the
> actual I/O cost observed by the workload, which includes not only path and controller
> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
> maintaining per-CPU path weights remains a better fit for the stated goal. It also
> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
> preserving a true end-to-end view of path latency, agreed?

It's not intuitive to me why it is not just adding noise.

>
> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
> ioengine=io_uring. Below are the aggregated throughput results observed under
> different NVMe multipath I/O policies:
>
>          numa         round-robin   queue-depth  adaptive
>          -----------  -----------   -----------  ---------
> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>          W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>
> These results show that under combined CPU and network stress, the adaptive I/O policy
> consistently delivers higher throughput across read, write, and mixed workloads when
> comapred against existing policies.

I'm not arguing other IO policies or comparison against them. We are 
discussing your implementation.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-23 14:50             ` Nilay Shroff
@ 2025-12-25 12:45               ` Sagi Grimberg
  2025-12-26 18:16                 ` Nilay Shroff
  0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-25 12:45 UTC (permalink / raw)
  To: Nilay Shroff, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 23/12/2025 16:50, Nilay Shroff wrote:
> [...]
>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>
>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>
>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>> suited for the normal case; I do wonder, though, if for high-speed
>> links we do see a difference as the data transfer time is getting
>> really fast...
>>
> For a high speed/bandwidth NIC card the transfer speed would be very fast,
> though I think for a very large I/O size, we would see a higer latency due
> to tcp segmentation and re-assembly.
>
> On my nvmf-tcp testbed, I do see the latency differences as shown below
> for varying I/O size (captured for random-read direct I/O workload):
> I/O-size	Avg-latency(usec)
>   512            12113
>   1k             10058
>   2k             11246
>   4k             12458
>   8k             12189
>   16k            11617
>   32k            17686
>   64k            28504
>   128k           59013
>   256k           118984
>   512k           233428
>   1M             460000
>
> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
> above, latency increases significantly and roughly doubles with each step in
> block size. Based on this data, I propose using coarse-grained I/O size buckets
> to preserve latency characteristics while avoiding excessive fragmentation of
> statistics. The suggested bucket layout is as follows:
>
> Bucket		block-size-range
> small		512B-32k
> medium		32k-64k
> large-64k	64k-128k
> large-128k	128k-256k
> large-256k	256k-512k
> large-512k	512k-1M
> very-large	>=1M
>
> In this model,
> - A single small bucket captures latency for I/O sizes where latency remains
>    largely uniform.
> - A medium bucket captures the transition region.
> - Separate large buckets preserve the rapidly increasing latency behavior
>    observed for larger block sizes.
> - A very-large bucket handles any I/O beyond 1M.
>
> This approach allows the adaptive policy to retain meaningful latency distinctions across
> I/O size regimes while keeping the number of buckets manageable and statistically stable,
> make sense?

Yes

>
>> [ .. ]
>>>>> I understand your concern about whether it really makes sense to keep this
>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>> stat per-hctx instead of per-CPU.
>>>>>
>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>> latency characteristics.
>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>> then accessing these weights in the fast-path is still cheap enough?
>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>> scope of what we are trying to measure, as it would largely exclude components of
>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>> actual I/O cost observed by the workload, which includes not only path and controller
>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>> preserving a true end-to-end view of path latency, agreed?
>>>
>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>> But two of these paths will always be on the same NUMA node).
>> So that doesn't work out.
>>
>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>> different NVMe multipath I/O policies:
>>>
>>>           numa         round-robin   queue-depth  adaptive
>>>           -----------  -----------   -----------  ---------
>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>           W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>
>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>> comapred against existing policies.
>>>    
>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>> Talk to me about FPIN ...
>>
> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
> Below are the aggregated throughput results observed under different NVMe multipath
> I/O policies.
>
> i) Stressing all 32 cpus using stress-ng
>
> All 32 CPUs were stressed using:
> # stress-ng --cpu 0 --cpu-method all -t 60m
>
>           numa          round-robin   queue-depth  adaptive
>           -----------   -----------   -----------  ---------
> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>           W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>
> ii) Symmetric paths (No CPU stress and no induced network load):
>
>           numa          round-robin   queue-depth   adaptive
>           -----------   -----------   -----------   ---------
> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>          W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>
> These results show that the adaptive I/O policy consistently delivers higher
> throughput under CPU stress and asymmetric path conditions. In case of symmetric
> paths the adaptive policy achieves throughput comparable to—or slightly
> better than—existing policies.

I still think that accounting uncorrelated latency is the best approach 
here.

My intuition tells me that:
1. averaging latencies over numa-node
2. calculating weights
3. distribute new weights per-cpu in the numa-node

Is a better approach. It is hard to evaluate without adding some randomness.

Can you please run benchmarks with 
`blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-25 12:45               ` Sagi Grimberg
@ 2025-12-26 18:16                 ` Nilay Shroff
  2025-12-27  9:33                   ` Sagi Grimberg
  2025-12-27  9:37                   ` Sagi Grimberg
  0 siblings, 2 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-26 18:16 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 12/25/25 6:15 PM, Sagi Grimberg wrote:
> 
> 
> On 23/12/2025 16:50, Nilay Shroff wrote:
>> [...]
>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>
>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>
>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>> suited for the normal case; I do wonder, though, if for high-speed
>>> links we do see a difference as the data transfer time is getting
>>> really fast...
>>>
>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>> though I think for a very large I/O size, we would see a higer latency due
>> to tcp segmentation and re-assembly.
>>
>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>> for varying I/O size (captured for random-read direct I/O workload):
>> I/O-size    Avg-latency(usec)
>>   512            12113
>>   1k             10058
>>   2k             11246
>>   4k             12458
>>   8k             12189
>>   16k            11617
>>   32k            17686
>>   64k            28504
>>   128k           59013
>>   256k           118984
>>   512k           233428
>>   1M             460000
>>
>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>> above, latency increases significantly and roughly doubles with each step in
>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>> to preserve latency characteristics while avoiding excessive fragmentation of
>> statistics. The suggested bucket layout is as follows:
>>
>> Bucket        block-size-range
>> small        512B-32k
>> medium        32k-64k
>> large-64k    64k-128k
>> large-128k    128k-256k
>> large-256k    256k-512k
>> large-512k    512k-1M
>> very-large    >=1M
>>
>> In this model,
>> - A single small bucket captures latency for I/O sizes where latency remains
>>    largely uniform.
>> - A medium bucket captures the transition region.
>> - Separate large buckets preserve the rapidly increasing latency behavior
>>    observed for larger block sizes.
>> - A very-large bucket handles any I/O beyond 1M.
>>
>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>> make sense?
> 
> Yes
> 
>>
>>> [ .. ]
>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>> stat per-hctx instead of per-CPU.
>>>>>>
>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>> latency characteristics.
>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>> preserving a true end-to-end view of path latency, agreed?
>>>>
>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>> But two of these paths will always be on the same NUMA node).
>>> So that doesn't work out.
>>>
>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>> different NVMe multipath I/O policies:
>>>>
>>>>           numa         round-robin   queue-depth  adaptive
>>>>           -----------  -----------   -----------  ---------
>>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>>           W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>>
>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>> comapred against existing policies.
>>>>    
>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>> Talk to me about FPIN ...
>>>
>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>> Below are the aggregated throughput results observed under different NVMe multipath
>> I/O policies.
>>
>> i) Stressing all 32 cpus using stress-ng
>>
>> All 32 CPUs were stressed using:
>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>
>>           numa          round-robin   queue-depth  adaptive
>>           -----------   -----------   -----------  ---------
>> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
>> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
>> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>>           W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>>
>> ii) Symmetric paths (No CPU stress and no induced network load):
>>
>>           numa          round-robin   queue-depth   adaptive
>>           -----------   -----------   -----------   ---------
>> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
>> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
>> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>>          W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>>
>> These results show that the adaptive I/O policy consistently delivers higher
>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>> paths the adaptive policy achieves throughput comparable to—or slightly
>> better than—existing policies.
> 
> I still think that accounting uncorrelated latency is the best approach here.
> 
> My intuition tells me that:
> 1. averaging latencies over numa-node
> 2. calculating weights
> 3. distribute new weights per-cpu in the numa-node
> 
> Is a better approach. It is hard to evaluate without adding some randomness.
> 
> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?

Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
file I used for the test, followed by the observed throughput result for reference.

Job file:
=========

[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
cpumode=qsort
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n2
rw=<randread/randwrite/randrw>
bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
iodepth=32
numjobs=32
direct=1

Throughput:
===========

         numa          round-robin   queue-depth    adaptive
         -----------   -----------   -----------    ---------
READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
         W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s

When comparing the results, I did not observe a significant throughput
difference between the queue-depth, round-robin, and adaptive policies.
With random I/O of mixed sizes, the adaptive policy appears to average
out the varying latency values and distribute I/O reasonably evenly
across the active paths (assuming symmetric paths).

Next I'd implement I/O size buckets and also per-numa node weight and
then rerun tests and share the result. Lets see if these changes help
further improve the throughput number for adaptive policy. We may then
again review the results and discuss further.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-26 18:16                 ` Nilay Shroff
@ 2025-12-27  9:33                   ` Sagi Grimberg
  2025-12-27  9:37                   ` Sagi Grimberg
  1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-27  9:33 UTC (permalink / raw)
  To: Nilay Shroff, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 26/12/2025 20:16, Nilay Shroff wrote:
>
> On 12/25/25 6:15 PM, Sagi Grimberg wrote:
>>
>> On 23/12/2025 16:50, Nilay Shroff wrote:
>>> [...]
>>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>>
>>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>>
>>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>>> suited for the normal case; I do wonder, though, if for high-speed
>>>> links we do see a difference as the data transfer time is getting
>>>> really fast...
>>>>
>>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>>> though I think for a very large I/O size, we would see a higer latency due
>>> to tcp segmentation and re-assembly.
>>>
>>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>>> for varying I/O size (captured for random-read direct I/O workload):
>>> I/O-size    Avg-latency(usec)
>>>    512            12113
>>>    1k             10058
>>>    2k             11246
>>>    4k             12458
>>>    8k             12189
>>>    16k            11617
>>>    32k            17686
>>>    64k            28504
>>>    128k           59013
>>>    256k           118984
>>>    512k           233428
>>>    1M             460000
>>>
>>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>>> above, latency increases significantly and roughly doubles with each step in
>>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>>> to preserve latency characteristics while avoiding excessive fragmentation of
>>> statistics. The suggested bucket layout is as follows:
>>>
>>> Bucket        block-size-range
>>> small        512B-32k
>>> medium        32k-64k
>>> large-64k    64k-128k
>>> large-128k    128k-256k
>>> large-256k    256k-512k
>>> large-512k    512k-1M
>>> very-large    >=1M
>>>
>>> In this model,
>>> - A single small bucket captures latency for I/O sizes where latency remains
>>>     largely uniform.
>>> - A medium bucket captures the transition region.
>>> - Separate large buckets preserve the rapidly increasing latency behavior
>>>     observed for larger block sizes.
>>> - A very-large bucket handles any I/O beyond 1M.
>>>
>>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>>> make sense?
>> Yes
>>
>>>> [ .. ]
>>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>>> stat per-hctx instead of per-CPU.
>>>>>>>
>>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>>> latency characteristics.
>>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>>> preserving a true end-to-end view of path latency, agreed?
>>>>>
>>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>>> But two of these paths will always be on the same NUMA node).
>>>> So that doesn't work out.
>>>>
>>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>>> different NVMe multipath I/O policies:
>>>>>
>>>>>            numa         round-robin   queue-depth  adaptive
>>>>>            -----------  -----------   -----------  ---------
>>>>> READ:   61.1 MiB/s   87.2 MiB/s    93.1 MiB/s   107 MiB/s
>>>>> WRITE:  95.8 MiB/s   138 MiB/s     159 MiB/s    179 MiB/s
>>>>> RW:     R:29.8 MiB/s R:53.1 MiB/s  R:58.8 MiB/s R:66.6 MiB/s
>>>>>            W:29.6 MiB/s W:52.7 MiB/s  W:58.2 MiB/s W:65.9 MiB/s
>>>>>
>>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>>> comapred against existing policies.
>>>>>     
>>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>>> Talk to me about FPIN ...
>>>>
>>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>>> Below are the aggregated throughput results observed under different NVMe multipath
>>> I/O policies.
>>>
>>> i) Stressing all 32 cpus using stress-ng
>>>
>>> All 32 CPUs were stressed using:
>>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>>
>>>            numa          round-robin   queue-depth  adaptive
>>>            -----------   -----------   -----------  ---------
>>> READ:    159 MiB/s     193 MiB/s     215 MiB/s    255 MiB/s
>>> WRITE:   188 MiB/s     186 MiB/s     195 MiB/s    199 MiB/s
>>> RW:      R:83.4 MiB/s  R:101 MiB/s   R:104 MiB/s  R: 111 MiB/s
>>>            W:83.3 MiB/s  W:101 MiB/s   W:105 MiB/s  W: 112 MiB/s
>>>
>>> ii) Symmetric paths (No CPU stress and no induced network load):
>>>
>>>            numa          round-robin   queue-depth   adaptive
>>>            -----------   -----------   -----------   ---------
>>> READ:    171 MiB/s     298 MiB/s     320 MiB/s     348 MiB/s
>>> WRITE:   229 MiB/s     419 MiB/s     442 MiB/s     460 MiB/s
>>> RW:     R: 93.0 MiB/s  R: 166 MiB/s  R: 171 MiB/s  R: 179 MiB/s
>>>           W: 94.2 MiB/s  W: 168 MiB/s  W: 168 MiB/s  W: 178 MiB/s
>>>
>>> These results show that the adaptive I/O policy consistently delivers higher
>>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>>> paths the adaptive policy achieves throughput comparable to—or slightly
>>> better than—existing policies.
>> I still think that accounting uncorrelated latency is the best approach here.
>>
>> My intuition tells me that:
>> 1. averaging latencies over numa-node
>> 2. calculating weights
>> 3. distribute new weights per-cpu in the numa-node
>>
>> Is a better approach. It is hard to evaluate without adding some randomness.
>>
>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
> file I used for the test, followed by the observed throughput result for reference.
>
> Job file:
> =========
>
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> cpumode=qsort
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n2
> rw=<randread/randwrite/randrw>
> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
> iodepth=32
> numjobs=32
> direct=1
>
> Throughput:
> ===========
>
>           numa          round-robin   queue-depth    adaptive
>           -----------   -----------   -----------    ---------
> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>
> When comparing the results, I did not observe a significant throughput
> difference between the queue-depth, round-robin, and adaptive policies.
> With random I/O of mixed sizes, the adaptive policy appears to average
> out the varying latency values and distribute I/O reasonably evenly
> across the active paths (assuming symmetric paths).
>
> Next I'd implement I/O size buckets and also per-numa node weight and
> then rerun tests and share the result. Lets see if these changes help
> further improve the throughput number for adaptive policy. We may then
> again review the results and discuss further.

Two comments:


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-26 18:16                 ` Nilay Shroff
  2025-12-27  9:33                   ` Sagi Grimberg
@ 2025-12-27  9:37                   ` Sagi Grimberg
  2026-01-04  9:07                     ` Nilay Shroff
  1 sibling, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-27  9:37 UTC (permalink / raw)
  To: Nilay Shroff, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce


>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
> file I used for the test, followed by the observed throughput result for reference.
>
> Job file:
> =========
>
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> cpumode=qsort
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n2
> rw=<randread/randwrite/randrw>
> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
> iodepth=32
> numjobs=32
> direct=1
>
> Throughput:
> ===========
>
>           numa          round-robin   queue-depth    adaptive
>           -----------   -----------   -----------    ---------
> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>
> When comparing the results, I did not observe a significant throughput
> difference between the queue-depth, round-robin, and adaptive policies.
> With random I/O of mixed sizes, the adaptive policy appears to average
> out the varying latency values and distribute I/O reasonably evenly
> across the active paths (assuming symmetric paths).
>
> Next I'd implement I/O size buckets and also per-numa node weight and
> then rerun tests and share the result. Lets see if these changes help
> further improve the throughput number for adaptive policy. We may then
> again review the results and discuss further.
>
> Thanks,
> --Nilay

two comments:
1. I'd make reads split slightly biased towards small block sizes, and 
writes biased towards larger block sizes
2. I'd also suggest to measure having weights calculation averaged out 
on all numa-node cores and then set percpu (such that
the datapath does not introduce serialization).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2025-12-27  9:37                   ` Sagi Grimberg
@ 2026-01-04  9:07                     ` Nilay Shroff
  2026-01-04 21:06                       ` Sagi Grimberg
  0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2026-01-04  9:07 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 12/27/25 3:07 PM, Sagi Grimberg wrote:
> 
>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>> file I used for the test, followed by the observed throughput result for reference.
>>
>> Job file:
>> =========
>>
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> cpumode=qsort
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n2
>> rw=<randread/randwrite/randrw>
>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>> iodepth=32
>> numjobs=32
>> direct=1
>>
>> Throughput:
>> ===========
>>
>>           numa          round-robin   queue-depth    adaptive
>>           -----------   -----------   -----------    ---------
>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>           W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>
>> When comparing the results, I did not observe a significant throughput
>> difference between the queue-depth, round-robin, and adaptive policies.
>> With random I/O of mixed sizes, the adaptive policy appears to average
>> out the varying latency values and distribute I/O reasonably evenly
>> across the active paths (assuming symmetric paths).
>>
>> Next I'd implement I/O size buckets and also per-numa node weight and
>> then rerun tests and share the result. Lets see if these changes help
>> further improve the throughput number for adaptive policy. We may then
>> again review the results and discuss further.
>>
>> Thanks,
>> --Nilay
> 
> two comments:
> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
> the datapath does not introduce serialization).

Thanks for the suggestions. I ran experiments incorporating both points—
biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
weight calculation—using the following setup.

Job file:
=========
[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
==========

[1] Block-size distributions:
    randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
    randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
    randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5

Results:
=======

i) Symmetric paths + system load
   (CPU stress using cpuload):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    636          621                   613           618  
WRITE:   1832         1847                  1840          1852
RW:      R:872        R:869                 R:866         R:874   
         W:872        W:870                 W:867         W:876 

ii) Asymmetric paths + system load
   (CPU stress using cpuload and iperf3 traffic for inducing network congestion):

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)     
         -------   -------------------      --------   -------------------
READ:    553          543                   540           533  
WRITE:   1705         1670                  1710          1655
RW:      R:769        R:771                 R:784         R:772   
         W:768        W:767                 W:785         W:771 


Looking at the above results,
- Per-CPU vs per-CPU with I/O buckets:
  The per-CPU implementation already averages latency effectively across CPUs.
  Introducing per-CPU I/O buckets does not provide a meaningful throughput
  improvement and remains largely comparable.

- Per-CPU vs per-NUMA aggregation:
  Calculating or averaging weights at the NUMA level does not significantly
  improve throughput over per-CPU weight calculation. Across both symmetric
  and asymmetric scenarios, the results remain very close.

So now based on above results and assessment, unless there are additional
scenarios or metrics of interest, shall we proceed with per-CPU weight 
calculation for this new I/O policy?

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2026-01-04  9:07                     ` Nilay Shroff
@ 2026-01-04 21:06                       ` Sagi Grimberg
  2026-01-06 14:16                         ` Nilay Shroff
  2026-01-07 11:15                         ` Hannes Reinecke
  0 siblings, 2 replies; 28+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:06 UTC (permalink / raw)
  To: Nilay Shroff, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 04/01/2026 11:07, Nilay Shroff wrote:
>
> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>> file I used for the test, followed by the observed throughput result for reference.
>>>
>>> Job file:
>>> =========
>>>
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> cpumode=qsort
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n2
>>> rw=<randread/randwrite/randrw>
>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>>
>>> Throughput:
>>> ===========
>>>
>>>            numa          round-robin   queue-depth    adaptive
>>>            -----------   -----------   -----------    ---------
>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>
>>> When comparing the results, I did not observe a significant throughput
>>> difference between the queue-depth, round-robin, and adaptive policies.
>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>> out the varying latency values and distribute I/O reasonably evenly
>>> across the active paths (assuming symmetric paths).
>>>
>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>> then rerun tests and share the result. Lets see if these changes help
>>> further improve the throughput number for adaptive policy. We may then
>>> again review the results and discuss further.
>>>
>>> Thanks,
>>> --Nilay
>> two comments:
>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>> the datapath does not introduce serialization).
> Thanks for the suggestions. I ran experiments incorporating both points—
> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
> weight calculation—using the following setup.
>
> Job file:
> =========
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n1
> rw=<randread/randwrite/randrw>
> bssplit=<based-on-I/O-pattern-type>[1]
> iodepth=32
> numjobs=32
> direct=1
> ==========
>
> [1] Block-size distributions:
>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>
> Results:
> =======
>
> i) Symmetric paths + system load
>     (CPU stress using cpuload):
>
>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>           -------   -------------------      --------   -------------------
> READ:    636          621                   613           618
> WRITE:   1832         1847                  1840          1852
> RW:      R:872        R:869                 R:866         R:874
>           W:872        W:870                 W:867         W:876
>
> ii) Asymmetric paths + system load
>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>
>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>           -------   -------------------      --------   -------------------
> READ:    553          543                   540           533
> WRITE:   1705         1670                  1710          1655
> RW:      R:769        R:771                 R:784         R:772
>           W:768        W:767                 W:785         W:771
>
>
> Looking at the above results,
> - Per-CPU vs per-CPU with I/O buckets:
>    The per-CPU implementation already averages latency effectively across CPUs.
>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>    improvement and remains largely comparable.
>
> - Per-CPU vs per-NUMA aggregation:
>    Calculating or averaging weights at the NUMA level does not significantly
>    improve throughput over per-CPU weight calculation. Across both symmetric
>    and asymmetric scenarios, the results remain very close.
>
> So now based on above results and assessment, unless there are additional
> scenarios or metrics of interest, shall we proceed with per-CPU weight
> calculation for this new I/O policy?

I think it is counter intuitive that bucketing I/O sizes does not 
present any advantage. Don't you?
Maybe the test is not good enough of a representation...

Lets also test what happens with multiple clients against the same 
subsystem.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2026-01-04 21:06                       ` Sagi Grimberg
@ 2026-01-06 14:16                         ` Nilay Shroff
  2026-02-02 13:33                           ` Nilay Shroff
  2026-01-07 11:15                         ` Hannes Reinecke
  1 sibling, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2026-01-06 14:16 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 1/5/26 2:36 AM, Sagi Grimberg wrote:
> 
> 
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>>            numa          round-robin   queue-depth    adaptive
>>>>            -----------   -----------   -----------    ---------
>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>>     (CPU stress using cpuload):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   -------------------
>> READ:    636          621                   613           618
>> WRITE:   1832         1847                  1840          1852
>> RW:      R:872        R:869                 R:866         R:874
>>           W:872        W:870                 W:867         W:876
>>
>> ii) Asymmetric paths + system load
>>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   -------------------
>> READ:    553          543                   540           533
>> WRITE:   1705         1670                  1710          1655
>> RW:      R:769        R:771                 R:784         R:772
>>           W:768        W:767                 W:785         W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>>    The per-CPU implementation already averages latency effectively across CPUs.
>>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>    improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>>    Calculating or averaging weights at the NUMA level does not significantly
>>    improve throughput over per-CPU weight calculation. Across both symmetric
>>    and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
> 
> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
> 
Hmm you were correct, I also thought the same but I couldn't find 
any test which could prove the advantage using I/O buckets. Then
today I spend some time thinking about the scenarios which could
prove the worth using I/O buckets. After some thought I came up
with following use case.

Size-dependent path behavior:

1. Example:
   Path A: good for ≤16k, bad for ≥32k
   Path B: good for all 

   Now running mixed I/O (bssplit => 16k/75:64k/25),

   Without buckets:
   Path B looks good; scheduler forwards more I/Os towards path B.

   With buckets:
   small I/Os are distributed across path A and B
   large I/Os favor path B

   So in theory, throughput shall improve with buckets.

2. Example:
   Path A: good for ≤16k, bad for ≥32k
   Path B: opposite

   Without buckets:
   latency averages cancel out
   scheduler sees “paths are equal”

   With buckets:
   small I/O bucket favors A
   large I/O bucket favors B

   Again in theory, throughput shall improve with buckets.

So with the above thought, I ran another experiment and results
are shown below:

Injecting additional delay on one path for larger packets (>=32k)
and mixing I/Os with bssplit => 16k/75:64k/25. So with this
test, we have,
Path A: good for ≤16k, bad for ≥32k
Path B: good for all  

         per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
         (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
         -------   -------------------      --------   -------------------
READ:    550          622                   523         615
WRITE:   726          829                   747         834
RW:      R:324        R:381                 R: 306     R:375
         W:323        W:381                 W: 306     W:374

So yes I/O buckets could be useful for the scenario tested 
above. And regarding per-CPU vs per-NUMA weight calculation
do you agree per-CPU should be good enough for this policy
as we saw above per-NUMA doesn't help improve much performance?


> Lets also test what happens with multiple clients against the same subsystem.
Yes this is a good test to run, I will test and post result.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2026-01-04 21:06                       ` Sagi Grimberg
  2026-01-06 14:16                         ` Nilay Shroff
@ 2026-01-07 11:15                         ` Hannes Reinecke
  1 sibling, 0 replies; 28+ messages in thread
From: Hannes Reinecke @ 2026-01-07 11:15 UTC (permalink / raw)
  To: Sagi Grimberg, Nilay Shroff, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce

On 1/4/26 22:06, Sagi Grimberg wrote:
> 
> 
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/ 
>>>>> `cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. 
>>>> Below is the job
>>>> file I used for the test, followed by the observed throughput result 
>>>> for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>>            numa          round-robin   queue-depth    adaptive
>>>>            -----------   -----------   -----------    ---------
>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, 
>>> and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged 
>>> out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>>     (CPU stress using cpuload):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO- 
>> buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   
>> -------------------
>> READ:    636          621                   613           618
>> WRITE:   1832         1847                  1840          1852
>> RW:      R:872        R:869                 R:866         R:874
>>           W:872        W:870                 W:867         W:876
>>
>> ii) Asymmetric paths + system load
>>     (CPU stress using cpuload and iperf3 traffic for inducing network 
>> congestion):
>>
>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO- 
>> buckets
>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>           -------   -------------------      --------   
>> -------------------
>> READ:    553          543                   540           533
>> WRITE:   1705         1670                  1710          1655
>> RW:      R:769        R:771                 R:784         R:772
>>           W:768        W:767                 W:785         W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>>    The per-CPU implementation already averages latency effectively 
>> across CPUs.
>>    Introducing per-CPU I/O buckets does not provide a meaningful 
>> throughput
>>    improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>>    Calculating or averaging weights at the NUMA level does not 
>> significantly
>>    improve throughput over per-CPU weight calculation. Across both 
>> symmetric
>>    and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
> 
> I think it is counter intuitive that bucketing I/O sizes does not 
> present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
> 
> Lets also test what happens with multiple clients against the same 
> subsystem.

I am not sure if focussing on NUMA nodes will bring us an advantage 
here. NUMA nodes would present an advantage if we can keep I/Os to
different controllers on different NUMA nodes; but with TCP this
is rarely possible (just think of two connections to different
controllers via the same interface ...), so I really think we
should keep the counters per-cpu.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
  2026-01-06 14:16                         ` Nilay Shroff
@ 2026-02-02 13:33                           ` Nilay Shroff
  0 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2026-02-02 13:33 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, linux-nvme
  Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce



On 1/6/26 7:46 PM, Nilay Shroff wrote:
> 
> 
> On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>>
>>
>> On 04/01/2026 11:07, Nilay Shroff wrote:
>>>
>>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>>
>>>>> Job file:
>>>>> =========
>>>>>
>>>>> [global]
>>>>> time_based
>>>>> runtime=120
>>>>> group_reporting=1
>>>>>
>>>>> [cpu]
>>>>> ioengine=cpuio
>>>>> cpuload=85
>>>>> cpumode=qsort
>>>>> numjobs=32
>>>>>
>>>>> [disk]
>>>>> ioengine=io_uring
>>>>> filename=/dev/nvme1n2
>>>>> rw=<randread/randwrite/randrw>
>>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>>> iodepth=32
>>>>> numjobs=32
>>>>> direct=1
>>>>>
>>>>> Throughput:
>>>>> ===========
>>>>>
>>>>>            numa          round-robin   queue-depth    adaptive
>>>>>            -----------   -----------   -----------    ---------
>>>>> READ:    1120 MiB/s    2241 MiB/s    2233 MiB/s     2215 MiB/s
>>>>> WRITE:   1107 MiB/s    1875 MiB/s    1847 MiB/s     1892 MiB/s
>>>>> RW:      R:1001 MiB/s  R:1047 MiB/s  R:1086 MiB/s   R:1112 MiB/s
>>>>>            W:999  MiB/s  W:1045 MiB/s  W:1084 MiB/s   W:1111 MiB/s
>>>>>
>>>>> When comparing the results, I did not observe a significant throughput
>>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>>> out the varying latency values and distribute I/O reasonably evenly
>>>>> across the active paths (assuming symmetric paths).
>>>>>
>>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>>> then rerun tests and share the result. Lets see if these changes help
>>>>> further improve the throughput number for adaptive policy. We may then
>>>>> again review the results and discuss further.
>>>>>
>>>>> Thanks,
>>>>> --Nilay
>>>> two comments:
>>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>>> the datapath does not introduce serialization).
>>> Thanks for the suggestions. I ran experiments incorporating both points—
>>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>>> weight calculation—using the following setup.
>>>
>>> Job file:
>>> =========
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n1
>>> rw=<randread/randwrite/randrw>
>>> bssplit=<based-on-I/O-pattern-type>[1]
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>> ==========
>>>
>>> [1] Block-size distributions:
>>>      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>>      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>>      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>>
>>> Results:
>>> =======
>>>
>>> i) Symmetric paths + system load
>>>     (CPU stress using cpuload):
>>>
>>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>>           -------   -------------------      --------   -------------------
>>> READ:    636          621                   613           618
>>> WRITE:   1832         1847                  1840          1852
>>> RW:      R:872        R:869                 R:866         R:874
>>>           W:872        W:870                 W:867         W:876
>>>
>>> ii) Asymmetric paths + system load
>>>     (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>>
>>>           per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>>>           (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>>>           -------   -------------------      --------   -------------------
>>> READ:    553          543                   540           533
>>> WRITE:   1705         1670                  1710          1655
>>> RW:      R:769        R:771                 R:784         R:772
>>>           W:768        W:767                 W:785         W:771
>>>
>>>
>>> Looking at the above results,
>>> - Per-CPU vs per-CPU with I/O buckets:
>>>    The per-CPU implementation already averages latency effectively across CPUs.
>>>    Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>>    improvement and remains largely comparable.
>>>
>>> - Per-CPU vs per-NUMA aggregation:
>>>    Calculating or averaging weights at the NUMA level does not significantly
>>>    improve throughput over per-CPU weight calculation. Across both symmetric
>>>    and asymmetric scenarios, the results remain very close.
>>>
>>> So now based on above results and assessment, unless there are additional
>>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>>> calculation for this new I/O policy?
>>
>> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
>> Maybe the test is not good enough of a representation...
>>
> Hmm you were correct, I also thought the same but I couldn't find 
> any test which could prove the advantage using I/O buckets. Then
> today I spend some time thinking about the scenarios which could
> prove the worth using I/O buckets. After some thought I came up
> with following use case.
> 
> Size-dependent path behavior:
> 
> 1. Example:
>    Path A: good for ≤16k, bad for ≥32k
>    Path B: good for all 
> 
>    Now running mixed I/O (bssplit => 16k/75:64k/25),
> 
>    Without buckets:
>    Path B looks good; scheduler forwards more I/Os towards path B.
> 
>    With buckets:
>    small I/Os are distributed across path A and B
>    large I/Os favor path B
> 
>    So in theory, throughput shall improve with buckets.
> 
> 2. Example:
>    Path A: good for ≤16k, bad for ≥32k
>    Path B: opposite
> 
>    Without buckets:
>    latency averages cancel out
>    scheduler sees “paths are equal”
> 
>    With buckets:
>    small I/O bucket favors A
>    large I/O bucket favors B
> 
>    Again in theory, throughput shall improve with buckets.
> 
> So with the above thought, I ran another experiment and results
> are shown below:
> 
> Injecting additional delay on one path for larger packets (>=32k)
> and mixing I/Os with bssplit => 16k/75:64k/25. So with this
> test, we have,
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all  
> 
>          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
>          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
>          -------   -------------------      --------   -------------------
> READ:    550          622                   523         615
> WRITE:   726          829                   747         834
> RW:      R:324        R:381                 R: 306     R:375
>          W:323        W:381                 W: 306     W:374
> 
> So yes I/O buckets could be useful for the scenario tested 
> above. And regarding per-CPU vs per-NUMA weight calculation
> do you agree per-CPU should be good enough for this policy
> as we saw above per-NUMA doesn't help improve much performance?
> 
> 
>> Lets also test what happens with multiple clients against the same subsystem.
> Yes this is a good test to run, I will test and post result.
> 

Finally, I was able to run tests with two nvmf-tcp hosts connected
to the same nvmf-tcp target. Apologies for the delay — setting up this
topology took some time, partly due to recent non-technical infrastructure
challenges after our lab relocation.

The goal of these tests was to evaluate per-CPU vs per-NUMA weight calculation,
with and without I/O size buckets, under multi-client contention.

I ran tests (randread, randwrite and randrw) with mixed I/O (using bssplit)
and added the CPU stress on hosts using cpuload as I already did for my 
earlier tests. Please find below the test result and observation.

Workload characteristics:
=========================
- Workloads tested: randread, randwrite, randrw
- Mixed I/O sizes using bssplit
- CPU stress induced using cpuload
- Both hosts run workloads simultaneously

Job file:
=========
[global]
time_based
runtime=120
group_reporting=1

[cpu]
ioengine=cpuio
cpuload=85
numjobs=32

[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
ramp-time=120

[1] Block-size distributions:
      randread  => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
      randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
      randrw    => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5

Test topology: 
==============
1. Two nvmf-tcp hosts connected to the same nvmf-tcp target
2. Each host connects to target using two symmetric paths
3. System load on each host is induced using cpuload (as shown in jobfile)
4. Both hosts run I/O workloads concurrently

Results:
=======
Host1:
          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
          -------   -------------------      --------   -------------------
READ:      153         164			166        131
WRITE:     839         837                      889        839
RW:        R:249       R:255                    R:226      R:256
           W:247       W:254                    W:225      W:253

Host2:

          per-CPU   per-CPU-IO-buckets       per-NUMA   per-NUMA-IO-buckets
          (MiB/s)         (MiB/s)             (MiB/s)        (MiB/s)
          -------   -------------------      --------   -------------------
READ:     268          258                     279         268
WRITE:    1012         992                     880         1017
RW:       R:386        R:410                   R:401       R:405
          W:385        W:409                   W:399       W:405


From the above results, I have got the same impression as earlier while I ran the
similar tests between one nvmf-tcp host and target. Looking at the above results,

Per-CPU vs per-CPU with I/O buckets:
- The per-CPU implementation already averages latency effectively across CPUs.
- Introducing per-CPU I/O buckets does not provide a meaningful throughput 
  improvement in the general case.
- Results remain largely comparable across workloads and hosts.
- However, as shown in earlier experiments with I/O size–dependent path behavior,
  I/O buckets can provide measurable benefits in specific scenarios.

Per-CPU vs per-NUMA aggregation:
- Calculating or averaging weights at the NUMA level does not significantly improve
  throughput over per-CPU weight calculation.
- This holds true even under multi-host contention.

Based on all the tests conducted so far, including, symmetric and asymmetric paths,
CPU stress, size-dependent path behavior and multi-client access to the same target:

The results suggest that we should move forward with a per-CPU implementation using
I/O buckets. That said, I am open to any further feedback, suggestions, or additional
scenarios that might be worth evaluating.

Thanks,
--Nilay


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-02-02 13:34 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-12-12 12:16   ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-12-12 13:04   ` Sagi Grimberg
2025-12-13  7:27     ` Nilay Shroff
2025-12-15 23:36       ` Sagi Grimberg
2025-12-18 11:19         ` Nilay Shroff
2025-12-18 13:46           ` Hannes Reinecke
2025-12-23 14:50             ` Nilay Shroff
2025-12-25 12:45               ` Sagi Grimberg
2025-12-26 18:16                 ` Nilay Shroff
2025-12-27  9:33                   ` Sagi Grimberg
2025-12-27  9:37                   ` Sagi Grimberg
2026-01-04  9:07                     ` Nilay Shroff
2026-01-04 21:06                       ` Sagi Grimberg
2026-01-06 14:16                         ` Nilay Shroff
2026-02-02 13:33                           ` Nilay Shroff
2026-01-07 11:15                         ` Hannes Reinecke
2025-12-25 12:28           ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
2025-12-13  8:22   ` Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox