* [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
@ 2025-11-05 10:33 Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
` (8 more replies)
0 siblings, 9 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
Hi,
This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies such as numa, round-robin, and queue-depth
are static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.
The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency for both PCIe and fabrics. Latency is
derived by passively sampling I/O completions. Each path is assigned a
weight proportional to its latency score, and I/Os are then forwarded
accordingly. As condition changes (e.g. latency spikes, bandwidth
differences), path weights are updated, automatically steering traffic
toward better-performing paths.
Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
This pathcset includes totla 6 patches:
[PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
- Make blk_stat APIs available to block drivers.
- Needed for per-path latency measurement in adaptive policy.
[PATCH 2/7] nvme-multipath: add adaptive I/O policy
- Implement path scoring based on latency (EWMA).
- Distribute I/O proportionally to per-path weights.
[PATCH 3/7] nvme: add generic debugfs support
- Introduce generic debugfs support for NVMe module
[PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
- Adds a debugfs attribute to control ewma shift
[PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
- Adds a debugfs attribute to control path weight calculation timeout
[PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
- Add “adaptive_stat” under per-path and head debugfs directories to
expose adaptive policy state and statistics.
[PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
- Includes documentation for adaptive I/O multipath policy.
As ususal, feedback and suggestions are most welcome!
Thanks!
Changes from v4:
- Added patch #7 which includes the documentation for adaptive I/O
policy. (Guixin Liu)
Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/
Changes from v3:
- Update the adaptive APIs name (which actually enable/disable
adaptive policy) to reflect the actual work it does. Also removed
the misleading use of "current_path" from the adaptive policy code
(Hannes Reinecke)
- Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
sysfs to debugfs (Hannes Reinecke)
Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
Changes from v2:
- Addede a new patch to allow user to configure EWMA shift
through sysfs (Hannes Reinecke)
- Added a new patch to allow user to configure path weight
calculation timeout (Hannes Reinecke)
- Distinguish between read/write and other commands (e.g.
admin comamnd) and calculate path weight for other commands
which is separate from read/write weight. (Hannes Reinecke)
- Normalize per-path weight in the range from 0-128 instead
of 0-100 (Hannes Reinecke)
- Restructure and optimize adaptive I/O forwarding code to use
one loop instead of two (Hannes Reinecke)
Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
Changes from v1:
- Ensure that the completion of I/O occurs on the same CPU as the
submitting I/O CPU (Hannes Reinecke)
- Remove adapter link speed from the path weight calculation
(Hannes Reinecke)
- Add adaptive I/O stat under debugfs instead of current sysfs
(Hannes Reinecke)
- Move path weight calculation to a workqueue from IO completion
code path
Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
Nilay Shroff (7):
block: expose blk_stat_{enable,disable}_accounting() to drivers
nvme-multipath: add support for adaptive I/O policy
nvme: add generic debugfs support
nvme-multipath: add debugfs attribute adaptive_ewma_shift
nvme-multipath: add debugfs attribute adaptive_weight_timeout
nvme-multipath: add debugfs attribute adaptive_stat
nvme-multipath: add documentation for adaptive I/O policy
Documentation/admin-guide/nvme-multipath.rst | 19 +
block/blk-stat.h | 4 -
drivers/nvme/host/Makefile | 2 +-
drivers/nvme/host/core.c | 22 +-
drivers/nvme/host/debugfs.c | 335 +++++++++++++++
drivers/nvme/host/ioctl.c | 31 +-
drivers/nvme/host/multipath.c | 430 ++++++++++++++++++-
drivers/nvme/host/nvme.h | 86 +++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 2 +-
include/linux/blk-mq.h | 4 +
11 files changed, 913 insertions(+), 28 deletions(-)
create mode 100644 drivers/nvme/host/debugfs.c
--
2.51.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-12-12 12:16 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
` (7 subsequent siblings)
8 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
The functions blk_stat_enable_accounting() and
blk_stat_disable_accounting() are currently exported, but their
prototypes are only defined in a private header. Move these prototypes
into a common header so that block drivers can directly use these APIs.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-stat.h | 4 ----
include/linux/blk-mq.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 9e05bf18d1be..f5d95dd8c0e9 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -67,10 +67,6 @@ void blk_free_queue_stats(struct blk_queue_stats *);
void blk_stat_add(struct request *rq, u64 now);
-/* record time/size info in request but not add a callback */
-void blk_stat_enable_accounting(struct request_queue *q);
-void blk_stat_disable_accounting(struct request_queue *q);
-
/**
* blk_stat_alloc_callback() - Allocate a block statistics callback.
* @timer_fn: Timer callback function.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b25d12545f46..f647444643b8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -735,6 +735,10 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
bool blk_mq_queue_inflight(struct request_queue *q);
+/* record time/size info in request but not add a callback */
+void blk_stat_enable_accounting(struct request_queue *q);
+void blk_stat_disable_accounting(struct request_queue *q);
+
enum {
/* return when out of requests */
BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0),
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-12-12 13:04 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
` (6 subsequent siblings)
8 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
This commit introduces a new I/O policy named "adaptive". Users can
configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
subsystemX/iopolicy"
The adaptive policy dynamically distributes I/O based on measured
completion latency. The main idea is to calculate latency for each path,
derive a weight, and then proportionally forward I/O according to those
weights.
To ensure scalability, path latency is measured per-CPU. Each CPU
maintains its own statistics, and I/O forwarding uses these per-CPU
values. Every ~15 seconds, a simple average latency of per-CPU batched
samples are computed and fed into an Exponentially Weighted Moving
Average (EWMA):
avg_latency = div_u64(batch, batch_count);
new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
latency value and 1/8 (~12.5%) to the most recent latency. This
smoothing reduces jitter, adapts quickly to changing conditions,
avoids storing historical samples, and works well for both low and
high I/O rates. Path weights are then derived from the smoothed (EWMA)
latency as follows (example with two paths A and B):
path_A_score = NSEC_PER_SEC / path_A_ewma_latency
path_B_score = NSEC_PER_SEC / path_B_ewma_latency
total_score = path_A_score + path_B_score
path_A_weight = (path_A_score * 100) / total_score
path_B_weight = (path_B_score * 100) / total_score
where:
- path_X_ewma_latency is the smoothed latency of a path in nanoseconds
- NSEC_PER_SEC is used as a scaling factor since valid latencies
are < 1 second
- weights are normalized to a 0–64 scale across all paths.
Path credits are refilled based on this weight, with one credit
consumed per I/O. When all credits are consumed, the credits are
refilled again based on the current weight. This ensures that I/O is
distributed across paths proportionally to their calculated weight.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 15 +-
drivers/nvme/host/ioctl.c | 31 ++-
drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 74 +++++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 2 +-
6 files changed, 530 insertions(+), 23 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de73..47f375c63d2d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
cleanup_srcu_struct(&head->srcu);
nvme_put_subsystem(head->subsys);
kfree(head->plids);
+#ifdef CONFIG_NVME_MULTIPATH
+ free_percpu(head->adp_path);
+#endif
kfree(head);
}
@@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
{
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+ nvme_free_ns_stat(ns);
put_disk(ns->disk);
nvme_put_ns_head(ns->head);
nvme_put_ctrl(ns->ctrl);
@@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
if (nvme_init_ns_head(ns, info))
goto out_cleanup_disk;
+ if (nvme_alloc_ns_stat(ns))
+ goto out_unlink_ns;
+
/*
* If multipathing is enabled, the device name for all disks and not
* just those that represent shared namespaces needs to be based on the
@@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
}
if (nvme_update_ns_info(ns, info))
- goto out_unlink_ns;
+ goto out_free_ns_stat;
mutex_lock(&ctrl->namespaces_lock);
/*
@@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
*/
if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
mutex_unlock(&ctrl->namespaces_lock);
- goto out_unlink_ns;
+ goto out_free_ns_stat;
}
nvme_ns_add_to_ctrl_list(ns);
mutex_unlock(&ctrl->namespaces_lock);
@@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
list_del_rcu(&ns->list);
mutex_unlock(&ctrl->namespaces_lock);
synchronize_srcu(&ctrl->srcu);
+out_free_ns_stat:
+ nvme_free_ns_stat(ns);
out_unlink_ns:
mutex_lock(&ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
@@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
*/
synchronize_srcu(&ns->head->srcu);
+ nvme_mpath_cancel_adaptive_path_weight_work(ns);
+
/* wait for concurrent submissions */
if (nvme_mpath_clear_current_path(ns))
synchronize_srcu(&ns->head->srcu);
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index c212fa952c0f..759d147d9930 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
unsigned int cmd, unsigned long arg)
{
+ u8 opcode;
struct nvme_ns_head *head = bdev->bd_disk->private_data;
bool open_for_write = mode & BLK_OPEN_WRITE;
void __user *argp = (void __user *)arg;
struct nvme_ns *ns;
int srcu_idx, ret = -EWOULDBLOCK;
unsigned int flags = 0;
+ unsigned int op_type = NVME_STAT_OTHER;
if (bdev_is_partition(bdev))
flags |= NVME_IOCTL_PARTITION;
+ if (cmd == NVME_IOCTL_SUBMIT_IO) {
+ if (get_user(opcode, (u8 *)argp))
+ return -EFAULT;
+ if (opcode == nvme_cmd_write)
+ op_type = NVME_STAT_WRITE;
+ else if (opcode == nvme_cmd_read)
+ op_type = NVME_STAT_READ;
+ }
+
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, op_type);
if (!ns)
goto out_unlock;
@@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
+ u8 opcode;
bool open_for_write = file->f_mode & FMODE_WRITE;
struct cdev *cdev = file_inode(file)->i_cdev;
struct nvme_ns_head *head =
@@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
void __user *argp = (void __user *)arg;
struct nvme_ns *ns;
int srcu_idx, ret = -EWOULDBLOCK;
+ unsigned int op_type = NVME_STAT_OTHER;
+
+ if (cmd == NVME_IOCTL_SUBMIT_IO) {
+ if (get_user(opcode, (u8 *)argp))
+ return -EFAULT;
+ if (opcode == nvme_cmd_write)
+ op_type = NVME_STAT_WRITE;
+ else if (opcode == nvme_cmd_read)
+ op_type = NVME_STAT_READ;
+ }
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, op_type);
if (!ns)
goto out_unlock;
@@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
int srcu_idx = srcu_read_lock(&head->srcu);
- struct nvme_ns *ns = nvme_find_path(head);
+ const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
+ struct nvme_ns *ns = nvme_find_path(head,
+ READ_ONCE(cmd->opcode) & 1 ?
+ NVME_STAT_WRITE : NVME_STAT_READ);
int ret = -EINVAL;
if (ns)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 543e17aead12..55dc28375662 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,9 @@
#include <linux/backing-dev.h>
#include <linux/moduleparam.h>
#include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
+#include <linux/math64.h>
+#include <linux/rculist.h>
#include <trace/events/block.h>
#include "nvme.h"
@@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
"create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
static const char *nvme_iopolicy_names[] = {
- [NVME_IOPOLICY_NUMA] = "numa",
- [NVME_IOPOLICY_RR] = "round-robin",
- [NVME_IOPOLICY_QD] = "queue-depth",
+ [NVME_IOPOLICY_NUMA] = "numa",
+ [NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
+ [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
};
static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_RR;
else if (!strncmp(val, "queue-depth", 11))
iopolicy = NVME_IOPOLICY_QD;
+ else if (!strncmp(val, "adaptive", 8))
+ iopolicy = NVME_IOPOLICY_ADAPTIVE;
else
return -EINVAL;
@@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
}
EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
+static void nvme_mpath_weight_work(struct work_struct *weight_work)
+{
+ int cpu, srcu_idx;
+ u32 weight;
+ struct nvme_ns *ns;
+ struct nvme_path_stat *stat;
+ struct nvme_path_work *work = container_of(weight_work,
+ struct nvme_path_work, weight_work);
+ struct nvme_ns_head *head = work->ns->head;
+ int op_type = work->op_type;
+ u64 total_score = 0;
+
+ cpu = get_cpu();
+
+ srcu_idx = srcu_read_lock(&head->srcu);
+ list_for_each_entry_srcu(ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+
+ stat = &this_cpu_ptr(ns->info)[op_type].stat;
+ if (!READ_ONCE(stat->slat_ns)) {
+ stat->score = 0;
+ continue;
+ }
+ /*
+ * Compute the path score as the inverse of smoothed
+ * latency, scaled by NSEC_PER_SEC. Floating point
+ * math is unavailable in the kernel, so fixed-point
+ * scaling is used instead. NSEC_PER_SEC is chosen
+ * because valid latencies are always < 1 second; longer
+ * latencies are ignored.
+ */
+ stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
+
+ /* Compute total score. */
+ total_score += stat->score;
+ }
+
+ if (!total_score)
+ goto out;
+
+ /*
+ * After computing the total slatency, we derive per-path weight
+ * (normalized to the range 0–64). The weight represents the
+ * relative share of I/O the path should receive.
+ *
+ * - lower smoothed latency -> higher weight
+ * - higher smoothed slatency -> lower weight
+ *
+ * Next, while forwarding I/O, we assign "credits" to each path
+ * based on its weight (please also refer nvme_adaptive_path()):
+ * - Initially, credits = weight.
+ * - Each time an I/O is dispatched on a path, its credits are
+ * decremented proportionally.
+ * - When a path runs out of credits, it becomes temporarily
+ * ineligible until credit is refilled.
+ *
+ * I/O distribution is therefore governed by available credits,
+ * ensuring that over time the proportion of I/O sent to each
+ * path matches its weight (and thus its performance).
+ */
+ list_for_each_entry_srcu(ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+
+ stat = &this_cpu_ptr(ns->info)[op_type].stat;
+ weight = div_u64(stat->score * 64, total_score);
+
+ /*
+ * Ensure the path weight never drops below 1. A weight
+ * of 0 is used only for newly added paths. During
+ * bootstrap, a few I/Os are sent to such paths to
+ * establish an initial weight. Enforcing a minimum
+ * weight of 1 guarantees that no path is forgotten and
+ * that each path is probed at least occasionally.
+ */
+ if (!weight)
+ weight = 1;
+
+ WRITE_ONCE(stat->weight, weight);
+ }
+out:
+ srcu_read_unlock(&head->srcu, srcu_idx);
+ put_cpu();
+}
+
+/*
+ * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
+ * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
+ * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
+ * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
+ */
+static inline u64 ewma_update(u64 old, u64 new)
+{
+ return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
+ + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
+}
+
+static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
+{
+ int cpu;
+ unsigned int op_type;
+ struct nvme_path_info *info;
+ struct nvme_path_stat *stat;
+ u64 now, latency, slat_ns, avg_lat_ns;
+ struct nvme_ns_head *head = ns->head;
+
+ if (list_is_singular(&head->list))
+ return;
+
+ now = ktime_get_ns();
+ latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
+ if (!latency)
+ return;
+
+ /*
+ * As completion code path is serialized(i.e. no same completion queue
+ * update code could run simultaneously on multiple cpu) we can safely
+ * access per cpu nvme path stat here from another cpu (in case the
+ * completion cpu is different from submission cpu).
+ * The only field which could be accessed simultaneously here is the
+ * path ->weight which may be accessed by this function as well as I/O
+ * submission path during path selection logic and we protect ->weight
+ * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
+ * we also don't need to be so accurate here as the path credit would
+ * be anyways refilled, based on path weight, once path consumes all
+ * its credits. And we limit path weight/credit max up to 100. Please
+ * also refer nvme_adaptive_path().
+ */
+ cpu = blk_mq_rq_cpu(rq);
+ op_type = nvme_data_dir(req_op(rq));
+ info = &per_cpu_ptr(ns->info, cpu)[op_type];
+ stat = &info->stat;
+
+ /*
+ * If latency > ~1s then ignore this sample to prevent EWMA from being
+ * skewed by pathological outliers (multi-second waits, controller
+ * timeouts etc.). This keeps path scores representative of normal
+ * performance and avoids instability from rare spikes. If such high
+ * latency is real, ANA state reporting or keep-alive error counters
+ * will mark the path unhealthy and remove it from the head node list,
+ * so we safely skip such sample here.
+ */
+ if (unlikely(latency > NSEC_PER_SEC)) {
+ stat->nr_ignored++;
+ dev_warn_ratelimited(ns->ctrl->device,
+ "ignoring sample with >1s latency (possible controller stall or timeout)\n");
+ return;
+ }
+
+ /*
+ * Accumulate latency samples and increment the batch count for each
+ * ~15 second interval. When the interval expires, compute the simple
+ * average latency over that window, then update the smoothed (EWMA)
+ * latency. The path weight is recalculated based on this smoothed
+ * latency.
+ */
+ stat->batch += latency;
+ stat->batch_count++;
+ stat->nr_samples++;
+
+ if (now > stat->last_weight_ts &&
+ (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
+
+ stat->last_weight_ts = now;
+
+ /*
+ * Find simple average latency for the last epoch (~15 sec
+ * interval).
+ */
+ avg_lat_ns = div_u64(stat->batch, stat->batch_count);
+
+ /*
+ * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
+ * latency. EWMA is preferred over simple average latency
+ * because it smooths naturally, reduces jitter from sudden
+ * spikes, and adapts faster to changing conditions. It also
+ * avoids storing historical samples, and works well for both
+ * slow and fast I/O rates.
+ * Formula:
+ * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
+ * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
+ * existing latency and 1/8 (~12.5%) weight to the new latency.
+ */
+ if (unlikely(!stat->slat_ns))
+ WRITE_ONCE(stat->slat_ns, avg_lat_ns);
+ else {
+ slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+ WRITE_ONCE(stat->slat_ns, slat_ns);
+ }
+
+ stat->batch = stat->batch_count = 0;
+
+ /*
+ * Defer calculation of the path weight in per-cpu workqueue.
+ */
+ schedule_work_on(cpu, &info->work.weight_work);
+ }
+}
+
void nvme_mpath_end_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;
@@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
atomic_dec_if_positive(&ns->ctrl->nr_active);
+ if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
+ nvme_mpath_add_sample(rq, ns);
+
if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
@@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
[NVME_ANA_CHANGE] = "change",
};
+static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
+{
+ int i, cpu;
+ struct nvme_path_stat *stat;
+
+ for_each_possible_cpu(cpu) {
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+ memset(stat, 0, sizeof(struct nvme_path_stat));
+ }
+ }
+}
+
+void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
+{
+ int i, cpu;
+ struct nvme_path_info *info;
+
+ if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
+ return;
+
+ for_each_online_cpu(cpu) {
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ info = &per_cpu_ptr(ns->info, cpu)[i];
+ cancel_work_sync(&info->work.weight_work);
+ }
+ }
+}
+
+static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
+{
+ struct nvme_ns_head *head = ns->head;
+
+ if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
+ return false;
+
+ if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
+ return false;
+
+ blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
+ blk_stat_enable_accounting(ns->queue);
+ return true;
+}
+
+static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
+{
+
+ if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
+ return false;
+
+ blk_stat_disable_accounting(ns->queue);
+ blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
+ nvme_mpath_reset_adaptive_path_stat(ns);
+ return true;
+}
+
bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
@@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
changed = true;
}
}
+ if (nvme_mpath_disable_adaptive_path_policy(ns))
+ changed = true;
out:
return changed;
}
@@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
srcu_read_unlock(&ctrl->srcu, srcu_idx);
}
+int nvme_alloc_ns_stat(struct nvme_ns *ns)
+{
+ int i, cpu;
+ struct nvme_path_work *work;
+ gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
+
+ if (!ns->head->disk)
+ return 0;
+
+ ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
+ sizeof(struct nvme_path_info),
+ __alignof__(struct nvme_path_info), gfp);
+ if (!ns->info)
+ return -ENOMEM;
+
+ for_each_possible_cpu(cpu) {
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ work = &per_cpu_ptr(ns->info, cpu)[i].work;
+ work->ns = ns;
+ work->op_type = i;
+ INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
+ }
+ }
+
+ return 0;
+}
+
+static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
+{
+ struct nvme_ns *ns;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&ctrl->srcu);
+ list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
+ srcu_read_lock_held(&ctrl->srcu))
+ nvme_mpath_enable_adaptive_path_policy(ns);
+ srcu_read_unlock(&ctrl->srcu, srcu_idx);
+}
+
void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
@@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
srcu_read_lock_held(&head->srcu)) {
if (capacity != get_capacity(ns->disk))
clear_bit(NVME_NS_READY, &ns->flags);
+
+ nvme_mpath_reset_adaptive_path_stat(ns);
}
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
return found;
}
+static inline bool nvme_state_is_live(enum nvme_ana_state state)
+{
+ return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
+}
+
+static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
+ unsigned int op_type)
+{
+ struct nvme_ns *ns, *start, *found = NULL;
+ struct nvme_path_stat *stat;
+ u32 weight;
+ int cpu;
+
+ cpu = get_cpu();
+ ns = *this_cpu_ptr(head->adp_path);
+ if (unlikely(!ns)) {
+ ns = list_first_or_null_rcu(&head->list,
+ struct nvme_ns, siblings);
+ if (unlikely(!ns))
+ goto out;
+ }
+found_ns:
+ start = ns;
+ while (nvme_path_is_disabled(ns) ||
+ !nvme_state_is_live(ns->ana_state)) {
+ ns = list_next_entry_circular(ns, &head->list, siblings);
+
+ /*
+ * If we iterate through all paths in the list but find each
+ * path in list is either disabled or dead then bail out.
+ */
+ if (ns == start)
+ goto out;
+ }
+
+ stat = &this_cpu_ptr(ns->info)[op_type].stat;
+
+ /*
+ * When the head path-list is singular we don't calculate the
+ * only path weight for optimization as we don't need to forward
+ * I/O to more than one path. The another possibility is whenthe
+ * path is newly added, we don't know its weight. So we go round
+ * -robin for each such path and forward I/O to it.Once we start
+ * getting response for such I/Os, the path weight calculation
+ * would kick in and then we start using path credit for
+ * forwarding I/O.
+ */
+ weight = READ_ONCE(stat->weight);
+ if (!weight) {
+ found = ns;
+ goto out;
+ }
+
+ /*
+ * To keep path selection logic simple, we don't distinguish
+ * between ANA optimized and non-optimized states. The non-
+ * optimized path is expected to have a lower weight, and
+ * therefore fewer credits. As a result, only a small number of
+ * I/Os will be forwarded to paths in the non-optimized state.
+ */
+ if (stat->credit > 0) {
+ --stat->credit;
+ found = ns;
+ goto out;
+ } else {
+ /*
+ * Refill credit from path weight and move to next path. The
+ * refilled credit of the current path will be used next when
+ * all remainng paths exhaust its credits.
+ */
+ weight = READ_ONCE(stat->weight);
+ stat->credit = weight;
+ ns = list_next_entry_circular(ns, &head->list, siblings);
+ if (likely(ns))
+ goto found_ns;
+ }
+out:
+ if (found) {
+ stat->sel++;
+ *this_cpu_ptr(head->adp_path) = found;
+ }
+
+ put_cpu();
+ return found;
+}
+
static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
{
struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
@@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
return ns;
}
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
+ unsigned int op_type)
{
switch (READ_ONCE(head->subsys->iopolicy)) {
+ case NVME_IOPOLICY_ADAPTIVE:
+ return nvme_adaptive_path(head, op_type);
case NVME_IOPOLICY_QD:
return nvme_queue_depth_path(head);
case NVME_IOPOLICY_RR:
@@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
return;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
if (likely(ns)) {
bio_set_dev(bio, ns->disk->part0);
bio->bi_opf |= REQ_NVME_MPATH;
@@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
int srcu_idx, ret = -EWOULDBLOCK;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, NVME_STAT_OTHER);
if (ns)
ret = nvme_ns_get_unique_id(ns, id, type);
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
int srcu_idx, ret = -EWOULDBLOCK;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, NVME_STAT_OTHER);
if (ns)
ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
head->delayed_removal_secs = 0;
+ head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
+ if (!head->adp_path)
+ return -ENOMEM;
/*
* If "multipath_always_on" is enabled, a multipath node is added
@@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
}
mutex_unlock(&head->lock);
+ mutex_lock(&nvme_subsystems_lock);
+ nvme_mpath_enable_adaptive_path_policy(ns);
+ mutex_unlock(&nvme_subsystems_lock);
+
synchronize_srcu(&head->srcu);
kblockd_schedule_work(&head->requeue_work);
}
@@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
return 0;
}
-static inline bool nvme_state_is_live(enum nvme_ana_state state)
-{
- return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
-}
-
static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
struct nvme_ns *ns)
{
@@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
WRITE_ONCE(subsys->iopolicy, iopolicy);
- /* iopolicy changes clear the mpath by design */
+ /* iopolicy changes clear/reset the mpath by design */
mutex_lock(&nvme_subsystems_lock);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
nvme_mpath_clear_ctrl_paths(ctrl);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ nvme_mpath_set_ctrl_paths(ctrl);
mutex_unlock(&nvme_subsystems_lock);
pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c..715c7053054c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
extern unsigned int admin_timeout;
#define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
-#define NVME_DEFAULT_KATO 5
+#define NVME_DEFAULT_KATO 5
+
+#define NVME_DEFAULT_ADP_EWMA_SHIFT 3
+#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT (15 * NSEC_PER_SEC)
#ifdef CONFIG_ARCH_NO_SG_CHAIN
#define NVME_INLINE_SG_CNT 0
@@ -421,6 +424,7 @@ enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
NVME_IOPOLICY_QD,
+ NVME_IOPOLICY_ADAPTIVE,
};
struct nvme_subsystem {
@@ -459,6 +463,37 @@ struct nvme_ns_ids {
u8 csi;
};
+enum nvme_stat_group {
+ NVME_STAT_READ,
+ NVME_STAT_WRITE,
+ NVME_STAT_OTHER,
+ NVME_NUM_STAT_GROUPS
+};
+
+struct nvme_path_stat {
+ u64 nr_samples; /* total num of samples processed */
+ u64 nr_ignored; /* num. of samples ignored */
+ u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
+ u64 score; /* score used for weight calculation */
+ u64 last_weight_ts; /* timestamp of the last weight calculation */
+ u64 sel; /* num of times this path is selcted for I/O */
+ u64 batch; /* accumulated latency sum for current window */
+ u32 batch_count; /* num of samples accumulated in current window */
+ u32 weight; /* path weight */
+ u32 credit; /* path credit for I/O forwarding */
+};
+
+struct nvme_path_work {
+ struct nvme_ns *ns; /* owning namespace */
+ struct work_struct weight_work; /* deferred work for weight calculation */
+ int op_type; /* op type : READ/WRITE/OTHER */
+};
+
+struct nvme_path_info {
+ struct nvme_path_stat stat; /* path statistics */
+ struct nvme_path_work work; /* background worker context */
+};
+
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
@@ -508,6 +543,9 @@ struct nvme_ns_head {
unsigned long flags;
struct delayed_work remove_work;
unsigned int delayed_removal_secs;
+
+ struct nvme_ns * __percpu *adp_path;
+
#define NVME_NSHEAD_DISK_LIVE 0
#define NVME_NSHEAD_QUEUE_IF_NO_PATH 1
struct nvme_ns __rcu *current_path[];
@@ -534,6 +572,7 @@ struct nvme_ns {
#ifdef CONFIG_NVME_MULTIPATH
enum nvme_ana_state ana_state;
u32 ana_grpid;
+ struct nvme_path_info __percpu *info;
#endif
struct list_head siblings;
struct kref kref;
@@ -545,6 +584,7 @@ struct nvme_ns {
#define NVME_NS_FORCE_RO 3
#define NVME_NS_READY 4
#define NVME_NS_SYSFS_ATTR_LINK 5
+#define NVME_NS_PATH_STAT 6
struct cdev cdev;
struct device cdev_device;
@@ -949,7 +989,17 @@ extern const struct attribute_group *nvme_dev_attr_groups[];
extern const struct block_device_operations nvme_bdev_ops;
void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
-struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
+struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, unsigned int op_type);
+static inline int nvme_data_dir(const enum req_op op)
+{
+ if (op == REQ_OP_READ)
+ return NVME_STAT_READ;
+ else if (op_is_write(op))
+ return NVME_STAT_WRITE;
+ else
+ return NVME_STAT_OTHER;
+}
+
#ifdef CONFIG_NVME_MULTIPATH
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
{
@@ -972,12 +1022,14 @@ void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl);
void nvme_mpath_update(struct nvme_ctrl *ctrl);
void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
void nvme_mpath_stop(struct nvme_ctrl *ctrl);
+void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns);
bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
void nvme_mpath_revalidate_paths(struct nvme_ns *ns);
void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
void nvme_mpath_start_request(struct request *rq);
void nvme_mpath_end_request(struct request *rq);
+int nvme_alloc_ns_stat(struct nvme_ns *ns);
static inline void nvme_trace_bio_complete(struct request *req)
{
@@ -1005,6 +1057,13 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
return true;
return false;
}
+static inline void nvme_free_ns_stat(struct nvme_ns *ns)
+{
+ if (!ns->head->disk)
+ return;
+
+ free_percpu(ns->info);
+}
#else
#define multipath false
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
@@ -1096,6 +1155,17 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
{
return false;
}
+static inline void nvme_mpath_cancel_adaptive_path_weight_work(
+ struct nvme_ns *ns)
+{
+}
+static inline int nvme_alloc_ns_stat(struct nvme_ns *ns)
+{
+ return 0;
+}
+static inline void nvme_free_ns_stat(struct nvme_ns *ns)
+{
+}
#endif /* CONFIG_NVME_MULTIPATH */
int nvme_ns_get_unique_id(struct nvme_ns *ns, u8 id[16],
diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c
index ca6a74607b13..7aca2186c462 100644
--- a/drivers/nvme/host/pr.c
+++ b/drivers/nvme/host/pr.c
@@ -53,10 +53,12 @@ static int nvme_send_ns_head_pr_command(struct block_device *bdev,
struct nvme_command *c, void *data, unsigned int data_len)
{
struct nvme_ns_head *head = bdev->bd_disk->private_data;
- int srcu_idx = srcu_read_lock(&head->srcu);
- struct nvme_ns *ns = nvme_find_path(head);
+ int srcu_idx;
+ struct nvme_ns *ns;
int ret = -EWOULDBLOCK;
+ srcu_idx = srcu_read_lock(&head->srcu);
+ ns = nvme_find_path(head, NVME_STAT_OTHER);
if (ns) {
c->common.nsid = cpu_to_le32(ns->head->ns_id);
ret = nvme_submit_sync_cmd(ns->queue, c, data, data_len);
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..1cbab90ed42e 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
return 0;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, NVME_STAT_OTHER);
if (!ns)
goto out_unlock;
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 3/7] nvme: add generic debugfs support
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
` (5 subsequent siblings)
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
Add generic infrastructure for creating and managing debugfs files in
the NVMe module. This introduces helper APIs that allow NVMe drivers to
register and unregister debugfs entries, along with a reusable attribute
structure for defining new debugfs files.
The implementation uses seq_file interfaces to safely expose per-NS and
per-NS-head statistics, while supporting both simple show callbacks and
full seq_operations.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/Makefile | 2 +-
drivers/nvme/host/core.c | 3 +
drivers/nvme/host/debugfs.c | 138 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/multipath.c | 2 +
drivers/nvme/host/nvme.h | 10 +++
5 files changed, 154 insertions(+), 1 deletion(-)
create mode 100644 drivers/nvme/host/debugfs.c
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index 6414ec968f99..7962dfc3b2ad 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_NVME_FC) += nvme-fc.o
obj-$(CONFIG_NVME_TCP) += nvme-tcp.o
obj-$(CONFIG_NVME_APPLE) += nvme-apple.o
-nvme-core-y += core.o ioctl.o sysfs.o pr.o
+nvme-core-y += core.o ioctl.o sysfs.o pr.o debugfs.o
nvme-core-$(CONFIG_NVME_VERBOSE_ERRORS) += constants.o
nvme-core-$(CONFIG_TRACING) += trace.o
nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 47f375c63d2d..c15dfcaf3de2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4187,6 +4187,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
if (device_add_disk(ctrl->device, ns->disk, nvme_ns_attr_groups))
goto out_cleanup_ns_from_list;
+ nvme_debugfs_register(ns->disk);
+
if (!nvme_ns_head_multipath(ns->head))
nvme_add_ns_cdev(ns);
@@ -4276,6 +4278,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
nvme_mpath_remove_sysfs_link(ns);
+ nvme_debugfs_unregister(ns->disk);
del_gendisk(ns->disk);
mutex_lock(&ns->ctrl->namespaces_lock);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
new file mode 100644
index 000000000000..6bb57c4b5c3b
--- /dev/null
+++ b/drivers/nvme/host/debugfs.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 IBM Corporation
+ * Nilay Shroff <nilay@linux.ibm.com>
+ */
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "nvme.h"
+
+struct nvme_debugfs_attr {
+ const char *name;
+ umode_t mode;
+ int (*show)(void *data, struct seq_file *m);
+ ssize_t (*write)(void *data, const char __user *buf, size_t count,
+ loff_t *ppos);
+ const struct seq_operations *seq_ops;
+};
+
+struct nvme_debugfs_ctx {
+ void *data;
+ struct nvme_debugfs_attr *attr;
+ int srcu_idx;
+};
+
+static int nvme_debugfs_show(struct seq_file *m, void *v)
+{
+ struct nvme_debugfs_ctx *ctx = m->private;
+ void *data = ctx->data;
+ struct nvme_debugfs_attr *attr = ctx->attr;
+
+ return attr->show(data, m);
+}
+
+static int nvme_debugfs_open(struct inode *inode, struct file *file)
+{
+ void *data = inode->i_private;
+ struct nvme_debugfs_attr *attr = debugfs_get_aux(file);
+ struct nvme_debugfs_ctx *ctx;
+ struct seq_file *m;
+ int ret;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (WARN_ON_ONCE(!ctx))
+ return -ENOMEM;
+
+ ctx->data = data;
+ ctx->attr = attr;
+
+ if (attr->seq_ops) {
+ ret = seq_open(file, attr->seq_ops);
+ if (ret) {
+ kfree(ctx);
+ return ret;
+ }
+ m = file->private_data;
+ m->private = ctx;
+ return ret;
+ }
+
+ if (WARN_ON_ONCE(!attr->show)) {
+ kfree(ctx);
+ return -EPERM;
+ }
+
+ return single_open(file, nvme_debugfs_show, ctx);
+}
+
+static ssize_t nvme_debugfs_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct seq_file *m = file->private_data;
+ struct nvme_debugfs_ctx *ctx = m->private;
+ struct nvme_debugfs_attr *attr = ctx->attr;
+
+ if (!attr->write)
+ return -EPERM;
+
+ return attr->write(ctx->data, buf, count, ppos);
+}
+
+static int nvme_debugfs_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct nvme_debugfs_ctx *ctx = m->private;
+ struct nvme_debugfs_attr *attr = ctx->attr;
+ int ret;
+
+ if (attr->seq_ops)
+ ret = seq_release(inode, file);
+ else
+ ret = single_release(inode, file);
+
+ kfree(ctx);
+ return ret;
+}
+
+static const struct file_operations nvme_debugfs_fops = {
+ .owner = THIS_MODULE,
+ .open = nvme_debugfs_open,
+ .read = seq_read,
+ .write = nvme_debugfs_write,
+ .llseek = seq_lseek,
+ .release = nvme_debugfs_release,
+};
+
+
+static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
+ {},
+};
+
+static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
+ {},
+};
+
+static void nvme_debugfs_create_files(struct request_queue *q,
+ const struct nvme_debugfs_attr *attr, void *data)
+{
+ if (WARN_ON_ONCE(!q->debugfs_dir))
+ return;
+
+ for (; attr->name; attr++)
+ debugfs_create_file_aux(attr->name, attr->mode, q->debugfs_dir,
+ data, (void *)attr, &nvme_debugfs_fops);
+}
+
+void nvme_debugfs_register(struct gendisk *disk)
+{
+ const struct nvme_debugfs_attr *attr;
+
+ if (nvme_disk_is_ns_head(disk))
+ attr = nvme_mpath_debugfs_attrs;
+ else
+ attr = nvme_ns_debugfs_attrs;
+
+ nvme_debugfs_create_files(disk->queue, attr, disk->private_data);
+}
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 55dc28375662..047dd9da9cbf 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -1086,6 +1086,7 @@ static void nvme_remove_head(struct nvme_ns_head *head)
nvme_cdev_del(&head->cdev, &head->cdev_device);
synchronize_srcu(&head->srcu);
+ nvme_debugfs_unregister(head->disk);
del_gendisk(head->disk);
}
nvme_put_ns_head(head);
@@ -1192,6 +1193,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
}
nvme_add_ns_head_cdev(head);
kblockd_schedule_work(&head->partition_scan_work);
+ nvme_debugfs_register(head->disk);
}
nvme_mpath_add_sysfs_link(ns->head);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 715c7053054c..1c1ec2a7f9ad 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1000,6 +1000,16 @@ static inline int nvme_data_dir(const enum req_op op)
return NVME_STAT_OTHER;
}
+void nvme_debugfs_register(struct gendisk *disk);
+static inline void nvme_debugfs_unregister(struct gendisk *disk)
+{
+ /*
+ * Nothing to do for now. When the request queue is unregistered,
+ * all files under q->debugfs_dir are recursively deleted.
+ * This is just a placeholder; the compiler will optimize it out.
+ */
+}
+
#ifdef CONFIG_NVME_MULTIPATH
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
{
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (2 preceding siblings ...)
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
` (4 subsequent siblings)
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
By default, the EWMA (Exponentially Weighted Moving Average) shift
value, used for storing latency samples for adaptive iopolicy, is set
to 3. The EWMA is calculated using the following formula:
ewma = (old * ((1 << ewma_shift) - 1) + new) >> ewma_shift;
The default value of 3 assigns ~87.5% weight to the existing EWMA value
and ~12.5% weight to the new latency sample. This provides a stable
average that smooths out short-term variations.
However, different workloads may require faster or slower adaptation to
changing conditions. This commit introduces a new debugfs attribute,
adaptive_ewma_shift, allowing users to tune the weighting factor.
For example:
- adaptive_ewma_shift = 2 => 75% old, 25% new
- adaptive_ewma_shift = 1 => 50% old, 50% new
- adaptive_ewma_shift = 0 => 0% old, 100% new
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 3 +++
drivers/nvme/host/debugfs.c | 46 +++++++++++++++++++++++++++++++++++
drivers/nvme/host/multipath.c | 8 +++---
drivers/nvme/host/nvme.h | 1 +
4 files changed, 54 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c15dfcaf3de2..43b9b0d6cbdf 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3913,6 +3913,9 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
head->ids = info->ids;
head->shared = info->is_shared;
head->rotational = info->is_rotational;
+#ifdef CONFIG_NVME_MULTIPATH
+ head->adp_ewma_shift = NVME_DEFAULT_ADP_EWMA_SHIFT;
+#endif
ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
kref_init(&head->ref);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index 6bb57c4b5c3b..e3c37041e8f2 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -105,8 +105,54 @@ static const struct file_operations nvme_debugfs_fops = {
.release = nvme_debugfs_release,
};
+#ifdef CONFIG_NVME_MULTIPATH
+static int nvme_adp_ewma_shift_show(void *data, struct seq_file *m)
+{
+ struct nvme_ns_head *head = data;
+
+ seq_printf(m, "%u\n", READ_ONCE(head->adp_ewma_shift));
+ return 0;
+}
+
+static ssize_t nvme_adp_ewma_shift_store(void *data, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct nvme_ns_head *head = data;
+ char kbuf[8];
+ u32 res;
+ int ret;
+ size_t len;
+ char *arg;
+
+ len = min(sizeof(kbuf) - 1, count);
+
+ if (copy_from_user(kbuf, ubuf, len))
+ return -EFAULT;
+
+ kbuf[len] = '\0';
+ arg = strstrip(kbuf);
+
+ ret = kstrtou32(arg, 0, &res);
+ if (ret)
+ return ret;
+
+ /*
+ * Values greater than 8 are nonsensical, as they effectively assign
+ * zero weight to new samples.
+ */
+ if (res > 8)
+ return -EINVAL;
+
+ WRITE_ONCE(head->adp_ewma_shift, res);
+ return count;
+}
+#endif
static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
+#ifdef CONFIG_NVME_MULTIPATH
+ {"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
+ nvme_adp_ewma_shift_store},
+#endif
{},
};
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 047dd9da9cbf..c7470cc8844e 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -294,10 +294,9 @@ static void nvme_mpath_weight_work(struct work_struct *weight_work)
* For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
* the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
*/
-static inline u64 ewma_update(u64 old, u64 new)
+static inline u64 ewma_update(u64 old, u64 new, u32 ewma_shift)
{
- return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
- + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
+ return (old * ((1 << ewma_shift) - 1) + new) >> ewma_shift;
}
static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
@@ -389,7 +388,8 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
if (unlikely(!stat->slat_ns))
WRITE_ONCE(stat->slat_ns, avg_lat_ns);
else {
- slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+ slat_ns = ewma_update(stat->slat_ns, avg_lat_ns,
+ READ_ONCE(head->adp_ewma_shift));
WRITE_ONCE(stat->slat_ns, slat_ns);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1c1ec2a7f9ad..97de45634f08 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -545,6 +545,7 @@ struct nvme_ns_head {
unsigned int delayed_removal_secs;
struct nvme_ns * __percpu *adp_path;
+ u32 adp_ewma_shift;
#define NVME_NSHEAD_DISK_LIVE 0
#define NVME_NSHEAD_QUEUE_IF_NO_PATH 1
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (3 preceding siblings ...)
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
` (3 subsequent siblings)
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
By default, the adaptive I/O policy accumulates latency samples over a
15-second window. When this window expires, the driver computes the
average latency and updates the smoothed (EWMA) latency value. The
path weight is then recalculated based on this data.
A 15-second window provides a good balance for most workloads, as it
helps smooth out transient latency spikes and produces a more stable
path weight profile. However, some workloads may benefit from faster
or slower adaptation to changing latency conditions.
This commit introduces a new debugfs attribute, adaptive_weight_timeout,
which allows users to configure the path weight calculation interval
based on their workload requirements.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/debugfs.c | 40 ++++++++++++++++++++++++++++++++++-
drivers/nvme/host/multipath.c | 7 ++++--
drivers/nvme/host/nvme.h | 1 +
4 files changed, 46 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 43b9b0d6cbdf..d3828c4812fc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3915,6 +3915,7 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
head->rotational = info->is_rotational;
#ifdef CONFIG_NVME_MULTIPATH
head->adp_ewma_shift = NVME_DEFAULT_ADP_EWMA_SHIFT;
+ head->adp_weight_timeout = NVME_DEFAULT_ADP_WEIGHT_TIMEOUT;
#endif
ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index e3c37041e8f2..e382fa411b13 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -146,12 +146,50 @@ static ssize_t nvme_adp_ewma_shift_store(void *data, const char __user *ubuf,
WRITE_ONCE(head->adp_ewma_shift, res);
return count;
}
+
+static int nvme_adp_weight_timeout_show(void *data, struct seq_file *m)
+{
+ struct nvme_ns_head *head = data;
+
+ seq_printf(m, "%llu\n",
+ div_u64(READ_ONCE(head->adp_weight_timeout), NSEC_PER_SEC));
+ return 0;
+}
+
+static ssize_t nvme_adp_weight_timeout_store(void *data,
+ const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ struct nvme_ns_head *head = data;
+ char kbuf[8];
+ u32 res;
+ int ret;
+ size_t len;
+ char *arg;
+
+ len = min(sizeof(kbuf) - 1, count);
+
+ if (copy_from_user(kbuf, ubuf, len))
+ return -EFAULT;
+
+ kbuf[len] = '\0';
+ arg = strstrip(kbuf);
+
+ ret = kstrtou32(arg, 0, &res);
+ if (ret)
+ return ret;
+
+ WRITE_ONCE(head->adp_weight_timeout, res * NSEC_PER_SEC);
+ return count;
+}
#endif
static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
#ifdef CONFIG_NVME_MULTIPATH
- {"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
+ {"adaptive_ewma_shift", 0600, nvme_adp_ewma_shift_show,
nvme_adp_ewma_shift_store},
+ {"adaptive_weight_timeout", 0600, nvme_adp_weight_timeout_show,
+ nvme_adp_weight_timeout_store},
#endif
{},
};
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c7470cc8844e..e70a7d5cf036 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -362,8 +362,11 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
stat->batch_count++;
stat->nr_samples++;
- if (now > stat->last_weight_ts &&
- (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
+ if (now > stat->last_weight_ts) {
+ u64 timeout = READ_ONCE(head->adp_weight_timeout);
+
+ if ((now - stat->last_weight_ts) < timeout)
+ return;
stat->last_weight_ts = now;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 97de45634f08..53d868cccbeb 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -546,6 +546,7 @@ struct nvme_ns_head {
struct nvme_ns * __percpu *adp_path;
u32 adp_ewma_shift;
+ u64 adp_weight_timeout;
#define NVME_NSHEAD_DISK_LIVE 0
#define NVME_NSHEAD_QUEUE_IF_NO_PATH 1
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (4 preceding siblings ...)
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
` (2 subsequent siblings)
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
This commit introduces a new debugfs attribute, "adaptive_stat", under
both per-path and head debugfs directories (defined under /sys/kernel/
debug/block/). This attribute provides visibility into the internal
state of the adaptive I/O policy to aid in debugging and performance
analysis.
For per-path entries, "adaptive_stat" reports the corresponding path
statistics such as I/O weight, selection count, processed samples, and
ignored samples.
For head entries, it reports per-CPU statistics for each reachable path,
including I/O weight, path score, smoothed (EWMA) latency, selection
count, processed samples, and ignored samples.
These additions enhance observability of the adaptive I/O path selection
behavior and help diagnose imbalance or instability in multipath
performance.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/debugfs.c | 113 ++++++++++++++++++++++++++++++++++++
1 file changed, 113 insertions(+)
diff --git a/drivers/nvme/host/debugfs.c b/drivers/nvme/host/debugfs.c
index e382fa411b13..28de4a8e2333 100644
--- a/drivers/nvme/host/debugfs.c
+++ b/drivers/nvme/host/debugfs.c
@@ -182,6 +182,115 @@ static ssize_t nvme_adp_weight_timeout_store(void *data,
WRITE_ONCE(head->adp_weight_timeout, res * NSEC_PER_SEC);
return count;
}
+
+static void *nvme_mpath_adp_stat_start(struct seq_file *m, loff_t *pos)
+{
+ struct nvme_ns *ns;
+ struct nvme_debugfs_ctx *ctx = m->private;
+ struct nvme_ns_head *head = ctx->data;
+
+ /* Remember srcu index, so we can unlock later. */
+ ctx->srcu_idx = srcu_read_lock(&head->srcu);
+ ns = list_first_or_null_rcu(&head->list, struct nvme_ns, siblings);
+
+ while (*pos && ns) {
+ ns = list_next_or_null_rcu(&head->list, &ns->siblings,
+ struct nvme_ns, siblings);
+ (*pos)--;
+ }
+
+ return ns;
+}
+
+static void *nvme_mpath_adp_stat_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct nvme_ns *ns = v;
+ struct nvme_debugfs_ctx *ctx = m->private;
+ struct nvme_ns_head *head = ctx->data;
+
+ (*pos)++;
+
+ return list_next_or_null_rcu(&head->list, &ns->siblings,
+ struct nvme_ns, siblings);
+}
+
+static void nvme_mpath_adp_stat_stop(struct seq_file *m, void *v)
+{
+ struct nvme_debugfs_ctx *ctx = m->private;
+ struct nvme_ns_head *head = ctx->data;
+ int srcu_idx = ctx->srcu_idx;
+
+ srcu_read_unlock(&head->srcu, srcu_idx);
+}
+
+static int nvme_mpath_adp_stat_show(struct seq_file *m, void *v)
+{
+ int i, cpu;
+ struct nvme_path_stat *stat;
+ struct nvme_ns *ns = v;
+
+ seq_printf(m, "%s:\n", ns->disk->disk_name);
+ for_each_online_cpu(cpu) {
+ seq_printf(m, "cpu %d : ", cpu);
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+ seq_printf(m, "%u %u %llu %llu %llu %llu %llu ",
+ stat->weight, stat->credit, stat->score,
+ stat->slat_ns, stat->sel,
+ stat->nr_samples, stat->nr_ignored);
+ }
+ seq_putc(m, '\n');
+ }
+ return 0;
+}
+
+static const struct seq_operations nvme_mpath_adp_stat_seq_ops = {
+ .start = nvme_mpath_adp_stat_start,
+ .next = nvme_mpath_adp_stat_next,
+ .stop = nvme_mpath_adp_stat_stop,
+ .show = nvme_mpath_adp_stat_show
+};
+
+static void adp_stat_read_all(struct nvme_ns *ns, struct nvme_path_stat *batch)
+{
+ int i, cpu;
+ u32 ncpu[NVME_NUM_STAT_GROUPS] = {0};
+ struct nvme_path_stat *stat;
+
+ for_each_online_cpu(cpu) {
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
+ batch[i].sel += stat->sel;
+ batch[i].nr_samples += stat->nr_samples;
+ batch[i].nr_ignored += stat->nr_ignored;
+ batch[i].weight += stat->weight;
+ if (stat->weight)
+ ncpu[i]++;
+ }
+ }
+
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ if (!ncpu[i])
+ continue;
+ batch[i].weight = DIV_U64_ROUND_CLOSEST(batch[i].weight,
+ ncpu[i]);
+ }
+}
+
+static int nvme_ns_adp_stat_show(void *data, struct seq_file *m)
+{
+ int i;
+ struct nvme_path_stat stat[NVME_NUM_STAT_GROUPS] = {0};
+ struct nvme_ns *ns = (struct nvme_ns *)data;
+
+ adp_stat_read_all(ns, stat);
+ for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
+ seq_printf(m, "%u %llu %llu %llu ",
+ stat[i].weight, stat[i].sel,
+ stat[i].nr_samples, stat[i].nr_ignored);
+ }
+ return 0;
+}
#endif
static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
@@ -190,11 +299,15 @@ static const struct nvme_debugfs_attr nvme_mpath_debugfs_attrs[] = {
nvme_adp_ewma_shift_store},
{"adaptive_weight_timeout", 0600, nvme_adp_weight_timeout_show,
nvme_adp_weight_timeout_store},
+ {"adaptive_stat", 0400, .seq_ops = &nvme_mpath_adp_stat_seq_ops},
#endif
{},
};
static const struct nvme_debugfs_attr nvme_ns_debugfs_attrs[] = {
+#ifdef CONFIG_NVME_MULTIPATH
+ {"adaptive_stat", 0400, nvme_ns_adp_stat_show},
+#endif
{},
};
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (5 preceding siblings ...)
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
@ 2025-11-05 10:33 ` Nilay Shroff
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-11-05 10:33 UTC (permalink / raw)
To: linux-nvme; +Cc: hare, hch, kbusch, sagi, dwagner, axboe, kanie, gjoyce
Update the nvme-multipath documentation to describe the adaptive I/O
policy, its behavior, and when it is suitable for use.
Suggested-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
Documentation/admin-guide/nvme-multipath.rst | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst
index 97ca1ccef459..7befaab01cf5 100644
--- a/Documentation/admin-guide/nvme-multipath.rst
+++ b/Documentation/admin-guide/nvme-multipath.rst
@@ -70,3 +70,22 @@ When to use the queue-depth policy:
1. High load with small I/Os: Effectively balances load across paths when
the load is high, and I/O operations consist of small, relatively
fixed-sized requests.
+
+Adaptive
+--------
+
+The adaptive policy manages I/O requests based on path latency. It periodically
+calculates a weight for each path and distributes I/O accordingly. Paths with
+higher latency receive lower weights, resulting in fewer I/O requests being sent
+to them, while paths with lower latency handle a proportionally larger share of
+the I/O load.
+
+When to use the adaptive policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Homogeneous Path Performance: Utilizes all available paths efficiently when
+ their performance characteristics (e.g., latency, bandwidth) are similar.
+
+2. Heterogeneous Path Performance: Dynamically distributes I/O based on per-path
+ performance characteristics. Paths with lower latency receive a higher share
+ of I/O compared to those with higher latency.
--
2.51.0
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (6 preceding siblings ...)
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
@ 2025-12-09 13:56 ` Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
8 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-09 13:56 UTC (permalink / raw)
To: Keith Busch
Cc: hare, hch, sagi, dwagner, axboe, kanie, gjoyce,
linux-nvme@lists.infradead.org
Hi Keith,
Just gentle ping on this one...
It has been reviewed and ready for some time now, and I wanted to check if you
had any remaining feedback or concerns, or if you could consider pulling it
into nvme-next.
Link to the latest version for convenience:
https://lore.kernel.org/all/20251105103347.86059-1-nilay@linux.ibm.com/
Please let me know if there's anything further needed on my side.
Thanks,
--Nilay
On 11/5/25 4:03 PM, Nilay Shroff wrote:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance. The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics. Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
>
> This pathcset includes totla 6 patches:
> [PATCH 1/7] block: expose blk_stat_{enable,disable}_accounting()
> - Make blk_stat APIs available to block drivers.
> - Needed for per-path latency measurement in adaptive policy.
>
> [PATCH 2/7] nvme-multipath: add adaptive I/O policy
> - Implement path scoring based on latency (EWMA).
> - Distribute I/O proportionally to per-path weights.
>
> [PATCH 3/7] nvme: add generic debugfs support
> - Introduce generic debugfs support for NVMe module
>
> [PATCH 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift
> - Adds a debugfs attribute to control ewma shift
>
> [PATCH 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout
> - Adds a debugfs attribute to control path weight calculation timeout
>
> [PATCH 6/7] nvme-multipath: add debugfs attribute adaptive_stat
> - Add “adaptive_stat” under per-path and head debugfs directories to
> expose adaptive policy state and statistics.
>
> [PATCH 7/7] nvme-multipath: add documentation for adaptive I/O policy
> - Includes documentation for adaptive I/O multipath policy.
>
> As ususal, feedback and suggestions are most welcome!
>
> Thanks!
>
> Changes from v4:
> - Added patch #7 which includes the documentation for adaptive I/O
> policy. (Guixin Liu)
> Link to v4: https://lore.kernel.org/all/20251104104533.138481-1-nilay@linux.ibm.com/
>
> Changes from v3:
> - Update the adaptive APIs name (which actually enable/disable
> adaptive policy) to reflect the actual work it does. Also removed
> the misleading use of "current_path" from the adaptive policy code
> (Hannes Reinecke)
> - Move adaptive_ewma_shift and adaptive_weight_timeout attributes from
> sysfs to debugfs (Hannes Reinecke)
> Link to v3: https://lore.kernel.org/all/20251027092949.961287-1-nilay@linux.ibm.com/
>
> Changes from v2:
> - Addede a new patch to allow user to configure EWMA shift
> through sysfs (Hannes Reinecke)
> - Added a new patch to allow user to configure path weight
> calculation timeout (Hannes Reinecke)
> - Distinguish between read/write and other commands (e.g.
> admin comamnd) and calculate path weight for other commands
> which is separate from read/write weight. (Hannes Reinecke)
> - Normalize per-path weight in the range from 0-128 instead
> of 0-100 (Hannes Reinecke)
> - Restructure and optimize adaptive I/O forwarding code to use
> one loop instead of two (Hannes Reinecke)
> Link to v2: https://lore.kernel.org/all/20251009100608.1699550-1-nilay@linux.ibm.com/
>
> Changes from v1:
> - Ensure that the completion of I/O occurs on the same CPU as the
> submitting I/O CPU (Hannes Reinecke)
> - Remove adapter link speed from the path weight calculation
> (Hannes Reinecke)
> - Add adaptive I/O stat under debugfs instead of current sysfs
> (Hannes Reinecke)
> - Move path weight calculation to a workqueue from IO completion
> code path
> Link to v1: https://lore.kernel.org/all/20250921111234.863853-1-nilay@linux.ibm.com/
>
> Nilay Shroff (7):
> block: expose blk_stat_{enable,disable}_accounting() to drivers
> nvme-multipath: add support for adaptive I/O policy
> nvme: add generic debugfs support
> nvme-multipath: add debugfs attribute adaptive_ewma_shift
> nvme-multipath: add debugfs attribute adaptive_weight_timeout
> nvme-multipath: add debugfs attribute adaptive_stat
> nvme-multipath: add documentation for adaptive I/O policy
>
> Documentation/admin-guide/nvme-multipath.rst | 19 +
> block/blk-stat.h | 4 -
> drivers/nvme/host/Makefile | 2 +-
> drivers/nvme/host/core.c | 22 +-
> drivers/nvme/host/debugfs.c | 335 +++++++++++++++
> drivers/nvme/host/ioctl.c | 31 +-
> drivers/nvme/host/multipath.c | 430 ++++++++++++++++++-
> drivers/nvme/host/nvme.h | 86 +++-
> drivers/nvme/host/pr.c | 6 +-
> drivers/nvme/host/sysfs.c | 2 +-
> include/linux/blk-mq.h | 4 +
> 11 files changed, 913 insertions(+), 28 deletions(-)
> create mode 100644 drivers/nvme/host/debugfs.c
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (7 preceding siblings ...)
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
@ 2025-12-12 12:08 ` Sagi Grimberg
2025-12-13 8:22 ` Nilay Shroff
8 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 12:08 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 05/11/2025 12:33, Nilay Shroff wrote:
> Hi,
>
> This series introduces a new adaptive I/O policy for NVMe native
> multipath. Existing policies such as numa, round-robin, and queue-depth
> are static and do not adapt to real-time transport performance.
It can be argued that queue-depth is a proxy of latency.
> The numa
> selects the path closest to the NUMA node of the current CPU, optimizing
> memory and path locality, but ignores actual path performance. The
> round-robin distributes I/O evenly across all paths, providing fairness
> but not performance awareness. The queue-depth reacts to instantaneous
> queue occupancy, avoiding heavily loaded paths, but does not account for
> actual latency, throughput, or link speed.
>
> The new adaptive policy addresses these gaps selecting paths dynamically
> based on measured I/O latency for both PCIe and fabrics.
Adaptive is not a good name. Maybe weighted-latency of wplat (weighted
path latency)
or something like that.
> Latency is
> derived by passively sampling I/O completions. Each path is assigned a
> weight proportional to its latency score, and I/Os are then forwarded
> accordingly. As condition changes (e.g. latency spikes, bandwidth
> differences), path weights are updated, automatically steering traffic
> toward better-performing paths.
>
> Early results show reduced tail latency under mixed workloads and
> improved throughput by exploiting higher-speed links more effectively.
> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
> delay), fio results with random read/write/rw workloads (direct I/O)
> showed:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
Seems like a nice gain.
Can you please test for the normal symmetric paths case? Would like
to see the trade-off...
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
@ 2025-12-12 12:16 ` Sagi Grimberg
0 siblings, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 12:16 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 05/11/2025 12:33, Nilay Shroff wrote:
> The functions blk_stat_enable_accounting() and
> blk_stat_disable_accounting() are currently exported, but their
> prototypes are only defined in a private header. Move these prototypes
> into a common header so that block drivers can directly use these APIs.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-12-12 13:04 ` Sagi Grimberg
2025-12-13 7:27 ` Nilay Shroff
0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-12 13:04 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 05/11/2025 12:33, Nilay Shroff wrote:
> This commit introduces a new I/O policy named "adaptive". Users can
> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
> subsystemX/iopolicy"
>
> The adaptive policy dynamically distributes I/O based on measured
> completion latency. The main idea is to calculate latency for each path,
> derive a weight, and then proportionally forward I/O according to those
> weights.
>
> To ensure scalability, path latency is measured per-CPU. Each CPU
> maintains its own statistics, and I/O forwarding uses these per-CPU
> values.
So a given cpu would select path-a vs. another cpu that may select path-b?
How does that play with less queues than cpu cores? what happens to cores
that have low traffic?
> Every ~15 seconds, a simple average latency of per-CPU batched
> samples are computed and fed into an Exponentially Weighted Moving
> Average (EWMA):
I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
>
> avg_latency = div_u64(batch, batch_count);
> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>
> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
> latency value and 1/8 (~12.5%) to the most recent latency. This
> smoothing reduces jitter, adapts quickly to changing conditions,
> avoids storing historical samples, and works well for both low and
> high I/O rates.
This weight was based on empirical measurements?
> Path weights are then derived from the smoothed (EWMA)
> latency as follows (example with two paths A and B):
>
> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
> total_score = path_A_score + path_B_score
>
> path_A_weight = (path_A_score * 100) / total_score
> path_B_weight = (path_B_score * 100) / total_score
What happens to R/W mixed workloads? What happens when the I/O pattern
has a distribution of block sizes?
I think that in order to understand how a non-trivial path selector
works we need
thorough testing in a variety of I/O patterns.
>
> where:
> - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
> - NSEC_PER_SEC is used as a scaling factor since valid latencies
> are < 1 second
> - weights are normalized to a 0–64 scale across all paths.
>
> Path credits are refilled based on this weight, with one credit
> consumed per I/O. When all credits are consumed, the credits are
> refilled again based on the current weight. This ensures that I/O is
> distributed across paths proportionally to their calculated weight.
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 15 +-
> drivers/nvme/host/ioctl.c | 31 ++-
> drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
> drivers/nvme/host/nvme.h | 74 +++++-
> drivers/nvme/host/pr.c | 6 +-
> drivers/nvme/host/sysfs.c | 2 +-
> 6 files changed, 530 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa4181d7de73..47f375c63d2d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
> cleanup_srcu_struct(&head->srcu);
> nvme_put_subsystem(head->subsys);
> kfree(head->plids);
> +#ifdef CONFIG_NVME_MULTIPATH
> + free_percpu(head->adp_path);
> +#endif
> kfree(head);
> }
>
> @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
> {
> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>
> + nvme_free_ns_stat(ns);
> put_disk(ns->disk);
> nvme_put_ns_head(ns->head);
> nvme_put_ctrl(ns->ctrl);
> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> if (nvme_init_ns_head(ns, info))
> goto out_cleanup_disk;
>
> + if (nvme_alloc_ns_stat(ns))
> + goto out_unlink_ns;
> +
> /*
> * If multipathing is enabled, the device name for all disks and not
> * just those that represent shared namespaces needs to be based on the
> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> }
>
> if (nvme_update_ns_info(ns, info))
> - goto out_unlink_ns;
> + goto out_free_ns_stat;
>
> mutex_lock(&ctrl->namespaces_lock);
> /*
> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> */
> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
> mutex_unlock(&ctrl->namespaces_lock);
> - goto out_unlink_ns;
> + goto out_free_ns_stat;
> }
> nvme_ns_add_to_ctrl_list(ns);
> mutex_unlock(&ctrl->namespaces_lock);
> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> list_del_rcu(&ns->list);
> mutex_unlock(&ctrl->namespaces_lock);
> synchronize_srcu(&ctrl->srcu);
> +out_free_ns_stat:
> + nvme_free_ns_stat(ns);
> out_unlink_ns:
> mutex_lock(&ctrl->subsys->lock);
> list_del_rcu(&ns->siblings);
> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> */
> synchronize_srcu(&ns->head->srcu);
>
> + nvme_mpath_cancel_adaptive_path_weight_work(ns);
> +
I personally think that the check on path stats should be done in the
call-site
and not in the function itself.
> /* wait for concurrent submissions */
> if (nvme_mpath_clear_current_path(ns))
> synchronize_srcu(&ns->head->srcu);
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index c212fa952c0f..759d147d9930 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
> int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
> unsigned int cmd, unsigned long arg)
> {
> + u8 opcode;
> struct nvme_ns_head *head = bdev->bd_disk->private_data;
> bool open_for_write = mode & BLK_OPEN_WRITE;
> void __user *argp = (void __user *)arg;
> struct nvme_ns *ns;
> int srcu_idx, ret = -EWOULDBLOCK;
> unsigned int flags = 0;
> + unsigned int op_type = NVME_STAT_OTHER;
>
> if (bdev_is_partition(bdev))
> flags |= NVME_IOCTL_PARTITION;
>
> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
> + if (get_user(opcode, (u8 *)argp))
> + return -EFAULT;
> + if (opcode == nvme_cmd_write)
> + op_type = NVME_STAT_WRITE;
> + else if (opcode == nvme_cmd_read)
> + op_type = NVME_STAT_READ;
> + }
> +
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, op_type);
Perhaps it would be easier to review if you split passing opcode to
nvme_find_path()
to a prep patch (explaining that the new iopolicy will leverage it)
> if (!ns)
> goto out_unlock;
>
> @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
> long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> + u8 opcode;
> bool open_for_write = file->f_mode & FMODE_WRITE;
> struct cdev *cdev = file_inode(file)->i_cdev;
> struct nvme_ns_head *head =
> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
> void __user *argp = (void __user *)arg;
> struct nvme_ns *ns;
> int srcu_idx, ret = -EWOULDBLOCK;
> + unsigned int op_type = NVME_STAT_OTHER;
> +
> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
> + if (get_user(opcode, (u8 *)argp))
> + return -EFAULT;
> + if (opcode == nvme_cmd_write)
> + op_type = NVME_STAT_WRITE;
> + else if (opcode == nvme_cmd_read)
> + op_type = NVME_STAT_READ;
> + }
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, op_type);
> if (!ns)
> goto out_unlock;
>
> @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
> int srcu_idx = srcu_read_lock(&head->srcu);
> - struct nvme_ns *ns = nvme_find_path(head);
> + const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
> + struct nvme_ns *ns = nvme_find_path(head,
> + READ_ONCE(cmd->opcode) & 1 ?
> + NVME_STAT_WRITE : NVME_STAT_READ);
> int ret = -EINVAL;
>
> if (ns)
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 543e17aead12..55dc28375662 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -6,6 +6,9 @@
> #include <linux/backing-dev.h>
> #include <linux/moduleparam.h>
> #include <linux/vmalloc.h>
> +#include <linux/blk-mq.h>
> +#include <linux/math64.h>
> +#include <linux/rculist.h>
> #include <trace/events/block.h>
> #include "nvme.h"
>
> @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>
> static const char *nvme_iopolicy_names[] = {
> - [NVME_IOPOLICY_NUMA] = "numa",
> - [NVME_IOPOLICY_RR] = "round-robin",
> - [NVME_IOPOLICY_QD] = "queue-depth",
> + [NVME_IOPOLICY_NUMA] = "numa",
> + [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_RR;
> else if (!strncmp(val, "queue-depth", 11))
> iopolicy = NVME_IOPOLICY_QD;
> + else if (!strncmp(val, "adaptive", 8))
> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
> else
> return -EINVAL;
>
> @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
> }
> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>
> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
> +{
> + int cpu, srcu_idx;
> + u32 weight;
> + struct nvme_ns *ns;
> + struct nvme_path_stat *stat;
> + struct nvme_path_work *work = container_of(weight_work,
> + struct nvme_path_work, weight_work);
> + struct nvme_ns_head *head = work->ns->head;
> + int op_type = work->op_type;
> + u64 total_score = 0;
> +
> + cpu = get_cpu();
> +
> + srcu_idx = srcu_read_lock(&head->srcu);
> + list_for_each_entry_srcu(ns, &head->list, siblings,
> + srcu_read_lock_held(&head->srcu)) {
> +
> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
> + if (!READ_ONCE(stat->slat_ns)) {
> + stat->score = 0;
> + continue;
> + }
> + /*
> + * Compute the path score as the inverse of smoothed
> + * latency, scaled by NSEC_PER_SEC. Floating point
> + * math is unavailable in the kernel, so fixed-point
> + * scaling is used instead. NSEC_PER_SEC is chosen
> + * because valid latencies are always < 1 second; longer
> + * latencies are ignored.
> + */
> + stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
> +
> + /* Compute total score. */
> + total_score += stat->score;
> + }
> +
> + if (!total_score)
> + goto out;
> +
> + /*
> + * After computing the total slatency, we derive per-path weight
> + * (normalized to the range 0–64). The weight represents the
> + * relative share of I/O the path should receive.
> + *
> + * - lower smoothed latency -> higher weight
> + * - higher smoothed slatency -> lower weight
> + *
> + * Next, while forwarding I/O, we assign "credits" to each path
> + * based on its weight (please also refer nvme_adaptive_path()):
> + * - Initially, credits = weight.
> + * - Each time an I/O is dispatched on a path, its credits are
> + * decremented proportionally.
> + * - When a path runs out of credits, it becomes temporarily
> + * ineligible until credit is refilled.
> + *
> + * I/O distribution is therefore governed by available credits,
> + * ensuring that over time the proportion of I/O sent to each
> + * path matches its weight (and thus its performance).
> + */
> + list_for_each_entry_srcu(ns, &head->list, siblings,
> + srcu_read_lock_held(&head->srcu)) {
> +
> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
> + weight = div_u64(stat->score * 64, total_score);
> +
> + /*
> + * Ensure the path weight never drops below 1. A weight
> + * of 0 is used only for newly added paths. During
> + * bootstrap, a few I/Os are sent to such paths to
> + * establish an initial weight. Enforcing a minimum
> + * weight of 1 guarantees that no path is forgotten and
> + * that each path is probed at least occasionally.
> + */
> + if (!weight)
> + weight = 1;
> +
> + WRITE_ONCE(stat->weight, weight);
> + }
> +out:
> + srcu_read_unlock(&head->srcu, srcu_idx);
> + put_cpu();
> +}
> +
> +/*
> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
> + */
> +static inline u64 ewma_update(u64 old, u64 new)
it is a calculation function, lets call it calc_ewma_update
> +{
> + return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
> + + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
> +}
> +
> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
> +{
> + int cpu;
> + unsigned int op_type;
> + struct nvme_path_info *info;
> + struct nvme_path_stat *stat;
> + u64 now, latency, slat_ns, avg_lat_ns;
> + struct nvme_ns_head *head = ns->head;
> +
> + if (list_is_singular(&head->list))
> + return;
> +
> + now = ktime_get_ns();
> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
> + if (!latency)
> + return;
> +
> + /*
> + * As completion code path is serialized(i.e. no same completion queue
> + * update code could run simultaneously on multiple cpu) we can safely
> + * access per cpu nvme path stat here from another cpu (in case the
> + * completion cpu is different from submission cpu).
> + * The only field which could be accessed simultaneously here is the
> + * path ->weight which may be accessed by this function as well as I/O
> + * submission path during path selection logic and we protect ->weight
> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
> + * we also don't need to be so accurate here as the path credit would
> + * be anyways refilled, based on path weight, once path consumes all
> + * its credits. And we limit path weight/credit max up to 100. Please
> + * also refer nvme_adaptive_path().
> + */
> + cpu = blk_mq_rq_cpu(rq);
> + op_type = nvme_data_dir(req_op(rq));
> + info = &per_cpu_ptr(ns->info, cpu)[op_type];
info is really really really confusing and generic and not
representative of what
"info" it is used for. maybe path_lat? or path_stats? anything is better
than info.
> + stat = &info->stat;
> +
> + /*
> + * If latency > ~1s then ignore this sample to prevent EWMA from being
> + * skewed by pathological outliers (multi-second waits, controller
> + * timeouts etc.). This keeps path scores representative of normal
> + * performance and avoids instability from rare spikes. If such high
> + * latency is real, ANA state reporting or keep-alive error counters
> + * will mark the path unhealthy and remove it from the head node list,
> + * so we safely skip such sample here.
> + */
> + if (unlikely(latency > NSEC_PER_SEC)) {
> + stat->nr_ignored++;
> + dev_warn_ratelimited(ns->ctrl->device,
> + "ignoring sample with >1s latency (possible controller stall or timeout)\n");
> + return;
> + }
> +
> + /*
> + * Accumulate latency samples and increment the batch count for each
> + * ~15 second interval. When the interval expires, compute the simple
> + * average latency over that window, then update the smoothed (EWMA)
> + * latency. The path weight is recalculated based on this smoothed
> + * latency.
> + */
> + stat->batch += latency;
> + stat->batch_count++;
> + stat->nr_samples++;
> +
> + if (now > stat->last_weight_ts &&
> + (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
> +
> + stat->last_weight_ts = now;
> +
> + /*
> + * Find simple average latency for the last epoch (~15 sec
> + * interval).
> + */
> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
> +
> + /*
> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
> + * latency. EWMA is preferred over simple average latency
> + * because it smooths naturally, reduces jitter from sudden
> + * spikes, and adapts faster to changing conditions. It also
> + * avoids storing historical samples, and works well for both
> + * slow and fast I/O rates.
> + * Formula:
> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
> + * existing latency and 1/8 (~12.5%) weight to the new latency.
> + */
> + if (unlikely(!stat->slat_ns))
> + WRITE_ONCE(stat->slat_ns, avg_lat_ns);
> + else {
> + slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
> + WRITE_ONCE(stat->slat_ns, slat_ns);
> + }
> +
> + stat->batch = stat->batch_count = 0;
> +
> + /*
> + * Defer calculation of the path weight in per-cpu workqueue.
> + */
> + schedule_work_on(cpu, &info->work.weight_work);
I'm unsure if the percpu is a good choice here. Don't you want it per
hctx at least?
workloads tend to bounce quite a bit between cpu cores... we have
systems with hundreds of
cpu cores.
> + }
> +}
> +
> void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
> if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
> atomic_dec_if_positive(&ns->ctrl->nr_active);
>
> + if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
> + nvme_mpath_add_sample(rq, ns);
> +
Doing all this work for EVERY completion is really worth it?
sounds kinda like an overkill.
> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
> return;
> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
> [NVME_ANA_CHANGE] = "change",
> };
>
> +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
> +{
> + int i, cpu;
> + struct nvme_path_stat *stat;
> +
> + for_each_possible_cpu(cpu) {
> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> + stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
> + memset(stat, 0, sizeof(struct nvme_path_stat));
> + }
> + }
> +}
> +
> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
> +{
> + int i, cpu;
> + struct nvme_path_info *info;
> +
> + if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
> + return;
> +
> + for_each_online_cpu(cpu) {
> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> + info = &per_cpu_ptr(ns->info, cpu)[i];
> + cancel_work_sync(&info->work.weight_work);
> + }
> + }
> +}
> +
> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
> +{
> + struct nvme_ns_head *head = ns->head;
> +
> + if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
> + return false;
> +
> + if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
> + return false;
> +
> + blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
This is an undocumented change...
> + blk_stat_enable_accounting(ns->queue);
> + return true;
> +}
> +
> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
> +{
> +
> + if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
> + return false;
> +
> + blk_stat_disable_accounting(ns->queue);
> + blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
> + nvme_mpath_reset_adaptive_path_stat(ns);
> + return true;
> +}
> +
> bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
> {
> struct nvme_ns_head *head = ns->head;
> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
> changed = true;
> }
> }
> + if (nvme_mpath_disable_adaptive_path_policy(ns))
> + changed = true;
Don't understand why you are setting changed here? it relates to of the
current_path
was changed. doesn't make sense to me.
> out:
> return changed;
> }
> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
> srcu_read_unlock(&ctrl->srcu, srcu_idx);
> }
>
> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
> +{
> + int i, cpu;
> + struct nvme_path_work *work;
> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> +
> + if (!ns->head->disk)
> + return 0;
> +
> + ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
> + sizeof(struct nvme_path_info),
> + __alignof__(struct nvme_path_info), gfp);
> + if (!ns->info)
> + return -ENOMEM;
> +
> + for_each_possible_cpu(cpu) {
> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
> + work = &per_cpu_ptr(ns->info, cpu)[i].work;
> + work->ns = ns;
> + work->op_type = i;
> + INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
Does this function set any ctrl paths? your code is very confusing.
> +{
> + struct nvme_ns *ns;
> + int srcu_idx;
> +
> + srcu_idx = srcu_read_lock(&ctrl->srcu);
> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
> + srcu_read_lock_held(&ctrl->srcu))
> + nvme_mpath_enable_adaptive_path_policy(ns);
> + srcu_read_unlock(&ctrl->srcu, srcu_idx);
seems like it enables the iopolicy on all ctrl namespaces.
the enable should also be more explicit like:
nvme_enable_ns_lat_sampling or something like that.
> +}
> +
> void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
> {
> struct nvme_ns_head *head = ns->head;
> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
> srcu_read_lock_held(&head->srcu)) {
> if (capacity != get_capacity(ns->disk))
> clear_bit(NVME_NS_READY, &ns->flags);
> +
> + nvme_mpath_reset_adaptive_path_stat(ns);
> }
> srcu_read_unlock(&head->srcu, srcu_idx);
>
> @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
> return found;
> }
>
> +static inline bool nvme_state_is_live(enum nvme_ana_state state)
> +{
> + return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
> +}
> +
> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
> + unsigned int op_type)
> +{
> + struct nvme_ns *ns, *start, *found = NULL;
> + struct nvme_path_stat *stat;
> + u32 weight;
> + int cpu;
> +
> + cpu = get_cpu();
> + ns = *this_cpu_ptr(head->adp_path);
> + if (unlikely(!ns)) {
> + ns = list_first_or_null_rcu(&head->list,
> + struct nvme_ns, siblings);
> + if (unlikely(!ns))
> + goto out;
> + }
> +found_ns:
> + start = ns;
> + while (nvme_path_is_disabled(ns) ||
> + !nvme_state_is_live(ns->ana_state)) {
> + ns = list_next_entry_circular(ns, &head->list, siblings);
> +
> + /*
> + * If we iterate through all paths in the list but find each
> + * path in list is either disabled or dead then bail out.
> + */
> + if (ns == start)
> + goto out;
> + }
> +
> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
> +
> + /*
> + * When the head path-list is singular we don't calculate the
> + * only path weight for optimization as we don't need to forward
> + * I/O to more than one path. The another possibility is whenthe
> + * path is newly added, we don't know its weight. So we go round
> + * -robin for each such path and forward I/O to it.Once we start
> + * getting response for such I/Os, the path weight calculation
> + * would kick in and then we start using path credit for
> + * forwarding I/O.
> + */
> + weight = READ_ONCE(stat->weight);
> + if (!weight) {
> + found = ns;
> + goto out;
> + }
> +
> + /*
> + * To keep path selection logic simple, we don't distinguish
> + * between ANA optimized and non-optimized states. The non-
> + * optimized path is expected to have a lower weight, and
> + * therefore fewer credits. As a result, only a small number of
> + * I/Os will be forwarded to paths in the non-optimized state.
> + */
> + if (stat->credit > 0) {
> + --stat->credit;
> + found = ns;
> + goto out;
> + } else {
> + /*
> + * Refill credit from path weight and move to next path. The
> + * refilled credit of the current path will be used next when
> + * all remainng paths exhaust its credits.
> + */
> + weight = READ_ONCE(stat->weight);
> + stat->credit = weight;
> + ns = list_next_entry_circular(ns, &head->list, siblings);
> + if (likely(ns))
> + goto found_ns;
> + }
> +out:
> + if (found) {
> + stat->sel++;
> + *this_cpu_ptr(head->adp_path) = found;
> + }
> +
> + put_cpu();
> + return found;
> +}
> +
> static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> {
> struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
> return ns;
> }
>
> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
> + unsigned int op_type)
> {
> switch (READ_ONCE(head->subsys->iopolicy)) {
> + case NVME_IOPOLICY_ADAPTIVE:
> + return nvme_adaptive_path(head, op_type);
> case NVME_IOPOLICY_QD:
> return nvme_queue_depth_path(head);
> case NVME_IOPOLICY_RR:
> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
> return;
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
> if (likely(ns)) {
> bio_set_dev(bio, ns->disk->part0);
> bio->bi_opf |= REQ_NVME_MPATH;
> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
> int srcu_idx, ret = -EWOULDBLOCK;
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, NVME_STAT_OTHER);
> if (ns)
> ret = nvme_ns_get_unique_id(ns, id, type);
> srcu_read_unlock(&head->srcu, srcu_idx);
> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
> int srcu_idx, ret = -EWOULDBLOCK;
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, NVME_STAT_OTHER);
> if (ns)
> ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
> srcu_read_unlock(&head->srcu, srcu_idx);
> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
> INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
> head->delayed_removal_secs = 0;
> + head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
> + if (!head->adp_path)
> + return -ENOMEM;
>
> /*
> * If "multipath_always_on" is enabled, a multipath node is added
> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
> }
> mutex_unlock(&head->lock);
>
> + mutex_lock(&nvme_subsystems_lock);
> + nvme_mpath_enable_adaptive_path_policy(ns);
> + mutex_unlock(&nvme_subsystems_lock);
> +
> synchronize_srcu(&head->srcu);
> kblockd_schedule_work(&head->requeue_work);
> }
> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
> return 0;
> }
>
> -static inline bool nvme_state_is_live(enum nvme_ana_state state)
> -{
> - return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
> -}
> -
> static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
> struct nvme_ns *ns)
> {
> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>
> WRITE_ONCE(subsys->iopolicy, iopolicy);
>
> - /* iopolicy changes clear the mpath by design */
> + /* iopolicy changes clear/reset the mpath by design */
> mutex_lock(&nvme_subsystems_lock);
> list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> nvme_mpath_clear_ctrl_paths(ctrl);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> + nvme_mpath_set_ctrl_paths(ctrl);
> mutex_unlock(&nvme_subsystems_lock);
>
> pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 102fae6a231c..715c7053054c 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
> extern unsigned int admin_timeout;
> #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
>
> -#define NVME_DEFAULT_KATO 5
> +#define NVME_DEFAULT_KATO 5
> +
> +#define NVME_DEFAULT_ADP_EWMA_SHIFT 3
> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT (15 * NSEC_PER_SEC)
You need these defines outside of nvme-mpath?
>
> #ifdef CONFIG_ARCH_NO_SG_CHAIN
> #define NVME_INLINE_SG_CNT 0
> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
> NVME_IOPOLICY_NUMA,
> NVME_IOPOLICY_RR,
> NVME_IOPOLICY_QD,
> + NVME_IOPOLICY_ADAPTIVE,
> };
>
> struct nvme_subsystem {
> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
> u8 csi;
> };
>
> +enum nvme_stat_group {
> + NVME_STAT_READ,
> + NVME_STAT_WRITE,
> + NVME_STAT_OTHER,
> + NVME_NUM_STAT_GROUPS
> +};
I see you have stats per io direction. However you don't have it per IO
size. I wonder
how this plays into this iopolicy.
> +
> +struct nvme_path_stat {
> + u64 nr_samples; /* total num of samples processed */
> + u64 nr_ignored; /* num. of samples ignored */
> + u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
> + u64 score; /* score used for weight calculation */
> + u64 last_weight_ts; /* timestamp of the last weight calculation */
> + u64 sel; /* num of times this path is selcted for I/O */
> + u64 batch; /* accumulated latency sum for current window */
> + u32 batch_count; /* num of samples accumulated in current window */
> + u32 weight; /* path weight */
> + u32 credit; /* path credit for I/O forwarding */
> +};
I'm still not convinced that having this be per-cpu-per-ns really makes
sense.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-12 13:04 ` Sagi Grimberg
@ 2025-12-13 7:27 ` Nilay Shroff
2025-12-15 23:36 ` Sagi Grimberg
0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-12-13 7:27 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>
>
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> This commit introduces a new I/O policy named "adaptive". Users can
>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>> subsystemX/iopolicy"
>>
>> The adaptive policy dynamically distributes I/O based on measured
>> completion latency. The main idea is to calculate latency for each path,
>> derive a weight, and then proportionally forward I/O according to those
>> weights.
>>
>> To ensure scalability, path latency is measured per-CPU. Each CPU
>> maintains its own statistics, and I/O forwarding uses these per-CPU
>> values.
>
> So a given cpu would select path-a vs. another cpu that may select path-b?
> How does that play with less queues than cpu cores? what happens to cores
> that have low traffic?
>
The path-selection logic does not depend on the relationship between the number
of CPUs and the number of hardware queues. It simply selects a path based on the
per-CPU path score/credit, which reflects the relative performance of each available
path.
For example, assume we have two paths (A and B) to the same shared namespace.
For each CPU, we maintain a smoothed latency estimate for every path. From these
latency values we derive a per-path score or credit. The credit represents the relative
share of I/O that each path should receive: a path with lower observed latency gets more
credit, and a path with higher latency gets less.
I/O distribution is thus governed directly by the available credits on that CPU. When the
NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
policy runs above the block-layer queueing logic, and the number of hardware queues does
not affect how paths are scored or selected.
>> Every ~15 seconds, a simple average latency of per-CPU batched
>> samples are computed and fed into an Exponentially Weighted Moving
>> Average (EWMA):
>
> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
>
>>
>> avg_latency = div_u64(batch, batch_count);
>> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>>
>> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
>> latency value and 1/8 (~12.5%) to the most recent latency. This
>> smoothing reduces jitter, adapts quickly to changing conditions,
>> avoids storing historical samples, and works well for both low and
>> high I/O rates.
>
> This weight was based on empirical measurements?
>
Yes correct and so we also allow user to configure WEIGHT, if needed.
>> Path weights are then derived from the smoothed (EWMA)
>> latency as follows (example with two paths A and B):
>>
>> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>> total_score = path_A_score + path_B_score
>>
>> path_A_weight = (path_A_score * 100) / total_score
>> path_B_weight = (path_B_score * 100) / total_score
>
> What happens to R/W mixed workloads? What happens when the I/O pattern
> has a distribution of block sizes?
>
We maintain separate metrics for READ and WRITE traffic, and during path
selection we use the appropriate metric depending on the I/O type.
Regarding block-size variability: the current implementation does not yet
account for I/O size. This is an important point — thank you for raising it.
I discussed this today with Hannes at LPC, and we agreed that a practical
approach is to normalize latency per 512-byte block. For our purposes, we
do not need an exact latency value; a relative latency metric is sufficient,
as it ultimately feeds into path scoring. A path with higher latency ends up
with a lower score, and a path with lower latency gets a higher score — the
exact absolute values are less important than maintaining consistent proportional
relationships.
Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
across different I/O block sizes. I think that it's easy to normalize a latency number
per 512 bytes block and I'd implement that in next patch version.
> I think that in order to understand how a non-trivial path selector works we need
> thorough testing in a variety of I/O patterns.
>
Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and
different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed
for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy.
Still if you prefer running anything further let me know.
>>
>> where:
>> - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>> - NSEC_PER_SEC is used as a scaling factor since valid latencies
>> are < 1 second
>> - weights are normalized to a 0–64 scale across all paths.
>>
>> Path credits are refilled based on this weight, with one credit
>> consumed per I/O. When all credits are consumed, the credits are
>> refilled again based on the current weight. This ensures that I/O is
>> distributed across paths proportionally to their calculated weight.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 15 +-
>> drivers/nvme/host/ioctl.c | 31 ++-
>> drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>> drivers/nvme/host/nvme.h | 74 +++++-
>> drivers/nvme/host/pr.c | 6 +-
>> drivers/nvme/host/sysfs.c | 2 +-
>> 6 files changed, 530 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index fa4181d7de73..47f375c63d2d 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>> cleanup_srcu_struct(&head->srcu);
>> nvme_put_subsystem(head->subsys);
>> kfree(head->plids);
>> +#ifdef CONFIG_NVME_MULTIPATH
>> + free_percpu(head->adp_path);
>> +#endif
>> kfree(head);
>> }
>> @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>> {
>> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>> + nvme_free_ns_stat(ns);
>> put_disk(ns->disk);
>> nvme_put_ns_head(ns->head);
>> nvme_put_ctrl(ns->ctrl);
>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> if (nvme_init_ns_head(ns, info))
>> goto out_cleanup_disk;
>> + if (nvme_alloc_ns_stat(ns))
>> + goto out_unlink_ns;
>> +
>> /*
>> * If multipathing is enabled, the device name for all disks and not
>> * just those that represent shared namespaces needs to be based on the
>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> }
>> if (nvme_update_ns_info(ns, info))
>> - goto out_unlink_ns;
>> + goto out_free_ns_stat;
>> mutex_lock(&ctrl->namespaces_lock);
>> /*
>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> */
>> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>> mutex_unlock(&ctrl->namespaces_lock);
>> - goto out_unlink_ns;
>> + goto out_free_ns_stat;
>> }
>> nvme_ns_add_to_ctrl_list(ns);
>> mutex_unlock(&ctrl->namespaces_lock);
>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> list_del_rcu(&ns->list);
>> mutex_unlock(&ctrl->namespaces_lock);
>> synchronize_srcu(&ctrl->srcu);
>> +out_free_ns_stat:
>> + nvme_free_ns_stat(ns);
>> out_unlink_ns:
>> mutex_lock(&ctrl->subsys->lock);
>> list_del_rcu(&ns->siblings);
>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>> */
>> synchronize_srcu(&ns->head->srcu);
>> + nvme_mpath_cancel_adaptive_path_weight_work(ns);
>> +
>
> I personally think that the check on path stats should be done in the call-site
> and not in the function itself.
Hmm, can you please elaborate on this point further? I think, I am unable to get
your point here.
>
>> /* wait for concurrent submissions */
>> if (nvme_mpath_clear_current_path(ns))
>> synchronize_srcu(&ns->head->srcu);
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index c212fa952c0f..759d147d9930 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>> int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>> unsigned int cmd, unsigned long arg)
>> {
>> + u8 opcode;
>> struct nvme_ns_head *head = bdev->bd_disk->private_data;
>> bool open_for_write = mode & BLK_OPEN_WRITE;
>> void __user *argp = (void __user *)arg;
>> struct nvme_ns *ns;
>> int srcu_idx, ret = -EWOULDBLOCK;
>> unsigned int flags = 0;
>> + unsigned int op_type = NVME_STAT_OTHER;
>> if (bdev_is_partition(bdev))
>> flags |= NVME_IOCTL_PARTITION;
>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>> + if (get_user(opcode, (u8 *)argp))
>> + return -EFAULT;
>> + if (opcode == nvme_cmd_write)
>> + op_type = NVME_STAT_WRITE;
>> + else if (opcode == nvme_cmd_read)
>> + op_type = NVME_STAT_READ;
>> + }
>> +
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, op_type);
>
> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
> to a prep patch (explaining that the new iopolicy will leverage it)
>
Sure, makes sense. I'll split this into prep patch as you suggested.
>> if (!ns)
>> goto out_unlock;
>> @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>> long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>> unsigned long arg)
>> {
>> + u8 opcode;
>> bool open_for_write = file->f_mode & FMODE_WRITE;
>> struct cdev *cdev = file_inode(file)->i_cdev;
>> struct nvme_ns_head *head =
>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>> void __user *argp = (void __user *)arg;
>> struct nvme_ns *ns;
>> int srcu_idx, ret = -EWOULDBLOCK;
>> + unsigned int op_type = NVME_STAT_OTHER;
>> +
>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>> + if (get_user(opcode, (u8 *)argp))
>> + return -EFAULT;
>> + if (opcode == nvme_cmd_write)
>> + op_type = NVME_STAT_WRITE;
>> + else if (opcode == nvme_cmd_read)
>> + op_type = NVME_STAT_READ;
>> + }
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, op_type);
>> if (!ns)
>> goto out_unlock;
>> @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>> int srcu_idx = srcu_read_lock(&head->srcu);
>> - struct nvme_ns *ns = nvme_find_path(head);
>> + const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>> + struct nvme_ns *ns = nvme_find_path(head,
>> + READ_ONCE(cmd->opcode) & 1 ?
>> + NVME_STAT_WRITE : NVME_STAT_READ);
>> int ret = -EINVAL;
>> if (ns)
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 543e17aead12..55dc28375662 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -6,6 +6,9 @@
>> #include <linux/backing-dev.h>
>> #include <linux/moduleparam.h>
>> #include <linux/vmalloc.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/math64.h>
>> +#include <linux/rculist.h>
>> #include <trace/events/block.h>
>> #include "nvme.h"
>> @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>> static const char *nvme_iopolicy_names[] = {
>> - [NVME_IOPOLICY_NUMA] = "numa",
>> - [NVME_IOPOLICY_RR] = "round-robin",
>> - [NVME_IOPOLICY_QD] = "queue-depth",
>> + [NVME_IOPOLICY_NUMA] = "numa",
>> + [NVME_IOPOLICY_RR] = "round-robin",
>> + [NVME_IOPOLICY_QD] = "queue-depth",
>> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>> };
>> static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>> iopolicy = NVME_IOPOLICY_RR;
>> else if (!strncmp(val, "queue-depth", 11))
>> iopolicy = NVME_IOPOLICY_QD;
>> + else if (!strncmp(val, "adaptive", 8))
>> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
>> else
>> return -EINVAL;
>> @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>> }
>> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>> +{
>> + int cpu, srcu_idx;
>> + u32 weight;
>> + struct nvme_ns *ns;
>> + struct nvme_path_stat *stat;
>> + struct nvme_path_work *work = container_of(weight_work,
>> + struct nvme_path_work, weight_work);
>> + struct nvme_ns_head *head = work->ns->head;
>> + int op_type = work->op_type;
>> + u64 total_score = 0;
>> +
>> + cpu = get_cpu();
>> +
>> + srcu_idx = srcu_read_lock(&head->srcu);
>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>> +
>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> + if (!READ_ONCE(stat->slat_ns)) {
>> + stat->score = 0;
>> + continue;
>> + }
>> + /*
>> + * Compute the path score as the inverse of smoothed
>> + * latency, scaled by NSEC_PER_SEC. Floating point
>> + * math is unavailable in the kernel, so fixed-point
>> + * scaling is used instead. NSEC_PER_SEC is chosen
>> + * because valid latencies are always < 1 second; longer
>> + * latencies are ignored.
>> + */
>> + stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>> +
>> + /* Compute total score. */
>> + total_score += stat->score;
>> + }
>> +
>> + if (!total_score)
>> + goto out;
>> +
>> + /*
>> + * After computing the total slatency, we derive per-path weight
>> + * (normalized to the range 0–64). The weight represents the
>> + * relative share of I/O the path should receive.
>> + *
>> + * - lower smoothed latency -> higher weight
>> + * - higher smoothed slatency -> lower weight
>> + *
>> + * Next, while forwarding I/O, we assign "credits" to each path
>> + * based on its weight (please also refer nvme_adaptive_path()):
>> + * - Initially, credits = weight.
>> + * - Each time an I/O is dispatched on a path, its credits are
>> + * decremented proportionally.
>> + * - When a path runs out of credits, it becomes temporarily
>> + * ineligible until credit is refilled.
>> + *
>> + * I/O distribution is therefore governed by available credits,
>> + * ensuring that over time the proportion of I/O sent to each
>> + * path matches its weight (and thus its performance).
>> + */
>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>> +
>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> + weight = div_u64(stat->score * 64, total_score);
>> +
>> + /*
>> + * Ensure the path weight never drops below 1. A weight
>> + * of 0 is used only for newly added paths. During
>> + * bootstrap, a few I/Os are sent to such paths to
>> + * establish an initial weight. Enforcing a minimum
>> + * weight of 1 guarantees that no path is forgotten and
>> + * that each path is probed at least occasionally.
>> + */
>> + if (!weight)
>> + weight = 1;
>> +
>> + WRITE_ONCE(stat->weight, weight);
>> + }
>> +out:
>> + srcu_read_unlock(&head->srcu, srcu_idx);
>> + put_cpu();
>> +}
>> +
>> +/*
>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>> + */
>> +static inline u64 ewma_update(u64 old, u64 new)
>
> it is a calculation function, lets call it calc_ewma_update
Yeah, will do this in next patch version.
>> +{
>> + return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>> + + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>> +}
>> +
>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>> +{
>> + int cpu;
>> + unsigned int op_type;
>> + struct nvme_path_info *info;
>> + struct nvme_path_stat *stat;
>> + u64 now, latency, slat_ns, avg_lat_ns;
>> + struct nvme_ns_head *head = ns->head;
>> +
>> + if (list_is_singular(&head->list))
>> + return;
>> +
>> + now = ktime_get_ns();
>> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>> + if (!latency)
>> + return;
>> +
>> + /*
>> + * As completion code path is serialized(i.e. no same completion queue
>> + * update code could run simultaneously on multiple cpu) we can safely
>> + * access per cpu nvme path stat here from another cpu (in case the
>> + * completion cpu is different from submission cpu).
>> + * The only field which could be accessed simultaneously here is the
>> + * path ->weight which may be accessed by this function as well as I/O
>> + * submission path during path selection logic and we protect ->weight
>> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>> + * we also don't need to be so accurate here as the path credit would
>> + * be anyways refilled, based on path weight, once path consumes all
>> + * its credits. And we limit path weight/credit max up to 100. Please
>> + * also refer nvme_adaptive_path().
>> + */
>> + cpu = blk_mq_rq_cpu(rq);
>> + op_type = nvme_data_dir(req_op(rq));
>> + info = &per_cpu_ptr(ns->info, cpu)[op_type];
>
> info is really really really confusing and generic and not representative of what
> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
>
Maybe I am used to with this code and so I never realized it. But yes agreed, I
will make it @path_lat.
>> + stat = &info->stat;
>> +
>> + /*
>> + * If latency > ~1s then ignore this sample to prevent EWMA from being
>> + * skewed by pathological outliers (multi-second waits, controller
>> + * timeouts etc.). This keeps path scores representative of normal
>> + * performance and avoids instability from rare spikes. If such high
>> + * latency is real, ANA state reporting or keep-alive error counters
>> + * will mark the path unhealthy and remove it from the head node list,
>> + * so we safely skip such sample here.
>> + */
>> + if (unlikely(latency > NSEC_PER_SEC)) {
>> + stat->nr_ignored++;
>> + dev_warn_ratelimited(ns->ctrl->device,
>> + "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>> + return;
>> + }
>> +
>> + /*
>> + * Accumulate latency samples and increment the batch count for each
>> + * ~15 second interval. When the interval expires, compute the simple
>> + * average latency over that window, then update the smoothed (EWMA)
>> + * latency. The path weight is recalculated based on this smoothed
>> + * latency.
>> + */
>> + stat->batch += latency;
>> + stat->batch_count++;
>> + stat->nr_samples++;
>> +
>> + if (now > stat->last_weight_ts &&
>> + (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>> +
>> + stat->last_weight_ts = now;
>> +
>> + /*
>> + * Find simple average latency for the last epoch (~15 sec
>> + * interval).
>> + */
>> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>> +
>> + /*
>> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>> + * latency. EWMA is preferred over simple average latency
>> + * because it smooths naturally, reduces jitter from sudden
>> + * spikes, and adapts faster to changing conditions. It also
>> + * avoids storing historical samples, and works well for both
>> + * slow and fast I/O rates.
>> + * Formula:
>> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>> + * existing latency and 1/8 (~12.5%) weight to the new latency.
>> + */
>> + if (unlikely(!stat->slat_ns))
>> + WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>> + else {
>> + slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>> + WRITE_ONCE(stat->slat_ns, slat_ns);
>> + }
>> +
>> + stat->batch = stat->batch_count = 0;
>> +
>> + /*
>> + * Defer calculation of the path weight in per-cpu workqueue.
>> + */
>> + schedule_work_on(cpu, &info->work.weight_work);
>
> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
> cpu cores.
As I explained earlier, in NVMe multipath driver code we don't know hctx while
we choose path. The ctx to hctx mapping happens later in the block layer while
submitting bio. Here we calculate the path score per-cpu and utilize it while
choosing path to forward I/O.
>
>> + }
>> +}
>> +
>> void nvme_mpath_end_request(struct request *rq)
>> {
>> struct nvme_ns *ns = rq->q->queuedata;
>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>> if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>> atomic_dec_if_positive(&ns->ctrl->nr_active);
>> + if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>> + nvme_mpath_add_sample(rq, ns);
>> +
>
> Doing all this work for EVERY completion is really worth it?
> sounds kinda like an overkill.
We don't really do much in nvme_mpath_add_sample() other than just
adding latency sample into batch. The real work where we calculate
the patch score is done once every ~15 seconds and that is done
under separate workqueu. So we don't do any heavy lifing here during
I/O completion processing.
>
>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>> return;
>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>> [NVME_ANA_CHANGE] = "change",
>> };
>> +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>> +{
>> + int i, cpu;
>> + struct nvme_path_stat *stat;
>> +
>> + for_each_possible_cpu(cpu) {
>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> + stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>> + memset(stat, 0, sizeof(struct nvme_path_stat));
>> + }
>> + }
>> +}
>> +
>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>> +{
>> + int i, cpu;
>> + struct nvme_path_info *info;
>> +
>> + if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>> + return;
>> +
>> + for_each_online_cpu(cpu) {
>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> + info = &per_cpu_ptr(ns->info, cpu)[i];
>> + cancel_work_sync(&info->work.weight_work);
>> + }
>> + }
>> +}
>> +
>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>> +{
>> + struct nvme_ns_head *head = ns->head;
>> +
>> + if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>> + return false;
>> +
>> + if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>> + return false;
>> +
>> + blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
>
> This is an undocumented change...
Sure, I would add comment in this code in the next patch version.
>
>> + blk_stat_enable_accounting(ns->queue);
>> + return true;
>> +}
>> +
>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>> +{
>> +
>> + if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>> + return false;
>> +
>> + blk_stat_disable_accounting(ns->queue);
>> + blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>> + nvme_mpath_reset_adaptive_path_stat(ns);
>> + return true;
>> +}
>> +
>> bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>> {
>> struct nvme_ns_head *head = ns->head;
>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>> changed = true;
>> }
>> }
>> + if (nvme_mpath_disable_adaptive_path_policy(ns))
>> + changed = true;
>
> Don't understand why you are setting changed here? it relates to of the current_path
> was changed. doesn't make sense to me.
>
I think you were correct. We don't have any rcu update here for adaptive path.
Will remove this.
>> out:
>> return changed;
>> }
>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>> srcu_read_unlock(&ctrl->srcu, srcu_idx);
>> }
>> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>> +{
>> + int i, cpu;
>> + struct nvme_path_work *work;
>> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>> +
>> + if (!ns->head->disk)
>> + return 0;
>> +
>> + ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>> + sizeof(struct nvme_path_info),
>> + __alignof__(struct nvme_path_info), gfp);
>> + if (!ns->info)
>> + return -ENOMEM;
>> +
>> + for_each_possible_cpu(cpu) {
>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>> + work = &per_cpu_ptr(ns->info, cpu)[i].work;
>> + work->ns = ns;
>> + work->op_type = i;
>> + INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
>
> Does this function set any ctrl paths? your code is very confusing.
>
Here ctrl path means, we iterate through each controller namespaces-path
and then sets/enable the adaptive path parameters required for each path.
Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
function which resets/clears the per-path parameters while chanigng I/O
policy.
>> +{
>> + struct nvme_ns *ns;
>> + int srcu_idx;
>> +
>> + srcu_idx = srcu_read_lock(&ctrl->srcu);
>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>> + srcu_read_lock_held(&ctrl->srcu))
>> + nvme_mpath_enable_adaptive_path_policy(ns);
>> + srcu_read_unlock(&ctrl->srcu, srcu_idx);
>
> seems like it enables the iopolicy on all ctrl namespaces.
> the enable should also be more explicit like:
> nvme_enable_ns_lat_sampling or something like that.
>
okay, I'll rename it to the appropriate function name, as you suggested.
>> +}
>> +
>> void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>> {
>> struct nvme_ns_head *head = ns->head;
>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>> srcu_read_lock_held(&head->srcu)) {
>> if (capacity != get_capacity(ns->disk))
>> clear_bit(NVME_NS_READY, &ns->flags);
>> +
>> + nvme_mpath_reset_adaptive_path_stat(ns);
>> }
>> srcu_read_unlock(&head->srcu, srcu_idx);
>> @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>> return found;
>> }
>> +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>> +{
>> + return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>> +}
>> +
>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>> + unsigned int op_type)
>> +{
>> + struct nvme_ns *ns, *start, *found = NULL;
>> + struct nvme_path_stat *stat;
>> + u32 weight;
>> + int cpu;
>> +
>> + cpu = get_cpu();
>> + ns = *this_cpu_ptr(head->adp_path);
>> + if (unlikely(!ns)) {
>> + ns = list_first_or_null_rcu(&head->list,
>> + struct nvme_ns, siblings);
>> + if (unlikely(!ns))
>> + goto out;
>> + }
>> +found_ns:
>> + start = ns;
>> + while (nvme_path_is_disabled(ns) ||
>> + !nvme_state_is_live(ns->ana_state)) {
>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>> +
>> + /*
>> + * If we iterate through all paths in the list but find each
>> + * path in list is either disabled or dead then bail out.
>> + */
>> + if (ns == start)
>> + goto out;
>> + }
>> +
>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>> +
>> + /*
>> + * When the head path-list is singular we don't calculate the
>> + * only path weight for optimization as we don't need to forward
>> + * I/O to more than one path. The another possibility is whenthe
>> + * path is newly added, we don't know its weight. So we go round
>> + * -robin for each such path and forward I/O to it.Once we start
>> + * getting response for such I/Os, the path weight calculation
>> + * would kick in and then we start using path credit for
>> + * forwarding I/O.
>> + */
>> + weight = READ_ONCE(stat->weight);
>> + if (!weight) {
>> + found = ns;
>> + goto out;
>> + }
>> +
>> + /*
>> + * To keep path selection logic simple, we don't distinguish
>> + * between ANA optimized and non-optimized states. The non-
>> + * optimized path is expected to have a lower weight, and
>> + * therefore fewer credits. As a result, only a small number of
>> + * I/Os will be forwarded to paths in the non-optimized state.
>> + */
>> + if (stat->credit > 0) {
>> + --stat->credit;
>> + found = ns;
>> + goto out;
>> + } else {
>> + /*
>> + * Refill credit from path weight and move to next path. The
>> + * refilled credit of the current path will be used next when
>> + * all remainng paths exhaust its credits.
>> + */
>> + weight = READ_ONCE(stat->weight);
>> + stat->credit = weight;
>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>> + if (likely(ns))
>> + goto found_ns;
>> + }
>> +out:
>> + if (found) {
>> + stat->sel++;
>> + *this_cpu_ptr(head->adp_path) = found;
>> + }
>> +
>> + put_cpu();
>> + return found;
>> +}
>> +
>> static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>> {
>> struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>> return ns;
>> }
>> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>> + unsigned int op_type)
>> {
>> switch (READ_ONCE(head->subsys->iopolicy)) {
>> + case NVME_IOPOLICY_ADAPTIVE:
>> + return nvme_adaptive_path(head, op_type);
>> case NVME_IOPOLICY_QD:
>> return nvme_queue_depth_path(head);
>> case NVME_IOPOLICY_RR:
>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>> return;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>> if (likely(ns)) {
>> bio_set_dev(bio, ns->disk->part0);
>> bio->bi_opf |= REQ_NVME_MPATH;
>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>> int srcu_idx, ret = -EWOULDBLOCK;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>> if (ns)
>> ret = nvme_ns_get_unique_id(ns, id, type);
>> srcu_read_unlock(&head->srcu, srcu_idx);
>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>> int srcu_idx, ret = -EWOULDBLOCK;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>> if (ns)
>> ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>> srcu_read_unlock(&head->srcu, srcu_idx);
>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>> INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>> head->delayed_removal_secs = 0;
>> + head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>> + if (!head->adp_path)
>> + return -ENOMEM;
>> /*
>> * If "multipath_always_on" is enabled, a multipath node is added
>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>> }
>> mutex_unlock(&head->lock);
>> + mutex_lock(&nvme_subsystems_lock);
>> + nvme_mpath_enable_adaptive_path_policy(ns);
>> + mutex_unlock(&nvme_subsystems_lock);
>> +
>> synchronize_srcu(&head->srcu);
>> kblockd_schedule_work(&head->requeue_work);
>> }
>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>> return 0;
>> }
>> -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>> -{
>> - return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>> -}
>> -
>> static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>> struct nvme_ns *ns)
>> {
>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>> WRITE_ONCE(subsys->iopolicy, iopolicy);
>> - /* iopolicy changes clear the mpath by design */
>> + /* iopolicy changes clear/reset the mpath by design */
>> mutex_lock(&nvme_subsystems_lock);
>> list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>> nvme_mpath_clear_ctrl_paths(ctrl);
>> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>> + nvme_mpath_set_ctrl_paths(ctrl);
>> mutex_unlock(&nvme_subsystems_lock);
>> pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 102fae6a231c..715c7053054c 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>> extern unsigned int admin_timeout;
>> #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
>> -#define NVME_DEFAULT_KATO 5
>> +#define NVME_DEFAULT_KATO 5
>> +
>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT 3
>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT (15 * NSEC_PER_SEC)
>
> You need these defines outside of nvme-mpath?
>
Currently, those macros are used in nvme/host/core.c.
I can move this inisde that source file.
>> #ifdef CONFIG_ARCH_NO_SG_CHAIN
>> #define NVME_INLINE_SG_CNT 0
>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>> NVME_IOPOLICY_NUMA,
>> NVME_IOPOLICY_RR,
>> NVME_IOPOLICY_QD,
>> + NVME_IOPOLICY_ADAPTIVE,
>> };
>> struct nvme_subsystem {
>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>> u8 csi;
>> };
>> +enum nvme_stat_group {
>> + NVME_STAT_READ,
>> + NVME_STAT_WRITE,
>> + NVME_STAT_OTHER,
>> + NVME_NUM_STAT_GROUPS
>> +};
>
> I see you have stats per io direction. However you don't have it per IO size. I wonder
> how this plays into this iopolicy.
>
Yes you're correct, and as mentioned earlier we'd measure latecy per
512 byte blocks size.
>> +
>> +struct nvme_path_stat {
>> + u64 nr_samples; /* total num of samples processed */
>> + u64 nr_ignored; /* num. of samples ignored */
>> + u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
>> + u64 score; /* score used for weight calculation */
>> + u64 last_weight_ts; /* timestamp of the last weight calculation */
>> + u64 sel; /* num of times this path is selcted for I/O */
>> + u64 batch; /* accumulated latency sum for current window */
>> + u32 batch_count; /* num of samples accumulated in current window */
>> + u32 weight; /* path weight */
>> + u32 credit; /* path credit for I/O forwarding */
>> +};
>
> I'm still not convinced that having this be per-cpu-per-ns really makes sense.
I understand your concern about whether it really makes sense to keep this
per-cpu-per-ns, and I see your point that you would prefer maintaining the
stat per-hctx instead of per-CPU.
However, as mentioned earlier, during path selection we cannot reliably map an
I/O to a specific hctx, so using per-hctx statistics becomes problematic in
practice. On the other hand, maintaining the metrics per-CPU has an additional
advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
the NUMA distance between the workload’s CPU and the I/O controller. This means
that on multi-node systems, the policy can automatically favor I/O paths/controllers
that are local/near to the CPU issuing the request, which may lead to better
latency characteristics.
Really appreciate your feedback/comments!
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy
2025-12-12 12:08 ` Sagi Grimberg
@ 2025-12-13 8:22 ` Nilay Shroff
0 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-13 8:22 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/12/25 5:38 PM, Sagi Grimberg wrote:
>
>
> On 05/11/2025 12:33, Nilay Shroff wrote:
>> Hi,
>>
>> This series introduces a new adaptive I/O policy for NVMe native
>> multipath. Existing policies such as numa, round-robin, and queue-depth
>> are static and do not adapt to real-time transport performance.
>
> It can be argued that queue-depth is a proxy of latency.
>
>> The numa
>> selects the path closest to the NUMA node of the current CPU, optimizing
>> memory and path locality, but ignores actual path performance. The
>> round-robin distributes I/O evenly across all paths, providing fairness
>> but not performance awareness. The queue-depth reacts to instantaneous
>> queue occupancy, avoiding heavily loaded paths, but does not account for
>> actual latency, throughput, or link speed.
>>
>> The new adaptive policy addresses these gaps selecting paths dynamically
>> based on measured I/O latency for both PCIe and fabrics.
>
> Adaptive is not a good name. Maybe weighted-latency of wplat (weighted path latency)
> or something like that.
>
Yeah I also talked to Hannes about this and he suggest naming it either "weighed-latency"
or "ewma-latency". What do you prefer?
>> Latency is
>> derived by passively sampling I/O completions. Each path is assigned a
>> weight proportional to its latency score, and I/Os are then forwarded
>> accordingly. As condition changes (e.g. latency spikes, bandwidth
>> differences), path weights are updated, automatically steering traffic
>> toward better-performing paths.
>>
>> Early results show reduced tail latency under mixed workloads and
>> improved throughput by exploiting higher-speed links more effectively.
>> For example, with NVMf/TCP using two paths (one throttled with ~30 ms
>> delay), fio results with random read/write/rw workloads (direct I/O)
>> showed:
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
>> WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
>> RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
>> W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
>
> Seems like a nice gain.
> Can you please test for the normal symmetric paths case? Would like
> to see the trade-off...
Yes, I've already tested that. I currently don’t have access to the system,
but based on my earlier runs, the performance for the symmetric-path case
was noticeably better than in the NUMA scenario, and roughly in the same
(or slightly better) range as the round-robin/qdepth I/O policy. I will
share those numbers later once I get the access.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-13 7:27 ` Nilay Shroff
@ 2025-12-15 23:36 ` Sagi Grimberg
2025-12-18 11:19 ` Nilay Shroff
0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-15 23:36 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 13/12/2025 9:27, Nilay Shroff wrote:
>
> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>
>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>> This commit introduces a new I/O policy named "adaptive". Users can
>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>> subsystemX/iopolicy"
>>>
>>> The adaptive policy dynamically distributes I/O based on measured
>>> completion latency. The main idea is to calculate latency for each path,
>>> derive a weight, and then proportionally forward I/O according to those
>>> weights.
>>>
>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>> values.
>> So a given cpu would select path-a vs. another cpu that may select path-b?
>> How does that play with less queues than cpu cores? what happens to cores
>> that have low traffic?
>>
> The path-selection logic does not depend on the relationship between the number
> of CPUs and the number of hardware queues. It simply selects a path based on the
> per-CPU path score/credit, which reflects the relative performance of each available
> path.
> For example, assume we have two paths (A and B) to the same shared namespace.
> For each CPU, we maintain a smoothed latency estimate for every path. From these
> latency values we derive a per-path score or credit. The credit represents the relative
> share of I/O that each path should receive: a path with lower observed latency gets more
> credit, and a path with higher latency gets less.
I understand that the stats are maintained per-cpu, however I am not
sure that having a
per-cpu path weights make sense. meaning that if we have paths a,b,c and
for cpu0 we'll
have one set of weights and for cpu1 we'll have another set of weights.
What if the a given cpu happened to schedule some other application in a
way that impacts
completion latency? won't that skew the sampling? that is not related to
the path at all. That
is possibly more noticable in tcp which completes in a kthread context.
What do we lose if the 15 seconds weight assignment, averages all the
cpus samping? won't
that mitigate to some extent the issue of non-path related latency skew?
>
> I/O distribution is thus governed directly by the available credits on that CPU. When the
> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
> policy runs above the block-layer queueing logic, and the number of hardware queues does
> not affect how paths are scored or selected.
This is potentially another problem. application may jump between cpu
cores due to scheduling
constraints. In this case, how is the path selection policy adhering to
the path weights?
What I'm trying to say here is that the path selection should be
inherently reflective on the path,
not the cpu core that was accessing this path. What I am concerned
about, is how this behaves
in the real-world. Your tests are running in a very distinct artificial
path variance, and it does not include
other workloads that are running on the system that can impact
completion latency.
It is possible that what I'm raising here is not a real concern, but I
think we need to be able to demonstrate
that.
>
>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>> samples are computed and fed into an Exponentially Weighted Moving
>>> Average (EWMA):
>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
wighted-lat is simpler.
>
> Path weights are then derived from the smoothed (EWMA)
> latency as follows (example with two paths A and B):
>
> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
> total_score = path_A_score + path_B_score
>
> path_A_weight = (path_A_score * 100) / total_score
> path_B_weight = (path_B_score * 100) / total_score
>
>> What happens to R/W mixed workloads? What happens when the I/O pattern
>> has a distribution of block sizes?
>>
> We maintain separate metrics for READ and WRITE traffic, and during path
> selection we use the appropriate metric depending on the I/O type.
>
> Regarding block-size variability: the current implementation does not yet
> account for I/O size. This is an important point — thank you for raising it.
> I discussed this today with Hannes at LPC, and we agreed that a practical
> approach is to normalize latency per 512-byte block. For our purposes, we
> do not need an exact latency value; a relative latency metric is sufficient,
> as it ultimately feeds into path scoring. A path with higher latency ends up
> with a lower score, and a path with lower latency gets a higher score — the
> exact absolute values are less important than maintaining consistent proportional
> relationships.
I am not sure that normalizing to 512 blocks is a good proxy. I think
that large IO will
have much lower amortized latency per 512 block. which could create an
false bias
to place a high weight on a path, if that path happened to host large
I/Os no?
in my mind having buckets for I/O sizes would probably give a better
approximation for
the paths weights won't it?
>
> Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
> across different I/O block sizes. I think that it's easy to normalize a latency number
> per 512 bytes block and I'd implement that in next patch version.
I am not sure. maybe it is.
The main issue I have here, is that you are trying to find asymmetry
between paths,
however you are adding entropy with few factors by not taking into account:
1. I/O size
2. cpu scheduling
3. application cpu affinity changes over time
Now I don't know if these aspects actually make a difference, or it may
be just hypothetical, but
I think we need to add these aspects when we test the proposed iopolicy...
> > I think that in order to understand how a non-trivial path selector works we need
>> thorough testing in a variety of I/O patterns.
>>
> Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and
> different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed
> for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy.
> Still if you prefer running anything further let me know.
Maybe run with higher nice values? or run other processes on the host in
parallel? maybe processes
that also makes heavier use of the network?
I don't think this is a viable approach for pcie in reality, most likely
this is exclusive to fabrics.
>
>>> where:
>>> - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>>> - NSEC_PER_SEC is used as a scaling factor since valid latencies
>>> are < 1 second
>>> - weights are normalized to a 0–64 scale across all paths.
>>>
>>> Path credits are refilled based on this weight, with one credit
>>> consumed per I/O. When all credits are consumed, the credits are
>>> refilled again based on the current weight. This ensures that I/O is
>>> distributed across paths proportionally to their calculated weight.
>>>
>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>> drivers/nvme/host/core.c | 15 +-
>>> drivers/nvme/host/ioctl.c | 31 ++-
>>> drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>>> drivers/nvme/host/nvme.h | 74 +++++-
>>> drivers/nvme/host/pr.c | 6 +-
>>> drivers/nvme/host/sysfs.c | 2 +-
>>> 6 files changed, 530 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index fa4181d7de73..47f375c63d2d 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>>> cleanup_srcu_struct(&head->srcu);
>>> nvme_put_subsystem(head->subsys);
>>> kfree(head->plids);
>>> +#ifdef CONFIG_NVME_MULTIPATH
>>> + free_percpu(head->adp_path);
>>> +#endif
>>> kfree(head);
>>> }
>>> @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>>> {
>>> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>>> + nvme_free_ns_stat(ns);
>>> put_disk(ns->disk);
>>> nvme_put_ns_head(ns->head);
>>> nvme_put_ctrl(ns->ctrl);
>>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>> if (nvme_init_ns_head(ns, info))
>>> goto out_cleanup_disk;
>>> + if (nvme_alloc_ns_stat(ns))
>>> + goto out_unlink_ns;
>>> +
>>> /*
>>> * If multipathing is enabled, the device name for all disks and not
>>> * just those that represent shared namespaces needs to be based on the
>>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>> }
>>> if (nvme_update_ns_info(ns, info))
>>> - goto out_unlink_ns;
>>> + goto out_free_ns_stat;
>>> mutex_lock(&ctrl->namespaces_lock);
>>> /*
>>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>> */
>>> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>>> mutex_unlock(&ctrl->namespaces_lock);
>>> - goto out_unlink_ns;
>>> + goto out_free_ns_stat;
>>> }
>>> nvme_ns_add_to_ctrl_list(ns);
>>> mutex_unlock(&ctrl->namespaces_lock);
>>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>> list_del_rcu(&ns->list);
>>> mutex_unlock(&ctrl->namespaces_lock);
>>> synchronize_srcu(&ctrl->srcu);
>>> +out_free_ns_stat:
>>> + nvme_free_ns_stat(ns);
>>> out_unlink_ns:
>>> mutex_lock(&ctrl->subsys->lock);
>>> list_del_rcu(&ns->siblings);
>>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>> */
>>> synchronize_srcu(&ns->head->srcu);
>>> + nvme_mpath_cancel_adaptive_path_weight_work(ns);
>>> +
>> I personally think that the check on path stats should be done in the call-site
>> and not in the function itself.
> Hmm, can you please elaborate on this point further? I think, I am unable to get
> your point here.
nvme_mpath_cancel_adaptive_path_weight_work may do something or it won't, I'd prefer that
this check will be made here and not in the function.
>
>>> /* wait for concurrent submissions */
>>> if (nvme_mpath_clear_current_path(ns))
>>> synchronize_srcu(&ns->head->srcu);
>>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>>> index c212fa952c0f..759d147d9930 100644
>>> --- a/drivers/nvme/host/ioctl.c
>>> +++ b/drivers/nvme/host/ioctl.c
>>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>>> int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>> unsigned int cmd, unsigned long arg)
>>> {
>>> + u8 opcode;
>>> struct nvme_ns_head *head = bdev->bd_disk->private_data;
>>> bool open_for_write = mode & BLK_OPEN_WRITE;
>>> void __user *argp = (void __user *)arg;
>>> struct nvme_ns *ns;
>>> int srcu_idx, ret = -EWOULDBLOCK;
>>> unsigned int flags = 0;
>>> + unsigned int op_type = NVME_STAT_OTHER;
>>> if (bdev_is_partition(bdev))
>>> flags |= NVME_IOCTL_PARTITION;
>>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>> + if (get_user(opcode, (u8 *)argp))
>>> + return -EFAULT;
>>> + if (opcode == nvme_cmd_write)
>>> + op_type = NVME_STAT_WRITE;
>>> + else if (opcode == nvme_cmd_read)
>>> + op_type = NVME_STAT_READ;
>>> + }
>>> +
>>> srcu_idx = srcu_read_lock(&head->srcu);
>>> - ns = nvme_find_path(head);
>>> + ns = nvme_find_path(head, op_type);
>> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
>> to a prep patch (explaining that the new iopolicy will leverage it)
>>
> Sure, makes sense. I'll split this into prep patch as you suggested.
>>> if (!ns)
>>> goto out_unlock;
>>> @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>> long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>> unsigned long arg)
>>> {
>>> + u8 opcode;
>>> bool open_for_write = file->f_mode & FMODE_WRITE;
>>> struct cdev *cdev = file_inode(file)->i_cdev;
>>> struct nvme_ns_head *head =
>>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>> void __user *argp = (void __user *)arg;
>>> struct nvme_ns *ns;
>>> int srcu_idx, ret = -EWOULDBLOCK;
>>> + unsigned int op_type = NVME_STAT_OTHER;
>>> +
>>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>> + if (get_user(opcode, (u8 *)argp))
>>> + return -EFAULT;
>>> + if (opcode == nvme_cmd_write)
>>> + op_type = NVME_STAT_WRITE;
>>> + else if (opcode == nvme_cmd_read)
>>> + op_type = NVME_STAT_READ;
>>> + }
>>> srcu_idx = srcu_read_lock(&head->srcu);
>>> - ns = nvme_find_path(head);
>>> + ns = nvme_find_path(head, op_type);
>>> if (!ns)
>>> goto out_unlock;
>>> @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>> int srcu_idx = srcu_read_lock(&head->srcu);
>>> - struct nvme_ns *ns = nvme_find_path(head);
>>> + const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>>> + struct nvme_ns *ns = nvme_find_path(head,
>>> + READ_ONCE(cmd->opcode) & 1 ?
>>> + NVME_STAT_WRITE : NVME_STAT_READ);
>>> int ret = -EINVAL;
>>> if (ns)
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 543e17aead12..55dc28375662 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -6,6 +6,9 @@
>>> #include <linux/backing-dev.h>
>>> #include <linux/moduleparam.h>
>>> #include <linux/vmalloc.h>
>>> +#include <linux/blk-mq.h>
>>> +#include <linux/math64.h>
>>> +#include <linux/rculist.h>
>>> #include <trace/events/block.h>
>>> #include "nvme.h"
>>> @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>>> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>>> static const char *nvme_iopolicy_names[] = {
>>> - [NVME_IOPOLICY_NUMA] = "numa",
>>> - [NVME_IOPOLICY_RR] = "round-robin",
>>> - [NVME_IOPOLICY_QD] = "queue-depth",
>>> + [NVME_IOPOLICY_NUMA] = "numa",
>>> + [NVME_IOPOLICY_RR] = "round-robin",
>>> + [NVME_IOPOLICY_QD] = "queue-depth",
>>> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>>> };
>>> static int iopolicy = NVME_IOPOLICY_NUMA;
>>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>> iopolicy = NVME_IOPOLICY_RR;
>>> else if (!strncmp(val, "queue-depth", 11))
>>> iopolicy = NVME_IOPOLICY_QD;
>>> + else if (!strncmp(val, "adaptive", 8))
>>> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
>>> else
>>> return -EINVAL;
>>> @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>>> }
>>> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>>> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>>> +{
>>> + int cpu, srcu_idx;
>>> + u32 weight;
>>> + struct nvme_ns *ns;
>>> + struct nvme_path_stat *stat;
>>> + struct nvme_path_work *work = container_of(weight_work,
>>> + struct nvme_path_work, weight_work);
>>> + struct nvme_ns_head *head = work->ns->head;
>>> + int op_type = work->op_type;
>>> + u64 total_score = 0;
>>> +
>>> + cpu = get_cpu();
>>> +
>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>>> + srcu_read_lock_held(&head->srcu)) {
>>> +
>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> + if (!READ_ONCE(stat->slat_ns)) {
>>> + stat->score = 0;
>>> + continue;
>>> + }
>>> + /*
>>> + * Compute the path score as the inverse of smoothed
>>> + * latency, scaled by NSEC_PER_SEC. Floating point
>>> + * math is unavailable in the kernel, so fixed-point
>>> + * scaling is used instead. NSEC_PER_SEC is chosen
>>> + * because valid latencies are always < 1 second; longer
>>> + * latencies are ignored.
>>> + */
>>> + stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>>> +
>>> + /* Compute total score. */
>>> + total_score += stat->score;
>>> + }
>>> +
>>> + if (!total_score)
>>> + goto out;
>>> +
>>> + /*
>>> + * After computing the total slatency, we derive per-path weight
>>> + * (normalized to the range 0–64). The weight represents the
>>> + * relative share of I/O the path should receive.
>>> + *
>>> + * - lower smoothed latency -> higher weight
>>> + * - higher smoothed slatency -> lower weight
>>> + *
>>> + * Next, while forwarding I/O, we assign "credits" to each path
>>> + * based on its weight (please also refer nvme_adaptive_path()):
>>> + * - Initially, credits = weight.
>>> + * - Each time an I/O is dispatched on a path, its credits are
>>> + * decremented proportionally.
>>> + * - When a path runs out of credits, it becomes temporarily
>>> + * ineligible until credit is refilled.
>>> + *
>>> + * I/O distribution is therefore governed by available credits,
>>> + * ensuring that over time the proportion of I/O sent to each
>>> + * path matches its weight (and thus its performance).
>>> + */
>>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>>> + srcu_read_lock_held(&head->srcu)) {
>>> +
>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> + weight = div_u64(stat->score * 64, total_score);
>>> +
>>> + /*
>>> + * Ensure the path weight never drops below 1. A weight
>>> + * of 0 is used only for newly added paths. During
>>> + * bootstrap, a few I/Os are sent to such paths to
>>> + * establish an initial weight. Enforcing a minimum
>>> + * weight of 1 guarantees that no path is forgotten and
>>> + * that each path is probed at least occasionally.
>>> + */
>>> + if (!weight)
>>> + weight = 1;
>>> +
>>> + WRITE_ONCE(stat->weight, weight);
>>> + }
>>> +out:
>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>> + put_cpu();
>>> +}
>>> +
>>> +/*
>>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>>> + */
>>> +static inline u64 ewma_update(u64 old, u64 new)
>> it is a calculation function, lets call it calc_ewma_update
> Yeah, will do this in next patch version.
>
>>> +{
>>> + return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>>> + + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>>> +}
>>> +
>>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>>> +{
>>> + int cpu;
>>> + unsigned int op_type;
>>> + struct nvme_path_info *info;
>>> + struct nvme_path_stat *stat;
>>> + u64 now, latency, slat_ns, avg_lat_ns;
>>> + struct nvme_ns_head *head = ns->head;
>>> +
>>> + if (list_is_singular(&head->list))
>>> + return;
>>> +
>>> + now = ktime_get_ns();
>>> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>>> + if (!latency)
>>> + return;
>>> +
>>> + /*
>>> + * As completion code path is serialized(i.e. no same completion queue
>>> + * update code could run simultaneously on multiple cpu) we can safely
>>> + * access per cpu nvme path stat here from another cpu (in case the
>>> + * completion cpu is different from submission cpu).
>>> + * The only field which could be accessed simultaneously here is the
>>> + * path ->weight which may be accessed by this function as well as I/O
>>> + * submission path during path selection logic and we protect ->weight
>>> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>>> + * we also don't need to be so accurate here as the path credit would
>>> + * be anyways refilled, based on path weight, once path consumes all
>>> + * its credits. And we limit path weight/credit max up to 100. Please
>>> + * also refer nvme_adaptive_path().
>>> + */
>>> + cpu = blk_mq_rq_cpu(rq);
>>> + op_type = nvme_data_dir(req_op(rq));
>>> + info = &per_cpu_ptr(ns->info, cpu)[op_type];
>> info is really really really confusing and generic and not representative of what
>> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
>>
> Maybe I am used to with this code and so I never realized it. But yes agreed, I
> will make it @path_lat.
>
>>> + stat = &info->stat;
>>> +
>>> + /*
>>> + * If latency > ~1s then ignore this sample to prevent EWMA from being
>>> + * skewed by pathological outliers (multi-second waits, controller
>>> + * timeouts etc.). This keeps path scores representative of normal
>>> + * performance and avoids instability from rare spikes. If such high
>>> + * latency is real, ANA state reporting or keep-alive error counters
>>> + * will mark the path unhealthy and remove it from the head node list,
>>> + * so we safely skip such sample here.
>>> + */
>>> + if (unlikely(latency > NSEC_PER_SEC)) {
>>> + stat->nr_ignored++;
>>> + dev_warn_ratelimited(ns->ctrl->device,
>>> + "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * Accumulate latency samples and increment the batch count for each
>>> + * ~15 second interval. When the interval expires, compute the simple
>>> + * average latency over that window, then update the smoothed (EWMA)
>>> + * latency. The path weight is recalculated based on this smoothed
>>> + * latency.
>>> + */
>>> + stat->batch += latency;
>>> + stat->batch_count++;
>>> + stat->nr_samples++;
>>> +
>>> + if (now > stat->last_weight_ts &&
>>> + (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>>> +
>>> + stat->last_weight_ts = now;
>>> +
>>> + /*
>>> + * Find simple average latency for the last epoch (~15 sec
>>> + * interval).
>>> + */
>>> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>>> +
>>> + /*
>>> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>>> + * latency. EWMA is preferred over simple average latency
>>> + * because it smooths naturally, reduces jitter from sudden
>>> + * spikes, and adapts faster to changing conditions. It also
>>> + * avoids storing historical samples, and works well for both
>>> + * slow and fast I/O rates.
>>> + * Formula:
>>> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>>> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>>> + * existing latency and 1/8 (~12.5%) weight to the new latency.
>>> + */
>>> + if (unlikely(!stat->slat_ns))
>>> + WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>>> + else {
>>> + slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>>> + WRITE_ONCE(stat->slat_ns, slat_ns);
>>> + }
>>> +
>>> + stat->batch = stat->batch_count = 0;
>>> +
>>> + /*
>>> + * Defer calculation of the path weight in per-cpu workqueue.
>>> + */
>>> + schedule_work_on(cpu, &info->work.weight_work);
>> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
>> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
>> cpu cores.
> As I explained earlier, in NVMe multipath driver code we don't know hctx while
> we choose path. The ctx to hctx mapping happens later in the block layer while
> submitting bio.
yes, hctx is not really relevant.
> Here we calculate the path score per-cpu and utilize it while
> choosing path to forward I/O.
>
>>> + }
>>> +}
>>> +
>>> void nvme_mpath_end_request(struct request *rq)
>>> {
>>> struct nvme_ns *ns = rq->q->queuedata;
>>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>>> if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>>> atomic_dec_if_positive(&ns->ctrl->nr_active);
>>> + if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> + nvme_mpath_add_sample(rq, ns);
>>> +
>> Doing all this work for EVERY completion is really worth it?
>> sounds kinda like an overkill.
> We don't really do much in nvme_mpath_add_sample() other than just
> adding latency sample into batch. The real work where we calculate
> the patch score is done once every ~15 seconds and that is done
> under separate workqueu. So we don't do any heavy lifing here during
> I/O completion processing.
>
>>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>> return;
>>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>>> [NVME_ANA_CHANGE] = "change",
>>> };
>>> +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>>> +{
>>> + int i, cpu;
>>> + struct nvme_path_stat *stat;
>>> +
>>> + for_each_possible_cpu(cpu) {
>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> + stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>>> + memset(stat, 0, sizeof(struct nvme_path_stat));
>>> + }
>>> + }
>>> +}
>>> +
>>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>>> +{
>>> + int i, cpu;
>>> + struct nvme_path_info *info;
>>> +
>>> + if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> + return;
>>> +
>>> + for_each_online_cpu(cpu) {
>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> + info = &per_cpu_ptr(ns->info, cpu)[i];
>>> + cancel_work_sync(&info->work.weight_work);
>>> + }
>>> + }
>>> +}
>>> +
>>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>>> +{
>>> + struct nvme_ns_head *head = ns->head;
>>> +
>>> + if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>>> + return false;
>>> +
>>> + if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> + return false;
>>> +
>>> + blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
>> This is an undocumented change...
> Sure, I would add comment in this code in the next patch version.
>
>>> + blk_stat_enable_accounting(ns->queue);
>>> + return true;
>>> +}
>>> +
>>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>>> +{
>>> +
>>> + if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>>> + return false;
>>> +
>>> + blk_stat_disable_accounting(ns->queue);
>>> + blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>> + nvme_mpath_reset_adaptive_path_stat(ns);
>>> + return true;
>>> +}
>>> +
>>> bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>> {
>>> struct nvme_ns_head *head = ns->head;
>>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>> changed = true;
>>> }
>>> }
>>> + if (nvme_mpath_disable_adaptive_path_policy(ns))
>>> + changed = true;
>> Don't understand why you are setting changed here? it relates to of the current_path
>> was changed. doesn't make sense to me.
>>
> I think you were correct. We don't have any rcu update here for adaptive path.
> Will remove this.
>
>>> out:
>>> return changed;
>>> }
>>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>>> srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>> }
>>> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>>> +{
>>> + int i, cpu;
>>> + struct nvme_path_work *work;
>>> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>>> +
>>> + if (!ns->head->disk)
>>> + return 0;
>>> +
>>> + ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>>> + sizeof(struct nvme_path_info),
>>> + __alignof__(struct nvme_path_info), gfp);
>>> + if (!ns->info)
>>> + return -ENOMEM;
>>> +
>>> + for_each_possible_cpu(cpu) {
>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>> + work = &per_cpu_ptr(ns->info, cpu)[i].work;
>>> + work->ns = ns;
>>> + work->op_type = i;
>>> + INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>>> + }
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
>> Does this function set any ctrl paths? your code is very confusing.
>>
> Here ctrl path means, we iterate through each controller namespaces-path
> and then sets/enable the adaptive path parameters required for each path.
> Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
> function which resets/clears the per-path parameters while chanigng I/O
> policy.
>
>>> +{
>>> + struct nvme_ns *ns;
>>> + int srcu_idx;
>>> +
>>> + srcu_idx = srcu_read_lock(&ctrl->srcu);
>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>> + srcu_read_lock_held(&ctrl->srcu))
>>> + nvme_mpath_enable_adaptive_path_policy(ns);
>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx);
>> seems like it enables the iopolicy on all ctrl namespaces.
>> the enable should also be more explicit like:
>> nvme_enable_ns_lat_sampling or something like that.
>>
> okay, I'll rename it to the appropriate function name, as you suggested.
>
>>> +}
>>> +
>>> void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>> {
>>> struct nvme_ns_head *head = ns->head;
>>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>> srcu_read_lock_held(&head->srcu)) {
>>> if (capacity != get_capacity(ns->disk))
>>> clear_bit(NVME_NS_READY, &ns->flags);
>>> +
>>> + nvme_mpath_reset_adaptive_path_stat(ns);
>>> }
>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>> @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>>> return found;
>>> }
>>> +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>> +{
>>> + return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>> +}
>>> +
>>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>>> + unsigned int op_type)
>>> +{
>>> + struct nvme_ns *ns, *start, *found = NULL;
>>> + struct nvme_path_stat *stat;
>>> + u32 weight;
>>> + int cpu;
>>> +
>>> + cpu = get_cpu();
>>> + ns = *this_cpu_ptr(head->adp_path);
>>> + if (unlikely(!ns)) {
>>> + ns = list_first_or_null_rcu(&head->list,
>>> + struct nvme_ns, siblings);
>>> + if (unlikely(!ns))
>>> + goto out;
>>> + }
>>> +found_ns:
>>> + start = ns;
>>> + while (nvme_path_is_disabled(ns) ||
>>> + !nvme_state_is_live(ns->ana_state)) {
>>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>>> +
>>> + /*
>>> + * If we iterate through all paths in the list but find each
>>> + * path in list is either disabled or dead then bail out.
>>> + */
>>> + if (ns == start)
>>> + goto out;
>>> + }
>>> +
>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>> +
>>> + /*
>>> + * When the head path-list is singular we don't calculate the
>>> + * only path weight for optimization as we don't need to forward
>>> + * I/O to more than one path. The another possibility is whenthe
>>> + * path is newly added, we don't know its weight. So we go round
>>> + * -robin for each such path and forward I/O to it.Once we start
>>> + * getting response for such I/Os, the path weight calculation
>>> + * would kick in and then we start using path credit for
>>> + * forwarding I/O.
>>> + */
>>> + weight = READ_ONCE(stat->weight);
>>> + if (!weight) {
>>> + found = ns;
>>> + goto out;
>>> + }
>>> +
>>> + /*
>>> + * To keep path selection logic simple, we don't distinguish
>>> + * between ANA optimized and non-optimized states. The non-
>>> + * optimized path is expected to have a lower weight, and
>>> + * therefore fewer credits. As a result, only a small number of
>>> + * I/Os will be forwarded to paths in the non-optimized state.
>>> + */
>>> + if (stat->credit > 0) {
>>> + --stat->credit;
>>> + found = ns;
>>> + goto out;
>>> + } else {
>>> + /*
>>> + * Refill credit from path weight and move to next path. The
>>> + * refilled credit of the current path will be used next when
>>> + * all remainng paths exhaust its credits.
>>> + */
>>> + weight = READ_ONCE(stat->weight);
>>> + stat->credit = weight;
>>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>>> + if (likely(ns))
>>> + goto found_ns;
>>> + }
>>> +out:
>>> + if (found) {
>>> + stat->sel++;
>>> + *this_cpu_ptr(head->adp_path) = found;
>>> + }
>>> +
>>> + put_cpu();
>>> + return found;
>>> +}
>>> +
>>> static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>>> {
>>> struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>> return ns;
>>> }
>>> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>>> + unsigned int op_type)
>>> {
>>> switch (READ_ONCE(head->subsys->iopolicy)) {
>>> + case NVME_IOPOLICY_ADAPTIVE:
>>> + return nvme_adaptive_path(head, op_type);
>>> case NVME_IOPOLICY_QD:
>>> return nvme_queue_depth_path(head);
>>> case NVME_IOPOLICY_RR:
>>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>>> return;
>>> srcu_idx = srcu_read_lock(&head->srcu);
>>> - ns = nvme_find_path(head);
>>> + ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>>> if (likely(ns)) {
>>> bio_set_dev(bio, ns->disk->part0);
>>> bio->bi_opf |= REQ_NVME_MPATH;
>>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>>> int srcu_idx, ret = -EWOULDBLOCK;
>>> srcu_idx = srcu_read_lock(&head->srcu);
>>> - ns = nvme_find_path(head);
>>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>>> if (ns)
>>> ret = nvme_ns_get_unique_id(ns, id, type);
>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>> int srcu_idx, ret = -EWOULDBLOCK;
>>> srcu_idx = srcu_read_lock(&head->srcu);
>>> - ns = nvme_find_path(head);
>>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>>> if (ns)
>>> ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>>> INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>>> head->delayed_removal_secs = 0;
>>> + head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>>> + if (!head->adp_path)
>>> + return -ENOMEM;
>>> /*
>>> * If "multipath_always_on" is enabled, a multipath node is added
>>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>> }
>>> mutex_unlock(&head->lock);
>>> + mutex_lock(&nvme_subsystems_lock);
>>> + nvme_mpath_enable_adaptive_path_policy(ns);
>>> + mutex_unlock(&nvme_subsystems_lock);
>>> +
>>> synchronize_srcu(&head->srcu);
>>> kblockd_schedule_work(&head->requeue_work);
>>> }
>>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>>> return 0;
>>> }
>>> -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>> -{
>>> - return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>> -}
>>> -
>>> static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>> struct nvme_ns *ns)
>>> {
>>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>>> WRITE_ONCE(subsys->iopolicy, iopolicy);
>>> - /* iopolicy changes clear the mpath by design */
>>> + /* iopolicy changes clear/reset the mpath by design */
>>> mutex_lock(&nvme_subsystems_lock);
>>> list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>> nvme_mpath_clear_ctrl_paths(ctrl);
>>> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>> + nvme_mpath_set_ctrl_paths(ctrl);
>>> mutex_unlock(&nvme_subsystems_lock);
>>> pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>> index 102fae6a231c..715c7053054c 100644
>>> --- a/drivers/nvme/host/nvme.h
>>> +++ b/drivers/nvme/host/nvme.h
>>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>>> extern unsigned int admin_timeout;
>>> #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
>>> -#define NVME_DEFAULT_KATO 5
>>> +#define NVME_DEFAULT_KATO 5
>>> +
>>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT 3
>>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT (15 * NSEC_PER_SEC)
>> You need these defines outside of nvme-mpath?
>>
> Currently, those macros are used in nvme/host/core.c.
> I can move this inisde that source file.
>
>>> #ifdef CONFIG_ARCH_NO_SG_CHAIN
>>> #define NVME_INLINE_SG_CNT 0
>>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>>> NVME_IOPOLICY_NUMA,
>>> NVME_IOPOLICY_RR,
>>> NVME_IOPOLICY_QD,
>>> + NVME_IOPOLICY_ADAPTIVE,
>>> };
>>> struct nvme_subsystem {
>>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>>> u8 csi;
>>> };
>>> +enum nvme_stat_group {
>>> + NVME_STAT_READ,
>>> + NVME_STAT_WRITE,
>>> + NVME_STAT_OTHER,
>>> + NVME_NUM_STAT_GROUPS
>>> +};
>> I see you have stats per io direction. However you don't have it per IO size. I wonder
>> how this plays into this iopolicy.
>>
> Yes you're correct, and as mentioned earlier we'd measure latecy per
> 512 byte blocks size.
>
>>> +
>>> +struct nvme_path_stat {
>>> + u64 nr_samples; /* total num of samples processed */
>>> + u64 nr_ignored; /* num. of samples ignored */
>>> + u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
>>> + u64 score; /* score used for weight calculation */
>>> + u64 last_weight_ts; /* timestamp of the last weight calculation */
>>> + u64 sel; /* num of times this path is selcted for I/O */
>>> + u64 batch; /* accumulated latency sum for current window */
>>> + u32 batch_count; /* num of samples accumulated in current window */
>>> + u32 weight; /* path weight */
>>> + u32 credit; /* path credit for I/O forwarding */
>>> +};
>> I'm still not convinced that having this be per-cpu-per-ns really makes sense.
> I understand your concern about whether it really makes sense to keep this
> per-cpu-per-ns, and I see your point that you would prefer maintaining the
> stat per-hctx instead of per-CPU.
>
> However, as mentioned earlier, during path selection we cannot reliably map an
> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
> practice. On the other hand, maintaining the metrics per-CPU has an additional
> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
> the NUMA distance between the workload’s CPU and the I/O controller. This means
> that on multi-node systems, the policy can automatically favor I/O paths/controllers
> that are local/near to the CPU issuing the request, which may lead to better
> latency characteristics.
With this I tend to agree. but per-cpu has lots of other churns IMO.
Maybe the answer is that paths weights are maintained per NUMA node?
then accessing these weights in the fast-path is still cheap enough?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-15 23:36 ` Sagi Grimberg
@ 2025-12-18 11:19 ` Nilay Shroff
2025-12-18 13:46 ` Hannes Reinecke
2025-12-25 12:28 ` Sagi Grimberg
0 siblings, 2 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-18 11:19 UTC (permalink / raw)
To: Sagi Grimberg, linux-nvme
Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/16/25 5:06 AM, Sagi Grimberg wrote:
>
>
> On 13/12/2025 9:27, Nilay Shroff wrote:
>>
>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>
>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>> subsystemX/iopolicy"
>>>>
>>>> The adaptive policy dynamically distributes I/O based on measured
>>>> completion latency. The main idea is to calculate latency for each path,
>>>> derive a weight, and then proportionally forward I/O according to those
>>>> weights.
>>>>
>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>> values.
>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>> How does that play with less queues than cpu cores? what happens to cores
>>> that have low traffic?
>>>
>> The path-selection logic does not depend on the relationship between the number
>> of CPUs and the number of hardware queues. It simply selects a path based on the
>> per-CPU path score/credit, which reflects the relative performance of each available
>> path.
>> For example, assume we have two paths (A and B) to the same shared namespace.
>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>> latency values we derive a per-path score or credit. The credit represents the relative
>> share of I/O that each path should receive: a path with lower observed latency gets more
>> credit, and a path with higher latency gets less.
>
> I understand that the stats are maintained per-cpu, however I am not sure that having a
> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
> have one set of weights and for cpu1 we'll have another set of weights.
>
> What if the a given cpu happened to schedule some other application in a way that impacts
> completion latency? won't that skew the sampling? that is not related to the path at all. That
> is possibly more noticable in tcp which completes in a kthread context.
>
> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
> that mitigate to some extent the issue of non-path related latency skew?
>
You’re right — what you’re describing is indeed possible. The intent of the adaptive policy,
however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
transport latency.
The observed completion latency intentionally includes all components that affect I/O from
the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
scheduler-induced delays, and the target device’s own I/O latency. By capturing the full
end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
Scheduler-related latency can vary over time due to workload placement or CPU contention,
and this variability is accounted for by the design. Since per-path weights are recalculated
periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
behavior are naturally incorporated into the path scoring. As a result, the policy can
automatically adapt/adjust and rebalance I/O toward paths that are performing better under
current system conditions.
In short, while per-CPU sampling may include effects beyond the physical path itself, this is
intentional and allows the adaptive policy to respond in real time to changing end-to-end
performance characteristics.
>>
>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>> not affect how paths are scored or selected.
>
> This is potentially another problem. application may jump between cpu cores due to scheduling
> constraints. In this case, how is the path selection policy adhering to the path weights?
>
> What I'm trying to say here is that the path selection should be inherently reflective on the path,
> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
> other workloads that are running on the system that can impact completion latency.
>
> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
> that.
>
In real-world systems, as stated earlier, the completion latency is influenced not only by
the physical path but also by system load, scheduler behavior, and transport stack processing.
By incorporating all of these factors into the latency measurement, the adaptive policy reflects
the true cost of issuing I/O on a given path under current conditions. This allows it to respond
to both path-level and system-level congestion.
In practice, during experiments with two paths (A and B), I observed that when additional latency—
whether introduced via the path itself or through system load—was present on path A, subsequent I/O
was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
and remains effective even in the presence of CPU migration and competing workloads.
Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
real-world end-to-end performance and continuously adjust I/O distribution in response to changing
system and path conditions.
>>
>>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>>> samples are computed and fed into an Exponentially Weighted Moving
>>>> Average (EWMA):
>>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
>> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
>
> wighted-lat is simpler.
Okay I'll renanme it to "weighted-lat".>
>>
>> Path weights are then derived from the smoothed (EWMA)
>> latency as follows (example with two paths A and B):
>>
>> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>> total_score = path_A_score + path_B_score
>>
>> path_A_weight = (path_A_score * 100) / total_score
>> path_B_weight = (path_B_score * 100) / total_score
>>
>>> What happens to R/W mixed workloads? What happens when the I/O pattern
>>> has a distribution of block sizes?
>>>
>> We maintain separate metrics for READ and WRITE traffic, and during path
>> selection we use the appropriate metric depending on the I/O type.
>>
>> Regarding block-size variability: the current implementation does not yet
>> account for I/O size. This is an important point — thank you for raising it.
>> I discussed this today with Hannes at LPC, and we agreed that a practical
>> approach is to normalize latency per 512-byte block. For our purposes, we
>> do not need an exact latency value; a relative latency metric is sufficient,
>> as it ultimately feeds into path scoring. A path with higher latency ends up
>> with a lower score, and a path with lower latency gets a higher score — the
>> exact absolute values are less important than maintaining consistent proportional
>> relationships.
>
> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
> have much lower amortized latency per 512 block. which could create an false bias
> to place a high weight on a path, if that path happened to host large I/Os no?
>
Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
> in my mind having buckets for I/O sizes would probably give a better approximation for
> the paths weights won't it?
>
Okay, so how about dividing I/O sizes in the 4 buckets as shown below?small <= 4k
medium 4k-64k
large 64k-128k
very-large >128k
>
>>
>> Normalizing latency per 512 bytes gives us a stable, size-aware metric that scales
>> across different I/O block sizes. I think that it's easy to normalize a latency number
>> per 512 bytes block and I'd implement that in next patch version.
>
> I am not sure. maybe it is.
> The main issue I have here, is that you are trying to find asymmetry between paths,
> however you are adding entropy with few factors by not taking into account:
> 1. I/O size
> 2. cpu scheduling
> 3. application cpu affinity changes over time
>
> Now I don't know if these aspects actually make a difference, or it may be just hypothetical, but
> I think we need to add these aspects when we test the proposed iopolicy...
>
As stated earlier, as we measure end-to-end latency, it helps account for both cpu scheduling
and other application workload specific delays while choosing the path. And regarding I/O
size variation, as you suggested, I proposed using the different bucket sizes mentioned above.
>> > I think that in order to understand how a non-trivial path selector works we need
>>> thorough testing in a variety of I/O patterns.
>>>
>> Yes that was done running fio with different I/O engines, I/O tyeps (read, write, r/w) and
>> different block sizes. I tested it using NVMe pcie and nvmf-tcp. The tests were performed
>> for both direct and buffered I/O. Also I ran blktests configuring adaptive I/O policy.
>> Still if you prefer running anything further let me know.
>
> Maybe run with higher nice values? or run other processes on the host in parallel? maybe processes
> that also makes heavier use of the network?
>
Okay I'll run such aaditonal workloads while testing this iopolicy.
In fact, you'd find the result of one such experiments I performed
at the end of this email.
> I don't think this is a viable approach for pcie in reality, most likely this is exclusive to fabrics.
>
>>
>>>> where:
>>>> - path_X_ewma_latency is the smoothed latency of a path in nanoseconds
>>>> - NSEC_PER_SEC is used as a scaling factor since valid latencies
>>>> are < 1 second
>>>> - weights are normalized to a 0–64 scale across all paths.
>>>>
>>>> Path credits are refilled based on this weight, with one credit
>>>> consumed per I/O. When all credits are consumed, the credits are
>>>> refilled again based on the current weight. This ensures that I/O is
>>>> distributed across paths proportionally to their calculated weight.
>>>>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.de>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 15 +-
>>>> drivers/nvme/host/ioctl.c | 31 ++-
>>>> drivers/nvme/host/multipath.c | 425 ++++++++++++++++++++++++++++++++--
>>>> drivers/nvme/host/nvme.h | 74 +++++-
>>>> drivers/nvme/host/pr.c | 6 +-
>>>> drivers/nvme/host/sysfs.c | 2 +-
>>>> 6 files changed, 530 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index fa4181d7de73..47f375c63d2d 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -672,6 +672,9 @@ static void nvme_free_ns_head(struct kref *ref)
>>>> cleanup_srcu_struct(&head->srcu);
>>>> nvme_put_subsystem(head->subsys);
>>>> kfree(head->plids);
>>>> +#ifdef CONFIG_NVME_MULTIPATH
>>>> + free_percpu(head->adp_path);
>>>> +#endif
>>>> kfree(head);
>>>> }
>>>> @@ -689,6 +692,7 @@ static void nvme_free_ns(struct kref *kref)
>>>> {
>>>> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>>>> + nvme_free_ns_stat(ns);
>>>> put_disk(ns->disk);
>>>> nvme_put_ns_head(ns->head);
>>>> nvme_put_ctrl(ns->ctrl);
>>>> @@ -4137,6 +4141,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>> if (nvme_init_ns_head(ns, info))
>>>> goto out_cleanup_disk;
>>>> + if (nvme_alloc_ns_stat(ns))
>>>> + goto out_unlink_ns;
>>>> +
>>>> /*
>>>> * If multipathing is enabled, the device name for all disks and not
>>>> * just those that represent shared namespaces needs to be based on the
>>>> @@ -4161,7 +4168,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>> }
>>>> if (nvme_update_ns_info(ns, info))
>>>> - goto out_unlink_ns;
>>>> + goto out_free_ns_stat;
>>>> mutex_lock(&ctrl->namespaces_lock);
>>>> /*
>>>> @@ -4170,7 +4177,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>> */
>>>> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>>>> mutex_unlock(&ctrl->namespaces_lock);
>>>> - goto out_unlink_ns;
>>>> + goto out_free_ns_stat;
>>>> }
>>>> nvme_ns_add_to_ctrl_list(ns);
>>>> mutex_unlock(&ctrl->namespaces_lock);
>>>> @@ -4201,6 +4208,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>>>> list_del_rcu(&ns->list);
>>>> mutex_unlock(&ctrl->namespaces_lock);
>>>> synchronize_srcu(&ctrl->srcu);
>>>> +out_free_ns_stat:
>>>> + nvme_free_ns_stat(ns);
>>>> out_unlink_ns:
>>>> mutex_lock(&ctrl->subsys->lock);
>>>> list_del_rcu(&ns->siblings);
>>>> @@ -4244,6 +4253,8 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> */
>>>> synchronize_srcu(&ns->head->srcu);
>>>> + nvme_mpath_cancel_adaptive_path_weight_work(ns);
>>>> +
>>> I personally think that the check on path stats should be done in the call-site
>>> and not in the function itself.
>> Hmm, can you please elaborate on this point further? I think, I am unable to get
>> your point here.
>
> nvme_mpath_cancel_adaptive_path_weight_work may do something or it won't, I'd prefer that
> this check will be made here and not in the function.
>
Okay got it. I'll make that path stat check in the call-site.>
>
>>
>>>> /* wait for concurrent submissions */
>>>> if (nvme_mpath_clear_current_path(ns))
>>>> synchronize_srcu(&ns->head->srcu);
>>>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>>>> index c212fa952c0f..759d147d9930 100644
>>>> --- a/drivers/nvme/host/ioctl.c
>>>> +++ b/drivers/nvme/host/ioctl.c
>>>> @@ -700,18 +700,29 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd,
>>>> int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>> unsigned int cmd, unsigned long arg)
>>>> {
>>>> + u8 opcode;
>>>> struct nvme_ns_head *head = bdev->bd_disk->private_data;
>>>> bool open_for_write = mode & BLK_OPEN_WRITE;
>>>> void __user *argp = (void __user *)arg;
>>>> struct nvme_ns *ns;
>>>> int srcu_idx, ret = -EWOULDBLOCK;
>>>> unsigned int flags = 0;
>>>> + unsigned int op_type = NVME_STAT_OTHER;
>>>> if (bdev_is_partition(bdev))
>>>> flags |= NVME_IOCTL_PARTITION;
>>>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>>> + if (get_user(opcode, (u8 *)argp))
>>>> + return -EFAULT;
>>>> + if (opcode == nvme_cmd_write)
>>>> + op_type = NVME_STAT_WRITE;
>>>> + else if (opcode == nvme_cmd_read)
>>>> + op_type = NVME_STAT_READ;
>>>> + }
>>>> +
>>>> srcu_idx = srcu_read_lock(&head->srcu);
>>>> - ns = nvme_find_path(head);
>>>> + ns = nvme_find_path(head, op_type);
>>> Perhaps it would be easier to review if you split passing opcode to nvme_find_path()
>>> to a prep patch (explaining that the new iopolicy will leverage it)
>>>
>> Sure, makes sense. I'll split this into prep patch as you suggested.
>>>> if (!ns)
>>>> goto out_unlock;
>>>> @@ -733,6 +744,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>>> long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>> unsigned long arg)
>>>> {
>>>> + u8 opcode;
>>>> bool open_for_write = file->f_mode & FMODE_WRITE;
>>>> struct cdev *cdev = file_inode(file)->i_cdev;
>>>> struct nvme_ns_head *head =
>>>> @@ -740,9 +752,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>>> void __user *argp = (void __user *)arg;
>>>> struct nvme_ns *ns;
>>>> int srcu_idx, ret = -EWOULDBLOCK;
>>>> + unsigned int op_type = NVME_STAT_OTHER;
>>>> +
>>>> + if (cmd == NVME_IOCTL_SUBMIT_IO) {
>>>> + if (get_user(opcode, (u8 *)argp))
>>>> + return -EFAULT;
>>>> + if (opcode == nvme_cmd_write)
>>>> + op_type = NVME_STAT_WRITE;
>>>> + else if (opcode == nvme_cmd_read)
>>>> + op_type = NVME_STAT_READ;
>>>> + }
>>>> srcu_idx = srcu_read_lock(&head->srcu);
>>>> - ns = nvme_find_path(head);
>>>> + ns = nvme_find_path(head, op_type);
>>>> if (!ns)
>>>> goto out_unlock;
>>>> @@ -762,7 +784,10 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>>> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>>> int srcu_idx = srcu_read_lock(&head->srcu);
>>>> - struct nvme_ns *ns = nvme_find_path(head);
>>>> + const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
>>>> + struct nvme_ns *ns = nvme_find_path(head,
>>>> + READ_ONCE(cmd->opcode) & 1 ?
>>>> + NVME_STAT_WRITE : NVME_STAT_READ);
>>>> int ret = -EINVAL;
>>>> if (ns)
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 543e17aead12..55dc28375662 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -6,6 +6,9 @@
>>>> #include <linux/backing-dev.h>
>>>> #include <linux/moduleparam.h>
>>>> #include <linux/vmalloc.h>
>>>> +#include <linux/blk-mq.h>
>>>> +#include <linux/math64.h>
>>>> +#include <linux/rculist.h>
>>>> #include <trace/events/block.h>
>>>> #include "nvme.h"
>>>> @@ -66,9 +69,10 @@ MODULE_PARM_DESC(multipath_always_on,
>>>> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>>>> static const char *nvme_iopolicy_names[] = {
>>>> - [NVME_IOPOLICY_NUMA] = "numa",
>>>> - [NVME_IOPOLICY_RR] = "round-robin",
>>>> - [NVME_IOPOLICY_QD] = "queue-depth",
>>>> + [NVME_IOPOLICY_NUMA] = "numa",
>>>> + [NVME_IOPOLICY_RR] = "round-robin",
>>>> + [NVME_IOPOLICY_QD] = "queue-depth",
>>>> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>>>> };
>>>> static int iopolicy = NVME_IOPOLICY_NUMA;
>>>> @@ -83,6 +87,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>>> iopolicy = NVME_IOPOLICY_RR;
>>>> else if (!strncmp(val, "queue-depth", 11))
>>>> iopolicy = NVME_IOPOLICY_QD;
>>>> + else if (!strncmp(val, "adaptive", 8))
>>>> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
>>>> else
>>>> return -EINVAL;
>>>> @@ -198,6 +204,204 @@ void nvme_mpath_start_request(struct request *rq)
>>>> }
>>>> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>>>> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>>>> +{
>>>> + int cpu, srcu_idx;
>>>> + u32 weight;
>>>> + struct nvme_ns *ns;
>>>> + struct nvme_path_stat *stat;
>>>> + struct nvme_path_work *work = container_of(weight_work,
>>>> + struct nvme_path_work, weight_work);
>>>> + struct nvme_ns_head *head = work->ns->head;
>>>> + int op_type = work->op_type;
>>>> + u64 total_score = 0;
>>>> +
>>>> + cpu = get_cpu();
>>>> +
>>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>>>> + srcu_read_lock_held(&head->srcu)) {
>>>> +
>>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> + if (!READ_ONCE(stat->slat_ns)) {
>>>> + stat->score = 0;
>>>> + continue;
>>>> + }
>>>> + /*
>>>> + * Compute the path score as the inverse of smoothed
>>>> + * latency, scaled by NSEC_PER_SEC. Floating point
>>>> + * math is unavailable in the kernel, so fixed-point
>>>> + * scaling is used instead. NSEC_PER_SEC is chosen
>>>> + * because valid latencies are always < 1 second; longer
>>>> + * latencies are ignored.
>>>> + */
>>>> + stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>>>> +
>>>> + /* Compute total score. */
>>>> + total_score += stat->score;
>>>> + }
>>>> +
>>>> + if (!total_score)
>>>> + goto out;
>>>> +
>>>> + /*
>>>> + * After computing the total slatency, we derive per-path weight
>>>> + * (normalized to the range 0–64). The weight represents the
>>>> + * relative share of I/O the path should receive.
>>>> + *
>>>> + * - lower smoothed latency -> higher weight
>>>> + * - higher smoothed slatency -> lower weight
>>>> + *
>>>> + * Next, while forwarding I/O, we assign "credits" to each path
>>>> + * based on its weight (please also refer nvme_adaptive_path()):
>>>> + * - Initially, credits = weight.
>>>> + * - Each time an I/O is dispatched on a path, its credits are
>>>> + * decremented proportionally.
>>>> + * - When a path runs out of credits, it becomes temporarily
>>>> + * ineligible until credit is refilled.
>>>> + *
>>>> + * I/O distribution is therefore governed by available credits,
>>>> + * ensuring that over time the proportion of I/O sent to each
>>>> + * path matches its weight (and thus its performance).
>>>> + */
>>>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>>>> + srcu_read_lock_held(&head->srcu)) {
>>>> +
>>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> + weight = div_u64(stat->score * 64, total_score);
>>>> +
>>>> + /*
>>>> + * Ensure the path weight never drops below 1. A weight
>>>> + * of 0 is used only for newly added paths. During
>>>> + * bootstrap, a few I/Os are sent to such paths to
>>>> + * establish an initial weight. Enforcing a minimum
>>>> + * weight of 1 guarantees that no path is forgotten and
>>>> + * that each path is probed at least occasionally.
>>>> + */
>>>> + if (!weight)
>>>> + weight = 1;
>>>> +
>>>> + WRITE_ONCE(stat->weight, weight);
>>>> + }
>>>> +out:
>>>> + srcu_read_unlock(&head->srcu, srcu_idx);
>>>> + put_cpu();
>>>> +}
>>>> +
>>>> +/*
>>>> + * Formula to calculate the EWMA (Exponentially Weighted Moving Average):
>>>> + * ewma = (old_ewma * (EWMA_SHIFT - 1) + (EWMA_SHIFT)) / EWMA_SHIFT
>>>> + * For instance, with EWMA_SHIFT = 3, this assigns 7/8 (~87.5 %) weight to
>>>> + * the existing/old ewma and 1/8 (~12.5%) weight to the new sample.
>>>> + */
>>>> +static inline u64 ewma_update(u64 old, u64 new)
>>> it is a calculation function, lets call it calc_ewma_update
>> Yeah, will do this in next patch version.
>>
>>>> +{
>>>> + return (old * ((1 << NVME_DEFAULT_ADP_EWMA_SHIFT) - 1)
>>>> + + new) >> NVME_DEFAULT_ADP_EWMA_SHIFT;
>>>> +}
>>>> +
>>>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>>>> +{
>>>> + int cpu;
>>>> + unsigned int op_type;
>>>> + struct nvme_path_info *info;
>>>> + struct nvme_path_stat *stat;
>>>> + u64 now, latency, slat_ns, avg_lat_ns;
>>>> + struct nvme_ns_head *head = ns->head;
>>>> +
>>>> + if (list_is_singular(&head->list))
>>>> + return;
>>>> +
>>>> + now = ktime_get_ns();
>>>> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>>>> + if (!latency)
>>>> + return;
>>>> +
>>>> + /*
>>>> + * As completion code path is serialized(i.e. no same completion queue
>>>> + * update code could run simultaneously on multiple cpu) we can safely
>>>> + * access per cpu nvme path stat here from another cpu (in case the
>>>> + * completion cpu is different from submission cpu).
>>>> + * The only field which could be accessed simultaneously here is the
>>>> + * path ->weight which may be accessed by this function as well as I/O
>>>> + * submission path during path selection logic and we protect ->weight
>>>> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>>>> + * we also don't need to be so accurate here as the path credit would
>>>> + * be anyways refilled, based on path weight, once path consumes all
>>>> + * its credits. And we limit path weight/credit max up to 100. Please
>>>> + * also refer nvme_adaptive_path().
>>>> + */
>>>> + cpu = blk_mq_rq_cpu(rq);
>>>> + op_type = nvme_data_dir(req_op(rq));
>>>> + info = &per_cpu_ptr(ns->info, cpu)[op_type];
>>> info is really really really confusing and generic and not representative of what
>>> "info" it is used for. maybe path_lat? or path_stats? anything is better than info.
>>>
>> Maybe I am used to with this code and so I never realized it. But yes agreed, I
>> will make it @path_lat.
>>
>>>> + stat = &info->stat;
>>>> +
>>>> + /*
>>>> + * If latency > ~1s then ignore this sample to prevent EWMA from being
>>>> + * skewed by pathological outliers (multi-second waits, controller
>>>> + * timeouts etc.). This keeps path scores representative of normal
>>>> + * performance and avoids instability from rare spikes. If such high
>>>> + * latency is real, ANA state reporting or keep-alive error counters
>>>> + * will mark the path unhealthy and remove it from the head node list,
>>>> + * so we safely skip such sample here.
>>>> + */
>>>> + if (unlikely(latency > NSEC_PER_SEC)) {
>>>> + stat->nr_ignored++;
>>>> + dev_warn_ratelimited(ns->ctrl->device,
>>>> + "ignoring sample with >1s latency (possible controller stall or timeout)\n");
>>>> + return;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Accumulate latency samples and increment the batch count for each
>>>> + * ~15 second interval. When the interval expires, compute the simple
>>>> + * average latency over that window, then update the smoothed (EWMA)
>>>> + * latency. The path weight is recalculated based on this smoothed
>>>> + * latency.
>>>> + */
>>>> + stat->batch += latency;
>>>> + stat->batch_count++;
>>>> + stat->nr_samples++;
>>>> +
>>>> + if (now > stat->last_weight_ts &&
>>>> + (now - stat->last_weight_ts) >= NVME_DEFAULT_ADP_WEIGHT_TIMEOUT) {
>>>> +
>>>> + stat->last_weight_ts = now;
>>>> +
>>>> + /*
>>>> + * Find simple average latency for the last epoch (~15 sec
>>>> + * interval).
>>>> + */
>>>> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>>>> +
>>>> + /*
>>>> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>>>> + * latency. EWMA is preferred over simple average latency
>>>> + * because it smooths naturally, reduces jitter from sudden
>>>> + * spikes, and adapts faster to changing conditions. It also
>>>> + * avoids storing historical samples, and works well for both
>>>> + * slow and fast I/O rates.
>>>> + * Formula:
>>>> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>>>> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>>>> + * existing latency and 1/8 (~12.5%) weight to the new latency.
>>>> + */
>>>> + if (unlikely(!stat->slat_ns))
>>>> + WRITE_ONCE(stat->slat_ns, avg_lat_ns);
>>>> + else {
>>>> + slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>>>> + WRITE_ONCE(stat->slat_ns, slat_ns);
>>>> + }
>>>> +
>>>> + stat->batch = stat->batch_count = 0;
>>>> +
>>>> + /*
>>>> + * Defer calculation of the path weight in per-cpu workqueue.
>>>> + */
>>>> + schedule_work_on(cpu, &info->work.weight_work);
>>> I'm unsure if the percpu is a good choice here. Don't you want it per hctx at least?
>>> workloads tend to bounce quite a bit between cpu cores... we have systems with hundreds of
>>> cpu cores.
>> As I explained earlier, in NVMe multipath driver code we don't know hctx while
>> we choose path. The ctx to hctx mapping happens later in the block layer while
>> submitting bio.
>
> yes, hctx is not really relevant.
>
>> Here we calculate the path score per-cpu and utilize it while
>> choosing path to forward I/O.
>>
>>>> + }
>>>> +}
>>>> +
>>>> void nvme_mpath_end_request(struct request *rq)
>>>> {
>>>> struct nvme_ns *ns = rq->q->queuedata;
>>>> @@ -205,6 +409,9 @@ void nvme_mpath_end_request(struct request *rq)
>>>> if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
>>>> atomic_dec_if_positive(&ns->ctrl->nr_active);
>>>> + if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> + nvme_mpath_add_sample(rq, ns);
>>>> +
>>> Doing all this work for EVERY completion is really worth it?
>>> sounds kinda like an overkill.
>> We don't really do much in nvme_mpath_add_sample() other than just
>> adding latency sample into batch. The real work where we calculate
>> the patch score is done once every ~15 seconds and that is done
>> under separate workqueu. So we don't do any heavy lifing here during
>> I/O completion processing.
>>
>>>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>>> return;
>>>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>>> @@ -238,6 +445,62 @@ static const char *nvme_ana_state_names[] = {
>>>> [NVME_ANA_CHANGE] = "change",
>>>> };
>>>> +static void nvme_mpath_reset_adaptive_path_stat(struct nvme_ns *ns)
>>>> +{
>>>> + int i, cpu;
>>>> + struct nvme_path_stat *stat;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> + stat = &per_cpu_ptr(ns->info, cpu)[i].stat;
>>>> + memset(stat, 0, sizeof(struct nvme_path_stat));
>>>> + }
>>>> + }
>>>> +}
>>>> +
>>>> +void nvme_mpath_cancel_adaptive_path_weight_work(struct nvme_ns *ns)
>>>> +{
>>>> + int i, cpu;
>>>> + struct nvme_path_info *info;
>>>> +
>>>> + if (!test_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> + return;
>>>> +
>>>> + for_each_online_cpu(cpu) {
>>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> + info = &per_cpu_ptr(ns->info, cpu)[i];
>>>> + cancel_work_sync(&info->work.weight_work);
>>>> + }
>>>> + }
>>>> +}
>>>> +
>>>> +static bool nvme_mpath_enable_adaptive_path_policy(struct nvme_ns *ns)
>>>> +{
>>>> + struct nvme_ns_head *head = ns->head;
>>>> +
>>>> + if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
>>>> + return false;
>>>> +
>>>> + if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> + return false;
>>>> +
>>>> + blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>> This is an undocumented change...
>> Sure, I would add comment in this code in the next patch version.
>>
>>>> + blk_stat_enable_accounting(ns->queue);
>>>> + return true;
>>>> +}
>>>> +
>>>> +static bool nvme_mpath_disable_adaptive_path_policy(struct nvme_ns *ns)
>>>> +{
>>>> +
>>>> + if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
>>>> + return false;
>>>> +
>>>> + blk_stat_disable_accounting(ns->queue);
>>>> + blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, ns->queue);
>>>> + nvme_mpath_reset_adaptive_path_stat(ns);
>>>> + return true;
>>>> +}
>>>> +
>>>> bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>> {
>>>> struct nvme_ns_head *head = ns->head;
>>>> @@ -253,6 +516,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
>>>> changed = true;
>>>> }
>>>> }
>>>> + if (nvme_mpath_disable_adaptive_path_policy(ns))
>>>> + changed = true;
>>> Don't understand why you are setting changed here? it relates to of the current_path
>>> was changed. doesn't make sense to me.
>>>
>> I think you were correct. We don't have any rcu update here for adaptive path.
>> Will remove this.
>>
>>>> out:
>>>> return changed;
>>>> }
>>>> @@ -271,6 +536,45 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
>>>> srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>>> }
>>>> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>>>> +{
>>>> + int i, cpu;
>>>> + struct nvme_path_work *work;
>>>> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>>>> +
>>>> + if (!ns->head->disk)
>>>> + return 0;
>>>> +
>>>> + ns->info = __alloc_percpu_gfp(NVME_NUM_STAT_GROUPS *
>>>> + sizeof(struct nvme_path_info),
>>>> + __alignof__(struct nvme_path_info), gfp);
>>>> + if (!ns->info)
>>>> + return -ENOMEM;
>>>> +
>>>> + for_each_possible_cpu(cpu) {
>>>> + for (i = 0; i < NVME_NUM_STAT_GROUPS; i++) {
>>>> + work = &per_cpu_ptr(ns->info, cpu)[i].work;
>>>> + work->ns = ns;
>>>> + work->op_type = i;
>>>> + INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>>>> + }
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
>>> Does this function set any ctrl paths? your code is very confusing.
>>>
>> Here ctrl path means, we iterate through each controller namespaces-path
>> and then sets/enable the adaptive path parameters required for each path.
>> Moreover, we already have corresponding nvme_mpath_clear_ctrl_paths()
>> function which resets/clears the per-path parameters while chanigng I/O
>> policy.
>>
>>>> +{
>>>> + struct nvme_ns *ns;
>>>> + int srcu_idx;
>>>> +
>>>> + srcu_idx = srcu_read_lock(&ctrl->srcu);
>>>> + list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
>>>> + srcu_read_lock_held(&ctrl->srcu))
>>>> + nvme_mpath_enable_adaptive_path_policy(ns);
>>>> + srcu_read_unlock(&ctrl->srcu, srcu_idx);
>>> seems like it enables the iopolicy on all ctrl namespaces.
>>> the enable should also be more explicit like:
>>> nvme_enable_ns_lat_sampling or something like that.
>>>
>> okay, I'll rename it to the appropriate function name, as you suggested.
>>
>>>> +}
>>>> +
>>>> void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>> {
>>>> struct nvme_ns_head *head = ns->head;
>>>> @@ -283,6 +587,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
>>>> srcu_read_lock_held(&head->srcu)) {
>>>> if (capacity != get_capacity(ns->disk))
>>>> clear_bit(NVME_NS_READY, &ns->flags);
>>>> +
>>>> + nvme_mpath_reset_adaptive_path_stat(ns);
>>>> }
>>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>>> @@ -407,6 +713,92 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
>>>> return found;
>>>> }
>>>> +static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>>> +{
>>>> + return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>>> +}
>>>> +
>>>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>>>> + unsigned int op_type)
>>>> +{
>>>> + struct nvme_ns *ns, *start, *found = NULL;
>>>> + struct nvme_path_stat *stat;
>>>> + u32 weight;
>>>> + int cpu;
>>>> +
>>>> + cpu = get_cpu();
>>>> + ns = *this_cpu_ptr(head->adp_path);
>>>> + if (unlikely(!ns)) {
>>>> + ns = list_first_or_null_rcu(&head->list,
>>>> + struct nvme_ns, siblings);
>>>> + if (unlikely(!ns))
>>>> + goto out;
>>>> + }
>>>> +found_ns:
>>>> + start = ns;
>>>> + while (nvme_path_is_disabled(ns) ||
>>>> + !nvme_state_is_live(ns->ana_state)) {
>>>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>>>> +
>>>> + /*
>>>> + * If we iterate through all paths in the list but find each
>>>> + * path in list is either disabled or dead then bail out.
>>>> + */
>>>> + if (ns == start)
>>>> + goto out;
>>>> + }
>>>> +
>>>> + stat = &this_cpu_ptr(ns->info)[op_type].stat;
>>>> +
>>>> + /*
>>>> + * When the head path-list is singular we don't calculate the
>>>> + * only path weight for optimization as we don't need to forward
>>>> + * I/O to more than one path. The another possibility is whenthe
>>>> + * path is newly added, we don't know its weight. So we go round
>>>> + * -robin for each such path and forward I/O to it.Once we start
>>>> + * getting response for such I/Os, the path weight calculation
>>>> + * would kick in and then we start using path credit for
>>>> + * forwarding I/O.
>>>> + */
>>>> + weight = READ_ONCE(stat->weight);
>>>> + if (!weight) {
>>>> + found = ns;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + /*
>>>> + * To keep path selection logic simple, we don't distinguish
>>>> + * between ANA optimized and non-optimized states. The non-
>>>> + * optimized path is expected to have a lower weight, and
>>>> + * therefore fewer credits. As a result, only a small number of
>>>> + * I/Os will be forwarded to paths in the non-optimized state.
>>>> + */
>>>> + if (stat->credit > 0) {
>>>> + --stat->credit;
>>>> + found = ns;
>>>> + goto out;
>>>> + } else {
>>>> + /*
>>>> + * Refill credit from path weight and move to next path. The
>>>> + * refilled credit of the current path will be used next when
>>>> + * all remainng paths exhaust its credits.
>>>> + */
>>>> + weight = READ_ONCE(stat->weight);
>>>> + stat->credit = weight;
>>>> + ns = list_next_entry_circular(ns, &head->list, siblings);
>>>> + if (likely(ns))
>>>> + goto found_ns;
>>>> + }
>>>> +out:
>>>> + if (found) {
>>>> + stat->sel++;
>>>> + *this_cpu_ptr(head->adp_path) = found;
>>>> + }
>>>> +
>>>> + put_cpu();
>>>> + return found;
>>>> +}
>>>> +
>>>> static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>>>> {
>>>> struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>>>> @@ -463,9 +855,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
>>>> return ns;
>>>> }
>>>> -inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>> +inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
>>>> + unsigned int op_type)
>>>> {
>>>> switch (READ_ONCE(head->subsys->iopolicy)) {
>>>> + case NVME_IOPOLICY_ADAPTIVE:
>>>> + return nvme_adaptive_path(head, op_type);
>>>> case NVME_IOPOLICY_QD:
>>>> return nvme_queue_depth_path(head);
>>>> case NVME_IOPOLICY_RR:
>>>> @@ -525,7 +920,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>>>> return;
>>>> srcu_idx = srcu_read_lock(&head->srcu);
>>>> - ns = nvme_find_path(head);
>>>> + ns = nvme_find_path(head, nvme_data_dir(bio_op(bio)));
>>>> if (likely(ns)) {
>>>> bio_set_dev(bio, ns->disk->part0);
>>>> bio->bi_opf |= REQ_NVME_MPATH;
>>>> @@ -567,7 +962,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
>>>> int srcu_idx, ret = -EWOULDBLOCK;
>>>> srcu_idx = srcu_read_lock(&head->srcu);
>>>> - ns = nvme_find_path(head);
>>>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>> if (ns)
>>>> ret = nvme_ns_get_unique_id(ns, id, type);
>>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>>> @@ -583,7 +978,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>>> int srcu_idx, ret = -EWOULDBLOCK;
>>>> srcu_idx = srcu_read_lock(&head->srcu);
>>>> - ns = nvme_find_path(head);
>>>> + ns = nvme_find_path(head, NVME_STAT_OTHER);
>>>> if (ns)
>>>> ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>>> srcu_read_unlock(&head->srcu, srcu_idx);
>>>> @@ -725,6 +1120,9 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
>>>> INIT_WORK(&head->partition_scan_work, nvme_partition_scan_work);
>>>> INIT_DELAYED_WORK(&head->remove_work, nvme_remove_head_work);
>>>> head->delayed_removal_secs = 0;
>>>> + head->adp_path = alloc_percpu_gfp(struct nvme_ns*, GFP_KERNEL);
>>>> + if (!head->adp_path)
>>>> + return -ENOMEM;
>>>> /*
>>>> * If "multipath_always_on" is enabled, a multipath node is added
>>>> @@ -809,6 +1207,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>> }
>>>> mutex_unlock(&head->lock);
>>>> + mutex_lock(&nvme_subsystems_lock);
>>>> + nvme_mpath_enable_adaptive_path_policy(ns);
>>>> + mutex_unlock(&nvme_subsystems_lock);
>>>> +
>>>> synchronize_srcu(&head->srcu);
>>>> kblockd_schedule_work(&head->requeue_work);
>>>> }
>>>> @@ -857,11 +1259,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
>>>> return 0;
>>>> }
>>>> -static inline bool nvme_state_is_live(enum nvme_ana_state state)
>>>> -{
>>>> - return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
>>>> -}
>>>> -
>>>> static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
>>>> struct nvme_ns *ns)
>>>> {
>>>> @@ -1039,10 +1436,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
>>>> WRITE_ONCE(subsys->iopolicy, iopolicy);
>>>> - /* iopolicy changes clear the mpath by design */
>>>> + /* iopolicy changes clear/reset the mpath by design */
>>>> mutex_lock(&nvme_subsystems_lock);
>>>> list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>>> nvme_mpath_clear_ctrl_paths(ctrl);
>>>> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>>> + nvme_mpath_set_ctrl_paths(ctrl);
>>>> mutex_unlock(&nvme_subsystems_lock);
>>>> pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
>>>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>>>> index 102fae6a231c..715c7053054c 100644
>>>> --- a/drivers/nvme/host/nvme.h
>>>> +++ b/drivers/nvme/host/nvme.h
>>>> @@ -28,7 +28,10 @@ extern unsigned int nvme_io_timeout;
>>>> extern unsigned int admin_timeout;
>>>> #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
>>>> -#define NVME_DEFAULT_KATO 5
>>>> +#define NVME_DEFAULT_KATO 5
>>>> +
>>>> +#define NVME_DEFAULT_ADP_EWMA_SHIFT 3
>>>> +#define NVME_DEFAULT_ADP_WEIGHT_TIMEOUT (15 * NSEC_PER_SEC)
>>> You need these defines outside of nvme-mpath?
>>>
>> Currently, those macros are used in nvme/host/core.c.
>> I can move this inisde that source file.
>>
>>>> #ifdef CONFIG_ARCH_NO_SG_CHAIN
>>>> #define NVME_INLINE_SG_CNT 0
>>>> @@ -421,6 +424,7 @@ enum nvme_iopolicy {
>>>> NVME_IOPOLICY_NUMA,
>>>> NVME_IOPOLICY_RR,
>>>> NVME_IOPOLICY_QD,
>>>> + NVME_IOPOLICY_ADAPTIVE,
>>>> };
>>>> struct nvme_subsystem {
>>>> @@ -459,6 +463,37 @@ struct nvme_ns_ids {
>>>> u8 csi;
>>>> };
>>>> +enum nvme_stat_group {
>>>> + NVME_STAT_READ,
>>>> + NVME_STAT_WRITE,
>>>> + NVME_STAT_OTHER,
>>>> + NVME_NUM_STAT_GROUPS
>>>> +};
>>> I see you have stats per io direction. However you don't have it per IO size. I wonder
>>> how this plays into this iopolicy.
>>>
>> Yes you're correct, and as mentioned earlier we'd measure latecy per
>> 512 byte blocks size.
>>
>>>> +
>>>> +struct nvme_path_stat {
>>>> + u64 nr_samples; /* total num of samples processed */
>>>> + u64 nr_ignored; /* num. of samples ignored */
>>>> + u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
>>>> + u64 score; /* score used for weight calculation */
>>>> + u64 last_weight_ts; /* timestamp of the last weight calculation */
>>>> + u64 sel; /* num of times this path is selcted for I/O */
>>>> + u64 batch; /* accumulated latency sum for current window */
>>>> + u32 batch_count; /* num of samples accumulated in current window */
>>>> + u32 weight; /* path weight */
>>>> + u32 credit; /* path credit for I/O forwarding */
>>>> +};
>>> I'm still not convinced that having this be per-cpu-per-ns really makes sense.
>> I understand your concern about whether it really makes sense to keep this
>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>> stat per-hctx instead of per-CPU.
>>
>> However, as mentioned earlier, during path selection we cannot reliably map an
>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>> that are local/near to the CPU issuing the request, which may lead to better
>> latency characteristics.
>
> With this I tend to agree. but per-cpu has lots of other churns IMO.
> Maybe the answer is that paths weights are maintained per NUMA node?
> then accessing these weights in the fast-path is still cheap enough?
That’s a fair point, and I agree that per-CPU accounting can introduce additional
variability. However, moving to per-NUMA path weights would implicitly narrow the
scope of what we are trying to measure, as it would largely exclude components of
end-to-end latency that arise from scheduler behavior and application-level scheduling
effects. As discussed earlier, the intent of the adaptive policy is to capture the
actual I/O cost observed by the workload, which includes not only path and controller
locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
maintaining per-CPU path weights remains a better fit for the stated goal. It also
offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
preserving a true end-to-end view of path latency, agreed?
I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
ioengine=io_uring. Below are the aggregated throughput results observed under
different NVMe multipath I/O policies:
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
These results show that under combined CPU and network stress, the adaptive I/O policy
consistently delivers higher throughput across read, write, and mixed workloads when
comapred against existing policies.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-18 11:19 ` Nilay Shroff
@ 2025-12-18 13:46 ` Hannes Reinecke
2025-12-23 14:50 ` Nilay Shroff
2025-12-25 12:28 ` Sagi Grimberg
1 sibling, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2025-12-18 13:46 UTC (permalink / raw)
To: Nilay Shroff, Sagi Grimberg, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/18/25 12:19, Nilay Shroff wrote:
>
>
> On 12/16/25 5:06 AM, Sagi Grimberg wrote:
>>
>>
>> On 13/12/2025 9:27, Nilay Shroff wrote:
>>>
>>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>>
>>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>>> subsystemX/iopolicy"
>>>>>
>>>>> The adaptive policy dynamically distributes I/O based on measured
>>>>> completion latency. The main idea is to calculate latency for each path,
>>>>> derive a weight, and then proportionally forward I/O according to those
>>>>> weights.
>>>>>
>>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>>> values.
>>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>>> How does that play with less queues than cpu cores? what happens to cores
>>>> that have low traffic?
>>>>
>>> The path-selection logic does not depend on the relationship between the number
>>> of CPUs and the number of hardware queues. It simply selects a path based on the
>>> per-CPU path score/credit, which reflects the relative performance of each available
>>> path.
>>> For example, assume we have two paths (A and B) to the same shared namespace.
>>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>>> latency values we derive a per-path score or credit. The credit represents the relative
>>> share of I/O that each path should receive: a path with lower observed latency gets more
>>> credit, and a path with higher latency gets less.
>>
>> I understand that the stats are maintained per-cpu, however I am not sure that having a
>> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
>> have one set of weights and for cpu1 we'll have another set of weights.
>>
>> What if the a given cpu happened to schedule some other application in a way that impacts
>> completion latency? won't that skew the sampling? that is not related to the path at all. That
>> is possibly more noticable in tcp which completes in a kthread context.
>>
>> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
>> that mitigate to some extent the issue of non-path related latency skew?
>>
> You’re right — what you’re describing is indeed possible. The intent of the adaptive policy,
> however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
> transport latency.
> The observed completion latency intentionally includes all components that affect I/O from
> the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
> scheduler-induced delays, and the target device’s own I/O latency. By capturing the full
> end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
> Scheduler-related latency can vary over time due to workload placement or CPU contention,
> and this variability is accounted for by the design. Since per-path weights are recalculated
> periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
> behavior are naturally incorporated into the path scoring. As a result, the policy can
> automatically adapt/adjust and rebalance I/O toward paths that are performing better under
> current system conditions.
> In short, while per-CPU sampling may include effects beyond the physical path itself, this is
> intentional and allows the adaptive policy to respond in real time to changing end-to-end
> performance characteristics.
>
That was not the point.
Thing is, we _cannot_ move I/O away from a given CPU. Once I/O
originates from a given CPU, it will stay on that CPU irrespective of
the path taken.
Remember: the I/O scheduler decides which path a given i/O should take,
not on which cpu any given I/O should run on.
So if a specific CPU has increase latency due to additional tasks /
interrupts running on it it will show up _on all paths_, but only for
weights on that CPU.
And Sagis point was that it would skew the measurement.
Which it certainly does.
But on the other hand _all_ I/O on this cpu will be affected, and we
don't have cross-speak to other CPUs (as this is a percpu counter).
So the only change would be that we're seeing increased numbers here,
the relation between paths won't change.
(Except in the really pathological case where the addedd latency is so
high that the path latency will get lost in the noise. But then it
wouldn't matter anyway as it'll be slow as hell.)
>>>
>>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>>> not affect how paths are scored or selected.
>>
>> This is potentially another problem. application may jump between cpu cores due to scheduling
>> constraints. In this case, how is the path selection policy adhering to the path weights?
>>
>> What I'm trying to say here is that the path selection should be inherently reflective on the path,
>> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
>> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
>> other workloads that are running on the system that can impact completion latency.
>>
>> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
>> that.
>>
>
> In real-world systems, as stated earlier, the completion latency is influenced not only by
> the physical path but also by system load, scheduler behavior, and transport stack processing.
> By incorporating all of these factors into the latency measurement, the adaptive policy reflects
> the true cost of issuing I/O on a given path under current conditions. This allows it to respond
> to both path-level and system-level congestion.
>
> In practice, during experiments with two paths (A and B), I observed that when additional latency—
> whether introduced via the path itself or through system load—was present on path A, subsequent I/O
> was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
> I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
> and remains effective even in the presence of CPU migration and competing workloads.
> Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
> real-world end-to-end performance and continuously adjust I/O distribution in response to changing
> system and path conditions.
>
>>>
>>>>> Every ~15 seconds, a simple average latency of per-CPU batched
>>>>> samples are computed and fed into an Exponentially Weighted Moving
>>>>> Average (EWMA):
>>>> I suggest to have iopolicy name reflect ewma. maybe "ewma-lat"?
>>> Okay that sounds good! Shall we name it "ewma-lat" or "weighted-lat"?
>>
>> wighted-lat is simpler.
> Okay I'll renanme it to "weighted-lat".>
>>>
>>> Path weights are then derived from the smoothed (EWMA)
>>> latency as follows (example with two paths A and B):
>>>
>>> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>>> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>>> total_score = path_A_score + path_B_score
>>>
>>> path_A_weight = (path_A_score * 100) / total_score
>>> path_B_weight = (path_B_score * 100) / total_score
>>>
>>>> What happens to R/W mixed workloads? What happens when the I/O pattern
>>>> has a distribution of block sizes?
>>>>
>>> We maintain separate metrics for READ and WRITE traffic, and during path
>>> selection we use the appropriate metric depending on the I/O type.
>>>
>>> Regarding block-size variability: the current implementation does not yet
>>> account for I/O size. This is an important point — thank you for raising it.
>>> I discussed this today with Hannes at LPC, and we agreed that a practical
>>> approach is to normalize latency per 512-byte block. For our purposes, we
>>> do not need an exact latency value; a relative latency metric is sufficient,
>>> as it ultimately feeds into path scoring. A path with higher latency ends up
>>> with a lower score, and a path with lower latency gets a higher score — the
>>> exact absolute values are less important than maintaining consistent proportional
>>> relationships.
>>
>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>> have much lower amortized latency per 512 block. which could create an false bias
>> to place a high weight on a path, if that path happened to host large I/Os no?
>>
> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>
Although technically we are then measure two different things (IO
latency vs block latency). But yeah, block latency might be better
suited for the normal case; I do wonder, though, if for high-speed
links we do see a difference as the data transfer time is getting
really fast...
[ .. ]
>>> I understand your concern about whether it really makes sense to keep this
>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>> stat per-hctx instead of per-CPU.
>>>
>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>> that are local/near to the CPU issuing the request, which may lead to better
>>> latency characteristics.
>>
>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>> Maybe the answer is that paths weights are maintained per NUMA node?
>> then accessing these weights in the fast-path is still cheap enough?
>
> That’s a fair point, and I agree that per-CPU accounting can introduce additional
> variability. However, moving to per-NUMA path weights would implicitly narrow the
> scope of what we are trying to measure, as it would largely exclude components of
> end-to-end latency that arise from scheduler behavior and application-level scheduling
> effects. As discussed earlier, the intent of the adaptive policy is to capture the
> actual I/O cost observed by the workload, which includes not only path and controller
> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
> maintaining per-CPU path weights remains a better fit for the stated goal. It also
> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
> preserving a true end-to-end view of path latency, agreed?
>
Well, for fabrics you can easily have several paths connected to the
same NUMA node (like in the classical 'two initiator ports
cross-connected to two target ports', resulting in four paths in total.
But two of these paths will always be on the same NUMA node).
So that doesn't work out.
> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
> ioengine=io_uring. Below are the aggregated throughput results observed under
> different NVMe multipath I/O policies:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>
> These results show that under combined CPU and network stress, the adaptive I/O policy
> consistently delivers higher throughput across read, write, and mixed workloads when
> comapred against existing policies.
>
And that is probably the best argument; we should put it under stress
with various scenarios. I must admit I am _really_ in favour of this
iopolicy, as it would be able to handle any temporary issues on the
fabric (or backend) without the need of additional signalling.
Talk to me about FPIN ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-18 13:46 ` Hannes Reinecke
@ 2025-12-23 14:50 ` Nilay Shroff
2025-12-25 12:45 ` Sagi Grimberg
0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2025-12-23 14:50 UTC (permalink / raw)
To: Hannes Reinecke, Sagi Grimberg, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
[...]
>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>> have much lower amortized latency per 512 block. which could create an false bias
>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>
>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>
> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
> suited for the normal case; I do wonder, though, if for high-speed
> links we do see a difference as the data transfer time is getting
> really fast...
>
For a high speed/bandwidth NIC card the transfer speed would be very fast,
though I think for a very large I/O size, we would see a higer latency due
to tcp segmentation and re-assembly.
On my nvmf-tcp testbed, I do see the latency differences as shown below
for varying I/O size (captured for random-read direct I/O workload):
I/O-size Avg-latency(usec)
512 12113
1k 10058
2k 11246
4k 12458
8k 12189
16k 11617
32k 17686
64k 28504
128k 59013
256k 118984
512k 233428
1M 460000
As can be seen, for smaller block sizes (512B–16K), latency remains relatively
stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
above, latency increases significantly and roughly doubles with each step in
block size. Based on this data, I propose using coarse-grained I/O size buckets
to preserve latency characteristics while avoiding excessive fragmentation of
statistics. The suggested bucket layout is as follows:
Bucket block-size-range
small 512B-32k
medium 32k-64k
large-64k 64k-128k
large-128k 128k-256k
large-256k 256k-512k
large-512k 512k-1M
very-large >=1M
In this model,
- A single small bucket captures latency for I/O sizes where latency remains
largely uniform.
- A medium bucket captures the transition region.
- Separate large buckets preserve the rapidly increasing latency behavior
observed for larger block sizes.
- A very-large bucket handles any I/O beyond 1M.
This approach allows the adaptive policy to retain meaningful latency distinctions across
I/O size regimes while keeping the number of buckets manageable and statistically stable,
make sense?
> [ .. ]
>>>> I understand your concern about whether it really makes sense to keep this
>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>> stat per-hctx instead of per-CPU.
>>>>
>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>> latency characteristics.
>>>
>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>> then accessing these weights in the fast-path is still cheap enough?
>>
>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>> scope of what we are trying to measure, as it would largely exclude components of
>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>> actual I/O cost observed by the workload, which includes not only path and controller
>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>> preserving a true end-to-end view of path latency, agreed?
>>
> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
> But two of these paths will always be on the same NUMA node).
> So that doesn't work out.
>
>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>> ioengine=io_uring. Below are the aggregated throughput results observed under
>> different NVMe multipath I/O policies:
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>
>> These results show that under combined CPU and network stress, the adaptive I/O policy
>> consistently delivers higher throughput across read, write, and mixed workloads when
>> comapred against existing policies.
>>
> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
> Talk to me about FPIN ...
>
I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
Below are the aggregated throughput results observed under different NVMe multipath
I/O policies.
i) Stressing all 32 cpus using stress-ng
All 32 CPUs were stressed using:
# stress-ng --cpu 0 --cpu-method all -t 60m
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
ii) Symmetric paths (No CPU stress and no induced network load):
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
These results show that the adaptive I/O policy consistently delivers higher
throughput under CPU stress and asymmetric path conditions. In case of symmetric
paths the adaptive policy achieves throughput comparable to—or slightly
better than—existing policies.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-18 11:19 ` Nilay Shroff
2025-12-18 13:46 ` Hannes Reinecke
@ 2025-12-25 12:28 ` Sagi Grimberg
1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-25 12:28 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: hare, hch, kbusch, dwagner, axboe, kanie, gjoyce
On 18/12/2025 13:19, Nilay Shroff wrote:
>
> On 12/16/25 5:06 AM, Sagi Grimberg wrote:
>>
>> On 13/12/2025 9:27, Nilay Shroff wrote:
>>> On 12/12/25 6:34 PM, Sagi Grimberg wrote:
>>>> On 05/11/2025 12:33, Nilay Shroff wrote:
>>>>> This commit introduces a new I/O policy named "adaptive". Users can
>>>>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>>>>> subsystemX/iopolicy"
>>>>>
>>>>> The adaptive policy dynamically distributes I/O based on measured
>>>>> completion latency. The main idea is to calculate latency for each path,
>>>>> derive a weight, and then proportionally forward I/O according to those
>>>>> weights.
>>>>>
>>>>> To ensure scalability, path latency is measured per-CPU. Each CPU
>>>>> maintains its own statistics, and I/O forwarding uses these per-CPU
>>>>> values.
>>>> So a given cpu would select path-a vs. another cpu that may select path-b?
>>>> How does that play with less queues than cpu cores? what happens to cores
>>>> that have low traffic?
>>>>
>>> The path-selection logic does not depend on the relationship between the number
>>> of CPUs and the number of hardware queues. It simply selects a path based on the
>>> per-CPU path score/credit, which reflects the relative performance of each available
>>> path.
>>> For example, assume we have two paths (A and B) to the same shared namespace.
>>> For each CPU, we maintain a smoothed latency estimate for every path. From these
>>> latency values we derive a per-path score or credit. The credit represents the relative
>>> share of I/O that each path should receive: a path with lower observed latency gets more
>>> credit, and a path with higher latency gets less.
>> I understand that the stats are maintained per-cpu, however I am not sure that having a
>> per-cpu path weights make sense. meaning that if we have paths a,b,c and for cpu0 we'll
>> have one set of weights and for cpu1 we'll have another set of weights.
>>
>> What if the a given cpu happened to schedule some other application in a way that impacts
>> completion latency? won't that skew the sampling? that is not related to the path at all. That
>> is possibly more noticable in tcp which completes in a kthread context.
>>
>> What do we lose if the 15 seconds weight assignment, averages all the cpus samping? won't
>> that mitigate to some extent the issue of non-path related latency skew?
>>
> You’re right — what you’re describing is indeed possible. The intent of the adaptive policy,
> however, is to measure end-to-end I/O latency, rather than isolating only the raw path or
> transport latency.
> The observed completion latency intentionally includes all components that affect I/O from
> the host’s perspective: path latency, fabric or protocol stack latency (for example, TCP/IP),
> scheduler-induced delays, and the target device’s own I/O latency. By capturing the full
> end-to-end behavior, the policy reflects the actual cost of issuing I/O on a given path.
> Scheduler-related latency can vary over time due to workload placement or CPU contention,
> and this variability is accounted for by the design. Since per-path weights are recalculated
> periodically (for example, every 15 seconds), any sustained changes in CPU load or scheduling
> behavior are naturally incorporated into the path scoring. As a result, the policy can
> automatically adapt/adjust and rebalance I/O toward paths that are performing better under
> current system conditions.
> In short, while per-CPU sampling may include effects beyond the physical path itself, this is
> intentional and allows the adaptive policy to respond in real time to changing end-to-end
> performance characteristics.
The issue is that you are crediting latency to a path where portions of
it (or maybe even the majority)
may be completely unrelated to the path at all. What I mean is that you
are accounting things that are unrelated
to the path selection.
In my mind, it would be better to amortize the cpu-local aspects of the
path selection (e.g. average out latency across
cpus - or across cpu numa-node) when calculating credits, and then have
all cpus use the same credits).
>
>>> I/O distribution is thus governed directly by the available credits on that CPU. When the
>>> NVMe multipath driver performs path selection, it chooses the path with sufficient credits,
>>> updates the bio’s bdev to correspond to that path, and submits the bio. Only after this
>>> point does the block layer map the bio to an hctx through the usual ctx->hctx mapping (i.e.,
>>> matching the issuing CPU to the appropriate hardware queue). In other words, the multipath
>>> policy runs above the block-layer queueing logic, and the number of hardware queues does
>>> not affect how paths are scored or selected.
>> This is potentially another problem. application may jump between cpu cores due to scheduling
>> constraints. In this case, how is the path selection policy adhering to the path weights?
>>
>> What I'm trying to say here is that the path selection should be inherently reflective on the path,
>> not the cpu core that was accessing this path. What I am concerned about, is how this behaves
>> in the real-world. Your tests are running in a very distinct artificial path variance, and it does not include
>> other workloads that are running on the system that can impact completion latency.
>>
>> It is possible that what I'm raising here is not a real concern, but I think we need to be able to demonstrate
>> that.
>>
> In real-world systems, as stated earlier, the completion latency is influenced not only by
> the physical path but also by system load, scheduler behavior, and transport stack processing.
> By incorporating all of these factors into the latency measurement, the adaptive policy reflects
> the true cost of issuing I/O on a given path under current conditions. This allows it to respond
> to both path-level and system-level congestion.
>
> In practice, during experiments with two paths (A and B), I observed that when additional latency—
> whether introduced via the path itself or through system load—was present on path A, subsequent I/O
> was automatically steered toward path B. Once conditions on path A improved, the policy rebalanced
> I/O based on the updated path weights. This behavior demonstrates that the policy adapts dynamically
> and remains effective even in the presence of CPU migration and competing workloads.
> Overall, while per-CPU sampling may appear counterintuitive at first, it enables the policy to capture
> real-world end-to-end performance and continuously adjust I/O distribution in response to changing
> system and path conditions.
I just don't understand how the presence of additional workloads or
system cpu load distribution
should affect the path that you select. I mean you can choose the
"worst" path but you run on a cpu
that happens to run just your thread and you score it maybe better than
the "best" path if you
are unfortunate enough to run on a cpu that currently is task switching
multiple cpu intensive threads...
>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>> Maybe the answer is that paths weights are maintained per NUMA node?
>> then accessing these weights in the fast-path is still cheap enough?
> That’s a fair point, and I agree that per-CPU accounting can introduce additional
> variability. However, moving to per-NUMA path weights would implicitly narrow the
> scope of what we are trying to measure, as it would largely exclude components of
> end-to-end latency that arise from scheduler behavior and application-level scheduling
> effects.
Not sure I agree. I argue that it will help you cleanup noise, which is
unrelated to evaluation
of "path quality".
> As discussed earlier, the intent of the adaptive policy is to capture the
> actual I/O cost observed by the workload, which includes not only path and controller
> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
> maintaining per-CPU path weights remains a better fit for the stated goal. It also
> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
> preserving a true end-to-end view of path latency, agreed?
It's not intuitive to me why it is not just adding noise.
>
> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
> ioengine=io_uring. Below are the aggregated throughput results observed under
> different NVMe multipath I/O policies:
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>
> These results show that under combined CPU and network stress, the adaptive I/O policy
> consistently delivers higher throughput across read, write, and mixed workloads when
> comapred against existing policies.
I'm not arguing other IO policies or comparison against them. We are
discussing your implementation.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-23 14:50 ` Nilay Shroff
@ 2025-12-25 12:45 ` Sagi Grimberg
2025-12-26 18:16 ` Nilay Shroff
0 siblings, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-25 12:45 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 23/12/2025 16:50, Nilay Shroff wrote:
> [...]
>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>
>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>
>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>> suited for the normal case; I do wonder, though, if for high-speed
>> links we do see a difference as the data transfer time is getting
>> really fast...
>>
> For a high speed/bandwidth NIC card the transfer speed would be very fast,
> though I think for a very large I/O size, we would see a higer latency due
> to tcp segmentation and re-assembly.
>
> On my nvmf-tcp testbed, I do see the latency differences as shown below
> for varying I/O size (captured for random-read direct I/O workload):
> I/O-size Avg-latency(usec)
> 512 12113
> 1k 10058
> 2k 11246
> 4k 12458
> 8k 12189
> 16k 11617
> 32k 17686
> 64k 28504
> 128k 59013
> 256k 118984
> 512k 233428
> 1M 460000
>
> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
> above, latency increases significantly and roughly doubles with each step in
> block size. Based on this data, I propose using coarse-grained I/O size buckets
> to preserve latency characteristics while avoiding excessive fragmentation of
> statistics. The suggested bucket layout is as follows:
>
> Bucket block-size-range
> small 512B-32k
> medium 32k-64k
> large-64k 64k-128k
> large-128k 128k-256k
> large-256k 256k-512k
> large-512k 512k-1M
> very-large >=1M
>
> In this model,
> - A single small bucket captures latency for I/O sizes where latency remains
> largely uniform.
> - A medium bucket captures the transition region.
> - Separate large buckets preserve the rapidly increasing latency behavior
> observed for larger block sizes.
> - A very-large bucket handles any I/O beyond 1M.
>
> This approach allows the adaptive policy to retain meaningful latency distinctions across
> I/O size regimes while keeping the number of buckets manageable and statistically stable,
> make sense?
Yes
>
>> [ .. ]
>>>>> I understand your concern about whether it really makes sense to keep this
>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>> stat per-hctx instead of per-CPU.
>>>>>
>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>> latency characteristics.
>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>> then accessing these weights in the fast-path is still cheap enough?
>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>> scope of what we are trying to measure, as it would largely exclude components of
>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>> actual I/O cost observed by the workload, which includes not only path and controller
>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>> preserving a true end-to-end view of path latency, agreed?
>>>
>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>> But two of these paths will always be on the same NUMA node).
>> So that doesn't work out.
>>
>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>> different NVMe multipath I/O policies:
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>>
>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>> comapred against existing policies.
>>>
>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>> Talk to me about FPIN ...
>>
> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
> Below are the aggregated throughput results observed under different NVMe multipath
> I/O policies.
>
> i) Stressing all 32 cpus using stress-ng
>
> All 32 CPUs were stressed using:
> # stress-ng --cpu 0 --cpu-method all -t 60m
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
> WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
> RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
> W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
>
> ii) Symmetric paths (No CPU stress and no induced network load):
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
> WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
> RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
> W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
>
> These results show that the adaptive I/O policy consistently delivers higher
> throughput under CPU stress and asymmetric path conditions. In case of symmetric
> paths the adaptive policy achieves throughput comparable to—or slightly
> better than—existing policies.
I still think that accounting uncorrelated latency is the best approach
here.
My intuition tells me that:
1. averaging latencies over numa-node
2. calculating weights
3. distribute new weights per-cpu in the numa-node
Is a better approach. It is hard to evaluate without adding some randomness.
Can you please run benchmarks with
`blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-25 12:45 ` Sagi Grimberg
@ 2025-12-26 18:16 ` Nilay Shroff
2025-12-27 9:33 ` Sagi Grimberg
2025-12-27 9:37 ` Sagi Grimberg
0 siblings, 2 replies; 28+ messages in thread
From: Nilay Shroff @ 2025-12-26 18:16 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/25/25 6:15 PM, Sagi Grimberg wrote:
>
>
> On 23/12/2025 16:50, Nilay Shroff wrote:
>> [...]
>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>
>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>
>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>> suited for the normal case; I do wonder, though, if for high-speed
>>> links we do see a difference as the data transfer time is getting
>>> really fast...
>>>
>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>> though I think for a very large I/O size, we would see a higer latency due
>> to tcp segmentation and re-assembly.
>>
>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>> for varying I/O size (captured for random-read direct I/O workload):
>> I/O-size Avg-latency(usec)
>> 512 12113
>> 1k 10058
>> 2k 11246
>> 4k 12458
>> 8k 12189
>> 16k 11617
>> 32k 17686
>> 64k 28504
>> 128k 59013
>> 256k 118984
>> 512k 233428
>> 1M 460000
>>
>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>> above, latency increases significantly and roughly doubles with each step in
>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>> to preserve latency characteristics while avoiding excessive fragmentation of
>> statistics. The suggested bucket layout is as follows:
>>
>> Bucket block-size-range
>> small 512B-32k
>> medium 32k-64k
>> large-64k 64k-128k
>> large-128k 128k-256k
>> large-256k 256k-512k
>> large-512k 512k-1M
>> very-large >=1M
>>
>> In this model,
>> - A single small bucket captures latency for I/O sizes where latency remains
>> largely uniform.
>> - A medium bucket captures the transition region.
>> - Separate large buckets preserve the rapidly increasing latency behavior
>> observed for larger block sizes.
>> - A very-large bucket handles any I/O beyond 1M.
>>
>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>> make sense?
>
> Yes
>
>>
>>> [ .. ]
>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>> stat per-hctx instead of per-CPU.
>>>>>>
>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>> latency characteristics.
>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>> preserving a true end-to-end view of path latency, agreed?
>>>>
>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>> But two of these paths will always be on the same NUMA node).
>>> So that doesn't work out.
>>>
>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>> different NVMe multipath I/O policies:
>>>>
>>>> numa round-robin queue-depth adaptive
>>>> ----------- ----------- ----------- ---------
>>>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>>>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>>>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>>>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>>>
>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>> comapred against existing policies.
>>>>
>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>> Talk to me about FPIN ...
>>>
>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>> Below are the aggregated throughput results observed under different NVMe multipath
>> I/O policies.
>>
>> i) Stressing all 32 cpus using stress-ng
>>
>> All 32 CPUs were stressed using:
>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
>> WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
>> RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
>> W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
>>
>> ii) Symmetric paths (No CPU stress and no induced network load):
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
>> WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
>> RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
>> W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
>>
>> These results show that the adaptive I/O policy consistently delivers higher
>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>> paths the adaptive policy achieves throughput comparable to—or slightly
>> better than—existing policies.
>
> I still think that accounting uncorrelated latency is the best approach here.
>
> My intuition tells me that:
> 1. averaging latencies over numa-node
> 2. calculating weights
> 3. distribute new weights per-cpu in the numa-node
>
> Is a better approach. It is hard to evaluate without adding some randomness.
>
> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
file I used for the test, followed by the observed throughput result for reference.
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
cpumode=qsort
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n2
rw=<randread/randwrite/randrw>
bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
iodepth=32
numjobs=32
direct=1
Throughput:
===========
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
When comparing the results, I did not observe a significant throughput
difference between the queue-depth, round-robin, and adaptive policies.
With random I/O of mixed sizes, the adaptive policy appears to average
out the varying latency values and distribute I/O reasonably evenly
across the active paths (assuming symmetric paths).
Next I'd implement I/O size buckets and also per-numa node weight and
then rerun tests and share the result. Lets see if these changes help
further improve the throughput number for adaptive policy. We may then
again review the results and discuss further.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-26 18:16 ` Nilay Shroff
@ 2025-12-27 9:33 ` Sagi Grimberg
2025-12-27 9:37 ` Sagi Grimberg
1 sibling, 0 replies; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:33 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 26/12/2025 20:16, Nilay Shroff wrote:
>
> On 12/25/25 6:15 PM, Sagi Grimberg wrote:
>>
>> On 23/12/2025 16:50, Nilay Shroff wrote:
>>> [...]
>>>>>> I am not sure that normalizing to 512 blocks is a good proxy. I think that large IO will
>>>>>> have much lower amortized latency per 512 block. which could create an false bias
>>>>>> to place a high weight on a path, if that path happened to host large I/Os no?
>>>>>>
>>>>> Hmm, I think yes, good point, I think for nvme over fabrics this could be true.
>>>>>
>>>> Although technically we are then measure two different things (IO latency vs block latency). But yeah, block latency might be better
>>>> suited for the normal case; I do wonder, though, if for high-speed
>>>> links we do see a difference as the data transfer time is getting
>>>> really fast...
>>>>
>>> For a high speed/bandwidth NIC card the transfer speed would be very fast,
>>> though I think for a very large I/O size, we would see a higer latency due
>>> to tcp segmentation and re-assembly.
>>>
>>> On my nvmf-tcp testbed, I do see the latency differences as shown below
>>> for varying I/O size (captured for random-read direct I/O workload):
>>> I/O-size Avg-latency(usec)
>>> 512 12113
>>> 1k 10058
>>> 2k 11246
>>> 4k 12458
>>> 8k 12189
>>> 16k 11617
>>> 32k 17686
>>> 64k 28504
>>> 128k 59013
>>> 256k 118984
>>> 512k 233428
>>> 1M 460000
>>>
>>> As can be seen, for smaller block sizes (512B–16K), latency remains relatively
>>> stable in the ~10–12 ms range. Starting at 32K and more noticeably at 64K and
>>> above, latency increases significantly and roughly doubles with each step in
>>> block size. Based on this data, I propose using coarse-grained I/O size buckets
>>> to preserve latency characteristics while avoiding excessive fragmentation of
>>> statistics. The suggested bucket layout is as follows:
>>>
>>> Bucket block-size-range
>>> small 512B-32k
>>> medium 32k-64k
>>> large-64k 64k-128k
>>> large-128k 128k-256k
>>> large-256k 256k-512k
>>> large-512k 512k-1M
>>> very-large >=1M
>>>
>>> In this model,
>>> - A single small bucket captures latency for I/O sizes where latency remains
>>> largely uniform.
>>> - A medium bucket captures the transition region.
>>> - Separate large buckets preserve the rapidly increasing latency behavior
>>> observed for larger block sizes.
>>> - A very-large bucket handles any I/O beyond 1M.
>>>
>>> This approach allows the adaptive policy to retain meaningful latency distinctions across
>>> I/O size regimes while keeping the number of buckets manageable and statistically stable,
>>> make sense?
>> Yes
>>
>>>> [ .. ]
>>>>>>> I understand your concern about whether it really makes sense to keep this
>>>>>>> per-cpu-per-ns, and I see your point that you would prefer maintaining the
>>>>>>> stat per-hctx instead of per-CPU.
>>>>>>>
>>>>>>> However, as mentioned earlier, during path selection we cannot reliably map an
>>>>>>> I/O to a specific hctx, so using per-hctx statistics becomes problematic in
>>>>>>> practice. On the other hand, maintaining the metrics per-CPU has an additional
>>>>>>> advantage: on a NUMA-aware system, the measured I/O latency naturally reflects
>>>>>>> the NUMA distance between the workload’s CPU and the I/O controller. This means
>>>>>>> that on multi-node systems, the policy can automatically favor I/O paths/controllers
>>>>>>> that are local/near to the CPU issuing the request, which may lead to better
>>>>>>> latency characteristics.
>>>>>> With this I tend to agree. but per-cpu has lots of other churns IMO.
>>>>>> Maybe the answer is that paths weights are maintained per NUMA node?
>>>>>> then accessing these weights in the fast-path is still cheap enough?
>>>>> That’s a fair point, and I agree that per-CPU accounting can introduce additional
>>>>> variability. However, moving to per-NUMA path weights would implicitly narrow the
>>>>> scope of what we are trying to measure, as it would largely exclude components of
>>>>> end-to-end latency that arise from scheduler behavior and application-level scheduling
>>>>> effects. As discussed earlier, the intent of the adaptive policy is to capture the
>>>>> actual I/O cost observed by the workload, which includes not only path and controller
>>>>> locality but also fabric, stack, and scheduling effects. From that perspective, IMO,
>>>>> maintaining per-CPU path weights remains a better fit for the stated goal. It also
>>>>> offers a dual advantage: naturally reflecting NUMA locality on a per-CPU basis while
>>>>> preserving a true end-to-end view of path latency, agreed?
>>>>>
>>>> Well, for fabrics you can easily have several paths connected to the same NUMA node (like in the classical 'two initiator ports cross-connected to two target ports', resulting in four paths in total.
>>>> But two of these paths will always be on the same NUMA node).
>>>> So that doesn't work out.
>>>>
>>>>> I conducted an experiment on my NVMe-oF TCP testbed while simultaneously running
>>>>> iperf3 TCP traffic to introduce both CPU and network load alongside fio. The
>>>>> host system has 32 cpus, so iperf3 was configured with 32 parallel TCP streams.
>>>>> And fio was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and
>>>>> ioengine=io_uring. Below are the aggregated throughput results observed under
>>>>> different NVMe multipath I/O policies:
>>>>>
>>>>> numa round-robin queue-depth adaptive
>>>>> ----------- ----------- ----------- ---------
>>>>> READ: 61.1 MiB/s 87.2 MiB/s 93.1 MiB/s 107 MiB/s
>>>>> WRITE: 95.8 MiB/s 138 MiB/s 159 MiB/s 179 MiB/s
>>>>> RW: R:29.8 MiB/s R:53.1 MiB/s R:58.8 MiB/s R:66.6 MiB/s
>>>>> W:29.6 MiB/s W:52.7 MiB/s W:58.2 MiB/s W:65.9 MiB/s
>>>>>
>>>>> These results show that under combined CPU and network stress, the adaptive I/O policy
>>>>> consistently delivers higher throughput across read, write, and mixed workloads when
>>>>> comapred against existing policies.
>>>>>
>>>> And that is probably the best argument; we should put it under stress with various scenarios. I must admit I am _really_ in favour of this
>>>> iopolicy, as it would be able to handle any temporary issues on the fabric (or backend) without the need of additional signalling.
>>>> Talk to me about FPIN ...
>>>>
>>> I ran additional experiments on the NVMe-oF TCP testbed. The host has 32 cpus so fio
>>> was configured with numjobs=32, iodepth=32, bs=4K, direct I/O, and ioengine=io_uring.
>>> Below are the aggregated throughput results observed under different NVMe multipath
>>> I/O policies.
>>>
>>> i) Stressing all 32 cpus using stress-ng
>>>
>>> All 32 CPUs were stressed using:
>>> # stress-ng --cpu 0 --cpu-method all -t 60m
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 159 MiB/s 193 MiB/s 215 MiB/s 255 MiB/s
>>> WRITE: 188 MiB/s 186 MiB/s 195 MiB/s 199 MiB/s
>>> RW: R:83.4 MiB/s R:101 MiB/s R:104 MiB/s R: 111 MiB/s
>>> W:83.3 MiB/s W:101 MiB/s W:105 MiB/s W: 112 MiB/s
>>>
>>> ii) Symmetric paths (No CPU stress and no induced network load):
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 171 MiB/s 298 MiB/s 320 MiB/s 348 MiB/s
>>> WRITE: 229 MiB/s 419 MiB/s 442 MiB/s 460 MiB/s
>>> RW: R: 93.0 MiB/s R: 166 MiB/s R: 171 MiB/s R: 179 MiB/s
>>> W: 94.2 MiB/s W: 168 MiB/s W: 168 MiB/s W: 178 MiB/s
>>>
>>> These results show that the adaptive I/O policy consistently delivers higher
>>> throughput under CPU stress and asymmetric path conditions. In case of symmetric
>>> paths the adaptive policy achieves throughput comparable to—or slightly
>>> better than—existing policies.
>> I still think that accounting uncorrelated latency is the best approach here.
>>
>> My intuition tells me that:
>> 1. averaging latencies over numa-node
>> 2. calculating weights
>> 3. distribute new weights per-cpu in the numa-node
>>
>> Is a better approach. It is hard to evaluate without adding some randomness.
>>
>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
> file I used for the test, followed by the observed throughput result for reference.
>
> Job file:
> =========
>
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> cpumode=qsort
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n2
> rw=<randread/randwrite/randrw>
> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
> iodepth=32
> numjobs=32
> direct=1
>
> Throughput:
> ===========
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>
> When comparing the results, I did not observe a significant throughput
> difference between the queue-depth, round-robin, and adaptive policies.
> With random I/O of mixed sizes, the adaptive policy appears to average
> out the varying latency values and distribute I/O reasonably evenly
> across the active paths (assuming symmetric paths).
>
> Next I'd implement I/O size buckets and also per-numa node weight and
> then rerun tests and share the result. Lets see if these changes help
> further improve the throughput number for adaptive policy. We may then
> again review the results and discuss further.
Two comments:
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-26 18:16 ` Nilay Shroff
2025-12-27 9:33 ` Sagi Grimberg
@ 2025-12-27 9:37 ` Sagi Grimberg
2026-01-04 9:07 ` Nilay Shroff
1 sibling, 1 reply; 28+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:37 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
> file I used for the test, followed by the observed throughput result for reference.
>
> Job file:
> =========
>
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> cpumode=qsort
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n2
> rw=<randread/randwrite/randrw>
> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
> iodepth=32
> numjobs=32
> direct=1
>
> Throughput:
> ===========
>
> numa round-robin queue-depth adaptive
> ----------- ----------- ----------- ---------
> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>
> When comparing the results, I did not observe a significant throughput
> difference between the queue-depth, round-robin, and adaptive policies.
> With random I/O of mixed sizes, the adaptive policy appears to average
> out the varying latency values and distribute I/O reasonably evenly
> across the active paths (assuming symmetric paths).
>
> Next I'd implement I/O size buckets and also per-numa node weight and
> then rerun tests and share the result. Lets see if these changes help
> further improve the throughput number for adaptive policy. We may then
> again review the results and discuss further.
>
> Thanks,
> --Nilay
two comments:
1. I'd make reads split slightly biased towards small block sizes, and
writes biased towards larger block sizes
2. I'd also suggest to measure having weights calculation averaged out
on all numa-node cores and then set percpu (such that
the datapath does not introduce serialization).
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2025-12-27 9:37 ` Sagi Grimberg
@ 2026-01-04 9:07 ` Nilay Shroff
2026-01-04 21:06 ` Sagi Grimberg
0 siblings, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2026-01-04 9:07 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>
>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>> file I used for the test, followed by the observed throughput result for reference.
>>
>> Job file:
>> =========
>>
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> cpumode=qsort
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n2
>> rw=<randread/randwrite/randrw>
>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>> iodepth=32
>> numjobs=32
>> direct=1
>>
>> Throughput:
>> ===========
>>
>> numa round-robin queue-depth adaptive
>> ----------- ----------- ----------- ---------
>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>
>> When comparing the results, I did not observe a significant throughput
>> difference between the queue-depth, round-robin, and adaptive policies.
>> With random I/O of mixed sizes, the adaptive policy appears to average
>> out the varying latency values and distribute I/O reasonably evenly
>> across the active paths (assuming symmetric paths).
>>
>> Next I'd implement I/O size buckets and also per-numa node weight and
>> then rerun tests and share the result. Lets see if these changes help
>> further improve the throughput number for adaptive policy. We may then
>> again review the results and discuss further.
>>
>> Thanks,
>> --Nilay
>
> two comments:
> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
> the datapath does not introduce serialization).
Thanks for the suggestions. I ran experiments incorporating both points—
biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
weight calculation—using the following setup.
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
==========
[1] Block-size distributions:
randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
Results:
=======
i) Symmetric paths + system load
(CPU stress using cpuload):
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 636 621 613 618
WRITE: 1832 1847 1840 1852
RW: R:872 R:869 R:866 R:874
W:872 W:870 W:867 W:876
ii) Asymmetric paths + system load
(CPU stress using cpuload and iperf3 traffic for inducing network congestion):
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 553 543 540 533
WRITE: 1705 1670 1710 1655
RW: R:769 R:771 R:784 R:772
W:768 W:767 W:785 W:771
Looking at the above results,
- Per-CPU vs per-CPU with I/O buckets:
The per-CPU implementation already averages latency effectively across CPUs.
Introducing per-CPU I/O buckets does not provide a meaningful throughput
improvement and remains largely comparable.
- Per-CPU vs per-NUMA aggregation:
Calculating or averaging weights at the NUMA level does not significantly
improve throughput over per-CPU weight calculation. Across both symmetric
and asymmetric scenarios, the results remain very close.
So now based on above results and assessment, unless there are additional
scenarios or metrics of interest, shall we proceed with per-CPU weight
calculation for this new I/O policy?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2026-01-04 9:07 ` Nilay Shroff
@ 2026-01-04 21:06 ` Sagi Grimberg
2026-01-06 14:16 ` Nilay Shroff
2026-01-07 11:15 ` Hannes Reinecke
0 siblings, 2 replies; 28+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:06 UTC (permalink / raw)
To: Nilay Shroff, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 04/01/2026 11:07, Nilay Shroff wrote:
>
> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>> file I used for the test, followed by the observed throughput result for reference.
>>>
>>> Job file:
>>> =========
>>>
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> cpumode=qsort
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n2
>>> rw=<randread/randwrite/randrw>
>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>>
>>> Throughput:
>>> ===========
>>>
>>> numa round-robin queue-depth adaptive
>>> ----------- ----------- ----------- ---------
>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>
>>> When comparing the results, I did not observe a significant throughput
>>> difference between the queue-depth, round-robin, and adaptive policies.
>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>> out the varying latency values and distribute I/O reasonably evenly
>>> across the active paths (assuming symmetric paths).
>>>
>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>> then rerun tests and share the result. Lets see if these changes help
>>> further improve the throughput number for adaptive policy. We may then
>>> again review the results and discuss further.
>>>
>>> Thanks,
>>> --Nilay
>> two comments:
>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>> the datapath does not introduce serialization).
> Thanks for the suggestions. I ran experiments incorporating both points—
> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
> weight calculation—using the following setup.
>
> Job file:
> =========
> [global]
> time_based
> runtime=120
> group_reporting=1
>
> [cpu]
> ioengine=cpuio
> cpuload=85
> numjobs=32
>
> [disk]
> ioengine=io_uring
> filename=/dev/nvme1n1
> rw=<randread/randwrite/randrw>
> bssplit=<based-on-I/O-pattern-type>[1]
> iodepth=32
> numjobs=32
> direct=1
> ==========
>
> [1] Block-size distributions:
> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>
> Results:
> =======
>
> i) Symmetric paths + system load
> (CPU stress using cpuload):
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 636 621 613 618
> WRITE: 1832 1847 1840 1852
> RW: R:872 R:869 R:866 R:874
> W:872 W:870 W:867 W:876
>
> ii) Asymmetric paths + system load
> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 553 543 540 533
> WRITE: 1705 1670 1710 1655
> RW: R:769 R:771 R:784 R:772
> W:768 W:767 W:785 W:771
>
>
> Looking at the above results,
> - Per-CPU vs per-CPU with I/O buckets:
> The per-CPU implementation already averages latency effectively across CPUs.
> Introducing per-CPU I/O buckets does not provide a meaningful throughput
> improvement and remains largely comparable.
>
> - Per-CPU vs per-NUMA aggregation:
> Calculating or averaging weights at the NUMA level does not significantly
> improve throughput over per-CPU weight calculation. Across both symmetric
> and asymmetric scenarios, the results remain very close.
>
> So now based on above results and assessment, unless there are additional
> scenarios or metrics of interest, shall we proceed with per-CPU weight
> calculation for this new I/O policy?
I think it is counter intuitive that bucketing I/O sizes does not
present any advantage. Don't you?
Maybe the test is not good enough of a representation...
Lets also test what happens with multiple clients against the same
subsystem.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2026-01-04 21:06 ` Sagi Grimberg
@ 2026-01-06 14:16 ` Nilay Shroff
2026-02-02 13:33 ` Nilay Shroff
2026-01-07 11:15 ` Hannes Reinecke
1 sibling, 1 reply; 28+ messages in thread
From: Nilay Shroff @ 2026-01-06 14:16 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>
>
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>> numa round-robin queue-depth adaptive
>>>> ----------- ----------- ----------- ---------
>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>> (CPU stress using cpuload):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- -------- -------------------
>> READ: 636 621 613 618
>> WRITE: 1832 1847 1840 1852
>> RW: R:872 R:869 R:866 R:874
>> W:872 W:870 W:867 W:876
>>
>> ii) Asymmetric paths + system load
>> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- -------- -------------------
>> READ: 553 543 540 533
>> WRITE: 1705 1670 1710 1655
>> RW: R:769 R:771 R:784 R:772
>> W:768 W:767 W:785 W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>> The per-CPU implementation already averages latency effectively across CPUs.
>> Introducing per-CPU I/O buckets does not provide a meaningful throughput
>> improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>> Calculating or averaging weights at the NUMA level does not significantly
>> improve throughput over per-CPU weight calculation. Across both symmetric
>> and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
>
> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
>
Hmm you were correct, I also thought the same but I couldn't find
any test which could prove the advantage using I/O buckets. Then
today I spend some time thinking about the scenarios which could
prove the worth using I/O buckets. After some thought I came up
with following use case.
Size-dependent path behavior:
1. Example:
Path A: good for ≤16k, bad for ≥32k
Path B: good for all
Now running mixed I/O (bssplit => 16k/75:64k/25),
Without buckets:
Path B looks good; scheduler forwards more I/Os towards path B.
With buckets:
small I/Os are distributed across path A and B
large I/Os favor path B
So in theory, throughput shall improve with buckets.
2. Example:
Path A: good for ≤16k, bad for ≥32k
Path B: opposite
Without buckets:
latency averages cancel out
scheduler sees “paths are equal”
With buckets:
small I/O bucket favors A
large I/O bucket favors B
Again in theory, throughput shall improve with buckets.
So with the above thought, I ran another experiment and results
are shown below:
Injecting additional delay on one path for larger packets (>=32k)
and mixing I/Os with bssplit => 16k/75:64k/25. So with this
test, we have,
Path A: good for ≤16k, bad for ≥32k
Path B: good for all
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 550 622 523 615
WRITE: 726 829 747 834
RW: R:324 R:381 R: 306 R:375
W:323 W:381 W: 306 W:374
So yes I/O buckets could be useful for the scenario tested
above. And regarding per-CPU vs per-NUMA weight calculation
do you agree per-CPU should be good enough for this policy
as we saw above per-NUMA doesn't help improve much performance?
> Lets also test what happens with multiple clients against the same subsystem.
Yes this is a good test to run, I will test and post result.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2026-01-04 21:06 ` Sagi Grimberg
2026-01-06 14:16 ` Nilay Shroff
@ 2026-01-07 11:15 ` Hannes Reinecke
1 sibling, 0 replies; 28+ messages in thread
From: Hannes Reinecke @ 2026-01-07 11:15 UTC (permalink / raw)
To: Sagi Grimberg, Nilay Shroff, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 1/4/26 22:06, Sagi Grimberg wrote:
>
>
> On 04/01/2026 11:07, Nilay Shroff wrote:
>>
>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/
>>>>> `cpuload`/`cpuchunks`/`cpumode` ?
>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode.
>>>> Below is the job
>>>> file I used for the test, followed by the observed throughput result
>>>> for reference.
>>>>
>>>> Job file:
>>>> =========
>>>>
>>>> [global]
>>>> time_based
>>>> runtime=120
>>>> group_reporting=1
>>>>
>>>> [cpu]
>>>> ioengine=cpuio
>>>> cpuload=85
>>>> cpumode=qsort
>>>> numjobs=32
>>>>
>>>> [disk]
>>>> ioengine=io_uring
>>>> filename=/dev/nvme1n2
>>>> rw=<randread/randwrite/randrw>
>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>> iodepth=32
>>>> numjobs=32
>>>> direct=1
>>>>
>>>> Throughput:
>>>> ===========
>>>>
>>>> numa round-robin queue-depth adaptive
>>>> ----------- ----------- ----------- ---------
>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>
>>>> When comparing the results, I did not observe a significant throughput
>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>> out the varying latency values and distribute I/O reasonably evenly
>>>> across the active paths (assuming symmetric paths).
>>>>
>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>> then rerun tests and share the result. Lets see if these changes help
>>>> further improve the throughput number for adaptive policy. We may then
>>>> again review the results and discuss further.
>>>>
>>>> Thanks,
>>>> --Nilay
>>> two comments:
>>> 1. I'd make reads split slightly biased towards small block sizes,
>>> and writes biased towards larger block sizes
>>> 2. I'd also suggest to measure having weights calculation averaged
>>> out on all numa-node cores and then set percpu (such that
>>> the datapath does not introduce serialization).
>> Thanks for the suggestions. I ran experiments incorporating both points—
>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>> weight calculation—using the following setup.
>>
>> Job file:
>> =========
>> [global]
>> time_based
>> runtime=120
>> group_reporting=1
>>
>> [cpu]
>> ioengine=cpuio
>> cpuload=85
>> numjobs=32
>>
>> [disk]
>> ioengine=io_uring
>> filename=/dev/nvme1n1
>> rw=<randread/randwrite/randrw>
>> bssplit=<based-on-I/O-pattern-type>[1]
>> iodepth=32
>> numjobs=32
>> direct=1
>> ==========
>>
>> [1] Block-size distributions:
>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>
>> Results:
>> =======
>>
>> i) Symmetric paths + system load
>> (CPU stress using cpuload):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-
>> buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- --------
>> -------------------
>> READ: 636 621 613 618
>> WRITE: 1832 1847 1840 1852
>> RW: R:872 R:869 R:866 R:874
>> W:872 W:870 W:867 W:876
>>
>> ii) Asymmetric paths + system load
>> (CPU stress using cpuload and iperf3 traffic for inducing network
>> congestion):
>>
>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-
>> buckets
>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>> ------- ------------------- --------
>> -------------------
>> READ: 553 543 540 533
>> WRITE: 1705 1670 1710 1655
>> RW: R:769 R:771 R:784 R:772
>> W:768 W:767 W:785 W:771
>>
>>
>> Looking at the above results,
>> - Per-CPU vs per-CPU with I/O buckets:
>> The per-CPU implementation already averages latency effectively
>> across CPUs.
>> Introducing per-CPU I/O buckets does not provide a meaningful
>> throughput
>> improvement and remains largely comparable.
>>
>> - Per-CPU vs per-NUMA aggregation:
>> Calculating or averaging weights at the NUMA level does not
>> significantly
>> improve throughput over per-CPU weight calculation. Across both
>> symmetric
>> and asymmetric scenarios, the results remain very close.
>>
>> So now based on above results and assessment, unless there are additional
>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>> calculation for this new I/O policy?
>
> I think it is counter intuitive that bucketing I/O sizes does not
> present any advantage. Don't you?
> Maybe the test is not good enough of a representation...
>
> Lets also test what happens with multiple clients against the same
> subsystem.
I am not sure if focussing on NUMA nodes will bring us an advantage
here. NUMA nodes would present an advantage if we can keep I/Os to
different controllers on different NUMA nodes; but with TCP this
is rarely possible (just think of two connections to different
controllers via the same interface ...), so I really think we
should keep the counters per-cpu.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy
2026-01-06 14:16 ` Nilay Shroff
@ 2026-02-02 13:33 ` Nilay Shroff
0 siblings, 0 replies; 28+ messages in thread
From: Nilay Shroff @ 2026-02-02 13:33 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, linux-nvme
Cc: hch, kbusch, dwagner, axboe, kanie, gjoyce
On 1/6/26 7:46 PM, Nilay Shroff wrote:
>
>
> On 1/5/26 2:36 AM, Sagi Grimberg wrote:
>>
>>
>> On 04/01/2026 11:07, Nilay Shroff wrote:
>>>
>>> On 12/27/25 3:07 PM, Sagi Grimberg wrote:
>>>>>> Can you please run benchmarks with `blocksize_range`/`bssplit`/`cpuload`/`cpuchunks`/`cpumode` ?
>>>>> Okay, so I ran the benchmark using bssplit, cpuload, and cpumode. Below is the job
>>>>> file I used for the test, followed by the observed throughput result for reference.
>>>>>
>>>>> Job file:
>>>>> =========
>>>>>
>>>>> [global]
>>>>> time_based
>>>>> runtime=120
>>>>> group_reporting=1
>>>>>
>>>>> [cpu]
>>>>> ioengine=cpuio
>>>>> cpuload=85
>>>>> cpumode=qsort
>>>>> numjobs=32
>>>>>
>>>>> [disk]
>>>>> ioengine=io_uring
>>>>> filename=/dev/nvme1n2
>>>>> rw=<randread/randwrite/randrw>
>>>>> bssplit=4k/10:32k/10:64k/10:128k/30:256k/10:512k/30
>>>>> iodepth=32
>>>>> numjobs=32
>>>>> direct=1
>>>>>
>>>>> Throughput:
>>>>> ===========
>>>>>
>>>>> numa round-robin queue-depth adaptive
>>>>> ----------- ----------- ----------- ---------
>>>>> READ: 1120 MiB/s 2241 MiB/s 2233 MiB/s 2215 MiB/s
>>>>> WRITE: 1107 MiB/s 1875 MiB/s 1847 MiB/s 1892 MiB/s
>>>>> RW: R:1001 MiB/s R:1047 MiB/s R:1086 MiB/s R:1112 MiB/s
>>>>> W:999 MiB/s W:1045 MiB/s W:1084 MiB/s W:1111 MiB/s
>>>>>
>>>>> When comparing the results, I did not observe a significant throughput
>>>>> difference between the queue-depth, round-robin, and adaptive policies.
>>>>> With random I/O of mixed sizes, the adaptive policy appears to average
>>>>> out the varying latency values and distribute I/O reasonably evenly
>>>>> across the active paths (assuming symmetric paths).
>>>>>
>>>>> Next I'd implement I/O size buckets and also per-numa node weight and
>>>>> then rerun tests and share the result. Lets see if these changes help
>>>>> further improve the throughput number for adaptive policy. We may then
>>>>> again review the results and discuss further.
>>>>>
>>>>> Thanks,
>>>>> --Nilay
>>>> two comments:
>>>> 1. I'd make reads split slightly biased towards small block sizes, and writes biased towards larger block sizes
>>>> 2. I'd also suggest to measure having weights calculation averaged out on all numa-node cores and then set percpu (such that
>>>> the datapath does not introduce serialization).
>>> Thanks for the suggestions. I ran experiments incorporating both points—
>>> biasing I/O sizes by operation type and comparing per-CPU vs per-NUMA
>>> weight calculation—using the following setup.
>>>
>>> Job file:
>>> =========
>>> [global]
>>> time_based
>>> runtime=120
>>> group_reporting=1
>>>
>>> [cpu]
>>> ioengine=cpuio
>>> cpuload=85
>>> numjobs=32
>>>
>>> [disk]
>>> ioengine=io_uring
>>> filename=/dev/nvme1n1
>>> rw=<randread/randwrite/randrw>
>>> bssplit=<based-on-I/O-pattern-type>[1]
>>> iodepth=32
>>> numjobs=32
>>> direct=1
>>> ==========
>>>
>>> [1] Block-size distributions:
>>> randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
>>> randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
>>> randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
>>>
>>> Results:
>>> =======
>>>
>>> i) Symmetric paths + system load
>>> (CPU stress using cpuload):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 636 621 613 618
>>> WRITE: 1832 1847 1840 1852
>>> RW: R:872 R:869 R:866 R:874
>>> W:872 W:870 W:867 W:876
>>>
>>> ii) Asymmetric paths + system load
>>> (CPU stress using cpuload and iperf3 traffic for inducing network congestion):
>>>
>>> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
>>> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
>>> ------- ------------------- -------- -------------------
>>> READ: 553 543 540 533
>>> WRITE: 1705 1670 1710 1655
>>> RW: R:769 R:771 R:784 R:772
>>> W:768 W:767 W:785 W:771
>>>
>>>
>>> Looking at the above results,
>>> - Per-CPU vs per-CPU with I/O buckets:
>>> The per-CPU implementation already averages latency effectively across CPUs.
>>> Introducing per-CPU I/O buckets does not provide a meaningful throughput
>>> improvement and remains largely comparable.
>>>
>>> - Per-CPU vs per-NUMA aggregation:
>>> Calculating or averaging weights at the NUMA level does not significantly
>>> improve throughput over per-CPU weight calculation. Across both symmetric
>>> and asymmetric scenarios, the results remain very close.
>>>
>>> So now based on above results and assessment, unless there are additional
>>> scenarios or metrics of interest, shall we proceed with per-CPU weight
>>> calculation for this new I/O policy?
>>
>> I think it is counter intuitive that bucketing I/O sizes does not present any advantage. Don't you?
>> Maybe the test is not good enough of a representation...
>>
> Hmm you were correct, I also thought the same but I couldn't find
> any test which could prove the advantage using I/O buckets. Then
> today I spend some time thinking about the scenarios which could
> prove the worth using I/O buckets. After some thought I came up
> with following use case.
>
> Size-dependent path behavior:
>
> 1. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> Now running mixed I/O (bssplit => 16k/75:64k/25),
>
> Without buckets:
> Path B looks good; scheduler forwards more I/Os towards path B.
>
> With buckets:
> small I/Os are distributed across path A and B
> large I/Os favor path B
>
> So in theory, throughput shall improve with buckets.
>
> 2. Example:
> Path A: good for ≤16k, bad for ≥32k
> Path B: opposite
>
> Without buckets:
> latency averages cancel out
> scheduler sees “paths are equal”
>
> With buckets:
> small I/O bucket favors A
> large I/O bucket favors B
>
> Again in theory, throughput shall improve with buckets.
>
> So with the above thought, I ran another experiment and results
> are shown below:
>
> Injecting additional delay on one path for larger packets (>=32k)
> and mixing I/Os with bssplit => 16k/75:64k/25. So with this
> test, we have,
> Path A: good for ≤16k, bad for ≥32k
> Path B: good for all
>
> per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
> (MiB/s) (MiB/s) (MiB/s) (MiB/s)
> ------- ------------------- -------- -------------------
> READ: 550 622 523 615
> WRITE: 726 829 747 834
> RW: R:324 R:381 R: 306 R:375
> W:323 W:381 W: 306 W:374
>
> So yes I/O buckets could be useful for the scenario tested
> above. And regarding per-CPU vs per-NUMA weight calculation
> do you agree per-CPU should be good enough for this policy
> as we saw above per-NUMA doesn't help improve much performance?
>
>
>> Lets also test what happens with multiple clients against the same subsystem.
> Yes this is a good test to run, I will test and post result.
>
Finally, I was able to run tests with two nvmf-tcp hosts connected
to the same nvmf-tcp target. Apologies for the delay — setting up this
topology took some time, partly due to recent non-technical infrastructure
challenges after our lab relocation.
The goal of these tests was to evaluate per-CPU vs per-NUMA weight calculation,
with and without I/O size buckets, under multi-client contention.
I ran tests (randread, randwrite and randrw) with mixed I/O (using bssplit)
and added the CPU stress on hosts using cpuload as I already did for my
earlier tests. Please find below the test result and observation.
Workload characteristics:
=========================
- Workloads tested: randread, randwrite, randrw
- Mixed I/O sizes using bssplit
- CPU stress induced using cpuload
- Both hosts run workloads simultaneously
Job file:
=========
[global]
time_based
runtime=120
group_reporting=1
[cpu]
ioengine=cpuio
cpuload=85
numjobs=32
[disk]
ioengine=io_uring
filename=/dev/nvme1n1
rw=<randread/randwrite/randrw>
bssplit=<based-on-I/O-pattern-type>[1]
iodepth=32
numjobs=32
direct=1
ramp-time=120
[1] Block-size distributions:
randread => bssplit = 512/30:4k/25:8k/20:16k/15:32k/10
randwrite => bssplit = 4k/10:64k/20:128k/30:256k/40
randrw => bssplit = 512/20:4k/25:32k/25:64k/20:128k/5:256k/5
Test topology:
==============
1. Two nvmf-tcp hosts connected to the same nvmf-tcp target
2. Each host connects to target using two symmetric paths
3. System load on each host is induced using cpuload (as shown in jobfile)
4. Both hosts run I/O workloads concurrently
Results:
=======
Host1:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 153 164 166 131
WRITE: 839 837 889 839
RW: R:249 R:255 R:226 R:256
W:247 W:254 W:225 W:253
Host2:
per-CPU per-CPU-IO-buckets per-NUMA per-NUMA-IO-buckets
(MiB/s) (MiB/s) (MiB/s) (MiB/s)
------- ------------------- -------- -------------------
READ: 268 258 279 268
WRITE: 1012 992 880 1017
RW: R:386 R:410 R:401 R:405
W:385 W:409 W:399 W:405
From the above results, I have got the same impression as earlier while I ran the
similar tests between one nvmf-tcp host and target. Looking at the above results,
Per-CPU vs per-CPU with I/O buckets:
- The per-CPU implementation already averages latency effectively across CPUs.
- Introducing per-CPU I/O buckets does not provide a meaningful throughput
improvement in the general case.
- Results remain largely comparable across workloads and hosts.
- However, as shown in earlier experiments with I/O size–dependent path behavior,
I/O buckets can provide measurable benefits in specific scenarios.
Per-CPU vs per-NUMA aggregation:
- Calculating or averaging weights at the NUMA level does not significantly improve
throughput over per-CPU weight calculation.
- This holds true even under multi-host contention.
Based on all the tests conducted so far, including, symmetric and asymmetric paths,
CPU stress, size-dependent path behavior and multi-client access to the same target:
The results suggest that we should move forward with a per-CPU implementation using
I/O buckets. That said, I am open to any further feedback, suggestions, or additional
scenarios that might be worth evaluating.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2026-02-02 13:34 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-05 10:33 [RFC PATCHv5 0/7] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 1/7] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-12-12 12:16 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 2/7] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-12-12 13:04 ` Sagi Grimberg
2025-12-13 7:27 ` Nilay Shroff
2025-12-15 23:36 ` Sagi Grimberg
2025-12-18 11:19 ` Nilay Shroff
2025-12-18 13:46 ` Hannes Reinecke
2025-12-23 14:50 ` Nilay Shroff
2025-12-25 12:45 ` Sagi Grimberg
2025-12-26 18:16 ` Nilay Shroff
2025-12-27 9:33 ` Sagi Grimberg
2025-12-27 9:37 ` Sagi Grimberg
2026-01-04 9:07 ` Nilay Shroff
2026-01-04 21:06 ` Sagi Grimberg
2026-01-06 14:16 ` Nilay Shroff
2026-02-02 13:33 ` Nilay Shroff
2026-01-07 11:15 ` Hannes Reinecke
2025-12-25 12:28 ` Sagi Grimberg
2025-11-05 10:33 ` [RFC PATCHv5 3/7] nvme: add generic debugfs support Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 4/7] nvme-multipath: add debugfs attribute adaptive_ewma_shift Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 5/7] nvme-multipath: add debugfs attribute adaptive_weight_timeout Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 6/7] nvme-multipath: add debugfs attribute adaptive_stat Nilay Shroff
2025-11-05 10:33 ` [RFC PATCHv5 7/7] nvme-multipath: add documentation for adaptive I/O policy Nilay Shroff
2025-12-09 13:56 ` [RFC PATCHv5 0/7] nvme-multipath: introduce " Nilay Shroff
2025-12-12 12:08 ` Sagi Grimberg
2025-12-13 8:22 ` Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox