* [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy
@ 2025-09-21 11:12 Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
Hi,
This series introduces a new adaptive I/O policy for NVMe native
multipath. Existing policies—numa, round-robin, and queue-depth—are
static and do not adapt to real-time transport performance. The numa
selects the path closest to the NUMA node of the current CPU, optimizing
memory and path locality, but ignores actual path performance. The
round-robin distributes I/O evenly across all paths, providing fairness
but not performance awareness. The queue-depth reacts to instantaneous
queue occupancy, avoiding heavily loaded paths, but does not account for
actual latency, throughput, or link speed.
The new adaptive policy addresses these gaps selecting paths dynamically
based on measured I/O latency and, for fabrics, the negotiated link
speed. Latency is derived by passively sampling I/O completions. Link
speed is queried from the adapter and factored into path scoring. Each
path is assigned a weight proportional to its score, and I/Os are then
forwarded accordingly. As conditions change (e.g. latency spikes,
bandwidth differences), path weights are updated, automatically
steering traffic toward better-performing paths.
Early results show reduced tail latency under mixed workloads and
improved throughput by exploiting higher-speed links more effectively.
For example, with NVMf/TCP using two paths (one throttled with ~30 ms
delay), fio results with random read/write/rw workloads (direct I/O)
showed:
numa round-robin queue-depth adaptive
----------- ----------- ----------- ---------
READ: 50.0 MiB/s 105 MiB/s 230 MiB/s 350 MiB/s
WRITE: 65.9 MiB/s 125 MiB/s 385 MiB/s 446 MiB/s
RW: R:30.6 MiB/s R:56.5 MiB/s R:122 MiB/s R:175 MiB/s
W:30.7 MiB/s W:56.5 MiB/s W:122 MiB/s W:175 MiB/s
This pathcset includes totla 5 patches:
[PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting()
- Make blk_stat APIs available to block drivers.
- Needed for per-path latency measurement in adaptive policy.
[PATCH 2/5] nvme-multipath: add adaptive I/O policy
- Implement path scoring based on latency (EWMA).
- Distribute I/O proportionally to per-path weights.
[PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive policy
- Introduce "adp_stat" under nvme path block device.
- Provide observability of latency, weight, and selection stats.
[PATCH 4/5] nvme-tcp: export NIC link speed
- Retrieve negotiated link speed (Mbps) from the adapter.
- Expose via sysfs for visibility/debugging.
[PATCH 5/5] nvme-multipath: factor link speed into path scoring
- Adjust adaptive path weights using link speed as a multiplier.
- Favor higher bandwidth links while still considering latency.
Currently, link speed reporting is implemented only for TCP NICs.
Support for Fibre Channel adapters will follow in a future patch.
As ususal, feedback and suggestions are most welcome!
Thanks!
Nilay Shroff (5):
block: expose blk_stat_{enable,disable}_accounting() to drivers
nvme-multipath: add support for adaptive I/O policy
nvme-multipath: add sysfs attribute for adaptive I/O policy
nvmf-tcp: add support for retrieving adapter link speed
nvme-multipath: factor fabric link speed into path score
block/blk-stat.h | 4 -
drivers/nvme/host/core.c | 10 +-
drivers/nvme/host/ioctl.c | 7 +-
drivers/nvme/host/multipath.c | 441 +++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 38 ++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 12 +-
drivers/nvme/host/tcp.c | 66 +++++
include/linux/blk-mq.h | 4 +
9 files changed, 562 insertions(+), 26 deletions(-)
--
2.51.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
@ 2025-09-21 11:12 ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
The functions blk_stat_enable_accounting() and
blk_stat_disable_accounting() are currently exported, but their
prototypes are only defined in a private header. Move these prototypes
into a common header so that block drivers can directly use these APIs.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
block/blk-stat.h | 4 ----
include/linux/blk-mq.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-stat.h b/block/blk-stat.h
index 9e05bf18d1be..f5d95dd8c0e9 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -67,10 +67,6 @@ void blk_free_queue_stats(struct blk_queue_stats *);
void blk_stat_add(struct request *rq, u64 now);
-/* record time/size info in request but not add a callback */
-void blk_stat_enable_accounting(struct request_queue *q);
-void blk_stat_disable_accounting(struct request_queue *q);
-
/**
* blk_stat_alloc_callback() - Allocate a block statistics callback.
* @timer_fn: Timer callback function.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2a5a828f19a0..e35e91ca2284 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -732,6 +732,10 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
bool blk_mq_queue_inflight(struct request_queue *q);
+/* record time/size info in request but not add a callback */
+void blk_stat_enable_accounting(struct request_queue *q);
+void blk_stat_disable_accounting(struct request_queue *q);
+
enum {
/* return when out of requests */
BLK_MQ_REQ_NOWAIT = (__force blk_mq_req_flags_t)(1 << 0),
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
@ 2025-09-21 11:12 ` Nilay Shroff
2025-09-22 7:30 ` Hannes Reinecke
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
` (2 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
This commit introduces a new I/O policy named "adaptive". Users can
configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
subsystemX/iopolicy"
The adaptive policy dynamically distributes I/O based on measured
completion latency. The main idea is to calculate latency for each path,
derive a weight, and then proportionally forward I/O according to those
weights.
To ensure scalability, path latency is measured per-CPU. Each CPU
maintains its own statistics, and I/O forwarding uses these per-CPU
values. Every ~15 seconds, a simple average latency of per-CPU batched
samples are computed and fed into an Exponentially Weighted Moving
Average (EWMA):
avg_latency = div_u64(batch, batch_count);
new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
latency value and 1/8 (~12.5%) to the most recent latency. This
smoothing reduces jitter, adapts quickly to changing conditions,
avoids storing historical samples, and works well for both low and
high I/O rates. Path weights are then derived from the smoothed (EWMA)
latency as follows (example with two paths A and B):
path_A_score = NSEC_PER_SEC / path_A_ewma_latency
path_B_score = NSEC_PER_SEC / path_B_ewma_latency
total_score = path_A_score + path_B_score
path_A_weight = (path_A_score * 100) / total_score
path_B_weight = (path_B_score * 100) / total_score
where:
- path_X_ewma_latency is the smoothed latency of a path in ns
- NSEC_PER_SEC is used as a scaling factor since valid latencies
are < 1 second
- weights are normalized to a 0–100 scale across all paths.
Path credits are refilled based on this weight, with one credit
consumed per I/O. When all credits are consumed, the credits are
refilled again based on the current weight. This ensures that I/O is
distributed across paths proportionally to their calculated weight.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 10 +-
drivers/nvme/host/ioctl.c | 7 +-
drivers/nvme/host/multipath.c | 353 ++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 33 +++-
drivers/nvme/host/pr.c | 6 +-
drivers/nvme/host/sysfs.c | 2 +-
6 files changed, 389 insertions(+), 22 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6b7493934535..6cd1d2c3e6ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -689,6 +689,7 @@ static void nvme_free_ns(struct kref *kref)
{
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
+ nvme_mpath_free_stat(ns);
put_disk(ns->disk);
nvme_put_ns_head(ns->head);
nvme_put_ctrl(ns->ctrl);
@@ -4132,6 +4133,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
if (nvme_init_ns_head(ns, info))
goto out_cleanup_disk;
+ if (nvme_mpath_alloc_stat(ns))
+ goto out_unlink_ns;
+
/*
* If multipathing is enabled, the device name for all disks and not
* just those that represent shared namespaces needs to be based on the
@@ -4156,7 +4160,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
}
if (nvme_update_ns_info(ns, info))
- goto out_unlink_ns;
+ goto out_free_ns_stat;
mutex_lock(&ctrl->namespaces_lock);
/*
@@ -4165,7 +4169,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
*/
if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
mutex_unlock(&ctrl->namespaces_lock);
- goto out_unlink_ns;
+ goto out_free_ns_stat;
}
nvme_ns_add_to_ctrl_list(ns);
mutex_unlock(&ctrl->namespaces_lock);
@@ -4196,6 +4200,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
list_del_rcu(&ns->list);
mutex_unlock(&ctrl->namespaces_lock);
synchronize_srcu(&ctrl->srcu);
+out_free_ns_stat:
+ nvme_mpath_free_stat(ns);
out_unlink_ns:
mutex_lock(&ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 6b3ac8ae3f34..aab7b795a168 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -716,7 +716,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
flags |= NVME_IOCTL_PARTITION;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, open_for_write ? WRITE : READ);
if (!ns)
goto out_unlock;
@@ -747,7 +747,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
int srcu_idx, ret = -EWOULDBLOCK;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, open_for_write ? WRITE : READ);
if (!ns)
goto out_unlock;
@@ -767,7 +767,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
int srcu_idx = srcu_read_lock(&head->srcu);
- struct nvme_ns *ns = nvme_find_path(head);
+ struct nvme_ns *ns = nvme_find_path(head,
+ ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ);
int ret = -EINVAL;
if (ns)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 3da980dc60d9..4f56a2bf7ea3 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -6,6 +6,8 @@
#include <linux/backing-dev.h>
#include <linux/moduleparam.h>
#include <linux/vmalloc.h>
+#include <linux/blk-mq.h>
+#include <linux/math64.h>
#include <trace/events/block.h>
#include "nvme.h"
@@ -66,9 +68,10 @@ MODULE_PARM_DESC(multipath_always_on,
"create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
static const char *nvme_iopolicy_names[] = {
- [NVME_IOPOLICY_NUMA] = "numa",
- [NVME_IOPOLICY_RR] = "round-robin",
- [NVME_IOPOLICY_QD] = "queue-depth",
+ [NVME_IOPOLICY_NUMA] = "numa",
+ [NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
+ [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
};
static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -83,6 +86,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_RR;
else if (!strncmp(val, "queue-depth", 11))
iopolicy = NVME_IOPOLICY_QD;
+ else if (!strncmp(val, "adaptive", 8))
+ iopolicy = NVME_IOPOLICY_ADAPTIVE;
else
return -EINVAL;
@@ -196,6 +201,190 @@ void nvme_mpath_start_request(struct request *rq)
}
EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
+int nvme_mpath_alloc_stat(struct nvme_ns *ns)
+{
+ gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
+
+ if (!ns->head->disk)
+ return 0;
+
+ ns->cpu_stat = __alloc_percpu_gfp(
+ 2 * sizeof(struct nvme_path_stat),
+ __alignof__(struct nvme_path_stat),
+ gfp);
+ if (!ns->cpu_stat)
+ return -ENOMEM;
+
+ return 0;
+}
+
+#define NVME_EWMA_SHIFT 3
+static inline u64 ewma_update(u64 old, u64 new)
+{
+ return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
+}
+
+static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
+{
+ int cpu, srcu_idx;
+ unsigned int rw;
+ struct nvme_path_stat *stat;
+ struct nvme_ns *cur_ns;
+ u32 weight;
+ u64 now, latency, avg_lat_ns;
+ u64 total_score = 0;
+ struct nvme_ns_head *head = ns->head;
+
+ if (list_is_singular(&head->list))
+ return;
+
+ now = ktime_get_ns();
+ latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
+ if (!latency)
+ return;
+
+ /*
+ * As completion code path is serialized(i.e. no same completion queue
+ * update code could run simultaneously on multiple cpu) we can safely
+ * access per cpu nvme path stat here from another cpu (in case the
+ * completion cpu is different from submission cpu).
+ * The only field which could be accessed simultaneously here is the
+ * path ->weight which may be accessed by this function as well as I/O
+ * submission path during path selection logic and we protect ->weight
+ * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
+ * we also don't need to be so accurate here as the path credit would
+ * be anyways refilled, based on path weight, once path consumes all
+ * its credits. And we limit path weight/credit max up to 100. Please
+ * also refer nvme_adaptive_path().
+ */
+ cpu = blk_mq_rq_cpu(rq);
+ rw = rq_data_dir(rq);
+ stat = &per_cpu_ptr(ns->cpu_stat, cpu)[rw];
+
+ /*
+ * If latency > ~1s then ignore this sample to prevent EWMA from being
+ * skewed by pathological outliers (multi-second waits, controller
+ * timeouts etc.). This keeps path scores representative of normal
+ * performance and avoids instability from rare spikes. If such high
+ * latency is real, ANA state reporting or keep-alive error counters
+ * will mark the path unhealthy and remove it from the head node list,
+ * so we safely skip such sample here.
+ */
+ if (unlikely(latency > NSEC_PER_SEC)) {
+ stat->nr_ignored++;
+ return;
+ }
+
+ /*
+ * Accumulate latency samples and increment the batch count for each
+ * ~15 second interval. When the interval expires, compute the simple
+ * average latency over that window, then update the smoothed (EWMA)
+ * latency. The path weight is recalculated based on this smoothed
+ * latency.
+ */
+ stat->batch += latency;
+ stat->batch_count++;
+ stat->nr_samples++;
+
+ if (now > stat->last_weight_ts &&
+ (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) {
+
+ stat->last_weight_ts = now;
+
+ /*
+ * Find simple average latency for the last epoch (~15 sec
+ * interval).
+ */
+ avg_lat_ns = div_u64(stat->batch, stat->batch_count);
+
+ /*
+ * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
+ * latency. EWMA is preferred over simple average latency
+ * because it smooths naturally, reduces jitter from sudden
+ * spikes, and adapts faster to changing conditions. It also
+ * avoids storing historical samples, and works well for both
+ * slow and fast I/O rates.
+ * Formula:
+ * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
+ * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
+ * existing latency and 1/8 (~12.5%) weight to the new latency.
+ */
+ if (unlikely(!stat->slat_ns))
+ stat->slat_ns = avg_lat_ns;
+ else
+ stat->slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+
+ stat->batch = stat->batch_count = 0;
+
+ srcu_idx = srcu_read_lock(&head->srcu);
+ list_for_each_entry_srcu(cur_ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+ stat = &per_cpu_ptr(cur_ns->cpu_stat, cpu)[rw];
+ if (!stat->slat_ns)
+ continue;
+
+ /*
+ * Compute the path score (inverse of smoothed latency),
+ * scaled by NSEC_PER_SEC. Floating point math is not
+ * available in the kernel, so fixed-point scaling is
+ * used instead. NSEC_PER_SEC is chosen as the scale
+ * because valid latencies are always < 1 second; and
+ * we ignore longer latencies.
+ */
+ stat->score = div_u64(NSEC_PER_SEC, stat->slat_ns);
+
+ /* Compute total score. */
+ total_score += stat->score;
+ }
+
+ if (!total_score)
+ goto out;
+
+ /*
+ * After computing the total slatency, we derive per-path weight
+ * (normalized to the range 0–100). The weight represents the
+ * relative share of I/O the path should receive.
+ *
+ * - lower smoothed latency -> higher weight
+ * - higher smoothed slatency -> lower weight
+ *
+ * Next, while forwarding I/O, we assign "credits" to each path
+ * based on its weight (please also refer nvme_adaptive_path()):
+ * - Initially, credits = weight.
+ * - Each time an I/O is dispatched on a path, its credits are
+ * decremented proportionally.
+ * - When a path runs out of credits, it becomes temporarily
+ * ineligible until credit is refilled.
+ *
+ * I/O distribution is therefore governed by available credits,
+ * ensuring that over time the proportion of I/O sent to each
+ * path matches its weight (and thus its performance).
+ */
+ list_for_each_entry_srcu(cur_ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+
+ stat = &per_cpu_ptr(cur_ns->cpu_stat, cpu)[rw];
+ weight = div_u64(stat->score * 100, total_score);
+
+ /*
+ * Ensure the path weight never drops below 1. A weight
+ * of 0 is used only for newly added paths. During
+ * bootstrap, a few I/Os are sent to such paths to
+ * establish an initial weight. Enforcing a minimum
+ * weight of 1 guarantees that no path is forgotten and
+ * that each path is probed at least occasionally.
+ */
+ if (!weight)
+ weight = 1;
+
+ WRITE_ONCE(stat->weight, weight);
+ stat->score = 0;
+ }
+out:
+ srcu_read_unlock(&head->srcu, srcu_idx);
+ }
+}
+
void nvme_mpath_end_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;
@@ -203,6 +392,9 @@ void nvme_mpath_end_request(struct request *rq)
if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
atomic_dec_if_positive(&ns->ctrl->nr_active);
+ if (test_bit(NVME_NS_PATH_STAT, &ns->flags))
+ nvme_mpath_add_sample(rq, ns);
+
if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
@@ -236,6 +428,41 @@ static const char *nvme_ana_state_names[] = {
[NVME_ANA_CHANGE] = "change",
};
+static void nvme_mpath_reset_current_stat(struct nvme_ns *ns)
+{
+ int cpu;
+ struct nvme_path_stat *stat;
+
+ for_each_possible_cpu(cpu) {
+ stat = per_cpu_ptr(ns->cpu_stat, cpu);
+ memset(stat, 0, 2 * sizeof(struct nvme_path_stat));
+ }
+}
+
+static bool nvme_mpath_set_current_adaptive_path(struct nvme_ns *ns)
+{
+ struct nvme_ns_head *head = ns->head;
+
+ if (!head->disk || head->subsys->iopolicy != NVME_IOPOLICY_ADAPTIVE)
+ return false;
+
+ if (test_and_set_bit(NVME_NS_PATH_STAT, &ns->flags))
+ return false;
+
+ blk_stat_enable_accounting(ns->queue);
+ return true;
+}
+
+static bool nvme_mpath_clear_current_adaptive_path(struct nvme_ns *ns)
+{
+ if (!test_and_clear_bit(NVME_NS_PATH_STAT, &ns->flags))
+ return false;
+
+ blk_stat_disable_accounting(ns->queue);
+ nvme_mpath_reset_current_stat(ns);
+ return true;
+}
+
bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
@@ -251,6 +478,8 @@ bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
changed = true;
}
}
+ if (nvme_mpath_clear_current_adaptive_path(ns))
+ changed = true;
out:
return changed;
}
@@ -269,6 +498,18 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl)
srcu_read_unlock(&ctrl->srcu, srcu_idx);
}
+static void nvme_mpath_set_ctrl_paths(struct nvme_ctrl *ctrl)
+{
+ struct nvme_ns *ns;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&ctrl->srcu);
+ list_for_each_entry_srcu(ns, &ctrl->namespaces, list,
+ srcu_read_lock_held(&ctrl->srcu))
+ nvme_mpath_set_current_adaptive_path(ns);
+ srcu_read_unlock(&ctrl->srcu, srcu_idx);
+}
+
void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
@@ -281,6 +522,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
srcu_read_lock_held(&head->srcu)) {
if (capacity != get_capacity(ns->disk))
clear_bit(NVME_NS_READY, &ns->flags);
+
+ nvme_mpath_reset_current_stat(ns);
}
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -405,6 +648,86 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
return found;
}
+static inline bool nvme_state_is_live(enum nvme_ana_state state)
+{
+ return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
+}
+
+static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
+ unsigned int rw)
+{
+ struct nvme_ns *ns, *found = NULL;
+ struct nvme_path_stat *stat;
+ u32 weight;
+ int refill = 0;
+
+ get_cpu();
+retry:
+ list_for_each_entry_srcu(ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+
+ if (nvme_path_is_disabled(ns) ||
+ !nvme_state_is_live(ns->ana_state))
+ continue;
+
+ stat = &this_cpu_ptr(ns->cpu_stat)[rw];
+
+ /*
+ * When the head path-list is singular we don't calculate the
+ * only path weight for optimization as we don't need to forward
+ * I/O to more than one path. The another possibility is whenthe
+ * path is newly added, we don't know its weight. So we go round
+ * -robin for each such path and forward I/O to it.Once we start
+ * getting response for such I/Os, the path weight calculation
+ * would kick in and then we start using path credit for
+ * forwarding I/O.
+ */
+ weight = READ_ONCE(stat->weight);
+ if (unlikely(!weight)) {
+ found = ns;
+ goto out;
+ }
+
+ /*
+ * To keep path selection logic simple, we don't distinguish
+ * between ANA optimized and non-optimized states. The non-
+ * optimized path is expected to have a lower weight, and
+ * therefore fewer credits. As a result, only a small number of
+ * I/Os will be forwarded to paths in the non-optimized state.
+ */
+ if (stat->credit > 0) {
+ --stat->credit;
+ found = ns;
+ goto out;
+ }
+ }
+
+ if (!found && !list_empty(&head->list)) {
+ /*
+ * Refill credits and retry.
+ */
+ list_for_each_entry_srcu(ns, &head->list, siblings,
+ srcu_read_lock_held(&head->srcu)) {
+ if (nvme_path_is_disabled(ns) ||
+ !nvme_state_is_live(ns->ana_state))
+ continue;
+
+ stat = &this_cpu_ptr(ns->cpu_stat)[rw];
+ weight = READ_ONCE(stat->weight);
+ stat->credit = weight;
+ refill = 1;
+ }
+ if (refill)
+ goto retry;
+ }
+out:
+ if (found)
+ stat->sel++;
+
+ put_cpu();
+ return found;
+}
+
static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
{
struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
@@ -461,9 +784,12 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
return ns;
}
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head,
+ unsigned int rw)
{
switch (READ_ONCE(head->subsys->iopolicy)) {
+ case NVME_IOPOLICY_ADAPTIVE:
+ return nvme_adaptive_path(head, rw);
case NVME_IOPOLICY_QD:
return nvme_queue_depth_path(head);
case NVME_IOPOLICY_RR:
@@ -523,7 +849,7 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
return;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, bio_data_dir(bio));
if (likely(ns)) {
bio_set_dev(bio, ns->disk->part0);
bio->bi_opf |= REQ_NVME_MPATH;
@@ -565,7 +891,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
int srcu_idx, ret = -EWOULDBLOCK;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, READ);
if (ns)
ret = nvme_ns_get_unique_id(ns, id, type);
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -581,7 +907,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
int srcu_idx, ret = -EWOULDBLOCK;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, READ);
if (ns)
ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -807,6 +1133,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
}
mutex_unlock(&head->lock);
+ mutex_lock(&nvme_subsystems_lock);
+ nvme_mpath_set_current_adaptive_path(ns);
+ mutex_unlock(&nvme_subsystems_lock);
+
synchronize_srcu(&head->srcu);
kblockd_schedule_work(&head->requeue_work);
}
@@ -855,11 +1185,6 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
return 0;
}
-static inline bool nvme_state_is_live(enum nvme_ana_state state)
-{
- return state == NVME_ANA_OPTIMIZED || state == NVME_ANA_NONOPTIMIZED;
-}
-
static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
struct nvme_ns *ns)
{
@@ -1037,10 +1362,12 @@ static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys,
WRITE_ONCE(subsys->iopolicy, iopolicy);
- /* iopolicy changes clear the mpath by design */
+ /* iopolicy changes clear/reset the mpath by design */
mutex_lock(&nvme_subsystems_lock);
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
nvme_mpath_clear_ctrl_paths(ctrl);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ nvme_mpath_set_ctrl_paths(ctrl);
mutex_unlock(&nvme_subsystems_lock);
pr_notice("subsysnqn %s iopolicy changed from %s to %s\n",
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cfd2b5b90b91..aa3f681d7376 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -421,6 +421,7 @@ enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
NVME_IOPOLICY_QD,
+ NVME_IOPOLICY_ADAPTIVE,
};
struct nvme_subsystem {
@@ -459,6 +460,19 @@ struct nvme_ns_ids {
u8 csi;
};
+struct nvme_path_stat {
+ u64 nr_samples; /* total num of samples processed */
+ u64 nr_ignored; /* num. of samples ignored */
+ u64 slat_ns; /* smoothed (ewma) latency in nanoseconds */
+ u64 score; /* score used for weight calculation */
+ u64 last_weight_ts; /* timestamp of the last time calculation */
+ u64 sel; /* num of times this path is selcted for I/O */
+ u64 batch; /* accumulated latency sum for current window */
+ u32 batch_count; /* num of samples accumulated in current window */
+ u32 weight; /* path weight */
+ u32 credit; /* path credit for I/O forwarding */
+};
+
/*
* Anchor structure for namespaces. There is one for each namespace in a
* NVMe subsystem that any of our controllers can see, and the namespace
@@ -534,6 +548,7 @@ struct nvme_ns {
#ifdef CONFIG_NVME_MULTIPATH
enum nvme_ana_state ana_state;
u32 ana_grpid;
+ struct nvme_path_stat __percpu *cpu_stat;
#endif
struct list_head siblings;
struct kref kref;
@@ -545,6 +560,7 @@ struct nvme_ns {
#define NVME_NS_FORCE_RO 3
#define NVME_NS_READY 4
#define NVME_NS_SYSFS_ATTR_LINK 5
+#define NVME_NS_PATH_STAT 6
struct cdev cdev;
struct device cdev_device;
@@ -949,7 +965,7 @@ extern const struct attribute_group *nvme_dev_attr_groups[];
extern const struct block_device_operations nvme_bdev_ops;
void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl);
-struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
+struct nvme_ns *nvme_find_path(struct nvme_ns_head *head, unsigned int rw);
#ifdef CONFIG_NVME_MULTIPATH
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
{
@@ -977,6 +993,7 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns);
void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
void nvme_mpath_remove_disk(struct nvme_ns_head *head);
void nvme_mpath_start_request(struct request *rq);
+int nvme_mpath_alloc_stat(struct nvme_ns *ns);
void nvme_mpath_end_request(struct request *rq);
static inline void nvme_trace_bio_complete(struct request *req)
@@ -1005,6 +1022,13 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
return true;
return false;
}
+static inline void nvme_mpath_free_stat(struct nvme_ns *ns)
+{
+ if (!ns->head->disk)
+ return;
+
+ free_percpu(ns->cpu_stat);
+}
#else
#define multipath false
static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
@@ -1096,6 +1120,13 @@ static inline bool nvme_mpath_queue_if_no_path(struct nvme_ns_head *head)
{
return false;
}
+static inline int nvme_mpath_alloc_stat(struct nvme_ns *ns)
+{
+ return 0;
+}
+static inline void nvme_mpath_free_stat(struct nvme_ns *ns)
+{
+}
#endif /* CONFIG_NVME_MULTIPATH */
int nvme_ns_get_unique_id(struct nvme_ns *ns, u8 id[16],
diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c
index ca6a74607b13..9f23793dc12f 100644
--- a/drivers/nvme/host/pr.c
+++ b/drivers/nvme/host/pr.c
@@ -53,10 +53,12 @@ static int nvme_send_ns_head_pr_command(struct block_device *bdev,
struct nvme_command *c, void *data, unsigned int data_len)
{
struct nvme_ns_head *head = bdev->bd_disk->private_data;
- int srcu_idx = srcu_read_lock(&head->srcu);
- struct nvme_ns *ns = nvme_find_path(head);
+ int srcu_idx;
+ struct nvme_ns *ns;
int ret = -EWOULDBLOCK;
+ srcu_idx = srcu_read_lock(&head->srcu);
+ ns = nvme_find_path(head, nvme_is_write(c) ? WRITE : READ);
if (ns) {
c->common.nsid = cpu_to_le32(ns->head->ns_id);
ret = nvme_submit_sync_cmd(ns->queue, c, data, data_len);
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..4f9607e9698a 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
return 0;
srcu_idx = srcu_read_lock(&head->srcu);
- ns = nvme_find_path(head);
+ ns = nvme_find_path(head, READ);
if (!ns)
goto out_unlock;
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive I/O policy
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-09-21 11:12 ` Nilay Shroff
2025-09-22 7:35 ` Hannes Reinecke
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score Nilay Shroff
4 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
This commit introduces a new sysfs attribute, "adp_stat", under the
nvme path block device. This attribute provides visibility into the
state of the adaptive I/O policy and is intended to aid debugging and
observability. We now also calculate the per-path aggregated smoothed
(EWMA) latency for reporting it under this new attribute.
The attribute reports per-path aggregated statistics, including I/O
weight, smoothed (EWMA) latency, selection count, processed samples,
and ignored samples.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 77 ++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 2 +
drivers/nvme/host/sysfs.c | 5 +++
3 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 4f56a2bf7ea3..84c64605d05c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -224,6 +224,22 @@ static inline u64 ewma_update(u64 old, u64 new)
return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
}
+static inline void path_ewma_update(atomic64_t *ptr, u64 new)
+{
+ u64 old, slat;
+
+ /*
+ * Since multiple CPUs may update the per-path smoothed (EWMA)
+ * latency concurrently, we use an atomic compare-and-exchange
+ * loop to safely apply the update without losing intermediate
+ * changes.
+ */
+ do {
+ old = atomic64_read(ptr);
+ slat = ewma_update(old, new);
+ } while (atomic64_cmpxchg(ptr, old, slat) != old);
+}
+
static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
{
int cpu, srcu_idx;
@@ -308,11 +324,18 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
* slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
* With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
* existing latency and 1/8 (~12.5%) weight to the new latency.
+ *
+ * Please note that we also calculate here the smooth/EWMA
+ * latency per path which is then used for exposing aggregated
+ * per-path latency using sysfs for observability/debugging.
*/
- if (unlikely(!stat->slat_ns))
+ if (unlikely(!stat->slat_ns)) {
stat->slat_ns = avg_lat_ns;
- else
+ atomic64_set(&ns->slat_ns[rw], avg_lat_ns);
+ } else {
stat->slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
+ path_ewma_update(&ns->slat_ns[rw], avg_lat_ns);
+ }
stat->batch = stat->batch_count = 0;
@@ -437,6 +460,7 @@ static void nvme_mpath_reset_current_stat(struct nvme_ns *ns)
stat = per_cpu_ptr(ns->cpu_stat, cpu);
memset(stat, 0, 2 * sizeof(struct nvme_path_stat));
}
+ memset(ns->slat_ns, 0, sizeof(ns->slat_ns));
}
static bool nvme_mpath_set_current_adaptive_path(struct nvme_ns *ns)
@@ -1450,6 +1474,55 @@ static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr
}
DEVICE_ATTR_RO(numa_nodes);
+static void adp_stat_read_all(struct nvme_ns *ns, struct nvme_path_stat *batch)
+{
+ int i, cpu;
+ int ncpu[2] = {0};
+ struct nvme_path_stat *stat;
+
+ for_each_online_cpu(cpu) {
+ stat = per_cpu_ptr(ns->cpu_stat, cpu);
+
+ for (i = 0; i < 2; i++) {
+ if (stat[i].weight) {
+ batch[i].weight += stat[i].weight;
+ batch[i].sel += stat[i].sel;
+ batch[i].nr_samples += stat[i].nr_samples;
+ batch[i].nr_ignored += stat[i].nr_ignored;
+ ncpu[i]++;
+ }
+ }
+ }
+
+ for (i = 0; i < 2; i++) {
+ if (!ncpu[i])
+ continue;
+ batch[i].weight = DIV_U64_ROUND_CLOSEST(batch[i].weight, ncpu[i]);
+ }
+}
+
+static ssize_t adp_stat_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_path_stat stat[2] = {0};
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+ adp_stat_read_all(ns, stat);
+ return sysfs_emit(buf, "%u %llu %llu %llu %llu %u %llu %llu %llu %llu\n",
+ stat[READ].weight,
+ atomic64_read(&ns->slat_ns[READ]),
+ stat[READ].sel,
+ stat[READ].nr_samples,
+ stat[READ].nr_ignored,
+ stat[WRITE].weight,
+ atomic64_read(&ns->slat_ns[WRITE]),
+ stat[WRITE].sel,
+ stat[WRITE].nr_samples,
+ stat[WRITE].nr_ignored);
+
+}
+DEVICE_ATTR_RO(adp_stat);
+
static ssize_t delayed_removal_secs_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index aa3f681d7376..22445cf4f5d5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -548,6 +548,7 @@ struct nvme_ns {
#ifdef CONFIG_NVME_MULTIPATH
enum nvme_ana_state ana_state;
u32 ana_grpid;
+ atomic64_t slat_ns[2]; /* path smoothed (EWMA) latency in nanosconds */
struct nvme_path_stat __percpu *cpu_stat;
#endif
struct list_head siblings;
@@ -1009,6 +1010,7 @@ extern struct device_attribute dev_attr_ana_grpid;
extern struct device_attribute dev_attr_ana_state;
extern struct device_attribute dev_attr_queue_depth;
extern struct device_attribute dev_attr_numa_nodes;
+extern struct device_attribute dev_attr_adp_stat;
extern struct device_attribute dev_attr_delayed_removal_secs;
extern struct device_attribute subsys_attr_iopolicy;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 4f9607e9698a..cb04539e2e2c 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -260,6 +260,7 @@ static struct attribute *nvme_ns_attrs[] = {
&dev_attr_ana_state.attr,
&dev_attr_queue_depth.attr,
&dev_attr_numa_nodes.attr,
+ &dev_attr_adp_stat.attr,
&dev_attr_delayed_removal_secs.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
@@ -303,6 +304,10 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (!nvme_disk_is_ns_head(disk))
return 0;
}
+ if (a == &dev_attr_adp_stat.attr) {
+ if (nvme_disk_is_ns_head(dev_to_disk(dev)))
+ return 0;
+ }
#endif
return a->mode;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (2 preceding siblings ...)
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
@ 2025-09-21 11:12 ` Nilay Shroff
2025-09-22 7:38 ` Hannes Reinecke
2025-09-21 11:12 ` [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score Nilay Shroff
4 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
Add support for retrieving the negotiated NIC link speed (in Mbps).
This value can be factored into path scoring for the adaptive I/O
policy. For visibility and debugging, a new sysfs attribute "speed"
is also added under the NVMe path block device.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 11 ++++++
drivers/nvme/host/nvme.h | 3 ++
drivers/nvme/host/sysfs.c | 5 +++
drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
4 files changed, 85 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 84c64605d05c..bcceb0fceb94 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -548,6 +548,8 @@ void nvme_mpath_revalidate_paths(struct nvme_ns *ns)
clear_bit(NVME_NS_READY, &ns->flags);
nvme_mpath_reset_current_stat(ns);
+ if (ns->ctrl->ops->get_link_speed)
+ ns->speed = ns->ctrl->ops->get_link_speed(ns->ctrl);
}
srcu_read_unlock(&head->srcu, srcu_idx);
@@ -1566,6 +1568,15 @@ static ssize_t delayed_removal_secs_store(struct device *dev,
DEVICE_ATTR_RW(delayed_removal_secs);
+static ssize_t speed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+ return sysfs_emit(buf, "%u\n", ns->speed);
+}
+DEVICE_ATTR_RO(speed);
+
static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
struct nvme_ana_group_desc *desc, void *data)
{
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 22445cf4f5d5..665f4a4cb52b 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -548,6 +548,7 @@ struct nvme_ns {
#ifdef CONFIG_NVME_MULTIPATH
enum nvme_ana_state ana_state;
u32 ana_grpid;
+ u32 speed; /* path link speed (in Mbps) for fabrics */
atomic64_t slat_ns[2]; /* path smoothed (EWMA) latency in nanosconds */
struct nvme_path_stat __percpu *cpu_stat;
#endif
@@ -593,6 +594,7 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
void (*stop_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+ u32 (*get_link_speed)(struct nvme_ctrl *ctrl);
void (*print_device_info)(struct nvme_ctrl *ctrl);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
};
@@ -1012,6 +1014,7 @@ extern struct device_attribute dev_attr_queue_depth;
extern struct device_attribute dev_attr_numa_nodes;
extern struct device_attribute dev_attr_adp_stat;
extern struct device_attribute dev_attr_delayed_removal_secs;
+extern struct device_attribute dev_attr_speed;
extern struct device_attribute subsys_attr_iopolicy;
static inline bool nvme_disk_is_ns_head(struct gendisk *disk)
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index cb04539e2e2c..5858c2426efd 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -262,6 +262,7 @@ static struct attribute *nvme_ns_attrs[] = {
&dev_attr_numa_nodes.attr,
&dev_attr_adp_stat.attr,
&dev_attr_delayed_removal_secs.attr,
+ &dev_attr_speed.attr,
#endif
&dev_attr_io_passthru_err_log_enabled.attr,
NULL,
@@ -308,6 +309,10 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
if (nvme_disk_is_ns_head(dev_to_disk(dev)))
return 0;
}
+ if (a == &dev_attr_speed.attr) {
+ if (nvme_disk_is_ns_head(dev_to_disk(dev)))
+ return 0;
+ }
#endif
return a->mode;
}
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c0fe8cfb7229..694f8cbe080d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -11,6 +11,8 @@
#include <linux/crc32.h>
#include <linux/nvme-tcp.h>
#include <linux/nvme-keyring.h>
+#include <linux/ethtool.h>
+#include <net/ip6_route.h>
#include <net/sock.h>
#include <net/tcp.h>
#include <net/tls.h>
@@ -2825,6 +2827,69 @@ static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size)
return len;
}
+static u32 nvme_tcp_get_link_speed(struct nvme_ctrl *ctrl)
+{
+ struct net *net;
+ struct sock *sk;
+ struct dst_entry *dst;
+ struct ethtool_link_ksettings cmd;
+ struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
+ u32 speed = 0;
+
+ if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+ return 0;
+
+ rtnl_lock();
+ sk = queue->sock->sk;
+ /*
+ * First try to get cached dst entry, if it's not available then
+ * fallback to route lookup.
+ */
+ dst = sk_dst_get(sk);
+ if (likely(dst)) {
+ if (!__ethtool_get_link_ksettings(dst->dev, &cmd))
+ speed = cmd.base.speed;
+ dst_release(dst);
+ } else {
+ net = sock_net(sk);
+
+ if (sk->sk_family == AF_INET) {
+ struct rtable *rt;
+ struct flowi4 fl4;
+ struct inet_sock *inet = inet_sk(sk);
+
+ inet_sk_init_flowi4(inet, &fl4);
+ rt = ip_route_output_flow(net, &fl4, sk);
+ if (IS_ERR(rt))
+ goto out;
+ if (!__ethtool_get_link_ksettings(rt->dst.dev, &cmd))
+ speed = cmd.base.speed;
+ ip_rt_put(rt);
+ }
+#if (IS_ENABLED(CONFIG_IPV6))
+ else if (sk->sk_family == AF_INET6) {
+ struct flowi6 fl6;
+ struct ipv6_pinfo *np = inet6_sk(sk);
+
+ fl6.saddr = np->saddr;
+ fl6.daddr = sk->sk_v6_daddr;
+ fl6.flowi6_oif = sk->sk_bound_dev_if;
+ fl6.flowi6_proto = sk->sk_protocol;
+
+ dst = ip6_route_output(net, sk, &fl6);
+ if (dst->error)
+ goto out;
+ if (!__ethtool_get_link_ksettings(dst->dev, &cmd))
+ speed = cmd.base.speed;
+ dst_release(dst);
+ }
+#endif
+ }
+out:
+ rtnl_unlock();
+ return speed;
+}
+
static const struct blk_mq_ops nvme_tcp_mq_ops = {
.queue_rq = nvme_tcp_queue_rq,
.commit_rqs = nvme_tcp_commit_rqs,
@@ -2858,6 +2923,7 @@ static const struct nvme_ctrl_ops nvme_tcp_ctrl_ops = {
.submit_async_event = nvme_tcp_submit_async_event,
.delete_ctrl = nvme_tcp_delete_ctrl,
.get_address = nvme_tcp_get_address,
+ .get_link_speed = nvme_tcp_get_link_speed,
.stop_ctrl = nvme_tcp_stop_ctrl,
};
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
` (3 preceding siblings ...)
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
@ 2025-09-21 11:12 ` Nilay Shroff
4 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-21 11:12 UTC (permalink / raw)
To: linux-nvme; +Cc: kbusch, hch, sagi, axboe, hare, dwagner, gjoyce
If the fabric adapter link speed is known, include it when calculating
the path score for the adaptive I/O policy. Paths with higher link
speed receive proportionally higher scores, while paths with lower link
speed receive lower scores.
For example, in a multipath topology with two paths—one with higher
link speed but higher latency, and another with lower link speed but
lower latency—the scoring formula balances these factors. The result
ensures that path selection does not blindly favor high link speed, but
adjusts scores based on both link speed and latency to achieve
proportional distribution.
The updated path scoring formula is:
path_X_score = link_speed_X * (NSEC_PER_SEC / path_X_ewma_latency)
where:
- link_speed_X is the negotiated link speed of the fabric adapter
(in Mbps),
- path_X_ewma_latency is the smoothed latency (ns) derived from I/O
completions,
- NSEC_PER_SEC is used as a scaling factor.
Weights are then normalized across all paths:
path_X_weight = (path_X_score * 100) / total_score
This ensures that both lower latency and higher link speed contribute
positively to path selection, while still distributing I/O
proportionally when conditions differ across paths.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/multipath.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index bcceb0fceb94..6ab42350284d 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -246,7 +246,7 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
unsigned int rw;
struct nvme_path_stat *stat;
struct nvme_ns *cur_ns;
- u32 weight;
+ u32 weight, speed;
u64 now, latency, avg_lat_ns;
u64 total_score = 0;
struct nvme_ns_head *head = ns->head;
@@ -347,14 +347,18 @@ static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
continue;
/*
- * Compute the path score (inverse of smoothed latency),
- * scaled by NSEC_PER_SEC. Floating point math is not
- * available in the kernel, so fixed-point scaling is
- * used instead. NSEC_PER_SEC is chosen as the scale
- * because valid latencies are always < 1 second; and
- * we ignore longer latencies.
+ * Compute the path score as the inverse of smoothed
+ * latency, scaled by NSEC_PER_SEC. If the device speed
+ * is known, it is factored in: higher speed increases
+ * the score, lower speed decreases it. Floating point
+ * math is unavailable in the kernel, so fixed-point
+ * scaling is used instead. NSEC_PER_SEC is chosen
+ * because valid latencies are always < 1 second; longer
+ * latencies are ignored.
*/
- stat->score = div_u64(NSEC_PER_SEC, stat->slat_ns);
+ speed = cur_ns->speed ? : 1;
+ stat->score = speed * div_u64(NSEC_PER_SEC,
+ stat->slat_ns);
/* Compute total score. */
total_score += stat->score;
--
2.51.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
@ 2025-09-22 7:30 ` Hannes Reinecke
2025-09-23 3:43 ` Nilay Shroff
0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2025-09-22 7:30 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/21/25 13:12, Nilay Shroff wrote:
> This commit introduces a new I/O policy named "adaptive". Users can
> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
> subsystemX/iopolicy"
>
> The adaptive policy dynamically distributes I/O based on measured
> completion latency. The main idea is to calculate latency for each path,
> derive a weight, and then proportionally forward I/O according to those
> weights.
>
> To ensure scalability, path latency is measured per-CPU. Each CPU
> maintains its own statistics, and I/O forwarding uses these per-CPU
> values. Every ~15 seconds, a simple average latency of per-CPU batched
> samples are computed and fed into an Exponentially Weighted Moving
> Average (EWMA):
>
> avg_latency = div_u64(batch, batch_count);
> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>
> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
> latency value and 1/8 (~12.5%) to the most recent latency. This
> smoothing reduces jitter, adapts quickly to changing conditions,
> avoids storing historical samples, and works well for both low and
> high I/O rates. Path weights are then derived from the smoothed (EWMA)
> latency as follows (example with two paths A and B):
>
> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
> total_score = path_A_score + path_B_score
>
> path_A_weight = (path_A_score * 100) / total_score
> path_B_weight = (path_B_score * 100) / total_score
>
> where:
> - path_X_ewma_latency is the smoothed latency of a path in ns
> - NSEC_PER_SEC is used as a scaling factor since valid latencies
> are < 1 second
> - weights are normalized to a 0–100 scale across all paths.
>
> Path credits are refilled based on this weight, with one credit
> consumed per I/O. When all credits are consumed, the credits are
> refilled again based on the current weight. This ensures that I/O is
> distributed across paths proportionally to their calculated weight.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 10 +-
> drivers/nvme/host/ioctl.c | 7 +-
> drivers/nvme/host/multipath.c | 353 ++++++++++++++++++++++++++++++++--
> drivers/nvme/host/nvme.h | 33 +++-
> drivers/nvme/host/pr.c | 6 +-
> drivers/nvme/host/sysfs.c | 2 +-
> 6 files changed, 389 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6b7493934535..6cd1d2c3e6ee 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -689,6 +689,7 @@ static void nvme_free_ns(struct kref *kref)
> {
> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>
> + nvme_mpath_free_stat(ns);
> put_disk(ns->disk);
> nvme_put_ns_head(ns->head);
> nvme_put_ctrl(ns->ctrl);
> @@ -4132,6 +4133,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> if (nvme_init_ns_head(ns, info))
> goto out_cleanup_disk;
>
> + if (nvme_mpath_alloc_stat(ns))
> + goto out_unlink_ns;
> +
> /*
> * If multipathing is enabled, the device name for all disks and not
> * just those that represent shared namespaces needs to be based on the
> @@ -4156,7 +4160,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> }
>
> if (nvme_update_ns_info(ns, info))
> - goto out_unlink_ns;
> + goto out_free_ns_stat;
>
> mutex_lock(&ctrl->namespaces_lock);
> /*
> @@ -4165,7 +4169,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> */
> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
> mutex_unlock(&ctrl->namespaces_lock);
> - goto out_unlink_ns;
> + goto out_free_ns_stat;
> }
> nvme_ns_add_to_ctrl_list(ns);
> mutex_unlock(&ctrl->namespaces_lock);
> @@ -4196,6 +4200,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
> list_del_rcu(&ns->list);
> mutex_unlock(&ctrl->namespaces_lock);
> synchronize_srcu(&ctrl->srcu);
> +out_free_ns_stat:
> + nvme_mpath_free_stat(ns);
> out_unlink_ns:
> mutex_lock(&ctrl->subsys->lock);
> list_del_rcu(&ns->siblings);
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 6b3ac8ae3f34..aab7b795a168 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -716,7 +716,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
> flags |= NVME_IOCTL_PARTITION;
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
> if (!ns)
> goto out_unlock;
>
> @@ -747,7 +747,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
> int srcu_idx, ret = -EWOULDBLOCK;
>
> srcu_idx = srcu_read_lock(&head->srcu);
> - ns = nvme_find_path(head);
> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
> if (!ns)
> goto out_unlock;
>
> @@ -767,7 +767,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
> int srcu_idx = srcu_read_lock(&head->srcu);
> - struct nvme_ns *ns = nvme_find_path(head);
> + struct nvme_ns *ns = nvme_find_path(head,
> + ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ);
> int ret = -EINVAL;
>
> if (ns)
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 3da980dc60d9..4f56a2bf7ea3 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -6,6 +6,8 @@
> #include <linux/backing-dev.h>
> #include <linux/moduleparam.h>
> #include <linux/vmalloc.h>
> +#include <linux/blk-mq.h>
> +#include <linux/math64.h>
> #include <trace/events/block.h>
> #include "nvme.h"
>
> @@ -66,9 +68,10 @@ MODULE_PARM_DESC(multipath_always_on,
> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>
> static const char *nvme_iopolicy_names[] = {
> - [NVME_IOPOLICY_NUMA] = "numa",
> - [NVME_IOPOLICY_RR] = "round-robin",
> - [NVME_IOPOLICY_QD] = "queue-depth",
> + [NVME_IOPOLICY_NUMA] = "numa",
> + [NVME_IOPOLICY_RR] = "round-robin",
> + [NVME_IOPOLICY_QD] = "queue-depth",
> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
> };
>
> static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -83,6 +86,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> iopolicy = NVME_IOPOLICY_RR;
> else if (!strncmp(val, "queue-depth", 11))
> iopolicy = NVME_IOPOLICY_QD;
> + else if (!strncmp(val, "adaptive", 8))
> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
> else
> return -EINVAL;
>
> @@ -196,6 +201,190 @@ void nvme_mpath_start_request(struct request *rq)
> }
> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>
> +int nvme_mpath_alloc_stat(struct nvme_ns *ns)
> +{
> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
> +
> + if (!ns->head->disk)
> + return 0;
> +
> + ns->cpu_stat = __alloc_percpu_gfp(
> + 2 * sizeof(struct nvme_path_stat),
> + __alignof__(struct nvme_path_stat),
> + gfp);
> + if (!ns->cpu_stat)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +#define NVME_EWMA_SHIFT 3
> +static inline u64 ewma_update(u64 old, u64 new)
> +{
> + return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
> +}
> +
> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
> +{
> + int cpu, srcu_idx;
> + unsigned int rw;
> + struct nvme_path_stat *stat;
> + struct nvme_ns *cur_ns;
> + u32 weight;
> + u64 now, latency, avg_lat_ns;
> + u64 total_score = 0;
> + struct nvme_ns_head *head = ns->head;
> +
> + if (list_is_singular(&head->list))
> + return;
> +
> + now = ktime_get_ns();
> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
> + if (!latency)
> + return;
> +
> + /*
> + * As completion code path is serialized(i.e. no same completion queue
> + * update code could run simultaneously on multiple cpu) we can safely
> + * access per cpu nvme path stat here from another cpu (in case the
> + * completion cpu is different from submission cpu).
> + * The only field which could be accessed simultaneously here is the
> + * path ->weight which may be accessed by this function as well as I/O
> + * submission path during path selection logic and we protect ->weight
> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
> + * we also don't need to be so accurate here as the path credit would
> + * be anyways refilled, based on path weight, once path consumes all
> + * its credits. And we limit path weight/credit max up to 100. Please
> + * also refer nvme_adaptive_path().
> + */
> + cpu = blk_mq_rq_cpu(rq);
> + rw = rq_data_dir(rq);
> + stat = &per_cpu_ptr(ns->cpu_stat, cpu)[rw];
> +
This is tad awkward for setups where #CPUs > #paths.
> + /*
> + * If latency > ~1s then ignore this sample to prevent EWMA from being
> + * skewed by pathological outliers (multi-second waits, controller
> + * timeouts etc.). This keeps path scores representative of normal
> + * performance and avoids instability from rare spikes. If such high
> + * latency is real, ANA state reporting or keep-alive error counters
> + * will mark the path unhealthy and remove it from the head node list,
> + * so we safely skip such sample here.
> + */
> + if (unlikely(latency > NSEC_PER_SEC)) {
> + stat->nr_ignored++;
> + return;
> + }
> +
> + /*
> + * Accumulate latency samples and increment the batch count for each
> + * ~15 second interval. When the interval expires, compute the simple
> + * average latency over that window, then update the smoothed (EWMA)
> + * latency. The path weight is recalculated based on this smoothed
> + * latency.
> + */
> + stat->batch += latency;
> + stat->batch_count++;
> + stat->nr_samples++;
> +
> + if (now > stat->last_weight_ts &&
> + (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) {
> +
> + stat->last_weight_ts = now;
> +
> + /*
> + * Find simple average latency for the last epoch (~15 sec
> + * interval).
> + */
> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
> +
> + /*
> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
> + * latency. EWMA is preferred over simple average latency
> + * because it smooths naturally, reduces jitter from sudden
> + * spikes, and adapts faster to changing conditions. It also
> + * avoids storing historical samples, and works well for both
> + * slow and fast I/O rates.
> + * Formula:
> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
> + * existing latency and 1/8 (~12.5%) weight to the new latency.
> + */
> + if (unlikely(!stat->slat_ns))
> + stat->slat_ns = avg_lat_ns;
> + else
> + stat->slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
> +
> + stat->batch = stat->batch_count = 0;
> +
> + srcu_idx = srcu_read_lock(&head->srcu);
> + list_for_each_entry_srcu(cur_ns, &head->list, siblings,
> + srcu_read_lock_held(&head->srcu)) {
And this is even more awkward as we need to iterate over all paths
(during completion!).
Do we really need to do this?
What would happen if we just measure the latency on the local CPU
and do away with this loop?
We would have less samples, true, but we would even be able to
not only differentiate between distinct path latency but also between
different CPU latencies; I would think this being a bonus for
multi-socket machines.
_And_ we wouldn't need to worry about path failures, which is bound
to expose some race conditions if we need to iterate paths at the
same time than path failures are being handled.
But nevertheless: great job!
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive I/O policy
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
@ 2025-09-22 7:35 ` Hannes Reinecke
2025-09-23 3:53 ` Nilay Shroff
0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2025-09-22 7:35 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/21/25 13:12, Nilay Shroff wrote:
> This commit introduces a new sysfs attribute, "adp_stat", under the
> nvme path block device. This attribute provides visibility into the
> state of the adaptive I/O policy and is intended to aid debugging and
> observability. We now also calculate the per-path aggregated smoothed
> (EWMA) latency for reporting it under this new attribute.
>
> The attribute reports per-path aggregated statistics, including I/O
> weight, smoothed (EWMA) latency, selection count, processed samples,
> and ignored samples.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/multipath.c | 77 ++++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 2 +
> drivers/nvme/host/sysfs.c | 5 +++
> 3 files changed, 82 insertions(+), 2 deletions(-)
>
Wouldn't this be better off if situated in the debugfs directly?
Exposing the stats is not really crucial to operations, and mainly
for debugging purposes only.
Exposing the weight from the EWMA algorithm, OTOH, really does influence
the performance, and might be an idea to expose.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
@ 2025-09-22 7:38 ` Hannes Reinecke
2025-09-23 9:33 ` Nilay Shroff
0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2025-09-22 7:38 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/21/25 13:12, Nilay Shroff wrote:
> Add support for retrieving the negotiated NIC link speed (in Mbps).
> This value can be factored into path scoring for the adaptive I/O
> policy. For visibility and debugging, a new sysfs attribute "speed"
> is also added under the NVMe path block device.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/multipath.c | 11 ++++++
> drivers/nvme/host/nvme.h | 3 ++
> drivers/nvme/host/sysfs.c | 5 +++
> drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
> 4 files changed, 85 insertions(+)
>
Why not for FC? We can easily extract the link speed from there, too ...
But why do we need to do that? We already calculated the weighted
average, so we _know_ the latency of each path. And then it's
pretty much immaterial if a path runs with a given speed; if the
latency is lower, that path is being preferred.
Irrespective of the speed, which might be deceptive anyway as
you'll only ever be able to retrieve the speed of the local
link, not of the entire path.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
2025-09-22 7:30 ` Hannes Reinecke
@ 2025-09-23 3:43 ` Nilay Shroff
2025-09-23 7:03 ` Hannes Reinecke
0 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2025-09-23 3:43 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/22/25 1:00 PM, Hannes Reinecke wrote:
> On 9/21/25 13:12, Nilay Shroff wrote:
>> This commit introduces a new I/O policy named "adaptive". Users can
>> configure it by writing "adaptive" to "/sys/class/nvme-subsystem/nvme-
>> subsystemX/iopolicy"
>>
>> The adaptive policy dynamically distributes I/O based on measured
>> completion latency. The main idea is to calculate latency for each path,
>> derive a weight, and then proportionally forward I/O according to those
>> weights.
>>
>> To ensure scalability, path latency is measured per-CPU. Each CPU
>> maintains its own statistics, and I/O forwarding uses these per-CPU
>> values. Every ~15 seconds, a simple average latency of per-CPU batched
>> samples are computed and fed into an Exponentially Weighted Moving
>> Average (EWMA):
>>
>> avg_latency = div_u64(batch, batch_count);
>> new_ewma_latency = (prev_ewma_latency * (WEIGHT-1) + avg_latency)/WEIGHT
>>
>> With WEIGHT = 8, this assigns 7/8 (~87.5%) weight to the previous
>> latency value and 1/8 (~12.5%) to the most recent latency. This
>> smoothing reduces jitter, adapts quickly to changing conditions,
>> avoids storing historical samples, and works well for both low and
>> high I/O rates. Path weights are then derived from the smoothed (EWMA)
>> latency as follows (example with two paths A and B):
>>
>> path_A_score = NSEC_PER_SEC / path_A_ewma_latency
>> path_B_score = NSEC_PER_SEC / path_B_ewma_latency
>> total_score = path_A_score + path_B_score
>>
>> path_A_weight = (path_A_score * 100) / total_score
>> path_B_weight = (path_B_score * 100) / total_score
>>
>> where:
>> - path_X_ewma_latency is the smoothed latency of a path in ns
>> - NSEC_PER_SEC is used as a scaling factor since valid latencies
>> are < 1 second
>> - weights are normalized to a 0–100 scale across all paths.
>>
>> Path credits are refilled based on this weight, with one credit
>> consumed per I/O. When all credits are consumed, the credits are
>> refilled again based on the current weight. This ensures that I/O is
>> distributed across paths proportionally to their calculated weight.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/core.c | 10 +-
>> drivers/nvme/host/ioctl.c | 7 +-
>> drivers/nvme/host/multipath.c | 353 ++++++++++++++++++++++++++++++++--
>> drivers/nvme/host/nvme.h | 33 +++-
>> drivers/nvme/host/pr.c | 6 +-
>> drivers/nvme/host/sysfs.c | 2 +-
>> 6 files changed, 389 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 6b7493934535..6cd1d2c3e6ee 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -689,6 +689,7 @@ static void nvme_free_ns(struct kref *kref)
>> {
>> struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
>> + nvme_mpath_free_stat(ns);
>> put_disk(ns->disk);
>> nvme_put_ns_head(ns->head);
>> nvme_put_ctrl(ns->ctrl);
>> @@ -4132,6 +4133,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> if (nvme_init_ns_head(ns, info))
>> goto out_cleanup_disk;
>> + if (nvme_mpath_alloc_stat(ns))
>> + goto out_unlink_ns;
>> +
>> /*
>> * If multipathing is enabled, the device name for all disks and not
>> * just those that represent shared namespaces needs to be based on the
>> @@ -4156,7 +4160,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> }
>> if (nvme_update_ns_info(ns, info))
>> - goto out_unlink_ns;
>> + goto out_free_ns_stat;
>> mutex_lock(&ctrl->namespaces_lock);
>> /*
>> @@ -4165,7 +4169,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> */
>> if (test_bit(NVME_CTRL_FROZEN, &ctrl->flags)) {
>> mutex_unlock(&ctrl->namespaces_lock);
>> - goto out_unlink_ns;
>> + goto out_free_ns_stat;
>> }
>> nvme_ns_add_to_ctrl_list(ns);
>> mutex_unlock(&ctrl->namespaces_lock);
>> @@ -4196,6 +4200,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
>> list_del_rcu(&ns->list);
>> mutex_unlock(&ctrl->namespaces_lock);
>> synchronize_srcu(&ctrl->srcu);
>> +out_free_ns_stat:
>> + nvme_mpath_free_stat(ns);
>> out_unlink_ns:
>> mutex_lock(&ctrl->subsys->lock);
>> list_del_rcu(&ns->siblings);
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index 6b3ac8ae3f34..aab7b795a168 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -716,7 +716,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>> flags |= NVME_IOCTL_PARTITION;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>> if (!ns)
>> goto out_unlock;
>> @@ -747,7 +747,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>> int srcu_idx, ret = -EWOULDBLOCK;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>> if (!ns)
>> goto out_unlock;
>> @@ -767,7 +767,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>> int srcu_idx = srcu_read_lock(&head->srcu);
>> - struct nvme_ns *ns = nvme_find_path(head);
>> + struct nvme_ns *ns = nvme_find_path(head,
>> + ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ);
>> int ret = -EINVAL;
>> if (ns)
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 3da980dc60d9..4f56a2bf7ea3 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -6,6 +6,8 @@
>> #include <linux/backing-dev.h>
>> #include <linux/moduleparam.h>
>> #include <linux/vmalloc.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/math64.h>
>> #include <trace/events/block.h>
>> #include "nvme.h"
>> @@ -66,9 +68,10 @@ MODULE_PARM_DESC(multipath_always_on,
>> "create multipath node always except for private namespace with non-unique nsid; note that this also implicitly enables native multipath support");
>> static const char *nvme_iopolicy_names[] = {
>> - [NVME_IOPOLICY_NUMA] = "numa",
>> - [NVME_IOPOLICY_RR] = "round-robin",
>> - [NVME_IOPOLICY_QD] = "queue-depth",
>> + [NVME_IOPOLICY_NUMA] = "numa",
>> + [NVME_IOPOLICY_RR] = "round-robin",
>> + [NVME_IOPOLICY_QD] = "queue-depth",
>> + [NVME_IOPOLICY_ADAPTIVE] = "adaptive",
>> };
>> static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -83,6 +86,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>> iopolicy = NVME_IOPOLICY_RR;
>> else if (!strncmp(val, "queue-depth", 11))
>> iopolicy = NVME_IOPOLICY_QD;
>> + else if (!strncmp(val, "adaptive", 8))
>> + iopolicy = NVME_IOPOLICY_ADAPTIVE;
>> else
>> return -EINVAL;
>> @@ -196,6 +201,190 @@ void nvme_mpath_start_request(struct request *rq)
>> }
>> EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
>> +int nvme_mpath_alloc_stat(struct nvme_ns *ns)
>> +{
>> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>> +
>> + if (!ns->head->disk)
>> + return 0;
>> +
>> + ns->cpu_stat = __alloc_percpu_gfp(
>> + 2 * sizeof(struct nvme_path_stat),
>> + __alignof__(struct nvme_path_stat),
>> + gfp);
>> + if (!ns->cpu_stat)
>> + return -ENOMEM;
>> +
>> + return 0;
>> +}
>> +
>> +#define NVME_EWMA_SHIFT 3
>> +static inline u64 ewma_update(u64 old, u64 new)
>> +{
>> + return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
>> +}
>> +
>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>> +{
>> + int cpu, srcu_idx;
>> + unsigned int rw;
>> + struct nvme_path_stat *stat;
>> + struct nvme_ns *cur_ns;
>> + u32 weight;
>> + u64 now, latency, avg_lat_ns;
>> + u64 total_score = 0;
>> + struct nvme_ns_head *head = ns->head;
>> +
>> + if (list_is_singular(&head->list))
>> + return;
>> +
>> + now = ktime_get_ns();
>> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>> + if (!latency)
>> + return;
>> +
>> + /*
>> + * As completion code path is serialized(i.e. no same completion queue
>> + * update code could run simultaneously on multiple cpu) we can safely
>> + * access per cpu nvme path stat here from another cpu (in case the
>> + * completion cpu is different from submission cpu).
>> + * The only field which could be accessed simultaneously here is the
>> + * path ->weight which may be accessed by this function as well as I/O
>> + * submission path during path selection logic and we protect ->weight
>> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>> + * we also don't need to be so accurate here as the path credit would
>> + * be anyways refilled, based on path weight, once path consumes all
>> + * its credits. And we limit path weight/credit max up to 100. Please
>> + * also refer nvme_adaptive_path().
>> + */
>> + cpu = blk_mq_rq_cpu(rq);
>> + rw = rq_data_dir(rq);
>> + stat = &per_cpu_ptr(ns->cpu_stat, cpu)[rw];
>> +
>
> This is tad awkward for setups where #CPUs > #paths.
>> + /*
>> + * If latency > ~1s then ignore this sample to prevent EWMA from being
>> + * skewed by pathological outliers (multi-second waits, controller
>> + * timeouts etc.). This keeps path scores representative of normal
>> + * performance and avoids instability from rare spikes. If such high
>> + * latency is real, ANA state reporting or keep-alive error counters
>> + * will mark the path unhealthy and remove it from the head node list,
>> + * so we safely skip such sample here.
>> + */
>> + if (unlikely(latency > NSEC_PER_SEC)) {
>> + stat->nr_ignored++;
>> + return;
>> + }
>> +
>> + /*
>> + * Accumulate latency samples and increment the batch count for each
>> + * ~15 second interval. When the interval expires, compute the simple
>> + * average latency over that window, then update the smoothed (EWMA)
>> + * latency. The path weight is recalculated based on this smoothed
>> + * latency.
>> + */
>> + stat->batch += latency;
>> + stat->batch_count++;
>> + stat->nr_samples++;
>> +
>> + if (now > stat->last_weight_ts &&
>> + (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) {
>> +
>> + stat->last_weight_ts = now;
>> +
>> + /*
>> + * Find simple average latency for the last epoch (~15 sec
>> + * interval).
>> + */
>> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>> +
>> + /*
>> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>> + * latency. EWMA is preferred over simple average latency
>> + * because it smooths naturally, reduces jitter from sudden
>> + * spikes, and adapts faster to changing conditions. It also
>> + * avoids storing historical samples, and works well for both
>> + * slow and fast I/O rates.
>> + * Formula:
>> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>> + * existing latency and 1/8 (~12.5%) weight to the new latency.
>> + */
>> + if (unlikely(!stat->slat_ns))
>> + stat->slat_ns = avg_lat_ns;
>> + else
>> + stat->slat_ns = ewma_update(stat->slat_ns, avg_lat_ns);
>> +
>> + stat->batch = stat->batch_count = 0;
>> +
>> + srcu_idx = srcu_read_lock(&head->srcu);
>> + list_for_each_entry_srcu(cur_ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>
> And this is even more awkward as we need to iterate over all paths
> (during completion!).
>
Hmm yes, but we only iterate once every ~15 seconds per CPU, so the overhead is minimal.
Typically we don’t have a large number of paths to deal with: enterprise SSDs usually
expose at most two controllers, and even in fabrics setups the path count is usually
limited to around 4–6. So the loop should run quite fast.
Also, looping in itself isn’t unusual — for example, the queue-depth I/O policy already
iterates over all paths in the submission path to check queue depth before dispatching each
I/O. That said, if looping in the completion path is still a concern, we could consider
moving this into a dedicated worker thread instead. What do you think?
> Do we really need to do this?
> What would happen if we just measure the latency on the local CPU
> and do away with this loop?
> We would have less samples, true, but we would even be able to
> not only differentiate between distinct path latency but also between
> different CPU latencies; I would think this being a bonus for
> multi-socket machines.
>
The idea is to keep per-cpu view consistent for each path. As we know,
in NVMe/fabrics multipath, submission and completion CPUs don’t necessarily
match (depends on the host’s irq/core mapping). And so if we were to measure
the latency/EWMA locally per-cpu then the per-CPU accumulator might be biased
towards the completion CPU, not the submission CPU. For instance, if submission
is on CPU A but completion lands on CPU B, then CPU A’s weights never reflect
it's I/O experience — they’ll be skewed by how interrupts get steered.
So on a multi socket/NUMA systems, depending on topology, calculating local
per-cpu ewma/latency may or may not line up. For example:
- If we have #cpu <= #vectors supported by NVMe disk then typically
we have 1:1 mapping between submission and completion queues and hence all completions for
a queue are steered to the same CPU that also submits, then per-CPU stats are accurate.
- But when #CPUs > #vectors, completions may be centralized or spread differently. In that
case, the per-CPU latency view can be distorted — e.g., CPU A may submit, but CPU B takes
completions, so CPU A’s weights never reflect its own I/O behavior.
> _And_ we wouldn't need to worry about path failures, which is bound
> to expose some race conditions if we need to iterate paths at the
> same time than path failures are being handled.
>
Yes agreed we may have some race here and so the path score/weight may be
skewed when that happens but then that'd be auto-corrected in the next epoc
(after ~15 sec) when we re-calculate the path weight/score again, isn't it?
> But nevertheless: great job!
Thank you :)
--Nilay
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 3/5] nvme-multipath: add sysfs attribute for adaptive I/O policy
2025-09-22 7:35 ` Hannes Reinecke
@ 2025-09-23 3:53 ` Nilay Shroff
0 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-23 3:53 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/22/25 1:05 PM, Hannes Reinecke wrote:
> On 9/21/25 13:12, Nilay Shroff wrote:
>> This commit introduces a new sysfs attribute, "adp_stat", under the
>> nvme path block device. This attribute provides visibility into the
>> state of the adaptive I/O policy and is intended to aid debugging and
>> observability. We now also calculate the per-path aggregated smoothed
>> (EWMA) latency for reporting it under this new attribute.
>>
>> The attribute reports per-path aggregated statistics, including I/O
>> weight, smoothed (EWMA) latency, selection count, processed samples,
>> and ignored samples.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/multipath.c | 77 ++++++++++++++++++++++++++++++++++-
>> drivers/nvme/host/nvme.h | 2 +
>> drivers/nvme/host/sysfs.c | 5 +++
>> 3 files changed, 82 insertions(+), 2 deletions(-)
>>
> Wouldn't this be better off if situated in the debugfs directly?
> Exposing the stats is not really crucial to operations, and mainly
> for debugging purposes only.
>
> Exposing the weight from the EWMA algorithm, OTOH, really does influence
> the performance, and might be an idea to expose.
>
Yes I think exposing this under debugfs is good idea.
Maybe we could also expose per-cpu stat under debugfs.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
2025-09-23 3:43 ` Nilay Shroff
@ 2025-09-23 7:03 ` Hannes Reinecke
2025-09-23 10:56 ` Nilay Shroff
0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2025-09-23 7:03 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/23/25 05:43, Nilay Shroff wrote:
>
>
> On 9/22/25 1:00 PM, Hannes Reinecke wrote:
>> On 9/21/25 13:12, Nilay Shroff wrote:
[ .. ]>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>> + list_for_each_entry_srcu(cur_ns, &head->list, siblings,
>>> + srcu_read_lock_held(&head->srcu)) {
>>
>> And this is even more awkward as we need to iterate over all paths
>> (during completion!).
>>
> Hmm yes, but we only iterate once every ~15 seconds per CPU, so the overhead is minimal.
> Typically we don’t have a large number of paths to deal with: enterprise SSDs usually
> expose at most two controllers, and even in fabrics setups the path count is usually
> limited to around 4–6. So the loop should run quite fast.
Hmm. Not from my experience. There is at least one implementation from a
rather substantial array vendor exposing up to low hundreds of queues.
> Also, looping in itself isn’t unusual — for example, the queue-depth I/O policy already
> iterates over all paths in the submission path to check queue depth before dispatching each
> I/O. That said, if looping in the completion path is still a concern, we could consider
> moving this into a dedicated worker thread instead. What do you think?
>
Not sure if that's a good idea; either the worker thread runs
asynchronous to the completion and then we have to deal with reliably
adding up numbers, or we're running synchronous and lose performance.
Still think that _not_ iterating and just adding up single-cpu latencies
might be worthwhile.
>> Do we really need to do this?
>> What would happen if we just measure the latency on the local CPU
>> and do away with this loop?
>> We would have less samples, true, but we would even be able to
>> not only differentiate between distinct path latency but also between
>> different CPU latencies; I would think this being a bonus for
>> multi-socket machines.
>>
> The idea is to keep per-cpu view consistent for each path. As we know,
> in NVMe/fabrics multipath, submission and completion CPUs don’t necessarily
> match (depends on the host’s irq/core mapping). And so if we were to measure
> the latency/EWMA locally per-cpu then the per-CPU accumulator might be biased
> towards the completion CPU, not the submission CPU. For instance, if submission
> is on CPU A but completion lands on CPU B, then CPU A’s weights never reflect
> it's I/O experience — they’ll be skewed by how interrupts get steered.
>
True. Problem is that for the #CPUs > #queues we're setting up a cpu
affinity group, and interrupts are directed to one of the CPU in that
group. I had hoped that the blk-mq code would raise a softirq in that
case and call .end_request on the cpu registered in the request itself.
Probably need to be evaluated.
> So on a multi socket/NUMA systems, depending on topology, calculating local
> per-cpu ewma/latency may or may not line up. For example:
>
> - If we have #cpu <= #vectors supported by NVMe disk then typically
> we have 1:1 mapping between submission and completion queues and hence all completions for
> a queue are steered to the same CPU that also submits, then per-CPU stats are accurate.
>
> - But when #CPUs > #vectors, completions may be centralized or spread differently. In that
> case, the per-CPU latency view can be distorted — e.g., CPU A may submit, but CPU B takes
> completions, so CPU A’s weights never reflect its own I/O behavior.
>
See above. We might check if blk-mq doesn't cover for this case already.
Thing is, I actually _do_ want to measure per-CPU latency.
On a multi-socket system it really does matter whether an I/O is run
from a CPU on the socket attached to the PCI device, or from an
off-socket CPU. If we are calculating just the per-path latency
we completely miss that (as blk-mq will spread out across _all_
cpus), but if we are measuring a per-cpu latency we will end up
with a differential matrix where cpus with the lowest latency
will be preferred.
So if we have a system with two sockets and two PCI HBAs, each
connected to a different socket, using per-path latency will be
spreading out I/Os across all cpus. Using per-cpu latency will
direct I/Os to the cpus with the lowest latency, preferring
the local cpus.
>> _And_ we wouldn't need to worry about path failures, which is bound
>> to expose some race conditions if we need to iterate paths at the
>> same time than path failures are being handled.
>>
> Yes agreed we may have some race here and so the path score/weight may be
> skewed when that happens but then that'd be auto-corrected in the next epoc
> (after ~15 sec) when we re-calculate the path weight/score again, isn't it?
>
Let's see. I still would want to check if we can't do per-cpu
statistics, as that would automatically avoid any races :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
2025-09-22 7:38 ` Hannes Reinecke
@ 2025-09-23 9:33 ` Nilay Shroff
2025-09-23 10:27 ` Hannes Reinecke
0 siblings, 1 reply; 16+ messages in thread
From: Nilay Shroff @ 2025-09-23 9:33 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/22/25 1:08 PM, Hannes Reinecke wrote:
> On 9/21/25 13:12, Nilay Shroff wrote:
>> Add support for retrieving the negotiated NIC link speed (in Mbps).
>> This value can be factored into path scoring for the adaptive I/O
>> policy. For visibility and debugging, a new sysfs attribute "speed"
>> is also added under the NVMe path block device.
>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>> drivers/nvme/host/multipath.c | 11 ++++++
>> drivers/nvme/host/nvme.h | 3 ++
>> drivers/nvme/host/sysfs.c | 5 +++
>> drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
>> 4 files changed, 85 insertions(+)
>>
> Why not for FC? We can easily extract the link speed from there, too ...
>
Yes it's easy to get the speed for FC. I just wanted to get feedback from
the community about this idea and so didn't include it. But I will do that
in the future patchset.
> But why do we need to do that? We already calculated the weighted
> average, so we _know_ the latency of each path. And then it's
> pretty much immaterial if a path runs with a given speed; if the
> latency is lower, that path is being preferred.
> Irrespective of the speed, which might be deceptive anyway as
> you'll only ever be able to retrieve the speed of the local
> link, not of the entire path.
>
Consider a scenario with two paths: one over a high-capacity link
(e.g. 1000 Mbps) and another over a much smaller link (e.g. 10 Mbps).
If both paths report the same latency, the current formula would
assign them identical weights. But in reality, the higher-capacity
path can sustain a larger number of I/Os compared to the lower-
capacity one.
In such cases, factoring in link speed allows us to assign proportionally
higher weight to the higher-capacity path. At the same time, if that same
path exhibits higher latency, it will be penalized accordingly, ensuring
the final score balances both latency and bandwidth.
So, including link speed in the weight calculation helps capture both
dimensions—latency sensitivity and throughput capacity—leading to a more
accurate and proportional I/O distribution.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
2025-09-23 9:33 ` Nilay Shroff
@ 2025-09-23 10:27 ` Hannes Reinecke
2025-09-23 17:58 ` Nilay Shroff
0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2025-09-23 10:27 UTC (permalink / raw)
To: Nilay Shroff, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/23/25 11:33, Nilay Shroff wrote:
>
>
> On 9/22/25 1:08 PM, Hannes Reinecke wrote:
>> On 9/21/25 13:12, Nilay Shroff wrote:
>>> Add support for retrieving the negotiated NIC link speed (in Mbps).
>>> This value can be factored into path scoring for the adaptive I/O
>>> policy. For visibility and debugging, a new sysfs attribute "speed"
>>> is also added under the NVMe path block device.
>>>
>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>> ---
>>> drivers/nvme/host/multipath.c | 11 ++++++
>>> drivers/nvme/host/nvme.h | 3 ++
>>> drivers/nvme/host/sysfs.c | 5 +++
>>> drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
>>> 4 files changed, 85 insertions(+)
>>>
>> Why not for FC? We can easily extract the link speed from there, too ...
>>
> Yes it's easy to get the speed for FC. I just wanted to get feedback from
> the community about this idea and so didn't include it. But I will do that
> in the future patchset.
>
>> But why do we need to do that? We already calculated the weighted
>> average, so we _know_ the latency of each path. And then it's
>> pretty much immaterial if a path runs with a given speed; if the
>> latency is lower, that path is being preferred.
>> Irrespective of the speed, which might be deceptive anyway as
>> you'll only ever be able to retrieve the speed of the local
>> link, not of the entire path.
>>
> Consider a scenario with two paths: one over a high-capacity link
> (e.g. 1000 Mbps) and another over a much smaller link (e.g. 10 Mbps).
> If both paths report the same latency, the current formula would
> assign them identical weights. But in reality, the higher-capacity
> path can sustain a larger number of I/Os compared to the lower-
> capacity one.
>
That would be correct if the transfer speed is assumed to be negligible.
But I would assume that we do transfer mainly in units of PAGE_SIZE,
so with 4k PAGE_SIZE we'll spend 400 ms on a 10Mbps link, but 4ms on a
1000Mbps link. That actually is one of the issues we're facing with
measuring latency: we only have access to the combined latency
(submission/data transfer/completion), so it's really hard to separate
them out.
> In such cases, factoring in link speed allows us to assign proportionally
> higher weight to the higher-capacity path. At the same time, if that same
> path exhibits higher latency, it will be penalized accordingly, ensuring
> the final score balances both latency and bandwidth.
>
See above. If we could measure them separately, yes. But we can't.
> So, including link speed in the weight calculation helps capture both
> dimensions—latency sensitivity and throughput capacity—leading to a more
> accurate and proportional I/O distribution.
>
Would be true if we could measure it properly. But we can only get
thespeed on the local link; everything behind that is anyone's guess, and
it would skew measurements even more if we assume the same link speed
for the entire path.
Cheers,Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy
2025-09-23 7:03 ` Hannes Reinecke
@ 2025-09-23 10:56 ` Nilay Shroff
0 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-23 10:56 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/23/25 12:33 PM, Hannes Reinecke wrote:
> On 9/23/25 05:43, Nilay Shroff wrote:
>>
>>
>> On 9/22/25 1:00 PM, Hannes Reinecke wrote:
>>> On 9/21/25 13:12, Nilay Shroff wrote:
> [ .. ]>>> + srcu_idx = srcu_read_lock(&head->srcu);
>>>> + list_for_each_entry_srcu(cur_ns, &head->list, siblings,
>>>> + srcu_read_lock_held(&head->srcu)) {
>>>
>>> And this is even more awkward as we need to iterate over all paths
>>> (during completion!).
>>>
>> Hmm yes, but we only iterate once every ~15 seconds per CPU, so the overhead is minimal.
>> Typically we don’t have a large number of paths to deal with: enterprise SSDs usually
>> expose at most two controllers, and even in fabrics setups the path count is usually
>> limited to around 4–6. So the loop should run quite fast.
>
> Hmm. Not from my experience. There is at least one implementation from a
> rather substantial array vendor exposing up to low hundreds of queues.
>
Sorry but I think we're discussing two different thing here? I'm referring
to the nvme paths (i.e. nvmeXcYnz) and probably you're referring to the
DMA queues or hw I/O queues. If that's true then yes some enterprise SSD
may expose low hundreds of queues, for instance, on my SSD I see 128 queues
being supported:
# nvme get-feature -f 0x7 /dev/nvme0 -H
get-feature:0x07 (Number of Queues), Current value:0x007f007f
Number of IO Completion Queues Allocated (NCQA): 128
Number of IO Submission Queues Allocated (NSQA): 128
But when it comes to the nvme paths, that should be limited (mostly single
digit). So my point was iterating those paths (and that's what the above
code implements) shouldn't be expensive. But please correct me if my
assumption was wrong.
>> Also, looping in itself isn’t unusual — for example, the queue-depth I/O policy already
>> iterates over all paths in the submission path to check queue depth before dispatching each
>> I/O. That said, if looping in the completion path is still a concern, we could consider
>> moving this into a dedicated worker thread instead. What do you think?
>>
>
> Not sure if that's a good idea; either the worker thread runs
> asynchronous to the completion and then we have to deal with reliably
> adding up numbers, or we're running synchronous and lose performance.
> Still think that _not_ iterating and just adding up single-cpu latencies
> might be worthwhile.
>
Yes worker thread should run asynchronously with completion and we can
deal with it in one of the two ways:
- If accuracy is critical, we could use atomic counters (e.g. atomic_cmpxchg)
to update batch counts and calculate path weights asynchronously in the worker.
- If a small skew is acceptable, snapshot-and-reset is sufficient. Since EWMA
smoothing absorbs micro-errors, missing 1–2 samples in a 15s window at real
I/O rates is negligible. Many kernel subsystems (block stats, net flow stats,
etc.) already rely on snapshot-and-reset without atomic, exactly for this reason.
IMO, for the adaptive policy, we don’t need atomic_xchg() unless we’re chasing
perfect stats. Snapshot-and-reset per-CPU accumulators is good enough, because
EWMA smoothing and weight recalculation will easily absorb micro-errors. So what
do you prefer if in case we choose worker thread?
>>> Do we really need to do this?
>>> What would happen if we just measure the latency on the local CPU
>>> and do away with this loop?
>>> We would have less samples, true, but we would even be able to
>>> not only differentiate between distinct path latency but also between
>>> different CPU latencies; I would think this being a bonus for
>>> multi-socket machines.
>>>
>> The idea is to keep per-cpu view consistent for each path. As we know,
>> in NVMe/fabrics multipath, submission and completion CPUs don’t necessarily
>> match (depends on the host’s irq/core mapping). And so if we were to measure
>> the latency/EWMA locally per-cpu then the per-CPU accumulator might be biased
>> towards the completion CPU, not the submission CPU. For instance, if submission
>> is on CPU A but completion lands on CPU B, then CPU A’s weights never reflect
>> it's I/O experience — they’ll be skewed by how interrupts get steered.
>>
> True. Problem is that for the #CPUs > #queues we're setting up a cpu
> affinity group, and interrupts are directed to one of the CPU in that
> group. I had hoped that the blk-mq code would raise a softirq in that
> case and call .end_request on the cpu registered in the request itself.
> Probably need to be evaluated.
>
I already evaluated this. If completion ends up on a cpu which is different
from the submission cpu then yes block layer code may raise softirq to steer
the completion on the same cpu as submission. But this is not guaranteed always.
For instance if poll queues are used for submission then submission and
completion on the same cpu is not guaranteed. Similarly, if threaded
interrupt is configured then this is not guaranteed. In fact blk-mq
also supports a flag named QUEUE_FLAG_SAME_FORCE which forces completion
on the same CPU as the submission CPU. But again this flags also doesn't
work always if we have poll-queue or thread interrupt is configured.
Please refer blk_mq_complete_request_remote().
>> So on a multi socket/NUMA systems, depending on topology, calculating local
>> per-cpu ewma/latency may or may not line up. For example:
>>
>> - If we have #cpu <= #vectors supported by NVMe disk then typically
>> we have 1:1 mapping between submission and completion queues and hence all completions for
>> a queue are steered to the same CPU that also submits, then per-CPU stats are accurate.
>>
>> - But when #CPUs > #vectors, completions may be centralized or spread differently. In that
>> case, the per-CPU latency view can be distorted — e.g., CPU A may submit, but CPU B takes
>> completions, so CPU A’s weights never reflect its own I/O behavior.
>>
> See above. We might check if blk-mq doesn't cover for this case already.
> Thing is, I actually _do_ want to measure per-CPU latency.
> On a multi-socket system it really does matter whether an I/O is run
> from a CPU on the socket attached to the PCI device, or from an
> off-socket CPU. If we are calculating just the per-path latency
> we completely miss that (as blk-mq will spread out across _all_
> cpus), but if we are measuring a per-cpu latency we will end up
> with a differential matrix where cpus with the lowest latency
> will be preferred.
> So if we have a system with two sockets and two PCI HBAs, each
> connected to a different socket, using per-path latency will be
> spreading out I/Os across all cpus. Using per-cpu latency will
> direct I/Os to the cpus with the lowest latency, preferring
> the local cpus.
>
Yes, agreed and so in this proposed patch we measure per-cpu latency and
_NOT_ per-path latency. Measuring the per-cpu latency and using it for
forwarding I/O is ideal for multi socket/NUMA systems.
>>> _And_ we wouldn't need to worry about path failures, which is bound
>>> to expose some race conditions if we need to iterate paths at the
>>> same time than path failures are being handled.
>>>
>> Yes agreed we may have some race here and so the path score/weight may be
>> skewed when that happens but then that'd be auto-corrected in the next epoc
>> (after ~15 sec) when we re-calculate the path weight/score again, isn't it?
>>
> Let's see. I still would want to check if we can't do per-cpu
> statistics, as that would automatically avoid any races :-)
>
As I mentioned above, there are certain use cases (like poll-queues, threaded
interrupts etc.) where submission and completion could happen on different CPUs,
so we have to account for such cases and cross over CPU while accumulating the
batch latency.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
2025-09-23 10:27 ` Hannes Reinecke
@ 2025-09-23 17:58 ` Nilay Shroff
0 siblings, 0 replies; 16+ messages in thread
From: Nilay Shroff @ 2025-09-23 17:58 UTC (permalink / raw)
To: Hannes Reinecke, linux-nvme; +Cc: kbusch, hch, sagi, axboe, dwagner, gjoyce
On 9/23/25 3:57 PM, Hannes Reinecke wrote:
> On 9/23/25 11:33, Nilay Shroff wrote:
>>
>>
>> On 9/22/25 1:08 PM, Hannes Reinecke wrote:
>>> On 9/21/25 13:12, Nilay Shroff wrote:
>>>> Add support for retrieving the negotiated NIC link speed (in Mbps).
>>>> This value can be factored into path scoring for the adaptive I/O
>>>> policy. For visibility and debugging, a new sysfs attribute "speed"
>>>> is also added under the NVMe path block device.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/multipath.c | 11 ++++++
>>>> drivers/nvme/host/nvme.h | 3 ++
>>>> drivers/nvme/host/sysfs.c | 5 +++
>>>> drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
>>>> 4 files changed, 85 insertions(+)
>>>>
>>> Why not for FC? We can easily extract the link speed from there, too ...
>>>
>> Yes it's easy to get the speed for FC. I just wanted to get feedback from
>> the community about this idea and so didn't include it. But I will do that
>> in the future patchset.
>>
>>> But why do we need to do that? We already calculated the weighted
>>> average, so we _know_ the latency of each path. And then it's
>>> pretty much immaterial if a path runs with a given speed; if the
>>> latency is lower, that path is being preferred.
>>> Irrespective of the speed, which might be deceptive anyway as
>>> you'll only ever be able to retrieve the speed of the local
>>> link, not of the entire path.
>>>
>> Consider a scenario with two paths: one over a high-capacity link
>> (e.g. 1000 Mbps) and another over a much smaller link (e.g. 10 Mbps).
>> If both paths report the same latency, the current formula would
>> assign them identical weights. But in reality, the higher-capacity
>> path can sustain a larger number of I/Os compared to the lower-
>> capacity one.
>>
> That would be correct if the transfer speed is assumed to be negligible.
> But I would assume that we do transfer mainly in units of PAGE_SIZE,
> so with 4k PAGE_SIZE we'll spend 400 ms on a 10Mbps link, but 4ms on a
> 1000Mbps link. That actually is one of the issues we're facing with
> measuring latency: we only have access to the combined latency
> (submission/data transfer/completion), so it's really hard to separate
> them out.
>
>> In such cases, factoring in link speed allows us to assign proportionally
>> higher weight to the higher-capacity path. At the same time, if that same
>> path exhibits higher latency, it will be penalized accordingly, ensuring
>> the final score balances both latency and bandwidth.
>>
> See above. If we could measure them separately, yes. But we can't.
>
>> So, including link speed in the weight calculation helps capture both
>> dimensions—latency sensitivity and throughput capacity—leading to a more
>> accurate and proportional I/O distribution.
>>
> Would be true if we could measure it properly. But we can only get thespeed on the local link; everything behind that is anyone's guess, and
> it would skew measurements even more if we assume the same link speed
> for the entire path.
>
Yes you brought a very good point and I agree that we can’t reliably
determine the end-to-end path capacity. Assuming the same link speed
beyond the first hop may not always be correct and could easily skew
the measurement.
Given that limitation, I agree it would be better to exclude link speed
from the path scoring formula. Without a way to accurately capture the
full path capacity, incorporating only the local link speed risks making
the scoring misleading rather than more accurate.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-09-23 17:58 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-21 11:12 [RFC PATCH 0/5] nvme-multipath: introduce adaptive I/O policy Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 1/5] block: expose blk_stat_{enable,disable}_accounting() to drivers Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 2/5] nvme-multipath: add support for adaptive I/O policy Nilay Shroff
2025-09-22 7:30 ` Hannes Reinecke
2025-09-23 3:43 ` Nilay Shroff
2025-09-23 7:03 ` Hannes Reinecke
2025-09-23 10:56 ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 3/5] nvme-multipath: add sysfs attribute " Nilay Shroff
2025-09-22 7:35 ` Hannes Reinecke
2025-09-23 3:53 ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed Nilay Shroff
2025-09-22 7:38 ` Hannes Reinecke
2025-09-23 9:33 ` Nilay Shroff
2025-09-23 10:27 ` Hannes Reinecke
2025-09-23 17:58 ` Nilay Shroff
2025-09-21 11:12 ` [RFC PATCH 5/5] nvme-multipath: factor fabric link speed into path score Nilay Shroff
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).