Linux Documentation
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: Introduce service-time iopolicy
@ 2026-06-17 11:45 Guixin Liu
  2026-06-17 11:45 ` [PATCH 1/2] nvme-multipath: add service-time I/O policy Guixin Liu
  2026-06-17 11:45 ` [PATCH 2/2] docs: nvme-multipath: document " Guixin Liu
  0 siblings, 2 replies; 3+ messages in thread
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc

Hi all,
  I developed the service-time iopolicy in nvme native
multipath, please review, all comments are wellcome.

Guixin Liu (2):
  nvme-multipath: add service-time I/O policy
  docs: nvme-multipath: document service-time I/O policy

 Documentation/admin-guide/nvme-multipath.rst |  31 +++-
 drivers/nvme/host/multipath.c                | 165 ++++++++++++++++++-
 drivers/nvme/host/nvme.h                     |   6 +
 drivers/nvme/host/sysfs.c                    |   5 +-
 4 files changed, 202 insertions(+), 5 deletions(-)

-- 
2.43.7


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] nvme-multipath: add service-time I/O policy
  2026-06-17 11:45 [PATCH 0/2] nvme: Introduce service-time iopolicy Guixin Liu
@ 2026-06-17 11:45 ` Guixin Liu
  2026-06-17 11:45 ` [PATCH 2/2] docs: nvme-multipath: document " Guixin Liu
  1 sibling, 0 replies; 3+ messages in thread
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc

Add a new "service-time" I/O policy for NVMe native multipath, adapted
from the DM multipath service-time path selector (dm-ps-service-time).

Unlike the existing "queue-depth" policy which only counts the number of
in-flight I/Os, service-time estimates each path's service time by
dividing the total in-flight I/O size by a configurable relative
throughput weight:

    service_time = in_flight_size / relative_throughput

This provides more accurate load balancing when I/O sizes vary
significantly across paths, and allows users to assign throughput
weights to paths with different performance characteristics.

The comparison algorithm is directly adapted from dm-ps-service-time,
using cross-multiplication to avoid division and includes overflow
protection by shifting down when in-flight sizes are large.

Per-controller state:
  - in_flight_size: total bytes of in-flight I/Os (atomic_t)
  - relative_throughput: configurable weight 0-100, default 1

Sysfs interface (per path device):
  - in_flight_bytes (ro): current in-flight byte count
  - relative_throughput (rw): path throughput weight

Paths with relative_throughput set to 0 are not selected while other
paths with positive values are available, matching DM service-time
semantics.

Usage:
  echo service-time > /sys/module/nvme_core/parameters/iopolicy
  echo 4 > /sys/block/nvmeXcYnZ/relative_throughput

Test environment:
  Two QEMU/KVM VMs connected via NVMe-oF TCP, 32 vCPU / 32G RAM each.
  Target exports a 1G RAM disk (/dev/ram0) via nvmet-tcp subsystem
  "nqn.test.nvme-st" on two ports (192.168.1.20:4420 and
  192.168.2.20:4420). Initiator connects both paths to form a native
  NVMe multipath device. VM interconnects use QEMU socket networking
  (point-to-point virtual Ethernet).

  Asymmetric bandwidth is created using tc-tbf on the target side:
    Path 1 (enp0s3): no traffic shaping (measured ~237 MiB/s)
    Path 2 (enp0s4): tc qdisc add dev enp0s4 root tbf rate 500mbit
                     burst 128kb latency 5ms (measured ~56 MiB/s)
  Bandwidth ratio is approximately 4:1.

Test 1 - Uniform 4k random read (iodepth=64, numjobs=4, 15s):

  iopolicy              IOPS   BW(MiB/s)  avg_lat(us)  lat_stddev(us)
  --------------------  -----  ---------  -----------  --------------
  round-robin           59.1k  231        4316         3109
  queue-depth           59.0k  230        4323         3046
  service-time 1:1      59.3k  232        4302         3057
  service-time 4:1      61.0k  238        4179         1294

  With uniform I/O size, service-time 1:1 performs similarly to
  queue-depth since per-byte and per-request tracking are equivalent
  for fixed-size I/Os. service-time 4:1 with correct throughput
  weights improves IOPS by 3.4% and reduces latency stddev by 57%.

Test 2 - Mixed random read 4k/128k (bssplit=4k/50:128k/50,
         iodepth=64, numjobs=4, 15s):

  iopolicy              IOPS  BW(MiB/s)  avg_lat(us)  lat_stddev(us)
  --------------------  ----  ---------  -----------  --------------
  round-robin           7689  276        33209        46817
  queue-depth           7642  276        33429        45868
  service-time 1:1      7456  269        34279        47433
  service-time 4:1      7747  277        32998        27102

  With mixed I/O sizes on asymmetric paths, queue-depth treats 128K
  and 4K requests identically when counting in-flight depth, leading
  to suboptimal load distribution. service-time tracks actual bytes
  in flight, producing significantly more consistent tail latencies.
  service-time 4:1 vs queue-depth: latency stddev -41%.

Standby path (relative_throughput=0):
  Setting a path's relative_throughput to 0 correctly excludes it
  from selection. With the slow path set to 0, all I/O is directed
  to the fast path only (IO split 118052:0), verified via diskstats.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 drivers/nvme/host/multipath.c | 165 +++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |   6 ++
 drivers/nvme/host/sysfs.c     |   5 +-
 3 files changed, 173 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac0..81fff2f20d23 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -69,6 +69,7 @@ static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
 	[NVME_IOPOLICY_QD]      = "queue-depth",
+	[NVME_IOPOLICY_ST]	= "service-time",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -83,6 +84,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_RR;
 	else if (!strncmp(val, "queue-depth", 11))
 		iopolicy = NVME_IOPOLICY_QD;
+	else if (!strncmp(val, "service-time", 12))
+		iopolicy = NVME_IOPOLICY_ST;
 	else
 		return -EINVAL;
 
@@ -97,7 +100,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
 	&iopolicy, 0644);
 MODULE_PARM_DESC(iopolicy,
-	"Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");
+	"Default multipath I/O policy; 'numa' (default), 'round-robin', 'queue-depth' or 'service-time'");
 
 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
@@ -168,11 +171,16 @@ void nvme_mpath_start_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
 	struct gendisk *disk = ns->head->disk;
+	int iopolicy = READ_ONCE(ns->head->subsys->iopolicy);
 
-	if ((READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) &&
+	if (iopolicy == NVME_IOPOLICY_QD &&
 	    !(nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)) {
 		atomic_inc(&ns->ctrl->nr_active);
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
+	} else if (iopolicy == NVME_IOPOLICY_ST &&
+		   !(nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE_BYTES)) {
+		atomic64_add(blk_rq_bytes(rq), &ns->ctrl->in_flight_size);
+		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE_BYTES;
 	}
 
 	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq) ||
@@ -191,6 +199,10 @@ void nvme_mpath_end_request(struct request *rq)
 
 	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
 		atomic_dec_if_positive(&ns->ctrl->nr_active);
+	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE_BYTES) {
+		atomic64_sub(blk_rq_bytes(rq), &ns->ctrl->in_flight_size);
+		nvme_req(rq)->flags &= ~NVME_MPATH_CNT_ACTIVE_BYTES;
+	}
 
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
@@ -427,6 +439,109 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 	return best_opt ? best_opt : best_nonopt;
 }
 
+#define ST_MAX_RELATIVE_THROUGHPUT	100
+#define ST_MAX_RELATIVE_THROUGHPUT_SHIFT	7
+#define ST_MAX_INFLIGHT_SIZE \
+	((size_t)-1 >> ST_MAX_RELATIVE_THROUGHPUT_SHIFT)
+
+/*
+ * Compare estimated service time of two paths.
+ * Returns negative if ns1 is better, positive if ns2 is better, 0 if equal.
+ *
+ * Service time = (in_flight_size + incoming) / relative_throughput
+ * Cross-multiply to avoid division.
+ */
+static int nvme_st_compare_load(struct nvme_ns *ns1, struct nvme_ns *ns2,
+				size_t incoming)
+{
+	size_t sz1, sz2, st1, st2;
+	unsigned int tp1 = ns1->ctrl->relative_throughput;
+	unsigned int tp2 = ns2->ctrl->relative_throughput;
+
+	sz1 = atomic64_read(&ns1->ctrl->in_flight_size);
+	sz2 = atomic64_read(&ns2->ctrl->in_flight_size);
+
+	/* Case 1: same throughput — compare load directly */
+	if (tp1 == tp2)
+		return sz1 - sz2;
+
+	/*
+	 * Case 2a: same load — prefer higher throughput.
+	 * Case 2b: one path has zero throughput — prefer the other.
+	 */
+	if (sz1 == sz2 || !tp1 || !tp2)
+		return tp2 - tp1;
+
+	/*
+	 * Case 3: general comparison via cross-multiplication.
+	 *   st1 = (sz1 + incoming) / tp1
+	 *   st2 = (sz2 + incoming) / tp2
+	 * Equivalent (since tp > 0):
+	 *   (sz1 + incoming) * tp2 <=> (sz2 + incoming) * tp1
+	 */
+	sz1 += incoming;
+	sz2 += incoming;
+	if (unlikely(sz1 >= ST_MAX_INFLIGHT_SIZE ||
+		     sz2 >= ST_MAX_INFLIGHT_SIZE)) {
+		sz1 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
+		sz2 >>= ST_MAX_RELATIVE_THROUGHPUT_SHIFT;
+	}
+	st1 = sz1 * tp2;
+	st2 = sz2 * tp1;
+	if (st1 != st2)
+		return st1 - st2;
+
+	/* Case 4: equal service time — prefer higher throughput */
+	return tp2 - tp1;
+}
+
+static struct nvme_ns *nvme_service_time_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	struct nvme_ns *fallback_opt = NULL, *fallback_nonopt = NULL;
+
+	list_for_each_entry_srcu(ns, &head->list, siblings,
+				 srcu_read_lock_held(&head->srcu)) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		/*
+		 * Paths with relative_throughput == 0 are only used when no
+		 * path with a positive value is available (matching DM
+		 * service-time semantics).
+		 */
+		if (!ns->ctrl->relative_throughput) {
+			if (ns->ana_state == NVME_ANA_OPTIMIZED && !fallback_opt)
+				fallback_opt = ns;
+			else if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+				 !fallback_nonopt)
+				fallback_nonopt = ns;
+			continue;
+		}
+
+		switch (ns->ana_state) {
+		case NVME_ANA_OPTIMIZED:
+			if (!best_opt ||
+			    nvme_st_compare_load(ns, best_opt, 0) < 0)
+				best_opt = ns;
+			break;
+		case NVME_ANA_NONOPTIMIZED:
+			if (!best_nonopt ||
+			    nvme_st_compare_load(ns, best_nonopt, 0) < 0)
+				best_nonopt = ns;
+			break;
+		default:
+			break;
+		}
+	}
+
+	if (best_opt)
+		return best_opt;
+	if (best_nonopt)
+		return best_nonopt;
+	return fallback_opt ? fallback_opt : fallback_nonopt;
+}
+
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
@@ -453,6 +568,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 		return nvme_queue_depth_path(head);
 	case NVME_IOPOLICY_RR:
 		return nvme_round_robin_path(head);
+	case NVME_IOPOLICY_ST:
+		return nvme_service_time_path(head);
 	default:
 		return nvme_numa_path(head);
 	}
@@ -1081,6 +1198,47 @@ static ssize_t queue_depth_show(struct device *dev,
 }
 DEVICE_ATTR_RO(queue_depth);
 
+static ssize_t in_flight_bytes_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+	if (ns->head->subsys->iopolicy != NVME_IOPOLICY_ST)
+		return 0;
+
+	return sysfs_emit(buf, "%lld\n", atomic64_read(&ns->ctrl->in_flight_size));
+}
+DEVICE_ATTR_RO(in_flight_bytes);
+
+static ssize_t relative_throughput_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+
+	if (ns->head->subsys->iopolicy != NVME_IOPOLICY_ST)
+		return 0;
+
+	return sysfs_emit(buf, "%u\n", ns->ctrl->relative_throughput);
+}
+
+static ssize_t relative_throughput_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
+	unsigned int val;
+	int ret;
+
+	ret = kstrtouint(buf, 0, &val);
+	if (ret < 0)
+		return ret;
+	if (val > ST_MAX_RELATIVE_THROUGHPUT)
+		return -EINVAL;
+
+	ns->ctrl->relative_throughput = val;
+	return count;
+}
+DEVICE_ATTR_RW(relative_throughput);
+
 static ssize_t numa_nodes_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
@@ -1341,6 +1499,9 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 
 	/* initialize this in the identify path to cover controller resets */
 	atomic_set(&ctrl->nr_active, 0);
+	atomic64_set(&ctrl->in_flight_size, 0);
+	if (!ctrl->relative_throughput)
+		ctrl->relative_throughput = 1;
 
 	if (!ctrl->max_namespaces ||
 	    ctrl->max_namespaces > le32_to_cpu(id->nn)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ccd5e05dac98..2b2627e0d3ce 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -261,6 +261,7 @@ enum {
 	NVME_REQ_USERCMD		= (1 << 1),
 	NVME_MPATH_IO_STATS		= (1 << 2),
 	NVME_MPATH_CNT_ACTIVE		= (1 << 3),
+	NVME_MPATH_CNT_ACTIVE_BYTES	= (1 << 4),
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
@@ -426,6 +427,8 @@ struct nvme_ctrl {
 	struct timer_list anatt_timer;
 	struct work_struct ana_work;
 	atomic_t nr_active;
+	atomic64_t in_flight_size;
+	u8 relative_throughput;
 #endif
 
 #ifdef CONFIG_NVME_HOST_AUTH
@@ -477,6 +480,7 @@ enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
 	NVME_IOPOLICY_QD,
+	NVME_IOPOLICY_ST,
 };
 
 struct nvme_subsystem {
@@ -1059,6 +1063,8 @@ extern bool multipath;
 extern struct device_attribute dev_attr_ana_grpid;
 extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute dev_attr_queue_depth;
+extern struct device_attribute dev_attr_in_flight_bytes;
+extern struct device_attribute dev_attr_relative_throughput;
 extern struct device_attribute dev_attr_numa_nodes;
 extern struct device_attribute dev_attr_delayed_removal_secs;
 extern struct device_attribute subsys_attr_iopolicy;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index e59758616f27..6309af224c93 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -259,6 +259,8 @@ static struct attribute *nvme_ns_attrs[] = {
 	&dev_attr_ana_grpid.attr,
 	&dev_attr_ana_state.attr,
 	&dev_attr_queue_depth.attr,
+	&dev_attr_in_flight_bytes.attr,
+	&dev_attr_relative_throughput.attr,
 	&dev_attr_numa_nodes.attr,
 	&dev_attr_delayed_removal_secs.attr,
 #endif
@@ -293,7 +295,8 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
 		if (!nvme_ctrl_use_ana(nvme_get_ns_from_dev(dev)->ctrl))
 			return 0;
 	}
-	if (a == &dev_attr_queue_depth.attr || a == &dev_attr_numa_nodes.attr) {
+	if (a == &dev_attr_queue_depth.attr || a == &dev_attr_in_flight_bytes.attr ||
+	    a == &dev_attr_relative_throughput.attr || a == &dev_attr_numa_nodes.attr) {
 		if (nvme_disk_is_ns_head(dev_to_disk(dev)))
 			return 0;
 	}
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] docs: nvme-multipath: document service-time I/O policy
  2026-06-17 11:45 [PATCH 0/2] nvme: Introduce service-time iopolicy Guixin Liu
  2026-06-17 11:45 ` [PATCH 1/2] nvme-multipath: add service-time I/O policy Guixin Liu
@ 2026-06-17 11:45 ` Guixin Liu
  1 sibling, 0 replies; 3+ messages in thread
From: Guixin Liu @ 2026-06-17 11:45 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	onathan Corbet, Shuah Khan
  Cc: linux-nvme, linux-doc

Add documentation for the service-time path selection policy, including
its algorithm overview, sysfs attributes (in_flight_bytes and
relative_throughput), and guidance on when to use it over queue-depth.

Document that setting relative_throughput to 0 makes the path a standby
that is only used when no path with a positive value is available.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 Documentation/admin-guide/nvme-multipath.rst | 31 ++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst
index 97ca1ccef459..2acfceaf3d65 100644
--- a/Documentation/admin-guide/nvme-multipath.rst
+++ b/Documentation/admin-guide/nvme-multipath.rst
@@ -24,8 +24,8 @@ Policies
 
 All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning
 that when an optimized path is available, it will be chosen over a non-optimized
-one. Current the NVMe multipath policies include numa(default), round-robin and
-queue-depth.
+one. Current the NVMe multipath policies include numa(default), round-robin,
+queue-depth and service-time.
 
 To set the desired policy (e.g., round-robin), use one of the following methods:
    1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy
@@ -70,3 +70,30 @@ When to use the queue-depth policy:
   1. High load with small I/Os: Effectively balances load across paths when
      the load is high, and I/O operations consist of small, relatively
      fixed-sized requests.
+
+
+Service-Time
+------------
+
+The service-time policy selects the path with the lowest estimated service time.
+It calculates service time as ``in_flight_bytes / relative_throughput`` for each
+path, preferring the path that would complete I/O fastest. Unlike queue-depth
+which counts requests regardless of size, service-time tracks actual bytes in
+flight, making it aware of I/O sizes.
+
+Each path exposes two sysfs attributes under
+``/sys/class/nvme/nvmeX/nvmeXcYnZ/``:
+
+  - ``in_flight_bytes`` (read-only): Current bytes in flight on this path.
+  - ``relative_throughput`` (read-write): Relative throughput weight for this
+    path, default 1. The valid range is 0-100. Set higher values for faster
+    paths. If set to 0, the path is not selected while other paths with
+    positive values are available.
+
+When to use the service-time policy:
+  1. Asymmetric Link Speeds: When paths have different bandwidths, set
+     ``relative_throughput`` proportionally (e.g., 2 for a link twice as fast)
+     to steer more traffic to faster paths.
+  2. Mixed I/O Sizes: When workloads mix small and large I/Os (e.g., 4K and
+     128K), service-time distributes load more accurately than queue-depth
+     because it accounts for actual bytes rather than request count.
-- 
2.43.7


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-17 11:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17 11:45 [PATCH 0/2] nvme: Introduce service-time iopolicy Guixin Liu
2026-06-17 11:45 ` [PATCH 1/2] nvme-multipath: add service-time I/O policy Guixin Liu
2026-06-17 11:45 ` [PATCH 2/2] docs: nvme-multipath: document " Guixin Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox