Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v3 0/2] Introduce service-time multipath policy and document
@ 2024-11-14  8:49 Guixin Liu
  2024-11-14  8:49 ` [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy Guixin Liu
  2024-11-14  8:49 ` [PATCH RESEND v3 2/2] docs, nvme: add a nvme-multipath document Guixin Liu
  0 siblings, 2 replies; 5+ messages in thread
From: Guixin Liu @ 2024-11-14  8:49 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg; +Cc: linux-nvme

Hi,
  This patchset introduce a new multipath policy, service-time, and also
add a new document to introduce nvme-multipath and all policies supported.

Changes from v2 to v3:
- Add more detail to message body(Christoph).
- Add a document to introduce nvme-multipath and all policies supported.

Changes from v1 to v2:
- Use atomic64_t to replace atomic_t (Keith).

Guixin Liu (2):
  nvme-multipath: introduce service-time iopolicy
  docs, nvme: add a nvme-multipath document

 Documentation/nvme/nvme-multipath.rst | 83 +++++++++++++++++++++++++++
 drivers/nvme/host/multipath.c         | 53 ++++++++++++++++-
 drivers/nvme/host/nvme.h              |  3 +
 3 files changed, 138 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/nvme/nvme-multipath.rst

-- 
2.43.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy
  2024-11-14  8:49 [PATCH RESEND v3 0/2] Introduce service-time multipath policy and document Guixin Liu
@ 2024-11-14  8:49 ` Guixin Liu
  2024-11-14 10:53   ` Chaitanya Kulkarni
  2024-11-14  8:49 ` [PATCH RESEND v3 2/2] docs, nvme: add a nvme-multipath document Guixin Liu
  1 sibling, 1 reply; 5+ messages in thread
From: Guixin Liu @ 2024-11-14  8:49 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg; +Cc: linux-nvme

In scenarios with varying random I/O sizes, the different I/O sizes
being processed on each path can lead to slower processing and higher
latency on paths under heavy load.

The service-time policy can dispatch I/O to the path with the lowest
total amount of currently processed I/O, ensuring that new I/O can be
sent to less-loaded paths when some paths are overloaded, thereby
achieving lower latency and higher throughput.

Introduce a atomic64_t inflight_size to record the total I/O size
that the path is processing, and choosing a path with lowest
inflight_size to send the I/O.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 drivers/nvme/host/multipath.c | 53 ++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |  3 ++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6a15873055b9..fcd3b2108152 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
 	[NVME_IOPOLICY_QD]      = "queue-depth",
+	[NVME_IOPOLICY_ST]	= "service-time",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -32,6 +33,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_RR;
 	else if (!strncmp(val, "queue-depth", 11))
 		iopolicy = NVME_IOPOLICY_QD;
+	else if (!strncmp(val, "service-time", 12))
+		iopolicy = NVME_IOPOLICY_ST;
 	else
 		return -EINVAL;
 
@@ -46,7 +49,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
 	&iopolicy, 0644);
 MODULE_PARM_DESC(iopolicy,
-	"Default multipath I/O policy; 'numa' (default), 'round-robin' or 'queue-depth'");
+	"Default multipath I/O policy; 'numa' (default), 'round-robin', 'queue-depth' or 'service-time'");
 
 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
@@ -136,6 +139,11 @@ void nvme_mpath_start_request(struct request *rq)
 		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
 	}
 
+	if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_ST) {
+		atomic64_add(blk_rq_bytes(rq), &ns->ctrl->inflight_size);
+		nvme_req(rq)->flags |= NVME_MPATH_CNT_IOSIZE;
+	}
+
 	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
 		return;
 
@@ -152,6 +160,9 @@ void nvme_mpath_end_request(struct request *rq)
 	if (nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE)
 		atomic_dec_if_positive(&ns->ctrl->nr_active);
 
+	if (nvme_req(rq)->flags & NVME_MPATH_CNT_IOSIZE)
+		atomic64_sub(blk_rq_bytes(rq), &ns->ctrl->inflight_size);
+
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
 	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
@@ -405,9 +416,48 @@ static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
 	return ns;
 }
 
+static struct nvme_ns *nvme_service_time_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *opt = NULL, *nonopt = NULL, *ns;
+	unsigned int min_inflight_nonopt = UINT_MAX;
+	unsigned int min_inflight_opt = UINT_MAX;
+	unsigned int inflight;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		inflight = atomic64_read(&ns->ctrl->inflight_size);
+
+		switch (ns->ana_state) {
+		case NVME_ANA_OPTIMIZED:
+			if (inflight < min_inflight_opt) {
+				min_inflight_opt = inflight;
+				opt = ns;
+			}
+			break;
+		case NVME_ANA_NONOPTIMIZED:
+			if (inflight < min_inflight_nonopt) {
+				min_inflight_nonopt = inflight;
+				nonopt = ns;
+			}
+			break;
+		default:
+			break;
+		}
+
+		if (min_inflight_opt == 0)
+			return opt;
+	}
+
+	return opt ? opt : nonopt;
+}
+
 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 {
 	switch (READ_ONCE(head->subsys->iopolicy)) {
+	case NVME_IOPOLICY_ST:
+		return nvme_service_time_path(head);
 	case NVME_IOPOLICY_QD:
 		return nvme_queue_depth_path(head);
 	case NVME_IOPOLICY_RR:
@@ -1040,6 +1090,7 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 
 	/* initialize this in the identify path to cover controller resets */
 	atomic_set(&ctrl->nr_active, 0);
+	atomic64_set(&ctrl->inflight_size, 0);
 
 	if (!ctrl->max_namespaces ||
 	    ctrl->max_namespaces > le32_to_cpu(id->nn)) {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 093cb423f536..bf6c74fdc9ba 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -202,6 +202,7 @@ enum {
 	NVME_REQ_USERCMD		= (1 << 1),
 	NVME_MPATH_IO_STATS		= (1 << 2),
 	NVME_MPATH_CNT_ACTIVE		= (1 << 3),
+	NVME_MPATH_CNT_IOSIZE		= (1 << 4),
 };
 
 static inline struct nvme_request *nvme_req(struct request *req)
@@ -367,6 +368,7 @@ struct nvme_ctrl {
 	struct timer_list anatt_timer;
 	struct work_struct ana_work;
 	atomic_t nr_active;
+	atomic64_t inflight_size;
 #endif
 
 #ifdef CONFIG_NVME_HOST_AUTH
@@ -416,6 +418,7 @@ enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
 	NVME_IOPOLICY_QD,
+	NVME_IOPOLICY_ST,
 };
 
 struct nvme_subsystem {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RESEND v3 2/2] docs, nvme: add a nvme-multipath document
  2024-11-14  8:49 [PATCH RESEND v3 0/2] Introduce service-time multipath policy and document Guixin Liu
  2024-11-14  8:49 ` [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy Guixin Liu
@ 2024-11-14  8:49 ` Guixin Liu
  1 sibling, 0 replies; 5+ messages in thread
From: Guixin Liu @ 2024-11-14  8:49 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg; +Cc: linux-nvme

This adds a document about nvme-multipath and policies supported
by the Linux NVMe host driver, and also each policy's best scenario.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
---
 Documentation/nvme/nvme-multipath.rst | 83 +++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)
 create mode 100644 Documentation/nvme/nvme-multipath.rst

diff --git a/Documentation/nvme/nvme-multipath.rst b/Documentation/nvme/nvme-multipath.rst
new file mode 100644
index 000000000000..bb9ef6fc3e9d
--- /dev/null
+++ b/Documentation/nvme/nvme-multipath.rst
@@ -0,0 +1,83 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Linux NVMe multipath
+====================
+
+This document describes NVMe multipath and its path selection policies supported
+by the Linux NVMe host driver.
+
+
+Introduction
+============
+
+The NVMe multipath feature in Linux integrates namespaces with the same
+identifier into a single block device. Using multipath enhances the reliability
+and stability of I/O access while improving bandwidth performance. When a user
+sends I/O to this merged block device, the multipath mechanism selects one of
+the underlying block devices (paths) according to the configured policy.
+Different policies result in different path selections.
+
+
+Policies
+========
+
+All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning
+that when an optimized path is available, it will be chosen over a non-optimized
+one. Current the NVMe multipath policies include numa(default), round-robin,
+queue-depth and service-time.
+
+To set the desired policy (e.g., round-robin), use one of the following methods:
+   1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy
+   2. or add the "nvme_core.iopolicy=round-robin" to cmdline.
+
+
+NUMA
+----
+
+The NUMA policy selects the path closest to the NUMA node of the current CPU for
+I/O distribution. This policy maintains the nearest paths to each NUMA node
+based on network interface connections.
+
+When to use the NUMA policy:
+  1. Multi-core Systems: Optimizes memory access in multi-core and
+     multi-processor systems, especially under NUMA architecture.
+  2. High Affinity Workloads: Binds I/O processing to the CPU to reduce
+     communication and data transfer delays across nodes.
+
+
+Round-Robin
+-----------
+
+The round-robin policy distributes I/O requests evenly across all paths to
+enhance throughput and resource utilization. Each I/O operation is sent to the
+next path in sequence.
+
+When to use the round-robin policy:
+  1. Balanced Workloads: Effective for balanced and predictable workloads with
+     similar I/O size and type.
+  2. Homogeneous Path Performance: Utilizes all paths efficiently when
+     performance characteristics (e.g., latency, bandwidth) are similar.
+
+
+Queue-Depth
+-----------
+
+The queue-depth policy manages I/O requests based on the current queue depth
+of each path, selecting the path with the least number of in-flight I/Os.
+
+When to use the queue-depth policy:
+  1. High load with small I/Os: Effectively balances load across paths when
+     the load is high, and I/O operations consist of small, relatively
+     fixed-sized requests.
+
+
+Service-Time
+------------
+
+The service-time policy distributes I/O requests based on the total size of
+in-flight I/Os, selecting the path with the least total size.
+
+When to use the service-time policy:
+  1. Highly Variable Workloads: Adapts to unpredictable and varying I/O patterns
+     by directing requests to the most responsive paths.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy
  2024-11-14  8:49 ` [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy Guixin Liu
@ 2024-11-14 10:53   ` Chaitanya Kulkarni
  2024-11-15  1:42     ` Guixin Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Chaitanya Kulkarni @ 2024-11-14 10:53 UTC (permalink / raw)
  To: Guixin Liu
  Cc: linux-nvme@lists.infradead.org, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg

On 11/14/24 00:49, Guixin Liu wrote:
>   
> +static struct nvme_ns *nvme_service_time_path(struct nvme_ns_head *head)
> +{
> +	struct nvme_ns *opt = NULL, *nonopt = NULL, *ns;
> +	unsigned int min_inflight_nonopt = UINT_MAX;
> +	unsigned int min_inflight_opt = UINT_MAX;
> +	unsigned int inflight;
> +
> +	list_for_each_entry_rcu(ns, &head->list, siblings) {
> +		if (nvme_path_is_disabled(ns))
> +			continue;
> +
> +		inflight = atomic64_read(&ns->ctrl->inflight_size);

atomic64_read has return type s64, is there any particular reason not to use
s64 ? OR

even u64 as it is used in many places in kernel ? OR

conversion from s64 to unsigned int is done intentionally ?

-ck



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy
  2024-11-14 10:53   ` Chaitanya Kulkarni
@ 2024-11-15  1:42     ` Guixin Liu
  0 siblings, 0 replies; 5+ messages in thread
From: Guixin Liu @ 2024-11-15  1:42 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: linux-nvme@lists.infradead.org, Keith Busch, Jens Axboe,
	Christoph Hellwig, Sagi Grimberg


在 2024/11/14 18:53, Chaitanya Kulkarni 写道:
> On 11/14/24 00:49, Guixin Liu wrote:
>>    
>> +static struct nvme_ns *nvme_service_time_path(struct nvme_ns_head *head)
>> +{
>> +	struct nvme_ns *opt = NULL, *nonopt = NULL, *ns;
>> +	unsigned int min_inflight_nonopt = UINT_MAX;
>> +	unsigned int min_inflight_opt = UINT_MAX;
>> +	unsigned int inflight;
>> +
>> +	list_for_each_entry_rcu(ns, &head->list, siblings) {
>> +		if (nvme_path_is_disabled(ns))
>> +			continue;
>> +
>> +		inflight = atomic64_read(&ns->ctrl->inflight_size);
> atomic64_read has return type s64, is there any particular reason not to use
> s64 ? OR
>
> even u64 as it is used in many places in kernel ? OR
>
> conversion from s64 to unsigned int is done intentionally ?
>
> -ck
>
Changed in v4, thanks.

Best Regards,

Guixin Liu



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-11-15  1:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-14  8:49 [PATCH RESEND v3 0/2] Introduce service-time multipath policy and document Guixin Liu
2024-11-14  8:49 ` [PATCH RESEND v3 1/2] nvme-multipath: introduce service-time iopolicy Guixin Liu
2024-11-14 10:53   ` Chaitanya Kulkarni
2024-11-15  1:42     ` Guixin Liu
2024-11-14  8:49 ` [PATCH RESEND v3 2/2] docs, nvme: add a nvme-multipath document Guixin Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox