Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv2 0/3] nvme-tcp: improve scalability
@ 2024-07-08  7:10 Hannes Reinecke
  2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

for workloads with a lot of controllers we run into workqueue contention,
where the single workqueue is not able to service requests fast enough,
leading to spurious I/O errors and connect resets during high load.
This patchset improves the situation by improve the fairness between
rx and tx scheduling, introducing per-controller workqueues,
and distribute the load accoring to the blk-mq cpu mapping.
With this we reduce the spurious I/O errors and improve the overall
performance for highly contended workloads.

All performance number are derived from the 'tiobench-example.fio'
sample from the fio sources, running on a 96 core machine with one
subsystem and two paths, each path exposing 32 queues.
Backend is nvmet using an Intel DC P3700 NVMe SSD.

Changes to the initial submission:
- Make the changes independent from the 'wq_unbound' parameter
- Drop changes to the workqueue
- Add patch to improve rx/tx fairness

Hannes Reinecke (3):
  nvme-tcp: improve rx/tx fairness
  nvme-tcp: align I/O cpu with blk-mq mapping
  nvme-tcp: per-controller I/O workqueues

 drivers/nvme/host/tcp.c | 135 ++++++++++++++++++++++++++++------------
 1 file changed, 95 insertions(+), 40 deletions(-)

-- 
2.35.3



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
@ 2024-07-08  7:10 ` Hannes Reinecke
  2024-07-08 11:57   ` Sagi Grimberg
  2024-07-08  7:10 ` [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

We need to restrict both side, rx and tx, to only run for a certain time
to ensure that we're not blocking the other side and induce starvation.
So pass in a 'deadline' value to nvme_tcp_send_all() and nvme_tcp_try_recv()
and break out of the loop if the deadline is reached.

As we now have a timestamp we can also use it to print out a warning
if the actual time spent exceeds the deadline.

Performance comparison:
               baseline rx/tx fairness
4k seq write:  449MiB/s 480MiB/s
4k rand write: 410MiB/s 481MiB/s
4k seq read:   478MiB/s 481MiB/s
4k rand read:  547MiB/s 480MiB/s

Random read is ever so disappointing, but that will be fixed with the later
patches.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0873b3949355..f621d3ba89b2 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -153,6 +153,7 @@ struct nvme_tcp_queue {
 	size_t			data_remaining;
 	size_t			ddgst_remaining;
 	unsigned int		nr_cqe;
+	unsigned long		deadline;
 
 	/* send state */
 	struct nvme_tcp_request *request;
@@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
 	}
 }
 
-static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
+static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
+				    unsigned long deadline)
 {
 	int ret;
 
 	/* drain the send queue as much as we can... */
 	do {
 		ret = nvme_tcp_try_send(queue);
+		if (time_after(jiffies, deadline))
+			break;
 	} while (ret > 0);
+	return ret;
 }
 
 static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue *queue)
@@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		bool sync, bool last)
 {
 	struct nvme_tcp_queue *queue = req->queue;
+	unsigned long deadline = jiffies + msecs_to_jiffies(1);
 	bool empty;
 
 	empty = llist_add(&req->lentry, &queue->req_list) &&
@@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	 */
 	if (queue->io_cpu == raw_smp_processor_id() &&
 	    sync && empty && mutex_trylock(&queue->send_mutex)) {
-		nvme_tcp_send_all(queue);
+		nvme_tcp_send_all(queue, deadline);
 		mutex_unlock(&queue->send_mutex);
 	}
 
@@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
 			return result;
 		}
+		if (time_after(jiffies, queue->deadline)) {
+			desc->count = 0;
+			break;
+		}
+
 	}
 
-	return consumed;
+	return consumed - len;
 }
 
 static void nvme_tcp_data_ready(struct sock *sk)
@@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
 	return ret;
 }
 
-static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
+static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned long deadline)
 {
 	struct socket *sock = queue->sock;
 	struct sock *sk = sock->sk;
@@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
 	rd_desc.count = 1;
 	lock_sock(sk);
 	queue->nr_cqe = 0;
+	queue->deadline = deadline;
 	consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
 	release_sock(sk);
 	return consumed;
@@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
 {
 	struct nvme_tcp_queue *queue =
 		container_of(w, struct nvme_tcp_queue, io_work);
-	unsigned long deadline = jiffies + msecs_to_jiffies(1);
+	unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
+	unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), overrun;
 
 	do {
 		bool pending = false;
 		int result;
 
 		if (mutex_trylock(&queue->send_mutex)) {
-			result = nvme_tcp_try_send(queue);
+			result = nvme_tcp_send_all(queue, tx_deadline);
 			mutex_unlock(&queue->send_mutex);
 			if (result > 0)
 				pending = true;
@@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
 				break;
 		}
 
-		result = nvme_tcp_try_recv(queue);
+		result = nvme_tcp_try_recv(queue, rx_deadline);
 		if (result > 0)
 			pending = true;
 		else if (unlikely(result < 0))
@@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		if (!pending || !queue->rd_enabled)
 			return;
 
-	} while (!time_after(jiffies, deadline)); /* quota is exhausted */
+	} while (!time_after(jiffies, rx_deadline)); /* quota is exhausted */
+
+	overrun = jiffies - rx_deadline;
+	if (nvme_tcp_queue_id(queue) > 0 &&
+	    overrun > msecs_to_jiffies(10))
+		dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u msecs)\n",
+			nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
 
 	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 }
@@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 {
 	struct nvme_tcp_queue *queue = hctx->driver_data;
 	struct sock *sk = queue->sock->sk;
+	unsigned long deadline = jiffies + msecs_to_jiffies(1);
 
 	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
 		return 0;
@@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 	set_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue))
 		sk_busy_loop(sk, true);
-	nvme_tcp_try_recv(queue);
+	nvme_tcp_try_recv(queue, deadline);
 	clear_bit(NVME_TCP_Q_POLLING, &queue->flags);
 	return queue->nr_cqe;
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
  2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
@ 2024-07-08  7:10 ` Hannes Reinecke
  2024-07-08 12:08   ` Sagi Grimberg
  2024-07-08  7:10 ` [PATCH 3/3] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

We should align the 'io_cpu' setting with the blk-mq
cpu mapping to ensure that we're not bouncing threads
when doing I/O. To avoid cpu contention this patch also
adds an atomic counter for the number of queues on each
cpu to distribute the load across all CPUs in the blk-mq cpu set.
Additionally we should always set the 'io_cpu' value, as
in the WQ_UNBOUND case it'll be treated as a hint anyway.

Performance comparison:
               baseline rx/tx    blk-mq align
4k seq write:  449MiB/s 480MiB/s 524MiB/s
4k rand write: 410MiB/s 481MiB/s 524MiB/s
4k seq read:   478MiB/s 481MiB/s 566MiB/s
4k rand read:  547MiB/s 480MiB/s 511MiB/s

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
 1 file changed, 49 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f621d3ba89b2..a5c42a7b4bee 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -26,6 +26,8 @@
 
 struct nvme_tcp_queue;
 
+static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
+
 /* Define the socket priority to use for connections were it is desirable
  * that the NIC consider performing optimized packet processing or filtering.
  * A non-zero value being sufficient to indicate general consideration of any
@@ -1578,20 +1580,42 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
 static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
 {
 	struct nvme_tcp_ctrl *ctrl = queue->ctrl;
-	int qid = nvme_tcp_queue_id(queue);
-	int n = 0;
-
-	if (nvme_tcp_default_queue(queue))
-		n = qid - 1;
-	else if (nvme_tcp_read_queue(queue))
-		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
-	else if (nvme_tcp_poll_queue(queue))
+	struct blk_mq_tag_set *set = &ctrl->tag_set;
+	int qid = nvme_tcp_queue_id(queue) - 1;
+	unsigned int *mq_map = NULL;;
+	int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
+
+	if (nvme_tcp_default_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
+		n = qid;
+	} else if (nvme_tcp_read_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_READ].mq_map;
+		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
+	} else if (nvme_tcp_poll_queue(queue)) {
+		mq_map = set->map[HCTX_TYPE_POLL].mq_map;
 		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
-				ctrl->io_queues[HCTX_TYPE_READ] - 1;
-	if (wq_unbound)
-		queue->io_cpu = WORK_CPU_UNBOUND;
-	else
-		queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
+				ctrl->io_queues[HCTX_TYPE_READ];
+	}
+
+	if (WARN_ON(!mq_map))
+		return;
+	for_each_online_cpu(cpu) {
+		int num_queues;
+
+		if (mq_map[cpu] != qid)
+			continue;
+		num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
+		if (num_queues < min_queues) {
+			min_queues = num_queues;
+			io_cpu = cpu;
+		}
+	}
+	if (io_cpu != queue->io_cpu) {
+		queue->io_cpu = io_cpu;
+		atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
+	}
+	dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
+		qid, queue->io_cpu);
 }
 
 static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
@@ -1735,7 +1759,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
 	queue->sock->sk->sk_use_task_frag = false;
-	nvme_tcp_set_queue_io_cpu(queue);
+	queue->io_cpu = WORK_CPU_UNBOUND;
 	queue->request = NULL;
 	queue->data_remaining = 0;
 	queue->ddgst_remaining = 0;
@@ -1847,6 +1871,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
 	nvme_tcp_restore_sock_ops(queue);
 	cancel_work_sync(&queue->io_work);
+	if (queue->io_cpu != WORK_CPU_UNBOUND) {
+		atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
+		queue->io_cpu = WORK_CPU_UNBOUND;
+	}
 }
 
 static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
@@ -1891,9 +1919,10 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
 	nvme_tcp_init_recv_ctx(queue);
 	nvme_tcp_setup_sock_ops(queue);
 
-	if (idx)
+	if (idx) {
+		nvme_tcp_set_queue_io_cpu(queue);
 		ret = nvmf_connect_io_queue(nctrl, idx);
-	else
+	} else
 		ret = nvmf_connect_admin_queue(nctrl);
 
 	if (!ret) {
@@ -2920,6 +2949,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 static int __init nvme_tcp_init_module(void)
 {
 	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
+	int cpu;
 
 	BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
 	BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
@@ -2937,6 +2967,9 @@ static int __init nvme_tcp_init_module(void)
 	if (!nvme_tcp_wq)
 		return -ENOMEM;
 
+	for_each_possible_cpu(cpu)
+		atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
+
 	nvmf_register_transport(&nvme_tcp_transport);
 	return 0;
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] nvme-tcp: per-controller I/O workqueues
  2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
  2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
  2024-07-08  7:10 ` [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-07-08  7:10 ` Hannes Reinecke
  2024-07-08 12:12   ` Sagi Grimberg
  2024-07-10 11:56 ` [PATCHv2 0/3] nvme-tcp: improve scalability Sagi Grimberg
  2024-07-16  6:31 ` Sagi Grimberg
  4 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
	Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

Implement per-controller I/O workqueues to reduce workqueue contention
during I/O and improve I/O performance.

Performance comparison:
               baseline rx/tx    blk-mq   multiple workqueues
4k seq write:  449MiB/s 480MiB/s 524MiB/s 540MiB/s
4k rand write: 410MiB/s 481MiB/s 524MiB/s 539MiB/s
4k seq read:   478MiB/s 481MiB/s 566MiB/s 582MiB/s
4k rand read:  547MiB/s 480MiB/s 511MiB/s 633MiB/s

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a5c42a7b4bee..fc8f682f686d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -194,6 +194,7 @@ struct nvme_tcp_ctrl {
 	struct sockaddr_storage src_addr;
 	struct nvme_ctrl	ctrl;
 
+	struct workqueue_struct	*io_wq;
 	struct work_struct	err_work;
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
@@ -202,7 +203,6 @@ struct nvme_tcp_ctrl {
 
 static LIST_HEAD(nvme_tcp_ctrl_list);
 static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
-static struct workqueue_struct *nvme_tcp_wq;
 static const struct blk_mq_ops nvme_tcp_mq_ops;
 static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
 static int nvme_tcp_try_send(struct nvme_tcp_queue *queue);
@@ -410,7 +410,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	}
 
 	if (last && nvme_tcp_queue_has_pending(queue))
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+		queue_work_on(queue->io_cpu, queue->ctrl->io_wq, &queue->io_work);
 }
 
 static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
@@ -987,7 +987,7 @@ static void nvme_tcp_data_ready(struct sock *sk)
 	queue = sk->sk_user_data;
 	if (likely(queue && queue->rd_enabled) &&
 	    !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+		queue_work_on(queue->io_cpu, queue->ctrl->io_wq, &queue->io_work);
 	read_unlock_bh(&sk->sk_callback_lock);
 }
 
@@ -999,7 +999,7 @@ static void nvme_tcp_write_space(struct sock *sk)
 	queue = sk->sk_user_data;
 	if (likely(queue && sk_stream_is_writeable(sk))) {
 		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+		queue_work_on(queue->io_cpu, queue->ctrl->io_wq, &queue->io_work);
 	}
 	read_unlock_bh(&sk->sk_callback_lock);
 }
@@ -1325,7 +1325,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u msecs)\n",
 			nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
 
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	queue_work_on(queue->io_cpu, queue->ctrl->io_wq, &queue->io_work);
 }
 
 static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
@@ -2486,6 +2486,8 @@ static void nvme_tcp_free_ctrl(struct nvme_ctrl *nctrl)
 
 	nvmf_free_options(nctrl->opts);
 free_ctrl:
+	destroy_workqueue(ctrl->io_wq);
+
 	kfree(ctrl->queues);
 	kfree(ctrl);
 }
@@ -2676,7 +2678,7 @@ static void nvme_tcp_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	struct nvme_tcp_queue *queue = hctx->driver_data;
 
 	if (!llist_empty(&queue->req_list))
-		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+		queue_work_on(queue->io_cpu, queue->ctrl->io_wq, &queue->io_work);
 }
 
 static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -2812,6 +2814,7 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
 		struct nvmf_ctrl_options *opts)
 {
 	struct nvme_tcp_ctrl *ctrl;
+	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
 	int ret;
 
 	ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
@@ -2883,6 +2886,15 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
 	if (ret)
 		goto out_kfree_queues;
 
+	if (wq_unbound)
+		wq_flags |= WQ_UNBOUND;
+	ctrl->io_wq = alloc_workqueue("nvme_tcp_wq_%d", wq_flags, 0,
+				      ctrl->ctrl.instance);
+	if (!ctrl->io_wq) {
+		nvme_put_ctrl(&ctrl->ctrl);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	return ctrl;
 out_kfree_queues:
 	kfree(ctrl->queues);
@@ -2948,7 +2960,6 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 
 static int __init nvme_tcp_init_module(void)
 {
-	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
 	int cpu;
 
 	BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
@@ -2960,13 +2971,6 @@ static int __init nvme_tcp_init_module(void)
 	BUILD_BUG_ON(sizeof(struct nvme_tcp_icresp_pdu) != 128);
 	BUILD_BUG_ON(sizeof(struct nvme_tcp_term_pdu) != 24);
 
-	if (wq_unbound)
-		wq_flags |= WQ_UNBOUND;
-
-	nvme_tcp_wq = alloc_workqueue("nvme_tcp_wq", wq_flags, 0);
-	if (!nvme_tcp_wq)
-		return -ENOMEM;
-
 	for_each_possible_cpu(cpu)
 		atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
 
@@ -2985,8 +2989,6 @@ static void __exit nvme_tcp_cleanup_module(void)
 		nvme_delete_ctrl(&ctrl->ctrl);
 	mutex_unlock(&nvme_tcp_ctrl_mutex);
 	flush_workqueue(nvme_delete_wq);
-
-	destroy_workqueue(nvme_tcp_wq);
 }
 
 module_init(nvme_tcp_init_module);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
@ 2024-07-08 11:57   ` Sagi Grimberg
  2024-07-08 13:21     ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 11:57 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

Hey Hannes, thanks for doing this.

On 08/07/2024 10:10, Hannes Reinecke wrote:
> We need to restrict both side, rx and tx, to only run for a certain time
> to ensure that we're not blocking the other side and induce starvation.
> So pass in a 'deadline' value to nvme_tcp_send_all()

Please split the addition of nvme_tcp_send_all() to a separate prep patch.

>   and nvme_tcp_try_recv()
> and break out of the loop if the deadline is reached.

I think we want to limit the rx/tx in pdus/bytes. This will also allow us
to possibly do burst rx from data-ready.

>
> As we now have a timestamp we can also use it to print out a warning
> if the actual time spent exceeds the deadline.
>
> Performance comparison:
>                 baseline rx/tx fairness
> 4k seq write:  449MiB/s 480MiB/s
> 4k rand write: 410MiB/s 481MiB/s
> 4k seq read:   478MiB/s 481MiB/s
> 4k rand read:  547MiB/s 480MiB/s
>
> Random read is ever so disappointing, but that will be fixed with the later
> patches.

That is a significant decline in relative perf. I'm counting 12.5%...
Can you explain why that is?

How does this look for multiple controllers?



>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>   1 file changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0873b3949355..f621d3ba89b2 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>   	size_t			data_remaining;
>   	size_t			ddgst_remaining;
>   	unsigned int		nr_cqe;
> +	unsigned long		deadline;

I don't see why you need to keep this in the queue struct. You could have
easily initialize it in the read_descriptor_t and test against it.

>   
>   	/* send state */
>   	struct nvme_tcp_request *request;
> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
>   	}
>   }
>   
> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
> +				    unsigned long deadline)
>   {
>   	int ret;
>   
>   	/* drain the send queue as much as we can... */
>   	do {
>   		ret = nvme_tcp_try_send(queue);
> +		if (time_after(jiffies, deadline))
> +			break;
>   	} while (ret > 0);
> +	return ret;

I think you want a different interface, nvme_tcp_send_budgeted(queue, 
budget).
I don't know what you pass here, but jiffies is a rather large 
granularity...

>   }
>   
>   static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue *queue)
> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		bool sync, bool last)
>   {
>   	struct nvme_tcp_queue *queue = req->queue;
> +	unsigned long deadline = jiffies + msecs_to_jiffies(1);
>   	bool empty;
>   
>   	empty = llist_add(&req->lentry, &queue->req_list) &&
> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   	 */
>   	if (queue->io_cpu == raw_smp_processor_id() &&
>   	    sync && empty && mutex_trylock(&queue->send_mutex)) {
> -		nvme_tcp_send_all(queue);
> +		nvme_tcp_send_all(queue, deadline);
>   		mutex_unlock(&queue->send_mutex);
>   	}

Umm, spend up to a millisecond in in queue_request ? Sounds like way too 
much...
Did you ever see this deadline exceeded? sends should be rather quick...

>   
> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>   			nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>   			return result;
>   		}
> +		if (time_after(jiffies, queue->deadline)) {
> +			desc->count = 0;
> +			break;
> +		}
> +

That is still not right.
You don't want to spend a full deadline reading from the socket, and 
then spend a full deadline
writing to the socket...

You want the io_work to take a full deadline, and send budgets of 
try_send and try_recv. And set
it to sane counts. Say 8 pdus, or 64k bytes. We want to get to some 
magic value that presents a
sane behavior, that confidently fits inside a deadline, and is fair.

>   	}
>   
> -	return consumed;
> +	return consumed - len;
>   }
>   
>   static void nvme_tcp_data_ready(struct sock *sk)
> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
>   	return ret;
>   }
>   
> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned long deadline)
>   {
>   	struct socket *sock = queue->sock;
>   	struct sock *sk = sock->sk;
> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>   	rd_desc.count = 1;
>   	lock_sock(sk);
>   	queue->nr_cqe = 0;
> +	queue->deadline = deadline;
>   	consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>   	release_sock(sk);
>   	return consumed;
> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   {
>   	struct nvme_tcp_queue *queue =
>   		container_of(w, struct nvme_tcp_queue, io_work);
> -	unsigned long deadline = jiffies + msecs_to_jiffies(1);
> +	unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
> +	unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), overrun;
>   
>   	do {
>   		bool pending = false;
>   		int result;
>   
>   		if (mutex_trylock(&queue->send_mutex)) {
> -			result = nvme_tcp_try_send(queue);
> +			result = nvme_tcp_send_all(queue, tx_deadline);
>   			mutex_unlock(&queue->send_mutex);
>   			if (result > 0)
>   				pending = true;
> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   				break;
>   		}
>   
> -		result = nvme_tcp_try_recv(queue);
> +		result = nvme_tcp_try_recv(queue, rx_deadline);

I think you want a more frequent substitution of sends/receives. the 
granularity
of 1ms budget may be too coarse?

>   		if (result > 0)
>   			pending = true;
>   		else if (unlikely(result < 0))
> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   		if (!pending || !queue->rd_enabled)
>   			return;
>   
> -	} while (!time_after(jiffies, deadline)); /* quota is exhausted */
> +	} while (!time_after(jiffies, rx_deadline)); /* quota is exhausted */
> +
> +	overrun = jiffies - rx_deadline;
> +	if (nvme_tcp_queue_id(queue) > 0 &&
> +	    overrun > msecs_to_jiffies(10))
> +		dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u msecs)\n",
> +			nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));

Umm, ok. why 10? why not 2? or 3?
Do you expect io_work to spend more time executing?

>   
>   	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   }
> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
>   {
>   	struct nvme_tcp_queue *queue = hctx->driver_data;
>   	struct sock *sk = queue->sock->sk;
> +	unsigned long deadline = jiffies + msecs_to_jiffies(1);
>   
>   	if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>   		return 0;
> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
>   	set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>   	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue))
>   		sk_busy_loop(sk, true);
> -	nvme_tcp_try_recv(queue);
> +	nvme_tcp_try_recv(queue, deadline);

spend a millisecond in nvme_tcp_poll() ??
Isn't it too long?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-07-08  7:10 ` [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-07-08 12:08   ` Sagi Grimberg
  2024-07-08 12:43     ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 12:08 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 08/07/2024 10:10, Hannes Reinecke wrote:
> We should align the 'io_cpu' setting with the blk-mq
> cpu mapping to ensure that we're not bouncing threads
> when doing I/O. To avoid cpu contention this patch also
> adds an atomic counter for the number of queues on each
> cpu to distribute the load across all CPUs in the blk-mq cpu set.
> Additionally we should always set the 'io_cpu' value, as
> in the WQ_UNBOUND case it'll be treated as a hint anyway.
>
> Performance comparison:
>                 baseline rx/tx    blk-mq align
> 4k seq write:  449MiB/s 480MiB/s 524MiB/s
> 4k rand write: 410MiB/s 481MiB/s 524MiB/s
> 4k seq read:   478MiB/s 481MiB/s 566MiB/s
> 4k rand read:  547MiB/s 480MiB/s 511MiB/s
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
>   1 file changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index f621d3ba89b2..a5c42a7b4bee 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -26,6 +26,8 @@
>   
>   struct nvme_tcp_queue;
>   
> +static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
> +
>   /* Define the socket priority to use for connections were it is desirable
>    * that the NIC consider performing optimized packet processing or filtering.
>    * A non-zero value being sufficient to indicate general consideration of any
> @@ -1578,20 +1580,42 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
>   static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
>   {
>   	struct nvme_tcp_ctrl *ctrl = queue->ctrl;
> -	int qid = nvme_tcp_queue_id(queue);
> -	int n = 0;
> -
> -	if (nvme_tcp_default_queue(queue))
> -		n = qid - 1;
> -	else if (nvme_tcp_read_queue(queue))
> -		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
> -	else if (nvme_tcp_poll_queue(queue))
> +	struct blk_mq_tag_set *set = &ctrl->tag_set;
> +	int qid = nvme_tcp_queue_id(queue) - 1;

umm, its not the qid, why change it? I mean it looks harmless, but
I don't see why.

> +	unsigned int *mq_map = NULL;;
> +	int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;

min_queues initialization looks very very weird. I can't even parse it.
besides, why is it declared in this scope?

> +
> +	if (nvme_tcp_default_queue(queue)) {
> +		mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
> +		n = qid;
> +	} else if (nvme_tcp_read_queue(queue)) {
> +		mq_map = set->map[HCTX_TYPE_READ].mq_map;
> +		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
> +	} else if (nvme_tcp_poll_queue(queue)) {
> +		mq_map = set->map[HCTX_TYPE_POLL].mq_map;
>   		n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
> -				ctrl->io_queues[HCTX_TYPE_READ] - 1;
> -	if (wq_unbound)
> -		queue->io_cpu = WORK_CPU_UNBOUND;
> -	else
> -		queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
> +				ctrl->io_queues[HCTX_TYPE_READ];
> +	}
> +
> +	if (WARN_ON(!mq_map))
> +		return;
> +	for_each_online_cpu(cpu) {
> +		int num_queues;
> +
> +		if (mq_map[cpu] != qid)
> +			continue;
> +		num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
> +		if (num_queues < min_queues) {
> +			min_queues = num_queues;
> +			io_cpu = cpu;
> +		}
> +	}
> +	if (io_cpu != queue->io_cpu) {
> +		queue->io_cpu = io_cpu;
> +		atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
> +	}

Why is that conditioned ? why not always assign and inc the counter?

> +	dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
> +		qid, queue->io_cpu);
>   }
>   
>   static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
> @@ -1735,7 +1759,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>   
>   	queue->sock->sk->sk_allocation = GFP_ATOMIC;
>   	queue->sock->sk->sk_use_task_frag = false;
> -	nvme_tcp_set_queue_io_cpu(queue);
> +	queue->io_cpu = WORK_CPU_UNBOUND;
>   	queue->request = NULL;
>   	queue->data_remaining = 0;
>   	queue->ddgst_remaining = 0;
> @@ -1847,6 +1871,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   	kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>   	nvme_tcp_restore_sock_ops(queue);
>   	cancel_work_sync(&queue->io_work);
> +	if (queue->io_cpu != WORK_CPU_UNBOUND) {
> +		atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
> +		queue->io_cpu = WORK_CPU_UNBOUND;
> +	}

Does anything change if we still set the io_cpu in wq_unbound case?
If not, I think we should always do it and simplify the code.

>   }
>   
>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
> @@ -1891,9 +1919,10 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   	nvme_tcp_init_recv_ctx(queue);
>   	nvme_tcp_setup_sock_ops(queue);
>   
> -	if (idx)
> +	if (idx) {
> +		nvme_tcp_set_queue_io_cpu(queue);
>   		ret = nvmf_connect_io_queue(nctrl, idx);
> -	else
> +	} else
>   		ret = nvmf_connect_admin_queue(nctrl);
>   
>   	if (!ret) {
> @@ -2920,6 +2949,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
>   static int __init nvme_tcp_init_module(void)
>   {
>   	unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
> +	int cpu;
>   
>   	BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
>   	BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
> @@ -2937,6 +2967,9 @@ static int __init nvme_tcp_init_module(void)
>   	if (!nvme_tcp_wq)
>   		return -ENOMEM;
>   
> +	for_each_possible_cpu(cpu)
> +		atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
> +

Why?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: per-controller I/O workqueues
  2024-07-08  7:10 ` [PATCH 3/3] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
@ 2024-07-08 12:12   ` Sagi Grimberg
  2024-07-08 12:48     ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 12:12 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, Tejun Heo
  Cc: Keith Busch, linux-nvme, Hannes Reinecke



On 08/07/2024 10:10, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Implement per-controller I/O workqueues to reduce workqueue contention
> during I/O and improve I/O performance.
>
> Performance comparison:
>                 baseline rx/tx    blk-mq   multiple workqueues
> 4k seq write:  449MiB/s 480MiB/s 524MiB/s 540MiB/s
> 4k rand write: 410MiB/s 481MiB/s 524MiB/s 539MiB/s
> 4k seq read:   478MiB/s 481MiB/s 566MiB/s 582MiB/s
> 4k rand read:  547MiB/s 480MiB/s 511MiB/s 633MiB/s

I am still puzzled by this one.

This is for 2 controllers? or more?
It is intresting that the rand read sees higher boost from the seq read.
Is this a nature of the SSD? What happens with null_blk ?

CCing Tejun. Is it possible that using two different workqueues
for a symmetrical workload is better than a single global workqueue?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-07-08 12:08   ` Sagi Grimberg
@ 2024-07-08 12:43     ` Hannes Reinecke
  2024-07-08 14:38       ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08 12:43 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/8/24 14:08, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> We should align the 'io_cpu' setting with the blk-mq
>> cpu mapping to ensure that we're not bouncing threads
>> when doing I/O. To avoid cpu contention this patch also
>> adds an atomic counter for the number of queues on each
>> cpu to distribute the load across all CPUs in the blk-mq cpu set.
>> Additionally we should always set the 'io_cpu' value, as
>> in the WQ_UNBOUND case it'll be treated as a hint anyway.
>>
>> Performance comparison:
>>                 baseline rx/tx    blk-mq align
>> 4k seq write:  449MiB/s 480MiB/s 524MiB/s
>> 4k rand write: 410MiB/s 481MiB/s 524MiB/s
>> 4k seq read:   478MiB/s 481MiB/s 566MiB/s
>> 4k rand read:  547MiB/s 480MiB/s 511MiB/s
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
>>   1 file changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f621d3ba89b2..a5c42a7b4bee 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -26,6 +26,8 @@
>>   struct nvme_tcp_queue;
>> +static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
>> +
>>   /* Define the socket priority to use for connections were it is 
>> desirable
>>    * that the NIC consider performing optimized packet processing or 
>> filtering.
>>    * A non-zero value being sufficient to indicate general 
>> consideration of any
>> @@ -1578,20 +1580,42 @@ static bool nvme_tcp_poll_queue(struct 
>> nvme_tcp_queue *queue)
>>   static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
>>   {
>>       struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>> -    int qid = nvme_tcp_queue_id(queue);
>> -    int n = 0;
>> -
>> -    if (nvme_tcp_default_queue(queue))
>> -        n = qid - 1;
>> -    else if (nvme_tcp_read_queue(queue))
>> -        n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
>> -    else if (nvme_tcp_poll_queue(queue))
>> +    struct blk_mq_tag_set *set = &ctrl->tag_set;
>> +    int qid = nvme_tcp_queue_id(queue) - 1;
> 
> umm, its not the qid, why change it? I mean it looks harmless, but
> I don't see why.
> 
>> +    unsigned int *mq_map = NULL;;
>> +    int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
> 
> min_queues initialization looks very very weird. I can't even parse it.
> besides, why is it declared in this scope?
> 
Because it's evaluated in the loop, with the value carried over across
loop iterations.
WORK_CPU_UNBOUND is the init value, to detect if it had been set at all.

>> +
>> +    if (nvme_tcp_default_queue(queue)) {
>> +        mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
>> +        n = qid;
>> +    } else if (nvme_tcp_read_queue(queue)) {
>> +        mq_map = set->map[HCTX_TYPE_READ].mq_map;
>> +        n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
>> +    } else if (nvme_tcp_poll_queue(queue)) {
>> +        mq_map = set->map[HCTX_TYPE_POLL].mq_map;
>>           n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
>> -                ctrl->io_queues[HCTX_TYPE_READ] - 1;
>> -    if (wq_unbound)
>> -        queue->io_cpu = WORK_CPU_UNBOUND;
>> -    else
>> -        queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, 
>> false);
>> +                ctrl->io_queues[HCTX_TYPE_READ];
>> +    }
>> +
>> +    if (WARN_ON(!mq_map))
>> +        return;
>> +    for_each_online_cpu(cpu) {
>> +        int num_queues;
>> +
>> +        if (mq_map[cpu] != qid)
>> +            continue;
>> +        num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
>> +        if (num_queues < min_queues) {
>> +            min_queues = num_queues;
>> +            io_cpu = cpu;
>> +        }
>> +    }
>> +    if (io_cpu != queue->io_cpu) {
>> +        queue->io_cpu = io_cpu;
>> +        atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
>> +    }
> 
> Why is that conditioned ? why not always assign and inc the counter?
> 
Because we're having the conditional

if (mq_map[cpu] != qid)

above, so technically we might not find the correct mapping.
Might be worthwhile doing a WARN_ON() here, but it might trigger
with CPU hotplug...

>> +    dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
>> +        qid, queue->io_cpu);
>>   }
>>   static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>> pskid)
>> @@ -1735,7 +1759,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl 
>> *nctrl, int qid,
>>       queue->sock->sk->sk_allocation = GFP_ATOMIC;
>>       queue->sock->sk->sk_use_task_frag = false;
>> -    nvme_tcp_set_queue_io_cpu(queue);
>> +    queue->io_cpu = WORK_CPU_UNBOUND;
>>       queue->request = NULL;
>>       queue->data_remaining = 0;
>>       queue->ddgst_remaining = 0;
>> @@ -1847,6 +1871,10 @@ static void __nvme_tcp_stop_queue(struct 
>> nvme_tcp_queue *queue)
>>       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>>       nvme_tcp_restore_sock_ops(queue);
>>       cancel_work_sync(&queue->io_work);
>> +    if (queue->io_cpu != WORK_CPU_UNBOUND) {
>> +        atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
>> +        queue->io_cpu = WORK_CPU_UNBOUND;
>> +    }
> 
> Does anything change if we still set the io_cpu in wq_unbound case?
> If not, I think we should always do it and simplify the code.
> 
See above. WORK_CPU_UNBOUND means that we failed to find the cpu
in the mq_map. Yes, I know, unlikely, but not impossible.

>>   }
>>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>> @@ -1891,9 +1919,10 @@ static int nvme_tcp_start_queue(struct 
>> nvme_ctrl *nctrl, int idx)
>>       nvme_tcp_init_recv_ctx(queue);
>>       nvme_tcp_setup_sock_ops(queue);
>> -    if (idx)
>> +    if (idx) {
>> +        nvme_tcp_set_queue_io_cpu(queue);
>>           ret = nvmf_connect_io_queue(nctrl, idx);
>> -    else
>> +    } else
>>           ret = nvmf_connect_admin_queue(nctrl);
>>       if (!ret) {
>> @@ -2920,6 +2949,7 @@ static struct nvmf_transport_ops 
>> nvme_tcp_transport = {
>>   static int __init nvme_tcp_init_module(void)
>>   {
>>       unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
>> +    int cpu;
>>       BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
>>       BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
>> @@ -2937,6 +2967,9 @@ static int __init nvme_tcp_init_module(void)
>>       if (!nvme_tcp_wq)
>>           return -ENOMEM;
>> +    for_each_possible_cpu(cpu)
>> +        atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
>> +
> 
> Why?

Don't we need to initialize the atomic counters?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: per-controller I/O workqueues
  2024-07-08 12:12   ` Sagi Grimberg
@ 2024-07-08 12:48     ` Hannes Reinecke
  2024-07-08 14:41       ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08 12:48 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig, Tejun Heo
  Cc: Keith Busch, linux-nvme

On 7/8/24 14:12, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> Implement per-controller I/O workqueues to reduce workqueue contention
>> during I/O and improve I/O performance.
>>
>> Performance comparison:
>>                 baseline rx/tx    blk-mq   multiple workqueues
>> 4k seq write:  449MiB/s 480MiB/s 524MiB/s 540MiB/s
>> 4k rand write: 410MiB/s 481MiB/s 524MiB/s 539MiB/s
>> 4k seq read:   478MiB/s 481MiB/s 566MiB/s 582MiB/s
>> 4k rand read:  547MiB/s 480MiB/s 511MiB/s 633MiB/s
> 
> I am still puzzled by this one.
> 
> This is for 2 controllers? or more?
> It is intresting that the rand read sees higher boost from the seq read.
> Is this a nature of the SSD? What happens with null_blk ?
> 
See the patchset description.
Two controllers, one subsystem.

> CCing Tejun. Is it possible that using two different workqueues
> for a symmetrical workload is better than a single global workqueue?

Yes, that is the implication.
It might simply due to the fact that the number of workers in a 
workqueue pool is limited to 512:

workqueue.h:      WQ_MAX_ACTIVE           = 512,    /* I like 512, 
better ideas? */

so by using a workqueue per controller we effectively raise the number
of possible workers. And we're reducing the concurrency between 
controllers, as each controller can now schedule I/O independent on
the other.

Let me check what would happen if I increase MAX_ACTIVE ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08 11:57   ` Sagi Grimberg
@ 2024-07-08 13:21     ` Hannes Reinecke
  2024-07-08 14:25       ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08 13:21 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/8/24 13:57, Sagi Grimberg wrote:
> Hey Hannes, thanks for doing this.
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> We need to restrict both side, rx and tx, to only run for a certain time
>> to ensure that we're not blocking the other side and induce starvation.
>> So pass in a 'deadline' value to nvme_tcp_send_all()
> 
> Please split the addition of nvme_tcp_send_all() to a separate prep patch.
> 
Okay, no problem.

>>   and nvme_tcp_try_recv()
>> and break out of the loop if the deadline is reached.
> 
> I think we want to limit the rx/tx in pdus/bytes. This will also allow us
> to possibly do burst rx from data-ready.
> 
PDUs is not the best scheduling boundary here, as each PDU can be of 
different size, and the network interface most definitely is limited by
the number of bytes transferred.

>>
>> As we now have a timestamp we can also use it to print out a warning
>> if the actual time spent exceeds the deadline.
>>
>> Performance comparison:
>>                 baseline rx/tx fairness
>> 4k seq write:  449MiB/s 480MiB/s
>> 4k rand write: 410MiB/s 481MiB/s
>> 4k seq read:   478MiB/s 481MiB/s
>> 4k rand read:  547MiB/s 480MiB/s
>>
>> Random read is ever so disappointing, but that will be fixed with the 
>> later
>> patches.
> 
> That is a significant decline in relative perf. I'm counting 12.5%...
> Can you explain why that is?
> 
Not really :-(
But then fairness cuts both ways; so I am not surprised that some 
workloads suffer here.

> How does this look for multiple controllers?
> 
Haven't really checked (yet); it would make a rather weak case if
we killed performance just to scale better ...

> 
> 
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>>   1 file changed, 29 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0873b3949355..f621d3ba89b2 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>>       size_t            data_remaining;
>>       size_t            ddgst_remaining;
>>       unsigned int        nr_cqe;
>> +    unsigned long        deadline;
> 
> I don't see why you need to keep this in the queue struct. You could have
> easily initialize it in the read_descriptor_t and test against it.
> 
Because I wanted to interrupt the receive side for large data transfers, 
and haven't found a way to pass the deadline into ->read_sock().

>>       /* send state */
>>       struct nvme_tcp_request *request;
>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>> nvme_tcp_request *req,
>>       }
>>   }
>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>> +                    unsigned long deadline)
>>   {
>>       int ret;
>>       /* drain the send queue as much as we can... */
>>       do {
>>           ret = nvme_tcp_try_send(queue);
>> +        if (time_after(jiffies, deadline))
>> +            break;
>>       } while (ret > 0);
>> +    return ret;
> 
> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
> budget).
> I don't know what you pass here, but jiffies is a rather large 
> granularity...
> 
Hmm. I've been using jiffies as the io_work loop had been counting
in jiffies. But let me check what would happen if I move that over to
PDU / size counting.

>>   }
>>   static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue 
>> *queue)
>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>           bool sync, bool last)
>>   {
>>       struct nvme_tcp_queue *queue = req->queue;
>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>       bool empty;
>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>        */
>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>> -        nvme_tcp_send_all(queue);
>> +        nvme_tcp_send_all(queue, deadline);
>>           mutex_unlock(&queue->send_mutex);
>>       }
> 
> Umm, spend up to a millisecond in in queue_request ? Sounds like way too 
> much...
> Did you ever see this deadline exceeded? sends should be rather quick...
> 
Of _course_ it's too long. That's kinda the point.
But I'm seeing network latency up to 4000 msecs (!) on my test setup,
so _that_ is the least of my worries ...

>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>> *desc, struct sk_buff *skb,
>>               nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>               return result;
>>           }
>> +        if (time_after(jiffies, queue->deadline)) {
>> +            desc->count = 0;
>> +            break;
>> +        }
>> +
> 
> That is still not right.
> You don't want to spend a full deadline reading from the socket, and 
> then spend a full deadline writing to the socket...
> 
Yes, and no.
Problem is that the current code serializes writes  and reads.
Essentially for each iteration we first to a write, and then a read.
If we spend the full deadline on write we will need to reschedule,
but we then _again_ start with writes. This leads to a heavy preference
for writing, and negative performance impact.

> You want the io_work to take a full deadline, and send budgets of 
> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
> 64k bytes. We want to get to some magic value that presents a
> sane behavior, that confidently fits inside a deadline, and is fair.
> 
Easier said than done.
Biggest problem is that most of the latency increase comes from the
actual 'sendmsg()' and 'read_sock()' calls.
And the only way of inhibiting that would be to check _prior_ whether
we can issue the call in the first place.
(That's why I did the SOCK_NOSPACE tests in the previous patchsets).

>>       }
>> -    return consumed;
>> +    return consumed - len;
>>   }
>>   static void nvme_tcp_data_ready(struct sock *sk)
>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct 
>> nvme_tcp_queue *queue)
>>       return ret;
>>   }
>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned 
>> long deadline)
>>   {
>>       struct socket *sock = queue->sock;
>>       struct sock *sk = sock->sk;
>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct 
>> nvme_tcp_queue *queue)
>>       rd_desc.count = 1;
>>       lock_sock(sk);
>>       queue->nr_cqe = 0;
>> +    queue->deadline = deadline;
>>       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>       release_sock(sk);
>>       return consumed;
>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct 
>> work_struct *w)
>>   {
>>       struct nvme_tcp_queue *queue =
>>           container_of(w, struct nvme_tcp_queue, io_work);
>> -    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>> +    unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
>> +    unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), 
>> overrun;
>>       do {
>>           bool pending = false;
>>           int result;
>>           if (mutex_trylock(&queue->send_mutex)) {
>> -            result = nvme_tcp_try_send(queue);
>> +            result = nvme_tcp_send_all(queue, tx_deadline);
>>               mutex_unlock(&queue->send_mutex);
>>               if (result > 0)
>>                   pending = true;
>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   break;
>>           }
>> -        result = nvme_tcp_try_recv(queue);
>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
> 
> I think you want a more frequent substitution of sends/receives. the 
> granularity of 1ms budget may be too coarse?
> 
The problem is not the granularity, the problem is the latency spikes
I'm seeing when issuing 'sendmsg' or 'read_sock'.

>>           if (result > 0)
>>               pending = true;
>>           else if (unlikely(result < 0))
>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>           if (!pending || !queue->rd_enabled)
>>               return;
>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>> exhausted */
>> +
>> +    overrun = jiffies - rx_deadline;
>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>> +        overrun > msecs_to_jiffies(10))
>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall (%u 
>> msecs)\n",
>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
> 
> Umm, ok. why 10? why not 2? or 3?
> Do you expect io_work to spend more time executing?
> 
Eg 4000 msecs like on my testbed?
Yes.

>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>   }
>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>> *hctx, struct io_comp_batch *iob)
>>   {
>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>       struct sock *sk = queue->sock->sk;
>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>           return 0;
>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>> *hctx, struct io_comp_batch *iob)
>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>       if (sk_can_busy_loop(sk) && 
>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>           sk_busy_loop(sk, true);
>> -    nvme_tcp_try_recv(queue);
>> +    nvme_tcp_try_recv(queue, deadline);
> 
> spend a millisecond in nvme_tcp_poll() ??
> Isn't it too long?
Haven't tried with polling, so can't honestly answer.

In the end, it all boils down to numbers.
I'm having 2 controllers with 32 queues and 256 requests.
Running on a 10GigE link one should be getting a net throughput
of 1GB/s, or, with 4k requests, 256k IOPS.
Or a latency of 0.25 milliseconds per command.
In the optimal case. As I'm seeing a bandwidth of around
500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.

So no, I don't think the 1 milliseconds is too long.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08 13:21     ` Hannes Reinecke
@ 2024-07-08 14:25       ` Sagi Grimberg
  2024-07-08 15:50         ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 14:25 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 08/07/2024 16:21, Hannes Reinecke wrote:
> On 7/8/24 13:57, Sagi Grimberg wrote:
>> Hey Hannes, thanks for doing this.
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> We need to restrict both side, rx and tx, to only run for a certain 
>>> time
>>> to ensure that we're not blocking the other side and induce starvation.
>>> So pass in a 'deadline' value to nvme_tcp_send_all()
>>
>> Please split the addition of nvme_tcp_send_all() to a separate prep 
>> patch.
>>
> Okay, no problem.
>
>>>   and nvme_tcp_try_recv()
>>> and break out of the loop if the deadline is reached.
>>
>> I think we want to limit the rx/tx in pdus/bytes. This will also 
>> allow us
>> to possibly do burst rx from data-ready.
>>
> PDUs is not the best scheduling boundary here, as each PDU can be of 
> different size, and the network interface most definitely is limited by
> the number of bytes transferred.

Well, PDUs is at least A proxy of bytes in terms of limits, but I agree 
bytes may be more appropriate.

>
>>>
>>> As we now have a timestamp we can also use it to print out a warning
>>> if the actual time spent exceeds the deadline.
>>>
>>> Performance comparison:
>>>                 baseline rx/tx fairness
>>> 4k seq write:  449MiB/s 480MiB/s
>>> 4k rand write: 410MiB/s 481MiB/s
>>> 4k seq read:   478MiB/s 481MiB/s
>>> 4k rand read:  547MiB/s 480MiB/s
>>>
>>> Random read is ever so disappointing, but that will be fixed with 
>>> the later
>>> patches.
>>
>> That is a significant decline in relative perf. I'm counting 12.5%...
>> Can you explain why that is?
>>
> Not really :-(
> But then fairness cuts both ways; so I am not surprised that some 
> workloads suffer here.

Well, 12.5% is rather steep, don't you agree?
I'm even more surprised that seq/rand read make a difference...

>
>
>> How does this look for multiple controllers?
>>
> Haven't really checked (yet); it would make a rather weak case if
> we killed performance just to scale better ...
>
>>
>>
>>>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>>>   1 file changed, 29 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 0873b3949355..f621d3ba89b2 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>>>       size_t            data_remaining;
>>>       size_t            ddgst_remaining;
>>>       unsigned int        nr_cqe;
>>> +    unsigned long        deadline;
>>
>> I don't see why you need to keep this in the queue struct. You could 
>> have
>> easily initialize it in the read_descriptor_t and test against it.
>>
> Because I wanted to interrupt the receive side for large data 
> transfers, and haven't found a way to pass the deadline into 
> ->read_sock().

desc.count is for you to interpret however you want it to be no?

>
>>>       /* send state */
>>>       struct nvme_tcp_request *request;
>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>>> nvme_tcp_request *req,
>>>       }
>>>   }
>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>> +                    unsigned long deadline)
>>>   {
>>>       int ret;
>>>       /* drain the send queue as much as we can... */
>>>       do {
>>>           ret = nvme_tcp_try_send(queue);
>>> +        if (time_after(jiffies, deadline))
>>> +            break;
>>>       } while (ret > 0);
>>> +    return ret;
>>
>> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
>> budget).
>> I don't know what you pass here, but jiffies is a rather large 
>> granularity...
>>
> Hmm. I've been using jiffies as the io_work loop had been counting
> in jiffies. But let me check what would happen if I move that over to
> PDU / size counting.

Cool.

>
>>>   }
>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>> nvme_tcp_queue *queue)
>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>>> nvme_tcp_request *req,
>>>           bool sync, bool last)
>>>   {
>>>       struct nvme_tcp_queue *queue = req->queue;
>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>       bool empty;
>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>>> nvme_tcp_request *req,
>>>        */
>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>> -        nvme_tcp_send_all(queue);
>>> +        nvme_tcp_send_all(queue, deadline);
>>>           mutex_unlock(&queue->send_mutex);
>>>       }
>>
>> Umm, spend up to a millisecond in in queue_request ? Sounds like way 
>> too much...
>> Did you ever see this deadline exceeded? sends should be rather quick...
>>
> Of _course_ it's too long. That's kinda the point.
> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
> so _that_ is the least of my worries ...

This has nothing to do with the network latency afaict. This is just a 
matter
of how long does it take to write to the socket buffer (or in our case, 
construct
skbs and add them to the socket as TX is zero copy).

>
>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>>> *desc, struct sk_buff *skb,
>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>               return result;
>>>           }
>>> +        if (time_after(jiffies, queue->deadline)) {
>>> +            desc->count = 0;
>>> +            break;
>>> +        }
>>> +
>>
>> That is still not right.
>> You don't want to spend a full deadline reading from the socket, and 
>> then spend a full deadline writing to the socket...
>>
> Yes, and no.
> Problem is that the current code serializes writes  and reads.
> Essentially for each iteration we first to a write, and then a read.
> If we spend the full deadline on write we will need to reschedule,
> but we then _again_ start with writes. This leads to a heavy preference
> for writing, and negative performance impact.

Yes, the assumption we should take is that the tx/rx limits allow to 
switch between the
two at least once or twice in the io_work while loop, and that they get 
a fair'ish share
of quota.

>
>> You want the io_work to take a full deadline, and send budgets of 
>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>> 64k bytes. We want to get to some magic value that presents a
>> sane behavior, that confidently fits inside a deadline, and is fair.
>>
> Easier said than done.
> Biggest problem is that most of the latency increase comes from the
> actual 'sendmsg()' and 'read_sock()' calls.

This is a bit surprising. because read_sock reads data that is already
processed by the tcp stack, and ready to simply copy back to the user
pages. Same for sendmsg, the socket is non-blocking, so I would not
expect it to stall for unusually long periods. Maybe there is an issue
hiding in how we call send/recv?

> And the only way of inhibiting that would be to check _prior_ whether
> we can issue the call in the first place.
> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).

The socket is non-blocking. I always thought that SOCK_NOSPACE is 
serving blocking
sockets... afaik, there shouldn't be a real difference between checking 
NOSPACE vs.
attempting sendmsg on a non-blocking socket... But I could be wrong here..

>
>>>       }
>>> -    return consumed;
>>> +    return consumed - len;
>>>   }
>>>   static void nvme_tcp_data_ready(struct sock *sk)
>>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct 
>>> nvme_tcp_queue *queue)
>>>       return ret;
>>>   }
>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned 
>>> long deadline)
>>>   {
>>>       struct socket *sock = queue->sock;
>>>       struct sock *sk = sock->sk;
>>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct 
>>> nvme_tcp_queue *queue)
>>>       rd_desc.count = 1;
>>>       lock_sock(sk);
>>>       queue->nr_cqe = 0;
>>> +    queue->deadline = deadline;
>>>       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>>       release_sock(sk);
>>>       return consumed;
>>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>   {
>>>       struct nvme_tcp_queue *queue =
>>>           container_of(w, struct nvme_tcp_queue, io_work);
>>> -    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>> +    unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
>>> +    unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), 
>>> overrun;
>>>       do {
>>>           bool pending = false;
>>>           int result;
>>>           if (mutex_trylock(&queue->send_mutex)) {
>>> -            result = nvme_tcp_try_send(queue);
>>> +            result = nvme_tcp_send_all(queue, tx_deadline);
>>>               mutex_unlock(&queue->send_mutex);
>>>               if (result > 0)
>>>                   pending = true;
>>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                   break;
>>>           }
>>> -        result = nvme_tcp_try_recv(queue);
>>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
>>
>> I think you want a more frequent substitution of sends/receives. the 
>> granularity of 1ms budget may be too coarse?
>>
> The problem is not the granularity, the problem is the latency spikes
> I'm seeing when issuing 'sendmsg' or 'read_sock'.

I think that its possible that there is another underlying issue. 
Because these calls
should NOT be blocked for a long period of time for non-blocking sockets.

>
>>>           if (result > 0)
>>>               pending = true;
>>>           else if (unlikely(result < 0))
>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           if (!pending || !queue->rd_enabled)
>>>               return;
>>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>>> exhausted */
>>> +
>>> +    overrun = jiffies - rx_deadline;
>>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>>> +        overrun > msecs_to_jiffies(10))
>>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall 
>>> (%u msecs)\n",
>>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>
>> Umm, ok. why 10? why not 2? or 3?
>> Do you expect io_work to spend more time executing?
>>
> Eg 4000 msecs like on my testbed?

If you see ANYTHING in the IO path stalling for 4 seconds then we have a 
bigger issue here.

> Yes.
>
>>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>   }
>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>> *hctx, struct io_comp_batch *iob)
>>>   {
>>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>>       struct sock *sk = queue->sock->sk;
>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>>           return 0;
>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>> *hctx, struct io_comp_batch *iob)
>>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>>       if (sk_can_busy_loop(sk) && 
>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>>           sk_busy_loop(sk, true);
>>> -    nvme_tcp_try_recv(queue);
>>> +    nvme_tcp_try_recv(queue, deadline);
>>
>> spend a millisecond in nvme_tcp_poll() ??
>> Isn't it too long?
> Haven't tried with polling, so can't honestly answer.
>
> In the end, it all boils down to numbers.
> I'm having 2 controllers with 32 queues and 256 requests.
> Running on a 10GigE link one should be getting a net throughput
> of 1GB/s, or, with 4k requests, 256k IOPS.
> Or a latency of 0.25 milliseconds per command.
> In the optimal case. As I'm seeing a bandwidth of around
> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>
> So no, I don't think the 1 milliseconds is too long.

nvme_tcp_poll is NOT designed to poll for the entirety or an individual 
IO lifetime. Its
a non-selective polling interface wired to io_uring. It should peek the 
socket, consume what ever
is there, and break. It most definitely should not spend 1ms in this 
check...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping
  2024-07-08 12:43     ` Hannes Reinecke
@ 2024-07-08 14:38       ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 14:38 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 08/07/2024 15:43, Hannes Reinecke wrote:
> On 7/8/24 14:08, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> We should align the 'io_cpu' setting with the blk-mq
>>> cpu mapping to ensure that we're not bouncing threads
>>> when doing I/O. To avoid cpu contention this patch also
>>> adds an atomic counter for the number of queues on each
>>> cpu to distribute the load across all CPUs in the blk-mq cpu set.
>>> Additionally we should always set the 'io_cpu' value, as
>>> in the WQ_UNBOUND case it'll be treated as a hint anyway.
>>>
>>> Performance comparison:
>>>                 baseline rx/tx    blk-mq align
>>> 4k seq write:  449MiB/s 480MiB/s 524MiB/s
>>> 4k rand write: 410MiB/s 481MiB/s 524MiB/s
>>> 4k seq read:   478MiB/s 481MiB/s 566MiB/s
>>> 4k rand read:  547MiB/s 480MiB/s 511MiB/s
>>>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 65 
>>> +++++++++++++++++++++++++++++++----------
>>>   1 file changed, 49 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index f621d3ba89b2..a5c42a7b4bee 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -26,6 +26,8 @@
>>>   struct nvme_tcp_queue;
>>> +static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
>>> +
>>>   /* Define the socket priority to use for connections were it is 
>>> desirable
>>>    * that the NIC consider performing optimized packet processing or 
>>> filtering.
>>>    * A non-zero value being sufficient to indicate general 
>>> consideration of any
>>> @@ -1578,20 +1580,42 @@ static bool nvme_tcp_poll_queue(struct 
>>> nvme_tcp_queue *queue)
>>>   static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
>>>   {
>>>       struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>>> -    int qid = nvme_tcp_queue_id(queue);
>>> -    int n = 0;
>>> -
>>> -    if (nvme_tcp_default_queue(queue))
>>> -        n = qid - 1;
>>> -    else if (nvme_tcp_read_queue(queue))
>>> -        n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
>>> -    else if (nvme_tcp_poll_queue(queue))
>>> +    struct blk_mq_tag_set *set = &ctrl->tag_set;
>>> +    int qid = nvme_tcp_queue_id(queue) - 1;
>>
>> umm, its not the qid, why change it? I mean it looks harmless, but
>> I don't see why.
>>
>>> +    unsigned int *mq_map = NULL;;
>>> +    int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
>>
>> min_queues initialization looks very very weird. I can't even parse it.
>> besides, why is it declared in this scope?
>>
> Because it's evaluated in the loop, with the value carried over across
> loop iterations.

Right. missed that.

> WORK_CPU_UNBOUND is the init value, to detect if it had been set at all.

as far as I read this patch, min_queues and num_queues are a quantity of
queues. So it is bizzar to see it initiatlized to this completely 
unrelated value
of WORK_CPU_UNBOUND. its really confusing. Still not sure what is tbh.

>
>>> +
>>> +    if (nvme_tcp_default_queue(queue)) {
>>> +        mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
>>> +        n = qid;
>>> +    } else if (nvme_tcp_read_queue(queue)) {
>>> +        mq_map = set->map[HCTX_TYPE_READ].mq_map;
>>> +        n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
>>> +    } else if (nvme_tcp_poll_queue(queue)) {
>>> +        mq_map = set->map[HCTX_TYPE_POLL].mq_map;
>>>           n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
>>> -                ctrl->io_queues[HCTX_TYPE_READ] - 1;
>>> -    if (wq_unbound)
>>> -        queue->io_cpu = WORK_CPU_UNBOUND;
>>> -    else
>>> -        queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, 
>>> -1, false);
>>> +                ctrl->io_queues[HCTX_TYPE_READ];
>>> +    }
>>> +
>>> +    if (WARN_ON(!mq_map))
>>> +        return;
>>> +    for_each_online_cpu(cpu) {
>>> +        int num_queues;
>>> +
>>> +        if (mq_map[cpu] != qid)
>>> +            continue;
>>> +        num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
>>> +        if (num_queues < min_queues) {
>>> +            min_queues = num_queues;
>>> +            io_cpu = cpu;
>>> +        }
>>> +    }
>>> +    if (io_cpu != queue->io_cpu) {
>>> +        queue->io_cpu = io_cpu;
>>> +        atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
>>> +    }
>>
>> Why is that conditioned ? why not always assign and inc the counter?
>>
> Because we're having the conditional
>
> if (mq_map[cpu] != qid)
>
> above, so technically we might not find the correct mapping.
> Might be worthwhile doing a WARN_ON() here, but it might trigger
> with CPU hotplug...

Writing code that looks like:
     if (a != b) {
         a = b;
     }

is silly.

iiuc It is probably because you initialized the queue->io_cpu to
WORK_CPU_UNBOUND and now you don't want to touch it. But it is
still kinda ugly.

I suggest that you always select a io_cpu and always account it.
You said yourself that if the wq is unbound, it doesn't really matter what
the io_cpu is...

And I think that a code comment would help here. its not the most trivial
code...

>
>>> +    dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
>>> +        qid, queue->io_cpu);
>>>   }
>>>   static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>>> pskid)
>>> @@ -1735,7 +1759,7 @@ static int nvme_tcp_alloc_queue(struct 
>>> nvme_ctrl *nctrl, int qid,
>>>       queue->sock->sk->sk_allocation = GFP_ATOMIC;
>>>       queue->sock->sk->sk_use_task_frag = false;
>>> -    nvme_tcp_set_queue_io_cpu(queue);
>>> +    queue->io_cpu = WORK_CPU_UNBOUND;
>>>       queue->request = NULL;
>>>       queue->data_remaining = 0;
>>>       queue->ddgst_remaining = 0;
>>> @@ -1847,6 +1871,10 @@ static void __nvme_tcp_stop_queue(struct 
>>> nvme_tcp_queue *queue)
>>>       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>>>       nvme_tcp_restore_sock_ops(queue);
>>>       cancel_work_sync(&queue->io_work);
>>> +    if (queue->io_cpu != WORK_CPU_UNBOUND) {
>>> + atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
>>> +        queue->io_cpu = WORK_CPU_UNBOUND;
>>> +    }
>>
>> Does anything change if we still set the io_cpu in wq_unbound case?
>> If not, I think we should always do it and simplify the code.
>>
> See above. WORK_CPU_UNBOUND means that we failed to find the cpu
> in the mq_map. Yes, I know, unlikely, but not impossible.

If you always select a cpu, it should be fine.

>
>>>   }
>>>   static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
>>> @@ -1891,9 +1919,10 @@ static int nvme_tcp_start_queue(struct 
>>> nvme_ctrl *nctrl, int idx)
>>>       nvme_tcp_init_recv_ctx(queue);
>>>       nvme_tcp_setup_sock_ops(queue);
>>> -    if (idx)
>>> +    if (idx) {
>>> +        nvme_tcp_set_queue_io_cpu(queue);
>>>           ret = nvmf_connect_io_queue(nctrl, idx);
>>> -    else
>>> +    } else
>>>           ret = nvmf_connect_admin_queue(nctrl);
>>>       if (!ret) {
>>> @@ -2920,6 +2949,7 @@ static struct nvmf_transport_ops 
>>> nvme_tcp_transport = {
>>>   static int __init nvme_tcp_init_module(void)
>>>   {
>>>       unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
>>> +    int cpu;
>>>       BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
>>>       BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
>>> @@ -2937,6 +2967,9 @@ static int __init nvme_tcp_init_module(void)
>>>       if (!nvme_tcp_wq)
>>>           return -ENOMEM;
>>> +    for_each_possible_cpu(cpu)
>>> +        atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
>>> +
>>
>> Why?
>
> Don't we need to initialize the atomic counters?

I thought this was the module __exit function, sorry.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: per-controller I/O workqueues
  2024-07-08 12:48     ` Hannes Reinecke
@ 2024-07-08 14:41       ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 14:41 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig, Tejun Heo
  Cc: Keith Busch, linux-nvme



On 08/07/2024 15:48, Hannes Reinecke wrote:
> On 7/8/24 14:12, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> Implement per-controller I/O workqueues to reduce workqueue contention
>>> during I/O and improve I/O performance.
>>>
>>> Performance comparison:
>>>                 baseline rx/tx    blk-mq   multiple workqueues
>>> 4k seq write:  449MiB/s 480MiB/s 524MiB/s 540MiB/s
>>> 4k rand write: 410MiB/s 481MiB/s 524MiB/s 539MiB/s
>>> 4k seq read:   478MiB/s 481MiB/s 566MiB/s 582MiB/s
>>> 4k rand read:  547MiB/s 480MiB/s 511MiB/s 633MiB/s
>>
>> I am still puzzled by this one.
>>
>> This is for 2 controllers? or more?
>> It is intresting that the rand read sees higher boost from the seq read.
>> Is this a nature of the SSD? What happens with null_blk ?
>>
> See the patchset description.
> Two controllers, one subsystem.

Can we try with null_blk as well?
Not that I'm dismissing the nvme numbers.

>
>> CCing Tejun. Is it possible that using two different workqueues
>> for a symmetrical workload is better than a single global workqueue?
>
> Yes, that is the implication.
> It might simply due to the fact that the number of workers in a 
> workqueue pool is limited to 512:
>
> workqueue.h:      WQ_MAX_ACTIVE           = 512,    /* I like 512, 
> better ideas? */
>
> so by using a workqueue per controller we effectively raise the number
> of possible workers. And we're reducing the concurrency between 
> controllers, as each controller can now schedule I/O independent on
> the other.
>
> Let me check what would happen if I increase MAX_ACTIVE ...

Umm, do you have concurrency of more than 512 active? you have 64 queues 
in your setup no.
Maybe I'm missing something...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08 14:25       ` Sagi Grimberg
@ 2024-07-08 15:50         ` Hannes Reinecke
  2024-07-08 19:31           ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-08 15:50 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/8/24 16:25, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 16:21, Hannes Reinecke wrote:
>> On 7/8/24 13:57, Sagi Grimberg wrote:
>>> Hey Hannes, thanks for doing this.
>>>
[ .. ]
>>> That is a significant decline in relative perf. I'm counting 12.5%...
>>> Can you explain why that is?
>>>
>> Not really :-(
>> But then fairness cuts both ways; so I am not surprised that some 
>> workloads suffer here.
> 
> Well, 12.5% is rather steep, don't you agree?
> I'm even more surprised that seq/rand read make a difference...
> 
Yeah; but I guess I found the issue now (we should interrupt data 
transfers if the deadline expires, not just pdu transfers).

[ .. ]
>>>
>>> I don't see why you need to keep this in the queue struct. You could 
>>> have
>>> easily initialize it in the read_descriptor_t and test against it.
>>>
>> Because I wanted to interrupt the receive side for large data 
>> transfers, and haven't found a way to pass the deadline into 
>> ->read_sock().
> 
> desc.count is for you to interpret however you want it to be no?
> 
Except for the magic 'desc.count == 0' value; but yeah, that should work.

>>
>>>>       /* send state */
>>>>       struct nvme_tcp_request *request;
>>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>>>> nvme_tcp_request *req,
>>>>       }
>>>>   }
>>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>>> +                    unsigned long deadline)
>>>>   {
>>>>       int ret;
>>>>       /* drain the send queue as much as we can... */
>>>>       do {
>>>>           ret = nvme_tcp_try_send(queue);
>>>> +        if (time_after(jiffies, deadline))
>>>> +            break;
>>>>       } while (ret > 0);
>>>> +    return ret;
>>>
>>> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
>>> budget).
>>> I don't know what you pass here, but jiffies is a rather large 
>>> granularity...
>>>
>> Hmm. I've been using jiffies as the io_work loop had been counting
>> in jiffies. But let me check what would happen if I move that over to
>> PDU / size counting.
> 
> Cool.
> 
>>
>>>>   }
>>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>>> nvme_tcp_queue *queue)
>>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>>>> nvme_tcp_request *req,
>>>>           bool sync, bool last)
>>>>   {
>>>>       struct nvme_tcp_queue *queue = req->queue;
>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>       bool empty;
>>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>>>> nvme_tcp_request *req,
>>>>        */
>>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>> -        nvme_tcp_send_all(queue);
>>>> +        nvme_tcp_send_all(queue, deadline);
>>>>           mutex_unlock(&queue->send_mutex);
>>>>       }
>>>
>>> Umm, spend up to a millisecond in in queue_request ? Sounds like way 
>>> too much...
>>> Did you ever see this deadline exceeded? sends should be rather quick...
>>>
>> Of _course_ it's too long. That's kinda the point.
>> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
>> so _that_ is the least of my worries ...
> 
> This has nothing to do with the network latency afaict. This is just a 
> matter of how long does it take to write to the socket buffer (or in our case, 
> construct skbs and add them to the socket as TX is zero copy).
> 
Not the transfer itself? I would have thought so ...

>>
>>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>>>> *desc, struct sk_buff *skb,
>>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>               return result;
>>>>           }
>>>> +        if (time_after(jiffies, queue->deadline)) {
>>>> +            desc->count = 0;
>>>> +            break;
>>>> +        }
>>>> +
>>>
>>> That is still not right.
>>> You don't want to spend a full deadline reading from the socket, and 
>>> then spend a full deadline writing to the socket...
>>>
>> Yes, and no.
>> Problem is that the current code serializes writes  and reads.
>> Essentially for each iteration we first to a write, and then a read.
>> If we spend the full deadline on write we will need to reschedule,
>> but we then _again_ start with writes. This leads to a heavy preference
>> for writing, and negative performance impact.
> 
> Yes, the assumption we should take is that the tx/rx limits allow to 
> switch between the
> two at least once or twice in the io_work while loop, and that they get 
> a fair'ish share
> of quota.
> 
>>
>>> You want the io_work to take a full deadline, and send budgets of 
>>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>>> 64k bytes. We want to get to some magic value that presents a
>>> sane behavior, that confidently fits inside a deadline, and is fair.
>>>
>> Easier said than done.
>> Biggest problem is that most of the latency increase comes from the
>> actual 'sendmsg()' and 'read_sock()' calls.
> 
> This is a bit surprising. because read_sock reads data that is already
> processed by the tcp stack, and ready to simply copy back to the user
> pages. Same for sendmsg, the socket is non-blocking, so I would not
> expect it to stall for unusually long periods. Maybe there is an issue
> hiding in how we call send/recv?
> 
Hmm. That could be.

>> And the only way of inhibiting that would be to check _prior_ whether
>> we can issue the call in the first place.
>> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
> 
> The socket is non-blocking. I always thought that SOCK_NOSPACE is 
> serving blocking
> sockets... afaik, there shouldn't be a real difference between checking 
> NOSPACE vs.
> attempting sendmsg on a non-blocking socket... But I could be wrong here..
> 

Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even 
get the 'write_space()' callback?
It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
before sending drops the number of invocations quite significantly.

[  .. ]
>>>>                   break;
>>>>           }
>>>> -        result = nvme_tcp_try_recv(queue);
>>>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
>>>
>>> I think you want a more frequent substitution of sends/receives. the 
>>> granularity of 1ms budget may be too coarse?
>>>
>> The problem is not the granularity, the problem is the latency spikes
>> I'm seeing when issuing 'sendmsg' or 'read_sock'.
> 
> I think that its possible that there is another underlying issue. 
> Because these calls
> should NOT be blocked for a long period of time for non-blocking sockets.
> 
I did some instrumentation, and that certainly pointed into that 
direction. But will have to redo that to get some meaningful data here.
Another point really might be the granularity; jiffies isn't a good 
measurement here (even with HZ=1000 we'll only have milliseconds 
granularity, so it might be that we're incurring fluctuations due to
jiffy calculation itself).

>>
>>>>           if (result > 0)
>>>>               pending = true;
>>>>           else if (unlikely(result < 0))
>>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct 
>>>> work_struct *w)
>>>>           if (!pending || !queue->rd_enabled)
>>>>               return;
>>>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>>>> exhausted */
>>>> +
>>>> +    overrun = jiffies - rx_deadline;
>>>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>>>> +        overrun > msecs_to_jiffies(10))
>>>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall 
>>>> (%u msecs)\n",
>>>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>>
>>> Umm, ok. why 10? why not 2? or 3?
>>> Do you expect io_work to spend more time executing?
>>>
>> Eg 4000 msecs like on my testbed?
> 
> If you see ANYTHING in the IO path stalling for 4 seconds then we have a 
> bigger issue here.
> 
You don't say ...

>> Yes.
>>
>>>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>>   }
>>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>>> *hctx, struct io_comp_batch *iob)
>>>>   {
>>>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>>>       struct sock *sk = queue->sock->sk;
>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>>>           return 0;
>>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>>> *hctx, struct io_comp_batch *iob)
>>>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>>>       if (sk_can_busy_loop(sk) && 
>>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>>>           sk_busy_loop(sk, true);
>>>> -    nvme_tcp_try_recv(queue);
>>>> +    nvme_tcp_try_recv(queue, deadline);
>>>
>>> spend a millisecond in nvme_tcp_poll() ??
>>> Isn't it too long?
>> Haven't tried with polling, so can't honestly answer.
>>
>> In the end, it all boils down to numbers.
>> I'm having 2 controllers with 32 queues and 256 requests.
>> Running on a 10GigE link one should be getting a net throughput
>> of 1GB/s, or, with 4k requests, 256k IOPS.
>> Or a latency of 0.25 milliseconds per command.
>> In the optimal case. As I'm seeing a bandwidth of around
>> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>>
>> So no, I don't think the 1 milliseconds is too long.
> 
> nvme_tcp_poll is NOT designed to poll for the entirety or an individual 
> IO lifetime. Its
> a non-selective polling interface wired to io_uring. It should peek the 
> socket, consume what ever
> is there, and break. It most definitely should not spend 1ms in this 
> check...
Oh, I know that.
That's why I said: "Haven't tried with polling."

The calculation is independent on polling.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08 15:50         ` Hannes Reinecke
@ 2024-07-08 19:31           ` Sagi Grimberg
  2024-07-09  6:51             ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-08 19:31 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 08/07/2024 18:50, Hannes Reinecke wrote:
> On 7/8/24 16:25, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 16:21, Hannes Reinecke wrote:
>>> On 7/8/24 13:57, Sagi Grimberg wrote:
>>>> Hey Hannes, thanks for doing this.
>>>>
> [ .. ]
>>>> That is a significant decline in relative perf. I'm counting 12.5%...
>>>> Can you explain why that is?
>>>>
>>> Not really :-(
>>> But then fairness cuts both ways; so I am not surprised that some 
>>> workloads suffer here.
>>
>> Well, 12.5% is rather steep, don't you agree?
>> I'm even more surprised that seq/rand read make a difference...
>>
> Yeah; but I guess I found the issue now (we should interrupt data 
> transfers if the deadline expires, not just pdu transfers).
>
> [ .. ]
>>>>
>>>> I don't see why you need to keep this in the queue struct. You 
>>>> could have
>>>> easily initialize it in the read_descriptor_t and test against it.
>>>>
>>> Because I wanted to interrupt the receive side for large data 
>>> transfers, and haven't found a way to pass the deadline into 
>>> ->read_sock().
>>
>> desc.count is for you to interpret however you want it to be no?
>>
> Except for the magic 'desc.count == 0' value; but yeah, that should work.
>
>>>
>>>>>       /* send state */
>>>>>       struct nvme_tcp_request *request;
>>>>> @@ -359,14 +360,18 @@ static inline void 
>>>>> nvme_tcp_advance_req(struct nvme_tcp_request *req,
>>>>>       }
>>>>>   }
>>>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>>>> +                    unsigned long deadline)
>>>>>   {
>>>>>       int ret;
>>>>>       /* drain the send queue as much as we can... */
>>>>>       do {
>>>>>           ret = nvme_tcp_try_send(queue);
>>>>> +        if (time_after(jiffies, deadline))
>>>>> +            break;
>>>>>       } while (ret > 0);
>>>>> +    return ret;
>>>>
>>>> I think you want a different interface, 
>>>> nvme_tcp_send_budgeted(queue, budget).
>>>> I don't know what you pass here, but jiffies is a rather large 
>>>> granularity...
>>>>
>>> Hmm. I've been using jiffies as the io_work loop had been counting
>>> in jiffies. But let me check what would happen if I move that over to
>>> PDU / size counting.
>>
>> Cool.
>>
>>>
>>>>>   }
>>>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>>>> nvme_tcp_queue *queue)
>>>>> @@ -385,6 +390,7 @@ static inline void 
>>>>> nvme_tcp_queue_request(struct nvme_tcp_request *req,
>>>>>           bool sync, bool last)
>>>>>   {
>>>>>       struct nvme_tcp_queue *queue = req->queue;
>>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>>       bool empty;
>>>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>>>> @@ -397,7 +403,7 @@ static inline void 
>>>>> nvme_tcp_queue_request(struct nvme_tcp_request *req,
>>>>>        */
>>>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>>> -        nvme_tcp_send_all(queue);
>>>>> +        nvme_tcp_send_all(queue, deadline);
>>>>>           mutex_unlock(&queue->send_mutex);
>>>>>       }
>>>>
>>>> Umm, spend up to a millisecond in in queue_request ? Sounds like 
>>>> way too much...
>>>> Did you ever see this deadline exceeded? sends should be rather 
>>>> quick...
>>>>
>>> Of _course_ it's too long. That's kinda the point.
>>> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
>>> so _that_ is the least of my worries ...
>>
>> This has nothing to do with the network latency afaict. This is just 
>> a matter of how long does it take to write to the socket buffer (or 
>> in our case, construct skbs and add them to the socket as TX is zero 
>> copy).
>>
> Not the transfer itself? I would have thought so ...

It may trigger push packets down to [tcp/ip]_output under some conditions,
but that is also not something that should block for long.

>
>>>
>>>>> @@ -959,9 +965,14 @@ static int 
>>>>> nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>>>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>>               return result;
>>>>>           }
>>>>> +        if (time_after(jiffies, queue->deadline)) {
>>>>> +            desc->count = 0;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>
>>>> That is still not right.
>>>> You don't want to spend a full deadline reading from the socket, 
>>>> and then spend a full deadline writing to the socket...
>>>>
>>> Yes, and no.
>>> Problem is that the current code serializes writes  and reads.
>>> Essentially for each iteration we first to a write, and then a read.
>>> If we spend the full deadline on write we will need to reschedule,
>>> but we then _again_ start with writes. This leads to a heavy preference
>>> for writing, and negative performance impact.
>>
>> Yes, the assumption we should take is that the tx/rx limits allow to 
>> switch between the
>> two at least once or twice in the io_work while loop, and that they 
>> get a fair'ish share
>> of quota.
>>
>>>
>>>> You want the io_work to take a full deadline, and send budgets of 
>>>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>>>> 64k bytes. We want to get to some magic value that presents a
>>>> sane behavior, that confidently fits inside a deadline, and is fair.
>>>>
>>> Easier said than done.
>>> Biggest problem is that most of the latency increase comes from the
>>> actual 'sendmsg()' and 'read_sock()' calls.
>>
>> This is a bit surprising. because read_sock reads data that is already
>> processed by the tcp stack, and ready to simply copy back to the user
>> pages. Same for sendmsg, the socket is non-blocking, so I would not
>> expect it to stall for unusually long periods. Maybe there is an issue
>> hiding in how we call send/recv?
>>
> Hmm. That could be.
>
>>> And the only way of inhibiting that would be to check _prior_ whether
>>> we can issue the call in the first place.
>>> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
>>
>> The socket is non-blocking. I always thought that SOCK_NOSPACE is 
>> serving blocking
>> sockets... afaik, there shouldn't be a real difference between 
>> checking NOSPACE vs.
>> attempting sendmsg on a non-blocking socket... But I could be wrong 
>> here..
>>
>
> Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even 
> get the 'write_space()' callback?
> It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
> before sending drops the number of invocations quite significantly.

I need to check, but where have you testing it? in the inline path?
Thinking further, I do agree that because the io_work is shared, we may
sendmsg immediately as space is becomes available instead of some 
minimum space.

Can you please quantify this with your testing?
How many times we get write_space() callback? how many times we get 
EAGAIN, and what
is the perf... Also interesting if this is more apparent in specific 
workloads. I'm assuming its
more apparent with large write workloads.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-08 19:31           ` Sagi Grimberg
@ 2024-07-09  6:51             ` Hannes Reinecke
  2024-07-09  7:06               ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-09  6:51 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/8/24 21:31, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 18:50, Hannes Reinecke wrote:
[ .. ]
>>
>> Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even 
>> get the 'write_space()' callback?
>> It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
>> before sending drops the number of invocations quite significantly.
> 
> I need to check, but where have you testing it? in the inline path?
> Thinking further, I do agree that because the io_work is shared, we may
> sendmsg immediately as space is becomes available instead of some 
> minimum space.
>  > Can you please quantify this with your testing?
> How many times we get write_space() callback? how many times we get 
> EAGAIN, and what
> is the perf... Also interesting if this is more apparent in specific 
> workloads. I'm assuming its
> more apparent with large write workloads.

Stats for my testing
(second row is controller 1, third row controller 2):

write_space:
0: 0 0
1: 489 2
10: 325 20
11: 28 1
12: 14 1
13: 252 1
14: 100 12
15: 310 1
16: 454 2
2: 31 11
3: 50 1
4: 299 42
5: 1 19
6: 12 0
7: 737 16
8: 636 1
9: 19 1

queue_busy:
0: 0 0
1: 396 2
10: 66 18
11: 91 8
12: 56 14
13: 464 7
14: 24 15
15: 574 8
16: 516 9
2: 22 29
3: 56 5
4: 153 99
5: 2 40
6: 18 0
7: 632 5
8: 590 13
9: 129 2

The send latency is actually pretty good, and around 150 us on 
controller 1 (with two notable exceptions of 480 us on queue 5 and 646 
us on queue 6) and around 100 us on controller 2 (again with exceptions 
of 444 us on queue 5 and 643 us on queue 6). Each queue processing 
around 150k requests.

Receive latency is far more consistent; each queue has around 150 us on 
controller 1 and 140 us on controller 2.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: improve rx/tx fairness
  2024-07-09  6:51             ` Hannes Reinecke
@ 2024-07-09  7:06               ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-09  7:06 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 09/07/2024 9:51, Hannes Reinecke wrote:
> On 7/8/24 21:31, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 18:50, Hannes Reinecke wrote:
> [ .. ]
>>>
>>> Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we 
>>> even get the 'write_space()' callback?
>>> It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
>>> before sending drops the number of invocations quite significantly.
>>
>> I need to check, but where have you testing it? in the inline path?
>> Thinking further, I do agree that because the io_work is shared, we may
>> sendmsg immediately as space is becomes available instead of some 
>> minimum space.
>>  > Can you please quantify this with your testing?
>> How many times we get write_space() callback? how many times we get 
>> EAGAIN, and what
>> is the perf... Also interesting if this is more apparent in specific 
>> workloads. I'm assuming its
>> more apparent with large write workloads.
>
> Stats for my testing
> (second row is controller 1, third row controller 2):

Which row? you mean column?

>
> write_space:
> 0: 0 0
> 1: 489 2
> 10: 325 20
> 11: 28 1
> 12: 14 1
> 13: 252 1
> 14: 100 12
> 15: 310 1
> 16: 454 2
> 2: 31 11
> 3: 50 1
> 4: 299 42
> 5: 1 19
> 6: 12 0
> 7: 737 16
> 8: 636 1
> 9: 19 1

These are incrementing counters where ctrl1 and ctrl2 got the 
.write_space() callbacks?
What are 0-16? queues? I thought you had 32 queues...

>
> queue_busy:
> 0: 0 0
> 1: 396 2
> 10: 66 18
> 11: 91 8
> 12: 56 14
> 13: 464 7
> 14: 24 15
> 15: 574 8
> 16: 516 9
> 2: 22 29
> 3: 56 5
> 4: 153 99
> 5: 2 40
> 6: 18 0
> 7: 632 5
> 8: 590 13
> 9: 129 2

What is queue_busy?

>
> The send latency is actually pretty good, and around 150 us on 
> controller 1

150us for what? the average duration of sendmsg call alone?

i.e.
start
sendmsg
end

> (with two notable exceptions of 480 us on queue 5 and 646 us on queue 
> 6) and around 100 us on controller 2 (again with exceptions of 444 us 
> on queue 5 and 643 us on queue 6). Each queue processing around 150k 
> requests.

Are you referring to the above measurements? I'm having trouble 
understanding what you wrote...

>
> Receive latency is far more consistent; each queue has around 150 us 
> on controller 1 and 140 us on controller 2.

Again, latency of what?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 0/3] nvme-tcp: improve scalability
  2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
                   ` (2 preceding siblings ...)
  2024-07-08  7:10 ` [PATCH 3/3] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
@ 2024-07-10 11:56 ` Sagi Grimberg
  2024-07-10 14:06   ` Hannes Reinecke
  2024-07-16  6:31 ` Sagi Grimberg
  4 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-10 11:56 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 08/07/2024 10:10, Hannes Reinecke wrote:
> Hi all,
>
> for workloads with a lot of controllers we run into workqueue contention,
> where the single workqueue is not able to service requests fast enough,
> leading to spurious I/O errors and connect resets during high load.
> This patchset improves the situation by improve the fairness between
> rx and tx scheduling, introducing per-controller workqueues,
> and distribute the load accoring to the blk-mq cpu mapping.
> With this we reduce the spurious I/O errors and improve the overall
> performance for highly contended workloads.
>
> All performance number are derived from the 'tiobench-example.fio'

Did you keep the fio file unmodified? I'd suggest to run it for longer
say 60 seconds each workload. 512 MB is a very short benchmark...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 0/3] nvme-tcp: improve scalability
  2024-07-10 11:56 ` [PATCHv2 0/3] nvme-tcp: improve scalability Sagi Grimberg
@ 2024-07-10 14:06   ` Hannes Reinecke
  2024-07-10 14:45     ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-10 14:06 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/10/24 13:56, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> Hi all,
>>
>> for workloads with a lot of controllers we run into workqueue contention,
>> where the single workqueue is not able to service requests fast enough,
>> leading to spurious I/O errors and connect resets during high load.
>> This patchset improves the situation by improve the fairness between
>> rx and tx scheduling, introducing per-controller workqueues,
>> and distribute the load accoring to the blk-mq cpu mapping.
>> With this we reduce the spurious I/O errors and improve the overall
>> performance for highly contended workloads.
>>
>> All performance number are derived from the 'tiobench-example.fio'
> 
> Did you keep the fio file unmodified? I'd suggest to run it for longer
> say 60 seconds each workload. 512 MB is a very short benchmark...

Not for 32 queues :-)
But yeah, I can keep it running for slightly longer.

Not making much progress, mind; your 'softirq' patch definitely speeds 
up receiving, but seem to messing up the write side such that I'm 
basically guaranteed to hit I/O timeouts on WRITE :-(

Keep on debugging ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 0/3] nvme-tcp: improve scalability
  2024-07-10 14:06   ` Hannes Reinecke
@ 2024-07-10 14:45     ` Sagi Grimberg
  0 siblings, 0 replies; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-10 14:45 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 10/07/2024 17:06, Hannes Reinecke wrote:
> On 7/10/24 13:56, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> for workloads with a lot of controllers we run into workqueue 
>>> contention,
>>> where the single workqueue is not able to service requests fast enough,
>>> leading to spurious I/O errors and connect resets during high load.
>>> This patchset improves the situation by improve the fairness between
>>> rx and tx scheduling, introducing per-controller workqueues,
>>> and distribute the load accoring to the blk-mq cpu mapping.
>>> With this we reduce the spurious I/O errors and improve the overall
>>> performance for highly contended workloads.
>>>
>>> All performance number are derived from the 'tiobench-example.fio'
>>
>> Did you keep the fio file unmodified? I'd suggest to run it for longer
>> say 60 seconds each workload. 512 MB is a very short benchmark...
>
> Not for 32 queues :-)
> But yeah, I can keep it running for slightly longer.

How does the number of queues make a difference?
doesn't it simply write 512MB from 4 threads?

>
> Not making much progress, mind; your 'softirq' patch definitely speeds 
> up receiving, but seem to messing up the write side such that I'm 
> basically guaranteed to hit I/O timeouts on WRITE :-(
>
> Keep on debugging ...

Hannes, I think that now that we established that rx can starve tx, we 
must pace rx, so if we do something like softirq rx, it must be an 
initial batch, limited, and once we exhaust
it, we schedule a workqueue to continue processing. I'd also leave it 
alone for now, we'll add it once we have a good understanding of what is 
going on...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 0/3] nvme-tcp: improve scalability
  2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
                   ` (3 preceding siblings ...)
  2024-07-10 11:56 ` [PATCHv2 0/3] nvme-tcp: improve scalability Sagi Grimberg
@ 2024-07-16  6:31 ` Sagi Grimberg
  2024-07-16  7:10   ` Hannes Reinecke
  4 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2024-07-16  6:31 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 08/07/2024 10:10, Hannes Reinecke wrote:
> Hi all,
>
> for workloads with a lot of controllers we run into workqueue contention,
> where the single workqueue is not able to service requests fast enough,
> leading to spurious I/O errors and connect resets during high load.
> This patchset improves the situation by improve the fairness between
> rx and tx scheduling, introducing per-controller workqueues,
> and distribute the load accoring to the blk-mq cpu mapping.
> With this we reduce the spurious I/O errors and improve the overall
> performance for highly contended workloads.
>
> All performance number are derived from the 'tiobench-example.fio'
> sample from the fio sources, running on a 96 core machine with one
> subsystem and two paths, each path exposing 32 queues.
> Backend is nvmet using an Intel DC P3700 NVMe SSD.
>
> Changes to the initial submission:
> - Make the changes independent from the 'wq_unbound' parameter
> - Drop changes to the workqueue
> - Add patch to improve rx/tx fairness

Hey Hannes, were you able to make progress here?
Thought I'd ask...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCHv2 0/3] nvme-tcp: improve scalability
  2024-07-16  6:31 ` Sagi Grimberg
@ 2024-07-16  7:10   ` Hannes Reinecke
  0 siblings, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2024-07-16  7:10 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 7/16/24 08:31, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 10:10, Hannes Reinecke wrote:
>> Hi all,
>>
>> for workloads with a lot of controllers we run into workqueue contention,
>> where the single workqueue is not able to service requests fast enough,
>> leading to spurious I/O errors and connect resets during high load.
>> This patchset improves the situation by improve the fairness between
>> rx and tx scheduling, introducing per-controller workqueues,
>> and distribute the load accoring to the blk-mq cpu mapping.
>> With this we reduce the spurious I/O errors and improve the overall
>> performance for highly contended workloads.
>>
>> All performance number are derived from the 'tiobench-example.fio'
>> sample from the fio sources, running on a 96 core machine with one
>> subsystem and two paths, each path exposing 32 queues.
>> Backend is nvmet using an Intel DC P3700 NVMe SSD.
>>
>> Changes to the initial submission:
>> - Make the changes independent from the 'wq_unbound' parameter
>> - Drop changes to the workqueue
>> - Add patch to improve rx/tx fairness
> 
> Hey Hannes, were you able to make progress here?
> Thought I'd ask...

Oh, well. Things are going back and forth, and occasionally sideways.
But I'm having something resembling a patchset; most of it is
actually debugging stuff allowing to analyse latencies.
Currently trying to get some performance numbers, but should be ready
shortly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-07-16  7:10 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-08  7:10 [PATCHv2 0/3] nvme-tcp: improve scalability Hannes Reinecke
2024-07-08  7:10 ` [PATCH 1/3] nvme-tcp: improve rx/tx fairness Hannes Reinecke
2024-07-08 11:57   ` Sagi Grimberg
2024-07-08 13:21     ` Hannes Reinecke
2024-07-08 14:25       ` Sagi Grimberg
2024-07-08 15:50         ` Hannes Reinecke
2024-07-08 19:31           ` Sagi Grimberg
2024-07-09  6:51             ` Hannes Reinecke
2024-07-09  7:06               ` Sagi Grimberg
2024-07-08  7:10 ` [PATCH 2/3] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-08 12:08   ` Sagi Grimberg
2024-07-08 12:43     ` Hannes Reinecke
2024-07-08 14:38       ` Sagi Grimberg
2024-07-08  7:10 ` [PATCH 3/3] nvme-tcp: per-controller I/O workqueues Hannes Reinecke
2024-07-08 12:12   ` Sagi Grimberg
2024-07-08 12:48     ` Hannes Reinecke
2024-07-08 14:41       ` Sagi Grimberg
2024-07-10 11:56 ` [PATCHv2 0/3] nvme-tcp: improve scalability Sagi Grimberg
2024-07-10 14:06   ` Hannes Reinecke
2024-07-10 14:45     ` Sagi Grimberg
2024-07-16  6:31 ` Sagi Grimberg
2024-07-16  7:10   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox