* [PATCH 0/7] nvme-tcp scalability improvements
@ 2024-06-26 12:13 Hannes Reinecke
2024-06-26 12:13 ` [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
` (7 more replies)
0 siblings, 8 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
Hi all,
we have had reports from partners that nvme-tcp suffers from scalability
problems with the number of controllers; they even managed to run into
a request timeout by just connecting enough controllers to the host.
Looking into it I have found several issues with the nvme-tcp implementation:
- the 'io_cpu' assignment is static, leading to the same calculation
for each controller. Thus each queue with the same number is
assigned the same CPU, leading to CPU starvation.
- The blk-mq cpu mapping is not taken into account when calculating
the 'io_cpu' number, leading to excessive thread bouncing during I/O
- The socket state is not evaluating, so we're piling more and more
requests onto the socket even when it's already full.
This patchset addresses these issues, leading to a better I/O
distribution for several controllers.
Performance for read increases from:
4k seq read: bw=368MiB/s (386MB/s), 11.5MiB/s-12.7MiB/s
(12.1MB/s-13.3MB/s), io=16.0GiB (17.2GB), run=40444-44468msec
4k rand read: bw=360MiB/s (378MB/s), 11.3MiB/s-12.1MiB/s
(11.8MB/s-12.7MB/s), io=16.0GiB (17.2GB), run=42310-45502msec
to:
4k seq read: bw=520MiB/s (545MB/s), 16.3MiB/s-21.1MiB/s
(17.0MB/s-22.2MB/s), io=16.0GiB (17.2GB), run=24208-31505msec
4k rand read: bw=533MiB/s (559MB/s), 16.7MiB/s-22.2MiB/s
(17.5MB/s-23.3MB/s), io=16.0GiB (17.2GB), run=23014-30731msec
However, peak write performance degrades from:
4k seq write: bw=657MiB/s (689MB/s), 20.5MiB/s-20.7MiB/s
(21.5MB/s-21.8MB/s), io=16.0GiB (17.2GB), run=24678-24950msec
4k rand write: bw=687MiB/s (720MB/s), 21.5MiB/s-21.7MiB/s
(22.5MB/s-22.8MB/s), io=16.0GiB (17.2GB), run=23559-23859msec
to:
4k seq write: bw=535MiB/s (561MB/s), 16.7MiB/s-19.9MiB/s
(17.5MB/s-20.9MB/s), io=16.0GiB (17.2GB), run=25707-30624msec
4k rand write: bw=560MiB/s (587MB/s), 17.5MiB/s-22.3MiB/s
(18.4MB/s-23.4MB/s), io=16.0GiB (17.2GB), run=22977-29248msec
which is not surprising, seeing that the original implementation would
be pushing as many writes as possible to the workqueue, with complete
disregard of the utilisation of the queue (which was precisely the
issue we're addressing here).
Hannes Reinecke (5):
nvme-tcp: align I/O cpu with blk-mq mapping
nvme-tcp: distribute queue affinity
nvmet-tcp: add wq_unbound module parameter
nvme-tcp: SOCK_NOSPACE handling
nvme-tcp: make softirq_rx the default
Sagi Grimberg (2):
net: micro-optimize skb_datagram_iter
nvme-tcp: receive data in softirq
drivers/nvme/host/tcp.c | 126 ++++++++++++++++++++++++++++----------
drivers/nvme/target/tcp.c | 34 +++++++---
net/core/datagram.c | 4 +-
3 files changed, 122 insertions(+), 42 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 12:13 ` [PATCH 2/7] nvme-tcp: distribute queue affinity Hannes Reinecke
` (6 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Select the first CPU from a given blk-mq hctx mapping
to queue the tcp workqueue item.
This avoids thread bouncing during I/O on machines with
an uneven cpu topology.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 43 +++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3be67c98c906..78fbce13a9e6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1550,20 +1550,38 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_ctrl *ctrl = queue->ctrl;
- int qid = nvme_tcp_queue_id(queue);
+ struct blk_mq_tag_set *set = &ctrl->tag_set;
+ int qid = nvme_tcp_queue_id(queue) - 1;
+ unsigned int *mq_map;
int n = 0;
- if (nvme_tcp_default_queue(queue))
- n = qid - 1;
- else if (nvme_tcp_read_queue(queue))
- n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
- else if (nvme_tcp_poll_queue(queue))
+ if (nvme_tcp_default_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
+ n = qid;
+ } else if (nvme_tcp_read_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_READ].mq_map;
+ n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
+ } else if (nvme_tcp_poll_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_POLL].mq_map;
n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
- ctrl->io_queues[HCTX_TYPE_READ] - 1;
+ ctrl->io_queues[HCTX_TYPE_READ];
+ }
if (wq_unbound)
queue->io_cpu = WORK_CPU_UNBOUND;
- else
- queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
+ else {
+ int i;
+
+ if (WARN_ON(!mq_map))
+ return;
+ for_each_cpu(i, cpu_online_mask) {
+ if (mq_map[i] == qid) {
+ queue->io_cpu = i;
+ break;
+ }
+ }
+ dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
+ qid, queue->io_cpu);
+ }
}
static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
@@ -1704,7 +1722,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
queue->sock->sk->sk_allocation = GFP_ATOMIC;
queue->sock->sk->sk_use_task_frag = false;
- nvme_tcp_set_queue_io_cpu(queue);
+ queue->io_cpu = WORK_CPU_UNBOUND;
queue->request = NULL;
queue->data_remaining = 0;
queue->ddgst_remaining = 0;
@@ -1858,9 +1876,10 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
nvme_tcp_init_recv_ctx(queue);
nvme_tcp_setup_sock_ops(queue);
- if (idx)
+ if (idx) {
+ nvme_tcp_set_queue_io_cpu(queue);
ret = nvmf_connect_io_queue(nctrl, idx);
- else
+ } else
ret = nvmf_connect_admin_queue(nctrl);
if (!ret) {
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/7] nvme-tcp: distribute queue affinity
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
2024-06-26 12:13 ` [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 13:38 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 3/7] net: micro-optimize skb_datagram_iter Hannes Reinecke
` (5 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Introduce a per-cpu counter to distribute the number of queues
over all cpus in a blk-mq hwctx cpu set. The current algorithm
leads to identical cpu affinity maps for all controllers, piling
work on the same cpu for all queues with the same qid.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 78fbce13a9e6..faab55ff86fe 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -26,6 +26,8 @@
struct nvme_tcp_queue;
+static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
+
/* Define the socket priority to use for connections were it is desirable
* that the NIC consider performing optimized packet processing or filtering.
* A non-zero value being sufficient to indicate general consideration of any
@@ -1569,16 +1571,26 @@ static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
if (wq_unbound)
queue->io_cpu = WORK_CPU_UNBOUND;
else {
- int i;
+ int i, min_queues = WORK_CPU_UNBOUND, io_cpu = WORK_CPU_UNBOUND;
if (WARN_ON(!mq_map))
return;
- for_each_cpu(i, cpu_online_mask) {
- if (mq_map[i] == qid) {
- queue->io_cpu = i;
- break;
+ for_each_online_cpu(i) {
+ int num_queues;
+
+ if (mq_map[i] != qid)
+ continue;
+
+ num_queues = atomic_read(&nvme_tcp_cpu_queues[i]);
+ if (num_queues < min_queues) {
+ min_queues = num_queues;
+ io_cpu = i;
}
}
+ if (io_cpu != WORK_CPU_UNBOUND) {
+ queue->io_cpu = io_cpu;
+ atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
+ }
dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
qid, queue->io_cpu);
}
@@ -1834,6 +1846,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_restore_sock_ops(queue);
cancel_work_sync(&queue->io_work);
+ if (queue->io_cpu != WORK_CPU_UNBOUND) {
+ atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
+ queue->io_cpu = WORK_CPU_UNBOUND;
+ }
}
static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
@@ -2845,7 +2861,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
static int __init nvme_tcp_init_module(void)
{
- unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
+ unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS, i;
BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
@@ -2863,6 +2879,9 @@ static int __init nvme_tcp_init_module(void)
if (!nvme_tcp_wq)
return -ENOMEM;
+ for_each_possible_cpu(i)
+ atomic_set(&nvme_tcp_cpu_queues[i], 0);
+
nvmf_register_transport(&nvme_tcp_transport);
return 0;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/7] net: micro-optimize skb_datagram_iter
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
2024-06-26 12:13 ` [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-06-26 12:13 ` [PATCH 2/7] nvme-tcp: distribute queue affinity Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 13:38 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 4/7] nvme-tcp: receive data in softirq Hannes Reinecke
` (4 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
From: Sagi Grimberg <sagi@grimberg.me>
We only use the mapping in a single context in a short and contained scope,
so kmap_local_page is sufficient and cheaper. This will also allow
skb_datagram_iter to be called from softirq context.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
net/core/datagram.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e614cfd8e14a..95f242591fd2 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -417,14 +417,14 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
end = start + skb_frag_size(frag);
if ((copy = end - offset) > 0) {
struct page *page = skb_frag_page(frag);
- u8 *vaddr = kmap(page);
+ u8 *vaddr = kmap_local_page(page);
if (copy > len)
copy = len;
n = INDIRECT_CALL_1(cb, simple_copy_to_iter,
vaddr + skb_frag_off(frag) + offset - start,
copy, data, to);
- kunmap(page);
+ kunmap_local(vaddr);
offset += n;
if (n != copy)
goto short_copy;
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/7] nvme-tcp: receive data in softirq
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
` (2 preceding siblings ...)
2024-06-26 12:13 ` [PATCH 3/7] net: micro-optimize skb_datagram_iter Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 12:13 ` [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter Hannes Reinecke
` (3 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
From: Sagi Grimberg <sagi@grimberg.me>
Network interrupts are already delivered in softirq, so there is
no need to punt things over to a workqueue. This patch adds a
module parameter 'softirq_rx' to process rx packets directly
from the softirq context.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 52 ++++++++++++++++++++++++++++-------------
1 file changed, 36 insertions(+), 16 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index faab55ff86fe..599d4ebf888f 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -46,6 +46,13 @@ static bool wq_unbound;
module_param(wq_unbound, bool, 0644);
MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO context (default false)");
+/*
+ * RX context runs in softirq
+ */
+static bool softirq_rx;
+module_param(softirq_rx, bool, 0644);
+MODULE_PARM_DESC(softirq_rx, "nvme-tcp RX context in softirq");
+
/*
* TLS handshake timeout
*/
@@ -957,6 +964,20 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
return consumed;
}
+static int nvme_tcp_try_recv_locked(struct nvme_tcp_queue *queue)
+{
+ struct socket *sock = queue->sock;
+ struct sock *sk = sock->sk;
+ read_descriptor_t rd_desc;
+ int consumed;
+
+ rd_desc.arg.data = queue;
+ rd_desc.count = 1;
+ queue->nr_cqe = 0;
+ consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+ return consumed;
+}
+
static void nvme_tcp_data_ready(struct sock *sk)
{
struct nvme_tcp_queue *queue;
@@ -966,8 +987,12 @@ static void nvme_tcp_data_ready(struct sock *sk)
read_lock_bh(&sk->sk_callback_lock);
queue = sk->sk_user_data;
if (likely(queue && queue->rd_enabled) &&
- !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
+ if (softirq_rx)
+ nvme_tcp_try_recv_locked(queue);
+ else
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ }
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1253,16 +1278,11 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
{
- struct socket *sock = queue->sock;
- struct sock *sk = sock->sk;
- read_descriptor_t rd_desc;
+ struct sock *sk = queue->sock->sk;
int consumed;
- rd_desc.arg.data = queue;
- rd_desc.count = 1;
lock_sock(sk);
- queue->nr_cqe = 0;
- consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+ consumed = nvme_tcp_try_recv_locked(queue);
release_sock(sk);
return consumed;
}
@@ -1285,13 +1305,13 @@ static void nvme_tcp_io_work(struct work_struct *w)
else if (unlikely(result < 0))
break;
}
-
- result = nvme_tcp_try_recv(queue);
- if (result > 0)
- pending = true;
- else if (unlikely(result < 0))
- return;
-
+ if (!softirq_rx) {
+ result = nvme_tcp_try_recv(queue);
+ if (result > 0)
+ pending = true;
+ else if (unlikely(result < 0))
+ return;
+ }
if (!pending || !queue->rd_enabled)
return;
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
` (3 preceding siblings ...)
2024-06-26 12:13 ` [PATCH 4/7] nvme-tcp: receive data in softirq Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 13:44 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling Hannes Reinecke
` (2 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
For high loads the default scheme of queueing work on the receiving
cpu might lead to cpu starvation and 'CPU hogged' messages.
This patch provides an 'wq_unbound' module parameter to let the
workqueue mechanism do scheduling decisions.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/target/tcp.c | 34 ++++++++++++++++++++++++++--------
1 file changed, 26 insertions(+), 8 deletions(-)
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index d305d7162dde..572e4f474c68 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -73,6 +73,14 @@ device_param_cb(idle_poll_period_usecs, &set_param_ops,
MODULE_PARM_DESC(idle_poll_period_usecs,
"nvmet tcp io_work poll till idle time period in usecs: Default 0");
+/*
+ * Use the unbound workqueue for nvme_tcp_wq, then we can set the cpu affinity
+ * from sysfs.
+ */
+static bool wq_unbound;
+module_param(wq_unbound, bool, 0644);
+MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO context (default false)");
+
#ifdef CONFIG_NVME_TARGET_TCP_TLS
/*
* TLS handshake timeout
@@ -566,6 +574,15 @@ static struct nvmet_tcp_cmd *nvmet_tcp_fetch_cmd(struct nvmet_tcp_queue *queue)
return queue->snd_cmd;
}
+static void nvmet_tcp_queue_work(struct nvmet_tcp_queue *queue)
+{
+ if (wq_unbound)
+ queue_work(nvmet_tcp_wq, &queue->io_work);
+ else
+ queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
+ &queue->io_work);
+}
+
static void nvmet_tcp_queue_response(struct nvmet_req *req)
{
struct nvmet_tcp_cmd *cmd =
@@ -590,7 +607,7 @@ static void nvmet_tcp_queue_response(struct nvmet_req *req)
}
llist_add(&cmd->lentry, &queue->resp_list);
- queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &cmd->queue->io_work);
+ nvmet_tcp_queue_work(queue);
}
static void nvmet_tcp_execute_request(struct nvmet_tcp_cmd *cmd)
@@ -1452,7 +1469,7 @@ static void nvmet_tcp_io_work(struct work_struct *w)
* ops activity was recorded during the do-while loop above.
*/
if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
- queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
+ nvmet_tcp_queue_work(queue);
}
static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
@@ -1628,8 +1645,7 @@ static void nvmet_tcp_data_ready(struct sock *sk)
if (queue->data_ready)
queue->data_ready(sk);
if (queue->state != NVMET_TCP_Q_TLS_HANDSHAKE)
- queue_work_on(queue_cpu(queue), nvmet_tcp_wq,
- &queue->io_work);
+ nvmet_tcp_queue_work(queue);
}
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1650,7 +1666,7 @@ static void nvmet_tcp_write_space(struct sock *sk)
if (sk_stream_is_writeable(sk)) {
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
+ nvmet_tcp_queue_work(queue);
}
out:
read_unlock_bh(&sk->sk_callback_lock);
@@ -1731,7 +1747,7 @@ static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
sock->sk->sk_write_space = nvmet_tcp_write_space;
if (idle_poll_period_usecs)
nvmet_tcp_arm_queue_deadline(queue);
- queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
+ nvmet_tcp_queue_work(queue);
}
write_unlock_bh(&sock->sk->sk_callback_lock);
@@ -2182,9 +2198,11 @@ static const struct nvmet_fabrics_ops nvmet_tcp_ops = {
static int __init nvmet_tcp_init(void)
{
int ret;
+ unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
- nvmet_tcp_wq = alloc_workqueue("nvmet_tcp_wq",
- WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
+ if (wq_unbound)
+ wq_flags |= WQ_UNBOUND;
+ nvmet_tcp_wq = alloc_workqueue("nvmet_tcp_wq", wq_flags, 0);
if (!nvmet_tcp_wq)
return -ENOMEM;
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
` (4 preceding siblings ...)
2024-06-26 12:13 ` [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 13:45 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 7/7] nvme-tcp: make softirq_rx the default Hannes Reinecke
2024-06-26 13:37 ` [PATCH 0/7] nvme-tcp scalability improvements Sagi Grimberg
7 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
When there is no write space on the socket we shouldn't try to
push more data onto it; it'll stall anyway and leads to higher CPU
utilisation. So check for sock_wspace() before queueing new
requests and let the sock write_space() handler restart the
submission.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 599d4ebf888f..d78cca2f05d4 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -147,6 +147,7 @@ enum nvme_tcp_recv_state {
struct nvme_tcp_ctrl;
struct nvme_tcp_queue {
struct socket *sock;
+ struct blk_mq_hw_ctx *hctx;
struct work_struct io_work;
int io_cpu;
@@ -381,6 +382,15 @@ static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
nvme_tcp_queue_has_pending(queue);
}
+static inline void nvme_tcp_queue_work(struct nvme_tcp_queue *queue)
+{
+ set_bit(SOCK_NOSPACE, &queue->sock->flags);
+ if (!sock_wspace(queue->sock->sk))
+ return;
+ clear_bit(SOCK_NOSPACE, &queue->sock->flags);
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
bool sync, bool last)
{
@@ -402,7 +412,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
}
if (last && nvme_tcp_queue_has_pending(queue))
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ nvme_tcp_queue_work(queue);
}
static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
@@ -550,6 +560,7 @@ static int nvme_tcp_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
struct nvme_tcp_queue *queue = &ctrl->queues[hctx_idx + 1];
hctx->driver_data = queue;
+ queue->hctx = hctx;
return 0;
}
@@ -1004,7 +1015,10 @@ static void nvme_tcp_write_space(struct sock *sk)
queue = sk->sk_user_data;
if (likely(queue && sk_stream_is_writeable(sk))) {
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ if (sock_wspace(sk))
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ if (queue->hctx)
+ blk_mq_start_hw_queue(queue->hctx);
}
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1317,7 +1331,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
} while (!time_after(jiffies, deadline)); /* quota is exhausted */
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ nvme_tcp_queue_work(queue);
}
static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
@@ -1863,6 +1877,7 @@ static void nvme_tcp_restore_sock_ops(struct nvme_tcp_queue *queue)
static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
{
+ queue->hctx = NULL;
kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_restore_sock_ops(queue);
cancel_work_sync(&queue->io_work);
@@ -2614,7 +2629,7 @@ static void nvme_tcp_commit_rqs(struct blk_mq_hw_ctx *hctx)
struct nvme_tcp_queue *queue = hctx->driver_data;
if (!llist_empty(&queue->req_list))
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ nvme_tcp_queue_work(queue);
}
static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -2630,6 +2645,13 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
if (!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
return nvme_fail_nonready_command(&queue->ctrl->ctrl, rq);
+ set_bit(SOCK_NOSPACE, &queue->sock->flags);
+ if (!sock_wspace(queue->sock->sk)) {
+ blk_mq_stop_hw_queue(hctx);
+ return BLK_STS_DEV_RESOURCE;
+ }
+ clear_bit(SOCK_NOSPACE, &queue->sock->flags);
+
ret = nvme_tcp_setup_cmd_pdu(ns, rq);
if (unlikely(ret))
return ret;
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 7/7] nvme-tcp: make softirq_rx the default
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
` (5 preceding siblings ...)
2024-06-26 12:13 ` [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling Hannes Reinecke
@ 2024-06-26 12:13 ` Hannes Reinecke
2024-06-26 13:46 ` Sagi Grimberg
2024-06-26 13:37 ` [PATCH 0/7] nvme-tcp scalability improvements Sagi Grimberg
7 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 12:13 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Remove the 'softirq_rx' parameter and make it the default.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 22 ++--------------------
1 file changed, 2 insertions(+), 20 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d78cca2f05d4..79fda5ab49ce 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -46,13 +46,6 @@ static bool wq_unbound;
module_param(wq_unbound, bool, 0644);
MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO context (default false)");
-/*
- * RX context runs in softirq
- */
-static bool softirq_rx;
-module_param(softirq_rx, bool, 0644);
-MODULE_PARM_DESC(softirq_rx, "nvme-tcp RX context in softirq");
-
/*
* TLS handshake timeout
*/
@@ -998,12 +991,8 @@ static void nvme_tcp_data_ready(struct sock *sk)
read_lock_bh(&sk->sk_callback_lock);
queue = sk->sk_user_data;
if (likely(queue && queue->rd_enabled) &&
- !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
- if (softirq_rx)
- nvme_tcp_try_recv_locked(queue);
- else
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
- }
+ !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
+ nvme_tcp_try_recv_locked(queue);
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1319,13 +1308,6 @@ static void nvme_tcp_io_work(struct work_struct *w)
else if (unlikely(result < 0))
break;
}
- if (!softirq_rx) {
- result = nvme_tcp_try_recv(queue);
- if (result > 0)
- pending = true;
- else if (unlikely(result < 0))
- return;
- }
if (!pending || !queue->rd_enabled)
return;
--
2.35.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/7] nvme-tcp scalability improvements
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
` (6 preceding siblings ...)
2024-06-26 12:13 ` [PATCH 7/7] nvme-tcp: make softirq_rx the default Hannes Reinecke
@ 2024-06-26 13:37 ` Sagi Grimberg
2024-06-26 14:27 ` Hannes Reinecke
7 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:37 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, linux-nvme, Hannes Reinecke
On 26/06/2024 15:13, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Hi all,
>
> we have had reports from partners that nvme-tcp suffers from scalability
> problems with the number of controllers; they even managed to run into
> a request timeout by just connecting enough controllers to the host.
>
> Looking into it I have found several issues with the nvme-tcp implementation:
> - the 'io_cpu' assignment is static, leading to the same calculation
> for each controller. Thus each queue with the same number is
> assigned the same CPU, leading to CPU starvation.
> - The blk-mq cpu mapping is not taken into account when calculating
> the 'io_cpu' number, leading to excessive thread bouncing during I/O
> - The socket state is not evaluating, so we're piling more and more
> requests onto the socket even when it's already full.
>
> This patchset addresses these issues, leading to a better I/O
> distribution for several controllers.
Hannes, Please quantify what every change gives us please. Each
change on its own should present merit, otherwise we should consider
weather it is actually needed.
>
> Performance for read increases from:
> 4k seq read: bw=368MiB/s (386MB/s), 11.5MiB/s-12.7MiB/s
> (12.1MB/s-13.3MB/s), io=16.0GiB (17.2GB), run=40444-44468msec
> 4k rand read: bw=360MiB/s (378MB/s), 11.3MiB/s-12.1MiB/s
> (11.8MB/s-12.7MB/s), io=16.0GiB (17.2GB), run=42310-45502msec
> to:
> 4k seq read: bw=520MiB/s (545MB/s), 16.3MiB/s-21.1MiB/s
> (17.0MB/s-22.2MB/s), io=16.0GiB (17.2GB), run=24208-31505msec
> 4k rand read: bw=533MiB/s (559MB/s), 16.7MiB/s-22.2MiB/s
> (17.5MB/s-23.3MB/s), io=16.0GiB (17.2GB), run=23014-30731msec
>
> However, peak write performance degrades from:
> 4k seq write: bw=657MiB/s (689MB/s), 20.5MiB/s-20.7MiB/s
> (21.5MB/s-21.8MB/s), io=16.0GiB (17.2GB), run=24678-24950msec
> 4k rand write: bw=687MiB/s (720MB/s), 21.5MiB/s-21.7MiB/s
> (22.5MB/s-22.8MB/s), io=16.0GiB (17.2GB), run=23559-23859msec
> to:
> 4k seq write: bw=535MiB/s (561MB/s), 16.7MiB/s-19.9MiB/s
> (17.5MB/s-20.9MB/s), io=16.0GiB (17.2GB), run=25707-30624msec
> 4k rand write: bw=560MiB/s (587MB/s), 17.5MiB/s-22.3MiB/s
> (18.4MB/s-23.4MB/s), io=16.0GiB (17.2GB), run=22977-29248msec
>
> which is not surprising, seeing that the original implementation would
> be pushing as many writes as possible to the workqueue, with complete
> disregard of the utilisation of the queue (which was precisely the
> issue we're addressing here).
Well, I do not expect performance to degrade here. Its a noticeable drop
I'd say.
>
> Hannes Reinecke (5):
> nvme-tcp: align I/O cpu with blk-mq mapping
> nvme-tcp: distribute queue affinity
> nvmet-tcp: add wq_unbound module parameter
> nvme-tcp: SOCK_NOSPACE handling
> nvme-tcp: make softirq_rx the default
>
> Sagi Grimberg (2):
> net: micro-optimize skb_datagram_iter
> nvme-tcp: receive data in softirq
>
> drivers/nvme/host/tcp.c | 126 ++++++++++++++++++++++++++++----------
> drivers/nvme/target/tcp.c | 34 +++++++---
> net/core/datagram.c | 4 +-
> 3 files changed, 122 insertions(+), 42 deletions(-)
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/7] nvme-tcp: distribute queue affinity
2024-06-26 12:13 ` [PATCH 2/7] nvme-tcp: distribute queue affinity Hannes Reinecke
@ 2024-06-26 13:38 ` Sagi Grimberg
0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:38 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 26/06/2024 15:13, Hannes Reinecke wrote:
> Introduce a per-cpu counter to distribute the number of queues
> over all cpus in a blk-mq hwctx cpu set. The current algorithm
> leads to identical cpu affinity maps for all controllers, piling
> work on the same cpu for all queues with the same qid.
I think you want to squash this into the former patch. No need to have
this separation.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/7] net: micro-optimize skb_datagram_iter
2024-06-26 12:13 ` [PATCH 3/7] net: micro-optimize skb_datagram_iter Hannes Reinecke
@ 2024-06-26 13:38 ` Sagi Grimberg
0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:38 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
This should be removed, an alternative patch was sent to netdev.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter
2024-06-26 12:13 ` [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter Hannes Reinecke
@ 2024-06-26 13:44 ` Sagi Grimberg
0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:44 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
Please leave this outside of the series.
On 26/06/2024 15:13, Hannes Reinecke wrote:
> For high loads the default scheme of queueing work on the receiving
> cpu might lead to cpu starvation and 'CPU hogged' messages.
In which workloads have you seen this print?
Can you place the print in the commit msg?
> This patch provides an 'wq_unbound' module parameter to let the
> workqueue mechanism do scheduling decisions.
It was intentional that we maintain locality of the nic to the backend
drive.
I'm not saying that we may not want to do this, but we need a justification
as to why. Was this reported by users? or observed by you?
What about rdma btw? That also runs in a workqueue context. Plus with
rxe/siw
it is also writing to normal udp/tcp sockets.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling
2024-06-26 12:13 ` [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling Hannes Reinecke
@ 2024-06-26 13:45 ` Sagi Grimberg
0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:45 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 26/06/2024 15:13, Hannes Reinecke wrote:
> When there is no write space on the socket we shouldn't try to
> push more data onto it; it'll stall anyway and leads to higher CPU
> utilisation. So check for sock_wspace() before queueing new
> requests and let the sock write_space() handler restart the
> submission.
What is the gain you get from this patch?
All I see is bit atomics added to the data path.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 7/7] nvme-tcp: make softirq_rx the default
2024-06-26 12:13 ` [PATCH 7/7] nvme-tcp: make softirq_rx the default Hannes Reinecke
@ 2024-06-26 13:46 ` Sagi Grimberg
0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-26 13:46 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 26/06/2024 15:13, Hannes Reinecke wrote:
> Remove the 'softirq_rx' parameter and make it the default.
What improvement have you seen from it?
Was there a workload that suffered from this?
Should we even keep it as an option or just make it
the only behavior?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/7] nvme-tcp scalability improvements
2024-06-26 13:37 ` [PATCH 0/7] nvme-tcp scalability improvements Sagi Grimberg
@ 2024-06-26 14:27 ` Hannes Reinecke
0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-26 14:27 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 6/26/24 15:37, Sagi Grimberg wrote:
>
>
> On 26/06/2024 15:13, Hannes Reinecke wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> Hi all,
>>
>> we have had reports from partners that nvme-tcp suffers from scalability
>> problems with the number of controllers; they even managed to run into
>> a request timeout by just connecting enough controllers to the host.
>>
>> Looking into it I have found several issues with the nvme-tcp
>> implementation:
>> - the 'io_cpu' assignment is static, leading to the same calculation
>> for each controller. Thus each queue with the same number is
>> assigned the same CPU, leading to CPU starvation.
>> - The blk-mq cpu mapping is not taken into account when calculating
>> the 'io_cpu' number, leading to excessive thread bouncing during I/O
>> - The socket state is not evaluating, so we're piling more and more
>> requests onto the socket even when it's already full.
>>
>> This patchset addresses these issues, leading to a better I/O
>> distribution for several controllers.
>
> Hannes, Please quantify what every change gives us please. Each
> change on its own should present merit, otherwise we should consider
> weather it is actually needed.
>
Oh, I do agree with you. It's just hard to quantify, as with the current
implementation we vastly favour writes over reads. If we try to balance
things out (as these patches do) read performance goes up significantly,
and write performance goes down. So one can't really say its a
performance regression in general, we're just shifting things around.
I'll try to get some more meaningful data, but this is a hard task as
I first need to setup a scale-out target such that I can demonstrate the
issues more clearly.
In the end I've send out these patches primarily to show the direction
I'm investigating in, in case other people want to play around with
them, too.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-06-26 14:28 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26 12:13 [PATCH 0/7] nvme-tcp scalability improvements Hannes Reinecke
2024-06-26 12:13 ` [PATCH 1/7] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-06-26 12:13 ` [PATCH 2/7] nvme-tcp: distribute queue affinity Hannes Reinecke
2024-06-26 13:38 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 3/7] net: micro-optimize skb_datagram_iter Hannes Reinecke
2024-06-26 13:38 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 4/7] nvme-tcp: receive data in softirq Hannes Reinecke
2024-06-26 12:13 ` [PATCH 5/7] nvmet-tcp: add wq_unbound module parameter Hannes Reinecke
2024-06-26 13:44 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 6/7] nvme-tcp: SOCK_NOSPACE handling Hannes Reinecke
2024-06-26 13:45 ` Sagi Grimberg
2024-06-26 12:13 ` [PATCH 7/7] nvme-tcp: make softirq_rx the default Hannes Reinecke
2024-06-26 13:46 ` Sagi Grimberg
2024-06-26 13:37 ` [PATCH 0/7] nvme-tcp scalability improvements Sagi Grimberg
2024-06-26 14:27 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox