* [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:03 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 2/8] nvme-tcp: io_work stall debugging Hannes Reinecke
` (7 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
The current TX deadline for the workqueue is pretty much arbitrary,
and jiffies is not the best choice for sub-microseconds granularity.
So make the deadline configurable via a config option and switch to
ktime instead of jiffies.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0873b3949355..3cf9a9abb0e0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -44,6 +44,10 @@ static bool wq_unbound;
module_param(wq_unbound, bool, 0644);
MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO context (default false)");
+static int deadline = 1000;
+module_param(deadline, int, 0644);
+MODULE_PARM_DESC(deadline, "RX/TX deadline in microseconds (default: 1000)");
+
/*
* TLS handshake timeout
*/
@@ -1278,7 +1282,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
{
struct nvme_tcp_queue *queue =
container_of(w, struct nvme_tcp_queue, io_work);
- unsigned long deadline = jiffies + msecs_to_jiffies(1);
+ u64 start = ktime_to_us(ktime_get());
+ u64 tx_deadline = start + deadline;
do {
bool pending = false;
@@ -1302,7 +1307,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
if (!pending || !queue->rd_enabled)
return;
- } while (!time_after(jiffies, deadline)); /* quota is exhausted */
+ } while (ktime_to_us(ktime_get()) < tx_deadline); /* quota is exhausted */
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
}
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable
2024-07-16 7:36 ` [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable Hannes Reinecke
@ 2024-07-17 21:03 ` Sagi Grimberg
2024-07-18 6:30 ` Hannes Reinecke
0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:03 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 16/07/2024 10:36, Hannes Reinecke wrote:
> The current TX deadline for the workqueue is pretty much arbitrary,
> and jiffies is not the best choice for sub-microseconds granularity.
> So make the deadline configurable via a config option and switch to
> ktime instead of jiffies.
I thought we agreed on bytes based limits. Did you try to look how does
that behave?
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0873b3949355..3cf9a9abb0e0 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -44,6 +44,10 @@ static bool wq_unbound;
> module_param(wq_unbound, bool, 0644);
> MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO context (default false)");
>
> +static int deadline = 1000;
> +module_param(deadline, int, 0644);
> +MODULE_PARM_DESC(deadline, "RX/TX deadline in microseconds (default: 1000)");
> +
> /*
> * TLS handshake timeout
> */
> @@ -1278,7 +1282,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
> {
> struct nvme_tcp_queue *queue =
> container_of(w, struct nvme_tcp_queue, io_work);
> - unsigned long deadline = jiffies + msecs_to_jiffies(1);
> + u64 start = ktime_to_us(ktime_get());
> + u64 tx_deadline = start + deadline;
>
> do {
> bool pending = false;
> @@ -1302,7 +1307,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
> if (!pending || !queue->rd_enabled)
> return;
>
> - } while (!time_after(jiffies, deadline)); /* quota is exhausted */
> + } while (ktime_to_us(ktime_get()) < tx_deadline); /* quota is exhausted */
>
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> }
The rename to tx_deadline does not make sense in the context of this patch.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable
2024-07-17 21:03 ` Sagi Grimberg
@ 2024-07-18 6:30 ` Hannes Reinecke
0 siblings, 0 replies; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-18 6:30 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 7/17/24 23:03, Sagi Grimberg wrote:
>
>
> On 16/07/2024 10:36, Hannes Reinecke wrote:
>> The current TX deadline for the workqueue is pretty much arbitrary,
>> and jiffies is not the best choice for sub-microseconds granularity.
>> So make the deadline configurable via a config option and switch to
>> ktime instead of jiffies.
>
> I thought we agreed on bytes based limits. Did you try to look how does
> that behave?
>
I tried, but this is not the objective of this patch.
Main reason for this patch was that we are trying to measure
sub-jiffy granularity. When running with HZ=100 (as I did)
you only have a granularity of 10 milliseconds per jiffy, and
the call 'msecs_to_jiffies(1)' is getting ever so questionable.
So while you can have the jiffy displayed in microseconds,
you cannot _measure_ a value below 10 milliseconds.
So all calculation which you try will be inaccurate, which
is quite harmful when you try to do an analysis based on
those numbers.
So switching to ktime gives a better precision and with
that a more reliable data for analysing latency.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/tcp.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0873b3949355..3cf9a9abb0e0 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -44,6 +44,10 @@ static bool wq_unbound;
>> module_param(wq_unbound, bool, 0644);
>> MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO
>> context (default false)");
>> +static int deadline = 1000;
>> +module_param(deadline, int, 0644);
>> +MODULE_PARM_DESC(deadline, "RX/TX deadline in microseconds (default:
>> 1000)");
>> +
>> /*
>> * TLS handshake timeout
>> */
>> @@ -1278,7 +1282,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
>> {
>> struct nvme_tcp_queue *queue =
>> container_of(w, struct nvme_tcp_queue, io_work);
>> - unsigned long deadline = jiffies + msecs_to_jiffies(1);
>> + u64 start = ktime_to_us(ktime_get());
>> + u64 tx_deadline = start + deadline;
>> do {
>> bool pending = false;
>> @@ -1302,7 +1307,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>> if (!pending || !queue->rd_enabled)
>> return;
>> - } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>> + } while (ktime_to_us(ktime_get()) < tx_deadline); /* quota is
>> exhausted */
>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> }
>
> The rename to tx_deadline does not make sense in the context of this patch.
Ok, will be renaming it.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/8] nvme-tcp: io_work stall debugging
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
2024-07-16 7:36 ` [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:05 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 3/8] nvme-tcp: re-init request list entries Hannes Reinecke
` (6 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
Add a debug message when the io workqueue exceeds the latency target
given by 'deadline'.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3cf9a9abb0e0..7876bf7d2fac 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1282,13 +1282,14 @@ static void nvme_tcp_io_work(struct work_struct *w)
{
struct nvme_tcp_queue *queue =
container_of(w, struct nvme_tcp_queue, io_work);
- u64 start = ktime_to_us(ktime_get());
+ u64 start = ktime_to_us(ktime_get()), overrun;
u64 tx_deadline = start + deadline;
+ bool pending = false;
do {
- bool pending = false;
int result;
+ pending = false;
if (mutex_trylock(&queue->send_mutex)) {
result = nvme_tcp_try_send(queue);
mutex_unlock(&queue->send_mutex);
@@ -1304,12 +1305,23 @@ static void nvme_tcp_io_work(struct work_struct *w)
else if (unlikely(result < 0))
return;
- if (!pending || !queue->rd_enabled)
+ if (!queue->rd_enabled)
return;
+ if (!pending)
+ goto check;
} while (ktime_to_us(ktime_get()) < tx_deadline); /* quota is exhausted */
- queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+check:
+ overrun = ktime_to_us(ktime_get()) - start;
+ if (overrun > 10 * deadline) {
+ dev_dbg(queue->ctrl->ctrl.device,
+ "queue %d: stall (%llu msecs)%s%s\n",
+ nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
+ list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
+ }
+ if (pending)
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
}
static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 2/8] nvme-tcp: io_work stall debugging
2024-07-16 7:36 ` [PATCH 2/8] nvme-tcp: io_work stall debugging Hannes Reinecke
@ 2024-07-17 21:05 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:05 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, linux-nvme, Hannes Reinecke
On 16/07/2024 10:36, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Add a debug message when the io workqueue exceeds the latency target
> given by 'deadline'.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3cf9a9abb0e0..7876bf7d2fac 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1282,13 +1282,14 @@ static void nvme_tcp_io_work(struct work_struct *w)
> {
> struct nvme_tcp_queue *queue =
> container_of(w, struct nvme_tcp_queue, io_work);
> - u64 start = ktime_to_us(ktime_get());
> + u64 start = ktime_to_us(ktime_get()), overrun;
> u64 tx_deadline = start + deadline;
> + bool pending = false;
>
> do {
> - bool pending = false;
> int result;
>
> + pending = false;
> if (mutex_trylock(&queue->send_mutex)) {
> result = nvme_tcp_try_send(queue);
> mutex_unlock(&queue->send_mutex);
> @@ -1304,12 +1305,23 @@ static void nvme_tcp_io_work(struct work_struct *w)
> else if (unlikely(result < 0))
> return;
>
> - if (!pending || !queue->rd_enabled)
> + if (!queue->rd_enabled)
> return;
> + if (!pending)
> + goto check;
...
>
> } while (ktime_to_us(ktime_get()) < tx_deadline); /* quota is exhausted */
>
> - queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> +check:
> + overrun = ktime_to_us(ktime_get()) - start;
> + if (overrun > 10 * deadline) {
> + dev_dbg(queue->ctrl->ctrl.device,
> + "queue %d: stall (%llu msecs)%s%s\n",
> + nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
> + list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
> + }
> + if (pending)
> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> }
>
> static void nvme_tcp_free_crypto(struct nvme_tcp_queue *queue)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/8] nvme-tcp: re-init request list entries
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
2024-07-16 7:36 ` [PATCH 1/8] nvme-tcp: switch TX deadline to microseconds and make it configurable Hannes Reinecke
2024-07-16 7:36 ` [PATCH 2/8] nvme-tcp: io_work stall debugging Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:23 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 4/8] nvme-tcp: improve stall debugging Hannes Reinecke
` (5 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
When removing entries from a list we should re-init the list entries
to allow for checking if the entries are still on the list.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 7876bf7d2fac..04d840709d5d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -412,11 +412,13 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_request *req;
- struct llist_node *node;
+ struct llist_node *node, *next;
- for (node = llist_del_all(&queue->req_list); node; node = node->next) {
+ for (node = llist_del_all(&queue->req_list); node; node = next) {
req = llist_entry(node, struct nvme_tcp_request, lentry);
list_add(&req->entry, &queue->send_list);
+ next = node->next;
+ init_llist_node(node);
}
}
@@ -435,7 +437,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
return NULL;
}
- list_del(&req->entry);
+ list_del_init(&req->entry);
return req;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH 4/8] nvme-tcp: improve stall debugging
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (2 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 3/8] nvme-tcp: re-init request list entries Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:11 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics Hannes Reinecke
` (4 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
Add counter for the number of send and receive calls, and an additional
counter for the number of SQEs processed.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 04d840709d5d..9caee99955c2 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -156,7 +156,10 @@ struct nvme_tcp_queue {
int pdu_offset;
size_t data_remaining;
size_t ddgst_remaining;
+ unsigned int nr_sqe;
+ unsigned int nr_send;
unsigned int nr_cqe;
+ unsigned int nr_recv;
/* send state */
struct nvme_tcp_request *request;
@@ -368,6 +371,8 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
int ret;
/* drain the send queue as much as we can... */
+ queue->nr_sqe = 0;
+ queue->nr_send = 0;
do {
ret = nvme_tcp_try_send(queue);
} while (ret > 0);
@@ -944,6 +949,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
if (unlikely(!queue->rd_enabled))
return -EFAULT;
+ queue->nr_recv++;
while (len) {
switch (nvme_tcp_recv_state(queue)) {
case NVME_TCP_RECV_PDU:
@@ -1028,6 +1034,7 @@ static void nvme_tcp_state_change(struct sock *sk)
static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
{
queue->request = NULL;
+ queue->nr_sqe++;
}
static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
@@ -1071,6 +1078,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
bvec_set_page(&bvec, page, len, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ queue->nr_send++;
ret = sock_sendmsg(queue->sock, &msg);
if (ret <= 0)
return ret;
@@ -1127,6 +1135,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ queue->nr_send++;
ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;
@@ -1165,6 +1174,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+ queue->nr_send++;
ret = sock_sendmsg(queue->sock, &msg);
if (unlikely(ret <= 0))
return ret;
@@ -1198,6 +1208,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
else
msg.msg_flags |= MSG_EOR;
+ queue->nr_send++;
ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
if (unlikely(ret <= 0))
return ret;
@@ -1275,6 +1286,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
rd_desc.count = 1;
lock_sock(sk);
queue->nr_cqe = 0;
+ queue->nr_recv = 0;
consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
release_sock(sk);
return consumed;
@@ -1288,6 +1300,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
u64 tx_deadline = start + deadline;
bool pending = false;
+ queue->nr_sqe = 0;
+ queue->nr_send = 0;
do {
int result;
@@ -1318,8 +1332,9 @@ static void nvme_tcp_io_work(struct work_struct *w)
overrun = ktime_to_us(ktime_get()) - start;
if (overrun > 10 * deadline) {
dev_dbg(queue->ctrl->ctrl.device,
- "queue %d: stall (%llu msecs)%s%s\n",
+ "queue %d: stall (%llu msecs) send %u sqe %u recv %u cqe %u%s%s\n",
nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
+ queue->nr_send, queue->nr_sqe, queue->nr_recv, queue->nr_cqe,
list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
}
if (pending)
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 4/8] nvme-tcp: improve stall debugging
2024-07-16 7:36 ` [PATCH 4/8] nvme-tcp: improve stall debugging Hannes Reinecke
@ 2024-07-17 21:11 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:11 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 16/07/2024 10:36, Hannes Reinecke wrote:
> Add counter for the number of send and receive calls, and an additional
> counter for the number of SQEs processed.
TBH, this looks like something that should be added as trace points
and have something like an ebpf program that attach to them and performs
this type of logic.
Otherwise, these counters need to be grouped in a proper stats struct to
clarify what they are used for.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 04d840709d5d..9caee99955c2 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -156,7 +156,10 @@ struct nvme_tcp_queue {
> int pdu_offset;
> size_t data_remaining;
> size_t ddgst_remaining;
> + unsigned int nr_sqe;
> + unsigned int nr_send;
> unsigned int nr_cqe;
> + unsigned int nr_recv;
>
> /* send state */
> struct nvme_tcp_request *request;
> @@ -368,6 +371,8 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> int ret;
>
> /* drain the send queue as much as we can... */
> + queue->nr_sqe = 0;
> + queue->nr_send = 0;
> do {
> ret = nvme_tcp_try_send(queue);
> } while (ret > 0);
> @@ -944,6 +949,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
> if (unlikely(!queue->rd_enabled))
> return -EFAULT;
>
> + queue->nr_recv++;
> while (len) {
> switch (nvme_tcp_recv_state(queue)) {
> case NVME_TCP_RECV_PDU:
> @@ -1028,6 +1034,7 @@ static void nvme_tcp_state_change(struct sock *sk)
> static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
> {
> queue->request = NULL;
> + queue->nr_sqe++;
> }
>
> static void nvme_tcp_fail_request(struct nvme_tcp_request *req)
> @@ -1071,6 +1078,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>
> bvec_set_page(&bvec, page, len, offset);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + queue->nr_send++;
> ret = sock_sendmsg(queue->sock, &msg);
> if (ret <= 0)
> return ret;
> @@ -1127,6 +1135,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>
> bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + queue->nr_send++;
> ret = sock_sendmsg(queue->sock, &msg);
> if (unlikely(ret <= 0))
> return ret;
> @@ -1165,6 +1174,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
>
> bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> + queue->nr_send++;
> ret = sock_sendmsg(queue->sock, &msg);
> if (unlikely(ret <= 0))
> return ret;
> @@ -1198,6 +1208,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
> else
> msg.msg_flags |= MSG_EOR;
>
> + queue->nr_send++;
> ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
> if (unlikely(ret <= 0))
> return ret;
> @@ -1275,6 +1286,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
> rd_desc.count = 1;
> lock_sock(sk);
> queue->nr_cqe = 0;
> + queue->nr_recv = 0;
> consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
> release_sock(sk);
> return consumed;
> @@ -1288,6 +1300,8 @@ static void nvme_tcp_io_work(struct work_struct *w)
> u64 tx_deadline = start + deadline;
> bool pending = false;
>
> + queue->nr_sqe = 0;
> + queue->nr_send = 0;
> do {
> int result;
>
> @@ -1318,8 +1332,9 @@ static void nvme_tcp_io_work(struct work_struct *w)
> overrun = ktime_to_us(ktime_get()) - start;
> if (overrun > 10 * deadline) {
> dev_dbg(queue->ctrl->ctrl.device,
> - "queue %d: stall (%llu msecs)%s%s\n",
> + "queue %d: stall (%llu msecs) send %u sqe %u recv %u cqe %u%s%s\n",
> nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
> + queue->nr_send, queue->nr_sqe, queue->nr_recv, queue->nr_cqe,
> list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
> }
> if (pending)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (3 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 4/8] nvme-tcp: improve stall debugging Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:14 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 6/8] nvme-tcp: reduce callback lock contention Hannes Reinecke
` (3 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
Add debugfs entries to display latency statistics:
'send_lat' displays the average latency for sendmsg() calls,
'recv_lat' the average latency for '->read_sock()' calls,
'write_space' the number of 'write_space' callbacks,
'data_ready' the number of 'data_ready' callbacks,
and 'queue_busy' the number -EAGAIN returns from sendmsg().
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 253 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 249 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9caee99955c2..a758fbb3f9bb 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -21,6 +21,8 @@
#include <net/busy_poll.h>
#include <trace/events/sock.h>
+#include <linux/debugfs.h>
+
#include "nvme.h"
#include "fabrics.h"
@@ -139,6 +141,13 @@ enum nvme_tcp_recv_state {
NVME_TCP_RECV_DDGST,
};
+struct nvme_tcp_stat {
+ u64 samples;
+ u64 batch;
+ s64 mean;
+ u64 sqmean;
+};
+
struct nvme_tcp_ctrl;
struct nvme_tcp_queue {
struct socket *sock;
@@ -164,6 +173,14 @@ struct nvme_tcp_queue {
/* send state */
struct nvme_tcp_request *request;
+ /* statistics */
+ struct dentry *debugfs_dir;
+ u64 data_ready_cnt;
+ u64 write_space_cnt;
+ u64 queue_busy_cnt;
+ struct nvme_tcp_stat send_lat;
+ struct nvme_tcp_stat recv_lat;
+
u32 maxh2cdata;
size_t cmnd_capsule_len;
struct nvme_tcp_ctrl *ctrl;
@@ -198,6 +215,7 @@ struct nvme_tcp_ctrl {
struct sockaddr_storage src_addr;
struct nvme_ctrl ctrl;
+ struct dentry *debugfs_dir;
struct work_struct err_work;
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
@@ -207,10 +225,163 @@ struct nvme_tcp_ctrl {
static LIST_HEAD(nvme_tcp_ctrl_list);
static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
static struct workqueue_struct *nvme_tcp_wq;
+static struct dentry *nvme_tcp_debugfs;
static const struct blk_mq_ops nvme_tcp_mq_ops;
static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
static int nvme_tcp_try_send(struct nvme_tcp_queue *queue);
+static void nvme_tcp_stat_add(struct nvme_tcp_stat *stat, u64 start)
+{
+ u64 usec;
+ s64 delta, delta2;
+
+ usec = ktime_to_us(ktime_get());
+ if (usec < start)
+ return;
+ usec -= start;
+ /* Count in nanoseconds to improve precision */
+ usec *= 1000;
+ stat->batch += usec;
+ /* Welford's algorithm */
+ stat->samples++;
+ delta = (s64)usec - stat->mean;
+ stat->mean += div_s64(delta, (s64)stat->samples);
+ delta2 = (s64)usec - stat->mean;
+ stat->sqmean += (u64)(delta * delta2);
+}
+
+#define NVME_TCP_DEBUGFS_ATTR(field) \
+ static int field##_open(struct inode *inode, struct file *file) \
+ { return single_open(file, field##_show, inode->i_private); } \
+ \
+ static const struct file_operations field##_fops = { \
+ .open = field##_open, \
+ .read = seq_read, \
+ .release = single_release, \
+ }
+
+#define NVME_TCP_DEBUGFS_RW_ATTR(field) \
+ static int field##_open(struct inode *inode, struct file *file) \
+ { return single_open(file, field##_show, inode->i_private); } \
+ \
+ static const struct file_operations field##_fops = { \
+ .open = field##_open, \
+ .read = seq_read, \
+ .write = field##_write, \
+ .release = single_release, \
+ }
+
+static int nvme_tcp_queue_data_ready_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_queue *queue = m->private;
+
+ seq_printf(m, "%llu\n", queue->data_ready_cnt);
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_data_ready);
+
+static int nvme_tcp_queue_write_space_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_queue *queue = m->private;
+
+ seq_printf(m, "%llu\n", queue->write_space_cnt);
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_write_space);
+
+static int nvme_tcp_queue_queue_busy_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_queue *queue = m->private;
+
+ seq_printf(m, "%llu\n", queue->queue_busy_cnt);
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_queue_busy);
+
+static int nvme_tcp_queue_recv_lat_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_queue *queue = m->private;
+ struct nvme_tcp_stat *st = &queue->recv_lat;
+ u64 var = 0;
+
+ if (st->samples > 1)
+ var = div_u64(st->sqmean, st->samples * 1000);
+
+ seq_printf(m, "%llu %llu %u\n",
+ st->samples, div_u64(st->batch, st->samples * 1000), int_sqrt64(var));
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_recv_lat);
+
+static int nvme_tcp_queue_send_lat_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_queue *queue = m->private;
+ struct nvme_tcp_stat *st = &queue->send_lat;
+ u64 var = 0;
+
+ if (st->samples > 1)
+ var = div_u64(st->sqmean, st->samples * 1000);
+
+ seq_printf(m, "%llu %llu %u\n",
+ st->samples, div_u64(st->batch, st->samples * 1000), int_sqrt64(var));
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_send_lat);
+
+static int nvme_tcp_ctrl_recv_lat_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_ctrl *ctrl = m->private;
+ struct nvme_tcp_stat *st, lat;
+ u64 mean = 0, var = 0;
+ int i;
+
+ lat.samples = 0;
+ lat.mean = 0;
+ lat.sqmean = 0;
+ for (i = 1; i < ctrl->ctrl.queue_count; i++) {
+ st = &ctrl->queues[i].recv_lat;
+ lat.mean += st->mean;
+ lat.sqmean += st->sqmean;
+ lat.samples += st->samples;
+ }
+ if (lat.samples > 1) {
+ mean = lat.mean;
+ var = int_sqrt(div_u64(lat.sqmean, lat.samples));
+ }
+
+ seq_printf(m, "%llu %llu %llu\n",
+ lat.samples, mean, var);
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_ctrl_recv_lat);
+
+static int nvme_tcp_ctrl_send_lat_show(struct seq_file *m, void *p)
+{
+ struct nvme_tcp_ctrl *ctrl = m->private;
+ struct nvme_tcp_stat *st, lat;
+ u64 mean = 0, var = 0;
+ int i;
+
+ lat.samples = 0;
+ lat.mean = 0;
+ lat.sqmean = 0;
+ for (i = 1; i < ctrl->ctrl.queue_count; i++) {
+ st = &ctrl->queues[i].send_lat;
+ lat.mean += st->mean;
+ lat.sqmean += st->sqmean;
+ lat.samples += st->samples;
+ }
+ if (lat.samples > 1) {
+ mean = lat.mean;
+ var = int_sqrt(div_u64(lat.sqmean, lat.samples));
+ }
+
+ seq_printf(m, "%llu %llu %llu\n",
+ lat.samples, mean, var);
+ return 0;
+}
+NVME_TCP_DEBUGFS_ATTR(nvme_tcp_ctrl_send_lat);
+
static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
{
return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
@@ -985,8 +1156,10 @@ static void nvme_tcp_data_ready(struct sock *sk)
read_lock_bh(&sk->sk_callback_lock);
queue = sk->sk_user_data;
if (likely(queue && queue->rd_enabled) &&
- !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
+ !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ queue->data_ready_cnt++;
+ }
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -999,6 +1172,7 @@ static void nvme_tcp_write_space(struct sock *sk)
if (likely(queue && sk_stream_is_writeable(sk))) {
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+ queue->write_space_cnt++;
}
read_unlock_bh(&sk->sk_callback_lock);
}
@@ -1067,6 +1241,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
bool last = nvme_tcp_pdu_last_send(req, len);
int req_data_sent = req->data_sent;
int ret;
+ u64 start;
if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
msg.msg_flags |= MSG_EOR;
@@ -1079,7 +1254,9 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
bvec_set_page(&bvec, page, len, offset);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
queue->nr_send++;
+ start = ktime_to_us(ktime_get());
ret = sock_sendmsg(queue->sock, &msg);
+ nvme_tcp_stat_add(&queue->send_lat, start);
if (ret <= 0)
return ret;
@@ -1124,6 +1301,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) + hdgst - req->offset;
int ret;
+ u64 start;
if (inline_data || nvme_tcp_queue_more(queue))
msg.msg_flags |= MSG_MORE;
@@ -1136,7 +1314,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
queue->nr_send++;
+ start = ktime_to_us(ktime_get());
ret = sock_sendmsg(queue->sock, &msg);
+ nvme_tcp_stat_add(&queue->send_lat, start);
if (unlikely(ret <= 0))
return ret;
@@ -1165,6 +1345,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
u8 hdgst = nvme_tcp_hdgst_len(queue);
int len = sizeof(*pdu) - req->offset + hdgst;
int ret;
+ u64 start;
if (queue->hdr_digest && !req->offset)
nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
@@ -1175,7 +1356,9 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
queue->nr_send++;
+ start = ktime_to_us(ktime_get());
ret = sock_sendmsg(queue->sock, &msg);
+ nvme_tcp_stat_add(&queue->send_lat, start);
if (unlikely(ret <= 0))
return ret;
@@ -1202,6 +1385,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
.iov_base = (u8 *)&req->ddgst + req->offset,
.iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
};
+ u64 start;
if (nvme_tcp_queue_more(queue))
msg.msg_flags |= MSG_MORE;
@@ -1209,7 +1393,9 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
msg.msg_flags |= MSG_EOR;
queue->nr_send++;
+ start = ktime_to_us(ktime_get());
ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
+ nvme_tcp_stat_add(&queue->send_lat, start);
if (unlikely(ret <= 0))
return ret;
@@ -1263,6 +1449,7 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
ret = nvme_tcp_try_send_ddgst(req);
done:
if (ret == -EAGAIN) {
+ queue->queue_busy_cnt++;
ret = 0;
} else if (ret < 0) {
dev_err(queue->ctrl->ctrl.device,
@@ -1281,13 +1468,16 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
struct sock *sk = sock->sk;
read_descriptor_t rd_desc;
int consumed;
+ u64 start;
rd_desc.arg.data = queue;
rd_desc.count = 1;
lock_sock(sk);
queue->nr_cqe = 0;
queue->nr_recv = 0;
+ start = ktime_to_us(ktime_get());
consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
+ nvme_tcp_stat_add(&queue->recv_lat, start);
release_sock(sk);
return consumed;
}
@@ -1331,11 +1521,18 @@ static void nvme_tcp_io_work(struct work_struct *w)
check:
overrun = ktime_to_us(ktime_get()) - start;
if (overrun > 10 * deadline) {
+ u64 slat = 0, clat = 0;
+
+ if (queue->send_lat.samples)
+ slat = queue->send_lat.mean;
+ if (queue->recv_lat.samples)
+ clat = queue->recv_lat.mean;
dev_dbg(queue->ctrl->ctrl.device,
- "queue %d: stall (%llu msecs) send %u sqe %u recv %u cqe %u%s%s\n",
+ "queue %d: stall (%llu msecs), send %u lat %llu %u sqes, recv %u lat %llu %u cqes%s%s\n",
nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
- queue->nr_send, queue->nr_sqe, queue->nr_recv, queue->nr_cqe,
- list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
+ queue->nr_send, slat, queue->nr_sqe, queue->nr_recv, clat, queue->nr_cqe,
+ list_empty(&queue->send_list) ? " empty" : "",
+ queue->request ? " pending" : "");
}
if (pending)
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
@@ -1408,6 +1605,11 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
return;
+ if (queue->debugfs_dir) {
+ debugfs_remove_recursive(queue->debugfs_dir);
+ queue->debugfs_dir = NULL;
+ }
+
if (queue->hdr_digest || queue->data_digest)
nvme_tcp_free_crypto(queue);
@@ -1696,6 +1898,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
int ret, rcv_pdu_size;
struct file *sock_file;
+ char queue_name[32];
mutex_init(&queue->queue_lock);
queue->ctrl = ctrl;
@@ -1756,6 +1959,11 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
queue->ddgst_remaining = 0;
queue->pdu_remaining = 0;
queue->pdu_offset = 0;
+ queue->write_space_cnt = 0;
+ queue->data_ready_cnt = 0;
+ queue->queue_busy_cnt = 0;
+ memset(&queue->recv_lat, 0, sizeof(struct nvme_tcp_stat));
+ memset(&queue->send_lat, 0, sizeof(struct nvme_tcp_stat));
sk_set_memalloc(queue->sock->sk);
if (nctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
@@ -1802,6 +2010,19 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
goto err_crypto;
}
+ sprintf(queue_name, "%d", nvme_tcp_queue_id(queue));
+ queue->debugfs_dir = debugfs_create_dir(queue_name, ctrl->debugfs_dir);
+ debugfs_create_file("recv_lat", S_IRUSR, queue->debugfs_dir,
+ queue, &nvme_tcp_queue_recv_lat_fops);
+ debugfs_create_file("send_lat", S_IRUSR, queue->debugfs_dir,
+ queue, &nvme_tcp_queue_send_lat_fops);
+ debugfs_create_file("data_ready", S_IRUSR, queue->debugfs_dir,
+ queue, &nvme_tcp_queue_data_ready_fops);
+ debugfs_create_file("write_space", S_IRUSR, queue->debugfs_dir,
+ queue, &nvme_tcp_queue_write_space_fops);
+ debugfs_create_file("queue_busy", S_IRUSR, queue->debugfs_dir,
+ queue, &nvme_tcp_queue_queue_busy_fops);
+
dev_dbg(nctrl->device, "connecting queue %d\n",
nvme_tcp_queue_id(queue));
@@ -2420,7 +2641,14 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
{
+ struct nvme_tcp_ctrl *nctrl = to_tcp_ctrl(ctrl);
+
nvme_tcp_teardown_ctrl(ctrl, true);
+
+ if (nctrl->debugfs_dir) {
+ debugfs_remove_recursive(nctrl->debugfs_dir);
+ nctrl->debugfs_dir = NULL;
+ }
}
static void nvme_reset_ctrl_work(struct work_struct *work)
@@ -2890,6 +3118,13 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
if (ret)
goto out_put_ctrl;
+ ctrl->debugfs_dir = debugfs_create_dir(dev_name(ctrl->ctrl.device),
+ nvme_tcp_debugfs);
+ debugfs_create_file("recv_lat", S_IRUSR, ctrl->debugfs_dir,
+ ctrl, &nvme_tcp_ctrl_recv_lat_fops);
+ debugfs_create_file("send_lat", S_IRUSR, ctrl->debugfs_dir,
+ ctrl, &nvme_tcp_ctrl_send_lat_fops);
+
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
WARN_ON_ONCE(1);
ret = -EINTR;
@@ -2910,6 +3145,10 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
return &ctrl->ctrl;
out_uninit_ctrl:
+ if (ctrl->debugfs_dir) {
+ debugfs_remove_recursive(ctrl->debugfs_dir);
+ ctrl->debugfs_dir = NULL;
+ }
nvme_uninit_ctrl(&ctrl->ctrl);
out_put_ctrl:
nvme_put_ctrl(&ctrl->ctrl);
@@ -2951,6 +3190,10 @@ static int __init nvme_tcp_init_module(void)
if (!nvme_tcp_wq)
return -ENOMEM;
+ nvme_tcp_debugfs = debugfs_create_dir("nvme_tcp", NULL);
+ if (!nvme_tcp_debugfs)
+ return -ENOMEM;
+
nvmf_register_transport(&nvme_tcp_transport);
return 0;
}
@@ -2961,6 +3204,8 @@ static void __exit nvme_tcp_cleanup_module(void)
nvmf_unregister_transport(&nvme_tcp_transport);
+ debugfs_remove_recursive(nvme_tcp_debugfs);
+
mutex_lock(&nvme_tcp_ctrl_mutex);
list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
nvme_delete_ctrl(&ctrl->ctrl);
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics
2024-07-16 7:36 ` [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics Hannes Reinecke
@ 2024-07-17 21:14 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:14 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, linux-nvme, Hannes Reinecke
On 16/07/2024 10:36, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Add debugfs entries to display latency statistics:
> 'send_lat' displays the average latency for sendmsg() calls,
> 'recv_lat' the average latency for '->read_sock()' calls,
> 'write_space' the number of 'write_space' callbacks,
> 'data_ready' the number of 'data_ready' callbacks,
> and 'queue_busy' the number -EAGAIN returns from sendmsg().
This is for sure something that ebpf is better used IMO.
What I'd recommend is that we add traces and move this
stats logic to an ebpf program (iovisor project would probably
welcome such a program).
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 253 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 249 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 9caee99955c2..a758fbb3f9bb 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -21,6 +21,8 @@
> #include <net/busy_poll.h>
> #include <trace/events/sock.h>
>
> +#include <linux/debugfs.h>
> +
> #include "nvme.h"
> #include "fabrics.h"
>
> @@ -139,6 +141,13 @@ enum nvme_tcp_recv_state {
> NVME_TCP_RECV_DDGST,
> };
>
> +struct nvme_tcp_stat {
> + u64 samples;
> + u64 batch;
> + s64 mean;
> + u64 sqmean;
> +};
> +
> struct nvme_tcp_ctrl;
> struct nvme_tcp_queue {
> struct socket *sock;
> @@ -164,6 +173,14 @@ struct nvme_tcp_queue {
> /* send state */
> struct nvme_tcp_request *request;
>
> + /* statistics */
> + struct dentry *debugfs_dir;
> + u64 data_ready_cnt;
> + u64 write_space_cnt;
> + u64 queue_busy_cnt;
> + struct nvme_tcp_stat send_lat;
> + struct nvme_tcp_stat recv_lat;
> +
> u32 maxh2cdata;
> size_t cmnd_capsule_len;
> struct nvme_tcp_ctrl *ctrl;
> @@ -198,6 +215,7 @@ struct nvme_tcp_ctrl {
> struct sockaddr_storage src_addr;
> struct nvme_ctrl ctrl;
>
> + struct dentry *debugfs_dir;
> struct work_struct err_work;
> struct delayed_work connect_work;
> struct nvme_tcp_request async_req;
> @@ -207,10 +225,163 @@ struct nvme_tcp_ctrl {
> static LIST_HEAD(nvme_tcp_ctrl_list);
> static DEFINE_MUTEX(nvme_tcp_ctrl_mutex);
> static struct workqueue_struct *nvme_tcp_wq;
> +static struct dentry *nvme_tcp_debugfs;
> static const struct blk_mq_ops nvme_tcp_mq_ops;
> static const struct blk_mq_ops nvme_tcp_admin_mq_ops;
> static int nvme_tcp_try_send(struct nvme_tcp_queue *queue);
>
> +static void nvme_tcp_stat_add(struct nvme_tcp_stat *stat, u64 start)
> +{
> + u64 usec;
> + s64 delta, delta2;
> +
> + usec = ktime_to_us(ktime_get());
> + if (usec < start)
> + return;
> + usec -= start;
> + /* Count in nanoseconds to improve precision */
> + usec *= 1000;
> + stat->batch += usec;
> + /* Welford's algorithm */
> + stat->samples++;
> + delta = (s64)usec - stat->mean;
> + stat->mean += div_s64(delta, (s64)stat->samples);
> + delta2 = (s64)usec - stat->mean;
> + stat->sqmean += (u64)(delta * delta2);
> +}
> +
> +#define NVME_TCP_DEBUGFS_ATTR(field) \
> + static int field##_open(struct inode *inode, struct file *file) \
> + { return single_open(file, field##_show, inode->i_private); } \
> + \
> + static const struct file_operations field##_fops = { \
> + .open = field##_open, \
> + .read = seq_read, \
> + .release = single_release, \
> + }
> +
> +#define NVME_TCP_DEBUGFS_RW_ATTR(field) \
> + static int field##_open(struct inode *inode, struct file *file) \
> + { return single_open(file, field##_show, inode->i_private); } \
> + \
> + static const struct file_operations field##_fops = { \
> + .open = field##_open, \
> + .read = seq_read, \
> + .write = field##_write, \
> + .release = single_release, \
> + }
> +
> +static int nvme_tcp_queue_data_ready_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_queue *queue = m->private;
> +
> + seq_printf(m, "%llu\n", queue->data_ready_cnt);
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_data_ready);
> +
> +static int nvme_tcp_queue_write_space_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_queue *queue = m->private;
> +
> + seq_printf(m, "%llu\n", queue->write_space_cnt);
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_write_space);
> +
> +static int nvme_tcp_queue_queue_busy_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_queue *queue = m->private;
> +
> + seq_printf(m, "%llu\n", queue->queue_busy_cnt);
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_queue_busy);
> +
> +static int nvme_tcp_queue_recv_lat_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_queue *queue = m->private;
> + struct nvme_tcp_stat *st = &queue->recv_lat;
> + u64 var = 0;
> +
> + if (st->samples > 1)
> + var = div_u64(st->sqmean, st->samples * 1000);
> +
> + seq_printf(m, "%llu %llu %u\n",
> + st->samples, div_u64(st->batch, st->samples * 1000), int_sqrt64(var));
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_recv_lat);
> +
> +static int nvme_tcp_queue_send_lat_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_queue *queue = m->private;
> + struct nvme_tcp_stat *st = &queue->send_lat;
> + u64 var = 0;
> +
> + if (st->samples > 1)
> + var = div_u64(st->sqmean, st->samples * 1000);
> +
> + seq_printf(m, "%llu %llu %u\n",
> + st->samples, div_u64(st->batch, st->samples * 1000), int_sqrt64(var));
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_queue_send_lat);
> +
> +static int nvme_tcp_ctrl_recv_lat_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_ctrl *ctrl = m->private;
> + struct nvme_tcp_stat *st, lat;
> + u64 mean = 0, var = 0;
> + int i;
> +
> + lat.samples = 0;
> + lat.mean = 0;
> + lat.sqmean = 0;
> + for (i = 1; i < ctrl->ctrl.queue_count; i++) {
> + st = &ctrl->queues[i].recv_lat;
> + lat.mean += st->mean;
> + lat.sqmean += st->sqmean;
> + lat.samples += st->samples;
> + }
> + if (lat.samples > 1) {
> + mean = lat.mean;
> + var = int_sqrt(div_u64(lat.sqmean, lat.samples));
> + }
> +
> + seq_printf(m, "%llu %llu %llu\n",
> + lat.samples, mean, var);
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_ctrl_recv_lat);
> +
> +static int nvme_tcp_ctrl_send_lat_show(struct seq_file *m, void *p)
> +{
> + struct nvme_tcp_ctrl *ctrl = m->private;
> + struct nvme_tcp_stat *st, lat;
> + u64 mean = 0, var = 0;
> + int i;
> +
> + lat.samples = 0;
> + lat.mean = 0;
> + lat.sqmean = 0;
> + for (i = 1; i < ctrl->ctrl.queue_count; i++) {
> + st = &ctrl->queues[i].send_lat;
> + lat.mean += st->mean;
> + lat.sqmean += st->sqmean;
> + lat.samples += st->samples;
> + }
> + if (lat.samples > 1) {
> + mean = lat.mean;
> + var = int_sqrt(div_u64(lat.sqmean, lat.samples));
> + }
> +
> + seq_printf(m, "%llu %llu %llu\n",
> + lat.samples, mean, var);
> + return 0;
> +}
> +NVME_TCP_DEBUGFS_ATTR(nvme_tcp_ctrl_send_lat);
> +
> static inline struct nvme_tcp_ctrl *to_tcp_ctrl(struct nvme_ctrl *ctrl)
> {
> return container_of(ctrl, struct nvme_tcp_ctrl, ctrl);
> @@ -985,8 +1156,10 @@ static void nvme_tcp_data_ready(struct sock *sk)
> read_lock_bh(&sk->sk_callback_lock);
> queue = sk->sk_user_data;
> if (likely(queue && queue->rd_enabled) &&
> - !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
> + !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> + queue->data_ready_cnt++;
> + }
> read_unlock_bh(&sk->sk_callback_lock);
> }
>
> @@ -999,6 +1172,7 @@ static void nvme_tcp_write_space(struct sock *sk)
> if (likely(queue && sk_stream_is_writeable(sk))) {
> clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> + queue->write_space_cnt++;
> }
> read_unlock_bh(&sk->sk_callback_lock);
> }
> @@ -1067,6 +1241,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> bool last = nvme_tcp_pdu_last_send(req, len);
> int req_data_sent = req->data_sent;
> int ret;
> + u64 start;
>
> if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
> msg.msg_flags |= MSG_EOR;
> @@ -1079,7 +1254,9 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> bvec_set_page(&bvec, page, len, offset);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> queue->nr_send++;
> + start = ktime_to_us(ktime_get());
> ret = sock_sendmsg(queue->sock, &msg);
> + nvme_tcp_stat_add(&queue->send_lat, start);
> if (ret <= 0)
> return ret;
>
> @@ -1124,6 +1301,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
> u8 hdgst = nvme_tcp_hdgst_len(queue);
> int len = sizeof(*pdu) + hdgst - req->offset;
> int ret;
> + u64 start;
>
> if (inline_data || nvme_tcp_queue_more(queue))
> msg.msg_flags |= MSG_MORE;
> @@ -1136,7 +1314,9 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
> bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> queue->nr_send++;
> + start = ktime_to_us(ktime_get());
> ret = sock_sendmsg(queue->sock, &msg);
> + nvme_tcp_stat_add(&queue->send_lat, start);
> if (unlikely(ret <= 0))
> return ret;
>
> @@ -1165,6 +1345,7 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
> u8 hdgst = nvme_tcp_hdgst_len(queue);
> int len = sizeof(*pdu) - req->offset + hdgst;
> int ret;
> + u64 start;
>
> if (queue->hdr_digest && !req->offset)
> nvme_tcp_hdgst(queue->snd_hash, pdu, sizeof(*pdu));
> @@ -1175,7 +1356,9 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req)
> bvec_set_virt(&bvec, (void *)pdu + req->offset, len);
> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
> queue->nr_send++;
> + start = ktime_to_us(ktime_get());
> ret = sock_sendmsg(queue->sock, &msg);
> + nvme_tcp_stat_add(&queue->send_lat, start);
> if (unlikely(ret <= 0))
> return ret;
>
> @@ -1202,6 +1385,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
> .iov_base = (u8 *)&req->ddgst + req->offset,
> .iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
> };
> + u64 start;
>
> if (nvme_tcp_queue_more(queue))
> msg.msg_flags |= MSG_MORE;
> @@ -1209,7 +1393,9 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
> msg.msg_flags |= MSG_EOR;
>
> queue->nr_send++;
> + start = ktime_to_us(ktime_get());
> ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
> + nvme_tcp_stat_add(&queue->send_lat, start);
> if (unlikely(ret <= 0))
> return ret;
>
> @@ -1263,6 +1449,7 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
> ret = nvme_tcp_try_send_ddgst(req);
> done:
> if (ret == -EAGAIN) {
> + queue->queue_busy_cnt++;
> ret = 0;
> } else if (ret < 0) {
> dev_err(queue->ctrl->ctrl.device,
> @@ -1281,13 +1468,16 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
> struct sock *sk = sock->sk;
> read_descriptor_t rd_desc;
> int consumed;
> + u64 start;
>
> rd_desc.arg.data = queue;
> rd_desc.count = 1;
> lock_sock(sk);
> queue->nr_cqe = 0;
> queue->nr_recv = 0;
> + start = ktime_to_us(ktime_get());
> consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
> + nvme_tcp_stat_add(&queue->recv_lat, start);
> release_sock(sk);
> return consumed;
> }
> @@ -1331,11 +1521,18 @@ static void nvme_tcp_io_work(struct work_struct *w)
> check:
> overrun = ktime_to_us(ktime_get()) - start;
> if (overrun > 10 * deadline) {
> + u64 slat = 0, clat = 0;
> +
> + if (queue->send_lat.samples)
> + slat = queue->send_lat.mean;
> + if (queue->recv_lat.samples)
> + clat = queue->recv_lat.mean;
> dev_dbg(queue->ctrl->ctrl.device,
> - "queue %d: stall (%llu msecs) send %u sqe %u recv %u cqe %u%s%s\n",
> + "queue %d: stall (%llu msecs), send %u lat %llu %u sqes, recv %u lat %llu %u cqes%s%s\n",
> nvme_tcp_queue_id(queue), div_u64(overrun, 1000),
> - queue->nr_send, queue->nr_sqe, queue->nr_recv, queue->nr_cqe,
> - list_empty(&queue->send_list) ? " empty" : "", queue->request ? " pending" : "");
> + queue->nr_send, slat, queue->nr_sqe, queue->nr_recv, clat, queue->nr_cqe,
> + list_empty(&queue->send_list) ? " empty" : "",
> + queue->request ? " pending" : "");
> }
> if (pending)
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> @@ -1408,6 +1605,11 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> return;
>
> + if (queue->debugfs_dir) {
> + debugfs_remove_recursive(queue->debugfs_dir);
> + queue->debugfs_dir = NULL;
> + }
> +
> if (queue->hdr_digest || queue->data_digest)
> nvme_tcp_free_crypto(queue);
>
> @@ -1696,6 +1898,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
> struct nvme_tcp_queue *queue = &ctrl->queues[qid];
> int ret, rcv_pdu_size;
> struct file *sock_file;
> + char queue_name[32];
>
> mutex_init(&queue->queue_lock);
> queue->ctrl = ctrl;
> @@ -1756,6 +1959,11 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
> queue->ddgst_remaining = 0;
> queue->pdu_remaining = 0;
> queue->pdu_offset = 0;
> + queue->write_space_cnt = 0;
> + queue->data_ready_cnt = 0;
> + queue->queue_busy_cnt = 0;
> + memset(&queue->recv_lat, 0, sizeof(struct nvme_tcp_stat));
> + memset(&queue->send_lat, 0, sizeof(struct nvme_tcp_stat));
> sk_set_memalloc(queue->sock->sk);
>
> if (nctrl->opts->mask & NVMF_OPT_HOST_TRADDR) {
> @@ -1802,6 +2010,19 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
> goto err_crypto;
> }
>
> + sprintf(queue_name, "%d", nvme_tcp_queue_id(queue));
> + queue->debugfs_dir = debugfs_create_dir(queue_name, ctrl->debugfs_dir);
> + debugfs_create_file("recv_lat", S_IRUSR, queue->debugfs_dir,
> + queue, &nvme_tcp_queue_recv_lat_fops);
> + debugfs_create_file("send_lat", S_IRUSR, queue->debugfs_dir,
> + queue, &nvme_tcp_queue_send_lat_fops);
> + debugfs_create_file("data_ready", S_IRUSR, queue->debugfs_dir,
> + queue, &nvme_tcp_queue_data_ready_fops);
> + debugfs_create_file("write_space", S_IRUSR, queue->debugfs_dir,
> + queue, &nvme_tcp_queue_write_space_fops);
> + debugfs_create_file("queue_busy", S_IRUSR, queue->debugfs_dir,
> + queue, &nvme_tcp_queue_queue_busy_fops);
> +
> dev_dbg(nctrl->device, "connecting queue %d\n",
> nvme_tcp_queue_id(queue));
>
> @@ -2420,7 +2641,14 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
>
> static void nvme_tcp_delete_ctrl(struct nvme_ctrl *ctrl)
> {
> + struct nvme_tcp_ctrl *nctrl = to_tcp_ctrl(ctrl);
> +
> nvme_tcp_teardown_ctrl(ctrl, true);
> +
> + if (nctrl->debugfs_dir) {
> + debugfs_remove_recursive(nctrl->debugfs_dir);
> + nctrl->debugfs_dir = NULL;
> + }
> }
>
> static void nvme_reset_ctrl_work(struct work_struct *work)
> @@ -2890,6 +3118,13 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
> if (ret)
> goto out_put_ctrl;
>
> + ctrl->debugfs_dir = debugfs_create_dir(dev_name(ctrl->ctrl.device),
> + nvme_tcp_debugfs);
> + debugfs_create_file("recv_lat", S_IRUSR, ctrl->debugfs_dir,
> + ctrl, &nvme_tcp_ctrl_recv_lat_fops);
> + debugfs_create_file("send_lat", S_IRUSR, ctrl->debugfs_dir,
> + ctrl, &nvme_tcp_ctrl_send_lat_fops);
> +
> if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
> WARN_ON_ONCE(1);
> ret = -EINTR;
> @@ -2910,6 +3145,10 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
> return &ctrl->ctrl;
>
> out_uninit_ctrl:
> + if (ctrl->debugfs_dir) {
> + debugfs_remove_recursive(ctrl->debugfs_dir);
> + ctrl->debugfs_dir = NULL;
> + }
> nvme_uninit_ctrl(&ctrl->ctrl);
> out_put_ctrl:
> nvme_put_ctrl(&ctrl->ctrl);
> @@ -2951,6 +3190,10 @@ static int __init nvme_tcp_init_module(void)
> if (!nvme_tcp_wq)
> return -ENOMEM;
>
> + nvme_tcp_debugfs = debugfs_create_dir("nvme_tcp", NULL);
> + if (!nvme_tcp_debugfs)
> + return -ENOMEM;
> +
> nvmf_register_transport(&nvme_tcp_transport);
> return 0;
> }
> @@ -2961,6 +3204,8 @@ static void __exit nvme_tcp_cleanup_module(void)
>
> nvmf_unregister_transport(&nvme_tcp_transport);
>
> + debugfs_remove_recursive(nvme_tcp_debugfs);
> +
> mutex_lock(&nvme_tcp_ctrl_mutex);
> list_for_each_entry(ctrl, &nvme_tcp_ctrl_list, list)
> nvme_delete_ctrl(&ctrl->ctrl);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 6/8] nvme-tcp: reduce callback lock contention
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (4 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 5/8] nvme-tcp: debugfs entries for latency statistics Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:19 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 7/8] nvme-tcp: check for SOCK_NOSPACE before sending Hannes Reinecke
` (2 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
We have heavily queued tx and rx flows, so callbacks might happen
at the same time. As the callbacks influence the state machine we
really should remove contention here to not impact I/O performance.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a758fbb3f9bb..9634c16d7bc0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1153,28 +1153,28 @@ static void nvme_tcp_data_ready(struct sock *sk)
trace_sk_data_ready(sk);
- read_lock_bh(&sk->sk_callback_lock);
- queue = sk->sk_user_data;
+ rcu_read_lock();
+ queue = rcu_dereference_sk_user_data(sk);
if (likely(queue && queue->rd_enabled) &&
!test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
queue->data_ready_cnt++;
}
- read_unlock_bh(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static void nvme_tcp_write_space(struct sock *sk)
{
struct nvme_tcp_queue *queue;
- read_lock_bh(&sk->sk_callback_lock);
- queue = sk->sk_user_data;
+ rcu_read_lock();
+ queue = rcu_dereference_sk_user_data(sk);
if (likely(queue && sk_stream_is_writeable(sk))) {
clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
queue->write_space_cnt++;
}
- read_unlock_bh(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static void nvme_tcp_state_change(struct sock *sk)
@@ -2076,6 +2076,7 @@ static void nvme_tcp_restore_sock_ops(struct nvme_tcp_queue *queue)
sock->sk->sk_state_change = queue->state_change;
sock->sk->sk_write_space = queue->write_space;
write_unlock_bh(&sock->sk->sk_callback_lock);
+ synchronize_rcu();
}
static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
@@ -2115,6 +2116,7 @@ static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
queue->sock->sk->sk_ll_usec = 1;
#endif
write_unlock_bh(&queue->sock->sk->sk_callback_lock);
+ synchronize_rcu();
}
static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 6/8] nvme-tcp: reduce callback lock contention
2024-07-16 7:36 ` [PATCH 6/8] nvme-tcp: reduce callback lock contention Hannes Reinecke
@ 2024-07-17 21:19 ` Sagi Grimberg
2024-07-18 6:42 ` Hannes Reinecke
0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:19 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig, netdev
Cc: Keith Busch, linux-nvme, Hannes Reinecke
On 16/07/2024 10:36, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> We have heavily queued tx and rx flows, so callbacks might happen
> at the same time. As the callbacks influence the state machine we
> really should remove contention here to not impact I/O performance.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a758fbb3f9bb..9634c16d7bc0 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1153,28 +1153,28 @@ static void nvme_tcp_data_ready(struct sock *sk)
>
> trace_sk_data_ready(sk);
>
> - read_lock_bh(&sk->sk_callback_lock);
> - queue = sk->sk_user_data;
> + rcu_read_lock();
> + queue = rcu_dereference_sk_user_data(sk);
> if (likely(queue && queue->rd_enabled) &&
> !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> queue->data_ready_cnt++;
> }
> - read_unlock_bh(&sk->sk_callback_lock);
> + rcu_read_unlock();
Umm, this looks dangerous...
Please give a concrete (numeric) justification for this change, and
preferably a big fat comment
on why it is safe to do (for either .data_ready or .write_space).
Is there any precedence of another tcp ulp that does this? I'd like to
have the netdev folks
review this change. CC'ing netdev.
> }
>
> static void nvme_tcp_write_space(struct sock *sk)
> {
> struct nvme_tcp_queue *queue;
>
> - read_lock_bh(&sk->sk_callback_lock);
> - queue = sk->sk_user_data;
> + rcu_read_lock();
> + queue = rcu_dereference_sk_user_data(sk);
> if (likely(queue && sk_stream_is_writeable(sk))) {
> clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> queue->write_space_cnt++;
> }
> - read_unlock_bh(&sk->sk_callback_lock);
> + rcu_read_unlock();
> }
>
> static void nvme_tcp_state_change(struct sock *sk)
> @@ -2076,6 +2076,7 @@ static void nvme_tcp_restore_sock_ops(struct nvme_tcp_queue *queue)
> sock->sk->sk_state_change = queue->state_change;
> sock->sk->sk_write_space = queue->write_space;
> write_unlock_bh(&sock->sk->sk_callback_lock);
> + synchronize_rcu();
> }
>
> static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
> @@ -2115,6 +2116,7 @@ static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
> queue->sock->sk->sk_ll_usec = 1;
> #endif
> write_unlock_bh(&queue->sock->sk->sk_callback_lock);
> + synchronize_rcu();
> }
>
> static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 6/8] nvme-tcp: reduce callback lock contention
2024-07-17 21:19 ` Sagi Grimberg
@ 2024-07-18 6:42 ` Hannes Reinecke
2024-07-21 11:46 ` Sagi Grimberg
0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-18 6:42 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig, netdev
Cc: Keith Busch, linux-nvme
On 7/17/24 23:19, Sagi Grimberg wrote:
>
>
> On 16/07/2024 10:36, Hannes Reinecke wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> We have heavily queued tx and rx flows, so callbacks might happen
>> at the same time. As the callbacks influence the state machine we
>> really should remove contention here to not impact I/O performance.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/tcp.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index a758fbb3f9bb..9634c16d7bc0 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1153,28 +1153,28 @@ static void nvme_tcp_data_ready(struct sock *sk)
>> trace_sk_data_ready(sk);
>> - read_lock_bh(&sk->sk_callback_lock);
>> - queue = sk->sk_user_data;
>> + rcu_read_lock();
>> + queue = rcu_dereference_sk_user_data(sk);
>> if (likely(queue && queue->rd_enabled) &&
>> !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> queue->data_ready_cnt++;
>> }
>> - read_unlock_bh(&sk->sk_callback_lock);
>> + rcu_read_unlock();
>
> Umm, this looks dangerous...
>
> Please give a concrete (numeric) justification for this change, and
> preferably a big fat comment
> on why it is safe to do (for either .data_ready or .write_space).
>
> Is there any precedence of another tcp ulp that does this? I'd like to
> have the netdev folks review this change. CC'ing netdev.
>
Reasoning here is that the queue itself (and with that, the workqueue
element) will _not_ be deleted once we set 'sk_user_data' to NULL.
The shutdown sequence is:
kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_restore_sock_ops(queue);
cancel_work_sync(&queue->io_work);
So first we're shutting down the socket (which cancels all I/O
calls in io_work), then restore the socket callbacks.
As these are rcu protected I'm calling synchronize_rcu() to
ensure all callbacks have left the rcu-critical section on
exit.
At a final step we are cancelling all work, ie ensuring that
any action triggered by the callbacks have completed.
But sure, comment is fine.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 6/8] nvme-tcp: reduce callback lock contention
2024-07-18 6:42 ` Hannes Reinecke
@ 2024-07-21 11:46 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-21 11:46 UTC (permalink / raw)
To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig, netdev
Cc: Keith Busch, linux-nvme
On 18/07/2024 9:42, Hannes Reinecke wrote:
> On 7/17/24 23:19, Sagi Grimberg wrote:
>>
>>
>> On 16/07/2024 10:36, Hannes Reinecke wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> We have heavily queued tx and rx flows, so callbacks might happen
>>> at the same time. As the callbacks influence the state machine we
>>> really should remove contention here to not impact I/O performance.
>>>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>> drivers/nvme/host/tcp.c | 14 ++++++++------
>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index a758fbb3f9bb..9634c16d7bc0 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1153,28 +1153,28 @@ static void nvme_tcp_data_ready(struct sock
>>> *sk)
>>> trace_sk_data_ready(sk);
>>> - read_lock_bh(&sk->sk_callback_lock);
>>> - queue = sk->sk_user_data;
>>> + rcu_read_lock();
>>> + queue = rcu_dereference_sk_user_data(sk);
>>> if (likely(queue && queue->rd_enabled) &&
>>> !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> queue->data_ready_cnt++;
>>> }
>>> - read_unlock_bh(&sk->sk_callback_lock);
>>> + rcu_read_unlock();
>>
>> Umm, this looks dangerous...
>>
>> Please give a concrete (numeric) justification for this change, and
>> preferably a big fat comment
>> on why it is safe to do (for either .data_ready or .write_space).
>>
>> Is there any precedence of another tcp ulp that does this? I'd like
>> to have the netdev folks review this change. CC'ing netdev.
>>
> Reasoning here is that the queue itself (and with that, the workqueue
> element) will _not_ be deleted once we set 'sk_user_data' to NULL.
>
> The shutdown sequence is:
>
> kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> nvme_tcp_restore_sock_ops(queue);
> cancel_work_sync(&queue->io_work);
>
> So first we're shutting down the socket (which cancels all I/O
> calls in io_work), then restore the socket callbacks.
> As these are rcu protected I'm calling synchronize_rcu() to
> ensure all callbacks have left the rcu-critical section on
> exit.
> At a final step we are cancelling all work, ie ensuring that
> any action triggered by the callbacks have completed.
>
> But sure, comment is fine.
I suggest that you audit all the accessors of this lock in the
networking subsystem
before determining that it can be safely converted to rcu read critical
section. I suspect that
the underlying network stack assumes that this lock is taken when the
callback is invoked.
$ grep -rIn sk_callback_lock net/ | wc -l
122
$ grep -rIn sk_callback_lock kernel/ | wc -l
15
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 7/8] nvme-tcp: check for SOCK_NOSPACE before sending
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (5 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 6/8] nvme-tcp: reduce callback lock contention Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:19 ` Sagi Grimberg
2024-07-16 7:36 ` [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
2024-07-17 21:01 ` [PATCHv3 0/8] nvme-tcp: improve scalability Sagi Grimberg
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
Hannes Reinecke
From: Hannes Reinecke <hare@suse.de>
It's pointless to try to sent if we don't have enough space; we'll
be notified by the 'write_space' callback anyway.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9634c16d7bc0..f3a94168b2c3 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1424,7 +1424,11 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue *queue)
}
req = queue->request;
+ if (test_bit(SOCK_NOSPACE, &queue->sock->flags))
+ return 0;
+
noreclaim_flag = memalloc_noreclaim_save();
+
if (req->state == NVME_TCP_SEND_CMD_PDU) {
ret = nvme_tcp_try_send_cmd_pdu(req);
if (ret <= 0)
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (6 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 7/8] nvme-tcp: check for SOCK_NOSPACE before sending Hannes Reinecke
@ 2024-07-16 7:36 ` Hannes Reinecke
2024-07-17 21:34 ` Sagi Grimberg
2024-07-17 21:01 ` [PATCHv3 0/8] nvme-tcp: improve scalability Sagi Grimberg
8 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-16 7:36 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke
We should align the 'io_cpu' setting with the blk-mq
cpu mapping to ensure that we're not bouncing threads
when doing I/O. To avoid cpu contention this patch also
adds an atomic counter for the number of queues on each
cpu to distribute the load across all CPUs in the blk-mq cpu set.
Additionally we should always set the 'io_cpu' value, as
in the WQ_UNBOUND case it'll be treated as a hint anyway.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
1 file changed, 49 insertions(+), 16 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f3a94168b2c3..a391a3f7c4d7 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -28,6 +28,8 @@
struct nvme_tcp_queue;
+static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
+
/* Define the socket priority to use for connections were it is desirable
* that the NIC consider performing optimized packet processing or filtering.
* A non-zero value being sufficient to indicate general consideration of any
@@ -1799,20 +1801,42 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_ctrl *ctrl = queue->ctrl;
- int qid = nvme_tcp_queue_id(queue);
- int n = 0;
-
- if (nvme_tcp_default_queue(queue))
- n = qid - 1;
- else if (nvme_tcp_read_queue(queue))
- n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
- else if (nvme_tcp_poll_queue(queue))
+ struct blk_mq_tag_set *set = &ctrl->tag_set;
+ int qid = nvme_tcp_queue_id(queue) - 1;
+ unsigned int *mq_map = NULL;;
+ int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
+
+ if (nvme_tcp_default_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
+ n = qid;
+ } else if (nvme_tcp_read_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_READ].mq_map;
+ n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
+ } else if (nvme_tcp_poll_queue(queue)) {
+ mq_map = set->map[HCTX_TYPE_POLL].mq_map;
n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
- ctrl->io_queues[HCTX_TYPE_READ] - 1;
- if (wq_unbound)
- queue->io_cpu = WORK_CPU_UNBOUND;
- else
- queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
+ ctrl->io_queues[HCTX_TYPE_READ];
+ }
+
+ if (WARN_ON(!mq_map))
+ return;
+ for_each_online_cpu(cpu) {
+ int num_queues;
+
+ if (mq_map[cpu] != qid)
+ continue;
+ num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
+ if (num_queues < min_queues) {
+ min_queues = num_queues;
+ io_cpu = cpu;
+ }
+ }
+ if (io_cpu != queue->io_cpu) {
+ queue->io_cpu = io_cpu;
+ atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
+ }
+ dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
+ qid, queue->io_cpu);
}
static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
@@ -1957,7 +1981,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
queue->sock->sk->sk_allocation = GFP_ATOMIC;
queue->sock->sk->sk_use_task_frag = false;
- nvme_tcp_set_queue_io_cpu(queue);
+ queue->io_cpu = WORK_CPU_UNBOUND;
queue->request = NULL;
queue->data_remaining = 0;
queue->ddgst_remaining = 0;
@@ -2088,6 +2112,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
kernel_sock_shutdown(queue->sock, SHUT_RDWR);
nvme_tcp_restore_sock_ops(queue);
cancel_work_sync(&queue->io_work);
+ if (queue->io_cpu != WORK_CPU_UNBOUND) {
+ atomic_dec(&nvme_tcp_cpu_queues[queue->io_cpu]);
+ queue->io_cpu = WORK_CPU_UNBOUND;
+ }
}
static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid)
@@ -2133,9 +2161,10 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
nvme_tcp_init_recv_ctx(queue);
nvme_tcp_setup_sock_ops(queue);
- if (idx)
+ if (idx) {
+ nvme_tcp_set_queue_io_cpu(queue);
ret = nvmf_connect_io_queue(nctrl, idx);
- else
+ } else
ret = nvmf_connect_admin_queue(nctrl);
if (!ret) {
@@ -3179,6 +3208,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
static int __init nvme_tcp_init_module(void)
{
unsigned int wq_flags = WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_SYSFS;
+ int cpu;
BUILD_BUG_ON(sizeof(struct nvme_tcp_hdr) != 8);
BUILD_BUG_ON(sizeof(struct nvme_tcp_cmd_pdu) != 72);
@@ -3200,6 +3230,9 @@ static int __init nvme_tcp_init_module(void)
if (!nvme_tcp_debugfs)
return -ENOMEM;
+ for_each_possible_cpu(cpu)
+ atomic_set(&nvme_tcp_cpu_queues[cpu], 0);
+
nvmf_register_transport(&nvme_tcp_transport);
return 0;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping
2024-07-16 7:36 ` [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-07-17 21:34 ` Sagi Grimberg
2024-08-13 19:36 ` Sagi Grimberg
0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:34 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 16/07/2024 10:36, Hannes Reinecke wrote:
> We should align the 'io_cpu' setting with the blk-mq
> cpu mapping to ensure that we're not bouncing threads
> when doing I/O. To avoid cpu contention this patch also
> adds an atomic counter for the number of queues on each
> cpu to distribute the load across all CPUs in the blk-mq cpu set.
> Additionally we should always set the 'io_cpu' value, as
> in the WQ_UNBOUND case it'll be treated as a hint anyway.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
> 1 file changed, 49 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index f3a94168b2c3..a391a3f7c4d7 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -28,6 +28,8 @@
>
> struct nvme_tcp_queue;
>
> +static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
> +
> /* Define the socket priority to use for connections were it is desirable
> * that the NIC consider performing optimized packet processing or filtering.
> * A non-zero value being sufficient to indicate general consideration of any
> @@ -1799,20 +1801,42 @@ static bool nvme_tcp_poll_queue(struct nvme_tcp_queue *queue)
> static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
> {
> struct nvme_tcp_ctrl *ctrl = queue->ctrl;
> - int qid = nvme_tcp_queue_id(queue);
> - int n = 0;
> -
> - if (nvme_tcp_default_queue(queue))
> - n = qid - 1;
> - else if (nvme_tcp_read_queue(queue))
> - n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
> - else if (nvme_tcp_poll_queue(queue))
> + struct blk_mq_tag_set *set = &ctrl->tag_set;
> + int qid = nvme_tcp_queue_id(queue) - 1;
> + unsigned int *mq_map = NULL;;
> + int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
Again, min_queues is a minimum quantity, not an id. It makes zero sense
to use WORK_CPU_UNBOUND as the initializer. Just set it to INT_MAX or
something.
> +
> + if (nvme_tcp_default_queue(queue)) {
> + mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
> + n = qid;
> + } else if (nvme_tcp_read_queue(queue)) {
> + mq_map = set->map[HCTX_TYPE_READ].mq_map;
> + n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
> + } else if (nvme_tcp_poll_queue(queue)) {
> + mq_map = set->map[HCTX_TYPE_POLL].mq_map;
> n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
> - ctrl->io_queues[HCTX_TYPE_READ] - 1;
> - if (wq_unbound)
> - queue->io_cpu = WORK_CPU_UNBOUND;
> - else
> - queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
> + ctrl->io_queues[HCTX_TYPE_READ];
> + }
> +
> + if (WARN_ON(!mq_map))
> + return;
> + for_each_online_cpu(cpu) {
> + int num_queues;
> +
> + if (mq_map[cpu] != qid)
> + continue;
> + num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
> + if (num_queues < min_queues) {
> + min_queues = num_queues;
> + io_cpu = cpu;
> + }
> + }
> + if (io_cpu != queue->io_cpu) {
> + queue->io_cpu = io_cpu;
Hannes, the code may make sense to you, but not to me.
Please do not add code like:
if (a != b) {
b = a;
}
I think it is a sign that we are doing something wrong here.
> + atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
> + }
Again, why can't we always set io_cpu and increment the counter?
If the wq is unbound, it makes no difference, and if the wq is bound, that
is actually what you want to do. What am I missing?
> + dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
> + qid, queue->io_cpu);
> }
>
> static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
> @@ -1957,7 +1981,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>
> queue->sock->sk->sk_allocation = GFP_ATOMIC;
> queue->sock->sk->sk_use_task_frag = false;
> - nvme_tcp_set_queue_io_cpu(queue);
> + queue->io_cpu = WORK_CPU_UNBOUND;
> queue->request = NULL;
> queue->data_remaining = 0;
> queue->ddgst_remaining = 0;
> @@ -2088,6 +2112,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
> kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> nvme_tcp_restore_sock_ops(queue);
> cancel_work_sync(&queue->io_work);
> + if (queue->io_cpu != WORK_CPU_UNBOUND) {
I think that we can safely always set queue->io_cpu to a cpu. If the
unbound_wq
only operates on a subset of the cores, it doesn't matter anyways...
The rest of the patch looks good though.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping
2024-07-17 21:34 ` Sagi Grimberg
@ 2024-08-13 19:36 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-08-13 19:36 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 18/07/2024 0:34, Sagi Grimberg wrote:
>
>
> On 16/07/2024 10:36, Hannes Reinecke wrote:
>> We should align the 'io_cpu' setting with the blk-mq
>> cpu mapping to ensure that we're not bouncing threads
>> when doing I/O. To avoid cpu contention this patch also
>> adds an atomic counter for the number of queues on each
>> cpu to distribute the load across all CPUs in the blk-mq cpu set.
>> Additionally we should always set the 'io_cpu' value, as
>> in the WQ_UNBOUND case it'll be treated as a hint anyway.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>> drivers/nvme/host/tcp.c | 65 +++++++++++++++++++++++++++++++----------
>> 1 file changed, 49 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f3a94168b2c3..a391a3f7c4d7 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -28,6 +28,8 @@
>> struct nvme_tcp_queue;
>> +static atomic_t nvme_tcp_cpu_queues[NR_CPUS];
>> +
>> /* Define the socket priority to use for connections were it is
>> desirable
>> * that the NIC consider performing optimized packet processing or
>> filtering.
>> * A non-zero value being sufficient to indicate general
>> consideration of any
>> @@ -1799,20 +1801,42 @@ static bool nvme_tcp_poll_queue(struct
>> nvme_tcp_queue *queue)
>> static void nvme_tcp_set_queue_io_cpu(struct nvme_tcp_queue *queue)
>> {
>> struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>> - int qid = nvme_tcp_queue_id(queue);
>> - int n = 0;
>> -
>> - if (nvme_tcp_default_queue(queue))
>> - n = qid - 1;
>> - else if (nvme_tcp_read_queue(queue))
>> - n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] - 1;
>> - else if (nvme_tcp_poll_queue(queue))
>> + struct blk_mq_tag_set *set = &ctrl->tag_set;
>> + int qid = nvme_tcp_queue_id(queue) - 1;
>> + unsigned int *mq_map = NULL;;
>> + int n = 0, cpu, io_cpu, min_queues = WORK_CPU_UNBOUND;
>
> Again, min_queues is a minimum quantity, not an id. It makes zero sense
> to use WORK_CPU_UNBOUND as the initializer. Just set it to INT_MAX or
> something.
>> +
>> + if (nvme_tcp_default_queue(queue)) {
>> + mq_map = set->map[HCTX_TYPE_DEFAULT].mq_map;
>> + n = qid;
>> + } else if (nvme_tcp_read_queue(queue)) {
>> + mq_map = set->map[HCTX_TYPE_READ].mq_map;
>> + n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT];
>> + } else if (nvme_tcp_poll_queue(queue)) {
>> + mq_map = set->map[HCTX_TYPE_POLL].mq_map;
>> n = qid - ctrl->io_queues[HCTX_TYPE_DEFAULT] -
>> - ctrl->io_queues[HCTX_TYPE_READ] - 1;
>> - if (wq_unbound)
>> - queue->io_cpu = WORK_CPU_UNBOUND;
>> - else
>> - queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask,
>> -1, false);
>> + ctrl->io_queues[HCTX_TYPE_READ];
>> + }
>> +
>> + if (WARN_ON(!mq_map))
>> + return;
>> + for_each_online_cpu(cpu) {
>> + int num_queues;
>> +
>> + if (mq_map[cpu] != qid)
>> + continue;
>> + num_queues = atomic_read(&nvme_tcp_cpu_queues[cpu]);
>> + if (num_queues < min_queues) {
>> + min_queues = num_queues;
>> + io_cpu = cpu;
>> + }
>> + }
>> + if (io_cpu != queue->io_cpu) {
>> + queue->io_cpu = io_cpu;
>
> Hannes, the code may make sense to you, but not to me.
> Please do not add code like:
> if (a != b) {
> b = a;
> }
>
> I think it is a sign that we are doing something wrong here.
>
>> + atomic_inc(&nvme_tcp_cpu_queues[io_cpu]);
>> + }
>
> Again, why can't we always set io_cpu and increment the counter?
> If the wq is unbound, it makes no difference, and if the wq is bound,
> that
> is actually what you want to do. What am I missing?
>
>> + dev_dbg(ctrl->ctrl.device, "queue %d: using cpu %d\n",
>> + qid, queue->io_cpu);
>> }
>> static void nvme_tcp_tls_done(void *data, int status,
>> key_serial_t pskid)
>> @@ -1957,7 +1981,7 @@ static int nvme_tcp_alloc_queue(struct
>> nvme_ctrl *nctrl, int qid,
>> queue->sock->sk->sk_allocation = GFP_ATOMIC;
>> queue->sock->sk->sk_use_task_frag = false;
>> - nvme_tcp_set_queue_io_cpu(queue);
>> + queue->io_cpu = WORK_CPU_UNBOUND;
>> queue->request = NULL;
>> queue->data_remaining = 0;
>> queue->ddgst_remaining = 0;
>> @@ -2088,6 +2112,10 @@ static void __nvme_tcp_stop_queue(struct
>> nvme_tcp_queue *queue)
>> kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>> nvme_tcp_restore_sock_ops(queue);
>> cancel_work_sync(&queue->io_work);
>> + if (queue->io_cpu != WORK_CPU_UNBOUND) {
>
> I think that we can safely always set queue->io_cpu to a cpu. If the
> unbound_wq
> only operates on a subset of the cores, it doesn't matter anyways...
>
> The rest of the patch looks good though.
Hey Hannes,
From this series, I think that this is one patch that we both agree
that should
be addressed. The idea that we want to spread out multiple
controllers/queues
over multiple cpu cores is correct.
How about addressing the comments on this patch and split it from the series
until we have more info on the rest?
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv3 0/8] nvme-tcp: improve scalability
2024-07-16 7:36 [PATCHv3 0/8] nvme-tcp: improve scalability Hannes Reinecke
` (7 preceding siblings ...)
2024-07-16 7:36 ` [PATCH 8/8] nvme-tcp: align I/O cpu with blk-mq mapping Hannes Reinecke
@ 2024-07-17 21:01 ` Sagi Grimberg
2024-07-18 6:20 ` Hannes Reinecke
8 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-17 21:01 UTC (permalink / raw)
To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 16/07/2024 10:36, Hannes Reinecke wrote:
> Hi all,
>
> for workloads with a lot of controllers we run into workqueue contention,
> where the single workqueue is not able to service requests fast enough,
> leading to spurious I/O errors and connect resets during high load.
> One culprit here was a lock contention on the callbacks, where we
> acquired the 'sk_callback_lock()' on every callback. As we are dealing
> with parallel rx and tx flows this induces quite a lot of contention.
> I have also added instrumentation to analyse I/O flows, by adding
> a I/O stall debug messages and added debugfs entries to display
> detailed statistics for each queue.
Hannes, I'm getting really confused with this...
Once again you submit a set that is a different direction almost
entirely from v1 and v2... Again without quantifying what each
change is giving us, and it makes it very hard to review and understand.
I suggest we split the changes that have consensus to a separate series
(still state what each change gets us), and understand better the rest...
>
> All performance number are derived from the 'tiobench-example.fio'
> sample from the fio sources, running on a 96 core machine with one,
> two, or four subsystem and two paths, each path exposing 32 queues.
> Backend is nvmet using an Intel DC P3700 NVMe SSD.
The patchset in v1 started by stating a performance issue when
controllers have a limited number of queues, does this test case
represent the original issue?
>
> write performance:
> baseline:
> 1 subsys, 4k seq: bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s (17.1MB/s-20.0MB/s)
> 1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s (16.4MB/s-22.5MB/s)
> 2 subsys, 4k seq: bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s (2871kB/s-4905kB/s)
> 2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s (2881kB/s-5635kB/s)
> 4 subsys, 4k seq: bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s (2038kB/s-8598kB/s)
> 4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s (2072kB/s-6466kB/s)
>
> patched:
> 1 subsys, 4k seq: bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s (14.4MB/s-16.8MB/s)
> 1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s (13.0MB/s-16.0MB/s)
That is a substantial degradation. I also keep asking, how does null_blk
looks like?
> 2 subsys, 4k seq: bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s (3667kB/s-4601kB/s)
> 2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s (3717kB/s-4528kB/s)
> 4 subsys, 4k seq: bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s (2625kB/s-8496kB/s)
> 4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s (2570kB/s-9640kB/s)
There is still an observed degradation when moving from 2 to 4
subsystems, what is the
cause of it?
>
> read performance:
> baseline:
> 1 subsys, 4k seq: bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s (12.7MB/s-18.0MB/s)
> 1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s (14.1MB/s-20.2MB/s)
> 2 subsys, 4k seq: bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s (2666kB/s-4083kB/s)
> 2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s (2489kB/s-5533kB/s)
> 4 subsys, 4k seq: bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s (202kB/s-11.6MB/s)
> 4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s (1392kB/s-14.6MB/s)
>
> patched:
> 1 subsys, 4k seq: bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s (13.3MB/s-15.4MB/s)
> 1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s (13.0MB/s-16.9MB/s)
> 2 subsys, 4k seq: bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s (2522kB/s-4632kB/s)
> 2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s (2312kB/s-4321kB/s)
> 4 subsys, 4k seq: bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s (1904kB/s-8305kB/s)
> 4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s (1823kB/s-4607kB/s)
Same question here, your patches do not seem to eliminate the overall
loss of efficiency.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [PATCHv3 0/8] nvme-tcp: improve scalability
2024-07-17 21:01 ` [PATCHv3 0/8] nvme-tcp: improve scalability Sagi Grimberg
@ 2024-07-18 6:20 ` Hannes Reinecke
2024-07-21 12:05 ` Sagi Grimberg
0 siblings, 1 reply; 24+ messages in thread
From: Hannes Reinecke @ 2024-07-18 6:20 UTC (permalink / raw)
To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme
On 7/17/24 23:01, Sagi Grimberg wrote:
>
>
> On 16/07/2024 10:36, Hannes Reinecke wrote:
>> Hi all,
>>
>> for workloads with a lot of controllers we run into workqueue contention,
>> where the single workqueue is not able to service requests fast enough,
>> leading to spurious I/O errors and connect resets during high load.
>> One culprit here was a lock contention on the callbacks, where we
>> acquired the 'sk_callback_lock()' on every callback. As we are dealing
>> with parallel rx and tx flows this induces quite a lot of contention.
>> I have also added instrumentation to analyse I/O flows, by adding
>> a I/O stall debug messages and added debugfs entries to display
>> detailed statistics for each queue.
>
> Hanneds, I'm getting really confused with this...
>
> Once again you submit a set that is a different direction almost
> entirely from v1 and v2... Again without quantifying what each
> change is giving us, and it makes it very hard to review and understand.
>
I fully concur. But the previous patchsets turned out to not give a
substantial improvements when scaling up.
And getting reliable performance numbers is _really_ hard, as there is
quite a high fluctuation in them. This was the reason for including
the statistics patches; there we get a direct insight of the I/O path
latency, and can directly measure the impact of the changes.
And that's also how I found the lock contention on the callbacks...
> I suggest we split the changes that have consensus to a separate series
> (still state what each change gets us), and understand better the rest...
>
The patchset here is (apart from the statistics patches) just the one
removing lock contention on callbacks (which _really_ were causing
issues), and the alignment with blk-mq, for which I do see an improvement.
All other patches posted previously turned out to increase the latency
(thanks to the statistics patches), so I left them out from this round.
>>
>> All performance number are derived from the 'tiobench-example.fio'
>> sample from the fio sources, running on a 96 core machine with one,
>> two, or four subsystem and two paths, each path exposing 32 queues.
>> Backend is nvmet using an Intel DC P3700 NVMe SSD.
>
> The patchset in v1 started by stating a performance issue when
> controllers have a limited number of queues, does this test case
> represent the original issue?
>
Oh, but it does. The entire test is run on machine with 96 cores.
>>
>> write performance:
>> baseline:
>> 1 subsys, 4k seq: bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s
>> (17.1MB/s-20.0MB/s)
>> 1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s
>> (16.4MB/s-22.5MB/s)
>> 2 subsys, 4k seq: bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s
>> (2871kB/s-4905kB/s)
>> 2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s
>> (2881kB/s-5635kB/s)
>> 4 subsys, 4k seq: bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s
>> (2038kB/s-8598kB/s)
>> 4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s
>> (2072kB/s-6466kB/s)
>>
>> patched:
>> 1 subsys, 4k seq: bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s
>> (14.4MB/s-16.8MB/s)
>> 1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s
>> (13.0MB/s-16.0MB/s)
>
> That is a substantial degradation. I also keep asking, how does null_blk
> looks like?
>
Tested, and doesn't make a difference. Similar numbers.
Surprisingly, but there you are.
>> 2 subsys, 4k seq: bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s
>> (3667kB/s-4601kB/s)
>> 2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s
>> (3717kB/s-4528kB/s)
>> 4 subsys, 4k seq: bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s
>> (2625kB/s-8496kB/s)
>> 4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s
>> (2570kB/s-9640kB/s)
>
> There is still an observed degradation when moving from 2 to 4
> subsystems, what is the cause of it?
>
All subsystems are running over the same 10GigE link, so some
performance degradation is to be expected as we are having higher
contention.
>>
>> read performance:
>> baseline:
>> 1 subsys, 4k seq: bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s
>> (12.7MB/s-18.0MB/s)
>> 1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s
>> (14.1MB/s-20.2MB/s)
>> 2 subsys, 4k seq: bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s
>> (2666kB/s-4083kB/s)
>> 2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s
>> (2489kB/s-5533kB/s)
>> 4 subsys, 4k seq: bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s
>> (202kB/s-11.6MB/s)
>> 4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s
>> (1392kB/s-14.6MB/s)
>>
>> patched:
>> 1 subsys, 4k seq: bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s
>> (13.3MB/s-15.4MB/s)
>> 1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s
>> (13.0MB/s-16.9MB/s)
>> 2 subsys, 4k seq: bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s
>> (2522kB/s-4632kB/s)
>> 2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s
>> (2312kB/s-4321kB/s)
>> 4 subsys, 4k seq: bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s
>> (1904kB/s-8305kB/s)
>> 4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s
>> (1823kB/s-4607kB/s)
>
> Same question here, your patches do not seem to eliminate the overall
> loss of efficiency.
Never claimed that, and really I can't see that we can.
All subsystems are running over the same link, so we are having to push
more independent frames across it, and we will suffer from higher
contention here. A performance degradation when scaling up subsystems
is unavoidable.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCHv3 0/8] nvme-tcp: improve scalability
2024-07-18 6:20 ` Hannes Reinecke
@ 2024-07-21 12:05 ` Sagi Grimberg
0 siblings, 0 replies; 24+ messages in thread
From: Sagi Grimberg @ 2024-07-21 12:05 UTC (permalink / raw)
To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
Cc: Keith Busch, linux-nvme
On 18/07/2024 9:20, Hannes Reinecke wrote:
> On 7/17/24 23:01, Sagi Grimberg wrote:
>>
>>
>> On 16/07/2024 10:36, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> for workloads with a lot of controllers we run into workqueue
>>> contention,
>>> where the single workqueue is not able to service requests fast enough,
>>> leading to spurious I/O errors and connect resets during high load.
>>> One culprit here was a lock contention on the callbacks, where we
>>> acquired the 'sk_callback_lock()' on every callback. As we are dealing
>>> with parallel rx and tx flows this induces quite a lot of contention.
>>> I have also added instrumentation to analyse I/O flows, by adding
>>> a I/O stall debug messages and added debugfs entries to display
>>> detailed statistics for each queue.
>>
>> Hanneds, I'm getting really confused with this...
>>
>> Once again you submit a set that is a different direction almost
>> entirely from v1 and v2... Again without quantifying what each
>> change is giving us, and it makes it very hard to review and understand.
>>
> I fully concur. But the previous patchsets turned out to not give a
> substantial improvements when scaling up.
I thought they did, at least the numbers you listed did (to some extent
afaict).
> And getting reliable performance numbers is _really_ hard, as there is
> quite a high fluctuation in them. This was the reason for including
> the statistics patches; there we get a direct insight of the I/O path
> latency, and can directly measure the impact of the changes.
>
> And that's also how I found the lock contention on the callbacks...
I am still not clear at all that we can simply omit this lock. Would
like the
networking folks to tell us if it is safe to do, and an audit if any
other socket
consumer does this (and if not, should they?)
>
>> I suggest we split the changes that have consensus to a separate series
>> (still state what each change gets us), and understand better the
>> rest...
>>
> The patchset here is (apart from the statistics patches) just the one
> removing lock contention on callbacks (which _really_ were causing
> issues), and the alignment with blk-mq, for which I do see an
> improvement.
Let's move forward with the blk-mq alignment patches (after addressing
the review comments).
Then we can continue to hunt this down.
>
> All other patches posted previously turned out to increase the latency
> (thanks to the statistics patches), so I left them out from this round.
I think that the cover letter should call that out. It does not reflect
your previous posting.
So there is no mutual interference between rx and tx? at all?
>
>>>
>>> All performance number are derived from the 'tiobench-example.fio'
>>> sample from the fio sources, running on a 96 core machine with one,
>>> two, or four subsystem and two paths, each path exposing 32 queues.
>>> Backend is nvmet using an Intel DC P3700 NVMe SSD.
>>
>> The patchset in v1 started by stating a performance issue when
>> controllers have a limited number of queues, does this test case
>> represent the original issue?
>>
> Oh, but it does. The entire test is run on machine with 96 cores.
>
>>>
>>> write performance:
>>> baseline:
>>> 1 subsys, 4k seq: bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s
>>> (17.1MB/s-20.0MB/s)
>>> 1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s
>>> (16.4MB/s-22.5MB/s)
>>> 2 subsys, 4k seq: bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s
>>> (2871kB/s-4905kB/s)
>>> 2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s
>>> (2881kB/s-5635kB/s)
>>> 4 subsys, 4k seq: bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s
>>> (2038kB/s-8598kB/s)
>>> 4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s
>>> (2072kB/s-6466kB/s)
>>>
>>> patched:
>>> 1 subsys, 4k seq: bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s
>>> (14.4MB/s-16.8MB/s)
>>> 1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s
>>> (13.0MB/s-16.0MB/s)
>>
>> That is a substantial degradation. I also keep asking, how does
>> null_blk looks like?
>>
> Tested, and doesn't make a difference. Similar numbers.
> Surprisingly, but there you are.
So in effect you should be reaching a ~1+GB/s throughput in this test
and HW, and yet you see less than
half of it, also with null_blk ?
Is there a configuration that you _are_ able to saturate the wire?
>
>>> 2 subsys, 4k seq: bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s
>>> (3667kB/s-4601kB/s)
>>> 2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s
>>> (3717kB/s-4528kB/s)
>>> 4 subsys, 4k seq: bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s
>>> (2625kB/s-8496kB/s)
>>> 4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s
>>> (2570kB/s-9640kB/s)
>>
>> There is still an observed degradation when moving from 2 to 4
>> subsystems, what is the cause of it?
>>
> All subsystems are running over the same 10GigE link, so some
> performance degradation is to be expected as we are having higher
> contention.
Not sure why its so expected? I mean, obviously it is the case now, but
why is it intrinsically expected?
>
>>>
>>> read performance:
>>> baseline:
>>> 1 subsys, 4k seq: bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s
>>> (12.7MB/s-18.0MB/s)
>>> 1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s
>>> (14.1MB/s-20.2MB/s)
>>> 2 subsys, 4k seq: bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s
>>> (2666kB/s-4083kB/s)
>>> 2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s
>>> (2489kB/s-5533kB/s)
>>> 4 subsys, 4k seq: bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s
>>> (202kB/s-11.6MB/s)
>>> 4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s
>>> (1392kB/s-14.6MB/s)
>>>
>>> patched:
>>> 1 subsys, 4k seq: bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s
>>> (13.3MB/s-15.4MB/s)
>>> 1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s
>>> (13.0MB/s-16.9MB/s)
>>> 2 subsys, 4k seq: bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s
>>> (2522kB/s-4632kB/s)
>>> 2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s
>>> (2312kB/s-4321kB/s)
>>> 4 subsys, 4k seq: bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s
>>> (1904kB/s-8305kB/s)
>>> 4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s
>>> (1823kB/s-4607kB/s)
>>
>> Same question here, your patches do not seem to eliminate the overall
>> loss of efficiency.
>
> Never claimed that, and really I can't see that we can.
> All subsystems are running over the same link, so we are having to push
> more independent frames across it, and we will suffer from higher
> contention here. A performance degradation when scaling up subsystems
> is unavoidable.
I took another look at the fio jobfile. Is my understanding correct that
there are only 4 inflight
requests at once (numjobs=4)? Maybe I'm missing something here, but I'm
lost on where is the
high load you are talking about here? Isn't this is a latency
optimization exercise?
Not questioning the necessity of it, I would just like us to be on the
same page as you previously
gave the impression that there is very high concurrency and load on the
system here (for example when we
asked Tejun about workqueues per controller).
^ permalink raw reply [flat|nested] 24+ messages in thread