* [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
@ 2020-06-19 0:30 Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
Hey, some datapath micro-optimizations for nvme-tcp host driver.
Optimizations are:
1. make requests list for send operations a lockless list, and have the
I/O thread pull from it to a local send_list in batches
2. Implement request plugging information
3. optimize msg flags to signal the stack more is coming if
some more requests are expected to be sent soon.
A simple benchmark reveals a nice win in iops in high qd (32) with
batching applied (8 requests per batch). Still QD=1 latency is not
impacted.
Tested on ConnectX-4 device (LRO disabled).
CPU: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
Baseline:
========
QDepth/Batch | IOPs [k]
--------------------------
1/1 | 43.2
32/8 | 106
with patchset:
=============
QDepth/Batch | IOPs [k]
--------------------------
1/1 | 43.3
32/8 | 165
Do note that in an alternative setup (different NIC, faster CPU) the
tests did not show a noticable difference, but overall the optimizations
do make the datapath somewhat more efficient.
Sagi Grimberg (3):
nvme-tcp: have queue prod/cons send list become a llist
nvme-tcp: leverage request plugging
nvme-tcp: optimize network stack with setting msg flags according to
batch size
drivers/nvme/host/tcp.c | 78 +++++++++++++++++++++++++++++++----------
1 file changed, 60 insertions(+), 18 deletions(-)
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-24 16:48 ` Christoph Hellwig
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
The queue processing will splice to a queue local list, this should
alleviate some contention on the send_list lock, but also prepares
us to the next patch where we look on these lists for network stack
flag optimization.
Remove queue lock as its unused anymore.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 79ef2b8e2b3c..93fc6d9f97df 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -46,6 +46,7 @@ struct nvme_tcp_request {
u32 pdu_sent;
u16 ttag;
struct list_head entry;
+ struct llist_node lentry;
__le32 ddgst;
struct bio *curr_bio;
@@ -75,8 +76,8 @@ struct nvme_tcp_queue {
struct work_struct io_work;
int io_cpu;
- spinlock_t lock;
struct mutex send_mutex;
+ struct llist_head req_list;
struct list_head send_list;
/* recv state */
@@ -266,10 +267,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
struct nvme_tcp_queue *queue = req->queue;
bool empty;
- spin_lock(&queue->lock);
- empty = list_empty(&queue->send_list) && !queue->request;
- list_add_tail(&req->entry, &queue->send_list);
- spin_unlock(&queue->lock);
+ empty = llist_add(&req->lentry, &queue->req_list) &&
+ list_empty(&queue->send_list) && !queue->request;
/*
* if we're the first on the send_list and we can try to send
@@ -285,18 +284,38 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
}
}
+static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
+{
+ struct nvme_tcp_request *req;
+ struct llist_node *node;
+
+ node = llist_del_all(&queue->req_list);
+ if (!node)
+ return;
+
+ while (node) {
+ req = llist_entry(node, struct nvme_tcp_request, lentry);
+ list_add(&req->entry, &queue->send_list);
+ node = node->next;
+ }
+}
+
static inline struct nvme_tcp_request *
nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
{
struct nvme_tcp_request *req;
- spin_lock(&queue->lock);
req = list_first_entry_or_null(&queue->send_list,
struct nvme_tcp_request, entry);
- if (req)
- list_del(&req->entry);
- spin_unlock(&queue->lock);
+ if (!req) {
+ nvme_tcp_process_req_list(queue);
+ req = list_first_entry_or_null(&queue->send_list,
+ struct nvme_tcp_request, entry);
+ if (unlikely(!req))
+ return NULL;
+ }
+ list_del(&req->entry);
return req;
}
@@ -1342,8 +1361,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
int ret, rcv_pdu_size;
queue->ctrl = ctrl;
+ init_llist_head(&queue->req_list);
INIT_LIST_HEAD(&queue->send_list);
- spin_lock_init(&queue->lock);
mutex_init(&queue->send_mutex);
INIT_WORK(&queue->io_work, nvme_tcp_io_work);
queue->queue_size = queue_size;
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] nvme-tcp: leverage request plugging
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
blk-mq request plugging can improve the execution of our pipeline.
When we queue a request we actually trigger our I/O worker thread
yielding a context switch by definition. However if we know that
there are more requests in the pipe that are coming, we are better
off not trigger our I/O worker and only do that for the last request
in the batch (bd->last). By having it, we improve efficiency by
amortizing context switches over a batch of requests.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 93fc6d9f97df..76fd587acc2e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -262,7 +262,7 @@ static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
}
static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
- bool sync)
+ bool sync, bool last)
{
struct nvme_tcp_queue *queue = req->queue;
bool empty;
@@ -279,7 +279,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
sync && empty && mutex_trylock(&queue->send_mutex)) {
nvme_tcp_try_send(queue);
mutex_unlock(&queue->send_mutex);
- } else {
+ } else if (last) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
}
}
@@ -614,7 +614,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
req->state = NVME_TCP_SEND_H2C_PDU;
req->offset = 0;
- nvme_tcp_queue_request(req, false);
+ nvme_tcp_queue_request(req, false, true);
return 0;
}
@@ -2123,7 +2123,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
ctrl->async_req.curr_bio = NULL;
ctrl->async_req.data_len = 0;
- nvme_tcp_queue_request(&ctrl->async_req, true);
+ nvme_tcp_queue_request(&ctrl->async_req, true, true);
}
static enum blk_eh_timer_return
@@ -2235,6 +2235,14 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct nvme_ns *ns,
return 0;
}
+static void nvme_tcp_commit_rqs(struct blk_mq_hw_ctx *hctx)
+{
+ struct nvme_tcp_queue *queue = hctx->driver_data;
+
+ if (!llist_empty(&queue->req_list))
+ queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+}
+
static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -2254,7 +2262,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(rq);
- nvme_tcp_queue_request(req, true);
+ nvme_tcp_queue_request(req, true, bd->last);
return BLK_STS_OK;
}
@@ -2322,6 +2330,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx)
static const struct blk_mq_ops nvme_tcp_mq_ops = {
.queue_rq = nvme_tcp_queue_rq,
+ .commit_rqs = nvme_tcp_commit_rqs,
.complete = nvme_complete_rq,
.init_request = nvme_tcp_init_request,
.exit_request = nvme_tcp_exit_request,
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
@ 2020-06-19 0:30 ` Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Sagi Grimberg @ 2020-06-19 0:30 UTC (permalink / raw)
To: linux-nvme, Christoph Hellwig, Keith Busch
Cc: Anil Vasudevan, Mark Wunderlich
If we have a long list of request to send, signal the network stack
that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR.
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/host/tcp.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76fd587acc2e..0a13a0812dbc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -79,6 +79,7 @@ struct nvme_tcp_queue {
struct mutex send_mutex;
struct llist_head req_list;
struct list_head send_list;
+ bool more_requests;
/* recv state */
void *pdu;
@@ -277,7 +278,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
*/
if (queue->io_cpu == smp_processor_id() &&
sync && empty && mutex_trylock(&queue->send_mutex)) {
+ queue->more_requests = !last;
nvme_tcp_try_send(queue);
+ queue->more_requests = false;
mutex_unlock(&queue->send_mutex);
} else if (last) {
queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
@@ -880,6 +883,12 @@ static void nvme_tcp_state_change(struct sock *sk)
read_unlock(&sk->sk_callback_lock);
}
+static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
+{
+ return !list_empty(&queue->send_list) ||
+ !llist_empty(&queue->req_list) || queue->more_requests;
+}
+
static inline void nvme_tcp_done_send_req(struct nvme_tcp_queue *queue)
{
queue->request = NULL;
@@ -901,7 +910,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
bool last = nvme_tcp_pdu_last_send(req, len);
int ret, flags = MSG_DONTWAIT;
- if (last && !queue->data_digest)
+ if (last && !queue->data_digest && !nvme_tcp_queue_more(queue))
flags |= MSG_EOR;
else
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -948,7 +957,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
int flags = MSG_DONTWAIT;
int ret;
- if (inline_data)
+ if (inline_data || nvme_tcp_queue_more(queue))
flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
else
flags |= MSG_EOR;
@@ -1013,12 +1022,17 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req)
{
struct nvme_tcp_queue *queue = req->queue;
int ret;
- struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
+ struct msghdr msg = { .msg_flags = MSG_DONTWAIT };
struct kvec iov = {
.iov_base = &req->ddgst + req->offset,
.iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
};
+ if (nvme_tcp_queue_more(queue))
+ msg.msg_flags |= MSG_MORE;
+ else
+ msg.msg_flags |= MSG_EOR;
+
ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
if (unlikely(ret <= 0))
return ret;
--
2.25.1
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
@ 2020-06-24 16:48 ` Christoph Hellwig
0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:48 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
linux-nvme
On Thu, Jun 18, 2020 at 05:30:22PM -0700, Sagi Grimberg wrote:
> +static void nvme_tcp_process_req_list(struct nvme_tcp_queue *queue)
> +{
> + struct nvme_tcp_request *req;
> + struct llist_node *node;
> +
> + node = llist_del_all(&queue->req_list);
> + if (!node)
> + return;
> +
> + while (node) {
> + req = llist_entry(node, struct nvme_tcp_request, lentry);
> + list_add(&req->entry, &queue->send_list);
> + node = node->next;
> + }
The if is not needed as the while covers this.
Otherwise this looks good, I'ĺl fix it up when applying.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
` (2 preceding siblings ...)
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
@ 2020-06-24 16:51 ` Christoph Hellwig
3 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2020-06-24 16:51 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Keith Busch, Anil Vasudevan, Mark Wunderlich, Christoph Hellwig,
linux-nvme
Applied to nvme-5.9.
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-06-24 16:51 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-19 0:30 [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Sagi Grimberg
2020-06-19 0:30 ` [PATCH 1/3] nvme-tcp: have queue prod/cons send list become a llist Sagi Grimberg
2020-06-24 16:48 ` Christoph Hellwig
2020-06-19 0:30 ` [PATCH 2/3] nvme-tcp: leverage request plugging Sagi Grimberg
2020-06-19 0:30 ` [PATCH 3/3] nvme-tcp: optimize network stack with setting msg flags according to batch size Sagi Grimberg
2020-06-24 16:51 ` [PATCH 0/3] Some datapath optimizations for nvme-tcp host driver Christoph Hellwig
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.