[PATCH 0/3] Fix request completion holes

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Fix request completion holes
@ 2017-10-31  8:55 Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31  8:55 UTC (permalink / raw)


We have two holes in nvme-rdma when completing request.

1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.

2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.

Sagi Grimberg (3):
  nvme-rdma: don't suppress send completions
  nvme-rdma: don't complete requests before a send work request has
    completed
  nvme-rdma: wait for local invalidation before completing a request

 drivers/nvme/host/rdma.c | 101 +++++++++++++++++++++++------------------------
 1 file changed, 50 insertions(+), 51 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/3] nvme-rdma: don't suppress send completions
  2017-10-31  8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
@ 2017-10-31  8:55 ` Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31  8:55 UTC (permalink / raw)


The entire completions suppress mechanism is currently
broken because the HCA might retry a send operation
(due to dropped ack) after the nvme transaction has completed.

In order to handle this, we signal all send completions (besides
async event which is not racing anything).

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 40 ++++------------------------------------
 1 file changed, 4 insertions(+), 36 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 53ad241493a1..ccbae327fe72 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1001,21 +1001,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_wr_error(cq, wc, "SEND");
 }
 
-/*
- * We want to signal completion at least every queue depth/2.  This returns the
- * largest power of two that is not above half of (queue size + 1) to optimize
- * (avoid divisions).
- */
-static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
-{
-	int limit = 1 << ilog2((queue->queue_size + 1) / 2);
-
-	return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0;
-}
-
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first, bool flush)
+		struct ib_send_wr *first)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1031,24 +1019,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = 0;
-
-	/*
-	 * Unsignalled send completions are another giant desaster in the
-	 * IB Verbs spec:  If we don't regularly post signalled sends
-	 * the send queue will fill up and only a QP reset will rescue us.
-	 * Would have been way to obvious to handle this in hardware or
-	 * at least the RDMA stack..
-	 *
-	 * Always signal the flushes. The magic request used for the flush
-	 * sequencer is not allocated in our driver's tagset and it's
-	 * triggered to be freed by blk_cleanup_queue(). So we need to
-	 * always mark it as signaled to ensure that the "wr_cqe", which is
-	 * embedded in request's payload, is not freed when __ib_process_cq()
-	 * calls wr_cqe->done().
-	 */
-	if (nvme_rdma_queue_sig_limit(queue) || flush)
-		wr.send_flags |= IB_SEND_SIGNALED;
+	wr.send_flags = IB_SEND_SIGNALED;
 
 	if (first)
 		first->next = &wr;
@@ -1122,7 +1093,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1407,7 +1378,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = sqe->data;
-	bool flush = false;
 	struct ib_device *dev;
 	blk_status_t ret;
 	int err;
@@ -1439,10 +1409,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	if (req_op(rq) == REQ_OP_FLUSH)
-		flush = true;
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL, flush);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed
  2017-10-31  8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
@ 2017-10-31  8:55 ` Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
  2017-10-31  9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy
  3 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31  8:55 UTC (permalink / raw)


In order to guarantee that the HCA will never get an access violation
(either from invalidated rkey or from iommu) when retrying a send
operation we must complete a request only when both send completion
and the nvme cqe has arrived.

Only then we are safe to invalidate the rkey (if needed), unmap
the host buffers, and complete the IO.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ccbae327fe72..ae1fb66358f7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -67,6 +67,9 @@ struct nvme_rdma_request {
 	struct nvme_request	req;
 	struct ib_mr		*mr;
 	struct nvme_rdma_qe	sqe;
+	struct nvme_completion	cqe;
+	bool			send_completed;
+	bool			resp_completed;
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	int			nents;
@@ -961,6 +964,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 	req->num_sge = 1;
 	req->inline_data = false;
 	req->mr->need_inval = false;
+	req->send_completed = false;
+	req->resp_completed = false;
 
 	c->common.flags |= NVME_CMD_SGL_METABUF;
 
@@ -997,13 +1002,25 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue,
 
 static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_qe *qe =
+		container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe);
+	struct nvme_rdma_request *req =
+		container_of(qe, struct nvme_rdma_request, sqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "SEND");
+		return;
+	}
+
+	req->send_completed = true;
+	if (req->resp_completed)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
 static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first)
+		struct ib_send_wr *first, bool signal)
 {
 	struct ib_send_wr wr, *bad_wr;
 	int ret;
@@ -1019,7 +1036,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
 	wr.sg_list    = sge;
 	wr.num_sge    = num_sge;
 	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = IB_SEND_SIGNALED;
+	wr.send_flags = signal ? IB_SEND_SIGNALED : 0;
 
 	if (first)
 		first->next = &wr;
@@ -1093,7 +1110,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
+	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false);
 	WARN_ON_ONCE(ret);
 }
 
@@ -1117,11 +1134,17 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	if (rq->tag == tag)
 		ret = 1;
 
+	req->cqe.status = cqe->status;
+	req->cqe.result = cqe->result;
+	req->resp_completed = true;
+
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
 	    wc->ex.invalidate_rkey == req->mr->rkey)
 		req->mr->need_inval = false;
 
-	nvme_end_request(rq, cqe->status, cqe->result);
+	if (req->send_completed)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+
 	return ret;
 }
 
@@ -1410,7 +1433,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
 	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr->need_inval ? &req->reg_wr.wr : NULL);
+			req->mr->need_inval ? &req->reg_wr.wr : NULL, true);
 	if (unlikely(err)) {
 		nvme_rdma_unmap_data(queue, rq);
 		goto err;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request
  2017-10-31  8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
  2017-10-31  8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
@ 2017-10-31  8:55 ` Sagi Grimberg
  2017-10-31  9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy
  3 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31  8:55 UTC (permalink / raw)


We must not complete a request before the host memory region is
invalidated. Luckily we have send with invalidate protocol support
so we usually don't need to execute it, but in case the target
did not invalidate a memory region for us, we must wait for
the invalidation to complete before unmapping host memory and
completing the I/O.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 42 +++++++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ae1fb66358f7..b7e0fb0fe913 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -818,8 +818,19 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc)
 
 static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc)
 {
-	if (unlikely(wc->status != IB_WC_SUCCESS))
+	struct nvme_rdma_request *req =
+		container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe);
+	struct request *rq = blk_mq_rq_from_pdu(req);
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		nvme_rdma_wr_error(cq, wc, "LOCAL_INV");
+		return;
+	}
+
+	req->mr->need_inval = false;
+	if (req->resp_completed && req->send_completed)
+		nvme_end_request(rq, req->cqe.status, req->cqe.result);
+
 }
 
 static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
@@ -830,7 +841,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue,
 		.opcode		    = IB_WR_LOCAL_INV,
 		.next		    = NULL,
 		.num_sge	    = 0,
-		.send_flags	    = 0,
+		.send_flags	    = IB_SEND_SIGNALED,
 		.ex.invalidate_rkey = req->mr->rkey,
 	};
 
@@ -844,24 +855,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue,
 		struct request *rq)
 {
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
-	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_rdma_device *dev = queue->device;
 	struct ib_device *ibdev = dev->dev;
-	int res;
 
 	if (!blk_rq_bytes(rq))
 		return;
 
-	if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) {
-		res = nvme_rdma_inv_rkey(queue, req);
-		if (unlikely(res < 0)) {
-			dev_err(ctrl->ctrl.device,
-				"Queueing INV WR for rkey %#x failed (%d)\n",
-				req->mr->rkey, res);
-			nvmf_error_recovery(&queue->ctrl->ctrl);
-		}
-	}
-
 	ib_dma_unmap_sg(ibdev, req->sg_table.sgl,
 			req->nents, rq_data_dir(rq) ==
 				    WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
@@ -1014,7 +1013,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 	}
 
 	req->send_completed = true;
-	if (req->resp_completed)
+	if (req->resp_completed && !req->mr->need_inval)
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 }
 
@@ -1139,10 +1138,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
 	req->resp_completed = true;
 
 	if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) &&
-	    wc->ex.invalidate_rkey == req->mr->rkey)
+	    wc->ex.invalidate_rkey == req->mr->rkey) {
 		req->mr->need_inval = false;
+	} else if (req->mr->need_inval) {
+		ret = nvme_rdma_inv_rkey(queue, req);
+		if (unlikely(ret < 0)) {
+			dev_err(queue->ctrl->ctrl.device,
+				"Queueing INV WR for rkey %#x failed (%d)\n",
+				req->mr->rkey, ret);
+			nvmf_error_recovery(&queue->ctrl->ctrl);
+		}
+	}
 
-	if (req->send_completed)
+	if (req->send_completed && !req->mr->need_inval)
 		nvme_end_request(rq, req->cqe.status, req->cqe.result);
 
 	return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-10-31  8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
                   ` (2 preceding siblings ...)
  2017-10-31  8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
@ 2017-10-31  9:38 ` Max Gurtovoy
  2017-10-31 11:10   ` Sagi Grimberg
  3 siblings, 1 reply; 23+ messages in thread
From: Max Gurtovoy @ 2017-10-31  9:38 UTC (permalink / raw)




On 10/31/2017 10:55 AM, Sagi Grimberg wrote:
> We have two holes in nvme-rdma when completing request.
> 

Thanks Sagi.
We'll run these patches in our lab + our patches for the performance 
improvments that we sent few days ago and update the results.

I guess I'll need to run it with my cq moderation patches to get less 
penalty in performance.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-10-31  9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy
@ 2017-10-31 11:10   ` Sagi Grimberg
  2017-11-01 16:02     ` idanb
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31 11:10 UTC (permalink / raw)


> Thanks Sagi.
> We'll run these patches in our lab + our patches for the performance 
> improvments that we sent few days ago and update the results.

Thanks,

> I guess I'll need to run it with my cq moderation patches to get less 
> penalty in performance.

Fine, but just a heads up, any cq moderation that is not adaptive is not 
something
that we'd like to pursue. I have a generic adaptive moderation library 
in the
works, so any ideas should be integrated into that (or something 
equivalent).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-10-31 11:10   ` Sagi Grimberg
@ 2017-11-01 16:02     ` idanb
  2017-11-01 16:09       ` Max Gurtovoy
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: idanb @ 2017-11-01 16:02 UTC (permalink / raw)


Sagi,

Thanks for you patches, Max and me ran your patches adding completion 
signaling for both send (command capsule) and local invalidation request.

The results were significant reduction of IOPs due to interrupts and 
completion processing, see below our results for READ/WRITE IOPs (512B, 
64 Jobs):

(1) No send signal and no invalidation signal w/ moderation 7.4M / 11.9M

(2) No send signal and no invalidation signal w/o moderation 7.1M / 8.7M

(3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M

(4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M

(5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M


I understand that (1), (2) and (3) are invalid, the performance 
reduction induce by mode (4) need to motivate us to enable cqe/event 
moderation as soon as possible.


On 10/31/2017 1:10 PM, Sagi Grimberg wrote:
>> Thanks Sagi.
>> We'll run these patches in our lab + our patches for the performance 
>> improvments that we sent few days ago and update the results.
>
> Thanks,
>
>> I guess I'll need to run it with my cq moderation patches to get less 
>> penalty in performance.
>
> Fine, but just a heads up, any cq moderation that is not adaptive is 
> not something
> that we'd like to pursue. I have a generic adaptive moderation library 
> in the
> works, so any ideas should be integrated into that (or something 
> equivalent).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 16:02     ` idanb
@ 2017-11-01 16:09       ` Max Gurtovoy
  2017-11-01 16:50       ` Jason Gunthorpe
  2017-11-01 17:26       ` Sagi Grimberg
  2 siblings, 0 replies; 23+ messages in thread
From: Max Gurtovoy @ 2017-11-01 16:09 UTC (permalink / raw)




On 11/1/2017 6:02 PM, idanb wrote:
> Sagi,
> 
> Thanks for you patches, Max and me ran your patches adding completion 
> signaling for both send (command capsule) and local invalidation request.
> 
> The results were significant reduction of IOPs due to interrupts and 
> completion processing, see below our results for READ/WRITE IOPs (512B, 
> 64 Jobs):
> 
> (1) No send signal and no invalidation signal w/ moderation 7.4M / 11.9M
> 
> (2) No send signal and no invalidation signal w/o moderation 7.1M / 8.7M
> 
> (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M
> 
> (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M
> 
> (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M
> 
> 
> I understand that (1), (2) and (3) are invalid, the performance 
> reduction induce by mode (4) need to motivate us to enable cqe/event 
> moderation as soon as possible.
> 

attached the full report we ran with Sagi's patches (+ cq moderation 
cases with cq_completions=32 cq_timeout=16 + old send signal [once in 
queue_size/2])
-------------- next part --------------
READ: Signal with moderation
+++++++++++++++++++++++++++++++
read (512B, 32 CPUs, 32 jobs, iodepth=128): io=61146MB, bw=3057.2MB/s, iops=6261.9K, runt= 20001msec
read (1k, 32 CPUs, 32 jobs, iodepth=128) : io=110943MB, bw=5546.1MB/s, iops=5680.2K, runt= 20001msec
read (2k, 32 CPUs, 32 jobs, iodepth=128): io=163192MB, bw=8159.3MB/s, iops=4177.6K, runt= 20001msec
read (4k, 32 CPUs, 32 jobs, iodepth=128): io=177120MB, bw=8855.2MB/s, iops=2266.1K, runt= 20002msec
read (8k, 32 CPUs, 32 jobs, iodepth=128): io=187884MB, bw=9392.9MB/s, iops=1202.3K, runt= 20003msec
read (16k, 32 CPUs, 32 jobs, iodepth=128): io=190151MB, bw=9505.7MB/s, iops=608359, runt= 20004msec
read (32k, 32 CPUs, 32 jobs, iodepth=128): io=191209MB, bw=9556.7MB/s, iops=305812, runt= 20008msec
read (64k, 32 CPUs, 32 jobs, iodepth=128): io=192417MB, bw=9611.8MB/s, iops=153787, runt= 20019msec
read (512B, 32 CPUs, 64 jobs, iodepth=128): io=34871MB, bw=1743.5MB/s, iops=3570.7K, runt= 20001msec
read (1k, 32 CPUs, 64 jobs, iodepth=128): io=106594MB, bw=5329.2MB/s, iops=5457.8K, runt= 20002msec
read (2k, 32 CPUs, 64 jobs, iodepth=128): io=153952MB, bw=7696.9MB/s, iops=3940.8K, runt= 20002msec
read (4k, 32 CPUs, 64 jobs, iodepth=128): io=171229MB, bw=8560.2MB/s, iops=2191.4K, runt= 20003msec
read (8k, 32 CPUs, 64 jobs, iodepth=128): io=178982MB, bw=8946.9MB/s, iops=1145.2K, runt= 20005msec
read (16k, 32 CPUs, 64 jobs, iodepth=128): io=182606MB, bw=9126.2MB/s, iops=584076, runt= 20009msec
read (32k, 32 CPUs, 64 jobs, iodepth=128): io=183271MB, bw=9157.2MB/s, iops=293028, runt= 20014msec
read (64k, 32 CPUs, 64 jobs, iodepth=128): io=169174MB, bw=8447.3MB/s, iops=135156, runt= 20027msec

READ: Signal without moderation
++++++++++++++++++++++++++++++++
read : io=33410MB, bw=1670.5MB/s, iops=3420.1K, runt= 20001msec
read : io=66287MB, bw=3314.2MB/s, iops=3393.8K, runt= 20001msec
read : io=128976MB, bw=6448.5MB/s, iops=3301.7K, runt= 20001msec
read : io=172651MB, bw=8631.8MB/s, iops=2209.8K, runt= 20002msec
read : io=181532MB, bw=9076.2MB/s, iops=1161.8K, runt= 20001msec
read : io=194935MB, bw=9744.9MB/s, iops=623667, runt= 20004msec
read : io=192299MB, bw=9610.7MB/s, iops=307540, runt= 20009msec
read : io=192919MB, bw=9635.9MB/s, iops=154173, runt= 20021msec
read : io=31905MB, bw=1595.2MB/s, iops=3266.1K, runt= 20001msec
read : io=69284MB, bw=3464.2MB/s, iops=3547.2K, runt= 20001msec
read : io=146078MB, bw=7303.2MB/s, iops=3739.3K, runt= 20002msec
read : io=168585MB, bw=8428.4MB/s, iops=2157.6K, runt= 20003msec
read : io=181224MB, bw=9058.1MB/s, iops=1159.6K, runt= 20005msec
read : io=184970MB, bw=9244.4MB/s, iops=591638, runt= 20009msec
read : io=184823MB, bw=9234.8MB/s, iops=295510, runt= 20014msec
read : io=175313MB, bw=8753.4MB/s, iops=140054, runt= 20028msec

READ: No SEND Signal with moderation (with signal on local invalidation)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
read : io=61968MB, bw=3098.3MB/s, iops=6345.3K, runt= 20001msec
read : io=112583MB, bw=5628.9MB/s, iops=5763.1K, runt= 20001msec
read : io=167024MB, bw=8350.9MB/s, iops=4275.7K, runt= 20001msec
read : io=181638MB, bw=9081.1MB/s, iops=2324.8K, runt= 20002msec
read : io=185373MB, bw=9267.3MB/s, iops=1186.3K, runt= 20003msec
read : io=191548MB, bw=9575.6MB/s, iops=612832, runt= 20004msec
read : io=190085MB, bw=9499.2MB/s, iops=303999, runt= 20009msec
read : io=192519MB, bw=9616.9MB/s, iops=153868, runt= 20019msec
read : io=58495MB, bw=2924.6MB/s, iops=5989.6K, runt= 20001msec
read : io=109628MB, bw=5480.9MB/s, iops=5612.4K, runt= 20002msec
read : io=156339MB, bw=7816.6MB/s, iops=4002.8K, runt= 20001msec
read : io=171421MB, bw=8569.9MB/s, iops=2193.9K, runt= 20003msec
read : io=179565MB, bw=8976.3MB/s, iops=1148.1K, runt= 20005msec
read : io=182728MB, bw=9132.8MB/s, iops=584495, runt= 20008msec
read : io=184840MB, bw=9235.2MB/s, iops=295551, runt= 20013msec
read : io=180544MB, bw=9015.8MB/s, iops=144240, runt= 20027msec

READ: No SEND Signal without moderation (with signal on local invalidation)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
read : io=40289MB, bw=2014.4MB/s, iops=4125.5K, runt= 20001msec
read : io=78689MB, bw=3934.3MB/s, iops=4028.7K, runt= 20001msec
read : io=143805MB, bw=7189.1MB/s, iops=3681.3K, runt= 20001msec
read : io=173558MB, bw=8677.4MB/s, iops=2221.4K, runt= 20002msec
read : io=186197MB, bw=9308.5MB/s, iops=1191.5K, runt= 20003msec
read : io=192793MB, bw=9637.8MB/s, iops=616812, runt= 20004msec
read : io=192399MB, bw=9616.1MB/s, iops=307714, runt= 20008msec
read : io=194790MB, bw=9731.8MB/s, iops=155707, runt= 20016msec
read : io=50862MB, bw=2542.2MB/s, iops=5208.3K, runt= 20001msec
read : io=103616MB, bw=5180.6MB/s, iops=5304.9K, runt= 20001msec
read : io=150399MB, bw=7519.3MB/s, iops=3849.9K, runt= 20002msec
read : io=173678MB, bw=8682.2MB/s, iops=2222.7K, runt= 20004msec
read : io=181481MB, bw=9071.9MB/s, iops=1161.2K, runt= 20005msec
read : io=186601MB, bw=9326.4MB/s, iops=596884, runt= 20008msec
read : io=187268MB, bw=9356.9MB/s, iops=299419, runt= 20014msec
read : io=176620MB, bw=8817.4MB/s, iops=141077, runt= 20031msec

Write: Signal with moderation
+++++++++++++++++++++++++++++++
write: io=115326MB, bw=5766.2MB/s, iops=11809K, runt= 20001msec
write: io=148240MB, bw=7411.7MB/s, iops=7589.6K, runt= 20001msec
write: io=161919MB, bw=8095.6MB/s, iops=4144.1K, runt= 20001msec
write: io=171162MB, bw=8557.3MB/s, iops=2190.7K, runt= 20002msec
write: io=199720MB, bw=9984.6MB/s, iops=1278.2K, runt= 20003msec
write: io=206792MB, bw=10338MB/s, iops=661600, runt= 20004msec
write: io=209549MB, bw=10474MB/s, iops=335161, runt= 20007msec
write: io=208180MB, bw=10402MB/s, iops=166427, runt= 20014msec
write: io=46900MB, bw=2344.1MB/s, iops=4802.4K, runt= 20001msec
write: io=105243MB, bw=5261.1MB/s, iops=5388.2K, runt= 20001msec
write: io=160543MB, bw=8026.4MB/s, iops=4109.6K, runt= 20002msec
write: io=174457MB, bw=8721.2MB/s, iops=2232.7K, runt= 20004msec
write: io=192598MB, bw=9627.2MB/s, iops=1232.4K, runt= 20004msec
write: io=201587MB, bw=10076MB/s, iops=644851, runt= 20007msec
write: io=205567MB, bw=10272MB/s, iops=328693, runt= 20013msec
write: io=208508MB, bw=10412MB/s, iops=166590, runt= 20026msec

Write: Signal without moderation
+++++++++++++++++++++++++++++++++
write: io=43445MB, bw=2172.2MB/s, iops=4448.6K, runt= 20001msec
write: io=93361MB, bw=4667.9MB/s, iops=4779.9K, runt= 20001msec
write: io=160929MB, bw=8046.6MB/s, iops=4119.6K, runt= 20001msec
write: io=171522MB, bw=8575.3MB/s, iops=2195.3K, runt= 20002msec
write: io=198314MB, bw=9914.2MB/s, iops=1269.2K, runt= 20003msec
write: io=206332MB, bw=10315MB/s, iops=660129, runt= 20004msec
write: io=207733MB, bw=10383MB/s, iops=332240, runt= 20008msec
write: io=208141MB, bw=10399MB/s, iops=166379, runt= 20016msec
write: io=45763MB, bw=2288.2MB/s, iops=4685.9K, runt= 20001msec
write: io=100083MB, bw=5003.1MB/s, iops=5123.1K, runt= 20001msec
write: io=159084MB, bw=7953.5MB/s, iops=4072.2K, runt= 20002msec
write: io=173523MB, bw=8674.5MB/s, iops=2220.7K, runt= 20004msec
write: io=189560MB, bw=9476.2MB/s, iops=1212.1K, runt= 20004msec
write: io=199250MB, bw=9958.6MB/s, iops=637344, runt= 20008msec
write: io=204977MB, bw=10243MB/s, iops=327767, runt= 20012msec
write: io=208619MB, bw=10417MB/s, iops=166678, runt= 20026msec

Write: No SEND Signal with moderation (with signal on local invalidation)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
write: io=116784MB, bw=5838.1MB/s, iops=11958K, runt= 20001msec
write: io=143658MB, bw=7182.6MB/s, iops=7354.1K, runt= 20001msec
write: io=161343MB, bw=8066.4MB/s, iops=4129.1K, runt= 20002msec
write: io=170761MB, bw=8537.3MB/s, iops=2185.6K, runt= 20002msec
write: io=199668MB, bw=9982.5MB/s, iops=1277.8K, runt= 20002msec
write: io=207307MB, bw=10363MB/s, iops=663248, runt= 20004msec
write: io=209903MB, bw=10491MB/s, iops=335727, runt= 20007msec
write: io=208268MB, bw=10406MB/s, iops=166489, runt= 20015msec
write: io=87286MB, bw=4364.1MB/s, iops=8937.7K, runt= 20001msec
write: io=145686MB, bw=7283.6MB/s, iops=7458.4K, runt= 20002msec
write: io=160259MB, bw=8012.2MB/s, iops=4102.3K, runt= 20002msec
write: io=174527MB, bw=8724.7MB/s, iops=2233.5K, runt= 20004msec
write: io=193620MB, bw=9679.5MB/s, iops=1238.1K, runt= 20004msec
write: io=202051MB, bw=10099MB/s, iops=646336, runt= 20007msec
write: io=206876MB, bw=10337MB/s, iops=330770, runt= 20014msec
write: io=208138MB, bw=10394MB/s, iops=166302, runt= 20025msec

Write: No SEND Signal without moderation (with signal on local invalidation)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
write: io=73271MB, bw=3663.4MB/s, iops=7502.6K, runt= 20001msec
write: io=129638MB, bw=6481.6MB/s, iops=6637.2K, runt= 20001msec
write: io=161002MB, bw=8049.3MB/s, iops=4121.3K, runt= 20002msec
write: io=170759MB, bw=8537.1MB/s, iops=2185.5K, runt= 20002msec
write: io=198565MB, bw=9927.3MB/s, iops=1270.7K, runt= 20002msec
write: io=207304MB, bw=10363MB/s, iops=663239, runt= 20004msec
write: io=207251MB, bw=10358MB/s, iops=331468, runt= 20008msec
write: io=208262MB, bw=10405MB/s, iops=166484, runt= 20015msec
write: io=81719MB, bw=4085.8MB/s, iops=8367.6K, runt= 20001msec
write: io=145468MB, bw=7273.2MB/s, iops=7447.6K, runt= 20001msec
write: io=159924MB, bw=7995.5MB/s, iops=4093.7K, runt= 20002msec
write: io=174306MB, bw=8713.6MB/s, iops=2230.7K, runt= 20004msec
write: io=186670MB, bw=9333.3MB/s, iops=1194.7K, runt= 20001msec
write: io=201011MB, bw=10047MB/s, iops=642978, runt= 20008msec
write: io=206175MB, bw=10303MB/s, iops=329681, runt= 20012msec
write: io=208853MB, bw=10430MB/s, iops=166882, runt= 20024msec

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 16:02     ` idanb
  2017-11-01 16:09       ` Max Gurtovoy
@ 2017-11-01 16:50       ` Jason Gunthorpe
  2017-11-01 17:31         ` Sagi Grimberg
  2017-11-01 17:26       ` Sagi Grimberg
  2 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-01 16:50 UTC (permalink / raw)


On Wed, Nov 01, 2017@06:02:38PM +0200, idanb wrote:

> (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M

Actually, isn't this valid and reasonable? You don't need to see the
SEND completion, the only thing that matters is the invalidate
completion, and the invalidate will naturally order after the send
is finished, yes?

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 16:50       ` Jason Gunthorpe
@ 2017-11-01 17:31         ` Sagi Grimberg
  2017-11-01 17:58           ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-11-01 17:31 UTC (permalink / raw)



>> (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M
> 
> Actually, isn't this valid and reasonable? You don't need to see the
> SEND completion, the only thing that matters is the invalidate
> completion, and the invalidate will naturally order after the send
> is finished, yes?

the default mode utilize remote invalidation, so no, its not valid.

Without remote invalidate I'm not sure, don't remember the ordering
rules from the top of my head. Idan, does a local invalidate completion
guarantee a prior send completion (of a non-related buffer)? I vaguely
remember that explicit fencing is required to guarantee that.

I guess we could not signal send completions if we are in a local
invalidate mode, but I'm liking this less and less...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 17:31         ` Sagi Grimberg
@ 2017-11-01 17:58           ` Jason Gunthorpe
  2017-11-02  8:06             ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-01 17:58 UTC (permalink / raw)

On Wed, Nov 01, 2017@07:31:39PM +0200, Sagi Grimberg wrote:

> the default mode utilize remote invalidation, so no, its not valid.

Well, usually the ULP design should allow some things to be reaped
async, and latency senstive things to be sync.

Eg if you have a SEND using a local buffer with a RKEY, then you'd
declare the RKEY data completed when SEND_WITH_INVALIDATE is returned.

Recycling the lkey command buffer is an async process and can wait
unsignaled until something signalled comes to push its completion
through.

This approach reduces the # of CQEs required at the expense of
delaying reaping the lkey buffer.

> Without remote invalidate I'm not sure, don't remember the ordering
> rules from the top of my head. Idan, does a local invalidate completion
> guarantee a prior send completion (of a non-related buffer)? I vaguely
> remember that explicit fencing is required to guarantee that.

This was the case I thought Idan was testing..

Local invalidate is defined to always be ordered by the spec, so it
is required to guarentee that the SEND is completed.

> I guess we could not signal send completions if we are in a local
> invalidate mode, but I'm liking this less and less...

If the SEND reaping can be defered as above, then the local invalidate
completion becomes the signaled CQE that allows the SEND to be reaped.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 17:58           ` Jason Gunthorpe
@ 2017-11-02  8:06             ` Sagi Grimberg
  2017-11-02 15:12               ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-11-02  8:06 UTC (permalink / raw)



>> the default mode utilize remote invalidation, so no, its not valid.
> 
> Well, usually the ULP design should allow some things to be reaped
> async, and latency senstive things to be sync.
> 
> Eg if you have a SEND using a local buffer with a RKEY, then you'd
> declare the RKEY data completed when SEND_WITH_INVALIDATE is returned.
> 
> Recycling the lkey command buffer is an async process and can wait
> unsignaled until something signalled comes to push its completion
> through.

Not when using inline data with the send, which is the main issue
here. if we inline data to the command, we will use the local
dma lkey, which does not even have a local invalidate following it.

> This approach reduces the # of CQEs required at the expense of
> delaying reaping the lkey buffer.

Yes, I understand the advantage of suppressing send completions..

But I'm not very fond of non-trivial behavior depending on all sort of
overly complicated conditions.

>> Without remote invalidate I'm not sure, don't remember the ordering
>> rules from the top of my head. Idan, does a local invalidate completion
>> guarantee a prior send completion (of a non-related buffer)? I vaguely
>> remember that explicit fencing is required to guarantee that.
> 
> This was the case I thought Idan was testing..

Still doesn't mean it is guaranteed to do the right thing :)

> Local invalidate is defined to always be ordered by the spec, so it
> is required to guarentee that the SEND is completed.
So local invalidate completion is guaranteed to come after all the
completions prior to it in the send queue?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02  8:06             ` Sagi Grimberg
@ 2017-11-02 15:12               ` Jason Gunthorpe
  2017-11-02 15:23                 ` Sagi Grimberg
  2017-11-02 16:18                 ` Steve Wise
  0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-02 15:12 UTC (permalink / raw)


On Thu, Nov 02, 2017@10:06:30AM +0200, Sagi Grimberg wrote:
> 
> >>the default mode utilize remote invalidation, so no, its not valid.
> >
> >Well, usually the ULP design should allow some things to be reaped
> >async, and latency senstive things to be sync.
> >
> >Eg if you have a SEND using a local buffer with a RKEY, then you'd
> >declare the RKEY data completed when SEND_WITH_INVALIDATE is returned.
> >
> >Recycling the lkey command buffer is an async process and can wait
> >unsignaled until something signalled comes to push its completion
> >through.
> 
> Not when using inline data with the send, which is the main issue
> here. if we inline data to the command, we will use the local
> dma lkey, which does not even have a local invalidate following it.

Does nvme use inline data send and RKEY transfer in the same SEND?
Then it would need to signal the SEND if remote invalidate is used,
otherwise it only needs to signal the local invalidate for the RKEY..

> >Local invalidate is defined to always be ordered by the spec, so it
> >is required to guarentee that the SEND is completed.

> So local invalidate completion is guaranteed to come after all the
> completions prior to it in the send queue?

IBA spec says so..

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 15:12               ` Jason Gunthorpe
@ 2017-11-02 15:23                 ` Sagi Grimberg
  2017-11-02 15:51                   ` Jason Gunthorpe
  2017-11-02 16:18                 ` Steve Wise
  1 sibling, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-11-02 15:23 UTC (permalink / raw)



>> Not when using inline data with the send, which is the main issue
>> here. if we inline data to the command, we will use the local
>> dma lkey, which does not even have a local invalidate following it.
> 
> Does nvme use inline data send and RKEY transfer in the same SEND?

No, we use local dma lkey for inline data.

> Then it would need to signal the SEND if remote invalidate is used,
> otherwise it only needs to signal the local invalidate for the RKEY..

Also if its simply using the local dma lkey without sending any rkey.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 15:23                 ` Sagi Grimberg
@ 2017-11-02 15:51                   ` Jason Gunthorpe
  2017-11-02 16:15                     ` Sagi Grimberg
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-02 15:51 UTC (permalink / raw)


On Thu, Nov 02, 2017@05:23:36PM +0200, Sagi Grimberg wrote:

> >Then it would need to signal the SEND if remote invalidate is used,
> >otherwise it only needs to signal the local invalidate for the RKEY..
> 
> Also if its simply using the local dma lkey without sending any rkey.

Right, OK, to be clear, my emails were only about the case where there
was a SEND and INVALIDATE required together, I wasn't discussing SEND
alone, that is already as good as it can be I think :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 15:51                   ` Jason Gunthorpe
@ 2017-11-02 16:15                     ` Sagi Grimberg
  0 siblings, 0 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-11-02 16:15 UTC (permalink / raw)



>>> Then it would need to signal the SEND if remote invalidate is used,
>>> otherwise it only needs to signal the local invalidate for the RKEY..
>>
>> Also if its simply using the local dma lkey without sending any rkey.
> 
> Right, OK, to be clear, my emails were only about the case where there
> was a SEND and INVALIDATE required together, I wasn't discussing SEND
> alone, that is already as good as it can be I think :)

Not arguing here :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 15:12               ` Jason Gunthorpe
  2017-11-02 15:23                 ` Sagi Grimberg
@ 2017-11-02 16:18                 ` Steve Wise
  2017-11-02 16:36                   ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Steve Wise @ 2017-11-02 16:18 UTC (permalink / raw)




> -----Original Message-----
> From: linux-rdma-owner at vger.kernel.org [mailto:linux-rdma-
> owner at vger.kernel.org] On Behalf Of Jason Gunthorpe
> Sent: Thursday, November 02, 2017 10:13 AM
> To: Sagi Grimberg
> Cc: idanb; Max Gurtovoy; linux-rdma at vger.kernel.org; linux-
> nvme at lists.infradead.org; Christoph Hellwig
> Subject: Re: [PATCH 0/3] Fix request completion holes
> 
> On Thu, Nov 02, 2017@10:06:30AM +0200, Sagi Grimberg wrote:
> >
> > >>the default mode utilize remote invalidation, so no, its not valid.
> > >
> > >Well, usually the ULP design should allow some things to be reaped
> > >async, and latency senstive things to be sync.
> > >
> > >Eg if you have a SEND using a local buffer with a RKEY, then you'd
> > >declare the RKEY data completed when SEND_WITH_INVALIDATE is returned.
> > >
> > >Recycling the lkey command buffer is an async process and can wait
> > >unsignaled until something signalled comes to push its completion
> > >through.
> >
> > Not when using inline data with the send, which is the main issue
> > here. if we inline data to the command, we will use the local
> > dma lkey, which does not even have a local invalidate following it.
> 
> Does nvme use inline data send and RKEY transfer in the same SEND?
> Then it would need to signal the SEND if remote invalidate is used,
> otherwise it only needs to signal the local invalidate for the RKEY..
> 
> > >Local invalidate is defined to always be ordered by the spec, so it
> > >is required to guarentee that the SEND is completed.
> 
> > So local invalidate completion is guaranteed to come after all the
> > completions prior to it in the send queue?
> 
> IBA spec says so..
> 

iWARP spec too.   This is in regard to completion ordering though.  The local
invalidate send WR must have the IB_SEND_FENCE flag set if you want it to only
be executed after all prior send WRs are executed.    Either way, the
completions are always inserted into the cq in-sq-submission-order and a
signaled completion implies completions of all prior unsignaled WRs.

Steve.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 16:18                 ` Steve Wise
@ 2017-11-02 16:36                   ` Jason Gunthorpe
  2017-11-02 16:53                     ` Steve Wise
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-02 16:36 UTC (permalink / raw)

On Thu, Nov 02, 2017@11:18:42AM -0500, Steve Wise wrote:

> iWARP spec too.   This is in regard to completion ordering though.  The local
> invalidate send WR must have the IB_SEND_FENCE flag set if you want it to only
> be executed after all prior send WRs are executed.

Oh right, yes, you need to have local invalidate fence to guarentee
that, but usually a local operation will not refer to a RKEY, so
execution ordering wont matter.

But this is a good general point, doesn't a ULP need to set FENCE on
SEND, eg:

 RDMA READ (rkey)
 RDMA READ (rkey)
 SEND w/ FENCE (tell remote to invalidate rkey)

Otherwise IBA Table 79 says RDMA READ can pass SEND and we have a
situation where the rkey has become invalidated when the remote is
still trying to use it.

IBA actually explicitly calls this out (pg 538):

 RDMA Read operations may be executed at the target after subse-
 quent Send and Invalidate operation already performed the invalida-
 tion at the target. That may cause the RDMA Read operation to fail.
 Setting the Fence Indicator on the subsequent operations guarantees
 that the RDMA Read will fully complete before the invalidation takes
 place.

None of our ULPs use IB_SEND_FENCE - so this is a bug!

> Either way, the completions are always inserted into the cq
> in-sq-submission-order and a signaled completion implies completions
> of all prior unsignaled WRs.

Yes, excactly what is relevant for this nvme case..

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 16:36                   ` Jason Gunthorpe
@ 2017-11-02 16:53                     ` Steve Wise
  2017-11-02 16:54                       ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Steve Wise @ 2017-11-02 16:53 UTC (permalink / raw)


> 
> > iWARP spec too.   This is in regard to completion ordering though.  The
local
> > invalidate send WR must have the IB_SEND_FENCE flag set if you want it to
only
> > be executed after all prior send WRs are executed.
> 
> Oh right, yes, you need to have local invalidate fence to guarentee
> that, but usually a local operation will not refer to a RKEY, so
> execution ordering wont matter.
> 
> But this is a good general point, doesn't a ULP need to set FENCE on
> SEND, eg:
> 
>  RDMA READ (rkey)
>  RDMA READ (rkey)
>  SEND w/ FENCE (tell remote to invalidate rkey)
> 
> Otherwise IBA Table 79 says RDMA READ can pass SEND and we have a
> situation where the rkey has become invalidated when the remote is
> still trying to use it.
> 
> IBA actually explicitly calls this out (pg 538):
> 
>  RDMA Read operations may be executed at the target after subse-
>  quent Send and Invalidate operation already performed the invalida-
>  tion at the target. That may cause the RDMA Read operation to fail.
>  Setting the Fence Indicator on the subsequent operations guarantees
>  that the RDMA Read will fully complete before the invalidation takes
>  place.
> 
> None of our ULPs use IB_SEND_FENCE - so this is a bug!
> 

I think the nvmf target doesn't post the SEND_W_INVALIDATE (NVMF CQE) until the
RDMA READ(s) (and back end BIO submission) complete. 

Steve.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-02 16:53                     ` Steve Wise
@ 2017-11-02 16:54                       ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2017-11-02 16:54 UTC (permalink / raw)


On Thu, Nov 02, 2017@11:53:44AM -0500, Steve Wise wrote:
> I think the nvmf target doesn't post the SEND_W_INVALIDATE (NVMF CQE) until the
> RDMA READ(s) (and back end BIO submission) complete. 

Ah! That makes sense for a storage use case

Thanks,
Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 16:02     ` idanb
  2017-11-01 16:09       ` Max Gurtovoy
  2017-11-01 16:50       ` Jason Gunthorpe
@ 2017-11-01 17:26       ` Sagi Grimberg
  2017-11-01 22:23         ` Max Gurtovoy
  2017-11-02 17:55         ` Steve Wise
  2 siblings, 2 replies; 23+ messages in thread
From: Sagi Grimberg @ 2017-11-01 17:26 UTC (permalink / raw)



> (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M

What about remote invalidation? That is how it will usually be used...

> (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M

What about latency effects? Any noticeable change?

> I understand that (1), (2) and (3) are invalid, the performance 
> reduction induce by mode (4) need to motivate us to enable cqe/event 
> moderation as soon as possible.

Well good, it means you can help test my adaptive coalescing code :)

May I ask what moderation parameters did you use? it will be useful
to pin down the profile levels the adaptive moderator would bounce
between.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 17:26       ` Sagi Grimberg
@ 2017-11-01 22:23         ` Max Gurtovoy
  2017-11-02 17:55         ` Steve Wise
  1 sibling, 0 replies; 23+ messages in thread
From: Max Gurtovoy @ 2017-11-01 22:23 UTC (permalink / raw)




On 11/1/2017 7:26 PM, Sagi Grimberg wrote:
> 
>> (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M
> 
> What about remote invalidation? That is how it will usually be used...

We haven't tested remote invalidation yet.
> 
>> (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M
> 
> What about latency effects? Any noticeable change?
> 
>> I understand that (1), (2) and (3) are invalid, the performance 
>> reduction induce by mode (4) need to motivate us to enable cqe/event 
>> moderation as soon as possible.
> 
> Well good, it means you can help test my adaptive coalescing code :)

Of course.

> 
> May I ask what moderation parameters did you use? it will be useful
> to pin down the profile levels the adaptive moderator would bounce
> between.

Sure, I mentioned the params earlier (cq_completions=32 cq_timeout=16).

Sagi,
As we can see from our results, the changes made in this patchset must 
be pushed together with some cq moderation (adaptive or manual option 
meanwhile). In the testing, I created a module parameter but we can 
create a different option for the user to configure.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 0/3] Fix request completion holes
  2017-11-01 17:26       ` Sagi Grimberg
  2017-11-01 22:23         ` Max Gurtovoy
@ 2017-11-02 17:55         ` Steve Wise
  1 sibling, 0 replies; 23+ messages in thread
From: Steve Wise @ 2017-11-02 17:55 UTC (permalink / raw)


> What about latency effects? Any noticeable change?
> 
> > I understand that (1), (2) and (3) are invalid, the performance
> > reduction induce by mode (4) need to motivate us to enable cqe/event
> > moderation as soon as possible.
> 
> Well good, it means you can help test my adaptive coalescing code :)
> 
> May I ask what moderation parameters did you use? it will be useful
> to pin down the profile levels the adaptive moderator would bounce
> between.
> 

Hey Sagi, I can help too on testing this with iw_cxgb4.  Where is your current
WIP?

Steve.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-11-02 17:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-31  8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
2017-10-31  8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
2017-10-31  8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
2017-10-31  8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
2017-10-31  9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy
2017-10-31 11:10   ` Sagi Grimberg
2017-11-01 16:02     ` idanb
2017-11-01 16:09       ` Max Gurtovoy
2017-11-01 16:50       ` Jason Gunthorpe
2017-11-01 17:31         ` Sagi Grimberg
2017-11-01 17:58           ` Jason Gunthorpe
2017-11-02  8:06             ` Sagi Grimberg
2017-11-02 15:12               ` Jason Gunthorpe
2017-11-02 15:23                 ` Sagi Grimberg
2017-11-02 15:51                   ` Jason Gunthorpe
2017-11-02 16:15                     ` Sagi Grimberg
2017-11-02 16:18                 ` Steve Wise
2017-11-02 16:36                   ` Jason Gunthorpe
2017-11-02 16:53                     ` Steve Wise
2017-11-02 16:54                       ` Jason Gunthorpe
2017-11-01 17:26       ` Sagi Grimberg
2017-11-01 22:23         ` Max Gurtovoy
2017-11-02 17:55         ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).