* [PATCH 0/3] Fix request completion holes
@ 2017-10-31 8:55 Sagi Grimberg
[not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
0 siblings, 1 reply; 23+ messages in thread
From: Sagi Grimberg @ 2017-10-31 8:55 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
Cc: Christoph Hellwig, Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w
We have two holes in nvme-rdma when completing request.
1. We never wait for send work request to complete before completing
a request. It is possible that the HCA retries a send operation (due
to dropped ack) after the nvme cqe has already arrived back to the host.
If we unmap the host buffer upon reception of the cqe, the HCA might
get iommu errors when attempting to access an unmapped host buffer.
We must wait also for the send completion before completing a request,
most of the time it will be before the nvme cqe has arrived back so
we pay only for the extra cq entry processing.
2. We don't wait for the request memory region to be fully invalidated
in case the target didn't invalidate remotely. We must wait for the local
invalidation to complete before completing the request.
Sagi Grimberg (3):
nvme-rdma: don't suppress send completions
nvme-rdma: don't complete requests before a send work request has
completed
nvme-rdma: wait for local invalidation before completing a request
drivers/nvme/host/rdma.c | 101 +++++++++++++++++++++++------------------------
1 file changed, 50 insertions(+), 51 deletions(-)
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread[parent not found: <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* [PATCH 1/3] nvme-rdma: don't suppress send completions [not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-10-31 8:55 ` Sagi Grimberg 2017-10-31 8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2017-10-31 8:55 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Christoph Hellwig, Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w The entire completions suppress mechanism is currently broken because the HCA might retry a send operation (due to dropped ack) after the nvme transaction has completed. In order to handle this, we signal all send completions (besides async event which is not racing anything). Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> --- drivers/nvme/host/rdma.c | 40 ++++------------------------------------ 1 file changed, 4 insertions(+), 36 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 53ad241493a1..ccbae327fe72 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1001,21 +1001,9 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc) nvme_rdma_wr_error(cq, wc, "SEND"); } -/* - * We want to signal completion at least every queue depth/2. This returns the - * largest power of two that is not above half of (queue size + 1) to optimize - * (avoid divisions). - */ -static inline bool nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue) -{ - int limit = 1 << ilog2((queue->queue_size + 1) / 2); - - return (atomic_inc_return(&queue->sig_count) & (limit - 1)) == 0; -} - static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge, - struct ib_send_wr *first, bool flush) + struct ib_send_wr *first) { struct ib_send_wr wr, *bad_wr; int ret; @@ -1031,24 +1019,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, wr.sg_list = sge; wr.num_sge = num_sge; wr.opcode = IB_WR_SEND; - wr.send_flags = 0; - - /* - * Unsignalled send completions are another giant desaster in the - * IB Verbs spec: If we don't regularly post signalled sends - * the send queue will fill up and only a QP reset will rescue us. - * Would have been way to obvious to handle this in hardware or - * at least the RDMA stack.. - * - * Always signal the flushes. The magic request used for the flush - * sequencer is not allocated in our driver's tagset and it's - * triggered to be freed by blk_cleanup_queue(). So we need to - * always mark it as signaled to ensure that the "wr_cqe", which is - * embedded in request's payload, is not freed when __ib_process_cq() - * calls wr_cqe->done(). - */ - if (nvme_rdma_queue_sig_limit(queue) || flush) - wr.send_flags |= IB_SEND_SIGNALED; + wr.send_flags = IB_SEND_SIGNALED; if (first) first->next = ≀ @@ -1122,7 +1093,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx) ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd), DMA_TO_DEVICE); - ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false); + ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL); WARN_ON_ONCE(ret); } @@ -1407,7 +1378,6 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx, struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq); struct nvme_rdma_qe *sqe = &req->sqe; struct nvme_command *c = sqe->data; - bool flush = false; struct ib_device *dev; blk_status_t ret; int err; @@ -1439,10 +1409,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx, ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(struct nvme_command), DMA_TO_DEVICE); - if (req_op(rq) == REQ_OP_FLUSH) - flush = true; err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge, - req->mr->need_inval ? &req->reg_wr.wr : NULL, flush); + req->mr->need_inval ? &req->reg_wr.wr : NULL); if (unlikely(err)) { nvme_rdma_unmap_data(queue, rq); goto err; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed [not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 2017-10-31 8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg @ 2017-10-31 8:55 ` Sagi Grimberg 2017-10-31 8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg 2017-10-31 9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy 3 siblings, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2017-10-31 8:55 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Christoph Hellwig, Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w In order to guarantee that the HCA will never get an access violation (either from invalidated rkey or from iommu) when retrying a send operation we must complete a request only when both send completion and the nvme cqe has arrived. Only then we are safe to invalidate the rkey (if needed), unmap the host buffers, and complete the IO. Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> --- drivers/nvme/host/rdma.c | 35 +++++++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index ccbae327fe72..ae1fb66358f7 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -67,6 +67,9 @@ struct nvme_rdma_request { struct nvme_request req; struct ib_mr *mr; struct nvme_rdma_qe sqe; + struct nvme_completion cqe; + bool send_completed; + bool resp_completed; struct ib_sge sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS]; u32 num_sge; int nents; @@ -961,6 +964,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, req->num_sge = 1; req->inline_data = false; req->mr->need_inval = false; + req->send_completed = false; + req->resp_completed = false; c->common.flags |= NVME_CMD_SGL_METABUF; @@ -997,13 +1002,25 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc) { - if (unlikely(wc->status != IB_WC_SUCCESS)) + struct nvme_rdma_qe *qe = + container_of(wc->wr_cqe, struct nvme_rdma_qe, cqe); + struct nvme_rdma_request *req = + container_of(qe, struct nvme_rdma_request, sqe); + struct request *rq = blk_mq_rq_from_pdu(req); + + if (unlikely(wc->status != IB_WC_SUCCESS)) { nvme_rdma_wr_error(cq, wc, "SEND"); + return; + } + + req->send_completed = true; + if (req->resp_completed) + nvme_end_request(rq, req->cqe.status, req->cqe.result); } static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge, - struct ib_send_wr *first) + struct ib_send_wr *first, bool signal) { struct ib_send_wr wr, *bad_wr; int ret; @@ -1019,7 +1036,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue, wr.sg_list = sge; wr.num_sge = num_sge; wr.opcode = IB_WR_SEND; - wr.send_flags = IB_SEND_SIGNALED; + wr.send_flags = signal ? IB_SEND_SIGNALED : 0; if (first) first->next = ≀ @@ -1093,7 +1110,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg, int aer_idx) ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd), DMA_TO_DEVICE); - ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL); + ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL, false); WARN_ON_ONCE(ret); } @@ -1117,11 +1134,17 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue, if (rq->tag == tag) ret = 1; + req->cqe.status = cqe->status; + req->cqe.result = cqe->result; + req->resp_completed = true; + if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) && wc->ex.invalidate_rkey == req->mr->rkey) req->mr->need_inval = false; - nvme_end_request(rq, cqe->status, cqe->result); + if (req->send_completed) + nvme_end_request(rq, req->cqe.status, req->cqe.result); + return ret; } @@ -1410,7 +1433,7 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx, sizeof(struct nvme_command), DMA_TO_DEVICE); err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge, - req->mr->need_inval ? &req->reg_wr.wr : NULL); + req->mr->need_inval ? &req->reg_wr.wr : NULL, true); if (unlikely(err)) { nvme_rdma_unmap_data(queue, rq); goto err; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request [not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 2017-10-31 8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg 2017-10-31 8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg @ 2017-10-31 8:55 ` Sagi Grimberg 2017-10-31 9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy 3 siblings, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2017-10-31 8:55 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Christoph Hellwig, Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w We must not complete a request before the host memory region is invalidated. Luckily we have send with invalidate protocol support so we usually don't need to execute it, but in case the target did not invalidate a memory region for us, we must wait for the invalidation to complete before unmapping host memory and completing the I/O. Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> --- drivers/nvme/host/rdma.c | 42 +++++++++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 17 deletions(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index ae1fb66358f7..b7e0fb0fe913 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -818,8 +818,19 @@ static void nvme_rdma_memreg_done(struct ib_cq *cq, struct ib_wc *wc) static void nvme_rdma_inv_rkey_done(struct ib_cq *cq, struct ib_wc *wc) { - if (unlikely(wc->status != IB_WC_SUCCESS)) + struct nvme_rdma_request *req = + container_of(wc->wr_cqe, struct nvme_rdma_request, reg_cqe); + struct request *rq = blk_mq_rq_from_pdu(req); + + if (unlikely(wc->status != IB_WC_SUCCESS)) { nvme_rdma_wr_error(cq, wc, "LOCAL_INV"); + return; + } + + req->mr->need_inval = false; + if (req->resp_completed && req->send_completed) + nvme_end_request(rq, req->cqe.status, req->cqe.result); + } static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue, @@ -830,7 +841,7 @@ static int nvme_rdma_inv_rkey(struct nvme_rdma_queue *queue, .opcode = IB_WR_LOCAL_INV, .next = NULL, .num_sge = 0, - .send_flags = 0, + .send_flags = IB_SEND_SIGNALED, .ex.invalidate_rkey = req->mr->rkey, }; @@ -844,24 +855,12 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue, struct request *rq) { struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq); - struct nvme_rdma_ctrl *ctrl = queue->ctrl; struct nvme_rdma_device *dev = queue->device; struct ib_device *ibdev = dev->dev; - int res; if (!blk_rq_bytes(rq)) return; - if (req->mr->need_inval && test_bit(NVME_RDMA_Q_LIVE, &req->queue->flags)) { - res = nvme_rdma_inv_rkey(queue, req); - if (unlikely(res < 0)) { - dev_err(ctrl->ctrl.device, - "Queueing INV WR for rkey %#x failed (%d)\n", - req->mr->rkey, res); - nvmf_error_recovery(&queue->ctrl->ctrl); - } - } - ib_dma_unmap_sg(ibdev, req->sg_table.sgl, req->nents, rq_data_dir(rq) == WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); @@ -1014,7 +1013,7 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc) } req->send_completed = true; - if (req->resp_completed) + if (req->resp_completed && !req->mr->need_inval) nvme_end_request(rq, req->cqe.status, req->cqe.result); } @@ -1139,10 +1138,19 @@ static int nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue, req->resp_completed = true; if ((wc->wc_flags & IB_WC_WITH_INVALIDATE) && - wc->ex.invalidate_rkey == req->mr->rkey) + wc->ex.invalidate_rkey == req->mr->rkey) { req->mr->need_inval = false; + } else if (req->mr->need_inval) { + ret = nvme_rdma_inv_rkey(queue, req); + if (unlikely(ret < 0)) { + dev_err(queue->ctrl->ctrl.device, + "Queueing INV WR for rkey %#x failed (%d)\n", + req->mr->rkey, ret); + nvmf_error_recovery(&queue->ctrl->ctrl); + } + } - if (req->send_completed) + if (req->send_completed && !req->mr->need_inval) nvme_end_request(rq, req->cqe.status, req->cqe.result); return ret; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> ` (2 preceding siblings ...) 2017-10-31 8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg @ 2017-10-31 9:38 ` Max Gurtovoy [not found] ` <8abcde91-9150-2982-3900-078619bcdac0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 3 siblings, 1 reply; 23+ messages in thread From: Max Gurtovoy @ 2017-10-31 9:38 UTC (permalink / raw) To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w, Christoph Hellwig On 10/31/2017 10:55 AM, Sagi Grimberg wrote: > We have two holes in nvme-rdma when completing request. > Thanks Sagi. We'll run these patches in our lab + our patches for the performance improvments that we sent few days ago and update the results. I guess I'll need to run it with my cq moderation patches to get less penalty in performance. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <8abcde91-9150-2982-3900-078619bcdac0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <8abcde91-9150-2982-3900-078619bcdac0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2017-10-31 11:10 ` Sagi Grimberg [not found] ` <a950a671-22a1-432c-555e-8309c8a64a88-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2017-10-31 11:10 UTC (permalink / raw) To: Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, idanb-VPRAkNaXOzVWk0Htik3J/w, Christoph Hellwig > Thanks Sagi. > We'll run these patches in our lab + our patches for the performance > improvments that we sent few days ago and update the results. Thanks, > I guess I'll need to run it with my cq moderation patches to get less > penalty in performance. Fine, but just a heads up, any cq moderation that is not adaptive is not something that we'd like to pursue. I have a generic adaptive moderation library in the works, so any ideas should be integrated into that (or something equivalent). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <a950a671-22a1-432c-555e-8309c8a64a88-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <a950a671-22a1-432c-555e-8309c8a64a88-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-11-01 16:02 ` idanb [not found] ` <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: idanb @ 2017-11-01 16:02 UTC (permalink / raw) To: Sagi Grimberg, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, Christoph Hellwig Sagi, Thanks for you patches, Max and me ran your patches adding completion signaling for both send (command capsule) and local invalidation request. The results were significant reduction of IOPs due to interrupts and completion processing, see below our results for READ/WRITE IOPs (512B, 64 Jobs): (1) No send signal and no invalidation signal w/ moderation 7.4M / 11.9M (2) No send signal and no invalidation signal w/o moderation 7.1M / 8.7M (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M I understand that (1), (2) and (3) are invalid, the performance reduction induce by mode (4) need to motivate us to enable cqe/event moderation as soon as possible. On 10/31/2017 1:10 PM, Sagi Grimberg wrote: >> Thanks Sagi. >> We'll run these patches in our lab + our patches for the performance >> improvments that we sent few days ago and update the results. > > Thanks, > >> I guess I'll need to run it with my cq moderation patches to get less >> penalty in performance. > > Fine, but just a heads up, any cq moderation that is not adaptive is > not something > that we'd like to pursue. I have a generic adaptive moderation library > in the > works, so any ideas should be integrated into that (or something > equivalent). -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2017-11-01 16:09 ` Max Gurtovoy 2017-11-01 16:50 ` Jason Gunthorpe 2017-11-01 17:26 ` Sagi Grimberg 2 siblings, 0 replies; 23+ messages in thread From: Max Gurtovoy @ 2017-11-01 16:09 UTC (permalink / raw) To: idanb, Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1076 bytes --] On 11/1/2017 6:02 PM, idanb wrote: > Sagi, > > Thanks for you patches, Max and me ran your patches adding completion > signaling for both send (command capsule) and local invalidation request. > > The results were significant reduction of IOPs due to interrupts and > completion processing, see below our results for READ/WRITE IOPs (512B, > 64 Jobs): > > (1) No send signal and no invalidation signal w/ moderation 7.4M / 11.9M > > (2) No send signal and no invalidation signal w/o moderation 7.1M / 8.7M > > (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M > > (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M > > (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M > > > I understand that (1), (2) and (3) are invalid, the performance > reduction induce by mode (4) need to motivate us to enable cqe/event > moderation as soon as possible. > attached the full report we ran with Sagi's patches (+ cq moderation cases with cq_completions=32 cq_timeout=16 + old send signal [once in queue_size/2]) [-- Attachment #2: final_report.log --] [-- Type: text/plain, Size: 9663 bytes --] READ: Signal with moderation +++++++++++++++++++++++++++++++ read (512B, 32 CPUs, 32 jobs, iodepth=128): io=61146MB, bw=3057.2MB/s, iops=6261.9K, runt= 20001msec read (1k, 32 CPUs, 32 jobs, iodepth=128) : io=110943MB, bw=5546.1MB/s, iops=5680.2K, runt= 20001msec read (2k, 32 CPUs, 32 jobs, iodepth=128): io=163192MB, bw=8159.3MB/s, iops=4177.6K, runt= 20001msec read (4k, 32 CPUs, 32 jobs, iodepth=128): io=177120MB, bw=8855.2MB/s, iops=2266.1K, runt= 20002msec read (8k, 32 CPUs, 32 jobs, iodepth=128): io=187884MB, bw=9392.9MB/s, iops=1202.3K, runt= 20003msec read (16k, 32 CPUs, 32 jobs, iodepth=128): io=190151MB, bw=9505.7MB/s, iops=608359, runt= 20004msec read (32k, 32 CPUs, 32 jobs, iodepth=128): io=191209MB, bw=9556.7MB/s, iops=305812, runt= 20008msec read (64k, 32 CPUs, 32 jobs, iodepth=128): io=192417MB, bw=9611.8MB/s, iops=153787, runt= 20019msec read (512B, 32 CPUs, 64 jobs, iodepth=128): io=34871MB, bw=1743.5MB/s, iops=3570.7K, runt= 20001msec read (1k, 32 CPUs, 64 jobs, iodepth=128): io=106594MB, bw=5329.2MB/s, iops=5457.8K, runt= 20002msec read (2k, 32 CPUs, 64 jobs, iodepth=128): io=153952MB, bw=7696.9MB/s, iops=3940.8K, runt= 20002msec read (4k, 32 CPUs, 64 jobs, iodepth=128): io=171229MB, bw=8560.2MB/s, iops=2191.4K, runt= 20003msec read (8k, 32 CPUs, 64 jobs, iodepth=128): io=178982MB, bw=8946.9MB/s, iops=1145.2K, runt= 20005msec read (16k, 32 CPUs, 64 jobs, iodepth=128): io=182606MB, bw=9126.2MB/s, iops=584076, runt= 20009msec read (32k, 32 CPUs, 64 jobs, iodepth=128): io=183271MB, bw=9157.2MB/s, iops=293028, runt= 20014msec read (64k, 32 CPUs, 64 jobs, iodepth=128): io=169174MB, bw=8447.3MB/s, iops=135156, runt= 20027msec READ: Signal without moderation ++++++++++++++++++++++++++++++++ read : io=33410MB, bw=1670.5MB/s, iops=3420.1K, runt= 20001msec read : io=66287MB, bw=3314.2MB/s, iops=3393.8K, runt= 20001msec read : io=128976MB, bw=6448.5MB/s, iops=3301.7K, runt= 20001msec read : io=172651MB, bw=8631.8MB/s, iops=2209.8K, runt= 20002msec read : io=181532MB, bw=9076.2MB/s, iops=1161.8K, runt= 20001msec read : io=194935MB, bw=9744.9MB/s, iops=623667, runt= 20004msec read : io=192299MB, bw=9610.7MB/s, iops=307540, runt= 20009msec read : io=192919MB, bw=9635.9MB/s, iops=154173, runt= 20021msec read : io=31905MB, bw=1595.2MB/s, iops=3266.1K, runt= 20001msec read : io=69284MB, bw=3464.2MB/s, iops=3547.2K, runt= 20001msec read : io=146078MB, bw=7303.2MB/s, iops=3739.3K, runt= 20002msec read : io=168585MB, bw=8428.4MB/s, iops=2157.6K, runt= 20003msec read : io=181224MB, bw=9058.1MB/s, iops=1159.6K, runt= 20005msec read : io=184970MB, bw=9244.4MB/s, iops=591638, runt= 20009msec read : io=184823MB, bw=9234.8MB/s, iops=295510, runt= 20014msec read : io=175313MB, bw=8753.4MB/s, iops=140054, runt= 20028msec READ: No SEND Signal with moderation (with signal on local invalidation) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ read : io=61968MB, bw=3098.3MB/s, iops=6345.3K, runt= 20001msec read : io=112583MB, bw=5628.9MB/s, iops=5763.1K, runt= 20001msec read : io=167024MB, bw=8350.9MB/s, iops=4275.7K, runt= 20001msec read : io=181638MB, bw=9081.1MB/s, iops=2324.8K, runt= 20002msec read : io=185373MB, bw=9267.3MB/s, iops=1186.3K, runt= 20003msec read : io=191548MB, bw=9575.6MB/s, iops=612832, runt= 20004msec read : io=190085MB, bw=9499.2MB/s, iops=303999, runt= 20009msec read : io=192519MB, bw=9616.9MB/s, iops=153868, runt= 20019msec read : io=58495MB, bw=2924.6MB/s, iops=5989.6K, runt= 20001msec read : io=109628MB, bw=5480.9MB/s, iops=5612.4K, runt= 20002msec read : io=156339MB, bw=7816.6MB/s, iops=4002.8K, runt= 20001msec read : io=171421MB, bw=8569.9MB/s, iops=2193.9K, runt= 20003msec read : io=179565MB, bw=8976.3MB/s, iops=1148.1K, runt= 20005msec read : io=182728MB, bw=9132.8MB/s, iops=584495, runt= 20008msec read : io=184840MB, bw=9235.2MB/s, iops=295551, runt= 20013msec read : io=180544MB, bw=9015.8MB/s, iops=144240, runt= 20027msec READ: No SEND Signal without moderation (with signal on local invalidation) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ read : io=40289MB, bw=2014.4MB/s, iops=4125.5K, runt= 20001msec read : io=78689MB, bw=3934.3MB/s, iops=4028.7K, runt= 20001msec read : io=143805MB, bw=7189.1MB/s, iops=3681.3K, runt= 20001msec read : io=173558MB, bw=8677.4MB/s, iops=2221.4K, runt= 20002msec read : io=186197MB, bw=9308.5MB/s, iops=1191.5K, runt= 20003msec read : io=192793MB, bw=9637.8MB/s, iops=616812, runt= 20004msec read : io=192399MB, bw=9616.1MB/s, iops=307714, runt= 20008msec read : io=194790MB, bw=9731.8MB/s, iops=155707, runt= 20016msec read : io=50862MB, bw=2542.2MB/s, iops=5208.3K, runt= 20001msec read : io=103616MB, bw=5180.6MB/s, iops=5304.9K, runt= 20001msec read : io=150399MB, bw=7519.3MB/s, iops=3849.9K, runt= 20002msec read : io=173678MB, bw=8682.2MB/s, iops=2222.7K, runt= 20004msec read : io=181481MB, bw=9071.9MB/s, iops=1161.2K, runt= 20005msec read : io=186601MB, bw=9326.4MB/s, iops=596884, runt= 20008msec read : io=187268MB, bw=9356.9MB/s, iops=299419, runt= 20014msec read : io=176620MB, bw=8817.4MB/s, iops=141077, runt= 20031msec Write: Signal with moderation +++++++++++++++++++++++++++++++ write: io=115326MB, bw=5766.2MB/s, iops=11809K, runt= 20001msec write: io=148240MB, bw=7411.7MB/s, iops=7589.6K, runt= 20001msec write: io=161919MB, bw=8095.6MB/s, iops=4144.1K, runt= 20001msec write: io=171162MB, bw=8557.3MB/s, iops=2190.7K, runt= 20002msec write: io=199720MB, bw=9984.6MB/s, iops=1278.2K, runt= 20003msec write: io=206792MB, bw=10338MB/s, iops=661600, runt= 20004msec write: io=209549MB, bw=10474MB/s, iops=335161, runt= 20007msec write: io=208180MB, bw=10402MB/s, iops=166427, runt= 20014msec write: io=46900MB, bw=2344.1MB/s, iops=4802.4K, runt= 20001msec write: io=105243MB, bw=5261.1MB/s, iops=5388.2K, runt= 20001msec write: io=160543MB, bw=8026.4MB/s, iops=4109.6K, runt= 20002msec write: io=174457MB, bw=8721.2MB/s, iops=2232.7K, runt= 20004msec write: io=192598MB, bw=9627.2MB/s, iops=1232.4K, runt= 20004msec write: io=201587MB, bw=10076MB/s, iops=644851, runt= 20007msec write: io=205567MB, bw=10272MB/s, iops=328693, runt= 20013msec write: io=208508MB, bw=10412MB/s, iops=166590, runt= 20026msec Write: Signal without moderation +++++++++++++++++++++++++++++++++ write: io=43445MB, bw=2172.2MB/s, iops=4448.6K, runt= 20001msec write: io=93361MB, bw=4667.9MB/s, iops=4779.9K, runt= 20001msec write: io=160929MB, bw=8046.6MB/s, iops=4119.6K, runt= 20001msec write: io=171522MB, bw=8575.3MB/s, iops=2195.3K, runt= 20002msec write: io=198314MB, bw=9914.2MB/s, iops=1269.2K, runt= 20003msec write: io=206332MB, bw=10315MB/s, iops=660129, runt= 20004msec write: io=207733MB, bw=10383MB/s, iops=332240, runt= 20008msec write: io=208141MB, bw=10399MB/s, iops=166379, runt= 20016msec write: io=45763MB, bw=2288.2MB/s, iops=4685.9K, runt= 20001msec write: io=100083MB, bw=5003.1MB/s, iops=5123.1K, runt= 20001msec write: io=159084MB, bw=7953.5MB/s, iops=4072.2K, runt= 20002msec write: io=173523MB, bw=8674.5MB/s, iops=2220.7K, runt= 20004msec write: io=189560MB, bw=9476.2MB/s, iops=1212.1K, runt= 20004msec write: io=199250MB, bw=9958.6MB/s, iops=637344, runt= 20008msec write: io=204977MB, bw=10243MB/s, iops=327767, runt= 20012msec write: io=208619MB, bw=10417MB/s, iops=166678, runt= 20026msec Write: No SEND Signal with moderation (with signal on local invalidation) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ write: io=116784MB, bw=5838.1MB/s, iops=11958K, runt= 20001msec write: io=143658MB, bw=7182.6MB/s, iops=7354.1K, runt= 20001msec write: io=161343MB, bw=8066.4MB/s, iops=4129.1K, runt= 20002msec write: io=170761MB, bw=8537.3MB/s, iops=2185.6K, runt= 20002msec write: io=199668MB, bw=9982.5MB/s, iops=1277.8K, runt= 20002msec write: io=207307MB, bw=10363MB/s, iops=663248, runt= 20004msec write: io=209903MB, bw=10491MB/s, iops=335727, runt= 20007msec write: io=208268MB, bw=10406MB/s, iops=166489, runt= 20015msec write: io=87286MB, bw=4364.1MB/s, iops=8937.7K, runt= 20001msec write: io=145686MB, bw=7283.6MB/s, iops=7458.4K, runt= 20002msec write: io=160259MB, bw=8012.2MB/s, iops=4102.3K, runt= 20002msec write: io=174527MB, bw=8724.7MB/s, iops=2233.5K, runt= 20004msec write: io=193620MB, bw=9679.5MB/s, iops=1238.1K, runt= 20004msec write: io=202051MB, bw=10099MB/s, iops=646336, runt= 20007msec write: io=206876MB, bw=10337MB/s, iops=330770, runt= 20014msec write: io=208138MB, bw=10394MB/s, iops=166302, runt= 20025msec Write: No SEND Signal without moderation (with signal on local invalidation) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ write: io=73271MB, bw=3663.4MB/s, iops=7502.6K, runt= 20001msec write: io=129638MB, bw=6481.6MB/s, iops=6637.2K, runt= 20001msec write: io=161002MB, bw=8049.3MB/s, iops=4121.3K, runt= 20002msec write: io=170759MB, bw=8537.1MB/s, iops=2185.5K, runt= 20002msec write: io=198565MB, bw=9927.3MB/s, iops=1270.7K, runt= 20002msec write: io=207304MB, bw=10363MB/s, iops=663239, runt= 20004msec write: io=207251MB, bw=10358MB/s, iops=331468, runt= 20008msec write: io=208262MB, bw=10405MB/s, iops=166484, runt= 20015msec write: io=81719MB, bw=4085.8MB/s, iops=8367.6K, runt= 20001msec write: io=145468MB, bw=7273.2MB/s, iops=7447.6K, runt= 20001msec write: io=159924MB, bw=7995.5MB/s, iops=4093.7K, runt= 20002msec write: io=174306MB, bw=8713.6MB/s, iops=2230.7K, runt= 20004msec write: io=186670MB, bw=9333.3MB/s, iops=1194.7K, runt= 20001msec write: io=201011MB, bw=10047MB/s, iops=642978, runt= 20008msec write: io=206175MB, bw=10303MB/s, iops=329681, runt= 20012msec write: io=208853MB, bw=10430MB/s, iops=166882, runt= 20024msec ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2017-11-01 16:09 ` Max Gurtovoy @ 2017-11-01 16:50 ` Jason Gunthorpe [not found] ` <20171101165036.GD1030-uk2M96/98Pc@public.gmane.org> 2017-11-01 17:26 ` Sagi Grimberg 2 siblings, 1 reply; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-01 16:50 UTC (permalink / raw) To: idanb Cc: Sagi Grimberg, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig On Wed, Nov 01, 2017 at 06:02:38PM +0200, idanb wrote: > (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M Actually, isn't this valid and reasonable? You don't need to see the SEND completion, the only thing that matters is the invalidate completion, and the invalidate will naturally order after the send is finished, yes? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20171101165036.GD1030-uk2M96/98Pc@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <20171101165036.GD1030-uk2M96/98Pc@public.gmane.org> @ 2017-11-01 17:31 ` Sagi Grimberg [not found] ` <5f3955d2-9116-5f18-2299-cc697947d599-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2017-11-01 17:31 UTC (permalink / raw) To: Jason Gunthorpe, idanb Cc: Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig >> (3) No send signal with invalidation signal w/o moderation 5.2M / 8.3M > > Actually, isn't this valid and reasonable? You don't need to see the > SEND completion, the only thing that matters is the invalidate > completion, and the invalidate will naturally order after the send > is finished, yes? the default mode utilize remote invalidation, so no, its not valid. Without remote invalidate I'm not sure, don't remember the ordering rules from the top of my head. Idan, does a local invalidate completion guarantee a prior send completion (of a non-related buffer)? I vaguely remember that explicit fencing is required to guarantee that. I guess we could not signal send completions if we are in a local invalidate mode, but I'm liking this less and less... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <5f3955d2-9116-5f18-2299-cc697947d599-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <5f3955d2-9116-5f18-2299-cc697947d599-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-11-01 17:58 ` Jason Gunthorpe [not found] ` <20171101175819.GG1030-uk2M96/98Pc@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-01 17:58 UTC (permalink / raw) To: Sagi Grimberg Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig On Wed, Nov 01, 2017 at 07:31:39PM +0200, Sagi Grimberg wrote: > the default mode utilize remote invalidation, so no, its not valid. Well, usually the ULP design should allow some things to be reaped async, and latency senstive things to be sync. Eg if you have a SEND using a local buffer with a RKEY, then you'd declare the RKEY data completed when SEND_WITH_INVALIDATE is returned. Recycling the lkey command buffer is an async process and can wait unsignaled until something signalled comes to push its completion through. This approach reduces the # of CQEs required at the expense of delaying reaping the lkey buffer. > Without remote invalidate I'm not sure, don't remember the ordering > rules from the top of my head. Idan, does a local invalidate completion > guarantee a prior send completion (of a non-related buffer)? I vaguely > remember that explicit fencing is required to guarantee that. This was the case I thought Idan was testing.. Local invalidate is defined to always be ordered by the spec, so it is required to guarentee that the SEND is completed. > I guess we could not signal send completions if we are in a local > invalidate mode, but I'm liking this less and less... If the SEND reaping can be defered as above, then the local invalidate completion becomes the signaled CQE that allows the SEND to be reaped. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20171101175819.GG1030-uk2M96/98Pc@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <20171101175819.GG1030-uk2M96/98Pc@public.gmane.org> @ 2017-11-02 8:06 ` Sagi Grimberg [not found] ` <5810bb05-fffd-a0f2-3509-9d9b89a2ef32-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2017-11-02 8:06 UTC (permalink / raw) To: Jason Gunthorpe Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig >> the default mode utilize remote invalidation, so no, its not valid. > > Well, usually the ULP design should allow some things to be reaped > async, and latency senstive things to be sync. > > Eg if you have a SEND using a local buffer with a RKEY, then you'd > declare the RKEY data completed when SEND_WITH_INVALIDATE is returned. > > Recycling the lkey command buffer is an async process and can wait > unsignaled until something signalled comes to push its completion > through. Not when using inline data with the send, which is the main issue here. if we inline data to the command, we will use the local dma lkey, which does not even have a local invalidate following it. > This approach reduces the # of CQEs required at the expense of > delaying reaping the lkey buffer. Yes, I understand the advantage of suppressing send completions.. But I'm not very fond of non-trivial behavior depending on all sort of overly complicated conditions. >> Without remote invalidate I'm not sure, don't remember the ordering >> rules from the top of my head. Idan, does a local invalidate completion >> guarantee a prior send completion (of a non-related buffer)? I vaguely >> remember that explicit fencing is required to guarantee that. > > This was the case I thought Idan was testing.. Still doesn't mean it is guaranteed to do the right thing :) > Local invalidate is defined to always be ordered by the spec, so it > is required to guarentee that the SEND is completed. So local invalidate completion is guaranteed to come after all the completions prior to it in the send queue? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <5810bb05-fffd-a0f2-3509-9d9b89a2ef32-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <5810bb05-fffd-a0f2-3509-9d9b89a2ef32-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-11-02 15:12 ` Jason Gunthorpe [not found] ` <20171102151254.GE18874-uk2M96/98Pc@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-02 15:12 UTC (permalink / raw) To: Sagi Grimberg Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig On Thu, Nov 02, 2017 at 10:06:30AM +0200, Sagi Grimberg wrote: > > >>the default mode utilize remote invalidation, so no, its not valid. > > > >Well, usually the ULP design should allow some things to be reaped > >async, and latency senstive things to be sync. > > > >Eg if you have a SEND using a local buffer with a RKEY, then you'd > >declare the RKEY data completed when SEND_WITH_INVALIDATE is returned. > > > >Recycling the lkey command buffer is an async process and can wait > >unsignaled until something signalled comes to push its completion > >through. > > Not when using inline data with the send, which is the main issue > here. if we inline data to the command, we will use the local > dma lkey, which does not even have a local invalidate following it. Does nvme use inline data send and RKEY transfer in the same SEND? Then it would need to signal the SEND if remote invalidate is used, otherwise it only needs to signal the local invalidate for the RKEY.. > >Local invalidate is defined to always be ordered by the spec, so it > >is required to guarentee that the SEND is completed. > So local invalidate completion is guaranteed to come after all the > completions prior to it in the send queue? IBA spec says so.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20171102151254.GE18874-uk2M96/98Pc@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <20171102151254.GE18874-uk2M96/98Pc@public.gmane.org> @ 2017-11-02 15:23 ` Sagi Grimberg [not found] ` <6626939a-3626-181f-ccea-0f6482e7a3fc-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 2017-11-02 16:18 ` Steve Wise 1 sibling, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2017-11-02 15:23 UTC (permalink / raw) To: Jason Gunthorpe Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig >> Not when using inline data with the send, which is the main issue >> here. if we inline data to the command, we will use the local >> dma lkey, which does not even have a local invalidate following it. > > Does nvme use inline data send and RKEY transfer in the same SEND? No, we use local dma lkey for inline data. > Then it would need to signal the SEND if remote invalidate is used, > otherwise it only needs to signal the local invalidate for the RKEY.. Also if its simply using the local dma lkey without sending any rkey. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <6626939a-3626-181f-ccea-0f6482e7a3fc-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <6626939a-3626-181f-ccea-0f6482e7a3fc-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-11-02 15:51 ` Jason Gunthorpe [not found] ` <20171102155156.GG18874-uk2M96/98Pc@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-02 15:51 UTC (permalink / raw) To: Sagi Grimberg Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig On Thu, Nov 02, 2017 at 05:23:36PM +0200, Sagi Grimberg wrote: > >Then it would need to signal the SEND if remote invalidate is used, > >otherwise it only needs to signal the local invalidate for the RKEY.. > > Also if its simply using the local dma lkey without sending any rkey. Right, OK, to be clear, my emails were only about the case where there was a SEND and INVALIDATE required together, I wasn't discussing SEND alone, that is already as good as it can be I think :) Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20171102155156.GG18874-uk2M96/98Pc@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <20171102155156.GG18874-uk2M96/98Pc@public.gmane.org> @ 2017-11-02 16:15 ` Sagi Grimberg 0 siblings, 0 replies; 23+ messages in thread From: Sagi Grimberg @ 2017-11-02 16:15 UTC (permalink / raw) To: Jason Gunthorpe Cc: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig >>> Then it would need to signal the SEND if remote invalidate is used, >>> otherwise it only needs to signal the local invalidate for the RKEY.. >> >> Also if its simply using the local dma lkey without sending any rkey. > > Right, OK, to be clear, my emails were only about the case where there > was a SEND and INVALIDATE required together, I wasn't discussing SEND > alone, that is already as good as it can be I think :) Not arguing here :) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 0/3] Fix request completion holes [not found] ` <20171102151254.GE18874-uk2M96/98Pc@public.gmane.org> 2017-11-02 15:23 ` Sagi Grimberg @ 2017-11-02 16:18 ` Steve Wise 2017-11-02 16:36 ` Jason Gunthorpe 1 sibling, 1 reply; 23+ messages in thread From: Steve Wise @ 2017-11-02 16:18 UTC (permalink / raw) To: 'Jason Gunthorpe', 'Sagi Grimberg' Cc: 'idanb', 'Max Gurtovoy', linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, 'Christoph Hellwig' > -----Original Message----- > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma- > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe > Sent: Thursday, November 02, 2017 10:13 AM > To: Sagi Grimberg > Cc: idanb; Max Gurtovoy; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux- > nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org; Christoph Hellwig > Subject: Re: [PATCH 0/3] Fix request completion holes > > On Thu, Nov 02, 2017 at 10:06:30AM +0200, Sagi Grimberg wrote: > > > > >>the default mode utilize remote invalidation, so no, its not valid. > > > > > >Well, usually the ULP design should allow some things to be reaped > > >async, and latency senstive things to be sync. > > > > > >Eg if you have a SEND using a local buffer with a RKEY, then you'd > > >declare the RKEY data completed when SEND_WITH_INVALIDATE is returned. > > > > > >Recycling the lkey command buffer is an async process and can wait > > >unsignaled until something signalled comes to push its completion > > >through. > > > > Not when using inline data with the send, which is the main issue > > here. if we inline data to the command, we will use the local > > dma lkey, which does not even have a local invalidate following it. > > Does nvme use inline data send and RKEY transfer in the same SEND? > Then it would need to signal the SEND if remote invalidate is used, > otherwise it only needs to signal the local invalidate for the RKEY.. > > > >Local invalidate is defined to always be ordered by the spec, so it > > >is required to guarentee that the SEND is completed. > > > So local invalidate completion is guaranteed to come after all the > > completions prior to it in the send queue? > > IBA spec says so.. > iWARP spec too. This is in regard to completion ordering though. The local invalidate send WR must have the IB_SEND_FENCE flag set if you want it to only be executed after all prior send WRs are executed. Either way, the completions are always inserted into the cq in-sq-submission-order and a signaled completion implies completions of all prior unsignaled WRs. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] Fix request completion holes 2017-11-02 16:18 ` Steve Wise @ 2017-11-02 16:36 ` Jason Gunthorpe [not found] ` <20171102163614.GK18874-uk2M96/98Pc@public.gmane.org> 0 siblings, 1 reply; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-02 16:36 UTC (permalink / raw) To: Steve Wise Cc: 'Sagi Grimberg', 'idanb', 'Max Gurtovoy', linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, 'Christoph Hellwig', chuck.lever-QHcLZuEGTsvQT0dZR+AlfA On Thu, Nov 02, 2017 at 11:18:42AM -0500, Steve Wise wrote: > iWARP spec too. This is in regard to completion ordering though. The local > invalidate send WR must have the IB_SEND_FENCE flag set if you want it to only > be executed after all prior send WRs are executed. Oh right, yes, you need to have local invalidate fence to guarentee that, but usually a local operation will not refer to a RKEY, so execution ordering wont matter. But this is a good general point, doesn't a ULP need to set FENCE on SEND, eg: RDMA READ (rkey) RDMA READ (rkey) SEND w/ FENCE (tell remote to invalidate rkey) Otherwise IBA Table 79 says RDMA READ can pass SEND and we have a situation where the rkey has become invalidated when the remote is still trying to use it. IBA actually explicitly calls this out (pg 538): RDMA Read operations may be executed at the target after subse- quent Send and Invalidate operation already performed the invalida- tion at the target. That may cause the RDMA Read operation to fail. Setting the Fence Indicator on the subsequent operations guarantees that the RDMA Read will fully complete before the invalidation takes place. None of our ULPs use IB_SEND_FENCE - so this is a bug! > Either way, the completions are always inserted into the cq > in-sq-submission-order and a signaled completion implies completions > of all prior unsignaled WRs. Yes, excactly what is relevant for this nvme case.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <20171102163614.GK18874-uk2M96/98Pc@public.gmane.org>]
* RE: [PATCH 0/3] Fix request completion holes [not found] ` <20171102163614.GK18874-uk2M96/98Pc@public.gmane.org> @ 2017-11-02 16:53 ` Steve Wise 2017-11-02 16:54 ` Jason Gunthorpe 0 siblings, 1 reply; 23+ messages in thread From: Steve Wise @ 2017-11-02 16:53 UTC (permalink / raw) To: 'Jason Gunthorpe' Cc: 'Sagi Grimberg', 'idanb', 'Max Gurtovoy', linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, 'Christoph Hellwig', chuck.lever-QHcLZuEGTsvQT0dZR+AlfA > > > iWARP spec too. This is in regard to completion ordering though. The local > > invalidate send WR must have the IB_SEND_FENCE flag set if you want it to only > > be executed after all prior send WRs are executed. > > Oh right, yes, you need to have local invalidate fence to guarentee > that, but usually a local operation will not refer to a RKEY, so > execution ordering wont matter. > > But this is a good general point, doesn't a ULP need to set FENCE on > SEND, eg: > > RDMA READ (rkey) > RDMA READ (rkey) > SEND w/ FENCE (tell remote to invalidate rkey) > > Otherwise IBA Table 79 says RDMA READ can pass SEND and we have a > situation where the rkey has become invalidated when the remote is > still trying to use it. > > IBA actually explicitly calls this out (pg 538): > > RDMA Read operations may be executed at the target after subse- > quent Send and Invalidate operation already performed the invalida- > tion at the target. That may cause the RDMA Read operation to fail. > Setting the Fence Indicator on the subsequent operations guarantees > that the RDMA Read will fully complete before the invalidation takes > place. > > None of our ULPs use IB_SEND_FENCE - so this is a bug! > I think the nvmf target doesn't post the SEND_W_INVALIDATE (NVMF CQE) until the RDMA READ(s) (and back end BIO submission) complete. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] Fix request completion holes 2017-11-02 16:53 ` Steve Wise @ 2017-11-02 16:54 ` Jason Gunthorpe 0 siblings, 0 replies; 23+ messages in thread From: Jason Gunthorpe @ 2017-11-02 16:54 UTC (permalink / raw) To: Steve Wise Cc: 'Sagi Grimberg', 'idanb', 'Max Gurtovoy', linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, 'Christoph Hellwig', chuck.lever-QHcLZuEGTsvQT0dZR+AlfA On Thu, Nov 02, 2017 at 11:53:44AM -0500, Steve Wise wrote: > I think the nvmf target doesn't post the SEND_W_INVALIDATE (NVMF CQE) until the > RDMA READ(s) (and back end BIO submission) complete. Ah! That makes sense for a storage use case Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 2017-11-01 16:09 ` Max Gurtovoy 2017-11-01 16:50 ` Jason Gunthorpe @ 2017-11-01 17:26 ` Sagi Grimberg [not found] ` <ae5a6be3-f9c3-9926-ab0d-48b0a99cdb35-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 2 siblings, 1 reply; 23+ messages in thread From: Sagi Grimberg @ 2017-11-01 17:26 UTC (permalink / raw) To: idanb, Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, Christoph Hellwig > (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M What about remote invalidation? That is how it will usually be used... > (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M What about latency effects? Any noticeable change? > I understand that (1), (2) and (3) are invalid, the performance > reduction induce by mode (4) need to motivate us to enable cqe/event > moderation as soon as possible. Well good, it means you can help test my adaptive coalescing code :) May I ask what moderation parameters did you use? it will be useful to pin down the profile levels the adaptive moderator would bounce between. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <ae5a6be3-f9c3-9926-ab0d-48b0a99cdb35-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>]
* Re: [PATCH 0/3] Fix request completion holes [not found] ` <ae5a6be3-f9c3-9926-ab0d-48b0a99cdb35-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> @ 2017-11-01 22:23 ` Max Gurtovoy 2017-11-02 17:55 ` Steve Wise 1 sibling, 0 replies; 23+ messages in thread From: Max Gurtovoy @ 2017-11-01 22:23 UTC (permalink / raw) To: Sagi Grimberg, idanb, linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: Jason Gunthorpe, Christoph Hellwig On 11/1/2017 7:26 PM, Sagi Grimberg wrote: > >> (4) Send signal with invalidation signal w/o moderation 3.4M / 4.6M > > What about remote invalidation? That is how it will usually be used... We haven't tested remote invalidation yet. > >> (5) Send signal with invalidation signal w/ moderation 6.2M / 11.8M > > What about latency effects? Any noticeable change? > >> I understand that (1), (2) and (3) are invalid, the performance >> reduction induce by mode (4) need to motivate us to enable cqe/event >> moderation as soon as possible. > > Well good, it means you can help test my adaptive coalescing code :) Of course. > > May I ask what moderation parameters did you use? it will be useful > to pin down the profile levels the adaptive moderator would bounce > between. Sure, I mentioned the params earlier (cq_completions=32 cq_timeout=16). Sagi, As we can see from our results, the changes made in this patchset must be pushed together with some cq moderation (adaptive or manual option meanwhile). In the testing, I created a module parameter but we can create a different option for the user to configure. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH 0/3] Fix request completion holes [not found] ` <ae5a6be3-f9c3-9926-ab0d-48b0a99cdb35-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> 2017-11-01 22:23 ` Max Gurtovoy @ 2017-11-02 17:55 ` Steve Wise 1 sibling, 0 replies; 23+ messages in thread From: Steve Wise @ 2017-11-02 17:55 UTC (permalink / raw) To: 'Sagi Grimberg', 'idanb', 'Max Gurtovoy', linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r Cc: 'Jason Gunthorpe', 'Christoph Hellwig' > What about latency effects? Any noticeable change? > > > I understand that (1), (2) and (3) are invalid, the performance > > reduction induce by mode (4) need to motivate us to enable cqe/event > > moderation as soon as possible. > > Well good, it means you can help test my adaptive coalescing code :) > > May I ask what moderation parameters did you use? it will be useful > to pin down the profile levels the adaptive moderator would bounce > between. > Hey Sagi, I can help too on testing this with iw_cxgb4. Where is your current WIP? Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-11-02 17:55 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-31 8:55 [PATCH 0/3] Fix request completion holes Sagi Grimberg
[not found] ` <1509440122-1190-1-git-send-email-sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-10-31 8:55 ` [PATCH 1/3] nvme-rdma: don't suppress send completions Sagi Grimberg
2017-10-31 8:55 ` [PATCH 2/3] nvme-rdma: don't complete requests before a send work request has completed Sagi Grimberg
2017-10-31 8:55 ` [PATCH 3/3] nvme-rdma: wait for local invalidation before completing a request Sagi Grimberg
2017-10-31 9:38 ` [PATCH 0/3] Fix request completion holes Max Gurtovoy
[not found] ` <8abcde91-9150-2982-3900-078619bcdac0-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-10-31 11:10 ` Sagi Grimberg
[not found] ` <a950a671-22a1-432c-555e-8309c8a64a88-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-01 16:02 ` idanb
[not found] ` <d9dcdb05-d001-53d8-7ed7-5fafc1709f4a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-11-01 16:09 ` Max Gurtovoy
2017-11-01 16:50 ` Jason Gunthorpe
[not found] ` <20171101165036.GD1030-uk2M96/98Pc@public.gmane.org>
2017-11-01 17:31 ` Sagi Grimberg
[not found] ` <5f3955d2-9116-5f18-2299-cc697947d599-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-01 17:58 ` Jason Gunthorpe
[not found] ` <20171101175819.GG1030-uk2M96/98Pc@public.gmane.org>
2017-11-02 8:06 ` Sagi Grimberg
[not found] ` <5810bb05-fffd-a0f2-3509-9d9b89a2ef32-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-02 15:12 ` Jason Gunthorpe
[not found] ` <20171102151254.GE18874-uk2M96/98Pc@public.gmane.org>
2017-11-02 15:23 ` Sagi Grimberg
[not found] ` <6626939a-3626-181f-ccea-0f6482e7a3fc-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-02 15:51 ` Jason Gunthorpe
[not found] ` <20171102155156.GG18874-uk2M96/98Pc@public.gmane.org>
2017-11-02 16:15 ` Sagi Grimberg
2017-11-02 16:18 ` Steve Wise
2017-11-02 16:36 ` Jason Gunthorpe
[not found] ` <20171102163614.GK18874-uk2M96/98Pc@public.gmane.org>
2017-11-02 16:53 ` Steve Wise
2017-11-02 16:54 ` Jason Gunthorpe
2017-11-01 17:26 ` Sagi Grimberg
[not found] ` <ae5a6be3-f9c3-9926-ab0d-48b0a99cdb35-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-11-01 22:23 ` Max Gurtovoy
2017-11-02 17:55 ` Steve Wise
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox