From: Devesh Sharma <devesh.sharma@avagotech.com>
To: Sagi Grimberg <sagig@dev.mellanox.co.il>
Cc: Chuck Lever <chuck.lever@oracle.com>,
linux-rdma@vger.kernel.org,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external()
Date: Fri, 8 May 2015 20:54:07 +0530 [thread overview]
Message-ID: <CANjDDBiGLcaAofGwz6OGEXUUE_b2rcZepv0ebvTc-XNVEBq5Mw@mail.gmail.com> (raw)
In-Reply-To: <554B3EEB.7070302@dev.mellanox.co.il>
On Thu, May 7, 2015 at 4:01 PM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote:
> On 5/4/2015 8:57 PM, Chuck Lever wrote:
>>
>> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
>> pool lock is expensive, and unnecessary because most modern adapters
>> can transfer 100s of KBs of payload using just a single MR.
>>
>> Instead, acquire MRs one-at-a-time as chunks are registered, and
>> return them to rb_mws immediately during deregistration.
>>
>> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
>> registration fails") is reverted: There is now a valid case where
>> registration can fail (with -ENOMEM) but the QP is still in RTS.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/sunrpc/xprtrdma/frwr_ops.c | 120
>> ++++++++++++++++++++++++++++------------
>> net/sunrpc/xprtrdma/rpc_rdma.c | 3 -
>> net/sunrpc/xprtrdma/verbs.c | 21 -------
>> 3 files changed, 86 insertions(+), 58 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c
>> b/net/sunrpc/xprtrdma/frwr_ops.c
>> index a06d9a3..6f93a89 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -11,6 +11,62 @@
>> * but most complex memory registration mode.
>> */
>>
>> +/* Normal operation
>> + *
>> + * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
>> + * Work Request (frmr_op_map). When the RDMA operation is finished, this
>> + * Memory Region is invalidated using a LOCAL_INV Work Request
>> + * (frmr_op_unmap).
>> + *
>> + * Typically these Work Requests are not signaled, and neither are RDMA
>> + * SEND Work Requests (with the exception of signaling occasionally to
>> + * prevent provider work queue overflows). This greatly reduces HCA
>> + * interrupt workload.
>> + *
>> + * As an optimization, frwr_op_unmap marks MRs INVALID before the
>> + * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
>> + * rb_mws immediately so that no work (like managing a linked list
>> + * under a spinlock) is needed in the completion upcall.
>> + *
>> + * But this means that frwr_op_map() can occasionally encounter an MR
>> + * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
>> + * ordering prevents a subsequent FAST_REG WR from executing against
>> + * that MR while it is still being invalidated.
>> + */
>> +
>> +/* Transport recovery
>> + *
>> + * ->op_map and the transport connect worker cannot run at the same
>> + * time, but ->op_unmap can fire while the transport connect worker
>> + * is running. Thus MR recovery is handled in ->op_map, to guarantee
>> + * that recovered MRs are owned by a sending RPC, and not one where
>> + * ->op_unmap could fire at the same time transport reconnect is
>> + * being done.
>> + *
>> + * When the underlying transport disconnects, MRs are left in one of
>> + * three states:
>> + *
>> + * INVALID: The MR was not in use before the QP entered ERROR state.
>> + * (Or, the LOCAL_INV WR has not completed or flushed yet).
>> + *
>> + * STALE: The MR was being registered or unregistered when the QP
>> + * entered ERROR state, and the pending WR was flushed.
>> + *
>> + * VALID: The MR was registered before the QP entered ERROR state.
>> + *
>> + * When frwr_op_map encounters STALE and VALID MRs, they are recovered
>> + * with ib_dereg_mr and then are re-initialized. Beause MR recovery
>> + * allocates fresh resources, it is deferred to a workqueue, and the
>> + * recovered MRs are placed back on the rb_mws list when recovery is
>> + * complete. frwr_op_map allocates another MR for the current RPC while
>> + * the broken MR is reset.
>> + *
>> + * To ensure that frwr_op_map doesn't encounter an MR that is marked
>> + * INVALID but that is about to be flushed due to a previous transport
>> + * disconnect, the transport connect worker attempts to drain all
>> + * pending send queue WRs before the transport is reconnected.
>> + */
>> +
>> #include "xprt_rdma.h"
>>
>> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> @@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> struct ib_device *device = ia->ri_device;
>> enum dma_data_direction direction = rpcrdma_data_dir(writing);
>> struct rpcrdma_mr_seg *seg1 = seg;
>> - struct rpcrdma_mw *mw = seg1->rl_mw;
>> - struct rpcrdma_frmr *frmr = &mw->r.frmr;
>> - struct ib_mr *mr = frmr->fr_mr;
>> + struct rpcrdma_mw *mw;
>> + struct rpcrdma_frmr *frmr;
>> + struct ib_mr *mr;
>> struct ib_send_wr fastreg_wr, *bad_wr;
>> u8 key;
>> int len, pageoff;
>> @@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> u64 pa;
>> int page_no;
>>
>> + mw = seg1->rl_mw;
>> + seg1->rl_mw = NULL;
>> + do {
>> + if (mw)
>> + __frwr_queue_recovery(mw);
>> + mw = rpcrdma_get_mw(r_xprt);
>> + if (!mw)
>> + return -ENOMEM;
>> + } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
>> + frmr = &mw->r.frmr;
>> + frmr->fr_state = FRMR_IS_VALID;
>> +
>> pageoff = offset_in_page(seg1->mr_offset);
>> seg1->mr_offset -= pageoff; /* start of page */
>> seg1->mr_len += pageoff;
>> len = -pageoff;
>> if (nsegs > ia->ri_max_frmr_depth)
>> nsegs = ia->ri_max_frmr_depth;
>> +
>> for (page_no = i = 0; i < nsegs;) {
>> rpcrdma_map_one(device, seg, direction);
>> pa = seg->mr_dma;
>> @@ -285,8 +354,6 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> dprintk("RPC: %s: Using frmr %p to map %d segments (%d
>> bytes)\n",
>> __func__, mw, i, len);
>>
>> - frmr->fr_state = FRMR_IS_VALID;
>> -
>> memset(&fastreg_wr, 0, sizeof(fastreg_wr));
>> fastreg_wr.wr_id = (unsigned long)(void *)mw;
>> fastreg_wr.opcode = IB_WR_FAST_REG_MR;
>> @@ -298,6 +365,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> fastreg_wr.wr.fast_reg.access_flags = writing ?
>> IB_ACCESS_REMOTE_WRITE |
>> IB_ACCESS_LOCAL_WRITE :
>> IB_ACCESS_REMOTE_READ;
>> + mr = frmr->fr_mr;
>> key = (u8)(mr->rkey & 0x000000FF);
>> ib_update_fast_reg_key(mr, ++key);
>> fastreg_wr.wr.fast_reg.rkey = mr->rkey;
>> @@ -307,6 +375,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>> if (rc)
>> goto out_senderr;
>>
>> + seg1->rl_mw = mw;
>> seg1->mr_rkey = mr->rkey;
>> seg1->mr_base = seg1->mr_dma + pageoff;
>> seg1->mr_nsegs = i;
>> @@ -315,10 +384,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg,
>>
>> out_senderr:
>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>> - ib_update_fast_reg_key(mr, --key);
>> - frmr->fr_state = FRMR_IS_INVALID;
>> while (i--)
>> rpcrdma_unmap_one(device, --seg);
>> + __frwr_queue_recovery(mw);
>> return rc;
>> }
>>
>> @@ -330,15 +398,19 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>> {
>> struct rpcrdma_mr_seg *seg1 = seg;
>> struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> + struct rpcrdma_mw *mw = seg1->rl_mw;
>> struct ib_send_wr invalidate_wr, *bad_wr;
>> int rc, nsegs = seg->mr_nsegs;
>>
>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> + dprintk("RPC: %s: FRMR %p\n", __func__, mw);
>> +
>> + seg1->rl_mw = NULL;
>> + mw->r.frmr.fr_state = FRMR_IS_INVALID;
>>
>> memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> - invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>> + invalidate_wr.wr_id = (unsigned long)(void *)mw;
>> invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> - invalidate_wr.ex.invalidate_rkey =
>> seg1->rl_mw->r.frmr.fr_mr->rkey;
>> + invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
>> DECR_CQCOUNT(&r_xprt->rx_ep);
>>
>> while (seg1->mr_nsegs--)
>> @@ -348,12 +420,13 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_mr_seg *seg)
>> read_unlock(&ia->ri_qplock);
>> if (rc)
>> goto out_err;
>> +
>> + rpcrdma_put_mw(r_xprt, mw);
>> return nsegs;
>>
>> out_err:
>> - /* Force rpcrdma_buffer_get() to retry */
>> - seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
>> dprintk("RPC: %s: ib_post_send status %i\n", __func__, rc);
>> + __frwr_queue_recovery(mw);
>> return nsegs;
>> }
>>
>> @@ -370,29 +443,6 @@ out_err:
>> static void
>> frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>> {
>> - struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> - struct ib_device *device = r_xprt->rx_ia.ri_device;
>> - unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
>> - struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
>> - struct rpcrdma_mw *r;
>> - int rc;
>> -
>> - list_for_each_entry(r, &buf->rb_all, mw_all) {
>> - if (r->r.frmr.fr_state == FRMR_IS_INVALID)
>> - continue;
>> -
>> - __frwr_release(r);
>> - rc = __frwr_init(r, pd, device, depth);
>> - if (rc) {
>> - dprintk("RPC: %s: mw %p left %s\n",
>> - __func__, r,
>> - (r->r.frmr.fr_state == FRMR_IS_STALE ?
>> - "stale" : "valid"));
>> - continue;
>> - }
>> -
>> - r->r.frmr.fr_state = FRMR_IS_INVALID;
>> - }
>> }
>>
>> static void
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index 98a3b95..35ead0b 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -284,9 +284,6 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct
>> xdr_buf *target,
>> return (unsigned char *)iptr - (unsigned char *)headerp;
>>
>> out:
>> - if (r_xprt->rx_ia.ri_memreg_strategy == RPCRDMA_FRMR)
>> - return n;
>> -
>> for (pos = 0; nchunks--;)
>> pos += r_xprt->rx_ia.ri_ops->ro_unmap(r_xprt,
>>
>> &req->rl_segments[pos]);
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 8a43c7ef..5226161 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -1343,12 +1343,11 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req,
>> struct rpcrdma_buffer *buf,
>> struct rpcrdma_req *
>> rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>> {
>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>> - struct list_head stale;
>> struct rpcrdma_req *req;
>> unsigned long flags;
>>
>> spin_lock_irqsave(&buffers->rb_lock, flags);
>> +
>> if (buffers->rb_send_index == buffers->rb_max_requests) {
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> dprintk("RPC: %s: out of request buffers\n",
>> __func__);
>> @@ -1367,17 +1366,7 @@ rpcrdma_buffer_get(struct rpcrdma_buffer *buffers)
>> }
>> buffers->rb_send_bufs[buffers->rb_send_index++] = NULL;
>>
>> - INIT_LIST_HEAD(&stale);
>> - switch (ia->ri_memreg_strategy) {
>> - case RPCRDMA_FRMR:
>> - req = rpcrdma_buffer_get_frmrs(req, buffers, &stale);
>> - break;
>> - default:
>> - break;
>> - }
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> - if (!list_empty(&stale))
>> - rpcrdma_retry_flushed_linv(&stale, buffers);
>> return req;
>> }
>>
>> @@ -1389,18 +1378,10 @@ void
>> rpcrdma_buffer_put(struct rpcrdma_req *req)
>> {
>> struct rpcrdma_buffer *buffers = req->rl_buffer;
>> - struct rpcrdma_ia *ia = rdmab_to_ia(buffers);
>> unsigned long flags;
>>
>> spin_lock_irqsave(&buffers->rb_lock, flags);
>> rpcrdma_buffer_put_sendbuf(req, buffers);
>> - switch (ia->ri_memreg_strategy) {
>> - case RPCRDMA_FRMR:
>> - rpcrdma_buffer_put_mrs(req, buffers);
>> - break;
>> - default:
>> - break;
>> - }
>> spin_unlock_irqrestore(&buffers->rb_lock, flags);
>> }
>>
>>
>
> Don't you need a call to flush_workqueue(frwr_recovery_wq) when you're
> about to destroy the endpoint (and the buffers and the MRs...)?
I agree with Sagi here, in xprt_rdma_destroy() before calling
rpcrdma_destroy_buffer(), flush_workqueue and cancelling any pending
work seems required.
With the optimization, is it possible that before completion of
REG-FRMR sever starts traffic on a yet-to-be-reg-complete rkey, this
will cause access-error async event on client side and server will see
flush errors.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
-Regards
Devesh
next prev parent reply other threads:[~2015-05-08 15:24 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-04 17:56 [PATCH v1 00/14] client NFS/RDMA patches for 4.2 Chuck Lever
2015-05-04 17:56 ` [PATCH v1 01/14] xprtrdma: Transport fault injection Chuck Lever
2015-05-05 13:49 ` Anna Schumaker
2015-05-05 13:53 ` Chuck Lever
2015-05-05 14:44 ` Anna Schumaker
2015-05-05 15:15 ` Chuck Lever
2015-05-05 15:16 ` Anna Schumaker
2015-05-05 15:10 ` Steve Wise
2015-05-04 17:57 ` [PATCH v1 02/14] xprtrdma: Warn when there are orphaned IB objects Chuck Lever
2015-05-06 11:37 ` Devesh Sharma
2015-05-06 13:24 ` Chuck Lever
2015-05-06 14:05 ` Sagi Grimberg
2015-05-06 14:22 ` Devesh Sharma
2015-05-06 16:48 ` Jason Gunthorpe
2015-05-07 7:53 ` Devesh Sharma
2015-05-04 17:57 ` [PATCH v1 03/14] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt Chuck Lever
2015-05-07 9:38 ` Sagi Grimberg
2015-05-07 13:25 ` Chuck Lever
2015-05-04 17:57 ` [PATCH v1 04/14] xprtrdma: Use ib_device pointer safely Chuck Lever
2015-05-07 10:00 ` Sagi Grimberg
2015-05-07 13:39 ` Chuck Lever
2015-05-07 13:56 ` Sagi Grimberg
2015-05-07 14:12 ` Chuck Lever
2015-05-07 15:11 ` Sagi Grimberg
2015-05-11 15:22 ` Chuck Lever
2015-05-11 18:26 ` Hefty, Sean
2015-05-11 18:57 ` Chuck Lever
2015-05-12 10:01 ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 05/14] xprtrdma: Introduce helpers for allocating MWs Chuck Lever
2015-05-07 10:16 ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 06/14] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() Chuck Lever
2015-05-07 10:15 ` Sagi Grimberg
2015-05-04 17:57 ` [PATCH v1 07/14] xprtrdma: Introduce an FRMR recovery workqueue Chuck Lever
2015-05-07 10:37 ` Devesh Sharma
2015-05-04 17:57 ` [PATCH v1 08/14] xprtrdma: Acquire MRs in rpcrdma_register_external() Chuck Lever
2015-05-07 10:31 ` Sagi Grimberg
2015-05-08 15:24 ` Devesh Sharma [this message]
2015-05-08 15:40 ` Chuck Lever
2015-05-10 10:17 ` Sagi Grimberg
2015-05-04 17:58 ` [PATCH v1 09/14] xprtrdma: Remove unused LOCAL_INV recovery logic Chuck Lever
2015-05-07 10:35 ` Sagi Grimberg
2015-05-08 15:31 ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 10/14] xprtrdma: Remove ->ro_reset Chuck Lever
2015-05-07 10:36 ` Sagi Grimberg
2015-05-08 15:33 ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 11/14] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy Chuck Lever
2015-05-07 10:36 ` Sagi Grimberg
2015-05-08 15:34 ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 12/14] xprtrdma: Split rb_lock Chuck Lever
2015-05-07 10:37 ` Sagi Grimberg
2015-05-04 17:58 ` [PATCH v1 13/14] xprtrdma: Stack relief in fmr_op_map() Chuck Lever
2015-05-07 10:50 ` Sagi Grimberg
2015-05-08 15:36 ` Devesh Sharma
2015-05-04 17:58 ` [PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation Chuck Lever
2015-05-07 11:00 ` Sagi Grimberg
2015-05-08 15:53 ` Devesh Sharma
2015-05-05 15:17 ` [PATCH v1 00/14] client NFS/RDMA patches for 4.2 Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CANjDDBiGLcaAofGwz6OGEXUUE_b2rcZepv0ebvTc-XNVEBq5Mw@mail.gmail.com \
--to=devesh.sharma@avagotech.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=sagig@dev.mellanox.co.il \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).